ABSTRACT
We describe FoundHaplo, a novel identity-by-descent algorithm designed to identify individuals with known, untyped, disease-causing variants using only SNP array data. FoundHaplo leverages knowledge of shared disease haplotypes for inherited disease-causing variants to identify individuals who share the disease haplotype and are, therefore, likely to carry the rare (MAF<0.01) variant. We performed a simulation study to evaluate the performance of FoundHaplo across 33 known disease-harbouring loci. We demonstrated the ability of FoundHaplo to infer the presence of two rare (MAF<0.01) pathogenic variants, SCN1B c.363C>G (p.Cys121Trp) and WWOX c.49G>A (p.E17K), which can cause mild dominant and severe recessive epilepsy respectively, in two large cohorts including 1,573 individuals with epilepsy from the Epi25 cohort and 468,481 individuals from the UK Biobank. We demonstrate that FoundHaplo performs substantially better at inferring the presence of these variants than existing genome-wide imputation approaches. FoundHaplo is a valuable, low-cost screening tool that can be applied to search SNP genotyping array data for disease-causing variants with known founder effects based on shared disease haplotypes. FoundHaplo is available at https://github.com/bahlolab/FoundHaplo.
INTRODUCTION
Detecting disease-causing variants (DCVs) is essential for identifying individuals at high-risk for disease 1,2, enabling appropriate patient care. Many pathogenic variants observed in unrelated individuals or families may result from the variant arising independently or may be the result of a variant being inherited from a common ancestor. This leads to shared core haplotypes among carriers, a phenomenon known as a founder effect 3–10. DCVs with founder events thus have an associated disease haplotype inherited from the common ancestor, shared by all the descendant variant carriers in subsequent generations 3–10. Pathogenic variants initially thought to be recurrent have been able to be reclassified as inherited based on haplotype sharing from a common ancestral founder 6,10. Shared haplotypes inherited from a common ancestor are defined as being identical by descent (IBD). Haplotypes shared by carriers of the DCV decrease in size over generations due to recombination events 3–10. Regardless of the time elapsed since the original founder event, an IBD segment persists in current-day descendants carrying the DCV. This suggests that detecting associated disease haplotypes through an IBD approach can also infer the presence of these inherited DCVs.
Founder events for many DCVs have been previously described. Table S1 lists an illustrative set of inherited genetic disorders with reported founder effects. Some genetic disorders show evidence of multiple founder events, each with its own unique haplotype. Examples of founder events include the Huntington’s disease repeat expansion, which displays multiple founder events 7,11–13, the CFTR p.F508del Cystic fibrosis-causing variant 14 and the GOSR2 p.G144W progressive myoclonus epilepsy-causing variant 15. In general, the population frequency of these highly or fully penetrant DCVs is low, with a minor allele frequency (MAF) <0.01 or 1%, typically leading to rare diseases. This low MAF typically leads to these variants being excluded in genome-wide association studies.
Most published IBD methods seek to identify genome-wide IBD tracts rather than directly screening individuals for DCVs 16–22. They do not make use of DCV haplotype information. DRIVE is a recent IBD tool developed to cluster carriers of DCVs to try and identify additional DCV carriers using existing genome-wide IBD methods 23.
Imputation platforms such as The Michigan Imputation Server (MIS) 24 and the TOPMed server 25 utilise linkage disequilibrium (LD) for genome-wide imputation of millions of variants not directly genotyped by SNP arrays. However, these rely on having DCV haplotypes in their reference databases, which is often not the case. With lower MAF, these imputation servers also perform more poorly. Hence, these approaches are not designed to identify rare DCVs (MAF<0.01).
Here, we introduce FoundHaplo, a novel IBD-based tool, developed using a first-order Hidden Markov Model (HMM), to detect DCVs with known founder effects from shared disease haplotypes requiring only SNP genotyping array data. FoundHaplo leverages its inference on pre-existing information of the location and haplotype of the DCV. This tool is particularly relevant given the widespread use of SNP genotyping arrays in genome-wide association studies (GWAS) to patient cohorts and large biobanks such as the UK Biobank (UKBB) 26, where many individuals are genotyped but not sequenced due to the relatively high costs of genome sequencing in contrast to SNP genotyping arrays or due to lack of remaining biospecimens suitable for sequencing 27,28. Even though SNP genotyping arrays are cost-effective and commonly available, many DCVs are not captured directly on SNP arrays. FoundHaplo addresses the gap in identifying DCVs not directly SNP genotyped or imputable with existing tools due to their low MAF and lack of representation in large databases leveraged for imputation29–33.
We perform a comprehensive simulation study to demonstrate the performance of FoundHaplo under single and multiple founder effects and then apply the algorithm to identify DCVs in cohorts, including the UKBB 26, demonstrating that FoundHaplo is a useful, low-cost screening tool which could be applied to bespoke catalogues of DCV haplotypes to identify individuals that merit further sequencing.
MATERIAL AND METHODS
FoundHaplo HMM
The FoundHaplo HMM aims to differentiate between random haplotype sharing and IBD between a known disease haplotype and a test individual in the vicinity of a DCV in a hypothesis-testing framework to infer the presence of a DCV in an individual’s phased and imputed genotyping data. The null hypothesis (H0) asserts no IBD between the individual’s haplotypes and the disease haplotype, indicating no common founder inheritance of the DCV. The alternative hypothesis (H1) suggests at least one haplotype presents evidence of IBD with the disease haplotype, indicating inheritance from a common founder. The FoundHaplo HMM, focusing on biallelic SNPs, models IBD to determine the hidden IBD state, discerning between no IBD (0) and IBD (1) based on the observed reference or alternate (0 or 1) alleles. HMMs in FoundHaplo replace the typical “waiting time” by the genetic map distance in Morgans from a known DCV locus to the next recombination event. With unknown IBD sharing boundaries, the algorithm starts Markov chains at the DCV locus and extends in opposite directions, comprising two Markov chains as illustrated in Figure 1.
The FoundHaplo algorithm calculates the log-likelihood ratio (LLR) of IBD versus non-IBD at each genetic marker surrounding the DCV (denoted by marker 0) for a disease haplotype and a test pair of imputed haplotypes. The likelihood of IBD is encapsulated in the FoundHaplo (FH) score, defined as ln(LLR) (Supplemental material and methods).
The algorithm updates the likelihood of hidden IBD (L0 for the null and L1 for the alternate hypothesis) in the FH score. It accounts for a fixed rate of genotype and imputation errors, and switches between the test individual’s two haplotypes to handle phasing errors.
Genotype and imputation errors are indistinguishable in this model and are treated similarly, assumed to occur at a fixed, uniform rate (g) of 1% across the genome 34,35 (Figure 1, Figure S1 and Supplemental material and methods). Genotype markers with missing values are excluded from the analysis.
To propagate the LLR, the FoundHaplo algorithm switches to the alternate haplotype of the test individual when haplotype sharing ceases on the current one, as depicted by orange arrows in Figure 1. This approach captures potential sharing on the other haplotype and accommodates block phasing errors introduced by LD-based phasing tools. The Markov chains end when sharing around the DCV between the disease haplotype and test individual stops (Supplemental material and methods). When multiple known disease haplotypes for a single disease exist, the FH score is determined as the maximum of individual FH scores across all available disease haplotypes for that variant (Supplemental material and methods).
Using empirical p-values for FH score evaluation
In the FoundHaplo algorithm, while the log-likelihood ratio test statistics under the null hypothesis are theoretically asymptotically chi-square distributed 36, the actual distribution deviates due to linkage disequilibrium (Figure S5). Therefore, the significance of the FoundHaplo statistic is assessed using the empirical distribution of FH scores from a control population. A test individual is identified as having IBD sharing with a disease haplotype if their FH score exceeds a critical threshold, set based on the 99th percentile of the FH score distribution in a control cohort of the same ancestral population as the test cohort, typically using data from the 1000 Genomes phase 3 haplotypes 37.
Inputs required by FoundHaplo
FoundHaplo algorithm relies on two parameters: minor allele frequencies (MAF) of genetic markers and recombination frequency between markers. Ancestry-agnostic recombination rates for the human genome are commonly used in IBD algorithms 5,9,22. Recombination rates in FoundHaplo are calculated from the genetic maps available from the HapMap project 38,39. We recommend using the MAF of the relevant ancestry of the test cohort taken from the gnomAD database 40,41 since MAF varies by ancestral population 24,32,33,37. The algorithm does not incorporate LD since IBD segments are typically larger than the length of LD blocks.
FoundHaplo requires phased genotype data for both test individuals and disease haplotypes. To enhance the limited variant coverage of SNP array data, imputation is used to increase marker density. Imputed markers with R-squared ⩾0.3 are retained for FoundHaplo analysis in line with standard quality measures 24,42,43. Both imputation and phasing can be performed together using LD-based genome-wide imputation and phasing tools or servers such as the Michigan Imputation Server (MIS) 24 or TOPMed server 25.
FoundHaplo requires accurately phased disease haplotypes, best achieved through pedigree-phasing with another confirmed carrier of the same DCV from the same family. This avoids errors from LD-based genome-wide phasing, which only resolves phasing to LD block resolution. Additionally, individuals with long homozygosity regions due to related parents can be used as the source of recessive disease haplotypes, as the homozygosity tracts are often much longer than any shared IBD tracts.
The FoundHaplo algorithm, available as an R package with a disease haplotype database schema, is freely available from https://github.com/bahlolab/FoundHaplo.
Researchers can create and manage their own database instances of disease haplotypes, maintaining data confidentiality while running FoundHaplo. A detailed mathematical derivation of the algorithm is provided in the Supplemental material and methods.
Simulation study
We performed a simulation study for 33 DCVs to evaluate the performance of the FoundHaplo algorithm using 503 unrelated individuals with European ancestry from the 1000 Genomes Project phase 3 dataset 37. Most of the DCVs we simulated are located at known repeat expansion loci (Table S3). Repeat expansions are rare, often inherited with known founder effects 3,4,8,44 and cannot be detected using SNP genotyping array data. They are, therefore, an excellent candidate set of diseases to demonstrate the utility of FoundHaplo. The simulation study investigated: (i) single founder effects, where multiple different versions of a single disease haplotype are found in the present time that are all distantly related and descended from a single ancestor, and (ii) multiple founder effects, where the same DCV has arisen independently in multiple unrelated founders, resulting in multiple unique haplotypes (Figure S2). Test cohorts were constructed by designating a fraction of individuals as cases. Each case was simulated by replacing the haplotype spanning the DCV locus with a randomly selected disease haplotype to simulate the presence of an inherited DCV. Cases were simulated to share different sizes of the disease haplotypes (0.5, 1, 2 and 5 cM) surrounding the DCV to simulate pairs of individuals with varying times to the most recent common ancestor (MRCA) (Figure S2). The rest of the test cohort remained unchanged, acting as controls. In our simulations, we introduced genotype and imputation errors by altering 1% of marker alleles genome-wide. Additionally, we simulated phasing errors in all individuals (except those used to derive the disease haplotypes) by switching blocks of adjacent marker alleles to the alternate haplotype with a rate of one switch per 20.05 Mbp 45.
For each of the 33 simulated disease loci, we created ten founder scenarios, generating ten disease haplotypes and 50 cases (5 per disease haplotype) for each scenario. This resulted in 1,320 simulation datasets encompassing both single and multiple founder effects with varying sharing lengths (Table S3 and Algorithm S2).
Detecting the SCN1B c.363C>G and WWOX c.49G>A rare epilepsy variants
FoundHaplo was used to predict carriers of the SCN1B c.363C>G (p.Cys121Trp) and WWOX c.49G>A (p.E17K) rare variants in two cohorts. SCN1B c.363C>G has a MAF of 0.01047% in gnomAD 40,41 and causes autosomal dominant genetic epilepsy with febrile seizures plus (OMIM: 604233) 46–49. WWOX c.49G>A has a MAF of 0.01037% in gnomAD 40,41 and causes autosomal recessive developmental and epileptic encephalopathy (OMIM: 616211) 50–52.
Cohort 1 consisted of 1,573 individuals with different types of epilepsy recruited in Australia or New Zealand as part of the international Epi25 study 53. Cohort 2 is the UKBB cohort (n=468,481) accessed through project ID 36610 26. Both cohorts, primarily of European ancestry and with WES and SNP genotyping data (Supplemental Material and Methods), identified two individuals in the Epi25 cohort and 171 individuals in the UKBB cohort who carried the SCN1B c.363C>G variant and 172 individuals in the UKBB who carried the WWOX c.49G>A variant.
For FoundHaplo analysis, five SCN1B c.363C>G and three WWOX c.49G>A disease haplotypes were created using duo and trio genotype data of eight different families (Supplemental Material and Methods). The European (EUR) cohort of 1000 Genomes Phase 3 37 was used as the control cohort when using FoundHaplo. None of the samples in the EUR cohort of the 1000 Genomes data carried either of the two variants.
FoundHaplo predictions were computed using critical values at the 99, 99.5 and 99.8 percentiles from the 1000 Genomes data. The algorithm’s effectiveness was evaluated using the area under the precision-recall (PR) curve (AUPRC), appropriate for imbalanced datasets 54,55. The performance of a random classifier of a PR curve can be evaluated with the baseline rate, which is the ratio of positives to the total cohort size54,55.
Performance comparison to DRIVE
We compared FoundHaplo with DRIVE 23, the only other known algorithm capable of detecting individuals with DCVs using haplotypes. DRIVE’s performance was evaluated for identifying carriers of SCN1B c.363C>G and the WWOX c.49G>A variants in the Epi25 cohort. To run the DRIVE algorithm, known carriers (five SCN1B c.363C>G and three WWOX c.49G>A carriers) of these variants were merged into the Epi25 dataset in two separate analyses for the two DCVs. DRIVE assessed if the individuals in the Epi25 cohort carrying these DCVs clustered with the original known carriers (Supplemental material and methods). We did not run DRIVE on the UKBB cohort (n=468,481) as the run time required for such a large number of samples was not feasible.
Briefly, DRIVE uses the output from existing genome-wide IBD detection tools and identifies clusters of test individuals containing individuals already known to harbour the DCV who pairwise share an IBD segment overlapping a locus of interest 23.
This study was approved by the Austin Health Human Research Ethics Committee. Informed consent was obtained and archived from all participants or their legal guardian. Research was approved by the Human Research Ethics Committee at The Walter and Eliza Hall Institute of Medical Research (G20/01, 17/09LR).
RESULTS
Simulation study
We performed a simulation study using the European cohort of the 1000 Genomes phase 3 data to test FoundHaplo’s ability to identify shared disease haplotypes. The performance in distinguishing simulated cases from controls was assessed under a variety of settings, including varying shared haplotype sizes, number of disease haplotypes, and DCV loci. The performance was evaluated for critical value percentiles of 99, 99.5 and 99.8.
Figure 2 displays FoundHaplo’s sensitivity at the default 99th percentile, measured by its ability to identify simulated cases sharing an ancestor with the disease haplotypes in use. The algorithm showed 81% sensitivity for single founder effects and 69% for multiple founder effects with a shared length of at least 2 cM, maintaining an empirical false positive rate of 1%. Sensitivity is higher for single founder effects due to having a shared core ancestral haplotype. The sensitivity increases with larger simulated shared segments that can be better differentiated from the general population, indicative of a more recent common ancestor. Variations in performance across disease loci may arise from the frequency of chosen disease haplotypes in simulations or differing LD levels at certain loci.
Figure 3 demonstrates the variation in sensitivity and area under the PR curve (AUPRC) with increasing numbers of disease haplotypes. Sensitivity and AUPRC were calculated based on accurately predicting all simulated cases in each test cohort. The AUPRC of a random classifier in each simulation is 0.1 (50 cases/492 total cohort size)(black horizontal lines in Figure 3 C-D (Algorithm S2)).
Sensitivity and AUPRC are greater for single founder effects in FoundHaplo. The algorithm can not identify shared haplotypes without a common ancestor with known disease haplotypes, leading to lower sensitivity in multiple founder effects with fewer disease haplotypes. This illustrates how having more unique disease haplotypes enhances the performance, especially for DCVs with multiple founder effects, as shown in Figures 3B and D.
Detecting individuals with the SCN1B c.363C>G and WWOX c.49G>A rare epilepsy variants
The original five SCN1B c.363C>G (p.Cys121Trp) carriers shared a core haplotype of 4.1 cM around the SCN1B c.363C>G variant, and the three WWOX c.49G>A (p.E17K) carriers shared a core haplotype of 3.9 cM around the WWOX c.49G>A variant, suggesting a common ancestor between the families for each of the two variants (Figures S10 and S11). A common ancestor for the SCN1B c.363C>G variant has already been identified by Grinton et al 6, and here we demonstrate evidence of a founder effect for the WWOX c.49G>A (p.E17K) variant.
Analysis of WES data identified two individuals in the Epi25 cohort and 171 in the UKBB cohort carrying the SCN1B c.363C>G variant and 172 individuals in the UKBB cohort carrying the recurrent WWOX c.49G>A variant. None of the Epi25 individuals were identified to carry the WWOX c.49G>A variant. We compared the disease haplotypes for these variants with the confirmed carriers based on WES analysis in the UKBB and Epi25 cohorts. All of the 178 SCN1B c.363C>G carriers shared a core haplotype of 55 kbp. Among carriers in the Epi25 and the UKBB, a minimum pairwise sharing of 63 kbp, a median of 4,650 kbp and a maximum of 12,321 kbp was observed. All of the 175 WWOX c.49G>A carriers shared a core haplotype of 157 kbp. Among carriers in the UKBB, a minimum pairwise sharing of 540 kbp, a median of 1,762 kbp and a maximum of 10,000 kbp was observed (Figures S10 and S11). This suggests that all the carriers have a common ancestor for each of the two variants. The two Epi25 carriers shared the shortest genomic region with the core haplotype (63 kbp and 373 kbp) around the SCN1B c.363C>G locus, implying that they are more distantly related compared to the rest of the carriers.
The distribution of the FH scores in the Epi25 (n=1,573) and the UKBB cohorts (n=468,481) and in their respective control cohorts for the constructed SCN1B c.363C>G and WWOX c.49G>A disease haplotypes are shown in Figure 4. Non-carriers above the 99th percentile critical value are shown in black for the Epi25 but not for the UKBB cohort, due to the large number of individuals (n=4,685) at this percentile under the null hypothesis alone. Using the 99th percentile, FoundHaplo predicted both SCN1B c.363C>G variant carriers and 13 non-carriers (100% sensitivity and 0.8% false positive rate) in the Epi25 cohort. There are no WWOX c.49G>A carriers in the Epi25 cohort, but FoundHaplo identified 23 non-carriers (1.5% false positive rate). FoundHaplo predicted 166 SCN1B c.363C>G variant carriers (97% sensitivity and 2% false positive rate) and 167 WWOX c.49G>A carriers correctly (97% sensitivity and 0.9% false positive rate) in the UKBB cohort.
PR curve analysis wasn’t done for the Epi25 cohort due to the limited number of SCN1B carriers and the absence of WWOX carriers. In the UKBB cohort, the AUPRC of a random classifier is 0.00037 (carriers/total cohort size) for both variants 54,55, whereas FoundHaplo achieved an AUPRC of 0.46 for the SCN1B c.363C>G variant and 0.6 for the WWOX c.49G>A variant, indicating its effectiveness in distinguishing carriers from non-carriers in the UKBB cohort.
The total number of predictions above the 99th percentile for the UKBB (n=9,523 for SCN1B c.363C>G and n=4,459 for WWOX c.49G>A) is typically too high for further screening (Tables S4-S5). For large cohorts, FoundHaplo can prioritize predictions by setting the selection of a specific number of samples with the highest FH scores for screening. Using this approach for the UKBB cohort, we assessed the top 100 samples for each of the two variants. This resulted in correctly predicting 53 carriers for the SCN1B c.363C>G variant and 74 carriers for the WWOX c.49G>A variant, with 31% and 43% sensitivity and with 53% and 74% true discovery rate, respectively (Tables S4-S5).
Comparing FoundHaplo performance with DRIVE
We evaluated the ability of DRIVE 23 to identify the SCN1B c.363C>G and WWOX c.49G>A variant carriers in the Epi25 cohort. Carriers of each variant in the Epi25 cohort were inferred in two separate analyses by merging the Epi25 cohort with the original variant carriers that were used to define the disease haplotypes in FoundHaplo. DRIVE estimated two clusters at the SCN1B c.363C>G locus and seven clusters at the WWOX c.49G>A locus.
All five original SCN1B c.363C>G carriers clustered together in a single cluster. The two candidate SCN1B c.363C>G carriers in the Epi25 cohort were not clustered with any of the five original carriers. The three original WWOX c.49G>A carriers were clustered together. None of the other Epi25 samples were clustered with the original SCN1B c.363C>G or WWOX c.49G>A carriers. We could not evaluate the performance of DRIVE on the UKBB cohort as the sample size was too large, and the required run time was not feasible.
Based on the analysis, DRIVE did not detect the two SCN1B c.363C>G variant carriers in the Epi25, whereas FoundHaplo identified them at the 99th percentile. However, DRIVE did not give any false positives for the two variants as none of the Epi25 samples clustered with any of the original variant carriers. FoundHaplo gave 13 and 23 false positives for the SCN1B c.363C>G and WWOX c.49G>A variants, respectively. However, the FoundHaplo analysis used an empirical 99% threshold; therefore, a 1% false positive rate is expected.
Comparing FoundHaplo performance with genome-wide imputation tools
The SCN1B c.363C>G (p.Cys121Trp) and WWOX c.49G>A (p.E17K) variants were neither genotyped on SNP arrays, nor imputed by the MIS 24 or the TOPMed server 25 in the Epi25 cohort, possibly due to the absence or scarcity in variant-carrying haplotypes in reference panels. In the publicly available UKBB cohort, the SCN1B c.363C>G variant was imputed as heterozygous carriers in only nine samples out of 171 carriers (5% sensitivity) using the impute2 tool 56 with the HRC 57, UK10K 58 and 1000 Genomes phase 3 37 reference panels. The WWOX c.49G>A variant is not imputed in the UKBB cohort, likely due to there being no WWOX c.49G>A carriers in the reference panel used in imputation. In contrast, FoundHaplo was able to correctly predict 55 SCN1B c.363C>G carriers and 74 WWOX c.49G>A carriers, using the 99th critical percentile for the Epi25 and the top 100 samples for the UKBB cohort, showing a notable 37% sensitivity for both variants, which is a substantial improvement over genome-wide imputation tools.
DISCUSSION
Achieving a genetic diagnosis is critical, providing opportunities for improved patient care by tailoring therapy appropriately, and potentially impacting the diagnosis of other family members, including distant relatives, who may also be at risk 1,2. With declining cost and widespread use of SNP genotyping, there are SNP arrays for millions of individuals in public databases. Hidden in these data are individuals who have inherited known, rare DCVs that are not ascertained directly by the SNP array and cannot be imputed due to their rarity.
FoundHaplo, uses SNP genotyping array data to identify individuals with rare DCVs based on known disease haplotypes, unlike traditional IBD algorithms that target genome-wide IBD regions 16–22. In our simulation study, FoundHaplo successfully detected 75% of cases sharing at least 2 cM of a disease haplotype. We evaluated the ability of FoundHaplo to identify two rare variants SCN1B c.363C>G (p.Cys121Trp) and WWOX c.49G>A (p.E17K), that can cause epilepsy: one a dominant allele, and the other a recessive variant requiring the presence of a second allele to cause disease. We showed that FoundHaplo’s targeted approach achieved 37% sensitivity in detecting the two rare epilepsy variants compared to the 5% sensitivity achieved using LD-based genomewide-imputation tools, despite the presence of multiple DCV-carrying haplotypes in the imputation reference panels, proving that FoundHaplo is superior to genome-wide imputation methods in inferring known DCVs using surrogate disease haplotypes. FoundHaplo also demonstrated greater sensitivity to DRIVE, a similar approach that was recently developed, however at the cost of a higher false positive rate.
We have shown that FoundHaplo can successfully identify disease haplotypes; however, the algorithm has a number of limitations. It uses a fixed error rate for genotype and imputation (1% by default), regardless of minor allele frequency (MAF) variations. A more refined method would adjust the error rate based on MAF., accommodating a higher error rate for rarer variants 32,33,59,60.
FoundHaplo doesn’t account for linkage disequilibrium (LD). While LD blocks are typically short and the algorithm targets longer IBD segments for the founder effects we seek to identify. incorporating LD might improve the detection of shorter IBD segments (≤1cM) with older common ancestors. The effect of LD can be seen in the simulation results, with performance varying by disease locus due to differences in background haplotype sharing, caused by locally specific LD, between controls.
FoundHaplo presumes accurate phasing of disease haplotypes, typically requiring multiple family members with a known DCV for pedigree phasing. Other LD-based genomewide phasing approaches, like those in TopMed or MIS, only offer block phasing. Additionally, FoundHaplo cannot determine the exact number of disease haplotype copies in a test individual. This is not relevant for autosomal dominant diseases since only one copy is sufficient to cause the disease. For recessive diseases, FoundHaplo can only predict individuals that carry at least one copy of the disease haplotype and further testing is required to determine the number of copies; however, this does not impact the utility of FoundHaplo as a screening tool. It will identify both carriers (one copy) and those individuals with two copies. These individuals may be homozygous or compound heterozygotes for inherited DCVs.
The power of the FH statistic increases the more unique disease haplotypes that are present. Additionally, FoundHaplo performs best when the disease and test individuals are more closely related to each other, allowing the preservation of a larger ancestral disease haplotype, and this is more likely to occur when there are more unique disease haplotypes present.
One important consideration when using FoundHaplo is the choice of critical threshold. The best choice depends on the appropriate balance between increasing sensitivity and minimising the number of false positives. The “false positives” identified by FoundHaplo that do not share the DCV may still share the disease haplotype since FoundHaplo uses disease haplotypes as surrogates for DCVs. This depends on the time between the DCV mutation and the uniqueness of the DCV-carrying haplotype prior to the DCV arising and is always unknown. For example, the SCN1B c.363C>G variant associated haplotype is present in ∼1% of the population 6, however, only a fraction of those individuals inherited the version of this haplotype with the DCV.
FoundHaplo, primarily designed for SNP genotyping data, can also utilize biallelic SNP genotypes from whole-genome sequencing (WGS) to identify disease haplotypes, which may be helpful for variants difficult to identify with short-read WGS, such as non-coding variants, cryptic splice variants and structural variants. We strongly recommend not using FoundHaplo on whole-exome sequencing (WES) data, where its effectiveness is likely limited due to sparse SNP markers, potentially increasing false negatives. FoundHaplo should be able to be used on any recombining genome for predicting inherited genetic variants.
The novelty of the FoundHaplo approach lies in using prior knowledge of known disease haplotypes to find local IBD segments specific to disease-causing variants of interest. We demonstrated the ability of FoundHaplo to detect two inherited rare variants that cause epilepsy. There are many other similar founder effects ideally suited for screening with this method. FoundHaplo could significantly aid in identifying carriers of known disease variants using SNP array data who might otherwise be unlikely to undergo targeted genetic testing or receive a genetic diagnosis.
Data Availability
This research has been conducted using data from UK Biobank, a major biomedical database. The UK Biobank is an open access resource. To access the UKBB datasets, you need to register as a UKBB researcher (https://www.ukbiobank.ac.uk/enable-your-research/register). Additional genetic data used in this study is not available due to patient privacy and ethical restrictions. FoundHaplo is available at https://github.com/bahlolab/FoundHaplo.
Declaration of Interests
Ingrid Scheffer has served on scientific advisory boards for BioMarin, Chiesi, Eisai, Encoded Therapeutics, GlaxoSmithKline, Knopp Biosciences, Nutricia, Rogcon, Takeda Pharmaceuticals, UCB, Xenon Pharmaceuticals, Cerecin; has received speaker honoraria from GlaxoSmithKline, UCB, BioMarin, Biocodex, Chiesi, Liva Nova, Nutricia, Zuellig Pharma, Stoke Therapeutics and Eisai; has received funding for travel from UCB, Biocodex, GlaxoSmithKline, Biomarin, Encoded Therapeutics, Stoke Therapeutics and Eisai; has served as an investigator for Anavex Life Sciences, Cerevel Therapeutics, Eisai, Encoded Therapeutics, EpiMinder Inc, Epygenyx, ES-Therapeutics, GW Pharma, Marinus, Neurocrine BioSciences, Ovid Therapeutics, Takeda Pharmaceuticals, UCB, Ultragenyx, Xenon Pharmaceuticals, Zogenix and Zynerba; and has consulted for Care Beyond Diagnosis, Epilepsy Consortium, Atheneum Partners, Ovid Therapeutics, UCB, Zynerba Pharmaceuticals, BioMarin, Encoded Therapeutics and Biohaven Pharmaceuticals; and is a Non-Executive Director of Bellberry Ltd and a Director of the Australian Academy of Health and Medical Sciences and the Australian Council of Learned Academies Limited. She may accrue future revenue on pending patent WO61/010176 (filed: 2008): Therapeutic Compound; has a patent for SCN1A testing held by Bionomics Inc and licensed to various diagnostic companies; has a patent molecular diagnostic/theranostic target for benign familial infantile epilepsy (BFIE) [PRRT2] 2011904493 & 2012900190 and PCT/AU2012/001321 (TECH ID:2012-009).
Web Resources
BCFtools, https://samtools.github.io/bcftools/bcftools.html
DRIVE, https://drive-ibd.readthedocs.io/en/latest/index.html
FoundHaplo, https://github.com/bahlolab/FoundHaplo
Genotype Harmonizer, https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer
gnomAD, https://gnomad.broadinstitute.org/
HapMap project, https://www.genome.gov/10001688/international-hapmap-project
Michigan Imputation Server, https://imputationserver.sph.umich.edu/
OMIM, http://www.omim.org/
Plink 1.9, https://www.cog-genomics.org/plink/1.9/
SHAPEIT4, https://odelaneau.github.io/shapeit4/
TOPMed Imputation Server, https://imputation.biodatacatalyst.nhlbi.nih.gov/
UK Biobank, https://www.ukbiobank.ac.uk/
VCFtools, https://vcftools.github.io/index.html
1000 Genomes Project, https://www.internationalgenome.org/
Data and code availability
This research has been conducted using data from UK Biobank, a major biomedical database. The UK Biobank is an open access resource. To access the UKBB datasets, you need to register as a UKBB researcher (https://www.ukbiobank.ac.uk/enable-your-research/register). Additional genetic data used in this study is not available due to patient privacy and ethical restrictions.
FoundHaplo is available at https://github.com/bahlolab/FoundHaplo.
Acknowledgements
We thank the Epi25 principal investigators, local staff from individual cohorts, and the individuals with epilepsy who participated in Epi25 for making possible this global collaboration and resource to advance epilepsy genetics research. This research was conducted with data from UK Biobank (www.ukbiobank.ac.uk), a major biomedical database, under data use agreement 36610 (PI Bahlo).
Funding support was provided by an Australian National Health and Medical Research Council (NHMRC) Investigator grant APP1195236 (MB); an NHMRC Senior Investigator Grant [GNT1172897] (IES); an NHMRC Senior Investigator Grant [APP196637] (SFB); a Melbourne Research Scholarship (392655) (ER); a CURE Epilepsy Taking Flight Award (MFB); an Australian Government Research Training Program Scholarship APP533086 (KLO); DHB Foundation Centenary Postdoctoral Fellowship in Neurogenetic Systems Biology (LGF); Health Research Council of New Zealand and Cure Kids (LS). This work was also supported by the Victorian Government’s Operational Infrastructure Support Program and the NHMRC Independent Research Institute Infrastructure Support Scheme. We would also like to acknowledge Professor Jozef Gecz and Dr Mark Corbett for valuable discussions on early work.