Abstract
Complex diseases, with many associated genetic and environmental factors, are a challenging target for genomic risk assessment. Genome-wide association studies (GWAS) associate disease status with, and compute risk from, individual common variants, which can be problematic for diseases with many interacting or rare variants. In addition, GWAS typically employ a reference genome which is not built from the subjects of the study, whose genetic background may differ from the reference and whose genetic characterization may be limited. We present a complementary method based on disease association with collections of genotypes, called frequented regions, on a pangenomic graph built from subjects’ genomes. We introduce the pangenomic genotype graph, which is better suited than sequence graphs to human disease studies. Our method draws out collections of features, across multiple genomic segments, which are associated with disease status. We show that the frequented regions method consistently improves machine-learning classification of disease status over GWAS classification, allowing incorporation of rare or interacting variants. Notably, genomic segments that have few or no variants of genome-wide signif-icance (p < 5 × 10−8) provide much-improved classification with frequented regions, encouraging their application across the entire genome. Frequented regions may also be utilized for purposes such as choice of treatment in addition to prediction of disease risk.
INTRODUCTION
Complex diseases, with many associated genetic and environmental factors, present a challenging diagnostic landscape.[1,2] It is difficult to assess disease risk from genomic data due to the polygenic nature of associations.[3] Genome-wide association studies (GWAS) are standard measures of genomic association with disease[4], but they typically treat associated features, often single-nucleotide polymorphisms (SNPs), independently, and assess only common variants even though rare alleles collectively represent a much larger pool of disease risk loci.[5] A whole-genome approach is motivated by the fact that GWAS often find strong associations with variants on genes and intergenic segments on many chromosomes, as well as the observation that these features often appear to interact to enhance disease risk.[3,6–8] A method which emphasizes the combined effect of disease-associated genomic features across the entire genome would provide a complementary and illuminating approach.
Though methods exist to analyze combinations of features from GWAS, the number of variant interaction combinations leads to computational limitations that often require filtering out variant combinations up front based on biological or statistical criteria.[9,10] While computational complexity increases when considering interactions among rare as well as common variants, including rare variants and their interactions can increase predictive power and identify clinically-relevant variants for common diseases.[10,11] Moving from a pairwise variant analysis to analyses based on genomic segments and application of machine learning techniques have both been proposed to overcome computational challenges in identifying interactions between variants or genes and for improving disease prediction.[12–15] Here, we apply a new region-based method and machine learning to link common and rare variants to human disease.
A promising type of genomic feature for this purpose is termed a frequented region, which groups genomic segments shared, or partly shared, among sample subsets. First introduced by Cleary et al.[16], frequented regions (FRs) are regions of a pangenomic graph that frequently occur within a subset of the population. Here we demonstrate the application of FRs to case-control genomic studies of human disease.
Pangenome graphs[17] are powerful ways to represent genomic variation within a cohort of individuals without resorting to variant calls against a reference genome or genotyping chips that cover only a fraction of the variation in a population. A graph may be built from individuals’ assembled genomes or even directly from whole-genome sequencing (WGS) reads. This is currently an area of active research.[18,19]
A pangenome-wide association study (PWAS) using FRs has been presented for 49 traits across 100 strains of yeast[20]. That study employed machine learning of FRs to predict the traits associated with each of the 100 strain genomes. In this report, we apply FRs to human case-control disease studies, where we have many similar but distinct genomes, labeled “case” or “control” depending on disease status of the individual. We also apply machine learning of FRs, in this case to predict disease status.
While FRs provide insight into polygenic associations with disease, our goal is to develop an assay which can estimate the risk of disease in undiagnosed individuals as a complement to conventional polygenic risk scores[6]. Another application would be to guide treatment of affected individuals, by predicting the success of medications when multiple choices are available. Frequented regions provide a natural framework for machine learning (ML)[16,20], in particular for supervised classification[21], so we present results of supervised classification using the support vector machine (SVM) method in our example of a complex disease, schizophrenia.
We present three example case-control analyses to illustrate the methodology and demonstrate its potential benefits: of sickle-cell disease (SCD), Huntington disease (HD), and schizophrenia (SCZ). The first two examples highlight distinct aspects of the method on Mendelian diseases for explanatory purposes; the third example, of a highly complex heritable disease, demonstrates that supervised classification using frequented regions performs better than classification using the same variants treated independently (GWAS).
METHODS
Frequented regions and application to case-control disease studies
Frequented regions linked to disease are collections of genomic features which highlight multi-genotypic or polygenic associations with a disease; a frequented region is a portion of a pangenome graph that represents sequence that is approximately conserved in the genomes that support that region. One searches for FRs of a pangenomic graph, guided by parameters chosen to bring out a desired type of collection: adjacent variants, multi-allelic variants, or non-adjacent variants on distant segments. The conceptual basis of FRs is that complex diseases are associated with a variety of features across the entire genome, while any given individual’s genome contains a subset of those associated features. By enumerating the FRs on a pangenomic graph, one collects relevant combinations of features for all of the individuals. While a GWAS uses a single variant as an input to risk prediction, an FR-based prediction uses a single FR, composed of many variants, as an input. As there can be thousands of individual variants associated with a disease, there can be thousands of FRs associated with a disease. The particular combination of variants will vary based on the individual genome, and many of the control-labeled genomes will also share some of the causative loci. Therefore, it is important to consider large sets of genomes. Our application of FRs is therefore a method in pangenome-wide association studies (PWAS).
In the original formulation[16], the pangenome graph is built from genomic sequences, and the graph is termed a sequence graph. Each vertex, or node on the graph represents a genomic segment, perhaps many bases long, that is carried by one or more individuals. The frequented region is designated by a cluster of nodes, and the task at hand is to find FRs that are deemed interesting, i.e. which satisfy some desirable characteristic, such as consensus among individuals or, in the application reported here, association with affected versus unaffected individuals in a case-control disease study. Two parameters, α and κ, described below, further specify FRs.
Each individual’s genome presents a path or paths through the graph. On a sequence graph, each diploid individual has two paths, which require phased assembly[22]. We introduce an alternative pangenome genotype graph, on which nodes represent genotypes, perhaps many bases long, rather than sequence (alleles) and an individual is represented by a single path through the graph. This typically increases the number of nodes (the genotypes A/A, A/T, and T/T are three nodes on a genotype graph, while occupying two nodes, A and T, on a sequence graph), while halving the number of paths. Importantly, genotype is associated with disease, and we are interested in the analysis of disease. ([23] presents a statistical argument for using genotypes rather than alleles in disease association studies, as the latter presumes Hardy-Weinberg equilibrium in the controls, which is often not the case.) Figure 1 compares a sequence graph with a genotype graph for the first 400 bases of the HTT gene from 37 individuals, 27 cases and 10 controls, in a study of Huntington disease (dbGaP study accession phs001071.v1.p1), both built from the provided variant calls against the GRCh37 human genome reference sequence.
Genotype graphs have a useful advantage over sequence graphs: they need not be constructed from contiguous sequence. We are free to construct a genotype graph from disjoint segments, such as exons in the case of whole exome sequencing (WES) data. We can exploit this advantage to study selected segments across the genome with limited computing power. We now turn to the definition of frequented regions and their associated quantities.
A path in a pangenome graph supports an FR if it satisfies two qualifying conditions based on two parameters, which deal with portions of the given path, termed subpaths:
α sets the minimum fraction of nodes in the FR’s node cluster that a path’s supporting subpath must traverse; it ranges from ϵ to 1 (where ϵ is an arbitrarily small positive number to ensure that at least one node in the cluster is traversed by a supporting subpath). One can alternatively call 1 − α “the maximum fraction of missing FR nodes on a subpath.”
κ sets the maximum number of consecutive nodes along a subpath that are not in the FR’s node cluster; it varies from 0 to ∞. When κ = 0, the subpath must traverse its FR nodes consecutively. One can call κ “the maximum number of consecutive inserted nodes on a subpath.”
In addition, qualifying subpaths must begin and end on FR nodes and be maximal : not part of a larger qualifying subpath. It is common for a path’s support to be greater than 1 because it contains distinct supporting subpaths. With α = 1, κ = 0, an FR is supported only by individuals that are genomically identical across the span of the FRs nodes (at the measured loci). For other values of α and κ, supporting paths may lack some FR nodes or contain nodes not belonging to the FR.
We note that FR analysis is sensitive to presence/absence, such as the case with indels. An insertion results in an extra node traversed by the individual’s path; a deletion results in lack of a node traversed by the individual’s path. These path variations lead to variations in FR support, depending on α and κ.
A few other terms: the size of an FR is the number of nodes in its cluster; the total support S of an FR is the total number of supporting subpaths. One can require that interesting FRs satisfy minimum size and/or minimum support. Figure 2 demonstrates a path and the variation of its support of an FR with different α and κ values.
On the graphs studied here, each path is labeled “case” or “control”, corresponding to the individual’s status in a disease study. We separately tally an FR’s case and control subpath support, Scase and Sctrl, and prioritize FRs on case vs. control association using the support-based odds ratio
where Ncase and Nctrl are the number of case and control paths. The FR’s priority, an integer in our implementation, is defined by
setting P = ± 2000 for Sctrl = 0 or Scase = 0. (We use an integer for compatibility with integer-based priorities like total support.) To break ties, we favor larger total support, followed by smaller size. We can prioritize control-enhanced FRs by reversing the sign of P or, as we do here, prioritize case- and control-enhanced FRs equally by using |P|.
The problem of finding FRs is computationally hard : it is #P-complete[24], meaning it is not computationally feasible to find all FRs even for small data sets. However, we can identify FRs that are useful for our classification task using, for example, hierarchical clustering[25]. Identifying such clusters on clinical-scale data, however, is an extreme clustering task [26], requiring powerful computational resources. In order to analyze large regions one must implement an FR search algorithm that runs in polynomial rather than exponential time[27]. Cleary et al.[16] describe a heuristic algorithm using bottom-up, agglomerative clustering[25]. For the purpose of showing the efficacy of an FR-based approach for disease classification, and to gain insight into the algorithm specialization that will be necessary to perform this task at scale, we used a brute-force search for the results presented here, with some computation-reducing additions described in the examples. Our implementation1 has many optional parameters to control the search. The work was performed on a 128-CPU, 1-TB machine running Java under Clear Linux.
The graphs studied here were built from variant calls against the human reference genome provided in VCF files, rather than directly from sequencing reads. This does not affect the purpose of this report, which is to demonstrate the efficacy of the methods described herein. We set our FR search routine to leave no-call genotypes (denoted “./.” in the VCF) out of FRs; paths continue to traverse them and they are included in the support calculation. In the future, we plan to build graphs directly from whole-genome DNA sequencing reads, or reference-guided assemblies, which can reveal novel genomic content in the study population which impacts disease risk.
Machine learning and supervised classification of disease status
Frequented regions, on their own, provide an informative way of viewing genomic variation that can be linked to disease. However, of more interest to the clinician is the ability to estimate genomic disease risk of a given individual. For this purpose, FRs provide a natural framework for machine learning, in particular supervised classification[16,21].
Supervised classification consists of training a classifier from a labeled training set, validating the classifier against a separate set of labeled data not used for training, and testing the classifier on unlabeled data. Classifiers operate on feature vectors which encode each sample’s quantitative association with each feature. In our work, those features are FRs and the associated quantity is path support: the feature vectors are FR path support vectors, one for each individual in a study.
Table 1 displays ten path support vectors for five FRs of a graph containing only four nodes. Classifying these data is an easy task, since case and control support are quite distinct. For complex diseases, association with disease status is much less clear – our thesis is that FRs provide useful classication results which are complementary to other methods.
For comparison with our FR-based supervised classification, we perform a GWAS-type classification on graph path traversal vectors : vectors of 1’s and 0’s which denote presence/absence of each genotype, i.e. which nodes are/are not traversed by an individual’s path on the graph. This mode of classification is akin to standard GWAS, where each variant is treated independently. To assess the association of individual loci with the disease, we employ the standard measurement of p-value using the Cochran-Armitage test for trend, and use the term “GWAS-significant” for loci which have p < 5× 10−8 [28].
An important exercise in supervised learning is k-fold cross-validation.[29] One separates the individuals into k equally-sized groups and then uses each group to validate the classifier trained on the remaining individuals, until all individuals have been classified, and then summarizes the classification results. We ran 10-fold cross-validation with LIBSVM[30] as well as a number of binary classifiers from the Weka package[31]. Since LIBSVM outperformed the Weka classifiers in almost every case, we report only the LIBSVM results here. Cross-validations were run 10 times, using a different random number seed each time, to generate the reported mean values and variance of the classification results.
RESULTS
The research presented here on frequented regions in case-control disease studies demonstrates the efficacy of the method. Our brute-force algorithm allowed us to analyze graphs with up to around 1000 nodes on the available equipment. We now present results for three examples: sickle-cell disease (SCD), Huntington disease (HD), and schizophrenia (SCZ).
Example 1: Sickle-cell disease (SCD)
Our first example is of a simple Mendelian disease: sickle-cell disease (SCD), an autosomal dominant disease associated with a SNP, rs334, on the hemoglobin subunit beta (HBB) gene on Chr 11.[33] Not having access to a case-control study of SCD with genomic data, we found that 137 of the 2,504 individuals in the 1kG database carry the deleterious allele. Since phenotype data is not provided, for the sake of demonstration we labeled those 137 individuals as “case” and labeled the other 2,367 individuals as “control”.
A pangenome graph provides information similar to, but more extensive than, a GWAS Manhattan plot. Figure 3 compares a GWAS Manhattan plot across the HBB gene with the genotype graph, for the full cohort of 2,504 subjects. One can see the associative genotypes in both cases, while the genotype graph provides a more extensive visualization of the variation amongst individuals. For explanatory purposes, we now discuss the meaning of certain choices of α and κ when searching for FRs on this graph.
With α = 1 and κ = 0, supporting paths must traverse every FR node consecutively, without any inserted nodes; that is, with α = 1, κ = 0 an FR is supported only by individuals that are genomically identical across the span of the FR’s nodes (at the measured loci). As a result, path support for α = 1, κ = 0 tends to be low for FRs of large size.
The two nodes in Figure 3 corresponding to rs334 are 141 (T/T) and 142 (the deleterious T/A). The single-node FRs {141} and {142} therefore have complementary support and, by design, log10 ORs = ±∞. Similarly, the single-node FRs {47}, {48}, and {49}, which correspond to the A/A, G/G, and G/A genotypes of GWAS-significant rs1609812, have case/control support of 125/1212, 0/266, and 12/889, or log10 ORs = 0.25, −∞, and − 0.63. These five FRs would all be deemed interesting since they have P = ±2000, 250 and − 630. (It is worth noting that only 266 individuals, or 10.6%, all controls, carry the homozygous reference genotype, reminding us that the human reference genome often represents a minority genotype in a diverse population like 1kG[34].)
We note that with α = 1, κ = 0, single-node FR analysis is similar to GWAS, where each genotype is treated independently. An aspect of α = 1, κ = 0 is that path support decreases when one views larger FRs. For example, the case-associated FR {142,143,145,147} has support 121 rather than 137.
In order to find FRs which reveal multi-genotypic associations with non-shared intermediate nodes we set κ > 0. In fact, setting κ = ∞ is often desirable, since we are not concerned with the distance on the graph between nodes but, rather, are interested in discovering polygenic associations, regardless of where the variants are located on the genome. In practical terms, increasing kappa allows for the identification of shared genotypes among samples, even if those genotypes are separated by non-shared genotypes. Capturing these shared but distant genotypes in an FR allows us to test them for phenotype interactions as a group, thereby uncovering phenotypic effects due to genotype interactions even if the individual genotypes do not have a detectable effect on phenotype.
Figure 3 displays three strongly associated loci for our case/control assignment besides rs334 which was used for splitting cases and controls: rs1609812 (nodes 47–49), rs10768683 (nodes 108–110) and rs713040 (nodes 147–149). The case-enhanced genotypes form the FR {47,108,147} which, with α = 1, κ = ∞, has case/control support 124/1205 and log10 ORs = 0.25 or P = 250. All but 13 of the case paths traverse these three nodes and therefore support this FR. 137 case-labeled subjects is too small a number from which to draw conclusions, but this particular FR might motivate study of the collective impact of these three loci on SCD, although it is also possible that their significance comes from their linkage to rs334 rather than from a biological effect of the variants. (In fact, dbSNP and ClinVar[35] report rs1609812:A, rs10768683:G, and rs713040:G as benign or likely benign alleles associated with SCD and/or β thalassemia.) With α = 1, κ =∞, FRs are supported by paths which traverse all of the FR nodes, revealing polygenic disease associations.
In conclusion, for a monogenic single-allele Mendelian disease like SCD, frequented regions of genotype graphs provide an alternative visualization of variation as well as an opportunity to discover multi-genotypic associations. They do not offer a large gain over GWAS in this scenario; their advantages become apparent in the analysis of multi-allelic Mendelian diseases with structural variants like Huntington disease (Example 2), and, more importantly, complex polygenic diseases like schizophrenia (Example 3).
Example 2: Huntington disease (HD)
Huntington disease, or Huntington’s chorea, is a fully penetrant neurodegenerative monogenic disease caused by a dominantly inherited CAG trinucleotide repeat extension in the huntingtin gene (HTT) on Chr 4 (dbSNP rs71180116, ClinVar Variation ID 31916).[36] Unaffected individuals have fewer than 36 CAG repeats at this locus on both chromosomes; individuals carrying 36–39 repeats may or may not be affected, but pass a 50% disease risk to their offspring; individuals carrying 40 or more repeats suffer from the disease. Thus, there are a variety of rs71180116 genotypes which do and do not lead to disease status, as shown in Figure 1(b). We employed the NINDS Family-Based Whole-Genome Sequencing to Find Modifiers of Age of Onset in Huntington’s Disease study, dbGaP accession phs001071.v1.p1, and used the provided variant calls against human reference GRCh37 along with separately assayed CAG repeat lengths provided by the study authors. We labeled 27 case and 10 control subjects based on the reported phenotype and/or the length of the longer CAG repeat. All case individuals were heterozygous for the damaging allele. We limited the span of analysis to the first 400 bp of HTT, which includes rs71180116.
GWAS, often based on SNP array data or SNP calls using WGS or WES reads, are not typically designed to study structural variations such as this. Genotype graphs, conversely, make no distinction between simple variants like SNPs and larger variants: a node may contain any size genotype.
To enable analysis of a multi-allelic disease like HD, we set α < 1 to enable FR support by paths which traverse different nodes at the same locus. Table 2 displays the highest case- and control-supported FRs on the graph in Figure 1(b) as α is decreased. The graph has 6 control and 7 case nodes at the rs71180116 locus. When α ≤1/6, the FR {4,9,10,12,13,14}, containing all of the non-disease genotypes, is supported by all of the control paths; when α ≤ 1/7, the FR {3,5,6,7,8,11,15}, containing all of the disease-associated genotypes, is supported by all of the case paths. Decreasing α reveals multi-allelic FRs.
We applied 10-fold cross-validation to the path support vectors for 692 FRs generated with α = E, κ = 0. LIBSVM consistently delivered perfect cross-validation. This is unsurprising, given the association of the HD genotypes with disease status. The final example, exploring a case-control schizophrenia study, exhibits imperfect classification which improves when the classifier is trained on FR path support rather than graph path traversal.
Example 3: Schizophrenia (SCZ)
Schizophrenia is a heritable psychiatric disorder that affects up to 1% of the general population.[37,38] Large-scale GWAS point to a large number of genes contributing to its pathophysiology, due to both rare and common variants.[39–42] A composite picture of heritability remains elusive.
In this example, we explore the application of frequented regions to a case-control study of 12,380 Swedish individuals, Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing, dbGaP study accession phs000473.v2.p2.[41] We used the variant calls against GRCh37 provided by the study.
First, we removed subjects diagnosed with bipolar disorder rather than SCZ, leaving 11,209 individuals. Then, we removed randomly selected control-labeled subjects to stratify the cohort at 4,966 cases and 4,966 controls, since balanced designs are known to provide better supervised classification[43].
With our algorithm and computational power, we must explore limited genomic segments. The most strongly disease-associated segment is the HLA region of Chr 6, shown in Figure 4, from which we selected the HLA-A, HLA-B, and HLA-C genes to build graphs labeled HLAA, HLAB and HLAC, with 1,166, 1,291, and 1,091 nodes, respectively. For comparison, we selected two low-association segments: one on Chr 6:151627034–151939181 for a graph labeled SCZ6A with 1,093 nodes, and one on Chr 14:31349968–31647448 for a graph labeled SCZ14C with 1,084 nodes. GWAS Manhattan plots of these five segements are shown in Figure 5.
The goal of this example is to determine whether supervised classification is improved when we train a classifier on FR path support rather than graph path traversal (GWAS). FRs bring out disease association with clusters of loci. For example, no locus in the SCZ14C range has a p-value less than 1× 10−2, yet many FRs, supported by a subset of paths, can be found that exhibit strong disease association. The combination of such FRs may provide improved classification. We now make a few points about frequented regions using the HLA-A gene, which contains eight GWAS-significant loci (Figure 4).
First, it is worth noting the effect of haplotypes on FR support. Figure 6(a) shows that two of the eight associating loci are strongly linked based on minimal changes to their support: rs74544126 and rs3098019 are on consecutive bases and form haplotype pairs: GT/GT and GT/CG. (With a graph built from assemblies or reads rather than SNP calls, these haplotypes and much larger genotypes will naturally appear.) The FR {784,787}, corresponding to GT/CG, has total support equal to one less than {784} alone. The presence of haplotypes across separate nodes increases an FR’s size with little effect on its support.
However, when α = 1, combination of less strongly-linked nodes reduces support. This is the case with the FR shown in Figure 6(b). While {786} has S = 6016 and {824} has S = 6655, the combined FR {786,824 has S = 5239. When α = 1, combining nodes generally decreases support. When α is small, however, paths traversing {786} or {824} will support {786,824}, resulting in S = 7482. With small α, combining nodes increases support.
The eight GWAS-significant variants on HLA-A shown in Figure 4 provide eight case-enhanced nodes on the HLAA graph. All are shared by many affected and unaffected individuals: with α = 1, 49.3% of the case paths and 44.5% of the control paths support the full eight-node FR, giving ORs = 1.106. With α = 1 we emphasize fully shared genotypes, i.e. the intersection of individuals’ genotypes. But complex diseases are associated with a wide array of genotypes, of which a subset are carried by any particular individual. Therefore, small α is valuable in revealing FRs that represent the union of individuals’ genotypes.
Repeating the above with α = ϵ to determine support by paths that contain at least one of the eight genotypes results in support by 79.1% of the case paths and 72.9% of the control paths. Since FRs with small α are supported by a larger number of paths, they provide more extensive input to a supervised learning classifier.
For small α and κ =∞, a path will support an FR by either 0 or 1, since the maximal subpath that contains FR nodes is unique when any number of inserted nodes is allowed. One can think of α = E, κ = as a binary test of whether a path traverses any of the FR’s nodes. For small α and κ = 0, however, a path’s support will be nearly equal to the number of FR nodes that it traverses, reduced slightly when the path traverses two or more FR nodes consecutively. One can think of α = E, κ = 0 as providing an approximate count of the FR nodes that a path traverses. The same paths will provide support in both cases, but with κ = 0 the path support vectors will have a variety of support values, rather than 0 or 1 with κ = ∞, and this may impact supervised learning and classification. (A range of α and κ values may be scanned to find an optimal combination for classification; we have done that, and found that extreme values of α and κ are sufficient for our purposes here.)
To establish a baseline of supervised classification, we ran 10-fold cross-validation on the path traversal of each graph (GWAS), disregarding no-call nodes. Results are presented in Table 3. The best path traversal classification is found on HLAC with 54.5% of individuals called correctly, specificity=0.502, sensitivity=0.588, and Matthew’s Correlation Coefficient MCC=0.091. (A random classifier would call 50% correctly, with specificity=sensitivity=0.5 and MCC=0.) Our low-association segments, SCZ6A and SCZ14C, provide poorer path traversal classification, as one would expect. If frequented regions are to be a useful alternative to standard GWAS, they must perform better than this baseline.
We then searched for FRs on each graph, requiring S ≥ 100 and|P| ≥ 1, which is comparable to GWAS requiring MAF>1% and log10 OR > 0.001. This is too large a problem for our algorithm without further reduction. To make the problem feasible, we added three constraints to the FR search routine:
(1) Interesting FRs must contain a specified starter node (FRFinder parameter --requirednodes).
(2) Interesting FRs must have higher priority than those previously found or be a subset of a previously found FR with the same priority (--keep=subset).
(3) Interesting FRs in step n must contain the nodes in the FR found in step n−1 (--requirebestnodeset).
(1) and (2) do not incur a loss of generality since we run every node as a starter node, and supersets of an FR with the same priority are generally uninteresting. Constraint (3), however, which extends the most interesting FR one node at a time, greatly shrinks the available space of FRs. We found that (3) still provides a useful set for classification. (Note that a node will generally appear in many more FR runs than the one in which it is the starter.) For each starter node, we keep the best (last and largest) FR found for training of a supervised classifier.
With those constraints in place, we set α = E, κ =∞ and searched for FRs of each graph. The resulting 10-fold cross-validations are shown in Table 3 and Figure 7. Supervised learning of FR path support consistently provided better classification than that of path traversal. Of particular note is the improvement on SCZ6A, from 51.7% to 55.2% correct calls. A segment with no significantly associated loci provided better classification using frequented regions than even highly-associated segments did using path traversal.
Finally, in order to explore the improvement that occurs when combining low-association segments, we built a graph from SCZ6A and SCZ14C. This graph has 2,168 nodes and led to 1,173 interesting FRs. The classification results were surprising: while path traversal classification resulted in 52.4% correct calls (only slightly better than SCZ6A and SCZ14C individually), classification with frequented regions resulted in 59.5% correct calls and MCC=0.190, both higher than any of the other runs. Figure 7 shows how sensitivity and specificity improved equally, leading to a large increase in total correct calls. This particular result encourages the pursuit of frequented regions composed of intervals across the entire genome.
Although, overall, the classification gains are not huge, it is clear that, on this SCZ dataset, frequented regions provide a better basis for supervised learning than do individual genotypes.
Twin studies provide us with an estimate of the maximum achievable classification of SCZ. Since there are many identical twins for which one sibling is affected and the other is not, we know that perfect classification is impossible even if we were to use the entire genome as input, since environmental and other non-genomic factors affect disease status. The probandwise concordance rates for several studies of identical twins are reported in [44] and cluster around 45%. If one had a cohort of 100 twins, for which one or both siblings is affected by SCZ, along with a cohort of 200 control individuals and a perfect classifier that classifies all SCZ genomes correctly, one would incorrectly classify 55 individuals as SCZ-positive, for a FPR=55/255=0.22, with 86% total correct calls and specificity=0.78.
We conducted a graph path traversal (GWAS) analysis of the 307 GWAS-significant loci on autosomal chromosomes in the Swedish SCZ study (891 total nodes). LIBSVM classified 60.1 0.2% of a 4969/4969 cohort correctly, with sensitivity=0.493± 0.002, specificity=0.708 ± 0.002, and MCC=0.206 ± 0.004. Given the results presented here, we expect that a genome-wide analysis of frequented regions would add at least 2–5% to this 60% correct classification, and perhaps much more.
DISCUSSION
This report presents a proof of principle for using frequented regions of genotype graphs in the study of complex diseases. The sickle-cell disease and Huntington disease examples help to explain the methodology in a simplified setting, while the Swedish schizophrenia study demonstrates improvement in genomic classification of a complex disease. The SCZ example employed small graphs built from exonic variants, mostly SNPs, which are thought to capture only 23% of SCZ susceptibility[40]. We believe that larger graphs, built directly from whole genome sequencing reads, without reliance on the human reference genome, will provide much better classification. In any case, our preliminary results suggest that frequented regions are worthy of further study.
We are not the first to apply machine learning to SCZ: a recent study[45] applied ML to feature vectors composed of SNP calls from the same Swedish study as used here, on 50 genes chosen a priori based on bioinformatic criteria. In particular, the authors focused on rare (predicted functional) variants, with MAF≤ 1%. They obtained remarkably good classification results from a selected 2545/2545 cohort. Our approach differs greatly from theirs: our method emphasizes the discovery of associated collections of genotypes, without regard to rate of their occurrence in the general population, including both common and rare alleles. In fact, ours is a standalone analysis which uses as input solely the pangenomic graph of the study individuals’ genotypes.
It is important to emphasize that our classification and cross-validation operated on individuals that were members of a relatively non-diverse population: ethnic Swedes. It is unlikely to perform well, as trained, on genomes from other populations. In fact, we believe that this is a fundamental aspect of genomic diagnosis of complex diseases: in order to find signatures that associate with disease status, one must have a sufficiently uniform genomic background. If FR analysis is to result in a clinical diagnostic, that diagnostic will have to be tuned for the population of the individuals being tested, based on ethnicity or perhaps a distinguishing set of markers. This is not a statistical problem for schizophrenia, which is common in all human populations. It becomes a problem, however, for rare diseases, when statistical power may be insufficient within a particular population.
Another challenge with classification based on FRs is the task of applying the classifier to an individual that is not in the cohort that was used to build the graph. In principle, if the graph is comprehensive enough, such an individual will have a fully-defined path for which the FR support vectors may be computed. However, there is no guarantee that an individual being tested does not have a relevant genotype that is missing from the training cohort.
We hope to find FRs on much larger graphs built from WGS reads, without involvement of a reference genome, and to measure their effectiveness in disease classification and other tasks. We have the long-term goal of building an FR-based disease classification appliance, which would be trained for a particular disease and population, and then used to test unlabeled genomes from that population. This sort of device could be useful in applications such as treatment, where, say, the successful medications for affected individuals would be used to label training data in order to suggest a medication to try with new patients. Case-control studies are only one of many types of study that can be analyzed with frequented regions.
That pursuit will likely find new and important disease-associated genotypes that are not present in the human reference genome. We believe that new pangenomic analyses like frequented regions of genotype graphs are an extremely important path for genomic disease research.
Data Availability
The case-control data used are available from dbGaP and the 1000 Genomes Consortium, the latter publicly available, the former requiring dbGaP approval. The dbGaP accessions used are phs001071.v1.p1 and phs000473.v2.p2. The code written and used is publicly available on GitHub under the repositories sammyjava/GWAS and sammyjava/pangenomics.
https://github.com/sammyjava/GWAS
https://github.com/sammyjava/pangenomics
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001071.v1.p1
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000473.v2.p2
Acknowledgements
We thank Callum Bell and Andrew Farmer for insightful comments on drafts of the manuscript.
The NINDS Family-Based Whole-Genome Sequencing to Find HD Modifiers datasets used for the analysis described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs001071.v1.p1. We thank the submitting investigator Steven Finkbeiner.
The Sweden-Schizophrenia Population-Based Case-Control Exome Sequencing datasets used for the analysis described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000473.v2.p2. Samples used for data analysis were provided by the Swedish Cohort Collection supported by the NIMH grant R01MH077139, the Sylvan C. Herman Foundation, the Stanley Medical Research Institute and The Swedish Research Council (grants 2009-4959 and 2011-4659). Support for the exome sequencing was provided by the NIMH Grand Opportunity grant RCMH089905, the Sylvan C. Herman Foundation, a grant from the Stanley Medical Research Institute and multiple gifts to the Stanley Center for Psychiatric Research at the Broad Institute of MIT and Harvard.
We also thank the International Genome Sample Resource for keeping the 1000 Genomes data freely available.
The work presented here was partly funded and computational resources provided by the National Center for Genome Resources, Santa Fe, New Mexico, USA. Partial funding also came from NSF award #1759522.
Footnotes
↵1 FRFinder available at https://github.com/sammyjava/pangenomics