Abstract
Background Genomic variants of disease are often discovered nowadays through population-based genome-wide association studies (GWAS). Identifying genomic variations potentially underlying a phenotype, such as hypertension, in an individual is important for designing personalized treatment; however, population-level models, such as GWAS, may not capture all of the important, individualized factors well. In addition, GWAS typically requires a large sample size to detect association of low-frequency genomic variants with sufficient power. Here, we report an individualized Bayesian inference (IBI) algorithm for estimating the genomic variants that influence complex traits such as hypertension at the level of an individual (e.g., a patient). By modeling at the level of the individual, IBI seeks to find genomic variants observed in the individual’s genome that provide a strong explanation of the phenotype observed in this individual.
Results We applied the IBI algorithm to the data from the Framingham Heart Study to explore genomic influences of hypertension. Among the top-ranking variants identified by IBI and GWAS, there is a significant number of shared variants (intersection); the unique variants identified only by IBI tend to have relatively lower minor allele frequency than those identified by GWAS. In addition, we observed that IBI discovered more individualized and diverse variants that explain the hypertension patients better than did GWAS. Furthermore, IBI found several well-known low-frequency variants as well as genes related to blood pressure that were missed by GWAS in the same cohort. Finally, IBI identified top-ranked variants that predicted hypertension better than did GWAS, according to the area under the ROC curve.
Conclusions The results provide support for IBI as a promising approach for complementing GWAS especially in detecting low-frequency genomic variants as well as learning personalized genomic variants of clinical traits and disease, such as the complex trait of hypertension, to help advance precision medicine.
Background
Hypertension (HTN) is a key risk factor for many cardiovascular diseases, and it was primarily responsible for about 7.8 million world-wide deaths in 2015 alone. Previous studies indicate that in addition to environmental factors, genomic factors play a significant role in blood pressure (BP) regulation [1]. Hypertension is a polygenic disease [2] burdening a large population across the globe. Current efforts at identifying significant genomic variants mostly involve the use of genome-wide association studies (GWAS). Although GWAS has successfully identified more than 1000 significant single nucleotide polymorphisms (SNPs; the most common type of genomic variants among people) for BP [3], there are limitations to this commonly used approach. In general, GWAS requires a large cohort to gain enough power to identify the significant SNPs, especially the ones with low minor allele frequency (MAF). That is why before 2015 there were only about 64 significant SNPs identified for blood pressure, and only recently were more SNPs identified due to the increased sample sizes (∼ 1 million individuals) [4-6]. Still, most of the SNPs identified so far are common SNPs with small effect sizes, and the total genetic variance in blood pressure explained by these ∼1000 SNPs is small (∼5.7%) [3]. It is likely that there are a significant number of non-common variants missed by GWAS that can help explain much of the remaining genomic variance [7].
GWAS is a population-based approach, and it extracts significant SNPs from a population level, not considering the specific genome of a given individual. Therefore, GWAS is not tailored to identify the genomic influences of HTN in an individual, which is the focus of personalized medicine. It is not uncommon that a HTN patient does not have any of the significant variants identified at the population level. Thus, identifying the most probable genomic variants of individual patients is important but remains an unmet need.
We have developed an individualized Bayesian inference (IBI) algorithm for estimating the genomic factors influencing the development of hypertension and other complex traits in a given individual. As a general machine learning framework, IBI applies a Bayesian method to identify the significant genomic variants in a given individual or patient. Bayesian methods including Bayesian multiple logistic [8] or linear regression have been used for identifying the causal SNPs among the significant genomic regions identified by GWAS (i.e., fine mapping) [9, 10]. However, none of these are individualized. IBI evolved from a tumor-specific causal inference algorithm (TCI) that members of our team developed for estimating the somatic mutations driving the development of individual cancerous tumors [11]. In contrast to TCI, IBI is designed to model and learn the relationships between an individual genome and a complex trait, such as HTN. Also, IBI was optimized for efficient computation with whole-genome data, whereas TCI was developed to use whole-exome data.
IBI identifies significant and potentially causal genomic variants for each individual based on his or her specific genomic background (and available training data on many similar individuals). By concentrating on the genomic variants observed in a particular individual, IBI has the potential to discover significant variants of low frequency that exist only in a small number of individuals and could have been missed by GWAS. The genomic variants identified being significant by IBI could help inform the design of personalized treatment for individuals with or at-risk for hypertension.
Methods
Overview of Bayesian Networks
A Bayesian network (BN) [12, 13] is a probabilistic graphical model which has two components. One is a graphical structure containing nodes and directed edges. Nodes represent domain variables such as genomic variants or clinical traits. Directed edges represent conditional dependencies between variables. The other component of a BN is a set of parameters which are conditional probabilities. Basically, each node has a conditional probability given its potential causes, which can be described by a conditional probability function. The joint probabilities of all nodes can be written as a product of each node’s conditional probability given its direct causes, based on the local causal Markov condition. A BN is a flexible framework for modeling the probabilistic relationships among variables in a complex domain via representing the joint probability of all the variables modeled in a probabilistic structure. A bipartite BN is a particular class of BN with less complexity, where there are only two sets of nodes in level 1 and level 2, and potential causal relationships only occur from nodes in level 1 to nodes in level 2.
How do we search for the most probable BN given data? A very popular class of methods are score-based algorithms that assign a Bayesian score to the BN model and return the BN with the highest score [12, 13]. This Bayesian score of the BN model is assigned based on how well this BN is supported by both the data and prior knowledge [14]. In this study, we will use a popular Bayesian score for modeling discrete variables, the Bayesian Dirichlet equivalent uniform (BDeu) score [14] as TCI did [11].
The general framework of individualized Bayesian inference
As mentioned, IBI is based on TCI [11] and has been further developed and adapted to fit the circumstances of modeling a variety of complex diseases or traits and whole-genome genotyping or sequencing data. IBI is designed to estimate the significant genomic variants, such as SNPs, in a specific individual or patient for downstream clinical and molecular phenotypes. IBI uses a bipartite BN [12, 13] for modeling the conditional-dependency or predictive relationships between the genomic variants as a set of V nodes and the downstream traits or phenotypes as set of T nodes; directed edges between V and T nodes represent the probabilistic or predictive relationships from variants to traits (Figure 1A). Within this bipartite BN, among all the variants in one individual, IBI assigns a posterior probability for each variant (represented by an V node) in influencing or predicting the trait of interest (represented by node T) specific for this individual (Figure 1A, D).
For a current individual h, let Vs be a variable that represents a specific genomic variant s (e.g., a SNP) and let Ti be a specific trait i (e.g., HTN) of this individual. Let be a vector representing all the genomic variants in individual h. We will examine the predictive relationship (Vs → Ti) in this individual for each possible Vs. Let P(Vs → Ti) be the prior probability for Vs predicting or influencing Ti, which could be estimated using biological background knowledge or could be set using a uniform prior that assumes all the genomic variants have the same prior probability of predicting or influencing Ti. Let D be the training data without inclusion of individual h. Let Ms represent the log form of the marginal likelihood of P(D |Vs → Ti) that was derived by assuming one genomic variant s as the potential predictor for the entire population D; Ms can be further normalized by the summation of Ms′ across all the SNPs to derive the posterior probability (PP). When scoring Vs → Ti for the entire population D (i.e., it is not individualized) using Bayesian learning and a uniform prior, PP is proportional to Ms; thus, the ranking of the specific driver Vs by Ms or PP as a potential predictor of a trait at the population level is the same. Thus, we will use Ms as the score for Vs → Ti. When evaluating the effect of Vs on Ti in the entire population using GWAS, the p-value is derived to indicate the significance of the association between Vs and Ti (Figure 1B).
IBI partitions the overall population into two subpopulations (Figure 1C). Suppose the current patient has Vs = 1, which represents the minor-allele of this SNP. Let represent the patient-like-me subpopulation, where all the patients in this subpopulation contain the value Vs = 1. IBI evaluates how well Vs predicts the HTN status within , which has a marginal likelihood score of that we abbreviate as (Figure 1C, D). Let represent the remaining cases that do not have Vs = 1, but rather, have Vs = 0. To predict the data in , IBI finds the SNP Vr (where “r” denotes the remaining cases) that maximizes the marginal likelihood of Vr → Ti, namely, , which we abbreviate as . The marginal likelihood for all of the data, given V as an individualized predictor and Vr as the best predictor of the remaining cases, is , which we refer to as Ms,r (Figure 1C, D). This score of Ms,r can be used to evaluate and rank the capability of Vs in explaining the patients-like-me subpopulation that contain this minor allele as well as in helping reduce the noises for the remaining subpopulation.
The marginal likelihood is computed using the BDeu score [14] (Figure 1C, D; refer to the TCI paper [8]). Individualized posterior probabilities of the form Vs → Ti are further derived relative to the SNPs that are minor alleles in the genome of the current patient h. Thus, the posterior probability takes into consideration the specific genomic background of the given individual (Figure 1D, Equation 2). In summary, IBI is individualized in the following ways: (1) The overall marginal likelihood for each arc Vs → Ti (Equation 1) contains an individualized component that uses the subpopulation of “patients like me” that have the same variant (i.e., Vs = 1). (2) Each individual has a unique set of genomic variants. Depending on the specific set of variants, the posterior probability for the same arc of Vs → Ti may be different in different individuals (Equation 2). The individualized nature of IBI makes it a potential tool for advancing precision medicine where personalized treatments are desired for individuals of varying genetic backgrounds. IBI is implemented in python with vectorization and matrix operations for efficient computation involving millions of variants, and has been tested on whole genome sequencing data on the BioData Catalyst platform [15].
Genome-Wide Association Studies
GWAS is the standard approach for identifying the significant variants associated with traits at the population level (e.g., p-value < 5 × 10−8 for genome wide significance). Conventional GWAS uses standard logistic regression models or Fisher’s exact test for discrete traits [16]. We performed GWAS using Fisher’s exact test on the same datasets to which we applied IBI and compared results.
Data and data preprocessing
We used the whole genome genotyping data of Affymetrix HuGeneFocused50K from the Framingham Heart Study (FHS) cohort (dbGaP Study Accession: phs000007.v30.p11), which covered about 50K gene-centric and coding SNPs across the genome [17, 18]. We used the following functions from plink for further filtration and quality control, and have acquired 38,342 SNPs: --mind 0.03 --geno 0.03 --maf 0.01 –hwe 10e-6 –me 0.05 0.1 – sexCheck. We filled missing SNP values with the most frequent value for that particular SNP across the entire population. Dominant coding was then performed in plink, and thus, the final SNP values are 0 or 1 where 0 represents zero copy of the minor allele (risk allele) and 1 represents one or two copies of the minor allele. The focus of this paper is to predict the risk (minor) allele SNPs that influence hypertension (high blood pressure) rather than protecting the subject from hypertension. Therefore, we further removed the SNPs that have risk ratio smaller than 1 resulting in a total of 19,276 SNPs of interest.
Clinical phenotype data included harmonized systolic BP (SBP) and diastolic BP (DBP) data which were downloaded from PIC-SURE on the NHLBI BioData Catalyst platform [15]. SBP and DBP are specifically harmonized by the Trans-omics for Precision Medicine (TOPMed) Data Coordinating Center [19] by taking the average of two SBP or DBP measurements obtained at a single clinic visit. 10 and 5mm Hg were specifically added for SBP and DBP for individuals who were taking antihypertensive drugs [20]. If SBP>=140, or DBP>=90 or an individual who was taking antihypertensive drugs, we considered this individual as having HTN and assigned ‘HTN = 1’; otherwise, we classified this individual as not having HTN, and we assigned ‘HTN = 0’. After merging the SNP data and BP data, we obtained a total of 6,613 patients with 19,276 SNPs. We performed a stratified random split to produce an 80% training set (5,290 subjects) and a 20% test set (1,323 subjects), and we reserved this test set for the prediction task.
Results
To evaluate IBI in inferring significant genomic variants for HTN, we compared its performance to that of GWAS. As a proof of concept, we applied both IBI and GWAS to the whole genome data of Affymetrix HuGeneFocused50K measurements and harmonized phenome data of BP measurements from the FHS cohort [18] as described above.
A Bayesian method for GWAS analysis
As was explained in the Methods section, when using a Bayesian method and a uniform prior to study a single variant’s effect on HTN in the population level, the derived Ms for this variant is proportional to its global posterior probability (GPP), making it possible to use the marginal likelihood to find the top predicting SNP (Figure 2A). We observed that when using a uniform prior, as we did in this study, a population-based (i.e., not individualized) Bayesian approach to identifying top-ranked SNPs based on Ms yielded similar results to the population-based GWAS method (Figure 2A). The Spearman correlation coefficient between the p-value and Ms across all the 19,276 SNPs is -0.9. We further examined and compared the top 189 SNPs ranked by Ms or p-value. The top SNPs identified by high Ms values or low p-values are highly overlapping: 164 out of the top 189 SNPs and 18 out of the top 20 SNPs overlapped between these two rankings (Figure 2). Furthermore, the ranking of these top SNPs by Ms and p-value are either exactly the same or very similar where Ms is negatively correlated with p-value (Figure 2).
IBI complements GWAS and better detects significant variants of low MAF
We applied both IBI and GWAS to the training (discovery) subset of 5,290 FHS subjects with 19,276 SNPs and HTN status, and derived the IBI marginal values of Ms,r (Figure 3A) and GWAS p-values (Figure 3B) for all the SNPs in the Manhattan plots. In Figure 3A, the values of Ms,r were normalized with (−2436 − Ms,r) / (−2436 − (−2463)) considering -2436 as the maximum and -2463 as the minimum, based on the min-max normalization technique. For the GWAS analysis, if considering 0.05 / 19276 = 2.59e−6 as the significance level for p-value after the Bonferroni correction, five SNPs reach such significance (Figure 3B). In Figure 2, the population-level Ms values were derived by assuming one SNP as the global predictor or potential cause of HTN for the entire population (Figure 1B). When using two SNPs to specifically explain HTN status from two distinct subpopulations as is done by IBI (Figure 1C), the overall marginal values (Ms,r) significantly increase for many SNPs of Vs. Among all the SNPs, 189 Vs SNPs have Ms,r values bigger than the biggest Ms value derived in the population level from the best global predictor, represented as Vg (Figure 3A). The higher score of IBI compared to the population-based Bayesian method also has theoretical support. It has been proved that instance-based (i.e., individualized) causal inference methods, a family of algorithms to which the IBI belongs, are consistent. More specifically, in the large sample limit, the score of the data-generating instance-specific model will be assigned the highest score of any model [21]. These results support that the HTN status in the overall population has been explained better by IBI with any of the top 189 SNPs, Vs, explaining the subpopulation of and with the remaining population predictor, Vr, explaining the remaining subpopulation, in comparison to using the best global predictor Vg itself to explain the entire population of D.
We performed another evaluation from the perspective of information theory. In this setting, GWAS analysis is searching for a variant Vs that has strong information with respect to a trait Ti (HTN), and the amount of information can be measured as information gain (IG) [22]. IG can be calculated by splitting samples according to one variable (Vs) and then measuring the change in the entropy of the other variable (Ti) during partitions. Rather than focusing on just one variant as in a GWAS analysis, the IBI algorithm evaluates how much information we can gain with respect to the trait Ti (HTN) if we consider both the specific variant of interest (Vs) and the remaining-population predictor (Vr), IG (Vs, Vr ; Ti) [23]. The more Vs and Vr complement each other (e.g., they are two distinct predictors for different subgroups) to provide information with respect to Ti, the higher the IG. In other words, IBI searches for variants that not only explain HTN well in the “patients-like-me” subpopulation, but also help enhance the information of Vr with respect to HTN in the remaining subpopulation that do not contain this specific variant of interest. Actually, the ranking of the top 189 SNPs by IBI Ms,r turned out to be highly correlated (Spearman correlation coefficient, r = 0.9) to their ranking by IG values: in general, the higher the Ms,r, the higher the IG. Based on information gain values, top-5 IBI SNPs were rs11574358, rs1794108, rs11000217, rs2292664 and rs9928967 while top-5 GWAS SNPs were rs11574358, rs2292664, rs383306, rs5491 and rs1794108. IG values for the top 189 IBI SNPs of Vs selected by Ms,r are significantly higher than those of the top 189 GWAS SNPs selected by the p-values; IG values for the top 189 GWAS SNPs are also significantly higher than the values for 189 randomly-selected SNPs (Figure 3C) as expected.
We further examined whether there is any overlap between the top 189 IBI and GWAS SNPs. We found that 41 of the top 189 SNPs were identified by both IBI and GWAS (Figure 3D), and 3 of the top 5 IBI and GWAS SNPs are the same (Figure 3A, B). Thus, IBI and GWAS share many of the same top SNPs, suggesting a mutual agreement between these two approaches. The unique SNPs for IBI or GWAS support that the two approaches are also complementary (Fig. 3D). We further examined the MAF distribution for the different subsets in Figure 3D (Figure 3E). Interestingly, the IBI-only SNPs overall had much lower MAF than GWAS-only SNPs (Fig. 3E). This result provides support for the hypothesis that IBI is more capable of identifying lower-frequency significant variants, relative to GWAS, by concentrating on the genomic variants of a given individual in a specific subpopulation.
IBI discovered more individualized and diverse significant SNPs that better explain the HTN patients, compared to GWAS
For a given individual h, IBI derives the posterior probability for each genomic variant , by normalizing Ms,r with a summation of all the Ms,r across all the existing minor allele SNPs (i.e., the SNP value is 1) in this individual. This posterior probability takes into consideration the diverse genomic background or context for different individuals. More specifically, a particular SNP Vs with the same Ms,r may have different posterior probabilities in different individuals due to their distinct genomic background (i.e., different sets of existing minor allele SNPs). For a given HTN patient, IBI ranked all the minor alleles existing in this individual based on their individualized posterior probabilities (this ranking will be the same as the ranking based on Ms,r in a given patient); the SNP with the highest posterior probability was considered as the most probable influence of HTN for this given patient. For comparison, we designated a top SNP for each HTN patient based on the population-level p-values derived by GWAS: among the existing minor alleles in a given HTN patient, the non-protective minor allele with the lowest (most significant) p-value was considered to be the most probable influence for HTN in this particular patient.
Among all the 930 HTN patients in the discovery dataset, we identified 16 unique SNPs according to GWAS ranking (Figure 4A) and 25 unique SNPs based on IBI ranking (Figure 4B); each of these unique SNPs was assigned by GWAS or IBI as a top-1 SNP for at least one HTN patient. The number of HTN patients explained by each of these unique SNP can be derived by the differences of the accumulated number of explained HTN patients showed in Figure 4A, B. The more unique SNPs identified by IBI suggested IBI was able to find a more diverse set of significant SNPs with a more personalized approach. IBI identified 13 SNPs that explain less than 10 HTN patients individually while GWAS found 6 such SNPs. Interestingly, at the same time, IBI assigns the intronic SNP ‘rs13265032’ in the CSMD1 loci as the top-1 SNP with the highest PP or Ms,r for each of the 425 (46%) of HTN patients (Figure 4B).
For the GWAS analysis, if considering 0.05/19276 = 2.59e-6 as the significance level for p-value after the Bonferroni correction, then 120 out of 930 (12.9%) HTN patients can be assigned a significant SNP identified by GWAS; even with a relaxed significance level of 1.09e-5, only 146 out of 930 (15.7%) HTN patients are covered or explained by these significant GWAS SNPs (Figure 4A). This suggests that the significant SNPs of HTN identified at the population level by GWAS do not necessarily exist in a given HTN patient, leaving a significant portion of HTN patients unexplained by these significant SNPs.
On average, there are 7,767 out of 19,276 minor allele SNPs existing in the HTN patients. If assuming all these existing risk alleles have the same prior probability in causing HTN, and only one of them is causing HTN, then the prior probability for each risk allele is 1.0 / 7767 = 1.3e-4. Interestingly, the top one SNP selected by IBI for each HTN patient has much higher posterior probability ranging from 0.08 to 0.99 (Figure 4C). If considering 0.1 as a significant posterior probability threshold, then 922 out of 930 (99.1%) HTN patients can be assigned a significant IBI SNP as the potential cause for their HTN status; with a more restrictive threshold of 0.2, 741 out of 930 (79.7%) HTN patients can be explained. These results suggested that IBI was able to find a top SNP with significant posterior probability (>=0.1), relative to the random chance (1.3e-4), for the majority of HTN patients as a potential genomic cause.
As is shown in Table 1, the intronic SNP ‘rs13265032’ in the CSMD1 loci that was assigned as the top-1 SNP by IBI for 46% (425) of HTN patients, was also ranked high (12) by IBI Ms,r among all the SNPs. By contrast, this SNP was never assigned as a top-1 SNP for any HTN patient by GWAS and this SNP was ranked very low (5361) by GWAS among all the SNPs. Interestingly, other intronic SNPs in the CSMD1 loci have been reported to be associated with hypertension [24] or blood pressure response to hydrochlorothiazide [25, 26], an antihypertensive drug. Among all the 86 SNPs located in the CSMD1 loci in our dataset, three of them were ranked very high by IBI as is shown in Table 1, while all of them were ranked relatively low by GWAS. These three novel SNPs identified by IBI in the CSMD1 loci provide evidence to support the reported role of CSMD1 in HTN, which themselves may warrant further analysis for their potential causal influence on CSMD1 regulation. In addition to SNPs in the CSMD1 loci, IBI also identified a novel missense variant of ‘rs1803274’ in the BCHE loci, a novel intron variant of ‘rs948028’ in the GRIK4 loci and a novel missense variant of ‘rs12779623’ in the MALRD1 loci as top-1 likely cause of HTN in 9, 2 and 1 HTN patient, respectively (Table 1). Interestingly, BCHE [27, 28] loci, GRIK4 [29] loci and MALRD1 loci [4] have been reported to be associated with blood pressure regulation, although GWAS analysis ranked their SNPs relatively low (Table 1). Overall, these results provide support for IBI being able to identify novel and biologically meaningful SNPs or genes associated with HTN that were missed by GWAS analysis.
IBI found well-known significant variants or genes that were missed by the parallel GWAS analysis in the same cohort
We list several missense variants (Table 2) as well as the gene loci (Table 3) that were previously reported for their influence on blood pressure regulation, in addition to the ones discussed in Table 1. In Tables 2 and 3, IBI ranks were determined by Ms,r while GWAS ranks were determined by the p-value.
In Table 2, the missense variant of rs37369 [30] has been shown to be one of the 4 functional SNPs of AGXT2, which has been reported to have strong associations with several cardiorenal traits, such as coronary heart disease [31]. Its significant association with hypertension was very recently reported via multiple regression analysis involving only several targeted SNPs [32]. The missense variant rs11575542 was very recently identified as a functional variant of the DOPA Decarboxylase (DDC) gene during the systematic polymorphism screening across the 15-Exon DCC locus [33]. The SNP was shown to alter the enzyme activity of DCC and result in changes in renal DA excretion that is linked to hypertension [33]. The missense variant of rs723580 was reported to be a top trans-eSNP [34] for the expression level of EPO associated with the red blood cell traits that were strongly linked to hypertension [35]. Intriguingly, with low MAF in our relatively small discovery cohort, these three SNPs were ranked much higher by IBI than by the parallel GWAS analysis (Table 2). This result provides support that IBI can recognize biologically meaningful genomic variants of low MAF, relative to GWAS, particularly when the sample size is small compared to the number of SNPs to be tested.
Table 3 shows a list of genes that have been reported to be associated with BP regulation or HTN where at least one related paper is listed for each gene. IBI identified these genes as candidate genes influencing HTN since these gene loci contain at least one novel SNP that is highly ranked by IBI for its association with HTN. Interestingly, all of these genes were ranked relatively low by GWAS even considering the highest SNP rank by GWAS (‘Top GWAS rank’ in Table 3) within each gene locus; all of these novel SNPs were also ranked relatively low by GWAS (‘GWAS rank’ in Table 3) with non-significant p-values (Table 3). Moreover, among these eight SNPs highly-ranked by IBI and lowly-ranked by GWAS, six have MAF lower than 0.1 and three have MAF as low as 0.01 (Table 3). This together with Table 2 supports that IBI is more capable in identifying significant variants of low MAF, compared to GWAS.
IBI top SNPs identify genetic risk scores that are more predictive for HTN than do the GWAS top SNPs
We further compared the capabilities of significant SNPs, identified by IBI and GWAS, in predicting HTN. After running IBI and GWAS on the training (discovery) dataset, we were able to rank all the SNPs based on Ms,r derived from IBI or p-values obtained from GWAS. For each subject in the test set, based on IBI ranking or GWAS ranking, we identified the top 1 and 3 SNPs that exist in this subject (with a value of ‘1’ denoting a minor allele). We then used these top SNPs to calculate the genetic risk scores (GRS) for each subject by the sum of his or her risk alleles, weighted by odds ratio for GWAS top SNPs or by Ms,r for IBI top SNPs. We further use min-max normalization to normalize both the IBI and GWAS GRS to avoid potential bias from the different scales of the original values. We then directly calculated the area under ROC curve (AUROC or AUC) using the normalized GRS for each patient (Figure 5). We also trained a logistic regression model for HTN prediction using this feature of normalized GRS, which gave very similar results (data not shown).
As expected, using randomly selected SNPs showed poor prediction performance, with an AUROC of 0.50 (Figure 5A, D); the GWAS-selected top one or three SNPs both have an AUROC of 0.55 (Figure 5B, E) suggesting some level of prediction; the IBI-selected top SNP had an AUROC of 0.59 (Figure 5C) and the top 3 SNPs had an AUROC of 0.60 (Figure 5F). The higher AUROC achieved by IBI provides support that the top SNPs it selects predict hypertension better than the top SNPs selected by GWAS.
Discussion
In this study, we developed and applied a novel and individualized method (IBI) to estimate the personalized genomic variants for the complex trait of hypertension. We compared its performance with the population-based GWAS method using a real dataset from the FHS cohort. The significant overlap of the top-ranked SNPs by both IBI and GWAS suggest a degree of agreement of these two approaches. On the other hand, the unique SNPs they found support a complementary role of IBI to current GWAS analyses. Interestingly, by focusing on each individual and its patient-like-me subgroup, IBI was capable of identifying significant SNPs of low MAF in the same cohort, relative to GWAS. IBI was also able to identify more diverse and individualized top SNPs to explain the HTN patients. Moreover, the top SNPs identified by IBI from the discovery cohort were able to predict HTN better than the top ones derived from GWAS when applied to an unseen test cohort. We also identified evidence from the literature to support the biological significance of top SNPs found by IBI, especially the ones highly-ranked by IBI and lowly-ranked by GWAS. In summary, our study provides support that IBI can serve as a complementary approach in discovering novel and personalized genomic variants that may be missed by GWAS.
Contemporary GWAS studies often involve a large sample size (∼1 million) to gain sufficient power, especially for variants of low MAF. Considering the large genomic heterogeneity among different individuals, as well as the nature of complex diseases often being affected by many variants of small effect size, an alternative approach is to focus on the subpopulation containing the specific variant of low MAF under evaluation, as IBI does. In this way, IBI may be able to better evaluate the effect of low-MAF variants in a patient-like-me subpopulation, without the potential noise from a large remaining population not containing such variants; moreover, this large remaining population could be explained better with a remaining population driver. The fact that the top-ranked SNPs by IBI in general have a higher overall marginal likelihood, Ms,r, and higher information gain with respect to the HTN status, provide support that IBI may have found specific drivers that better explain the subpopulations. Our results also support that IBI is not compromised in identifying significant high-MAF SNPs. IBI’s population partition strategy aligns well with the concept of personalized medicine in which different individuals or subpopulations may have different underlying genomic influences on producing complex clinical phenotypes such as HTN.
As a general Bayesian framework, IBI can be applied to any discrete trait. It can also be applied to continuous traits by changing the marginal likelihood function from using the BDeu score for discrete variables to using the Bayesian information criterion (BIC) score for continuous variables [14]. For the current approach presented in this study, one limitation is that it only considers the genomic factors of HTN, while not modeling the effects of other factors such as age, sex, population structure and the family relatedness that may exist in this FHS cohort. To model the effects from non-genomic factors, we plan to incorporate linear mixed models [47-51] into our current framework. Also, due to confounding factors such as population structure, as well as linage disequilibrium (LD), the predictive variants described in this paper are not guaranteed to be causal. Further fine mapping approaches, functional analysis, or Mendelian randomization can be used to further pinpoint the potential causality. Another interesting future direction is to search for more than one genomic variant that might work together to affect and predict the phenotypes of individuals and subpopulations.
Conclusions
In summary, we described a novel Bayesian method for identifying personalized genomic variants that predict complex traits, such as HTN. IBI can serve as a complementary approach to GWAS, especially in detecting significant genomic variants of low frequency. The novel SNPs we identified for HTN warrant further analysis for their possible causal role in blood pressure regulation.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v32.p13
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
The genomic and blood pressure data from the FHS cohort (study accession: phs000007.v30.p11) are publicly available via controlled-access through the database of Genotypes and Phenotypes Study (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v30.p11). The source code for IBI is publicly available on GitHub at https://github.com/asadcfc/IBI.
Competing interests
The authors declared no competing interests.
Authors’ contributions
JL designed and implemented the IBI algorithm in python, conceived and designed the experiments, and wrote the paper; MA.R and JL carried out the data collection, modeling, analyses; XL and GC provided insightful advices about the design of the IBI algorithm and some of the experiments, as well as edited the manuscript; YD provided advice for whole-genome data processing and GWAS analysis; DM provided general advice on hypertension development and its potential genomic causes; CC helped with the information gain experiment; All authors read and approved the final manuscript.
Funding
This research was funded in part by the BioData Catalyst Fellowship award (0065304) from the National Institutes of Health, National Heart, Lung, and Blood Institute through the University of North Carolina at Chapel Hill, grant R01LM012011 from the National Institutes of Health, R01 MD009118 from the National Institute on Minority Health and Health Disparities (NIMHD).
Acknowledgements
The Framingham Heart Study, within the database of Genotypes and Phenotypes Study (accession #: phs000007.v30.p11), is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195 and HHSN268201500001I). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University. Funding for Affymetrix genotyping of the FHS Omni cohorts was provided by Intramural NHLBI funds from Andrew D. Johnson and Christopher J. O’Donnell.
Support for this work was provided by the National Institutes of Health, National Heart, Lung, and Blood Institute, through the BioData Catalyst program (award 1OT3HL142479-01, 1OT3HL142478-01, 1OT3HL142481-01, 1OT3HL142480-01, 1OT3HL147154-01). Any opinions expressed in this document are those of the author(s) and do not necessarily reflect the views of NHLBI, individual BioData Catalyst team members, or affiliated organizations and institutions. The authors wish to acknowledge the contributions of the consortium working on the development of the NHLBI BioData Catalyst ecosystem.
List of abbreviations
- GWAS
- Genome Wide Association Study
- IBI
- Individualized Bayesian inference
- TCI
- tumor-specific causal inference
- HTN
- Hypertension
- BP
- Blood pressure
- SBP
- Systolic blood pressure
- DBP
- Diastolic blood pressure
- FHS
- Framingham Heart Study
- MAF
- Minor allele frequency
- CBN
- Causal Bayesian network
- BDeu
- Bayesian Dirichlet equivalent uniform
- TOPMed
- The Trans-Omics for Precision Medicine
- IG
- Information gain
- GRS
- Genetic risk score
- AUROC or AUC
- area under ROC curve
- BIC
- Bayesian information criterion