ABSTRACT
Integrative methods, like colocalization and transcriptome-wide association studies (TWAS), identify transcriptomic mechanisms at only a small fraction of trait-associated genetic loci from genome-wide association studies (GWAS). Here, we show that a reliance on reference functional genomics panels of only total gene expression greatly contributes to this reduced discovery. This is particularly relevant for neuropsychiatric traits, as the brain expresses extensive, complex, and unique alternative splicing patterns giving rise to multiple genetically-regulated transcript-isoforms per gene.
We introduce isoTWAS, a scalable, multivariate framework to integrate genetics, isoform-level expression, and phenotypic associations in a step-wise testing framework. Multivariate predictive models were trained using isoform-level expression across tissues from the Genotype-Tissue Expression Project and in the developing and adult human brain from PsychENCODE. Across five neuropsychiatric traits, isoTWAS dramatically increased discovery of trait associations within GWAS loci, capturing 92 unique loci compared with 27 using gene-level TWAS. These power gains reflected a ∼2-fold increase in the number of testable genes, an ∼15-35% increase in total gene expression prediction accuracy, and the ability to jointly capture expression and splicing mechanisms. Results from extensive simulations showed no increase in false discovery rate and reinforce isoTWAS’s advantages in prediction and trait mapping power over TWAS, especially when genetic effects on expression vary across isoforms of the same gene. We illustrate multiple biologically-relevant isoTWAS-identified trait associations undetectable by gene-level methods, including isoforms of AKT3, GIGYF2, and KMT5A with schizophrenia risk.
The isoTWAS framework addresses an unmet need to consider the transcriptome on the transcript-isoform level to maximize discovery of trait associations, especially for brain-relevant traits.
INTRODUCTION
Over the past decade, the number of genetic associations with complex traits identified by genome-wide association studies (GWAS) has increased considerably1,2. However, the slow translation of these genetic associations into concrete molecular mechanisms remains a major obstacle. As GWAS associations predominately localize within non-coding regions of the genome and are often tagged within large blocks of linkage disequilibrium (LD), the first major challenge is prioritizing the underlying causal variant(s) within a given locus and identifying their functional impact on nearby target genes. To address this, numerous methods have been developed to integrate transcriptomic reference panels with GWAS to prioritize genes at trait-associated loci3–15. TWAS and related approaches (e.g., PrediXcan) impute the component of gene expression predicted by germline genetics into an association cohort, thereby reducing multiple comparisons and increasing interpretability by identifying a set of genes at a locus that may underlie the genetic association3,4. Despite these methodological advances, a majority of GWAS loci still lack robust functional interpretation16.
Previous integrative analyses have largely focused on aggregated measurements of total gene expression but have not systematically explored the potentially large number of distinct transcript-isoforms that can be generated from a given genetic locus through alternative splicing. Alternative splicing is a fundamental form of tissue-specific gene regulation present in more than 90% of human genes, that vastly expands the coding and regulatory potential of the genome17–20. Genes uniquely expressed in the brain, which are longer and contain far more exons, undergo the greatest degree of splicing compared with other tissues and species—a mechanism contributing to the vast proteomic, phenotypic, and evolutionary complexity of the human brain21–24. Independent of gene expression, splicing dysregulation has been increasingly implicated as a putative disease mechanism, 23,25–29. including for neuropsychiatric disorders10,23,26. Yet, local splicing events can be computationally intensive to measure, are difficult to systematically integrate across multiple distinct large-scale datasets, and often coordinated across a gene, yielding many non-independent features that increases multiple testing burden. In contrast, transcript-isoform abundance can now be rapidly estimated across large-scale RNA-seq datasets using pseudoalignment methods30,31. Furthermore, in the brain, isoform-level expression changes have shown greater enrichment for schizophrenia heritability than gene or local splicing changes21,32.
Here, we present isoform-level TWAS (isoTWAS), a flexible and scalable approach for complex trait mapping by integrating genetic effects on isoform-level expression with GWAS. In extensive simulations and real data applications in the Genotype-Tissue Expression (GTEx) Project33 and the PsychENCODE Consortium21,23, we show that isoTWAS provides several important advantages to trait mapping. First, in the transcriptomic prediction step, the correlation between isoforms provides additional information that is unavailable when only gene-level expression is measured, which can be leveraged to improve prediction34 accuracy of total gene expression by 15-35%. In parallel, our isoform-centric framework uncovers heritable predictive models for nearly 2-fold more genes, doubling the number of testable features in the trait mapping step. Third, divergent patterns of genetic effects across isoforms can be leveraged to provide a more granular hypothesis for a mechanism underlying the SNP-trait relationship. Finally, the isoTWAS framework jointly captures expression and splicing disease mechanisms, while maintaining a well-controlled false discovery rate. Altogether, using data from 5 neuropsychiatric traits, isoTWAS substantially increases the power to detect gene-level trait associations uncovering associations at >3 fold more GWAS loci compared to traditional gene-level TWAS. These results stress the need to shift focus to transcript-isoforms to maximize discovery of transcriptomic mechanisms underlying genetic associations with complex traits.
RESULTS
The isoTWAS pipeline
The isoTWAS framework prioritizes transcript-isoforms of genes in which the cis-genetic component of expression is significantly associated with a complex trait. Expression of different isoform transcripts of a gene can be modelled as a multivariate data object, with, on average, four GENCODE v40-annotated isoforms per gene (mean 4.05, standard deviation 7.28, median 1), with strong pair-wise correlations between these isoforms33,35 Here, we assume that local genetic variants directly modulate expression of an isoform. In addition, we assume that the abundance of a gene (total gene expression) is the sum of the abundance of its isoforms (isoform expression) (Supplemental Figure S1), on the raw count scale. Flexibly integrating isoform-level expression into trait mapping may lead to novel discoveries in disease mapping and prioritize specific isoforms that explain genetic associations. Accordingly, gene-level trait mapping using traditional TWAS methods may not necessarily detect a trait association on the gene-level if a gene has multiple isoforms but only one is associated with the trait (Figure 1a). This scenario may be particularly relevant in the human brain, where some genes may express up to hundreds to thousands of unique isoforms24. By modeling the genetic architectures of isoforms of a gene simultaneously, isoTWAS provides a deeper understanding of potential transcriptomic mechanisms that underlie genetic associations.
The isoTWAS pipeline contains three general steps. First, we build multivariate predictive models of isoform-level expression using well-powered functional genomics training datasets, such as GTEx33 and PsychENCODE23,26. Here, we trained and systematically compared 6 multivariate predictive frameworks: (1) multivariate elastic net penalized regression36, (2) multivariate LASSO penalized regression with simultaneous covariance estimation37, (3) multivariate Sum of Single Effects (SuSiE)38, (4) Breiman and Friedman’s curds-and-whey procedure34, (5) ordinary least squares with clustered standard errors39, and (6) univariate regularized regressions of each individual isoform, as implemented in Gusev et al’s FUSION software4,36,38,40 (see Methods and Supplemental Methods). Models were trained to predict the expression of heritable isoforms using the set of cis-SNPs within 1 Megabase (Mb) of the gene body (Methods, Figure 1b). Model performance was assessed via 5-fold cross-validation, using McNemar’s adjusted R2 between observed and predicted expression.
Next, we use these models to impute isoform expression into an external GWAS cohort and quantify the association between imputed isoform expression and the target GWAS phenotype (Figure 1b). If individual-level genotypes are available, isoform expression can be directly imputed as a linear combination of the SNPs in the models, and these associations can be estimated through appropriate regression analyses. If only GWAS summary statistics are available, imputation and association testing is conducted simultaneously through a weighted burden test4. isoTWAS uses a two-step hypothesis testing procedure to account for multiple comparisons: isoform associations are first aggregated to the gene-level to prioritize a gene using the Aggregated Cauchy Association Test (ACAT)41, where false discovery rates are controlled, and then individual isoforms of prioritized genes are subjected to post-hoc family-wise error control42 (Supplemental Figure S2, Methods). After this step, a set of isoforms are identified whose cis-genetic component of expression are associated with the trait of interest4. For these isoforms, we apply a rigorous permutation test whereby the SNP-to-isoform effects are permuted 10,000 times to generate a null distribution; this permutation test assesses how much signal is added by isoform expression, given the GWAS architecture of the locus, and controls for large LD blocks4. Lastly, to identify a set of isoforms that explains the genetic association with the trait at the locus, probabilistic isoform-level fine-mapping can be applied (Figure 1b, Methods).
By considering isoforms of a gene as a multivariate data structure and aggregating isoform-level tests to the gene-level, isoTWAS boosts both prediction of gene expression and power to detect trait associations, especially when genetic effects on isoforms are different and isoforms have different effects on a phenotype. Not only does isoTWAS provide substantial power gains, as we show through simulations and empiric data, this last step also provides a testable hypothesis for the mechanism underlying a trait association (Figure 1b). isoTWAS is available as an R package through Github (https://github.com/bhattacharya-a-bt/isotwas).
isoTWAS improves prediction of total gene expression in real data
Previous work has demonstrated that isoform-level quantifications, when propagated to the gene-level, can lead to more accurate gene expression estimates and differential expression results43,44. Using GTEx data from 13 brain tissues and 35 other tissues with sufficient sample sizes (N > 100, Supplemental Table S1; Methods), we demonstrate that, by accounting for isoform-level genetic architecture, isoTWAS increases prediction of total gene expression over TWAS by a median of 15-35%, both in- and out-sample (Figure 2). As the trait mapping step in TWAS and isoTWAS only tests models with a baseline prediction cross-validation R2 > 0.01, this increase in accuracy leads to nearly 2-fold more testable gene models under the isoTWAS framework. Multivariate SuSiE provided the models with best prediction of isoform expression in 40-55% of models (Supplemental Figure S3), with multivariate regression with simultaneous covariate estimation showing the best prediction in 20-30% of models, underscoring the utility of modeling the covariance structure between isoforms to improve prediction.
The advantage for isoTWAS starts at the feature selection step, in which imputation models are filtered based on nominally significant positive cis-heritability (P < 0.05, Methods). Across the 13 brain tissues, there are nearly 3 to 4-fold more the number of genes with at least 1 cis-heritable isoform, compared to the number of genes that are cis-heritable themselves (mean of 6331 genes with at least 1 cis-heritable isoform vs. 1831 genes that are cis-heritable across 13 tissues, Figure 2b). When we consider genes that pass the cis-heritability feature selection cutoff, we trained twice as many isoTWAS models as TWAS models with CV R2 ≥ 0.01 (Figure 2b). These trends extend for nearly all other tissues as well, with greater than 2.5-fold more cis-heritable and predictable genes on the isoform-level than total gene expression level across GTEx tissues (Supplemental Figure S4-5, Supplemental Table S2). The median ratio of numbers of heritable isoforms to genes was significantly larger for brain tissues compared to others (3.40 vs 3.08, respectively; P = 0.004), when accounting for sample size (Figure 2e; Supplemental Figure S8). Notably, the least complex isoform architecture was exhibited by whole blood samples, with only a 2:1 ratio of cis-heritable isoforms to genes. These results may reflect the more complex splicing patterns in the brain compared to other tissues and suggest that blood-based gene imputation panels may generalize poorly to the complex gene regulatory mechanisms present in the brain.
Even when restricting to the subset of genes with corresponding models at both gene and isoform levels, 60-75% of isoTWAS models showed better prediction of total gene expression than the corresponding TWAS model (Figure 2c, Supplemental Figure S6-7, Supplemental Table S3-4). Similarly, across all GTEx tissues, isoTWAS shows considerable improvements in prediction, even when considering tissues not derived from the brain (Supplemental Figure S6-7. Supplemental Table S3-4). In a completely independent eQTL reference panel derived from adult frontal cortex brain samples from the PsychENCODE Consortium (N = 850), multivariate isoTWAS models for adult frontal cortex trained in GTEx also outperformed univariate TWAS models in predicting total gene expression (median percent increase in adjusted R2 of 21.9%; Figure 2d, Supplemental Table S5-6).
In sum, as predictive performance is positively related to power to detect trait associations45, both the increased number and accuracy of trainable imputation models using isoTWAS’s multivariate predictive framework will yield substantially improved discovery in trait mapping. Altogether, for 48 tissues in GTEx, we built significant predictive models for 26,000 to 33,000 isoforms across 5,000 to 6,000 unique genes per tissue (Supplemental Figure S9). Our results underscore that multivariate methods that incorporate the correlation between isoforms provide utility in prediction of both total gene and individual isoform expression, with strong implications on increased discovery in trait mapping45.
isoTWAS increases discovery of trait associations across 5 neuropsychiatric disorders
To directly test our central hypothesis that isoform-centric multivariate predictive modeling will yield substantial power gains in complex trait mapping, particularly in brain tissues for relevant neuropsychiatric traits, we next directly compared isoTWAS and TWAS discovery across 5 neuropsychiatric GWAS results using models trained in larger functional reference panels derived from adult26 and fetal23 cortex from the PsychENCODE Consortium (Methods, Supplemental Figure S10). Our results demonstrate that isoTWAS increases discovery of trait associations without clear indication of elevated test statistic inflation, especially at genetic loci that house trait-associated genetic variants detected by GWAS (Figure 3).
Using PsychENCODE data, we trained 7,530/3,402 models using isoTWAS/TWAS in adult cortex and 8,331/2,112 isoTWAS/TWAS models for fetal cortex, passing the heritability and cross-validation R2 cutoffs mentioned previously (Methods; Supplemental Tables S5-6). We next integrated these models with 5 neuropsychiatric GWAS (Methods, Figure 3a): SCZ46, BD47, ALZ48, ASD49, and ADHD50. At a conservative permutation-based significance thresholds, we detected far more trait associations with isoform-level mapping than gene-level (Supplemental Figure S11, Supplemental Data 1-4). Specifically, across all 5 neuropsychiatric GWAS and both adult and fetal tissue, isoTWAS identified significant associations within 92 unique GWAS loci compared with 27 detected by TWAS - nearly a 3.5-fold increase in discovery (Figure 3b; Methods). For SCZ, out of the 284 GWAS loci that have been identified to date46, isoTWAS prioritizes genes within 81 unique loci across fetal and adult reference panels, compared with 22 from gene-level TWAS of which 19 were jointly identified. Over 89% (24/27) of gene-level TWAS associations within these GWAS loci were concordantly identified by isoTWAS. The standardized effect sizes across gene- and isoform-level mapping (for associations at nominal P < 0.05) are also positively correlated (r = 0.90, P < 2.2 × 10−16), with only 7.8% (103 out of 1220) of these effect sizes in opposite directions (Figure 3c). These results emphasize that isoTWAS recovers a large majority TWAS associations and significantly increases the GWAS signal that can be potentially attributed to underlying transcriptomic mechanisms. For example, out of the 284 GWAS loci that have been identified for SCZ46, we find overlapping isoTWAS-identified genes for 46 and 52 loci in adult and fetal brain, respectively. These isoTWAS associations also recover 16 of 17 and 7 of 9 GWAS-overlapping TWAS associations from adult and fetal brain, respectively.
We next sought to ensure that this increase in trait mapping discovery reflected true biological signal rather than test statistic inflation due to the increased number of tests (approximately 7 to 8-fold increase in number of tests). Quantile-quantile plots of trait association gene-level P-values across isoTWAS and TWAS do not show any apparent differences in test statistic inflation (Supplemental Figure S12). Through empirical Bayesian methods51, we estimated inflation in gene-level test statistics across TWAS and isoTWAS with no marked differences in the 95% credible intervals for test statistic inflation across the 5 traits. In fact, the posterior median for estimated test statistic inflation for isoTWAS were generally smaller than for TWAS (Supplemental Figure S12, Supplemental Table S7). Using a heuristic to estimate increases in effective sample size (Methods), we observed an approximate increase in effective sample size of 15-35% when using isoTWAS compared to TWAS (Figure 2d, Supplemental Table S7). These analyses indicate that isoTWAS discovery is well calibrated in real data compared with TWAS
Lastly, in our comparison of methods in real data, we compared discovery using isoTWAS to discovery using splicing-event level trait mapping. Briefly, for the fetal brain cohort, we first calculated intron usage using Leafcutter52 and transformed these usage percentages to M-values53. Then, for all introns mapped to a given gene, we used all SNPs within 1 Mb of a splicing-event to predict its usage and mapped trait associations for these splicing events using isoTWAS’s multivariate framework (Methods). Overall, when aggregated to the gene-level, we found that isoTWAS discovers prioritizes features at far more independent GWAS loci (61 loci) than this splicing-event level trait mapping (31 loci), with 16 loci jointly identified (Figure 3e). This minimal overlap may suggest that these analyses could be used complementarily with careful interpretation of results, bridging annotation-based and -free quantifications of splicing patterns in the transcriptome. Taken together, isoTWAS’s specific focus on modeling isoforms of a gene provides important gains over considering only total gene expression or intron usage in identifying trait associations for genes and their transcripts.
Simulations support improved power and calibrated null of isoTWAS across genetic architectures
To systematically explore the effects of varied genetic architecture on the performance of isoTWAS and TWAS, we conducted extensive simulations using European-ancestry SNP data from the 1000 Genomes Project54 (Methods, Figure 4a, Supplemental Figure S14). In general, simulations reinforce results from real data: isoTWAS increases prediction of total gene expression and power to detect trait associations, especially when genetic effects on expression are different across isoforms of a gene. First, we assessed scenarios in which isoTWAS improves prediction of gene expression over TWAS. Similar to results from real data, the multivariate prediction models (particularly multivariate SuSiE38) provided the largest average adjusted R2 for isoform expression prediction across most simulation settings (Supplemental Figure S15, Supplemental Data 5). At sparser isoQTL architectures, the optimal isoTWAS model greatly outperforms the optimal TWAS model, with the 25% and 75% quantiles of adjusted R2 increasing by 1.3- and 14-fold (Figure 4b-c, Supplemental Figure S16-17, Supplemental Data 6). Performance gains decreased somewhat with denser isoQTL architectures, although real data reflects this 0.1-1% sparsity (i.e., 1-10 causal e- and isoQTLs per gene or isoform)33. Similar performance gains were observed from simulations in a second simulated test locus around XRN2 (Methods, Supplemental Figure S18, Supplemental Data 7).
We next introduced GWAS traits into our simulation framework to benchmark the power and false positive rate (FPR) of isoTWAS (Methods). First, we found that the FPR is controlled at 0.05 for isoform-level mapping using ACAT, similar to that of gene-level mapping using TWAS (Supplemental Figure S19, Supplemental Data 8). For a simulated trait, we modeled causal effect architectures for a genomic locus with two isoforms under three main scenarios (Methods; Figure 5, Supplemental Figure S20): (1) where the true trait effect is from total gene expression, (2) where there is only one isoform with a non-zero effect on the trait, called the ‘effect isoform’; and (3) where both isoforms are ‘effect’ isoforms, with varying magnitudes and directions of association. This first scenario shows clear increases in power for TWAS over isoTWAS, but this advantage of TWAS over isoTWAS decreases as the causal proportion of QTLs increases (Figure 5a, Supplemental Data 9). As effects on the trait varied for isoforms for the same genes (Figure 5b-c, Supplemental Data 10-11), we observed stark increases in power for isoTWAS over TWAS. Here, across most scenarios and causal effect architectures, isoTWAS demonstrated improved power compared with gene-level TWAS, particularly when only one isoform of a gene carries a trait effect or when two effect isoforms of the same gene have different directions of effects. Our rigorous simulations reinforce the prediction and power advantages of isoTWAS over TWAS, suggesting varied isoQTL architecture and isoform effects on traits for isoforms of the same gene as key features that may drive advantages we observed in real data.
isoTWAS identifies biologically-meaningful trait associations undetectable by TWAS
Lastly, we applied probabilistic isoform-level fine-mapping to genetic loci where isoTWAS prioritized multiple isoform-trait associations to estimate sets of isoforms that are most likely to explain the trait association at the genetic locus (Methods). The full number of gene-level and isoform-level trait associations at each step of the testing pipeline are summarized in Figure 6a. In total, forming 90% credible sets explain the association with the trait at the genetic locus, using adult frontal cortex isoform expression, we detected trait associations at 25 genes (26 isoforms) for ADHD, 19 genes (21 isoforms) for ALZ, 17 genes (18 isoforms) for ASD, 28 genes (28 isoforms) for BD, and 136 genes (136 isoforms) for SCZ. Using fetal cortex isoform expression, we detected trait associations at 98 genes (105 isoforms) for ADHD, 20 genes (21 isoforms) for ALZ, 4 genes (4 isoforms) for ASD, 22 genes (22 isoforms) for BD, and 65 genes (66 isoforms) for SCZ (Supplemental Figure S21-S25, Supplemental Tables S8-9). Overall, we find, for isoforms associated with one of the 5 traits at nominal P < 0.05 using both adult and fetal frontal cortex expression, there is only a moderate correlation in isoTWAS effect sizes (r = 0.63, P = 4.4 × 10−11, Figure 6b). This level of correlation may suggest that there is a difference in isoform-specific effects in the developing brain, compared to the adult brain, on these neuropsychiatric traits.
Across all 5 traits, isoTWAS detected 15 trait-associated genes using fetal frontal cortex models and 42 genes using adult cortex models with gnomAD-calculated probability of loss-of-function intolerance (pLI) score of ≥ 0.8 (Supplemental Tables S8-9), suggesting that these genes are particularly intolerant to protein-truncating variation55 and that isoTWAS detects biological-relevant signal due to its isoform-specific focus. Transcripts of many of these genes also house GWAS-significant variants within 10 kilobases, with effect sizes and 95% confidence intervals shown in Figure 6c. For example, we identified an association with SCZ with imputed fetal expression of ENST00000586144, an isoform of CXXC1 (P = 2.56 × 10−11 in isoTWAS, pLI = 0.90, chromosomal location of 18q21.1), with a GWAS-significant variant within 10 kilobases upstream of the gene’s transcription start site. CXXC1 belongs to the SET1-COMPASS complex, implicated in transcriptomic pathways underlying schizophrenia risk56. Another transcript whose imputed fetal brain expression was associated with SCZ risk is ENST00000511382, an isoform of the gene PDE4D (P = 2.4 × 10−8, pLI = 1, chromosomal location of 5q11.2-q12.1), with a GWAS SNP within the gene body (rs7701440). A previous GWAS has implicated PDE4D in associations with SCZ risk57, but our analysis directly ties the genetically-regulated component of its isoform ENST00000511382 to SCZ risk through probabilistic fine-mapping. Imputed adult brain expression of ENST00000492146 (P = 6.9 × 10−12, pLI = 0.94, chromosomal location of 3p21.1), an isoform of SFMBT1, showed a strong association with SCZ risk and contains a GWAS SNP within the gene body (rs2710339). In a recent gene-based analysis of GWAS data for SCZ and BD, decreased expression of SFMBT1 was associated with increased risk of both disorders58, consistent with the effects for the isoform we identify in this isoTWAS analysis. Lastly, overrepresentation analysis of isoTWAS-priortized genes for these 5 traits showed multiple relevant biological processes, cellular components, and pathways, at Bonferroni-adjusted P < 0.10: cation transportation, immune response ontologies, and plasma membrane and organelle organizations (Figure 6d). These results further emphasize that isoTWAS identifies sets of trait-associated genes with biological significance.
As evidenced by our simulation analyses, a main advantage of isoTWAS over TWAS is the identification of trait associations for isoforms of genes, where the gene itself is not associated with the trait. We illustrate a few examples of isoforms prioritized by isoTWAS, all using adult frontal cortex tissues, which show limited evidence of QTLs on the total gene expression level. First, we detected a strong SCZ association with ENST00000492957, an isoform of AKT3 (pLI = 1) at 1q43-q44, which encodes a serine/threonine-protein kinase that regulated many processes like metabolism and cell growth, proliferation, and survival. AKT3 has shown effects on anxiety, spatial and contextual memory, and fear extinction in mice, such that loss-of-function of AKT3 causes learning and memory deficits59,60. At this locus, we found a strong isoQTL signal (P < 10−15) that intersects GWAS associations at P < 5 × 10−8; however, the eQTL signal failed to even reach P < 10−4 (Figure 7a). The lead isoQTL rs2125232 shows a strong positive effect on ENST00000492957 but an insignificant positive association with AKT3, as the number of alternative alleles at the SNP increased. This isoQTL effect was in the opposite direction as the negative associations with SCZ from GWAS.
Similarly, we find a strong isoQTL signal for ENST00000373566 but a weak eQTL signal of GIGYF2 in the 2q37.1 locus (pLI = 1), in another association with SCZ (Figure 7b). GIGYF2 is a translation inhibitor, which, with the protein 4EHP, decreases translation of defective messenger RNAs to assist ribosome-associated quality control; decreased ribosomal quality is a hypothesized mechanism explaining neuropsychiatric disorders, as observed in both mice and humans with decreased GIGYF2 function61. Interestingly, the lead SNP rs2675972 has a positive effect on both isoform and a negative effect on SCZ. Lastly, our analysis implicated ENST00000537270, an isoform of KMT5A (pLI = 0.99, found in 12q24.31), in an association with SCZ risk (Figure 7c). There is a strong isoQTL signal at the locus with no significant eQTL signal; the lead isoQTL rs28577594 has a strong negative effect of the transcript-isoform expression and a protective effect on SCZ risk. KMT5A, a H4K20 methyltransferase62, has been previously implicated in GWAS for SCZ but cis-eQTLs of the gene have not be colocalized with the GWAS signal previously46,63.
Lastly, our isoTWAS analysis detected an association between ALZ and ENST00000252491, an isoform of APOC1 in the 19q13.32 locus, a gene implicated in ALZ risk64,65, with APOE and ACE, associated genes in the locus (Supplemental Figure S26). Interestingly, when we condition on APOE risk variants strongly associated with ALZ risk (rs7412, rs429358)48,66 using analyses highlighted in Gusev et al9,67, we find that the association between ENST00000252491 and ALZ persists (without adjustment for APOE variants: P = 1.9 × 10−7; with adjustment for APOE variants: P = 2.1 × 10−6). Despite limitations in these conditional analyses using GWAS summary statistics68, our results may suggest an independent signal in this locus with an underlying transcriptomic mechanism through ENST00000252491.
These SNP-by-SNP views into the GWAS, eQTL, and isoQTL signal at a locus demonstrate why isoTWAS gives a clearer picture of the mechanism at the locus over TWAS. Integration with total gene expression cannot detect an overlapping genetic, transcriptomic, and phenotypic signal, especially at these sample sizes. When we specifically model the isoform transcripts belonging to the same gene family, we detect the more nuanced isoform-specific signals that may underlie a genetic association.
DISCUSSION
In this work, we introduce isoTWAS, a scalable framework that integrates genetic and isoform-level transcriptomic variation with GWAS to not only identify genes whose expression are associated with a trait but also prioritize a set of isoforms of the gene that best explains the association. We provide all isoform-level predictive models (https://zenodo.org/record/679594769) and software to conduct isoform-level trait mapping with GWAS summary statistics (https://github.com/bhattacharya-a-bt/isoTWAS).
isoTWAS presents a number of advantages over TWAS. First, by modeling transcript expression on the isoform-level, it can detect isoQTL architecture that varies across isoforms and thus are not reflected in the eQTLs of a gene. This isoform-level approach to prediction showed improved predictive power both in simulated and real data. Second, by aggregating isoform-level associations to the gene-level through ACAT, isoTWAS showed substantially increased power to detect trait associations over traditional TWAS. We attribute this increase in power under several key features: first, isoform-level modeling in isoTWAS increases the number of imputable genes by >2-fold; second, isoTWAS models improve accuracy of gene-level prediction by as much as 35%; third, isoTWAS jointly models both expression and splicing regulation, thereby capturing multiple potential underlying complex trait mechanisms; finally, as genetic control of isoform expression and usage are often much more tissue-specific than eQTLs25,33, tissue-specific genetically-controlled transcriptomic effects on traits can be more accurately mapped.
Recent work has highlighted the promise of alternative splicing as a mechanism underlying complex trait biology not fully captured through eQTLs23,25,26,70. Splicing-QTL integration is a promising vehicle to prioritize alternative exons or splice sites that may explain the genetic association with a trait, but a single exon or splice site can correspond to multiple isoforms of the same gene. Mapping genetic regulation of transcriptomics at the exon-rather than gene-level often leads to more detected signal71. However, the overwhelming majority of these analyses have focused on splicing events or exon-level quantification of expression, rather than the transcriptomic consequences of these events: different isoforms of the same gene. Local splicing events can be computationally intensive to measure and are difficult to systematically integrate across multiple distinct large-scale datasets, which is necessary for achieving sample sizes sufficient for interrogation of population-level allelic effects26,27. Furthermore, multiple splicing changes are often coordinated across a gene, yielding a large number of non-independent features that increases multiple testing burden. Our results show that isoform-centric study using isoTWAS increases discovery compared to splice-site-based analysis, but these methods may recover some independent signal. Future work should consider integrating annotation-based and -free quantifications of the transcriptomic and splicing patterns to develop more nuanced mechanistic hypotheses for GWAS loci.
We conclude with some limitations of and future considerations for isoTWAS. First, we note that isoform-level expression quantifications are maximum-likelihood estimates, due to the inherent limitations of short-read RNA-seq. These estimates are generally guided by existing transcriptome annotations. As such, these estimates will be contingent on the completeness and accuracy of the annotations (e.g,. GENCODE). Further, many dataset specific sequencing factors may will affect the accuracy of these estimates, especially sequencing depth, read length and library preparation (mRNA vs total RNA sequencing). The rapid emergence of long-read sequencing technologies, including PacBio IsoSeq and Oxford Nanopore, will be instrumental for improving transcriptome annotations, particularly with tissue specificity. Further, as these methods become cost effective to be performed at population-scale datasets, they will replace the need for short-read sequencing and isoform estimation. Recent analyses have shown that larger sample sizes may outweigh sequencing coverage for eQTL mapping, but this same relationship has not been investigated for isoform-level expression72. Thus, for optimal discovery with isoTWAS, larger sample sizes or higher sequencing depth in the reference expression dataset may be imperative. Additionally, it is important to note that transcript quantification is heavily dependent on accurate genomic and transcriptomic annotations. As these annotations improve with more transcript discovery, the ability of isoTWAS to detect gene- and isoform-trait associations should improve.
Second, while inferential replicates from RNA-seq quantification can provide measures of technical variation, they are not incorporated into the predictive models. These inferential replicates are unlikely to affect the point estimates of SNP effects on isoforms but may aid in estimation of the precision of these SNP effects. A more flexible predictive model that estimates standard errors by model averaging across the replicate datasets may help with trait mapping by providing a prediction interval for isoform- and gene-level GReX. Third, just as TWAS can be cast as a differential gene expression analysis conducted with GReX, isoform-level trait mapping is akin to differential transcript expression analysis. An analogous goal of isoTWAS can be to detect trait associations with genetically-regulated transcript usage. However, it is unclear if the compositional nature of transcript usage data needs to accounted for at the prediction step or the trait mapping steps73. Lastly, this framework can suffer from reduced power and inflated false positives in the presence of SNP horizontal pleiotropy, where the genetic variants in the isoform expression model affect the trait, independent of isoform expression68,74. Probabilistic fine-mapping may reconcile any pleiotropy for SNPs shared across models for multiple isoforms at the same genetic locus. However, for pathways that are not observed or accounted for in the reference expression panel and GWAS, accounting for horizontal pleiotropy may improve trait mapping. However, as shown in previous analyses, summary-statistic based methods that control for horizontal pleiotropy are not yet effective75.
In sum, isoTWAS provides a novel framework to fully and scalably interrogate the transcriptomic mechanisms underlying genetic associations with complex traits and generate biologically-meaningful and testable hypotheses about disease risk mechanisms. Based on our results, we emphasize a shift in focus from quantifications of the transcriptome on the gene-level to the transcript-isoform level to maximize discovery of transcriptome-centric genetic associations with complex traits.
METHODS
isoTWAS consists of three steps: (1) training predictive models of isoform expression, (2) imputing Genetically-Regulated eXpression (GReX) into a separate GWAS panel, and (3) estimating the association between GReX and a phenotype (Figure 1b). isoTWAS contrasts with TWAS as it models correlations between the expression of isoforms of the same gene. We outline each step below, with further mathematical details in Supplemental Methods.
Training predictive models of isoform expression
Model and assumptions
Assume a given gene G has M isoforms with expression levels across N samples, with each sample having R inferential replicates. Let be the N × M matrix of mean isoform expression for the N samples and M isoforms, using the point estimates from the Expectation-Maximization algorithm of a pseudo-mapping quantification algorithm, like salmon or kallisto30,31. We can jointly model isoform expression with a system of N × M × R equations. For sample n ∈ {1, …, N}, isoform m ∈ {1, …, M} of gene G, and replicate r ∈ {1, …, R}, we have: where ynmr is the expression of isoform m for the rth inferential replicate of sample n, xn is the P-vector (vector of length P) of cis-genotypes in a 1 Mb window around gene G, βm is the P-vector of genetic effects of the P genotypes on isoform expression, and ∈nmr is normally distributed random noise with mean 0 and variance . We standardize both the genotypes and the isoform expression to mean 0 and variance 1. As the SNP vector xn does not differ across inferential replicates, we impose the following assumptions on the variance-covariance matrix of ∈ = (∈1,1,1, ∈1,1,2, …, ∈nmr)T, the vector of random errors: we assume that ∈nmr are independent and identically distributed across samples n ∈ {1, …, N} and identically distributed across replicates r ∈ {1, …, R}. Accordingly, the point estimates of the SNP effects on isoform expression are not influenced by differences in expression across replications. Therefore, in matrix form, we consider the following predictive model:
Here, XG is the N × P matrix of genotype dosages, BG is the P × M matrix of SNP effects on isoform expression and EG is a matrix of random errors, such that vec(EG) ∼ NNM(0, ∑ = Ω−1 ⊗ IN). ∑ represents the variance-covariance matrix in the errors (with precision matrix Ω = Σ−1), following the aforementioned independence assumptions.
Estimating SNP effects on isoform expression
We apply 6 different methods to estimate , the matrix of SNP effects on isoform expression. The matrix that gives the largest adjusted R2 in 5-fold cross-validation across the 6 methods is selected as the final model to predict isoform expression for a given gene. We provide an overview of these methods here, with mathematical details in Supplemental Methods:
Univariate FUSION: the simplest implemented method is the univariate predictive modelling used in FUSION2. We disregard the correlation structure between isoforms and train a univariate elastic net36, estimation of the best linear unbiased predictor (BLUP) in a linear mixed model76, and SuSiE38 predictive model for each isoform separately. The model with the largest adjusted R2 out of these three models is outputted.
Curds-and-whey (CW) procedure: Breiman and Friedman’s CW approach takes in univariate estimates of SNP effects and modifies the predicted values from those regressions by shrinking them using the canonical correlations between the isoform expression variables and the SNP dosages34. This process significantly reduces prediction error when there exist large correlations between responses.
Multivariate elastic net (MVEnet) regression: MVEnet is an extension of elastic net, where the response is a matrix of correlated responses36. Here, the absolute penalty is imposed on each single coefficient by a group-lasso penalty on each vector of SNP effects across isoforms (rows of BG); accordingly, a SNP can only have a non-zero effect on an isoform if it has a non-zero effect on all isoforms.
Multivariate LASSO Regression with covariance estimation (MRCE): We adapt Rothman et al’s proposed procedure to simultaneously and iteratively estimate both , the SNP effects matrix, and , the precision matrix37. This procedure accounts for the correlation between isoforms but does not impose the group-lasso penalty as in MVEnet. As such, a single SNP need not have a non-zero effect on all isoforms.
Multivariate SuSiE (mvSuSiE): mvSuSiE extends the SuSiE framework to a multivariate response38, providing SNP effect sizes on expression of each isoform and posterior inclusion probabilities for each SNP. These effect sizes can be incorporated in a predictive model of the isoform expression matrix. Again, mvSuSiE does not impose a group penalty, as in MVEnet.
SuSiE fine-mapping and ordinary least squares with clustered standard errors: Here, we use SuSiE to generate a 90% credible set of causal SNPs affect isoform expression38. We then fit a seemingly unrelated regression77 with a multivariate response using a design matrix that only includes SNPs found in these credible sets. We compute Liang-Zeger clustered standard errors for the SNP effects to account for inferential uncertainty across replicates39. These standard errors can be incorporated to calculate prediction uncertainty when imputing GReX in individual-level genotypes.
Step-wise association testing
We note that the tests of association in isoTWAS are similar to tests in differential transcript expression analyses, as TWAS tests of association are analogous to tests in differential gene expression analyses. isoTWAS and TWAS are distinct, as these methods consider imputed isoform and gene expression, respectively, as predicted by the trained expression models. If individual-level genotypes are available in the external GWAS panel, GReX can be directly imputed by multiplying the SNP weights from the predictive model with the genotype dosages in the GWAS panel. If only summary statistics are available, we adopt the weighted burden test from Gusev et al with an ancestry-matched linkage disequilibrium panel4,68. Compared to TWAS, a main distinction for isoTWAS association testing is the increased number of tests (approximately 4-fold the number of isoforms than genes)35 and the potentially correlation in test statistics for isoforms of the same gene.
Accordingly, we adopt methods from the stageR framework (Supplemental Figure S2)78. In the first step, for every isoform with a significant model, we generate a TWAS test statistic using either linear regression with direct GReX imputation for GWAS with individual-level genotypes or the weighted burden test for GWAS with only summary statistics4. Given the t test statistics T1, …, Tt for a single gene, we used an omnibus test to aggregate the t test statistics into a single P-value for a gene. We benchmark different omnibus tests in simulations, but the default omnibus test in isoTWAS is aggregated Cauchy association test (ACAT)41. We control for false discovery across all genes via the Bonferroni procedure, but the Benjamini-Hochberg procedure can also be applied for less conservative false discovery control. In the second step, for isoforms for genes with FDR-adjusted omnibus P < 0.05, we employ Shaffer’s modified sequentially rejective Bonferroni (MSRB) procedure to control the within-gene family-wide error rate. At the end of these two steps, we identify a set of genes and their isoforms that are associated with the trait.
Control for false positives within GWAS loci
In TWAS and related methods, association statistics have been shown to be well calibrated under the null of no GWAS association. However, within loci harboring significant GWAS signal, false positive associations can result when eQTLs and GWAS coincide within overlapping LD blocks. To address this, we have adopted two conservative approaches to control for type 1 error within GWAS loci, namely (1) permutation testing and (2) probabilistic fine mapping. The permutation testing approach, adopted from Gusev et al4, is a highly conservative test of the signal added by the SNP-transcript effects from the predictive models, conditional on the GWAS architecture of the locus. Briefly, we shuffle the SNP-transcript effects in the predictive models 10,000 times and generate a null distribution for the TWAS test statistic for each isoform. We can use the null distribution to generate a permutation P-value for the original test statistic for each isoform. Additionally, we can aggregate these null distributions using an omnibus test and compare the omnibus P-value to this distribution to generate permutation P-value for each gene. In addition, we employ gene-level probabilistic fine-mapping using methods from FOCUS79 to generate credible set of isoforms that explain the trait association at a locus. We only run isoform-level fine-mapping for significantly-associated isoforms in overlapping 1 Megabase windows.
GTEx processing and model training
We quantified GTEx v833 RNA-Seq samples for all 49 tissues using Salmon v1.5.230 in mapping-based mode. We first built a Salmon index for a decoy-aware transcriptome consisting of GENCODE v38 transcript sequences and the full GRCh38 reference genome as decoy sequences35. We then extracted reads from the available genome-aligned GTEx BAM files into FASTQ files and used the FASTQ files as input to Salmon with mapping validation and corrections for sequencing and GC bias. We estimated 50 inferential bootstraps for isoform expression using Salmon’s Expectation-Maximization algorithm. We then imported Salmon isoform-level quantifications and aggregate to the gene-level using tximeta44. Using edgeR, gene and isoform-level quantifications underwent TMM-normalization, followed by transformation into a log-space using the variance-stabilizing transformation using DESeq280,81. We then residualized isoform-level and gene-level expression (as log-transformed CPM) by all tissue-specific covariates (clinical, demographic, genotype PCs, and expression PEER factors) used in the original QTL analyses in GTEx.
SNP genotype calls were derived from Whole Genome Sequencing data for samples from individuals of European ancestry, filtering out SNPs that have less than 5% minor allele frequency and deviate from Hardy-Weinberg equilibrium at P < 10−5. We further filtered out SNPs with less than 1% minor allele frequency among the European ancestry samples in 1000 Genomes Project54.
Details of the model training pipeline for GTEx are similar to those summarized in Supplemental Figure S10. Before training genetics models of gene-level and isoform-level expression, we estimated the heritability of gene and isoform expression using all SNPs within 1 Mb with GCTA-GREML82,83. We only considered genes and isoforms with positive heritability estimates with nominal P < 0.05. We borrowed techniques from FUSION4 to train gene-level models using the most predictive, based on McNemar’s adjusted 5-fold cross-validation (CV) R2, of models trained using elastic net regression36, linear mixed modeling40, and SuSiE38. We selected only genes with CV R2 ≥ 0.01. We applied multivariate modeling outlined in isoTWAS to train isoform-level predictive models, selecting only those isoform models with CV R2 ≥ 0.01; we only used multivariate elastic net, multivariate SuSiE, univariate modeling across isoforms, and multivariate LASSO regression with covariance estimation, as these four methods showed far better predictive performance across all 6 methods (Supplemental Figure S2). For gene expression prediction in TWAS, we employed univariate elastic net, linear mixed modeling, and SuSiE, as in FUSION4,36,38,76, using all SNPs within 1 Mb of the gene body. All isoTWAS models generated are available at https://zenodo.org/record/679594769.
Adult and fetal PsychENCODE processing and model training
We quantified both adult (N = 850)26,27 and fetal frontal cortex23 (N = 205) RNA-Seq samples using Salmon v1.5.230 in mapping-based mode. We used the same Salmon indexed transcriptome used in the GTEx analysis. We then extracted reads from the available genome-aligned BAM files into FASTQ files and used the FASTQ files as input to Salmon with mapping validation and corrections for sequencing and GC bias. We estimated 50 inferential bootstraps for isoform expression using Salmon’s Expectation-Maximization algorithm. We then imported Salmon isoform-level quantifications and aggregate to the gene-level using tximeta44. Using edgeR, gene and isoform-level quantifications underwent TMM-normalization, followed by transformation into a log-space using the variance-stabilizing transformation using DESeq280,81. We then residualized isoform-level and gene-level expression (as log-transformed CPM) by covariates (age, sex, 10 genotype PCs, 90 and 70 hidden covariates with prior for fetal and adult brain, respectively) used in the original QTL analyses in Gandal et al and Walker et al.
Typed SNPs with non-zero alternative alleles, MAF > 1%, genotyping rate > 95%, HWE P < 10−6 were first imputed to TOPMed Freeze 5 using minimac4 and eagle v2.484,85. genotype calls were derived from Whole Genome We then filtered to SNPs that had imputation accuracy R2 > 0.8, are biallelic, and have rsIDs. Furthermore, we filtered out SNPS with MAF < 0.05 and deviate from Hardy-Weinberg equilibrium at P < 10−6.
Details of the model training pipeline are summarized in Supplemental Figure S10 and are equivalent to those used to train models in GTEx data. All isoTWAS models generated are available at https://zenodo.org/record/679594769.
Gene- and isoform-level trait mapping
We conducted gene- and isoform-level trait mapping for five neuropsychiatric traits: (1) schizophrenia (SCZ; Ncases = 69,369; Ncontrols = 236,642)46, (2) bipolar disorder (BD; Ncases = 41,917; Ncontrols = 371,549)47, (3) attention deficit hyperactivity disorder (ADHD; Ncases = 20,183; Ncontrols = 35,191)50, (4) autism spectrum disorder (ASD; Ncases = 18,381; Ncontrols = 27,969)49, and (5) Alzheimer’s disease (ALZ; Ncases = 90,338; Ncontrols = 1,036,225)48. For gene-level trait mapping, we used the weighted burden test, followed by the permutation test, as outlined by Gusev et al4. For isoform-level trait mapping, we used the stage-wise testing procedure outlined in the isoTWAS method. For isoforms, irrespective of their corresponding genes, passing both stage-wise tests and the permutation test, we employed isoform-level probabilistic fine-mapping using FOCUS with default parameters79.
We estimate the percent increase in effective sample size by employing the following heuristic. We convert gene-level association P-values into χ2 test statistics with 1 degree-of-freedom. For χ2 > 1, we then calculate the percent increase for isoTWAS-based associations versus TWAS-based associations. As the mean of the χ2 distribution is linearly related to power and sample size86, we can use this percent increase in test statistic as a measure of power or effective sample size.
Simulation framework
We adopt techniques from Mancuso et al’s twas_sim protocol79,87 to simulate multivariate isoform expression based on randomly simulated genotypes and environmental random noise. First, for n samples, we generate a matrix of genotype dosages for the SNPs within 1 Megabase of CACNA1E (1,107 total SNPs) using an LD reference panel from European samples from 1000 Genomes Project54. We use this ‘gene for these simulations as it displays interesting patterns of alternative isoform expression88; however, we note that the selection of this particular gene is ultimately arbitrary. We extended our simulations to the locus around XRN2, as well, with similar results.
Next, we generate a matrix of SNP-isoform effects across different causal SNP proportions pc, numbers of isoforms t, and “shared” or “different” SNP-isoform architectures. We define “shared SNP-isoform architecture as a setting where a SNP either has non-zero effects on all isoform or none at all, such as a case where a variant lies within a gene’s promoter and acts as a broad gene-level cis-eQTL. We define “different” SNP-isoform architecture as one in which variant(s) exhibit distinct (zero or non-zero) effects across different isoforms, such as the case where a variant is proximal to an alternative transcription start site shared only by a subset of isoforms. Lastly, we compare “sparse” (all off-diagonal elements are near 0) or “dense” (most or all off-diagonal elements are far from 0 and generally greater than 0.3-0.4) correlation structures for the isoform expression levels within a gene locus, using Joe’s C-vine method89,90. Using matched moments followed by scaling to ensure a set isoform expression heritability of , we can generate the random error matrix that ensures the desired correlation between isoforms. We generate 10,000 simulations for each configuration of the simulation parameters, varying n ∈ {200, 500}, pc ∈ {0.001,0.01,0.10}, , “shared” and different SNP-isoform effects, and “sparse” and “dense” correlation between isoforms. Full mathematical details are provided in Supplemental Methods and summarized in Figure 4a and Supplemental Figure S14.
We also generate traits under three distinct scenarios, each with 2 isoforms for a given gene:
Only gene-level expression has a non-zero effect on trait. Here, we sum the isoform expression to generate a simulated gene expression. We randomly simulate the effect size and scale the error to ensure trait heritability .
Only 1 isoform has a non-zero effect on the trait. Here, we generate a multivariate isoform expression matrix with 2 isoforms and scale the total gene expression value such that one isoform (called the effect isoform) makes up pg ∈ {0.10, 0.30, 0.50, 0.70, 0.90} proportion of total gene expression. We then generate effect size for one of the isoforms and scale the error to ensure trait heritability .
Two isoforms with different effects on traits. Here, we generate a multivariate isoform expression matrix with 2 isoforms that make up equal portions of the total gene expression. We then generate an effect size of α for one isoform and peα for the other isoform, such that pe ∈ {−1, −0.5, −0.2,0.2,0.5,1}. We then scale the error to ensure trait heritability .
Lastly, to estimate the approximate false positive rate (FPR), we followed the same simulation framework to generate eQTL data and GWAS data. In the GWAS data, however, we set the effect of gene- and isoform-level GReX to 0 to generate simulated trait under the null, where the gene and isoforms are not associated with the trait. We then estimate the FPR by calculating the proportion of gene-trait associations at P < 0.05 under this null across 20 sets of 1,000 simulated GWAS panels.
Data Availability
GTEx genetic, transcriptomic, and covariate data were obtained through dbGAP approval at accession number phs000424.v8.p2. Linkage disequilibrium reference data from the 1000 Genomes Project were obtained at this link: https://www.internationalgenome.org/data-portal/sample. GWAS summary statistics were obtained at the following links: ADHD (https://www.med.unc.edu/pgc/download-results/), ALZ (https://ctg.cncr.nl/software/summary_statistics/), ASD (https://www.med.unc.edu/pgc/download-results/), BD (https://www.med.unc.edu/pgc/download-results/), and SCZ (https://www.med.unc.edu/pgc/download-results/). The RNA-seq and genotype dataset from Walker et al is available at dbGAP with accession number phs001900. The subset of data from the CommonMind Consortium from the PsychENCODE Consortium is available at doi.org/10.7303/syn12080241. isoTWAS models for GTEx and PsychENCODE are available at https://zenodo.org/record/679594769.
https://www.internationalgenome.org/data-portal/sample
DATA AVAILABILITY
GTEx genetic, transcriptomic, and covariate data were obtained through dbGAP approval at accession number phs000424.v8.p2. Linkage disequilibrium reference data from the 1000 Genomes Project were obtained at this link: https://www.internationalgenome.org/data-portal/sample. GWAS summary statistics were obtained at the following links: ADHD (https://www.med.unc.edu/pgc/download-results/), ALZ (https://ctg.cncr.nl/software/summary_statistics/), ASD (https://www.med.unc.edu/pgc/download-results/), BD (https://www.med.unc.edu/pgc/download-results/), and SCZ (https://www.med.unc.edu/pgc/download-results/). The RNA-seq and genotype dataset from Walker et al is available at dbGAP with accession number phs001900. The subset of data from the CommonMind Consortium from the PsychENCODE Consortium is available at doi.org/10.7303/syn12080241. isoTWAS models for GTEx and PsychENCODE are available at https://zenodo.org/record/679594769.
CODE AVAILABILITY
isoTWAS is available as an R package at https://github.com/bhattacharya-a-bt/isotwas. Sample scripts for analyses are available at https://github.com/bhattacharya-a-bt/isotwas_manu_scripts.
SUPPLEMENTAL INFORMATION
Supplemental Methods
Supplemental Methods are provided in the attached SupplementalMethods.pdf.
Supplemental Figure Legends
Figure S1: Directed acyclic graph (DAG) illustrating causal assumptions in isoTWAS. We assume that the local genetic variants within 1 Megabase of a gene have direct effects on the expression of a gene G and its isoforms; these genetic effects need not be shared across isoforms and the gene. Furthermore, we assume that the abundance of a gene is the sum of abundances of its isoforms. Lastly, we assume that the isoform and gene need not affect the complex trait through the same pathway. We also acknowledge that genetic variants may have effects on the trait through pathways independent of gene and isoform expression.
Figure S2: Step-wise hypothesis testing in isoTWAS. First, isoform-trait associations are estimated either using appropriate regressions of imputed expression in individual-level GWAS or the weighted burden test in Gusev et al in GWAS summary statistics. Then, associations for isoforms of the same gene are aggregated to the gene-level using the Aggregated Cauchy Association Test (ACAT). These aggregated gene-level associations are adjusted for multiple testing burden to control the false discovery rate (FDR) using either a Bonferroni or Benjamini-Hochberg procedure. Lastly, for isoforms of genes that pass gene-level testing, we control the family-wide error rate (FWER) using Shaffer’s modified sequentially rejective procedure.
Figure S3: Across GTEx tissues (X-axis), proportion (Y-axis) of isoform expression models that show highest R2 using one of four models: MRCE, multivariate elastic net, multivariate SuSiE, univariate modeling.
Figure S4: Comparison of number of cis-heritable genes on at gene and isoform expression level in GTEx. Across all GTEx tissues (Y-axis), number (X-axis) of genes that are cis-heritable on the total gene expression level (red), with at least 1 isoform that is cis-heritable (blue).
Figure S5: Comparison of number of predictable genes on isoform expression and total gene expression level. Across all GTEx tissues (Y-axis), number (X-axis) of genes that are predicted at R2 > 0.01 on the total gene expression level (red) or with at least 1 isoform that is predicted at R2 > 0.01 (blue).
Figure S6: Comparison of distribution of total gene expression predictive R2 using TWAS and isoTWAS across all GTEx tissues. Across all GTEx tissues (Y-axis), boxplots of adjusted R2 of predicted total gene expression using gene-level TWAS models (red) and isoTWAS models (blue).
Figure S7: Difference in total gene expression predictive R2 using TWAS and isoTWAS across all GTEx tissues. Across all GTEx tissues (Y-axis), boxplots of difference in adjusted R2 of predicted total gene expression using gene-level TWAS models vs. isoTWAS models. The label shows the proportion of genes that is better predicted using isoTWAS models.
Figure S8: Boxplot of ratio of number of isoforms predicted at R2 > 0.01 using isoTWAS models over number of genes predicted at R2 > 0.01 using TWAS models across 13 brain tissues (purple) and 35 other tissues (brown) in GTEx.
Figure S9: Number (X-axis) of unique genes with at least one isoform (left) and unique isoforms (right) predicted with McNemar’s adjusted R2 ≥ 0.01 for each tissue (Y-axis) represented in GTEx.
Figure S10: Schematic diagram for analysis using adult and fetal frontal cortex data from PsychENCODE. Data sources for eQTL reference data, GWAS cohorts, and reference LD data are provided on the left (black). The full gene-level TWAS (red) and isoTWAS (blue) are summarized on the right.
Figure S11: Numbers of trait associations detected using TWAS (red), isoTWAS (blue), and both (green) across two testing thresholds. Using adult and fetal frontal cortex data (right margin), the number of trait associations detected at Bonferroni-adjusted P < 0.05 and permutation P < 0.05 (top margin) for 5 neuropsychiatric traits.
Figure S12: QQ-plots of gene-level P-values for gene-level TWAS (red) and isoTWAS (blue), with expected -log10 P-values on X-axis and observed -log10 P-values on Y-axis, across 5 neuropsychiatric traits. Estimates of test statistic inflation and 95% credible intervals using empirical Bayes methods from bacon.
Figure S13: Isoform-level trait mapping increases discovery of genetic association over splicing-event level trait mapping. (a) Number of gene-trait associations using splicing-event level TWAS (purple), isoTWAS (blue), or either (orange) at various testing thresholds, comparing genes with both splicing-event TWAS and isoTWAS models. (b) Number of gene-trait associations using splicing-event level TWAS (purple) and isoTWAS (blue) with a GWAS locus within 0.5 Mb.
Figure S14: Detailed schematic of comparison of prediction between TWAS and isoTWAS in GTEx data.
Figure S15: Comparison of predictive performance of 6 models implemented in isoTWAS. Across 1,000 simulations of 5 isoforms with isoform heritability set to 0.10 (a) and 0.25 (b). On the left, boxplots of adjusted R2 of prediction of isoform expression (Y-axis) across causal isoQTL proportion (X-axis). On the right, plot of proportion of simulations where the model has the maximum adjusted R2 (Y-axis) across causal isoQTLs proportion (X-axis).
Figure S16: IsoTWAS models predict gene expression with more accuracy than TWAS models. Boxplots of difference in adjusted R2 in predicting gene expression between isoTWAS and TWAS models from simulations with sample size 200 (compare with sample size 500 in Figure 2), where isoform and gene expression heritability are set to (a) 0.10 and (b) 0.25. Here, we sum predicted isoform expression from isoTWAS models to form predicted gene expression and compare to predicted gene expression from a TWAS model.
Figure S17: IsoTWAS models predict gene expression with more accuracy than TWAS models at lower expression heritability settings. Boxplots of difference in adjusted R2 in predicting gene expression between isoTWAS and TWAS models from simulations with sample size 200 where isoform and gene expression heritability are set to 0.05. Here, we sum predicted isoform expression from isoTWAS models to form predicted gene expression and compare to predicted gene expression from a TWAS model.
Figure S18: IsoTWAS models predict gene expression with more accuracy than TWAS models at simulated genetic locus around XRN2. Boxplots of difference in adjusted R2 in predicting gene expression between isoTWAS and TWAS models from simulations with sample size 200 where isoform and gene expression heritability are set to 0.05 (a), 0.10 (b), and 0.25 (c), using the 1 Megabase window around XRN2 as a reference. Here, we sum predicted isoform expression from isoTWAS models to form predicted gene expression and compare to predicted gene expression from a TWAS model.
Figure S19: IsoTWAS and gene-level TWAS show relatively similar false positive rates. Across 20 iterations of 1,000 simulations, boxplots of false positive rate to detect a gene-trait association using Cauchy-aggregated P-values of isoform-trait associations (red) and gene-level TWAS (blue). In these simulations, we simulate 200 and 5,000 samples in the eQTL and GWAS panels, 5 isoforms, and isoform and gene expression heritability of 0.10. We vary the causal isoQTLs proportion (pc, shown on X-axis), the QTL architecture (right margin), and correlation between isoforms (top margin). For each iteration, we simulate one eQTL panel and 1,000 GWAS panels where genetically-regulated gene and isoform expression have no effect on the trait. We calculate the false positive rate as the proportion of the 1,000 tests that give P > 0.05.
Figure S20: At low expression heritability settings, isoTWAS improves power to detect gene-trait associations, especially when genetic effects differ across isoforms. (a) Power to detect gene-trait association (proportion of tests with P < 2.5 × 10−6, Y-axis) across causal proportion of isoQTLs (X-axis), facetted by sparse and dense correlations between isoforms (right margin) and shared or different isoQTLs across isoforms (top margin). (b) Power to detect gene-trait association (proportion of tests with P < 2.5 × 10−6, Y-axis) across proportion of gene expression explained by effect isoform (X-axis), facetted by sparse and dense correlations between isoforms and shared or different isoQTLs across isoforms. (c) Power to detect gene-trait association (proportion of tests with P < 2.5 × 10−6, Y-axis) across ratio of effect sizes for 2 effect isoforms, facetted by sparse and dense correlations between isoforms and shared or different isoQTLs across isoforms. Across all plots, isoform and total gene expression heritability is set to 0.05 with 2 simulated isoforms.
Figure S21: Miami plot of isoTWAS isoform-trait associations for SCZ. Z-score of isoTWAS association between trait and isoform (Y-axis) across chromosomal position (X-axis). Points are colored by significance thresholds and labelled with the transcript is within 10 kilobases of a GWAS-significant SNP.
Figure S22: Miami plot of isoTWAS isoform-trait associations for BD. Z-score of isoTWAS association between trait and isoform (Y-axis) across chromosomal position (X-axis). Points are colored by significance thresholds and labelled with the transcript is within 10 kilobases of a GWAS-significant SNP.
Figure S23: Miami plot of isoTWAS isoform-trait associations for ASD. Z-score of isoTWAS association between trait and isoform (Y-axis) across chromosomal position (X-axis). Points are colored by significance thresholds and labelled with the transcript is within 10 kilobases of a GWAS-significant SNP.
Figure S24: Miami plot of isoTWAS isoform-trait associations for ALZ. Z-score of isoTWAS association between trait and isoform (Y-axis) across chromosomal position (X-axis). Points are colored by significance thresholds and labelled with the transcript is within 10 kilobases of a GWAS-significant SNP.
Figure S25: Miami plot of isoTWAS isoform-trait associations for ADHD. Z-score of isoTWAS association between trait and isoform (Y-axis) across chromosomal position (X-axis). Points are colored by significance thresholds and labelled with the transcript is within 10 kilobases of a GWAS-significant SNP.
Figure S26: Colocalization plots for ALZ, APOC1 total gene eQTLs, and ENST00000252491 isoQTLs. (Top) Manhattan plots of ALZ risk GWAS, APOC1 total gene eQTLs, and ENST00000252491 isoQTLs, colored by LD to the lead isoQTL and with the lead isoQTL shown in a triangle and labelled. (Bottom) Boxplots of gene and isoform expression by genotype of the lead isoQTL SNP on the left and forest plot of the lead isoQTL’s effect and adjusted 95% confidence interval on the trait (black), gene (red), and isoform (blue), with P-values.
Supplemental Tables
Table S1: Sample size, source, and tissues for functional genomics reference panels
Table S2: Number of genes/models that pass heritability and cross-validation prediction cutoffs using TWAS and isoTWAS feature selection criteria and prediction methods
Table S3: Distribution of difference in predictive cross-validation R2 of observed total gene expression vs. predicted total gene expression (isoTWAS -TWAS)
Table S4: Distribution of percent difference in predictive cross-validation R2 of observed total gene expression vs. predicted total gene expression (isoTWAS -TWAS)
Table S5: Distribution of difference in external predictive R2 of observed total gene expression vs. predicted total gene expression using completely independent training and test sets (isoTWAS -TWAS)
Table S6: Distribution of percent difference in external predictive R2 of observed total gene expression vs. predicted total gene expression using completely independent training and test sets (isoTWAS -TWAS)
Table S7: Measures of test statistic inflation across isoTWAS and TWAS and increases in effective sample size for isoTWAS over TWAS using PsychENCODE expression data and across 5 neuropsychiatric traits
Table S8: isoTWAS results using fetal frontal cortex expression across 5 neuropsychiatric traits. Only isoform transcripts that are found in 90% credible intervals using FOCUS are shown
Table S9: isoTWAS results using adult frontal cortex expression across 5 neuropsychiatric traits. Only isoform transcripts that are found in 90% credible intervals using FOCUS are shown
Supplemental Data
Data 1: Predictive performance across all 6 predictive methods in isoTWAS using CACNA1E locus and across multiple simulation parameters
Data 2: Difference in predictive performance (isoTWAS - TWAS) using CACNA1E locus and across multiple simulation parameters
Data 3: Difference in predictive performance (isoTWAS - TWAS) using XRN2 locus and across multiple simulation parameters
Data 4: False positive rates across 20 simulations of 1,000 iterations across gene-level TWAS, ACAT, and minimum P aggregation of isoform associations using CACNA1E locus and across multiple simulation parameters
Data 5: Power to detect trait association at P < 2.5 × 10−6 across 1,000 simulations using TWAS and isoTWAS (ACAT) across various simulations. These simulations are under Scenario 1 in Figure 3a (gene has a true effect on the trait, but none of the isoforms have a true effect on the trait)
Data 6: Power to detect trait association at P < 2.5 × 10−6 across 1,000 simulations using TWAS and isoTWAS (ACAT) across various simulations. These simulations are under Scenario 2 in Figure 3b (a gene has multiple isoforms, only one has an effect on the trait, and we vary the usage of this effect isoform)
Data 7: Power to detect trait association at P < 2.5 × 10−6 across 1,000 simulations using TWAS and isoTWAS (ACAT) across various simulations. These simulations are under Scenario 3 in Figure 3c (a gene has two isoforms with differing effects on the trait, and we vary the effect size of one of the isoforms)
Data 8: Raw isoTWAS results across 5 neuropsychiatric traits using adult prefrontal cortex expression models
Data 9: Raw TWAS results across 5 neuropsychiatric traits using adult prefrontal cortex expression models
Data 10: Raw isoTWAS results across 5 neuropsychiatric traits using fetal prefrontal cortex expression models
Data 11: Raw TWAS results across 5 neuropsychiatric traits using fetal prefrontal cortex expression models
AUTHOR INFORMATION
Contributions
Conceptualization: AB, BP, MJG; Methodology: AB, BP, MJG; Software: AB; Validation: AB; Formal analysis: AB, MK, CW, CJ; Investigation: AB, MJG; Resources: BP, MJG; Data curation: AB, MK, CW, CJ, MJG; Writing – original draft: AB, MJG; Writing – reviewing and editing: AB, MK, CW, CJ, BP, MJG; Visualization: AB; Supervision: BP, MJG; Project administration: BP, MJG; Funding acquisition: BP, MJG
Correspondence
Please direct all correspondence to Arjun Bhattacharya (abtbhatt{at}ucla.edu).
ETHICS DECLARATIONS
Competing Interests
The authors declare no competing interests.
ACKNOWLEDGEMENTS
We thank Kangcheng Hou, Tommer Schwarz, Pan Zhang, Leanna Hernandez, Harold Pimentel, Mike Love, and Achal Patel for engaging discussion during the research process. We thank Kanishka Patel for her aesthetic advice for figures. We thank the Psychiatric Genomics Consortium and Complex Trait Genomics Lab for their publicly available GWAS summary statistics.
BP was partially supported by NIH awards R01 HG009120, R01 MH115676, R01 CA251555, R01 AI153827, R01 HG006399, R01 CA244670, and U01 HG011715. MJG was supported by SFARI Bridge to Independence Award, NIMH R01-MH121521, NIMH R01-MH123922, and NICHD-P50-HD103557.
Footnotes
Typo fixed in abstract
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.
- 7.
- 8.
- 9.↵
- 10.↵
- 11.
- 12.
- 13.
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.
- 19.
- 20.↵
- 21.↵
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵