Abstract
Large scale genetic association studies have identified many trait-associated variants and understanding the role of these variants in downstream regulation of gene-expressions can uncover important mediating biological mechanisms. In this study, we propose Aggregative tRans assoCiation to detect pHenotype specIfic gEne-sets (ARCHIE), as a method to establish links between sets of known genetic variants associated with a trait and sets of co-regulated gene-expressions through trans associations. ARCHIE employs sparse canonical correlation analysis based on summary statistics from trans-eQTL mapping and genotype and expression correlation matrices constructed from external data sources. We propose a resampling based procedure to test for significant trait-specific trans-association patterns in the background of highly polygenic regulation of gene-expression. By applying ARCHIE to available trans-eQTL summary statistics reported by the eQTLGen consortium, we identify 71 gene networks which have significant evidence of trans-association with groups of known genetic variants across 29 complex traits. A majority (50.7%) of the genes do not have any strong trans-associations and could not have been detected by standard trans-eQTL mapping. We provide further evidence for causal basis of the target genes through a series of follow-up analyses. These results show ARCHIE is a powerful tool for identifying sets of genes whose trans regulation may be related to specific complex traits.
Introduction
Genome-wide association studies (GWAS) have identified tens of thousands of common variants associated with a variety of complex traits 1 and a majority of these identified trait-related variants are in the non-coding regions of the genome2–4. It has been shown that these GWAS identified variants have a substantial overlap with variants that are associated with the expression levels of genes (eQTL) 5–7. A number of tools 8–11 have been developed to identify potential target genes through which genetic associations may be mediated by investigating the effect of variants on local genes (cis-eQTL), typically within 1Mb region around the variant, underlying causal interpretation remains complicated due to linkage disequilibrium and pleiotropy. A recent study has shown that a modest fraction of trait-heritability can be explained cis-mediated bulk gene-expressions 12, but future studies with more cell-type specific information has the potential to explain further.
Compared to cis-eQTL, studies of trans-eQTL have received much less attention for illuminating causal mechanisms underlying GWAS identified loci. However, they have the potential to illuminate downstream genes and pathways that would shed light on disease mechanism. A major challenge has been the limited statistical power for detection of trans-eQTL effects due to much weaker effects of SNPs on expressions of distal genes compared to those in cis-regions and a very large burden of multiple testing. However, trans-effects, when detected, has been shown to be more likely to have tissue-specific effects13,14 and are more enriched than cis-eQTLs among disease loci15. Trans-eQTLs are, in general, known to act on regulatory circuits governing broader groups of genes16 and thus have the potential to uncover gene networks and pathways consequential to complex traits 17,18. Limited studies of trans-eQTL effect of known GWAS loci have identified complex downstream effects on known consequential genes for diseases15,19. In fact, an “omnigenic” model of complex traits has been hypothesized under which a large majority of genetic associations is mediated by cascading trans-effects on a few “core genes” 20,21.
In this article, we propose a novel method based on a sparse canonical correlation analysis (sCCA)22–24 framework, termed Aggregative tRans assoCiation to detect pHenotype specIfic gEne-sets (ARCHIE), for detecting trans-effects of groups of GWAS SNPs associated with a trait on sets of genetically co-regulated genes. Using summary statistics from standard SNP-gene expression trans-eQTL mapping, estimates of linkage disequilibrium (LD) between the variants and co-expression between genes, we select sets distal target genes (termed gene component) that are associated with a group of the trait-related variants (termed variant component). Together, the selected variants and genes (jointly termed ARCHIE components) reflect significant trait-specific patterns of trans-association (Figure 1A shows an illustration for the functionality of ARCHIE).
Compared to standard trans-eQTL mapping, the proposed method improves power for detection of signals by aggregating multiple trans-association signals across GWAS loci and genes. We propose a resampling-based method to assess the statistical significance of the top components (top sparse canonical correlation values) resulting from sCCA for testing enrichment of trait-specific signals in the background of broader genome-wide trans-associations. If multiple ARCHIE components are significant, they reflect approximately orthogonal patterns of trans-associations for the trait-related variants, with the selected target genes pertaining to distinct downstream mechanisms of trans-regulation. Finally, we propose a novel way of independently supporting the results based on an analysis of trait heritability explained by the identified target genes compared to random genes. We apply the proposed method to analyze large-scale trans-association summary statistics for SNPs associated with 29 traits reported by the eQTLGen consortium19. The results show that ARCHIE can identify trait-specific patterns of trans-associations and relevant sets of variants and co-regulated target genes. The majority (50.7%) of the detected target genes are novel, meaning they would not have been identified by standard trans-eQTL mapping alone. We provide independent evidence supporting our results, using a series of downstream analysis to show that the selected target genes are enriched in known trait-related pathways and define directions of associations for the SNPs that are more enriched for underlying trait heritability than expected by chance. Further we show an example that the identified trans-associations can be explained using biological cis-mediation mechanisms as well.
Results
Trait-specific patterns of trans-associations in Whole Blood
For each of 29 traits, we applied ARCHIE to the trans-eQTL summary statistics for the set of GWAS loci identified for that trait and tested in eQTLGen, across all genes and select the trait-specific target genes via the significant gene components (See Methods and Figure 1B for analysis details). On average, across these traits, we detect 2 (max = 7 for “Height”) significant sets of variant and gene components capturing phenotype-specific trans-association patterns (Figure 1C). Of the target genes selected by ARCHIE in the significant gene-components for each trait, only 49.3% genes displayed a strong association in standard analysis (trans-eQTL p-value < 1 × 10-06 reported in eQTLGen) with any variant associated to that traits. The remaining 50.7% genes (termed “novel genes”) harbors only weaker (0.05 > p-value > 1 × 10-06) associations and hence cannot be detected by standard trans-eQTL mapping alone; these genes display a similar pattern of trans-association with corresponding selected trait-related variants and are detectable via the significant ARCHIE components. We made the list of target genes and variants selected by ARCHIE for each phenotype publicly available through an openly accessible database (See URL). Here, we focus on results for three different phenotypes their corresponding trans- association patterns, the selected target gene-sets and the novel genes detected by ARCHIE.
Schizophrenia
Schizophrenia is a neuropsychiatric disorder that affects perception and cognition. The eQTLGen consortium reports complete (non-missing) trans-association statistics for 218 SNPs, curated from multiple large-scale GWAS, associated with Schizophrenia (SCZ) across 7,756 genes. Of these, 7,047 genes were expressed in whole blood of Genotype-Tissue Expression (GTEx)25 v8 individuals. We identified one significant ARCHIE component capturing trans-association patterns significantly related to SCZ (Figure 2A-B) consisting of 27 variants and 75 genes. Of the selected genes, only 16 (21.4%) had evidence of at least one strong association (p-value < 1 × 10-06) and possibly multiple weaker (0.05 > p-value > 1 × 10-06) association as reported by eQTLGen. The remaining 59 genes (78.6%) only had weaker trans-associations with SCZ-related variants and could not have been identified using traditional trans-eQTL mapping (Table 1). Using an expression imputation approach (See Methods for details), we found the target genes mediate significant trait heritability (p-value < 0.001) than expected by chance (Figure 2D).
Several of the 59 identified novel genes have been previously been reported to be associated with neurological functions. For example, chemokine receptor 4 (CXCR4), a gene that underlies interneuron migration and several neurodegenerative diseases26, was identified by aggregating weaker associations from 20 SCZ-related SNPs in the variant component, but does not have any significant trans-associations. Similarly, caveolin-1 (CAV1), which is a known regulator of a SCZ risk gene (DISC1)27, aggregates 13 weaker association to SCZ-related variants in the variant component. Notably, the target genes identified by ARCHIE include genes such as HSPA5 and AP5S1, which not only harbor multiple trans-associations from SCZ-related variants but have also been reported to have cis-variants associated with psychiatric disorders28,29. We investigated whether in general the genes selected by ARCHIE had have evidence of association with SCZ through cis variants. Aggregating results from several large-scale cis-eQTL studies across tissues9,30, we found that 12 of the 59 of the (enrichment p-value = 2.8×10-05) novel genes have nominally significant (p-value < 1×10-04) evidence of cis-regulatory SNPs to be associated with SCZ or other different neuropsychiatric diseases.
By performing pathway enrichment analysis of the target genes we investigated if the selected genes represented known SCZ-related biological mechanisms (See Supplementary Methods for details). Among the significantly enriched pathways, the majority (51.3%) were immune related. In particular, we identified 42 GO pathways31, 36 canonical pathways32–35 and 4 hallmark pathways36 to be strongly enriched (FDR adjusted p-value < 0.05) for the selected genes with 73 (89.0%) of them containing at least one novel gene (Figure 2C and Supplementary Table 2). Several pathways, previously reported in connection to SCZ, are identified to be enriched (FDR adjusted p-value < 0.05) for the selected genes (Figure 2C). For example, among the enriched gene ontology (GO) terms, GO-0034976: response to endoplasmic reticulum stress37 (FDR adjusted p-value=0.013), GO-055065 metal ion homeostasis38 (adjusted p-value=0.029), GO-0006915: apoptotic process39 (adjusted p-value = 0.029), GO-0043005: neuron projection (adjusted p-value = 0.021) have previously been suggested to be linked to SCZ. Four hallmark gene-sets are also found to be significantly enriched for the selected genes including glycolysis40, hypoxia41, mTORC1 signaling42 and unfolded protein response43, all of which have suggestive evidence of being associated to SCZ. Using numerous TF databases44,45, we found that the selected target genes were enriched (adjusted p-value < 0.05) for targets of 10 TFs (Supplementary Table 3), several of which have been previously reported to be associated with neuropsychiatric disorders46,47.
Protein-protein interaction (PPI) enrichment analysis using STRING (v11.0)48 showed a significant enrichment (p-value = 1.1×10-03) indicating that the corresponding proteins may physically interact. Next, we performed a differential expression enrichment analysis to investigate whether the target genes were differentially expressed in any of the 54 tissues in GTEx v8 dataset. For each tissue, we curated lists of differentially expressed genes across the genome. We defined a gene to be differentially expressed in a tissue if the corresponding gene expression level in that tissue was significantly different from that across the rest of the tissues (See Supplementary Methods for details). Using such pre-computed lists of differentially expressed genes for each tissue, we found that the target genes selected by ARCHIE were enriched within the set of differentially expressed genes in 12 different tissues including 4 brain tissues in GTEx v8 (Supplementary Figure 1). For example, 3 novel genes (PADI2, KCNJ10, MLC1), were highly differentially expressed in several brain tissues (Supplementary Figure 2), in comparison to their expression in rest of the tissues.
Ulcerative Colitis
Ulcerative colitis (UC) is a form of inflammatory bowel disease, affecting the innermost lining of colon and rectum, causing inflammation and sores in the digestive tract and can lead to several colon-related symptoms and complications including colon cancer49–51. The eQTLGen consortium reports complete (non-missing) trans-association summary statistics for 163 SNPs associated with Ulcerative Colitis, curated from multiple large-scale GWAS, across 12,010 genes. Of these, 10,307 genes were expressed in Whole Blood from GTEx v8 individuals. Using ARCHIE, we detected two significant variant-gene components comprising of 74 SNPs and 148 genes in total (Figure 3A and Supplementary Figure 3; Supplementary Table 1) that reflect trans- association patterns specific to UC. Of the selected genes, 68 genes (45.9%) were novel, meaning they did not have any strong trans-association (Table 1, Supplementary Table 1) with the variants related to UC. Further, similar to SCZ, we found the associations of the SNPs with target genes was strongly enriched (p-value < 0.001) for heritability of UC than expected by chance alone (Figure 3D).
Several of the novel target genes detected have been previously linked to intestinal inflammations and diseases. For example, glycoprotein A33 (GPA33) is known to impact intestinal permeability52 and is an established colon cancer antigen53. Recent research using mouse-models have reported a connection between the regulation of GPA33 and the development of colitis and other colon related inflammatory syndromes54. We also identify spermine oxidase (SMOX) through its weaker association with 9 UC-related variants. SMOX is significantly upregulated in individuals with inflammatory bowel diseases55 and has been implicated in gastric and colon inflammations as well as carcinogenesis56.
Using a series of follow-up analyses, we identify several pathways to be enriched (FDR adjusted p-value < 0.05) for the selected target genes (Supplementary table 4-5), majority of them being immune related (59.6%). Among others, the hallmark interleukin-2-STAT5 signaling pathway (FDR adjusted p-value = 1.6 ⨯10-08) has previously been reported to be associated to development of UC via suppression of immune response 57. Various GO pathways related to endocytosis, lymphocyte activation, T-cell activation are found to be overrepresented in the selected target genes as well (Figure 3B and Table 3). Further enrichment analysis using broad TF databases, we found the selected target genes across both gene-components are enriched (adjusted p-value < 0.01) for targets of 18 different TFs, majority of which have been previously reported to be involved in mucosal inflammation, inflammation of the intestine and epithelial cells and in immune-related responses (Supplementary Table 6).
Protein-protein interaction (PPI) enrichment analysis shows that the resultant proteins interact more often than random (p-value= 8.8×10-03 and 1.3×10-03 respectively for two significant ARCHIE components) (Figure 3C). Additionally, the selected genes were found to be enriched for genes significantly differentially expressed in several relevant tissues like colon-sigmoid and small-intestine ileum among others (Supplementary Figure 4). We further investigated if any known mechanism can explain how the selected genes are associated with the selected variants, including mechanisms reflecting cis mediation15. In one example from our analysis, we observe that, among the 41 variants selected by variant-component 1, one UC-related variant rs3774959 is a cis-eQTL of NFKB1 (p-value = 6.2 × 10-41 in eQTLGen and 6.3 × 10-05 in GTEx in whole blood). The Nuclear factor κB (NF-κB) family of transcription factors (TF) including NFKB1, has been extensively reported to be involved in immune58 and inflammatory responses59. In particular, mutations in the promoter region of Nuclear factor κB1 (NFKB1) have been strongly implicated to be associated to UC60, although the downstream target genes of NFKB1 that are associated with UC, are largely unknown. Among 106 target genes selected in the first gene component, there are 6 genes (CD74, CD83, IL1B, Il2RA, PTPN6, FOXP3) that are reported targets for NFKB1 (adjusted enrichment p-value = 7.5 × 10-03) in TRRUST v2.044. Thus, it can be conceptualized that the selected UC-related variant may regulate the expression levels of the 6 selected targets of NFKB1 via cis-regulation of NFKB1 expression levels, influencing UC-status downstream.
Prostate Cancer
Prostate cancer (PC) is one of the most common types of cancers in middle aged and older men, having a high public health burden with more than 3 million new cases in USA per year. The eQTLGen consortium reports complete (non-missing) trans-association summary statistics for 122 SNPs associated with prostate cancer, curated from multiple large-scale GWAS, across 12,951 genes. Of these, 11,385 genes were expressed in Whole Blood from GTEx v8 individuals. Using ARCHIE, we detected two significant variant-gene components comprising 33 SNPs and 53 genes in total (Figure 4A; Supplementary Table 1) that reflect trans- association patterns specific to PC, of which 44 genes (83.1%) were novel (Table 1, Supplementary Table 1). Additionally, similar to SCZ and UC, we found evidence of enrichment of trans-heritability of PC that can be mediated by the target genes (Figure 4D), but the level of significance achieved was relatively weaker (p-value = 0.002 and 0.008; See Methods for details).
Among the novel genes, we identified several key genes that are generally implicated in different types of cancers. For example, TP53 aggregates weaker trans-associations with 9 PC-related variants in ARCHIE component 1 (Figure 4B). The TP53 gene encodes tumor protein p53 which acts as a key tumor suppressor and regulates cell division in general. TP53 is implicated in a large spectrum of cancer phenotypes and has been considered to be one of the most important cancer genes studied61. Further, genes associated with the second gene component included SMAD3 (Figure 4C) which is also a well-known tumor suppressor gene that plays a key role in transforming growth factor β (TGF-β) mediated immune suppression and also in regulating transcriptional responses suitable for metastasis62–64. TP53 and SMAD3 belong to two different ARCHIE components meaning that they might pertain to two relatively distinct biological processes that are independently affected by different sets of PC related variants. Additionally, the second gene-component included EEA1 which is reported to have significantly altered expression levels in prostate cancer patients65.
Using enrichment analyses, we found several pathways, including broadly ubiquitous pathways to be significantly overrepresented in the selected genes for both the gene-components like regulation of intracellular transport (adjusted p-value=0.017) and mRNA 3’-UTR binding (adjusted p-value = 0.008). Notably, we found the selected genes to be enriched for targets of several transcription factors many of which have been associated with different types and subtypes of cancer. For example, we found a TF target enrichment for SPAG9 (adjusted p-value = 0.016) which has been identified to be associated to breast cancer, ovarian cancer, colorectal cancer and others 66. We also found enrichment for targets of SSRP1 (adjusted p-value = 0.016), which is differentially regulated in a wide spectrum of malignant tumors 67. However, we did not find any evidence for significance enrichment of PPI among identified genes.
The downstream analysis suggests that a majority of the pathways (78.1%) enriched for the selected genes are immune related as observed in the previous examples as well. This might have been driven by the fact that eQTLGen reports summary trans-associations in whole blood. In general, whole blood might not be the ideal candidate tissue to identify trans-associations pertaining to PC. It is conceivable that relevant tissue-specific analysis for PC could have illuminated further trans-association patterns and identified key tissue-specific target genes. Despite that, we can identify several genes which have been elaborately reported to be key target genes for various cancers. This underlines the utility of our aggregative approach and that it can illuminate important target genes pertaining to a trait.
Conclusion
In this article, we have proposed ARCHIE, a novel method for identifying groups of trait-associated genetic variants the effects of which may be mediated through trans-effect on groups of coregulated genes. By aggregating association signals across multiple SNP-gene expression pairs, the method improves power for the detection of patterns of weak trans-effects. Further, we develop a resampling-based method to allow testing for the statistical significance of trait-specific enrichment patterns in the background of expected highly polygenic broad trans association signals. By applying the method to summary-level trans-association statistics available from the eQTLGen consortium, we identify novel target gene sets across a wide variety of traits. We explore the results in depth specifically for three complex diseases and provide further validation of and insight to causal basis of the identified target genes. The method can be applied to investigate complex mediatory effects of GWAS variants on traits through networks of other molecular traits, such as proteins and metabolites.
While modern genome-wide association studies have been successful in identifying large number of genetic variants associated with complex traits, the underlying biological mechanisms by which these association arise has remained elusive. While trans genetic regulations, mediated through cis- or otherwise, has been proposed for detecting important target genes for GWAS variants, identification of trans associations using standard univariate SNP vs gene-expression association analysis is notoriously difficult due to weak effect-sizes and large multiple testing burdens. Also due to pleiotropy, there could be abundant associations between genetic variants and distal gene expression, but these associations may not reflect any trait specific patterns. ARCHIE addresses these limitations of traditional trans-mapping of GWAS variants. Application of the method to eQTLGen consortium trans-eQTL statistics not only identified many novel trans-associations for trait-related variants, but also it helped to contextualize the individual associations in terms of broader patterns that were detected by underlying correlation components and were shown to be highly trait-specific.
The set of selected target genes in the gene component is one of the key outputs of ARCHIE. Using a series of follow-up analyses for three different types of traits, we showed that the selected genes are often overrepresented in known disease-relevant pathways, enriched in protein-protein interaction networks, shows co-regulations across tissues and contains targets for known transcription factors implicated to the disease (SCZ and UC) and key tumor suppressor genes (PC). Further, using a trans-expression imputation approach, we demonstrated that the selected genes can significantly mediate heritability associated trait related variants. All of these analyses point out that the trans-association patterns we detect are likely to have trait specific biological basis.
There are several limitations of the proposed method and current analysis. First, in the current version ARCHIE, we begin with a set of genetic variants associated with a trait, but we do not incorporate the underlying association directions and effect sizes in the analysis. This approach allowed us to independently investigate identified target genes through testing for consistency of directions of association of the SNPs with the trait and those with the expressions of the target through the trans-heritability enrichment analysis. However, it is likely that incorporation of the direction of trait association for the SNPs in the sCCA analysis itself will lead to improved power for detection of the trait specific target genes. Further, incorporation of known functional annotation of genetic variants and other prior information regarding the relationship between genes can improve the power of the analysis as well. Currently, the resampling-based testing method we used to test for trait-specific enrichment patterns for trans associations is computationally intensive. In the future, further research is merited to develop analytical approximation techniques to reduce the computational burden of ARCHIE.
In this article we have analyzed summary statistics reported by eQTLGen in the whole blood. This is primarily because of the substantial effective sample size of eQTLGen. While the approach can be applied to eQTL results from other tissues, the underlying sample sizes may be too limited to yield sufficient power. Although blood might not be the most relevant tissue for a number of traits, our analysis did detect trans association patterns that appear to have a broader biological basis in the disease genetics, from multiple independent lines of evidence. Nevertheless, it is likely that our analysis has missed many trans-association patterns that will be present only in specific disease-relevant tissues, cell types or/and dynamic stages 68. In the future, we will seek applications for ARCHIE in various types of emerging eQTL databases to provide a more complete map of networks of genetics variants and trans-regulated gene expressions and relevant contexts.
Overall, in this article we have developed a novel summary-based method, ARCHIE, to detect trait-specific gene-sets by aggregating trans-associations from multiple trait-related variants. ARCHIE is a powerful tool for identifying target gene sets through which the effect of genetic variants on a complex trait may be mediated. Applications of the methods to a variety of existing data on association between genetics variants with high-throughput molecular traits can provide insights to biological mechanisms underlying genetic basis of complex traits.
URL
eQTLGen (trans-eQTL summary statistics): https://www.eqtlgen.org/trans-eqtls.html
GTEx: https://www.gtexportal.org/home/
1000 Genomes: https://www.internationalgenome.org/data/
UKBiobank: https://www.ukbiobank.ac.uk/
FUMA: https://fuma.ctglab.nl/
ShinyGO: http://bioinformatics.sdstate.edu/go/
STRING: https://string-db.org/cgi/
GitHub: https://github.com/diptavo/ARCHIE (initial release)
Data Availability
The results from the analysis has been and will continue to be updated in Github.
Sample Description
eQTLGen
The eQTLGen consortium19 is a large-scale multi-study effort to identify to study the downstream effects of trait-related variants via their effects on gene-expression in whole blood. The consortium consists of 37 individual studies with a collective sample size of 31,684 participants. With this sample size, the study has relatively higher power to detect moderate to weaker effects of variants on gene-expression. 10,317 variants related to complex traits, compiled from several GWAS databases, were tested for trans-associations with the expression levels of 19,964 genes in whole blood. The authors have made summary statistics (Z-score, p-value) for these trans-eQTL mapping analyses freely available to public.
GTEx
The Genotype-Tissue Expression (GTEx) project25 aims to study tissue-specific gene expression and regulation. We used individual level data from GTEx (v8) whole blood to construct the co-expression matrix (ΣEE) and further downstream validation of the gene-sets selected using ARCHIE. In our analysis, we used the latest version (v8) of GTEx having gene-expression and genotype data with samples from 54 different tissues. In particular, 755 individuals had expression data on 20,315 genes for whole blood.
UK Biobank
UK Biobank is a large biobank study with above 500,000 participants. Among several data resources available, the genotype data, hospitalization records and health-records data are available. We used individual level genotype data from UK Biobank to construct LD matrix (ΣGG) and for further downstream analysis of the selected target genes.
The phenotype data constructed from hospitalization and health-data records were used in the quantification and testing of enrichment in trans-heritability explained by the selected target genes (See Methods). We included the individuals with European ancestry in the analysis. For example, in the analysis of schizophrenia (SCZ), we used a sample of 366,326 participants from UK Biobank to construct the imputed gene-expression levels and evaluate the corresponding regression r2 as an estimate of trans-heritability on SCZ as a binary phenotype.
Methods
Estimating trait-specific pattern of trans-associations
Our proposed method, Aggregative tRans assoCiation to detect pHenotype specIfic gEne-sets (ARCHIE), can select target genes trans-associated with trait-related variants using summary statistics in a sparse canonical correlation framework. To apply ARCHIE, we start with the summary statistics from trans-eQTL mapping (Z-value, p-value). Given the trans-association summary statistics across the variants related with the trait and all the corresponding distant genes (variant > 5Mb away from the transcription start site of the gene), we first adjust for the correlation within the variants and genes through appropriate linkage disequilibrium and co-expression matrices respectively as follows: where ΣGG and ΣEE are estimates of LD-matrix and co-expression matrix (see Supplementary Methods), and ΣGE is a matrix of Z-values obtained from the standard trans-eQTL mapping across all pairs of variants and gene-expressions.
Using W, the correlation-adjusted matrix of trans-associations, ARCHIE employs sparse canonical correlation analysis22,24 (sCCA) which produces a sparse linear combination of the variants (u; termed variant-component) that is strongly correlated with a sparse linear combination of genes (v; termed gene-component) by solving the following optimization problem where ‖x‖h is the Lh norm of a vector x; cu (or cv) is the sparsity parameter on the variant (or gene) component for the lasso-type L1 penalty. Sparsity aids in interpretation since each non-zero element of a variant or gene component indicates that the respective variant (or gene) is selected in that component. Thus, (u, v), which are the resultant variant and gene components (jointly termed ARCHIE components) can be interpreted as the sparse latent factors that explain the majority of the aggregated association between all the trait-related variants and all the genes. The corresponding sparse canonical correlation (cc-value) between each pair of variant and gene components, defined as q2=(vTWu)2would ideally represent the cumulative correlation between the selected sets of variants and genes by aggregating multiple (possibly weaker) associations (Figure 1A shows an illustration using P variants and G genes). Multiple such components (u, v) can be extracted to reflect approximately orthogonal latent factors of the aggregative correlation, corresponding to possibly distinct mechanisms of trans-regulation (See Supplementary Methods).
At suitable levels of sparsity (See Supplementary Methods), ARCHIE components produce a much smaller number of selected target genes which harbor multiple moderate to weak trans-association from a selected set of trait-associated variants, thus reflecting a trait-specific pattern of trans-association. A detailed algorithm for the estimation of the ARCHIE components is provided in the Supplementary Section A.
Testing Hypothesis of Enrichment of Trait-Specific trans-Association using a Competitive Null Hypothesis Framework
To test which ARCHIE components significantly capture the phenotype-specific trans-association pattern we evaluated the results from the original analysis against a competitive null hypothesis. Since trait-related variants are expected to be enriched for trans-eQTLs in general, we test whether the cc-values obtained in the original analysis are higher than that obtained using the trans-summary statistics between a random set of GWAS-identified variants and genes of similar size, that do not reflect any trait-specific pattern. For this, we first construct a null matrix by taking a random sample of p variants from the pool of all variants available and extracting the corresponding trans-summary statistics for another set of randomly chosen g genes. Since eQTLGen reports the trans-summary statistics across about 10,000 variants associated with different traits, we can construct the null matrix using the trans-summary statistics from these variants that are associated with different traits and not with the trait of interest. This matrix of trans-associations, by design, should not reflect phenotype-specific patterns. For example, in the analysis for Schizophrenia (SCZ) using summary statistics across 218 variants and 7,047 genes, we construct the null matrix using 1 variant selected at random from 218 randomly chosen traits and extracting their corresponding trans-summary statistics across 7,047 randomly chosen genes.
Then we use ARCHIE with the same sparsity levels as the original analysis, to extract the gene and variant components and calculate corresponding cc-values. We repeat this step multiple (M) times to generate a competitive null distribution of cc-values. We then evaluate the observed cc-values from the original analysis against the corresponding competitive null distributions to calculate the p-value. In particular, the p-value of the kth ARCHIE component is given as where denotes the kth cc-value in the original analysis and denotes the elements of the null distribution of the kth cc-value. We declare that the top L components significantly capture phenotype-specific trans- association patterns if The random set of p variants should be carefully chosen so that none of the variants associated to the phenotype in consideration or any phenotype sharing substantial genetic correlation, are included. Further the set should be such that it does not include a large fraction of the variants from the same phenotype (different from the original phenotype), which may bias the competitive null distribution towards the trans-association cc-values for that phenotype.
We performed an evaluation of the statistical properties including type-I error of the proposed testing procedure using extensive resampling experiments across four different traits. Our results show that the test against competitive null hypothesis can preserve correct type-I error rate at significance threshold of 0.001. Further, it can detect presence of trait-specific trans-associations with high probability (Supplementary Section B).
Analysis of eQTLGen data
To identify phenotype-specific trans-associations, we applied ARCHIE on the trans-association summary statistics for 10,317 trait-related variants across 19,942 genes reported by the eQTLGen consortium19 (See Sample Description for details on the study). In line with the consortium, we defined any gene to be trans to a variant if the variant was located at least 5Mb from the transcription start site of the gene or on another chromosome. The data contains multiple variants associated with the same trait analyzed for trans-eQTL mapping. Our analysis was restricted to phenotypes that had at least 100 associated variants tested for trans-mapping in the consortium, producing 29 phenotypes. Figure 1B shows a graphical representation of the major steps of our workflow. Briefly, for each phenotype, we extracted the summary trans-eQTL association statistics (Z-score, p-value) and removed all genes that were in within 5Mb of any of the trait related variants. In the preprocessing step, we filtered for any missing data and retained the genes that were also expressed in GTEx (v8)25 whole blood. This produced a list of approximately 129 (min: 112; max: 533) variants and 10,219 (min: 3426; max: 13910) genes on an average per phenotype. ARCHIE requires two additional matrices representing the correlation among the variants themselves (a linkage-disequilibrium matrix) and among the gene-expression levels (a co-expression matrix), which can be estimated using reference data. We constructed the LD-matrix for the variants from individual-level genotype information using 5,000 randomly selected, unrelated European samples in UK Biobank69. For the correlation between gene-expressions, we used a penalized co-expression matrix70 of the corresponding genes constructed from the covariate-adjusted gene-expression levels for individuals in GTEx v8 data (See Supplementary Methods). Subsequently, for the given trait, we extracted the selected variants and genes using the significant components and were evaluated for presence of false-positives due to cross-mapping.
Cross-Mappability
Alignment errors due to similarity in sequenced reads can lead a substantial rise in false positives for detecting trans-eQTL associations 71. With the selected ARCHIE components, we extracted the nearby genes expressed in GTEx v8 whole blood for the selected variants (TSS within ±500 kb of the variant) and evaluated the cross-mapping scores for these genes with the selected target genes. Across the 3 traits analyzed in this article, we found that all such gene-pairs were mostly non cross-mappable (SCZ: 99.98%, UC: 99.17%, PC: 99.93%), indicating that the trans-association patterns were less likely to be affected by false positive arising from alignment errors.
Follow-up Analysis
Quantifying and Testing for Enrichment for Trait Heritability Explained by Identified Target Genes
In the following, we propose a method for quantifying trait heritability explained by the GWAS variants that would be mediated by the identified target-genes and develop a corresponding test for enrichment through comparison of such estimates of mediated heritability associated with that from random genes. For or a particular trait of interest, we start with the Z-scores for regression-based trans-eQTL mapping for a set of underlying p variants and g genes. We will assume that, using ARCHIE, we have identified G target genes that capture trait-specific trans-association patterns. To perform the test as proposed above, we require individual level phenotype and genotype data independent of the samples used in the original analysis. Given genotypes (or dosages) at the p variant sites for an individual k, for each target gene, we define the trans-imputed expression scores (TIES) as the predicted expression value for the jth target gene as where Zij is the z-score for the effect of the ith trait-related variant on the jth gene, xik is the genotype or dosage for the kth individual at the ith variant and mi is the minor allele frequency of the ith variant. We construct the TIES under two different schemes:
Using all the trait-related variants with complete trans-association statistics reported in eQTLGen
Using only the trait-related variants selected in the significant components
To evaluate how strongly the TIES for the G target genes are associated with the phenotype levels, we use the following multiple regression model where yk is the phenotype value (e.g. disease status) for the kth individual; g[.] is a canonical link function and can be set to be the identity function for continuous phenotype or the logistic function for binary (disease status) phenotypes. We record the pseudo-r2 from this regression model as a measure of association between the TIES and the phenotype value. The pseudo-r2 would provide an estimate of trans-heritability, meaning it can quantify the heritability explained by the trait-related variants that is expected to be mediated via the selected target genes in context of the trans-associations reported. To test whether the observed r2 is significant in comparison to what is expected at random, we adopt a resampling based approach. We sampled g genes (excluding the originally selected target genes) from the genome, constructed the corresponding TIES for the individuals and recorded the r2 for the regression model. We performed resampling multiple (1000) times to generate a control (null) distribution of r2 to reflect the associations expected from random genes. We then calculated the p-value of the observed r2 using the originally selected g genes from this control distribution to evaluate whether the TIES have any significant association with the phenotype.
Approximately, the observed r2 reflects the proportion of trait-variance explained by the TIES. Thus, a significantly higher r2 would imply that the selected genes harbor several trans-associations and mediate the effects of the trait-related variants more than any random set of genes. As the analysis of association between TIES and trait (for both the selected trans genes and random genes) is performed in an independent dataset, and no information on directions or magnitudes of trait association for the SNPs are used in the original ARCHIE analysis, the test for heritability enrichment provides independent validation of the relevance of selected target-genes in explaining genetic associations for the trait. In our application, we used individual level phenotype and genotype data from UK Biobank participants to estimate association between TIES and traits.
We also performed several other follow up analyses including PPI enrichment, pathway enrichment, differentially expressed genes enrichment. These analyses were carried out using pre-established standard pipelines. For full details on these see Supplementary Section C.