Abstract
Heritable diseases often manifest in a highly tissue-specific manner, with different disease loci mediated by genes in distinct tissues or cell types. We propose Tissue-Gene Fine-Mapping (TGFM), a fine-mapping method that infers the posterior probability (PIP) for each gene-tissue pair to mediate a disease locus by analyzing GWAS summary statistics (and in-sample LD) and leveraging eQTL data from diverse tissues to build cis-predicted expression models; TGFM also assigns PIPs to causal variants that are not mediated by gene expression in assayed genes and tissues. TGFM accounts for both co-regulation across genes and tissues and LD between SNPs (generalizing existing fine-mapping methods), and incorporates genome-wide estimates of each tissue’s contribution to disease as tissue-level priors. TGFM was well-calibrated and moderately well-powered in simulations; unlike previous methods, TGFM was able to attain correct calibration by modeling uncertainty in cis-predicted expression models. We applied TGFM to 45 UK Biobank diseases/traits (average N = 316K) using eQTL data from 38 GTEx tissues. TGFM identified an average of 147 PIP > 0.5 causal genetic elements per disease/trait, of which 11% were gene-tissue pairs. Implicated gene-tissue pairs were concentrated in known disease-critical tissues, and causal genes were strongly enriched in disease-relevant gene sets. Causal gene-tissue pairs identified by TGFM recapitulated known biology (e.g., TPO-thyroid for Hypothyroidism), but also included biologically plausible novel findings (e.g., SLC20A2-artery aorta for Diastolic blood pressure). Further application of TGFM to single-cell eQTL data from 9 cell types in peripheral blood mononuclear cells (PBMC), analyzed jointly with GTEx tissues, identified 30 additional causal gene-PBMC cell type pairs at PIP > 0.5—primarily for autoimmune disease and blood cell traits, including the well-established role of CTLA4 in CD8+ T cells for All autoimmune disease. In conclusion, TGFM is a robust and powerful method for fine-mapping causal tissues and genes at disease-associated loci.
Introduction
Heritable diseases often manifest in a highly tissue-specific manner, motivating intense efforts to elucidate tissue-specific mechanisms of disease1. Previous studies have identified disease-critical tissues/cell-types based on genome-wide patterns2–11, and have deeply dissected a limited number of GWAS loci12–16. However, different GWAS loci may be mediated by different tissues, motivating genome-wide efforts to fine-map causal tissues and genes at individual GWAS loci.
Existing approaches, including colocalization17–19 and transcriptome wide association studies (TWAS)20–22, have implicated disease genes via the integration of GWAS data with expression quantitative trait loci (eQTLs) while considering the effect of each gene-tissue pair on disease in isolation. However, it is likely that most of these disease-implicated genes are not actually causal in the analyzed tissue; analogous to non-causal tagging variants implicated by linkage disequilibrium (LD) between variants23, non-causal gene-tissue pairs can be implicated by correlations with causal gene-tissue pairs (involving a different gene and/or different tissue)11,22,24–27. In addition, false-positive gene-tissue pairs can arise from correlations with non-mediated genetic variants, i.e., variants whose causal effects are not mediated by assayed expression levels22,27,28. Previous fine-mapping approaches such as FOCUS24 have proven valuable in disentangling causal effects across correlated genes in a single tissue (also see ref. 27), but have not considered causal gene-tissue pairs.
Here, we introduce a new method, Tissue-Gene Fine-Mapping (TGFM), that infers the posterior inclusion probability (PIP) for each gene-tissue pair to mediate a disease association at a given locus; TGFM also assigns PIPs to causal genetic variants whose effects are not mediated by gene expression in assayed tissues and genes. TGFM models both gene-tissue pairs (using cis-predicted expression20,21) and non-mediated genetic variants as potential causal genetic elements, and accounts for both correlations in cis-predicted expression across genes and tissues and LD between genetic variants, generalizing existing fine-mapping methods23,24,27,29–31. TGFM incorporates genome-wide estimates of each tissue’s contribution to disease as tissue-level priors and employs a sampling approach to account for uncertainty in cis-predicted gene expression. We validated TGFM using extensive simulations with real genotypes, including comparisons to coloc17 and FOCUS24. We applied TGFM to 45 UK Biobank traits32 using eQTL data from 38 GTEx tissues25 and 9 fine-grained single-cell PBMC cell-types33.
Results
Overview of TGFM
TGFM estimates the posterior inclusion probability (PIP) for each genetic element (gene-tissue pair or genetic variant) to have a non-zero causal effect on disease, in a model that includes mediated causal effects of each gene-tissue pair (via the cis-genetic component of expression of a given gene in a given tissue) and non-mediated causal effects of each genetic variant: where Y denotes phenotypes, g indexes genes, t indexes tissues, X is the matrix of genotypes, δgt is the vector of causal cis-eQTL effect sizes of each variant on gene expression in gene g and tissue t (thus Xδgt is the cis-genetic component of gene expression in gene g and tissue t), αgt denotes the (scalar) effect of cis-genetic expression in gene g and tissue t on the disease or trait, β is the vector of non-mediated causal effects of each genetic variant on the disease or trait, and ϵ denotes environmental noise. We caution that, analogous to previous studies, inference of causal genetic elements relies on the assumption that all causal genetic elements have been assayed, which may not be true in practice (see Discussion); we use the word “causal” for simplicity, with this caveat in mind.
TGFM estimates the PIP of each genetic element by generalizing the Sum of Single Effects (SuSiE)30,31 fine-mapping method to include both gene-tissue pairs and genetic variants; gene-tissue pairs are included via cis-predicted expression20,21 (using an external eQTL data set such as GTEx25 to build prediction models), which is an approximation to true cis-genetic expression (Methods). This approach allows for fine-mapping multiple causal genetic elements in a given locus, inferring causal effects underlying both marginal GWAS34 and marginal TWAS20–22 (i.e., the association between cis-predicted expression of a single gene-tissue pair and disease) associations by accounting for correlations between gene-tissue pairs due to co-regulation across genes/tissues11,22,24–27, correlations between genetic variants due to LD23, and/or correlations between gene-tissue pairs and genetic variants due to the inclusion of a genetic variant in a model of cis-predicted gene expression22,27,28. TGFM employs a sampling approach to account for uncertainty in cis-predicted expression, avoiding false positives that arise from noisy estimation of cis-genetic expression.
In detail, TGFM consists of four steps. In step 1, we apply SuSiE to perform eQTL fine-mapping of each gene-tissue pair in the external gene expression data set (estimating a posterior distribution of the causal cis-eQTL effect sizes for each gene-tissue pair). In step 2, we randomly sample 100 cis-predicted expression models for each gene-tissue pair from the posterior distributions of causal cis-eQTL effect sizes estimated in step 1 (Methods). In step 3, we apply SuSiE to perform disease fine-mapping in the target data set (estimating the PIP of each genetic element) 100 times, iterating over the sampled cis-predicted expression models for each gene-tissue pair from step 2. In step 4, we average the results of step 3 across the 100 disease fine-mapping runs. TGFM utilizes a custom implementation of the SuSiE algorithm that provides efficient estimation of PIPs across 100 parallel SuSiE runs that differ only in their cis-predicted expression models (Methods). TGFM inference requires only summary-level GWAS data31,35 consisting of GWAS z-scores for each variant and in-sample LD between genetic variants, in addition to external eQTL data sets across tissues of interest.
TGFM increases fine-mapping power by specifying tissue-specific prior probabilities for each genetic element in a locus that are informed by genome-wide data, analogous to functionally informed variant-level fine-mapping methods such as PolyFun36; TGFM assigns one prior causal probability πt for each gene-tissue pair from tissue t and one prior causal probability πnm for each non-mediated genetic variant. We estimate πt and πnm in each disease/trait separately by iteratively running a computationally efficient approximation to TGFM (Methods), starting with flat priors and updating πt and πnm at each iteration until convergence. When analyzing a given locus with TGFM, we normalize the prior causal probabilities to sum to 1, analogous to PolyFun36. We account for uncertainty in estimates of πt and πnm by using genomic bootstrapping, randomly sampling 100 sets of values of πt and πnm (one for each of the 100 disease fine-mapping runs in step 3) and averaging TGFM results across the random samples.
We restrict cis-predicted expression models to cis-eQTLs within 500kb of each gene’s transcription start site (TSS). We only assign cis-predicted expression models to gene-tissue pairs that are well-predicted by genetic variants, using the SuSiE “purity filter”30 (see Methods). We apply TGFM to fine-map any of the 2,682 overlapping 3Mb loci spanning the entire genome36 that contain at least 50 genetic variants and at least one genetic variant with marginal GWAS p-value less than 1e-5. Further details, including sampling cis-predicted expression models from SuSiE posterior distributions of causal cis-eQTL effect sizes, the custom implementation of the SuSiE algorithm providing efficient estimation of PIPs across 100 parallel SuSiE runs, and the computationally efficient approximation to TGFM used when estimating tissue-specific prior causal probabilities, are provided in the Methods section. We have released open-source software implementing TGFM (see Code availability), as well as posterior distributions of causal eQTL effect sizes across tissues and genes, and TGFM PIPs from this study (see Data availability).
Simulations
We performed simulations using real genotypes to assess the calibration and power of TGFM to identify causal tissues and genes underlying GWAS associations. We used real genotypes from unrelated UK Biobank (UKBB) British samples32 to simulate both gene expression phenotypes (for each gene-tissue pair) and quantitative trait phenotypes. Default simulation parameters were specified as follows: the gene expression sample size ranged from 300 to 1000; the quantitative trait sample size was set to 100,000 (disjoint from gene expression samples); we analyzed 426,593 SNPs and 1,976 genes on chromosome 1 (following ref. 11); the number of tissues was set to 10, of which 2 were causal for the quantitative trait; the quantitative trait architecture was simulated to have average polygenicity37, consisting of 2,700 causal non-mediated variants and 300 causal gene-tissue pairs (150 for each causal tissue) with the expected heritability per causal genetic element (non-mediated variant or gene-tissue pair) set to 0.0001 (expected quantitative trait heritability of 0.3, 10% of which was mediated through gene expression, consistent with genome-wide estimates from MESC28); causal non-mediated variants were randomly selected with probability proportional to their expected per-variant heritability based on baseline-LD model annotations3,38–40 (estimated using S-LDSC3 applied to the UKBB trait White blood cell count) in order to make the simulations as realistic as possible; the genetic architecture of gene expression across tissues was specified following ref. 11: roughly, each heritable gene-tissue pair was randomly assigned 5 causal cis-eQTLs (expected per-SNP heritability: 0.015), 2 of the 5 causal eQTLs were specific to each tissue, and 3 of the 5 causal eQTLs were shared across tissues with effect size covariance set to mimic that of GTEx tissues25; and causal gene-tissue pairs were randomly selected from all genetically heritable genes (true cis-SNP-heritability > 0) in each of the causal tissues. Non-default simulation parameter values were also explored. We performed 100 independent simulations, and averaged results across simulations. Further details of the simulation framework are provided in the Methods section.
We compared TGFM to two previously published methods, coloc17 and FOCUS24. Briefly, coloc calculates the posterior probability of a shared causal variant between a GWAS disease/trait and a gene expression trait from a single gene-tissue pair without considering correlations between genes or gene-tissue pairs. FOCUS assigns PIPs for the expression of each gene in a given tissue to have non-zero causal effect on disease, while modeling correlations between genes in that tissue but not modeling correlations between different tissues or correlations between genes and non-mediated genetic variants. FOCUS can naturally be extended to model correlations between all gene-tissue pairs (without modeling correlations between genes and non-mediated genetic variants); we refer to the resulting method as FOCUS-TG. (In contrast, coloc does not model correlations between genes, and cannot be extended in this way.)
We first evaluated the calibration of TGFM, coloc, FOCUS and FOCUS-TG to fine-map causal gene-tissue pairs. Calibration was assessed using empirical false discovery rate (FDR), estimated as the proportion of false-positive gene-tissue pairs among all gene-tissue pairs above a given PIP threshold. Following ref. 36, we assessed whether the empirical FDR is less than or equal to (1 – PIP threshold), a more conservative choice than (1 – average PIP) (which has been shown to be slightly mis-calibrated in previous fine-mapping simulations36,41; also see Methods and secondary analyses below). Results are reported in Figure 1a-b and Supplementary Table 1. TGFM produced well-calibrated PIPs at all eQTL sample sizes and PIP thresholds. In contrast, coloc, FOCUS, and FOCUS-TG were poorly calibrated across all PIP thresholds, even at large eQTL sample sizes. We attribute the superior calibration of TGFM over other approaches to its joint modeling of gene-tissue pairs and non-mediated variants, as well as its sampling procedure that accounts for uncertainty in genetically predicted gene expression (see secondary analyses below). Results were similar at other PIP thresholds (Supplementary Figure 1).
We next evaluated the power of TGFM, coloc, FOCUS and FOCUS-TG to fine-map causal gene-tissue pairs. Results are reported in Figure 1c-d and Supplementary Table 2. TGFM was moderately well-powered to detect causal gene-tissue pairs at larger eQTL sample sizes, with power ranging from 0.02-0.17 across eQTL sample sizes at a PIP threshold of 0.5. Other methods (coloc, FOCUS, FOCUS-TG) achieved higher power than TGFM, but this is largely moot due to the poor calibration of those methods (Figure 1a-b). Results were similar at other PIP thresholds (Supplementary Figure 2).
We compared the calibration and power of TGFM for fine-mapping causal gene-tissue pairs, genes, or non-mediated genetic variants. Gene PIPs were computed by aggregating gene-tissue PIPs across all gene-tissue pairs corresponding to the gene (defining a gene as causal if at least one corresponding gene-tissue pair is causal; Methods). Causal eQTL variants for causal gene-tissue pairs were not considered to be false positives for variant-level calibration but were not included as true positives in variant-level power computations (also see secondary analyses below). Calibration results for TGFM (Gene-Tissue), TGFM (Gene) and TGFM (Variant) are reported in Figure 2a-b and Supplementary Table 3. TGFM produced well-calibrated gene-level and variant-level PIPs. In contrast, gene-level coloc, FOCUS, and FOCUS-TG PIPs were poorly calibrated across all PIP thresholds even at large eQTL sample sizes (Supplementary Figure 3, analogous to Figure 1a-b). Results were similar at other PIP thresholds (Supplementary Figure 4).
Power results for TGFM (Gene-Tissue), TGFM (Gene) and TGFM (Variant) are reported in Figure 2c-d and Supplementary Table 4. TGFM attained higher power to fine-map causal genes than causal gene-tissue pairs, which is expected as fine-mapping causal genes is an easier problem. Power for variant-level fine-mapping was invariant to eQTL sample size, such that variant-level fine-mapping was more powerful than gene-tissue or gene-level fine-mapping at smaller eQTL sample sizes—particularly at the stringent PIP>0.9 threshold, at which the latter were severely underpowered. Results were similar at other PIP thresholds (Supplementary Figure 5).
We performed 9 secondary analyses. First, we ran TGFM with a uniform prior (same πnm and πt for all tissues) instead of the default tissue-specific priors inferred from genome-wide data. TGFM with a uniform prior remained well-calibrated (Supplementary Figure 6a-b) but suffered substantially reduced power (Supplementary Figure 6c-d), highlighting the benefit of tissue-specific priors informed by genome-wide data. Second, we ran TGFM with a uniform prior and a single cis-predicted expression model (based on posterior mean causal cis-eQTL effect sizes) instead of averaging results across 100 sampled cis-predicted expression models. TGFM without sampling cis-predicted expression models suffered poor calibration, particularly at smaller eQTL sample sizes (Supplementary Figure 6a-b), highlighting the advantages of the sampling approach to account for uncertainty in cis-predicted expression. However, the calibration of this method was still better than the calibration of FOCUS-TG (Figure 1a-b), perhaps because FOCUS-TG does not account for non-mediated genetic variants. Third, we performed an alternative calibration analysis of TGFM (Variant) where causal eQTL variants for causal gene-tissue pairs were considered false positives for variant-level calibration. TGFM (Variant) was miscalibrated at small eQTL sample sizes and high PIP thresholds in this alternative calibration analysis (Supplementary Figure 7). The calibration worsened at small eQTL sample sizes, likely due to decreased power to detect causal gene-tissue pairs at small eQTL sample sizes, forcing the unmodeled gene-tissue pair effects to be captured by non-mediated variants. Fourth, we modified our calibration analyses to assess whether the empirical FDR is less than or equal to (1 – average PIP), a less conservative choice than (1 – PIP threshold) (ref. 36,41; see Methods). We determined that TGFM (Gene-Tissue) was slightly miscalibrated only at small eQTL sample sizes and low PIP thresholds, TGFM (Gene) was well-calibrated across all eQTL sample sizes and PIP thresholds analyzed, TGFM (Variant) was slightly mis-calibrated at high PIP thresholds regardless of eQTL sample size, and coloc, FOCUS, and FOCUS-TG were severely miscalibrated across all eQTL sample sizes and PIP thresholds analyzed (Supplementary Figures 1, 4). The slight miscalibration of TGFM (Gene-Tissue) and TGFM (Variant) when using (1 – average PIP) is consistent with previous simulations of variant-level fine-mapping methods using polygenic trait architectures36,41. Fifth, we ran TGFM at different simulated GWAS sample sizes ranging from 50,000 to 200,000 (instead of the default sample size of 100,000). TGFM remained well-calibrated regardless of GWAS sample size but attained increased power at larger GWAS sample sizes (Supplementary Figure 8). TGFM fine-mapping power at PIP > 0.5 increased 2.0-fold when doubling the eQTL sample size vs. 1.2-fold when doubling the GWAS sample size (relative to an eQTL sample size of 500 and GWAS sample size of 100,000), suggesting that TGFM attains a greater benefit from increasing the eQTL sample size under our default parameter settings. Sixth, we ran TGFM at different values of heritability of gene expression, ranging from 0.05 to 0.1 (instead of the default value of 0.075). TGFM remained approximately well-calibrated regardless of gene expression heritability but attained increased power at larger values of gene expression heritability (Supplementary Figure 9), analogous to the impact of varying eQTL sample size (Figure 2). Seventh, we investigated TGFM’s ability to provide unbiased estimates of the proportion of causal genetic elements (gene-tissue pairs or non-mediated variants) that are gene-tissue pairs, a parameter closely related to the proportion of disease heritability mediated by gene expression (estimated in ref. 28). We first calculated the proportion of fine-mapped genetic elements at various PIP thresholds that are gene-tissue pairs. This approach yielded either upward or downward biased estimates of the true proportion depending on the eQTL sample size and PIP threshold (Supplementary Figure 10a), reflecting differential discovery power of non-mediated variants and gene-tissue pairs as a function of both eQTL sample size and PIP threshold (Figure 2c-d). We next calculated the expected proportion of fine-mapped genetic elements that are gene-tissue pairs by summing PIPs across genetic elements. This approach yielded conservative estimates of the true proportion, becoming less conservative at larger eQTL sample sizes (Supplementary Figure 10b), suggesting that this statistic can provide a conservative lower bound on the true proportion of causal genetic elements that are gene-tissue pairs. Eighth, we investigated the calibration and power of inference of disease-critical tissues via the TGFM tissue-specific prior, using genomic bootstrap to assess significance (see Methods); although inference of disease-critical tissues is not a primary goal of TGFM (and there exist previous methods for inferring disease-critical tissues using eQTL data5,11), assessing the TGFM tissue-specific prior is of interest. We determined that inference of disease-critical tissues via the TGFM tissue-specific prior was well-calibrated and well-powered (Supplementary Figure 11). Ninth, we compared TGFM (Variant) PIPs with variant-level PIPs inferred by SuSiE. The variant-level PIPs were strongly correlated and consistent in magnitude, particularly after excluding from the analysis any variant that was correlated with a TGFM fine-mapped gene-tissue pair (Supplementary Figure 12).
Tissue-gene fine-mapping of 45 diseases and complex traits
We applied TGFM to fine-map tissues and genes for 45 diseases and complex traits from the UK Biobank (average N = 316K; previously analyzed with functionally informed variant-level fine-mapping36; Methods and Supplementary Table 5) using gene expression data from 47 GTEx tissues25, which were aggregated into 38 meta-tissues11 (average N=259; Supplementary Table 6) to minimize eQTL sample size differences across tissues; below, we refer to these as “tissues” for simplicity. For each disease/trait we applied TGFM to 2,682 overlapping 3-Mb loci36 spanning 119,270 (protein-coding) gene-tissue pairs with cis-predicted expression models (3,139 genes per tissue on average, Supplementary Table 6; 13,700 unique genes) and 10,545,304 genetic variants with MAF ≥ 0.005. We assigned a PIP to each gene-tissue pair, gene, and non-mediated genetic variant using the locus in which the genetic element was most central36; the position of a gene was determined by its TSS. TGFM running times are reported in Supplementary Table 7. We have publicly released PIPs for all gene-tissue pairs, genes, and non-mediated variants for each disease/trait (see Data Availability).
Results are summarized in Figure 3 (16 independent traits36), Supplementary Figure 13 (all 45 traits), and Supplementary Table 8. Across all 45 traits, TGFM identified 711 gene-tissue-trait triplets, 2,800 gene-trait pairs (aggregating gene-tissue PIPs across tissues; see above), and 5,893 non-mediated genetic variant-trait pairs at PIP > 0.5 (43 gene-tissue-trait triplets, 382 gene-trait pairs, and 2,675 non-mediated genetic variant-trait pairs at PIP > 0.9). The number of gene-tissue pairs with PIP > 0.5 ranged from 0 (Number of children) to 56 (White blood cell count) across traits, and ranged from 0 (coronary artery) to 197 (whole blood) across tissues. Of the 711 gene-tissue-trait triplets with PIP > 0.5, 180 (25%) had TWAS p-value > 0.05/119,270 = 4.2 × 10−7 (the Bonferroni significance threshold based on 119,270 gene-tissue pairs with cis-predicted expression models42) and 136 (19%) had no nearby variants in the same fine-mapping region with GWAS p-value ≤ 5 × 10−8. Of the 110,828 gene-tissue-trait triplets with TWAS p-value ≤ 4.2 × 10−7, only 531 (0.5%) had TGFM PIP > 0.5. The proportion of causal genetic elements (variants or gene-tissue pairs) that were gene-tissue pairs was equal to 8.1% when counting PIP > 0.5 genetic elements across 16 independent traits (271 gene-tissue pairs and 3,074 non-mediated genetic variants), or 10.1% when summing PIPs across 16 independent traits (Methods), consistent with previous estimates of the proportion of trait heritability mediated by gene expression28.
For each trait, we identified the most frequently implicated tissues, computing the proportion of fine-mapped gene-tissue pairs in each tissue by counting gene-tissue pairs with PIP > 0.5 (Methods). Results are reported in Figure 4a (14 representative traits), Supplementary Figure 14 (all 45 traits), and Supplementary Table 9. Implicated tissue-trait pairs were consistently concentrated in expected trait-critical tissues, e.g., 50% in spleen and 30% in lymphocytes for All autoimmune disease, 71% in skin (sun-exposed) for Vitamin D level, 60% in liver and 25% in whole blood for Total cholesterol, and 36% in artery tibial and 32% in artery aorta for Diastolic blood pressure. Results were similar at other PIP thresholds (Supplementary Figure 15). Separately, we assessed the statistical significance of implicated tissue-trait pairs by applying genomic bootstrapping to the TGFM tissue-specific prior (Methods). Results are reported in Supplementary Figure 14 and Supplementary Table 9. This approach identified 23 tissue-trait pairs with FDR ≤ 0.05 (64 tissue-trait pairs with FDR ≤ 0.2). Despite limited power, the TGFM tissue-specific prior identified 6 traits with more than one significantly associated tissue (FDR ≤ 0.05; 17 traits at FDR ≤ 0.2); this result motivates the use of TGFM over a two-step approach of separately identifying the causal gene using a gene-level fine-mapping method24 and identifying the causal tissue using a method for identifying trait-critical tissues5,7,11. Significant tissue-trait pairs were consistently concentrated in expected trait-critical tissues, analogous to Figure 4a. Although inference of trait-critical tissues is not a primary goal of TGFM (and there exist previous methods for inferring trait-critical tissues using eQTL data5,11), these results validate the TGFM tissue-specific prior.
We observed instances where TGFM was unable to distinguish the causal tissue within a small set of highly correlated tissues. For example, for waist-hip-ratio adjusted for BMI (WHRadjBMI), TGFM fine-mapped only 1 gene-tissue pair in adipose visceral and 0 gene-tissue pairs in adipose subcutaneous tissue at PIP > 0.5, despite strong prior evidence of the role of adipose tissue in WHRadjBMI7,43; however, TGFM fine-mapped 10 gene-tissue pairs in adipose (defined as adipose subcutaneous U adipose visceral) at PIP > 0.5 when summing PIPs of gene-tissue pairs across the two tissues (Methods). Unsurprisingly, the average correlation in cis-predicted gene expression of adipose subcutaneous vs. adipose visceral across all genes included in TGFM was very large (0.92). Average correlations for all pairs of tissues are reported in Supplementary Figure 16 and Supplementary Table 10; the average correlation ranged from 0.48 (whole blood and brain cerebellum) to 0.96 (brain substantia nigra and brain spinal cord), and the correlation patterns reflected known relationships between tissues.
We sought to validate genes prioritized by TGFM by assessing their overlap with genes prioritized by independent gene sets/scores. First, we assessed overlap with PoPS44, a similarity-based gene score that prioritizes trait-relevant genes from gene-level features such as cell-type-specific expression. Results are reported in Figure 4b and Supplementary Table 11. The average PoPS score increased as a function of TGFM (Gene) PIPs in different bins, from −0.0046 (s.e. 0.0031) for trait-gene pairs with PIP < 0.01 to 0.34 (s.e. 0.047) for trait-gene pairs with PIP ≥ 0.9; this provides an external validation of genes prioritized by TGFM. Second, we assessed overlap with 10 non-disease-specific gene sets (e.g. High-pLI genes45) that are known to be enriched/depleted for disease heritability (from Figure 4 of ref. 26). Results are reported in Supplementary Figure 17a and Supplementary Table 12. Genes with TGFM (Gene) PIP > 0.5 were significantly (FDR < 0.05) overrepresented in Epigenetic modifier genes46 (odds ratio: 1.78), High-pLI genes45 (odds ratio: 1.31) and Mouse Genome Informatics (MGI) essential genes47 (odds ratio: 1.31) and underrepresented in genes with the most SNPs within 100kb (odds ratio: 0.48), consistent with previous findings26. Results were similar at other PIP thresholds (Supplementary Figure 17b).
TGFM pinpoints disease genes and tissues of action
We highlight 6 examples of fine-mapped (PIP > 0.5) gene-tissue-trait triplets that recapitulate known biology or nominate biologically plausible mechanisms. First, TGFM fine-mapped TPO (Thyroid Peroxidase) in thyroid for Hypothyroidism (Figure 5a; gene-tissue PIP: 0.88; gene PIP: 0.88). TPO is an enzyme involved in thyroid hormone biosynthesis and its involvement in Hypothyroidism has been well-studied48,49, and TPO has been linked to Hypothyroidism in genetic association studies50. Thyroid was also identified as a Hypothyroidism-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.21, bootstrap p = 0.04 for tissue-specific prior; Supplementary Figure 14).
Second, TGFM fine-mapped OVOL1 in lymphocytes for Eczema (Figure 5b, gene-tissue PIP: 0.75, gene PIP: 0.76). Recent work demonstrated that loss of OVOL1 results in skin inflammation and Eczema via an immune-mediated mechanism in T cells (a lymphocyte cell type)51–53, and OVOL1 has previously been linked to Eczema in genetic association studies54. Lymphocyte was suggestively implicated as an Eczema-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.36, bootstrap p = 0.08 for tissue-specific prior; Supplementary Figure 14). There exist 28 other gene-tissue pairs within 1 Mb of the TSS of OVOL1 (4 of which correspond to OVOL1 in a tissue other than lymphocytes) that had significant TWAS p-values (p ≤ 0.05 / 119,270 = 4.2 × 10−7) but were not fine-mapped by TGFM (all with PIP ≤ 0.01), underscoring the benefit of joint fine-mapping of gene-tissue pairs. TGFM also fine-mapped one non-mediated variant (rs56225074; PIP: 0.55) within 1 Mb of the TSS of OVOL1, perhaps due to finite eQTL sample size and/or absence of the causal cell-type or context in GTEx expression data28,55 (see Discussion).
Third, TGFM fine-mapped PADI1 in skin (sun exposed) for Vitamin D level (Figure 5c; gene-tissue PIP: 0.64, gene PIP: 0.65). TGFM assigned PADI1 a PIP of 0.64 in skin (sun exposed) tissue (TWAS p = 1.7 × 10−20) and a PIP of 0.01 in skin (not sun exposed) tissue (TWAS p = 3.8 × 10−19), demonstrating TGFM’s ability to distinguish trait-critical tissues from closely related tissues. PADI1 is known to interact with keratins during epidermal differentation56,57, and has previously been linked to Vitamin D level in genetic association studies58,59. Skin (sun exposed) was also identified as a Vitamin D level-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.72, bootstrap p = 0.008 for tissue-specific prior; Supplementary Figure 14).
Fourth, TGFM fine-mapped MC4R in brain cerebellum for Systolic blood pressure (Figure 5d, gene-tissue PIP: 0.66, gene PIP: 0.66), despite a non-significant TWAS p-value (p = 3.3 × 10−5 > 0.05/119,270 = 4.2 × 10−7). Previous work has shown that activation of MC4R in the central nervous system increases sympathetic nervous system activity and blood pressure60–62, and MC4R has previously been linked to hypertension in genetic association studies63. Brain cerebellum was identified as a Systolic blood pressure-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.16, bootstrap p = 0.002 for tissue-specific prior; Supplementary Figure 14), consistent with previous studies61,64,65. Strong support of brain cerebellum genes from the tissue-specific prior enabled TGFM to prioritize MC4R-brain cerebellum instead of more significant nearby GWAS associations of non-mediated genetic variants.
Fifth, TGFM fine-mapped SLC20A2 in artery aorta for Systolic blood pressure (Figure 5e, gene-tissue PIP: 0.91, gene PIP: 0.91). Previous work has shown that loss of SLC20A2 results in Human idiopathic basal ganglia calcification66,67 and well as arteriolar calcification68, but SLC20A2 has not previously been linked to Systolic blood pressure to our knowledge. Artery aorta was also identified as a Systolic blood pressure-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.36, bootstrap p = 0.0008 for tissue-specific prior; Supplementary Figure 14). We note that we have highlighted two fine-mapped gene-tissue pairs for Systolic blood pressure involving two different tissues (artery aorta and brain cerebellum); this demonstrates the advantages of TGFM over a two-step approach of separately identifying the causal gene using a gene-level fine-mapping method24 and identifying the causal tissue using a method for identifying trait-critical tissues5,7,11.
Sixth, TGFM fine-mapped NMT1 in brain cerebellum for Menarche age (Figure 5f, gene-tissue PIP: 0.53, gene PIP: 0.86); TGFM also assigned NMT1 in brain limbic a PIP of 0.10. Recent work demonstrated that NMT1 cis-predicted expression in brain tissues was marginally associated with image-derived brain phenotypes69,70, but NMT1 has not previously been linked to Menarche age to our knowledge. Brain cerebellum was also identified as a Menarche age-critical tissue genome-wide (proportion of fine-mapped gene-tissue pairs = 0.25, bootstrap p = 0.05 for tissue-specific prior; Supplementary Figure 14), consistent with previous studies3,71. There exist 20 other gene-tissue pairs within 1 Mb of the TSS of NMT1 (11 of which correspond to NMT1 in a tissue other than brain cerebellum) that had significant TWAS p-values (p ≤ 0.05 / 119,270 = 4.2 × 10−7) but were not fine-mapped by TGFM (all with PIP ≤ 0.01), underscoring the benefit of joint fine-mapping of gene-tissue pairs. Additional examples are discussed in the Supplementary Note and Supplementary Figure 18 (including an additional implicated tissue for Vitamin D level).
TGFM pinpoints disease genes and fine-grained cell-types in single-cell eQTL data
It is widely hypothesized that eQTL in fine-grained cell types/contexts may help resolve the limited proportion of disease heritability explained by eQTL in bulk tissues28,55. Accordingly, we applied TGFM to fine-map disease genes and fine-grained cell types in 45 disease and complex traits from the UK Biobank (same diseases/traits as above) using a recently generated single-cell eQTL data set spanning 9 fine-grained peripheral blood mononuclear cell (PBMC) cell types33 (average N=112; Methods and Supplementary Table 14). We converted single-cell expression measurements in each cell type to pseudobulk expression (see Methods), and also included the 38 GTEx tissues analyzed above. For each disease/trait, we applied TGFM to 2,682 overlapping 3-Mb loci spanning 1,851 (protein-coding) gene-PBMC cell type pairs with cis-predicted expression models (206 genes per cell type on average, Supplementary Table 14; 1,066 unique genes), 119,270 (protein-coding) gene-tissue pairs with cis-predicted expression models (GTEx tissues; 3,139 genes per tissue on average, Supplementary Table 6; 13,700 unique genes) and 10,545,304 genetic variants with MAF ≥ 0.005. We assigned a PIP to each gene-PBMC cell type pair, gene-tissue pair, gene, and non-mediated genetic variant, analogous to above. We have publicly released PIPs for all gene-PBMC cell type pairs, gene-tissue pairs, genes, and non-mediated variants for each disease/trait (see Data Availability).
Results are reported in Figure 6a-b (18 representative traits), Supplementary Figure 19 (all 45 traits), and Supplementary Table 15. Across all 45 traits, TGFM identified 30 gene-PBMC cell type-trait triplets at PIP > 0.5; TGFM was not sufficiently powered to detect any gene-PBMC cell type-trait triplets at PIP > 0.9, likely due to the limited single-cell eQTL sample size. Of the 30 gene-PBMC cell type-trait triplets with PIP > 0.5, 25 involved a trait locus that had no confidently fine-mapped gene-tissue pair (no TGFM PIP > 0.5 in GTEx-only analysis corresponding to Figures 3-5). For the 18 representative traits, TGFM identified 12 gene-PBMC cell type pairs at PIP > 0.5 for autoimmune diseases and blood cell traits (Figure 6a) vs. 5 gene-PBMC cell type pairs at PIP > 0.5 for non-blood-related traits (Figure 6b; includes 2 gene-PBMC cell type pairs for Menarche age, which we conservatively labeled as non-blood-related even though it has been reported to be partially immune-mediated72,73), increasing to 23 vs. 7 for all 45 traits; this validates the importance of gene expression in blood cell types for blood-related traits.
For each trait, we identified the most frequently implicated PBMC cell types, computing the proportion of fine-mapped gene-PBMC cell type pairs in each PBMC cell type by counting gene-PBMC cell type pairs with PIP > 0.5. Results are reported in Figure 6c-e (3 representative blood-related traits), Supplementary Figure 20 (all 45 traits), and Supplementary Table 15. Despite low power, the fine-mapped gene-PBMC cell type pairs at PIP > 0.5 were concentrated in expected trait-critical PBMC cell types, e.g. 62.5% in non-classical monocyte (ncM) cells and 37.5% in classical monocyte (cM) cells for Monocyte count, 100% in CD4+ T (CD4) cells for Lymphocyte count, and 100% in CD8+ T (CD8) cells for All autoimmune disease7,74,75. Results were similar at other PIP thresholds (Supplementary Figure 20).
We highlight 4 examples of fine-mapped (PIP > 0.5) gene-PBMC cell type-trait triplets that recapitulate known biology or nominate biologically plausible mechanisms. First, TGFM fine-mapped CTLA4 (Cytotoxic T-lymphocyte associated protein 4) in CD8+ T (CD8) cells for All autoimmune disease (Figure 7a; gene-PBMC cell type PIP: 0.84; gene PIP: 0.85). CTLA4 is a well-studied regulator of immune responses in T cells76,77 and has previously been linked to autoimmune diseases (rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes) in genetic association studies78,79. CD8 cells were also identified as an All autoimmune disease-critical cell type genome-wide (bootstrap p = 0.05 for tissue-specific prior; Supplementary Table 17). CTLA4 did not meet the criteria for having a cis-predicted expression model in any GTEx tissue, underscoring the advantages of modeling gene expression in fine-grained PBMC cell types.
Second, TGFM fine-mapped CD52 in classical monocyte (cM) cells for Monocyte count (Figure 7b; gene-PBMC cell type PIP: 0.55; gene PIP: 0.88; gene-tissue PIP ≤ 0.05 in all GTEx tissues); TGFM also assigned non-negligible PIPs to CD52 in multiple correlated PBMC cell types and GTEx tissues including 0.29 in non-classical monocyte (ncM) cells and 0.05 in GTEx whole blood. Previous work demonstrated that CD52 regulates immune homeostasis in monocytes and T cells by inhibiting NF-κB signalling80–82, but CD52 has not previously been linked to Monocyte count to our knowledge. cM cells were also identified as a Monocyte count-critical cell type genome-wide (bootstrap p = 0.04 for tissue-specific prior; Supplementary Table 17). There exist 56 other gene-tissue pairs and 3 other gene-PBMC cell type pairs within 1 Mb of the TSS of CD52 (14 and 2 of which correspond to CD52 in a cell type or tissue other than cM, respectively) with significant TWAS p-values (p ≤ 0.05 / 121,121 = 4.1 × 10−7, where 121,121 = 119,270 + 1,851) but not fine-mapped by TGFM (all with PIP ≤ 0.01), underscoring the benefit of joint fine-mapping of gene-tissue and gene-PBMC cell type pairs. CD52 was not prioritized in any tissue in the GTEx-only analysis of Figure 3 (highest gene-tissue PIP = 0.07 (whole blood)), underscoring the advantages of modeling gene expression in fine-grained PBMC cell types.
Third, TGFM fine-mapped KLF13 in CD4+ T (CD4) cells for Lymphocyte count (Figure 7c; gene PBMC cell type PIP: 0.75; gene PIP: 0.75; gene-tissue PIP ≤ 0.01 in all GTEx tissues) despite a non-significant TWAS p-value (p = 3.5 × 10−6 > 0.05 / 121,121 = 4.1 × 10−7). KLF13 has been shown to regulate lymphocyte development and survival83,84, but KLF13 has not previously been linked to Lymphocyte count in genetic association studies to our knowledge. CD4 cells were suggestively implicated as a Lymphocyte count-critical cell type genome-wide (bootstrap p = 0.12 for tissue-specific prior; Supplementary Table 17). KLF13 was not prioritized in any tissue in the GTEx-only analysis of Figure 3 (highest gene-tissue PIP = 0.002 (brain cerebellum)), again underscoring the advantages of modeling gene expression in fine-grained PBMC cell types.
Fourth, TGFM fine-mapped HMGB1 in B cells for Mean corpuscular hemoglobin (Figure 7d; gene-tissue PIP: 0.75; gene PIP: 0.75). HMGB1 has previously been shown to mediate anemia of inflammation (i.e. anemia resulting from a prolonged immune response) in mice85,86, but HMGB1 has not previously been linked to Mean corpuscular hemoglobin to our knowledge. B cells were also identified as a Mean corpuscular hemoglobin-critical cell type genome-wide (bootstrap p = 0.05 for tissue-specific prior; Supplementary Table 17). HMGB1 did not meet the criteria for having a cis-predicted expression model in any GTEx tissue, again underscoring the advantages of modeling gene expression in fine-grained PBMC cell types. TGFM also fine-mapped one non-mediated variant (rs149180914; PIP: 0.59) within 1 Mb of the TSS of HMGB1 (see Discussion).
Discussion
We developed a new method, TGFM, that jointly fine-maps causal gene-tissue pairs and non-mediated genetic variants at disease-associated loci. We applied TGFM to 45 UK Biobank diseases and traits using 38 GTEx tissues and identified many causal gene-tissue pairs (PIP > 0.5), which were concentrated in known disease-critical tissues2–11 and strongly enriched in known disease-relevant genes44,45. Causal gene-tissue pairs identified by TGFM recapitulated known biology, but also included biologically plausible novel findings. We further applied TGFM to single-cell eQTL data from 9 cell types in PBMC (analyzed jointly with GTEx tissues) and identified additional causal gene-PBMC cell type pairs (PIP > 0.5), primarily for autoimmune disease and blood cell traits.
TGFM is distinct from previous methods for fine-mapping causal genes in three ways. First, TGFM identifies causal gene-tissue pairs, not just causal genes. Second, TGFM jointly models the disease contribution of each gene-tissue pair and non-mediated variant, disentangling causal gene-tissue pairs from both tagging gene-tissue pairs and tagging non-mediated genetic variants. Third, TGFM employs a sampling procedure that accounts for uncertainty in cis-predicted expression models. Our simulations show that TGFM attains accurate calibration, in contrast to previous methods such as coloc17 and FOCUS24 (Figure 1). We attribute the superior calibration to both our joint modeling of gene-tissue pairs and non-mediated variants, and our sampling procedure that accounts for uncertainty cis-predicted expression models (Supplementary Figure 6a-b). We note that a recent preprint developed a method, causal-TWAS27, to jointly fine-map genes (from a single tissue) and non-mediated genetic variants; however causal-TWAS only fine-maps genes, not fine-map gene-tissue pairs, and does not account for uncertainty in cis-predicted expression models. Another method, CAFEH19, was recently developed to identify causal variants underlying GWAS and eQTL in multiple tissues; however, CAFEH does not model co-regulation between gene-tissue pairs and is thus unable to distinguish causal from tagging gene-tissue pairs. Other recent studies have made valuable contributions in nominating causal gene-tissue pairs for disease by linking fine-mapped causal variants to causal genes using tissue-specific SNP-to-gene linking strategies such as Activity-By-Contact or EpiMap enhancer-gene linking12–14. However, this approach, unlike TGFM, is based on heuristic prioritization and does not provide direct evidence that the regulatory variant’s effect is mediated by the nominated gene-tissue pair.
We note several limitations of our work. First, TGFM leverages in-sample LD from the GWAS sample35,87 (analogous to other fine-mapping methods36), but in-sample LD may be unavailable in some applications (e.g. disease consortium meta-analyses). Following ref. 36, our recommendation when in-sample LD is unavailable is as follows: if there exists a LD reference panel from a population closely matching the GWAS sample population spanning at least 10% of the GWAS sample size, run the default version of TGFM with the LD reference panel; otherwise, run TGFM assuming a single causal genetic element per locus (in the latter case, no LD reference panel is needed). Second, TGFM may be susceptible to false positives in the case of unassayed causal genetic elements (analogous to other fine-mapping methods23,30,36,41). Specifically, if the causal gene-tissue pair is not assayed, TGFM may prioritize a correlated assayed gene-tissue pair or a correlated non-mediated genetic variant. We anticipate that this limitation will be mitigated over time as emerging eQTL data sets increasingly capture diverse tissues, cell types, and cellular contexts88. Third, TGFM is only moderately well-powered to detect causal gene-tissue pairs, particularly at lower eQTL sample sizes (Figure 1). In addition, at low eQTL sample sizes, undiscovered causal gene-tissue pairs may be falsely prioritized as non-mediated genetic variants (Supplementary Figure 7). We anticipate this limitation will be mitigated over time as eQTL data sets increase in size89. Fourth, a recent study showed that GWAS and eQTLs studies are well-powered to detect different types of genetic variants, limiting the number of GWAS associations that can be explained by eQTL data at current sample sizes90 Nevertheless, TGFM identified causal gene-tissue pairs at hundreds of GWAS loci using current eQTL data, and we anticipate that TGFM will identify a larger number of causal gene-tissue pairs over time as eQTL data sets increase in size89 and capture increasingly diverse tissues, cell types, and cellular contexts88. Fifth, it is theoretically possible that TGFM could prioritize gene-tissue pairs due to reverse causality, whereby the disease/trait causally impacts the gene-tissue pair. However, we believe that this is very unlikely in practice, as cis-eQTL variants with a detectable effect on gene expression at current eQTL sample sizes explain a substantial proportion of gene expression variation25,91,92 whereas diseases/traits are highly polygenic with each causal variant (outside the HLA locus, which we exclude from our analyses) explaining a small proportion of disease/trait variation37,93–95. Sixth, we have focused here on single-ancestry fine-mapping, but an important future direction is to extend this work to multi-ancestry fine-mapping (incorporating multi-ancestry eQTL analysis96,97), which is likely to further increase fine-mapping power98–100. Seventh, an important future direction is to extend TGFM to incorporate variant-level functional annotations that are enriched for disease heritability3,38–40, which is likely to further increase fine-mapping power36,101. Eighth, another important future direction is to extend TGFM to incorporate gene sets that are enriched for disease heritability explained by cis-predicted expression26, which may also further increase fine-mapping power. Finally, we have focused here on cis-genetic components of gene expression, but TGFM could be extended to genetic components of other molecular traits102–107. Despite these limitations, TGFM is a robust and powerful method for fine-mapping causal tissues and genes at disease-associated loci.
Methods
TGFM model overview
TGFM estimates the posterior inclusion probability (PIP) for each genetic element (gene-tissue pair or genetic variant) to have a non-zero causal effect on disease using a model that includes mediated causal effects of each gene-tissue pair (via the cis-genetic component of expression of a given gene in a given tissue) and non-mediated causal effects of each genetic variant: where Y denotes the phenotype vector across GWAS individuals, g indexes genes, t indexes tissues, X is the matrix of genotypes, Wgt is the vector of the cis-genetic component of gene expression across GWAS individuals in gene g and tissue t, αgt denotes the (scalar) effect of cis-genetic expression in gene g and tissue t on the disease or trait, β denotes the vector of non-mediated causal effects of each genetic variant on the disease or trait, and ϵ denotes environmental noise. We assume that the trait Y, the cis-genetic component of gene expression Wgt in each gene g and tissue t, and the genotype vector of each variant (each column of X) are standardized to have mean 0 and variance 1. We model the cis-genetic component of gene expression as a linear combination of variant-level effects: where δgt denotes the vector of causal cis-eQTL effect sizes of each variant on gene expression in gene g and tissue t. We emphasize that we model the phenotype Y as a linear combination of the unobserved true cis-genetic component of gene expression Wgt (a deterministic function of the unobserved true causal eQTL effect sizes Δgt) in each gene and tissue. The predicted cis-genetic component of gene expression can be estimated (according to predicted causal eQTL effect sizes ), with uncertainty, from finite sample-size eQTL data sets in the specified tissue t, and provide noisy estimates of the true unobserved cis-genetic component of gene expression Wgt. Later (see below), we explain how TGFM models uncertainty in predicted cis-genetic expression.
TGFM places the Sum of Single Effects (SuSiE)30,31 fine-mapping prior distribution on the vector of causal disease (mediated and non-mediated) effect sizes: where α denotes the vector of expression-mediated causal effects of genetic gene expression of each gene-tissue pair on the disease or trait, β denotes the vector of non-mediated causal effects of each genetic variant on the disease or trait, [α, β] denotes the concatenated vector of mediated and non-mediated genetic effects, l indexes fine-mapping components where a single component represents the disease signal from a single genetic element, γl denotes a Categorical random variable indicating which one of the genetic elements disease component l comes from, π denotes the simplex vector of prior probabilities on each genetic element being causal, dl denotes a Gaussian random variable specifying the causal effect size of component l, and denotes the prior variance on dl. This approach assumes the true causal disease effect sizes originate in a small number (l) of genetic elements with non-zero effects.
TGFM will automatically infer posterior distributions on the random variables defining equations 5 and 6 (γl, dl, ; inference details provided below). Posterior inclusion probabilities (PIPs), or the probability that a genetic element has a non-zero effect on disease, can be calculated for each genetic element from these inferred posterior distributions as follows: where j indexes genetic elements, l indexes components, PIPj denotes the PIP for genetic element j, and denotes the expected value of the posterior distribution on γl for genetic element j.
Sum of Single Effects (SuSiE) fine-mapping prior distribution and inference
The SuSiE prior distribution was developed in refs. 30,31 for the purpose of fine-mapping trait-causal variants. We briefly summarize the SuSiE fine-mapping prior distribution here: where β denotes the vector of causal effects of each genetic variant on the disease or trait, l indexes fine-mapping components where a single component represents the disease signal from a single genetic variant, γl denotes a Categorical random variable indicating which one of the genetic variants disease component l comes from, π denotes the simplex vector of prior probabilities on each genetic variant being causal, dl denotes a Gaussian random variable specifying the causal effect size of component l, and denotes the prior variance on dl.
Briefly, the SuSiE prior distribution assumes that only a subset of variants (L total) have non-zero effect (i.e., are causal) on the trait, the effect sizes of each causal variant are independent, and the trait causal effect sizes can be calculated by summing the causal effect size of each of the L causal variants. A “single effect” refers to the effect of one of the L causal variants; hence why the model is called Sum of Single Effects.
Ref. 30 proposed a simple model-fitting approach (to infer posterior distributions on random variables in equations 9 and 10), which the authors referred to as Iterative Bayesian Stepwise Selection (IBSS). Briefly, IBSS iteratively updates each of the L single effects while keeping all other single effects fixed. It is computationally simple to update a single effect given fixed values of all other single effects (details of updating a single effect provided in the supplement of Ref. 30).
Overview of TGFM inference
TGFM inference consists of four steps. In step 1, we apply SuSiE to perform eQTL fine-mapping of each gene-tissue pair in the external gene expression data set (estimating a posterior distribution of the causal cis-eQTL effect sizes for each gene-tissue pair). In step 2, we randomly sample 100 cis-predicted expression models for each gene-tissue pair from the posterior distributions of causal cis-eQTL effect sizes estimated in step 1 (Methods). In step 3, we apply SuSiE to perform disease fine-mapping in the target data set (estimating the PIP of each genetic element) 100 times, iterating over the sampled cis-predicted expression models for each gene-tissue pair from step 2. In step 4, we average the results of step 3 across the 100 disease fine-mapping runs. TGFM utilizes a custom implementation of the SuSiE algorithm that provides efficient estimation of PIPs across 100 parallel SuSiE runs that differ only in their cis-predicted expression models.
TGFM inference Step 1: Estimating causal eQTL effect size distributions from external eQTL data
TGFM inference relies on probability distributions defining the causal eQTL effects for each gene-tissue pair. These causal eQTL effect size distributions are estimated by applying SuSiE30 to eQTL data; SuSiE infers the following posterior distribution on the causal eQTL effect sizes for a given gene-tissue pair from eQTL data: where δgt denotes the vector of causal eQTL effect sizes corresponding to the effects of standardized cis-variants on gene g in tissue t, l indexes fine-mapping components where a single component represents the eQTL signal from a single cis-genetic variant, λl denotes a Categorical random variable indicating which one of the genetic variants component l comes from, denotes the simplex vector of inferred posterior probabilities on each genetic variant being causal for component l, dlk|λlk = 1 denotes a Gaussian random variable specifying the causal effect size of component l for variant k conditioned on variant k being causal for component l, and and define the inferred posterior mean and variance on dlk|λlk. The posterior mean causal eQTL effect sizes of this distribution are .
We restrict TGFM to gene-tissue pairs that are well-predicted by genetic variants, using the SuSiE “purity filter”30. Specifically, this filter removes any gene-tissue pair such that all components defining the gene-tissue pair have a minimum absolute correlation between all variants in the component’s 95% credible set less than 0.5. The 95% credible set for a given component is calculated by selecting the minimum set of variants that contain the causal variant with 95% confidence (according to the inferred posterior distribution ).
Although in the present study we utilized SuSiE for generating distributions of causal eQTL effect sizes due to its computational efficiency, the TGFM inference procedure is generalizable to a variety of methods that generate probabilistic cis-predicted expression models potentially including probabilistic/Bayesian multivariable regression methods (for example, BSLMM108, SBayesR109, or LDpred2110) or bootstrapping111.
TGFM inference Step 2: Randomly sample 100 cis-predicted expression models for each gene-tissue pair
Instantiations of the causal eQTL effect sizes (which determine instantiations of cis-predicted expression models) for each gene-tissue pair can be randomly sampled from that gene-tissue pair’s SuSiE-inferred posterior distribution (equations 11-13). To draw a single random sample of the causal eQTL effects for a given gene-tissue pair: (1) randomly sample the causal variant for each of the l components from (2) randomly sample the causal eQTL effect size of the sampled causal variant k (identified in Step 1) for each of the l components from (3) sum the sampled causal eQTL effects across the l components.
TGFM inference step 3: Fine-mapping inference conditioned on a sampled cis-predicted expression models
We describe here tissue-gene fine-mapping inference conditioned on setting the cis genetic component of gene expression for each gene-tissue pair equal to a sampled cis-predicted expression model (see TGFM inference step 2) for that gene-tissue pair. In this setting: where g indexes genes, t indexes tissues, denotes the predicted cis-genetic component of gene expression in gene g and tissue t (as determined by sampled cis-predicted expression model’s causal eQTL effect sizes ), Y denotes the phenotype vector across GWAS individuals, X is the matrix of genotypes, αgt denotes the (scalar) effect of cis-genetic expression in gene g and tissue t on the disease or trait, β denotes the vector of non-mediated causal effects of each genetic variant on the disease or trait, and ϵ denotes environmental noise. The trait Y, the predicted cis-genetic component of gene expression in each gene g and tissue t, and the genotype vector of each variant (each column of X) are standardized to have mean 0 and variance 1. We place SuSiE fine-mapping prior distributions on the disease/trait causal effect sizes (equations 4-6).
In this setting, we use the existing SuSiE software for inference of posterior distributions on the fine-mapping random variables (γl, dl, and in equations 5-6) and generation of corresponding PIPs for each genetic element (equation 7). We refer to these PIPs as conditional PIPs as they are conditional upon the predicted causal eQTL effect sizes. SuSiE inference applied to this task requires the following input: GWAS summary statistic z-scores for each non-mediated variant, transcriptome-wide association study (TWAS) summary statistic z-scores for each gene-tissue pair corresponding to the marginal association between predicted genetic gene expression and the trait, in-sample correlations between all pairs of genetic elements (variants and predicted genetic gene-tissue pairs) and specified prior probabilities π. We assume the user provides GWAS summary statistic z-scores for each variant, in-sample LD (ie. correlations between all pairs of genetic variants based on the GWAS samples), predicted causal eQTL effect sizes for each gene-tissue pair, and the prior causal probabilities π (we discuss below how π can be inferred). TWAS summary statistic z-scores and in-sample correlations between all genetic elements can be computed from the user-provided input as follows:
TWAS summary statistic z-scores for a particular gene-tissue pair can be easily calculated from GWAS summary statistic z-scores, in-sample variant LD, and predicted causal cis-eQTL effect sizes defining the genetic component of gene expression for the gene-tissue pair following ref. 21: where zgt denotes the TWAS summary statistic z-score for gene g in tissue t, Zgwas denotes the vector GWAS summary statistic z-scores across variants, denotes the vector of predicted causal eQTL effect sizes for gene g in tissue t, and Σ denotes the in-sample variant LD. We note that ref. 21 only requires LD from a reference panel (instead of in-sample LD) for TWAS; TGFM instead requires in-sample LD (see Discussion).
The correlation between the predicted genetic gene expression of two genes can be computed from in-sample variant LD and the predicted causal eQTL effect sizes of both genes: where gt indexes one gene-tissue pair and g′t′ indexes the other gene-tissue pair, denotes the predicted genetic component of gene expression in gene g in tissue t across in-sample GWAS individuals, denotes the vector of predicted causal eQTL effect sizes for gene g in tissue t, and Σ denotes the in-sample variant LD. Similarly, the in-sample correlation between a non-mediated variant and the predicted genetic component of gene-tissue pair is: where k indexes the non-mediated variant, gt indexes the gene tissue pair, Xk is the genotype of variant k across in-sample GWAS individuals, denotes the cis-predicted genetic component of gene expression in gene g and tissue t across in-sample GWAS individuals, Σk denotes row k of the in-sample variant LD matrix, and denotes the vector of causal eQTL effect sizes for gene g in tissue t.
TGFM inference step 4: Marginalize uncertainty in cis-predicted expression on fine-mapping PIPs by averaging fine-mapping results across 100 runs
TGFM PIPs for a given locus are calculated by marginalizing out the uncertainty in the cis-predicted causal eQTL effect sizes on the conditional PIPs: where is the TGFM PIP for genetic element j, δ is the set of causal eQTL effect sizes for all gene-tissue pairs in the fine-mapping region, is a specific instantiation of the set of causal eQTL effect sizes for all gene-tissue pairs in the fine-mapping region, is the conditional PIP of genetic element j conditioned on genetic gene expression determined by , and is the probability of estimating causal eQTL effect sizes given the eQTL data. The conditional PIPs, , can be inferred using SuSiE fine-mapping of both non-mediated variants and gene-tissue pairs where cis-predicted genetic gene expression is determined by (described in previous Methods subsection, TGFM inference step 3). The causal eQTL effect size distribution, p(δ), is approximated by the posterior distribution on causal eQTL effect sizes for each gene-tissue pair estimated by applying SuSiE to eQTL data (described in previous Methods subsection, TGFM inference step 1). In practice, we approximate in equation 18 by drawing 100 random samples of causal eQTL effect sizes across gene-tissue pairs from p(Δ) (described in previous Methods subsection, TGFM inference step 2), calculating the conditional PIP for each of the 100 random samples, and taking the average across 100 conditional PIPs. In the next methods subsection, we discuss how this approach can be extended to integrate the inferred tissue-specific prior.
In practice, we do not run the existing SuSiE package30,31 100 times in series for each fine-mapping region. TGFM utilizes a custom implementation of the SuSiE algorithm that provides efficient estimation of PIPs across 100 parallel runs where each of 100 parallel SuSiE runs differ only in their cis-predicted expression models. The rate-limiting step of each iteration of the SuSiE algorithm involves multiplying the LD matrix with that iteration’s estimate of the trait-causal effect sizes. The custom TGFM implementation exploits the fact that the variant LD matrix (which constitutes the majority of the full correlation matrix of all pairs of genetic elements) is identical across all 100 runs; TGFM uses matrix multiplication to multiply the variant LD matrix with the current causal non-mediated variant effects for each of the 100 runs (corresponding to a single (KXK)X(KX100) matrix multiplication instead of 100 (KXK)X(KX1) multiplications in series where K is the number of variants in the region). In addition, the custom TGFM implementation of SuSiE does not compute the ELBO at each iteration for each of the 100 runs; computing the ELBO is computationally intensive and primarily utilized to assess convergence. Instead, we run each of the 100 runs for a pre-specified number of iterations; we use a default of 5 iterations which performed well in simulations (Figure 2). We set the default number of components l underlying each of the 100 runs to 10.
The iterative algorithm underlying SuSiE is not guaranteed to reach a global optimum and can get stuck in local optima30. We found TGFM was prone to reaching local optima in which a non-causal gene-tissue pair was confidently fine-mapped (i.e., a false positive) when the gene-tissue pair was moderately correlated with multiple independent causal non-mediated variants. This is likely due to the greedy nature of the SuSiE iterative algorithm. We use the following initialization strategy to mitigate convergence on local optima: (1) Run TGFM where the causal effects are initialized to be zero (this is the default initialization used by SuSiE) (2) If the TGFM PIP for any gene-tissue pair in the fine-mapping region is greater than 0.2: (2a) Run SuSiE fine-mapping on only the non-mediated variants (2b) Run TGFM where the non-mediated variant effects are initialized to the converged values from step 2a and the gene-tissue pair effects are initialized to zero (2c) For each of the 100 TGFM runs, select the fitted TGFM model (either (1) or (2b)) with the larger ELBO.
Inference of tissue-specific TGFM prior causal probabilities
TGFM increases fine-mapping power by specifying tissue-specific prior probabilities for each genetic element in a locus that are informed by genome-wide data, similar to PolyFun36. For each trait separately, TGFM assigns one prior causal probability πt for each gene-tissue pair from tissue t and one prior causal probability πnm for each non-mediated genetic variant where πt reflects the prior probability that an arbitrary gene from tissue t at a disease-associated locus has a non-zero causal effect on the disease/trait and πnm reflects the prior probability that an arbitrary non-mediated variant at a disease-associated locus has a non-zero causal effect on the disease/trait. We note that πt is a genome-wide parameter reflecting the probability that an arbitrary gene from tissue t has non-zero effect on disease, which is related but distinct from genome-wide expression-mediated disease heritability parameters previously estimated in refs. 11,26,28. We infer πt and πnm separately for each disease/trait using an iterative algorithm, starting with flat priors, and at each iteration: (1) updating the PIP of each genetic element using a computationally efficient approximation of TGFM (see next paragraph for details) given the current prior causal probabilities, which are normalized to sum to one across genetic elements at each locus, analogous to PolyFun36. (2) updating the prior causal probabilities according to: where c is the genetic element class (for example a specific tissue or non-mediated variant), πc is the prior causal probability corresponding to genetic element class c, k indexes genetic elements from genetic element class c, PIPk is the current PIP of genetic element k, and |c| is the number of genetic elements belonging to genetic element class c.
It is computationally prohibitive to run TGFM genome-wide tens to hundreds of times for each trait while updating the prior probabilities at each iteration. We make two approximations to TGFM to allow for efficient computation of genome-wide PIPs at each iteration (and we emphasize that these approximations are only used for inference of the prior). First, we run TGFM with a single cis-predicted expression model for each gene-tissue pair (based on SuSiE posterior mean causal cis-eQTL effect sizes) instead of averaging results across 100 sampled cis-predicted expression models. We refer to this approach as TGFM (no sampling). Thus, PIPs at each iteration can be inferred by applying SuSiE to fine-map both non-mediated variants and gene-tissue pairs where genetic gene expression of each gene-tissue pair is determined by the posterior mean causal eQTL effect sizes. While we show TGFM (no sampling) generates poorly calibrated PIPs (Supplementary Figure 6), using TGFM (no sampling) to infer prior causal probabilities results in well-calibrated causal prior probabilities in simulations (Supplementary Figure 11a), perhaps because the causal prior probabilities integrate information across the genome (equation 19) and fine-mapping errors resulting from uncertainty in the genetic component of gene expression will be averaged out across all genes in the genome. Second, we only run TGFM (no sampling) inference once, during the first iteration. After running TGFM (no sampling) in the first iteration, we save the resulting Bayes Factors30 for each component-genetic element pair; a Bayes Factor reflects the relative support for including that genetic element in that component of the model, irrespective of the prior probabilities. PIPs can be easily calculated based on the Bayes Factors and the current prior: where l indexes fine-mapping components, j and k index genetic elements in the fine-mapping region, denotes the expected value of the posterior distribution on γl for genetic element j, BFlj denotes the Bayes Factor for genetic element j on component l, θj denotes the normalized prior causal probability of genetic element j, and PIPj is the TGFM (no sampling) PIP for genetic element j. We do not recalculate the Bayes Factors after the first iteration, and simply used the saved Bayes Factors from the first iteration in all subsequent iterations. This approximation is reasonable as while we expect PIPs to change with the evolving prior, we do not expect posterior mean causal effect sizes of each fine-mapping component to drastically change with the evolving prior, ultimately leaving the Bayes Factors stable.
During inference of the causal prior probabilities, TGFM (no sampling) is run genome-wide on overlapping 3Mb windows36. The prior probability updates at each iteration (equation 16) are calculated from PIPs corresponding to genetic elements located in the middle Mb of these 3 Mb windows. To limit to windows with at least one disease causal signal, we remove 3Mb windows from the prior probability inference procedure that do not pass the SuSiE “purity filter” (see above) after running TGFM (no sampling) with a uniform prior.
In addition, TGFM inference can also rely on probability distributions defining the uncertainty in our estimated prior causal probabilities (see below). Empirical distributions, as well as significance testing, on the causal prior probabilities can be calculated using 100 iterations of bootstrapping111 across the genome (we refer to this as “genomic bootstrapping”). Specifically, for each of the 100 bootstraps, we randomly sample T fine-mapping 3Mb windows with replacement (assuming T total 3Mb fine-mapping windows for the disease/trait) and run the iterative algorithm on the bootstrapped regions. This procedure results in 100 empirical samples of the causal prior probabilities which reflect their estimation uncertainty. These 100 empirical samples are input directly into the sampling procedure underlying TGFM inference (see below). Significance testing of whether the prior causal probability is greater than zero for a particular genetic element class can be generated using a z-score computed from the mean and standard error of the bootstrapped distribution. For a single analyzed trait, we assess significance using Benjamini-Hochberg112 FDR correction across all tissues and/or cell types included in the analysis.
Inference of TGFM (no sampling) is performed using the function ‘susie_rss’ from the SuSiE package30,31 with default parameters. We used the same initialization strategy that was used by TGFM (described above). The iterative algorithm for inference of prior causal probabilities was run for 400 iterations.
TGFM inference including uncertainty in tissue-specific prior causal probabilities
The previous subsection described inference of probability distributions defining the tissue-specific prior causal probabilities. Here we described an extension of TGFM inference step 4 that integrates out uncertainty in both cis-predicted causal eQL effect sizes and the prior causal probabilities on the conditional PIPs: where is the TGFM PIP for genetic element j, δ is the set of causal eQTL effect sizes for all gene-tissue pairs in the fine-mapping region, is a specific instantiation of the set of causal eQTL effect sizes for all gene-tissue pairs in the fine-mapping region, π are the causal prior probabilities, are a specific instantiation of the causal prior probabilities, is the conditional PIP of genetic element j conditioned on genetic gene expression determined by and causal prior probabilities equal to is the probability of estimating causal eQTL effect sizes given the eQTL data, and is the probability of estimating prior causal probabilities . In practice, we approximate by drawing 100 random samples of causal eQTL effect sizes across gene-tissue pairs from p(δ) and prior causal probabilities from p(π), calculating the conditional PIP for each of the 100 random samples, and taking the average across 100 conditional PIPs.
Calculating gene-level PIPs with TGFM
We define a gene as causal for a disease/trait if there exists at least one tissue where the gene-tissue pair is causal for the trait. Gene-level PIPs can be computed by aggregating gene-tissue pair fine-mapping results across all gene-tissue pairs corresponding to the gene of interest: where l indexes fine-mapping components, g indexes genes, k ∈ g indexes all gene-tissue pairs corresponding to gene g, denotes the expected value of the posterior distribution on γl for gene g, denotes the expected value of the posterior distribution on γl for gene-tissue pair k, and is the gene-level PIP corresponding to gene g. Similar to TGFM PIPs for variants and gene-tissue pairs, we are describing here the calculation of conditional gene PIPs (conditional upon a given instantiation of predicted causal cis-eQTL effect sizes and prior causal probabilities). These conditional gene PIPs will be averaged across 100 samples of cis-predicted genetic gene expression and predicted causal prior probabilities (equation 22).
This approach can also be used to calculate PIPs for causal genes in a specified subset of tissues, or gene-tissue subset PIPs. For example, identifying gene-tissue subset pairs for the subset of adipose tissues (defined as adipose subcutaneous U adipose visceral). Gene-tissue subset PIPs can be computed by aggregating (as done in equations 23-24) gene-tissue pair fine-mapping results across all gene-tissue pairs corresponding to the gene of interest from tissues in the tissue subset of interest.
Simulation framework
We used real genotypes from unrelated UK Biobank British (UKBB) samples32 to simulate both gene expression phenotypes (for each gene-tissue pair) and quantitative trait phenotypes. Default simulation parameters were specified as follows: the gene expression sample size ranged from 300 to 1000; the quantitative trait sample size was set to 100,000 (disjoint from gene expression samples); we analyzed 426,593 SNPs and 1,976 genes on chromosome 1; the number of tissues was set to 10, of which 2 were causal for the quantitative trait; the quantitative trait architecture was simulated to have average polygenicity37, consisting of 2,700 causal non-mediated variants and 300 causal gene-tissue pairs (150 for each causal tissue allowing for the option of two causal gene-tissue pairs from the same gene) with the expected heritability per causal genetic element (non-mediated variant or gene-tissue pair) set to 0.0001 (expected quantitative trait heritability of 0.3, 10% of which was mediated through gene expression, consistent with genome-wide estimates from MESC28); causal non-mediated variants were randomly selected with probability proportional to their expected per-variant heritability based on baseline-LD model annotations3,38,39 (estimated using S-LDSC3 applied to the UKBB trait White blood cell count) in order to make the simulations as realistic as possible. We simulated the genetic architecture of gene expression across related tissues using an approach similar to the approach implemented in ref. 11: we simulated all 1,976 protein-coding genes on chromosome 1 to be expressed in all tissues, with 50% of these gene-tissue pairs being cis-genetically heritable (only cis-genetically heritable gene-tissue pairs could have a simulated causal effect on the trait, though we considered all expressed gene-tissue pairs for fine-mapping, not just those that were heritable); each heritable gene-tissue pair was randomly assigned 5 causal cis-eQTLs within 100Kb of the gene’s TSS, 3 of the 5 causal cis-eQTLs were shared across tissues and 2 of the 5 causal cis-eQTLs were specific to each tissue; each causal cis-eQTL explains 1.5% of the variance of each gene-tissue pair (resulting in an average gene heritability of .075); effect sizes of shared causal cis-eQTLs covaried across tissues as follows (based on ref. 11): the tissues were split into 3 categories to mimic biological tissue modules in GTEx25 (tissues 1-3, tissues 4-6, and tissues 7-10) and the correlation of shared cis-eQTL effect sizes across tissues was set to 0.8 and 0.74 for tissues in the same tissue category and across tissues in different tissue categories, respectively.
Our simulation indicates that PIPs reported by TGFM are slightly anti-conservative with respect to (1 – average PIP) (Supplementary Figures 1, 4), consistent with previous simulations of variant-level fine-mapping methods using polygenic trait architectures36,41. Following ref. 36, we circumvent this problem by using an alternative FDR estimator given by (1 – PIP threshold); setting all PIPs greater than the specified PIP threshold equal to the PIP threshold. For example, at a PIP threshold of 0.9, we treat all genetic elements with PIP ≥ 0.9 as if they had PIP = 0.9.
UK Biobank GWAS summary statistics and in-sample LD
We applied to TGFM to 45 of the 49 UK Biobank traits analyzed via functionally informed fine-mapping in ref. 36 (average N=316K; Supplementary Table 5); the four excluded traits were Dermatology, Diabetes (any), Endocrine Diabetes, and Childless; these four diseases and traits were excluded due to redundancy and low heritability. We considered the set of 10,545,304 UK Biobank imputed variants with MAF ≥ 0.5% and INFO score ≥ 0.6, similar to previous work36,113. We used GWAS summary statistics that were generated and described in ref. 36. Briefly, the summary statistics were computed in ref. 36 from n=337,426 unrelated British-ancestry individuals in UK Biobank using BOLT-LMM114 adjusting for sex, age and age squared, assessment center, genotyping platform, the top 20 genotyping principal components, and dilution factor for biochemical traits (see ref. 36 for complete details). We used Liftover115 to convert variant positions from hg19 to hg38. Z-scores used by TGFM were computed from the Bolt-LMM output as follows: as the noninfinitesimal version of BOLT-LMM does not calculate effect sizes, we calculated z-scores by taking the square root of the BOLT-LMM chi-squared statistics and multiplying them by the sign of the effect size estimate from the infinitesimal version of BOLT-LMM.
We computed in-sample variant LD matrices using 337,426 unrelated British-ancestry individuals in UK Biobank (same individuals as summary statistics); missing values were imputed by the mean of a variant across individuals.
Overlapping 3Mb loci used for fine-mapping
We applied TGFM to fine-map each of the 2,682 overlapping 3Mb loci spanning the entire genome. Analogous to ref. 36, the overlapping 3Mb loci had 1 Mb spacing between the start points of consecutive loci, were limited to autosomal chromosomes, and did not include 3 long-range LD regions including the MHC region (chromosome 6 positions 25499772-33532223, chromosome 8 positions 8142478-12142491, and chromosome 11 positions 45978449-57232526 in hg38; lifted over from long-range LD regions ignored in ref. 36). Distinct from the windows generated in ref. 36, for each disease/trait, we limited TGFM fine-mapping to loci with at least 50 genetic variants and at least one genetic variant with marginal GWAS p-value less than 1e-5.
GTEx cis-predicted expression models
We analyzed GTEx25 data from 47 GTEx tissues, which were aggregated into 38 meta tissues of similar sample size consisting of European-ancestry individuals (average N=259, range: N=101-320 individuals, 23 meta-tissues with N=320; Supplementary Table 6) to reduce heterogeneity in eQTL sample sizes across tissues. Testis tissue was removed from analysis as it has outlier (cis- and trans-) eQTL discovery power after controlling for sample size (see ref. 25 Figure 2C). Meta-tissues were constructed using the same individuals and tissue aggregation strategy as described in ref. 11. Normalized expression matrices and covariates used in GTEx consortium’s single-tissue cis-eQTL analysis25 were downloaded from the GTEx portal (https://gtexportal.org/home/datasets) for each of the 47 analyzed tissues. In each tissue, we subset individuals to those composing the corresponding meta-tissue, and then re-standardize gene expression of each gene to have mean zero and variance one in each subsetted tissue. Cis-eQTLs were called in each tissue independently while controlling for covariates and limiting to variant-gene pairs such that the variant is within 500Kb of the gene’s TSS. For each meta-tissue, we removed variants with MAF < .05 across samples in the meta-tissue, were strand-ambiguous, or did not overlap the 10,545,304 analyzed UK Biobank variants. Cis-predicted expression models were generated using SuSiE30,31 applied to eQTL summary statistics and eQTL in-sample LD matrices (using the function ‘susie_rss’ from the SuSiE package30,31 with default parameters) for each (gene, meta-tissue) pair, independently. If a meta-tissue is composed of more than 1 constituent tissues, meta-analyzed eQTL summary statistics were generated using a fixed-effect meta-analysis across constituent tissues and meta-analyzed in-sample LD was generated by computing variant-variant correlations across all samples composing the meta-analyzed tissue. After removing gene-tissue pairs that did not pass the SuSiE “purity filter” (see above), we identified 119,270 gene-tissue pairs with a cis-predicted expression model; all 119,270 cis-predicted expression models are publicly available (see Data availability).
PBMC cis-predicted expression models
PBMC single-cell eQTL (described and generated in ref. 33) expression data was downloaded from the Human Cell Atlas Data Coordination Platform and genotype data downloaded from dbGaP (accession number phs002812.v1.p1). We removed individuals that were not European ancestry or had fewer than 2,500 detected cells. We generated pseudobulk expression in the 9 most abundant cell types (cell type assignment determined by ref. 33) for all individuals in each cell type with greater than 5 cells. In each of the 9 cell types separately, we removed genes that were expressed in less than 80% of that cell type’s pseudobulk samples. Pseudobulk expression in each cell type was transformed using EdgeR’s logCPM function116 followed by normalizing each gene to have mean 0 and variance 1. The top 10 expression PCs for each cell type were calculated based on the normalized expression matrix. The number of pseudobulk samples per cell type is described in Supplementary Table 14 (average N=112).
Cis-eQTLs were called in each cell type independently while controlling for covariates (the top 10 expression PCs) and limiting to variant-gene pairs such that the variant is within 500Kb of the gene’s TSS. For each cell type, we removed variants with MAF < 0.05 across samples in the cell type, that were strand-ambiguous, or did not overlap the 10,545,304 analyzed UK Biobank variants. Cis-predicted expression models were generated using SuSiE30,31 applied to eQTL summary statistics and eQTL in-sample LD matrices (using the function ‘susie_rss’ from the SuSiE package30,31 with default parameters) for each (gene, cell type) pair, independently. After removing gene-PBMC cell type pairs that did not pass the SuSiE “purity filter” (see above), we identified 1,851 gene-PBMC cell type pairs with a cis-predicted expression model; all 1,851 cis-predicted expression models are publicly available (see Data availability).
Data Availability
We have made TGFM PIPs for gene tissue pairs, gene PBMC cell type pairs, genes, and non mediated variants across 45 diseases/traits (for both analyses of 38 GTEx tissues + analyses of 38 GTEx tissues and 9 PBMC cell types) publicly available at https://doi.org/10.7910/DVN/S26PFI, GTEx cis predicted expression models for all gene tissue pairs publicly available at https://doi.org/10.7910/DVN/8IPOPK, PBMC cis predicted expression models for all gene PBMC cell type pairs publicly available at https://doi.org/10.7910/DVN/A6K9QW, GWAS summary statistics for all 45 diseases/traits publicly available at https://doi.org/10.7910/DVN/GTEGPE. To limit the use of computational resources, we refer the reader to UK Biobank in sample LD (337K unrelated British ancestry samples) from ref. 36, which is publicly available at https://registry.opendata.aws/ukbb-ld/. The UK Biobank resource is publicly available via application (http://www.ukbiobank.ac.uk/).
Data availability
We have made TGFM PIPs for gene-tissue pairs, gene-PBMC cell type pairs, genes, and non-mediated variants across 45 diseases/traits (for both analyses of 38 GTEx tissues + analyses of 38 GTEx tissues and 9 PBMC cell types) publicly available at https://doi.org/10.7910/DVN/S26PFI, GTEx cis-predicted expression models for all gene-tissue pairs publicly available at https://doi.org/10.7910/DVN/8IPOPK, PBMC cis-predicted expression models for all gene-PBMC cell type pairs publicly available at https://doi.org/10.7910/DVN/A6K9QW, GWAS summary statistics for all 45 diseases/traits publicly available at https://doi.org/10.7910/DVN/GTEGPE. To limit the use of computational resources, we refer the reader to UK Biobank in-sample LD (337K unrelated British-ancestry samples) from ref. 36, which is publicly available at https://registry.opendata.aws/ukbb-ld/. The UK Biobank resource is publicly available via application (http://www.ukbiobank.ac.uk/).
Code availability
Software implementing TGFM will be made publicly available at https://github.com/BennyStrobes prior to publication.
Acknowledgements
We are grateful to Arun Durvasula and Xilin Jiang for helpful discussions. This research was conducted using the UK Biobank resource under application no. 16549 and funded by National Institutes of Health (NIH) grants R01 MH101244, R37 MH107649, R01 HG006399, R01 MH115676, U01 HG012009, and F32 HG012889. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Footnotes
New supplementary figure (Supplementary Figure 18) and Supplementary Note now added to describe 6 additional examples of fine-mapped disease loci
References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.
- 7.↵
- 8.
- 9.
- 10.
- 11.↵
- 12.↵
- 13.
- 14.↵
- 15.
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.↵
- 103.
- 104.
- 105.
- 106.
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵