Transcriptome-Wide Root Causal Inference ======================================== * Eric V. Strobl * Eric R. Gamazon ## ABSTRACT Root causal genes correspond to the first gene expression levels perturbed during pathogenesis by genetic or non-genetic factors. Targeting root causal genes has the potential to alleviate disease entirely by eliminating pathology near its onset. No existing algorithm discovers root causal genes from observational data alone. We therefore propose the Transcriptome-Wide Root Causal Inference (TWRCI) algorithm that identifies root causal genes and their causal graph using a combination of genetic variant and unperturbed bulk RNA sequencing data. TWRCI uses a novel competitive regression procedure to annotate cis and trans-genetic variants to the gene expression levels they directly cause. The algorithm simultaneously recovers a causal ordering of the expression levels to pinpoint the underlying causal graph and estimate root causal effects. TWRCI outperforms alternative approaches across a diverse group of metrics by directly targeting root causal genes while accounting for distal relations, linkage disequilibrium, patient heterogeneity and widespread pleiotropy. We demonstrate the algorithm by uncovering the root causal mechanisms of two complex diseases, which we confirm by replication using independent genome-wide summary statistics. ## 1 Introduction Genetic and non-genetic factors can influence gene expression levels to ultimately cause disease. Root causal gene expression levels – or *root causal genes* for short – correspond to the *initial* changes to *gene expression* that ultimately generate disease as a downstream effect1. Root causal genes differ from core genes that directly cause the phenotype and thus lie at the end, rather than at the beginning, of pathogenesis2. Root causal genes also generalize driver genes that only account for the effects of somatic mutations primarily in protein coding sequences in cancer3. Discovering root causal genes is critical towards identifying drug targets that modify disease near its pathogenic onset and thus mitigate downstream pathogenesis in its entirety4. The problem is complicated by the existence of complex disease, where the causal effects of the root causal genes may differ between patients even within the same diagnostic category. However, the recently defined *omnigenic root causal model* posits that only a few root causal genes affect nearly all downstream genes to initiate the vast majority of pathology in each patient1. We thus more specifically seek to identify *personalized* root causal genes specific to any given individual. Only one existing algorithm accurately identifies personalized root causal genes1, but the algorithm requires access to genome-wide Perturb-seq data, or high throughput perturbations with single cell RNA sequencing readout5, 6. Perturb-seq is currently expensive and difficult to obtain in many cell types. We instead seek a method that can uncover personalized root causal genes directly from widespread observational (or non-experimental) datasets. We make the following contributions in this paper: 1. We introduce the conditional root causal effect (CRCE) that measures the causal effect of the genetic and non-genetic factors, which directly affect a gene expression level, on the phenotype. 2. We propose a novel strategy called Competitive Regression that provably annotates both cis and trans-genetic variants to the gene expression level or phenotype they directly cause without conservative significance testing. 3. We create an algorithm called Trascriptome-Wide Root Causal Inference (TWRCI) that uses the annotations to reconstruct a personalized causal graph summarizing the CRCEs of gene expression levels from a combination of genetic variant and bulk RNA sequencing observational data. 4. We show with confirmatory replication that TWRCI identifies only a few root causal genes that accurately distinguish subgroups of patients even in complex diseases – consistent with the omnigenic root causal model. We provide an example of the output of TWRCI in Figure 1. TWRCI annotates both cis and trans genetic variants to the expression level or phenotype they *directly* cause. We prove that the direct causal annotations allow the algorithm to uniquely reconstruct the causal graph between the gene expression levels that cause the phenotype as well as estimate their CRCEs. The algorithm summarizes the CRCEs in the graph by weighing and color-coding each vertex, where vertex size correlates with magnitude, green induces disease and red prevents disease. TWRCI thus provides a succinct summary of root causal genes and their root causal effect sizes specific to a given patient using observational data alone. TWRCI outperforms combinations of existing algorithms across all subtasks: annotation, graph reconstruction and CRCE estimation. No existing algorithm performs all subtasks simultaneously. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F1) Figure 1. Toy example of a root causal mechanism inferred by TWRCI for a specific patient. Rectangles denote sets of genetic variants, potentially in linkage disequilibrium even between sets. Each set of variants directly causes a gene expression level in ![Graphic][1] or the phenotype *Y*. Larger lettered vertices denote larger CRCE magnitudes and colors refer to their direction – green is a positive CRCE and red is negative. TWRCI simultaneously annotates, reconstructs and estimates the CRCEs. ## 2 Results ### 2.1 Overview of TWRCI #### 2.1.1 Setup We seek to identify not just causal but *root causal* genes. We must therefore carefully define the generative process. We consider a set of variants ***S***, the transcriptome ![Graphic][2] and the phenotype *Y*. We represent the generative causal process using a directed graph like in Figure 2 (a), where the variants cause the transcriptome, and the transcriptome causes the phenotype. Directed edges denote direct causal relations between variables. In practice, the sets ***S*** and ![Graphic][3] contain millions and thousands of variables, respectively. As described in Methods 4.2, we cannot measure the values of ![Graphic][4] exactly using RNA sequencing but instead measure values ***X*** corrupted by Poisson measurement error and batch effects. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F2) Figure 2. Overview of the TWRCI algorithm. (a) We redraw Figure 1 in more detail. We do not have access to the underlying causal graph in practice. (b) TWRCI first performs variable selection by only keeping variants and gene expression levels correlated with *Y* (and their common cause confounders) as shown in black. (c) The algorithm then uses Competitive Regression to find the variants that directly cause *Y* in orange. (d) TWRCI iteratively repeats Competitive Regression for each gene expression level as well – again shown in orange. (e) The algorithm next performs causal discovery to identify the causal relations between the gene expression levels and the phenotype in blue. (f) Finally, TWRCI weighs each vertex ![Graphic][5] by the magnitude of Γ*i* and color codes the vertex by its direction (green is positive, red is negative). TWRCI thus ultimately recovers a causal graph like the one shown in Figure 1. #### 2.1.2 Variable Selection Simultaneously handling millions of variants and thousands of gene expression levels currently requires expensive computational resources. Moreover, most variants and gene expression levels do not inform the discovery of root causal genes for a particular phenotype *Y*. The Transcriptome-Wide Root Causal Inference (TWRCI) thus first performs variable selection by eliminating variants and gene expression levels unnecessary for root causal inference. TWRCI first identifies variants ***T*** ⊆ ***S*** associated with *Y* using widely available summary statistics at a liberal *α* threshold, such as 5e-5, in order to capture many causal variants. The algorithm then uses individual-level data – where each individual has variant data, bulk gene expression data from the relevant tissue and phenotype data (variant-expression-phenotype) – to learn a regression model predicting ***X*** from ***T***. TWRCI identifies the subset of expression levels ![Graphic][6] that it can predict better than chance. We refer the reader to Methods 4.4.2 for details on the discovery of additional nuisance variables required to mitigate confounding. We prove that ![Graphic][7] retains all of the causes of *Y* in ![Graphic][8]. #### 2.1.3 Annotation by Competitive Regression We want to annotate both cis and trans-variants to the gene expression level that they directly cause in ![Graphic][9]. We also want to annotate variants to the phenotype *Y* in order to account for horizontal pleiotropy, where variants bypass ![Graphic][10] and directly cause *Y*. TWRCI achieves both of these feats through a novel process called *Competitive Regression*. TWRCI accounts for horizontal pleiotropy by applying Competitive Regression to the phenotype. We do not restrict the theoretical results detailed in Methods to linear models, but linear models trained on genotype data currently exhibit competitive performance7, 8. TWRCI therefore trains debiased linear ridge regression models9 predicting *Y* from ![Graphic][11] without requiring Gaussian distributions; let *γ**Yi* refer to the coefficient for *T**i* in the regression model. Similarly, let Δ−*Yi* correspond to the matrix of coefficients for *T**i* in the regression models predicting ![Graphic][12] (but not *Y*) from ***T***; notice that we have not conditioned on the gene expression levels in this case. If *T**i* *directly* causes *Y*, then it will predict *Y* given ***T*** \ *T**i* and given ![Graphic][13] (i.e., Δ*Yi* and *γ**Yi* will both be non-zero), but *T**i* will not predict any gene expression level given ***T*** \ *T**i* (i.e., ![Graphic][14] will be zero). We also prove the converse direction in Methods 4.4.4. TWRCI therefore annotates *T**i* to *Y*, if |Δ*Yi**γ**Yi* | deviates away from zero even after conditioning on gene expression so that ![Graphic][15] i.e., Δ*Yi**γ**Yi* “beats” ![Graphic][16] in a competitive process (Figure 2 (c)). We prove in Methods 4.4.3 that Competitive Regression successfully recovers the direct causes of *Y*, denoted by ***S****Y* ⊆ ***T***, so long as *Y* is a sink vertex that does not cause any other variable. We also require analogues of two standard assumptions used in instrumental variable analysis: relevance and exchangeability10. In this paper, *relevance* means that at least one variant in ***T*** directly causes each gene expression level in ![Graphic][17]; the assumption usually holds because ***T*** contains orders of magnitude more variants than entries in ![Graphic][18]. On the other hand, *exchangeability* assumes that ***T*** and other sets of direct causal variants not in ***T*** share no latent confounders (details in Methods 4.4.2); this assumption holds approximately due to the weak causal relations emanating from variants to gene expression and the phenotype. Exchangeability also weakens as ***T*** grows larger. We further show that Competitive Regression can recover ***S****i* ⊆ ***T***, or the direct causes of ![Graphic][19] in ***T***, when ![Graphic][20] causes *Y* and turns into a sink vertex after removing *Y* from consideration (Methods 4.4.4). As a result, TWRCI removes *Y* and appends it to the empty ordered set ***K*** to ensure that some ![Graphic][21] is now a sink vertex. We introduce a statistical criterion in Methods 4.4.4 that allows TWRCI to find the sink vertex ![Graphic][22] after removing *Y*. TWRCI then annotates ***S****i* to ![Graphic][23] again using Competitive Regression (Figure 2 (d)), removes *R**i* from ***R*** and appends *R**i* to the front of ***K***. The algorithm iterates until it has removed all variables from ***R*** ⋃ *Y* and placed them into the causal order ***K***. #### 2.1.4 Causal Discovery and CRCE Estimation Annotation only elucidates the direct causal relations from variants to gene expression, but it does not recover the causal relations between gene expression or the causal relations from gene expression to the phenotype. We want TWRCI to recover the *entire* biological mechanism from variants all the way to the phenotype. TWRCI thus subsequently runs a causal discovery algorithm with the causal order ***K*** to uniquely identify the causal graph over ![Graphic][24] (Figure 2 (e)). The algorithm also estimates the personalized or *conditional root causal effect* (CRCE) of gene expression levels that cause *Y* : ![Formula][25] where we choose ![Graphic][26] carefully to ensure that the second equality holds (Methods 4.3). The CRCE Γ*i* of ![Graphic][27] thus measures the causal effect of the genetic factors ***S****i* and the non-genetic factors *E**i* on *Y* that perturb ![Graphic][28] first. The CRCE values differ between patients, so TWRCI can recover different causal graphs by weighing each vertex according to the patient-specific CRCE values Γ = *γ* (Figure 2 (f)). The gene ![Graphic][29] is a *personalized root causal gene* if |*γ**i*| > 0. The *omnigenic root causal model* posits that |*γ*| ≫ 0 for only a small subset of genes in each patient even in complex disease. ### 2.2 TWRCI accurately annotates, reconstructs and estimates in silico No existing algorithm recovers personalized root causal genes from observational data alone. However, existing algorithms can annotate variants using different criteria and reconstruct causal graphs from observational data. We therefore compared TWRCI against state of the art algorithms in annotation and causal graph reconstruction using 100 semi-synthetic datasets with real variant data but simulated gene expression and phenotype data (Methods 4.6). Many different annotation methods exist with different objectives. Most methods nevertheless annotate variants by at least considering proximity to the transcription start site (TSS), with the hope that variants near the TSS of a gene will *directly* affect that gene’s expression level; for example, a variant in the exonic region of a gene may compromise its mRNA stability, while a variant in the promoter region may affect its transcription rate. We thus compare a diverse range of methods in *direct causal* annotation, or assigning variants to the gene expression levels they directly cause. This criterion accommodates other annotation objectives from a mathematical perspective as well – solving direct causation automatically solves causation, colocalization and correlation as progressively more relaxed cases. We in particular compare nearest TSS, a one mega-base cis-window1, the causal transcriptome-wide association study (cTWAS)11, the maximally correlated gene within the cis-window (cis-eQTL)12, colocalization with approximate Bayes factors (ABF)13, and colocalization with Sum of SIngle Effects model (SuSIE)14. We then performed causal graph reconstruction using SIGNET15, 16, RCI17, GRCI18 and the PC algorithm19, 20. We evaluated TWRCI against all combinations of annotation and graph reconstruction methods. See Methods 4.5 and 4.8 for a detailed description of comparator algorithms and evaluation metrics, respectively. All statements about empirical results mentioned below hold at a Bonferroni corrected threshold of 0.05 divided by the number of comparator algorithms according to two-sided paired t-tests. We first summarize the accuracy results for annotation of direct causes only. All existing annotation algorithms utilize heuristics such as location, correlation or colocalization to infer causality. Only TWRCI provably identifies the direct causes of each gene expression level (Theorem 1 in Methods 4.4.6). Empirical results corroborate this theoretical conclusion. TWRCI achieved the highest accuracy as assessed by Matthew’s correlation coefficient (MCC) to the true direct causal variants of each gene expression level and phenotype (Figure 3 (a) left). The algorithm also ranked the ground truth direct causal variants the highest by assigning the ground truth causal variants larger regression coefficient magnitudes than non-causal variants (Figure 3 (a) right). Both TWRCI and cTWAS account for horizontal pleiotropy, but TWRCI again outperformed cTWAS even when we only compared the true and inferred variants that directly cause the phenotype using MCC and the normalized rank (Figure 3 (b)). We conclude that TWRCI annotated the genetic variants to their direct effects most accurately. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F3) Figure 3. Semi-synthetic data results in terms of (a) direct causal annotation, (b) annotation focused on horizontal pleiotropy only, (c) graph reconstruction, (d) combined annotation and graph reconstruction, and (e) CRCE estimation accuracy. Four of the graphs summarize two evaluation metrics. Arrows near the y-axis denote whether a higher (upward arrow) or a lower (downward arrow) score is better. We do not plot the results of cis-eQTL and SuSIE in (d) and (e) when they exhibit much worse performance. The cis-window and cTWAS algorithms have the exact same CRCE estimates in (e) because accounting for horizontal pleiotropy in cTWAS does not change the conditioning set ***D*** in Equation (1); we thus denote cis-Window and cTWAS as Win/cT for short. TWRCI in purple outperformed all algorithms across all nine evaluation metrics. Error bars correspond to 95% confidence intervals. We obtained similar results with causal graph reconstruction. TWRCI obtained the best performance according to the highest MCC and the lowest structural hamming distance (SHD) to the ground truth causal graphs (Figure 3 (c)). We then assessed the performance of combined annotation and graph reconstruction using the mean absolute correlation of the residuals (MACR), or the mean absolute correlation between the *indirect* causes of a gene expression level and the residual gene expression level obtained after partialing out the inferred *direct* causes; if an algorithm annotates and reconstructs accurately, then each gene expression level should not correlate with its indirect causes after partialing out its direct causes, so the MACR should attain a small value. TWRCI accordingly achieved the lowest MACR as compared to all possible combinations of existing algorithms (Figure 3 (d)). The cis-eQTL and SuSIE algorithms obtained MACR values greater than 0.3 because many cis-variants did not correlate or colocalize with the expression level of the gene with the nearest TSS; we thus do not plot the results of these algorithms. We conclude that TWRCI used annotations to reconstruct the causal graph most accurately by provably accounting for both cis and trans-variants. We finally analyzed CRCE estimation accuracy. Computing the CRCE requires access to the inferred annotations and causal graph. We therefore again evaluated TWRCI against all possible combinations of existing algorithms. The CRCE estimates of TWRCI attained the largest correlation to the ground truth CRCE values (Figure 3 (e) left). Further, if an algorithm accurately estimates the components ![Graphic][30] and 𝔼(*Y*|***D***) of the CRCE in Equation (1), then the residual ![Graphic][31] should not correlate with ***S****i* ⋂ ***T***. TWRCI accordingly obtained the lowest mean absolute correlation of these residuals (MACR) against all combinations of algorithms (Figure 3 (e) right). The cis-eQTL and SuSIE algorithms again attained much worse MACR values above 0.4 because they failed to annotate many causal variants to their gene expression levels. We conclude that TWRCI outperformed existing methods in CRCE estimation. TWRCI therefore annotated, reconstructed and estimated the most accurately according to all nine evaluation criteria. The algorithm also completed within about 3 minutes for each dataset (Supplementary Figure 1). ### 2.3 Chronic and exaggerated immunity in COPD We next ran the algorithms using summary statistics of a large GWAS of COPD21 consisting of 13,530 cases and 454,945 controls of European ancestry. We downloaded individual variant-expression-phenotype data of lung tissue from GTEx22 with 96 cases and 415 controls. We also replicated results using an independent GWAS consisting of 4,017 cases and 162,653 controls of East Asian ancestry21. COPD is a chronic inflammatory condition of the airways or the alveoli that leads to persistent airflow obstruction23. Exposure to respiratory infections or environmental pollutants can also trigger acute on chronic inflammation called COPD exacerbations that worsen the obstruction. #### 2.3.1 Accuracy We first compared the accuracy of the algorithms in variant annotation, graph reconstruction and CRCE estimation. We can compute the MACR metrics – representing two of the nine evaluation criteria used in the previous section – with real data. We summarize the MACR for simultaneous variant annotation and graph reconstruction averaged over ten nested cross-validation folds in Figure 4 (a). TWRCI achieved the lowest MACR out of all combinations of algorithms within about 3 minutes (Supplementary Figure 2 (b) and (c)). Performance differed primarily by the annotation method rather than the causal discovery algorithm. Conservative annotation algorithms, such as colocalization by SuSIE, again failed to achieve a low MACR because they frequently failed to annotate at least one variant to every gene expression level. MACR values for CRCE estimation followed a similar pattern (Figure 4 (b)) because accurate annotation and reconstruction enabled accurate downstream CRCE estimation. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F4) Figure 4. Results for COPD. (a) TWRCI outperformed all other combinations of algorithms in direct causal annotation and graph reconstruction by achieving the lowest MACR; error bars correspond to one standard error of the mean in accordance with the one standard error rule of cross-validation27. (b) TWRCI similarly achieved the lowest MACR for CRCE estimation. (c) Silver standard genes exhibited the smallest correlation with the phenotype after partialing out the root causal genes inferred by TWRCI. (d) More than 13% the causal variants exhibited horizontal pleiotropy. TWRCI annotated the remaining causal variants to eight gene expression levels. (e) TWRCI assigned approximately 78% of the causal variants to genes located on different chromosomes. Most causal variants annotated to a gene on the same chromosome fell within a one megabase distance from the TSS (blue, left). The average magnitude of the regression coefficients remained approximately constant with increasing distance from the TSS (red, right); the dotted line again corresponds to variants on different chromosomes. (f) The COPD-wide causal graph revealed multiple MHC class II genes as root causal. (g) UMAP dimensionality reduction revealed two clusters of COPD patients well-separated from the healthy controls. (h) The directed graphs highlighted different root causal genes within each of the two clusters. We next followed11, 24 and downloaded a set of silver standard genes enriched in genes that cause COPD. The KEGG database does not contain a pathway for COPD, so we downloaded the gene set from the DisGeNet database instead (UMLS C0024117, curated)25, 26. Many silver standard genes are causal but not *root* causal for COPD. If an algorithm truly identifies root causal genes, then partialing out the root causal genes from all of the downstream non-root causal genes and the phenotype should explain away the vast majority of the causal effect between the non-root causal genes and the phenotype according to the omnigenic root causal model. We therefore computed another MACR metric, the mean absolute correlation between the residuals of the silver standard genes and the residuals of the phenotype after partialing out the inferred root causal genes. TWRCI again obtained the lowest MACR value (Figure 4 (c)). We conclude that TWRCI identified the root causal genes most accurately according to known causal genes in COPD. #### 2.3.2 Horizontal pleiotropy and trans-variants We studied the output of TWRCI in detail to gain insight into important issues in computational genomics. Previous studies have implicated the existence of widespread horizontal pleiotropy in many diseases28. TWRCI can annotate variants directly to the phenotype, so we can use TWRCI to assess the existence of widespread pleiotropy. The variable selection step of TWRCI identified fourteen gene expression levels surviving false discovery rate (FDR) correction at a liberal 10% threshold; eight of these levels ultimately caused the phenotype, including two psoriasis susceptibility genes, a complement protein and five MHC class II genes. TWRCI annotated 13.7% of the variants that cause COPD directly to the phenotype, despite competition for variants between the phenotype and the eight gene expression levels (Figure 4 (d)). Many variants thus directly cause COPD by bypassing expression. We conclude that TWRCI successfully identified widespread horizontal pleiotropy in COPD. In contrast, cTWAS failed to identify any variants that bypass gene expression because all variants had very small effects on the phenotype, especially after accounting for gene expression; as a result, no variants ultimately had a posterior inclusion probability greater than 0.8 according to cTWAS. TWRCI annotates both cis and trans-variants, so we examined the locations of the annotated variants relative to the TSS for each of the eight causal genes. Most of the variants lying on the same chromosome as the TSS fell within a one megabase distance from the TSS (Figure 4 (e) blue). However, 78% of the variants were located on different chromosomes. We thus compared the variants annotated to causal genes by TWRCI against a previously published list of trans-eQTLs associated with any phenotype in a large-scale search29 (Methods 4.7.3). Variants annotated by TWRCI were located 1.94 times closer to trans-eQTLs than expected by chance (10,000 permutations, *p* < 0.001, 95% CI [1.93,1.95]). We next examined the effect sizes of the variants that cause the phenotype. We regressed the phenotype on variants inferred to directly or indirectly cause the phenotype using linear ridge regression. We then computed the moving average of the magnitudes of the regression coefficients over different distances from the TSS. The magnitudes remained approximately constant with increasing distance from the TSS (Figure 4 (e) red). Moreover, the magnitudes for variants located on different chromosomes did not converge to zero (dotted line). We thus conclude that trans-variants play a significant role in modulating gene expression to cause COPD. #### 2.3.3 Root Causal Mechanism We next analyzed the output of TWRCI to elucidate the root causal mechanism of COPD. The pathogenesis of COPD starts with inhaled irritants that trigger an exaggerated and persistent activation of inflammatory cells such as macrophages, T cells and B cells23. These cells in turn regulate a variety of inflammatory mediators that promote alveolar wall destruction, abnormal tissue repair and mucous hypersecretion obstructing airflow. The root causal genes of COPD therefore likely involve genes mediating chronic and exaggerated inflammation in the lung. Eight of the fourteen gene expression levels ultimately caused the COPD phenotype in the causal graph reconstructed by TWRCI (Figure 4 (f)). The graph contained five MHC class II genes that present extracellular peptide antigens to CD4+ T cells in the adaptive immune response30. Subsequent activation of T cell receptors regulates a variety of inflammatory mediators and cytokines31. Moreover, the complement fragment C4a32 as well as the psoriasis susceptibility genes PSORS1C1 and PSORS1C233 help initiate and maintain the exaggerated inflammatory response seen in COPD. The recovered causal graph thus implicates chronic exaggerated inflammation as the root causal mechanism of COPD. TWRCI replicated these results by again discovering C4A and the MHC class II genes in an independent GWAS dataset composed of individuals of East Asian ancestry (Supplementary Figure 3 (a)). We finally analyzed the personalized CRCE estimates in more detail. We can decompose the CRCE estimate of each gene into genetic and non-genetic components according to Equation (1). The genetic variants explained only 6.4% of the estimated variance of the CRCE for HLA-DRB5, 1.4% for C4A and <1% for the other six causal genes. We conclude that non-genetic factors account for nearly all of the explained variance in the CRCE estimates. We then performed UMAP dimensionality reduction34 on the causal gene expression levels. Hierarchical clustering with Ward’s method35 yielded three clear clusters of patients with COPD (Figure 4 (g)) according to the elbow method on the sum of squares plot (Supplementary Figure 2 (a)). UMAP differentiated two of the COPD clusters from healthy controls, each with different mean CRCE estimates (Figure 4 (h) directed graphs). For example, HLA-DRB5 had a large positive CRCE in cluster one but a large negative CRCE in cluster two. The CRCE estimates thus differentiated multiple subgroups of patients consistent with the known pathobiology of COPD; we likewise obtained similar results in the second GWAS dataset (Supplementary Figure 3 (b) and (c)). ### 2.4 Oxidative stress in ischemic heart disease We also ran the algorithms on summary statistics of ischemic heart disease (IHD) consisting of 31,640 cases and 187,152 controls from Finland36. We used variant-expression-phenotype data of whole blood from GTEx22 with 113 cases and 547 controls. We used whole blood because IHD arises from narrowing or obstruction of the coronary arteries most commonly secondary to atherosclerosis with transcription products released into the bloodstream37. We replicated the results using an independent set of GWAS summary statistics from 20,857 cases and 340,337 controls from the UK Biobank38. #### 2.4.1 Accuracy We compared the algorithms in variant annotation, graph reconstruction and CRCE estimation accuracy. TWRCI achieved the lowest MACR in both cases (Figure 5 (a) and (b)) within about one hour (Supplementary Figure 4 (b) and (c)). Cis-eQTLs and colocalization with SuSIE failed to annotate many variants because many trans-variants again predicted gene expression. We obtained similar results with a set of silver standard genes downloaded from the KEGG database (hsa05417)39, where TWRCI outperformed all other algorithms (Figure 5 (c)). ![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F5.medium.gif) [Figure 5.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F5) Figure 5. Results for IHD. (a) TWRCI again outperformed all other algorithms in combined annotation and graph reconstruction by achieving the lowest MACR. (b) TWRCI also estimated the CRCEs most accurately relative to all possible combinations of the other algorithms. (c) TWRCI outperformed all other algorithms with a silver standard set of genes causally involved in atherosclerosis. (d) TWRCI annotated variegated numbers of variants to six causal expression levels as well as the phenotype. (e) Nearly all of the annotated variants were located distal to the TSS (blue), and the magnitudes of their causal effects did not consistently increase or decrease on average with greater distance from the TSS (red). (f) TWRCI estimated the largest mean CRCEs for MRPL1, TRBV6-2 and FAM241B. (g) The annotated variants only explained a small proportion (<1.5%) of the variance for all CRCE estimates. (h) UMAP dimensionality reduction identified one cluster of patients clearly separated from healthy controls. (i) The mean CRCEs of MRPL1, TRBV6-2 and FAM241B remained the largest in this cluster. #### 2.4.2 Horizontal pleiotropy and trans-variants The genetic variants predicted 27 gene expression levels at an FDR threshold of 10% with six genes inferred to cause the phenotype. We plot the six genes in the directed graph recovered by TWRCI in Figure 5 (f). TWRCI sorted approximately 8-23% of the causal variants to each of the six genes (Figure 5 (d)). Moreover, TWRCI annotated approximately 17% of the causal variants directly to the phenotype supporting widespread horizontal pleiotropy in IHD. In contrast, cTWAS again did not detect any variants that directly cause the phenotype with a posterior inclusion probability greater than 0.8. We analyzed the inferred causal effects of cis and trans-variants. Only 7.4% of the annotated variants were located on the same chromosome, and those on the same chromosome were often located over 10 megabases from the TSS (Figure 5 (e) blue). Moreover, variants annotated by TWRCI were located 4.46 times closer to a published list of trans-eQTLs than expected by chance (10,000 permutations, *p* = 0.0014, 95% CI [4.39,4.52]). The magnitudes of the regression coefficients remained approximately constant with increasing distance from the TSS and converged to 0.002 – rather than to zero – on different chromosomes (Figure 5 (e) red). We conclude that trans-variants also play a prominent role in IHD. #### 2.4.3 Root Causal Mechanism We next examined the root causal genes of IHD. IHD is usually caused by atherosclerosis, where sites of disturbed laminar flow and altered shear stress trap low-density lipoprotein (LDL)40. Reactive oxygen species then oxidize LDL and stimulate an inflammatory response. T cells in turn stimulate macrophages that ingest the oxidized LDL. The macrophages then develop into lipid-laden foam cells that form the initial fatty streak of an eventual atherosclerotic plaque. We therefore expect the root causal genes of IHD to involve oxidative stress and the inflammatory response. TWRCI identified MRPL1, TRBV6-2 and FAM241B as the top three root causal genes (Figure 5 (f)). MRPL1 encodes a mitochondrial ribosomal protein that helps synthesize complex proteins involved in the respiratory chain41. Deficiency of MRPL1 can lead to increased oxidative stress. TRBV6-2 encodes a T-cell receptor beta variable involved in the inflammatory response and accumulation of T-cells in the atherosclerotic plaque42. Moreover, knocking out FAM241B induces the cytoplasmic buildup of large lysosome-derived vacuoles that generate foam cells43. We conclude that the root causal genes identified by TWRCI correspond to known genes involved in the pathogenesis of IHD. Finally, TWRCI rediscovered MRPL1 in a second independent GWAS dataset (Supplementary Figure 5 (a)). We next dissected the CRCE estimates in detail. The annotated variants explained less than 1.5% of the CRCE variance for MRPL1, TRBV6-2 and FAM241B (Figure 5 (g)). Non-genetic factors therefore account for the vast majority of the CRCE variance. UMAP dimensionality reduction and then hierarchical cluster on the causal genes discovered by TWRCI revealed two clusters of IHD patients (Supplementary Figure 4 (a)). The largest of the two clusters lied distal to the cluster of healthy controls (Figure 5 (h)). Furthermore, the FAM241B, TRBV6-2 and MRPL1 genes retained the largest mean CRCEs in this cluster (Figure 5 (i)). TWRCI likewise replicated the large mean CRCE estimate for MRPL1 in the independent GWAS dataset (Supplementary Figure 5 (a) and (b)). We conclude that the CRCE estimates also identify genes that differentiate patient subgroups in IHD. ## 3 Discussion We introduced the CRCE of a gene, a measure of the causal effect of the genetic and non-genetic factors that directly cause a gene expression level on a phenotype. We then created the TWRCI algorithm that estimates the CRCE of each gene after simultaneously annotating variants and reconstructing the causal graph for improved statistical power. TWRCI annotates, reconstructs and estimates more accurately than alternative algorithms across multiple semi-synthetic and real datasets. Applications of TWRCI to COPD and IHD revealed succinct sets of root causal genes consistent with the known pathogenesis of each disease, which we verified by replication. Furthermore, clustering delineated patient subgroups whose pathogeneses were dictated by different root causal genes. Our experimental results highlight the importance of incorporating trans-variants in statistical analysis. TWRCI annotated many variants distal to the TSS of each gene. These trans-variants improved the ability of the algorithm to learn models of gene regulation consistent with the correlations in the data according to the MACR criteria. Moreover, variants annotated by TWRCI were located closer to the positions of a previously published list of trans-eQTLs than expected by chance29. In contrast, nearest TSS, cis-windows, cTWAS, cis-eQTLs and the colocalization methods all rely on cis-variants that did not overlap with many GWAS hits both in the COPD and IHD datasets. Most GWAS hits likely lie distal to the TSSs in disease due to natural selection against cis-variants with large causal effects on gene expression44. As a result, algorithms that depend solely on cis-variants can fail to detect a large proportion of variants that cause disease in practice. TWRCI detected widespread horizontal pleiotropy accounting for 13-17% of the causal variants in both the COPD and IHD datasets. Previous studies have detected horizontal pleiotropy in around 20% of causal variants even after considering thousands of gene expression levels as well28. Moreover, many of the variants annotated to the phenotype by TWRCI correlated with gene expression (Supplementary Figures 2 (d) and 4 (d)). Accounting for widespread horizontal pleiotropy thus mitigates pervasive confounding between gene expression levels and the phenotype. The cTWAS algorithm did not detect widespread pleiotropy in the real datasets. The algorithm also underperformed TWRCI in the semi-synthetic data, even when we restricted the analyses to variants that directly cause the phenotype. We obtained these results because cTWAS relies on the SuSIE algorithm to identify pleiotropic variants. However, pleiotropic variants usually exhibit weak causal relations to the phenotype, so most of these variants do not achieve a large posterior inclusion probability in practice. Algorithms that depend on *absolute* measures of certainty, such as posterior probabilities or p-values, miss many causal variants with weak causal effects in general. TWRCI therefore instead annotates variants by relying on *relative* certainty via a novel process called Competitive Regression, which we showed leads to more consistent causal models across multiple metrics. We re-emphasize that TWRCI is the only algorithm that accurately recovers *root* causal genes *initiating* pathogenesis. Other methods such as colocalization and cTWAS identify causal genes *involved* in pathogenesis, regardless of whether the genes are root causal or not root causal. As a result, only TWRCI inferred a few genes with large CRCE magnitudes even in complex diseases. Moreover, genes with non-zero CRCE magnitudes explained away most of the causal effects of the non-root causal genes in the silver standards. Both of these results are consistent with the omnigenic root causal model, or the hypothesis that only a few root causal genes initiate the vast majority of pathology in each patient even in complex disease by affecting a very large number of downstream genes1. Recall that the above root causal genes differ from driver genes and core genes. Root causal genes generalize driver genes by accounting for all of the factors that directly influence gene expression levels across all diseases, rather than just somatic mutations in cancer3. Accounting for both genetic and non-genetic factors is especially important when non-genetic factors explain the majority of the variance in the root causal effects, as we saw in COPD and IHD. Finally, root causal genes differ from core genes, or the gene expression levels that directly cause a phenotype, by focusing on the beginning rather than the end of pathogenesis2. Root causal genes may affect the expression levels of downstream genes so that many genes are differentially expressed between patients and healthy controls including many core genes. A few root causal genes can therefore increase the number of core genes. TWRCI provably identifies root causal genes and attains high empirical accuracy, but the algorithm carries several limitations. The algorithm cannot accommodate cycles or directed graphs with different directed edges, even though cycles may exist and direct causal relations may differ between patient populations in practice45. TWRCI also estimates CRCE values at the patient-specific level, but the CRCEs may also vary between different cell types. Finally, the algorithm uses linear rather than non-linear models to quantify the causal effects of the variants on gene expression or the phenotype. Future work should therefore consider relaxing the single DAG constraint and accommodating non-linear relations. Future work will also focus on scaling the method to millions of genetic variants without feature selection. In summary, we introduced an algorithm called TWRCI for accurate estimation and interpretation of the CRCE using personalized causal graphs. TWRCI empirically discovers only a few gene expression levels with large CRCE magnitudes even within different patient subgroups of complex disease in concordance with the omnigenic root causal model46. We conclude that TWRCI is a novel, accurate and disease agnostic procedure that couples variant annotation with graph reconstruction to identify root causal genes using observational data alone. ## Data Availability All datasets analyzed in this study have been previously published and are publicly accessible as described in Methods. [https://github.com/ericstrobl/TWRCI](https://github.com/ericstrobl/TWRCI) ## 4 Methods ### 4.1 Background on Causal Discovery Causal discovery refers to the process of discovering causal relations from data. We let italicized letters such as *Z**i* denote a singleton random variable and bold italicized letters such as ***Z*** denote sets of random variables. Calligraphic letters such as 𝒵 refer to sets of sets. We consider a set of *p endogenous variables* ***Z***. We represent a causal process over ***Z*** using a *structural equation model* (SEM) consisting of a series of deterministic functions: ![Formula][32] where Pa(*Z**i*) ⊆ ***Z*** \ *Z**i* denotes the *parents*, of direct causes, of *Z**i* and *E**i* ∈ ***E*** an *exogenous variable*, also called an *error* or a *noise term*. We assume that the variables in ***E*** are mutually independent. The set Ch(*Z**i*) refers to the *children*, or direct effects, of *Z**i* where *Z* *j* ∈ Ch(*Z**i*) if and only if *Z**i* ∈ Pa(*Z**j*). We can associate an SEM with a *directed graph* 𝔾 by a drawing a directed edge from *Z**j* to *Z**i* when *Z**j* ∈ Pa(*Z**i*). We thus use the words *variable* and *vertex* interchangeably. A *root vertex* in 𝔾 refers to a vertex without any parents, whereas a *sink* or *terminal vertex* refers to a vertex without any children. A *path* between *Z* and *Z**n* corresponds to an ordered sequence of distinct vertices ⟨*Z*, …, *Z**n*⟩ such that *Z**i* and *Z**i*+1 are adjacent for all 0 ≤ *i* ≤ *n* − 1. In contrast, a *directed path* from *Z* to *Z**n* corresponds to an ordered sequence of distinct vertices ⟨*Z*, …, *Z**n*⟩ such that *Z**i* ∈ Pa(*Z**i*+1) for all 0 ≤ *i* ≤ *n* − 1. We say that *Z**j* is an *ancestor* of *Z**i*, and likewise that *Z**i* is a *descendant* of *Z**j*, if there exists a directed path from *Z**j* to *Z**i* (or *Z**j* = *Z**i*). We collect all ancestors of *Z* *j* into the set Anc(*Z**j*), and all its non-descendants into the set Nd(*Z**j*). We write *Z**i* ∈ Anc(***A***) when *Z**i* is an ancestor of any variable in ***A***, and likewise Nd(***A***) for the non-descendants. The variable *Z* *j* *causes Z**i* if *Z**j* is an ancestor of *Z**i* and *Z**j* ≠ *Z**i*. A *root cause* of *Z**i* corresponds to a root vertex that also causes *Z**i*. A *cycle* exists in 𝔾 when *Z**j* causes *Z**i* and vice versa. A *directed acyclic graph* (DAG) corresponds to a directed graph without cycles. A *collider* corresponds to *Z**j* in the triple *Z**i* → *Z**j* ← *Z**k*. Two vertices *Z**i* and *Z**j* are d-connected given ***W*** ⊆ ***Z*** \ {*Z**i*, *Z**j*} if there exists a path between *Z**i* and *Z**j* such that no non-collider is in ***W*** and all colliders are ancestors of ***W***. We denote d-connection by ![Graphic][33] for shorthand. The two vertices are *d-separated* given ***W***, likewise denoted by *Z**i* ╨*d* *Z* *j* |***W***, if they are not d-connected. The *Markov boundary* of *Z**i*, denoted by Mb(*Z**i*), corresponds to the not necessarily unique but smallest set of variables in ***Z*** \ *Z**i* such that *Z**i* ╨*d* (***Z*** \ Mb(*Z**i*)) |Mb(*Z**i*). A path is *blocked* by ***W*** if ***W*** contains at least one non-collider on the path or does not contain an ancestor of a collider (or both). A probability density that obeys an SEM associated with the DAG 𝔾 also factorizes according to the graph: ![Formula][34] Any density that factorizes as above obeys the *global Markov property*, where *Z**i* and *Z**j* are conditionally independent given ***W***, or *Z**i* ╨ *Z* *j* |***W***, if *Z**i* ╨*d* *Z* *j* |***W***47. A density obeys *d-separation faithfulness* when the converse holds: if *Z**i* ╨ *Z* *j* |***W***, then *Z**i* ╨*d* *Z**j* |***W***. The Markov boundary of *Z**i* uniquely corresponds to the parents, children and parents of the children (or *spouses*) of *Z**i* under d-separation faithfulness. ### 4.2 Causal Modeling of Variants, Gene Expression and the Phenotype We divide the set of random variables ***Z*** into disjoint sets ![Graphic][35] corresponding to the phenotype *Y, q* genetic variants ***S***, latent variables ***L*** modeling linkage disequilibrium (LD) and *m* gene expression levels ![Graphic][36]. We model the causal process over ***Z*** using the following SEM associated with a DAG 𝔾: ![Formula][37] where ![Graphic][38] and ![Graphic][39] for any latent variable, any genetic variant, any gene expression level and the phenotype, respectively. In other words, linkage disequilibrium ***L*** generates variants ***S***, and variants and gene expression generate other gene expression levels ![Graphic][40] and the phenotype *Y* (example in Figure 6 (a)). We assume that *Y* is a sink vertex such that gene expression and variants cause *Y* but not vice versa. ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F6) Figure 6. (a) An example of a DAG over ***Z***. In (b), the additional vertices ***X*** denote counts corrupted by batch *B* effects and Poisson measurement error. (c) We can also augment the DAG in (a) with root vertex error terms ***E***. Let ***S****i* denote the direct causes of ![Graphic][41] in ***S***. We require ![Graphic][42] for all ![Graphic][43] so that at least one variant directly causes each gene expression level. We also assume that any single variant can only *directly* cause one gene expression level or the phenotype (but not both). Investigators have reported only a few rare exceptions to this latter assumption in the literature46. A variant may however indirectly cause many gene expression levels. We unfortunately cannot measure the exact values of gene expression using RNA sequencing (RNA-seq) technology. Numerous theoretical and experimental investigations have revealed that RNA-seq suffers from independent Poisson measurement error48, 49: ![Formula][44] where *π**ij* denotes the *mapping efficiency* of ![Graphic][45] in batch *j*. We thus sample *Y* ⋃ ***S*** ⋃ ***L*** ⋃ ***X*** ⋃ *B* from the DAG like the one shown in Figure 6 (b) in practice, where *B* denotes the batch. With slight abuse of terminology, we will still call ![Graphic][46] a *sink vertex* if it has only one child *X**i*. We can perform consistent regression under Poisson measurement error. Let ![Graphic][47] denote the library size and let ![Graphic][48] denote the true unobserved total gene expression level weighted by the mapping efficiencies in batch *j*. Also let ![Graphic][49] and ***V*** ⊆ ***S*** refer to any subset of gene expression levels and variants, respectively. The following result holds: Lemma 1. *Assume Lipschitz continuity of the conditional expectation for all N* ≥ *n*: ![Formula][50] *where C**N* ∈ *O*(1) *is a positive constant, and we have taken an outer expectation on both sides. Then* ![Graphic][51] *almost surely*. We delegate proofs to the Supplementary Materials. Intuitively, ![Graphic][52] approaches ![Graphic][53] as the library size increases, so the above lemma states that accurate estimation of ![Graphic][54] implies accurate estimation of ![Graphic][55]. We can thus consistently estimate any conditional expectation ![Graphic][56] using 𝔼(*Z**i* | ***U, V, B***) when the library size approaches infinity. We only apply the asymptotic argument to bulk RNA-seq, where the library size is on the order of at least tens of millions. We henceforth implicitly assume additional conditioning on *B* whenever regressing to or on bulk RNA-seq data in order to simplify notation. ### 4.3 Conditional Root Causal Effects We define the root causal effect of a gene expression level on the phenotype *Y*. We focus on Equation (3) with the endogenous variables ***Z*** and the exogenous variables ***E***. If the error terms ***E*** are mutually independent, then we can *augment* the associated DAG 𝔾 with ***E*** by drawing a directed edge from each ![Graphic][57] to its direct effect *Z**i* (Figure 6 (c)). We denote the resultant graph by 𝔾′, where we always have ![Graphic][58] and the subscript emphasizes the augmented DAG; if we do not place a subscript, then we refer to the original DAG 𝔾. Only the error terms are root vertices in 𝔾′, so only exogenous variables that cause *Y* can be root causes of *Y*. The *root causal effect* of *Z**i* on *Y* given the exogenous variables ***E*** is the causal effect of its direct causes in ***E*** on *Y*: ![Formula][59] The variable *Z**i* is the first variable in ***Z*** affected by ![Graphic][60], and *Z**i* may in turn causally affect *Y*. The exogenous variable ![Graphic][61] models the effects of environmental, epigenetic and other *non-genetic* factors on *Z**i* because the set of endogenous variables ![Graphic][62] includes the *genetic factors* ***S***. The root causal effect is a special case of the *conditional root causal effect* (CRCE) given the exogenous variables ***E***: ![Formula][63] where ![Graphic][64] and ![Graphic][65]. The first condition ensures that ***D*** does not block any directed path from *Z**i* to *Y*. The second ensures that ***D*** eliminates any confounding between ![Graphic][66] and *Y*. The first condition actually implies the second in this case because ***E*** are root vertices. If we set ![Graphic][67], then we recover the unconditional root causal effect in Equation (4). We are however interested in identifying the causal effects of both genetic *and* non-genetic factors on *Y* through gene expression ![Graphic][68] with potential confounding between members of ***S*** due to LD. We therefore expand the set of exogenous variables to ***E*** ⋃ ***S*** representing the non-genetic and genetic factors, respectively. We define the conditional root causal effect of ![Graphic][69] given the exogenous variables ***E*** ⋃ ***S*** as: ![Formula][70] where we write ![Graphic][71] as *E**i* ⋃ ***S****i* to prevent cluttering of notation. The set *E**i* ⋃ ***S****i* thus refers to the direct causes of ![Graphic][72] in ***E*** ⋃ ***S***. The above conditional root causal effect measures the causal effect of the root vertices ***E*** on *Y* as they pass through *E**i* ⋃ ***S****i* to ![Graphic][73]. We can likewise choose any ***D*** such that ![Graphic][74] and ![Graphic][75]. We choose ***D*** carefully to satisfy these two conditions as well as elicit favorable mathematical properties by setting ![Graphic][76], where ![Graphic][77] and ![Graphic][78]. This particular choice of ***D*** allows us to write: ![Formula][79] so that we do not need to recover *E**i* as an intermediate step. We prove the second equality in Proposition 1 of the Supplementary Materials under *exchangeability*, or no latent confounding by ***L*** between any two entries of ![Graphic][80]; this union corresponds to a set of sets including ***T*** and each entry of ![Graphic][81] in the set. Exchangeability holds approximately in practice due to the weak causal relations emanating from variants to gene expression and the phenotype. Moreover, the assumption weakens with more variants in ***T***. Now the first gene expression level in ![Graphic][82] affected by *E**i* ⋃ ***S****i* is ![Graphic][83]. We thus call ![Graphic][84] a *root causal gene* if ![Graphic][85] also causes *Y* such that Ψ*i* ≠ 0. We finally focus on the expected version of Ψ*i* to enhance computational speed, improve statistical efficiency and overcome Poisson measurement error according to Lemma 1: ![Formula][86] The *omnigenic root causal model* posits that |*γ*| ≫ 0 for only a small subset of gene expression levels in each patient with Γ = *γ*. We thus seek to estimate the values *γ* for each patient. We use the acronym CRCEs to specifically refer to Γ from here on. ### 4.4 Algorithm #### 4.4.1 Strategy Overview We seek to accurately annotate, reconstruct and estimate the CRCEs using (1) summary statistics as well as (2) linked variant-expression-phenotype data. We summarize the proposed Transcriptome-Wide Root Causal Inference (TWRCI) algorithm in Algorithm 1. TWRCI first uses summary statistics to identify variants ***T*** associated with the phenotype at a liberal *α* threshold in Line 1. The algorithm also identifies gene expression levels ***R*** ⊆ ***X*** predictable by ***T*** in Line 1 from the variant-expression-phenotype data. TWRCI then annotates non-overlapping sets of variants to the phenotype in Line 2 and each gene expression level in Line 3 using a novel process called Competitive Regression; we prove that annotated variants include all of the direct causes in ***T***. TWRCI identifies the causal ordering ***K*** among ![Graphic][87] during the annotation process. The algorithm finally recovers the directed graph uniquely given ***K*** in Line 4 and estimates the CRCE of each gene inferred to cause *Y* using the estimated graph ![Graphic][88] and the annotations 𝒫 in Line 5. TWRCI can thus weigh and color-code each node in ![Graphic][89] that causes *Y* by the CRCE estimates for each patient. We will formally prove that TWRCI is sound and complete at the end of this subsection. Algorithm 1 Transcriptome-Wide Root Causal Inference (TWRCI) ![Figure7](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F7.medium.gif) [Figure7](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F7) #### 4.4.2 Variable Selection We summarize the variable selection portion of TWRCI in Algorithm 2. TWRCI first reduces the number of variants using summary statistics by only keeping variants with a significant association to the phenotype at a very liberal *α* threshold (Line 1); we use 5e-5, or a three orders of magnitude increase from the usual threshold of 5e-8. We do not employ clumping or other pre-processing methods that may remove more variants from consideration. Let ***T*** denote the variants that survive this screening step so that ![Graphic][90]. The variable selection algorithm then identifies the gene expression levels predictable by ***T*** using the variant-expression-phenotype data in Line 2. We operationalize this step by linearly regressing ***X*** on ***T*** using half of the samples, and then testing whether the predicted level ![Graphic][91] and the true level *X**i* linearly correlate in the second half for each *X**i* ∈ ***X***50. This sample splitting procedure ensures proper control of the Type I error rate51. We keep gene expression levels ***R*** ⊆ ***X*** that achieve a q-value below a liberal FDR threshold of 10%52. We say that ***T*** is *relevant* if it contains at least one variant that directly causes each member of ![Graphic][92]. We finally repeat the above procedure after regressing out ***R*** from ***X*** \ ***R*** and ***T*** in Line 3 in order to identify *Ñ*, or all parents of ![Graphic][93] in ![Graphic][94]. We call ***N*** the set of *nuisance variables*, since we will need to condition on them, but they do not contain the ancestors of *Y*. Algorithm 2 formally identifies the necessary ancestors needed for downstream inference: Lemma 2. *Assume d-separation faithfulness and relevance. Then*, ![Graphic][95] *contains all of the ancestors of Y in* ![Graphic][96], *and* ![Graphic][97] *for any* ![Graphic][98]. Algorithm 2 Variable Selection ![Figure8](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F8.medium.gif) [Figure8](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F8) #### 4.4.3 Annotation for Horizontal Pleiotropy TWRCI next annotates the associated variants ***T*** to their direct effects in ![Graphic][99]. The algorithm first annotates a sink vertex and then gradually works its way up the DAG until it annotates the final root vertex. TWRCI assumes that *Y* is a sink vertex, so it first annotates to *Y*. A variant exhibits *horizontal pleiotropy* if it directly causes *Y*. We propose a novel competitive regression scheme to annotate all members of ***T*** ⋂ Pa(*Y*) = ***S****Y* to *Y*. We mildly assume equality in conditional expectation implies equality in conditional distribution and vice versa. Let ![Graphic][100] and likewise ***Q*** = ***R*** ⋃ *Y*. We also mildly assume that the following *contribution scores* exist and are finite: Δ*ij* = 𝔼|*∂*𝔼(*Q**i* |***T***)/*∂T**j* | and ![Graphic][101]. The scores correspond to the variable coefficients in linear regression. We use the contribution scores to annotate any *T**j* ∈ ***T*** such that ![Graphic][102] to *Y*, since this set of variants corresponds to a superset of ***S****Y* by the following result: Corollary 1. *Under d-separation faithfulness, relevance and exchangeability*, ![Graphic][103] *if and only if T**j* ∉ Anc(***R***) *or T**j* ∈ Pa(*Y*) *(or both)*. The proof follows directly from Lemma 3 in the Supplementary Materials. The Competitive Regression (CR) algorithm summarized in Algorithm 3 computes the contribution scores in order to annotate variants to *Y*. Let Δ−*i* denote the removal of the *i*th row from Δ corresponding to *Q**i* = *Y*. We use debiased linear ridge regression9 to compute Δ in Line 1 and *γ**i* in Line 2. CR compares the two quantities and outputs the set ![Graphic][104], or a superset of ***S****i* ⋂***T*** not including any other variants with children in ![Graphic][105] according to Corollary 1, in Line 3. #### 4.4.4 Annotation and Causal Order The CR algorithm requires the user to specify a known sink vertex. We drop this assumption by integrating CR into the Annotation and Causal Order (ACO) algorithm that automatically finds a sink vertex at each iteration. ACO takes ***R, N***, *Y*, ***T, P****Y* as input as summarized in Algorithm 4. The algorithm constructs a causal ordering over ***R*** ⋃ *Y* in ***K*** by iteratively eliminating a sink vertex from ***R*** and appending it to the front of ***K***. ACO also instantiates a list 𝒫 and assigns genetic variants ![Graphic][106] to each gene expression level *R**i* ∈ ***R*** in Lines 8 and 18 using the following generalization of Corollary 1: Algorithm 3 Competitive Regression (CR) ![Figure9](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F9.medium.gif) [Figure9](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F9) Algorithm 4 Annotation and Causal Order (ACO) ![Figure10](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F10.medium.gif) [Figure10](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F10) Lemma 3. *Assume d-separation faithfulness, relevance and exchangeability. Further assume that* ![Graphic][107] *is a sink vertex. Then*, ![Graphic][108] *if and only if T**j* ∉ Anc(***Q*** \ *Q**i*) *or* ![Graphic][109] *(or both)*. The set ***P****i* is thus again a superset of ***S****i* ⋂***T***, and any additional variants in ***P****i* do not directly cause another gene expression level or the phenotype. ACO determines whether ![Graphic][110] is indeed a sink vertex from data using the following result: Lemma 4. ![Graphic][111] *is a sink vertex if and only if* ![Graphic][112] *in Line 12 of ACO under d-separation faithfulness, relevance and exchangeability*. ACO practically determines whether any ![Graphic][113] is indeed a sink vertex post variable elimination by first computing the residuals *F**i* after regressing *R**i* on ![Graphic][114], the nuisance variables *Ñ* and the identified variants ***U****i*. A sink vertex ![Graphic][115] has residuals *F**i* that are uncorrelated with the variants in ***T*** \ (***O*** ⋃***U****i*) in Line 12 by Lemma 4, so ACO can identify the sink vertex ![Graphic][116] in Line 15 as the variable with the smallest absolute linear correlation. The algorithm then appends *R**i* to the front of ***K*** and eliminates *R**i* from ***R*** in Lines 16 and 17, respectively. ACO finally adds ***U****i* to 𝒫 in Line 18, so ***U****i* can be removed from ***T*** of the next iteration through ***O***. We formally prove the following result: Lemma 5. *Under d-separation faithfulness, relevance and exchangeability, ACO recovers the correct causal order* ***K*** *over* ![Graphic][117] *and* (***S****i* ⋂***T***) ⊆ ***P****i* *for all* ![Graphic][118]. #### 4.4.5 Causal Graph Discovery TWRCI uses the causal order ***K*** and the annotations 𝒫 to perform causal discovery. The algorithm runs the (stabilized) skeleton discovery procedure of the Peter-Clark (PC) algorithm to identify the presence or absence of edges between any two gene expression levels (Algorithm 5)19, 53. We modify the PC algorithm so that it tests whether *R**i* and *R* *j* are conditionally independent given ***P****i* ⋃ ***Ñ*** and subsets of the neighbors of ![Graphic][119] in ![Graphic][120] in Line 12 to ensure that we condition on all parents of ![Graphic][121]. Finally, we orient the edges using the causal order ***K*** in Line 19 to uniquely recover the DAG over ![Graphic][122]: Lemma 6. *Under d-separation faithfulness, relevance and exchangeability, the graph discovery algorithm outputs the true sub-DAG over* ![Graphic][123] *given a conditional independence oracle*, ***K*** *and* 𝒫. We next include the phenotype *Y* into the causal graph. We often only have a weak causal effect from gene expression and variants to the phenotype. We therefore choose to detect any causal relation to *Y* rather than just direct causal relations using Algorithm 6. Algorithm 6 only conditions on ![Graphic][124] in Line 4 to discover both direct and indirect causation in concordance with the following result: Lemma 7. *Under d-separation faithfulness, relevance and exchangeability*, ![Graphic][125] *causes Y – and likewise the vertices* ***S****i* ⋃ *E**i* *cause Y – if and only if* ![Graphic][126]. Algorithm 5 Graph Discovery ![Figure11](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F11.medium.gif) [Figure11](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F11) #### 4.4.6 Conditional Root Causal Effect Estimation TWRCI finally estimates the CRCEs of the genes that cause *Y* given the recovered graph ![Graphic][127] and the annotations 𝒫 We estimate the two conditional expectations in Equation (5) using kernel ridge regression54. We embed *X**i* and ![Graphic][128] using a radial basis function kernel but embed ***T*** \ ***P****i* using a normalized linear kernel. We normalize the latter to prevent the linear kernel from dominating the radial basis function kernel, since the variables in ***T*** \ ***P****i* typically far outnumber those in ![Graphic][129]. We now integrate all steps of TWRCI by formally proving that TWRCI is sound and complete: Algorithm 6 CRCE Graph Discovery ![Figure12](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F12.medium.gif) [Figure12](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F12) Theorem 1. *(Fisher consistency) Under d-separation faithfulness, relevance and exchangeability, TWRCI identifies all of the direct causal variants of* ![Graphic][130], *the unique causal graph over* ![Graphic][131] *and the CRCEs of* ![Graphic][132] *almost surely as N* → ∞ *with Lipschitz continuous conditional expectations and a conditional independence oracle*. We perform conditional independence testing by correlating the regression residuals of smooth non-linear transformations of the gene expression levels and phenotype55. As a result, Lemma 1 also enables accurate conditional independence testing over subsets of ![Graphic][133], even though we only have access to ***T*** ⋃ ***R*** ⋃ ***N***. #### 4.4.7 Time Complexity We analyze the time complexity of TWRCI in detail. TWRCI can admit different regression procedures, so we will assume that each regression takes **(*c*3) time, where *c* denotes the dimensionality of the conditioning set typically much larger than the sample size *n*. Most regression procedures satisfy the requirement. TWRCI first runs Algorithm 2 which requires *O*(*q*) time in Line 1 with summary statistics, *O*(*q*3*m*) time in Line 2 with at most *m* regressions on ***T***, and *O*(*m* (*m* + *q*)) time for at most *m* + *q* regressions on ![Graphic][134] in Line 3. Algorithm 2 thus takes *O*(*m*4 + *m*3*q*) + *O*(*q*3*m*) time in total. TWRCI next annotates to *Y* using Algorithm 3 which takes *O*(*q*3*m*) + *O*((*m* + *q*)3) time for Lines 1 and 2, respectively. Annotation to *Y* therefore carries a total time complexity of *O*(*m*3*q*3). TWRCI then runs Algorithm 4. Each iteration of the repeat loop in Line 3 of Algorithm 4 takes *O*(*q*3) time for the regression in Line 4 and *O*(*m*(*m* + *q*)3) time for the at most *m* regressions in Line 7. The repeat loop iterates at most *m* times, so Algorithm 4 has a total time complexity of *O*(*m*(*q*3 + *m*(*m* + *q*)3)) = *O*(*m*5*q*3). Algorithm 5 dominates Algorithm 6 in time during the causal graph discovery portion of TWRCI. Algorithm 5 runs in *O*(*m**e* (*m* + *q*)3) = *O*(*m**e*+3*q*3) time, where *e* denotes the maximum neighborhood size19. Finally, CRCE estimation in Line 5 requires *O*(2*m*(*m* + *q*)3) = *O*(*m*4*q*3) time for at most 2*m* regressions on expression levels and variants. Thus TWRCI in total requires *O*(*m*4 + *m*3*q*) + *O*(*q*3*m*) + *O*(*m*3*q*3) + *O*(*m*5*q*3) + *O*(*m**e*+3*q*3) + *O*(*m*4*q*3) = *O*(*m*5*q*3) + *O*(*m**e*+3*q*3) time. We conclude that the ACO and Graph Discovery sub-algorithms dominate the time complexity of TWRCI. ### 4.5 Comparators We compared TWRCI against state of the art algorithms enumerated below. #### Annotation 1. Nearest TSS: annotates each variant to its closest gene according to the TSS. 2. Cis-window: annotates a variant to a gene if the variant lies within a one megabase window of the TSS. If a variant lies in multiple windows, then we assign the variant to the closest TSS. 3. Causal transcriptome-wide association study (cTWAS)11: annotates variants to genes using cis-windows and then accounts for horizontal pleiotropy using the Sum of SIngle Effects (SuSIE) algorithm. 4. Cis-eQTLs12: annotates a variant to a gene if (1) the variant lies in the cis-window of the gene per above, and (2) the variant correlates most strongly with that gene expression level relative to the other levels. 5. Colocalization with approximate Bayes factors13: annotates each variant to the gene expression level with the highest colocalization probability according to approximate Bayes factors. We *could not* differentiate this method from cis-windows using the MACR criteria for the real data (Methods 4.8), since the algorithm always assigns higher approximate Bayes factors to cis-variants. 6. Colocalization with SuSIE13, 14: same as above but with probabilities determined according to SuSIE. We *could* differentiate this method from cis-windows using the MACR criteria for the real data. #### Causal Graph Reconstruction 1. SIGNET15, 16: predicts gene expression levels from variants using ridge regression and then recovers the genetic ancestors of each expression level by running the adaptive LASSO on the predicted expression levels. The method thus assumes linearity. 2. RCI17: assumes a linear non-Gaussian acyclic model56, and recovers the causal order by maximizing independence between gene expression level residuals obtained from linear regression. 3. GRCI18: same as above but assumes an additive noise model57 and uses non-linear regression. 4. PC/CausalCell20: runs the stabilized PC algorithm19, 53 on the gene expression levels using a non-parametric conditional independence test55. ### 4.6 Semi-Synthetic Data The causal graph reconstruction algorithms all require a variable selection step with gene expression data, since they cannot scale to the tens of thousands of genes with the neighborhood sizes seen in practice1, 20. We therefore assessed the performance of the algorithms independent of variable selection by first instantiating a DAG directly over ![Graphic][135] with *p* = 30 variables including 29 gene expression levels and a single phenotype. We generated a linear SEM obeying Equation (3) such that ![Graphic][136] for every ![Graphic][137] with *E**i* ∼ 𝒩 (0, 1/25) to enable detection of weak causal effects from variants. We drew the coefficient matrix *β* from a Bernoulli(2/(*p* − 1)) in the upper triangular portion of the matrix and then randomly permuted the ordering of the variables. The resultant DAG has an expected neighborhood size of 2. We then weighted the coefficient matrix between the gene expression levels and phenotype by sampling uniformly from [−1, −0.25] ∪ [0.25, 1]. We instantiated the variants ***T*** and *θ* as follows. We downloaded summary statistics from a wide variety of IEU datasets listed in Table 1 and filtered variants at a liberal *α* threshold of 5e-5. We selected a variant to be closest to the TSS of each gene uniformly at random and assigned direct causal variants to the 29 gene expression levels with probability proportional to the inverse of the absolute distance from the closest variant plus one. As a result, variants closer to the TSS are more likely to have a direct causal effect on the gene expression level. We assigned the remaining variants to the phenotype. We sampled ***T*** by bootstrap from the GTEx version 822 individual-level genotype data and the weights *θ* uniformly from [−0.15, −0.05] ∪ [0.05, 0.15] because variants usually have weak causal effects. View this table: [Table 1.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/T1) Table 1. Variant data used during semi-synthetic data generation. We converted the above linear SEM to a non-linear one by setting ![Graphic][138] for each ![Graphic][139]. We obtained each measurement error corrupted surrogate *R**i* by sampling from ![Graphic][140] for each ![Graphic][141]. We drew the mapping efficiencies *π*·1 for a single batch from the uniform distribution between 100 and 10000 for the bulk RNA sequencing data. We repeated the entirety of the above procedure 100 times to generate 100 independent variant-expression-phenotype datasets. We ran TWRCI and all combinations of the comparator algorithms on each dataset. ### 4.7 Real Data #### 4.7.1 Data Availability All real datasets analyzed in this study have been previously published and are publicly accessible. The COPD datasets include: 1. Summary statistics: [ebi-a-GCST90018807](https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST90018807/) 2. Individual level variant and phenotype data: [GTEx V8 Protected Access Data](https://gtexportal.org/home/protectedDataAccess) 3. Gene expression data: [GTEx V8 Lung](https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/counts-by-tissue/gene\_reads_2017-06-05_v8_lung.gct.gz) 4. Replication summary statistics: [ebi-a-GCST90018587](https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST90018587/) The IHD datasets include: 1. Summary statistics: [finn-b-I9\_ISCHHEART](https://gwas.mrcieu.ac.uk/datasets/finn-b-I9_IHD/) 2. Individual level variant and phenotype data: [GTEx V8 Protected Access Data](https://gtexportal.org/home/protectedDataAccess) 3. Gene expression data: [GTEx V8 Whole Blood](https://gtexportal.org/home/protectedDataAccess) 4. Replication summary statistics: [ukb-d-I9\_IHD](https://gwas.mrcieu.ac.uk/datasets/ukb-d-I9_IHD/) #### 4.7.2 Quality Control We selected variants ***T*** at an *α* threshold of 5e-5 for both the COPD and IHD summary statistics. We harmonized the variant data of the IEU and GTEx datasets by lifting the GTEx variant data from the hg38 to hg19 build using the liftover command in BCFtools version 1.1858. We ensured that the reference and alternative alleles matched in both datasets after lifting for every variant. We removed gene expression levels with a mean count of less than five. We subjected the gene expression data to an inverse hyperbolic sine transformation to mitigate the effects of outliers. We regressed out the first 5 principal components, sequencing platform (Illumina HiSeq 2000 or HiSeq X), sequencing protocol (PCR-based or PCR-free) and sex from all variables in the linked GTEx variant-expression-phenotype data. Then, we either included age as a covariate for algorithms that accept a nuisance covariate, or regressed out age from the expression and phenotype data for algorithms that do not accept a nuisance covariate. #### 4.7.3 Comparison to trans-eQTLs TWRCI annotated many trans-variants in both of the real datasets. Other authors have proposed *trans-eQTLs* as variants that lie distal to the TSS and correlate with at least one reported phenotype in the Catalog of Published GWAS59. TWRCI annotates variants based on direct causality rather than correlation and an overlap with another phenotype. However, we hypothesized that the variants discovered by TWRCI should still lie close to at least a subset of the trans-eQTLs. To test this hypothesis, we downloaded trans-eQTL results from the [eQTLGen database](https://eqtlgen.org/trans-eqtls.html)29. We then standardized the positions of the variants within each chromosome by their standard deviation to account for variable chromosome length and polymorphism density. Next, we computed the nearest neighbor distances between the variants annotated to causal genes by TWRCI and the trans-eQTLs. We used the median of these normalized distances *M* as a robust statistic of central tendency. We used a permutation test to test the null hypothesis that the variants annotated to causal genes by TWRCI are distributed arbitrarily far from the trans-eQTLs. We recomputed the median statistic 10,000 times after permuting the positions of the trans-eQTL variants. The p-value corresponds to the proportion of permuted statistics smaller than *M*. We reject the null hypothesis – and thus conclude that the variants annotated to causal genes by TWRCI lie close to trans-eQTLs – when the p-value falls below 0.05. ### 4.8 Metrics We evaluated the accuracy of the algorithms using the nine metrics listed below for the synthetic data. We evaluated annotation quality using the following two metrics: * 1. Matthew’s Correlation Coefficient (MCC)60 between the estimated annotations and the ground truth direct causal variants. Larger is better. * 2. Rank of the estimated coefficients ![Graphic][142] normalized by the rank of the ground truth coefficients *θ*. Larger is better. We also computed the above two quantities only using the variants that directly cause the phenotype in order to evaluate the ability of the algorithms to account for horizontal pleiotropy (3. and 4.). We evaluated the causal graph reconstruction quality using the following two metrics: * 5. Structural Hamming Distance (SHD)61 between the estimated and the ground truth causal graph. Smaller is better. * 6. MCC between the estimated and the ground truth causal graph. Larger is better. We evaluated combined annotation and graph reconstruction quality using Lemma 4: * 7. Mean absolute correlation of the residuals (MACR) defined as the mean absolute correlation between (a) the variants ***T*** and ancestral gene expression levels, and (b) the gene expression residuals after partialing out the inferred parents. Smaller is better under the global Markov property and exchangeability. If the algorithm infers no direct causal variants in ***T*** and no parents in ![Graphic][143] for some ![Graphic][144], then this situation violates the relevance assumption, where at least one variant in ***T*** directly causes ![Graphic][145]. We thus set the absolute correlation of ![Graphic][146] to one in this case. We assessed the accuracy in CRCE estimation using the following metrics: * 8. Root mean squared error between the estimated CRCE and the ground truth CRCE averaged over all gene expression levels. We do not have access to the ground truth CRCE, so we estimate it to negligible error with kernel ridge regression using the ground truth parents. Smaller is better. * 9. MACR between (a) the residuals ![Graphic][147] and (b) the inferred set ***P****i*, which should be zero under the global Markov property and exchangeability. Smaller is better. We again set the absolute correlation to one for ![Graphic][148] if the algorithm infers no direct causal variants and no parents in ![Graphic][149] under relevance. We can compute the MACR metrics 7. and 9. on real data, so we evaluate the algorithms using these two metrics in the IHD and COPD datasets. We also have access to silver standard sets of genes known to be causally involved in disease from either the DisGeNet26 or KEGG database39. We therefore compute a third MACR metric with the real data: * 10. A causal gene should at least correlate with the phenotype, so we first correlate the silver standard genes with the phenotype and only keep silver standard genes with a signification correlation (*p* < 0.05 uncorrected). We then compute a MACR metric between (a) the kept silver standard genes after partialing out genes with non-zero CRCEs and (b) the phenotype after partialing out genes with non-zero CRCEs. ### 4.9 Code Availability R code needed to replicate all experimental results is available on [GitHub](https://github.com/ericstrobl/TWRCI). ## 5. Supplementary Materials ### 5.1 Additional Semi-Synthetic Data Results ![Supplementary Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F13.medium.gif) [Supplementary Figure 1.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F13) Supplementary Figure 1. Timing results for the semi-synthetic datasets split into the variant annotation and graph reconstruction portions because they took the longest by far. (a) TWRCI took the longest time during annotation, but (b) all algorithms spent the majority of the time in causal graph reconstruction over ![Graphic][150] in congruence with the time complexity results of Methods 4.4.7. TWRCI completed within about 3 minutes overall. ### 5.2 Additional COPD Data Results ![Supplementary Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F14.medium.gif) [Supplementary Figure 2.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F14) Supplementary Figure 2. Additional results for COPD. (a) Sum of squares plot for hierarchical clustering using Ward’s method revealed three clusters according to the elbow method, or the cluster size with the maximum distance from the imaginary line drawn between the first and last cluster sizes. TWRCI took the second longest time to complete in annotation (b) and the second longest time to complete in graph reconstruction (c). RCI, GRCI and PC all took a much smaller amount of time to reconstruct the causal graph because they ignore the genetic variants. (d) Histogram of Pearson correlation test p-values computed between variants annotated to the phenotype and gene expression levels. The p-values did not follow a uniform distribution according to the Kolomogorov-Smirnov test with statistic *D* indicating the presence of confounding between the variants annotated to the phenotype and gene expression. ![Supplementary Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F15.medium.gif) [Supplementary Figure 3.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F15) Supplementary Figure 3. Replication results in an independent set of individuals of East Asian ancestry (dataset ebi-a-GCST90018587). We summarize results for (a) all patients, (b) cluster one in Figure 4 (g) and (c) cluster two. TWRCI again identified C4A and multiple MHC class II genes involving the adaptive immune system. ### 5.3 Additional IHD Data Results ![Supplementary Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F16.medium.gif) [Supplementary Figure 4.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F16) Supplementary Figure 4. Additional results for IHD. (a) Sum of squares plot revealed two clusters according to the elbow method. (b) TWRCI took the longest to annotate, but (c) the timing results for graph reconstruction dominated in this case. Methods using SIGNET thus took the longest overall in this dataset. (d) Histogram of Pearson correlation test p-values were again non-uniform, indicating confounding between the variants annotated to the phenotype and gene expression. ![Supplementary Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/09/12/2024.07.22.24310837/F17.medium.gif) [Supplementary Figure 5.](http://medrxiv.org/content/early/2024/09/12/2024.07.22.24310837/F17) Supplementary Figure 5. Replication results in an independent set of patients from the UK Biobank (dataset ukb-d-I9_IHD). We summarize results for (a) all patients, and (b) cluster one. TWRCI again identified MRPL1 as a root causal gene with a large positive CRCE. ### 5.4 Proofs Lemma 1. *Assume Lipschitz continuity of the conditional expectation for all N* ≥ *n*: ![Formula][151] *where C**N* ∈ *O*(1) *is a positive constant, and we have taken an outer expectation on both sides. Then* ![Graphic][152] *almost surely*. *Proof*. We can write the following sequence: ![Formula][153] where we have applied the Lipschitz continuity assumption at the first inequality. We have *C**N* ≤ *C* for all *N* ≥ *n* in the second inequality because *C**N* ∈ *O*(1). With the above bound, choose *a* > 0 and invoke the Markov inequality: ![Formula][154] The conclusion follows because we chose *a* arbitrarily. Proposition 1. *We have* ![Graphic][155] *under exchangeability. Proof*. We can write: ![Formula][156] The third equality follows because ![Graphic][157] is a constant given *E* and ![Graphic][158]. For the fourth equality, all paths between ***S****i* and *Y* are blocked by ![Graphic][159] under exchangeability. We thus have ![Graphic][160] and ![Graphic][161] by the global Markov property. Lemma 2. *Assume d-separation faithfulness and relevance. Then*, ![Graphic][162] *contains all of the ancestors of Y in* ![Graphic][163], *and* ![Graphic][164] *for any* ![Graphic][165]. *Proof*. We first prove (1). If *S**i* is an ancestor of *Y*, then ![Graphic][166], so ![Graphic][167] by d-separation faithfulness. It follows that *S* *i* ∈ ***T*** by Line 1 of Algorithm 2. If ![Graphic][168] is an ancestor of *Y*, then so is ***S****i*⊆ ***T***. Hence ![Graphic][169],so ![Graphic][170] and ![Graphic][171] by d-separation faithfulness and Line 2, respectively. We chose *S**i* and ![Graphic][172] arbitrarily, so the set ![Graphic][173] contains all of the ancestors of *Y* in ![Graphic][174]. We now prove (2). We need to show that ![Graphic][175] contains the parents, children and spouses of any ![Graphic][176], provided that these relatives are also in ![Graphic][177]. Note that ![Graphic][178] by Line 2 under the global Markov property. Hence, the parents and children of ![Graphic][179] in ![Graphic][180] are also d-connected to ***T*** and hence dependent on ***T*** under d-separation faithfulness. It follows that ![Graphic][181] contains all of the parents and children of ![Graphic][182] also by Line 2. Next, suppose ![Graphic][183] does not contain a spouse of ![Graphic][184], which we denote by ![Graphic][185]. Then we have ![Graphic][186] and *S*∈ ***T*** under relevance. Hence ![Graphic][187], so ![Graphic][188] by d-separation faithfulness and *X**i*∈***N*** by Line 3. It follows that ![Graphic][189] contains all of the spouses of ![Graphic][190]. We conclude that ![Graphic][191] contains all members of ![Graphic][192] of any ![Graphic][193]. Lemma 8. *Under d-separation faithfulness, relevance and exchangeability, (1) T**j* ∉ Anc(*Q**i*) *if and only if Q**i* ⫫ *T**j* |***T*** \ *T**j* *and* ![Graphic][194] *and* ![Graphic][195] *if and only if* ![Graphic][196]. *Proof*. For the first statement and forward direction, if *T**j* ∉ Anc(*Q**i*), then *Q**i* ⫫*d* *T**j* |***T*** \ *T**j* under exchangeability, so *Q**i* ⫫ *T**j* |***T*** \ *T**j* by the global Markov property. For the backward direction, if *Q**i* ⫫ *T**j* |***T*** \ *T**j*, then *Q**i* ⫫*d* *T**j* |***T*** \ *T**j* by d-separation faithfulness. No directed path can thus exist from *T**j* to *Q**i*, so *T**j* ∉ Anc(*Q**i*). We next address the second statement. The backward direction follows immediately from d-separation faithfulness. For the forward direction, if ![Graphic][197], then *T**j* ∈Anc (*Q**i*)from statement (1). Furthermore, if ![Graphic][198] then ![Graphic][199] under the global Markov property. Note that ![Graphic][200] contains ![Graphic][201] by Lemma 2 under d-separation faithfulness and relevance. Therefore, if *T**j* is not in the Markov boundary of ![Graphic][202], then ![Graphic][203] contains ![Graphic][204]. As a result, all paths between *T**j* and ![Graphic][205] are blocked by ![Graphic][206] under exchangeability. We thus arrive at the contradiction ![Graphic][207]. It follows that *T**j* must be in the Markov boundary of ![Graphic][208] and therefore can only be a parent or a spouse of ![Graphic][209] (or both). If *T**j* is a spouse but not a parent of ![Graphic][210], then we arrive at another contradiction that *T**j* ∉ Anc(*Q**i*). Hence![Graphic][211]. Lemma 3. *Assume d-separation faithfulness, relevance and exchangeability. Further assume that* ![Graphic][212] *is a sink vertex. Then*, ![Graphic][213] *if and only if T**j* ∉ Anc(***Q*** \ *Q**i*) *or* ![Graphic][214]*(or both)*. *Proof*. Assume ![Graphic][215] for the forward direction. We have two cases. If |Δ*i j* *γ**i j* | > 0, then ![Graphic][216], and ![Graphic][217], so ![Graphic][218] by Lemma 8. If |Δ*i j* *γ**i j* | = 0, then ![Graphic][219], so *Q**k* ⫫ *T**j* |***T*** \ *T**j* for all *Q**k* ∈ ***Q*** \ *Q**i*. We conclude that *T**j* ∉ Anc(***Q*** \ *Q**i*) by again invoking Lemma 8. For the backward direction, if *T**j* ∉ Anc(***Q*** \ *Q**i*), then *Q**k* ⫫ *T**j* |***T*** \ *T**j* for all *Q**k* ∈ ***Q*** \ *Q**i* by Lemma 8. Thus ![Graphic][220] so ![Graphic][221]. If ![Graphic][222], then *T**j* ∈ Anc (***Q*** \ *Q**i*) because ![Graphic][223] is a sink vertex. Hence *Q**k* ⫫ *T**j* |***T*** \ *T**j* for all *Q**k* ∈ ***Q*** \ *Q**i* by Lemma 8, so ![Graphic][224]. We conclude that ![Graphic][225]. Lemma 4. ![Graphic][226] *is a sink vertex if and only if* ![Graphic][227] *in Line 12 of ACO under d-separation faithfulness, relevance and exchangeability*. *Proof*. Assume that ![Graphic][228] is a sink vertex for the forward direction. We have two cases: 1. If ![Graphic][229], then ![Graphic][230] by the first statement of Lemma 2 and Lemma 3. Note that *Ñ* Hence, ![Graphic][231] because ![Graphic][232] is a sink vertex, and ![Graphic][233] in Line 12 by the global Markov property. 2. If ![Graphic][234], then ![Graphic][235] contains all of the parents of ![Graphic][236] in ![Graphic][237] and ***T*** by Lemma 2 and Lemma 3, respectively. Moreover, the other direct causal variants of ![Graphic][238], or ***S****i* \ ***T***, share no latent confounders with ***T*** or any other direct causal variant set excluding ***T*** by exchangeability. Hence, we also have ![Graphic][239], and ![Graphic][240] in Line 12 by the global Markov property. We have exhausted all possibilities and thus conclude that ![Graphic][241]. For the backward direction, assume![Graphic][242] so that ![Graphic][243] by d-separation faithfulness. Assume for a contradiction that ![Graphic][244] is not a sink vertex. Then there exists a path ![Graphic][245] for some *T**k* ∈ ***S*** *j* ∩***T*** by relevance. We thus have Δ*ik* = 0 but ![Graphic][246], so *T**k* ∉ ***U****i* and *T**k* ∈ ***T*** \ ***U****i*. We arrive at the contradiction ![Graphic][247] The variable ![Graphic][248] must therefore be a sink vertex. Lemma 5. *Under d-separation faithfulness, relevance and exchangeability, ACO recovers the correct causal order* ***K*** *over* ![Graphic][249] *and* (***S****i* ∩***T***) ⊆ ***P****i* *for all* ![Graphic][250]. *Proof*. We use proof by induction. Base: Suppose ![Graphic][251] contains one variable ![Graphic][252]. Then ***K*** = (*R**i*,*Y*) because *R**i* is trivially the most independent variable in ***R*** according to ***C*** of Line 15. The variable *R**i* is a sink vertex after *Y* is eliminated, so we have (***S****i* ∩***T***) ⊆ ***P****i* under d-separation faithfulness, relevance and exchangeability by Lemma 3. Step: Assume that the conclusion holds when ![Graphic][253] contains *p* − 1 variables. We need to prove the statement when ![Graphic][254] contains *p* variables. Assume for now that ![Graphic][255] is an arbitrary sink vertex in ![Graphic][256]. Lemma 3 then guarantees ![Graphic][257] for each *S* *j* ∈ ***S*** *p* ∩ ***T*** in Line 8 under d-separation faithfulness, relevance and exchangeability. We thus have (***S*** *p* ∩***T***) ⊆ ***P****p* and no variant of any other parent set is in ***P****p*. Finally, the measure of dependence *C**p* in Line 15 identifies *R**p* as a sink vertex by Lemma 4. ACO thus eliminates *R**p* from ***R*** and appends it to the front of ***K***. The conclusion follows by the inductive hypothesis. Lemma 6. *Under d-separation faithfulness, relevance and exchangeability, the graph discovery algorithm outputs the true sub-DAG over* ![Graphic][258] *given a conditional independence oracle*, ***K*** *and* 𝒫. *Proof*. The set ![Graphic][259] contains all of the parents of any ![Graphic][260] in ![Graphic][261] by Lemma 2. Furthermore, ***P****i* contains all of the parents of ![Graphic][262] in ***T*** for any ![Graphic][263] by Lemma 4. The stabilized skeleton discovery procedure of the PC algorithm thus recovers all and only the undirected edges in the true DAG over ![Graphic][264] under d-separation faithfulness and exchangeability53. The conclusion follows because ACO recovers the true causal order over ![Graphic][265] also by Lemma 5, so Algorithm 5 infers the true sub-DAG uniquely over ![Graphic][266] in Line 19. Lemma 7. *Under d-separation faithfulness, relevance and exchangeability*, ![Graphic][267] *causes Y – and likewise the vertices* ***S****i*∪ *E**i* *cause Y – if and only if* ![Graphic][268]. *Proof*. Recall that (***S****i* ∩***T***) ⊆ ***P****i* by Lemma 5 under d-separation faithfulness, relevance and exchangeability. Now if ![Graphic][269] causes *Y*, then there exists a directed path from ![Graphic][270] to *Y* so ![Graphic][271]. We then have ![Graphic][272] by d-separation faithfulness. For the backward direction, assume that ![Graphic][273] does not cause *Y*. All paths between ![Graphic][274] and *Y* are blocked by ![Graphic][275] under exchangeability. Thus ![Graphic][276] and *Y* are d-separated given ![Graphic][277]. We invoke the global Markov property to conclude that ![Graphic][278]. Theorem 1. *(Fisher consistency) Under d-separation faithfulness, relevance and exchangeability, TWRCI identifies all of the direct causal variants of* ![Graphic][279], *the unique causal graph over* ![Graphic][280] *and the CRCEs of* ![Graphic][281] *almost surely as N* → ∞ *with Lipschitz continuous conditional expectations and a conditional independence oracle*. *Proof*. Lemma 2 ensures ![Graphic][282] from Line 1 of Algorithm 1 contains all of the ancestors of *Y* in ![Graphic][283]. Thus ![Graphic][284] and (Anc (*Y*) ∩ ***S***⊆ ***T***). TWRCI identifies ***S****Y*⊆ ***P****Y* in Line 2 by Lemma 3 under d-separation faithfulness, relevance and exchangeability. The algorithm also identifies ***S****i*⊆ ***P****i* for each ![Graphic][285] under d-separation faithfulness, relevance and exchangeability in Line 3 by invoking Lemma 5. Furthermore, TWRCI recovers the causal order over ![Graphic][286] via Lemma 5. TWRCI thus uniquely recovers the sub-DAG over ![Graphic][287] in Line 4 by Lemma 6 and then correctly includes *Y* in the graph by Lemma 7. TWRCI finally identifies the CRCEs of ![Graphic][288] almost surely in Line 5 given the recovered DAG over ![Graphic][289] and 𝒫 by Lemma 1. ## Acknowledgements Research reported in this manuscript was supported by (1) the National Human Genome Research Institute of the National Institutes of Health under award numbers R01HG011138 and R35HG010718, and (2) the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM140287. ## Footnotes * Improved the TWRCI algorithm and re-ran the experiments with the improved version * 1 If multiple genes were present in the window, then we assigned the variant to the gene with the nearest TSS. * Received July 22, 2024. * Revision received September 12, 2024. * Accepted September 12, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. 1.Strobl, E. V. & Gamazon, E. R. Discovering root causal genes with high throughput perturbations. eLife (2024, in press). 2. 2.Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2017.05.038&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28622505&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 3. 3.Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 20, 555–572 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/S41568-020-0290-X&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 4. 4.Strobl, E. V., Lasko, T. A. & Gamazon, E. R. Mitigating pathogenesis for target discovery and disease subtyping. Comput. Biol. Medicine 108122 (2024). 5. 5.Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell 167, 1853–1866 (2016). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2016.11.038&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27984732&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 6. 6.Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell 185, 2559–2575 (2022). 7. 7.Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat. Genet. 55, 2056–2059 (2023). 8. 8.Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from dna sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023). 9. 9.Zhang, Y. & Politis, D. N. Ridge regression revisited: Debiasing, thresholding and bootstrap. The Annals Stat. 50, 1401–1422 (2022). 10. 10.Lousdal, M. L. An introduction to instrumental variable assumptions, validation and estimation. Emerg. Themes Epidemiol. 15, 1 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12982-018-0069-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29387137&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 11. 11.Zhao, S. et al. Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits. Nat. Genet. 56, 336–347 (2024). 12. 12.Rockman, M. V. & Kruglyak, L. Genetics of global gene expression. Nat. Rev. Genet. 7, 862–872 (2006). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nrg1964&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17047685&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000241384100014&link_type=ISI) 13. 13.Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1004383&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24830394&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 14. 14.Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. Royal Stat. Soc. Ser. B: Stat. Methodol. 82, 1273–1300 (2020). 15. 15.Chen, C., Ren, M., Zhang, M. & Zhang, D. A two-stage penalized least squares method for constructing large systems of structural equations. J. Mach. Learn. Res. 19, 1–34 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5555/3291125.3309634&link_type=DOI) 16. 16.Jiang, Z. et al. Signet: transcriptome-wide causal inference for gene regulatory networks. Sci. Reports 13, 19371 (2023). 17. 17.Strobl, E. V. & Lasko, T. A. Identifying patient-specific root causes of disease. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–10 (2022). 18. 18.Strobl, E. V. & Lasko, T. A. Identifying patient-specific root causes with the heteroscedastic noise model. J. Comput. Sci. 72, 102099 (2023). 19. 19.Spirtes, P., Glymour, C. & Scheines, R. Causation, Prediction, and Search (MIT press, 2000), 2nd edn. 20. 20.Wen, Y. et al. Applying causal discovery to single-cell analyses using causalcell. Elife 12, e81464 (2023). 21. 21.Sakaue, S. et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat. Genet. 53, 1415–1424 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-021-00931-x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 22. 22.Consortium, G. The gtex consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020). [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNjkvNjUwOS8xMzE4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMDkvMTIvMjAyNC4wNy4yMi4yNDMxMDgzNy5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 23. 23.Barnes, P. J. Inflammatory mechanisms in patients with chronic obstructive pulmonary disease. J. Allergy Clin. Immunol. 138, 16–27 (2016). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jaci.2016.05.011&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27373322&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 24. 24.Zhou, D. et al. A unified framework for joint-tissue transcriptome-wide association and mendelian randomization analysis. Nat. Genet. 52, 1239–1246 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-020-0706-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33020666&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 25. 25.Caramori, G. et al. Copd immunopathology. In Seminars in Immunopathology, vol. 38, 497–515 (Springer, 2016). 26. 26.Piñero, J. et al. The disgenet knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48, D845–D855 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/NAR/GKZ1021&link_type=DOI) 27. 27.Breiman, L. Classification and regression trees (Routledge, 2017). 28. 28.Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from mendelian randomization between complex traits and diseases. Nat. Genet. 50, 693–698 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-018-0099-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29686387&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 29. 29.Võsa, U. et al. Large-scale cis-and trans-eqtl analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 53, 1300–1310 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/0022146515594631.Marriage&link_type=DOI) 30. 30.Nurwidya, F., Damayanti, T. & Yunus, F. The role of innate and adaptive immune cells in the immunopathogenesis of chronic obstructive pulmonary disease. Tuberc. Respir. Dis. 79, 5 (2016). 31. 31.West, E. E., Kolev, M. & Kemper, C. Complement and the regulation of t cell responses. Annu. Rev. Immunol. 36, 309–338 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1146/annurev-immunol-042617-053245&link_type=DOI) 32. 32.Detsika, M., Palamaris, K., Dimopoulou, I., Kotanidou, A. & Orfanos, S. The complement cascade in lung injury and disease. Respir. Res. 25, 20 (2024). 33. 33.Li, X. et al. Association between psoriasis and chronic obstructive pulmonary disease: a systematic review and meta-analysis. PloS One 10, e0145221 (2015). 34. 34.McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). 35. 35.Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 236–244 (1963). 36. 36.Elsworth, B. et al. The mrc ieu opengwas data infrastructure. BioRxiv 2020–08 (2020). 37. 37.Jensen, R. V., Hjortbak, M. V. & Bøtker, H. E. Ischemic heart disease: an update. In Seminars in Nuclear Medicine, vol. 50, 195–207 (Elsevier, 2020). 38. 38.Bycroft, C. et al. The uk biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-018-0579-z&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30305743&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 39. 39.Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkw1092&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27899662&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 40. 40.Batty, M., Bennett, M. R. & Yu, E. The role of oxidative stress in atherosclerosis. Cells 11, 3843 (2022). 41. 41.Gan, X. et al. Tag-mediated isolation of yeast mitochondrial ribosome and mass spectrometric identification of its new components. Eur. J. Biochem. 269, 5203–5214 (2002). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1046/j.1432-1033.2002.03226.x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=12392552&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000178811200011&link_type=ISI) 42. 42.Saigusa, R., Winkels, H. & Ley, K. T cell subsets and functions in atherosclerosis. Nat. Rev. Cardiol. 17, 387–401 (2020). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41569-020-0352-5&link_type=DOI) 43. 43.Lenk, G. M. et al. Crispr knockout screen implicates three genes in lysosome function. Sci. Reports 9, 9609 (2019). 44. 44.Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet. 55, 1866–1875 (2023). 45. 45.Strobl, E. V. Causal discovery with a mixture of dags. Mach. Learn. 1–25 (2022). 46. 46.Yang, W. et al. Promoter-sharing by different genes in human genome–cpne1 and rbm12 gene pair as an example. BMC Genomics 9, 1–16 (2008). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2164-9-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=18171476&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000257872300001&link_type=ISI) 47. 47.Lauritzen, S. L., Dawid, A. P., Larsen, B. N. & Leimer, H.-G. Independence properties of directed markov fields. Networks 20, 491–505 (1990). 48. 48.Choudhary, S. & Satija, R. Comparison and evaluation of statistical error models for scrna-seq. Genome Biol. 23, 27 (2022). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13059-021-02584-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35042561&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 49. 49.Sarkar, A. & Stephens, M. Separating measurement and expression models clarifies confusion in single-cell rna sequencing analysis. Nat. Genet. 53, 770–777 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-021-00873-4&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34031584&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 50. 50.Bowley, A. The standard deviation of the correlation coefficient. J. Am. Stat. Assoc. 23, 31–34 (1928). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.1928.10502991&link_type=DOI) 51. 51.Stone, M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc. Ser. B (Methodological) 36, 111–133 (1974). 52. 52.Storey, J. D. The positive false discovery rate: a bayesian interpretation and the q-value. The Annals Stat. 31, 2013–2035 (2003). 53. 53.Colombo, D., Maathuis, M. H. et al. Order-independent constraint-based causal structure learning. J. Mach. Learn. Res. 15, 3741–3782 (2014). 54. 54.Cristianini, N. & Shawe-Taylor, J. An introduction to support vector machines and other kernel-based learning methods (Cambridge University Press, 2000). 55. 55.Strobl, E. V., Zhang, K. & Visweswaran, S. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. J. Causal Inference 7, 20180017 (2019). 56. 56.Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A. & Jordan, M. A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7 (2006). 57. 57.Hoyer, P., Janzing, D., Mooij, J. M., Peters, J. & Schölkopf, B. Nonlinear causal discovery with additive noise models. Adv. Neural Inf. Process. Syst. 21 (2008). 58. 58.Danecek, P. et al. Twelve years of samtools and bcftools. Gigascience 10, giab008 (2021). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/gigascience/giab008&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33590861&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 59. 59.Westra, H.-J. et al. Systematic identification of trans eqtls as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013). [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.2756&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24013639&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F09%2F12%2F2024.07.22.24310837.atom) 60. 60.Matthews, B. W. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophys. Acta (BBA)-Protein Struct. 405, 442–451 (1975). 61. 61.Acid, S. & de Campos, L. M. Searching for bayesian network structures in the space of restricted acyclic partially directed graphs. J. Artif. Intell. Res. 18, 445–490 (2003). [1]: F1/embed/inline-graphic-1.gif [2]: /embed/inline-graphic-2.gif [3]: /embed/inline-graphic-3.gif [4]: /embed/inline-graphic-4.gif [5]: F2/embed/inline-graphic-5.gif [6]: /embed/inline-graphic-6.gif [7]: /embed/inline-graphic-7.gif [8]: /embed/inline-graphic-8.gif [9]: /embed/inline-graphic-9.gif [10]: /embed/inline-graphic-10.gif [11]: /embed/inline-graphic-11.gif [12]: /embed/inline-graphic-12.gif [13]: /embed/inline-graphic-13.gif [14]: /embed/inline-graphic-14.gif [15]: /embed/inline-graphic-15.gif [16]: /embed/inline-graphic-16.gif [17]: /embed/inline-graphic-17.gif [18]: /embed/inline-graphic-18.gif [19]: /embed/inline-graphic-19.gif [20]: /embed/inline-graphic-20.gif [21]: /embed/inline-graphic-21.gif [22]: /embed/inline-graphic-22.gif [23]: /embed/inline-graphic-23.gif [24]: /embed/inline-graphic-24.gif [25]: /embed/graphic-3.gif [26]: /embed/inline-graphic-25.gif [27]: /embed/inline-graphic-26.gif [28]: /embed/inline-graphic-27.gif [29]: /embed/inline-graphic-28.gif [30]: /embed/inline-graphic-29.gif [31]: /embed/inline-graphic-30.gif [32]: /embed/graphic-7.gif [33]: /embed/inline-graphic-31.gif [34]: /embed/graphic-8.gif [35]: /embed/inline-graphic-32.gif [36]: /embed/inline-graphic-33.gif [37]: /embed/graphic-9.gif [38]: /embed/inline-graphic-34.gif [39]: /embed/inline-graphic-35.gif [40]: /embed/inline-graphic-36.gif [41]: /embed/inline-graphic-37.gif [42]: /embed/inline-graphic-38.gif [43]: /embed/inline-graphic-39.gif [44]: /embed/graphic-11.gif [45]: /embed/inline-graphic-40.gif [46]: /embed/inline-graphic-41.gif [47]: /embed/inline-graphic-42.gif [48]: /embed/inline-graphic-43.gif [49]: /embed/inline-graphic-44.gif [50]: /embed/graphic-12.gif [51]: /embed/inline-graphic-45.gif [52]: /embed/inline-graphic-46.gif [53]: /embed/inline-graphic-47.gif [54]: /embed/inline-graphic-48.gif [55]: /embed/inline-graphic-49.gif [56]: /embed/inline-graphic-50.gif [57]: /embed/inline-graphic-51.gif [58]: /embed/inline-graphic-52.gif [59]: /embed/graphic-13.gif [60]: /embed/inline-graphic-53.gif [61]: /embed/inline-graphic-54.gif [62]: /embed/inline-graphic-55.gif [63]: /embed/graphic-14.gif [64]: /embed/inline-graphic-56.gif [65]: /embed/inline-graphic-57.gif [66]: /embed/inline-graphic-58.gif [67]: /embed/inline-graphic-59.gif [68]: /embed/inline-graphic-60.gif [69]: /embed/inline-graphic-61.gif [70]: /embed/graphic-15.gif [71]: /embed/inline-graphic-62.gif [72]: /embed/inline-graphic-63.gif [73]: /embed/inline-graphic-64.gif [74]: /embed/inline-graphic-65.gif [75]: /embed/inline-graphic-66.gif [76]: /embed/inline-graphic-67.gif [77]: /embed/inline-graphic-68.gif [78]: /embed/inline-graphic-69.gif [79]: /embed/graphic-16.gif [80]: /embed/inline-graphic-70.gif [81]: /embed/inline-graphic-71.gif [82]: /embed/inline-graphic-72.gif [83]: /embed/inline-graphic-73.gif [84]: /embed/inline-graphic-74.gif [85]: /embed/inline-graphic-75.gif [86]: /embed/graphic-17.gif [87]: /embed/inline-graphic-76.gif [88]: /embed/inline-graphic-77.gif [89]: /embed/inline-graphic-78.gif [90]: /embed/inline-graphic-79.gif [91]: /embed/inline-graphic-80.gif [92]: /embed/inline-graphic-81.gif [93]: /embed/inline-graphic-82.gif [94]: /embed/inline-graphic-83.gif [95]: /embed/inline-graphic-84.gif [96]: /embed/inline-graphic-85.gif [97]: /embed/inline-graphic-86.gif [98]: /embed/inline-graphic-87.gif [99]: /embed/inline-graphic-88.gif [100]: /embed/inline-graphic-89.gif [101]: /embed/inline-graphic-90.gif [102]: /embed/inline-graphic-91.gif [103]: /embed/inline-graphic-92.gif [104]: /embed/inline-graphic-93.gif [105]: /embed/inline-graphic-94.gif [106]: /embed/inline-graphic-95.gif [107]: /embed/inline-graphic-96.gif [108]: /embed/inline-graphic-97.gif [109]: /embed/inline-graphic-98.gif [110]: /embed/inline-graphic-99.gif [111]: /embed/inline-graphic-100.gif [112]: /embed/inline-graphic-101.gif [113]: /embed/inline-graphic-102.gif [114]: /embed/inline-graphic-103.gif [115]: /embed/inline-graphic-104.gif [116]: /embed/inline-graphic-105.gif [117]: /embed/inline-graphic-106.gif [118]: /embed/inline-graphic-107.gif [119]: /embed/inline-graphic-108.gif [120]: /embed/inline-graphic-109.gif [121]: /embed/inline-graphic-110.gif [122]: /embed/inline-graphic-111.gif [123]: /embed/inline-graphic-112.gif [124]: /embed/inline-graphic-113.gif [125]: /embed/inline-graphic-114.gif [126]: /embed/inline-graphic-115.gif [127]: /embed/inline-graphic-116.gif [128]: /embed/inline-graphic-117.gif [129]: /embed/inline-graphic-118.gif [130]: /embed/inline-graphic-119.gif [131]: /embed/inline-graphic-120.gif [132]: /embed/inline-graphic-121.gif [133]: /embed/inline-graphic-122.gif [134]: /embed/inline-graphic-123.gif [135]: /embed/inline-graphic-124.gif [136]: /embed/inline-graphic-125.gif [137]: /embed/inline-graphic-126.gif [138]: /embed/inline-graphic-127.gif [139]: /embed/inline-graphic-128.gif [140]: /embed/inline-graphic-129.gif [141]: /embed/inline-graphic-130.gif [142]: /embed/inline-graphic-131.gif [143]: /embed/inline-graphic-132.gif [144]: /embed/inline-graphic-133.gif [145]: /embed/inline-graphic-134.gif [146]: /embed/inline-graphic-135.gif [147]: /embed/inline-graphic-136.gif [148]: /embed/inline-graphic-137.gif [149]: /embed/inline-graphic-138.gif [150]: F13/embed/inline-graphic-139.gif [151]: /embed/graphic-30.gif [152]: /embed/inline-graphic-140.gif [153]: /embed/graphic-31.gif [154]: /embed/graphic-32.gif [155]: /embed/inline-graphic-141.gif [156]: /embed/graphic-33.gif [157]: /embed/inline-graphic-142.gif [158]: /embed/inline-graphic-143.gif [159]: /embed/inline-graphic-144.gif [160]: /embed/inline-graphic-145.gif [161]: /embed/inline-graphic-146.gif [162]: /embed/inline-graphic-147.gif [163]: /embed/inline-graphic-148.gif [164]: /embed/inline-graphic-149.gif [165]: /embed/inline-graphic-150.gif [166]: /embed/inline-graphic-151.gif [167]: /embed/inline-graphic-152.gif [168]: /embed/inline-graphic-153.gif [169]: /embed/inline-graphic-154.gif [170]: /embed/inline-graphic-155.gif [171]: /embed/inline-graphic-156.gif [172]: /embed/inline-graphic-157.gif [173]: /embed/inline-graphic-158.gif [174]: /embed/inline-graphic-159.gif [175]: /embed/inline-graphic-160.gif [176]: /embed/inline-graphic-161.gif [177]: /embed/inline-graphic-162.gif [178]: /embed/inline-graphic-163.gif [179]: /embed/inline-graphic-164.gif [180]: /embed/inline-graphic-165.gif [181]: /embed/inline-graphic-166.gif [182]: /embed/inline-graphic-167.gif [183]: /embed/inline-graphic-168.gif [184]: /embed/inline-graphic-169.gif [185]: /embed/inline-graphic-170.gif [186]: /embed/inline-graphic-171.gif [187]: /embed/inline-graphic-172.gif [188]: /embed/inline-graphic-173.gif [189]: /embed/inline-graphic-174.gif [190]: /embed/inline-graphic-175.gif [191]: /embed/inline-graphic-176.gif [192]: /embed/inline-graphic-177.gif [193]: /embed/inline-graphic-178.gif [194]: /embed/inline-graphic-179.gif [195]: /embed/inline-graphic-180.gif [196]: /embed/inline-graphic-181.gif [197]: /embed/inline-graphic-182.gif [198]: /embed/inline-graphic-183.gif [199]: /embed/inline-graphic-184.gif [200]: /embed/inline-graphic-185.gif [201]: /embed/inline-graphic-186.gif [202]: /embed/inline-graphic-187.gif [203]: /embed/inline-graphic-188.gif [204]: /embed/inline-graphic-189.gif [205]: /embed/inline-graphic-190.gif [206]: /embed/inline-graphic-191.gif [207]: /embed/inline-graphic-192.gif [208]: /embed/inline-graphic-193.gif [209]: /embed/inline-graphic-194.gif [210]: /embed/inline-graphic-195.gif [211]: /embed/inline-graphic-196.gif [212]: /embed/inline-graphic-197.gif [213]: /embed/inline-graphic-198.gif [214]: /embed/inline-graphic-199.gif [215]: /embed/inline-graphic-200.gif [216]: /embed/inline-graphic-201.gif [217]: /embed/inline-graphic-202.gif [218]: /embed/inline-graphic-203.gif [219]: /embed/inline-graphic-204.gif [220]: /embed/inline-graphic-205.gif [221]: /embed/inline-graphic-206.gif [222]: /embed/inline-graphic-207.gif [223]: /embed/inline-graphic-208.gif [224]: /embed/inline-graphic-209.gif [225]: /embed/inline-graphic-210.gif [226]: /embed/inline-graphic-211.gif [227]: /embed/inline-graphic-212.gif [228]: /embed/inline-graphic-213.gif [229]: /embed/inline-graphic-214.gif [230]: /embed/inline-graphic-215.gif [231]: /embed/inline-graphic-216.gif [232]: /embed/inline-graphic-217.gif [233]: /embed/inline-graphic-218.gif [234]: /embed/inline-graphic-219.gif [235]: /embed/inline-graphic-220.gif [236]: /embed/inline-graphic-221.gif [237]: /embed/inline-graphic-222.gif [238]: /embed/inline-graphic-223.gif [239]: /embed/inline-graphic-224.gif [240]: /embed/inline-graphic-225.gif [241]: /embed/inline-graphic-226.gif [242]: /embed/inline-graphic-227.gif [243]: /embed/inline-graphic-228.gif [244]: /embed/inline-graphic-229.gif [245]: /embed/inline-graphic-230.gif [246]: /embed/inline-graphic-231.gif [247]: /embed/inline-graphic-232.gif [248]: /embed/inline-graphic-233.gif [249]: /embed/inline-graphic-234.gif [250]: /embed/inline-graphic-235.gif [251]: /embed/inline-graphic-236.gif [252]: /embed/inline-graphic-237.gif [253]: /embed/inline-graphic-238.gif [254]: /embed/inline-graphic-239.gif [255]: /embed/inline-graphic-240.gif [256]: /embed/inline-graphic-241.gif [257]: /embed/inline-graphic-242.gif [258]: /embed/inline-graphic-243.gif [259]: /embed/inline-graphic-244.gif [260]: /embed/inline-graphic-245.gif [261]: /embed/inline-graphic-246.gif [262]: /embed/inline-graphic-247.gif [263]: /embed/inline-graphic-248.gif [264]: /embed/inline-graphic-249.gif [265]: /embed/inline-graphic-250.gif [266]: /embed/inline-graphic-251.gif [267]: /embed/inline-graphic-252.gif [268]: /embed/inline-graphic-253.gif [269]: /embed/inline-graphic-254.gif [270]: /embed/inline-graphic-255.gif [271]: /embed/inline-graphic-256.gif [272]: /embed/inline-graphic-257.gif [273]: /embed/inline-graphic-258.gif [274]: /embed/inline-graphic-259.gif [275]: /embed/inline-graphic-260.gif [276]: /embed/inline-graphic-261.gif [277]: /embed/inline-graphic-262.gif [278]: /embed/inline-graphic-263.gif [279]: /embed/inline-graphic-264.gif [280]: /embed/inline-graphic-265.gif [281]: /embed/inline-graphic-266.gif [282]: /embed/inline-graphic-267.gif [283]: /embed/inline-graphic-268.gif [284]: /embed/inline-graphic-269.gif [285]: /embed/inline-graphic-270.gif [286]: /embed/inline-graphic-271.gif [287]: /embed/inline-graphic-272.gif [288]: /embed/inline-graphic-273.gif [289]: /embed/inline-graphic-274.gif