Abstract
Identifying risk protein targets and their therapeutic drugs is crucial for effective cancer prevention. Here, we conduct integrative and fine-mapping analyses of large genome-wide association studies data for breast, colorectal, lung, ovarian, pancreatic, and prostate cancers, and characterize 710 lead variants independently associated with cancer risk. Through mapping protein quantitative trait loci (pQTL) for these variants using plasma proteomics data from over 75,000 participants, we identify 365 proteins associated with cancer risk. Subsequent colocalization analysis identifies 101 proteins, including 74 not reported in previous studies. We further characterize 36 potential druggable proteins for cancers or other disease indications. Analyzing >3.5 million electronic health records, we uncover five drugs (Haloperidol, Trazodone, Tranexamic Acid, Haloperidol, and Captopril) associated with increased cancer risk and two drugs (Caffeine and Acetazolamide) linked to reduced colorectal cancer risk. This study offers novel insights into therapeutic drugs targeting risk proteins for cancer prevention and intervention.
Introduction
Human genetic research has not only advanced our understanding of disease mechanisms but has also significantly contributed to drug discovery and development. Drugs supported by genetic evidence exhibit enhanced therapeutic validity compared to those lacking such support, highlighting the importance of incorporating genetic evidence in drug development initiatives1,2. Common risk variants implicated in diseases can dysregulate nearby gene or protein expression, which can mimic the effects of therapeutic drugs on the targetable proteins. These proteins could serve as potential targets for therapeutic intervention3. Thus, concerted efforts for cancer prevention based on proteins influenced by common polymorphisms that modulate cancer risk, are urgently needed4. To date, genome-wide association studies (GWAS) have identified several hundred common genetic risk loci for each of three prevalent cancer types: breast, colorectal, and prostate5–8, and several dozen risk loci have been identified for other cancers, such as cancer of lung, pancreas, and ovarian9–13. Previous research, including our work, has identified hundreds of putative cancer susceptible genes potentially regulated by these risk variants, using methods such as expression quantitative trait loci (eQTL) analysis8–12,14–20 and transcriptome-wide association studies (TWAS)7,19,21–29. However, most dysregulated gene expression has not been thoroughly investigated at the protein level.
To deepen the understanding of causal mechanisms and enhance drug discovery endeavors, it is imperative to explore data from transcriptomic to proteomic studies. Proteins, the ultimate products of mRNA translation, play critical roles in cellular activities and represent promising therapeutic targets, as evidenced by successful drug targeting of enzymes, transporters, ion channels, and receptors30. Recent studies include protein quantitative trait loci (pQTL) mapping and Mendelian randomization (MR) analysis by integrating cancer GWAS and blood proteomics data to identify potential risk proteins. However, only a few dozen of cancer risk proteins have been reported, with a false discovery rate < 0.0531–36. Most reported proteins have not been directly linked to the GWAS-identified risk variants in common cancer types. Furthermore, research is lacking in integrating multiple population-scale proteomic studies like the recent emerging UK Biobank Pharma Proteomics Project (UKB-PPP)37, which offers an unprecedented opportunity to establish extensive pQTL databases, accelerating therapeutic drug discovery for therapeutic prevention and intervention in human cancers.
Traditional drug discovery faces numerous challenges, including escalating costs, lengthy timelines, and high failure rates38. Drug repurposing presents a promising strategy by identifying new applications for existing drugs, leveraging their well-documented characteristics39. With the widespread adoption of modern electronic health record (EHR) systems, vast amounts of real-world patient data are available to augment pre-clinical outcomes and facilitate drug repurposing screening. Recently, drug repurposing using EHRs has successfully discovered repurposing hypotheses for preventing Alzheimer’s Disease40, reducing cancer mortality41,42, treating COVID-1943,44, and coronary artery disease45. However, for therapeutic drugs that have been used for a long term to treat disease indications with evidence of affecting the expression of cancer risk proteins, their potential association with the risk of human cancers remains largely unclear. Some of these drugs may be linked to an increased cancer risk due to long-neglected side effects.
In this work, we integrate large GWAS data for breast, colorectal, lung, ovarian, pancreatic, and prostate cancers and population-scale proteomics data from over 75,000 participants combined from Atherosclerosis Risk in Communities study (ARIC)46, deCODE genetics47, and UKB-PPP to identify risk proteins associated with each cancer. We further characterized therapeutic drugs based on druggable risk proteins targeted by approved drugs or undergoing clinical trials for cancer treatment or other indications. We further evaluate the effect of cancer risk for those drugs approved for the indications, using over 3.5 million EHR database at Vanderbilt University Medical Center (VUMC). Findings from this study offer novel insights into therapeutic drugs targeting risk proteins for cancer prevention and intervention.
Results
Overall analysis workflow
In Figure 1, we outlined several main steps of a comprehensive integrative analysis of GWAS, pQTLs, druggable proteins, and EHR data. First, we examined previously identified risk loci from six cancer types based on the most recent GWAS in breast (N = 247,173), ovary (N = 63,347), prostate (N = 140,306), colorectum (N = 254,791), lung (N = 85,716), and pancreas (N = 21,536). Through additional fine-mapping analysis using SuSiE48, we characterized the most significantly associated variants (the lead variants) with independent association signals at each risk locus for each cancer (Fig. 1a; Online Methods). Second, we analyzed cis-pQTL results for the lead variants using proteomics data from individuals of European descent from ARIC46, deCODE47, and UKB-PPP37. We conducted fixed-effect meta-analyses of summary statistics cis-pQTLs from ARIC46 and deCODE47 through the same SOMAscan® platform (covering > 4,500 proteins). We combined them with the pQTL results from UKB-PPP through the Olink platform to identify potential risk proteins. For proteins that satisfied the significance threshold after multiple testing corrections, we further performed colocalization analyses to determine cancer risk proteins with high confidence through evaluating the likelihood of shared causal variants between pQTLs and GWAS (Fig. 1a). Third, these risk proteins with evidence of colocalization were further annotated based on drug-protein information from four drugs/compounds databases (DrugBank49, ChEMBL50, the Therapeutic Target Database51 (TTD) and OpenTargets52 (Fig. 1b). We next identified druggable proteins that are therapeutic targets of approved drugs or undergoing clinical trials for cancer treatment or other indications. Finally, we focused on drugs approved for other indications. We built emulation of treated-control drug trials under the Inverse Probability of Treatment weighting (IPTW) framework53 through the analysis of over 3.5 million EHRs at VUMC. In these emulations, we used the Cox proportional hazard model for each trial to evaluate the hazard ratio (HR) of the specific cancer risk between the treated focal drug and the control drug. The significance of each focal drug (HR and P value) was derived from a random-effects meta-analysis of results across its balanced trials (Fig. 1c, Online Methods).
Characterizing lead variants for breast, ovarian, prostate, colorectal, lung, and pancreas cancers
To characterize lead variants at each locus for each cancer type, we collected the reported risk variants from previous fine-mapping or GWAS. Using breast cancer as an example, we included 196 lead variants with independent association signals at loci from a previous fine-mapping study based on conditional association analysis54 and additional 32 genetic variants identified from a recent GWAS6. We then performed additional fine-mapping analysis using SuSiE48 based on summary statistics of GWAS (N = 247,173) from the Breast Cancer Association Consortium (BCAC, Supplementary Table 1). After integrating the previous results with new fine-mapping efforts, we identified 227 lead variants with independently associated with cancer risk at each locus through several processing steps, which included lead variant selection (P < 1 × 10-6 in European populations) and evaluating a linkage disequilibrium (LD) (r2 < 0.1) among the identified risk variants (Extended Data Fig. 1; Online Methods). Similarly, we characterized lead variants from previous GWAS and our fine-mapping studies for colorectal55 and other cancers (Extended Data Fig. 1; Supplementary Table 1; Online Methods). In our analysis, we identified 710 lead variants, including 227 for breast cancer, 213 for colorectal cancer, 213 for prostate cancer, 26 for lung cancer, 13 for ovarian cancer and 18 for pancreatic cancer (Fig. 2 and Supplementary Tables 2 and 3; Online Methods).
Identifying cancer risk proteins from pQTLs mapping and colocalization analyses
We mapped the 710 lead variants to cis-pQTLs to identify cancer risk proteins. At a Bonferroni-corrected P < 0.05, we identified a total of 459 pQTL association signals (corresponding 365 proteins after combined proteins unique for each cancer) for 222 lead variants across six cancer types, including 74 for breast, 127 for colorectal, 37 for lung, 5 for ovarian, 9 for pancreatic, and 113 for prostate cancer (Fig. 2; Supplementary Table 4). Notably, 312 of the identified proteins (85.4% of 365) among these cancer types have not been reported in previous proteomics-based MR studies32–36,56,57 (Supplementary Table 5). Furthermore, through analysis of the identified proteins commonly observed in multiple cancers, we found that 60 proteins were commonly observed in at least two of these six cancers. In particular, we observed that several well-known cancer-related proteins, such as HLA-A and HLA-E, were linked to lead variants located in major histocompatibility complex (MHC) in breast, colorectal and lung cancers, highlighting the potential role of these proteins in cancer pleiotropy and shared cancer risk mechanisms (Fig. 2).
A further colocalization analysis identified 101 proteins after combined proteins unique for each cancer that showed strong evidence supported by either colocalization or SMR+HEIDI analysis (Online Methods). Specifically, we identified 23 proteins for breast, 38 proteins for colorectal cancer, 7 proteins for lung, 2 for ovarian, 2 for pancreatic and 29 for prostate cancer, respectively (Fig. 3a-b, Supplementary Table 6). Of these, 74 proteins (73.2% of 101) have not been previously linked to cancer risk (Supplementary Table 7). Of note, 71 proteins were only assayed by either SOMAscan® (n=32) or Olink platform (n=39). For the remaining 22 significant proteins commonly assayed, all showed a pQTL significance signal with a minimal nominal P < 1 × 10-5 in both ARIC+deCODE and UKB-PPP (r = 0.66, P = 2×10-4; Fig. 3c). In particular, seven proteins were highlighted as cancer-driver proteins58,59 and Cancer Gene Census (CGC)60, including ALDH2, HLA-A and SUB1 for breast cancer, ALDH2 and HLA-A for colorectal cancer, NT5C2 for lung cancer, and NT5C2, RNF43, TYRO3 and USP28 for prostate cancer.
Cancer risk proteins supported by functional genomics analyses
Of the identified 101 proteins among the six cancers, we next examined whether they are supported by functional genomics analyses. Specifically, we first evaluated xQTL (i.e., eQTLs, alternative splicing - sQTLs, and alternative polyadenylation - apaQTLs) results in their respective target tissues and whole blood samples (Online Methods). We found 63 proteins that were supported by at least one xQTLs at a nominal P < 0.05, including12 for breast (52% of 23), 22 for colorectal (57% of 38), 5 for lung (71% of 7), 2 for ovarian, 2 for pancreatic, and 20 (68% of 29) for prostate cancer (Supplementary Table 7). Second, we used functional genomic data generated in their cancer-related tissues/cells (i.e., promoter and enhancer) to characterize putative functional variants that are in strong LD (r2 > 0.8 in the European population) with the lead variants (Online Methods) Our results showed that 17 genes were likely regulated by the closest putative regulatory variants with either promoter and/or enhancer activities (Supplementary Table 8). We further investigated the potential distal regulatory effects of putative functional variants on these genes by analyzing chromatin-chromatin interaction data (Online Methods). We found that 39 genes were regulated distally by putative functional variants through long-term promoter-enhancer interactions (Supplementary Table 9). Lastly, we examined differential protein expression between normal and tumor tissues available for breast, colon, lung and pancreatic cancers using data from Clinical Proteomic Tumor Analysis Consortium (CPTAC). We showed evidence of the 18 identified proteins with consistent association directions supported by significantly differential expression at a nominal P < 0.05, including 3 for breast cancer, 10 for colorectal cancer, 4 for lung cancer and 1 for pancreatic cancer. Similarly, we showed evidence of the 40 identified proteins supported by significantly differential mRNA expression using data from The Cancer Genome Atlas Program (TCGA), including 6 for breast cancer, 15 for colorectal cancer, 3 for lung cancer, 1 for pancreatic cancer and 15 for prostate cancer. Taken together, our analysis provided additional evidence that most of the identified proteins partially or wholly supported by functional genomics analyses (Supplementary Table 10).
Identifying druggable proteins
Using data from DrugBank49, ChEMBL50, the Therapeutic Target Database51 (TTD) and OpenTargets52, we comprehensively annotated our proteins as therapeutic targets of approved or clinical-stage drugs (Online Methods). Of the 101 proteins among the six cancers, we identified 36 druggable proteins potentially targeted by 404 approved drugs or undergoing clinical trials for cancer treatment or other indications (Fig. 4, Supplementary Table 11). Specifically, we found 19 proteins targeted by 133 drugs either approved or under clinical trials to treat cancers (Fig. 5, Supplementary Table 12). Our results also provide evidence that the remaining draggable proteins are targeted by 197 drugs used for treating indications other than cancer (Extended Data Fig. 3).
Evaluating associations of drugs approved for indications with cancer risk
We next evaluated the effect on cancer risk of therapeutic drugs that have been used long-term to treat indications based on real-world EHRs from the VUMC Synthetic Derivative (SD) database. Given a focal drug, we first emulated its trials by building control patient groups who were exposed to similar treated drugs under the same ATC-L2 category (Online Methods; Supplementary Table 13). To mimic randomized controlled trials (RCT) to evaluate the focal drug’s effect, we applied the Inverse Probability of Treatment Weighting (IPTW) framework61 to create a pseudo-population wherein confounding variables are evenly distributed between the treated and control groups (Online Methods). After discarding the trials with less than 500 eligible patients in either of the groups, we analyzed 14 treated drugs with 335 balanced trials. Our analysis revealed that five drugs were linked to an increased risk of cancer: Haloperidol (HR = 1.76; P = 1.6 × 10-23; targeting HLA-A protein) and Trazodone (HR = 1.32; P = 2.3 × 10-12; targeting HLA-A protein) for breast cancer; Tranexamic Acid (HR = 1.53; P = 1.1 × 10-3; targeting PLG protein) and Sirolimus (HR = 1.71; P = 1.1 × 10-28; targeting TYRO3 protein) for prostate cancer; and Haloperidol (HR = 2.62, P = 6.6 × 10-20, targeting HLA-A protein) and Captopril (HR = 1.65; P = 2.2 × 10-9; targeting TF protein) for colorectal cancer (Fig. 6). In contrast, we also found that two drugs associated with a decreased risk of colorectal cancer: Caffeine (HR = 0.74, P = 9.3 × 10-5, targeting ALDH2 protein) and Acetazolamide (HR = 0.72; P = 1.1 × 10-20; targeting HLA-A protein) (Fig. 6).
Discussion
In this study, we conducted a comprehensive investigation of cancer risk proteins by integrating lead variants and pQTLs for six common cancer types using large-scale GWAS and population-based proteomics data. Through pQTL mapping and subsequent colocalization analysis, we identified 101 risk proteins across the six cancer types, with over three-quarters of them not previously linked to cancer susceptibility. Moreover, most of the proteins we identified are supported by functional genomics analyses. Our findings not only significantly expand the pool of known cancer risk proteins but also offer new insights into the biology and susceptibility of common cancers.
Through analysis of drug-protein interaction databases, we identified 36 druggable proteins potentially targeted by 404 therapeutic drugs. Among these, 30 drugs have already received approval for cancer treatment, while 73 are currently undergoing clinical trials for cancer treatment. These findings offer genetic evidence supporting the effectiveness of certain drugs and suggest potential opportunities for repurposing them to treat additional cancers that share common risk proteins. However, it’s crucial to acknowledge that while the cancer risk proteins identified in our study hold promise as therapeutic targets for cancer treatment, drugs may also have adverse effects, potentially exacerbating cancer development through these targets (i.e., depending on their inhibitory or promotive effects)62. Additionally, our analysis characterized 197 drugs used for indications other than cancer, which may influence cancer risk due to their interactions with cancer-risk proteins. Overall, our findings have the potential to accelerate therapeutic drug discovery for the prevention and intervention of human cancers.
We uncovered five non-cancer drugs associated with increased cancer risk: Tranexamic Acid (PLG), Sirolimus (TYRO3), Haloperidol (HLA-A), Trazodone (HLA-A), and Captopril (TF). Our genetic evidence or previous pharmacological studies further support these findings. Specifically, Tranexamic Acid, an antifibrinolytic agent used to block the breakdown of blood clots and prevent bleeding, was associated with an increased risk of prostate cancer. Our findings suggest its potential inhibition of the protein expression of human plasminogen (PLG), based on data from the ChEMBL database. Our GWAS and pQTL results indicate that PLG may serve as a potential tumor suppressor, supported by evidence of the risk allele C of rs9347480 being associated with increased prostate cancer risk (P = 1.41 × 10-07) and decreased protein expression (P = 4.57 × 10-33). Additionally, this protein also shows notable evidence of decreased gene expression (P = 0.038) in prostate tumor samples compared to normal samples, as observed in data from TCGA. The drug Sirolimus, primarily used to treat immune system and eye diseases, can potentially affect receptor tyrosine kinases, TYRO3, a known protein important for prostate cancer development63,64. In breast and colorectal cancers, we identified two candidate drugs (Haloperidol and Trazodone) targeting the major histocompatibility complex, HLA-A, an essential protein for the immune system’s defense against cancer development. Interestingly, our analysis suggests that Haloperidol, a type of antipsychotic treatment, is highly likely to increase the risk of both breast and colon cancer. Haloperidol, the first-generation antipsychotics, has been reported to be a carcinogenic compound65 and its exposure of five years or more was associated with an increased risk of breast cancer in a Finland nationwide study62. Consistently, another prior study showed its notably increased risks of colorectal cancer in patients with schizophrenia who take antipsychotic medications66. We also found that Captopril, originally for cardiovascular diseases and promisingly repurposed for cancer treatment in clinical trials and several studies67–70, has the potential to increase the risk of colorectal cancer, aligning with a previous study71.
Conversely, our study also identified two non-cancer drugs (Caffeine and Acetazolamide) associated with a reduced risk of colorectal cancer. Acetazolamide, prioritized by the risk protein named TF, exhibited a notable effect in preventing colorectal cancer development (HR = 0.72, P = 1.1 × 10-20). In line with our findings, prior studies demonstrated its role in inhibiting cell viability, migration, and colony formation ability of colorectal cancer cells72, as well as its ability to suppress the development of intestinal polyps in Min Mice73. In addition, Caffeine, a drug prioritized by the risk protein ALDH2 in colorectal cancer, has been shown to exert a protective effect on colorectal cancer by prior studies74,75. However, such low-risk association may vary by colon subsites76 and specific populations77. Of note, a clinical trial (NCT05692024) is undergoing the recruitment phase to evaluate the effects of instant coffee on the gut microbiome, metabolome, liver fat, and fibrosis in colorectal cancer patients.
Although the larger sample size of the European-ancestry study available for both GWAS and proteomics enabled us to identify a larger number of association signals for risk protein discovery based on colocalization analysis, our study was primarily limited to individuals of European ancestry and further investigations are needed to assess the relevance of these proteins in non-European populations. Millions of EHRs provide an unprecedented opportunity to systematically evaluate non-cancer drugs’ effect on risk of cancer development. Especially these therapeutic drugs have been used for a long time to treat diseases other than cancers, which can provide appropriate statistical power for the analysis. While this approach is limited in only examining the cancer risk of common approved drugs, it serves as an efficient complementary method to the pre-clinical data analysis for cancer prevention and treatment. Despite the supportive evidence of Acetazolamide in vitro and in vivo, it remains necessary to evaluate the effects of our reported candidate drugs through both in vitro and in vivo assays in future investigations.
Online Methods
Data resources
The GWAS summary statistics data of European descendants for breast, prostate, ovarian, and lung cancers were downloaded and compiled from their corresponding consortia, including the Breast Cancer Association Consortium (BCAC)6 (N = 247,173, 133,384 cases and 113,789 controls), the Transdisciplinary Research of Cancer in Lung of the International Lung Cancer Consortium (TRICL-ILCCO) and the Lung Cancer Cohort Consortium (LC3)13 (N = 85,716, 29,266 cases and 56,450 controls), the Ovary Cancer Association Consortium (OCAC)11 (N = 63,347, 22,406 cases and 40,941 controls), and the Pancreatic Cancer Case-Control Consortium (PanC4)10 (N = 21,536, 9,040 cases and 12,496 controls), and the Prostate Cancer Association Group Investigate Cancer Associated Alterations in the Genome (PRACTICAL)78 (N = 140,306, 79,194 cases and 61,112 controls). For colorectal cancer, we included GWAS data of 125,487 subjects from the European population.19,79,80 In addition, the GWAS data (N = 254,791) consisting of 100,204 colorectal cancer cases and 154,587 controls from European and Asian populations7 were also used in our analysis.
The large-scale cis-protein quantitative trait loci (cis-pQTLs) among European-ancestry populations were analyzed based on three proteomics datasets: UKB-PPP37 (N = 34,557, 2,922 plasma proteins), ARIC44 (N = 7,213, 4,657 plasma proteins) and deCODE genetics45 (N = 35,559, 4,907 plasma proteins). Detailed descriptions of sample collection and processes of the cis-pQTL analyses from the above proteomics datasets have been described in previous studies37,46,47.
We utilized the synthetic derivative (SD) database at Vanderbilt University Medical Center (VUMC)81. This VUMC SD database contains de-identified clinical information derived from Vanderbilt’s electronic medical record The SD has longitudinal clinical data for over 3.5 million individuals, including patient demographics, medical history, laboratory results, and medication history.
Characterization of lead variants in six types of cancer
In breast cancer, we included 196 strong independent association signals at P < 1 × 10-6 from a fine-mapping study54 and 32 risk variants from a GWAS6. We first combined the reported lead variants from these two studies after removing those variants in LD (r2 < 0.1 in European populations). We further included additional lead variants from SuSiE fine-mapping analysis on GWAS (N = 247,173)48, with fine-mapping windows of 500 kilobases (kb) and allowed a maximum of five causal variants. LD reference was based on the British-ancestry UK Biobank samples (N = 337,000)82. We identified a credible set of causal variants with a 95% posterior inclusion probability (95% PIP) for each independent risk signal and a lead variant was represented by the variant with the minimum P. We included additional lead variants from our SuSiE analysis with LD r2 < 0.1 in European populations with the above set of lead variants for those with independent risk-associated signals at GWAS P < 5 × 10-8 and located in GWAS loci with independent risk-associated signals at P < 1 × 10-6 in European populations.
For colorectal cancer, we analyzed 238 lead variants from our recent fine-mapping study55 based on the GWAS data from 254,791 participants in both European and Asian populations. We characterized 233 lead variants with independent risk-associated signals at minimal P < 1 × 10-6 in European populations, from the analysis based on GWAS from trans-ancestry and European populations, respectively. For prostate cancer, we first identified lead variants with independent risk-associated signals at GWAS P < 5 × 10-8 from our SuSiE fine-mapping analysis on GWAS summary statistics (N = 140,306). We next included additional GWAS-identified risk variants with P < 1 × 10-6 in European populations from the previous trans-ancestry GWAS8 and r2 < 0.1 with any lead variants from the above set in the fine-mapping analysis. Similarly, we used the above strategy to characterize lead variants from fine-mapping analysis for ovarian cancer (N = 63,347) and pancreatic cancer (N = 21,536). We next included additional risk variants that are missed in the above set of lead variants from previous GWAS for ovarian11 and pancreatic cancer10. For lung cancer, we included 26 risk variants at P < 1 × 10-6 in European populations from the trans-ancestry GWAS9.
Identification of putative target proteins for lead variants
To identify potential cancer risk proteins, we mapped GWAS lead variants to cis-pQTLs (+/- 500Kb region of a gene) results from three studies among European populations: UK Biobank Pharma Proteomics Project37, Atherosclerosis Risk in Communities study44 and deCODE genetics45. To increase the power of pQTLs, we combined cis-pQTLs from the ARIC and the deCODE (both assayed through SOMAscan® platform) via a a fixed-effects meta-analysis using META83. Cis-pQTLs from the UKB-PPP (assayed through Olink platform) were independently analyzed. In few cases where the lead variant did not overlap with any cis-pQTLs, we substituted it with the correlated variant exhibiting the strongest association signal. Putative cancer risk protein was defined based on pQTL significance at a Bonferroni threshold of P < 0.05 (nominal P = 2.3 × 10-5, corresponding to 2,164 variant-protein tests for UKB-PPP; nominal P = 3.8 × 10-5, corresponding to 1,322 variant-protein tests for ARIC+deCODE).
Colocalization analyses between pQTL and GWAS signals
To identify cancer risk proteins, we conducted colocalization analysis using two approaches: Bayesian method coloc 84 and summary data-based Mendelian Randomization (SMR)85. For the SMR approach, a followed HEIDI test is performed on significant SMR results to determine if the colocalized signals can be explained by one single causal variant or by multiple causal variants in the locus. For each protein, SNPs with P < 0.5 from GWAS, MAF > 0.01, and within 50 kb of the lead variant were included. To estimate the posterior probability (PP) of colocalization, we utilized the default priors and coloc.abf function. In our study, we particularly focused on the assumption that one genetic variant is simultaneously associated with both two traits, which was quantified by PP.H4. We considered a protein to host one shared causal variant from GWAS and pQTLs if its coloc PP.H4 > 0.5. Additionally, we also performed SMR+HEIDI analysis for significant cis-pQTL with default parameter settings. Specifically, significant SMR+HEIDI results were defined as a tested locus with Bonferroni-adjusted SMR P < 0.05 and HEIDI P ≥ 0.05 (no obvious evidence of heterogeneity of estimated effects or linkage). The above analyses were only conducted in European populations available for both GWAS and proteomics in European populations.
Functional genomic analyses
For our identified cancer risk proteins, we examined their xQTLs, including eQTLs, sQTLs, and apaQTL using the resource from the GTEx (version 8). We collected eQTLs and sQTLs from six normal tissues and whole blood from GTEx studies, and we collected apaQTLs from Li’s work86. A nominal P value < 0.05 for at least one xQTL in either tissue or blood samples was considered supportive of the pQTL results.
We identified putative regulatory variants in strong linkage disequilibrium (LD) (r2 > 0.8 in European population) for lead variants with significant colocalization between GWAS and cis-pQTL signals. Using the HaploReg tool87, we annotated these variants with a variety of epigenetic annotations, including regulatory chromatin states based on DNAse and histone ChIP-Seq from Roadmap Epigenomics Project, histone marks for promoter and enhancer, binding sites of transcription factors, and gene annotation from the GENCODE and RefSeq. We denoted variants as “Proximal” if they overlapped with these functional annotations near the closest target gene. We analyzed a variety of chromatin-chromatin interaction data, from 4D genome88, FANTOM589, EnhancerAtlas90, and super-enhancer91. We examined the overlap between putative regulatory variants and enhancer elements in corresponding cell lines or tissues of these six cancer types. We further determined enhancer-promoter loops after combining these data with ChIP-seq data of the histone modification H3K27ac (an active enhancer mark). We focused on interacted loops in which a fragment overlapped an H3K27ac peak (enhancer-like elements). In contrast, the other fragment overlapped the promoter of a gene (defined as a region of upstream 2kb and downstream 100bp around transcript start site). We denoted variants as “Distal” if they overlapped with these chromatin-chromatin variants.
For our identified cancer risk proteins, we assessed the statistical significance of their differential protein expression between tumor and normal tissue in breast, colorectal, lung, ovarian, and pancreatic cancer samples using data from CPTAC, accessed through the UALCAN website92,93. Similarly, we analyzed their differential gene expression between tumor and normal tissue using data from TCGA, also through the UALCAN website.
Inclusion of patients for a focal drug and its control drugs
To evaluate the impact of a focal drug on cancer development, we conducted comparisons between its effects and those of its control drugs. To minimize potential confounding factors associated with a drug prescription, we selected control drugs that belong to the same second-level Anatomical Therapeutic Chemical classification category (ATC-L2) as the focal drug. We formulated emulation trials, each containing one treated patient group (taking the focal drug) and one control patient group (taking the control drug). One focal drug may have multiple trials, depending on the number of potential control drugs belonging to the same ATC-L2 category. Next, we enrolled patients for the treated group and the control group from the VUMC SD based on the following criteria: 1) patients aged ≥ 40 at the time of the latest EHRs or the initial diagnosis of cancer; 2) availability of at least one year of EHRs before the first prescription of the treated/control drug (index date); 3) for cancer patients, a minimum of two exposures to the treated or control drug during the follow-up period (from the index date to the three months before cancer diagnosis); 4) for non-cancer individuals, a minimum of two exposures to control drugs during the follow-up period (from the index date to the date of the latest EHRs). Finally, we precluded patients who were prescribed both treated and control drugs and discarded the trials with less than 500 eligible patients in either patient group40.
Emulation of treated-control drugs balanced trials
In the IPTW framework61, individuals are assigned weights based on the inverse of their propensity scores (PS), which represent their probability of being exposed to risk factors or a specific intervention, such as a treated drug, based on their baseline characteristics. In this study, we followed Zang’s work40 and trained a logistic regression propensity score (LR-PS) model with L1 or L2 regularization on patients’ treatment assignments Z and covariates, including age, gender, comorbidities, etc. (Supplementary Table 14). We trained and selected the logistic model (Eq.1) with the highest area under curve (AUC) using a 10-folder cross-validation. We used the selected model to calculate all patient’s stabilized weights (Eq. 2). These weights are used to calculate the standardized mean difference (SMD, Eq.3) of the covariate’s prevalence in treated and control groups. A covariate d is defined as unbalanced if SMD(d) > 0.1 in IPTW framework (Eqs. 3, 4). A trial is balanced if it contains ≤10% unbalanced covariates (Eq. 5).
The logistic regression is defined as follows: where Z refers to treatment assignment (1 for treated patient group and 0 for control patient group) and X(X1, X2, …, Xn) for baseline covariates. The propensity score is defined as P(Z = 1| X) and the stabilized IPTW of each individual is calculated as follows:
Standardized mean difference is calculated as following:
, representing vectors of D number of covariates of treated group and control group respectively; μtreat, μcontrol are their sample means, and are their sample variances. In IPTW framework, the weighted sample mean μw and sample variance are calculated as following:
Number of unbalanced covariates are calculated as following:
where D is the total number of covariates in the model, d is one covariate
Logistic regression propensity score (LR-PS) hyperparameter selection and model training
To select the optimal regulation penalty weight (λ), we applied 10-fold cross-validation on a list of lambda elements λ ∈ [0.005, 0.01, 0.05, 0.1, 0.5]. Specifically, the logistic model was trained on 9 training folders, and learnable parameters (β) were estimated through minimizing the binary cross-entropy loss with L1 (Eq. 6) or L2 penalty (Eq. 7). On the left-one out validation folder (k), we calculated SMDk (Eq. 3) values for D covariates based on individuals’ weights (Eq. 2) as well as the number unbalanced covariates nk (Eq. 5). In addition, we evaluated trained model’s prediction performance using area under curve (AUCk) on the validation dataset. The same processes were repeated 10 times. We defined the optimal hyperparameter value is the value generates the smallest averaged nk. For two hyperparameter values generate approximate nk, the one with larger averaged AUCk is the optimal. Finally, we trained LR-PS model’s learnable parameters (β) on all subjects with the optimal hyperparameter and leveraged this trained model to compute weights and the proportion of imbalanced covariates to pinpoint balanced trials.
Binary cross-entropy loss function with LASSO (L1) penalization:
Binary cross-entropy loss function with ridge (L2) penalization:
Calculation of overall hazard ratio for cancer risk drug on cancer development risk
We evaluate subjects’ hazard of developing/preventing cancer in balanced treated-control trials through survival analyses. We applied weighted Cox proportional hazard model94 using the lifelines 0.28.0 Python package to systematically evaluate the hazard ratio of developing cancer for patients taking treated drug vs patients taking control drugs (time-to-event). The time windows utilized in this study start from the earliest date in EHRs for prescription of the treated/control drug to patients and end at the date of the diagnosis of cancer (event) or the end of EHRs records (censored). We included unbalanced covariates (if exist) into Cox models. For a treated drug, its overall hazard ratio and p-value were obtained by applying a random effect meta-analysis on the hazard ratios from its eligible trials using the meta 7.0 R package (Supplementary Table 15). We reported that a treated drug has a significantly increasing or decreasing risk of cancer development, contrasting to its control drugs, if the overall hazard ratio has a P < 0.05 after Bonferroni correction (nominal P = 3.5 × 10-3 corresponding to 14 tests).
Data availability
Supplementary Table 1 provides the download information for the summary statistics of GWAS data for the six common cancers, including breast, ovary, prostate, colorectum, lung, and pancreas. Metadata and pQTL summary statistics from UKB-PPP can be downloaded from Synapse: Project SynID: syn51364943; pQTL from ARIC46 and deCODE genetics47 can be accessed through previous publications (PMID: 34857953 and PMID: 35501419). Functional genomic data includes: TCGA and CPTAC differential expression results accessible through https://ualcan.path.uab.edu/index.html; 4DGenome: https://4dgenome.research.chop.edu/; Depmap : https://depmap.org/portal/; FANTOM5: http://fantom.gsc.riken.jp/5/. HaploReg: https://pubs.broadinstitute.org/mammals/haploreg/. GTEx: https://gtexportal.org/home/. GENCODE (v26.GRCh38) was downloaded from https://www.gencodegenes.org/human/release_26.html. National Cancer Institute can be accessed through https://www.cancer.gov/about-cancer/treatment/drugs; CGC can be accessed accessed via COSMIC website: https://cancer.sanger.ac.uk/census. Drugs and compounds data can be downloaded from the following URLs: ChEMBL: https://www.ebi.ac.uk/chembl/; Therapeutic Target Database: https://db.idrblab.net/ttd/; Open Targets: https://www.opentargets.org/; DrugBank: https://go.drugbank.com/. The EHR data, containing de-identified clinical information, can be accessed through the VUMC SD database. Data is available through restricted access for approved studies and researchers who agree to specific conditions of use.
Code availability
The developed pipeline and main source R codes that are used in this work are available from the GitHub website of Xingyi Guo’s lab: https://github.com/XingyiGuo/PQTL_EHR/
Declaration of Interests
The authors declare no competing interests.
Acknowledgments
This work was supported by the US National Institutes of Health grant 1R37CA227130-01A1 and R01CA269589-01A1 to X.G. The data analyses were conducted using the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt University. New Frontiers in Research Fund (NFRFE-2018-00748) and NSERC Discovery Grant (RGPIN-2024-04679) to Q.L. The computational infrastructure was partly supported by a Canada Foundation for Innovation JELF grant (36605) to Q.L.
Footnotes
↵† Author names shared co-first authorship
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.
- 16.
- 17.
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵