Abstract
Incidence and severity of prostate cancer (PrCa) substantially varies across ancestries. American men of African ancestry (AA) are more likely to be diagnosed with and die from PrCa than the those of European ancestry (EA). Published polygenic risk scores for developing prostate cancer, even those based on multi-ancestry genome-wide association studies, do not address population-specific genetic mechanisms underlying PrCa risk in men of African ancestry. Specifically, the role of non-coding regulatory polymorphisms in driving inter-ancestry variation in PrCa has not been sufficiently explored. Here, by employing a sequence-based deep learning model of prostate regulatory enhancers, we identified ∼2,000 SNPs with higher alternate allele frequency in AA men that potentially affect enhancer function associated with PrCa susceptibility, as supported by our experimental validation. The identified enhancer SNPs (eSNPs) may influence PrCa development through two complementary mechanisms: 1) the alternate allele that increase enhancer activity result in immune suppression and telomere elongation, and 2) the alternate alleles that decrease enhancer activity, lead to de-differentiation and inhibition of apoptosis. Notably, the eSNPs tend to disrupt the binding of known prostate transcription factors including FOX, AR and HOX families. Lastly, the identified eSNPs can be combined into a polygenic risk score that adds value to current GWAS-based risk variants in assessing PrCa risk in independent cohorts.
Introduction
PrCa exhibits considerable heritability and notable disparities in incidence rates among different ancestral populations, with men of African ancestry displaying a far greater risk than those of European (1.75-fold) and Asian (3.18-fold) descent1,2. Additionally, AA men experience the highest mortality rate from PrCa3. Studies have demonstrated a role for population-specific genetic factors, even in conjunction with neighborhood deprivation and other social determinants of health in PrCa incidence4–6. Large genome-wide association studies (GWAS) have identified PrCa risk variants, which are aggregated into a polygenic risk score (PRS)7–10. In general, GWAS-derived risk variants do not generalize well to independent discovery cohorts11–13. Specifically, PrCa PRSs, including those based on multi-ancestry cohorts, show variable accuracies across ancestries14. More importantly, variants discovered via GWAS do not directly point to underlying mechanisms.
Recent studies have shown that a rare, AA-specific variant of the non-coding SNP rs72725854 (A>G/T) at the 8q24 locus is associated with a ∼2-fold increased risk of PrCa15,16. Follow-up work showed that the alternate allele ‘T’ introduces a binding site for the transcription factor (TF) SPDEF, which substantially increases the activity of the enhancer harboring the SNP and, in response to androgen stimulation, increases the expression of the enhancer’s target genes PCAT1, PVT1 and c-Myc in prostate tumors; enhancer with the reference ‘A’ allele remain non-responsive to androgen17. This finding underscores the potential roles of germline non-coding regulatory variants in mediating cancer risk, motivating this study.
Here, we aimed to comprehensively assess the role of genome-wide non-coding variants associated with higher PrCa risk in AA men, specifically via affecting regulatory enhancer function in prostate tissue. We propose a method that integrates a sequence-based deep learning enhancer model with a sliding window strategy, referred to as ProSNP-DL. The deep learning model of enhancer activity has single-nucleotide resolution18, i.e., the model can predict and quantify the potential change in the enhancer activity due to a single nucleotide change in the enhancer sequence. Applying ProSNP-DL genome-wide, we identified 2,407 non-coding SNPs (termed eSNPs) with greater frequency of the alternate allele in AA individuals, that exhibited a substantial difference in their predicted enhancer activities between the two alleles. Of the 2,407 eSNPs, for 1,296 (53.8%) eSNPs, the alternate allele was predicted to have a higher activity (we refer to these as “gained” eSNPs), and for the remaining 1,111 eSNPs the alternate allele was predicted to have a lower enhancer activity (we refer to these as “lost” eSNPs). We noted complementary functions affected by the gained and the lost eSNPs: putative targets of gained eSNPs are involved in oncogenic pathways, such as suppression of the adaptive immune system and telomere elongation. In contrast, targets of lost eSNPs are involved in tumor suppression pathways including differentiation. Gained and lost eSNPs respectively establish and inhibit the binding of several families of TFs, in particular those critical in prostate development and homeostasis, as well as PrCA development, including FOX, AR, and HOX. Finally, we assessed the ability of eSNPs to quantify PrCa risk. We found that a polygenic risk score (PRS) based on eSNPs can distinguish PrCa patients from healthy controls with high accuracy, far better than controls. Importantly, eSNPs, which are biologically motivated and derived agnostic to PrCa status, add value to the previously reported gold-standard PRS learnt from a large multi-ancestry cohort of PrCa patients and healthy controls 8. Overall, our biologically informed approach effectively identifies regulatory germline variants linked to elevated PrCa risk, while also suggesting potential mechanisms.
Results
Overview of the study
To identify the regulatory variants associated with divergent PrCa risk in AA and EA men, we first restricted our investigation to 10,171,946 non-coding SNPs that were common (Minor Allele Frequency; MAF >= 5%) in either the EA or the AA population, and further selected the 491K SNPs having the most disparate allele frequencies between AA and EA (top 5% FST score). We next prioritized these candidate SNPs in silico based on their effect on prostate enhancer activity as follows. Leveraging the H3K27ac ChIP-seq data (as a proxy for active enhancers) in a prostate- derived LNCaP cell line, we trained a deep learning model that learns the sequence encoding of prostate enhancers and is capable of quantifying regulatory impact of a SNP. Here, the major and minor alleles in EA are considered the reference and the alternate alleles, respectively. Our overall approach involves a deep learning model of enhancer activity and a sliding window strategy, to account for all potential position of a SNP within an enhancer, aiming to identify SNPs potentially affecting enhancer activity (Fig. 1, Methods); we dubbed our approach as ProSNP-DL (Prostate Cancer Risk SNPs prediction using Deep Learning). Applying ProSNP-DL to the aforementioned 491K SNPs, we identified two sets of AA-dominant SNPs (defined as those with alternate allele frequency higher in AA than EA) whose alternate alleles are predicted to lead to either an enhancer gain or loss, referred to as gained (1,296) and lost (1,111) eSNPs, respectively. Fig. 1 illustrates our approach. The rationale underlying this approach is that the SNPs with large potential to affect enhancer activity (either gain or loss) are more likely to have a phenotypic impact in the corresponding cellular context. Next, we investigated the biological pathways associated with the two sets of eSNPs and identified the potential TFs whose bindings are likely disrupted by the eSNPs. We further conducted a ChIP-seq experiment targeting FOXA1, a primary cognate TF of the eSNPs, in two prostate cancer cell lines (LNCaP19, derived from a donor of European ancestry and MDA PCa 2B20, derived from a donor of African ancestry) to assess the functional impact of eSNPs on TF binding disruption. Finally, we derive a polygenic risk score (PRS) from the eSNPs and assess its potential to quantify PrCa risk, relative to the PRSs based on previously published risk variants derived from GWAS.
ProSNP-DL - a deep learning model of PrCa enhancers with single-nucleotide resolution
To capture the sequence encryption of the PrCa enhancers, we first trained a deep convolution model (Fig. 2A, Methods) on 67,064 LNCaP H3K27ac peaks21 as a proxy for prostate enhancers, and 10 fold pan-cell type DNase Hypersensitive regions not overlapping LNCaP H3K27ac peaks as the negative control to ensure the tissue-specificity of the PrCa enhancer model. The enhancer model has the same architecture as the one in our previous work18. The model is able to accurately discriminate prostate enhancers from accessible regions devoid of prostate enhancers, with the area under the receiver operating characteristic curve (auROC) of 0.91 (Fig. 2B). We have shown that an enhancer model in HepG2 cell line with the same architecture and parameters is able to predict the allele-specific effects of previously profiled raQTLs22 on enhancer activity18. In short, the enhancer model score is not only able to identify active enhancers, is sensitive to single nucleotide changes in enhancer activity.
However, the effects of a SNP on enhancer activity depends on the position of the enhancer (Fig. 2C), as enhancers overlapping the SNP at various locations have different sequence contexts. We therefore evaluate a SNP’s influence by aggregating its effects across all conceivable positions of a hypothetical enhancer within a 1 KB window. The number of windows in which the alternate allele substantially transforms a neutral sequence into an active enhancer or vice versa, as predicted by the model, is termed the ‘essential window number’ (EWN). We utilized the EWN to quantify the effect size of the SNP on enhancer activity (Methods). Specifically, if the alternate allele consistently exhibits positive delta scores (the difference in model scores caused by the alternate allele) and converts a neutral sequence to an active enhancer in at least 5 windows (EWN ≥ 5), the SNP is classified as a gained activity eSNP (gained eSNP for simplicity), suggesting an enhancer activity gain. Conversely, if the alternate allele consistently shows negative delta scores and alters an active enhancer to a neutral sequence in at least 5 windows (EWN ≥ 5), the SNP is considered as a lost activity eSNP (lost eSNP for simplicity; Fig. 1, Methods), suggesting an enhancer activity loss. Fig. 2C illustrates this logic for gained eSNPs using the aforementioned PrCa risk SNP rs72725854 that were previously identified17. The delta scores of alternate allele ‘T’ are larger than the reference allele ‘A’ in all the windows in the (- 200bp, 200bp) region centered by the SNP, and change an inactive enhancer to an active enhancer (with False Positive Rate FPR < 0.01) in 5 windows (EWN = 5) according to our model. In summary, our approach aims to identify SNPs with a significant impact on enhancer activity by integrating a deep learning model trained on context-specific enhancer sequences with a sliding- window strategy, termed ProSNP-DL.
ProSNP-DL identifies enhancer SNPs dominant in an African ancestral population and associated with PrCa risk and patient survival
Starting with 84,802,133 genome-wide SNPs, we first selected the 10,171,946 non-coding SNPs that are common (MAF>=5%) in EA or AA, to ensure the robustness of our analysis. Further, to specifically identify SNPs that may underly the higher PrCa incidence in AA men, we focused on the SNPs with the largest allele frequency difference between AA and EA (Methods), yielding 491K SNPs (Fig. S1A). Applying ProSNP-DL on these SNPs yielded 2,069 gained enhancer activity SNPs, and 2,175 lost enhancer activity SNPs (eSNPs). Overall, these included 2,407 eSNPs with higher alternate allele frequency in AA (AA-dominant), and 1,837 eSNPs with higher alternate allele frequency in EA (EA-dominant) (Fig. S1B). The remainder of the 491K SNPs are referred to as non-enhancer SNPs (non-eSNPs) and used as negative control for several following analyses. To validate the functional impact of eSNPs on enhancer activity, we assessed allelic imbalance of H3K27ac reads at the heterozygous eSNP sites. Compared with non-eSNPs, the eSNPs exhibited a significantly greater extent of allelic imbalance of H3K27ac read coverage; as expected, the alternate alleles at gained eSNPs showed larger H3K27ac read coverage (Fig. 3A), and those at lost eSNP exhibited lower H3K27ac read coverage (Fig. 3B). This result suggests a coherent functional impact of the identified gained and lost activity eSNPs on the experimentally determined gain and loss of enhancer activity, respectively.
Next, we directly compared the AA-dominant and EA-dominant eSNPs in terms of their association with PrCa risk, using two complementary strategies. First, we assessed the association of the eSNPs with PrCa risk, using summary statistics from a PrCa GWAS of 140,306 males10. Notably, the eSNPs with a greater alternate allele frequency in AA are strongly associated with the PrCa GWAS traits (Fig. 3C). Contrastingly, this is not the case for the eSNPs with EA- dominant alternate allele frequency (Fig. 3D, Fig. S1C). As our second strategy, for the AA- dominant gained activity eSNPs, we assessed whether their putative target genes (proximal genes taken as the proxy) are associated with worse patient survival using Cox regression, controlling for age; conversely, for the lost activity eSNPs, we assessed whether their putative target genes are associated with better survival. Consistent with the first strategy, the target genes of the gained activity AA-dominant eSNPs associated with decreased survival as shown by the high hazard ratios (HR), and those of the lost AA-dominant eSNPs with improved survival (Fig. 3E). However, these associations were not observed for the EA-dominant gained and lost activity eSNPs (Fig. S1D). Taken together, our findings are indicative of a distinct association between eSNPs and PrCa risk and survival for AA-dominant eSNPs encoding either gained or lost enhancer actvity. In the following, we focus specifically on the AA-dominant gained (1,296) and lost (1,111) eSNPs and further investigate their role in prostate cancer risk.
The gained and lost activity eSNPs may facilitate carcinogenesis via complementary mechanisms
We next functionally characterize gained and lost activity eSNPs, using the annotations enrichment tool GREAT23,24 (Method). The analysis showed that gained activity eSNPs are primarily associated with innate and adaptive immune system, lipid metabolism, and telomerase activity regulation (Fig. 4A). This suggests that the gained activity eSNPs may facilitate carcinogenesis by modulating the immune system, substantiated by prior clinical research25. To further assess this possibility, we identified the subset of gained activity eSNPs, whose alternate alleles were present in at least ten ICGC PrCa samples. We then compared the T cell dysfunction scores from TIDE (Tumor Immune Dysfunction and Exclusion)26 of the genes proximal to such eSNPs with genes near other gained activity eSNPs and non-eSNPs. We observed that the genes near the frequently gained activity eSNPs exhibit higher T cell dysfunction scores (Fig. 4C, Methods), suggesting potential functional impact of the gained activity eSNPs on immune system dysfunction. In addition, the gained activity eSNPs are associated with regulation of telomerase activity. Several lines of evidence indicate that longer telomere length contributes to cancer27,28. In agreement with these findings, a previous study has shown that relative to cells derived from Europeans, the cells derived from Africans have longer telomere length across all tissues and, even more relevant, the difference was found to be the highest in the prostate29, further supporting a potential causal role of the gained eSNPs in tumor progression.
In contrast, the lost activity eSNPs are largely associated with regulation of differentiation/development (Fig. 4B). We hypothesized that genes regulated by lost activity eSNPs are involved in maintaining differentiated states, and the loss of their enhancers may cause cells to revert to a more undifferentiated state. In other words, the target genes of lost activity eSNPs are expected to be highly expressed in adult normal prostate tissue but repressed in the developmental stage. To validate this hypothesis, we performed a regression analysis to assess the association of different classes of SNPs to the adult-to-fetal prostate differential expression. As shown in Fig. 4D, a positive coefficient for the target genes of lost activity eSNP indicates their inactivation during development, and conversely, the target genes of gained activity eSNPs exhibit higher expression in the fetal prostate compared to the adult normal prostate (GTEx), consistent with our hypothesis (Fig. 4D). Notably, when we stratify the eSNPs based on the number of patients in which the risk allele occurs, the more frequent risk alleles show a greater magnitude of coefficient (Fig. 4D), further supporting their functional impact. Furthermore, the target gene of gained activity eSNPs have higher Cancer Dependency score (DepMap, Methods) compared to those of lost activity eSNPs (Fig. 4E). Taken together, these results strongly suggest that by activating oncogenic processes and disrupting tumor suppression mechanisms, the gained and lost activity eSNPs collaboratively promote carcinogenesis.
eSNPs disrupt binding of key prostate transcription factors including the FOX and the HOX families
The identified eSNPs, by design and supported by further validations, affect enhancer activity, which is likely to be mediated, in substantive part, via disruption (gain or loss) of transcription factor binding17. We therefore sought to identify the cognate TFs whose bindings are potentially disrupted by the eSNPs. Toward this, for both gained and lost activity eSNPs, we focused on the TFs whose binding sites are enriched at the eSNPs sites compared to the non-eSNP sites. This analysis was conducted using sequences containing both the reference and alternate alleles to eliminate allelic bias (Methods). Our findings revealed that for both gained and lost activity eSNPs, TFs belonging to the FOX and HOX families were among the most enriched (Table S1-4). The eSNPs affecting these TFs not only exhibited large effects on enhancer activity (average EWN ≥ 7) but also had the highest number of potential target genes (Fig. 5AB). This finding aligns with previous research indicating that FOXA1 acts as a pioneer factor in PrCa, capable of extensively reprogramming the androgen receptor (AR) cistrome in collaboration with HOXB1330; interestingly, AR is also among the TFs majorly disrupted by gained activity eSNPs (Fig. 5AB, Table S1-2). This provides further mechanistic insights into how eSNPs might contribute to PrCa risk. Although gained and lost activity eSNPs appear to disrupt the binding of the same set of TFs, we speculated that their target genes are involved in complementary pathways. Indeed, the genes proximal to gained activity eSNPs associated with FOX family TFs were enriched in the pro-tumor cancer signatures such as hypoxia, epithelial-to-mesenchymal transition (EMT), metastasis, and stemness signatures in the CancerSEA database31. In contrast, the genes proximal to lost eSNPs associated with FOX were enriched in the anti-tumor cancer signatures, including differentiation and apoptosis (Fig. 5C, Methods).
To experimentally validate the impact of eSNPs on FOX binding, we performed ChIP-seq for FOXA1 in two PrCa cell lines of distinct ancestral origins: LNCaP cells, derived from a lymph node metastasis of a 50-year-old Caucasian male with PrCa19, and MDA PCa 2B cells, isolated from the prostate of a 63-year-old Black male patient with adenocarcinoma20 (Methods). First, consistent with our expectation, MDA PCa 2B has a greater number of alternate alleles at eSNP sites overall (Table S5). Furthermore, we observed that relative to LNCaP, FOXA1 binding intensity was greater in MDA PCa 2B at gained activity eSNPs and lower at lost activity eSNPs (Fig. 5D, Methods).
One lost activity eSNP coincided with a common SNP rs10095018 (T/C) which is spatially proximal to the gene NDRG1 in the 3D genomic space (Fig. S2A), where the alternate allele ‘C’ substantially reduces enhancer activity according to our model (EWN=10). NDRG1 is recognized for its role in suppression of PrCa; decreased membrane expression of NDRG1 has been associated with significantly poorer survival outcomes in PrCa patients, and is correlated with higher Gleason scores, which predict the aggressiveness of PrCa33,34. At rs10095018 SNP site, LNCaP cell line possesses two reference alleles (T/T), while MDA PCa 2B cell line has heterozygous alleles (T/C). Notably, LNCaP exhibited a stronger FOXA1 binding signal compared to MDA PCa 2B (Fig. 5E). Furthermore, the MDA PCa 2B cell line exhibits allelic imbalance at this position, with the C allele harboring far fewer ChIP-seq reads of FOXA1 (Fig. S2C). These observations together suggest that the alternate allele C can indeed disrupt FOXA1 binding. More importantly, NDRG1 expression level is higher in LNCaP compared to MDA PCa 2B (Fig. 5F), suggesting that the alternate allele ‘C’ at lost eSNP rs10095018 reduces the expression of tumor suppressor NDRG1 by disrupting FOXA1 binding in MDA PCa 2B cell line, thus adversely affecting PrCa patient outcome. In summary, these eSNPs may contribute to heightened PrCa risk by disturbing the binding of key TFs.
Polygenic risk score derived from eSNPs can be helpful to differentiate between cases and controls of PrCa
While by design our eSNPs have high allele frequency differential between AA and EA ancestry, given their potential functional role in prostate cancer, here we assessed their ability to broadly capture PrCa susceptibility across various ancestral populations. We derived a Polygenic Risk Score (PRS) based on the number of risk alleles of each eSNP in an individual, trained and tested (10-fold cross-validation) in a large EA cohort in a PrCa GWAS study (PEGASUS) (4,599 cases and 2,841 controls, see Methods). We compared the cross-validation accuracy of eSNP-based PRS with those based on (i) randomly select 5000 high Fst SNPs, (ii) 50,708 chromatin Quantitative Trait Loci (cQTLs) linked to chromatin accessibility in PrCa7 and (iii) 545 PrCa risk variants (PGS-545) identified through a multi-ancestry genome-wide study8. We followed the convention8 to measure the accuracy as follows: we partitioned the range of PRS scores for the controls in five quantiles and assessed the enrichment of PrCa PRS scores in the top and the bottom quantiles. As shown in Fig. 6A (and Supplementary Fig S3), eSNP-based PRS substantially outperformed those based on cQTL and random high FST SNPs. As expected, PGS- 545 has a superior performance, as reported previously8. However, integrating eSNPs with PGS- 545 SNPs resulted in a slightly improved performance (Fig. 6B and S3), suggesting that eSNPs provide complementary information to PGS-545 SNPs. These results observed in the cross- validation within the EA cohort were similar to those when a model trained on the EA cohort was applied to the independent AA cohort (474 cases and 458 controls; see Methods). These results suggest that eSNPs provide valuable insights into PrCa susceptibility across various ancestral populations. Interestingly, we also found that, among the AA PrCa patients, those with higher PRS based on eSNPs and PGS-545 tend to develop PrCa at younger ages compared to those with lower PRS (Fig. 6C), supporting the role of eSNPs and PGS-545 SNPs in PrCa onset (see Methods).
Given that eSNPs are identified agnostic of PrCa cohorts, it is not surprising that they do not perform as well as the gold-standard GWAS-based PGS-545 SNPs in discriminating PrCa from control. However, eSNPs are more likely to be functional in terms of affecting enhancer function, and they are likely to underlie prostate-related disorders other than PrCa. First, we found that compared to PGS-545 SNPs, eSNPs indeed exhibit a greater allelic imbalance for H3K27Ac reads (Fig 6D). Looking at the GWAS catalog, while there are four additional traits besides PrCa related to the prostate, they are all indirectly related to cancer – prostate antigen levels and various adverse responses to PrCa treatment. We assessed whether eSNPs better capture these additional traits than PGS-545 which is based specifically on PrCa. We found that eSNPs indeed show a greater enrichment for these additional traits (Fig 6E). Furthermore, the enrichment of these prostate traits in eSNPs is true only for AA-dominant eSNPs; there is no overlap between EA-dominant SNPs and the QTLs for these additional traits. These results suggest that by virtue of being mechanism-based and agnostic to PrCa, eSNPs may better capture other prostate- related traits.
Overall, these results suggest that our identified eSNPs, derived in mechanism-aware but PrCa- agnostic fashion, can contribute to prioritize PrCa susceptibility across different ancestries, as well as other prostate disorders.
Discussion
In this study, we explored non-coding regulatory polymorphisms potentially underlying the high prevalence of PrCa and associated mortality in AA men. Using a sequence-based deep learning model (ProSNP-DL) that learns the sequence encodings of prostate enhancers and is capable of quantifying regulatory effects of a single nucleotide change, we identified two sets of regulatory genetic variants: one included the SNPs where the alternate allele encoded an increased enhancer activity (gained activity eSNPs) and the other included the SNPs where the alternate allele encoded a reduced enhancer activity (lost activity eSNPs). We particularly focused on the eSNPs with greater alternate allele frequency in AA (AA-dominant SNPs) because: 1) the AA- dominant eSNPs tended to have stronger signal for PrCa GWAS traits than EA-dominant eSNPs, and 2) a higher expression of the target genes of AA-dominant eSNPs were commonly associated with patient survival, which was not observed for EA-dominant eSNPs. Overall, we identify ∼2,000 gained and lost activity eSNPs with AA-dominant alternate alleles that provide a list of regulatory variants potentially associated with the excessive PrCa risk manifest in AA men.
We discovered that the gained and lost activity eSNPs potentially affect PrCa risk via two complementary mechanisms. Specifically, the genes near the gained activity eSNPs exhibit higher T cell Dysfunction score, leading to T cell exhaustion and suppression of the adaptive immune system. Moreover, the gained activity eSNPs are linked to telomere elongation, a process implicated in oncogenesis 27,28. On the other hand, the lost activity eSNPs are associated with de- differentiation (stem-like) and evasion of cell death processes. Thus, the gained and lost activity eSNPs collectively contributing to a PrCa predisposition. The identified eSNPs likely function by disrupting the binding of key transcription factors, predominantly belonging to the FOX family30,35. Although both the gained and the lost activity eSNPs disrupt the binding of the same set of TFs, they regulate different target genes. The target genes of gained activity eSNPs disrupting FOX binding are involved in oncogenic processes, such as EMT36, stemness36, hypoxia37, and metastasis. Conversely, the target genes of lost activity eSNPs disrupting FOX binding are involved in tumor suppression processes, including differentiation and apoptosis (Fig. 5C). Therefore, loss of enhancers caused by lost activity eSNPs would lead to inhibition of cell differentiation38 and evasion of cell death39. This aligns with the enrichment of the overall gained and lost eSNPs in the two complementary mechanisms that collectively contribute to PrCa risk (Fig. 4AB).
While our computational analysis points to FOX family of TFs, given the well-established role of FOXA1 in prostate development and prostate cancer30,35, we experimentally validated the predicted effects of the two sets of SNPs on FOXA1 binding in two PrCa cell lines: LNCaP and MDA PCa 2B cells, respectively derived from individuals of European and African ancestry. The MDA PCa 2B cell line, with a higher number of alternate alleles of eSNPs, exhibited increased binding at the gained activity eSNPs and decreased binding at the lost activity eSNPs (Fig. 5D), thereby validating our predictions. The lost activity eSNPs appear to have a larger impact on disrupting FOXA1 binding compared to the gained activity eSNPs (Fig. 5D). This observation is consistent with the fact that TF binding is more easily lost than gained with single nucleotide changes40.
Unlike traditional GWAS studies that identify risk variants associated with complex human traits or diseases through SNP-trait associations in large cohorts, our approach does not rely on large cohorts with clinical information, in particular, PrCa status, but rather directly identifies potentially functional regulatory SNPs in prostate tissue, which may be relevant to other prostate diseases beyond cancer. Despite not relying on PrCa cohorts, the identified eSNPs add value to the gold standard PGS-545 PrCa variants in discriminating PrCa patients from healthy controls, both within EA and within AA cohort (Fig 6AB), even though our eSNPs were identified based on differential allele frequency between AA and EA individuals. However, compared to PGS-545, eSNPs are more likely to have direct functional role beyond mere association with PrCa (Fig. 6D). Interestingly, the genes near PGS-545 SNPs are also associated with developmental processes (Fig. S4), similar to those associated with the lost eSNPs, further substantiating the connection between development and cancer progression. However, the genes near eSNPs are enriched for several additional biological processes likely to be relevant to PrCa, such as telomere maintenance, interferon signaling, and EMT.
Past successes in deep learning models of enhancers18,41,42 have motivated the current study to focus on non-coding SNPs potentially affecting regulatory enhancer activity. However in principle, such an approach can be applied to other classes of non-coding SNPs, such as those affecting splicing, as long as accurate sequence-based models capturing those functions can be developed. This provides an avenue for future extension of our study. Likewise, this study can be extended to other cancer types with substantial hereditary basis, provided availability of the active enhancer data in appropriate cell or tissue type. Another point worth noting is that our deep learning model is trained on prostate enhancers, which can affect not just malignant transformation, i.e., PrCa, but other maladies of the prostate, which should be explored in a follow up study. Much like the QTLs identified by GWAS, the eSNPs identified in our study must be rigorously tested experimentally to establish their causality. As a strength our study, we have provided initial experimental validation of our predictions, vis a vis FOXA1 binding.
Methods
Data Availability
We downloaded the LNCaP H3K27ac peaks after DHT treatment from GEO dataset GSM124944821. The DHT treated DHS data in the LNCaP cell line were obtained from GSM82238843. For allelic imbalance analysis, we utilized the mapped bam files for H3K27ac data of 5 EA PrCa patients from7. The gene expression profile of fetal prostate was obtained from GSM61454544. The list of cQTL SNPs were obtained from the authors of7. The PGS SNPs were obtained from8. The genotypes of PrCa cases and controls with European ancestry of the GWAS study (National Cancer Institute (NCI) Prostate Cancer Genome-Wide Association Study for Uncommon Susceptibility Loci (PEGASUS)) were downloaded from dbGap (phs000882.v1.p1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000882.v1.p1), and the genotypes of PrCa cases and controls with African ancestry from the study (Ghana)9 were downloaded from dbGap (phs000838.v1.p1.c1, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000838.v1.p1). The Hi-C data of LNCaP cell line was obtained from45. The RNA-seq data for LNCaP and MDA PCa 2b cell lines were obtained from DapMap46 (DepMap, Broad (2023). DepMap 23Q4 Public. Figshare+. Dataset. https://depmap.org/portal). As there is no effects/dependency DapMap score for prostate cancer, we simply took the average DapMap scores across cancer types as an estimate of the DapMap score for each gene. The SNP array- based genotype of LNCaP cell line is obtained from GSM888346, and the SNP array-based genotype of MDA PCa 2b cell line is obtained from GSM888377. The FOXA1 ChIP-seq data was generated in house and is deposited at GSE276748. The human GRCh37 (hg19) reference genome was used throughout the study. Clinical data for TCGA PrCa were downloaded from PanCanAtlas (https://gdc.cancer.gov/about-data/publications/pancanatlas). Gene expression data for TCGA PrCa patients was downloaded from the toil-hub of UCSC-Xena browser.
Data Processing
Fastq files were downloaded from the European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/home). Reads were trimmed based on the overlap with the adapter sequence, and those with an average base quality of <20 were removed using Trimmomatic47 version 0.39. Trimmed reads were aligned to hg19.p13.plusMT.no_alt_analysis_set genome using bowtie248 version 2.5.1 (https://bowtie-bio.sourceforge.net/bowtie2/index.shtml) with default parameters. Duplicate reads from the alignment files were removed, and alignments with a mapping quality score of <20 were discarded using samtools49 version 1.16.1 (http://www.htslib.org/). Peaks were called using MACS250 version 2.2.9.1. For H3K27Ac data, broad peaks were called. For DHS data, narrow peaks were called without building a model, shifting the reads by 150 bp towards the 5’ end and extending by 300 bp towards the 3’ end.
FOXA1 ChIP-seq experiments in two PrCa cell lines
Approximately 10 million cells were fixed using 2 mM DSG (CovaChem) for 10 minutes followed by 1% formaldehyde for 10 minutes at 37°C and quenched with 2 M glycine for 5 minutes at room temperature. Cross-linked cells were resuspended in cold lysis buffer (50 mM Tris pH 8.0, 10 mM EDTA pH 8.0, 0.5% SDS, 1X protease inhibitor cocktail) and sheared using a Bioruptor Pico (Diagenode) device. Fragmented chromatin was incubated with 3 µg of FOXA1 (Abcam, cat#ab23738) antibody overnight at 4°C. A fraction of each sample prior to addition of antibody was used as an input control. Protein A/G beads (ThermoFisher) were added and incubated for 1 hour at 4°C, washed six times with RIPA buffer (50 mM HEPES pH 7.6, 1 mM EDTA pH 8.0, 500 mM LiCl, 0.7% sodium deoxycholate, 1% NP40), and eluted in elution buffer (100 mM NaHCO3, 1% SDS). Samples, including input DNA, were treated with RNase A (ThermoFisher) for 30 minutes at 37 °C followed by proteinase K (New England Biolabs) overnight at 65 °C to reverse crosslinking. ChIP DNA was purified with Monarch Genomic DNA Purification Kit (New England Biolabs) and concentrations were quantified by Qubit fluorometer (ThermoFisher) and TapeStation (Agilent). Sequencing libraries were prepared by NEBNext UltraII DNA Library Prep Kit (New England Biolabs) per manufacturer’s instructions and sequenced on the Illumina NextSeq2000 platform at the CCR Genomics Core at the National Cancer Institute, NIH, Bethesda, MD.
A deep convolutional neural network model of prostate enhancers
We built a deep convolutional neural network to predict tissue-specific enhancer activity directly from the enhancer DNA sequence. The deep learning model has the same architecture as the one in our previous study18, comprising five convolution layers with 320, 320, 240, 240, and 480 kernels, respectively. Higher-level convolution layers receive input from larger genomic ranges and are able to represent more complex patterns than the lower layers. The convolutional layers are followed by a fully connected layer with 180 neurons, integrating the information from the 1kb sequence. In total, the DLM has 3,631,401 trainable parameters. We used the Python library Keras version 2.1.5 (https://github.com/keras-team/keras) to implement our model. The positive sets contain the 1kb sequence (and their reverse complement) of LNCaP enhancers. We used the DHS peaks overlapping H3K27ac peaks to delimit enhancer regions. DHS peaks located within the same H3K27ac peak were merged into one. The enhancers were then these DHS peaks extended to 1 kb from its original center. Enhancers overlapping promoters (including all alternative promoters) and promoters (intervals [−1000 base pairs (bp), +1000 bp] surrounding the transcription start site) were removed from the enhancer set. Overall, we identified 58,266 LNCaP enhancers. The DHS profiles of non–prostate tissues from Roadmap Epigenomics projects51, which do not overlap the positive sets, were collected as the negative training set of the deep learning model. The reason we used DHS sites not overlapping prostate H3K27ac peaks as negative control regions is that we aim to distinguish prostate-specific enhancers from otherwise accessible and potential enhancers in other tissues. Seventy percent of the data was used for training, 15% is for parameter tuning (on validation set), and the remaining 15% is used as the testing set. Larger DL score of the genomic sequence corresponds to a higher propensity to be an active tissue-specific enhancer18. The genomic sequence with deep learning model score ≥ 0.58 (selected based on FPR ≤ 0.01) is predicted to be active enhancers.
Effect size of a SNP
We evaluate the effect of a SNP by combining its effects across all possible potential positions of a putative enhancer. For a given SNP, using a stride as 20bp, there are 21 2-KB windows overlapping the SNP within a (-200bp, +200bp) region centered on the SNP. The reason we restricted our sliding windows to cover the central (-200bp, +200bp) region is that the alternate alleles tend to have a larger effect on enhancer activity within this narrow region42 and tend to have more consistent directions of deltas (Fig. 2C). The number of windows in which exactly one of the alternate alleles converts a neutral sequence to an active enhancer and vice versa, according to the model, is termed the essential window number (EWN). The EWN, which ranges from 0 to 21, is utilized to assess the effect size of the SNP on enhancer activity. Further, we expect a consistent directional impact of the SNP on enhancer activity across the 21 windows. Specifically, if the alternate allele exhibits positive delta scores in over 85% of the windows and converts a neutral sequence to an active enhancer in at least 5 windows (EWN ≥ 5) according to our model, the SNP is classified as a gained eSNP, suggesting an enhancer gain. Similarly, if the alternate allele has negative delta scores in over 85% of the windows and changes an active enhancer to a neutral sequence in at least 5 windows (EWN ≥ 5), the SNP is considered as a lost eSNP, suggesting an enhancer loss.
Identification of allelic imbalance in H3K27ac data
We reasoned that if an eSNP locus happens to be heterozygous in a sample, we would expect the two alleles to have differential enhancer activity. Specifically, the alternate alleles of the gained eSNPs should have higher enhancer activity than the reference alleles, and the alternate alleles of the lost eSNPs should have weaker enhancer activity than the reference alleles. This should be reflected in the differential representation of the two alleles among the H3K27Ac reads of the locus. The H3K27ac reads were extracted using BaalChIP52. Allelic counts over heterozygous sites of the 5 EA PrCa patients H3K27ac data7 were merged, and variants that had at least six reads were further processed for allele-specific enhancer activity analysis with binomial test. We use the heterozygous sites of the gained and lost non-eSNPs as the background. The gained non-eSNPs are high FST SNPs with positive delta scores, and the lost non-eSNPs are high FST SNPs with negative delta scores. For a heterozygous site of gained eSNPs, if the ratio of read number of the alternate allele to that of the reference allele is over 1.5 and the binomial P value ≤ 0.01, the position is considered to have allelic imbalance. For a heterozygous site of lost eSNPs, if the ratio of read number of the reference allele to that of the alternate allele is over 1.5 and the binomial P value ≤ 0.01, the position is considered to have allelic imbalance. Notably, the essential mutations are identified according to the sequence-based deep learning model, suggesting that the alternate alleles at these positions are likely to create or deactivate enhancers. The alternate alleles at these eSNPs coincidently (and independently) exhibit significantly more or less H3K27ac reads than do the reference alleles (termed allelic imbalance), substantiating that the alternate alleles of gained eSNPs are associated with increased enhancer activity, and alternate alleles of lost eSNPs are associated with decreased enhancer activity.
Functional enrichment analysis using GREAT and cancerSEA
To probe the potential functional roles of gained and lost eSNPs, we tested for functional enrichment among genes near the eSNPs loci using the online Genomic Regions Enrichment of Annotations Tool (GREAT23) version 3.0.0 using two-nearest-genes association rule. The GO terms will be considered as enriched if it has at least 10 gene hits with binomial p-value threshold set as 0.01. The whole genome region was used as the background. To characterize the cancer progression related functions of the gained and lost eSNPs associated with FOX family TFs, we tested the enrichment of the nearby genes of the eSNPs in the 14 functional states across cancer types (including stemness, invasion, metastasis, proliferation, EMT, angiogenesis, apoptosis, cell cycle, differentiation, DNA damage, DNA repair, hypoxia, inflammation and quiescence) portrayed by cancerSEA database31. The enrichment of the targets of gained eSNPs in the genes of a particular functional state relative to those of lost eSNPs was evaluated as log2(fraction_gained/fraction_lost), and vice versa, where fraction_gained is the fraction of the target genes of gained eSNPs overlapping with the genes of a particular functional state.
Survival analysis
We used clinical data from TCGA to model the overall survival of PrCa patients using the gene expression as a predictor variable and age as a covariate in the cox regression to relate gene expression levels to survival outcomes. We used the R library “survival” for this analysis. The Hazard Ratio (HR) represents the change in hazard (risk of death) associated with a unit increase in gene expression. Only the HR with significant p-values (p <= 0.05) were considered in this study.
Logistic regression
We applied the “glm” method from the R package to apply logistic regression to predict gene expression change (adult prostate – fetal prostate) against categories of gained eSNPs, lost eSNPs, random SNPs, and gene expression in fetal prostate. where Δg is the gene expression change in adult prostate compared to fetal prostate (gadult − gfetal), which is binary (1 for positive change, 0 for negative change). We integrated the fetal prostate gene expression profile (GSM61454544) to GTEx and applied quantile normalized adult and fetal embryonic prostate gene expression and to remove differences across experiments and batch effects. Three different glms were applied to the three sets of SNPs (gained eSNPs, lost eSNPs, and random SNPs), separately. SNP category is binary for each SNP set. gfetal is the expression level of the gene in fetal prostate. Positive coefficient of each variable indicates positive correlation between the variable and Δg.
Identification of potential TFBSs associated with the gained and lost eSNPs
To identify potential TF binding sites, we used FIMO53 to scan the profiles of binding sites for vertebrate TF motifs in JASPAR54, CIS-BP55, SwissRegulon56, HOCOMOCO57, and UniPROBE58 databases, along the 100-bp sequences centered at the gained and lost eSNPs. We identified motif-specific thresholds to limit the FDR to no more than five false positives in 10 kb of sequence, by scanning each motif on random genomic sequences using FIMO53. Enrichment of a motif associated with gained/lost eSNPs (foreground) relative to non-eSNPs (background) was ascertained using Fisher’s exact test. The occurrence of a particular Transcription Factor Binding Site (TFBS) in the set of 100 bp genomic sequences centered by the gained/conserved eSNPs was normalized by the total number of eSNPs. Notably, we included the sequences containing both the reference and alternate alleles to avoid allelic bias.
Polygenic Risk Score
Before estimating the PRS, we applied Minimac4 algorithm to impute genotype using Michigan Imputation Server59. We applied multivariate linear regression using 10-fold cross validation in the EA cohort (PEGASUS) to test the predictive accuracy in PrCa predisposition of different sets of risk variants. To estimate the PRS in the independent AA cohort (Ghana) and test the model, we retrain the model using the entire EA cohort and applied it to score the AA individuals using bootstrapping, by randomly subsampling 90% the AA individuals 10 times. The PRS is the weighted sum of the alternative allele number of the risk variants in a given genotype, where the weights are the coefficients of the trained multivariate linear regression model.
Data Availability
All data produced in the present work are contained in the manuscript
Supplemental Information
Acknowledgement
This work used the computational resources of the NIH HPC Biowulf cluster. Next generation sequencing was done at the CCR Genomics Core at the National Cancer Institute, NIH, Bethesda, MD. We would like to thank Dr. Baca, Sylvan C, Dr. Fortunato, Brad J., Dr. Ziwei Zhang and Dr. Mathew Freedman for kindly sharing the supplementary materials of their works with us. We would like to thank Dr. Kun Wang, Dr. Vishaka Gopalan, Annan Timon, Arati Rajeevan for their feedback. This research was funded, in part, by U.S. National Cancer Institute grant 1-ZIA-BC011979-02 (to S.H.) and 1-ZIA-BC011973-04 (to D.T.)