Abstract
Recent exome-wide association studies have explored the role of coding variants in breast cancer risk, highlighting the role of rare variants in multiple genes including BRCA1, BRCA2, CHEK2, ATM and PALB2, as well as new susceptibility genes e.g., MAP3K1. These genes, however, explain a small proportion of the missing heritability of the disease. Much of the missing heritability likely lies in the non-coding genome. We evaluated the role of rare variants in the 5’ and 3’ untranslated regions (UTRs) of 18,676 genes, and 35,201 putative promoter regions, using whole-genome sequencing data from UK Biobank on 8,001 women with breast cancer and 92,534 women without breast cancer. Burden tests and SKAT-O tests were performed in UTR and promoter regions. For UTR regions of 35 putative breast cancer susceptibility genes, we additionally performed a meta-analysis with a large breast cancer case-control dataset. Associations for 8 regions at P<0.0001 were identified, including several with known roles in tumorigenesis. The strongest evidence of association was for variants in the 5’UTR of CDK5R1 (P=8.5×10−7). These results highlight the potential role of non-coding regulatory regions in breast cancer susceptibility.
Introduction
Genetic susceptibility to breast cancer is known to be conferred by common variants, identified through genome-wide association studies (GWAS), together with rare coding variants conferring higher disease risks. Protein-truncating variants (PTVs) and rare missense variants identified through linkage and targeting sequencing studies in some genes have been well established, including ATM, BARD1, BRCA1, BRCA2, CHEK2, RAD51C, RAD51D, PALB2 and TP531. Recently, whole exome sequencing (WES) analysis identified additional associations for PTVs and rare missense variants in MAP3K1, LZTR1, ATRIP and CDKN2A2. However, these common and rare variants, in aggregate, only explain ~50% of the familial aggregation of the disease. Much of the “missing” heritability is likely to be due to variants in the non-coding genome2. Here we explore the association of variants in non-coding untranslated regions (UTRs) and promoter regions with breast cancer risk.
UTRs are regions upstream (5’ UTR) and downstream (3’ UTR) of the coding sequence of a gene that are transcribed but not translated into protein3, while promoters are untranscribed regions upstream of genes where proteins bind to initiate transcription4. Variants in these regions do not affect the protein-coding sequence but may regulate gene expression. We used whole genome sequencing (WGS) data from the UK Biobank (UKB), to assess the role of rare variants in promoter and UTR regions in all genes. After quality control (QC; see methods), this dataset comprised 8,001 women with breast cancer and 92,534 women without breast cancer (Supplementary Table 1). For 35 putative breast cancer susceptibility genes, we incorporated additional data from 51,494 women with breast cancer and 43,884 women without breast cancer from the BRIDGES dataset1.
For promoter regions, we considered all rare variants (minor allele frequency<0.001) in promoter regions defined by Ensembl BioMart5 excluding variants falling in coding sequences. For UTR regions, we considered all rare variants annotated as a UTR variant according to Ensembl Variant Effect Predictor (VEP)6.
We conducted burden tests for all promoter regions or UTR regions in which there was at least one carrier of a variant in either cases or controls (35,201 promoter regions, 16,381 3’ UTRs and 17,185 5’ UTRs). These tests, in which variants are collapsed together, can be more powerful than single-variant tests if variants have similar effect sizes7. We used logistic regression with all rare variants grouped together and further improved power by incorporating data on family history of breast cancer, as described elsewhere2. Since the assumption of similar effect sizes may break down for these regulatory regions8, for burden analyses with P<0.05, we conducted robust SKAT-O tests, which allow for variants to have different effect sizes and directions of association. We denote P-values from the burden test as PB, P-values from the robust SKAT-O test as PS, and P-values from the meta-analysis of burden results for the 35 putative breast cancer susceptibility genes as PBM.
Results
5’ UTRs
23 5’ UTR regions were associated at PB<0.001 (Figures 1 and 2; Supplementary Table 2), slightly more than the number (17) expected by chance. 21 of these corresponded to an increased risk, compared with ~8.5 which would be expected by chance. SKAT-O robust was tested on 861 5’ UTR regions with PB<0.05; 14 of these had PS<0.001 (Figure 3, Supplementary Table 3). Associations with PB <1×10−4 or PS <1×10−4 were observed for: CDK5R1 (PB =8.5×10−7, PS =7.2×10−7), MVB12A (PB =6.6×10−5, PS =1.6×10−4) and SYNE1 (PB =8.0×10−5, 0.014). None of the five most clearly established breast cancer risk genes had PB or PS <0.001 (Supplementary Table 4).
The x-axis is the chromosomal position, and the y-axis is the Z-score from testing H0 of no association. The blue lines correspond to Z=±3.29, P=0.001, and the red lines correspond to Z=±4.71, P=2.5×10−6. All labelled genes are those with P<0.001. P-values are unadjusted for multiple testing.
The x-axis is the expected log10 P-values from the null hypothesis, the y-axis is the observed log10 P-values. Blue dots correspond to regions associated with increased risk and red dots correspond to regions associated with decreased risk. All P-values are unadjusted for multiple testing.
The x-axis is the −ve log10 P-values from the robust SKAT-O test (PS) and the y-axis is the −ve log10 P-values from the burden test (PB). Genes with P-value<0.001 for either the burden or robust SKAT-O test are labelled. Blue dots correspond to (−log10 (PB))2 + (−log10 (PS))2 ≤ 9, orange corresponds to 9< (−log10 (PB))2 + (−log10 (PS))2 ≤ 16 and red correspond to (−log10 (PB))2 + (−log10 (PS))2 ≥ 16. All P-values are unadjusted for multiple testing.
The most significant association, meeting P<10−6 for both the burden and robust-SKAT-O test, was CDK5R1. UTR variants were in two distinct regions approximately 350bp apart, with 69 variants with positions 32486993-32487160 and 36 variants with positions 32487483-32487620 (Figure 4). The effect sizes did not differ significantly between the two regions (Supplementary Table 5). The interval between the UTR regions, 32487172-32487467 was an intron, which had no association with risk (Supplementary Table 5). A table of the UTR variants in CDK5R1, sorted by the number of carriers, is provided in Supplementary Table 6. There were 9 variants at position 32487007 with varying repeats of GCC (combined PB=0.0030, PS=0.0017).
The x-axis is the chromosomal position, and the y-axis is the frequency for cases (red; above y=0) and controls (blue; below y=0). The frequency is calculated for females in the dataset only.
3’ UTRs
18 3’ UTR regions were associated with breast cancer risk at PB<0.001 (Figures 5 and 6; Supplementary Table 7), similar to the number (16) expected by chance. 11 of these corresponded to an increased risk, compared with ~ 8 which would be expected by chance. SKAT-O robust was tested on 821 3’ UTR regions with PB<0.05; 17 of these had associations with P <0.001 (Figure 3, Supplementary Table 8). Associations with PB <1×10−4 or PS <1×10−4 were observed for: KCNN3 (PB =4.6×10−5, PS =5.6×10−4) and ZNF821 (PB =2.3×10−4, PB =2.2×10−5). None of the five most clearly established breast cancer risk genes (ATM, BRCA1, BRCA2, CHEK2, and PALB2) had PB or PS <0.001 (Supplementary Table 9).
The x-axis is the chromosomal position, and the y-axis is the Z-score from testing H0 of no association. The blue lines correspond to Z=±3.29, P=0.001, and the red lines correspond to Z=±4.71, P=2.5×10−6. All labelled genes are those with P<0.001. P-values are unadjusted for multiple testing.
The x-axis is the expected log10 P-values from the null hypothesis, the y-axis is the observed log10 P-values. Blue dots correspond to regions associated with increased risk and red dots correspond to regions associated with decreased risk. All P-values are unadjusted for multiple testing.
UTR UK Biobank and BRIDGES meta-analysis
We obtained additional data on UTR regions for 35 putative breast cancer susceptibility genes that were sequenced in the BRIDGES study1 and performed a meta-analysis of burden associations across the UK Biobank and BRIDGES datasets. No regions were associated with breast cancer risk at PBM<0.001 (Supplementary Table 10). The strongest associations were for the BARD1 5’ UTR (PBM=0.0076) and CHEK2 3’ UTR (PBM=0.017).
Promoters
31 promoter regions were associated at PB<0.001 (Figures 7 and 8; Supplementary Table 11), consistent with the number (35.2) expected by chance. However, 23 of the 31 regions at PB<0.001 correspond to an increased risk, compared with ~17.6 which would be expected by chance. SKAT-O robust was tested on 1,756 promoter regions with PB<0.05; 25 of these had PS<0.001 (Figure 9, Supplementary Table 12). For the most significant associations, we identified possible genes associated with each promoter by searching for those with transcription start site (TSS) within 500bp of the promoter end. For each promoter, 1 gene was identified satisfying this condition.
The x-axis is the chromosomal position, and the y-axis is the Z-score from testing H0 of no association. The blue lines correspond to Z=±3.29, P=0.001, and the red lines correspond to Z=±4.71, P=2.5×10−6. All red dots are regions with P<0.001. P-values are unadjusted for multiple testing.
The x-axis is the expected log10 P-values from the null hypothesis, the y-axis is the observed log10 P-values. Blue dots correspond to regions associated with increased risk and red dots correspond to regions associated with decreased risk. All P-values are unadjusted for multiple testing.
The x-axis is the −ve log10 P-values from the robust SKAT-O test (PS) and the y-axis is the −ve log10 P-values from the burden test (PB). Genes with P-value<0.001 for either the burden or robust SKAT-O test are labelled. Blue dots correspond to (−log10 (PB))2 + (−log10 (PS))2 ≤ 9, orange correspond to 9< (−log10 (PB))2 + (−log10 (PS))2 ≤ 16 and red correspond to (−log10 (PB))2 + (−log10 (PS))2 ≥ 16. All P-values are unadjusted for multiple testing.
Associations with PB <1×10−4 or PS <1×10−4 were observed for 3 promoter regions: ENSR00001081062 in 4p14 for with nearest downstream gene NWD2 (PB =8.9×10−5, PS =0.038), ENSR00000016877 in 1q25.3 with nearest downstream gene ARPC5 (PB =9.9×10−5, PS =2.0×10−3) and ENSR00000189328 in 5q33.2 with nearest downstream gene SAP30L (PB =0.0049, PS =9.7×10−5). None of the five most clearly established breast cancer risk genes had PB or PS <0.001 (Supplementary Table 13).
Discussion
Of the regulatory genomic regions we examined, the strongest association with breast cancer risk was for rare variants in the 5’UTR of CDK5R1, which reached P<10−6 using both the simple burden and robust SKAT-O tests. CDK5R1 shows increased expression in many malignancies and has been associated with poor prognosis, proliferation, and drug resistance in many cancers including breast cancer9. Of the other 5’UTRs associated at P<0.0001, SYNE1 is a nuclear envelope protein critical for cellular structure and signalling, which is downregulated in many malignancies. Loss of function mutations in the gene have been found in ovarian cancer patients, and downregulation of the gene has been associated with increased tumour mutation burden and immune cell infiltration10, 11. Hypermethylation of the SYNE1 promoter has also been associated with tumour aggressiveness in breast cancer12 and poor outcomes in gastric cancer13. MVB12A forms part of the ESCRT-1 complex, which has potential links to cancer14. Two 3’ UTR associations reached PB <1×10−4 or PS <1×10−4: KCNN3 and ZNF821. KCNN3 encodes SK3, also known as KCa2.3, a potassium channel. Increased expression of SK3 is associated with greater breast cancer cell migration15. Furthermore, decreased expression levels of KCNN3 have been associated with drug resistance and poor prognosis for ovarian cancer16. ZNF821 interacts with ATM, encoded by the known breast cancer susceptibility gene ATM involved in DNA damage signalling and repair17.
Three promoter regions reached PB <1×10−4 or PS <1×10−4: on 1q25.3 (closest likely target ARPC5), on 5q23.2 (SAP30L) and on 4p14 (NWD2), though none reached P<10−6. All were high-confidence promoters for these genes, although other genes may be regulated by these promoters. ARPC5 encodes one of 7 subunits of the Arp2/3 protein complex, which is overexpressed in a variety of cancers including breast cancer18. This protein is thought to be involved in the mechanism controlling tumour cell migration, invasion, and metastasis, by mediating actin polymerisation, and therefore to be closely linked to tumour prognosis18. CRISPR studies have additionally shown that loss of ARPC5 can delay the migration of adherent MDA-MB-231 cells (a triple-negative breast cancer cell line)19. SAP30L is a SAP30-like protein that, along with SAP30, is a subunit of the SIN3 protein complex, which has suggested roles in breast cancer progression including gene upregulation affecting cell motility, angiogenesis and lymphangiogenesis20, 21. NWD2 is a paralog of AAMP which plays a role in angiogenesis and cellular migration. High expression of AAMP has been associated with poor prognosis and metastasis of breast cancer22.
We considered two types of burden analyses: a simple burden analysis in which all rare variants are considered equivalent, and robust SKAT-O which allows for variation in effect across variants. The latter may be more powerful if only a subset of variants are risk-associated or there are effects in opposite directions, while the simpler test is likely to be more powerful if the effects are similar in magnitude, and also has the advantage that power can be improved by taking family history of the disease into account. Notably, the CDK5R1 5’UTR association was the strongest using either method.
Previous GWAS have identified more than 300 common susceptibility loci for breast cancer23–25, but few of these have been definitively mapped to UTRs or promoters, with the majority apparently located in more distal regulatory regions. A notable exception is the variant rs78378222 in the 3’UTR of TP53, which is associated (in opposite directions) with both ER-negative and ER-positive breast cancer23. In this analysis, one association reached P<10−6, and the excess of positive associations at P<0.001 suggests that additional breast cancer susceptibility variants may be present in these regions. Many of the putative associations are in regions regulating plausible cancer-related genes, but further, and even larger, replication studies will be required to validate these associations and provide reliable risk estimates.
Methods
UK Biobank
UK Biobank is a prospective cohort study of more than 500,000 individuals. Detailed information is given elsewhere26, 27. WGS data for 200,000 individuals were released in November 2021 and accessed via the UK Biobank DNA Nexus platform. QC metrics were applied to Variant Call Format files as described by Gardener et al, including genotype level filters for depth and genotype quality28. Other filters, including samples with disagreement between genetically determined and self-reported sex and excess relatives, were applied as described elsewhere2. Cases were defined by having invasive breast cancer (International Classification of Diseases (ICD)-10 code C50) or carcinoma in situ (D05), as determined by linkage to the National Cancer Registration and Analysis Service (NCRAS), or self-reported breast cancer. Both prevalent and incident cases were included. Only breast cancers that were an individual’s first or second diagnosed cancer were included as cases. By this definition, 8,001 female and 38 male cases were included.
BRIDGES
For the meta-analysis of UTR regions, we also performed the analysis in the BRIDGES dataset for 35 genes sequenced on a targeted sequencing panel, as described elsewhere1. This dataset included 51,494 women with breast cancer and 43,884 women without breast cancer from 43 studies participating in the Breast Cancer Association Consortium (BCAC). Phenotype data were based on the BCAC database v14. Of these 41,609 women with breast cancer were from cohort or population-based case-control studies and unselected for family history. The remainder were from clinic-based studies with some oversampling for familial cases. Details of the sequencing methodology, variant calling and quality control are described in detail elsewhere1.
Data preparation
Promoter regions were extracted using the Ensembl BioMart5 data mining tool web page. These regions do not directly identify a specific target gene. To identify the likely corresponding gene for each significant region, we identified genes with TSS within 500bp of the promoter end. UTR regions for genes were similarly extracted from Ensembl BioMart using the R package biomaRt.
Ensemble Variant Effect Predictor (VEP)6 v101.0 was used to annotate variants within regions of interest. Annotations included the 1000 genomes phase 3 allele frequency, sequence ontology variant consequences and exon/intron number. For each region, the MANE Select transcript29 was used, if available, otherwise the RefSeq Select transcript was used30. Annotation files were used to identify rare variants in promoter regions and variants in UTR regions. PTVs, missense and other coding variants were excluded.
Rare variant analysis
Association analyses were carried out separately for each promoter or UTR region. The main association analyses were burden tests in which genotypes were collapsed into a 0/1 variable based on whether samples carried a variant in the region. For each region we estimated odds ratios and confidence intervals, and derived Z-scores, and two-sided P-values (PB), using logistic regression. The method used here also incorporates the family history of breast cancer as a surrogate for disease status with weight ½, as explained in detail elsewhere2. This model assumes that all variants are associated with the same effect size and is expected to be most powerful under this assumption. SKAT is a method that allows for different variants within a region to have different effects or no effect and with different effect directions31. SKAT is likely to be more powerful when a smaller proportion of variants are causal, or effects have different directions32. SKAT-O is an optimal test that is a linear combination of the burden and SKAT statistic, which optimises the weighting of the two tests for each region26. In large biobanks with unbalanced case-control ratios, these methods can suffer from inflated type-1 error rates. A robust SKAT-O method was recently developed that accounts for this using Saddle Point Approximation and Efficient Resampling33. This method is more computationally intensive than the simple burden test and we therefore only obtain SKAT-O robust P-values (PS) for promoter or UTR regions with P <0.05. Exome-wide significance is defined as P<2.5×10−6. Here we define a more stringent level at P<10−6 given we consider approximately 35,000 promoter regions and 34,000 UTR regions. However, we additionally consider regions with P<0.001 to be of interest.
For promoter regions, 3’ UTR regions and 5’ UTR regions, separate Manhattan plots and QQ plots were made for the Z-scores and PB values from the simple burden tests. For regions with PB<0.05 scatter plots summarising PS and PB values were made.
For genes in both the UKB and BRIDGES datasets, we combined Z-scores for the UTR regions in a meta-analysis using an inverse variance weighting approach. The combined Z-score was defined as . Here, ZM is the combined z score, Zj the z-score for study j and sej, is the standard error of βj = log(ORj).
Statistics and reproducibility
No statistical method was used to predetermine the sample size. The experiments were not randomized, and we did not use blinding. Some samples were excluded for reasons as described in the methods above, for example, for sex discrepancies, excess relatives, or discrepancies with previous genotyping.
Data Availability
Individual level data for the BRIDGES data are not publicly available due to ethical review board constraints but are available on request through the BCAC Data Access Co-ordinating Committee (BCAC{at}medschl.cam.ac.uk). Requests for access to UK Biobank data should be made to the UK Biobank Access Management Team (access{at}ukbiobank.ac.uk).
Code Availability
Quality Control filtering of vcf files was performed using vcftools v0.1.15, bcftools v1.9, picard v2.22.2 and plink v1.90b, as outlined in the methods. Variants were annotated using Ensembl Variant Effect Predictor v101 with assembly GRCh38. The code for each software is available at the website of each package. Data manipulation and analysis were performed using R-4.3.3 with packages clusterProfiler (4.2.2), data.table (1.14.2), dplyr (1.0.9), dbplyr (2.5.0), gtools (3.9.5), HGNChelper (0.8.9), SKAT (2.2.5), tibble (3.2.1) and tidyr (1.3.1). Plots were created using additional packages ggplot2 (3.5.1) and ggrepel (0.9.5). The code for each of the R packages can be found in their associated vignettes.
Data Availability
Data from UK Biobank are available through application to UK Biobank. Data from the Breast Cancer Association Consortium (BCAC) used in the present study are available upon reasonable request through the BCAC Data Access Coordinating Committee. All data produced in the present study are available upon reasonable request to the authors
Author contributions
D.F.E. supervised this work and directed the overall analysis. N.W. performed the statistical analysis. N.W., J.P.T., L.D., J.D., M.N, E.J.G. and J.R.B.P., developed the bioinformatics and computational pipelines. M.K.B., S.B., R.K. and Q.W. led data management within the BCAC. J.C-C. and M.K.S. led working groups within the BCAC. A.M.D. and P.D. directed the BRIDGES project. M.A., T.U.A., I.L. A., A.C.A., S.E.B., M.K.B., H.B., N.J.C., J.C.-C., K.C., T.D., D.G.E., P.A.F., J.D.F., H.F., A.G.-N., P.G., E.H., P. H., M.H,, M.J.H., A.J., V.N.K., J.L., A.Lindblom, A.Lophatananon, A.M., S.M., R.L. M., N.O., M.I.P., S.K.P., M.U.R., E.S., E.J.S., M.K.S., M.C.S., A.B.S., D.T., Q.W., J.S. and S.H.T. contributed to the design and conduct of the contributing BCAC studies.
N.W. and D.F.E. drafted the manuscript. All authors reviewed and approved the paper.
Competing interests
JRBP and EJG are employees of Insmed Innovation UK and holds stock/stock options in Insmed Inc. JRBP also receives research funding from GSK and engages in paid consultancy for WW International Inc.
Acknowledgments
The research has been conducted using the UK Biobank Resource under Application Number 28126. N.W. was supported by the International Alliance for Cancer Early Detection, an alliance between Cancer Research UK (C14478/A29329), Canary Center at Stanford University, the University of Cambridge, OHSU Knight Cancer Institute, University College London, and the University of Manchester. J.P.T. was supported by Cancer Research UK (PRCPJT-May21\100006 and G110748). Quality control of the UK Biobank sequencing data has been funded by the Medical Research Council (unit programs: MC_UU_12015/2, MC_UU_00006/2). The BRIDGES project was supported by the European Union Horizon 2020 research and innovation programs BRIDGES (grant number, 634935 to P.D A.G.-N., A,M.D. and D.F.E) and B-CAST (633784 to M.K.S. and D.F.E.), the Wellcome Trust (v203477/Z/16/Z to S.H.T ad D.F.E), and Cancer Research UK (C1287/A16563). Details regarding funding of specific BRIDGES studies are provided in the Supplementary Material.