Abstract
Common genetic variation throughout the genome together with rare coding variants identified to date explain about a half of the inherited genetic component of epithelial ovarian cancer risk. It is likely that rare variation in the non-coding genome will explain some of the unexplained heritability, but identifying such variants is challenging. The primary problem is lack of statistical power to identifying individual risk variants by association as power is a function of sample size, effect size and allele frequency. Power can be increased by using burden tests which test for association of carriers of any variant in a specified genomic region. This has the effect of increasing the putative effect allele frequency.
PAX8 is a transcription factor that plays a critical role in tumour progression, migration and invasion. Furthermore, regulatory elements proximal to target genes of PAX8 are enriched for common ovarian cancer risk variants. We hypothesised that rare variation in PAX8 binding sites are also associated with ovarian cancer risk, but unlikely to be associated with risk of breast, colorectal or endometrial cancer.
We have used publicly-available, whole-genome sequencing data from the UK 100,000 Genomes Project to evaluate the burden of rare variation in PAX8 binding sites across the genome. Data were available for 522 ovarian cancers, 2560 breast cancers, 2465 colorectal cancers and 729 endometrial cancers and 2253 non-cancer controls. Active binding sites were defined using data from multiple PAX8 and H3K27 ChIPseq experiments.
We found no association between the burden of rare variation in PAX8 binding sites (defined in several ways) and risk of ovarian, breast or endometrial cancer. An apparent association with colorectal cancer was likely to be a technical artefact as a similar association was also detected for rare variation in random regions of the genome.
Despite the null result this study provides a proof-of -principle for using burden testing to identify rare, non-coding germline genetic variation associated with disease. Larger sample sizes available from large-scale sequencing projects together with improved understanding of the function of the non-coding genome will increase the potential of similar studies in the future.
Introduction
Genome-wide association studies (GWAS) conducted by the Ovarian Cancer Association Consortium (OCAC) identified 42 common epithelial ovarian cancer (EOC) susceptibility alleles1 and rare coding variants in at least 10 genes are known to be involved in susceptibility to epithelial ovarian cancer2. The high-penetrance genes BRCA13 and BRCA24 were identified by linkage studies in the 1990’s; protein truncating variants in these genes confer a substantial lifetime risk of EOC as well as breast cancer and other cancers5,6. Since then, truncating variants in BRIP1, FANCM, PALB2, RAD51C and RAD51D have been shown to confer more moderate risks by using candidate-gene case-control sequencing7. The uncommon and rare, high- and moderate penetrance alleles identified to date explain about one quarter of the inherited component of EOC susceptibility with another 5 percent explained by the known common risk alleles2. Genome-wide heritability analyses have estimated that the set of common variants that are tagged or captured by the standard genome-wide genotyping arrays explains about 40 percent of the familial aggregation – the so-called chip heritability2. The characteristics of the alleles that account for the remaining familial aggregation are not known; analyses of whole-genome data suggest that a substantial portion is explained by rare variants, particularly those in regions of low linkage disequilibrium 8.
The identification of rare genetic variants associated with disease susceptibility presents a considerable challenge. In principle it would be possible to sequence a very large number of cases and controls and to evaluate the disease associations on a variant-by-variant basis. This, in essence, has been the approach to common risk variant discovery in genome-wide association studies. However, the cost of sequencing tens of thousands of cases and controls remains prohibitively expensive. Even then the statistical power will be limited to detect the association on individual rare variants. Candidate gene sequencing was successful in identifying uncommon, loss-of-function variation associated with disease risk because it was been possible to apply gene-based burden tests in which any putative protein truncating variant in a gene is treated as equivalent and so the combined frequency of the variants is sufficient for reasonable sample sizes to have adequate power. The association of loss-of-function variants in BRIP1, PALB2, RAD51C, and RAD51D were identified using this approach.
Burden testing can also be applied to the non-coding genome. However, it is less clear what the unit for genomic analysis ought to be. One possibility is to use functional annotation to define non-contiguous genomic regions for burden testing. Such units could include all the binding sites for transcription factors of interest, regions of active chromatin in tissues of interest, regions of open chromatin in tissues of interest and regions that are frequently somatically mutated in ovarian cancer. The majority of known germline risk alleles associated with heritability of EOC risk is in non-protein coding regions, mostly located in active regulatory elements9. Putative regulatory elements proximal to target genes of critical transcription factors that are implicated in ovarian cancer or precursor cell biology are enriched for ovarian cancer risk SNPs, highlighting a central role for the transcription factor paired box 8 (PAX8)10. PAX8 is highly expressed in high-grade serous ovarian cancers and plays a critical role in tumour progression, migration and invasion11. PAX8 is also expressed in secretory epithelial cells of the fallopian tube, the precursor tissues of high-grade serous ovarian cancer 12. Thus, rare germline genetic variation in PAX8 bindings sites are candidate susceptibility alleles for EOC.
In addition to the PAX8 binding sites, we are also interested in active regions of the genome, where DNA is highly accessible, such as those identified by the histone H3K27 acetylation mark13. It has been shown that this important histone mark can distinguish between active and poised enhancer elements14, which tend to cluster near the genes they regulate. Genes proximal to H3K27ac marks also show higher expression levels13. Transcription factors such as PAX8 bind only to a subset of enhancers in a given tissue and H3K27ac marks can be used to select some of these enhancer locations. Previous analyses have shown that more than 80 percent of cell type specific regulatory elements lie in putative enhancers, which drive the cell type specific gene expression2.
The aim of this study was to use publicly available whole genome sequencing data from the UK 100,000 Genomes Project to evaluate the association between rare variation in active PAX8 binding sites and epithelial ovarian cancer risk, using a burden testing approach and a case-control study design. We also evaluated the association with breast, colorectal and endometrial cancer, with the goal of investigating the relevance of these target regions in other types of cancer.
Methods
Samples
The UK 100,000 Genomes Project { https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/} was established to sequence 100,000 genomes from around 85,000 NHS patients affected by a rare disease, or cancer. Recruitment of participants to the 100,000 Genomes Project was completed in 2018, with the 100,000th sequence achieved in December 2018. Germline whole genome sequencing has been carried out on 581 cases of EOC as part of the project (v15), from which 522 were included in the full analysis after filtering 59 outliers (see Methods -Sample quality control).
The majority of EOC was classified as high-grade serous carcinoma (n = 296); the number of cases of other histotypes are described in Table 1Error! Reference source not found.. We also analysed 2,984 cases of breast cancer, 2,696 cases of colorectal cancer and 836 cases of endometrial cancer, respectively.
Unaffected controls have not been sequenced as part of the Project. However, many of the Project samples are from individuals with rare disorders that are extremely unlikely to share heritability with epithelial ovarian cancer and can be used as controls for the association analysis. We selected 2,750 patients as controls, from Genomics England main programme v6, with diseases such as myopathies and epilepsies, whose aetiology is multifactorial and unlikely to be related to cancer (Supplementary Table 1).
PAX8 binding sites were identified using PAX8 ChIPseq data from eight cell lines: three benign immortalized normal fallopian tube cell lines (FT33, FT194 and FT246)15 and five EOC cell lines (KURAMOCHI, OVSAHO and JHOS4 15; GROV1 and HeyA8 16). PAX8 ChIPseq NarrowPeak formatted files are publicly available from the GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79893) for the KURAMOCHI, OVSAHO and JHOS4 cancer cell lines and the FT33, FT194, and FT246 benign immortalized fallopian tube cell lines. PAX8 ChIPseq peaks for IGROV1 and HeyA8, as well as the signal-enriched and normalized files for all cell lines, were obtained from the Women’s Cancer Program at the Samuel Oschin Comprehensive Cancer Institute16. The number of PAX8 peaks and their genomic coverage for each cell line is shown in Supplementary Table 2.
Not all transcription factor binding sites will be active in any given tissue. Acetylation of histone H3K27 is a marker of active promoters and candidate enhancers and colocalizes with regulatory sequences and gene activation17. H3K27ac ChIPseq data were available for five different high-grade serous ovarian cancers and two normal fallopian tube epithelium cell lines (FT33 and FT246) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68104) 18. The number of H3K27ac ChIPseq peaks for each cell line is shown in Supplementary Table 3.
Whole genome sequencing data
Germline whole genome samples were prepared using an Illumina TruSeq DNA Nano and sequenced at Illumina facilities using HiSeq X with 150 bases reads at a target of 30X in Genomics England facilities. Only samples with coverage of at least 95 percent of the genome at 15X or above with well mapped reads (mapping quality > 10) were included in the analyses. The Illumina North Star pipeline (version 2.6.53.23) was used for primary WGS analysis. Read alignment against human reference genome GRCh38-Decoy+EBV was performed with ISAAC (version iSAAC-03.16.02.19)20, 21
Variant calling and quality control
All samples were processed with GATK Haplotype Caller version 4.0, with the standard thresholds for variant calling22. Variants were filtered out according to several quality criteria: read Depth <= 10; Mapping quality <= 20; allele balance <= 0.2 (proportion of alternate alleles) and alternative reads <=5; all variants in regions of low complexity. Low-complexity regions were identified using RepeatMasker23 and blacklisted regions from ENCODE24, resulting in the removal of 289,837 regions overlapping our target regions. Additional criteria were then used for exclusion of specific indels and single nucleotide variants (Supplementary Table 4).
These thresholds were based on an extensive comparison of concordance of whole exome sequence NGS calls with equivalent ChIP genotypes (unpublished). The rare variants were defined as those frequency lower than 0.01 in non-Finnish European samples or variants not reported in the gnomaAD 3.0 database25.
Sample quality control
Sample quality control was carried out using the variants falling within the H3K_tum regions (defined in results section). The number of variants carried by each individual in these sets of regions varied from about 4,000 to over 20,000 with a long tail distribution (Supplementary Figure 1). We excluded 1,318 individuals that carried more than 6,800 rare alleles in the H3K_tum regions (Supplementary Table 5).
The first plot represents the total number of variants found in all individuals; the blue line is the threshold admitted to proceed with further analysis. The second plot provides further details in the distribution of variants in H3K_tum regions, in all individuals.
Statistical methods
The number of variants in a given target region was compared across controls and the four cancers phenotypes using the non-parametric Kruskal-Wallis test. Then a pairwise comparison of each cancer phenotype with controls was carried out. We did not correct for multiple testing as the individual hypothesis tests are correlated, but given a low prior on the alternative hypothesis it is likely that most associations at a nominal P<0.05 would be false positives and so we set a nominal threshold for statistical significance at P<10−3.
Results
Defining the genomic region of interest
Across all the cell lines there were 19,097 peaks covering 9.5MB genomic DNA hereafter referred to as PAX8_all with PAX8_ft and PAX8_tum referring to PAX8 ChIPseq peaks in the fallopian tube and tumour cell lines respectively (Table 2). The number of H3K27ac ChIPseq peaks in the fallopian tube (H3K_ft) and tumours (H3K_tum) are shown in Table 2 with the numbers for each cell line individually in Supplementary Table 3.
We defined target regions based on the degree of overlap between PAX8 ChIPseq peaks and H3K27ac peaks, as can be observed in Figure 1. Medium stringency regions (MS_all) were those where there was overlap between H3K27ac marks in at least one of the seven cell lines (tumour or normal fallopian tubes) and PAX8 marks in at least one of the eight cell lines (tumour or normal fallopian tubes). Medium stringency fallopian tube regions (MS_ft) were those where there was overlap between H3K27ac marks in at least one of the two normal fallopian tube cell lines and PAX8 marks at least one of the three normal fallopian tube cell lines. Similarly, medium stringency tumour regions (MS_tum) were defined as those where there was overlap between H3K27ac marks in at least one of the five tumour cell lines and PAX8 marks in at least one of the five tumour cell lines. High stringency regions (HS_all) were defined as those where there was overlap between H3K27ac marks in at least three of the seven tumour or normal fallopian tube cell lines and PAX8 marks in at least three of the eight tumour or normal fallopian tube cell lines. High stringency fallopian tube regions (HS_ft) were those with at least three overlaps among any of the fallopian tube ChIPseqs. High stringency tumour regions (HS_tum) were those with overlap between H3K27ac marks in at least three of the five tumour cell lines and PAX8 marks in at least three of the five tumour cell lines. Following our prior experience, high stringency regions selected in this manner provide more robust evidence of signal9.
Random control regions
Differential processing of the case and control samples from DNA extraction up to the sequencing and quality filters may result in artefactual differences in variant frequencies between cases and controls. In order to counteract the possible inflation of variants in the target regions, several types of control regions were created from random genomic sequences with similar characteristics to match the functional regions of interest. Variant frequencies in these random regions are unlikely to differ between cases and controls in the absence of functional elements.
Random sequences were selected to match 1,687 HS_all regions where there was overlap between 3 or more H3K27ac ChIPseq marks and 3 or more PAX8 ChIPseq marks. The criteria for a random sequence were that it must: i) be on the same chromosome ii) be the same size iii) similar base content with the ratio of the proportion of each base in the random region compared to the functional region being between 0.7 and 1.3 iv) not collocate with repeatMasker23 or ENCODE blacklisted regions24. Three types of random sequences were then chosen to match the HS_all regions: For the Random2k regions, a genomic window, the size of each of the 1,687 target regions was evaluated, starting at 2,000 bases upstream of the original site and moving 10bp upstream until the matching criteria was achieved. The Random50k regions were similarly selected starting at 50,000 bases upstream of the original site. Finally, we used bedtools shuffle 26 and seven different seeds to generate 8,620 regions from which 1,677 were selected to meet the criteria above (RandomShuffle regions).
A fourth type of random region was also selected to match the H3K_tum regions, using the same criteria as above and the same seven seeds for the random algorithm in bedtools shuffle, but the selection was a bit more stringent to yield a similar number of regions. The proportion was set between 0.75 and 1.25. This way, Random H3K_tum yielded 113,888 regions from the total of 129,293 H3K_tum target regions.
Association of rare variants in PAX8 binding sites with cancer risk
The number of variants per sample by variant type (all variants, indels and substitutions) and phenotype in each of the eight defined functional regions are shown in Figures 2-4. There was nominally significant heterogeneity in the variant frequency across the controls and four case sets for the H3K_tum, H3K_ft, MS_all and MS_tum regions (Table 3). Pairwise comparisons for each set of cases and controls showed that this heterogeneity was being driven by a difference between colorectal cancer cases and controls.
However, the observed differences were small and similar heterogeneity was seen for the Random H3K_tum regions (Supplementary Figure 2). This suggests that the observed differences are due to sequencing batch effects or similar technical artefact rather than any biological difference between cases and controls. When restricting the comparison to single nucleotide variants (Table 4) a similar pattern was seen. There were no nominally significant differences in the frequencies of indels in the target functional regions (Table 5).
Discussion
Rare, germline genetic variation in the non-coding genome is likely to account for a substantial portion of the unexplained heritability of many complex traits including ovarian cancer. Chip genotyping data combined with imputation can be used to evaluate rare variant associations, however, many rare variants are poorly imputed and whole genome sequencing is the only method to reliably capture all the rare variation across the genome. The evaluation of association for rare, non-coding variation on a variant-by-variant basis is unlikely to be an effective approach because of the very large number of such variants and the limited statistical power to detect association at very stringent levels of statistical significance required when the prior probability of any one association is very small. As with analysis of rare variation in the coding sequence, burden testing is an alternative approach to increase statistical power. However, unlike with gene-based burden testing, the best way to define groups of variants in the non-coding genome is not clear. One approach might be to simply combine rare variants within a specific genomic window. However, it seems very unlikely that multiple rare variants in any given contiguous genomic window would be functionally equivalent and associated with similar risks. We used an alternative approach, evaluating the burden variants in non-contiguous genomic regions that are thought to share some common function.
Many of the known common risk variants in the non-coding regions are thought to act by affecting regulatory elements. For example, breast cancer risk variants are enriched for binding sites of ESR1, FOXA1, GATA3, E2F1 and TCF7L227 and six networks centred on homeobox transcription factor genes at 17q21.32 and at 2q31 are enriched for EOC susceptibility28. Thus, binding sites for transcription factors thought to be important in ovarian cancer biology would be good candidates for non-contiguous regions of the genome that harbour rare susceptibility variants for EOC.
We used publicly-available, germline, whole genome sequencing data of cancer and rare diseases patients to evaluate a hypothesised association between rare variation in putative active PAX8 transcription factor binding sites and risk of epithelial ovarian cancer. Our novel approach of using burden testing across non-contiguous regions of the genome has the potential to identify uncommon and rare germline genetic variants that underpin complex disease susceptibility. We found little evidence to support the hypothesis, but several limitations should be considered before considering these results definitive. First, statistical power is likely to be a limiting factor as there were data for only 522 EOC cases. A second limitation is our ability to identify PAX8 binding sites that are truly functional in the relevant tissue. We have used a variety of epigenomic data in an attempt to identify genomic regions that are most likely to be functionally relevant and we have used several different definitions to define putative functional regions. Nevertheless, it is likely that for any given definition of functionally relevant regions, a proportion of the binding sites of interest are not functional. It is also not fully understood how much a given variant would affect PAX8 binding if at all. Finally, redundancy of binding sites in the pathways activated by PAX89,10 may limit the functional effect of a variant that does disrupt PAX8 binding in a relevant tissue. This misclassification could reduce substantially the power to detect a true association.
Despite the null findings and the limitations of this study, the approach based on burden testing across non-contiguous regions of the genome remains promising. Much large-scale case-control data will become available as whole genome sequencing becomes cheaper and this will increase substantially the statistical power of this type of study. Furthermore, improved functional annotation of the non-coding genome will lead to a better definition of the likely regions of the genome that are biologically important together with a better understanding of which variants are most likely to disturb putative functionality.
Data Availability
Germline genetic data used in this project are available through application to the Genomics England 100,000 Genomes Project. CHiPSeq data are publicly available as described in the methods.