ABSTRACT
Objective While population screening programs for cancer colorectal (CRC) have proven benefit, risk-stratified approaches may improve screening outcomes further. To date, genome-wide polygenic risk scores (PRS) for CRC have not been integrated with non-genetic risk factors. We aimed to evaluate several genome-wide approaches, and the benefit of adding PRS to the QCancer-10 (colorectal cancer) non-genetic risk model, to identify those at highest risk of CRC.
Design Using UK Biobank we developed and compared six different PRS for CRC. The top-performing genome-wide and GWAS-significant PRS were then combined with QCancer-10 and performance compared to QCancer-10 alone.
Results PRS derived using LDpred2 software performed best, with an odds-ratio per standard deviation of 1.58, and top age- and sex-adjusted C-statistic of 0.733 in logistic regression and 0.724 in Cox regression models in the Geographic Validation Cohort. Integrated QCancer-10+PRS models out-performed QCancer-10, with C-statistics of 0.730 and 0.693, and explained variation of 28.1% and 21.0% from QCancer-10+LDpred2 and QCancer-10 respectively in men; performance improvements in women were similar. Men in the top 20% of risk accounted for 47.6% of cases, and women 42.5% using QCancer-10+LDpred2 models, with a 3.49-fold increase in risk in men and 2.75-fold increase in women in the top 5% of risk, compared to average risk. Decision curve analysis showed that adding PRS to QCancer-10 improved net-benefit and interventions avoided across most probability thresholds.
Conclusion Integrated QCancer-10+PRS models out-perform existing CRC risk prediction models. Evaluation of risk stratified screening using this approach in a bowel screening population could be warranted.
What is already known about this subject
Risk stratification based on genetic or environmental risk factors may improve cancer screening outcomes
Many polygenic risk scores (PRS) based on a limited number of genome-wide significant SNPs have been assessed in colorectal cancer (CRC), but just two studies have examined the use of genome-wide PRS methodologies
No previously published study has examined integrated models combining genome-wide PRS and non-genetic risk factors beyond age
QCancer-10 (colorectal cancer) is the top-performing non-genetic risk prediction model for CRC
What are the new findings?
PRS derived using LDpred2 software outperform existing models, and other genome-wide and genome-wide significant models evaluated here
Adding either LDpred2 PRS or genome-wide significant PRS improves the performance and clinical benefit of the QCancer-10 model, with greater gain from the LDpred2 model
How might it impact on clinical practice in the foreseeable future?
The performance and clinical benefit of QCancer-10 is improved by adding PRS, to a level that suggests utility in stratifying CRC screening and prevention
INTRODUCTION
Colorectal cancer (CRC) is the fourth most common cancer in the UK, with increasing incidence in younger ages and countries with historically lower rates.1,2 Population screening is effective in reducing CRC incidence and mortality, through detection and removal of pre-malignant adenomas, and earlier detection of cancers.3 Screening modalities vary internationally. While colonoscopy is the gold-standard, it is expensive, invasive and time consuming. Many countries have adopted a staged process, with initial faecal blood testing, followed by colonoscopy for those who test positive.
Risk-stratified approaches to screening direct resources to those at highest risk have the potential to improve screening detection rates, reduce investigative burden of those at lower risk, and potentially improve cost-effectiveness.4,5 Improved understanding of cancer risk could also improve informed consent and shared decision making around screening participation.
Both genetic and non-genetic factors contribute to an individual’s risk of CRC, some of the latter being modifiable. Genetic variants known to predispose to CRC are mostly single nucleotide polymorphisms (SNPs) identified as significant in genome-wide association studies (GWAS). Genetic risk can be summarised in a polygenic risk score (PRS). Multiple genetic and non-genetic risk models have been developed to predict CRC risk in the general population, and many have been validated in the UK Biobank (UKB).6,7
Most existing PRS have combined GWAS-significant SNP genotypes weighted by their effect sizes. While including a greater number of SNPs generally produces better performance, discrimination has generally remained poor.7 More recently, genome-wide PRS have incorporated many more SNPs than those reaching GWAS-significance. Several genome-wide PRS software tools are now available, with differences in performance across disease types,8 but evaluation in CRC has been limited.9,10 In a recent study, a genome-wide model derived using PRS software LDpred, incorporating 1.2 million SNPs, out-performed both machine learning approaches and a 140 GWAS-significant SNP PRS, with an age- and sex-adjusted area under the receiver operating characteristic curve (AUC) of 0.654.9
The top-performing non-genetic risk model in external validation is QCancer-10 (colorectal cancer), which has an AUC of 0.70 in men and 0.66 in women aged 40-70 in UKB.6,11,12 QCancer-10 is a 15-year CRC prediction model, developed using the QResearch linked primary care database of almost 5 million individuals aged 25-84, registered at QResearch practices across England between 1998 and 2013.11 It is based on age, ethnicity, family history, alcohol and smoking status, a small number of medical conditions, and for men, Townsend deprivation score and body mass index (BMI). As the predictors are derived from electronic health records, it could be embedded at point of care, and linked with screening records to facilitate risk stratification within the bowel screening programme. It thus forms a strong basis for development of an integrated risk prediction model.
Integrated models for CRC, which combine PRS with non-genetic risk factors, generally perform better than PRS alone.7,13 The top-performing integrated model in external validation in UKB had an AUC of 0.71 in men and 0.67 in women, though this was largely attributable to the non-genetic component, the PRS based on 11 SNPs.7
We hypothesised that genome-wide PRS methods, and optimised SNP content based on recent European GWAS, would improve PRS performance, and that integrating this with QCancer-10 should provide for enhanced risk prediction to that afforded by existing models. We therefore developed PRS based on LDpred2, compared this to other genome-wide and GWAS-significant approaches, and validated findings in Geographic and Minority Ethnic Validation Cohorts. We validated QCancer-10, and then derived integrated QCancer-10+PRS risk models, which we internally validated and compared with QCancer-10 alone.
METHODS
Overview
We conducted a development and validation study of PRS utilising multiple methods and integrated PRS-epidemiological models, to predict risk of CRC in a set of UK individuals of bowel cancer screening age. We followed the PRS-RS and TRIPOD reporting guidelines for PRS and prediction modelling.14,15
We used UKB to derive and validate our risk models.16 In brief, just over 500,000 participants (5.5% of invitees) were recruited to UKB from across the UK (between 2006 and 2010). Baseline demographics, medical, lifestyle and physical data, and blood, urine and saliva samples were collected at recruitment. Follow-up through linked hospital and registry data is ongoing. A detailed description of genetic resources including quality control measures can be found in Bycroft et al.16 and Supplementary Methods.
We calculated age-specific and directly standardised CRC incidence rates in UKB and compared these with Office for National Statistics (ONS) data (Supplementary Methods).17 See Figure 1 for sample exclusions for the modelling cohorts.
Polygenic risk scores
We meta-analysed 14 CRC GWAS cohorts (which did not include UKB) to provide SNP association effect sizes (see Supplementary Methods and Ref.18). The main outcome in all models was CRC diagnosis, identified through self-report at UKB enrolment visit and ICD-9 (153, 154.0, 154.1) and ICD-10 (C18-C20) codes in linked cancer and death registry and hospital data. For PRS development and evaluation in UKB, in logistic regression models we included incident and prevalent cases, with the remaining cohort used as controls. For Cox proportional hazards (Cox) models, prevalent cases with a diagnosis of CRC prior to cohort entry were excluded. Follow-up began at date of enrolment, and was censored at the earliest of date of incident CRC, loss to follow-up, death, or end of available registry follow-up (31st October 2015 for Scottish participants, 13th March 2016 for all other participants).
Three broad approaches to PRS development were evaluated (Supplementary Methods). Firstly, we used a ‘standard’ PRS (hereafter ‘GWAS-sig’), which comprised a manually curated list of 50 sentinel SNPs shown in recent European GWAS-meta-analyses18,19 to be independently and reproducibly associated with CRC risk at P<5×10−8. This PRS was constructed as a log-additive sum of SNP dosages weighted by their betas. Betas were adjusted for winner’s curse using FIQT correction.20
Secondly, genome-wide clumping and thresholding methodologies were evaluated using ‘standard’ (C+T) and ‘stacked’ (SCT) approaches.21
Thirdly, we used LDpred2,8 which takes a Bayesian approach to SNP selection, accounting for linkage disequilibrium between the SNPs. We used three different LDpred2 options – an infinitesimal model (LDpred2-Inf), a non-sparse grid model (LDpred2-grid) and a sparse grid model (LDpred2-grid-sp).
Optimal PRS tuning parameters for genome-wide approaches were selected in the Training Cohort. For each optimal PRS, we then constructed both logistic regression and Cox risk models in the Test cohort with PRS, age, sex, genotyping array and the first four principal components (PCs) from UKB as predictors. We tested for interactions between age and PRS. Training and Test Cohorts included participants of white-British ancestry (identified through self-reported ethnicity and genetic information)16 from England and Wales (Figure 1). We compared performance to a ‘Null’ (reference) model based only on age, sex, genotyping array and four PCs. We also evaluated performance without age and sex in the model. Each model was internally validated, and shrinkage applied to adjust for optimism.
We reported the distribution of standardised PRS and adjusted odd ratios and hazard ratios per-standard deviation (Supplementary methods). We used the C-statistic and Somers’ Dxy statistic to assess discrimination, in addition to Royston’s D statistic and Kaplan-Meier curves across four risk groups for Cox models.22 Nagelkerke’s R2 was used in logistic regression models and Royston and Sauerbrei’s in Cox models to assess variance explained, and R2attributable to the PRS was calculated by R2 (full model) - R2 (null model). Scaled Brier scores were used to assess overall model performance.23 Confidence intervals and internal validation for all models used 500 bootstrap samples.
Polygenic risk score model validation
PRS models were externally validated in a Geographic Validation Cohort, comprising Scottish participants with European ancestry, and a Minority Ethnic Validation Cohort (from any region).
In addition to the performance metrics described above, calibration was assessed through the calibration slope and visual assessment of calibration plots, with calibration-in-the-large for logistic regression models. For Cox models, calibration plots were created over 5-8 years of follow-up. In pre-specified subgroup analyses, we calculated performance statistics by sex and in those with a first degree family history of CRC. We evaluated calibration by age by plotting predicted and observed risk across 5-year age bands.
Development of QCancer-10+PRS combined models
The Integrated Modelling Cohort used for QCancer-10 validation and integrated model development comprised all individuals with imputed genetic data passing QC, excluding the 30000 individuals used for PRS hyper-parameter selection (Supplementary Methods), and with complete QCancer-10 predictor data (Figure 1).
Coding of QCancer-10 variables is described in Supplementary Methods. Since missingness was <5% for all predictors (Table S3), complete case analysis was used. The QCancer-10 score was calculated for males and females,24 and performance evaluated using the published baseline survival functions. We then recalibrated the models by re-estimating the baseline survival function (recalibration-in-the-large).
Sample size adequacy for integrated model development was calculated following Riley et al.25 (Supplementary Methods). Integrated models included the risk score from QCancer-10 plus the top-performing PRS (based on the maximum C-statistic and R2 in external validation) using Cox models, developed in men and women separately.
We used the same metrics and time periods to assess the original QCancer-10 model, and QCancer-10+PRS model performance as described for Cox PRS models. In pre-specified subgroup analyses we assessed expected to observed (E/O) ratios of risks and plotted calibration plots in those with a first degree family history of CRC and individuals from minority ethnic backgrounds, and plotted calibration by age.
Model sensitivities were evaluated by calculating the proportion of cases identified at centile thresholds for absolute risk and relative risk. Relative risks were calculated relative to an individual of the same age and sex, mean PRS (by sex), mean PCs, BMI of 25, white ethnicity, mean Townsend score, and no other CRC risk factors. We used decision curve analyses to compare the net benefit and interventions avoided using QCancer-10 and QCancer-10+PRS models.26 For decision curve and subgroup analyses, QCancer-10+PRS models were first adjusted for optimism, and recalibrated QCancer-10 models were used.
Statistical analysis was performed using R/3.6.2.27
Ethics
The UK Biobank study has ethical approval from the North West Multi-centre Research Ethics Committee (16/NW/0274). This study was performed under UK Biobank application number 8508. All contributing GWAS studies were undertaken with ethical review board approval at respective study centres as detailed in Law et al.1
Patients and public were not involved in the design, conduct or reporting of this study.
RESULTS
Demographics for the UKB-derived Integrated Modelling Cohort are shown in Table 1. The characteristics of each PRS cohort are shown in Table S4. Age-standardised CRC incidence from linked cancer-registry data in the whole UKB cohort was 108.3 and 73.9 cases per 100000 person years at risk for men and women respectively, compared to 127.8 and 80.7 cases per 100000 person years at risk in ONS data.17 Incidence in the Integrated Modelling Cohort, with cases identified through all linked data, was 118.0 CRCs per 100,000 years follow-up in men and 79.3 in women. Age-specific incidence rates in UKB (Figure S3) closely followed those from ONS until the age of 70, after which UKB rates were lower.
Polygenic risk score models
Each of 6 PRS models assessed (Figure S4) improved performance over the Null model of age, sex, genotyping array and four PCs (Table 2). A weak interaction between age and PRS was noted (Table S5, Figure S5), but was not included in the models. LDpred2-grid and LDpred2-grid-sp performed best in logistic regression models across all metrics, with similar performance (Table 2). LDpred2-grid had the highest odds ratio per SD of PRS (1.584, 95% CI 1.536-1.633) and C-statistic (0.717, 0.711-0.725) (Table 2). Performance without adjustment for age and sex was considerably worse (Table S6). Internal validation showed low bias in all measures (Table 2).
In the Geographic Validation Cohort, discrimination and variation explained improved compared to the Test Cohort for all models. LDpred2-grid-sp performed best (C-statistic 0.733, 95% CI 0.710-0.753). All models under-predicted risk (CITL >0, Table 2) particularly in the highest PRS groups (Figure S6), and genome-wide models were slightly over-fitted (calibration slope >1, i.e. insufficient variation at the extremes of prediction, Table 2, Figure S6).
In subgroup analyses of logistic regression models (Table S7, Figure S7), discrimination and explained variation were better in males; models were better fitted in females but under-predicted risk to a greater extent, particularly in higher risk groups. Discrimination and variation explained were poorer in individuals with a first-degree family history of CRC, with models systematically underpredicting risk across PRS risk groups. All models tended to under-predict risk across age groups (Figure S8).
Performance was poor in the Minority Ethnic Validation Cohort, with little difference between models. Only LDpred2-grid and LDpred2-grid-sp showed improved performance over the Null model (Table 2). Models systematically under-predicted risk and were highly over-fitted (i.e. predictions were too extreme, Table 2), with modest improvement following recalibration (Figure S6).
In general, PRS performance in Cox models supported the logistic regression analysis (Tables 3 and S8, Figures S9-14). Best performance in external validation occurred with the LDpred2-grid-sp model (C-index 0.725, 95% CI 0.696-0.752).
QCancer-10 non-genetic model
QCancer-10 models risk in males and females separately. Comparative demographics of the original QCancer-10 derivation cohort and the Integrated Modelling Cohort are shown in Table S9. Performance of QCancer-10 in UKB was in line with previously published studies (Table 4).6 As expected, the model for females performed less well than the model for males.6 Both models tended to over-predict risk, which was corrected through recalibration, though in women the model continued to over-predict in the top risk decile (Figure S15). In subgroup analysis, models were well calibrated across age groups; they underpredicted risk in individuals from minority ethnic backgrounds; and the model for females tended to over-predict risk in those with a first-degree family history of CRC, particularly in higher risk groups (Table S10, Figures S16-S17).
QCancer-10+PRS models
We selected LDpred2-grid-sp as the top-performing genome-wide PRS for integrated modelling with QCancer-10, favouring sparsity over the non-sparse model (see Supplementary Results for full model specifications and baseline hazards).28 Cox models combining the QCancer-10 risk score with LDpred2-grid-sp (QCancer-10+LDP), and the GWAS-sig PRS (QCancer-10+GWS) both out-performed QCancer-10 (Table 4, Figure 2). Internal validation of the QCancer-10+PRS models showed very little optimism in performance estimates.
Models predicting risk in men had better discrimination, and explained more of the variation in risk than models for women (Table 4). Calibration by age was good (Figure S16), with slight under-prediction of risk in the top age group in women. As with QCancer-10, in those with a first degree family history of CRC, female QCancer-10+PRS models tended to over-predict risk, particularly in higher risk groups; male QCancer-10+PRS models were well calibrated (Table S10, Figure S17). In minority ethnicities, QCancer-10+PRS models underpredicted risk (Table S10) to a greater extent than QCancer-10, subject to the caveat of a low CRC case numbers (46 men, 58 in women) in this subgroup.
QCancer-10+LDP consistently provided the best risk prediction. Individuals predicted to be in the top 20% of absolute risk by QCancer-10+LDP accounted for 47.6% of male cases and 42.4% of female cases (Table 5). Men in the top 5% of risk had >3.49-fold increased absolute 5-year risk compared to the median, with a comparable 2.75-fold increase in women (see Table S12 for other models). QCancer-10, QCancer-10+GWS, and LDpred2-grid-sp had lower sensitivity in men than QCancer-10+LDP (Tables S13-S17). In women, QCancer-10+LDP and LDpred2-grid-sp models performed equally well, with higher sensitivity than QCancer-10+GWS and QCancer-10 (see Discussion, Tables S13-S17). Decision curve analyses confirmed that, across a wide range of probability thresholds, QCancer-10+LDP gave greater net benefit than QCancer-10+GWS and QCancer-10 for both men and women (Figure 3), and predicted a greater number of interventions avoided across clinically relevant thresholds.
By way of illustration, enhanced screening is frequently offered for those with a single first degree relative with CRC (FDRCRC), corresponding to a ∼2.2-fold increased risk.29 QCancer-10+LDP identified 18.2% of men (34.0% of cases) and 7.2% of women (16.5% of cases) as having a relative risk >2.2, of whom 76% and 70% respectively had no FDRCRC (Table S18).
DISCUSSION
We have undertaken the first study to develop and validate new prediction models for colorectal cancer that combine phenotypic risk with genome-wide PRS. Our findings demonstrate that LDpred2 significantly improves prediction of CRC above existing PRS models,9 with top age- and sex-adjusted C-statistics of 0.733 in logistic regression models and 0.725 in Cox models in the Geographic Validation Cohort. We also show that combining the non-genetic QCancer-10 model with PRS improves model performance and clinical benefit, with greatest improvements seen in the QCancer-10+LDP model. To our knowledge the QCancer-10+LDP models have higher discrimination in UKB than any previously published CRC risk score.7,12,13
Our models could be used to improve or instigate risk-stratified CRC screening. QCancer-1011 has recently been recommended to guide shared decision making around CRC screening.30 However, our study predicts that QCancer-10+PRS models have a greater net-benefit and avoid more interventions than QCancer-10 across a wide range of clinically-relevant risk thresholds, with the greatest benefit from QCancer-10+LDP. The sensitivities achieved using QCancer-10+PRS exceed those of other integrated models recently validated in UKBioank.7 The genome-wide SNP genotyping required for LDpred2 is reliably performed from saliva samples, and is rapid, inexpensive and straightforward to analyse. The sensitivities and decision curves provided by QCancer-10+LDP could therefore be used to inform clinical decision making.
Of the PRS methods evaluated, LDpred2-grid and LDpred-grid-sp models had highest discrimination, explained more of the variation in risk, and were well calibrated. The improvement in performance between the derivation and validation cohorts when using the PRS models probably results from lower genetic homogeneity in the latter. The Geographic Validation Cohort was well matched in age to the derivation cohort, but had a higher proportion of women; prevalence of CRC was higher, at 1.79% compared to 1.51% in the derivation cohort. We would expect performance in Northern European individuals in the general population to be similar to that of the Validation Cohort. Validation of the PRS models in a geographically external cohort demonstrates portability of the models.
Strengths of our study include our large GWAS meta-analysis (∼68000 individuals) and non-overlap between this and modelling cohorts, thus reducing overfitting of the PRS and performance optimism.31 We used expected genotype dosages rather than allele counts in each PRS, incorporating uncertainty in genotype imputation, and applied correction for ascertainment bias to effect sizes in the genome-wide model.20 Our GWAS-significant PRS used stringent inclusion criteria, including only SNPs which replicated in our base meta-analysis. The evaluation of multiple PRS methodologies, and examination of performance in both cross-sectional and prospective cohorts, with and without adjustment for age and sex,28 facilitates comparison with previous models.
UKB provides a large sample size, extensive phenotyping, completeness of data recording, and linkage to external datasets. Linkage to cancer registry data in UKB ended in 2015/16 at the time of our study, so we have only been able to follow-up for a median 7 years; updating this will improve risk estimates and permit estimation of risks over a longer follow-up. The UKB age range of ∼40-70 is similar to that of bowel cancer screening (50-74 in England and Scotland), but narrower than 25-84 used in the original Qcancer-10 study.11 However, model performance in UKB is arguably unlikely to reflect relative performance in the general population, for several reasons. Model performance will vary between populations with different prevalence or risk of a disease – known as the ‘spectrum effect’. As UKB has a lower incidence of disease than the general population of screening age, one might expect sensitivity to increase (which is of benefit in a screening test) when applied to a population with higher risk.32 Furthermore, all of our models appeared to perform less well in females. For PRS models, wide confidence intervals in the Geographic Validation Cohort mean this finding should be interpreted with caution, but for models that include QCancer-10, this difference was not unexpected. The known healthy volunteer bias that exists in UKB is especially marked in women (for example, the reduction in all-cause mortality and overall cancer incidence in UKB relative to the general population is greater for women than men).33 The QCancer-10 model has previously been shown to perform less well when validated in UKB than in the QResearch validation.11 This is likely to be largely due to the differences in age distribution between the general population sample used to develop the original QCancer-10 score and the more restricted UKB sample in this study. External validation of a separate QCancer (colorectal) score for symptomatic patients (rather than the asymptomatic score evaluated here) in an independent population-based cohort showed comparable performance to the discovery study.34 Overall, risk model performance should be validated in a population representative of the screening population, and we have shown that PRS calibration can be largely corrected in new (ethnically similar) populations by recalibration.
Further limitations of our study may include unknown differences in the demographics of the contributing base GWAS datasets and UKB. In addition, we did not include Mendelian CRC syndromes in the genetic model, and doing so would almost certainly provide improved performance. Another limitation of our study, and PRS generally, is that most models are developed in individuals of European ethnicity. Although most CRC risk SNPs appear to be shared across ethnic groups, quantitative risk estimates cannot readily be transferred across populations,35 and, as anticipated, the PRS performed poorly in the Minority Ethnic Validation Cohort.36 As minority ethnic populations often have higher CRC associated mortality and lower screening uptake,37-39 further work is urgently needed to improve PRS for CRC in these already disadvantaged populations.
Recent commentaries have been sceptical about the utility of PRS in cancer prevention and early diagnosis,40 and implementation of PRS in clinical practice has been limited. However, our risk score predicts that ∼10% of the population aged ∼40-70 have relative risks of CRC high enough to warrant surveillance under current guidelines used in a familial risk context.41 Furthermore, in a national population screening programme, a risk score with moderate predictive value has considerable potential for improving performance through risk stratification. For bowel cancer screening, a quantitative FIT score is used to decide who is investigated further by colonoscopy. A risk score could be applied alongside the FIT score to allocate colonoscopy more effectively, thus maintaining universal access to screening whilst improving performance. The risk models constructed here perform at a level that may well be clinically useful.40 Alongside efforts to improve PRS performance in individuals of diverse ancestry, validation in a cohort representative of the screening population and evaluation in a screening trial are, we believe, warranted to assess the performance, acceptability and cost effectiveness of a mixed genetic and non-genetic risk model in a FIT-based bowel cancer screening programme.
Data Availability
UK Biobank data can be obtained through http://www.ukbiobank.ac.uk/. Genotype data are available in the European Genome-phenome Archive under accession numbers EGAS00001005412, EGAS00001005421, or from the Edinburgh University DataShare Repository (https://datashare.ed.ac.uk/). Finnish cohort samples can be requested from the THL Biobank https://thl.fi/en/web/thl-biobank. PRS SNP inclusion lists and model specifications will be deposited in the PGS catalogue repository (https://www.pgscatalog.org/). Risk scores for UKB participants will be returned to UK Biobank for use by approved researchers.
Declaration of interests
JHC is director of the QResearch database – a not-for-profit collaboration between University of Oxford and EMIS (commercial supplier of NHS computer systems). She is founder and shareholder of ClinRisk Ltd and was its medical director until June 2019. ClinRisk Ltd supplies free open-source software for research purposes. It also licenses other closed source software to implement risk prediction tools into NHS computer systems outside the submitted work. She is also an adviser to the CMO in England on cancer screening.
JEE has served on clinical advisory boards for Lumendi, Boston Scientific, and Paion; has served on the clinical advisory board and owns share options in Satisfai Health; and reports speaker fees from Falk. JEE serves on the ACPGBI / BSG guideline group for implementation FIT for the detection of CRC in patients with symptoms suspicious of CRC.
Funding
SEB is supported by an MRC Clinical Research Training Fellowship (MR/P001106/1). JEE and SW receive funding from the NIHR Oxford Biomedical Research Centre (BRC). This work of the Houlston Laboratory (PL, RH) is supported by a grant from Cancer Research UK (CR-UK) (C1298/A25514). JHC received funding from the John Fell Oxford University Press Research Fund, grants from CR-UK grant number C5255/A18085, through the Cancer Research UK Oxford Centre, grants from the Oxford Wellcome Institutional Strategic Support Fund (204826/Z/16/Z) and other research councils, during the conduct of the study. MD is funded by CR-UK Programme Grant C348/A12076. IT is funded by CR-UK Programme Grant C6199/A27327. The research was supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z with funding from the NIHR Oxford BRC. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. Funders had no role in the study design; in the collection, analysis and interpretation of the data; in the writing of the report; or in the decision to submit the paper for publication.
Contributors
All authors contributed to study conception and design, with development of PRS and statistical analysis led by SEB, IT and JHC. IT, MD and RH provided data. SB and PL carried out primary data analysis, SEB completed the statistical analysis under supervision of IT and JHC. SB and IT wrote the first draft of the manuscript. All authors contributed to critical revision of the manuscript for important intellectual content, and have read and approved the final version.
Data Sharing Statement
UK Biobank data can be obtained through http://www.ukbiobank.ac.uk/. Genotype data are available in the European Genome-phenome Archive under accession numbers EGAS00001005412, EGAS00001005421, or from the Edinburgh University DataShare Repository (https://datashare.ed.ac.uk/). Finnish cohort samples can be requested from the THL Biobank https://thl.fi/en/web/thl-biobank. PRS SNP inclusion lists and model specifications will be deposited in the PGS catalogue repository (https://www.pgscatalog.org/). Risk scores for UKB participants will be returned to UK Biobank for use by approved researchers.
Acknowledgements
This research has been conducted using data from UK Biobank, a major biomedical database, http://www.ukbiobank.ac.uk/. We thank all individuals who agreed to participate in the contributing GWAS studies and in UK Biobank, and the investigators, research associates and wider teams involved in these studies. We thank the authors of LDpred2 for their instructive PRS tutorial and code.