Abstract
Background The multifactorial risk prediction model BOADICEA enables identification of women at higher or lower risk of developing breast cancer. BOADICEA models genetic susceptibility in terms of the effects of rare variants in breast cancer susceptibility genes and a polygenic component, decomposed into an unmeasured and a measured component - the polygenic risk score (PRS). The current version was developed using a 313 SNP PRS. Here, we evaluated approaches to incorporating this PRS and alternative PRS in BOADICEA.
Methods The mean, standard deviation (SD), and proportion of the overall polygenic component explained by the PRS (α2) need to be estimated. α was estimated using logistic regression, where the age-specific log-relative risk is constrained to be a function of the age-dependent polygenic relative risk in BOADICEA; and using a retrospective likelihood (RL) approach that models, in addition, the unmeasured polygenic component.
Results Parameters were computed for 11 PRS, including 6 variations of the 313 SNP PRS used in clinical trials and implementation studies. The logistic regression approach underestimates α, as compared with the RL estimates. The RL α estimates were very close to those obtained by assuming proportionality to the odds ratio per 1 SD, with the constant of proportionality estimated using the 313 SNP PRS. Small variations in the SNPs included in the PRS can lead to large differences in the mean.
Conclusions BOADICEA can be readily adapted to different PRS in a manner that maintains consistency of the model.
Impact The methods described enable comprehensive breast cancer risk assessment.
Introduction
BOADICEA1,2 is a risk prediction algorithm for predicting breast and ovarian cancer risk on the basis of genetic and non-genetic factors. The algorithm incorporates the effects of common genetic variants, summarised in a polygenic risk score (PRS), in addition to the effects of pathogenic variants in major breast cancer susceptibility genes, other lifestyle/hormonal risk factors and cancer family history.
The current version (v6)1,2 has been specifically developed to incorporate the 313 SNP PRS of Mavaddat et al.3; this PRS was developed using the very large dataset of the Breast Cancer Association Consortium (BCAC) and extensively validated in prospective studies. However, the algorithm itself is flexible and can incorporate any PRS for which the relevant parameters are known. These parameters are the mean (µ) and standard deviation (σ) of the PRS in the population, and the proportion (α2) of the polygenic variance attributable to the PRS. In practice, the PRS can be normalised and supplied as a Z-score, in which case only parameter α is required. By modelling the PRS as the proportion of a (fixed) polygenic component, the predicted familial risks remain consistent, irrespective of the PRS used, and importantly there is no double counting of the effect of the PRS and cancer family history.
Here, we discuss the incorporation of alternative PRSs into BOADICEA, and provide the relevant parameters for several PRS that have been developed, including several that are in use in clinical applications.
Materials and Methods
BOADICEA models breast cancer risks such that the incidence of breast cancer at age t is of the form1,4: Here, λ0(t) is the baseline incidence. The term δg(i)(t) models the major gene component for individual i (δk(t) being the age-specific log-hazard ratio associated with genotype k). models the polygenic component, σP(t) being the polygenic standard deviation and the normalised polygenic component for individual i. The final term models the effects of other risk factors. The polygenic variance, is allowed to be age-dependent and assumed to be a linear function of age t: The parameters γ and θ have been previously estimated, using complex segregation analysis, as 4.86 and -0.06 respectively4.
The PRS is incorporated into BOADICEA by partitioning the total polygenic component xP into the sum of a known component xK measured by the PRS, and an unmeasured residual component xR1. The variance due to the known component is of the form3: σK(t) can also be interpreted as the age-specific log-hazard ratio per unit SD of the PRS, conditional on other risk factors. Note that in Mavaddat et al. equation (1) is written . The change of symbols is for consistency with Lee et al.1 and the Canrisk platform (www.canrisk.org), where the proportion of the polygenic variance explained by the PRS is denoted as α2.
Estimation of α and incorporating alternative PRS
The key parameter is α. The first approach to estimating this parameter makes the simplifying assumption that the polygenic standard deviation σK(t) can be approximated by the marginal age-specific log-hazard ratio per unit SD of the PRS3. This can then be estimated using case-control (or cohort) data, by first transforming the PRS using: where xK is the standardised (per unit SD) version of the proposed PRS. S′ is then included as a covariate in a logistic regression model: the parameter (log-relative risk) estimate corresponding to the covariate S′ gives the required α parameter, which we denote αGLM. This can then be estimated using case-control (or cohort) data. This method was applied to 22,767 controls and 16,151 women diagnosed with invasive breast cancer from the validation and prospective test sets used in Mavaddat et al3 (Supplementary Tables 1 and 2). The analysis was restricted to women of European ancestry with age of diagnosis or last observation less than 80 years (after application of inclusion/exclusion criteria mean age at diagnosis = 59.9 (sd=10) years for cases, and 57.1(sd=10.4) years for controls). Analyses were adjusted for country in which the study was conducted (15 countries) and 10 Principal Components (PCs).
The above analyses make the simplifying assumption that the marginal PRS effect size is a good approximation to the effect size conditional on other risk factors. This is likely to be a reasonable assumption for non-genetic risk factors, which have relatively small effects on risk and appear to be independent of the PRS5. However, it may not be true for other genetic factors, in particular the unmeasured polygenic component. To address this problem, we also estimated α using a retrospective likelihood approach (αRL), applied to the same BCAC dataset. In this analysis, the observed PRS is computed conditional on the phenotypes of the individuals (age of diagnosis and case/control status). Details are given in the Supplementary Methods.
This approach requires overall population age-specific incidence rates to be specified. For this purpose, the rates for England and Wales 2016-2018 were used (https://www.cancerresearchuk.org/health-professional/cancer-statistics/incidence/age).
Since the mean PRS varies by country, we first regressed the PRS on country and principal components, adjusted for case/control status, and performed the analyses on the residual PRS. The likelihood was maximised using the optimize function in R. 95% Confidence Intervals were obtained using a grid of values for αRL, and finding the difference between the log-likelihoods and the maximum log-likelihood.
Using the value for the proportionality estimated in our data set, we now note that there is a simple proportional relationship between the PRS effect size and alpha. The effect size of a PRS is typically expressed in terms of the log-relative risk per unit SD, say η. From equations (1) and (2) it can be seen that, under the rare disease assumption, the marginal hazard ratio associated with the PRS should equal the conditional hazard ratio. If differential age effects can also be ignored, α should therefore be approximately proportional to η. This allows α to be estimated using PRS313 as a standard. Thus: where η0 and α0 are the corresponding estimates for PRS313. This provides a simple method that could be applied to PRS developed and validated on a different dataset.
PRS Examples
We computed the relevant parameters for 11 PRS (Supplementary Tables 3-12; SNP positions based on Genome Reference Consortium Human Build 37 (GRCh37)). Six of these are modifications of the PRS313, designed for clinical implementation. The BRIDGES PRS was developed as an NGS panel test to facilitate clinical translational studies of BOADICEA implemented in the context of genetic testing of women with a family history (https://bridges-research.eu/). Of 313 variants, 295 could be designed and a further 11 were replaced by surrogate markers (r2>0.9 in Europeans). The PERSPECTIVE I&I PRS was designed to facilitate risk stratified screening in the context of population-based mammographic screening in Ontario and Quebec6. This PRS was designed as an NGS panel: 287 of 313 markers could be designed and a further 8 were surrogates. The EastGLH PRS was designed by the NHS East Genomic Laboratory Hub for use in a randomised control trial of women testing positive for an inherited pathogenic/likely pathogenic gene variant in BRCA1, BRCA2, PALB2, CHEK2 or ATM, using a NGS panel of 303 markers 7. The PRISMA PRS, designed as genotyping array of 268 markers (37 surrogates), was developed to provide multifactorial cancer risks to women attending genetic clinics in Spain. The eMERGE PRS consisted of 308 markers and is part of a large US study aiming to communicate PRS-based genome-informed risk assessment across multiple diseases (https://emerge-network.org). DBDS299, using data from the Danish Blood Donor Study (https://bmjopen.bmj.com/content/9/6/e028401) is used in a research study to validate BOADICEA in the Danish population.
In addition, we included the earlier PRS77 developed using BCAC data and comprising genome-wide significant SNPs, PRS3820 developed by Mavaddat et al.3 using Lasso penalised regression, and two PRS (WISDOM75 and WISDOM120) developed for the WISDOM clinical trial8 (Clinical Trials identifier NCT02620852).
The means and standard deviations of each PRS, and the proportion (α) of the polygenic variance attributable to these alternative PRS were derived in the same data set, namely the validation and prospective sets described by Mavaddat et al.3
PRS313 includes two variants (22_29203724_C_T and 22_29551872_A_G) which are correlated with the protein truncating variant CHEK2*1100delC, and some of the derivative PRS also include these SNPs. This could result in overestimation of risk in CHEK2*1100delC carriers if the PRS is used in conjunction with gene-panel testing, because BOADICEA assumes that the PRS and major gene genotypes are independent in the population. We therefore also considered PRS without these variants. (Note that CHEK2 p.I157T (22_29121087_A_G) is also included in PRS313 but is only weakly correlated with CHEK2*1100delC and does not introduce a bias).
Ethics Declaration
All studies were approved by the relevant local ethical review boards and used appropriate consent procedures. SEARCH was approved by the NRES Committee East of England - Cambridge South.
Data Availability
Data were generated by the authors and available on request.
Results
Table 1 summarises the estimated parameters for PRS313 and each of the alternative PRS. As expected, the 6 PRS that are variations on PRS313 have very similar effect sizes, expressed as log-OR per 1 SD, reflecting the fact that only a few variants are not accounted for. The αRL parameters are also similar, and only marginally lower than PRS313 estimate (0.441, 95%CI 0.430-0.445). The effect sizes for PRS77 (both in terms of the log-OR per 1 SD and α) were smaller than for PRS313, while PRS3820 had larger effect sizes. The two WISDOM and PRISMA PRSs also had somewhat smaller effect sizes. Removal of the 2 chromosome 22 SNPs had only a small effect on the estimated log-OR per 1SD, and α – for example reducing αRL from 0.441 to 0.439 for PRS313. The α values computed using the simpler logistic regression approach (αGLM) were smaller than those generated using the retrospective likelihood approach for all PRS.
Assuming that the α values are proportional to the log-relative risks per 1 SD, and using the PRS313 as the standard, the predicted value is given by α = 0.887η (Figure 1). These predicted values were very similar to the αRL values for all PRS.
Discussion
We evaluated approaches to incorporating alternate breast cancer PRSs into the risk prediction algorithm BOADICEA. The α values computed using the simpler logistic regression approach (αGLM) were consistently smaller than those generated using the retrospective likelihood approach (αRL), for all PRS. This is consistent with the fact that women with a high polygenic component are more likely to develop the disease at an early age. This results in a negative correlation between the PRS and the residual polygenic component, which leads to an underestimation of the PRS effect size if the latter is not allowed for, a phenomenon related to index event bias9.
We showed further that the α parameters derived from the log-relative risk estimate by assuming proportionality were very close to the αRL estimates. This suggests that this approach is likely to be reasonably accurate for other PRS, at least across the range of effect sizes considered here, providing a very straightforward approach to incorporating a PRS developed on another dataset if a log-relative risk estimate is already available.
A striking observation is the very large difference in the means of the different PRS. This reflects the fact that the removal of a few SNPs with important weights can have a substantial effect on the mean. For example, the means for the PRS excluding the chromosome 22 SNPs are higher. While the mean has no intrinsic significance, this emphasises the importance of correctly normalising the PRS. In particular, because BOADICEA also incorporates the effects of CHEK2 protein truncating variants, we recommend using the PRS without these SNPs when gene-panel testing is performed.
It is important to note that estimates derived from European ancestry populations may not be applicable to individuals of other ancestries. The effect sizes may differ among populations, for example due to differences in linkage disequilibrium structure. This has been shown for PRS313, for which somewhat smaller effect sizes have been estimated in Asian and African-American populations10-13. In addition, the mean PRS can vary significantly by population –PRS313 has a higher mean in both Asian and African-American populations than in Europeans. This again emphasises the importance of calibrating the PRS to the relevant population distribution.
The analyses used here adjusted the PRS for both the country in which the study was conducted and ancestry informative principal components. An adjustment is necessary since the mean PRS varies by country, even among European populations (and this is not reflected in differences in incidence rates). However, it is possible that adjustment for both country and principal components is over-conservative. Further analyses in large population-specific datasets may be able to address this.
The approaches described allow BOADICEA to be adapted for use with any PRS in a consistent manner. However, it should be emphasised that the main validations of BOADICEA utilised PRS31314-17. For PRS that are substantially different, and particularly as more informative PRS are generated through larger GWAS, further validations will be required. We also note that the current formulation of BOADICEA assumes that the age-specific effects of the PRS and the residual polygenic component (as measured by the log-relative risk per 1 SD) are proportional. This significantly simplifies the algorithm, but it is possible that better predictions may be available by allowing differential age-specific effects.
The BOADICEA algorithm has been extensively validated, particularly when incorporating PRS31314-17 in addition to other risk factors. It is available through the CanRisk (www.canrisk.org) tool18 and is widely used in the context of women with cancer family history or undergoing gene-panel testing, including several ongoing clinical implementation studies. The methods described here allow other PRSs to be used with BOADICEA and hence provide comprehensive breast cancer risk assessment.
Data Availability
Data will be made available on request.
Supplementary Table 1. Studies and samples used in these analyses.
Supplementary Table 2. Country in which the studies were conducted.
Supplementary Table 3. List of SNPs used in each of the PRS.
Supplementary Table 4. SNPs and weights for the BCAC PRS313.
Supplementary Table 5. SNPs and weights for the BRIDGES PRS.
Supplementary Table 6. SNPs and weights for the PERSPECTIVE I&I PRS.
Supplementary Table 7. SNPs and weights for the EASTGLH PRS.
Supplementary Table 8. SNPs and weights for the PRISMA PRS.
Supplementary Table 9. SNPs and weights for the eMERGE PRS.
Supplementary Table 10. SNPs and weights for the DBDS299 PRS.
Supplementary Table 11. SNPs and weights for the WISDOM75 PRS.
Supplementary Table 12. SNPs and weights for the WISDOM128 PRS.
Footnotes
Funding: This work has been supported by grants from Cancer Research UK (PPRPGM-Nov20\100002); the European Union’s Horizon 2020 research and innovation programme under grant agreement numbers 633784 (B-CAST) and 634935 (BRIDGES); the PERSPECTIVE I&I project which is funded by the Government of Canada through Genome Canada (#13529) and the Canadian Institutes of Health Research (#155865), the Ministère de l’Économie et de l’Innovation du Québec through Genome Québec, the Quebec Breast Cancer Foundation, the CHU de Quebec Foundation and the Ontario Research Fund; and by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014).
BCAC is funded by the European Union’s Horizon 2020 Research and Innovation Programme (grant numbers 634935 and 633784 for BRIDGES and B-CAST respectively), and the PERSPECTIVE I&I project, funded by the Government of Canada through Genome Canada and the Canadian Institutes of Health Research, the Ministère de l’Économie et de l’Innovation du Québec through Genome Québec, the Quebec Breast Cancer Foundation. The EU Horizon 2020 Research and Innovation Programme funding source had no role in study design, data collection, data analysis, data interpretation or writing of the report. Additional funding for BCAC is provided via the Confluence project which is funded with intramural funds from the National Cancer Institute Intramural Research Program, National Institutes of Health.
Genotyping of the OncoArray was funded by the NIH Grant U19 CA148065, and Cancer Research UK Grant C1287/A16563 and the PERSPECTIVE project supported by the Government of Canada through Genome Canada and the Canadian Institutes of Health Research (grant GPH-129344) and, the Ministère de l’Économie, Science et Innovation du Québec through Genome Québec and the PSRSIIRI-701 grant, and the Quebec Breast Cancer Foundation. MT was supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) and Cancer Research UK C22770/A31523 (International Alliance for Cancer Early Detection programme).
The PRISMA study was supported by the Carlos III National Health Institute and Ministerio de Educación y Ciencia grant (PI19/01195); and CERCA Programa/Generalitat de Catalunya for institutional support (J. Balmaña).
W. Chung was funded by NIH grant U01 HG008680 (eMERGE grant)
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare that: The BOADICEA model has been licensed to Cambridge Enterprise for commercialization, with the authors D.F.E., A.C.A., A.P.C., A.L. and T.C. listed as its inventors. A.P. has received royalties as a result of this. A.L. is an employee of Illumina Inc. and owns shares in Illumina Inc. A.A. T.C., J. D. and N.M. are funded by CanRisk CR-UK programme (PPRPGM-Nov20\100002) paid to the institution. N.M. is also funded through the PERSPECTIVE program. S.J. leads the PERSPECTIVE I&I project, which is funded by the Government of Canada through Genome Canada (#13529) and the Canadian Institutes of Health Research (#155865), the Ministère de l’Économie et de l’Innovation du Québec through Genome Québec, the Quebec Breast Cancer Foundation, the CHU de Quebec Foundation and the Ontario Research Fund (Payments made to institution Université Laval). SJ leads the PERSPECTIVE project supported by the Government of Canada through Genome Canada and the Canadian Institutes of Health Research (grant GPH-129344) and, the Ministère de l’Économie, Science et Innovation du Québec through Genome Québec and the PSRSIIRI-701 grant, and the Quebec Breast Cancer Foundation (Payments made to institution Université Laval). M.T. is funded by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) and Cancer Research UK C22770/A31523 (International Alliance for Cancer Early Detection programme). W.C. is funded by NIH grant U01 HG008680 (eMERGE grant). W.C. received consulting fees from and is on the scientific advisory board of Regeneron Genetics Center and is also on the board of directors of Prime Medicine. S.B. is funded by Region H and Boserup Foundation. These funders did not have any influence on the sample collection, lab work, data analyses or their interpretation. LFeliubadaló declares honoraria for lecture on analysis of sequencing data from ovarian cancer presented to Astra-Zeneca. Other authors report no conflict of interest