ABSTRACT
Background Polygenic risk scores (PRS) have been developed to predict individual cancer risk and their potential clinical utility is receiving a great deal of attention. However, the degree to which the predictive utility of individual cancer-specific PRS may be augmented or refined by the incorporation of other cancer PRS, non-cancer disease PRS, or the protective effects of health and longevity-associated variants, is largely unexplored.
Methods We constructed PRS for different cancers from public domain data as well as genetic scores for longevity (‘Polygenic Longevity Scores’ or ‘PLS’) for individuals in the UK Biobank. We then explored the relationships of these multiple PRS and PLS among those with and without various cancers.
Results We found statistically significant associations between some PLS and individual cancers, even after accounting for cancer-specific PRS. None of the PLS in their current form had an effect pronounced enough to motivate clinical cancer risk stratification based on its combined use with cancer PRS. A few variants at loci used in the PLS had known associations with Alzheimer’s disease and other diseases.
Conclusion Underlying heterogeneity behind cancer susceptibility in the population at large is not captured by PRS derived from analytical models that only consider marginal associations of individual variants with cancer diagnoses. Our results have implications for the derivation and calculation of PRS and their use in clinical and biomedical research settings.
Impact Extensions of analyses like ours could result in a more refined understanding of cancer biology and how to construct PRS for cancer.
INTRODUCTION
While cancer treatments and outcomes have improved significantly over the past 40 years, overall cancer incidence continues to rise.(1) In 2019, an estimated 16.9 million cancer survivors lived in the United States, a seven-fold increase since 1971.(2) Cancers are primarily diseases of aging, with over 80% of cancers occurring in individuals over age 50.(1) Since this proportion of the US population is projected to increase steadily over the next several decades,(3) there is an urgent need to address this impending public health and medical crisis through the development of strategies to decrease rates of cancers in aging adults. Developing appropriate strategies will require identifying individuals at risk for cancer and providing them with specific risk mitigation interventions that go beyond current prevention and screening recommendations. In this light, cancer risk assessment can be greatly enhanced by using genetic profiling. Both monogenic and polygenic influences play an important role in susceptibility to and pathogenesis of cancers, however, cancers in the aged are less likely to be associated with monogenic germline variants(4-6) The degree to which polygenic variants contribute specifically to aging-related cancer risk is the subject of much current research.(7,8)
GWAS have identified thousands of common genetic variants associated with many different types of cancer.(9,10) Although each individual variant has a small effect on risk, especially relative to those contributing to monogenic forms of cancer, their cumulative effects can be pronounced. These polygenic effects can make individual contributions to risk equal to that of the individual rare variants contributing to monogenic forms of hereditary cancer (11) and also modify or compound the risk of these monogenic driver variants(12-14). These common variant, polygenic effects are often summarized at the individual level as a polygenic risk score (PRS) that aggregates effects across multiple variants into a single estimate of the total cancer susceptibility burden in a given genome. Cancer PRS are derived in different ways, but often use simple weighted multiple regression and related prediction modeling techniques.(15-17). Supplementary Table 1 provides a list of recent cancer PRS studies.
Interest in PRS grew from developing models that could discriminate specific cancer cases from controls, but has now expanded into multiple other areas. For example, PRS have been used to explore genetic, as opposed to phenotypic, correlations between cancers to better characterize shared genetic determinants of different cancers;(8,18) correlations between the PRS themselves for different cancer-related phenotypes;(19) the pleiotropic effects of variants across multiple different cancers;(20) and the causal effects of various measurable and potentially modifiable genetically-mediated factors on cancer via, e.g., Mendelian Randomization tests.(21-23) This work suggests that being susceptible to one type of cancer or phenotype can influence susceptibility - positively or negatively - to another type.(21,24) The degree to which this is the case is an open question.
Fundamental processes mediating cancer susceptibility that are shared across cancers are also likely to mediate susceptibility to other diseases and conditions beyond cancer. Such processes are quite likely to be associated with aging and longevity, given that the manifestation of many common chronic diseases is age-related. By bringing together genetic insights into multiple cancers, diseases, aging and longevity, one might be able to develop not only better prediction models through, e.g., risk stratification, but also more insight into the genetic etiology of cancer. Identifying variants associated with health span and longevity is challenging. Different definitions of longevity are used, sample sizes in studies focusing on extreme longevity are small, and the actual heritability of human lifespan is debated, as is the need to account for environmental factors in relevant studies.(25-27) Large-scale analyses of long-lived cohorts have enabled pooled analysis of multiple GWAS, as has been the focus of the National Institute on Aging’s Longevity Consortium (LC), the Integrated Longevity Omics (ILO), and Long Life Family Study (LLFS) projects.(25) The results of these studies have yet to be integrated with studies of various diseases. GWAS focused on genetic ‘surrogates’ for individual lifespan, such as parental lifespan, have also been pursued.(28,29) Although not all studies are consistent, many of these analyses suggest that long lived individuals possess different disease PRS profiles from short lived individuals.(30-33) Some individuals possessing a genetic background consistent with a long-life – i.e., having an elevated Polygenic Longevity Score (PLS) – may have inherited genetic variants that protect them from cancer and other diseases. However, these variants may not have individual protective effects on cancer pronounced enough to be detected as significantly associated with a cancer diagnosis in cancer-specific GWAS.(34)
We explored the relationships among PRS built from published GWAS results and PLS derived from different longevity-focused GWAS and their relationships to specific cancer diagnoses, general cancer risk, and age of cancer onset in the UK Biobank.(35,36) We also considered the development and performance of modified cancer risk prediction models that included longevity associated variants. We find that although many PLS are associated with cancers at some level, these effects are not pronounced enough to radically impact the use of cancer PRS prediction models. We do, however, argue that the relationship between longevity, cancers and disease protection needs greater exploration, especially given that some of the variants used in the cancer-associated PLS are strongly negatively associated with other diseases, like Alzheimer’s disease, raising important biological questions.
METHODS
Data Sources
We calculated polygenic risk scores (PRS) for 16 cancers using publicly available data,(7,37,38) as well as polygenic longevity scores (PLS) using 8 different longevity measures derived from different publicly available longevity GWAS summary statistics and different p-value thresholds (Table 1; SNPs and weights used for each PLS are provided in the supplementary material).
PLS were labeled according to the source reference for the summary statistics (‘dl’, ‘dl90’, ‘tim’, etc. See Table 1). We note that some PLS were based on a subset of SNP markers identified from the various longevity-focused GWAS because there were no reported effect sizes and some were not among those in version-3 of the UK Biobank total SNP list, including imputed SNPs. All PRS and PLS were computed on each of the individuals in the UK Biobank with genotype calls from the UK Biobank version-3 genotype data using the linear scoring function in PLINK v2 software and published PRS effect sizes.(36,39,40) We restricted our analyses to White British individuals (based on UK Biobank Data-Field 21000) as the PRS and PLS we used were derived from individuals with Northern European ancestry and the simultaneous study of multiple ancestries can confound PRS inference.
Logistic Regression Modeling
We used logistic regression to fit a number of prediction models to the UK Biobank data, with a specific cancer diagnosis or cancer PRS as a dependent variable and age, sex, PLS and the PRS for other cancer diagnoses as independent variables. For these analyses we included the first 10 principal components (PC) from the genome wide genetic relationship matrix (GRM) of the participants, although there is debate about the need or meaningfulness of including PCs in relevant analyses when the focus is on polygenic effects.(41-44) We also carried out stepwise logistic regression, using both forward and backward steps in a ‘both-way’ approach to calculate the best model, keeping independent variables in models in each step based on the Akaike information criterion (AIC) until no improvements in the model occurred. More details about the various models are described in corresponding sections in the Results section. All analyses were carried out using R version 3.6.3.
Survival Analyses
We performed survival analyses on each of the 16 cancers and for ‘any-cancer’ diagnosis using age-of-onset information as the dependent variable and an individual’s age at assessment, if that individual was not diagnosed with cancer, as a censoring time variable. The UK Biobank recruited individuals between the ages of 40 and 70. We therefore did not include any cancer age-of-onsets that were reported after age 70 in follow-up studies with UK Biobank participants. Survival analyses were pursued using both the Kaplan-Meier models coupled with log-rank tests of differences in survival curves as a function of PLS, and Cox proportional hazards models with the first 10 PCs from the GRM (see above), cancer PRS and PLS as independent variables as implemented in the ‘survival’ R package.(45)
Prediction Evaluation
We evaluated Receiver-Operator Characteristic (ROC) curves for each logistic regression model fit with cancer diagnosis as a dependent variable. We computed the Area-Under-the-Curve statistic (AUC) from the ROC analyses to gauge the performance of our logistic regression-based prediction models with age, sex, cancer PRS and PLS as independent variables. The R package pROC(46) was used for the relevant analyses.
RESULTS
Evidence for a shared genetic basis for different cancers
We first considered evidence that there might be a shared genetic basis to susceptibility to all cancers. We note that even if there is no evidence for a shared genetic basis for susceptibility across all or even most cancers, this does not preclude the possibility that one or a few cancers may share genetically-mediated determinants with longevity. Since individuals who live a long life did not die from cancer at earlier ages by definition, this suggests some degree of overlap in genetic determinants between cancers – and all diseases – and longevity (i.e., it may be that individuals who live a long life likely do not have genetic variants contributing to cancer susceptibility to the same degree as individuals who die earlier from cancer). We explored evidence for common genetic factors among cancers in three ways: 1. we considered studies in the literature whose results were consistent with a shared genetic basis for cancers; we analyzed data in the UK Biobank to test the hypothesis of a shared genetic basis; and 3. we explored the consistency between genetic correlations and PRS correlations among cancers.
Supplementary Table 2 shows the number of individuals in the UK Biobank diagnosed with one cancer and individuals with at least two different cancers. There were only 45 individuals with 3 or more cancers and there was a maximum of 4 cancers observed for any individual. Supplementary Table 3 depicts the AUC statistics from logistic regression models with specific cancer diagnosis as a dependent variable and age as well as other cancer diagnoses as independent variables, chosen based on stepwise logistic regression (see Methods). We note that some cancers are less frequent than others and as such some of these logistic regression analyses were likely underpowered, despite the large overall sample size of the UK Biobank. Both Supplementary Tables 2 and 3 suggest that many cancer diagnoses are significantly associated with, and predictive of, other cancer diagnoses. Many of the relationships between cancers that we identified are well known: e.g., breast and ovarian cancer; endometrial and colorectal cancer; but some are not: e.g., bladder and breast cancer. We acknowledge and emphasize that the associations between cancers could be, at least in part, attributable to environmental factors that contribute to both cancers.
We explored the consistency between the strength of the genetic correlations and the strength of the PRS correlations, knowing that the different PRS used different sets of SNPs (see Discussion section). Figure 1 provides a scatter plot of the estimates of genetic correlation obtained from Lindstrom et al(47) and corresponding PRS correlations computed on the UK Biobank subjects using the PRS of several cancers generated from the publicly available summary statistics (see Methods).(7)
We note that genetic and PRS correlations can be computed under different assumptions, using different methods with different overall orientations. For example, genetic correlations consider all the variants in a GWAS, whereas PRS correlations only use variants in the individual PRS. Figure 1 suggests there is more variation in the genetic correlations than there is in the PRS correlations, perhaps reflecting the fact that the PRS correlations involve a small number of SNPs, even though many PRS correlations are significant. Figure 2 provides a heatmap of the cancer PRS correlations for the UK Biobank participants and suggests that many cancers exhibit positive and negative PRS correlations.
Note we included PRS correlations for cancers that are sex-specific for both sexes since such correlations may be meaningful biologically in terms of the pathways and underlying processes implicated. Since an obvious potential source for PRS correlations is overlap in the variants considered in the PRS being correlated, we tallied the number of overlapping variants used in the PRS we considered. The Supplementary Material provides a matrix with the number of SNPs in the summary statistics used to generate cancer PRS and number of intersecting SNPs across the cancers. More in-depth studies of the linkage-disequilibrium (LD) relationships between variants used in different PRS is called for to rule out more subtle sets of overlapping variant effects in different PRS.
Associations Between Polygenic Longevity Score and Cancer
We assessed the relationships between the 8 different PLS listed in Table 1 and cancer diagnosis, as well as cancer PRS. We note that one of our PLS, the PLS derived from the Timmers et al study, ‘tim,’(28) was derived from the UK Biobank data and studies involving actual UK Biobank participants, thus it should be kept in mind throughout our discussions that the study of this specific PLS could be confounded by an overfitting effect in our analyses. We took each cancer diagnosis as a dependent variable, and then considered the corresponding PRS for that cancer, PRS for the other cancers, age, and the different PLS as independent variables in stepwise logistic regression models. These analyses were pursued separately for each sex. The AUC statistics derived from these analyses consider: 1. age only; 2. age and cancer PRS; and 3. age, cancer PRS, and the PLS, and are summarized in Table 2 and Supplementary Table 4. (Specific models for each cancer with regression coefficients, etc. are available from the authors).
These tables suggest that PLS and other cancer PRS are associated with some cancers – Bladder, Colon, Pancreas, Prostate and Testicular cancer for men and Bladder and Colon for women – independently of the individual PRS for those cancers. However, the increase in the AUC score for these models when PLS are included are not large, and certainly not indicative of their potential for risk stratifying individuals in their current instantiation. We also computed correlations between the 16 cancer PRS and the 8 PLS. (Supplementary Table 5) We found small correlations, mostly negative as expected, although many were highly statistically significant. Supplementary Table 6 provides a matrix of the correlations of each of the cancers with each of the eight PLS. Excluding the correlations for the ‘tim’ PLS, the largest negative correlation was between the ‘seb10’ PLS and the PRS for lung cancer and the largest positive correlation was between ‘sebp5’ and oral cancer.
Age of Cancer Onset and Polygenic Longevity Score
Since the age-of-onset of cancers can be influenced by a number of factors, including genetic factors, we explored the association of the 8 PLS to cancer age-of-onset. We took information about the age-of-onset of each cancer in the UK Biobank and considered it in Cox proportional hazards models as well as a standard Kaplan-Meier (KM) survival analysis (see Methods). The age of individuals at the time they participated in the UK Biobank studies was used as a censoring age for individuals without a cancer diagnosis. Supplementary Tables 7 and 8 provides the results of the Cox and KM analyses for each cancer and PLS, as well as analyses considering the age-of-onset of any cancer, and suggest that many cancers have an age-of-onset that is highly statistically significant in its association with one or another PLS, particularly the ‘dl’, ‘dl90’ and ‘dl99’ PLS. To depict the relationships between PLS and age-of-onset of cancers, we generated KM-based survival curves for individuals whose PLS were in different PLS distribution percentiles. Figure 3 depicts the KM curves for analyses using age-of-onsets of any cancer diagnosis as well as for female breast cancer using the ‘dl’ PLS, contrasting individuals with ‘dl’ PLS in the upper and lower 10th percentiles of the PLS scores among the UK Biobank participants. (See Supplementary Figure 1 for KM curves for cervical cancer and PLS).
Cancer Prediction Modeling Performance
To determine the clinical utility of cancer prediction modeling that includes PLS in addition to cancer PRS (and age and sex), we computed Receiver Operator Characteristic (ROC) curves and the area-under-the-curve (AUC) statistic from logistic regressions with all the 16 cancer diagnoses combined and treated as a dependent variable (i.e., a dummy dependent variable was constructed whereby 1 = any-cancer diagnosis, and 0 = no cancer diagnosis) and all cancer PRS, age and sex as independent variables. We then added PLS as independent variables to the models and tested to see if the AUC was improved significantly, but did not find any statistically significant improvement (Table 3). We also carried out a similar analysis for each of the cancers; first using only age as an independent variable and then adding a single PLS. We also did not find significant improvements in AUCs (Supplementary Table 9). Thus, although PLS are significantly associated with cancer diagnoses and age at onset, their inclusion in clinical prediction models is not justified at this time.
DISCUSSION
Many cancers share underlying etiological processes and factors (e.g., faulty DNA repair capacity, immune surveillance dysfunctions, etc.),(48) suggesting that the genetic determinants of these processes and factors mediate all or different subsets of cancers. The degree of overlap in the genetic determinants of cancers has been explored through the examination of the genetic correlations derived from individual GWAS results and data, as well as studies of the pleiotropic effects of individual genetic variants on multiple cancers.(7,18,20,49,50) In addition, many cancers have also been shown to be genetically correlated with non-cancer-related diseases.(24,51) These genetic correlations, possibly exacerbated by relevant environmental factors, are also quite likely to lead to cancer patients developing a second primary cancer (as opposed to recurrent cancers) or have comorbid illness. In this light, many other less obvious diseases have been found to be genetically correlated to specific cancers, such as psychiatric illness and lung and breast cancer.(52) Also, and perhaps more germane to our study of PRS and PLS, a recent study by Witte et al.(7) found that many variants associated with one form of cancer were also associated with another (e.g., exhibited pleiotropic effects across different cancer types) and that variants used in the PRS of one cancer are often included in the PRS of other cancers. They also found that PRS values were correlated for many pairs of cancers, as have many others.(8,18-20,53)
Since cancers can compromise lifespan, it is possible that individuals who live a long time: 1. do not have the variants that predispose to cancer to the same degree as individuals who have not lived a long time; 2. possess variants that protect them from the deleterious effects of cancer predisposing variants; 3. have been exposed to environmental factors (e.g., diet, lifestyle, etc.) that protect them from cancer in some way; or 4. had or have a combination of the above. Since many published studies provide evidence that genetic variants exist that predispose to a long life,(30,31,34) a good question concerns the degree to which such variants are absent from cancer PRS, or whether the possession of longevity-enhancing variants protects individuals against the deleterious effects of having an elevated cancer PRS. This is of particular interest given that cancer is primarily a disease of aging, and thus cancer’s genetically-mediated relationship to longevity may be pronounced. In addition, exploring and characterizing the positive genetic correlations between cancers, between cancers and other diseases, and the likely negative genetic correlations between cancers and longevity, can shed light on the pathogenesis of aging-related diseases, and also potential risk stratification strategies for identifying individuals at risk for a specific cancer, multiple cancers, disease sequelae and correlated morbidities, and complex disease trajectories whose interventions may be complicated.
Our study of the relationship between PLS and cancer diagnosis, age-of-onset and PRS in the UK Biobank suggests that although PLS are primarily negatively associated with many cancers, these associations are not strong enough to be used to modify risk assessments in clinical contexts in their current state. Our study used the UK Biobank data to evaluate associations between PLS, cancer PRS and cancer diagnoses and age-of-onsets, but considered PLS and cancer PRS for these analyses that were derived independently of the UK Biobank, with the exception of the ‘tim’ PLS, to avoid overfitting. Despite this, there are several limitations to our study that should motivate further research. First, we did not consider an analysis involving the genetic basis of many modifiable risk factors that could impact susceptibility to, or protection from, cancer, such as diet, stress, exercise, smoking, etc. that themselves might be under genetic control.(54,55) Second, our study, like most GWAS-based studies, is only relevant to individuals of Northern European descent, and it is known that many cancers are more prevalent in individuals of non-European descent.(1) This is a well-documented criticism of PRS studies(56) and highlights the importance of conducting more diverse large-scale genomics research, inclusive of racially and ethnically diverse populations, such as the All-of-Us project, (57) before PRS (and PLS) can be exploited in a clinical setting.
Third, our analyses focused on common variants of the type used in most PRS calculations based on GWAS data and as such ignores rare variants that have been shown to enhance longevity and protect against diseases.(58-60) Fourth, most PRS have been derived from cross-sectional data (i.e., cases and controls) and yet are thought to be useful for both classifying individuals with and without cancer diagnoses, and predicting disease. Future analyses should consider the association of PLS and cancer development with longitudinal data that considers age, aging, long term exposure to environmental factors, and overall health trajectory to get a clearer picture of their interplay. Fifth, for reasons of scope, we did not consider detailed functional assessments of the individual genes harboring the variants, nor the linkage disequilibrium patterns they exhibit, in the various longevity and cancer PRS, which, if pursued in future studies, could shed light on the more basic biology behind the relationship between the longevity and cancer. We do, however, point out that many of the variants used to construct the 8 PLS we studied have been shown to be negatively correlated with other diseases, suggesting there are important biological relationships between PLS and PRS variants. For example, the ‘dl’ PLS we studied is made up of only two SNPs, but those SNPs are known to be associated with Alzheimer’s disease (AD), suggesting a relationship between multilocus influences on AD and cancers.(61,62)[See the supplementary material for SNP information.)
An additional issue concerns why variants associated with longevity used in the PLS we studied are not included in cancer PRS if, in fact, the PLS are associated with cancer diagnosis and cancer PRS. Although there are a number of explanations for this which should motivate further study, we believe this is ultimately due to small marginal effects each variant used to construct a cancer PRS (or PLS) has on cancer susceptibility and the fact that the nearly the same score for a PRS can be obtained with different combinations of genotypes. Small individual effects make it unlikely that all biologically relevant variants are identified in studies used to derive PRS (or PLS) from a single finite data set. Also, there is likely great locus heterogeneity underlying cancer susceptibility and most diseases, as well as lifespan. Finally, cancer PRS are often trained on cases that possess a number of susceptibility variants, against controls that do not, and, as such, the variants used in PRS that are predictive of cancer diagnosis (or longevity) act as surrogates for the presence of others. Despite these issues, our results suggest that genetically-mediated factors that contribute to cancer susceptibility are indeed complex, but can be teased apart to some degree by leveraging insights into the genetic basis of not only different cancers, but also health-positive traits such as longevity. More detailed and focused studies of specific genetically-mediated factors that might contribute to both cancer risk reduction and longevity, including studies involving longitudinal data, are needed.
Data Availability
All data is publicly available from the UK Biobank and/or the authors.
FINANCIAL DISCLOSURES
The authors declare no financial conflicts
Supplementary Material
ACKNOWLEDGEMENTS
This research was conducted with the UK Biobank Resource. The Project ID is 43036. Aspects of this work were funded in part by NIH grants UH2 AG064706, U19 AG023122, U24 AG051129, U24 AG051129-04S1; NSF grant (FAIN number) 2031819; Dell, Inc.; and the Ivy and Ottesen Foundations.