Abstract
Background Health care is experiencing a drive towards digitisation and many countries are implementing national health data resources. Digital medicine promises to identify individuals at elevated risk of disease who may benefit from screening or interventions. This is particularly needed for cancer where early detection improves outcomes. While a range of cancer risk models exists, the utility of population-wide electronic health databases for risk stratification across cancer types has not been fully explored.
Methods We use time-dependent Bayesian Cox Hazard models built on modern machine learning frameworks to scale the statistical approach to 6.7 million Danish individuals covering 193 million life-years over a period from 1978-2015. A set of 1,392 covariates from available clinical disease trajectories, text-mined basic health factors and family histories are used to train predictive models of 20 major cancer types. The models are validated on cancer incidence between 2015-2018 across Denmark and on 0.35 million individuals in the UK Biobank.
Findings The predictive performance of models was found to exceed age-sex-based predictions in all but one cancer type. Models trained on Danish data perform similarly on the UK Biobank in a direct transfer without any additional retraining. Cancer risks are associated, in addition to heritable components, with a broad range of preceding diagnoses and health factors. The best overall performance was seen for cancers of the digestive system but also Thyroid, Kidney and Uterine Cancers. Risk-adapted cohorts may on average include 25% individuals younger than age-sex-based cohorts with similar incidence.
Interpretation Data available in national electronic health databases can be used to approximate cancer risk factors and enable risk predictions in most cancer types. Model predictions generalise between the Danish and UK health care systems and may help to enable cancer screening in younger age groups.
Funding Novo Nordisk Foundation.
Evidence before this study A number of cancer risk prediction algorithms based on genetics or family history, lifestyle and health factors, as well as diagnostic tests have been developed to improve cancer screening by targeting individuals at increased risk. Many countries are assembling population-wide registries of electronic health records. Yet these resources do not necessarily encompass all the information required for currently available cancer risk models. It is therefore not clear yet how well national health data resources serve the purpose of population wide cancer risk prediction and cancer screening, which factors and data types are most informative for cancer specific and multi-cancer risk prediction and whether such algorithms would transfer between national health care systems.
Added value of this study We developed risk prediction models for 20 major cancer types based on hospital admission records, family history of cancer cases, and some text-mined basic health factors across the Danish population from 1978 to 2015. The analysis shows that established and novel risk factors of different cancer types can be extracted from the vast amounts of data available in national health registries, facilitating accurate risk predictions. Further, validating the model on all adults residing in Denmark from 2015 to 2018 provides a unique opportunity to examine the potential of national-scale medical records for cancer risk prediction. Additionally, we validate the models in the UK Biobank, showing the transferability of the models across different health care systems. Lastly, we calculate that the information may facilitate earlier screening of individuals compared to an age-sex-based approach.
Implications of all the available evidence Our study shows that national electronic health databases can help to identify individuals of increased risk of cancer across many organ sites. Model parameters approximate important cancer risk factors related to alcohol, smoking, metabolic syndromes and the female reproductive system. The ability to identify subsets of the population earlier compared to age-sex-based screening may improve the efficiency of current screening programs. The ability to predict a broad range of cancers may also benefit the implementation of new multi-cancer early detection tests, which are currently being trialled across the world.
Competing Interest Statement
SB reports personal fees from Intomics and Proscion. EB is a paid consultant of Oxford Nanopore. All other authors declare no competing interests.
Funding Statement
This work was supported by grant NNF17OC0027594 from the Novo Nordisk Foundation.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Permissions for the work were obtained from the Danish Patient Safety Authority (3-3013-1731/1) and the Danish Health Data Authority (FSEID-00003092, FSEID-00003724, FSEID-00005633).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Danish registry data are available for use in secure, dedicated environments via application to the Danish Patient Safety Authority and the Danish Health Data Authority. UK Biobank data are available to verified researchers on application at http://www.UK Biobankiobank.ac.uk/using-the-resource/. Code is available on https://github.com/gerstung-lab/CancerRisk