Abstract
Purpose Pancreatic Duct Adenocarcinoma (PDAC) screening can enable detection of early-stage disease and long-term survival. Current guidelines are based on inherited predisposition; only about 10% of PDAC cases meet screening eligibility criteria. Electronic Health Record (EHR) risk models for the general population hold out the promise of identifying a high-risk cohort to expand the currently screened population. Using EHR data from a multi-institutional federated network, we developed and validated a PDAC risk prediction model for the general US population.
Methods We developed Neural Network (NN) and Logistic Regression (LR) models on structured, routinely collected EHR data from 55 US Health Care Organizations (HCOs). Our models used sex, age, frequency of clinical encounters, diagnoses, lab tests, and medications, to predict PDAC risk 6-18 months before diagnosis. Model performance was assessed using Receiver Operating Characteristic (ROC) curves and calibration plots. Models were externally validated using location, race, and temporal validation, with performance assessed using Area Under the Curve (AUC). We further simulated model deployment, evaluating sensitivity, specificity, Positive Predictive Value (PPV) and Standardized Incidence Ratio (SIR). We calculated SIR based on the SEER data of the general population with matched demographics.
Results The final dataset included 63,884 PDAC cases and 3,604,863 controls between the ages 40 and 97.4 years. Our best performing NN model obtained an AUC of 0.829 (95% CI: 0.821 to 0.837) on the test set. Calibration plots showed good agreement between predicted and observed risks. Race-based external validation (trained on four races, tested on the fifth) AUCs of NN were 0.836 (95% CI: 0.797 to 0.874), 0.838 (95% CI: 0.821 to 0.855), 0.824 (95% CI: 0.819 to 0.830), 0.842 (95% CI: 0.750 to 0.934), and 0.774 (95% CI: 0.771 to 0.777) for AIAN, Asian, Black, NHPI, and White, respectively. Location-based external validation (trained on three locations, tested on the fourth) AUCs of NN were 0.751 (95% CI: 0.746 to 0.757), 0.749 (95% CI: 0.745 to 0.753), 0.752 (95% CI: 0.748 to 0.756), and 0.722 (95% CI: 0.713 to 0.732) for Midwest, Northeast, South, and West, respectively. Average temporal external validation (trained on data prior to certain dates, tested on data after a date) AUC of NN was 0.784 (95% CI: 0.763 to 0.805). Simulated deployment on the test set, with a mean follow up of 2.00 (SD 0.39) years, demonstrated an SIR range between 2.42-83.5 for NN, depending on the chosen risk threshold. At an SIR of 5.44, which exceeds the current threshold for inclusion into PDAC screening programs, NN sensitivity was 35.5% (specificity 95.6%), which is 3.5 times the sensitivity of those currently being screened with an inherited predisposition to PDAC. At a chosen high-risk threshold with a lower SIR, specificity was about 85%, and both models exhibited sensitivities above 50%.
Conclusions Our models demonstrate good accuracy and generalizability across populations from diverse geographic locations, races, and over time. At comparable risk levels these models can predict up to three times as many PDAC cases as current screening guidelines. These models can therefore be used to identify high-risk individuals, overlooked by current guidelines, who may benefit from PDAC screening or inclusion in an enriched group for further testing such as biomarker testing. Our integration with the federated network provided access to data from a large, geographically and racially diverse patient population as well as a pathway to future clinical deployment.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
LA acknowledges support from the Prevent Cancer Foundation for this work. MR, LA, KJ acknowledge the contribution of resources by TriNetX, including secured laptop computers, access to the TriNetX EHR database, and clinical, technical, legal, and administrative assistance from the TriNetX team of clinical informaticists, engineers, and technical staff. MR and KJ received funding from DARPA and Boeing. MR also received funding from the NSF, Aarno Labs, and Boeing. During the time the research was performed MR consulted for Comcast, Google, Motorola, and Qualcomm.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Any data displayed on the TriNetX federated network database platform in aggregate form or any patient level data provided in a data set generated by the federated network database platform only contains de-identified data as per the de-identification standard defined in the Health Insurance Portability and Accountability Act Privacy Rule. The process by which the data is de-identified is attested to through a formal determination by a qualified expert as defined in the Health Insurance Portability and Accountability Act Privacy Rule. This formal determination by a qualified expert supersedes the need for the previous waiver of TriNetX from the Western Institutional Review Board. Because this study used only de-identified patient records and did not involve the collection use or transmittal of individually identifiable data this study was exempted from Institutional Review Board approval.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
jiakai{at}mit.edu
rinard{at}csail.mit.edu
steve.kundrot{at}trinetx.com
matvey.palchuk{at}trinetx.com
jeff.warnick{at}trinetx.com
kathryn.haapala{at}trinetx.com
ikaplan{at}bidmc.harvard.edu
lappelb1{at}bidmc.harvard.edu
↵⋆ Co-senior authors.
Data Availability
The de-identified data in TriNetX federated network database can only be accessed by researchers that are either part of the network or have a collaboration agreement with TriNetX. As stated in the manuscript, we accessed data as part of a no-cost collaboration agreement between BIDMC, MIT, and TriNetX.