RT Journal Article
SR Electronic
T1 Selective prediction for extracting unstructured clinical data
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2022.11.15.22282368
DO 10.1101/2022.11.15.22282368
A1 Swaminathan, Akshay
A1 Lopez, Ivan
A1 Wang, William
A1 Srivastava, Ujwal
A1 Tran, Edward
A1 Bhargava-Shah, Aarohi
A1 Wu, Janet Y
A1 Ren, Alexander
A1 Caoili, Kaitlin
A1 Bui, Brandon
A1 Alkhani, Layth
A1 Lee, Susan
A1 Mohit, Nathan
A1 Seo, Noel
A1 Macedo, Nicholas
A1 Cheng, Winson
A1 Liu, Charles
A1 Thomas, Reena
A1 Chen, Jonathan H.
A1 Gevaert, Olivier
YR 2022
UL http://medrxiv.org/content/early/2022/11/18/2022.11.15.22282368.abstract
AB Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving “easy” charts to a model and “hard” charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.Competing Interest StatementAS reports stock ownership in Roche (RHHVF). JHC reports royalties from Reaction Explorer LLC; consulting fees from National Institute of Drug Abuse Clinical Trials Network, Tuolc Inc, Roche Inc; and payment for expert testimony from Younker Hyde MacFarlane PLLC and Sutton Pierce. All other authors declare that they have no competing interests.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics approval was granted through Stanford University IRB (#50031).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data needed to evaluate the conclusions are present in the paper and in the Supplementary Materials. The datasets generated analyzed during the current study are not publicly available due to patient privacy but are available from the corresponding author (OG) on reasonable request.