PT - JOURNAL ARTICLE AU - Swaminathan, Akshay AU - Lopez, Ivan AU - Wang, William AU - Srivastava, Ujwal AU - Tran, Edward AU - Bhargava-Shah, Aarohi AU - Wu, Janet Y AU - Ren, Alexander AU - Caoili, Kaitlin AU - Bui, Brandon AU - Alkhani, Layth AU - Lee, Susan AU - Mohit, Nathan AU - Seo, Noel AU - Macedo, Nicholas AU - Cheng, Winson AU - Liu, Charles AU - Thomas, Reena AU - Chen, Jonathan H. AU - Gevaert, Olivier TI - Selective prediction for extracting unstructured clinical data AID - 10.1101/2022.11.15.22282368 DP - 2022 Jan 01 TA - medRxiv PG - 2022.11.15.22282368 4099 - http://medrxiv.org/content/early/2022/11/18/2022.11.15.22282368.short 4100 - http://medrxiv.org/content/early/2022/11/18/2022.11.15.22282368.full AB - Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving “easy” charts to a model and “hard” charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.Competing Interest StatementAS reports stock ownership in Roche (RHHVF). JHC reports royalties from Reaction Explorer LLC; consulting fees from National Institute of Drug Abuse Clinical Trials Network, Tuolc Inc, Roche Inc; and payment for expert testimony from Younker Hyde MacFarlane PLLC and Sutton Pierce. All other authors declare that they have no competing interests.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics approval was granted through Stanford University IRB (#50031).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data needed to evaluate the conclusions are present in the paper and in the Supplementary Materials. The datasets generated analyzed during the current study are not publicly available due to patient privacy but are available from the corresponding author (OG) on reasonable request.