PT - JOURNAL ARTICLE AU - Andrew McMurry AU - Amy R Zipursky AU - Alon Geva AU - Karen L Olson AU - James Jones AU - Vlad Ignatov AU - Timothy Miller AU - Kenneth D Mandl TI - Moving Biosurveillance Beyond Coded Data: AI for Symptom Detection from Physician Notes AID - 10.1101/2023.09.24.23295960 DP - 2023 Jan 01 TA - medRxiv PG - 2023.09.24.23295960 4099 - http://medrxiv.org/content/early/2023/09/25/2023.09.24.23295960.short 4100 - http://medrxiv.org/content/early/2023/09/25/2023.09.24.23295960.full AB - Background Real-time surveillance of emerging infectious diseases necessitates a dynamically evolving, computable case definition, which frequently incorporates symptom-related criteria. For symptom detection, both population health monitoring platforms and research initiatives primarily depend on structured data extracted from electronic health records.Objective To validate and test an artificial intelligence (AI) based Natural Language Processing (NLP) pipeline for detecting COVID-19 symptoms from physician notes.Methods Subjects in this retrospective cohort study are patients 21 years old and younger, who presented to a pediatric emergency department (ED) at a large academic children’s hospital between March 1, 2020 and May 31, 2022. ED notes for all patients were processed with an NLP pipeline tuned to detect the mention of 11 COVID-19 symptoms based on CDC criteria. For a gold standard, 3 subject matter experts labeled 226 ED notes and had strong agreement (F1=98.6; PPV=97.2; Recall=100.0). F1, PPV, and recall were used to compare the performance of both NLP and ICD-10 to the gold standard chart review. As a formative use case, variations in symptom patterns were measured across SARS-Cov2 variant eras.Results There were 85,678 ED encounters during the study period, 4.0% with patients with COVID-19. NLP was more accurate at identifying encounters with patients that had any of the COVID-19 symptoms (F1=79.6) than ICD-10 codes (F1=45.1%). NLP accuracy was higher for positive symptoms (recall=93%) than ICD-10 (recall=30%). However, ICD-10 accuracy was higher for negative symptoms (specificity=99.4%) than NLP (specificity=91.7%). Congestion or runny nose showed the highest accuracy difference: NLP F1=82.8%, ICD-10 F1=4.2%. Prevalence of NLP symptoms among patients with COVID-19 differed across variant eras. And patients with COVID-19 were more likely to have each symptom than patients without this disease. Effect sizes (odds ratios) varied across pandemic eras.Conclusions This study establishes the value of AI based NLP as a highly effective tool for real-time COVID-19 symptom detection in pediatric patients, outperforming traditional ICD-10 methods. It also reveals the evolving nature of symptom prevalence across different virus variants, underscoring the need for dynamic, technology-driven approaches in infectious disease surveillance.Competing Interest StatementThe authors have declared no competing interest.Funding StatementCenters for Disease Control and Prevention of the U.S. Department of Health and Human Services (HHS) as part of a financial assistance award. Training Grant from the National Institute of Child Health and Human Development T32HD040128.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The Institutional Review Board (IRB) of Boston Children's Hospital has reviewed the protocol and determined that it qualifies as exempt.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present study are available upon reasonable request to the authors.AbbreviationsAIartificial intelligenceEDemergency departmentEHRelectronic health recordICD-10International Classification of Diseases, 10th RevisionNLPnatural language processingPPVpositive predictive value