Abstract
Background Large language models (LLMs) have emerged as transformative technologies, revolutionizing natural language understanding and generation across various domains, including medicine. In this study, we investigated the capabilities, limitations, and generalizability of Generative Pre-trained Transformer (GPT) models in analyzing unstructured patient notes from large healthcare datasets to identify immune-related adverse events (irAEs) associated with the use of immune checkpoint inhibitor (ICI) therapy. Methods We evaluated the performance of GPT-3.5, GPT-4, and GPT-4o models on manually annotated datasets of patients receiving ICI therapy, sampled from two electronic health record (EHR) systems and seven clinical trials. A zero-shot prompt was designed to exhaustively identify irAEs at the patient level (main analysis) and the note level (secondary analysis). The LLM-based system followed a multi-label classification approach to identify any combination of irAEs associated with individual patients or clinical notes. System evaluation was conducted for each available irAE as well as for broader categories of irAEs classified at the organ level. Results Our analysis included 442 patients across three institutions. The most common irAEs manually identified in the patient datasets included pneumonitis (N=64), colitis (N=56), rash (N=32), and hepatitis (N=28). Overall, GPT models achieved high sensitivity and specificity but only moderate positive predictive values, reflecting a potential bias towards overpredicting irAE outcomes. GPT-4o achieved the highest F1 and micro-averaged F1 scores for both patient-level and note-level evaluations. Highest performance was observed in the hematological (F1 range=1.0-1.0), gastrointestinal (F1 range=0.81-0.85), and musculoskeletal and rheumatologic (F1 range=0.67-1.0) irAE categories. Error analysis uncovered substantial limitations of GPT models in handling textual causation, where adverse events should not only be accurately identified in clinical text but also causally linked to immune checkpoint inhibitors. Conclusion The GPT models demonstrated generalizable abilities in identifying irAEs across EHRs and clinical trial reports. Using GPT models to automate adverse event detection in large healthcare datasets will reduce the burden on physicians and healthcare professionals by eliminating the need for manual review. This will strengthen safety monitoring and lead to improved patient care.
Competing Interest Statement
DBJ has served on advisory boards or as a consultant for AstraZeneca, BMS, The Jackson Laboratory, Merck, Mosaic ImmunoEngineering, Novartis, Pfizer, and Teiko, and has received research funding from BMS and Incyte, and has patents pending for use of MHC-II as a biomarker for immune checkpoint inhibitor response, and abatacept as treatment for immune-related adverse events. EJP has served as a consultant for Janssen, Rapt, Servier, Espirion, Verve, Elion and UpToDate outside of the submitted work, and receives royalties from UpToDate. GSC and RM are employees and shareholders of Roche. They received support for preparation of this manuscript, and are co-inventors on patents filed by Roche related to use of atezolizumab. ZQ has consulted for Novartis and Sanofi.
Funding Statement
CAB, DBJ, and JB are supported by R01CA227481. CAB and JB are supported by R01HL156021. CAB is supported in part by R21HD113234. The UCSF team is supported by the Research Allocation Program grant. JS is supported by the NIH T32 fellowship (PCORT UroGynCan T32CA251072). EJP receives funding from R01HG010863, R01AI152183, U01AI154659, the NHMRC of Australia, SJS Research Fund and the Angela Anderson Research fund. ZQ is supported by the NIH NIDDK DiabDocs K12DK133995 and a Larry L Hillblom Foundation Start Up Grant. The Vanderbilt University Medical Center dataset used in this study is supported by numerous funding sources including UL1TR002243, UL1TR000445, and UL1RR024975. The analysis of Roche data included completed clinical trials funded by Roche.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The institutional review boards at Vanderbilt University Medical Center and University of California San Francisco approved this study, and all internal processes were followed to enable secondary use of data from Roche sponsored studies for this analysis.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
The Chan Zuckerberg Initiative, Cold Spring Harbor Laboratory, the Sergey Brin Family Foundation, California Institute of Technology, Centre National de la Recherche Scientifique, Fred Hutchinson Cancer Center, Imperial College London, Massachusetts Institute of Technology, Stanford University, University of Washington, and Vrije Universiteit Amsterdam.