PT - JOURNAL ARTICLE AU - Bejan, Cosmin A. AU - Reed, Amy M AU - Mikula, Matthew AU - Zhang, Siwei AU - Xu, Yaomin AU - Fabbri, Daniel AU - Embí, Peter J. AU - Hsi, Ryan S. TI - Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones AID - 10.1101/2024.08.12.24311870 DP - 2024 Jan 01 TA - medRxiv PG - 2024.08.12.24311870 4099 - http://medrxiv.org/content/early/2024/08/13/2024.08.12.24311870.short 4100 - http://medrxiv.org/content/early/2024/08/13/2024.08.12.24311870.full AB - Background Recent advancements of large language models (LLMs) like Generative Pre-trained Transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. This study investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were caused by symptomatic kidney stones.Methods Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance the performance of GPT-4, GPT-3.5, and Llama-2 including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by these LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The evaluation includes a comparison between LLMs, traditional machine learning models (logistic regression, extreme gradient boosting, and light gradient boosting machine), and a baseline system utilizing International Classification of Diseases (ICD) codes for kidney stones.Results The best results were achieved by GPT-4 (macro-F1=0.833, 95% confidence interval [CI]=0.826–0.841) and GPT-3.5 (macro-F1=0.796, 95% CI=0.796–0.796), both being statistically significantly better than the ICD-based baseline result (macro-F1=0.71). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning when using the same parameter configuration. Adding demographic information and prior disease history to the prompts allows LLMs to make more accurate decisions. The evaluation of bias found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity. The analysis of explanations provided by GPT-4 demonstrates advanced capabilities of this model in understanding clinical text and reasoning with medical knowledge.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis research was supported by the National Institutes of Health (NIH) grants R21DK127075, R21HD113234, and UL1TR002243. The authors are also deeply grateful to Chris and Helga Holland for their generous donation, which played a crucial role in facilitating this study. The NIH were not involved in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The IRB of Vanderbilt University Medical Center gave ethical approval for this work.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesData contain protected health information and are not publicly available. The summary statistics extracted from the EHR data used in this study are provided in the manuscript and supplementary material.