RT Journal Article SR Electronic T1 Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2023.07.07.23292391 DO 10.1101/2023.07.07.23292391 A1 Soroush, Ali A1 Glicksberg, Benjamin S. A1 Zimlichman, Eyal A1 Barash, Yiftach A1 Freeman, Robert A1 Charney, Alexander W. A1 Nadkarni, Girish N A1 Klang, Eyal YR 2023 UL http://medrxiv.org/content/early/2023/07/11/2023.07.07.23292391.abstract AB Background Large Language Models (LLMs) like GPT-3.5 and GPT-4 are increasingly entering the healthcare domain as a proposed means to assist with administrative tasks. To ensure safe and effective use with billing coding tasks, it is crucial to assess these models’ ability to generate the correct International Classification of Diseases (ICD) codes from text descriptions.Objectives We aimed to evaluate GPT-3.5 and GPT-4’s capability to generate correct ICD billing codes, using the ICD-9-CM (2014) and ICD-10-CM and PCS (2023) systems.Methods We randomly selected 100 unique codes from each of the most recent versions of the ICD-9-CM, ICD-10-CM, and ICD-10-PCS billing code sets published by the Centers for Medicare and Medicaid Services. Using the ChatGPT interface (GPT-3.5 and GPT-4), we prompted for the ICD codes that corresponding to each provided code description. Outputs were compared with the actual billing codes across several performance measures. Errors were qualitatively and quantitatively assessed for any underlying patterns.Results GPT-4 and GPT-3.5 demonstrated varied performance across each ICD system. In ICD-9-CM, GPT-4 and GPT-3.5 achieved an exact match rate of 22% and 10%, respectively. 13% (GPT-4) and 10% (GPT-3.5) of generated ICD-10-CM codes were exact matches. Notably, both models struggled considerably with the procedurally focused ICD-10-PCS, with neither GPT-4 or GPT-3.5 producing any exactly matched codes. A substantial number of incorrect codes had semantic similarity with the actual codes for ICD-9-CM (GPT-4: 60.3%, GPT-3.5: 51.1%) and ICD-10-CM (GPT-4: 70.1%, GPT-3.5: 61.1%), in contrast to ICD-10-PCS (GPT-4: 30.0%, GPT-3.5: 16.0%).Conclusion Our evaluation of GPT-3.5 and GPT-4’s proficiency in generating ICD billing codes from ICD-9-CM, ICD-10-CM and ICD-10-PCS code descriptions reveals an inadequate level of performance. While the models appear to exhibit a general conceptual understanding of the codes and their descriptions, they have a propensity for hallucinating key details, suggesting underlying technological limitations of the base LLMs. This suggests a need for more rigorous LLM augmentation strategies and validation prior to their implementation in healthcare contexts, particularly in tasks such as ICD coding which require significant digit-level precision.Competing Interest StatementBSG: no relevant conflicts of interest but is an employee of Character Biosciences. GN: Consultancy agreements with AstraZeneca, BioVie, GLG Consulting, Pensieve Health, Reata, Renalytix, Siemens Healthineers, and Variant Bio; research funding from Goldfinch Bio and Renalytix; honoraria from AstraZeneca, BioVie, Lexicon, Daiichi Sankyo, Meanrini Health and Reata; patents or royalties with Renalytix; owns equity and stock options in Pensieve Health and Renalytix as a scientific cofounder; owns equity in Verici Dx; has received financial compensation as a scientific board member and advisor to Renalytix; serves on the advisory board of Neurona Health; and serves in an advisory or leadership role for Pensieve Health and Renalytix. All other authors: no conflicts of interest to declare.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present study are available upon reasonable request to the authors.