Abstract
Large language models (LLMs) are increasingly being explored for their potential to simulate clinical reasoning. Here, we demonstrate our initial experience using the GPT-4o LLM along with prompt engineering and knowledge retrieval to develop EndoGPT, a clinical decision support tool for the management of thyroid nodules. In a pilot study of 50 cases, EndoGPT demonstrated an 83% concordance rate with expert surgeons’ assessments and plans. The highest concordance was in diagnosis (93%), followed by the need for an operation (82%) and type of operation (69%). This work suggests that LLM-based assistants may play a useful role in assisting clinicians in the future.
Introduction
Though large-language models (LLM) demonstrate the ability to answer medical questions, their ability to simulate clinical reasoning is a topic of current exploration. Recent technical advances allow LLMs to be optimized using prompt engineering and knowledge retrieval from data sources, even without specific fine-tuning.1,2 Here, we describe our implementation of these techniques to prototype an LLM-based clinical decision support tool for the management of thyroid nodules.
Methods
We abstracted deidentified data from clinic notes of patients referred for evaluation of thyroid nodules or thyroid cancer. We built an assistant (EndoGPT) based on the GPT-4o LLM that could ingest this data and output a predicted assessment and plan (A&P). To provide EndoGPT with additional context, we uploaded the 2015 American Thyroid Association Management Guidelines for Thyroid Nodules and Differentiated Thyroid Cancer as a reference.3 EndoGPT could then utilize relevant components of the guidelines using vector embeddings and similarity search techniques. 4 For each patient scenario, we generated five predicted A&Ps and ensembled them into a compound A&P using a second assistant. After pre-testing EndoGPT on 25 patient scenarios, we analyzed errors, wrote instructions to avoid them, and added this data to EndoGPT’s prompt for additional context before testing it on new scenarios (Figure 1).
To evaluate EndoGPT, we measured concordance between the expert-generated and the predicted A&Ps across three domains: (1) diagnosis, (2) need for an operation, and (3) type of operation (Figure 1). This study was deemed exempt by the Columbia University Institutional Review Board (Protocol AAAV1151). Our code is available on GitHub.
Results
We tested EndoGPT on 50 patient scenarios and achieved an overall concordance of 83%. EndoGPT agreed with the expert’s diagnosis completely in 44/50 cases and partially in 5/50 cases (93% concordant). Moreover, the assistant agreed with the expert’s need for an operation in 41/50 cases (82% concordant). When the expert recommended surgery (n=36 cases), the assistant agreed with the expert’s choice of operation completely in 24 cases and partially in two cases (69% concordant) (Figure 2). Details on the differences in A&Ps are described in Table S1.
Discussion
Our early experience with EndoGPT suggests that surgeons who may not have the technical resources to build their own LLMs can still use general-purpose models like GPT-4o to develop clinical decision support tools. We achieved an 83% concordance with expert A&Ps using knowledge-retrieval and prompt engineering.
Our model was most concordant when predicting a diagnosis and least concordant when suggesting a specific operation. Specific areas of recurring discordance were in the type of lymph node dissection (LND) recommended (e.g. EndoGPT did not assign a laterality to central LND) or the recommendation of surgery for benign nodules causing compressive symptoms (rather than performing fine needle aspiration). The latter may have occurred because we gave EndoGPT specific feedback during pretesting to consider surgery for benign, compressive nodules, highlighting the risk of over-prompting the model. In some cases, because we tested concordance with a singular A&P, it is possible that EndoGPT suggested a safe alternative approach. Thus, we may be underestimating EndoGPT’s overall accuracy. In future experiments, a panel of experts can assess EndoGPT’s responses for accuracy.
Though not intended to replace physician evaluation, tools like EndoGPT may help train 4 surgical residents, assist non-specialist providers with initial workup and management, or make technical documents such as guidelines more accessible to patients. Utility will likely be greatest in areas of medicine where clear guidelines already exist. Further studies will be needed to fully optimize this system for patient care.