ABSTRACT
The use of large language models (LLMs) such as ChatGPT for medical question-answering is becoming increasingly popular. However, there are concerns that these models may generate and amplify medical misinformation. Because cancer patients frequently seek to educate themselves through online resources, some individuals will likely use ChatGPT to obtain cancer treatment information. This study evaluated the performance and robustness of ChatGPT in providing breast, prostate, and lung cancer treatment recommendations that align with National Comprehensive Cancer Network (NCCN) guidelines. Four prompt templates were created to explore how differences in how the query is posed impacts response. ChatGPT output was scored by 3 oncologists and a 4th oncologist adjudicated in cases of disagreement. ChatGPT provided at least one NCCN-concordant recommendation for 102/104 (98%) prompts. However, 35/102 (34.3%) of these also included a recommendation that was at least partially non-concordant with NCCN guidelines. Responses varied based on prompt type. In conclusion, ChatGPT did not perform well at reliably and robustly providing cancer treatment recommendations. Patients and clinicians should be aware of the limitations of ChatGPT and similar technologies for self-education.
INTRODUCTION
Large language models (LLMs) underlying chatbots such as ChatGPT1 have a unique ability to mimic human language and quickly return detailed and coherent-seeming responses. Yet these properties might obscure the fact that they are providing inaccurate information. Because patients often turn to the internet for self-education,2 some will undoubtedly use ChatGPT for cancer-related medical information. This could lead ChatGPT to generate and amplify cancer treatment misinformation. There is thus an immediate need to assess ChatGPT’s performance on these kinds of questions. We evaluated the performance and robustness of ChatGPT to provide breast, prostate, and lung cancer treatment regimen recommendations that are concordant with National Comprehensive Cancer Network (NCCN)3 guidelines.
METHODS
We developed 4 zero-shot prompt templates to query treatment recommendations (Figure). Zero-shot prompts are prompts that do not provide examples of correct responses to guide the model’s output. Templates were used to create 4 prompts for each of 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts. Prompts were input to the gpt-3.5-turbo-0301 model via the ChatGPT API for inferencing.
We benchmarked against NCCN 2021 because ChatGPT was trained on data up to September 2021. Five scoring criteria were developed to assess guideline concordance (Table). The output did not have to recommend all possible regimens to be considered concordant; instead, the recommended treatment approach needed to be an NCCN option. Four board-certified oncologists scored output. Prompts were scored by 3 oncologists and majority rule was taken as the final score. In cases of complete disagreement, the oncologist who had not previously seen the output adjudicated.
All prompts, ChatGPT output, scores, and scoring guidelines are available at https://github.com/AIM-Harvard/ChatGPT_NCCN.
RESULTS
All 3 annotators agreed on 322/520 (61.9%) scores. The Table shows agreement between prompt templates and the distribution of scores across cancer type and extent of disease. The 4 prompts yielded the same scores for all criteria for 9/26 (34.6%) diagnosis descriptions. ChatGPT provided at least one recommendation for 102/104 (98%) prompts. All outputs with a recommendation included at least one NCCN-concordant treatment, but 35/102 (34.3%) of these outputs also recommended one or more non-concordant treatments.
Modalities were hallucinated (i.e., did not appear as part of any recommended treatment) in 13/104 (12.5%) outputs. These were primarily recommendations for localized treatment for advanced disease, and for targeted therapy or immunotherapy.
DISCUSSION
One-third of ChatGPT treatment recommendations were at least partially non-concordant with NCCN guidelines, and recommendations varied based on how the question was posed. The disagreement among annotators’ scores highlights the ambiguities and challenges of interpreting generative LLM output—another source of treatment confusion. More work is needed before these methods can be considered for medical question-answering, where both reliability and robustness are critical.
LLMs have been found to achieve a passing grade on the USMLE licensing exam,4 encode clinical knowledge5, and provide diagnoses better than laypeople.6 However, ChatGPT did not perform well at providing cancer treatment recommendations. Concerningly, ChatGPT was most likely to provide incorrect recommendations amongst correct recommendations, an insidious error mode difficult even for experts to detect.
Although this study evaluates a single model at a snapshot in time, it provides insight into areas of concern and future research needs. ChatGPT does not purport to be a medical device, and need not be held to such standards. However, patients and their families will likely use such technologies in their self-education, and this will impact shared decision-making and the patient-clinician relationship.2 Developers should have some responsibility to distribute technologies that do not cause harm, and patients and clinicians need to be aware of the limitations of these technologies.
Data Availability
All data produced are available online at https://github.com/AIM-Harvard/ChatGPT_NCCN
Disclosures
DSB: Associate Editor of Radiation Oncology, HemOnc.org (no financial compensation, unrelated to this work); Funding from American Association for Cancer Research (unrelated to this work).
Funding
The authors thank the Woods Foundation for their generous support of this work.