The utility of ChatGPT for cancer treatment information ======================================================= * Shan Chen * Benjamin H. Kann * Michael B. Foote * Hugo JWL Aerts * Guergana K. Savova * Raymond H. Mak * Danielle S. Bitterman ## ABSTRACT The use of large language models (LLMs) such as ChatGPT for medical question-answering is becoming increasingly popular. However, there are concerns that these models may generate and amplify medical misinformation. Because cancer patients frequently seek to educate themselves through online resources, some individuals will likely use ChatGPT to obtain cancer treatment information. This study evaluated the performance and robustness of ChatGPT in providing breast, prostate, and lung cancer treatment recommendations that align with National Comprehensive Cancer Network (NCCN) guidelines. Four prompt templates were created to explore how differences in how the query is posed impacts response. ChatGPT output was scored by 3 oncologists and a 4th oncologist adjudicated in cases of disagreement. ChatGPT provided at least one NCCN-concordant recommendation for 102/104 (98%) prompts. However, 35/102 (34.3%) of these also included a recommendation that was at least partially non-concordant with NCCN guidelines. Responses varied based on prompt type. In conclusion, ChatGPT did not perform well at reliably and robustly providing cancer treatment recommendations. Patients and clinicians should be aware of the limitations of ChatGPT and similar technologies for self-education. ## INTRODUCTION Large language models (LLMs) underlying chatbots such as ChatGPT1 have a unique ability to mimic human language and quickly return detailed and coherent-seeming responses. Yet these properties might obscure the fact that they are providing inaccurate information. Because patients often turn to the internet for self-education,2 some will undoubtedly use ChatGPT for cancer-related medical information. This could lead ChatGPT to generate and amplify cancer treatment misinformation. There is thus an immediate need to assess ChatGPT’s performance on these kinds of questions. We evaluated the performance and robustness of ChatGPT to provide breast, prostate, and lung cancer treatment regimen recommendations that are concordant with National Comprehensive Cancer Network (NCCN)3 guidelines. ## METHODS We developed 4 zero-shot prompt templates to query treatment recommendations (Figure). Zero-shot prompts are prompts that do not provide examples of correct responses to guide the model’s output. Templates were used to create 4 prompts for each of 26 unique diagnosis descriptions (cancer types ± extent of disease modifiers relevant for each cancer) for a total of 104 prompts. Prompts were input to the gpt-3.5-turbo-0301 model via the ChatGPT API for inferencing. ![Figure.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/03/23/2023.03.16.23287316/F1.medium.gif) [Figure.](http://medrxiv.org/content/early/2023/03/23/2023.03.16.23287316/F1) Figure. Experimental design. Underlined text indicates where each diagnosis description was input into prompt template. Diagnosis descriptions consisted of cancer type (breast cancer, non-small cell lung cancer, small-cell lung cancer, and prostate cancer) with and without extents of disease relevant for each cancer type. A total of 26 disease descriptions were input into the prompt templates, for a total of 104 unique prompts. We benchmarked against NCCN 2021 because ChatGPT was trained on data up to September 2021. Five scoring criteria were developed to assess guideline concordance (Table). The output did not have to recommend all possible regimens to be considered concordant; instead, the recommended treatment approach needed to be an NCCN option. Four board-certified oncologists scored output. Prompts were scored by 3 oncologists and majority rule was taken as the final score. In cases of complete disagreement, the oncologist who had not previously seen the output adjudicated. View this table: [Table.](http://medrxiv.org/content/early/2023/03/23/2023.03.16.23287316/T1) Table. Scoring of ChatGPT treatment recommendations. All prompts, ChatGPT output, scores, and scoring guidelines are available at [https://github.com/AIM-Harvard/ChatGPT\_NCCN](https://github.com/AIM-Harvard/ChatGPT_NCCN). ## RESULTS All 3 annotators agreed on 322/520 (61.9%) scores. The Table shows agreement between prompt templates and the distribution of scores across cancer type and extent of disease. The 4 prompts yielded the same scores for all criteria for 9/26 (34.6%) diagnosis descriptions. ChatGPT provided at least one recommendation for 102/104 (98%) prompts. All outputs with a recommendation included at least one NCCN-concordant treatment, but 35/102 (34.3%) of these outputs also recommended one or more non-concordant treatments. Modalities were hallucinated (i.e., did not appear as part of any recommended treatment) in 13/104 (12.5%) outputs. These were primarily recommendations for localized treatment for advanced disease, and for targeted therapy or immunotherapy. ## DISCUSSION One-third of ChatGPT treatment recommendations were at least partially non-concordant with NCCN guidelines, and recommendations varied based on how the question was posed. The disagreement among annotators’ scores highlights the ambiguities and challenges of interpreting generative LLM output—another source of treatment confusion. More work is needed before these methods can be considered for medical question-answering, where both reliability and robustness are critical. LLMs have been found to achieve a passing grade on the USMLE licensing exam,4 encode clinical knowledge5, and provide diagnoses better than laypeople.6 However, ChatGPT did not perform well at providing cancer treatment recommendations. Concerningly, ChatGPT was most likely to provide incorrect recommendations amongst correct recommendations, an insidious error mode difficult even for experts to detect. Although this study evaluates a single model at a snapshot in time, it provides insight into areas of concern and future research needs. ChatGPT does not purport to be a medical device, and need not be held to such standards. However, patients and their families will likely use such technologies in their self-education, and this will impact shared decision-making and the patient-clinician relationship.2 Developers should have some responsibility to distribute technologies that do not cause harm, and patients and clinicians need to be aware of the limitations of these technologies. ## Data Availability All data produced are available online at [https://github.com/AIM-Harvard/ChatGPT\_NCCN](https://github.com/AIM-Harvard/ChatGPT_NCCN) [https://github.com/AIM-Harvard/ChatGPT\_NCCN](https://github.com/AIM-Harvard/ChatGPT_NCCN) ## Disclosures DSB: Associate Editor of Radiation Oncology, [HemOnc.org](http://HemOnc.org) (no financial compensation, unrelated to this work); Funding from American Association for Cancer Research (unrelated to this work). ## Funding The authors thank the Woods Foundation for their generous support of this work. ## Acknowledgements * Received March 16, 2023. * Revision received March 16, 2023. * Accepted March 23, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. 1.Introducing ChatGPT. Accessed March 10, 2023. [https://openai.com/blog/chatgpt](https://openai.com/blog/chatgpt) 2. 2.Arora VM, Madison S, Simpson L. Addressing Medical Misinformation in the Patient-Clinician Relationship. JAMA. 2020;324(23):2367–2368. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.4263&link_type=DOI) 3. 3.National comprehensive cancer network - home. NCCN. Accessed March 10, 2023. [https://www.nccn.org/Home](https://www.nccn.org/Home) 4. 4.Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pdig.0000198&link_type=DOI) 5. 5.Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge. arXiv [csCL]. Published online December 26, 2022. [http://arxiv.org/abs/2212.13138](http://arxiv.org/abs/2212.13138) 6. 6.Levine DM, Tuwani R, Kompa B, et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model. medRxiv. Published online February 1, 2023. doi:10.1101/2023.01.30.23285067 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMy4wMS4zMC4yMzI4NTA2N3YxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMDMvMjMvMjAyMy4wMy4xNi4yMzI4NzMxNi5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=)