Performance of a large language model (ChatGPT-3.5) for Pooled Cohort Equation estimation of atherosclerotic cardiovascular disease risk | medRxiv

Abstract

Despite demonstrated facility for arithmetic and other quantitative tasks, the performance of ChatGPT and other large language models for clinical risk calculation have yet to be assessed. Using synthetic patient data, this preliminary study aimed to assess the calibration, reproducibility, and potential for sociodemographic bias of ChatGPT-derived Pooled Cohort Equation (PCE) scores of atherosclerotic cardiovascular disease risk as compared to true scores. We found that ChatGPT-derived PCE scores, despite being moderately associated with the true PCE scores, displayed poor calibration with respect to true PCE scores, and exhibited instability between repeated rounds of prompting, suggesting lack of reproducibility. Moreover, ChatGPT-derived PCE scores also appeared inappropriately sensitive to contextual indicators of the sociodemographic status of the synthetic patients in this study. Further work is needed to confirm these results, and to assess performance on a wider variety of prompts as well as in other settings beyond cardiovascular disease prevention where accurate risk calculation is also vital to appropriate clinical decision-making.

Figure. Underestimation of true PCE risk estimates (x-axis) by ChatGPT (y-axis) on synthetic patient data.

Introduction

Large language models (LLMs), including general-purpose systems such as ChatGPT as well as more specialized models such as Med-PaLM,¹ have shown remarkable facility for qualitative tasks in medicine such as question-answering and general clinical reasoning.

Despite the popular conception of ChatGPT and related publicly-available LLMs as mere chatbots, LLMs are in fact capable of tasks beyond question-answering, including arithmetic and mathematical reasoning, albeit with mixed results.² However, the performance characteristics of LLMs for quantitative clinical tasks, including clinical risk prediction, have yet to be assessed. In this preliminary report, we aim to characterize the calibration, reproducibility, and potential sociodemographic bias of ChatGPT-derived Pooled Cohort Equation³ (PCE) risk estimates of atherosclerotic cardiovascular disease (ASCVD) as compared to actual PCE risk estimates.

Methods

Synthetic individual-level data comprising a complete set of PCE predictor variables were randomly generated in R. These synthetic data were used in prompts to generate bulk ChatGPT estimates of PCE risk scores (ChatGPT-PCE scores) to assess the 3 domains of performance listed above:

Calibration: We prompted ChatGPT to generate PCE scores for n=500 unique synthetic patients (Prompt C) and compared these ChatGPT-PCE scores to actual PCE scores generated using the R package “PooledCohort”.
Reproducibility: we generated a new set of n=100 synthetic patients, and produced 5 sets of ChatGPT-PCE scores for the same 100 patients by repeating Prompt C 5 times.
Bias: to assess the potential for bias among sociodemographic lines, we presented ChatGPT with Prompt C to generate ChatGPT-PCE scores for n=50 patients, followed by two prompts (Prompts B.1 and B.2) requesting that these 50 ChatGPT-PCE estimates be updated, based on the assumed sociodemographic characteristics of these patients. Overall, this process produced 3 distinct sets of 50 ChatGPT-PCE scores, with each set corresponding to an assumed sociodemographic context.

ChatGPT-3.5 (9 May 2023 version) was used for all experiments. Calibration was assessed graphically and via the Pearson correlation coefficient, while analysis of variance (ANOVA) was applied across repeated rounds of prompting to test for changes in scores across rounds in the reproducibility and bias experiments. The text of Prompts C and B.1/B.2, together with representative R code, are provided in the Appendix.

Results

Compared to the actual PCE estimates for the 500 synthetic patients (Figure 1), calibration of the ChatGPT-PCE scores appeared poor. The ChatGPT-PCE score consistently under-predicted actual PCE scores, although the two sets of scores exhibited modest correlation (Pearson correlation coefficient 0.46, p < 0.001). Moreover, individual ChatGPT-PCE scores, when re-generated 5 times with identical synthetic data, did not appear reproducible, being significantly different across 5 attempts for each individual synthetic patient (one-way ANOVA p = 0.010) (Figure 2). Finally, when prompted to update a set of risk estimates based on whether the data were assumed to derive from a safety-net clinic (Prompt B.1) or from a clinic in an affluent suburb (Prompt B.2), ChatGPT-PCE risk estimates were revised significantly upwards, then downwards, respectively (one-way ANOVA p < 0.001) (Figure 3).

Figure 1.

Figure 1.

Comparison of Pooled Cohort Equation (PCE) scores generated by ChatGPT (ChatGPT-PCE scores) to true PCE scores on individual synthetic patients. The blue line depicts the best-fit line, and the statistics in the upper left corner are those associated with that line. The dashed line depicts the 45-degree line associated with perfect calibration.

Figure 2.

Figure 2.

Reproducibility of ChatGPT-PCE estimates for a subset (20 shown) of the n=50 synthetic patients. Each panel corresponds to one synthetic patient, and with points denoting their ChatGPT-PCE scores generated across 5 rounds of prompting. The dashed line denotes the true PCE risk estimate for that patient, while the red line denotes the 7.5% PCE threshold. Both substantial variability in ChatGPT scores and frequent reclassification with respect to the 7.5% threshold are observed.

Figure 3.

Figure 3.

Sensitivity of ChatGPT-PCE scores to additional context potentially indicative of sociodemographic status of synthetic patients (x-axis). Each point depicts the ChatGPT-PCE risk estimate for that patient under three different sets of context. The data were initially generated with no such context (“None”), then were updated under the assumption that the patients were treated in a safety-net clinic (“Safety Net”; Prompt B.1) then again updated under the assumption they derived from a clinic located in an affluent suburb of a Midwestern city (“Suburb”; Prompt B.2).

Discussion

This study found that ChatGPT produced poorly calibrated, and individually highly variable, estimates of ASCVD risk compared to those obtained via the true Pooled Cohort Equations. However, despite the demonstrated propensity of LLMs to “hallucinate” (fabricate) output, the ChatGPT-PCE estimates did exhibit significant correlation with the true PCE scores. Moreover, ChatGPT-PCE scores, upon re-prompting, appeared sensitive, and arguably unnecessarily so, to contextual indicators of patient sociodemographic status. Higher ChatGPT-PCE scores were generated for patients assumed to be treated at a safety-net clinic, while the same set of patients, this time assumed to be treated in a clinic in an affluent suburb, received far lower scores. These adjustments appeared to be performed in an idiosyncratic manner with no apparent justification (e.g. an adjustment factor, equation, or re-calibrated model) behind why individual scores were adjusted as observed.

Our study design presented ChatGPT with synthetic patient data examples to generate the ChatGPT-PCE risk estimates. Here, our approach relying on synthetic data forecloses the possibility, however improbable, that ChatGPT had simply memorized these particular data.

Nevertheless, it is not immediately clear why the ChatGPT-PCE scores appeared to carry at least some information about true PCE risk scores, given the moderate level of correlation observed between these two sets of scores. Further work remains to probe ChatGPT and other LLMs to understand the origins of this observation.

Our study, while preliminary, has several limitations, which also present avenues for further work. First, it remains to be seen whether our results can be replicated by other LLMs, including ChatGPT-4, Anthropic’s Claude 2, and Google’s Bard, among others. Second, our prompt may not necessarily reflect how a LLM would be used to generate risk estimates in practice. Indeed, it may not be immediately clear why a LLM would be needed to generate risk estimates at all, given that risk calculators already exist and are readily available.

However, insofar as PCE risk estimates remain integral to decision-making for primary prevention of ASCVD,⁴ their accurate calculation is essential for systems interfacing with, and reasoning based on, patient data from encounters for ASCVD prevention. Many other settings beyond ASCVD prevention also depend on accurate risk estimation for appropriate clinical decision-making, as well. Future work could assess LLM performance based on patient vignettes or prompts more reflective of actual practice. Ultimately, LLMs may rely on the ability to hook into an external source, such as a “code interpreter”,⁵ to interface with the appropriate risk calculator and directly compute the desired estimates. However, no such interfaces for clinical risk calculators yet exist, and so such abilities remain untested.

Given current efforts towards LLM-electronic health record integration, our preliminary findings may have broad implications. In particular, our finding that ChatGPT-PCE estimates carried at least some information regarding true estimated risks is surprising. From a safety perspective, this finding may also be concerning insofar as it demonstrates the potential for automation bias⁶ engendered by inappropriately-calibrated trust in quantitative output that ostensibly appears correct.⁷ Altogether, further work remains—not only to build on our preliminary results, but also to characterize the performance of LLMs on a wider variety of clinical risk calculators and to investigate methods with potential to improve their performance on these and related tasks, including chain-of-thought prompting⁸ and other approaches to prompting.

Data Availability

The data produced in the present study are available upon request.

Appendix to

Example Prompts

Prompt C

Prompt B.1

Prompt B.2

Example R code

References

1.↵
Singhal K, Tu T, Gottweis J, et al. Towards Expert-Level Medical Question Answering with Large Language Models. Published online May 16, 2023. Accessed May 24, 2023. http://arxiv.org/abs/2305.09617
2.↵
Frieder S, Pinchetti L, Chevalier A, et al. Mathematical Capabilities of ChatGPT. arXiv [csLG]. Published online January 31, 2023. http://arxiv.org/abs/2301.13867
3.↵
Goff DC Jr., Lloyd-Jones DM, Bennett G, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation. 2014;129(25 Suppl 2):S49–S73.
OpenUrl FREE Full Text
4.↵
Arnett DK, Blumenthal RS, Albert MA, et al. 2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation. 2019;140(11):e596–e646.
OpenUrl CrossRef PubMed
5.↵
Lu Y. What to Know About ChatGPT’s New Code Interpreter Feature. The New York Times. https://www.nytimes.com/2023/07/11/technology/what-to-know-chatgpt-code-interpreter.html. Published July 11, 2023. Accessed August 10, 2023.
6.↵
Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28(3):231–237.
OpenUrl FREE Full Text
7.↵
Lee JD, See KA. Trust in automation: designing for appropriate reliance. Hum Factors. 2004;46(1). doi:10.1518/hfes.46.1.50_30392
OpenUrl CrossRef PubMed Web of Science
8.↵
Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Published online January 28, 2022. Accessed May 24, 2023. http://arxiv.org/abs/2201.11903

View the discussion thread.

Posted August 16, 2023.

Subject Area

Health Informatics

Subject Areas

All Articles

Addiction Medicine (376)
Allergy and Immunology (691)
Anesthesia (185)
Cardiovascular Medicine (2781)
Dentistry and Oral Medicine (323)
Dermatology (237)
Emergency Medicine (418)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (989)
Epidemiology (12432)
Forensic Medicine (10)
Gastroenterology (787)
Genetic and Genomic Medicine (4299)
Geriatric Medicine (397)
Health Economics (705)
Health Informatics (2776)
Health Policy (1029)
Health Systems and Quality Improvement (1023)
Hematology (371)
HIV/AIDS (882)
Infectious Diseases (except HIV/AIDS) (13871)
Intensive Care and Critical Care Medicine (820)
Medical Education (406)
Medical Ethics (113)
Nephrology (455)
Neurology (4074)
Nursing (218)
Nutrition (603)
Obstetrics and Gynecology (768)
Occupational and Environmental Health (713)
Oncology (2155)
Ophthalmology (609)
Orthopedics (252)
Otolaryngology (313)
Pain Medicine (257)
Palliative Medicine (79)
Pathology (480)
Pediatrics (1152)
Pharmacology and Therapeutics (479)
Primary Care Research (474)
Psychiatry and Clinical Psychology (3572)
Public and Global Health (6680)
Radiology and Imaging (1457)
Rehabilitation Medicine and Physical Therapy (850)
Respiratory Medicine (889)
Rheumatology (425)
Sexual and Reproductive Health (426)
Sports Medicine (354)
Surgery (467)
Toxicology (57)
Transplantation (194)
Urology (172)