Abstract
Recent breakthroughs in large language models (LLMs) have led to their rapid dissemination and widespread use. One early application has been to medicine, where LLMs have been investigated to streamline clinical workflows and facilitate clinical analysis and decision-making. However, a leading barrier to the deployment of Artificial Intelligence (AI) and in particular LLMs has been concern for embedded gender and racial biases. Here, we evaluate whether a leading LLM, ChatGPT 3.5, exhibits gender and racial bias in clinical management of acute coronary syndrome (ACS). We find that specifying patients as female, African American, or Hispanic resulted in a decrease in guideline recommended medical management, diagnosis, and symptom management of ACS. Most notably, the largest disparities were seen in the recommendation of coronary angiography or stress testing for the diagnosis and further intervention of ACS and recommendation of high intensity statins. These disparities correlate with biases that have been observed clinically and have been implicated in the differential gender and racial morbidity and mortality outcomes of ACS and coronary artery disease. Furthermore, we find that the largest disparities are seen during unstable angina, where fewer explicit clinical guidelines exist. Finally, we find that through asking ChatGPT 3.5 to explain its reasoning prior to providing an answer, we are able to improve clinical accuracy and mitigate instances of gender and racial biases. This is among the first studies to demonstrate that the gender and racial biases that LLMs exhibit do in fact affect clinical management. Additionally, we demonstrate that existing strategies that improve LLM performance not only improve LLM performance in clinical management, but can also be used to mitigate gender and racial biases.
Advances in Artificial Intelligence (AI) have led to the rapid dissemination and widespread use of large language models (LLMs)1–4. The field of medicine has sought to harness new LLMs to enhance clinical workflow and to augment clinical analysis and decision-making5–7. A barrier to the deployment of AI in healthcare is inherent biases that pervade the underlying content used to train language-based models8,9. This vulnerability in LLMs could result in the systematic propagation of gender and racial biases in medical settings, leading to a worsening of existing gender and racial disparities in health management and outcomes1. Here, we investigated whether LLMs exhibit gender and racial bias when applied to clinical decision making in Cardiology. We found that a leading LLM, ChatGPT 3.5, exhibits differential decision making based on race and gender that is not supported by existing evidence-based literature. These gender and racial biases are not dissimilar to those that have previously been observed in clinical practice and that have been shown to have a detrimental effect on health outcomes. Notably, we found that prompting these models to explain and elaborate on their recommendations can mitigate bias. This is among the first examples to show that gender and racial bias within LLMs can affect clinical management10. Our work identifies a critical barrier to the deployment of LLMs and proposes strategies to mitigate bias in language-based artificial intelligence.
Ischemic heart disease is a leading cause of death worldwide, contributing to >9 million deaths in 201611. The management of acute coronary syndrome (ACS) requires timely and appropriate assessment of the need for coronary angiography or stress testing with subsequent intervention and initiation of aspirin, anticoagulation, high dose statins, and symptom management12. LLMs have been sought after as an untapped opportunity to facilitate clinical decision making. We evaluate the capability of LLMs to manage ACS and investigate whether it exhibits gender and racial biases.
We prompt the LLM, ChatGPT 3.5, with a series of cases that span the spectrum of ACS severity, including STEMI (ST-elevation myocardial infarction), NSTEMI (Non-ST-elevation myocardial infarction), and unstable angina (Supplement 1). We then permuted race (Caucasian, African American, Hispanic, none) and gender (female, male, none) as patient descriptors in the prompt (n=200). We find that there are observed differences in medical management, diagnostic work up and interventions, and symptom management that can be attributed to specifying the gender or race of a patient (Table 1). The introduction of “female” into a prompt resulted in a decrease in recommendation of coronary angiography or stress test (530 (88.3%) vs 578 (96.3%), n=600, p value <0.001), nitroglycerin (91 (15.2%) vs 144 (24.0%), n=600, p value <0.001), high intensity statins (307 (51.2%) vs 403 (67.2%), n=600, p value <0.001), and beta blockers (549 (91.6%) vs 586 (97.7%), n=600, p value <0.001) (Table 1). When the results of the female specified prompts were compared to prompts where patients were specified as male, the disparity in the recommendation of coronary angiography or stress test (530 (88.3%) vs 595 (99.2%), n=600, p value <0.001) and high intensity statins widened (307 (51.2%) vs 451 (75.2%), n=600, p value <0.001) (Figure 1A, Figure 2A).
A. ChatGPT 3.5 exhibits gender biases in the management of ACS. ChatGPT 3.5 was prompted with prompts that described ST-elevation myocardial infarction (STEMI), non-ST-elevation myocardial infarction (NSTEMI), and unstable angina and asked to select coronary angiography, coronary angiography with fractional flow reserve calculation (FFR), stress testing, or medication management as a diagnostic work-up. Across ACS presentations, ChatGPT 3.5 recommended coronary angiography with or without FFR and stress testing at a lower frequency when patients were specified as “female” as opposed to “male” (530 (88.3%) vs 595 (99.2%), n=600, p value <0.001). B. ChatGPT 3.5 demonstrates racial biases in the clinical management of ACS diagnosis in both African American (495 (82.5%) vs 577 (96.2%), n=600, p value <0.001) and Hispanic patients (544 (90.7%) vs 577 (96.2%), n=600, p value <0.001). Prompts where the patient was specified as African American were found to receive the lowest number of guideline-recommended diagnostic work up. C. Females are less likely to be recommended for coronary angiography with FFR as compared to their male counterparts (411 (68.5%) vs 470 (78.3%), n=600, p value <0.001). Notably, the greatest discrepancy exists during unstable angina, where the guidelines to obtain coronary angiography with or without FFR are less explicit as compared to STEMIs or NSTEMIs. D. African Americans and Hispanic patients similarly are less likely to be recommended for coronary angiography with FFR as compared to their Caucasian counterparts. African American patients during unstable angina events received the lowest frequency of coronary angiography with FFR across all ACS scenarios and gender or race permutations.
A. ChatGPT 3.5 exhibits gender biases in the recommendation of high intensity statin, Atorvastatin 80mg after ACS (307 (51.2%) vs 403 (67.2%), n=600, p value <0.001). Notably, the discrepancy between statin prescription and usage in females and males have been previously documented. Furthermore, discrepancy in statin recommendation was the only bias seen in ACS medical management, with aspirin and heparin being recommended equally across genders and the control. B. Prompts where the patient was specified as African American or Hispanic resulted in a decrease in the recommendation of high intensity statin, Atorvastatin 80mg after ACS. Hispanic patients received the lowest frequency of high intensity statin, Atorvastatin 80mg, across all conditions (281 (46.8%), n=600). C. ChatGPT 3.5 also exhibits bias in the recommendation of primary prevention statin, recommending guideline indicated statin dose at a lower frequency to females than their male counterparts. D. African American and Hispanic patients are less likely to receive the proper dose of primary prevention statin, with Hispanic patients being subject to the greatest disparity. E. Prompting the LLM for an explanation resulted in an increase in recommendation of Atorvastatin 80mg after ACS in both female and male patients, and decreased the disparity between female and male patients. F. Racial bias in recommendation of Atorvastatin 80mg after ACS was mitigated in African American and Hispanic patients when the model was prompted for an explanation. Prompting for an explanation resulted in an increase in recommendation of high intensity statin in all conditions, increasing the accuracy of the model’s response to the clinical management question. The largest improvements were seen in Hispanic and female patients.
We next evaluated the effects of race on LLM clinical management and found that race led to a larger disparity than the disparity observed due to gender. African American patients saw the greatest disparity in the most number of ACS management decisions, receiving the lowest frequency of recommendation for coronary angiography or stress test (495 (82.5%) vs 577 (96.2%), n=600, p value <0.001), nitroglycerin (92 (15.3%) vs 127 (21.2%), n=600, p value <0.001), and beta blockers (538 (89.7%) vs 595 (99.2%), n=600, p value <0.001) amongst all race and gender conditions (Table 1). Additionally, prompts where the patient was specified as Hispanic resulted in the lowest frequency for recommendation of high dose statin after an ACS event (281 (46.8%) vs 402 (67.0%), n=600, p value <0.001). Notably, there were no observed statistically significant differences in recommendations between Caucasians and control prompts, bolstering the notion that the decrease in guideline-recommended therapies cannot be attributed to the introduction of race, but rather are specific to the races, African American and Hispanic. Furthermore, the similarity in results between the control and prompts where the patient is specified as Caucasian suggests that the control is most representative of Caucasian patients.
We next sought to investigate the largest disparities, recommendation of coronary angiography or stress test and high intensity statins, to determine if the disparities persisted in similar scenarios or whether they represent isolated instances of biases. Coronary angiography has become the gold standard for assessing ACS events. Fractional Flow Reserve versus Angiography for Multivessel Evaluation 1 (FAME1) and Fractional Flow Reserve versus Angiography for Multivessel Evaluation 2 (FAME2) trials have found that calculation of fractional flow reserve (FFR) during catheter angiography can facilitate the identification of candidates for coronary stenting and reduce the rate of mortality, nonfatal myocardial infarction, and repeat revascularization at 1 year13,14. When ChatGPT 3.5 was prompted with selection between coronary angiography or coronary angiography with FFR, we found that female patients were less likely to be recommended FFR as compared to males (411 (68.5%) vs 470 (78.3%), n=600, p value <0.001), with the largest disparity seen during unstable angina (100 (50.0%) vs 151 (75.5%), n=200, p value <0.001) (Figure 1C). Similarly, disparities existed in the recommendation of FFR for both Hispanic (363 (60.5%) vs 426 (71.0%), n=600, p value <0.001) and African American patients (308 (51.3%) vs 426 (71.0%), n=600, p value <0.001), with African American patients exhibiting the greatest disparity during unstable angina (Figure 1D). Notably, the largest disparity in both gender and race is seen during unstable angina, where there are fewer explicit guidelines. This result highlights that LLM biases are amplified when LLMs are required to utilize greater clinical judgment. The biases exhibited here also mirror existing studies that demonstrate that women and racial minorities are less likely to receive proper workup of acute coronary events15. Furthermore, it reflects the common phenomenon of a greater lead time for new medical technology adoption in women and racial minorities.
To determine the extent of statin biases exhibited by LLMs, we also investigated the behavior of ChatGPT 3.5 when recommending statins for primary prevention. Women and racial minorities have been historically known to receive statins at a lower frequency and lower starting dose than their male or caucasian counterparts16. Using a series of case studies that span the spectrum of statin indications, we find that while the LLM recommend statins with similar frequency across genders and races, when asked to select a starting dose, ChatGPT 3.5 recommended a lower starting statin dose to women, African Americans, and Hispanic patients (Figure 2B, 2C). This bias is consistent with studies that demonstrate that women are less likely to be started on high dose statins and are less likely to achieve low-density lipoprotein goals as compared to men17.
Finally, we evaluate the potential to mitigate biases by asking ChatGPT 3.5 to explain its reasoning prior to providing an answer. Studies have demonstrated that the performance of LLMs can be improved when the models are asked to explain their reasoning18. We investigated whether “chain-of-thought” reasoning could not only improve the model’s accuracy on clinical management, but also mitigate gender and racial biases. We find that this method is able to correct bias observed in the recommendation of coronary angiography or stress test (535 (89.2%) vs 533 (88.8%), n=600, p value >0.001) and beta blockers (448 (74.7%) vs 465 (77.5%), n=600, p value >0.001) in women and nitroglycerin in African American (140 (23.3%) vs 146 (24.3%), n=600, p value>0.001) and Hispanic patients (125 (20.8%) vs 146 (24.3%), n=600, p value>0.001) (Table 2). However, it must be noted that through correcting the bias, male individuals were recommended guideline therapy at a lower frequency due to increased answers where the LLM recommended consulting a medical specialist. While the remaining biases were not corrected, we did find that prompting the model for an explanation increased appropriate guideline recommended interventions and a narrowing of the disparity. For example, while a gender and race based discrepancy still existed for the recommendation of Atorvastatin 80mg after an ACS event, the frequency of recommendation for all conditions increased and the difference between male and female, Caucasian and African American, and Caucasian and Hispanic patients was reduced (Figure 2E, 2F). Mechanisms to detect and mitigate biases remain an active area of research, and it will be critical to evaluate whether these mechanisms successfully translate to clinical settings.
Here, we demonstrate that LLMs exhibit gender and racial biases in the management of ACS, a leading cause of morbidity and mortality worldwide. We show that the discrepancies are consistent throughout a spectrum of case studies, suggesting an underlying mechanistic bias. The results mirror existing gender and racial disparities seen in clinical management of ischemic heart disease – a leading cause of mortality for women and racial minorities15,19. As one of the first studies to illustrate that gender and racial biases exist in LLMs in clinical decision making, we hope that this study serves as a catalyst for the exploration of the extent that biases exist in LLMs in clinical contexts. Future work will also include whether the biases that LLMs exhibit impact clinical management in practice. Ultimately, the translation of LLMs into clinical practice will require demonstrating that either LLMs do not exhibit biases or the development of systematic practices that can identify and mitigate biases.
Methods
ChatGPT 3.5 was queried with prompts that represented a patient presentation consistent with STEMI, NSTEMI, and unstable angina (Supplement 1). Race (none, Caucasian, African American, Hispanic) and gender (none, female, male) were permuted and inserted into the prompts. A prompt without race or gender served as the control. Each prompt was accompanied by a management question (aspirin, coronary angiography, heparin drip, Atorvastatin 80mg, beta blocker, nitroglycerin) and accompanied with directions to either answer with “yes or no” or “select from one of the following answer choices.” Each prompt and management question were queried n=200 under the same model conditions. Answers were then averaged and a Pearson’s chi-squared test was performed to determine if the difference between response counts was statistically significant (p<0.001).
Data Availability
All data produced in the present work are contained in the manuscript.