Abstract
Importance. Discriminatory language in clinical documentation impacts patient care and reinforces systemic biases. Scalable tools to detect and mitigate this are needed. Objective. Determine utility of a frontier large language model (GPT-4) in identifying and categorizing biased language and evaluate its suggestions for debiasing. Design. Cross-sectional study analyzing emergency department (ED) notes from the Mount Sinai Health System (MSHS) and discharge notes from MIMIC-IV. Setting. MSHS, a large urban healthcare system, and MIMIC-IV, a public dataset. Participants. We randomly selected 50,000 ED medical and nursing notes from 230,967 MSHS 2023 adult ED visiting patients, and 500 randomly selected discharge notes from 145,915 patients in MIMIC-IV database. One note was selected for each unique patient. Main Outcomes and Measures. Primary measure was accuracy of detection and categorization (discrediting, stigmatizing/labeling, judgmental, and stereotyping) of bias compared to human review. Secondary measures were proportion of patients with any bias, differences in the prevalence of bias across demographic and socioeconomic subgroups, and provider ratings of effectiveness of GPT-4's debiasing language. Results. Bias was detected in 6.5% of MSHS and 7.4% of MIMIC-IV notes. Compared to manual review, GPT-4 had sensitivity of 95%, specificity of 86%, positive predictive value of 84% and negative predictive value of 96% for bias detection. Stigmatizing/labeling (3.4%), judgmental (3.2%), and discrediting (4.0%) biases were most prevalent. There was higher bias in Black patients (8.3%), transgender individuals (15.7% for trans-female, 16.7% for trans-male), and undomiciled individuals (27%). Patients with non-commercial insurance, particularly Medicaid, also had higher bias (8.9%). Higher bias was also seen in health-related characteristics like frequent healthcare utilization (21% for >100 visits) and substance use disorders (32.2%). Physician-authored notes showed higher bias than nursing notes (9.4% vs. 4.2%, p < 0.001). GPT-4's suggested revisions were rated highly effective by physicians, with an average improvement score of 9.6/10 in reducing bias. Conclusions and Relevance. A frontier LLM effectively identified biased language, without further training, showing utility as a scalable fairness tool. High bias prevalence linked to certain patient characteristics underscores the need for targeted interventions. Integrating AI to facilitate unbiased documentation could significantly impact clinical practice and health outcomes.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Work was supported in part by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of Icahn School of Medicine at Mount Sinai gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors