Abstract
Background racial bias has been shown to be present in clinical data, affecting patients unfairly based on their race, ethnicity and socio-economic status. This problem has the potential to be significantly exacerbated in the light of Artificial Intelligence-aided clinical decision making. We sought to investigate whether bias can be introduced from sources that are considered neutral with respect to ethnicity and race and consequently routinely used in modelling, specifically vital signs.
Methods to perform our analysis, we extracted vital signs from 49,610 admissions from a cohort of adult patients during the first 24 hours after the admission to the Intensive Care Units (ICU), derived from multi-centre eICU-CRD database and single-centre MIMIC-III database, spanning over 208 hospitals and 335 ICUs. Using heart rate, SaO2, respiratory rate, systolic, diastolic, and mean blood pressure, we develop machine learning models based on Logistic Regression and eXtreme Gradient Boosting and investigate their performance in predicting patients’ self-reported race. To balance the dataset between the three ethno-races considered in our study, we use a matching cohort based on age, gender, and admission diagnosis.
Findings standard machine learning models, derived solely on six vital signs can be used to predict patients’ self-reported race with AUC of 75%. Our findings hold under diverse patient populations, derived from multiple hospitals and intensive care units. We also show that oxygen saturation is a highly predictive variable, even when measured through methods other than pulse oximetry, namely arterial blood gas analysis, suggesting that addressing bias in routinely collected clinical variables will be challenging.
Interpretation our finding that machine learning models can predict self-reported race using solely vital signs creates a significant risk in clinical decision making, further exacerbating racial inequalities, with highly challenging mitigation measures.
Funding The funders had no role in the design of this study.
1 Introduction
Machine learning (ML) algorithms are being increasingly used to tackle particularly complex clinical challenges [1],[2],[3]. Looking ahead, clinical practice will benefit from game-changing approaches that can assist healthcare processes via (semi-) autonomous decision making and/or recommendations of care actions. Some of these algorithms may possibly focus on decisions where patients’ lives are at risk. Therefore, the ML algorithms that become part of clinical practice must be robust, reliable, and unbiased.
Bias can be introduced from the data or stem from the approach used in the model development pipeline, and in both cases, it can result in decision making that can be discriminatory and harmful to minority groups. Importantly, the bias stemming from data is especially critical since it propagates and even amplifies inequalities in under-represented groups.
The unequal treatment of patients based on their race has been reported in detailed studies, accompanied by indisputable evidence of bias in healthcare providers’ attitudes, expectations, and behaviour [4],[5],[6]. The problem with the presence of bias in medical practice is further amplified if considered that the bias might be taught to students, as is heavily implied in an opinion piece in [7] defining the process as a “silent curriculum”, stating “among two patients in pain waiting in an emergency department examination room, the white one is more likely to get medications and the black one is more likely to be discharged with a note documenting “narcotic-seeking behavior”. A recent study [8] illustrates racial bias in the descriptions of patients’ electronic health records (EHR), showing that black patients are 2.5 times more likely to have one or more negative descriptors in their EHR compared with white patients.
Biased medical decisions may also result from clinical trials that produce biased datasets, either obtained entirely from a single ethno-racial group or with a dominant representation of one ethno-race over others [9]. The study in [10] shows that, even though ethno-race influences response to cancer treatments and outcomes, no ethno-racial statuses are recorded in the majority of patients, and in cases of recorded ethno-race the highest represented ethno-race in melanoma, breast and lung cancer trials are white people (25.94%), followed by Asians (4.97%), and African Americans (1.08%); resulting in biased datasets with underrepresentation of particular ethnicities [11].
Working with biased datasets negatively influences the development of ML assisted applications. There have been reports of detected ethno-racial bias in medical ML applications. The study in [12] shows patients being assigned a risk score depending on their skin colour. In particular, Black patients which were placed in the same risk category as a subset of White patients, health-wise, had considerably worse symptoms. To add to the severity of the problem, the ML algorithm reduced the number of Black patients who should have been referred for complex care by more than half. Another example is an algorithm for the diagnosis of diabetic retinopathy showing poor performance in populations living outside of the location where it was developed [13].
Analysis of ethno-racial bias in ML applications can also be performed by observing the models’ performances across race-ethnicities [14]. In [15] the authors present their investigation into the performance of three severity scoring systems in four ethnicities, focusing on hospital mortality. The authors conclude that severity scores have a statistical bias since the overestimated mortalities are most notable with Hispanic and Black patients.
Considering these issues, we sought to investigate an unlikely contributor of bias in artificial intelligence (AI) algorithms, namely the vital signs. The development of ML models typically includes the data pre-processing phase where variables that could potentially introduce bias in the models, such as race-ethnicity are excluded. The objective is to prevent the algorithm from using attributes that should not be used for classification, prediction or optimisation. For example, an algorithm that sifts through curriculum vitae should not be using the gender or the race-ethnicity as input so these are routinely excluded as input when the algorithms are trained. Vital signs, such as blood pressure, heart rate, and oxygen saturation, are objective measures and considered neutral with respect to demographics, ethnicity and race and as such considered safe, unbiased candidate features, for modelling. We investigate whether seemingly bias-free data can contain information about sensitive attributes such as race-ethnicity that can be learned during training, specifically focusing on vital signs.
To perform our analysis, we rely on the eICU Collaborative Research Database [16] where we extract the vital signs from 31,849 admissions of adult patients and the MIMIC-III [17] dataset where we extract the vital signs of a cohort of 17,761 admissions of adult patients, considered in the first 24 hours after the admission. Since both cohorts are made up of predominantly Caucasian patients, we used a matching cohort based on age, gender, and admission diagnosis to balance the dataset between the three ethno-races considered in our study, namely Caucasian, African American and Hispanic. Classical ML algorithms (Logistic Regression and XGBoost) are then used to investigate whether the patients’ ethno-race can be predicted based solely on their vital signs. We also perform variable saliency analysis, to identify the extent to which specific vital signs contribute most towards the prediction of the ethno-race.
The rest of the paper is organised as follows. Section II describes the dataset, the data preparation process and the methodology used. Section III presents our results, whereas in section IV we discuss them. Section V concludes this paper and discusses our potential future work in this area.
2 Methods
2.1 Clinical data sources and study population
The eICU Collaborative Research Database (eICU-CRD), used in our study, contains data associated with 200,859 admissions collected from 335 ICUs across 208 hospitals in the US admitted between 2014 and 2015 [16]. We additionally used the MIMIC-III [17] database, which comprises data of over 40,000 patients admitted in critical care units between 2001 and 2012. From both datasets, we selected all adult patients (age 18 and over) that were alive within the first 24 hours after ICU admission that had at least one clinically valid measurement (see Appendix 2, Table I) for all the six vital signs considered for this study: heart rate, SaO2, respiratory rate, systolic, diastolic and mean blood pressure. We additionally extracted several statistical features including mean, minimum, maximum and variance. Patients that were missing admission diagnosis, age, or gender were excluded from the study.
For the eICU dataset, we used the three most dominant ethno-races: Caucasian, African American, and Hispanic. This resulted in a total of 31,849 patients, of which 27,335 were (85%) Caucasian, 3351 (11%) were African American, and 1163 were (4%) Hispanic patients. The data for the other two, Asian and Native American, was significantly lower, with 621 and 247 patients respectively, so these ethno-races weren’t used in this research. For the MIMIC-III dataset, we repeated the same selection, which resulted in 17,761 patients total, of which 15,899 (89.5%) were Caucasian, 1365 (7.7%) were African American, and 497 (2.8%) were Hispanic patients.
2.2 Statistical Analysis
The baseline characteristics of the patients were analysed using medians (IQRs) for continuous variables and frequencies (percentages) for categorical variables. We used the Kruskal–Wallis test (one-way ANOVA) for continuous variables and the chi-square test for categorical variables to compare ethno-racial subgroups.
2.3 Model development and validation
For each of the datasets, we analysed binary comparative tests, i.e., African Americans and Caucasians, Hispanics and Caucasians, Hispanics and African Americans. Additionally, to address the issue of data imbalance between ethno-races we created three matched cohorts by devising a matching process with the minority ethno-race based on three primary features: admission diagnosis (first cohort), gender (second cohort), and age (third match) (see Appendix 1). The matching process resulted in a total of 9 sets of data, for each dataset. We evaluate the performance using the Logistic Regression (LR) - a model which uses a logistic function in order to model a binary output variable given input variables. The classification is performed based on a decision threshold, which is why LR is useful when the outcome variable is binary, but the input variables are continuous; and XGBoost - uses gradient boosted trees, and is an ensemble of models that learn by correcting the errors made by existing models until no further improvements can be made. XGBoost can be prone to overfitting because when the weaker models are trained, the resulting model has high complexity, and therefore, we trained two versions of the XGBoost algorithm: a default version (with the default parameters), and an optimised version (with the parameters selected with random search).
The algorithms were internally evaluated using stratified 5-fold cross-validation, meaning the data was divided into 5 folds in a way each fold maintains the original distribution class-wise. The training of the model was performed on 4 folds, whereas the remaining fold was used to validate the model’s performance. This process was repeated 5 times, for each of the folds, and the final results were averaged over all folds.
The performance of our models was assessed by computing the area under the receiver operator characteristic curve (AUC-ROC) and the area under the precision-recall curve (AU-PRC). The AUC-ROC shows how well the model is capable of distinguishing between the classes and is plotted with the true positive rate (recall) on the y-axis and the false positive rate (FPR) on the x-axis. The recall represents the model’s ability to correctly classify positive samples as positive. The FPR shows the samples incorrectly classified by the model as positives out of all negative samples. The AU-PRC shows the trade-off between precision and recall. The precision represents the model’s ability not to classify a negative sample as positive.
3 Results
The results for all matched cohorts for both datasets showed similar performance for all models across all comparative tests, with the second matched cohort that provided the best results. Therefore, the results presented in this section are from the second matched cohort. The patient baseline characteristics in each of the three comparative tests for the second matched cohort are summarised in Table I. The baseline characteristics for the first and third matched cohort are provided in Appendix 1).
3.1 eICU-CRD Results
Table II summarises the eICU-CRD results from the AUC-ROC and AU-PRC for all comparative tests performed with the second matched cohort and each of the three algorithms: LR, XGBoost (with default parameters), and optimised XGBoost (with optimised parameters). The comparative test between Hispanic and Caucasian patients shows the lowest results across all algorithms. On the other hand, the best results were obtained in the comparative test between African American and Caucasian patients, with the highest AUC performance at 0.75 ± 0.019. The results additionally show that XGBoost performed better than LR across all tests. The optimised XGBoost performed better than the default XGBoost, due to the simplification of the ensemble models used, which removed the model’s overfitting.
The AUC-ROC and AU-PRC are displayed in Table III. Between the three comparative tests, the graphs show the best results in the comparison between African Americans and Caucasians. Additionally, the AUC-ROC clearly displays the difference between the performance of the three algorithms, illustrating that the optimised XGBoost provides the best ethno-race prediction. The AUC-ROC and AU-PRC for the first and third matched cohorts are provided in Appendix 3 and 4.
3.2 MIMIC-III Results
Table IV summarises the MIMIC-III results from the AUC-ROC and AU-PRC for all comparative tests performed with the second matched cohort and each of the three algorithms: LR, XGBoost (with default parameters), and optimised XGBoost (with optimised parameters). The best results were obtained in the comparative test between African American and Caucasian patients, with the highest AUC performance at 0.65 ± 0.021, whereas the comparative test between Hispanic and Caucasian patients, in summary, shows the lowest results. XGBoost performed better than LR in all, except the comparison between Hispanic and Caucasian patients - the better performance of LR here can be due to potential linearity in the comparison, or the number of patients in the cohort.
The AUC-ROC and AU-PRC are displayed in Table V. Between the three comparative tests, the graphs show the best results in the comparison between African Americans and Caucasians. The other two comparative tests show marginally worse results, which was expected considering the low number of Hispanic patients remaining after the selection criteria were applied to the MIMIC-III dataset.
3.3 Variable importance (eICU-CRD and MIMIC-III)
In order to understand our results better, we wanted to analyse which variables contributed most to the prediction outcome. Since the optimised XGBoost algorithms provided the best results in all cases, we inspected the variable importance of the optimised XGBoost model for each comparative test. We used the SHAP (SHapley Additive exPlanations) values to tell us how much each input variable in the model contributed to the prediction. The SHAP beeswarm plots across all comparative tests for the second matched cohort in the eICU-CRD are displayed in Table VI. When comparing African American and Caucasian patients, we see that the beeswarm plot shows the oxygen saturation carries high importance for the XGBoost classifier. This was true also for the MIMIC-III cohort
The mean value of the oxygen saturation is important in the comparison of Hispanics and Caucasians, as well. In this comparison, the maximum respiration is also important to the model. When comparing Hispanic and African American patients the respiration maximum and variance were among the most relevant variables for the model, which was also the case when comparing Hispanic with Caucasian patients.
We performed the variable analysis on the MIMIC-III dataset, and we again focused on the results provided by the optimised XGBoost (in spite of LR performing better in the case of Hispanic and Caucasian patients, the difference in the result is not significant). The SHAP beeswarm plots across all comparative tests for the second matched cohort are displayed in Table VII. The oxygen saturation and diastolic blood pressure are among the topmost important features in the comparative tests between the African American and Caucasian patients, and Hispanic and Caucasian patients. When comparing African American and Hispanic patients the heart rate variables significantly contribute to the decisions of the model.
4 Discussion
From the eICU-CRD results, it can be observed that across all comparative tests the optimised XGBoost performed well with respect to identifying the race-ethnicity from the vital signs after matching for age, gender and diagnosis. When comparing African Americans and Caucasians, the AUC-ROC and AU-PRC showed the best predictive performance of the algorithms compared to the other two comparative tests. Additionally, in comparing African Americans and Caucasians the standard deviations of each algorithm are lower compared to the other two comparative tests. This outcome can be a result of the number of patients representative of each ethno-race in the comparative tests; namely, during the matching process, we can see the resulting data for the African American and Caucasian comparative test has approximately thrice the number of patients in the Hispanics and Caucasians and Hispanics and African Americans patient pools. However, the significantly larger number of patients does not provide a pronounced difference in the results, signalling that it might be that the selected variables do not offer additional insight into the distinction between two ethno-races. We observed lower model performance on the MIMIC-III dataset, in part due to the lower number of patients selected from the MIMIC-III. However, the MIMIC-III results follow the same trend, i.e., the best results were obtained in the second matched cohort (where the gender was the primary match feature), the comparative tests of African American and Caucasian patients showed best results across all matched cohorts, and the oxygen saturation is the most important feature for the African American and Caucasian patients, and the Hispanic and Caucasian patients.
While the results show that a definite division between each ethno-race cannot be obtained, the success of the model in classifying two out of three patients correctly cannot be accidental. The models are using only the information from patients’ vital signs, suggesting that routinely collected information in the first 24 hours of admission, excluding demographic information, can provide information about the race-ethnicity even when such information is removed from the dataset.
These results could be influenced by several factors. Firstly, the bias can be introduced from the socio-economic factors, that is patients from ethno-racial minorities tend to have poorer preventive care, and consequently once admitted into the ICU they can be in significantly worse condition compared to other ethno-races even with the same admission diagnosis that we control for. Additionally, there is potential for underlying issues regarding the accuracy of medical equipment across different populations. For example, it was shown that pulse oximeters were inaccurate for certain races, which came to light during the COVID crisis, resulting in hidden hypoxemia among patients of colour [18]. This was further investigated, and studies showed that oxygen saturation levels had greater variability in patients who identified as African American, followed by Hispanic, Asian, and lastly, Caucasian patients. While our saliency analysis showed oxygen saturation as an important variable, this may not fully explain our finding as we used measures from arterial oxygen saturation measured directly through blood gas analysis, rather than pulse oximetry, where the discrepancies were found.
5 Conclusions
ML applications in medicine show success in performing tasks such as diagnosis and prognostication at the same level as experts. However, various studies have demonstrated how ML applications can be biased and therefore affect patients unfairly, favouring one group of patients over the other depending on race, income, etc. The bias observed in ML applications can be a result of the model development or be introduced by the data used for training algorithms. Therefore, we focused on analysing the presence of ethno-racial bias in clinical data, by investigating if vital signs could give ML models enough information to determine a patient’s ethno-race correctly. We compared the performance of three separate models in three distinct matched cohorts for two different datasets. Due to the similarity of the results between the cohorts, we focused on further analysing the results obtained from one of the cohorts. Our results show that two of three patients in all comparative tests have their ethno-race correctly identified, and the most important variables in the model decisions proved to be oxygen saturation and respiration.
Data Availability
The datasets analyzed in the current study are publicly available in the MIMIC-III repository (https://mimic.physionet.org/) and eICU-CRD repository (https://eicu-crd.mit.edu/).
Declaration of interests
Authors declare no conflict of interests
Ethical approval and consent to participate
The data in MIMIC-III was previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research. The analysis using the eICU-CRD is exempt from institutional review board approval due to the retrospective design, lack of direct patient intervention, and the security schema, for which the re-identification risk was certified as meeting safe harbor standards by an independent privacy expert (Privacert, Cambridge, MA) (Health Insurance Portability and Accountability Act Certification no. 1031219-2). All experiments were performed in accordance with relevant guidelines and regulations.
Availability of data and materials
The datasets analyzed in the current study are publicly available in the MIMIC-III repository (https://mimic.physionet.org/) and eICU-CRD repository (https://eicu-crd.mit.edu/).
Acknowledgement
This research was supported by the WideHealth project - EU Horizon 2020, under grant agreement No 952279.
1. Appendix 1 Matching cohort
A matched cohort in our case represents a dataset created of pairs of patients from two different ethno-racial groups, who may differ with respect to their individual vital signs, however, are matched through the same specific baseline characteristics, namely age, gender and admission diagnosis. From the dataset with three ethno-races present, we created three ethno-race-based divisions for matching: African Americans vs. Caucasians, Hispanics vs. Caucasians, Hispanics vs. African Americans. For each of the three baseline characteristics (age, gender, diagnosis), we created three feature-prioritising processes for matching, each taking on one of the features as a priority in the matching process. The matching process sought to match each of the patients from the minority class with a patient from the majority class, initially and if possible, on all three features, and if there were no matches made where all three features corresponded, then for the remaining unmatched patients the matching was based on the prioritised feature.
Match 1: The first match process prioritised the diagnosis, therefore the matching between the two ethno-racial groups followed three stages: the patients were matched on three features, then if no match was made the patients were matched on the combination of diagnosis-gender equality or diagnosis-age equality, and lastly, if no matches were found in the second stage the match was made on diagnosis only.
Match 2: The second match process prioritised the gender feature, therefore the matching between the two ethno-racial groups followed three stages: the patients were matched on three features, then if no match was made the patients were matched on the combination of gender-age equality or gender-diagnosis equality, and lastly, if no matches were found in the second stage the match was made on gender only.
Match 3: The third match process prioritised the age, therefore the matching between the two ethno-racial groups followed three stages: the patients were matched on three features, then if no match was made the patients were matched on the combination of age-diagnosis equality or age-gender equality, and lastly, if no matches were found in the second stage the match was made on age only. In each of the three matching processes, if by the end a patient from the minority class had no matched patient from the majority class, then the patient from the minority class was dropped.
The process provided a total of nine datasets, where each of the three ethno-race-based divisions were combined with each of the three feature-prioritising processes. The details on the patients matched and the resulting number of patients per cohort in each of the nine perfectly-balanced datasets is given in Table I.
2. Appendix 2 Selected variables, measurement units and valid clinical ranges
3. Appendix 3 Full overview of results
4. Appendix 4 AUC-ROC and AU-PRC for the first and third matched cohort across all comparative tests
Footnotes
bojanav{at}feit.ukim.edu.mk, hristijang{at}feit.ukim.edu.mk, danield{at}feit.ukim.edu.mk, marijaka{at}feit.ukim.edu.mk, bmamandipoor{at}fbk.eu, lceli{at}bidmc.harvard.edu, vosmani{at}fbk.eu