Abstract
Background In critically ill patients, complex relationships exist among patient disease factors, medication management, and mortality. Considering the potential for nonlinear relationships and the high dimensionality of medication data, machine learning and advanced regression methods may offer advantages over traditional regression techniques. The purpose of this study was to evaluate the role of different modeling approaches incorporating medication data for mortality prediction.
Methods This was a single-center, observational cohort study of critically ill adults. A random sample of 991 adults admitted ≥ 24 hours to the intensive care unit (ICU) from 10/2015 to 10/2020 were included. Models to predict hospital mortality at discharge were created. Models were externally validated against a temporally separate dataset of 4,878 patients. Potential mortality predictor variables (n=27, together with 14 indicators for missingness) were collected at baseline (age, sex, service, diagnosis) and 24 hours (illness severity, supportive care use, fluid balance, laboratory values, MRC-ICU score, and vasopressor use) and included in all models. The optimal traditional (equipped with linear predictors) logistic regression model and optimal advanced (equipped with nature splines, smoothing splines, and local linearity) logistic regression models were created using stepwise selection by Bayesian information criterion (BIC). Supervised, classification-based ML models [e.g., Random Forest, Support Vector Machine (SVM), and XGBoost] were developed. Area under the receiver operating characteristic (AUROC), positive predictive value (PPV), and negative predictive value (NPV) were compared among different mortality prediction models.
Results A model including MRC-ICU in addition to SOFA and APACHE II demonstrated an AUROC of 0.83 for hospital mortality prediction, compared to AUROCs of 0.72 and 0.81 for APACHE II and SOFA alone. Machine learning models based on Random Forest, SVM, and XGBoost demonstrated AUROCs of 0.83, 0.85, and 0.82, respectively. Accuracy of traditional regression models was similar to that of machine learning models. MRC-ICU demonstrated a moderate level of feature importance in both XGBoost and Random Forest. Across all ten models, performance was lower on the validation set.
Conclusions While medication data were not included as a significant predictor in regression models, addition of MRC-ICU to severity of illness scores (APACHE II and SOFA) improved AUROC for mortality prediction. Machine learning methods did not improve model performance relative to traditional regression methods.
Introduction
Medications are key influencers of mortality but are rarely included in mortality prediction models. Assessing medications as direct determinants of mortality is difficult given complex interactions with patient physiology, disease pathophysiology, and other elements of a medication regimen. Moreover, the therapeutic effect of any medication is balanced against the potential to cause harm.[1] Patients in the intensive care unit (ICU) are particularly susceptible to the effects of medications as they receive on average twice as many medications as patients on a general ward,[2, 3] are frequently exposed to clinically relevant potential drug-drug interactions (DDIs),[4, 5] and are more than twice as likely to experience an adverse drug event (ADE).[2, 3, 6–8] Elucidating the direct effects of medications on mortality of ICU patients and incorporating that information into prediction models may improve model performance and better inform care of ICU patients.
The incorporation of medication data into prediction models for critically ill patients has improved results for prediction of fluid overload and prolonged duration of mechanical ventilation, particularly with the application of supervised machine learning approaches.[9, 10] Additionally, a pilot study of six machine learning methods also showed that incorporation of medication data and the medication regimen complexity-intensive care unit (MRC-ICU) score improved mortality prediction, and adding MRC-ICU to severity of illness improved traditional regression as well.[11] These examples offer credence to the concept that incorporating information on medication regimens is useful in predicting both shot-term and long-term outcomes for ICU patients.
Development of accurate mortality prediction models remains an ongoing area of interest in critically ill patients.[12–14] Severity of illness scores (Acute Physiology and Chronic Health Evaluation [APACHE]-IV, Sequential Organ Failure Assessment [SOFA], and Simplified Acute Physiology Score [SAPS] I-III) are commonly used to predict mortality in the ICU but have potential limitations.[15, 16] While these scores are reflective of syndrome-attributable mortality, they do not consider the impact of patient management and the contribution of clinical decision-making to patient outcomes.
Artificial intelligence (AI) and machine learning have renewed interest in outcome prediction in a variety of patient populations, including the critically ill. The ability of AI algorithms to process vast amounts of information and identify previously unelucidated patterns in data highlight the potential for these methodologies to improve prediction models through the incorporation of a greater number of latent outcome determinants and markers with predictive value. Using machine learning, it is possible to incorporate specific and granular medication data with severity of illness variables to create more comprehensive outcome prediction models. The purpose of this study was to assess the value of medication data in the context of other patient specific factors (e.g., severity of illness) in machine learning-based mortality prediction. We hypothesized that incorporation of medication data would significantly improve model performance and that medications would be highly ranked in feature importance graphs.
Methods
Study Population
This retrospective, observational study was reviewed by the University of Georgia (UGA) Institutional Review Board (IRB) and deemed to be exempt from IRB oversight (Project00001541). All methods were performed in accordance with the ethical standards of the UGA IRB and the Helinski Declaration of 1975. Patient data were obtained via the Carolina Data Warehouse, which houses EPIC® electronic health record (EHR) data from the University of North Carolina Health System (UNCHS), an integrated healthcare delivery system. Given the intensity of the data collection effort, we employed random number generation to identify a sample of 1,000 adults aged ≥ 18 years admitted to the ICU ≥ 24 hours between October 2015 and October 2020. Only data from the first ICU admission for each patient were included. Patients were excluded if they were placed on comfort care within the first 24 hours of their ICU stay. These methods have been previously published.[9, 17] This evaluation followed the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) extension reporting frameworks, as applicable.[18, 19]
Data Collection and Outcomes
The primary outcome was hospital mortality. The EHR was queried to obtain: 1) ICU baseline characteristics: age, sex, type of admission ICU (i.e., burn, cardiac, cardiothoracic, medical, neurosciences, surgical, mixed), primary ICU admission diagnosis (i.e., burn, cardiovascular, dermatology, electrolyte abnormalities, endocrine, fever, gastrointestinal, hematologic, hepatic, infection, mental health, neoplasm, neurology, pneumonia, pregnancy, pulmonary, renal, respiratory, sepsis, shock, syncope, toxicology/ingestion, trauma, weakness, or other); 2) Clinical data 24 hours after ICU admission: Acute Physiology and Chronic Health Evaluation (APACHE) II and Sequential Organ Failure Assessment (SOFA) score (using worst values in the 24 hour period), vital signs (i.e., heart rate, systolic blood pressure, temperature), Acute Respiratory Distress Syndrome (ARDS) classification (i.e., mild, moderate, or severe based on ratio of partial pressure of arterial oxygen [PaO2] to fraction of inspire oxygen [FiO2]), use of supportive care devices (i.e., renal replacement therapy, mechanical ventilation), serum laboratory values (i.e., albumin, bicarbonate, creatinine, glucose, lactate, potassium, pH, sodium, hemoglobin, hematocrit, platelets, white blood cell count), fluid balance (mL); 3) Medication data at 24 hour: MRC-ICU score and vasopressor use within 24 hour.
Data Analysis
Missingness: For missing components of SOFA and APACHE II score, patients were considered to have ‘normal’ condition for that corresponding variable (Supplemental Digital Content – Table 1). Additionally, Dummy Variable Adjustment for missingness was conducted for 14 additional predictors to assess how much the missing cases differed from an average individual without missing data for the primary outcome of mortality.
To evaluate the models, the cohort was split into training and test sets, using a ratio of 4:1. Four types of logistic regression models were created for predicting mortality with 27 variables and an additional 14 indicators for missingness (Supplemental Digital Content – Table 2) for both the training and test sets: 1) linear predictors, 2) predictors in nature cubic splines, 3) predictors in smoothing cubic splines, 4) local linear predictors. We then applied stepwise elimination with Bayesian Information Criterion (BIC) to select the final model.[20] Three machine learning models, Random Forest, SVM, and XGBoost, were employed.[21–26] Feature importance graphs were applied to visualize the strength of every predictor. For XGBoost, feature importance was measured as the frequency a feature was used in the trees. For Random Forest, feature importance was measured by mean decrease in node impurity. For SVM, feature importance was measured by the absolute magnitude of the coefficients for each variable with a normalized data set.
To assess both the specific and overall impact of each predictor, we performed two types of regression analyses: 1) individual logistic regressions examining a single variable alongside its relevant missingness indicator (if applicable), and 2) a comprehensive additive logistic regression encompassing all predictors and their respective missingness indicators. During the model training, 5-fold cross validation was applied to choose the hyperparameters for local linear logistic regression and the three machine leaning models.
With the optimal hyperparameters selected, the models were then fitted again on the training set. For local linear logistic regression, the proportion of observations in a neighborhood of every point to be used for fitting the models was tuned. For Random Forest, two hyperparameters were tuned (number of trees and number of variables randomly sampled as candidates at each split). For SVM, linear kernel was used, and cost of constraints violation was tuned. For XGBoost, two hyperparameters were tuned (maximum depth of a tree and maximum number of boosting iterations). Predictions for mortality for all 7 models were made using the test set.
Based on a previous study,[11] the baseline model was a logistic regression model including APACHE II, SOFA, and MRC-ICU as predictors, benchmarked against models including only SOFA or APACHE II as predictor variables. To evaluate the predictive abilities of each model on hospital mortality, area under the receiver operating characteristic curve (AUROC) was computed in addition to sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV) (or precision), and accuracy in the test set. Results were subsequently compared between AUROCs using DeLong’s test where prediction thresholds were chosen by maximizing Informedness, Matthew’s Correlation Coefficient (MCC) and F1 scores in the training set. A two-sided p-value less than 0.05 was used to determine statistical significance for all variables. All analyses were performed using R (version 4.3.0).
Validation
To verify the models’ performance, we validated them using the UNC 5000 dataset. This involved applying our trained models to the validation data, generating ROC curves, and assessing model metrics.
Results
Of the initial 1,000 patients randomly selected for inclusion, 9 patients were excluded because the ICU admission from which data was drawn did not represent the index ICU admission. The remaining 991 patients were included in the final cohort (Supplemental Digital Content – Figure 1). The population was 56.81% male and an average of 61.24 years old (SD 17.58). Predominantly represented ICUs included cardiac (30.78%) and medical (40.77%), with cardiovascular (24.62%), neurology (12.21%), and pulmonary (8.88%) representing the most common ICU admission diagnosis categories. Of the total cohort, 9.79% (n = 97) experienced the primary outcome of hospital mortality. Demographic characteristics for the entire cohort as well as cohorts stratified by mortality outcome are described in Table 1.
Univariate and multivariate analysis of each predictor variable is provided in Table 2. While several variables had statistically significant associations with mortality on univariate analysis, only sex (as a baseline variable), SOFA score, heart rate, and hemoglobin/hematocrit (as clinical/laboratory variables within the first 24 hours of ICU admission) maintained a statistically significant association in multivariate analysis. After stepwise selection, the four regression models involved similar predictors, which are given in Table 3 together with the corresponding p values. Notably, lowest SOFA score at 24 hours, lowest hematocrit at 24 hours, and an indicator of missingness for temperature at 24 hours were present in all models, with age and lowest albumin at 24 hours included in three of the models and fluid balance included in two of the models based on ANOVA for nonparametric effects. The local linear logistic regression after stepwise selection did not have higher order terms and used the same predictors as the logistic regression with linear predictors.
AUROC curves were plotted for ten models (Figure 1). Adding medication data (MRC-ICU) to a model including SOFA and APACHE II as clinical severity of illness metrics and mortality predictors increased the AUROC compared to benchmarking models based on APACHE II and SOFA alone (0.83 vs 0.72 and 0.81, respectively). A model inclusive of SOFA, APACHE II, Mechanical Ventilation, Albumin, Temperature, Hemoglobin, and Hematocrit increased the AUROC to 0.86. AUROC and their corresponding confidence intervals for each model are provided in Table 4. Detailed model performance metrics including accuracy, sensitivity, specificity, PPV, and NPV values on the test set with thresholds chosen by different criteria are reported in Table 5.
Machine learning models (Random Forest, SVM, XGBoost) provided similar AUROC for hospital mortality prediction (0.83, 0.85, and 0.82, respectively) as traditional regression methods as well as the model based on MRC-ICU + SOFA + APACHE II. Feature importance graphs for XGBoost and Random Forest models reveal similar contributing variables (Figure 2 and Supplemental Digital Content – Figure 2), with SOFA, APACHE II, Age, Heart Rate, and some laboratory values prominently featured.
Medication data in the form of MRC-ICU demonstrated feature importance similar to serum lactate or ARDS classification and ranked higher than presence of mechanical ventilation in the first 24 hours. Feature importance graphs generated for SVM models did not feature MRC-ICU (Supplemental Digital Content – Figure 3).
Validation was performed on a temporally separate dataset which, after excluding patients for whom data was not from the index ICU admission, included a total of 4,878 patients. The population was 51.91% male with an average age of 58.87 years old (SD 16.47). Compared to the cohort from which the training and test sets were derived, the validation cohort featured a greater percentage of medical ICU admissions (94.94%) and had fewer admissions related to a primary cardiovascular diagnosis (7.46%), with a greater proportion of respiratory or respiratory failure (16.73%), sepsis (11.46%), gastrointestinal (11.75%), and infection-related admissions (8.18%). Of the total validation cohort, 19.76% (n = 964) experienced the primary outcome of hospital mortality. Demographic characteristics for the entire validation cohort as well as cohorts stratified by mortality outcome are described in Supplemental Digital Content – Table 3. Missingness characteristics for each of the variables in the validation cohort is provided in Supplemental Digital Content – Table 4.
Each of the trained models was applied to the validation cohort, with AUROC curves generated (Supplemental Digital Content – Figure 4). Model performance on the validation cohort was consistently lower than performance on the test set across all models (Supplemental Digital Content – Tables 5 and 6). Traditional regression methods and models built on severity of illness metrics and medication-related data performed as well as machine learning-based models in the validation cohort. The Random Forest model exhibited the most stable performance from the test set to the validation set.
Discussion
The prediction of ICU outcomes including mortality remains an ongoing area of research, with new models and scores developed frequently.[27–29] However, model performance varies and even commonly used mortality predictors suffer from shortcomings.[30] Proposed methods for model optimization include customization of scores using local data[31] and the use of AI and machine learning methods to identify complex interactions and associations not apparent in traditional modeling.[14, 29, 32–35] It is notable that contemporary mortality prediction scores and metrics (e.g., SOFA, APACHE) do not include medication data (beyond vasoactive agents), despite the fact that medications are known to be causal agents both for positive outcomes via the treatment of disease as well as negative outcomes via both predictable and unpredictable interactions with the patient, the disease, or the medication regimen.[1, 3, 5]
Addition of medication data (as incorporated into the MRC-ICU score, a summary characterization of medication use in the ICU) to severity of illness scores (i.e., SOFA, APACHE II) in traditional modeling improved performance for hospital mortality prediction in this study; however, MRC-ICU was not identified as a significant predictor in logistic regression models. MRC-ICU has previously been shown to be associated with mortality,[36–38] number and intensity of required medication interventions,[37, 39] need for mechanical ventilation,[10] and development of fluid overload.[40] Addition of MRC-ICU to conventional severity of illness scores (i.e., SOFA, APACHE) with traditional modeling techniques improved mortality prediction,[11] although there was a positive association between MRC-ICU and mortality in univariate analysis and a negative association in multivariate analysis inclusive of severity of illness metrics, suggesting a complex modifying relationship.
Machine learning-based models did not out-perform traditional modeling methods using simple and easily retrievable variables, but machine learning approaches may be useful for further elucidating complex relationships between variables or causal relationships behind associations in traditional modeling. Interestingly, MRC-ICU had similar feature importance to clinical variables such as lactate and fluid balance in the prediction of hospital mortality. Machine learning approaches may be particularly useful for the incorporation of medication data, which is problematic for traditional modeling because of the large number of variables involved. Inclusion of MRC-ICU in machine learning models has shown the ability to improve prediction of patient outcomes compared to traditional modeling methods, albeit modestly.[10, 41]
An interesting finding from this study is that machine learning methods identified different predictors of mortality than traditional logistic regression models (and even than models based on different machine learning models). This finding highlights the potential value of machine learning approaches for identifying and characterizing previously unknown associations, but also emphasizes the “black box” nature of such an approach. Despite the potential impact of AI and machine learning approaches, it is important to consider whether performance of these models significantly improves upon traditional models that are simpler and more transparent, considering the resources required to implement these approaches.[9, 42] This study also highlights the need for prediction models to benchmark against existing clinical standards.
When models generated on a training set of nearly 1,000 patients were validated against a set of nearly 5,000 patients, AUROCs were consistently lower than they were on the test set. This discrepancy in performance may stem from structural differences between the two datasets, which could be initially discernible through their characteristics. Determining which dataset better represents the population is challenging. However, the current model results underscore the utility of the models in predicting mortality as well as the need for institution or population specific evaluation of a base model.
This study has notable limitations. While sample sizes of nearly 1,000 and 5,000 patients are large for most critical care studies, these sample sizes are small for the purposes of machine learning. Small sample sizes may lead to model overfitting and a lack of generalizability. While handling of missingness in this study was consistent, the use of normal values to substitute for missing data may have altered model results. Bias may exist as a result of which patients had specific data points available. It is possible that data points not collected could improve model performance for hospital mortality prediction. Limiting model metrics to the admission timepoint and the first 24 hours may not capture all potential contributors to and predictors of hospital mortality and limits assessment of the dynamic nature of critical illness over the course of an ICU stay. Hospital mortality as an outcome also has potential limitations, and using a specific timepoint may have altered outcomes of this study. Finally, all data sets in this study include patients from a single health system and may not be readily generalizable to patients in other health systems or geographic regions.
This study represents the largest and most robust evaluation to date of the MRC-ICU score with a training/test set of nearly 1,000 patients and a validation set of nearly 5,000 patients. It is the largest study to date incorporating medication regimen complexity, severity of illness data, and machine learning methods evaluating a patient-centered outcome that is relevant to potential future applications. Given these results, future studies will focus on the use of larger datasets with more granular medication data, as these features represent modifiable components in a causal model of patient outcomes.
Conclusion
Accurate prediction of mortality is an ongoing challenge in critically ill patient populations. Medications are known to be causal agents for mortality outcomes. Incorporation of medication data has potential promise to improve the performance of mortality prediction models in the ICU. While machine learning approaches may improve prediction performance in some settings, performance should be benchmarked against simpler and more transparent models for outcome prediction.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Declarations Ethical Approval
This was an observational study that was reviewed by the University of Georgia Institutional Review Board (IRB) and determined to be exempt from IRB oversight (Project00001541).
Author Contributions
BM conducted manuscript drafting, results interpretation, and critical revisions. TZ and XC conducted data pre-processing, data analysis, and methods development. AM, SS, JD, RK, and DM provided manuscript review and revisions and clinical results interpretation. AS provided project oversight, manuscript drafting, revisions, and results interpretation.
Availability of data & materials
Following request and authorship team approval of the appropriateness of the request, datasets can be made available. Please contact the corresponding author (sikora{at}uga.edu).
Acknowledgements
Data acquisition were supported by NC Tracs, funded by Grant Number UL1TR002489 from the National Center for Advancing Translations Sciences at the National Institutes of Health, and Data Analytics at the University of North Carolina Medical Center Department of Pharmacy.
Footnotes
Brian.2.Murray{at}cuanschutz.edu, Tianyi.Zhang{at}uga.edu, Amoreena.most{at}uga.edu, xychen{at}uga.edu, Susan.smith{at}uga.edu, J.Devlin{at}northeastern.edu, David.J.Murphy{at}emory.edu, r.kamaleswaran{at}duke.edu
Conflicts of Interest: The authors have no conflicts of interest.
Funding: Funding through Agency of Healthcare Research and Quality for Drs. Devlin, Murphy, Sikora, Smith, and Kamaleswaran was provided through R21HS028485 and R01HS029009.