Abstract
Background Criteria to identify patients who are ready to be liberated from mechanical ventilation are imprecise, often resulting in prolonged mechanical ventilation or reintubation, both of which are associated with adverse outcomes. Daily protocol-driven assessment of the need for mechanical ventilation leads to earlier extubation but requires dedicated personnel. We sought to determine whether machine learning applied to the electronic health record could predict successful extubation.
Methods We examined 37 clinical features from patients from a single-center prospective cohort study of patients in our quaternary care medical ICU who required mechanical ventilation and underwent a bronchoalveolar lavage for known or suspected pneumonia. We also tested our models on an external test set from a community hospital ICU in our health care system. We curated electronic health record data aggregated from midnight to 8AM and labeled extubation status. We deployed three data encoding/imputation strategies and built XGBoost, LightGBM, logistic regression, LSTM, and RNN models to predict successful next-day extubation. We evaluated each model’s performance using Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision Recall Curve (AUPRC), Sensitivity (Recall), Specificity, PPV (Precision), Accuracy, and F1-Score.
Results Our internal cohort included 696 patients and 9,828 ICU days, and our external cohort had 333 patients and 2,835 ICU days. The best model (LSTM) predicted successful extubation on a given ICU day with an AUROC 0.87 (95% CI 0.834-0.902) and the internal test set and 0.87 (95% CI 0.848-0.885) on the external test set. A Logistic Regression model performed similarly (AUROC 0.86 internal test, 0.83 external test). Across multiple model types, measures previously demonstrated to be important in determining readiness for extubation were found to be most informative, including plateau pressure and Richmond Agitation Sedation Scale (RASS) score. Our model often predicted patients to be stable for extubation in the days preceding their actual extubation, with 63.8% of predicted extubations occurring within three days of true extubation. We also tested the best model on cases of failed extubations (requiring reintubation within two days) not seen by the model during training. Our best model would have identified 35.4% (17/48) of these cases in the internal test set and 48.1% (13/27) cases in the external test set as unlikely to be successfully extubated.
Conclusions Machine learning models can accurately predict the likelihood of extubation on a given ICU day from data available in the electronic health record. Predictions from these models are driven by clinical features that have been associated with successful extubation in clinical trials.
Introduction
Mechanical ventilation is a life-saving intervention to support patients with respiratory failure. However, mechanical ventilation is also an invasive therapy with substantial risks.1 Hence, ICU physicians seek to extubate patients at the earliest point in their hospital course when they can sustain spontaneous breathing without an artificial airway.
Premature extubation can lead to reintubation, which is associated with a prolonged ICU stay and greater hospital mortality.2 As a result, physicians often delay weaning, prolonging the duration of mechanical ventilation and its associated complications.3 Implementation of protocolized daily screening by nurses or respiratory therapists followed by spontaneous breathing trials (SBTs) in the absence of physician input reduces the duration of mechanical ventilation and ICU stay.4 Despite this, compliance with these protocols is highly variable and SBTs alone have poor operating characteristics in predicting successful extubation.5–8 Furthermore, these protocols distract from other processes of care, increase costs, and are difficult to continue when ICU resources become limited.9
The success of protocol-driven weaning highlights the value of objective and data-driven approaches to assist healthcare providers identify candidates for liberation from mechanical ventilation. Electronic Health Records (EHRs) ingest continuous clinical data that report on patient physiology, including vital signs, laboratory results, medications, and ventilator parameters.10 Machine learning uses computational techniques that identify complex non-linear relationships within disparate data, including those in the EHR to improve the accuracy of clinical decisions.11,12 We sought to determine whether machine learning approaches applied to EHR data could predict successful next-day extubation. We trained our model using data from a relatively small cohort of well-phenotyped patients enrolled in an observational clinical cohort study. After careful annotation and labeling by clinicians, several models were found to perform well, accurately predicting successful extubation. Predictions from our models were driven by clinical features validated in clinical trials of protocol driven weaning. Our best performing model correctly predicted next-day extubation status in 84.6% of cases and could potentially have identified 35.4% (17/48) of failed extubation cases as unlikely to be successfully extubated.
Methods
Study participants and setting
Patients were enrolled in the Successful Clinical Response In Pneumonia Therapy (SCRIPT) Systems Biology Center, a single-site prospective cohort study of patients requiring mechanical ventilation, who underwent bronchoalveolar lavage for known or suspected pneumonia in a quaternary care hospital from 2018 to 2023 (NU IRB # STU00204868). Patients or their legal authorized representative consented to participate in this study. External testing was done using EHR data collected from patients in a mixed medical surgical ICU from a community hospital, Central DuPage Hospital (CDH) from 2018 to 2022 (NU IRB # STU00216678); this was a retrospective data-only protocol and received a waiver of informed consent. We included patients who required mechanical ventilation, with ICD-9 or ICD-10 codes for pneumonia or from the top 10 ICD codes of SCRIPT patient admissions (Supplemental Table 1); we excluded patients admitted from an operating room to minimize surgical ICU patients.
Data
EHR data from both hospitals were extracted from the Northwestern University Enterprise Data Warehouse,13 and were manually reviewed and validated by ICU physicians who focused specifically on ventilation features and markers of intubation and extubation. Over two hundred charts were manually reviewed by physicians to ensure the accuracy of the extracted data. When the reviewing clinician was uncertain, a study adjudication committee meeting comprised of five critical care physicians made a consensus decision that was considered final. Our group is experienced in the annotation of clinical data and has adjudicated over 800 pneumonia episodes in a rigorous, predefined process.14 Each patient chart was further systematically reviewed by the research study team to gather information including transition to comfort measures only (CMO). For patients who had multiple sequences of intubation and extubation, we used only the first intubation/extubation sequence. We excluded patients with tracheostomy and failed extubation (requiring reintubation within two days) and excluded ICU days with palliative extubation and extubation while on extracorporeal membrane oxygenation (ECMO) from our training data.
Features
Our features were inspired by the practice of daily multidisciplinary rounds in the ICU and built off a previously described dataset from our group.15,16 As almost all extubations occurred after 8AM (Supplemental Figure 1), we used only data from midnight to 8AM. This strategy is consistent with the protocol-driven weaning shown to reduce the duration of mechanical ventilation4 and can facilitate future deployment by presenting model predictions to the clinical team during morning rounds. For a given timestep, the model was fed with a dataset of 37 features pertaining to a specific patient, aggregated from the hours of 12AM-8AM from a single day. Features included in the dataset span ventilator parameters, laboratory values, mental status assessments, and organ failure assessments (Supplemental Table 2).
Missing data
Missing data is common in real-world EHR data. Some missing data in the EHR are not random, instead reflecting a change in clinical status that results in different patterns of monitoring and documentation. We report the percent missing for each feature in Supplemental Table 2. Some machine learning models (such as XGBoost and LightGBM, detailed below and in Supplemental Methods) are designed to incorporate missing data. For other models, including RNN and LSTM, we tested several imputation strategies, including mean imputation and a binning strategy inspired by FIDDLE.17 For binning, features are coded into a flag indicating missing/present, and flags indicating the quartile range into which the data point falls. For example, a day with a respiratory rate of 24 breaths per minute would have a 0 for the missing flag (indicating the data are present), a 0 for the first bin that encompasses range <18 breaths per minute, a 0 for the second bin that encompasses range [18,22), a 1 for the third bin that encompasses range [22,27.4), and a 0 for the fourth bin that encompasses the highest range of ≥ 27.4. Details of each strategy are outlined in Supplemental Methods and in our code repository.
Labels
We labeled the patient’s intubation status on each day as intubated, extubated, failed extubation, comfort measures palliative extuation, or pre-intubation. After removing pre-intubation days, and sequences where there was failed extuation or palliative extubation, our dataset contained only intubated days for training. The label of interest that our models predicted was the next day’s intubation/extubation status. Additional details on labeling and filtering are available in Supplemental Methods.
Split
Our training dataset contains patients with hospital admission dates ranging from June 2018 to August 2023 (Figure 1). We split patients into train and test sets based on a cutoff date of August 20, 2021. This split put 80% of patient ecounters before that date into the train set, and the 20% after into the test set. We further split the train data into smaller train and validation subsets based on an 80/20% random split.
Modeling
Both traditional machine learning models and deep learning architectures were employed. We used traditional models, including logistic regression,18 random forest,19 LightGBM,20 and XGBoost,21 to make daily predictions. In contrast, deep learning models, specifically Recurrent Neural Networks (RNNs)22 and Long Short-Term Memory networks (LSTMs),23 used the entire patient sequence to make predictions. Detailed explanations and comparisons of the models are available in Supplemental Methods.
Performance metrics
The primary metric for model evaluation was the Area Under the Receiver Operating Characteristic curve (AUROC). Additional metrics included Area Under the Precision-Recall Curve (AUPRC), Accuracy, F1-Score, Precision (PPV), Recall (Sensitivity), and Specificity.
Feature importance
To understand and interpret the effect of the various features on the predictions made by our deep learning models, we used ablation techniques by masking individual features for 37 iterations (total features) and observing the decrease in AUC without that feature available. We also used XGBoost’s SHAP (Shapley Additive exPlanations) values.24 For our logistic regression model, we plotted the top feature coefficients. More detailed explanations are available in Supplemental Methods.
Predicting failed extubations and missed opportunities for extubation
We defined a correct prediction of failed extubation as a patient who required reintubation within two days and a model prediction of a low probability of successful extubation on the day of extubation. We defined a possible missed opportunity for extubation as a patient with a high probability of successful extubation in the days before successful extubation.
Statistics
Descriptive statistics are reported for the train/validate/test cohorts using % and median [Q1,Q3]. Comparisons between nonparametric data were done using Mann-Whitney U tests.
Code and data availability
A detailed description of data extraction, processing, and modeling are available in our code repository at https://github.com/NUPulmonary/2024_Fenske_Peltekian. Programming was done in Python version 3.10. A deidentified version of the SCRIPT cohort data are available on PhysioNet, and the exact data needed to replicate this study will be posted in a future update.16
Reporting checklist
We followed the TRIPOD Checklist for predictive model development, which is available in Supplemental Materials.
Results
Cohort description
There were 712 enrollments in SCRIPT during the study period, with 940 separate ICU stays totaling 16,402 ICU days. After filtering days and stays for our next-day successful extubation task, we trained and evaluated using 696 unique patients, 781 ICU stays, and 9,828 ICU days. Our CDH dataset consisted of 459 unique patients, 518 ICU stays, and 5,814 ICU days. This was filtered down to 333 unique patients, 349 ICU stays, and 2,835 ICU days. The failed extubation rate, defined as requiring reintubation within two days was 23.1% and 15.8% in the SCRIPT and CDH cohorts, respectively. Patient demographics and outcomes are reported in Table 1 between the train/validate/test and external sets. In the SCRIPT cohort, the median [Q1,Q3] patient age was 63 [51, 72], and 44% of the patients were female. The percent of patients with unfavorable outcomes (death, discharge to hospice, or lung transplantation) for the SCRIPT cohort was 45%. While the demographics and overall outcomes were similar among the different datasets, the features were markedly different (Table 2, Supplemental Table 2), likely reflecting changes in the cohort of patients requiring mechanical ventilation during the COVID-19 pandemic. A summary of the different types of intubation/extubation sequences and how often they occurred are presented in Supplemental Figure 2.
Model results
The performances of the predictive models using different imputation strategies are summarized in Figure 2. The performance of all the models in predicting extubation was similar. An LSTM classifier using binned data achieved the best performance with a an area under the Receiver Operating Characteristic curve (AUROC) of 0.87 (95% CI 0.834-0.902) and an area under the Precision-Recall curve (AUPRC) of 0.382 (95% CI 0.269-0.492) on the test set (baseline rate of extubation/total days of 0.09, which would be the AUPRC for a no-skill model). Full metrics for all models before hyperparameter optimization are available in Supplemental Table 3 and full metrics for optimized models are available in Supplemental Table 4. In addition to the temporally split SCRIPT test set, we evaluated our top-performing model on data from an external hospital and found consistent performance across the two datasets (Figure 3).
Imputation/encoding strategies affect model performance
We compared the performance of different imputation strategies across models as well. The best encoding strategy was binning, which encodes data presence or missingness, and then bins continuous variables into quartiles. The next was using raw data without any imputation, in models that could accommodate this such as XGBoost and LightGBM. Simpler but commonly employed strategies such as mean imputation performed consistently the worst across models all models (Supplemental Table 3).
Feature importance
The top ten features contributing to the predictive performance of the XGBoost model were identified using SHAP (SHapley Additive exPlanations) values (Figure 4). They include markers of the patient’s mental status, as documented by Glascow Coma Scale (GCS), as well as markers of the patient’s lung physiology, as documented by presence of ventilator measurements such as plateau pressure. Similar features were present on LSTM ablation feature importance analysis, with top feature importance including RASS score, respiratory rate, plateau pressure, and GCS motor response. RASS score, plateau pressure, and GCS motor response are also represented by different binned features in the top ten feature coefficients for Logistic Regression.
Missed opportunities and preventing failed extubations
Our model often predicted patients could be considered for extubation in the days preceding their actual extubation, suggesting opportunities to consider earlier extubation (Figure 5A). This is a stringent way of evaluating the model as only the earliest extubation prediction is used for analysis (Supplemental Methods). We also tested on failed extubations (requiring reintubation within two days) not seen by the model during training. Our best model predicted extubation failure in 35.4% (17/48) of these cases.
Discussion
Identifying patients who are ready for extubation remains a challenging clinical problem in the ICU. Although ventilator-weaning protocols have been shown to outperform routine physician-driven care,4,25 the adoption and standardization of these practices varies.5,26,27 Barriers to the adoption of some protocols include their requirement for trained personnel28 and in some cases specialized equipment.29 We built several machine learning models that incorporated features included in the electronic health record in most ICU settings. Several of these models performed well with the best model providing an AUROC above 0.85. These analyses also suggest that HER-based classifiers would have identified some patients who might have been considered for earlier SBTs and extubation. We found 63.8% of predicted extubations in cases of successful extubation occurred within three days of actual extubation, suggesting potential earlier extubation could have been possible if the team were following model-driven care. We also found, when looking at sequences where a patient had an unsuccessful extubation, our model would have recommended against extubation in 35.4% of cases.
The models we used allow us to identify clinical features that drive their predictions. Protocol driven weaning which has been rigorously shown to shorten the duration of mechanical ventilation usually includes an assessment of mental status, oxygenation, hemodynamic stability, an absence of ongoing infection, and physiologic measures of lung mechanics.7,30 Our models were informed by clinical features that reflect these assessments, lending clinical plausibility to their predictions. Furthermore, models that treated each day independently performed similarly or just slightly inferior to sequential models that used information from multiple days, consistent with efficacy of protocols that assess readiness for spontaneous breathing on a daily basis irrespective of clinical context.3
We also examined the effect of different imputation strategies on model performance. Commonly used strategies, such as filling in missing values with the data’s mean value, performed significantly worse than maintaining missingness using either encoding strategies such as binning or working with models such as XGBoost that can inherently handle missing data. This strategy can be particularly useful for complex models like deep learning algorithms where data in a medical context can have such extreme variance.
SCRIPT is a highly curated dataset, with a panel of five critical care physicians meeting weekly to review each admitted patient and their outcomes.14 A full-time research team collects enrollment and end of study data, including if a patient was transitioned to comfort-focused care. Since this is not possible for larger scale datasets, we compiled a set of pragmatic rules based on our experience (Supplemental Methods). We performed significant manual curation of our dataset of over 200 patient charts, confirming manually on the front end EHR unusual cases of intubation, which can be difficult to parse out using only flowsheet data.31 We believe that our careful data cleaning and manual adjudication processes were key in developing a clinically important model using features previously known to be clinically important to inform model development.32 Compared with other machine learning models to predict extubation33 or shift off full ventilatory support,34 our work obtains good predictive performance when applied to a population of critically ill patients in both an internal test dataset and when applied to a distinct medical ICU setting. Ensuring a sufficient time gap between data gathering and the prediction of interest is a crucial step in mitigating the risk of data leakage in a machine learning model. In contrast to studies that make predictions on the next time period without any gap for intervention,35 we specifically use only information from midnight to 8AM rather than the entire day so the model can offer this information to the multidisciplinary team on morning rounds to inform clinical decision-making.
Limitations
First, our model was trained using clinically adjudicated data from a quaternary care referral health system that cares for severely ill patients, reflected in the substantial mortality and long duration of mechanical ventilation, which might limit its relevance in other ICU settings. Nevertheless, the model performed well in a medical ICU from a community-based hospital, suggesting carefully curated data from relatively small populations can be used to train models for generalized use. Importantly, however, the community-based hospital ICU we used for validation uses the same EHR as our academic medical center. We do not know if different health record instances would affect performance and this remains to be evaluated in future work. Second, we limited the information used to train our models to physiologic factors that are measured in most ICUs. This necessarily excluded factors, such as continued intubation for a procedure, that might have influenced clinicians’ decision on readiness for extubation. Our dataset is small compared with larger ICU cohorts such as MIMIC or eICU36,37 and future work validated on these datasets is needed. Our study focused on MICU patients with suspected pneumonia, so generalizing these findings to other ICU settings, for example surgical or cardiac ICUs, will require further work. While our model supports the feasibility of a machine learning model to identify patients who might be considered for extubation, we have not yet deployed it in a clinical setting. Nevertheless, we designed our predictor with an eye towards deployment, only using data from midnight to 8AM. In the future, we envision providing model output to a clinical team on rounds as a Bayesian pre-SBT probability of extubation success. Alternatively, the model could be incorporated into the EHR to automatically trigger a clinical decision support alert to bedside nurses or respiratory therapists suggesting initiation of a SBT to help optimize the timing of extubation, consistent with best evidence.4 We hope that this can help both shorten intubation duration and prevent failed extubations, to improve the care of critically ill patients.
Funding
SCRIPT is supported by NIH/NIAID U19AI135964. Work in the Division of Pulmonary and Critical Care is also supported by Simpson Querrey Lung Institute for Translational Science (SQLIFTS) and the Canning Thoracic Institute.
NSM is supported by AHA 24PRE1196998. GRSB is supported by the NIH (U19AI135964, P01AG049665, R01HL147575, P01HL071643, and R01HL154686); the US Department of Veterans Affairs (I01CX001777); a grant from the Chicago Biomedical Consortium; and a Northwestern University Dixon Translational Science Award. RGW is supported by NIH grants (U19AI135964, U01TR003528, P01HL154998, R01HL14988, and R01LM013337). AVM is supported by NIH grants (U19AI135964, P01AG049665, R21AG075423, R01HL158139, R01HL153312, and P01HL154998). BDS is supported by the NIH (R01HL149883, R01HL153122, P01HL154998, P01AG049665, and U19AI135964). AA is supported by NIH grants (U19AI135964 and R01HL158139). CAG is supported by NIH/NHLBI K23HL169815, a Parker B. Francis Opportunity Award, and an American Thoracic Society Unrestricted Grant.
Conflict of interest
BDS holds US patent 10,905,706, "Compositions and methods to accelerate resolution of acute lung inflammation," and serves on the scientific advisory board of Zoe Biosciences, in which he holds stock options. Other authors have no conflicts within the area of this work.
Role of funding source
The funding sources did not have a role in the design, execution, or prior review of the study or in the data presented in this manuscript. Opinions expressed in this work do not necessarily reflect those of the funding sources.
Human Ethics and Consent to Participate
This study was approved by the Northwestern University Institutional Review Board with study IDs STU00204868 and STU00216678.
Data availability
A significant portion of this data has been already made available through PhysioNet at https://physionet.org/content/script-carpediem-dataset/1.1.0/, a future update will include new patients and updated data since the publication of the original dataset. Code for processing and analysis are available at https://github.com/NUPulmonary/2024_Fenske_Peltekian.
Author contributions
Conceptualization: CAG, AA, AVM, GRSB, RGW
Methodology: SF, AP, AA, CAG
Data acquisition: MK, AP
Programming/analysis/visualization: SF, AP, NSM, AA, CAG
Chart review: MZ, KG, MB, CAG
Drafting: SF, AP, AA, CAG
Editing: all authors
Supplemental methods and materials for Developing and validating machine learning model to predict successful next-day extubation in the ICU
Cohort information
Our study involved patients who had received mechanical ventilation and were admitted to an intensive care unit (ICU) as part of the Successful Clinical Response in Pneumonia Therapy (SCRIPT) study conducted at Northwestern Memorial Hospital (NMH) between June 2018 and April 2023. Patients who had multiple ICU stays during the same hospitalization had their stays numbered consecutively but did not contribute data between their ICU stays. Multiple hospitalizations of the same patient are reported as separate patients, as SCRIPT enrollments are unique to each hospitalization. For patients who had multiple sequences of intubation and extubation in a given ICU stay, we used only the first intubation/extubation sequence.
EHR data
Electronic health record (EHR) data are compiled by the Northwestern Medicine Enterprise Data Warehouse (EDW), the primary data repository for clinical data at Northwestern Medicine. Approximately 150 data sources (including the main EHR system, Epic) are loaded into the EDW on a nightly basis. The data are primarily loaded using Microsoft technologies (Visual Studio, SSIS, etc.) and scheduled to load via the SQL Server Job Agent. Data engineers and architects on the EDW team then combine the data sources using custom SQL scripts, Visual Studio, SSIS, etc. to create datamarts. Analysts on the EDW team then work with the datamarts to create reports, dashboards, and extracts validated with clinician input. Over two hundred charts were manually reviewed by physicians to ensure data pull fidelity. Questions were brought to a committee meeting comprised of five critical care physicians and consensus was obtained before moving forward. Each patient chart was reviewed by the research study team on study close to gather information including transition to comfort measures only.
Clinical parameters
We selected 37 clinical features considered representative of those that physicians would consider during daily ICU rounds. We compiled the status of intubation, extracorporeal membrane oxygenation (ECMO), renal replacement therapy (hemodialysis [HD], and continuous renal replacement therapy [CRRT]), sedation parameters (Glasgow Coma Scale [GCS] subscores of eye opening, motor response, verbal response, and Richmond Agitation Sedation Scale [RASS]), lung injury (arterial pH, PaO2, PaCO2, PEEP, FiO2, plateau pressure, lung compliance, minute ventilation, and oxygen saturation), hemodynamics (norepinephrine rate in mcg/kg/min, mean arterial pressure, systolic and diastolic blood pressure, lactic acid, hemoglobin, and bicarbonate), renal (creatinine, and aforementioned HD and CRRT flags), inflammatory markers (WBC count, neutrophil count, platelets, procalcitonin, C-reactive protein, D-dimer, lactate dehydrogenase, ferritin, bilirubin, albumin, and lymphocytes), vital signs (temperature, heart rate, respiratory rate). When multiple measurements were available for the same time block, they were aggregated by mean. Outliers were removed prior to aggregation by using predefined ranges for each measurement. Full details are available in our code at https://github.com/NUPulmonary/2024_Fenske_Peltekian.
Time blocking
Since the majority of extubations occur after 8AM (Supplemental Figure 1), we aggregated our dataset using only information from midnight to 8AM. This strategy would be optimal for future deployment to allow information and model predictions to be presented to the clinical team during rounds.
Annotating flags
ECMO_flag was labeled as 1 if that day lay on or in between the first and last days of recorded ECMO. Hemodialysis_flag was labeled as 1 if the patient was ever on Hemodialysis between midnight and 8AM that day. CRRT_flag was also labeled as 1 if the patient was ever on CRRT between midnight and 8AM that day.
Continuous features
Features measured on a continuous numerical scale were aggregated by mean from all recordings between midnight and 8AM that day.
Categorical features
RASS and GCS scores were the only ordinal features in the dataset. Because they are still a numerical scale, they were aggregated the same way continuous features were.
Annotating extubation status
To label days as intubated or extubated, we parsed through the EDW reports of oxygen support devices (ranging from nasal cannula to invasive mechanical ventilation) and employ a series of rules to determine the extubation status of each day. Full details are present in our code repository at https://github.com/NUPulmonary/2024_Fenske_Peltekian. In most cases, a patient is maintained entirely as intubated or extubated between midnight to 8AM. Extubation status was labeled according to data from the oxygen device report for that day.
When pulling data from the EDW, days were marked as intubated based on the presence of at least two instances of intubation or if one of the following was documented: PEEP, plateau pressure, static compliance, or minute ventilation. The cutoff of 8AM was purposely chosen to obtain the cleanest data such that transitions on/off the ventilator do not confound that extubation status of a patient day. There are two other broad classes: a patient being intubated/extubated before 8AM. If a patient is intubated before 8AM, this day is excluded from modeling due to a mix of intubated and extubated data, and the following day, assuming they remain intubated from midnight to 8AM, would mark the first ICU day presented to the model. If a patient is extubated before 8AM, the day is marked as extubated. Box 1 provides a summary of our inclusion criteria for training a model on successful next-day extubation. Box 2 outlines further manual checks that were performed to ensure proper labeling of extubation status between 12AM and 8AM on each day.
Preserving only successful extubation cases.
Exclude patient-days with tracheostomy, as ventilator liberation in this population is different and lower stakes than patients with endotracheal tubes.
Excluding cases where reintubation is required within two days of extubation.
Excluding cases of extubation while the patient remains on ECMO.
Excluding cases of palliative extubation. In our dataset, this was gathered by the research team and CAG performed additional manual review. A pragmatic way of labeling this in the absence of chart review would be to exclude cases where a patient died within 48 hours of extubation.
Using O2 device report to assign extubation status.
If documented extubation event on an intubated day, assign as extubated.
If extubated day has at least three or is entirely documentations of intubation, label as intubated.
If patient is re-intubated within two days of an extubated day, label the extubated day(s) as failed extubation.
If within the same day, a patient is intubated, then extubated, then has at least three documentations of intubation following the extubation event or the next day is intubated, label as failed extubation. Here we require a minimum of three documentations of intubation following the extubation event due to one-off ventilator reads that will show up during the extubation documentation. If the next day is intubated, it is safe to assume that the intubation documentation is correct.
If patient was extubated on their last day in the ICU and we have documentation that it was a pallative extubation, then label the preceding extubated days as CMO.
If patient was extubated on their last day in the ICU and die within 48 hours of the day of extubation, label the day(s) as CMO.
Flag all days that have at least three documentations of trach collar. These patients will be withheld from analysis.
Label all extubated days preceding intubation during a given ICU stay as preintubated.
Multiple intubation sequences
Sequences including failed extubation are excluded from training. Cases of extubation while still on ECMO and preintubation days are also excluded from training. Cases with one ICU day are naturally excluded as there is no next-day extubation status to predict. We further consolidated our training dataset by excluding sequences of reintubation.
All-inclusive model
We initially tried an all-inclusive model that included all days. This included pre-intubation, CMO, failed, and successful extubation days. This was easier to assess and presented a better general purpose model design, but the model failed to capture the differences between intubated and extubated days. This was due to the noise introduced by CMO, failed, and pre-intubation days that made it difficult for the model to learn when a patient was ready for extubation. Designing a model with the more focused goal of predicting successful extubation, we consolidated our time series sequences and were able to see the model learn the differences between patient-days on which a successful versus unsuccessful extubation would be predicted to occur.
Imputation Strategies
No imputation, called ‘raw’: as some models such as XGBoost can inherently handle missing values, no imputation was performed in ‘raw’ models.
Mean Imputation: The missing values were initially replaced with the mean of the respective feature derived from the training data. This mean value was then applied universally to any absent data points in both training and test sets.
Building upon the work of Fiddle,17 a modified version was employed as an encoding strategy. Continuous features were binned into quartiles derived from the training set, subsequently encoding each as either 1 or 0 based on its bin membership. All absent features were encoded using a fifth feature, represented by a 1, which signifies a missing value. Thus, for a given feature, only exactly one of the five derived features would be encoded as 1, and the other four encoded as 0. For example, if someone had a temperature of 97.1, the value would be encoded as a 1 in the lowest bin of temperature quartiles, with the rest of the bins being zero (Bin 1: 1, Bin 2: 0, Bin 3: 0, Bin 4: 0, Mask Bin: 0). If their temperature value is missing for that day, it would be encoded in the mask bin. Quartiles, or the boundaries for each bin, were established based on the range of values in the train set, and the test set data is placed into these same bins.
Deep learning and boosting models in sequential data prediction
Deep Learning Architectures
RNN and LSTM
In deep learning, Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are valuable for sequential data prediction. Our models consisted of an input layer designed to handle sequences of varying lengths, padded with zeros. We included a masking layer so that these padded values do not interfere with the model’s computations by skipping time steps with zero input. RNNs can handle data in a sequential manner, capturing time-step patterns and remembering historical data. However, they can experience issues with the vanishing gradient problem and can propagate errors across predictions. LSTMs address the vanishing gradient issue, handling long sequences with their advanced memory capabilities. They do also share the problem of error propagation and are computationally more complex than standard RNNs.
Padding Sequences
For the deep learning models, sequences of varying lengths were encountered. To make them consistent in length, they were padded with 0s. This ensures that the model receives input of a uniform shape, a requirement for efficient batch processing.
Boosting Models
Boosting models, such as XGBoost and LightGBM, treat each data point independently. They are calibrated using isotonic regression for better alignment with the true data distribution, which is often used in imbalanced datasets. Boosting models are efficient against missing data and do not let incorrect predictions influence subsequent ones since they handle prediction problems non-sequentially. They often outperform other traditional machine learning algorithms but lack the sequential data understanding and may overfit if not properly tuned. Logistic regression also treats data points independently but cannot handle missing data.
Hyperparameter Optimization
Hyperparameter tuning was performed on the top five best performing models, with details of specific hyperparameters tried in Supplemental Table 5.
Key Metric Observations
AUROC (Area Under the Receiver Operating Characteristic Curve)
Measures the ability of a model to distinguish between classes. It is represented as a curve plotting the true positive rate against the false positive rate at various threshold settings.
AUPRC (Area Under the Precision-Recall Curve)
Plots precision (the ratio of true positives to the sum of true and false positives) against recall (the ratio of true positives to the sum of true positives and false negatives). This curve provides a trade-off between precision and recall across different thresholds, making it useful for evaluating models where one class is significantly underrepresented.
Accuracy
Ratio of correctly predicted observations to the total observations. It represents the proportion of true results (both true positives and true negatives) among the total number of cases examined.
F1 Score
Provides a balance between precision and recall. The F1 Score reaches its best value at 1 (perfect precision and recall) and worst at 0.
Recall
Recall (sensitivity) measures the proportion of actual positives correctly identified by the model. It is calculated as the ratio of true positives to the sum of true positives and false negatives.
Precision
Measures the proportion of positive predictions that are correct. It is calculated as the ratio of true positives to the sum of true positives and false positives.
Confidence intervals
Confidence intervals for general performance metrics were generated by bootstrapping. Bootstrapping was done through 1000 passes through the dataset where each pass takes n days from the dataset with replacement, where n is the number of days in the test set. The inner 95th percentile of each metric is the reported 95% CI.
Threshold for binary predictions
Set using the class imbalance in the train set. Specifically, the ratio of next-day intubated days to all days (0.91).
Feature Importance
SHAP values
SHAP values are a useful tool for explaining the output of any machine learning model. They measure the impact of each feature on a particular prediction in comparison to the model’s baseline output. For our boosting models, SHAP (using the summary_plot method) values give us the marginal contribution of each feature towards the prediction for every individual instance. This explains importance features, and also how a specific feature can influence a particular prediction.
LR coefficients
For Logistic Regression models, each feature contains a coefficient where greater coefficients imply greater effect on model output.
Ablation
For RNN and LSTM models, an ablation procedure was performed to quantify a feature’s impact on model performance. For each feature in the test set, all values were converted to NA, and the change in AUROC from such data from the original data was reported as the feature importance score. For example, when assessing the feature importance of RASS score in a binned LSTM model, each RASS score value would shift from a 1 in whichever bin it fell in to a 1 in the RASS_score_mask column. Based on Figure 4, we know this led to a decrease in AUROC, implying RASS score is an important feature to the model.
Evaluating early model predictions
To investigate the timing of extubation predictions, we investigated where incorrect model predictions occurred in reference to true extubation events. To provide an unbiased quantification of this, we subset on all ICU stays containing a successful extubation, and examined where the model first predicted extubation during these ICU stays. We report that a significant amount of incorrect model predictions are within three days of the true extubation event (Figure 5).
Supplemental Figures and Tables
Supplemental Documents
NU SCRIPT Study Investigators
TRIPOD Checklist [Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, Vickers AJ, Ransohoff DF, Collins GS. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162(1):W1-W73. PMID: 25560730]
Acknowledgements
BioRender was used for figure generation.