Abstract
Acute hypoxemic respiratory failure (RF) occurs frequently in critically ill patients and is associated with substantial morbidity, mortality and increased resource use. We used machine learning to create a comprehensive monitoring system to assist intensive care unit (ICU) physicians in managing acute RF. The system encompasses early detection and ongoing monitoring of acute hypoxemic RF, assessment of readiness for tracheal extubation and prediction of the risk of extubation failure. In study patients, the model predicted 80% of RF events at a precision of 45%, with 65% of RF events identified more than 10 hours before RF onset. System predictive performance was significantly higher than standard clinical monitoring based on the patient’s oxygenation index and was successfully validated in an external cohort of ICU patients. We have demonstrated how the estimated risk of extubation failure (EF) could facilitate prevention of both, extubation failure and unnecessarily prolonged mechanical ventilation. Furthermore, we illustrated how machine-learning-based monitoring of RF risk, along with the necessity for mechanical ventilation and extubation readiness on a patient-by-patient basis, can facilitate resource planning for mechanical ventilation in the ICU. Specifically, our model predicted ICU-level ventilator use within 8 to 16 hours into the future, with a mean absolute error of 0.4 ventilators per 10 patients of effective ICU capacity.
Introduction
Acute hypoxemic respiratory failure (RF) is a common occurrence in intensive care unit (ICU) patients and is associated with high morbidity, mortality and high resource use1,2. Hypoxemic RF (Type I RF) is the most common type of respiratory failure3 and its severity is defined by the P/F (PaO2/FiO2) ratio, with values below 200 mmHg corresponding to moderate and below 100 mmHg to severe RF. Treating patients with RF involves a sequence of clinical evaluations, including identifying RF and the need for mechanical ventilation, monitoring the recovery of lung function, determining the right time to stop mechanical ventilation, and assessing the risk of complications after tracheal extubation.
For optimal clinical decision-making, it is paramount to continuously monitor the patient’s clinical state in an attempt to predict their future clinical course. ICU physicians base their treatment decisions mostly on intermittent clinical assessments and trend evaluation of monitored vital signs stored in electronic patient data management systems. In the increasingly complex ICU environment, clinicians are confronted with large amounts of data from a multitude of monitoring systems of numerous patients. The quantity of data and the possibility of artifacts increases the risk that clinicians will not readily recognize, interpret, and act upon relevant information, potentially contributing to suboptimal patient outcomes and increased ICU resource expenditure4 compared to optimal care. Large datasets involving multiple data points on many patients are ideal for automatic processing by machine learning (ML) algorithms5,6. To facilitate such advancements, we previously published the High time Resolution Intensive care unit (HiRID) Dataset, which encompasses approximately 34,000 ICU admissions12. ML has been used to develop decision support systems for various conditions in the ICU, such as acute respiratory distress syndrome (ARDS)7–11, circulatory failure12, sepsis13–15, and renal failure16.
We aimed to develop a comprehensive, ML-based respiratory monitoring system (RMS), consisting of multiple subsystems to simplify and expedite the management of individual patients with RF and to optimize ICU resource planning. For individual patients, the system predicts the risk of hypoxemic RF (RMS-RF) and the need for mechanical ventilation (RMS-MVStart), continuously monitors changes and improvements of the respiratory state, and predicts the remaining time of required mechanical ventilation (RMS-MVEnd) and probability of successful extubation (RMS-EF). We investigated how using respiratory state predictions on a patient-by-patient basis could enable estimating the future number of patients in need of mechanical ventilation on a shift-to-shift short-term basis (resource planning). In addition, we prepare a new version of the dataset, HiRID-II, which we anticipate will significantly expand both the number of included patients and the range of available clinical variables.
We hypothesized that our ML-system could predict the relevant respiratory events throughout the treatment process of individual patients accurately and early; both in the development dataset and when validated in externally sourced data. In addition, we intended to develop a resource management support tool to predict ICU-level future mechanical ventilator use by integrating all RMS scores across ICU patients.
Results
Preparation of an extended HiRID dataset (HiRID-II)
We present the High time Resolution Intensive care unit Dataset II (HiRID-II), a substantial update to HiRID-I17, that will be made available1 to the research community on Physionet.org18,19. This new dataset contains 60% more ICU admissions than its predecessor (Table 1, Extended Data Fig. 1a). Additionally, the number of variables has increased from 209 to 310 (Extended Data Fig. 1b). The dataset was k-anonymized in regard to age, weight, height, and gender, reducing the number of admissions from 60,503 to 55,85820. Admission dates were randomly shifted to further reduce the risk of identification of individual patients. To allow the assessment of model generalization to the future, the dataset was divided into temporal splits while respecting k-anonymization (Extended Data Fig. 1c). A high-resolution external evaluation dataset, extracted from the Amsterdam UMCdb21, was used to test the generalizability of RMS to other health care systems (Extended Data Fig. 1d). Preliminary analysis of the HiRiD-II dataset revealed strong correlations between the occurrence of RF and EF with ICU mortality, confirming prior results1 and motivating our proposed RMS (Extended Data Fig. 2).
Development of a continuous monitoring system for respiratory management
We developed a comprehensive ML-based respiratory monitoring system composed of four interrelated predictive models that, together, cover the respiratory trajectory of patients. RMS-RF estimates every five minutes the risk of moderate to severe hypoxemic respiratory failure (P/F ratio < 200mmHg) in the next 24 hours. RMS-MVStart predicts the need for mechanical ventilation, while RMS-MVEnd determines if the patients will be ready to be liberated from ventilatory support; both tasks forecast risk within the subsequent 24 hours. Finally, RMS-EF evaluates the likelihood of successful extubation at given time points, when the patient already meets formal criteria.
Patient state annotation and labeling
For each time point we determined if a patient is currently in moderate or severe hypoxemic RF (P/F ratio < 200 mmHg), mechanically ventilated, or ready to be extubated. Readiness to extubate (REXT) status at each time point was defined using a heuristic scoring system determined by gas exchange, respiratory mechanics, hemodynamics and neurological status. A score threshold was manually selected after inspection of the time series by an experienced ICU clinician (Fig. 1b). At these time-points the patient could be extubated according to formal criteria, but extubation failure can still occur. The current ventilation status was derived from the presence of ventilator-specific parameters.
Positive labels for future RF were defined as time points when the patient was not currently in RF but would exhibit RF in the next 24 hours (“impending RF”); while negative labels were assigned to time points when the patient was not currently in RF and would remain stable during the next 24 hours. For every extubation event, we determined whether it failed (reintubation necessary within 48h after extubation) and used it as the label for extubation failure. Labels for ventilation onset and readiness to extubate prediction were positive at time points when the patient was currently not ventilated/ready-to-extubate but would be in the next 24 hours (Fig. 1b). In HiRID-II, 43.7% and 46.2% of all patients had RF events and required mechanical ventilation, respectively. The dataset contained 23,861 extubation events, of which 11.1% extubations failed.
Continuous P/F ratio estimation
To measure PaO2, an arterial blood sample is necessary. PaO2 is therefore only periodically available at a low temporal resolution. For continuous high-resolution hypoxemic RF labelling, a continuous estimation of the current PaO2 is required. An underlying physiological principle determines the binding and dissociation of oxygen and hemoglobin, yielding a sigmoid correlation between arterial oxygen saturation (SpO2) and PaO 24–26. A model based on the continuously available SpO can therefore estimate continuous PaO2. We developed an ML algorithm to produce continuous estimates of PaO2 based on SpO2 and other relevant variables determining the hemoglobin-oxygen dissociation curve. The algorithm outperformed the existing non-linear Severinghaus-Ellis baseline27 for estimating PaO2 values from non-invasive SpO2 measurements (Extended Data Fig. 3). By integrating PaO2 estimates with continuously available FiO2 data, we derived P/F ratio estimates on a five-minute time grid.
Development of RMS predictors
The RMS produces four individual scores active at different stages of the RF management process. All four subsystems are based on manual feature engineering and LightGBM22 predictors, similar to our previous work12. Prior analyses on HiRID-I for circulatory failure and a related respiratory failure task have shown superior performance of LightGBM compared to other models including deep learning12,23. The predictor for RF (RMS-RF) used 15 clinical variables (Supplemental Table 3). As in Hyland et al.12, the system triggered an alarm if the RF score exceeded a specified threshold. It was silenced for 4 hours afterwards. The alarm system was reset when the patient recovered from an event and could reactivate 30 minutes later. The extubation failure (RMS-EF) predictor used 20 clinical variables (Supplemental Table 3). The RMS-RF & RMS-EF variable sets were identified separately for the two tasks, using greedy forward selection on the validation set of five data splits. The models for ventilator use (RMS-MVStart) and extubation readiness (RMS-MVEnd) used the union of the parameters of the two main tasks, consisting of 26 variables in total (Supplemental Table 3).
We utilized the four risk scores to predict short-term mechanical ventilator resource requirements by training a meta-model (Fig. 1c). The resource planning problem was divided into two sub-problems; predicting the future ventilator use for already admitted ICU patients, and predicting the requirement for mechanical ventilation for newly admitted non-elective patients in the near future. We excluded elective patient admissions as their resource use is typically known well in advance (hence, no prediction is needed). The predictor uses date and time information as well as summary statistics regarding ventilator use and patient numbers derived from the ICU dataset. A LightGBM22 regressor was trained to solve both sub-problems. For admitted ICU patients, it predicts the necessity for mechanical ventilation in the short-term future, as well as the total number of ventilators required for all admitted patients as an aggregate of the individual predictions (Fig. 1d).
Open source release
All elements of the developed system (Fig. 1e), including data preprocessing, annotation, prediction task labeling, and both training and prediction pipelines are made available under an open source license facilitating the reproducibility and reuse of the methodology and results.
RMS-RF predicted RF early with high precision and reduced false alarms compared to clinical baselines
We developed a model that continuously evaluates the likelihood of a patient developing hypoxemic RF within the next 24 hours, updating its predictions every five minutes throughout the ICU stay. We define RF as a moderate or severe reduction in oxygenation, reflected by a P/F ratio below 200 mmHg. To focus on impending deterioration, the model only generates predictions when the patient is not currently experiencing RF. Conversely, if the patient remains stable and does not meet criteria for RF over the subsequent 24 hours, the model recognizes a low-risk state. The early prediction of hypoxemic RF is crucial for timely intervention and may reduce the number of unfavorable patient outcomes and improve overall healthcare quality. By accurately forecasting these events, RMS-RF may not only improve clinical decision-making but also allow physicians to commence treatment early, thereby mitigating the risk of more severe respiratory complications.
The RMS-RF model achieved an area under the alarm/event precision recall curve12 (AUPRC) of 0.559 with an alarm precision of 45% at an event recall of 80%. The model had an area under the receiver operating characteristic curve (AUROC) of 0.839 (Extended Data Fig. 4a). We observed that RMS-RF significantly outperformed two comparator baselines, a decision tree that uses the current value of respiratory and other parameters (SpO2, FiO2, PaO2, Positive end-expiratory pressure (PEEP), RR, Ventilator presence, HR, GCS) as well as a clinical threshold-based system based on the SpO2/FiO2 ratio (Fig. 2a). RMS-RF was well calibrated, in contrast to the two baselines (Extended Data Fig. 4b). The system detected 65% and 78% of respiratory failure events at least 10 hours before they occurred when set to an event recall of 80% and 90% respectively (Fig. 2b). Compared to the SpO2/FiO2 threshold, our system generated two-thirds fewer false alarms per day on days without respiratory failure. (Fig. 2c). The model performance increased with inclusion of data up to 25% of the total dataset size, while no further improvements were observed when using more data (Extended Data Fig. 4c). Model performance was highest in patients across cardiovascular and respiratory diagnostic groups (alarm precisions 55% and 60% at 80% event recall, respectively). Lower performance was observed in neurologic and trauma patients (Fig. 2d). Performance varied in groups determined by age and gender28 (Extended Data Fig. 4d/e). RMS-RF exhibited physiologically plausible relationships between risk and clinical variables, according to SHapley Additive exPlanations (SHAP)29 values (Fig. 2e, Extended Data Fig. 5).
The proposed RMS-RF model used a small number of physiological parameters and ventilator settings. Slightly diminished performance was observed when the HiRID-II-based model was externally validated in the Amsterdam UMCdb dataset21. There was no significant performance improvement observed through retraining with local data (Fig. 2f; both have 38% alarm precision at 80% event recall). We excluded medication variables to reduce the effect of differences in medication policies in different hospitals. A variant of RMS-RF including medication variables (RFS-RF-p) achieved only minor gains in internal HiRID performance (Fig. 2g) and exhibited poor transfer performance to UMCdb (Extended Data Fig. 6a). Medication policy differences between HiRID-II and UMCdb were analyzed to investigate these drops in transfer performance, indicating that these were generated by differences in the use of loop diuretics, heparin, and propofol (Fig. 2g, Extended Data Fig. 6b/c).
RMS-EF predicted extubation failure with high precision and was well-calibrated
The accurate prediction of extubation failure is a critical aspect of patient management in intensive care, enabling clinicians to make informed decisions about the ideal timing of extubation. By utilizing RMS-EF to predict the risk of extubation failure, physicians could judiciously determine whether to proceed with or delay extubation based on a quantifiable risk threshold, potentially reducing the likelihood of complications associated with both, premature extubation or unnecessary prolongation of mechanical ventilation. We compared the developed RMS-EF predictor to a threshold-based scoring system, which counts the number of violations of clinically established criteria for readiness to extubate at the time point when the prediction is made (REXT status score). RMS-EF significantly outperformed the baseline (Fig. 3a) with an AUPRC of 0.535 and an AUROC of 0.865 (Extended Data Fig. 7a). We also analyzed calibration and observed a high concordance between observed risk of extubation failure and RMS-EF with a Brier score of 0.078; in contrast to the baseline (Fig. 3b). The precision for predicting extubation failure was 80% at a recall of 20% indicating that RMS-EF can effectively identify the patients at highest risk. RMS-EF predicted successful extubation at least 3h prior to the time point when extubation effectively takes place in 25% of events (Fig. 3c). As with RMS-RF, no major improvements in model performance were observed when using more than 25% of the training data (Extended Data Fig. 7b). Performance across diagnostic groups was similar, with RMS-EF performing best in respiratory patients (Fig. 3d). We observed that the performance in female patients and older age groups was slightly inferior (Extended Data Fig. 7c/d). As RMS-EF is based almost exclusively on variables that are influenced by clinical policies (Fig. 3f, Extended Data Fig. 8) which likely differ in different hospitals, it transferred poorly to the UMCdb dataset21 (External Data Fig. 7e). However, a variant of our model can be constructed without medication variables, which transferred well to the UMCdb dataset with slightly reduced internal performance (Fig. 3e; AUPRC 53.5% vs. 49% for HiRID). Accordingly, the analysis of medication policies revealed major differences for ready-to-extubate patients between HiRID-II and UMCdb (Extended Data Fig. 7f/g). SHAP value analysis30 showed that the RMS-EF risk score was dependent on several parameters determined by treatment-policies (Fig. 3f, Extended Data Fig. 8). Severe loss of transfer performance resulted from the inclusion of sedatives and vasopressors in the model (Fig. 3g).
Predicting intubation and readiness-to-extubate for individual patients
We evaluated prediction of ventilation onset (RMS-MVStart) and readiness to extubate (RMS-MVEnd) within the next 24h also on a patient-level. We observed high discriminative performance with AUROCs of 0.914 and 0.809 (Extended Data Fig. 9a/b), event-based AUPRCs of 0.528 and 0.910 (Extended Data Fig. 9c/d), for RMS-MVStart and RMS-MVEnd respectively, and the models were well calibrated (Extended Data Fig. 9e/f).
Integrating all RMS scores of individual patients for planning ICU-level resource allocation
Using the predictions for the four models focusing on respiratory failure (RMS-RF), extubation failure (RMS-EF), ventilation onset (RMS-MVStart), and readiness to extubate (RMS-MVEnd), we developed a combined model predicting the number of ventilators in use for non-elective patients at a specific future horizon. Preliminary analysis of the HiRID-II dataset demonstrated substantial variation in demand for ventilators each day, underscoring the need for a model to aid resource planning (Fig. 4a).
We trained a meta-model using the four scores to predict the number of ventilators in use in the ICU at future time horizons every hour (4-8h, 4-12h, 8-12h, 8-16h, 16-24h; Fig. 4b). We compared it with a baseline that predicts that the number of non-elective patients requiring mechanical ventilation remains stable. We observed that the proposed model clearly outperforms this baseline in terms of mean absolute error (MAE), with the largest relative gain in longer prediction horizons (Fig. 4b). In 39% of time points the model’s predictions were at least two ventilators closer to ground-truth, for predicted ventilator use in 8-16 hours (Fig. 4c/d). RMS outperformed the baseline for the majority of ICU ventilator utilization scenarios (Fig. 4e) with the largest improvement over the baseline when the respirator use is below the maximum capacity (Fig. 4e) and for predictions of ventilator use during day hours (Fig. 4f).
Explorative joint analysis of RMS scores throughout the ICU stay
We analyzed the relationship of the four RMS scores produced at each time point of the ICU stay by embedding the most important parameters for respiratory failure and extubation failure prediction (union of the top 10 variables identified for each task, current value feature) using t-distributed stochastic neighbor embedding (t-SNE31) with subsequent discretization into hexes. This approach produces a two-dimensional hex-map that defines subsets of comparable patient states that can be compared across different characteristics, i.e. between the panels for the hex. We observed that the space is divided into two distinct states, corresponding to time points when the patient is ventilated or not ventilated (Fig. 5a). The region of ventilated patients is further subdivided, with patients in the upper part being more likely to be ready-to-extubate (Fig. 5b). As expected, the ventilated and not ready-to-extubate region has the highest observed 24h mortality (Fig. 5c). Patients experiencing respiratory failure were concentrated in a compact region in the area corresponding to non-ventilated patients, as well as scattered throughout the area corresponding to ventilated patients (Fig. 5d). States with high risk of future ventilation need according to RMS-MVStart are close to the boundary of the ventilated region (Fig. 5e). Readiness to extubate scores show a less clear pattern, but scores tend to be higher in the upper part of the ventilated region, which is also enriched in states in which patients are ready-to-extubate (Fig. 5f). For RMS-EF, high scores are concentrated in two distinct regions at the edge of the ventilated region (Fig. 5g). Lastly, RMS-RF scores are high close to the boundary of patients already in respiratory failure (Fig. 5h). The median risk scores of hexes for respiratory failure/ventilation need are strongly positively correlated with an R2 of 0.471 (Fig. 5i). Likewise, respiratory failure and extubation failure scores were moderately positively correlated (Fig. 5j). For RMS-EF/RMS-MVEnd scores, no correlation could be observed (Extended Data Fig. 10). For three exemplary hexes with predominantly (1) non-ventilated patients but high RMS-RF score, (2) ready-to-extubate patients but high RMS-EF score, and (3) not-ready-to-extubate patients but high RMS-RF score, the distribution of clinical parameters was analyzed, showing plausible relationships with clinical parameters. For instance, the non-ventilated patients with the highest RF risks had low PaO2, high (supplemental) FiO2 and high respiratory rates (Fig. 5k).
Discussion
We presented a ML-based system for the comprehensive monitoring of the respiratory state of ICU patients (RMS). RMS consists of four highly accurate scoring models that predict the occurrence of respiratory failure, start of mechanical ventilation, readiness to extubate as well as tracheal extubation failure. By combining the prediction scores of all admitted patients at any time point and by accounting for the likelihood of future admissions, RMS facilitated the accurate prediction of the near future cumulative number of patients requiring mechanical ventilation, which may help to optimize resource allocation within ICUs.
The HiRID-II dataset released alongside this work is a rich resource for broad-scale analyses of ICU patient data. It represents an important advance over HiRID-I, both in terms of the number of included patients and the number of clinical parameters that are included. Our initial analysis of the HiRID-II dataset identified clinically significant links; both the presence and duration of respiratory failure, as well as extubation failure, were associated with increased ICU mortality, indicating distinct yet interconnected risk factors. These insights highlight the critical need for advanced alarm systems for clinical settings to reduce the risks associated with respiratory and extubation failure. The release of the HiRID-II dataset on Physionet18,19 will offer numerous opportunities for further research, allowing for more in-depth investigations into various aspects of ICU patient care and outcomes.
RMS-RF predicts respiratory failure throughout the ICU stay, and alarms for impending failure were typically triggered at least 10 hours before the event. This early warning has the potential for clinicians to optimise medical therapy and potentially prevent the need for intubation. The transparent break-up of the model’s alarm output into SHAP values of the most relevant parameters may inform clinician understanding and guide their actions. RMS-RF outperformed a baseline representing standard clinical decision-making based on SpO2 and FiO2, and significantly reduced the number of false alarms. It produces RF-specific alarms and silences them within a specified period of time after the model triggers an alarm, reducing alarm fatigue, which is a major issue for ventilator alarms32. Prior to respiratory failure, only 1.5 alarms per patient/day were raised, which is manageable for the clinical personnel, and unlikely to cause alarm fatigue. Reassuringly, only variables directly associated with respiratory physiology or ventilator settings were found to be predictive of impending respiratory failure. RMS-RF demonstrated its highest precision in individuals admitted with cardiovascular or respiratory admission diagnoses, while its performance notably declined in neurologic patients. In these patients ventilatory management is often determined by the need to protect a compromised airway in patients with altered levels of consciousness and not by the presence of RF per se33. To date, few externally validated ML models continuously predicting acute respiratory failure in the ICU have been reported. Recent works by Le et al.10, Zeiberg et al.34, and Singhal et al.35 focus on mild respiratory failure (P/F index < 300 mmHg). Other models predict respiratory failure at the time of ICU admission or are only valid for specific cohorts36–38.
RMS-EF predicts tracheal extubation failure and significantly outperformed a clinical baseline derived from common clinical criteria for assessment of readiness to extubate status. The model was well calibrated, with almost ideal concordance of the prediction score and observed risk of extubation failure. A potential use case would be to assess the predicted failure risk to determine whether to accelerate or delay the extubation of the patient. At 80% recall, a quarter of correctly predicted extubation successes were recommended more than 3h before the actual extubation. This exceeds the recall of routinely used readiness tests and suggests that our model could help clinicians to extubate patients earlier. However, in our analysis we could not ascertain whether a patient was not extubated for reasons not apparent from the data, such as availability of staff. For clinical use the model could also be operated at 20% recall with very high precision (80%), to identify patients with a high likelihood that extubation will be unsuccessful. This could caution clinicians from prematurely extubating high-risk patients. For the prediction of extubation failure, various models have been proposed39–44. The largest cohorts to date were used in the works by Zhao et al.44, who only validated the model in a cardiac ICU cohort, which limits the generalizability of the results, and Chen et al.45, who restricted the evaluation to ROC-based metrics, and do not discuss the clinical implications of the model’s performance.
ML has previously been used to develop support systems for the management of RF patients in the ICU. These included models for detection of ARDS7–11 and COVID-19 pneumonitis patients35,46, prediction of readiness-to-extubate47–49, need for mechanical ventilation50,51, and detection of patient-ventilator asynchrony52. Existing work focused on single aspects of RF management, often in specific patient cohorts only. Our approach aims to comprehensively monitor the respiratory state throughout the RF treatment process, by integrating relevant respiratory-system related tasks and allowing for joint analysis of risk scores and trajectories. We believe a single and universally applicable system is much more likely to be successfully implemented than multiple fragmented models relating to specific disease entities. A further distinguishing feature of RMS is the five-minute time resolution at which predictions are made, enabling longitudinal analysis of risk trajectories. The dynamic prediction, which is a central feature of our model, is more flexible than traditional severity scores, which are evaluated at fixed time points, such as at 24 h after ICU admission53, mainly to predict ICU mortality54.
For successful external validation of RMS-RF, it was key to exclude medication variables from the model, as their inclusion was detrimental to model transferability. We hypothesize that this difficulty is caused by the observed medication policy differences between centers. Interestingly, ventilator settings, while also policy-dependent, did not compromise transfer performance in the same way. Investigating and quantifying the underlying policy differences, which made transfer difficult, needs additional research. Model transferability is an important topic in robust ML algorithms for ICU settings, where it has been recently studied in risk prediction in sepsis15,55,56 and mortality57. Our results suggest that medication variables require special attention to enable transfer. In contrast to RMS-RF, we suggest that RMS-EF to be re-trained and fine-tuned using the data from the center where it should be applied. The policy differences between different centers proved more detrimental to its performance than for RMS-RF.
Clinical prediction models for individual patients have been extensively studied. Resource planning in the ICU has received little attention in the ML literature, but came into renewed focus due to the COVID-19 crisis58. The first ML-based models to predict ICU occupancy were proposed during the pandemic. Lorenzen et al.58 predicted daily ventilator use as well as, more generally, hospitalization up to 15 days into the future59. The RMS presented here clearly outperformed a baseline method for predicting future ventilator use at the ICU level. With a mean absolute error of 0.39 ventilators per 10 ICU patient beds used during the next shift (8 to 16 hours), the model is sufficiently precise for practical purposes. Since resource allocation in the ICU depends on local policies and procedures, such a system likely needs to be retrained for every clinical facility for reliable predictions. External validation was not feasible as all public ICU datasets have random date offsets and therefore no information on concurrent patients in the ICU60.
In this study, we developed predictors of key aspects of respiratory state management, including RF, extubation failure, the need for mechanical ventilation, and readiness for tracheal extubation. These predictors collectively describe various aspects of a patient’s respiratory status in the ICU which can be used for exploratory analysis. The joint analysis and visualization of risk scores alongside other vital clinical variables yielded discernible clusters that correspond to specific patient states, indicating the potential for risk stratification within a patient population. We observed a separation of patient states into two main clusters that align with ventilated and non-ventilated states, with substructures within these clusters. The patients with highest 24h mortality risk identified on the hex-map often had depressed levels of consciousness, were more likely to require mandatory modes of mechanical ventilation, had higher peak airway pressure and required higher PEEP; all indicators of more severe underlying lung pathology. We also identified a cluster of patients who are clinically ready to be extubated, and have a low risk of RF but a very high extubation failure risk. These patients required relatively higher airway ventilation pressure and had a low respiratory rate, which are all established risk factors for extubation failure.
The hex-map visualization allows for the monitoring of individual patient states over time with updates, akin to those seen in methodologies like T-DPSOM61,62. This dynamic tracking is based on the automated integration of multiple respiratory state dimensions and uses nonlinear dimensionality reduction to provide the position of an individual patient on the map of respiratory health states. We expect that hex-map visualization has the potential to assist clinicians in identifying changes in patients’ clinical states, although the practical implications of this feature require further validation. This represents a different approach to previous work, that mainly tries to understand biological phenotypes of ARDS patients63–66 or longitudinal sub-phenotypes of a more specific patient set, like COVID-19 patients67,68. Overall, while the hex-map visualization provides an interesting perspective for monitoring of respiratory state in the ICU and can serve as a tool for a more detailed exploration, the presented analysis is exploratory only. Further research is needed to substantiate the clinical relevance of the identified clusters and to explore how this system might integrate into the decision-making processes within the ICU.
Our study tried to avoid certain limitations of retrospective model development studies. Unlike typical single-center studies, our research utilized data from two distinct centers, one for development and another for validation. This approach reduced the risk of overfitting models to a local patient cohort, although it is important to note that external applicability may still vary and retraining on local data will be needed for parts of the proposed RMS. We have incorporated improvements based on our previous work into our models. Unlike earlier systems that were heavily reliant on sporadic clinical measurements, such as serum lactate concentration12, our current model uses continuous SpO2 monitoring and ventilator data. Use of automated continuous data reduces the influence of clinician-driven decisions on our alarm systems, ensuring a more objective assessment of the patient’s condition. However, the retrospective nature of our data collection is still a limitation. Missing data was partially imputed for respiratory failure annotation, and while this aids in model development, it introduces potential biases. This study does not address how integrating the system into everyday clinical practice might influence treatment or monitoring strategies (a phenomenon known as domain shift69). Specifically, if the model would rely heavily on clinician-driven interventions (such as changes in PEEP or the administration of diuretics) as predictors of respiratory failure, any future alterations in clinician behavior (possibly driven by the model implementation) could reduce the model’s predictive accuracy. We constructed clinical baselines as best-effort reference points for comparison, derived from the data available in our cohort. As such, they are not established standards and may miss important clinical elements, such as respiratory effort, that are not routinely recorded but available to physicians in practice. Lastly, our assessment of the extubation failure risk score was limited to scenarios of actual extubation events. While we think that the accuracy of this score would be similar in patients nearing readiness for extubation, this cannot be definitively concluded from our retrospective data. Future prospective implementation studies are needed to fully understand the implications of our model in a live clinical setting.
In summary, we have developed a comprehensive monitoring system for the entire respiratory failure management process. We have shown that our system has the potential to facilitate early identification and assessment of deteriorating patients, aiming to enable rapid treatment; and to simplify resource planning within the ICU environment. The physiological relationship between risks and individual predictions can be inspected using SHAP values, thereby hopefully offering valuable insights to clinicians, and ultimately increasing trust in the system70. The potential benefit of the system in improving patient outcomes needs to be validated in prospective clinical implementation trials.
Data Availability
Data used in this study were obtained from University Hospital Bern and we aim to make it available on physionet.org in anonymized form, similar to the HIRID-I dataset associated with a previous study (Hyland et al., Nature Medicine, 2020).
Author contributions
M.H., X.L., M.F., G.R., T.M.M. with input from S.L.H., M.Ho., A.P., H.Y., M.B. designed the experiments; M.F., T.M.M., D.B. selected and provided the clinical data and context; A.P. with input from M.F., G.R., T.M.M., D.B., X.L. k-anonymized the dataset. X.L., M.F., A.P. with contributions from T.M.M., G.R., D.B. preprocessed and cleaned the HiRID-II data; X.L., M.F. harmonized the UMCdb dataset with HiRID-II. M.H., M.F. with input from G.R., T.M.M, X.L., S.L.H. defined and developed the respiratory state annotations and labels; M.H., M.F. developed the continuous estimation algorithm for PaO2;. M.H., M.F. developed and extracted ML features; M.H. developed the pipeline for supervised learning including variable selection; M.Ho. with input from X.L., G.R., M.F., M.H. performed the fairness analysis in sub-cohorts. A.P., with input from M.H., M.F., T.M.M., G.R., performed analyses of treatment policy differences. X.L. with input from G.R., M.H., M.F., A.P. conceived and developed the model for resource planning, M.H. with input from G.R., M.F., X.L., T.M.M. implemented the joint analysis of RMS scores; T.M.M., G.R., M.F. conceived and directed the project; M.H., M.F., X.L., G.R., T.M.M, A.P., M.Ho. with input from H.Y., M.B., D.B. wrote the manuscript. X.L. with input from all authors created Fig. 1. All authors read the manuscript and provided critical feedback.
Extended data figures
Supplemental Materials
Supplemental Table 1. Details on the clinical parameters extracted in the HiRID-II dataset (downloadable XLSX file).
Supplemental Table 2. Details on the imputation parameters, such as normal value, and imputation models, for the clinical parameters (downloadable XLSX file).
Supplemental Table 3. List of important variables used for computing complex features, as a basis for variable selection, and for building the final models RMS-RF/RMS-EF/RMS-MVStart/RMS-MVEnd (downloadable XLSX file).
Supplemental Table 4. List of severity levels for computing ‘instability history’ features, for a subset of the important variables. (downloadable XLSX file).
Supplemental Table 5. Model training parameters and grid used for selection of hyperparameters for the LightGBM library (downloadable XLSX file).
Acknowledgments
This project was supported by the Grant No. 205321_176005 of the Swiss National Science Foundation (to T.M.M. and G.R.), and grant #2022-278 of the Strategic Focus Area “Personalized Health and Related Technologies (PHRT)” of the ETH Domain (Swiss Federal Institutes of Technology), and ETH core funding (to G.R.). We acknowledge discussions with and organizational, administrative or technical help by Carmen Pfortmüller, Jörg Schefold, Daniel Vonder Mühll, Olga Mineeva, Quinten Johnson, Dinara Veshchezerova, David Meyer, Anastasia Escher, Nora Toussaint, Margarita Kuznetsova, Fedor Sergeev, Marc Zimmermann, Catherine Jutzeler, Karsten Borgwardt, Thomas Gumbsch, Bowen Fan, Jörg Goldhahn, Sonia Strangio, Ivo Schauwecker, Martina Baumann, Sergio Maffioletti, Bernd Rinn, Anna Wiegand, Diana Coman Schmidt, Matthew Levin, Robert Freeman, Thomas Fuchs, Emanuela Keller, Michael Krauthammer, Paul Elbers, and Patrick Thoral. Computational analyses were performed at the LeonhardMed Trusted Research Environment at ETH Zurich (https://sis.id.ethz.ch/services/sensitiveresearchdata/). The work by S.L.H. was done while she was working at ETH Zurich. We thank David Sidebotham for proofreading the manuscript.
Footnotes
↵+ These authors jointly supervised this work: Tobias M. Merz, Gunnar Rätsch; E-mail: tobiasm{at}adhb.govt.nz, gunnar.raetsch{at}inf.ethz.ch.
We have received reviews from Nature medicine but the manuscript was rejected. In the revision we have thoroughly addressed the reviewers comments to improve the manuscript.
↵1 We currently work on the approvals for the release of the newer dataset and expect to have it ready at time of publication of the manuscript.