ABSTRACT
Background The coronavirus disease 2019 (COVID-19) has become a pandemic, placing significant burdens on the healthcare systems. In this study, we tested the hypothesis that a machine learning approach incorporating hidden nonlinear interactions can improve prediction for Intensive care unit (ICU) admission.
Methods Consecutive patients admitted to public hospitals between 1st January and 24th May 2020 in Hong Kong with COVID-19 diagnosed by RT-PCR were included. The primary endpoint was ICU admission.
Results This study included 1043 patients (median age 35 (IQR: 32-37; 54% male). Nineteen patients were admitted to ICU (median hospital length of stay (LOS): 30 days, median ICU LOS: 16 days). ICU patients were more likely to be prescribed angiotensin converting enzyme inhibitors/angiotensin receptor blockers, anti-retroviral drugs lopinavir/ritonavir and remdesivir, ribavirin, steroids, interferon-beta and hydroxychloroquine. Significant predictors of ICU admission were older age, male sex, prior coronary artery disease, respiratory diseases, diabetes, hypertension and chronic kidney disease, and activated partial thromboplastin time, red cell count, white cell count, albumin and serum sodium. A tree-based machine learning model identified most informative characteristics and hidden interactions that can predict ICU admission. These were: low red cells with 1) male, 2) older age, 3) low albumin, 4) low sodium or 5) prolonged APTT. A five-fold cross validation confirms superior performance of this model over baseline models including XGBoost, LightGBM, random forests, and multivariate logistic regression.
Conclusions A machine learning model including baseline risk factors and their hidden interactions can accurately predict ICU admission in COVID-19.
Introduction
Coronavirus disease 2019 (COVID-19), the third coronavirus epidemic in the recent two decades after severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), has become a pandemic, placing significant burdens on healthcare systems worldwide 1. The number of people confirmed with COVID-19 worldwide exceeded 7.4 million on June 11, 2020, including at least 416,000 deaths across 188 countries and territories 2. The coronavirus pandemic remains unresolved, even though countries around the world have moved to lift quarantines, stay-at-home orders and other social restrictions. A particular challenge countries face in the COVID-19 pandemic is the surge in demand for intensive care unit (ICU) care 3, 4. Recent studies have exposed an astonishing case fatality rate of 61.5% for critical cases, increasing sharply with older age and for patients with underlying comorbidities 5. The unfulfilled ICU demand would immediately lead to elevated fatality rate. The critical question on the clinical characteristics and relevant biomarkers for efficient ICU management of COVID-19 patients remains unanswered 6. Identification of prognostic biomarkers to distinguish patients that require immediate medical attention has become an urgent yet challenging necessity. Therefore, the aim of this study is to identify significant risk factors or characteristics as well as hidden interaction effects associated with ICU admission by using an interpretable machine learning approach.
Methods
Study design and population
This study was approved by the Institutional Review Board of the University of Hong Kong/Hospital Authority Hong Kong West Cluster. This was a retrospective, territory-wide cohort study of patients infected with COVID-19, as confirmed by RT-PCR, between 1st January and 24th May 2020. The patients were identified from the Clinical Data Analysis and Reporting System (CDARS), a territory-wide database that centralizes patient information from individual local hospitals to establish comprehensive medical data, including clinical characteristics, disease diagnosis, laboratory results, and drug treatment details. The system has been previously used by both our team and other teams in Hong Kong 7-10. Clinical data include primary diagnoses after admission (1st January 2020 to 24th May 2020) and comorbidities (1st January 1999 to 31st December 2019) in the past decade. The list of conditions identified is detailed in the Supplementary Appendix. Diagnosis of COVID-19 was made by RT-PCR. Other respiratory viruses, including influenza A virus (H1N1, H3N2, H7N9), influenza B virus, respiratory syncytial virus, parainfluenza virus, adenovirus, SARS coronavirus (SARS-CoV), and MERS coronavirus (MERS-CoV) were also examined with RT-PCR.
Outcomes and statistical analysis
The primary outcome was ICU admission. Continuous variables were presented as median (95% confidence interval [CI] or interquartile range [IQR]) and categorical variables were presented count (%). The Mann-Whitney U test was used to compare continuous variables. The χ2 test with Yates’ correction was used for 2×2 contingency data, and Pearson’s χ2 test was used for contingency data for variables with more than two categories. To identify the significant risk factors associated with ICU admission of COVID-19 patients, univariate logistic regression was used to estimate odds ratios (ORs) and 95% CIs, adjusting for age, sex, comorbidities. A two-sided α of less than 0.05 was considered statistically significant. Statistical analyses (including univariate logistic regression) were performed using RStudio software (Version: 1.1.456) and Python (Version: 3.6).
Development of a tree-based interpretable machine learning model
After the identification of significant predictors for ICU admission, we aim to further construct a practically useful ICU use decision-making model by considering both main and interaction effects among those important univariable variables. Here the interaction effects, mainly pairwise interactions, capture the hidden nonlinear dependence between risk characteristics and can provide additional information for ICU outcome identification, besides individual predictors. Significant predictors identified on univariate logistic regression were enter into a state-of-the-art interpretable boosting machine model: Explainable Boosting Machine (EBM) 11.
The EBM model is an explainable supervised predictor developed by using modern machine learning techniques like bagging, gradient boosting, and automatic main and interaction effects detection with high accuracy of state-of-the-art learning models (e.g., random forests 12 and XGBoost 13) with its light memory usage and fast prediction time. EBM is constructed with multiple hierarchically organized simple classifiers consisting of sequences of binary decisions. Unlike these black-box models, EBM produce lossless explanations for outcome predictions due to its great interpretability potential of tree-based decision system, which is desired for clinically operable decision-making. In contrast, internally black-box-like learning models are typically difficult to interpret. Intrinsic interpretability as equipped in EBM aims to intrinsically interpret the model predictions. The contribution of main and interaction effects to identify ICU use can be determined by their accumulated use in each decision tree splitting process, which can be easily sorted and visualized in descending order to identify the more important variables.
Results
Baseline characteristics
The flowchart of patient enrolment in this study is provided in Figure 1. A total of 1043 patients admitted to the hospital between 1st January 2020 and 24th May 2020 were included in this study. The case distributions with respect to the different districts of Hong Kong are shown in Figure 2. There are 373 cases from Hong Kong Island District (36%), 212 cases from Kowloon District (20%), 398 cases from New Territories District (38%), and 60 cases without district indicators (6%). Chinese is the most common nationality (914, 87.6%), followed by Filipino (38, 3.6%), Pakistani (22, 2.1%), British (14, 1.3%), French (9, 0.9%), Nepalese (9, 0.9%), American (6, 0.58%), Indian (4, %), Canadian (2, 0.2%), Australian (2, 0.2%) and Korean, German, New Zealander, Greek, Thai, Indonesian, Japanese, and Netherlander (1 each, 0.1%).
The baseline demographics, comorbidities, medications, and laboratory test findings are shown in Table 1. Of the included patients, 563 were males (54%, median age: 35 [IQR: 32-37], range: 0-93 years old) and 480 were females (46%, median age: 35 [IQR: 32-37], range 1-96 years old) (Figure 3). Most patients (n=776, 74%) were between 18 and 60 years of age. In total, 19 patients (14 males, 73.68%, median ICU length of stay [LOS]: 16 days) were admitted to the ICU and the numbers within age intervals are also shown in Figure 3. Patients admitted to the ICU has median inpatient length of stay (LOS) of 30 days, in comparison to a median of 20 days for patients without ICU admission. The timeline of COVID-19 cases after hospitalization is shown in Figure 4. Amongst the ICU patients, there are 12 Chinese, 1 Filipino, and 6 of unknown race. The 19 ICU patients has median urine output of 1610 ml/24 hours (IQR: 1255-2000, max: 3310). The distributions of other physiological parameters are shown in Table 1. In addition, 9 ICU patients (47.37%) were on ventilators for respiratory support, and three (15.79%) received renal replacement therapy. Four patients died. Two deaths occurred during inpatient hospitalization with single admission, one during ICU hospitalization, and one upon admission.
A total of 535 COVID-19 patients (87.3%) had records of preexisting comorbidities (n=1237) between January 1st, 1999 to December 31st, 2019. Of these, 230 patients (42.99%) had respiratory diseases, 174 patients (32.52%) had gastrointestinal diseases, 108 patients (20.19%) had hypertension, 54 patients (10.09%) had diabetes, 21 patients (3.93%) had chronic kidney diseases, and 10 patients (1.87%) had cardiovascular diseases.
In terms of medications prescribed during the inpatient stay for non-ICU patients, lopinavir/ritonavir (Kaletra) is the most commonly used drug (60.8%), followed by ribavirin (53.2%), interferon-beta (32.5%), angiotensin converting enzyme inhibitors (ACEI) or angiotensin receptor blockers (ARBs) (18.9%), steroids (14.6%), hydroxychloroquine (13.2%), and remdesivir (2.5%). Among the ICU patients, lopinavir/ritonavir was the most frequently prescribed drug (88.9%), followed by ribavirin (77.8%), ACEI/ARB (77.8%), hydroxychloroquine (38.9%), interferon-beta (38.9%), steroids (27.8%) and remdesivir (5.6%). We find that patients admitted to ICU are more likely to be given Kaletra and ribavirin, which may reflect more aggressive treatment towards critically ill patients.
Predictors of ICU admission
Univariate logistic regression was conducted to identify significant predictors of ICU admission (Table 3). The following are significant predictors for ICU admission of COVID-19 patients:
Demographic features: Age (OR: 1.06 [1.03 -1.09], p<0.0001) and male (OR: 2.42 [0.87-6.78], p<0.0001).
Comorbidities: cardiovascular diseases (OR: 3.12 [0.81-10.12], p<0.0001), respiratory diseases (OR: 8.15 [1.85-14.44], p<0.0001), diabetes (OR: 6.17 [2.07-9.36], p<0.0001), hypertension (OR: 3.15 [1.25-5.32], p<0.0001) and chronic kidney diseases (OR: 4.87 [2.66-9.71], p=0.0009). The analyses demonstrate the importance of baseline comorbidities in affecting the prognosis of patients with COVID-19.
Drugs: ACEI or ARB (OR: 1.10 [0.24-2.14], p<0.0001), lopinavir/ritonavir (OR: 1.73 [1.02-3.05], p<0.0001), ribavirin (OR: 1.43 [0.37-2.05], p<0.0001), remdesivir (OR: 1.04 [0.39-2.80], p<0.0001), interferon beta (OR: 1.04 [0.39-2.80], p<0.0001) and hydroxychloroquine (OR: 1.24 [0.90-1.73], p=0.00036).
Biochemical markers: APTT (OR: 1.19 [1.08-1.30], p=0.0003), neutrophil count (OR: 1.54 [1.53-1.55], p<0.0001), red blood cells (OR: 1.47 [1.46-1.48], p<0.0001), white blood cells (OR: 1.47 [1.21-1.79], p<0.0001), albumin (OR: 0.80 [0.74-0.87], p<0.0001), serum sodium (OR: 1.26 [1.08-1.93], p<0.0001), lactate dehydrogenase (OR: 1.01 [0.85-1.12], p<0.0001), total cholesterol (OR: 1.04 [1.02-1.06], p<0.0001), spot urine glucose (OR: 1.32 [1.31-1.32], p<0.0001), hemoglobin A1c (OR: 1.03 [1.03-1.04]<0.0001), random glucose (OR: 1.05 [1.04-1.06], p<0.0001), serum triglycerides (OR: 1.46 [1.43-1.48], p<0.0001).
Main and Hidden Interaction Effects
The EBM model was employed to distinguish patients in need for ICU admission by accurately uncovering the main and hidden interaction effects. This utilized different data modalities such as demographics, comorbidities and multiple laboratory results. Significant variables identified by univariate logistic regression were entered into the EBM model, which will deal with the trade-off between having a minimal number of predictors and the capacity of good model prediction, therefore avoiding overfitting. The cohort is randomly classified into training and validation datasets with an 80:20 split. The obtained importance rankings of significant predictors for ICU admission are shown in Figure 5. Red blood cells, APTT, sex, age and white blood cells are the five most informative parameters in predicting ICU admission, followed by hypertension, serum sodium, serum albumin, serum triglycerides, and respiratory disease. Significant predictors for ICU admission identification are provided in Figure 6. We can observe that the following combination of patient characteristics predicts a higher likelihood for ICU admission: 1) male patients with lower level of red blood cells, 2) older patients with lower level of red blood cells, 3) patients with both lower levels of red blood cells and albumin or sodium, 4) patients with longer APTT and lower level of red blood cells. Important hidden pair-wise interaction effects are shown in Figure 7, where green or yellow zones with larger values indicate higher probability of ICU admission that can be predicted by examining the pair-wise variable interactions. We can observe from the plots of interaction effects that 1) male with lower red blood cells, (2) older age with lower red blood cells, 3) lower albumin level and lower red blood cells, 4) lower sodium level and lower red blood cells, 5) older age and prolonged APTT, 6) lower red bold cells level and higher white blood cells level, 7) lower red blood cells level and prolonged APTT, 8) older age and higher level white predicts higher probability of ICU admission.
EBM can provide predictions on individual cases. For example, a randomly selected patient (male, 69 years old) with ICU admission has the characteristics as shown in Figure 8. He has prior comorbidities of cardiovascular, chronic kidney, hypertension, diabetes and lung and respiratory diseases. EBM predicts that he needs ICU attention with 72% probability, based on his characteristics of prior cardiovascular disease, white blood cells at 12.43 (x10^9/L), lactate dehydrogenase level at 390 (U/L), APTT at 33.70 (sec), prior comorbidities of hypertension and diabetes, and others. But his characteristic of triglycerides at 6.29 provide non-supportive information to the prediction outcome. By contrast, a randomly selected patient (female, 54 years old) who did not require ICU admission is exemplified in Figure 9. EBM accurately predicted that she doesn’t need ICU admission. Local explanations provided by EBM can provide precise ICU admission predictions based on patient’s main characteristics in a user-friendly visualization way for practical clinical use.
The five-fold cross validation performance of EBM was compared with baseline models including XGBoost, LightGBM, random forests, and multivariate logistic regression, as shown in Table 4. EBM outperforms all baseline models according to evaluation metrics of precision, recall, F1 score, and area under the curve (AUC) of the receiving operating characteristics (ROC) curve.
Discussion
The main findings of this territory-wide retrospective cohort study are twofold: (1) Significant predictors of ICU admission were older age, male sex, prior coronary artery disease, respiratory diseases, diabetes, hypertension and chronic kidney disease, and activated partial thromboplastin time, red cell count, white cell count, albumin and sodium; (2) A tree-based interpretable machine learning model identified most informative characteristics and hidden interactions that can predict ICU admission. These interacting factors were low red cells with 1) male, 2) older age, 3) low albumin, 4) low sodium or 5) prolonged APTT around 33 seconds.
Prior studies have reported that patients with pre-existing medical comorbidities have a poorer prognosis in not only COVID-19 but also other infectious diseases such as SARS-CoV and MERS 14, 15. In COVID-19, hypertension, diabetes, coronary heart disease, chronic kidney disease, cerebrovascular disease, hepatitis, and chronic obstructive pulmonary disease (COPD) have been identified as predictors of disease severity and mortality in COVID-19 16, 17. In this study, we confirm that these comorbidities are predictive of ICU utilization and provide a simple clinical approach to quantify the initial risk of ICU admission precisely and quickly. Furthermore, various laboratory markers have been shown to predict adverse outcomes. Our study found that prolonged APTT and raised D-dimer, reflecting coagulopathy, was predictive of ICU admission. Other significant predictors were neutrophil count (inflammation), red cell count (oxygen carrying capacity), albumin (nutritional status), sodium (electrolyte homeostasis) and lactate dehydrogenase (tissue damage). Troponin was borderline significant, reflecting that myocardial damage is an important determinant of ICU use.
We further illustrate the novel findings that interacting factors between low red cell count and basic demographics such as gender and age, or laboratory findings such as albumin, sodium and APTT are also important determinants. Older patients with laboratory examinations of lower red cells, lower albumin, lower sodium and prolonged APTT are subject to high ICU admission risk. Red cells, albumin, sodium and APTT can be easily collected in any hospital. In crowded hospitals with limited medical resources, this simple model can help to quickly prioritize patients for ICU attention.
The optimum medication regimen for COVID-19 is yet to be determined. However, small scale observational studies or trials have suggested the use of antivirals 18, antimalarials 19, interferons 20, anticoagulants 21 and antibodies 22, though not all have been shown to be beneficial in larger clinical trials 23. A better understanding of the pathophysiological mechanisms underlying COVID-19 will enable better treatment strategies to be devised 24. In our study, the anti-viral drug lopinavir/ritonavir (Kaletra) was the commonest prescribed drug, followed by ribavirin, interferon-beta, ACEIs/ARBs, steroids, hydroxychloroquine and the antiviral remdesivir. We found that these medications were more frequently prescribed in patients requiring ICU compared to those without. This may reflect the increased severity of cases in which clinicians were more likely to prescribe a cocktail of drugs.
Conclusion
In summary, this study has identified important univariable and interaction effects informing intensive care admission in patients hospitalized with COVID-19. Significant univariable predictors of ICU admission include older age, male sex, prior coronary artery disease, respiratory diseases, diabetes, hypertension and chronic kidney disease, and activated partial thromboplastin time, red cell count, white cell count, albumin and serum sodium. A tree-based interpretable machine learning model identified most informative characteristics and hidden interactions (i.e., low red cells with male, older age, low albumin, low sodium or prolonged APTT) for COVID-19 prognostic ICU admission prediction. The tree-based machine learning model outperforms several baselines, enabling early detection of ICU admission, efficient healthcare resource utilization, and potentially mortality reduction of hospitalized patients with COVID-19.
Data Availability
Available upon request
Footnotes
↵# joint first authors