Abstract
Background It’s critical to identify COVID-19 patients with a higher death risk at early stage to give them better hospitalization or intensive care. However, thus far, none of the machine learning models has been shown to be successful in an independent cohort. We aim to develop a machine learning model which could accurately predict death risk of COVID-19 patients at an early stage in other independent cohorts.
Methods We used a cohort containing 4711 patients whose clinical features associated with patient physiological conditions or lab test data associated with inflammation, hepatorenal function, cardiovascular function and so on to identify key features. To do so, we first developed a novel data preprocessing approach to clean up clinical features and then developed an ensemble machine learning method to identify key features.
Results Finally, we identified 14 key clinical features whose combination reached a good predictive performance of AUC 0.907. Most importantly, we successfully validated these key features in a large independent cohort containing 15,790 patients.
Conclusions Our study shows that 14 key features are robust and useful in predicting the risk of death in patients confirmed SARS-CoV-2 infection at an early stage, and potentially useful in clinical settings to help in making clinical decisions.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Alberta Innovates for Health
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
1, The first data is from https://figshare.com/s/79827c396af7df42b3d7. the detail of the first data can be found in the paper:"Altschul DJ, Unda SR, Benton J, de la Garza Ramos R, Cezayirli P, Mehler M, et al. A novel severity score to predict inpatient mortality in COVID-19 patients. Scientific Reports. 2020;10(1):16726. 2, the second data is from UKbiobank, we list two references in our manuscript:" Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. Barbour V. UK Biobank: a project in search of a protocol? The Lancet. 2003;361(9370):1734-8.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
↵# Co-first author
List of Abbreviations
- OsSats
- oxygen saturation
- Temp
- temperature
- MAP
- mean arterial pressure
- Ddimer
- D-dimer
- Plts
- platelets
- INR
- international normalized ratio
- BUN
- blood urea nitrogen
- AST
- aspartate aminotransferase
- ALT
- alanine aminotransferase
- WBC
- white blood cells
- Lympho
- lymphocytes
- IL-6
- interleukin-6
- CrctProtein
- C-reactive protein
- KNN
- k-nearest neighbor method
- GBDT
- Gradient Boosted Decision Tree
- XGBoost
- Extreme Gradient Boosting
- RF
- Random Forest
- LR
- Logistic Regression
- SVM
- Support Vector Machine
- EM
- Ensemble Model
- ROC
- Receiver Operating Characteristic
- AUC
- Area Under ROC Curve
- TP
- True Positive
- FP
- False Positive, TN: True Negative
- FN
- False Negative
- CSS
- COVID-19 severity scores
Paper in collection COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv
The Chan Zuckerberg Initiative, Cold Spring Harbor Laboratory, the Sergey Brin Family Foundation, California Institute of Technology, Centre National de la Recherche Scientifique, Fred Hutchinson Cancer Center, Imperial College London, Massachusetts Institute of Technology, Stanford University, University of Washington, and Vrije Universiteit Amsterdam.