ABSTRACT
Objective Retrospective study of COVID-19 positive patients treated at NYU Langone Health (NYULH) to identify clinical markers predictive of disease severity to assist in clinical decision triage and provide additional biological insights into disease progression.
Materials and Methods Clinical activity of 3740 de-identified patients at NYULH between January and August 2020. Models were trained on clinical data during different parts of their hospital stay to predict three clinical outcomes: deceased, ventilated, or admitted to ICU.
Results XGBoost model trained on clinical data from the final 24 hours excelled at predicting mortality (AUC=0.92, specificity=86% and sensitivity=85%). Respiration rate was the most important feature, followed by SpO2 and age 75+. Performance of this model to predict the deceased outcome extended 5 days prior with AUC=0.81, specificity=70%, sensitivity=75%. When only using clinical data from the first 24 hours, AUCs of 0.79, 0.80, and 0.77 were obtained for deceased, ventilated, or ICU admitted, respectively. Although respiration rate and SpO2 levels offered the highest feature importance, other canonical markers including diabetic history, age and temperature offered minimal gain. When lab values were incorporated, prediction of mortality benefited the most from blood urea nitrogen (BUN) and lactate dehydrogenase (LDH). Features predictive of morbidity included LDH, calcium, glucose, and C-reactive protein (CRP).
Conclusion Together this work summarizes efforts to systematically examine the importance of a wide range of features across different endpoint outcomes and at different hospitalization time points.
BACKGROUND AND SIGNIFICANCE The first cluster of SARS-CoV-2 was reported in Wuhan, Hubei Province on December 31, 2019. Inciting symptoms remarkably similar to pneumonia, the disease quickly traveled around the world, earning its pandemic status by the World Health Organization on March 11, 2020. Although the first wave has since passed for hardest-hit regions such as New York City (NYC) and most of Asia, a resurgence of cases has already been reported in Europe and record new cases tallied in the Midwest and rural United States (US). As of November 12th, the US alone logged its highest tally to date with a 317% growth over the preceding 30 days1. The coronavirus disease (COVID-19) is far from seeing the end of its days and there remains a compelling need to prioritize care and resources for patients at elevated risk of morbidity and mortality.
Previous work building machine learning models used patient data from Tongji Hospital2,3 (Wuhan, China), Zhongnan Hospital4 (Wuhan China), Mount Sinai Hospital5 (NYC, US), and NYU Family Health Center6 (NYC, US). Surprisingly, clinical features selected varied widely across studies. For example, while McRae et al.’s 2-tiered model6 trained on 701 NYC patients to predict mortality was based on actual age, C-reactive protein (CRP), procalcitonin, and D-dimer, Yan et al.’s model2 trained on 485 patients from Wuhan selected lactate dehydrogenase (LDH), lymphocyte count, and CRP as the most predictive for mortality. Variations in selected features differed greatly even when trained to predict similar outcomes on data from patients of the same city. Yao et al.’s model3 was trained on 137 patients from Wuhan and relied on 28 biomarkers in their final model to predict morbidity. Given the differences among prior models, some of which were driven by domain-specific knowledge, we decided to systematically examine the importance of a wide range of features across different endpoint outcomes and at different hospitalization time points.
This study analyzes retrospective PCR-confirmed COVID-19 inpatient data collected at NYU Langone Hospital spanning 1/1/2020 to 8/7/2020 to predict three sets of clinical outcomes: alive vs deceased, ventilated vs not ventilated, or ICU admitted vs not ICU admitted. The clinical information of 3740 patient encounters included demographic data (age, sex, insurance, past diagnosis of diabetes, presence of cardiovascular comorbidities), vital signs (SpO2, pulse, respiration rate, temperature, blood pressure), and the 50 most frequently ordered lab tests in our dataset. Models were developed using two methods: logistic regression with feature selection using Least Absolute Shrinkage and Selection Operator7 (LASSO) and gradient tree boosting with XGBoost8. An explainable algorithm, such as logistic regression, provides easy to interpret insights into the features of importance. Conversely, the larger model capacity of XGBoost better handles data complexities to explore the extent that predictive performance can be optimized. Together, these methods ensure a holistic survey that explores the clinical underpinnings of disease etiology and the prospects of building models that are sufficiently competent to be effective decision support tools.
Competing Interest Statement
MPM has served as a paid consultant for SensoDx, LLC and has a provisional patent pending. JTM has a provisional patent pending. In addition, he has an ownership position and an equity interest in both SensoDx II, LLC and OraLiva, Inc. and serves on their advisory boards. All other authors declare no competing interests.
Funding Statement
We wish to thank the Medical Center Information Technology and Office of Science & Research at NYU Langone Health for maintaining and de-identifying the clinical database. JMW is supported by the New York University Medical Scientist Training Program (T32GM136573). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. A portion of this work was funded by Renaissance Health Service Corporation and Delta Dental of Michigan.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Per resubmission request, we received ethics exemption/waiver from Helen Panageas (Director of Institutional Review Board Operations at NYU Grossman School of Medicine) who confirmed that the use of the COVID-19 De-identified Clinical Database does not require IRB approval. Specifically, "the subsequent research studies would not fall under human subject research and no IRB approval would be required since IRB reviews only HS research." In addition, she advised us to complete and submit a self-certification form. The form is labeled as "INTERNAL REFERENCE ONLY - IRB self-certification form" in the supplements. The COVID-19 De-identified Clinical Database was stripped of all unique identifiers prior to receiving data. In addition, all dates were shifted by an arbitrary number of days for each patient. These safeguards ensure that patient data cannot be re-identified, and thus are not subject to HIPAA restrictions on research use, and do not require IRB approval.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Competing Interests: MPM has served as a paid consultant for SensoDx, LLC and has a provisional patent pending. JTM has a provisional patent pending. In addition, he has an ownership position and an equity interest in both SensoDx II, LLC and OraLiva, Inc. and serves on their advisory boards. All other authors declare no competing interests.
Data Availability
Institutional policies prevent public distribution of patient clinical data.