Abstract
Background Sepsis, a life-threatening illness caused by the body’s response to an infection, is the leading cause of death worldwide and has become a global epidemiological burden. Early prediction of sepsis increases the likelihood of survival for septic patients.
Methods The 2019 DII National Data Science Challenge enabled participating teams to develop models for early prediction of sepsis onset with de-identified electronic health records of over 100,000 unique patients. Our task is to predict sepsis onset 4 hours before its diagnosis using basic administrative and demographics, time-series vital, lab, nutrition as features. An LSTM-based model with event embedding and time encoding is proposed to model time-series prediction. We utilized the attention mechanism and global max pooling techniques to enable interpretation for the proposed deep learning model.
Results We evaluated the performance of the proposed model on 2 use cases of sepsis onset prediction which achieved AUC scores of 0.940 and 0.845, respectively. Our team, BuckeyeAI achieved an average AUC of 0.892 and the official rank is #2 out of 30 participants.
Conclusions Our model outperformed collapsed models (i.e., logistic regression, random forest, and LightGBM). The proposed LSTM-based model handles irregular time intervals by incorporating time encoding and is interpretable thanks to the attention mechanism and global max pooling techniques.
Availability The code for this paper is available at: https://github.com/yinchangchang/DII-Challenge.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Not applicable.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Cerner Health Facts is subscribed by School of Biomedical Informatics, UTHealth and a subset of data were approved for use for the 2019 DII National Data Science Challenge by Cerner and UTHealth. Cerner Health Facts is a database that comprises de-identified EHR data from over 600 participating Cerner client hospitals and clinics in the United States and represents over 106 million unique patients. With this longitudinal, relational database-reflecting data from 2000-2016-researchers can analyze detailed sets of de-identified clinical data at the patient level. Types of data available include demographics, encounters, diagnoses, procedures, lab results, medication orders, medication administration, vital signs, microbiology, surgical cases, other clinical observations, and health systems attributes.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Cerner Health Facts is subscribed by School of Biomedical Informatics, UTHealth and a subset of data were approved for use for the 2019 DII National Data Science Challenge by Cerner and UTHealth. Cerner Health Facts is a database that comprises de-identified EHR data from over 600 participating Cerner client hospitals and clinics in the United States and represents over 106 million unique patients. With this longitudinal, relational database-reflecting data from 2000-2016-researchers can analyze detailed sets of de-identified clinical data at the patient level. Types of data available include demographics, encounters, diagnoses, procedures, lab results, medication orders, medication administration, vital signs, microbiology, surgical cases, other clinical observations, and health systems attributes.