PT - JOURNAL ARTICLE AU - Zhang, Dongdong AU - Yin, Changchang AU - Zeng, Jucheng AU - Yuan, Xiaohui AU - Zhang, Ping TI - Combining structured and unstructured data for predictive models: a deep learning approach AID - 10.1101/2020.08.10.20172122 DP - 2020 Jan 01 TA - medRxiv PG - 2020.08.10.20172122 4099 - http://medrxiv.org/content/early/2020/08/14/2020.08.10.20172122.short 4100 - http://medrxiv.org/content/early/2020/08/14/2020.08.10.20172122.full AB - Background The broad adoption of Electronic Health Records (EHRs) provides great opportunities to conduct health care research and solve various clinical problems in medicine. With recent advances and success, methods based on machine learning and deep learning have become increasingly popular in medical informatics. However, while many research studies utilize temporal structured data on predictive modeling, they typically neglect potentially valuable information in unstructured clinical notes. Integrating heterogeneous data types across EHRs through deep learning techniques may help improve the performance of prediction models.Methods In this research, we proposed 2 general-purpose multi-modal neural network architectures to enhance patient representation learning by combining sequential unstructured notes with structured data. The proposed fusion models leverage document embeddings for the representation of long clinical note documents and either convolutional neural network or long short-term memory networks to model the sequential clinical notes and temporal signals, and one-hot encoding for static information representation. The concatenated representation is the final patient representation which is used to make predictions.Results We evaluate the performance of proposed models on 3 risk prediction tasks (i.e., in-hospital mortality, 30-day hospital readmission, and long length of stay prediction) using derived data from the publicly available Medical Information Mart for Intensive Care III dataset. Our results show that by combining unstructured clinical notes with structured data, the proposed models outperform other models that utilize either unstructured notes or structured data only.Conclusions The proposed fusion models learn better patient representation by combining structured and unstructured data. Integrating heterogeneous data types across EHRs helps improve the performance of prediction models and reduce errors.Availability The code for this paper is available at: https://github.com/onlyzdd/clinical-fusion.Competing Interest StatementPZ is the member of the editorial board of BMC Medical Informatics and Decision Making. The authors declare that they have no other competing interests.Funding StatementThis project was funded in part under a grant with Lyntek Medical Technologies, Inc.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study uses the MIMIC-III dataset. We are using the MIMIC IRB. This study was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA, USA), the Massachusetts Institute of Technology (Cambridge, MA, USA). Requirement for individual patient consent was waived because the study did not impact clinical care and all protected health information was de-identified. De-identification was performed in compliance with Health Insurance Portability and Accountability Act (HIPAA) standards in order to facilitate public access to MIMIC-III. Deletion of protected health information (PHI) from structured data sources (e.g., database fields that provide patient name or date of birth) was straightforward.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesMIMIC-III database analyzed in the study is available on PhysioNet repository. https://mimic.physionet.org/about/mimic