RT Journal Article SR Electronic T1 Comparing neural language models for medical concept representation and patient trajectory prediction JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2023.06.01.23290824 DO 10.1101/2023.06.01.23290824 A1 Bornet, Alban A1 Proios, Dimitrios A1 Yazdani, Anthony A1 Santero, Fernando Jaume A1 Haller, Guy A1 Choi, Edward A1 Teodoro, Douglas YR 2024 UL http://medrxiv.org/content/early/2024/10/22/2023.06.01.23290824.abstract AB Effective representation of medical concepts is crucial for secondary analyses of electronic health records. Neural language models have shown promise in automatically deriving medical concept representations from clinical data. However, the comparative performance of different language models for creating these empirical representations, and the extent to which they encode medical semantics, has not been extensively studied. This study aims to address this gap by evaluating the effectiveness of three popular language models - word2vec, fastText, and GloVe - in creating medical concept embeddings that capture their semantic meaning. By using a large dataset of digital health records, we created patient trajectories and used them to train the language models. We then assessed the ability of the learned embeddings to encode semantics through an explicit comparison with biomedical terminologies, and implicitly by predicting patient outcomes and trajectories with different levels of available information. Our qualitative analysis shows that empirical clusters of embeddings learned by fastText exhibit the highest similarity with theoretical clustering patterns obtained from biomedical terminologies, with a similarity score between empirical and theoretical clusters of 0.88, 0.80, and 0.92 for diagnosis, procedure, and medication codes, respectively. Conversely, for outcome prediction, word2vec and GloVe tend to outperform fastText, with the former achieving AUROC as high as 0.78, 0.62, and 0.85 for length-of-stay, readmission, and mortality prediction, respectively. In predicting medical codes in patient trajectories, GloVe achieves the highest performance for diagnosis and medication codes (AUPRC of 0.45 and of 0.81, respectively) at the highest level of the semantic hierarchy, while fastText outperforms the other models for procedure codes (AUPRC of 0.66). Our study demonstrates that subword information is crucial for learning medical concept representations, but global embedding vectors are better suited for more high-level downstream tasks, such as trajectory prediction. Thus, these models can be harnessed to learn representations that convey clinical meaning, and our insights highlight the potential of using machine learning techniques to semantically encode medical data.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by the Innosuisse - Schweizerische Agentur für Innovationsförderung - project no.: 55441.1 IP ICT.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Data used in this study were sourced exclusively from the MIMIC-IV database, a publicly available, de-identified human dataset hosted on the PhysioNet repository (https://physionet.org/content/mimiciv/2.2/). Access to the MIMIC-IV database required completion of mandatory coursework on data privacy and human subjects research protection. The creation and use of the MIMIC-IV database have been approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). This study, involving secondary analysis of pre-existing, de-identified data, does not classify as human subjects research, and therefore, did not necessitate additional ethical approval.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe MIMIC-IV dataset used in this study is publicly available upon request from the PhysioNet repository (https://physionet.org/content/mimiciv/2.1/). The exact queries and scripts used for data extraction and analysis from this dataset are available at https://github.com/ds4dh/medical_concept_representation.https://physionet.org/content/mimiciv/2.1/https://github.com/ds4dh/medical_concept_representation