RT Journal Article SR Electronic T1 LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.03.26.24304920 DO 10.1101/2024.03.26.24304920 A1 WonJin Yoon A1 Shan Chen A1 Yanjun Gao A1 Dmitriy Dligach A1 Danielle S. Bitterman A1 Majid Afshar A1 Timothy Miller YR 2024 UL http://medrxiv.org/content/early/2024/03/27/2024.03.26.24304920.abstract AB Natural Language Processing (NLP) is a study of automated processing of text data. Application of NLP in the clinical domain is important due to the rich unstructured information implanted in clinical documents, which often remains inaccessible in structured data. Empowered by the recent advance of language models (LMs), there is a growing interest in their application within the clinical domain. When applying NLP methods to a certain domain, the role of benchmark datasets are crucial as benchmark datasets not only guide the selection of best-performing models but also enable assessing of the reliability of the generated outputs. Despite the recent availability of LMs capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent. To address this issue, we propose LCD benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of MIMIC-IV and statewide death data. Our notes have a median word count of 1687 and an interquartile range of 1308 to 2169. We evaluated this benchmark dataset using baseline models, from bag-of-words and CNN to Hierarchical Transformer and an open-source instruction-tuned large language model. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations. We expect LCD benchmarks to become a resource for the development of advanced supervised models, prompting methods, or the foundation models themselves, tailored for clinical text.The benchmark dataset is available at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-docCompeting Interest StatementThe authors have declared no competing interest.Funding StatementResearch reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under Award Number R01LM012973, and by the National Institute Of Mental Health of the National Institutes of Health under Award Number R01MH126977.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB of Boston Children's Hospital gave ethical approval for this work (IRB number:IRB-P00028617). Under PhysioNet Credentialed Health Data Use Agreement 1.5.0 - Data Use Agreement for the MIMIC-IV (v2.2) and - Data Use Agreement for the MIMIC-IV-Note: Deidentified free-text clinical notes (v2.2) All authors are granted access to the database.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe data underlying this article are available in a github repository, at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc.The datasets were derived from sources: https://physionet.org/content/mimiciv/2.2/ and https://physionet.org/content/mimic-iv-note/2.2/