Abstract
Natural Language Processing (NLP) is a study of automated processing of text data. Application of NLP in the clinical domain is important due to the rich unstructured information implanted in clinical documents, which often remains inaccessible in structured data. Empowered by the recent advance of language models (LMs), there is a growing interest in their application within the clinical domain. When applying NLP methods to a certain domain, the role of benchmark datasets are crucial as benchmark datasets not only guide the selection of best-performing models but also enable assessing of the reliability of the generated outputs. Despite the recent availability of LMs capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent. To address this issue, we propose LCD benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of MIMIC-IV and statewide death data. Our notes have a median word count of 1687 and an interquartile range of 1308 to 2169. We evaluated this benchmark dataset using baseline models, from bag-of-words and CNN to Hierarchical Transformer and an open-source instruction-tuned large language model. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations. We expect LCD benchmarks to become a resource for the development of advanced supervised models, prompting methods, or the foundation models themselves, tailored for clinical text.
The benchmark dataset is available at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc
INTRODUCTION
Clinical notes describe and communicate events or interactions about a patient, written by healthcare providers1. These notes are rich sources of information important for clinical decision-making, often containing details that may not be readily available in a structured format. Clinical Natural Language Processing (NLP) can extract information from unstructured data2. This capability can be used to build an end-to-end model for specific tasks or to supplement structured data that can be used for further use in ML/AI models3–7. With the recent emergence of transformer-based Language Models (LMs), research on clinical NLP has achieved remarkable improvements. However, due to the architectural characteristics of transformer models, most available LMs have constraints on the maximum length of the input sequence that a model can process at once. In the clinical NLP domain, this can be a major technical hurdle for translational applications as the clinical notes can be longer than what most transformer models can process. For example, BERT and RoBERTa models can handle up to 512 tokens at one time, but the discharge summaries in MIMIC-IV have 1,600 words on average, which in token is about six times longer than the 512 token limit. These constraints raise the need for the development of the models capable of processing longer documents as well as long document benchmark datasets to test the ability of developed models.
A handful of methods have been introduced to utilize LMs in processing long documents. One simple method is to split input texts into smaller parts and merge the results after the predictions. For example, Devlin et al.8 truncated input with a given window size and processed the remaining on subsequent processing steps for the Question Answering task. Alsentzer et al.9 breakdown input documents to sentence-level inputs for Clinical NLP tasks. Chen et al.10 sectionized long clinical documents and only kept informative sections for downstream NLP tasks. These methods are straightforward and easy to implement, but do not allow the LM to consider the entire context of original inputs. Another method is to increase the maximum length of the model itself. LongFormer11 and Clinical-LongFormer12 have expanded the maximum processing window by replacing full self-attention flow with a combination of local attention and task-oriented global attention. Another approach that falls into this branch is the large LMs trained on longer context windows. Pre-training or fine-tuning of these models requires significant computational resources that precludes access for many researchers and healthcare institutions. Existing work has therefore focused on existing models using prompt-based zero- or few-shot learning13–15.
In this paper, we describe work in developing a benchmark for clinical long document processing models, based on the out-of-hospital mortality prediction task. The source of the dataset is MIMIC-IV v2.2 corpus, specifically discharge notes for patients who were admitted to the ICU and discharged to locations other than hospice facilities. Along with the benchmark dataset, we explore multiple machine learning models for the task, including traditional Support Vector Machine using Bag-of-Words, Convolutional Neural Networks, a hierarchical transformer encoder16, and zero-shot large language models. In the results section, we select three baseline models, the best-performing CNN model, hierarchical transformer, and large language models (Mixtral-8x7B-instruct-v0.1) and analyze the outputs. Based on expert physician review, we discovered that the dataset is challenging and at the same time the models can find meaningful signals. We additionally leverage the architecture of the hierarchical transformer model to analyze its behavior. In these analyses, we visualize and quantify the extent to which they jointly consider information from different sections of the discharge summary.
Contribution of this paper is two-fold. In clinical NLP perspective, we anticipate that the proposed dataset will serve as a solid foundation for model development and, moreover, as a forum for evaluating Large LMs on long clinical document classification tasks1. Second, the utilization of predictive models for 30-day mortality at the time of discharge is anticipated to facilitate timely end-of-life discussions with patients and their families. Such conversations are crucial for enhancing the quality of life for patients nearing the end of life, by ensuring that care decisions align with their values and preferences17–20.
MATERIALS AND METHODS
Medical Information Mart for Intensive Care IV (MIMIC-IV)
The Medical Information Mart for Intensive Care (MIMIC) is a series of publicly available electronic health record (EHR) databases collected from Beth Israel Deaconess Medical Center (BIDMC)21. MIMIC databases contain multi-modal data such as text data, structured data (including laboratory data, admission records, and demographic data), and radiograph images for some versions. All the records and text data are de-identified.
MIMIC-IV22 is the latest release that encompasses admissions between 2009 and 2019, focused on structured data and text data of ICU patients. We used MIMIC-IV v2.2 data2 with discharge summaries and multiple structured records, including out-of-hospital mortality records from Massachusetts State Registry of Vital Records and Statistics22.
Preprocessing
Preprocessing of our benchmark dataset is composed of three steps. First, following the criteria of Harutyunyan et al.23, we collected admission records with an ICU stay. In the second step, we merged date of death data using the admission records identifier (hadm_id). In-hospital deaths were collected directly in the EHR of BIDMC or an affiliated institution, while out-of-hospital deaths were collected following a matching algorithm22 with the Massachusetts State Registry of Vital Records and Statistics to incorporate the deaths in Massachusetts into MIMIC-IV. Dates of death were censored at one-year from the patient’s last hospital discharge and null dates of death represent patients discharged and alive at one year after discharge. The third step filtered out records with task-specific restrictions. For our proposed 30-day out-of-hospital mortality prediction dataset, we excluded admissions with in-hospital deaths and admissions where a patient had a discharge disposition of “hospice” in structured data because these patients are expected to die shortly after discharge. The training, validation, and testing datasets were partitioned according to patient ID to guarantee that all admissions from the same patient are allocated to the same dataset subset. For the note data, we only utilized discharge notes and not radiology reports. Full details are available in Appendix A, and python implementation of the exact algorithm is available on GitHub.
Figure 1 shows the processing flow and Table 1 shows the number of datapoints after processing steps. As shown on the last two rows, our dataset is highly imbalanced; the negative-label notes, which means the patient survived, are about 26 times more abundant than positive-label notes. Of note, the number of admissions in the raw dataset exceeds the number of discharge notes because 99,437 admissions do not have discharge notes.
Figure 2 displays a histogram of the number of tokens in discharge notes. Each note is tokenized with microsoft/xtremedistil-l6-h256-uncased tokenizer and the Hugging Face Transformers library. As this model employs a word-piece tokenizer, a single word can be broken down into subwords and tokenized into multiple tokens depending on the frequency of the word. The median value for the token length were: 3978 (Interquartile range (IQR) 3085 - 5091) for train; 3991 (IQR 3080 - 5103) for development; and 3952 (IQR 3072 - 5072) for test set.
Baseline model
Bag-of-words (BoW) model
BoW model is a widely used baseline model for NLP where a given text sequence is represented with the frequency of words or word chunks in the sequence. BoW models learn vocabulary occurrence information but do not utilize the information of the order of word chunks in an input. Hence, they have a very limited ability to use syntactic information. The size of the word chunk, which could be one or a few words depending on the window size, can add the ability to represent local syntactic information, but it can also make vocabularies sparse and very large. Despite these limitations, BoW is a strong baseline for document classification tasks with limited training dataset24.
Bag-of-words models were implemented using scikit-learn25 CountVectorizer module with monogram and bigram and SGDClassifier with default settings (hinge loss, max_iter=1000, tol=1e-3).
Convolutional Neural Networks
Kim et al.26 proposed Convolutional Neural Network (CNN) as a feature extractor for the sentence classification task. Our CNN model followed the structure of Kim et al. with modification on embedding layer and hyperparameter settings: Kim et al. used word vectors pre-trained with continuous bag-of-words architecture namely word2vec (Mikolov et al., 2013), whereas our model used a randomly initialized embedding layer of 100 dimensions.
Pretrained transformer models
The transformer is a model architecture that relies on the self-attention mechanism, which is effective at capturing global dependencies within an input sequence. As the structure does not involve recurrence, transformer models can be efficiently trained using parallel computation units, such as GPUs, and this enables the training of much larger LMs. BERT and GPT models are some of the early proposed transformer LMs. These models are pre-trained on large-scale corpora and further finetuned to task-specific datasets for supervised learning. Empowered by the pretrained LMs, models tackling clinical NLP tasks have shown remarkable progress.
The self-attention mechanism of early transformer models is implemented by fully connecting each unit of sequence. This requires memory and computational costs that are quadratic with respect to the length of the input sequence, making it impossible to use transformer models for longer sequences.
Longformers
To mitigate this computational limitation for processing long documents, a handful of methods such as blend of local window and global attention approach and sliding window attention 11,12,27,28 have been proposed. Longformer11 and Clinical-Longformer12 are examples of such methods. Clinical-Longformer model was initialized from the pre-trained weights provided by the original authors3 and fine-tuned on our dataset.
Hierarchical Transformers
Su et al.16 introduced a hierarchical transformer, a stacked two transformer encoders of word-level encoder layer group and chunk-level encoder layer group. Hierarchical transformer splits input sequences into smaller chunks and first encoded chunks with word-level encoder to output chunk representations. The latter part of the structure, chunk-level encoder works as a feature extractor given the chunk representations of the former part and predicts classes for an input document (Figure 3). Hierarchical transformer models were experimented with two settings, xtremedistil model and PubMedBERT model as initial weights for word-level encoder. The chunk-level encoder of the hierarchical transformer model was randomly initialized. Chunk size of hierarchical transformers were tested with two settings, 256 tokens and 512 tokens. In this paper we refer the letter setting as “Bigchunk” setting.
Large LMs
We explored the ability of zero-shot mortality prediction using large LMs, Mixtral (8×7B)4, and GPT4-32k (GPT-4). For the GPT4-32k, we used the HIPAA-compliant version that is provided through Mass General Brigham Azure version 0613. We experimented with Llama 2 models but were unable to find prompts that elicited results better than random guessing. For zero-shot experiments, we used the Hugging Face library to load and inference the Mixtral model and Azure API for the GPT-4 model. These models were selected as they are able to handle context with 32k tokens. For Mixtral, a 4-bit quantized 29 version was used. Figure 4 shows our prompting template for large LM models (Mixtral and GPT-4). The model is asked to choose the answer between 0:alive, 1:death and we used a regular expression that looks for the first incidence of “0” or “1” to extract answers from them.
Experiment details
Our primary metric is F-1 score for the positive labels and we used Receiver Operating Characteristic/Area Under the Curve (ROC AUC) as supplementary metric. Note that we used hinge loss for BoW models, which does not produce probability estimates for the calculation of ROC score. For BoW, CNN and Hierarchical Transformers, we experimented with 5 or 10 runs with identical settings except for the random seeds and averaged the performance to minimize the effect of random initialization of the model.
CNN, Hierarchical transformers, and Clinical-Longformer models were trained and tested on the CNLPT library30 (available on GitHub5). Hyperparameter search for the BoW model was only performed on the n-gram window of the vectorizer, and we selected best performing settings based on experiments on the development dataset.
The models were evaluated against the dev set during the training time, and the best performing checkpoints were selected based on the average of Accuracy and the F-1 score.
CNN model and hierarchical transformer models have flexibility in selecting the maximum sequence length (max_seq_length), as unlike most language models, these models can expand the window without pre-training again from scratch. We selected max_seq_length to be 8192 tokens, which can cover 97% of the notes in the train and development set without truncation (based on xtremedistil model tokenizer).
Since the open-sourced Clinical-Longformer only supports maximum sequence length of 4096, we tested both right-truncation and left-truncation settings, i.e. truncating the ending part and the beginning part of the input sequence respectively.
Visualization of model attention
One of the benefits of the hierarchical transformer model is that it can provide a window into interpretability by highlighting the saliency of each input segment into the model prediction. This becomes possible because the model splits the input into several chunks, each chunk is encoded through an encoder layer, and each encoded chunk representation works as an input unit of the chunk attention layers. By analyzing attention values and the vector norm31 of each chunk, we can infer the model’s prioritization of information across various chunks.
Early works attempting to understand the decision making process of transformer models have focused on explaining linguistic phenomena with attention weights 32–35. However, multiple works argued that the attention weight based analysis is noisy and sometimes not explainable 31,36,37.
Kobayashi et al.31 proposed vector norm based analysis, noting that the output vector of each attention layer is a weighted sum of vectors. Following the expression of Kobayashi et al., we denote vector representation of input unit, which is a chunk as we look into chunk-level encoder, at j-th position as xj, and attention weight for j-th input to i-th output unit is denoted as αi,j.
Then, the output vector (yi) can be expressed as Equation (1) where a function f(x) is a simplified notation of value transformation given input unit vector x. As the equation explains, the output is affected by not only attention weights, αi,j, but also transformed input vector, f(x). Norm-based analysis measures the norm of the weighted vector (||αf(x)||) to figure out which input segments are highlighted for a given input sequence.
Unlike machine translation tasks where this analysis is first presented, looking into input unit alignment (i.e. finding an input unit that resonates with another word) does not teach us meaningful insights. Rather, we focused on norm of output vector of attention layer, yi, or , which will directly show the degree of importance of each input unit in the model’s decision.
To investigate the importance of aggregating information across a discharge summary, we use the vector norm method to analyze section importance for this task. We do this by aggregating the two highest vector norm chunks for each instance in the test dataset. Since all inputs have different length, the content in a chunk with a given index can have a different meaning across each sample. Hence, instead of using chunk locations alone, we use section names from chunks for the analysis. The section names were extracted using a rule-based approach. If there are multiple sections in the chunks, for example n sections in the first chunk and k sections in the seconde chunk, all combinations of sections for an instance are included in the analysis with weights set as to make the sum of weights for an instance as 1. This is to adjust the effect of short sections having a higher chance of being represented. For example, if one of the two important chunks contains section headers “Brief Hospital Course” and “Admission Diagnosis” and the second chunk contains the section header “Discharge Disposition”, the section pairs (“Brief Hospital Course”, “Discharge Disposition”) and (“Admission Diagnosis”, “Discharge Disposition”) each receive 0.5 counts for that instance. The partial counts are summed across the test dataset to see the population level analysis. Note that some longer sections such as “brief hospital course” can appear in multiple chunks.
Qualitative analysis
For the post-experiment exploratory analysis, we conduct two-step investigations. The first step is dictionary-based detection (i.e. exact match of synonyms list) of mentions about palliative and comfort care measures6 and Do Not Resuscitate and Do Not Intubate (DNR/DNI) status. These mentions can be a strong signal for poor prognosis and can be a first filter for data investigation. The second step is to manually review the discharge notes for the left-over samples that do not have such terms. For the manual review, we provide notes, model predictions, true labels, and three questionnaires. Regarding model predictions, predicted binary labels and order of chunk highlights are provided. Labels are set to be hidden by default, and need to click unhide to see the labels. Three questionnaires were “Does this patient label seem valid?”, “Was chunk information useful?”, and “Was this case difficult to predict?”
For comparative analysis, we compare outputs of three models, CNN, Hierarchical Transformer, and Mixtral and manually inspect samples of the benchmark dataset. For this analysis, we focus on open-sourced models for this section as we have more control over the prediction process and the results of these models are more likely to be reproducible.
RESULTS
Table 2 shows our experimental results for the machine learning models.
BoW and CNN models showed strong performance against the transformer models: BoW showed 28.1% F1, CNN showed 28.9% F1. Among transformer-based models, hierarchical transformers showed the best performance, which is near the BoW or CNN models. Bigchunk model of Hierarchical Transformer models, which refers to chunk size of 512 tokens setting as opposed to normal 256 tokens, showed the best performance of 27.8% F1.
Clinical-Longformer showed lower performance when compared with BoW, CNN and hierarchical transformers models regardless of whether the text was truncated from the bottom (right truncated) or top (left truncated) of the document. Mixtral-8x7B-instruct-v0.1 model with zero-shot methods showed performance of 22.3% F1, which is 6% lower than the best performing supervised fine-tuning approach. Our results with GPT4 showed the best performance of 32.4% F1.
Table 3 shows the results of the aggregated chunk highlight pairs analysis. The table shows the section combinations and their frequencies, which shows the summation of section pair weights, normalized by the highest weight. Sections like “Brief Hospital Course” and “Pertinent Results” frequently are in the two most-attended sections.
Comparative analysis on model predictions
Figure 5 shows a Venn diagram of the true positive and false positive samples from three models: CNN, Hierarchical Transformer, and Mixtral.The diagrams illustrate that the predictions from the two supervised models have different characteristics when compared to those from zero-shot Mixtral, a large LM. This is unsurprising, as these supervised models are strongly influenced by the dataset models are trained on, whereas the large LMs have presumably never seen the dataset.
Sixteen samples were reviewed by a board-certified critical care physician and clinical informatics expert (MA) to understand the face validity of the label and difficulty of the task. The samples were selected across the various categories: 3 were common true positives, 7 were common false positives where 3 among them were samples without date of death records (For complete information, please see Appendix C. Overall, the physician commented that predicting a specific time window such as 30 or 60 days was difficult. This finding agreed with multiple prior studies showing that prognostication is clinically challenging in patients with serious illness, and even experienced physicians tend to overestimate survival 38–41. Incorrect prognostication can hinder end-of-life discussions, lead to more aggressive and potentially over-treatment, and lead to interventions that are not in line with patients’ goals-of-care. In the outpatient oncology setting, machine learning-guided prognostication has been found to improve advanced care planning documentation and serious illness conversations, which could improve end-of-life care. In the inpatient intensive care setting, models such as those developed here could be used to identify patients who may have lower probability of survival to improve end-of-life planning and care.
Common predictions
The intersection of all three models in true positive and false positive, suggests that those instances can be “easy” cases and “difficult” cases (including sudden death or label error) respectively. We examine these cases more closely, both to understand the behavior of our models, and to validate the quality of the benchmark dataset.
All three models have true positive predictions on 29 instances, which can be considered as easy-to-predict examples. Among 29 instances, 26 are identified as having comfort care mentions in the note and another partition of 26 are identified as having DNR/DNI mentions. All of the 29 instances have at least one of those two keyword sets. Note that patients identifiable as discharged to hospice through structured data were excluded from the dataset during pre-processing steps. Some of the 26 patients had discussion for discharge to hospice facility but were not actually discharged to there according to the structured data (cf. they were discharged to home with hospice care or alternative facilities like SKILLED NURSING FACILITY or CHRONIC/LONG TERM ACUTE CARE).
The remaining three samples were manually examined. From the structured data, they passed away in 7, 10, and 22 days. Physician commented that the labels for these three patient cases had face validity. Based on our analysis, we did not find any anomalies in the labels of all 29 instances.
For false positive predictions, three models have 18 instances in common. Since all machine-learning models predicted these negative instances as positives, instances in this category can be treated as difficult instances. These false positive predictions can be interpreted in multiple ways: the patient’s condition is severely bad but the patient survived, or the prediction is correct but the label is erroneous (please see Limitation section for the further discussion).
Our dictionary-based detection found comfort care terms from 13 notes and we manually reviewed the rest of 5 notes where it cannot find the term. Three cases survived less than 1 year and among them, two passed away after 61 and 106 days. One of the other two patients survived about 1 year and 9 months. The last patient had a comfort care mention that our detection mechanism could not catch due to euphemistic language (please see a paragraph entitled Various mentions about comfort care for details). This patient did not have a date of death record but our reviewing physician commented that this patient has a high possibility of death in a short period (MIMIC-IV censors death dates at one year after last discharge, so the patient may have survived over one year, or may have been lost to follow-up and died in another state). In summary, some of our data instances raise challenging points to the models, which we believe are important for the discriminative ability of a benchmark dataset. The model predictions were also reasonable and the errors are likely to happen even for a well trained model or domain experts.
Distinct predictions
Hierarchical transformer (xdistill-bigchunk) had 6 distinct true positive predictions that other models failed to predict correctly (Green area in Figure 5 - (a)). Among 6 distinct true positives, 3 mentioned comfort care and 3 did not. The former means that the other models did not predict these instances as negative cases, which can be interpreted as models recognizing other signals of survival from the text even though they were not correct predictions. This is also true for the manual analysis, the physician predicted 2 out of 3 cases to survive more than 30 days.
Attention of Hierarchical Transformer Model
We looked into the vector norm values of the hierarchical transformer to see which chunks, input units of the chunk attention layers, are highlighted during the prediction. During prediction of one of the notes without comfort care mentions, the model had highlights on the 5th chunk that has # icu course part of brief hospital course : and the last chunk, which has discharge information where a part of discharge medications:, discharge disposition :, discharge diagnosis :, discharge condition :, and discharge instructions :, and discharge instructions : sections are written (Figure 6). The brief hospital course provides informative background about the clinical findings pertaining to a patient’s brain injury, while the discharge information provides complementary, non-overlapping information indicating the level of severity of the injury and mental status at the time of discharge.
Following is a part of 5th chunk:
# icu course on admission, patient was monitored on cveeg with no seizures captured . some left temporal epileptiform discharges were seen in a semirhythmic pattern (pleds), but they were not frequent or concerning for seizure . she was continued on keppra 1500mg bid with no seizures seen . she had a cth, which was suspicious for large left mca stroke . mri was obtained which was concerning for hypoglycemia related damage vs hypoxic ischemic encephalopathy with cortical necrosis vs post - ictal changes . cta did not show vessel abnormalities . repeat mri was performed on, and showed stable changed . etiology of her exam was felt to be a combination of hypoglycemia and hypoxia . she remained intubated and off sedation for her entire stay . during her icu stay, she began to have more spontaneous movement of her lower extremities, and would intermittently open her eyes, and maintained her brainstem reflexes on minimal ventilator settings . she did not regard, track, or follow any commands . an mri was repeated on, which showed persistent cortical slow diffusion within left greater than right cerebral hemispheres with parietal / temporal predominance, and new gyriform contrast enhancement, including a new discrete t2 hyperintense and enhancing focus in the medial left temporal lobe . <omitted>In this chunk, we note the patient has evidence of hypoxic brain injury and remained in a non-cognitive state that required dependence on breathing and feeding life support. .
Following is a part of the last chunk:
discharge medications : 1 . acetaminophen 650 mg … <omitted> discharge diagnosis : hypoglycemic encephalopathy hypoxic ischemic brain injury urinary tract infection discharge condition : mental status : confused - always. level of consciousness : lethargic but arousable. activity status : bedbound. discharge instructions : dear ms ., you were hospitalized after severely low blood sugars and brain injury caused by insulin overdose . you were started on medication to prevent seizures . you will need to go to a nursing facility to help you take care of yourself. it was a pleasure taking care of you, your neurologistsIn this chunk, we could again confirm that the patient had hypoxic ischemic brain injury and low blood sugars, while gaining new information about her mental status and clinical condition at the time she was discharged. While it is reasonably clear from the latter section that the patient’s condition has a poor prognosis, the earlier section contains detailed information of their problems that could give the model more fine-grained information that could modulate the model’s estimation of their condition’s severity and neurologic function.
DISCUSSION
Limitations
We first discuss the difference in settings between the baseline models, and the foreseeable impact of those differences. In the latter part of this section, we discuss the limitation of our benchmark dataset regarding the source of information and our methodology in analysis.
Different settings in baseline models
Models inherently have different settings due to the nature of their architectures. One of the notable setting differences is the variance in maximum token length for input instance across the models.
Max token length can highly impact models regardless of the structure 42, yet it can be more impactful for large LMs due to its nature of using the prompting strategy. Prompts for large LMs include system prompts, questions, and the input sequences. Users are also required to leave space for output response tokens. Moreover, large LM performance may be improved with few-shot settings, in which the prompts include one or a few training instances in the prompt, making it more important to discover the impact of a shortened input sequence.
Post-hoc experiments on Mixtral showed that when the maximum token length is limited to 2048, the performance dropped by 11 percent in absolute difference, which is about half of the performance of the full-length model.
For the zero-shot setting with large LMs, the performance of the models relies on how the prompt is formulated. Sometimes the model cannot produce answers that comply with the suggested answer format. For example, our prompt requires the model to answer only between 0:alive or 1:death but sometimes answers were started with “Based on the information provided,” not matching the requested format. 165 predictions from Mixtral 8*7B included the above mentioned phrase. To alleviate this problem, Gao et al.43 proposed an alternative prompt method for zero-shot evaluation named harness. However, this approach can only be applied to models that support output of probability, meaning that most cloud-based models like GPT-4 cannot be evaluated using this method.
Source of data of death labels
MIMIC-IV v2.2 dataset utilized the Massachusetts Registry of Vital Records (cf. Death Certificate is public record in the state of Massachusetts) to enrich the date of death record. According to the MIMIC-IV paper, the state registry was selected instead of the Social Security Death Master File due to data quality concerns44. However using state registry cannot fully resolve the data concerns as patients who moved out of the state cannot be traced with this method. For example, among 18 instances of common false positive cases (i.e. union of three models used in the Comparative analysis section), three patients do not have date of death (DoD) records. We requested the physician to review these instances and found out that all of these patients are severely ill and less likely to survive long enough after discharge, meaning that these three labels may be erroneous.
Despite this intrinsic limitation, we believe the state registry is still one of the most viable options when creating a database.
Various mentions about comfort care
During our post-experiment analysis, we learned that palliative care and comfort care terms are largely varied and cannot be efficiently extracted using dictionary-based measures. Decisions on comfort care, especially in discharge notes, are highly transformed in a euphemistic manner, making it challenging to detect them. For example, there were mentions such as “aimed at keeping you as comfortable as possible” or more indirect mentions, “we all met as a group to discuss the kind of care that will give you the greatest number of happy days.”
In the first example, the “comfortable” is the key mention to identify it as comfort care. However, filtering using the single term “comfortable” is not viable as it can raise false positive cases for mentions like, “He is tachypneic but otherwise comfortable appearing.” A possible direction for further study involves developing robust models capable of recognizing and normalizing euphemistic terms with clinical significance.
Our rule-based extraction of mentions about comfort care found that 220 out of 7568 patients had discussion about comfort care that is not extractable from structured data. We did not exclude these patients, because this may make supervised models overfit to these terms as they can be a strong signal for the mortality task.
CONCLUSION
In this paper, we present a benchmark for evaluating long clinical document processing, entitled LCD benchmark. We tested our benchmark dataset using baseline methods, ranging from Bag-of-words to zero-shot prediction with large LMs. As a result of these methods along with further analysis, we showed that the LCD benchmark presents challenges and the potential for improvement in current neural network-based approaches. During our experiments with large LMs, we further explored the importance of their capability to process longer sequences.
Additionally, we raised questions regarding the formulation of prompts. Our benchmark dataset is publicly available for the researchers who gained access to the MIMIC-IV datasets.
DATA AVAILABILITY STATEMENT
The data underlying this article are available in a github repository, at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc.
The datasets were derived from sources: https://physionet.org/content/mimiciv/2.2/ and https://physionet.org/content/mimic-iv-note/2.2/
FUNDING
Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under Award Number R01LM012973, and by the National Institute Of Mental Health of the National Institutes of Health under Award Number R01MH126977.
AUTHOR NOTE
For the large language models we used, we do not have control of their training materials. Our experiments, including those with large LMs, were conducted in HIPAA Protected environments, which blocks third parties from using our data for reviewing or training purposes.
Footnotes
↵1 The benchmark dataset is available at https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc and https://www.codabench.org/competitions/2064/
↵2 Published: Jan. 6, 2023. https://physionet.org/content/mimiciv/2.2/
↵4 Mixtral-8x7B-Instruct-v0.1
↵5 https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/tree/dev-v0.7.0
↵6 Comfort care term list: “hospice”, “comfort measures”, “comfort care”, “palliative care”