Abstract
Background The increasing need for diagnostic echocardiography (echo) tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored.
Aims To evaluate open-source LLMs in echo report summarization.
Methods Adult echo studies conducted at the Mayo Clinic from January 1, 2017, to December 31, 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning (ICL) and Quantized Low-Rank Adaptation (QLoRA) fine-tuning for echo report summarization from “Findings” to “Impressions.” Against cardiologist-generated Impressions, the models’ performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists.
Results The development dataset included 97,506 reports from 71,717 unique patients, predominantly male (55.4%), with an average age of 64.3±15.8 years. EchoGPT, a QLoRA fine-tuned Llama-2 model, outperformed other LLMs with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness (p< 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores versus clinical utility (r=0.42) and automatic metrics showed insensitivity (0-5% drop) to changes in measurement numbers.
Conclusions EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.
Clinical Perspectives 1. What is new?
This study evaluated multiple open-source LLMs and different model adaptation methods in echocardiography report summarization.
The resulting system, EchoGPT, can generate echo reports comparable in quality to cardiologists.
Future metrics for echo report quality should emphasize factual correctness, especially on numerical measurements.
2. What are the clinical implications?
EchoGPT system demonstrated the potential of introducing LLMs into echocardiography practice to generate draft reports for human review and approval.
Introduction
Echocardiography (echo) is the mainstay imaging modality in the current practice of cardiology(1), providing vital, non-invasive assessments of heart anatomy and physiology to guide clinical decisions(2). In the past decade, the rising demand for diagnostic echo tests(3) has posed significant challenges in maintaining the quality and timeliness of diagnostic reports(4–7), underscoring the necessity for automated solutions to enhance both efficiency and report quality(8–10).
With the recent emergence of artificial intelligence (AI), automated echo reporting has been proposed to use deep learning (DL) models to generate diagnostic predictions and measurements to fill a pre-set report template(8,10,11). These frameworks focused on specific image processing tasks(8,11) rather than the report text, and are technically equivalent to generating individual findings. However, these frameworks were not designed to handle the high-level cognitive activity of synthesizing clinically relevant impressions from detailed findings(12). In practice, physicians usually spend a significant amount of time summarizing detailed findings to clinically relevant final impressions(13,14). While this task is crucial, it can be time-consuming and prone to errors(15).
The advance of large language models (LLM) marked an important milestone for the application of AI in healthcare to automate clinical information summarization(13,14,16) and expert-level question-answering(17). A major advantage of LLMs is the flexibility of input and output(18), as well as the capability of handling conversations and interaction with human experts(19). While similar functionality can be achieved through commercially available LLMs (e.g., ChatGPT; OpenAI, San Francisco, CA)(20), only a few healthcare institutions have integrated ChatGPT(21). Furthermore, fine-tuning ChatGPT for specific tasks still requires uploading data to a central server, which also raises privacy concerns(22). In contrast, open-source LLMs are free of charge and can be locally fine-tuned for specific tasks within each healthcare institution’s secure confines(18).
Previous studies predominantly focused on electronic health records(13,16) and chest X-rays (CXR)(13,18) have highlighted the potential of using LLMs to summarize clinical text. In contrast, echo-related studies were mainly on data extraction or classification, rather than report summarization(23–25). Tang et al. used rule-based systems and a fine-tuned BART (Bidirectional and Auto-Regressive Transformer) model, EchoGen (26) for this purpose and demonstrated convincing results. However, EchoGen reports were less favored by human experts more than 50% of the time, perhaps due to the smaller number of parameters than current state-of-the-art LLMs(26). Meanwhile, despite a recent study exploring the use of an LLM for its question-answering capabilities on echo report texts(27), the potential of using billion-parameter LLMs to generate echo reports remains under-explored(28,29).
In this work, we proposed to evaluate LLMs in echocardiography reporting and construct a local, domain-specific LLM (EchoGPT) dedicated to echocardiography report summarization through an instruction fine-tuning approach, which is known to be an effective strategy to adapt LLMs for similar tasks(13,18). We anticipate that the fine-tuning procedure improves LLMs’ performance on the task of echocardiography report summarization. In addition to provide new insights into the use of open-source LLMs within the domain of echocardiography reporting, we will explore the challenges associated with applying current evaluation standards to echo reports generated by these models.
Method
Dataset
Mayo Clinic Reports: All adult (> 18 years old) echocardiography studies performed from 1/1/2017 to 12/31/2017 at Mayo Clinic Enterprise were retrieved. The types of studies include transthoracic echocardiography (TTE), transesophageal echocardiography (TEE), and stress echocardiography (including exercise and pharmacological studies). Text in the “Findings” and the “Final Impression” sections of each report was used for the current study (Figure 1). The study was approved by the Mayo Clinic IRB (protocol#: 22-010944).
MIMIC-III ECHO-NOTE2NUM Dataset (v.1.0.0, referred to as MIMIC-EchoNotes below)(30): This publicly available dataset contains 43,472 valid free-text echocardiography reports from the intensive care unit at the Beth Israel Deaconess Medical Center between 2001 and 2012. A random subset of the dataset was used for external validation.
Data Curation and Preprocessing
The summarization task was defined as creating the Final Impression section based on the Findings section, mirroring the established workflow of clinical echocardiography reporting at Mayo Clinic. The Final Impression text was used as the ground truth report (Figure 1).
Mayo Clinic Reports: Echocardiography reports were excluded according to the following criteria: (1) reports without Findings or Final Impression sections, (2) reports whose Findings or Impression section contained less than 15 words, as these are frequently canceled studies in our practice, and (3) labeled in report metadata as limited report. After this filtering process, the Findings section of each report was further processed as follows: (1) Remove capitalized subheadings (e.g., LEFT VENTRICLE, VALVES, OTHER FINDINGS, etc.), (2) Remove template sentences that make comparisons to prior reports, as no information from previous reports has been provided in Findings, and (3) Remove quality control-related sentences such as “study performed per left ventricular function protocol.”
MIMIC-EchoNotes Reports: Cases in this dataset were excluded based on criteria (1) and (2) above, as the metadata differed from that of the Mayo Reports. We also removed the subheadings and template sentences as described previously. We observed fundamental differences in report structure, including the “General Comments” section, which typically contains comments related to study quality, and the “Conclusions” section, which usually consists of the physician’s interpretation of findings. However, the Impression section often contains only 2-3 sentences summarizing the most pertinent study findings, which was challenging for head-to-head comparison in this study. Given the distinct report structure, the contents under the subheadings “General Comments” and “Conclusions” were integrated into the “Findings” and “Impression” sections, respectively. Common abbreviations in the text were expanded to their full forms.
Data split
Data from Rochester, Florida, and Mayo Clinic Healthcare sites were used as the model development set. Considering variations in practice style among different sites, the data from the Arizona site was designated as the external validation set (referred to as the AZ validation set). Within the development set, 1,000 non-duplicated cases were randomly selected for the test and validation sets, respectively; the rest of the cases were used for fine-tuning (training set). Similarly, from the AZ validation and the MIMIC-EchoNotes datasets, we selected 1,000 non-duplicated random cases from each. For basic dataset statistics, the token length was calculated based on the natural language processing toolkit (NLTK) tokenizer(31), and the lexical variance was defined as the ratio of the number of unique tokens to the number of total tokens in each example(13).
Model Selection
Due to patient privacy policy regulations, proprietary LLMs such as GPT-3.5 and GPT-4 were not considered in this work because versions of those models that were safe for protected health information were not yet available. For our target task, we selected auto-regressive and sequence-to-sequence LLMs with architectures under 7 billion parameters, balancing performance with manageability. Among open-source models, we selected representative auto-regressive models including Llama-2-7b-chat(28), Zephyr-7b(29), and Med-Alpaca(32) models considering their performance and max input context length on general natural language processing (NLP) tasks and radiology report summarization(13). For sequence-to-sequence (seq2seq) models, we used Flan-T5 (base) as the representative model as it is known for accurate text summarization(13,33), given that the EchoGen model(26) was not publicly available.
Model Inference Hyperparameter Search
LLM inference was conducted by using Hugging Face’s (Manhattan, NY) transformer pipeline via the open-source LangChain framework(34). After initial tests, text generation and summarization were used as the task type for auto-regressive models and seq2seq models, respectively. A subset (10%, n=100) of examples were randomly selected from the test set for the hyperparameter search. We specifically tested the following configuration parameters that can significantly affect performance: temperature (0.1, 0.5, and 0.9) and repetition penalty (1.1, 1.2, and 1.3). These two parameters were tested separately, when one parameter was being tested, the other was fixed at the lowest value. The generated contents were evaluated by both automatic metrics and qualitative assessment. We chose the following configuration for model inference: {temperature 0.1, repetition penalty 1.1} after comparing automatic metrics and qualitative assessments; see Supplemental Table 1). We did not complete a dedicated search procedure for the optimal LLM inference configurations, but the configurations used in our study were similar to prior reports, and the generated contents were satisfying on qualitative review. Of note, the configurations were tested in a zero-shot setting, and the best configuration was directly applied to the ICL and QLoRA fine-tuned models(13).
Model Adaptation
Prompt
A prompt template was created with components of the prefix, instruction, and suffix(13) (Table 1). The final prompt was decided after qualitatively evaluating several different variants of each component on a small subset of the data. We also specified that the summarization should be “concise” and use “a minimal amount of text” to avoid LLMs generating lengthy reports(13). Likely due to the difference in the reporting style of the MIMIC-EchoNotes dataset, the final prompt above led to suboptimal responses. Therefore, we adopted a new prompt tailored to match the reporting style by incorporating new instructions below: 1) Write a 10-bullet points clinical summary, and 2) Avoid using numbers other than LVEF.
In-context Learning (ICL)
ICL has been proposed to improve LLM’s performance without changing the base model weights(13,35,36). Also, using relevant in-context examples is shown to have better model performance compared to random examples in ICL. To obtain relevant in-context examples, we adopted the approach to select m (m=1, 2, 4…) nearest neighbors from the training set for each test set case, after embedding both sets by the PubMedBERT model(37).
Instruction tuning with quantized low-rank adaptation (QLoRA)
Due to the size of candidate models, we opted for quantized low-rank adaptation (QLoRA)(38), a type of parameter efficient fine-tuning (PEFT)(39) to optimize our LLMs for echo report summarization tasks. The same prompt template (Table 1) was used, and the Final Impression text from the same report was used as the target output (13,14).
We configured the training process as follows: load model in 4-bit precision, with a LoRA configuration of (alpha = 16, LoRA dropout = 0.1, LoRA r= 64). The batch size and gradient accumulation were adjusted for each model to achieve an effective batch size of 24 that fits on a single NVIDIA RTX A5000 24G GPU setting. A paged-AdamW 32-bit optimizer was used, with an initial learning rate of 1e-3, which decayed to 1e-4 (by a cosine scheduler) after the initial 100 warm-up steps. The above configuration provided the most stable training process after attempting different configurations reported in prior studies(13,38).
Model Performance Evaluation
Automatic NLP evaluation metrics
To evaluate the models’ performance on the information summarization task and compare it to prior works, we utilized four established automatic metrics that have been used in other clinical text summarization studies(40,41): BLEU (Bilingual Evaluation Understudy)(42), METEOR (Metric for Evaluation of Translation with Explicit ORdering)(43), ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence), and the BERT (Bidirectional Encoder Representations from Transformers) score(44), which represents the similarity between generated contents and the corresponding reference at words/characters (n-gram), single word (unigram), longest sequence of words, and contextual level, respectively For ROUGE-L, we present the F1 score component(26,45). For factual correctness, the RadGraph-F1 metric (level: all) was reported(46,47) This metric served as the primary evaluation criterion for model performance, considering its significance in ensuring the factual correctness of generated clinical content.
Evaluation of Significance of Measurement Numbers in Automatic Metrics
Considering the importance of measurements in echo studies, we also attempted to evaluate whether the current automatic metrics can detect changes in measurement numbers. For this purpose, we generated synthetic reports by replacing all the measurement numbers with random numbers ranging from 1 to 99 in the reports. We then compared the automatic metric scores with the corresponding reports with the original measurements.
Human expert evaluation metrics
We designed a human expert evaluation process based on previous clinical text summarization studies(13,18,26). The Findings with corresponding ground truth (Final Impression) and LLM-generated summarization of 30 randomly selected cases were presented to four echocardiography-board-certified cardiologists for blinded quality review. We noted that physician-summarized Impressions may contain free-text information beyond the Findings (e.g., documenting events during the study or communication with the ordering provider), and the reviewers were instructed to rate only based on information within the Findings section. Each metric was rated for the preference (5 levels) between the two summarizations (Supplemental Figure 1)(13). The “4C metrics” evaluated by echocardiography experts are completeness, conciseness, correctness, and clinical utility(14), as described in Table 2.
Statistical analysis
Automatic metric performance between each pair of models was compared using a two-tailed paired Student’s t-test or Wilcoxon signed-rank test for normally and non-normally distributed data, respectively. Models were also compared based on win rates, which are defined as the percentage of head-to-head victories in performance between two models for each selected metric(13). For the performance bias analysis, data were grouped based on sex (male versus female) and race (white, black, other). In the case of sex, we employed two-tailed Student’s t-tests or Mann-Whitney U tests for normally and non-normally distributed data, respectively. For race, we utilized one-way ANOVA. In human expert qualitative analysis, the 4C metrics were compared by a one-sample Wilcoxon signed-rank test(13), and the agreement of ratings between experts was assessed by Fleiss’ kappa coefficient(48), and interpreted as recommended by Landis et al(49). We conducted Pearson correlation analyses to explore the relationships between each human expert evaluation metric, assessing their independence. Additionally, we conducted similar analyses to examine the correlation between human and automatic metrics. Statistical analyses were performed using Python 3.8 and SciPy 1.8.0. All the comparisons consider a p-value < 0.05 as significant.
Results
Patient Cohort and Dataset
Our development set contains 97,506 reports from 71,717 unique patients, with a mean age of 64.3±15.8 years, 54,005 (55.4%) were male, 89,466 (91.8%) were white. Randomly selected from Mayo Arizona studies (19,557 reports/15,853 unique patients), the AZ validation set contains 1,000 reports from 1,000 unique patients with a mean age of 63.9 ±16.0 years, 584 (58.4%) were male and 885 (88.5%) were white. Detailed demographic information was not available for the MIMIC-EchoNotes dataset. Other detailed patient characteristics and statistics of text data are summarized in Table 3. Transthoracic echocardiography was the predominant study type in development, AZ validation, and MIMIC-Echo datasets (81.9%,77.6%, and 86.4% respectively).
Zero-shot, ICL, and QLoRA fine-tuned performance
Table 4 is a summary of the performance of zero-shot and fine-tuned LLMs, including MedAlpaca, Llama-2, and Zephyr. QLoRA fine-tuning significantly improved LLMs’ performance from baseline. Note that T5 and MedAlpaca were not fine-tuned so only zero-shot results were provided for reference. Among the candidate models, Llama-2 generally had the best zero-shot performance, which was consistent in ICL. While Flan-T5 had a similar or superior performance to Llama-2 across most metrics, it was particularly worse on the RadGraph F1 score (Table 4; Figure 3). On qualitative review, we noted that T5 provided concise summaries, however important clinical information was missed in this process (Supplemental Table 2).
In ICL, LLMs that allow longer context length (Llama-2 and Zephyr) had the best performance across all metrics when one example was provided (ICL-1). The performance gradually trended down with more examples (ICL-2 and ICL-4). In contrast, the performance of LLMs with shorter max context length started to trend down with one example (Figure 3). Importantly, we observed that Llama-2 can integrate information from multiple ICL examples, such as “Calculated 2-d linear left ventricular ejection fraction 57, 61, 75 (3 reports)”, which occurred in 34 (3.4%) and 69 (6.9%) cases for ICL-2 and ICL-4, respectively (Supplemental Table 3).
Based on the zero-shot and ICL performance of the candidate models, Llama-2 and Zephyr were selected for instruction fine-tuning. Compared to zero-shot, QLoRA significantly improved the performance of selected LLMs across all metrics (Table 4). For the head-to-head comparison in model win rates, fine-tuned Llama-2 was superior to all other models, including Zephyr (base and fine-tuned), MedAlpaca (base), and Flan-T5 (base) across all 5 automatic metrics (Figure 4). Llama-2 maintained similar performance in the AZ validation set (n=1,000) and was consistently superior to fine-tuned Zephyr (Supplemental Table 4). Regarding potential biases, we did not observe significant biases regarding sex and race across the automatic metrics, except for a slightly better RadGraph F1 performance in female patients in the AZ validation set (male vs. female: 0.38 ± 0.14 vs. 0.40 ± 0.15, p=0.04) (Supplemental Figure 2). Because Llama-2 had the best performance on zero-shot, ICL, and QLoRA fine-tuning approaches, fine-tuned Llama-2 was selected as EchoGPT and used for the subsequent expert qualitative review.
Significance of Measurement Numbers in Automatic Metrics
After replacing the measurement numbers with random numbers, we observed relatively minor decreases in BLEU, METEOR, ROUGE-L, BERT, and RadGraph F1 scores, while statistically significant (p<0.0001). One can see that in the provided examples, the report with random numbers doesn’t make clinical sense when compared to the original content (Table 5).
Human Expert Evaluation
We observed slight agreement for correctness, fair agreement for conciseness and clinical utility, and moderate agreement for completeness (Supplemental Table 5). Among the 4C metrics, we observed that EchoGPT significantly outperformed human experts in conciseness (p<0.001). There was no significant preference among the other three categories (Figure 5A). The 4C metrics were not completely independent. There was a high correlation between clinical utility and completeness (Pearson’s r= 0.78), and modest to moderate correlations between other metrics (Figure 5C). We also observed that across all automatic metrics, RadGraph F1 had modest to moderate correlations with all 4 human evaluation metrics (Figure 5C).
External Validation on the MIMIC-EchoNotes dataset
In the MIMIC-EchoNotes dataset (n=1,000), we observed a performance drop in fine-tuned models, while EchoGPT was still superior to fine-tuned Zephyr (Supplemental Table 4). Regarding the expert review, we observed moderate agreement for completeness and conciseness, but slight agreement for correctness and clinical utility (Table 6). The original reports (combined Conclusions and Impression sections) were preferred over EchoGPT in completeness, correctness, and clinical utility (p< 0.001); while EchoGPT was superior in conciseness (p<0.001) (Supplemental Figure 6A). RadGraph F1 still had the strongest correlations with the 4C metrics, although other metrics also had stronger correlations (Supplemental Figure 3). A representative example demonstrated the difference in reporting structure and style, along with reviewers’ feedback on the two datasets (Table 6).
Discussion
This study evaluated the feasibility of multiple open-source LLMs in echocardiography report summarization through different model adaptation methods. Trained on one of the largest echocardiography report datasets in the world, we demonstrated that QLoRA fine-tuning can significantly improve LLMs’ performance for the desired summarization task, with at least comparable qualities to human experts. ICL was overall inferior to QLoRA fine-tuning and faced substantial limitations, such as integrating information from multiple examples. Additionally, we demonstrated that current automatic metrics are not sensitive to the change in measurement numbers in echo reports. The current study offers insights into the development and evaluation of a specialized local LLM tailored for echo report summarization and presenting significant workflow advantages.
Model Adaption Approaches: ICL vs. Fine-tuning
Our results suggested that both ICL and QLoRA fine-tuning improved LLMs’ performance over zero-shot, and fine-tuning performance was consistently above ICL across all automatic metrics (Figure 3). Additionally, we observed substantial limitations of ICL for echo reporting, particularly due to the longer context lengths typical of echo reports. The behavior of integrating results from multiple examples also presents challenges, as discussed below.
Compared to CXR reports, echocardiography reports come with a relatively longer context, which directly affects the available choices of LLMs. In contrast to a prior study that used 32 or more examples(13), we were only able to test up to 4 examples for ICL, therefore not able to assess the models’ behavior with more examples. However, across all metrics, LLMs’ had gradually down-trending performance when more examples were provided (Figure 3), which is consistent with prior studies(13,35). It is also important to consider that the computation time and resources required in ICL can increase with the number of examples used(35). Additionally, we note that LLMs can integrate information from the example ICL cases, which compromises the report quality (Supplemental Table 3). While this behavior was not reported in other studies(13,35), we believe it could be a relatively common condition, that is easier to identify with numerical values (in echo) compared to narrative statements (in CXR). Therefore, even in scenarios where ICL can outperform fine-tuning(13), fine-tuning may be preferable.
The EchoGen study previously demonstrated that Bidirectional Auto-Regressive Transformers (BART) was superior to other rule-based approaches for summarizing echocardiogram reports, with BART achieving ROUGE-based scores between 0.65 and 0.73, however, human summaries were preferred by the majority of the time over those generated by the BART model(26). Although EchoGPT didn’t match the scores in ROUGE-L, it compared favorably to human experts in qualitative assessments. Additionally, in our study, we observed that T5 (as the representative seq2seq model since EchoGen is not publicly available) generated summaries that were overly brief, so important clinical information was missed (Supplemental Table 2). We assume that similar behavior could have occurred with BART, leading to its unfavorable ratings by physicians. However, direct comparisons were not feasible as neither the EchoGen model nor qualitative examples generated by it were available for review.
Evaluation of Echo Report Summarization
Automatic evaluation of LLM in clinical text summarization tasks is an emerging area, and there is no gold standard metric that can evaluate all aspects of a report(13,26). Our study reinforces this conclusion. We noted that MedAlpaca and Zephyr can generate medical-professionally-sounding content that frequently includes hallucinated information. These differences were mainly reflected by the factual correctness metric RadGraph F1 (Table 4).
The practice style at each institution could greatly affect the quantitative performance of a model(18). In the AZ validation set, we observed a 5-10% drop in performance of both fine-tuned Llama-2 and Zephyr (Supplemental Table 4). This is likely secondary to the differences in practice style: the AZ validation set, despite having a similar Finding section length, contained an average of 9.5 additional tokens in the Final Impression section (Table 3). A more significant drop in performance was observed in reports from the MIMIC-EchoNotes datasets (RadGraph F1 from 47.7 to 25.1; Supplemental Table 4), which was anticipated and within a reasonable range(50). According to our observations and the input of expert reviewers, the key factors leading to the performance drop were:
The distinct report structure (Findings/Impressions versus Findings/Conclusions/Impressions).
The use of reporting languages (templated statements at Mayo versus the free-text style of the MIMIC-EchoNotes dataset).
Specifically, the combination of the Conclusion and Impression sections makes the section almost as long as the Findings (184.3 ± 50.0 vs. 163.2 ± 40.4; Table 3), and even longer in some cases. Additionally, the physician’s interpretation often contains information beyond the Findings, or variations of the original sentences, that the model was not able to summarize. Moreover, the sentence templates at Mayo include measurement numbers in relevant statements (e.g., ejection fraction, right ventricular systolic pressure, left atrial size index), while MIMIC-EchoNotes did not (Table 6). These factors contributed to the overall less favored completeness, correctness, and clinical utility of EchoGPT summaries on the MIMIC-EchoNotes dataset (Supplemental Figure 3). It is important to note the low agreement on correctness and clinical utility metrics, which also implies the challenge on comparing reports with distinct styles (Supplemental Table 5). The difference across institutions, as listed above, could limit the direct generalization of a fine-tuned LLM for report summarization. However, while not comprehensively tested in the current study, we noted that adjusting the prompt to fit the reporting style could lead to better summaries without further fine-tuning(50,51).
Regarding the correlations between automatic metrics and human expert preference, our results were similar to the prior studies, showing that most of the metrics were not strongly correlated(13). Notably, the highest correlation was observed between the RadGraph F1 scores and the 4C metrics, particularly in terms of clinical utility (r=0.42) (Figure 5C). This suggests that the quality of echo reports judged by cardiologists may not be well captured in automated metrics that do not capture notions of factual correctness. While stronger correlations were observed in the MIMIC-EchoNotes examples (Supplemental Figure 3), we believe it was reflecting the strong preference secondary to the distinct reporting style.
As a specific subtype of clinical text, echocardiography reports contain unique terminology, including precise measurements. Clinically, 25% and 55% LV ejection fraction values indicate a significant difference, however, our study demonstrates that this distinction is difficult to capture with current automatic metrics (Table 5). While this aspect of reporting can be easily captured in qualitative analyses, such analyses are expensive to conduct at scale because of the limited availability of in-domain experts. A dedicated metric for echocardiography diagnostic quality evaluation, with emphasis on measurement accuracy, is still needed to address this knowledge gap.
Application of EchoGPT
Our study shows the feasibility of introducing LLMs into echocardiography practice. Through the QLoRA fine-tuning process(38), the EchoGPT model was able to learn clinically relevant knowledge to summarize echo report findings at a quality level comparable to echocardiography-trained cardiologists (Figure 6A).
We envision that EchoGPT could be used as a reporting interface or a co-pilot that could generate echo reports with various inputs(52). EchoGPT still inherits the limitations of LLMs, including hallucination(13,18,53). Although the fine-tuning process can potentially reduce hallucinations, additional efforts such as optimization for factual correctness(46) or paired with a retrieval augmented generation system(54) are still required to minimize hallucinations before clinical implementation.
Conclusion
Our study successfully built EchoGPT through QLoRA fine-tuning of open-source LLMs and demonstrated that the model is capable of generating echocardiography reports on par with cardiologists, marking an advancement in integrating LLMs into current echo practice. Our analysis also highlights the challenges of the current LLM evaluation process in echo reporting, particularly dedicated automatic metrics and scalable qualitative evaluation approaches, which necessitate further studies.
Limitations
This study is limited by its retrospective nature and a predominantly white population served by the healthcare system. However, we were able to demonstrate that the algorithm is not biased by sex and race. Our echocardiography reports are based on standardized statements, with an option to add free text. While the lexical variance was high, the corpus could differ from reports composed entirely of free text contents. Due to patient privacy regulations, this work did not assess the performance of GPT-3.5 and GPT-4.
However, we compared the performance of state-of-the-art open-source LLMs, which provided important insights for model selection when data privacy is a critical consideration. Instead of full fine-tuning, QLoRA was used as the fine-tuning approach, however, it has been demonstrated as an effective approach as full fine-tuning is often not feasible for LLMs. Last but not the least, although QLoRA fine-tuning demonstrated improvements in echo report summarization tasks, our current approach does not include optimization for factual correctness and human expert preference.
Data Availability
The data that support the findings of this study are not openly available due to reasons of sensitivity and patient privacy. Data are located in controlled access data storage at the Mayo Clinic. The MIMIC-EchoNotes (ECHO-NOTE2NUM) data is publicly available at https://doi.org/10.13026/xhrz-ht59
Code Availability
We released a checkpoint of the fine-tuned Llama-2 model, along with the QLoRA fine-tuning, inference, and statistical analysis code. The code and checkpoint are available on GitHub: https://github.com/chiehjuchao/EchoGPT.git
Footnotes
↵† Chieh-Ju Chao and Ehsan Adeli are co-corresponding authors for this manuscript.
Disclosure: The authors have no conflict of interest to disclose.
Revised manuscript content to discuss the challenges in evaluating echo reports, especially the limitations on automatic metrics, and the limited scalability of human expert review.
Abbreviations
- 4C metrics
- qualitative review metrics including completeness, correctness, conciseness, and clinical utility.
- AI
- Artificial Intelligence
- DL
- Deep Learning
- Echo
- Echocardiography
- FT
- Fine-Tuning
- GPT
- Generative Pre-trained Transformer
- ICL
- In-Context Learning
- NLP
- Natural Language Processing
- Seq2seq
- Sequence-to-sequence
- TTE
- Transthoracic Echocardiography
- TEE
- Transesophageal Echocardiography
- LLM
- Large Language Model
- NLP
- Natural Language Processing
- QLoRA
- Quantized Low-Rank Adaption