Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries

Christopher Y.K. Williams; Jaskaran Bains; Tianyu Tang; Kishan Patel; Alexa N. Lucas; Fiona Chen; Brenda Y. Miao; Atul J. Butte; Aaron E. Kornblith

doi:10.1101/2024.04.03.24305088

Abstract

Importance Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed.

Objective To investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary.

Design Cross-sectional study.

Setting University of California, San Francisco ED.

Participants We identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization.

Exposure We investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary.

Main Outcomes and Measures GPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors.

Results From 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients’ Physical Examination findings or History of Presenting Complaint.

Conclusions and Relevance In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.

Introduction

Clinical documentation is an essential part of high-quality patient care.^1,2 However, in recent years there has been an increase in the complexity of clinical documentation as a result of the transition from paper-based to electronic health records (EHRs).³ This has had downstream effects on the amount of time physicians spend on the EHR, with recent studies suggesting that every hour of direct clinical time spent with patients is associated with 2 extra hours of EHR documentation.^4,5 This concerning increase in EHR burden is a significant contributing factor to the rising prevalence of physician burnout, which may lead to a reduction in the overall quality of patient care.^6–9

A foundational element of clinical documentation is the patient discharge or encounter summary, created following both Emergency Department (ED) visits and inpatient hospital admissions.

Discharge summaries serve as a critical method of patient information transfer and provide instructions for the ongoing management of patients’ illness.^10–12 However, the process of writing discharge summaries is time-consuming and, consequently, these summaries are often not completed in a timely manner or finished at all.^12,13 This is problematic given that the timeliness and availability of discharge summaries is associated with patients’ readmission rates, with the absence of a discharge summary associated with a 79% increased rate of 7-day readmission and 37% increased rate of readmission within 28 days.¹³ The AHRQ identifies the lack of adequate post-discharge summarization and communication as primary reasons for ED discharge failures.¹⁴

The recent introduction of large language models (LLMs) such as ChatGPT has led to renewed focus on the use of natural language processing (NLP) to improve both quality and efficiency in healthcare.¹⁵ LLMs possess a range of capabilities which may be applied to the clinical domain, one of which is text summarization. Previous reports have evaluated the potential use of LLMs in summarizing scientific literature, radiology reports, patient problem lists and doctor-patient conversations, with varying success.^16,17 However, there has been limited research on the ability of LLMs to summarize information from a patient’s hospital encounter into a discharge summary.¹⁸ As ambient AI scribes and other LLM-based tools begin to be deployed within healthcare settings,¹⁹ rigorous evaluations of the accuracy of these technologies are urgently needed.

In this study, we investigate the performance of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, in generating ED discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary.

Methods

The UCSF Information Commons contains deidentified structured clinical data as well as deidentified clinical text notes, with externally certified deidentification as previously described.²⁰ The UCSF Institutional Review Board determined that this use of deidentified data within the UCSF Information Commons is not human participants research and, therefore, was exempt from further approval and informed consent. This study was completed according to a prospectively developed protocol (Supplementary File 1).

We identified all adult patients discharged from the University of California, San Francisco (UCSF) ED from 2012 to 2023 with an ED clinician note present within Information Commons (Figure 1). If more than one Emergency Medicine (EM) clinician note was available for a particular ED visit, the earliest note was selected as subsequent notes were often attending attestation notes. In the case of multiple notes with the same chart time, the longest note (by character count) was selected. Clinical notes were minimally preprocessed – only line breaks and extra spaces were removed. Software packages incorporating a series of regular expressions were created and used to examine the structure of notes, confirming the presence/absence of the following note headers: ‘Chief Complaint’ (274,983/278,629 notes); ‘Review of Systems’ (263,219/278,629 notes); ‘Physical Exam’ (276,834/278,629 notes); ‘ED Course’ (245,900/278,629 notes); and ‘Initial Assessment’ (139,838/278,629 notes). Notes which did not contain appropriate history, physical examination and assessment/plan sections were excluded. Each note was tokenized using the OpenAI Tiktoken tokenizer.²¹ Notes containing ≥3500 tokens were excluded to allow sufficient tokens for the GPT-3.5-turbo API response to be completed within the model’s 4096 token context window, which was the shortest context window of the models used. Patients who were admitted to hospital from the ED were identified from the structured electronic health record and excluded so that only patients discharged from the ED were included in our cohort.

Figure 1.

A) Flowchart of included Emergency Department (ED) visits. B) Study workflow.

Next, we randomly selected two n = 100 samples to be used as the development and test sets. All prompt engineering and resident annotator training was conducted on the development set, while evaluation was conducted on the held-out test set. Using the secure, HIPAA-compliant, UCSF Versa Application Programming Interface (API) on Microsoft Azure, we prompted both GPT-3.5-turbo (model = ‘gpt-3.5-turbo-0613’, role = ‘user’, temperature = 0; all other settings at default values) and GPT-4 (model = ‘gpt-4-0613’, role = ‘user’, temperature = 0; all other settings at default values) to summarize the full ED clinician note into a discharge summary. The following prompt was used, followed by the corresponding note for each patient, denoted by triple quotation marks: “You are an Emergency Department physician. Below is the History and Physical Examination note for a patient presenting to the Emergency Department who was subsequently discharged. Write a discharge summary for the patient based on this note. Do not include any additional information not present in the note. \n\n “““ Note text ””” ”

The GPT-3.5-turbo and GPT-4 generated discharge summaries were evaluated by two independent EM resident reviewers (from AL, FC, KB, KP, TT) in accordance with the protocol. Initial rates of inter-reviewer agreement were over 90% (Table S1). Disagreements were resolved by consensus and, if required, by an attending EM physician reviewer (AK). We selected three evaluation criteria for review: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. An inaccuracy refers to information that is not factual and/or is contradicted by the original ED clinician note. Hallucination refers to the fabrication of information in the discharge summary that is not present in the original ED clinician note. Omissions refer to information from the ED clinician note that the reviewer deemed relevant for inclusion in the discharge summary but was not included. The following aspects of a patient’s ED visit were evaluated for the presence of inaccuracies, hallucinations, and omissions: Presenting complaint; History of presenting complaint; Past medical history; Allergies/contraindications; Review of systems; Positive examination findings; Laboratory test results; Radiological investigations; Plan; Other notable events during ED stay (if any). On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was subsequently manually classified into subgroups of errors within each of the above three evaluation criteria.

Statistical analysis

For both the GPT-3.5-turbo and GPT-4 discharge summaries, counts of each error (Inaccuracy, Hallucination or Omission) across each section (Presenting complaint; History of presenting complaint; Past medical history; Allergies/contraindications; Review of systems; Positive examination findings; Laboratory test results; Radiological investigations; Plan; Other notable events during ED stay [if any]) relating to the ED visit were collated and reported in a descriptive analysis. The median word count with interquartile range (IQR) for the original EM clinician notes, alongside both the GPT-4-generated and GPT-3.5-turbo generated summaries was calculated. To evaluate discharge summary readability, the average Flesch-Kincaid Reading Ease Score (FRES) and Flesch Kincaid Grade Level (FKGL) was calculated for each GPT model. Median word counts and FRES/FKGL values were compared using the Mann-Whitney U test against the null hypothesis that there is no significant difference between GPT-4-generated and GPT-3.5-turbo-generated discharge summaries. Categorical variables were compared using the Chi-squared test. P < 0.05 was significant. Analyses were performed in Python and R.

Results

From 202,059 eligible ED visits with an EM clinician note, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation (Figure 1; Table 1). The average length of the original EM clinician notes summarized by the GPT models was 802.5 words (IQR 643.5-1053.25) (Figure S1). GPT-4-generated discharge summaries (median word count = 235 words, IQR 205-264) were shorter than those generated by GPT-3.5-turbo (median word count = 369.5 words, IQR 307.75-445) (Figure S2; Mann-Whitney U, p < 0.001). The average Flesch-Kincaid Grade Level for GPT-4-generated summaries was lower (FKGL = 10.0, IQR 9.5-11.1) than for GPT-3.5-turbo-generated summaries (FKGL = 10.7, IQR 9.7-11.7) (Mann-Whitney U, p = 0.02), indicating greater readability of GPT-4-generated discharge summaries. This was also reflected in the Flesch Reading Ease Scores, with GPT-4 summaries (FRES = 48.6, IQR 41.0-52.0) having a higher score on average than GPT-3.5-turbo summaries (FRES = 46.7, IQR 39.7-49.5), though this did not meet statistical significance (Mann-Whitney U, p = 0.10).

View this table:

Table 1.

Patient demographics in n = 100 sample of Emergency Department encounters randomly selected for GPT-3.5-turbo and GPT-4 discharge summary generation. ED = Emergency department; ESI = Emergency Severity Index; IQR = interquartile range.

Overall, GPT-4-generated discharge summaries contained fewer errors than GPT-3.5-turbo-generated summaries across all three domains (Figure 2). In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases. However, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. This compares to 36% of GPT-3.5-turbo summaries containing an inaccuracy, with 64% and 50% of the predecessor model’s summaries containing hallucinations and clinical omissions, respectively. Initial inter-reviewer agreement rates were 95.8%, 93.6% and 91.9% for inaccuracy, hallucination and omission errors, respectively, prior to consensus agreement (Table S1).

Figure 2.

Proportion of discharge summaries with 1 or more error identified by clinical reviewers in each of the three domains evaluated: 1) Inaccuracy, 2) Hallucination and 3) Clinical Omission.

Error rate by domain and discharge summary section is shown in Figure 3. The few inaccuracy errors identified in GPT-4-generated discharge summaries predominantly occurred in the Plan section of the summary (n = 4). When comparing GPT-3.5-turbo and GPT-4 models, there was a notable improvement in the accuracy of reporting patients’ Past Medical History, in which 10% of GPT-3.5-turbo summaries contained an error compared to only 1% of GPT-4 summaries. Most hallucination errors, across both GPT-3.5-turbo and GPT-4 models, occurred in either the Plan or Other sections of the summary, with GPT-4 recording 36% fewer hallucinations in these sections than GPT-3.5-turbo. Omissions were most frequently present in the Physical Examination section for both GPT-4 (20%) and GPT-3.5-turbo (18%) summaries, followed by the History of Presenting Complaint section (10% of GPT-4 summaries vs 17% of GPT-3.5-turbo summaries).

Figure 3.

Breakdown of errors for each domain (Accuracy, Hallucination and Clinical Omission) by section of discharge summary. PC = Presenting Complaint; HPC = History of Presenting Complaint; PMH = Past Medical History; ROS = Review of Systems; PE = Physical Examination.

Finally, we manually categorized free-text reviewer comments detailing the subtype of each error (Table 2 & Figure S3). Among the GPT-4 summaries, inaccuracy errors included inaccurate follow-up details (e.g., reviewer comment: “[the original note states that the] patient had follow-up with GI for colonoscopy.. already scheduled [whereas the GPT summary states the patient was advised to obtain this]”), inaccurately reporting the interim plan as the follow up plan (e.g., reviewer comment: “the final plan is listed [by GPT-4] as ‘follow-up labs/psych recommendations’, but this was the sign-out plan – the final plan was actually: ‘safe for discharge’”) and inaccurate reporting of physical examination findings (e.g., reviewer comment: “[the GPT summary] states HINTS exam was positive, but is in fact negative”). The most commonly identified hallucination error subtype was hallucination of information in the note that had been redacted during the de-identification process (n = 15; e.g., reviewer comment: “redacted portion [of original note] filled in [in GPT summary] as ‘headache’”). The next most common hallucinations related to patients’ follow up, with GPT-4 either providing details of outpatient specialty follow-up that had not been arranged (n = 11; e.g., reviewer comment: “[the GPT summary] hallucinated follow-up with Rheumatology and Neurology, though [there is] no mention of this in [the original] note”), hallucinating ED return precautions (n = 7), and hallucinating follow-up instructions (n = 3; e.g., reviewer comment: “no instructions to continue current meds or avoid morphine were provided in the original note”). Meanwhile, examples of the most common omission errors include GPT-4 omitting certain positive physical examination findings (n = 13; e.g., “[GPT summary] omitted left sided laceration” or “[GPT summary] omitted murmur”), imaging results (n = 8), details of patients’ management in ED (n = 7; mostly relating to specialty consults that had taken place) and symptom(s) reported (n = 7; e.g “[GPT summary] does not mention Tylenol overdose concern”). The manually categorized reviewer comments for the GPT-3.5-turbo-generated summaries are shown in Supplementary File 2 (Table S2 & Figure S4).

View this table:

Table 2.

Manual categorization of expert reviewer comments providing further details for each error subtype among GPT-4-generated discharge summaries compared to the ground-truth, original Emergency Medicine provider note. *Comments reported with minor modifications to syntax for improved readability.

Discussion

In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. Overall, GPT-4-generated summaries contained fewer errors than GPT-3.5-turbo summaries across all three domains, with 10%, 42% and 47% of summaries containing inaccuracies, hallucinations and omissions, respectively. GPT-4-generated summaries were also shorter and more readable than those generated by GPT-3.5-turbo, with an average Flesch-Kincaid Grade Level of 10.

The improved performance of GPT-4 compared to GPT-3.5-turbo aligns with prior literature which has shown superior GPT-4 performance across both medical and non-medical tasks.^22–24 Moreover, the fact that GPT-4 summaries contained a lower number of omissions than GPT-3.5-turbo, whilst summarizing the same information in fewer words, suggests increased summary concision that may be welcomed by primary care physicians and others on the receiving end of the transition of care.²⁵

Although only 33% of summaries generated by GPT-4 were entirely error-free across all domains, a more detailed review of the subtypes of error demonstrated that a majority of hallucinations either related to information redacted in the original note as part of our institution’s de-identification process or resulted from GPT-4 hallucinating follow-up instructions and/or return precautions. In the latter instance, such follow-up instructions were often appropriate for the patient’s care (as if they were derived from a standard set of precautions associated with the patient’s final diagnosis), but because they had not been explicitly mentioned in the original EM provider’s note, they were classified as hallucinations in accordance with our pre-specified protocol. After excluding these specific types of errors post-hoc, the proportion of GPT-4 generated summaries considered error-free across all domains increased by 14%, reaching 47% error-free across the three domains.

Meanwhile, there were notable differences in initial inter-reviewer agreement between error type prior to consensus agreement, with 91.9% agreement on the presence of clinical omissions compared to 95.8% and 93.6% agreement for inaccuracies and hallucinations, respectively. This reflects the subjective nature of classifying clinical omissions, where the inclusion of different clinical details may depend on the preference of the discharging clinician. It is possible that, with either dedicated prompt engineering or the addition of few-shot examples during future prompting, clinician-specific preferences of what information ought to be included in each discharge summary may be incorporated to address this.

There is a paucity of existing literature examining the performance of LLMs when generating discharge summaries, either in the Emergency Department or inpatient hospital setting. This is concerning given reports of the recent deployment of ambient artificial intelligence (AI) scribes at a large healthcare organisation.¹⁹ In that study, 35 example patient transcripts and encounter summaries generated by the AI scribe were rated using a modified version of the Physician Documentation Quality Instrument, with an average score of 48/50 achieved.^19,26 However, a quantitative analysis of the number and type of errors present was not reported. Meanwhile, a separate study of neurology inpatient encounters showed that Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional and Auto-Regressive Transformers (BART) models could be used to generate summaries which met the standard of care in 62% of cases, but acknowledged that future work should count the number and type of hallucinations in automated summaries.¹⁸

Since clinicians will ultimately be responsible for auditing and modifying clinical documentation produced by LLMs, gaining a thorough understanding of potential error sources in this documentation is critically important. Without a thorough understanding of where errors may occur, there’s a risk that errors made by LLMs could be overlooked, potentially harming patient care.²⁷ Additionally, the increased workload on clinicians to meticulously audit the discharge summary could lead to worsening burnout, potentially negating the benefits of using this technology. Our findings suggest that the location of errors within a GPT-generated discharge summary may vary based on the type of error: inaccuracies and hallucinations are most commonly found within the Plan sections of GPT-generated discharge summaries, while the Physical Examination and History of Presenting Complaint sections should be checked closely for clinical omissions. Future studies should evaluate the application of LLMs themselves to identify instances of inaccuracy, hallucination and clinical omission errors within LLM-generated clinical documents when compared to the original source documents, allowing clinicians to audit and amend areas that are subject to discordance.

This study has several limitations. First, in this study only the initial EM clinician note was summarized. While this note typically contains the patient’s clinical history, physical examination findings, results of investigations performed and overall plan, other pertinent information that is found in notes from other providers, such as physical or occupational therapist recommendations and specialty consult advice, may not have been included in the discharge summary. Future work should evaluate the performance of LLMs in the more complex task of multi-document summarization before deployment to EDs can be considered. Second, due to the time and labor-intensive process of manual expert review, we included 100 randomly selected ED encounters in our sample, which may limit generalizability across different types of patient demographics and presenting symptoms. Notably, our randomly selected sample predominantly consisted of White, Asian or Black/African American patients, with limited representation of other minority groups. As LLM performance continues to be evaluated across different medical tasks, racial and gender bias assessments of these tools must be performed prior to their integration into clinical care.²⁸ Third, GPT model performance may improve with further iterations of prompt engineering and/or in-context learning. For instance, in comparing GPT-3.5-turbo to GPT-4, there was an enhancement in summarization capabilities across all domains evaluated, including over ED discharge summary length. Fourth, we did not directly compare the GPT-generated discharge summaries with the actual clinician-generated discharge summaries for these encounters. It is possible that important information might have been missing, or inaccurately reported, in the clinician-generated discharge summaries as well.

Conclusion

In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. Our results suggest that the location of errors within a GPT-generated discharge summary may vary based on the type of error. A comprehensive understanding of where errors are most likely to occur in GPT-generated clinical text is critically important to facilitate clinician review and revision of such content and prevent patient harm.

Conflicts of Interest

AEK is a co-founder and consultant to CaptureDx. AJB is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. AJB receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. AJB’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any bearing on the design of this study or the writing of the manuscript.

No other authors have conflicts of interest to disclose.

Code availability

The code accompanying this manuscript is available at https://github.com/cykwilliams/GPT-4-Emergency-Department-Discharge-Summary/

Acknowledgements

Dr Aaron E. Kornblith is supported by Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health under award number K23HD110716.

The authors acknowledge the use of the UCSF Information Commons computational research platform, developed and supported by UCSF Bakar Computational Health Sciences Institute. The authors also thank the UCSF AI Tiger Team, Academic Research Services, Research Information Technology, and the Chancellor’s Task Force for Generative AI for their software development, analytical and technical support related to the use of Versa API gateway (the UCSF secure implementation of large language models and generative AI via API gateway), Versa chat (the chat user interface), and related data asset and services.

Dr Christopher Y.K. Williams had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

References

1.↵
Ngo E, Patel N, Chandrasekaran K, Tajik AJ, Paterick TE. The Importance of the Medical Record: A Critical Professional Responsibility. J Med Pract Manag MPM. 2016;31(5):305–308.
OpenUrl Google Scholar
2.↵
Ebbers T, Kool RB, Smeele LE, et al. The Impact of Structured and Standardized Documentation on Documentation Quality; a Multicenter, Retrospective Study. J Med Syst. 2022;46(7):46. doi:10.1007/s10916-022-01837-9
OpenUrl CrossRef Google Scholar
3.↵
Gesner E, Gazarian P, Dykes P. The Burden and Burnout in Documenting Patient Care: An Integrative Literature Review. Stud Health Technol Inform. 2019;264:1194–1198. doi:10.3233/SHTI190415
OpenUrl CrossRef Google Scholar
4.↵
Mishra P, Kiang JC, Grant RW. Association of Medical Scribes in Primary Care With Physician Workflow and Patient Experience. JAMA Intern Med. 2018;178(11):1467–1472. doi:10.1001/jamainternmed.2018.3956
OpenUrl CrossRef Google Scholar
5.↵
Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann Intern Med. 2016;165(11):753–760. doi:10.7326/M16-0961
OpenUrl CrossRef PubMed Google Scholar
6.↵
Adler-Milstein J, Zhao W, Willard-Grace R, Knox M, Grumbach K. Electronic health records and burnout: Time spent on the electronic health record after hours and message volume associated with exhaustion but not with cynicism among primary care clinicians. J Am Med Inform Assoc. 2020;27(4):531–538. doi:10.1093/jamia/ocz220
OpenUrl CrossRef PubMed Google Scholar
7.
Ortega MV, Hidrue MK, Lehrhoff SR, et al. Patterns in Physician Burnout in a Stable-Linked Cohort. JAMA Netw Open. 2023;6(10):e2336745. doi:10.1001/jamanetworkopen.2023.36745
OpenUrl CrossRef Google Scholar
8.
Tajirian T, Stergiopoulos V, Strudwick G, et al. The Influence of Electronic Health Record Use on Physician Burnout: Cross-Sectional Survey. J Med Internet Res. 2020;22(7):e19274. doi:10.2196/19274
OpenUrl CrossRef PubMed Google Scholar
9.↵
Peccoralo LA, Kaplan CA, Pietrzak RH, Charney DS, Ripp JA. The impact of time spent on the electronic health record after work and of clerical work on burnout among clinical faculty. J Am Med Inform Assoc. 2021;28(5):938–947. doi:10.1093/jamia/ocaa349
OpenUrl CrossRef PubMed Google Scholar
10.↵
Taylor DM, Cameron PA. Discharge instructions for emergency department patients: what should we provide? Emerg Med J. 2000;17(2):86–90. doi:10.1136/emj.17.2.86
OpenUrl Abstract/FREE Full Text Google Scholar
11.
Sorita A, Robelia PM, Kattel SB, et al. The Ideal Hospital Discharge Summary: A Survey of U.S. Physicians. J Patient Saf. 2021;17(7):e637–e644. doi:10.1097/PTS.0000000000000421
OpenUrl CrossRef Google Scholar
12.↵
Robelia PM, Kashiwagi DT, Jenkins SM, Newman JS, Sorita A. Information Transfer and the Hospital Discharge Summary: National Primary Care Provider Perspectives of Challenges and Opportunities. J Am Board Fam Med JABFM. 2017;30(6):758–765. doi:10.3122/jabfm.2017.06.170194
OpenUrl Abstract/FREE Full Text Google Scholar
13.↵
Li JYZ, Yong TY, Hakendorf P, Ben-Tovim D, Thompson CH. Timeliness in discharge summary dissemination is associated with patients’ clinical outcomes. J Eval Clin Pract. 2013;19(1):76–79. doi:10.1111/j.1365-2753.2011.01772.x
OpenUrl CrossRef PubMed Google Scholar
14.↵
Improving the Emergency Department Discharge Process: Environmental Scan Report.
Google Scholar
15.↵
Wachter RM, Brynjolfsson E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA. 2024;331(1):65–69. doi:10.1001/jama.2023.25054
OpenUrl CrossRef Google Scholar
16.↵
Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. Npj Digit Med. 2023;6(1):1–8. doi:10.1038/s41746-023-00896-7
OpenUrl CrossRef Google Scholar
17.↵
Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. Published online February 27, 2024:1–9. doi:10.1038/s41591-024-02855-5
OpenUrl CrossRef Google Scholar
18.↵
Hartman VC, Bapat SS, Weiner MG, Navi BB, Sholle ET, Campion TR. A method to automate the discharge summary hospital course for neurology patients. J Am Med Inform Assoc JAMIA. 2023;30(12):1995–2003. doi:10.1093/jamia/ocad177
OpenUrl CrossRef Google Scholar
19.↵
Tierney AA, Gayre G, Hoberman B, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. Catal Non-Issue Content. 2024;5(1):CAT.23.0404. doi:10.1056/CAT.23.0404
OpenUrl CrossRef Google Scholar
20.↵
Radhakrishnan L, Schenk G, Muenzen K, et al. A certified de-identification system for all clinical text documents for information extraction at scale. JAMIA Open. 2023;6(3):ooad045. doi:10.1093/jamiaopen/ooad045
OpenUrl CrossRef Google Scholar
21.↵
openai/tiktoken. Published online March 23, 2024. Accessed March 23, 2024. https://github.com/openai/tiktoken
Google Scholar
22.↵
Williams CYK, Zack T, Miao BY, Sushil M, Wang M, Butte AJ. Assessing clinical acuity in the Emergency Department using the GPT-3.5 Artificial Intelligence Model. Published online August 13, 2023:2023.08.09.23293795. doi:10.1101/2023.08.09.23293795
OpenUrl Abstract/FREE Full Text Google Scholar
23.
OpenAI. GPT-4 Technical Report. Published online March 27, 2023. doi:10.48550/arXiv.2303.08774
OpenUrl CrossRef Google Scholar
24.↵
Fink MA, Bischoff A, Fink CA, et al. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology. 2023;308(3):e231362. doi:10.1148/radiol.231362
OpenUrl CrossRef PubMed Google Scholar
25.↵
Chatterton B, Chen J, Schwarz EB, Karlin J. Primary Care Physicians’ Perspectives on High-Quality Discharge Summaries. J Gen Intern Med. Published online November 27, 2023. doi:10.1007/s11606-023-08541-5
OpenUrl CrossRef Google Scholar
26.↵
Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing Electronic Note Quality Using the Physician Documentation Quality Instrument (PDQI-9). Appl Clin Inform. 2012;3(2):164–174. doi:10.4338/ACI-2011-11-RA-0070
OpenUrl CrossRef PubMed Google Scholar
27.↵
Adler-Milstein J, Redelmeier DA, Wachter RM. The Limits of Clinician Vigilance as an AI Safety Bulwark. JAMA. Published online March 14, 2024. doi:10.1001/jama.2024.3620
OpenUrl CrossRef Google Scholar
28.↵
Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12–e22. doi:10.1016/S2589-7500(23)00225-X
OpenUrl CrossRef PubMed Google Scholar

Posted April 04, 2024.

Download PDF

Author Declarations

Supplementary Material

Data/Code

Citation Tools

Get QR code

Tweet Widget

Subject Area

Health Informatics

Reviews and Context

Comment

TRIP Peer Reviews

Community Reviews

Automated Services

Blogs/Media

Author Videos

Subject Areas

All Articles

Addiction Medicine (419)
Allergy and Immunology (741)
Anesthesia (217)
Cardiovascular Medicine (3190)
Dentistry and Oral Medicine (355)
Dermatology (269)
Emergency Medicine (472)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1135)
Epidemiology (13173)
Forensic Medicine (18)
Gastroenterology (882)
Genetic and Genomic Medicine (5004)
Geriatric Medicine (464)
Health Economics (767)
Health Informatics (3151)
Health Policy (1118)
Health Systems and Quality Improvement (1160)
Hematology (418)
HIV/AIDS (991)
Infectious Diseases (except HIV/AIDS) (14477)
Intensive Care and Critical Care Medicine (900)
Medical Education (466)
Medical Ethics (122)
Nephrology (512)
Neurology (4755)
Nursing (253)
Nutrition (705)
Obstetrics and Gynecology (863)
Occupational and Environmental Health (775)
Oncology (2446)
Ophthalmology (695)
Orthopedics (273)
Otolaryngology (335)
Pain Medicine (317)
Palliative Medicine (89)
Pathology (525)
Pediatrics (1270)
Pharmacology and Therapeutics (537)
Primary Care Research (541)
Psychiatry and Clinical Psychology (4081)
Public and Global Health (7319)
Radiology and Imaging (1643)
Rehabilitation Medicine and Physical Therapy (977)
Respiratory Medicine (959)
Rheumatology (469)
Sexual and Reproductive Health (486)
Sports Medicine (412)
Surgery (528)
Toxicology (68)
Transplantation (227)
Urology (197)

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

[1] 1.↵
Ngo E, Patel N, Chandrasekaran K, Tajik AJ, Paterick TE. The Importance of the Medical Record: A Critical Professional Responsibility. J Med Pract Manag MPM. 2016;31(5):305–308.
OpenUrl Google Scholar

[2] 2.↵
Ebbers T, Kool RB, Smeele LE, et al. The Impact of Structured and Standardized Documentation on Documentation Quality; a Multicenter, Retrospective Study. J Med Syst. 2022;46(7):46. doi:10.1007/s10916-022-01837-9
OpenUrl CrossRef Google Scholar

[3] 3.↵
Gesner E, Gazarian P, Dykes P. The Burden and Burnout in Documenting Patient Care: An Integrative Literature Review. Stud Health Technol Inform. 2019;264:1194–1198. doi:10.3233/SHTI190415
OpenUrl CrossRef Google Scholar

[4] 4.↵
Mishra P, Kiang JC, Grant RW. Association of Medical Scribes in Primary Care With Physician Workflow and Patient Experience. JAMA Intern Med. 2018;178(11):1467–1472. doi:10.1001/jamainternmed.2018.3956
OpenUrl CrossRef Google Scholar

[5] 5.↵
Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann Intern Med. 2016;165(11):753–760. doi:10.7326/M16-0961
OpenUrl CrossRef PubMed Google Scholar

[6] 6.↵
Adler-Milstein J, Zhao W, Willard-Grace R, Knox M, Grumbach K. Electronic health records and burnout: Time spent on the electronic health record after hours and message volume associated with exhaustion but not with cynicism among primary care clinicians. J Am Med Inform Assoc. 2020;27(4):531–538. doi:10.1093/jamia/ocz220
OpenUrl CrossRef PubMed Google Scholar

[7] 7.
Ortega MV, Hidrue MK, Lehrhoff SR, et al. Patterns in Physician Burnout in a Stable-Linked Cohort. JAMA Netw Open. 2023;6(10):e2336745. doi:10.1001/jamanetworkopen.2023.36745
OpenUrl CrossRef Google Scholar

[8] 8.
Tajirian T, Stergiopoulos V, Strudwick G, et al. The Influence of Electronic Health Record Use on Physician Burnout: Cross-Sectional Survey. J Med Internet Res. 2020;22(7):e19274. doi:10.2196/19274
OpenUrl CrossRef PubMed Google Scholar

[9] 9.↵
Peccoralo LA, Kaplan CA, Pietrzak RH, Charney DS, Ripp JA. The impact of time spent on the electronic health record after work and of clerical work on burnout among clinical faculty. J Am Med Inform Assoc. 2021;28(5):938–947. doi:10.1093/jamia/ocaa349
OpenUrl CrossRef PubMed Google Scholar

[10] 10.↵
Taylor DM, Cameron PA. Discharge instructions for emergency department patients: what should we provide? Emerg Med J. 2000;17(2):86–90. doi:10.1136/emj.17.2.86
OpenUrl Abstract/FREE Full Text Google Scholar

[11] 11.
Sorita A, Robelia PM, Kattel SB, et al. The Ideal Hospital Discharge Summary: A Survey of U.S. Physicians. J Patient Saf. 2021;17(7):e637–e644. doi:10.1097/PTS.0000000000000421
OpenUrl CrossRef Google Scholar

[12] 12.↵
Robelia PM, Kashiwagi DT, Jenkins SM, Newman JS, Sorita A. Information Transfer and the Hospital Discharge Summary: National Primary Care Provider Perspectives of Challenges and Opportunities. J Am Board Fam Med JABFM. 2017;30(6):758–765. doi:10.3122/jabfm.2017.06.170194
OpenUrl Abstract/FREE Full Text Google Scholar

[13] 13.↵
Li JYZ, Yong TY, Hakendorf P, Ben-Tovim D, Thompson CH. Timeliness in discharge summary dissemination is associated with patients’ clinical outcomes. J Eval Clin Pract. 2013;19(1):76–79. doi:10.1111/j.1365-2753.2011.01772.x
OpenUrl CrossRef PubMed Google Scholar

[14] 14.↵
Improving the Emergency Department Discharge Process: Environmental Scan Report.
Google Scholar

[15] 15.↵
Wachter RM, Brynjolfsson E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA. 2024;331(1):65–69. doi:10.1001/jama.2023.25054
OpenUrl CrossRef Google Scholar

[16] 16.↵
Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. Npj Digit Med. 2023;6(1):1–8. doi:10.1038/s41746-023-00896-7
OpenUrl CrossRef Google Scholar

[17] 17.↵
Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. Published online February 27, 2024:1–9. doi:10.1038/s41591-024-02855-5
OpenUrl CrossRef Google Scholar

[18] 18.↵
Hartman VC, Bapat SS, Weiner MG, Navi BB, Sholle ET, Campion TR. A method to automate the discharge summary hospital course for neurology patients. J Am Med Inform Assoc JAMIA. 2023;30(12):1995–2003. doi:10.1093/jamia/ocad177
OpenUrl CrossRef Google Scholar

[19] 19.↵
Tierney AA, Gayre G, Hoberman B, et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. Catal Non-Issue Content. 2024;5(1):CAT.23.0404. doi:10.1056/CAT.23.0404
OpenUrl CrossRef Google Scholar

[20] 20.↵
Radhakrishnan L, Schenk G, Muenzen K, et al. A certified de-identification system for all clinical text documents for information extraction at scale. JAMIA Open. 2023;6(3):ooad045. doi:10.1093/jamiaopen/ooad045
OpenUrl CrossRef Google Scholar

[21] 21.↵
openai/tiktoken. Published online March 23, 2024. Accessed March 23, 2024. https://github.com/openai/tiktoken
Google Scholar

[22] 22.↵
Williams CYK, Zack T, Miao BY, Sushil M, Wang M, Butte AJ. Assessing clinical acuity in the Emergency Department using the GPT-3.5 Artificial Intelligence Model. Published online August 13, 2023:2023.08.09.23293795. doi:10.1101/2023.08.09.23293795
OpenUrl Abstract/FREE Full Text Google Scholar

[23] 23.
OpenAI. GPT-4 Technical Report. Published online March 27, 2023. doi:10.48550/arXiv.2303.08774
OpenUrl CrossRef Google Scholar

[24] 24.↵
Fink MA, Bischoff A, Fink CA, et al. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology. 2023;308(3):e231362. doi:10.1148/radiol.231362
OpenUrl CrossRef PubMed Google Scholar

[25] 25.↵
Chatterton B, Chen J, Schwarz EB, Karlin J. Primary Care Physicians’ Perspectives on High-Quality Discharge Summaries. J Gen Intern Med. Published online November 27, 2023. doi:10.1007/s11606-023-08541-5
OpenUrl CrossRef Google Scholar

[26] 26.↵
Stetson PD, Bakken S, Wrenn JO, Siegler EL. Assessing Electronic Note Quality Using the Physician Documentation Quality Instrument (PDQI-9). Appl Clin Inform. 2012;3(2):164–174. doi:10.4338/ACI-2011-11-RA-0070
OpenUrl CrossRef PubMed Google Scholar

[27] 27.↵
Adler-Milstein J, Redelmeier DA, Wachter RM. The Limits of Clinician Vigilance as an AI Safety Bulwark. JAMA. Published online March 14, 2024. doi:10.1001/jama.2024.3620
OpenUrl CrossRef Google Scholar

[28] 28.↵
Zack T, Lehman E, Suzgun M, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12–e22. doi:10.1016/S2589-7500(23)00225-X
OpenUrl CrossRef PubMed Google Scholar

Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries

Abstract

Introduction

Methods

Statistical analysis

Results

Discussion

Conclusion

Data Availability

Conflicts of Interest

Code availability

Acknowledgements

References

Subject Area

Citation Manager Formats

Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries

Abstract

Introduction

Methods

Statistical analysis

Results

Discussion

Conclusion

Data Availability

Conflicts of Interest

Code availability

Acknowledgements

References

Subject Area

Follow this preprint