A comparison of large language model versus manual chart review for extraction of data elements from the electronic health record
=================================================================================================================================

* Jin Ge
* Michael Li
* Molly B. Delk
* Jennifer C. Lai

## Structured Abstract

**Importance** Large language models (LLMs) have proven useful for extracting data from publicly available sources, but their uses in clinical settings and with clinical data are unknown.

**Objective** To determine the accuracy of data extraction using “Versa Chat,” a chat implementation of the general-purpose OpenAI gpt-35-turbo LLM model, versus manual chart review for hepatocellular carcinoma (HCC) imaging reports.

**Design** We engineered a prompt for the data extraction task of six distinct data elements and input 182 abdominal imaging reports that were also manually tagged. We evaluated performance by calculating accuracy, precision, recall, and F1 scores.

**Setting/Participants** Cross-sectional abdominal imaging reports of patients diagnosed with hepatocellular carcinoma enrolled in the Functional Assessment in Liver Transplantation (FrAILT) study.

## Background

Large language models (LLMs) hold tremendous potential for accelerating clinical research and augmenting clinical care.1 One of the most promising LLM use cases is natural language processing (NLP) and extraction of structured elements from unstructured clinical text, such as imaging reports.2 LI-RADS (Liver Imaging Reporting and Data System) was created by the American College of Radiology and provides standardized and reproducible reporting of hepatocellular carcinoma (HCC) imaging for clinical care and research.3 Due to the LI-RADS reporting system, HCC imaging provides an ideal test case for LLM-enabled NLP extraction of structured data from unstructured clinical text. We sought to assess the performance of a commercially available general-purpose LLM, deployed in an isolated protected environment and permitted to be used with protected health information (PHI), versus human manual chart review in extracting six distinct data elements from abdominal imaging reports.

## Methods

We used”Versa Chat,” the chat user interface of the general purpose Microsoft Azure OpenAI gpt-35-turbo LLM model (“Versa”) that is implemented in a protected environment at the University of California, San Francisco (UCSF) to accommodate the use of PHI and intellectual property, for this study.4”Versa,” like other gpt-35-turbo implementations, has a token limit of 4,096 tokens, defined as the unit that OpenAI generative artificial intelligence (GAI) models use to compute text length. One token approximates to about four characters or one word. This 4,096 token limit includes the count from both the user prompt and completion of the task for each session.5 We manually reviewed 182 CT or MRI abdomen imaging reports without evidence of locoregional treatments from 169 patients diagnosed with HCC enrolled in the Functional Assessment in Liver Transplantation (FrAILT) study at UCSF.6 The imaging reports, therefore, may or may not contain evidence of HCC as a diagnosis could have occurred subsequent to the date of imaging. We manually tagged the imaging reports for six distinct data elements: 1. Maximum LI-RADS score for any HCC lesions (defined as 4 or 5), 2. Number of HCC lesions, 3. Diameter (cm) of the largest lesion, 4. Sum of diameters (cm) of all HCC lesions, 5. Presence or absence of macrovascular invasion, or 6. Presence or absence of extrahepatic metastases.

All 182 imaging reports were trimmed to only include the findings and impressions sections. Due to the limitation of 4,096 tokens per session in”Versa Chat,” we iteratively developed a”zero-shot” prompt (defined as a prompt that does not contain training data) with testing on the first five records (Figure 1).7 As snowballing of data passed per prompt often led to execution failure from exceeding the token limit, we ran 26 sessions of the final”zero-shot” extraction prompt in”Versa Chat” with approximately seven records per session for data extraction (see Figure 2 for an example exchange using mock data with”Versa Chat”). If”Versa Chat” produced an output that required minor additional formatting, we made those changes within the chat interface prior to collecting and aggregating the data. The total amount of time required to process all 182 records was 45 minutes.

![Figure 1](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/04/2023.08.31.23294924/F1.medium.gif)

[Figure 1](http://medrxiv.org/content/early/2023/09/04/2023.08.31.23294924/F1)

Figure 1 Final prompt used for data extraction from”Versa Chat” (gpt-35-turbo)

![Figure 2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/04/2023.08.31.23294924/F2.medium.gif)

[Figure 2](http://medrxiv.org/content/early/2023/09/04/2023.08.31.23294924/F2)

Figure 2 Example of an exchange with”Versa Chat” (gpt-35-turbo) using mock data

We evaluated the accuracy of”Versa Chat” data extractions versus manual chart review with each imaging report as a separate record. We calculated performance metrics, notably accuracy, precision, recall, and F1 score (harmonic mean of precision and recall commonly used to evaluate classification in machine learning) for each of the six data elements whenever possible. For multilevel classifications (maximum LI-RADS score, number of HCC lesions, diameter of the largest lesion, and sum of tumor diameters), we calculated weighted-average precision, recall, and F1 score. For binary classifications (macrovascular invasion and extrahepatic metastases), we defined the presence of these features as a positive case for precision, recall, and F1 score.8 We estimated 95% confidence intervals (CI) for performance metrics whenever possible through bootstrapping with 2,000 iterations. All statistical analyses were conducted in R, version 4.3.1”Beagle Scouts” (R Core Team, Vienna, Austria),9 and R packages *boot*, version 1.3-28.1,10 and *caret*, version 6.0-94.11 This study was approved by the UCSF Institutional Review Board in Study #11-07513.

## Results

The performance metrics for the six data elements extracted by the gpt-35-turbo”Versa Chat” model versus manual chart review are featured in Table 1. The overall accuracy of”Versa Chat” was 0.889 (95% CI 0.869-0.907) versus manual review. The accuracy rate varied between 0.725 (95% CI 0.643-0.780) for sum of tumor diameters to 0.989 (95% CI 0.956-0.995) for macrovascular invasion. In general, accuracy was higher for simple classification tasks (maximum LI-RADS score, macrovascular invasion, and extrahepatic metastases) compared to those that required comparison (maximum tumor diameter) or summation (number of tumors and sum of tumor diameters). As macrovascular invasion and extrahepatic metastases did not have any true positive cases, the precision for these two data elements were both zero. Similarly, as there were no false negative cases, the recall and F1 score for macrovascular invasion could not be calculated. As the precision, recall, and F1 score statistics for maximum LI-RADS score, number of tumors, maximum tumor diameter, and sum of tumor diameters were calculated as weighted-average values due to multilevel classifications, these values may be biased as accurate predictions of absence of an imaging feature (e.g.”Versa Chat” noted zero tumors when there were no tumors by manual chart review) were included in the statistics.

View this table:
[Table 1](http://medrxiv.org/content/early/2023/09/04/2023.08.31.23294924/T1)

Table 1 Performance evaluation statistics of”Versa Chat” versus manual chart review

## Discussion

This is one of the first studies that has demonstrated and compared the performance of the chat interface of a general-purpose LLM versus manual chart review for extraction of clinical data. We demonstrated high accuracy for simple extraction tasks, which degraded with more complex use cases. Of note, iterative development (“prompt engineering”) of a”zero-shot” prompt to specify the operations to be executed by the LLM was necessary to achieve this level of accuracy.7 Our use of a”zero-shot” prompt and limiting the amount of data processed per session, however, prevented the gpt-35-turbo model from maintaining a persistent memory to allow in-context”learning” based on previous data.12 These are known limitations of the gpt-35-turbo model, which have been improved upon in gpt-35-turbo-16k (which supports up to 16,384 tokens), gpt-4 (up to 8,192 tokens), and gpt-4-32k (up to 32,768 tokens).5 Despite these limitations, our study demonstrated two important concepts: 1. Feasibility of using general purpose LLMs to extract structured information from clinical data with *minimal* technical expertise, and 2. Use of a LLM deployed in isolated protected environment that accommodates PHI (as opposed to ChatGPT, which is often not permitted for use with PHI) for clinical use cases.

## Data Availability

Aggregate data produced in the present study may be available upon reasonable request to and with approval by the authors.

## Data Acknowledgement

*   - The authors thank the UCSF AI Tiger Team, Academic Research Services, Research Information Technology, and the Chancellor’s Task Force for Generative AI for their software development, analytical, and technical support related to the use of Versa API gateway (the UCSF secure implementation of large language models and generative AI via API gateway), Versa chat (the chat user interface), and related data assets.

## Footnotes

*   **Financial/Grant Support:** The authors of this study were supported in part by the KL2TR001870 (National Center for Advancing Translational Sciences, Ge), P30DK026743 (UCSF Liver Center Grant, Ge, Li, and Lai), ACG Junior Faculty Development Award (American College of Gastroenterology Institute, Li), and R01AG059183/K24AG080021 (National Institute on Aging, Lai). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or any other funding agencies. The funding agencies played no role in the analysis of the data or the preparation of this manuscript.

*   **Disclosures:** The authors of this manuscript have the following potential conflicts of interest to disclose: 
    *   - Dr. Jin Ge receives research support from Merck and Co; and consults for Astellas Pharmaceuticals/Iota Biosciences.
    
    *   - Dr. Jennifer C. Lai receives research support from Lipocene and Vir Biotechnologies; receives an education grant from Nestle Nutrition Sciences; serves on an advisory board for Novo Nordisk; and consults for Genfit, Third Rock Ventures, and Boehringer Ingelheim.

*   **Writing Assistance:** None.

*   Minor edits to the abstract.

## Abbreviations

CI
:   confidence interval
FrAILT
:   Functional Assessment in Liver Transplantation
GAI
:   generative artificial intelligence
HCC
:   hepatocellular carcinoma
LI-RADS
:   Liver Imaging Reporting and Data System
LLM
:   large language model
NLP
:   natural language processing
PHI
:   protected health information
UCSF
:   University of California, San Francisco

*   Received August 31, 2023.
*   Revision received September 2, 2023.
*   Accepted September 4, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  1.Ge J, Lai JC. Artificial intelligence-based text generators in hepatology: ChatGPT is just the beginning. Hepatol Commun. 2023;7(4). doi:10.1097/HC9.0000000000000097
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/HC9.0000000000000097&link_type=DOI) 

2.  2.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–180. doi:10.1038/s41586-023-06291-2
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-023-06291-2&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37438534&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F04%2F2023.08.31.23294924.atom) 

3.  3.Chernyak V, Fowler KJ, Kamaya A, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289(3):816–830. doi:10.1148/radiol.2018181494
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/radiol.2018181494&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F04%2F2023.08.31.23294924.atom) 

4.  4.Azure OpenAI Service – Large Language Models for Generative AI. [https://azure.microsoft.com/en-us/products/ai-services/openai-service-b](https://azure.microsoft.com/en-us/products/ai-services/openai-service-b). Accessed August 25, 2023.
    
    
5.  5.Azure OpenAI Service models - Azure OpenAI | Microsoft Learn. [https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models). Accessed August 26, 2023.
    
    
6.  6.Lai JC, Covinsky KE, Dodge JL, et al. Development of a novel frailty index to predict mortality in patients with end-stage liver disease. Hepatology. 2017;66(2):564–574. doi:10.1002/hep.29219
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/hep.29219&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F04%2F2023.08.31.23294924.atom) 

7.  7.Wang J, Shi E, Yu S, et al. Prompt Engineering for Healthcare: Methodologies and Applications. arXiv. 2023. doi:10.48550/arxiv.2304.14670
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/arxiv.2304.14670&link_type=DOI) 

8.  8.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009;45(4):427–437. doi:10.1016/j.ipm.2009.03.002
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ipm.2009.03.002&link_type=DOI) 

9.  9.Team RC. R: A language and environment for statistical computing. 2013.
    
    
10. 10.Bootstrap Functions (Originally by Angelo Canty for S) [R package boot version 1.3-28.1]. [https://cran.r-project.org/web/packages/boot/index.html](https://cran.r-project.org/web/packages/boot/index.html). Published November 22, 2022. Accessed January 6, 2023.
    
    
11. 11.Kuhn M. Classification and Regression Training [R package caret version 6.0-94]. March 2023.
    
    
12. 12.Liu J, Shen D, Zhang Y, Dolan B, Carin L, Chen W. [2101.06804] What Makes Good In-Context Examples for GPT-$3$? arXiv. January 2021.