Abstract
Background Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating incorrect or hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows embedding of customized data into LLMs. This approach “specializes” the LLMs and is thought to reduce hallucinations.
Methods We developed “LiVersa,” a liver disease-specific LLM, by using our institution’s protected health information (PHI)-complaint text embedding and LLM platform, “Versa.” We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases (AASLD) guidelines and guidance documents to be incorporated into LiVersa. We evaluated LiVersa’s performance by comparing its responses versus those of trainees from a previously published knowledge assessment study regarding hepatitis B (HBV) treatment and hepatocellular carcinoma (HCC) surveillance.
Results LiVersa answered all 10 questions correctly when forced to provide a “yes” or “no” answer. Full detailed responses with justifications and rationales, however, were not completely correct for three of the questions.
Discussions In this study, we demonstrated the ability to build disease-specific and PHI-compliant LLMs using RAG. While our LLM, LiVersa, demonstrated more specificity in answering questions related to clinical hepatology – there were some knowledge deficiencies due to limitations set by the number and types of documents used for RAG. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical uses and a potential strategy to realize personalized medicine in the future.
Introduction
Large language models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT) family of models, have demonstrated significant capabilities in clinical information processing tasks, such as data extraction,1 summarizing literature,2 content generation,3 and predictive modeling.4 Commercially available general-purpose LLMs, such as OpenAI’s ChatGPT, however, are trained on publicly available data and not optimized for clinical uses.5 This means that the outputs of publicly available LLMs when prompted with clinical questions may include incorrect, incomplete, or hallucinatory information.6,7 Despite these limitations, LLMs are thought to have significant potential in biomedical and clinical applications. This is because the practice of modern medicine is highly complex endeavor with ever-increasing amounts of knowledge generated yearly.8 For instance, it was estimated that two papers were uploaded to PubMed every minute in 2016,9 a figure that has undoubtedly increased in the seven years since. National practice societies have also responded to the ever-expanding body of clinical knowledge by developing comprehensive practice guidelines. For instance, the American Association of the Study of Liver Diseases (AASLD) has issued guidelines and guidance documents for 26 liver diseases and conditions, and two quality measures concerning cirrhosis and hepatocellular carcinoma (HCC).10 Methods to incorporate medical literature, such as clinical practice guidelines and guidance documents, into LLM responses, therefore, represent growing area of interest to help general-purpose LLMs become “specialized” for clinically-focused applications.
Currently, there are three general approaches and techniques to allow LLM “specialization:” 1. Fine-tune the original LLM model, which is computationally expensive;11,12 2. Prompting within the LLM, which only accommodates small amounts of data and requires iterative user input;13–15 and 3. Retrieval-augmented generation (RAG).16–18 RAG is an enterprise architecture that augments LLM abilities by adding an information retrieval system that provides external data, which then supplement and constrain LLM output. In practice, this means that a dataset, such as a compendium of clinical practice guidelines, is vectorized and encoded using embedding models and then incorporated into the LLM by layering it on top of the LLM information retrieval and output processes.16 The theoretical advantages of RAG are two-fold: 1. Handling large numbers of documents to provide “ground knowledge” for the LLM, and 2. Decreasing hallucinations by limiting the potential “solution-space” for LLM outputs. This RAG approach has been proposed and utilized in other clinical specialties, such as general medicine and urology.17,19,20 Existing RAG implementations, however, have largely been on publicly available and not protected health information [PHI]-compliant LLMs – thereby limiting permitted use in clinical practice
In this study, we utilized RAG to create a prototype liver disease-specific LLM, called “LiVersa,” within the University of California, San Francisco’s (UCSF) PHI-compliant implementation of Microsoft OpenAI GPT family of LLM models, “Versa.”
Methods
This study was not considered human subjects research because no human data or specimens were used. Consequently, approval from the UCSF Institutional Review Board was not sought. To construct our liver disease-specific LLM (LiVersa), we used all available provider-facing clinical practice guidelines, guidance documents, and quality measure documents published by AASLD on its website.10 We excluded introduction and executive summary documents as the content in these were often duplicated in the corresponding full-length versions (e.g., for Wilson’s Disease). We included both original guidelines/guidance documents and updates if both were publicly available for download on AASLD website (e.g., for Chronic Hepatitis B [HBV]). Patient-facing and/or outdated guidelines/guidance documents were not included in this study. In total, 30 documents were retrieved to be incorporated into the RAG process (Table 1).
We utilized application programming interfaces (APIs) provided by the Microsoft Azure OpenAI Cognitive Search suite of tools to incorporate the AASLD guidelines and documents for RAG (Figure 1).16 In the pre-processing phase, the 30 AASLD guidelines and documents in PDF format were transformed into embeddings using Microsoft Azure OpenAI’s ADA Text Embedding Version 2 model (text-embedding-ada-002). Text embeddings are numerical representations of text where words for phrases are represented as multi-dimensional vectors.21,22 These embedding vectors are then stored in a database with the Microsoft Azure Cognitive Search services. During chat interactions with LiVersa, users’ prompts are converted into embeddings in real time using the same text-embedding-ada-002 model. A search is then performed on the previously processed vector database of AASLD guidelines and documents to find matches for the prompt embeddings. The search results from this search process are then passed to the gpt-35-turbo or gpt-4-32k LLMs to generate a completion, which is the output from the LLM in the LiVersa chat interface.
We pre-set three sample questions into “Sample Questions” section of the LiVersa interface:
What are the indications for liver transplantation?
Who should be screened for chronic hepatitis B?
What is the recommended therapy for a patient with BCLC 0-A HCC without portal hypertension?
To compare LiVersa’s performance on free-form clinical hepatology questions, we compared its responses to medical trainees’ responses in a previously published case-vignette based knowledge assessment questions on HBV treatment and hepatocellular carcinoma (HCC) surveillance.23 We chose this set of questions because they have also been evaluated by the publicly available ChatGPT.24 Prior to input into LiVersa, case-vignettes were edited for clarity and appropriateness with the goal of retaining the essence of the clinical scenario. We asked LiVersa to give both a full response and a forced “yes” or “no” response by appending the prompt with “Answer yes or no” to each inputted case-vignette (Figure 2).
Results
The user interface for LiVersa is shown in Figure 2. The interface is the same as to the one used for “Versa Chat” (the chat interface for the general-purpose PHI-compliant LLM implementation at UCSF), except with the data source being set to the “UCSF Intranet: AASLD Clinical Practice Guidelines (Prototype).” LiVersa’s responses to the three sample questions are featured in Table 2. LiVersa’s performance on case-vignette based knowledge assessment questions on HBV treatment and HCC surveillance are featured in Table 3, Columns B and C. The correct answers along with percentages of trainees who answered correctly per prior studies are featured in Table 3, Column D. LiVersa answered all 10 questions correctly when forced to provide a “yes” or “no” answer. Rationales and justifications within the full answers, however, were not completely correct for three clinical scenarios in HCC surveillance. These three clinical scenarios were: “A 25-year-old Haitian man with chronic hepatitis B, on treatment with entecavir,” “A 40-year-old Cuban woman who was recently diagnosed with hepatitis B after developing jaundice in the setting of a surgical procedure. There is no evidence of cirrhosis,” and “A 40-year-old woman from Thailand with cirrhosis and chronic inactive hepatitis B.”
Discussion
Two of the most significant barriers to using general-purpose LLMs like ChatGPT in clinical practice are the tendency for these LLMs to “hallucinate” or generate confident sounding but false responses, and their lack of compatibility with PHI.6,7 The first issue stems from the nature of the transformer architecture,25 which optimizes the LLM’s objective of predicting the most probable next word (token) from its pre-trained data without consideration for accuracy. Augmenting the knowledge base of general-purpose LLMs, such as through RAG, becomes one of the key strategies to shaping and constraining LLM outputs to prevent false information from being propagated and disseminated. Our work is one of the first demonstrations and proof-of-concept of using RAG to create a liver disease-specific and, more importantly, PHI-compliant LLM chat interface. Our incorporation of hepatology specific knowledge, such as the AASLD practice guidelines, resulted in the LiVersa chat interface generating answers that were likely more specific than those generated by general-purpose ChatGPT.24
While LiVersa answered all 10 questions regarding HBV treatment and HCC surveillance in our test set correctly, the rationales given for three cases were not completely correct. These incorrect responses reflect limitations of utilizing RAG. For instance, LiVersa’s stated rationale for a “Yes” recommendation for the case-vignette of “A 25-year-old Haitian man with chronic hepatitis B, on treatment with entecavir” was based on the presumption that the patient was from an HBV endemic country (Table 3). The actual rationale given in prior literature is based on the 2018 version of the AASLD HCC practice guidance document, which recommended early surveillance for African HBV patients and/or North American HBV patients of African-American descent.
In this example, we highlight two key issues. First, the 2018 version of the AASLD HCC practice guidance document was not uploaded into LiVersa’s RAG dataset, the 2023 version was.26,27 LiVersa gave the wrong justification for HCC surveillance in this case simply because the 2018 version was not available to it and “restrictions” placed by the RAG dataset forced it to generate a hallucinatory answer. This limitation could easily be overcome by increasing the number and variety of literature incorporated into LiVersa’s RAG dataset. Additional documents that could be considered for inclusion in the future include clinical practice guidelines/guidance documents from other societies, such as those from the American Gastroenterological Association, American College of Gastroenterology, European Association for the Study of the Liver, and Asian Pacific Association for the Study of the Liver; and from compendium sources, such as the Cochrane Review and UpToDate.
The second key issue is contextual bias. In this instance, the recommendation for HCC surveillance is based on the assumption that the man from Haiti in the clinical scenario is of African descent. This is a problem that affects both literature data incorporated into the RAG dataset and in the pre-training data that underpins the LLM itself. The more “correct” response should have been to ask for additional context to clarify the self-identified national-origin or racial/ethnic background of the patient to allow the LLM to give a more comprehensive recommendation/output. This is an illustration of the problems concerning algorithm bias that generative artificial intelligence (GAI) research community is actively grappling with.28–30 Finally, in addition to the two key considerations noted above, there is a minor limitation with regards to our grading of LiVersa’s responses. We had edited the case-vignette knowledge assessment questions for clarity and appropriateness. While we tried to retain the intent of the clinical cases, the wording of these cases may not be the same as those evaluated by trainees and ChatGPT in prior studies.23,24 This may have affected the comparisons in Table 3.
Despite these limitations, we demonstrated that RAG could be a powerful method to create a “specialized” LLM. Given that our LiVersa prototype was developed and deployed in a PHI-compliant environment (UCSF’s Versa), it could theoretically be used clinically to evaluate actual patient scenarios. By extension, there is a potential to incorporate both literature and patient data from the electronic health record through extraction by the Fast Healthcare Interoperability Resources (FHIR) API in the RAG process.31 This process would allow for the creation of patient-specific clinical LLMs that could be a true realization of GAI-enabled personalized medicine.
Data Availability
The clinical guidelines/guidance documents used in the methods of this manuscript are publicly available.
Financial/Grant Support
The authors of this study were supported in part by the KL2TR001870 (National Center for Advancing Translational Sciences, Ge), P30DK026743 (UCSF Liver Center Grant, Ge and Lai), UL1TR001872 (National Center for Advancing Translational Sciences, Pletcher), and R01AG059183/K24AG080021 (National Institute on Aging, Lai). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or any other funding agencies. The funding agencies played no role in the analysis of the data or the preparation of this manuscript.
Disclosures
The authors of this manuscript have the following potential conflicts of interest to disclose:
- Dr. Jin Ge receives research support from Merck and Co; and consults for Astellas Pharmaceuticals/Iota Biosciences.
- Dr. Jennifer C. Lai receives research support from Lipocene and Vir Biotechnologies; receives an education grant from Nestle Nutrition Sciences; serves on an advisory board for Novo Nordisk; and consults for Genfit, Third Rock Ventures, and Boehringer Ingelheim.
Writing Assistance
None.
Author Contributions
Authorship was determined using ICMJE recommendations.
Ge: Concept and design; data extraction; analysis and interpretation of data; drafting of manuscript; critical revision of the manuscript for important intellectual content; study supervision
Sun: Data extraction; analysis and interpretation of data; critical revision of the manuscript for important intellectual content
Owens: Concept and design; analysis and interpretation of data; critical revision of the manuscript for important intellectual content
Galvez: Analysis and interpretation of data; critical revision of the manuscript for important intellectual content
Gologorskaya: Analysis and interpretation of data; critical revision of the manuscript for important intellectual content
J. Lai: Concept and design; critical revision of the manuscript for important intellectual content; study supervision
Pletcher: Concept and design; critical revision of the manuscript for important intellectual content; study supervision
K. Lai: Concept and design; critical revision of the manuscript for important intellectual content; study supervision
Acknowledgements
- The authors thank the UCSF AI Tiger Team, Academic Research Services, Research Information Technology, and the Chancellor’s Task Force for Generative AI for their software development, analytical, and technical support related to the use of Versa API gateway (the UCSF secure implementation of large language models and generative AI via API gateway), Versa chat (the chat user interface), and related data assets.
Abbreviations
- AASLD
- American Association for the Study of Liver Diseases
- APIs
- application programming interfaces
- FHIR
- Fast Healthcare Interoperability Resources
- HBV
- hepatitis B
- HCC
- hepatocellular carcinoma
- GAI
- generative artificial intelligence
- GPT
- generative pre-trained transformer
- LLMs
- large language models
- PHI
- protected health information
- RAG
- retrieval-augmented generation
- UCSF
- University of California, San Francisco