Abstract
Background and Aim In the last two years, natural language processing (NLP) has transformed significantly with the introduction of large language models (LLM). This review updates on NLP and LLM applications and challenges in gastroenterology and hepatology.
Methods Registered with PROSPERO (CRD42024542275) and adhering to PRISMA guidelines, we searched six databases for relevant studies published from 2003 to 2024, ultimately including 57 studies.
Results Our review notes an increase in relevant publications in 2023-2024 compared to previous years, reflecting growing interest in newer models such as GPT-3 and GPT-4. The results demonstrate that NLP models have enhanced data extraction from electronic health records and other unstructured medical data sources. Key findings include high precision in identifying disease characteristics from unstructured reports and ongoing improvement in clinical decision-making. Risk of bias assessments using ROBINS-I, QUADAS-2, and PROBAST tools confirmed the methodological robustness of the included studies.
Conclusion NLP and LLMs can enhance diagnosis and treatment in gastroenterology and hepatology. They enable extraction of data from unstructured medical records, such as endoscopy reports and patient notes, and for enhancing clinical decision-making. Despite these advancements, integrating these tools into routine practice is still challenging. Future work should prospectively demonstrate real-world value.
WHAT IS KNOWN
Natural language processing (NLP) models can extract data from unstructured medical records.
Large language models (LLMs) like GPT-3 and GPT-4 are increasingly used in healthcare.
NLP aids in identifying disease characteristics and improving clinical decision-making.
WHAT IS NEW HERE
This is the largest systematic review in the field of NLP and LLM applications in gastroenterology and hepatology, showing a notable increase in related publications during 2023-2024.
NLP models show high precision in extracting data from electronic health records.
Despite advancements, integrating these tools into routine clinical practice remains challenging.
Future studies need to demonstrate real-world value prospectively.
Introduction
Recent advances in Natural Language Processing (NLP) show potential for being integrated in the field of gastroenterology and hepatology (1,2). Since the last review in 2014 by Hou et al., which underscored the potential of NLP to enhance the field’s efficiency (2), there have been significant strides in technology, particularly with the emergence of Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT) (3). These developments have expanded the scope of NLP applications, from automating routine tasks to enabling complex diagnostic and therapeutic decisions (4).
NLP and LLMs extract and interpret data from patient records, notes, and reports (5–7). In gastroenterology and hepatology, they streamline the review of endoscopy, radiology, and pathology reports. This technology can help create research cohorts for clinical trials, flag complications, and support decision-making systems. Examples include managing complex conditions like IBD and hepatocellular carcinoma (5,7,8).
This review discusses the current applications and challenges of NLP and LLMs in gastroenterology and hepatology.
Methods
Registration and Protocol
This systematic literature review was registered with the International Prospective Register of Systematic Reviews, PROSPERO, under the registration code CRD42024542275 (9). Our methodology adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (10).
Search Strategy
We conducted a rigorous search of six key databases (PubMed, Embase, Web of Science, and Scopus, Cochrane library and IEEE Xplore) for studies published until April 2024. Our focus was on the outcomes of integrating NLP and LLM models in gastroenterology and hepatology. We designed Boolean search strings tailored to each database. To maximize coverage, we supplemented our search with a comprehensive manual reference screening of included studies and targeted searches on Google Scholar. Details of the specific Boolean strings used are provided in the Supplementary Materials.
Study screening and selection
Our review encompasses original research articles, and full conference papers (11). The exclusion criteria were confined to preprints, review papers, case reports, commentaries, protocol studies, editorials, and non-English publications. For the initial screening, we used the Rayyan web application (12). The initial screening and study selection, which were conducted according to predefined criteria, were independently performed by two reviewers (MO and EK).
Discrepancies were resolved through discussion. Fleiss’ kappa was calculated for the agreement between the two independent reviewers.
Data Extraction
Data extraction was conducted by researchers MO and EK using a standardized form to ensure consistent and accurate data capture. This included details such as author, publication year, sample size, data type, task type, specific field, model used, results, numeric metrics, conclusions, and limitations. Any discrepancies in data extraction were resolved through discussion and a third reviewer was consulted when necessary.
Risk of Bias Assessment
To ensure a thorough evaluation of the included studies, we used three tools, each tailored to a specific study design within our review. The Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool has been employed in interventional studies assessing NLP in applications such as management, prescription guidance, and clinical inquiry responses (13). For diagnostic studies where NLP models were compared with physicians or a reference standard for diagnosing and detection, the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool was used (14). Finally, the Prediction model Risk Of Bias ASsessment Tool (PROBAST) tool was utilized for the remaining studies, which involved NLP models prediction, without direct comparison to reference standards (15). This multitool approach allowed us to appropriately address the diverse methodologies and applications presented in the reviewed studies.
Results
Search Results and Study Selection
A total of 720 articles were identified through initial screening. After the removal of 114 duplicates, 606 articles remained for further evaluation. Title and abstract screening led to the exclusion of 524 articles, leaving 82 articles for full-text review. Of these, the reasons for exclusion and the number of articles excluded for each reason remain the same as described earlier. Ultimately, 55 studies met all inclusion criteria. By employing reference checking and snowballing techniques, two additional studies were identified, resulting in a final tally of 57 studies (16–72). A PRISMA flowchart visually represents the screening process in Figure 1. Fleiss’ kappa for the agreement between screeners was calculated as 0.957, which is considered very high (73).
An overview of the included studies
Our systematic review incorporates a total of 57 studies (16–72). Among these, a substantial majority, 49 studies, are centered on gastroenterology, while hepatology is the focus of 8 studies. These studies span from 2018 to 2024, with a notable increase in publications in the last two years, particularly between 2023 and 2024, which collectively account for 28 of the total included studies. This uptick highlights a growing interest in advanced NLP models like GPT-3 and GPT-4.
The models employed in these studies vary widely, with traditional NLP methods and more recent LLMs like GPT-3 and GPT-4. For instance, Kong et al. (2024) utilized GPT-4 among other versions for medical counseling (38), while Schneider et al. (2023) employed rule-based NLP algorithms for detecting undiagnosed hepatic steatosis (54).
Sample sizes in these studies range from very small datasets to large-scale analyses involving millions of data points, such as in the study by Schneider et al., which analyzed data from over 2.7 million imaging reports (54). The type of data analyzed also varies significantly, encompassing electronic health records (EHRs), pathology reports, and data generated from AI models responding to preset medical queries.
Tasks performed by these models are equally diverse, from diagnostic assistance and disease monitoring to providing patient education and supporting clinical decision-making. Specific examples include the work by Truhn et al. (2024), which focused on extracting structured data from colorectal cancer reports (49), and Lahat et al. (2023), who evaluated the utility of GPT models in answering patient questions related to gastroenterology (47).
Risk of bias
We used ROBINS-I, QUADAS-2, and PROBAST to map potential biases. Notably, most of the included studies were published in Q1 journals, affirming their scholarly impact and supported by strong SCImago Journal Rank (SJR) scores (Figure 2).
PROBAST Results (Table S1): This assessment mostly highlighted low-risk ratings in outcome and analysis domains. However, several studies encountered issues with high participant-related applicability biases, influencing the generalizability of their findings.
QUADAS-2 Results (Table S2): A synthesis of QUADAS-2 results revealed that most studies (20 out of 32) exhibited low risk of bias across all four assessed domains. This underscores their methodological robustness and reliability. However, three studies were identified as having a high risk of bias in one of the four categories. Patient selection applicability concerns were notable, primarily due to the reliance on single-center data with specific documentation styles, which may limit the broader applicability of these findings.
ROBINS-I Results (Table S3): Analysis of ROBINS-I revealed that 14 studies displayed a moderate risk of bias overall, while one study exhibited a high risk. This was largely due to biases in the selection of participants into the study and confounding factors, particularly because many studies utilized specific questions, queries, or fictional vignettes and case scenarios. Despite these concerns, the other assessment categories predominantly showed low risk. Nonetheless, six studies demonstrated low risk across all evaluated domains.
NLP Applications
We categorized the applications of the NLP and LLM models under three main categories for a synthesized analysis of the results: Disease Detection and Diagnosis (n = 30), Patient Care (n = 22), and Education and Research applications (n = 5). Disease Detection and Diagnosis was further divided into Colonoscopy Reports and Other Diagnostic Applications, while Patient Care included Management and Communication and Clinical Decision Support, focusing on patient-oriented and healthcare professionals-oriented applications, respectively (Figure 3).
Disease Detection and Diagnosis
Most of the studies evaluated NLP models in extracting data from unstructured colonoscopy reports (n=17) (Figure 4). Nonetheless, there were many unique applications.
Colonoscopy Reports: This category, which includes 17 studies, primarily explored NLP’s role in enhancing the interpretation of unstructured colonoscopy reports. Various quality and diagnostic measures were evaluated, such as the adenoma detection rate (ADR), a frequent subject of investigation. For instance, Nayor et al. reported that their NLP pipeline achieved high precision and recall in the automated calculation of ADR (57). Other assessments included polyp detection and sizing, with Imler et al. demonstrating accuracies of 98% for pathology level identification and 96% for size estimation (58). Additionally, Raju et al. noted that NLP matched or exceeded manual methods in identifying and categorizing adenomas with a detection rate of 43% (59). Overall, NLP models showed a broad range of accuracies from 84% to 100%, consistently outperforming manual review methods. Despite needing GPUs, these models reduce the time and effort of manual evaluations.
Other Diagnostic Applications: Beyond colonoscopy, NLP was applied to a diverse array of diagnostic contexts in gastroenterology and hepatology.
In gastroenterology, several innovative NLP applications have emerged. For example, Wenker et al. utilized NLP to identify dysplasia in Barrett’s Esophagus from esophagogastroduodenoscopy (EGD) reports with a high accuracy of 98.7% (69). Song et al. developed a model to extract detailed clinical information such as disease presence, location, size, and stage from unstructured EGD reports, achieving high sensitivity, precision, and accuracy scores (61). Denny et al. applied NLP to enhance colorectal cancer screening by identifying references to four CRC tests within electronic clinical documentation, demonstrating superior recall compared to traditional manual and billing record reviews (63). Additionally, Blumenthal et al. and Parthasarathy et al. used NLP for patient monitoring, with the former detecting non-adherence to follow-up colonoscopies with an AUC of 70.2%, and the latter identifying patients meeting WHO criteria for serrated polyposis syndrome with 93% accuracy (18,65).
For IBDs, Stidham et al. utilized NLP to detect and infer the activity status of extraintestinal manifestations from clinical notes, enhancing detection accuracy to 94.1% and specificity to 95% (52). Ananthakrishnan et al. explored improving case definitions for Crohn’s disease and ulcerative colitis by combining codified data with narrative clinical texts, which identified 6-12% more patients than models using codified data alone, with AUCs of 0.95 for Crohn’s disease and 0.94 for ulcerative colitis (53).
In hepatology, NLP has facilitated significant advancements in disease identification and progression monitoring. Sada et al. combined NLP with ICD-9 codes to improve the identification of hepatocellular carcinoma cases from EHR data, significantly enhancing sensitivity and specificity, with an F2 score of 0.92 (71). Van Vleck et al. employed NLP to track disease progression in patients with non-alcoholic fatty liver disease (NAFLD), demonstrating superior sensitivity and F2 scores compared to traditional methods, effectively identifying disease progression from NAFLD to NASH or cirrhosis with sensitivity of 0.93 and an F2 score of 0.92 (30). Furthermore, Sherman et al. developed an NLP model capable of automatically scoring and classifying histological features found in pathology reports related to metabolic associated steatohepatitis (17). The goal was to estimate the risk of progression towards cirrhosis. The model demonstrated high positive and negative predictive values, ranging from 93.5% to 100%, across various histological features (17). Importantly, this NLP model facilitated the creation of a large and quality-controlled cohort of MASLD patients (17).
Patient Care
The patient care section is subdivided into two categories: patient management and communication, which comprises 13 studies, and clinical decision support, encompassing 9 studies.
Management and Communication: This category explores the use of NLP and LLMs in facilitating communication and management.
In gastroenterology, studies like Lahat et al. evaluated ChatGPT’s ability to answer real-life gastroenterology-related patient queries, achieving moderate effectiveness with accuracy scores ranging from 3.4 to 3.9 (47). Choo et al. reported an 86.7% concordance rate between ChatGPT’s recommendations for managing complex colorectal cancer cases and decisions made by multidisciplinary teams (39). Furthermore, Lim et al. demonstrated that a contextualized GPT-4 model provided accurate colonoscopy interval advice, significantly outperforming standard models by adhering closely to established guidelines (33). Imler et al. used the cTAKES system to achieve an 81.7% agreement with guideline-adherent colonoscopy surveillance intervals, substantially surpassing manual review accuracies (36). However, studies like Huo et al. and Atarere et al. indicated variability in ChatGPT’s performance, suggesting the need for enhancements in AI consistency and reliability (25,44). In the area of IBD, Zand et al. developed an NLP model that categorized electronic dialog data, showing a 95% agreement with physician evaluations and underscoring the potential of automated chatbots in patient interaction (23). Sciberras et al. found ChatGPT to provide highly accurate (84.2%) and moderately complete responses to patient inquiries about IBD, with particular strengths in topics like smoking and medication (20).
In hepatology, Yeo et al. tested GPT’s proficiency in delivering emotional support and accurate information on cirrhosis and hepatocellular carcinoma, achieving correct response rates of 79.1% for cirrhosis and 74% for carcinoma (29). Samaan et al. explored GPT’s effectiveness in Arabic, noting a 72.5% accuracy rate, though it was less accurate than its English counterpart, indicating disparities in language performance (34).
Clinical Decision Support: NLP models were tested for their accuracy and effectiveness in decision-making scenarios. For example, Kong et al. evaluated LLMs’ capability to provide counseling on Helicobacter pylori, noting that while accuracy was generally high (90% acceptable responses), completeness needed improvement (38). Li et al.’s integration of NLP with machine learning for predicting liver metastases showed impressive results with accuracy and F1 scores around 80.4% (24). The study by Becker et al. utilized an NLP pipeline tailored for German, achieving high precision and recall in guideline-based treatment extraction from clinical notes (64). Further, Wang et al.’s “DeepCausality” framework accurately assessed causal factors for drug-induced liver injuries, aligning well with clinical guidelines (41). Another significant study, Wagholikar et al., demonstrated that an NLP-powered clinical decision support system could assist in making guideline-adherent recommendations for colonoscopy surveillance, as it made optimal recommendations in 48 out of 53 cases (35).
Education and Research
Five studies focused on this aspect. Generaly, NLP and LLMs have demonstrated a promising capacity to enhance learning and knowledge dissemination. Benedicenti et al. explored the accuracy of ChatGPT in solving clinical vignettes against gastroenterologists, noting an initial 40% accuracy that improved to 65% over time, suggesting a potential for future clinical integration with continued advancements (56). Zhou et al. assessed GPT-3.5 and GPT-4 for their ability to provide consultation recommendations and analyze gastroscopy reports related to gastric cancer, with GPT-4 achieving 91.3% appropriateness and 95.7% consistency (48). Lahat et al. utilized GPT to generate research questions in gastroenterology, finding the questions relevant and clear but lacking in originality (46). Meanwhile, Gravina et al. highlighted the efficacy of ChatGPT 3.5 in medical education, as it outperformed Perplexity AI in residency exam questions with a 94.11% accuracy rate (32). Additionally, Pradhan et al. compared AI-generated patient educational materials on cirrhosis with human-derived content, finding no significant differences in readability or accuracy, though human materials were deemed more actionable (28).
Discussion
Our systematic review assessed the integration of NLP and LLMS in gastroenterology and hepatology, registering significant advancements. We reviewed 57 studies, highlighting a sharp increase in research over the last two years, particularly focusing on newer models like GPT-3 and GPT-4. These studies reflect a shift from traditional tasks, such as report analysis, to more dynamic roles in patient management and research facilitation.
The results show that certain NLP applications seem ready for immediate clinical use. For example, Schneider et al. (2023) identified 42,000 hepatic steatosis cases using an NLP model on 2.15 million pathology reports and 2.7 million imaging reports. This level of precision (PPV 99.7%) exemplifies NLP’s readiness to support diagnostic processes in large-scale healthcare settings. Similarly, Truhn et al. (2024) successfully employed GPT-4 to extract structured data from colorectal cancer reports with a precision of 99% for T-stage identification, suggesting a high reliability of NLP in processing and structuring complex pathological data.
Conversely, the technology’s expansion into more dynamic roles such as comprehensive disease management and holistic patient care is still evolving. For instance, Kong et al. (2024) found that while the accuracy and comprehensibility of GPT-4’s responses to medical inquiries about Helicobacter pylori were high, the completeness of the information was less satisfactory. This indicates ongoing challenges in ensuring that NLP outputs are not only accurate but also fully informative.
Our results suggest that both classic NLP methods and newer models can be effectively integrated to streamline manual tasks such as extracting data and making diagnoses from complex and unstructured reports, with an accuracy that typically surpasses manual screening (16–18,21,22,27,33). This builds upon and adds on a previous systematic review of NLP in gastroenterology and hepatology conducted by Hou et al. (2). While he found promising results, he emphasized the need for careful consideration of the quality of clinical data within EHRs, and also highlighted the importance of understanding variations and deviations from established clinical practice standards (2). Our updated and more comprehensive results indicate that these models consistently demonstrate high accuracies (16–18,21,22,27,33). This trend is observable in other fields utilizing NLP, such as radiology and infectious diseases (74,75). However, our research suggests that applying these methods to more complex tasks like patient management, education, and clinical decision-making is still challenging (20,29,34,37). While newer models show promising results, there are significant limitations and variability that require further development (67). This trend is consistent with data and the current findings from other fields (76,77).
Several limitations of our review must be acknowledged. Many studies utilize single-institution datasets, which could affect the generalizability of the findings. The accuracy of NLP outputs is heavily dependent on the quality of the input data, with errors or inconsistencies in medical records potentially leading to inaccurate results (78). The opaque nature of AI decision-making processes (’black box’) raises concerns about the transparency and trustworthiness of these models in clinical settings (79). Finally, ethical considerations around potential biases in training data and algorithmic outputs underscore the necessity for careful implementation to ensure fairness and equity in healthcare delivery (80).
In conclusion,
our systematic review highlights the impact of NLP and LLMs in gastroenterology and hepatology. On one hand, NLP has already proven its utility in screening and analyzing medical reports, facilitating streamlined screening policies with impressive outcomes. On the other hand, the capabilities of newer LLMs are still unfolding, with their full potential in complex management and research roles yet to be fully realized.
The results demonstrate that while some applications of NLP are well-established and highly effective, newer LLMs offer exciting, emerging applications that promise to further enhance clinical practice. Moving forward, research focus should be on refining these models to ensure prospectively they meet real-world clinical needs.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Financial disclosure
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Competing interest
None declared.
Ethical approval was not required for this research.
Data Sharing Statement
This manuscript is a systematic review; therefore, all data utilized are available in the published articles that were reviewed. As this study did not generate new data, but rather analyzed existing published data, there are no additional datasets to share.
Contributorship
M.O: Responsible for writing the initial draft and the final version of the manuscript, gathering and extracting data, and visualizing the results.
K.S: Provided oversight and editing of the manuscript.
B.G: Provided oversight and editing of the manuscript.
G.N: Oversaw the entire project.
E.K: Involved in data extraction, editing the final manuscript, and validation of the data.
Specific Booleans used in each database
PubMed
Embase
(’natural language processing’ OR ‘nlp’ OR ‘bert’ OR ‘gpt-3’ OR ‘gpt’ OR ‘chatgpt’ OR ‘gpt-3.5’ OR ‘gpt-4’ OR ‘transformer language models’ OR ‘ai language technology’) AND (’gastroenterology’ OR ‘hepatology’ OR ‘gastrointestinal oncology’ OR ‘endoscopic imaging’ OR ‘colorectal cancer screening’ OR ‘hepatocellular carcinoma’ OR ‘liver fibrosis’ OR ‘inflammatory bowel disease management’ OR ‘ibd’ OR ‘crohns disease’ OR ‘ulcerative colitis treatment’)
AND (2003:py OR 2004:py OR 2005:py OR 2006:py OR 2007:py OR 2008:py OR 2009:py OR 2010:py OR 2011:py OR 2012:py OR 2013:py OR 2014:py OR 2015:py OR 2016:py OR 2017:py OR 2018:py OR 2019:py OR 2020:py OR 2021:py OR 2022:py OR 2023:py OR 2024:py) AND [embase]/lim NOT ([embase]/lim AND [medline]/lim) AND (’article’/it OR ‘conference paper’/it)
Scopus
(TITLE-ABS-KEY((”natural language processing” OR “NLP” OR “large language models” OR “LLM” OR “BERT” OR “GPT-3” OR “GPT” OR “ChatGPT” OR “GPT-3.5” OR “GPT-4” OR “transformer language models”) AND (”Gastroenterology” OR “Hepatology” OR “peptic ulcer disease” OR “PUD” OR “Endoscopy” OR “colorectal cancer” OR “hepatocellular carcinoma treatment” OR “IBD predictive modeling” OR “endoscopic imaging”))) AND PUBYEAR > 2002 AND (NOT DOCTYPE(re) OR DOCTYPE(le) OR DOCTYPE(ed) OR DOCTYPE(cr) OR DOCTYPE(cp))
Web of science
TS=(”natural language processing” OR “NLP” OR “large language models” OR) “LLM” OR “BERT” OR “GPT-3” OR “GPT” OR “ChatGPT” OR “GPT-3.5” OR “GPT-4” OR “transformer language models”) AND TS=(”Gastroenterology” OR “Hepatology” OR “peptic ulcer disease” OR “PUD” OR “Endoscopy” OR “colorectal cancer” OR “hepatocellular carcinoma treatment” OR “IBD predictive modeling” OR “endoscopic imaging”)) AND PY=(2003-2023) NOT DT=(”Review” “(OR “Editorial Material” OR “Letter” OR “Meeting Abstract” OR “News Item”)
Cochrane library
(”natural language processing” OR “NLP” OR “large language models” OR “LLM” OR “BERT” OR “GPT-3” OR “GPT” OR “ChatGPT” OR “GPT-3.5” OR “GPT-4” OR “transformer language models”) AND (”gastroenterology” OR “hepatology” OR “digestive health” OR “gastrointestinal oncology” OR “endoscopic imaging” OR “colorectal cancer screening” OR “hepatocellular carcinoma” OR “liver fibrosis” OR “inflammatory bowel disease management” OR “IBD” OR “Crohn’s disease” OR “ulcerative colitis treatment”)
IEEE Xplore
(”natural language processing” OR “NLP” OR “large language models” OR “LLM” OR “BERT” OR “GPT-3” OR “GPT” OR “ChatGPT” OR “GPT-3.5” OR “GPT-4” OR “transformer language models”) AND (”gastroenterology” OR “hepatology” OR “digestive health” OR “gastrointestinal oncology” OR “endoscopic imaging” OR “colorectal cancer screening” OR “hepatocellular carcinoma” OR “liver fibrosis” OR “inflammatory bowel disease management” OR “IBD” OR “Crohn’s disease” OR “ulcerative colitis treatment”)
Acknowledgment
none.