Large language models for conducting systematic reviews: on the rise, but not yet ready for use – a scoping review

Judith-Lisa Lieberum; Markus Töws; Maria-Inti Metzendorf; Felix Heilmeyer; Waldemar Siemens; Christian Haverkamp; Daniel Böhringer; Joerg J. Meerpohl; Angelika Eisele-Metzger

doi:10.1101/2024.12.19.24319326

ABSTRACT

Background Machine learning (ML) promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention.

Objective To provide an overview of ML and specifically LLM applications in SR conduct in health research.

Study design We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: 26 February 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review with a related research question. Two reviewers independently screened studies for eligibility; after piloting, one reviewer extracted data, checked by another.

Results Our database search yielded 8054 hits, and we identified 33 articles from our hand search. Of the 196 included reports, 159 described more traditional ML techniques, 37 focused on LLMs. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n=15, 41%), study selection (n=14, 38%), and data extraction (n=11, 30%). The mostly recurring LLM was GPT (n=33, 89%). Validation studies were predominant (n=21, 57%). In half of the studies, authors evaluated LLM use as promising (n=20, 54%), one quarter as neutral (n=9, 24%) and one fifth as non-promising (n=8, 22%).

Conclusions Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.

HIGHLIGHTS

Machine learning (ML) offers promising support for systematic review (SR) creation.
GPT was the most commonly used large language model (LLM) to support SR production.
LLM application included 10 of 13 defined SR steps, most often literature search.
Validation studies predominated, but fully established LLM applications are rare.
LLM research for SR conduct is surging, highlighting the increasing relevance.

Competing Interest Statement

The authors have declared no competing interest.

Clinical Protocols

https://osf.io/asjm3

Funding Statement

This work was supported by the Research Commission at the Faculty of Medicine, University of Freiburg, Freiburg, Germany (grant no. EIS2244/23)

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Our preregistered study protocol and extracted LLM and ML data as Excel spreadsheets can be found on OSF (https://osf.io/vdsgb). Supplement material includes our final search strategy and deviations from the preregistered study protocol.

https://osf.io/vdsgb

ABBREVIATIONS AND GLOSSARY

BART: bidirectional and auto regressive transformers
BERT: bidirectional encoder representations from transfomers
DEST-Eppi-Vis: digital evidence synthesis tool evaluations
EBM: evidence-based medicine
GPT: generative pretrained transformer
GRADE: grading of recommendations, assessment, development and evaluation
JBI: Joanna Briggs Institute
KNN: K-nearest neighbors
LaMDA: language model for dialogue applications
LLaMA: large language model Meta AI
LLM: large language model
ML: machine learning
NLP: natural language processing
OSF: open science framework
PaLM: pathway language models
PCC: population, concept, context
PLS: plain language summary
PRISMA-ScR: preferred reporting items for systematic reviews and meta-analyses extension for scoping reviews
RoB: risk of bias
SR: systematic review

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.