Abstract
Background The assessment of risk of bias is a critical component of systematic review methods. Assessing risk of bias, however, can be time- and resource-intensive. AI-based solutions may increase efficiency and reduce burden.
Objective To evaluate the reliability of ChatGPT for performing risk of bias assessments of randomized trials.
Methods We sampled recently published Cochrane systematic reviews of medical interventions (up to October 2023) that included randomized controlled trials and assessed risk of bias using the Cochrane-endorsed revised risk of bias tool for randomized trials (RoB 2.0). From each eligible review, we collected data on the risk of bias assessments for the first three reported outcomes. Using ChatGPT-4, we assessed the risk of bias for the same outcomes using three different prompts: a minimal prompt including limited instructions, a maximal prompt with extensive instructions, and an optimized prompt that was designed to yield the best risk of bias judgments. The agreement between ChatGPT’s assessments and those of the systematic reviewers was quantified using weighted kappa statistics.
Results We included 34 systematic reviews with 157 unique trials. We found the agreement between ChatGPT and systematic review authors for assessment of overall risk of bias to be 0.16 (95% CI: 0.01 to 0.3) for the maximal ChatGPT prompt, 0.17 (95% CI: 0.02 to 0.32) for the optimized prompt, and 0.11 (95% CI: -0.04 to 0.27) for the minimal prompt. For the optimized prompt, agreement ranged between 0.11 (95% CI: -0.11 to 0.33) to 0.29 (95% CI: 0.14 to 0.44) across risk of bias domains, with the lowest agreement for the deviations from the intended intervention domain and the highest agreement for the missing outcome data domain.
Conclusion Our results suggest that ChatGPT and systematic reviewers only have “slight” to “fair” agreement in risk of bias judgments for randomized trials. ChatGPT is currently unable to reliably assess risk of bias of randomized trials. We recommend systematic reviewers avoid using ChatGPT to perform risk of bias assessments.
Background
The practice of evidence-based medicine demands knowledge of the best available evidence, which most often comes from rigorous systematic reviews and meta-analyses (1). Systematic reviews, however, are time- and resource-intensive. Empirical evidence suggests they typically require upwards of one year to complete and publish and many are outdated at or shortly following publication (2, 3).
One particular time- and resource-intensive component of systematic reviews is the assessment of risk of bias of primary studies—defined as the propensity for studies to systematically over- or underestimate treatment effects (4). Risk of bias assessments are burdensome and time-consuming and demand specialized training. Moreover, to reduce the opportunity for errors, guidance for conducting rigorous systematic reviews typically suggests authors assess risk of bias independently and in duplicate, adding to the complexity and workload of the process (4).
In 2019, a new risk of bias tool was introduced that built on the successes of the previous Cochrane endorsed risk of bias tool but also incorporated new advancements (5). This tool was called the revised tool for assessing risk of bias of randomized trials (RoB 2.0) and has now become the gold standard (4). The RoB 2.0 tool rates risk of bias as either high, some concerns, or low across five domains: randomization, deviations from intended intervention, missing outcome data, measurement of outcome, and selective reporting. The overall rating of risk of bias is determined by the domain rated at highest risk of bias.
While the RoB 2.0 tool builds off a decade’s worth of experience with the original risk of bias tool, recent evidence suggests that reviewers find it more complex and time-consuming (6, 7). Innovations to streamline and simplify risk of bias assessments without compromising their rigor will reduce the time and effort required to perform systematic reviews and aid in maintaining their currency.
Previous efforts to streamline and automate risk of bias assessments have shown optimistic results (8-13), suggesting that such endeavors may be feasible. For example, RobotReviewer is an automated tool to extract data from and assess the risk of bias of randomized trials (8, 11, 12). The RobotReviewer, however, was trained on the original Cochrane risk of bias tool and only offers judgments on four of the seven domains of the original tool.
ChatGPT (OpenAI, San Francisco, California, USA) is a conversational artificial intelligence (AI) large language model with capabilities in natural language processing and realization (14). Unlike specialized tools for risk of bias assessments, ChatGPT is a general purpose tool, has been developed to emulate human language rather than risk of bias assessments, and has been trained on an internet-scale corpus covering many areas of knowledge, rather than a small training set focused on evidence synthesis and evaluation (14).
Nevertheless, ChatGPT has been shown to perform remarkable tasks many of which are similar to performing risk of bias assessments, including passing the United States Medical Licensing exams (15), performing accurate diagnoses (16), and offering medical advice comparable to physicians (17). Further, ChatGPT has been able to construct reasonable search strategies for systematic reviews (18) and other tasks for which it was not intentionally designed (19), suggesting that it may also be able to assess risk of bias despite not originally being designed for this task.
This study evaluates the performance of ChatGPT, an AI-based language model, for assessing risk of bias of randomized trials using the RoB 2.0 tool. To do this, we sampled Cochrane systematic reviews using the RoB 2.0 tool and had ChatGPT assess the risk of bias of the trials within these reviews. We compared ChatGPT’s assessment with those presented in Cochrane reviews. Consistency in assessments of risk of bias between ChatGPT and Cochrane reviewers will suggest that ChatGPT can provide a reliable assessment of the risk of bias of randomized trials. Conversely, discrepancies in risk of bias assessments between ChatGPT and Cochrane reviewers will suggest that ChatGPT is unreliable for assessing risk of bias.
Methods
We registered our protocol on Open Science Framework (https://osf.io/aq85p) in September 2023. We report our study according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and Guidelines for Reporting Reliability and Agreement Studies (GRRAS) reporting checklists (20, 21).
This study does not involve human participants and is thus exempt from ethics review.
Figure 1 presents an overview of our methods.
Search strategy and screening
For this study, we intended to include a reasonably representative sample of Cochrane systematic reviews. We did not perform a search of medical research databases. Instead, we used the Cochrane Database of Systematic Reviews (CDSR) that provides a chronological catalogue of published and updated Cochrane systematic reviews to identify eligible reviews.
Reviewers worked independently and in duplicate to screen Cochrane reviews for eligibility, starting with the most recently published (August 2023) and working backwards in time. We preferentially included the most recently published Cochrane systematic reviews since these reviews are most likely to have used the most up-to-date version of the RoB 2.0 tool instead of preliminary pilot versions of the tool (5). Reviewers continued screening until we had identified our target sample size of approximately 160 trials.
Eligibility criteria
Our sampling approach was designed to include randomized trials addressing a diverse range of questions (i.e., selected from different systematic reviews) and both dichotomous and continuous outcomes.
We included newly published or updated Cochrane systematic reviews addressing the benefits and/or harms of health interventions that included one or more parallel randomized trials and reported consensus-based risk of bias judgments using the Cochrane-endorsed RoB 2.0 tool (5). We define consensus-based as two reviewers agreeing on the final risk of bias judgments. This may involve two reviewers independently assessing risk of bias and resolving conflicts by discussion or a reviewer assessing risk of bias and a second reviewer confirming the first reviewers’ judgments.
We excluded systematic reviews that were not published by Cochrane, since such reviews may not involve reviewers with sufficient training to appropriately apply the RoB 2.0 tool. We also excluded Cochrane systematic reviews that investigated prognosis or the performance of diagnostic tests and systematic reviews that only include observational studies since these reviews will necessitate the use of other risk of bias tools.
Cochrane systematic reviews use summary of findings tables to present their results (4, 22). These tables list outcomes in order of importance, the number of trials and patients that contributed data to the meta-analysis for each outcome, the relative and absolute effect estimates based on meta-analyses, and judgments about the certainty of evidence (4, 22). From each eligible review, we selected the first two listed outcomes (suggesting that they are the most important) that were informed by one or more trials. If either of the first two outcomes were continuous, we then selected the third outcome listed in the summary of findings table. If the two reported outcomes were both dichotomous, we then selected the first listed continuous outcome reported in the summary of findings table. When summary of findings tables reported on the same outcome at different timepoints, we selected entirely unique outcomes.
From each review, we included all parallel randomized trials published in English that were included in analyses addressing the outcomes of interest. We excluded crossover and cluster randomized trials since these trial designs require unique considerations in their assessment of risk of bias and different versions of the RoB 2.0 tool.
Cochrane reviews often include unpublished trial data. When reviews reported that information for a particular trial was unpublished or was drawn from a combination of unpublished and published data, we excluded those trials since we did not have access to the same unpublished information as the Cochrane reviewers for risk of bias assessments. For feasibility, we also excluded trials for which data was drawn from multiple publications. Including such trials would have necessitated an exhaustive review of all related publications to identify those containing the outcome data and the comprehensive details required for risk of bias assessment.
ChatGPT prompts
A key component in the use of ChatGPT is the design of the text used to instruct the model (called ‘prompts’) to generate an answer. We anticipated that ChatGPT’s risk of bias judgments may depend on the nature of the prompts that it is provided. To study how different prompts may influence risk of bias judgments, we iteratively designed three different prompts: a minimal prompt including limited instructions for assessing risk of bias, a maximal prompt with extensive instructions, and an optimized prompt that was designed to include sufficient information to yield the best risk of bias judgments.
We piloted the prompts using 15 trials drawn from systematic reviews previously performed by our own team and refined the prompts by iterative discussion and input by the co-authors (23-25). All prompts asked ChatGPT to judge risk of bias for all RoB 2.0 domains (bias due to randomization, deviation from intended intervention, missing outcome data, measurement of outcome, and selective reporting) as low risk of bias, some concerns, or high risk of bias—consistent with RoB 2.0 guidance (5). Supplement 1 presents these three prompts.
The RoB 2.0 tool is accompanied by a document that describes the tool and offers guidance on its implementation. All three prompts included the RoB 2.0 full guidance document (riskofbias.info), which were fed to ChatGPT using the AskYourPDF ChatGPT plugin that allows ChatGPT to read and query PDF documents. All prompts also included a PDF copy of the trial publication, a PDF copy of the trial registration or protocol (if one was available), and specified the outcome of interest for which risk of bias assessment was being performed.
The RoB 2.0 tool offers two options for assessing the risk of bias due to deviations of the intended intervention: one for the effect of assignment to the intervention and the other for the effect of adhering to the intervention. In Cochrane systematic reviews, the subsection on risk of bias typically reports whether Cochrane reviewers assessed risk of bias for the effect of assignment or adherence to the intervention. Our ChatGPT prompts also specified whether to assess risk of bias for the effect of assignment or adherence to the intervention, depending on the option selected by the Cochrane review authors. For systematic reviews that failed to specify whether they assessed risk of bias for the effect of being assigned to the intervention or adherence to the intervention, we assumed they assessed risk of bias for assignment to the intervention.
The ChatGPT prompts do not include any information related to the consensus-based risk of bias judgments presented in the systematic reviews. Hence, ChatGPT is ‘blind’ to the risk of bias judgments that are presented in the review.
Data collection
RoB 2.0 guidance demands that reviewers perform risk of bias judgments for each particular result rather than each trial or outcome, since risk of bias may differ across outcomes in a trial or across different ways of statistically summarizing the results for the same outcome (5). We took this approach in this study. For each eligible trial and outcome, we collected information on the consensus-based risk of bias judgments presented in the Cochrane systematic reviews. Subsequently, for each eligible trial, we used the ChatGPT-4 chatbot to assess the risk of bias of the outcomes of interest, using each of the three ChatGPT prompts. ChatGPT-4 is a more advanced iteration of its predecessor ChatGPT-3. Unlike ChatGPT-3, ChatGPT-4 is only available with a paid subscription to OpenAI. We implemented each of the prompts in unique chats.
We did not collect data in duplicate because the nature of the data did not require any subjective judgments and we anticipated that the only potential source of error is mistakes in copying and pasting prompts to the ChatGPT interface, which we deemed unlikely.
We anticipated that the reliability of ChatGPT may depend on the objectivity of the outcome for which risk of bias is being assessed. We considered outcomes objective if they were based on established laboratory measures or if they were not subject to interpretation by patients or healthcare providers. Conversely, we considered outcomes subjective if they were patient-reported or subject to interpretation by patients or healthcare providers. We classified outcomes as either definitely objective (e.g., mortality), probably objective (e.g., unscheduled physician visits), probably subjective (e.g., serious adverse events), and definitely subjective (e.g., quality of life) to facilitate stratified analyses based on the degree of objectivity of the outcome.
Data synthesis and analysis
Sample size estimation
We used the kappaSize package in R (Vienna, Austria, Version 4.1.3) to estimate sample size (26). We aimed to calculate the number of required trials to obtain a sufficiently precise estimate of a value of kappa for which systematic reviewers will feel confident using ChatGPT for risk of bias assessments. We assumed that most reviewers will feel confident using ChatGPT for risk of bias assessments if it yields a kappa of 0.70, indicating substantial agreement, with the lower bound of the confidence interval no less than 0.55. We anticipated the risk of bias distribution to be approximately 30% low, 30% with some concerns, and 40% high.
We inflated the estimated sample size by a design effect to account for correlation between the risk of bias of trials from the same review. We assumed an intra-review correlation of 0.05 and an average of 10 trials per review, yielding a design effect of 1.45. This resulted in a minimum sample size of 120 trials from 12 reviews. We investigated the sensitivity of our estimated sample size to different assumptions about the anticipated distribution of risk of bias judgments across the three categories and the potential correlation between trials from the same review. To account for other potential scenarios (e.g., kappa = 0.6, intrareview correlation of 0.1), we ultimately intended to include approximately 160 trials from 16 reviews.
Agreement between ChatGPT and consensus-based risk of bias assessments
We present the inter-rater agreement, represented by weighted kappa, between each of the three ChatGPT prompts and consensus-based risk of bias judgments from Cochrane authors. Unlike percentage agreement, the weighted kappa accounts for the possibility of agreement due to chance and for the ordinal nature of the response options of the RoB 2.0 tool (low risk of bias, some concerns, high risk of bias) (27).
We present separate analyses for each RoB 2.0 domain and for the overall rating of risk of bias. Each analysis only includes one outcome from each included trial. Our primary analysis includes the most important outcome, based on the order in which outcomes were listed in Cochrane systematic review summary of findings tables. We adjusted for clustering of trials within each systematic review by inflating the variance of all estimates by the design effect (28).
We interpreted Cohen’s kappa statistics using previously established guidelines: values from 0.0 to 0.2 indicating slight agreement, 0.21 to 0.40 indicating fair agreement, 0.41 to 0.60 indicating moderate agreement, 0.61 to 0.80 indicating substantial agreement, and 0.81 to 1.0 indicating perfect agreement (29).
We hypothesized that ChatGPT may be more reliable to assess risk of bias when there are few subjective judgments. Therefore, we expected better agreement for: (i) trials addressing pharmacologic interventions because trials of pharmacologic interventions are more likely to blind patients and healthcare providers thus simplifying judgments related to deviations from intended intervention and measurement of outcomes; (ii) trials addressing risk of bias of assignment of the intervention because assignment to the intervention does not necessitate making judgments about adherence; (iii) objective outcomes since these outcomes do not need additional judgments about whether failure to blind may have resulted in differential measurement of the outcome, and (iv) dichotomous instead of continuous outcomes since continuous outcomes are more likely to be subjective. To test these hypotheses, we performed secondary analyses stratified by these factors.
We also performed a secondary analysis in which we collapsed ratings of “some concerns” and “high risk of bias” into a single category.
In our primary analysis, we excluded ratings of uncertain risk of bias from analyses. We had planned to perform additional sensitivity analyses treating these ratings as some concerns or high risk of bias but there were too few uncertain ratings to affect estimates of reliability.
We performed all statistical analyses using the psych package in R (Vienna, Austria, Version 4.1.3) (30).
Review of ChatGPT justifications for discrepant risk of bias judgments between Cochrane systematic reviewers and ChatGPT
Our prompts queried ChatGPT to provide a justification for its ratings of risk of bias. To understand reasons why ChatGPT may produce unreliable risk of bias judgments, we also qualitatively reviewed justifications provided by ChatGPT to support its judgments for potential errors or problems.
Deviations from protocol
To account for correlation between trials in the same systematic review, we planned to calculate weighted kappa within each review individually and pool the weighted kappa statistics across systematic reviews using random-effects meta-analysis (31). The sampling distribution of kappa, however, is asymmetric. While with a large enough number of observations, the sampling distribution of kappa is approximately normal, we found there to be too few trials within each systematic review to assume normality, precluding our approach to perform meta-analyses. Instead, we adjusted the variance of all estimates for the correlation within each systematic review.
Results
Systematic review and trial characteristics
We included 157 trials from 34 systematic reviews. Figure 2 presents the selection of systematic reviews. Supplement 2 presents a list of included reviews and supplement 3 presents a list of excluded reviews.
More than half of reviews were published in 2023 and addressed pharmacologic interventions. Reviews most commonly addressed infectious, ophthalmologic, and respiratory conditions. Reviews either rated the risk of bias for assignment to the intervention or did not report whether they assessed the risk of bias of assignment to or adherence to the intervention. More than half of included outcomes were dichotomous and rated as either definitely or probably objective.
In our analyses, each trial contributed data only for one outcome. Our primary analysis included data from 157 trials. Of these, 45 (28.7%) were rated at low risk of bias by Cochrane systematic reviewers, 75 (47.8%) at some concerns, and 37 (24.6%) at high risk of bias. Fifty-two trials (33.1%) were rated at high risk of bias or some concerns for bias due to randomization, 37 (23.6%) for bias due to deviations from the intended intervention, 23 (14.7%) for missing outcome data, 29 (18.5%) for measurement of the outcome, and 72 (45.9%) for selective reporting.
Agreement between ChatGPT and consensus-based risk of bias judgments from Cochrane review authors
In our analyses, each trial contributed data only for one outcome. In our primary analysis, when a trial reported data on more than one outcome of interest, we included data for the outcome reported first in the systematic review.
We found overall only slight agreement between ChatGPT risk of bias judgments and consensus-based risk of bias judgments from systematic reviewers. Agreement for overall risk of bias ranged between 0.11 (95% CI: -0.04 to 0.27) and 0.17 (95% CI: 0.02 to 0.32) for the minimal and optimized prompts, respectively. Figure 2 presents a flow diagram representing categorical changes in the overall rating of risk of bias between systematic reviewers and the optimized ChatGPT prompt.
For the optimized prompt, agreement ranged between 0.11 (95% CI: -0.11 to 0.33) to 0.29 (95% CI: 0.14 to 0.44) across risk of bias domains, with the lowest agreement for the deviations from the intended intervention domain and the highest agreement for the missing outcome data domain.
We hypothesized that ChatGPT may be more reliable to assess risk of bias when there are few subjective judgments: trials addressing pharmacologic interventions, reviews that assessed risk of bias of assignment rather than adherence to the intervention, objective outcomes, and dichotomous outcomes. To test these hypotheses, we performed secondary analyses stratified by these factors. We did not find evidence that ChatGPT had importantly different reliability in these stratified analyses (Supplements 4 to 10). ChatGPT showed “slight” to “fair” agreement for these subgroups.
Likewise, we performed a secondary analysis in which we collapsed ratings of “some concerns” and “high risk of bias” into a single category. This secondary analysis also showed “slight” to “fair” agreement (Supplement 11).
Discrepant risk of bias judgments between Cochrane systematic reviewers and ChatGPT
For all risk of bias judgments, our prompts queried ChatGPT to provide a justification for its rating of risk of bias. To understand reasons why ChatGPT may produce unreliable risk of bias judgments, we also qualitatively reviewed justifications provided by the optimized ChatGPT prompt to support its judgments for potential errors or problems. An analysis of the justifications provided by ChatGPT suggests four major types of problems.
First, it appears that ChatGPT could not distinguish between characteristics of trials that are at low risk of bias and characteristics at high risk of bias. For example, one trial reported randomization by an “interactive web-response system”, which suggests central randomization and allocation concealment (32). The ChatGPT optimized prompt rates the trial at some concerns for randomization because the trial report “does not explicitly mention whether the allocation sequence was concealed”. One trial reported using “system-generated random numbers” to randomize participants (33). The ChatGPT prompt rated risk of bias due to randomization at low risk of bias with the justification that “an open list of random numbers for concealment” indicates “proper randomization process”—an incorrect statement since an open list allows those recruiting participants in a trial to predict the arm to which subsequent participants will be randomized.
Second, ChatGPT was unable to make reasonable assumptions about risk of bias. Cochrane systematic reviewers rated a trial investigating the effects of convalescent plasma on all-cause mortality in COVID-19 patients at low risk of bias for missing outcome data (34). ChatGPT rated the trial at some concerns because it “does not provide explicit details about the availability of outcome data for all participants or if there was significant dropout of participants”. The trial however reports that no patients were lost to follow-up and it is reasonable to assume that all-cause mortality would be one of the outcomes for which there would be no missing outcome data without loss to follow-up since it does not involve active measurement or monitoring by investigators.
Third, ChatGPT made errors that suggested that it was unfamiliar with recommended processes for risk of bias assessments. For example, for the domain bias due to deviations from the intended intervention an open-label trial of aspirin for COVID-19 was judged at high risk of bias by systematic reviewers and low risk of bias by ChatGPT because the outcome ‘all-cause mortality’ is objective (35). While the outcome is objective, the domain of bias due to deviations from intended intervention is meant to solely assess risk of bias due to imbalances in cointerventions or differences in how the intervention is implemented rather than objectivity of the outcome, which is assessed by the bias due to measurement of the outcome domain. Similarly, in making judgments about risk of bias due to selective reporting, ChatGPT often considered discrepancies between all outcomes and results between the trial publication or the registration or protocol instead of the results for which risk of bias was being assessed. Although bias due to randomization should be consistent across outcomes from the same trial, we identified instances in which ChatGPT rated the domain inconsistently across outcomes from the same trial.
Finally, ChatGPT made random errors in assessing risk of bias. For example, in another trial described as double-blind and rated at low risk of bias due to deviations from intended intervention by systematic reviewers, ChatGPT rated risk of bias as high because the trial “does not provide information on whether participants and personnel were aware of the intervention” (36).
Discussion
Main findings
We performed a study evaluating ChatGPT for assessing the risk of bias of randomized trials using the Cochrane-endorsed RoB 2.0 tool (5). To do this, we sampled Cochrane systematic reviews that reported RoB 2.0 judgments for randomized trials, assessed the risk of bias of trials using ChatGPT via three variations of prompts, and compared the degree of agreement between RoB 2.0 judgments presented in systematic reviews and those by ChatGPT.
We found only slight to fair agreement between ChatGPT risk of bias judgments and those presented in systematic reviews. Our results suggest that ChatGPT, at least as it stands today, is suboptimal for facilitating risk of bias assessments. We found similar results when we restricted our analysis to subgroups for which we hypothesized that ChatGPT may be more reliable, including trials addressing pharmacologic interventions, reviews assessing the risk of bias associated with assignment to the intervention, objective outcomes, and dichotomous outcomes.
We also reviewed cases in which ChatGPT’s risk of bias judgments differed from those of Cochrane systematic reviewers with the goal of identifying ways in which we can refine future prompts. Our findings indicate that ChatGPT might make more accurate risk of bias judgments if informed about both low and high risk of bias methodological traits. For example, one trial reported randomization by an “interactive web-response system”, which suggests central randomization and allocation concealment (32). ChatGPT, however, rated the trial at some concerns for randomization because the trial report “does not explicitly mention whether the allocation sequence was concealed”. Training ChatGPT to recognize features of trials at low versus high risk of bias may improve the reliability of its risk of bias assessments.
Though our results appear discouraging, they must also be contextualized considering general poor agreement between even experienced reviewers in implementing the RoB 2.0 tool. For example, a previous investigation of the reliability of RoB 2.0 using experienced systematic reviewers reported inter-rater reliability ranging between 0.04 to 0.45, indicating only slight to fair agreement (7). The original Cochrane risk of bias tool also demonstrated poor inter-rater reliability for select domains (37).
Our results may also be explained by ChatGPT’s limited memory, which may not be sufficient to fully process RoB 2.0’s extensive and lengthy guidance (38, 39). An improvement in ChatGPT’s performance in risk of bias assessment might be achieved by enhancing its memory capabilities, by utilizing other plans from OpenAI that offer expanded memory options such as ChatGPT Enterprise, or by fine-tuning ChatGPT’s base model—a process that involves additional training of the model.
Finally, while we evaluated the degree of agreement between risk of bias judgments reported in systematic reviews and those made by ChatGPT, we did not consider the impact of these discrepancies. For example, discrepancies in risk of bias judgments may not necessarily lead to an overall change in the rating of the certainty (quality) of evidence and the material conclusions of systematic reviews.
Strengths and limitation
The primary strength of our study is its generalizability to diverse research questions, reviews, and research teams. Risk of bias judgments are subjective and different research groups and teams may have different understandings and thresholds for expressing concerns about risk of bias. Similarly, assessing risk of bias involves unique considerations related to the research question being investigated. As our sample included systematic reviews from multiple diverse research teams, ChatGPT’s reliability is not confined to the specific nuances of a single group’s approach to risk of bias assessments or to a single topic.
Our study was limited to parallel randomized trials published in English. We excluded crossover and cluster randomized trials since these trial designs require unique considerations in their assessment of risk of bias and different versions of the RoB 2.0 tool. Thus, the results of our study may lack generalizability beyond English language parallel randomized trials, though these are the most common studies typically included in systematic reviews. Further, it is unlikely for ChatGPT to be able to perform remarkably differently for other types of trials, since assessing the risk of bias of these trials necessitates the same considerations as parallel randomized trials in addition to several additional unique considerations.
Evidence suggests that risk of bias assessments in Cochrane reviews, despite their rigor, are sometimes unreliable and inconsistent with established guidance (7). Hence, differences between risk of bias judgments between ChatGPT and Cochrane systematic reviewers may also represent errors on part of reviewers. Previous studies suggest that agreement between reviewers in assessing risk of bias may be very poor (40, 41). To minimize the potential for this error, we limited our sample to Cochrane systematic reviews, which are known for their methodological rigor (42, 43).
The performance of ChatGPT is also not static. The infrastructure, interfaces, and applications built around ChatGPT are continuously updated. Our experiment was performed over a two-week time period between September and October 2023. It is possible that the performance that we observed may not be replicable in the future—though it is more likely that the capabilities of ChatGPT will improve rather than deteriorate. Even with identical prompts, ChatGPT might provide slightly different answers due to the inherent stochasticity in its response generation.
The reliability of ChatGPT risk of bias assessments is likely to depend on the nature of the prompts. We tested three different prompts. Our results suggest that the performance of the three prompts is comparable. It is possible that reviewers may be able to produce more reliable risk of bias assessments using alternative prompts.
Our prompts queried ChatGPT to provide a justification for its ratings of risk of bias. To understand reasons why ChatGPT may produce unreliable risk of bias judgments, we also reviewed justifications provided by ChatGPT to support its judgments for potential errors or problems. While we performed a general review of justifications for which ChatGPT and Cochrane reviewers made discrepant risk of bias judgments, we did not perform a formal qualitative analysis of the justifications.
While we did not record the exact duration our team spent using ChatGPT, we estimate that each trial took no longer than 15 minutes—less time than on average required for a reviewer to conduct an individual risk of bias assessment and consensus meeting according to empirical evidence (6, 7).
Relation to previous findings
Attempts to reduce the time, resources, and expertise needed to perform systematic reviews are not new. For example, RobotReviewer is an automated tool to extract data from and assess the risk of bias of randomized trials (8). The RobotReviewer, however, was trained on the original Cochrane risk of bias tool and only offers judgments on four of the seven domains of the original tool. Since then, Cochrane has adopted a revised risk of bias assessment tool that requires more nuanced judgments and is more resource and time intensive (6). Given the performance of ChatGPT, however, adapting RobotReviewer to provide risk of bias assessments using the RoB 2.0 tool may be more promising.
Implications
The practice of evidence-based medicine demands knowledge of the best available evidence, which most often comes from rigorous systematic reviews and meta-analyses (1). Systematic reviews are resource and time intensive. For example, empirical evidence suggests that systematic reviews typically require upwards of one year to complete and publish (2). Tools that efficiently and reliably conduct risk of bias assessments can conserve time and resources, free reviewers to concentrate on other critical tasks, and potentially enhance the accuracy of risk of bias judgments.
Our results suggest that ChatGPT, in its current form, is not able to reliably assess the risk of bias of randomized trials. Since assessment of the risk of bias of observational studies or diagnostic studies is even more complicated, it is reasonable to expect that ChatGPT might encounter even more challenges with these other types of study designs.
Our study also has implications for future research. While our prompts in their current form could not be used to reliably assess risk of bias, other prompts may be able to provide more reliable assessments. For example, for each domain, RoB 2.0 contains a series of signaling questions designed to help reviewers think systematically about the different aspects of trial conduct that might lead to bias. These signaling questions are answered with “Yes,” “Probably yes,” “Probably no,” “No,” or “No information.” Based on the answers to these questions, a judgment is made about the risk of bias for that domain as “Low,” “Some concerns,” or “High.” Instead of asking ChatGPT to assess the risk of bias of each domain, ChatGPT may be prompted to go through the RoB 2.0 signalling questions. Future research may address the usefulness of having systematic reviewers reconcile their risk of bias assessments with ChatGPT or the role of ChatGPT in training systematic reviewers.
There are also opportunities to use ChatGPT to streamline other aspects of systematic reviews. Early studies suggest that ChatGPT can be used to devise search strategies (18). ChatGPT may also assist with screening search records, extracting data from eligible studies, or performing evaluations of the certainty of evidence. Though, at this time, based on the results of the current study, we are not optimistic about ChatGPT’s ability to reliably extract data or evaluate the certainty of evidence. Screening studies is less subjective and perhaps better suited to ChatGPT’s abilities.
If ChatGPT’s performance improves or if other tools emerge that can reliably perform various systematic review tasks, systematic review authors will need to consider whether the time and resource savings afforded by these tools are worth potential suboptimal performance. While these tools may not always perform perfectly, they may still be useful in situations in which systematic reviews need to be performed quickly or with limited resources. Similarly, systematic review authors will also need to consider the acceptability of such tools by evidence users. For example, evidence users may be skeptical of systematic reviews that use AI tools.
There are ethical implications around the adoption of large language models, artificial intelligence, and ChatGPT, in health research (44). Perhaps the most immediate ethical implication is the replacement of systematic reviewers. Because evidence syntheses are used to make decisions about large numbers of patients, incorrectly replacing human reviewers with an underperforming tool may have serious negative health consequences. We caution against the adoption of ChatGPT for assessments of risk of bias, particularly the replacement of reviewers. Our results suggest that ChatGPT performs poorly in assessing risk of bias.
The integration of artificial intelligence and large language models in systematic reviews can also affect trust in health research. We anticipate that due to limited experience, evidence users will be more cautious about the application of studies that use such tools (45, 46).
There are also ethical issues in outsourcing important research functions to software developed and operated by commercial entities located in foreign jurisdictions that may not be incentivized to ensure that health decisions are free of conflicts of interest. Undue influence or attacks on artificial intelligence systems by corporations, interest groups, and even hostile governments represent new threats against which research should be protected (47, 48). Further, there is limited details on how ChatGPT works internally, including model architecture and the training data. The benefits, risks, and costs of outsourcing risk of bias assessments to software operated by commercial entities should be evaluated, from the perspective of research resiliency and scientific accountability.
Conclusion
We performed a study evaluating the usefulness of ChatGPT for assessing the risk of bias of parallel randomized trials using the Cochrane-endorsed RoB 2.0 tool. We found only slight to fair agreement between ChatGPT risk of bias judgments and risk of bias judgments presented in systematic reviews. Our results suggest that ChatGPT, at least as it stands today, is suboptimal for performing risk of bias assessments. The practice of evidence-based medicine demands knowledge of the best available evidence, which most often comes from rigorous systematic reviews. Systematic reviews, though, are time and resource intensive. Tools to assist with systematic reviews, be it with risk of bias assessments or other tasks, are critically needed.
Tables
Acknowledgements
None.
Footnotes
Disclaimers: None.
Funding: None.
Data: Available on OSF (https://osf.io/aq85p)