Abstract
Introduction Previous studies about the replicability of clinical research based on the published literature have suggested that highly cited articles are often contradicted or found to have inflated effects. Nevertheless, there are no recent updates of such efforts, and this situation may have changed over time.
Methods We searched the Web of Science database for articles studying medical interventions with more than 2000 citations, published between 2004 and 2018 in high-impact medical journals. We then searched for replications of these studies in PubMed using the PICO (Population, Intervention, Comparator and Outcome) framework. Replication success was evaluated by the presence of a statistically significant effect in the same direction and by overlap of the replication’s effect size confidence interval (CIs) with that of the original study. Evidence of effect size inflation and potential predictors of replicability were also analyzed.
Results A total of 89 eligible studies, of which 24 had valid replications (17 meta-analyses and 7 primary studies) were found. Of these, 21 (88%) had effect sizes with overlapping CIs. Of 15 highly cited studies with a statistically significant difference in the primary outcome, 13 (87%) had a significant effect in the replication as well. When both criteria were considered together, the replicability rate in our sample was of 20 out of 24 (83%). There was no evidence of systematic inflation in these highly cited studies, with a mean effect size ratio of 1.03 (95% CI [0.88, 1.21]) between initial and subsequent effects. Due to the small number of contradicted results, our analysis had low statistical power to detect predictors of replicability.
Conclusion Although most studies did not have eligible replications, the replicability rate of highly cited clinical studies in our sample was higher than in previous estimates, with little evidence of systematic effect size inflation.
Introduction
The replicability of published research has been recently questioned in different scientific fields, with replication rates shown to be variable and often low (1–8). Whether this represents a “reproducibility crisis” is open to debate (9), and defining what constitutes a successful replication is not trivial (10). Systematic replication efforts have mostly focused on restricted samples of the literature, and data on the subject is still lacking in many areas.
The replicability of highly cited clinical research was studied by Ioannidis in 2005, based on available published replications of a sample of articles between 1990 and 2003 (11). It focused on the reproducibility of study conclusions, typically assessed by statistical significance, as well as on effect size comparisons. 44% of highly cited studies had been successfully replicated, 16% had been contradicted, 16% had found effects that were larger than those of subsequent studies, and 24% remained unchallenged.
A similar effort for highly cited psychiatry research between 2000 and 2002 found lower estimates, with 19% of studies replicated, 19% contradicted, 13% initially inflated and 48% unchallenged (12). Another study on critical care research found that 18% of interventions published in high-profile journals between 1946 and 2016 had their results replicated by a subsequent study, whereas 22% were contradicted, 2% had replications in progress and 58% remained unchallenged (13).
Clinical research has changed in some aspects over the last two decades. A priori registration of study protocols has become more common and mandatory for clinical trials in many countries (14). Although publication bias has not been eliminated (15), the likelihood of null results has increased in published studies (16). Reporting guidelines have become more widely used and underwent reevaluations and updates (17,18). The push for full reporting of results and availability of individual patient data has also gained ground (14,19). Thus, the replicability panorama in high-impact clinical research may have changed during this period (20–22).
In light of this, the goal of this study is to estimate the replicability of highly cited clinical studies published between 2004 and 2018. Our primary outcome is the rate of successful replication in these studies, as measured both by statistical significance in the same direction and by overlap of CIs for the main effect measure in both studies. We also explore effect size inflation and potential predictors of replicability.
Methods
An overview of the project, datasets and analysis code can be found at https://osf.io/a8zug/. The protocol for the study was preregistered at https://osf.io/nh965/, with a step-by-step methodology available at https://osf.io/2qncz/ and updates and amendments described at https://osf.io/26d98/. All statistical analyses were performed in R version 4.1.2 (23). Data and analysis code are available at https://osf.io/5qhdz and https://osf.io/9hmx4, respectively.
Search for highly cited studies
We searched the Web of Science database for articles with more than 2000 citations published between January 1st, 2004, and December 31st, 2018, in medical journals with an impact factor above 14 in the 2020 Journal Citation Reports (list at https://osf.io/2qncz/). General journals were searched on February 7th, 2020, and specialty journals on March 4th, 2020. The cutoffs for citation and impact factors were twice as large as those used by Ioannidis (11), accounting for the growth in the total number of articles in PubMed during the period (calculation at https://osf.io/t9xu7/).
Within this sample, one author (K.N.) screened titles and abstracts for articles that addressed the efficacy of therapeutic or preventive interventions with primary data (i.e., excluding reviews, meta-analyses or articles that combined two or more previous studies). Two evaluators (G.G.C. and K.N.) then selected the primary outcome in each study, or the main conclusion in the abstract if the study had no primary outcome. In the case of co-primary outcomes or equally emphasized conclusions (11,24–26), we chose the outcome that was deemed more clinically relevant (e.g., mortality over progression or neurologic improvement over reperfusion). In the case of trials with more than two arms (27), we selected the most effective drug as the intervention and randomly chose an active comparator. Studies with no control group (e.g., phase 1 trials) were considered eligible if the abstract clearly stated that an intervention was clinically effective. When both benefits and harms or caveats were presented, focus was given on the net conclusion of whether the experimental intervention merited consideration for use in clinical practice. Disagreements in outcome selection were solved by consensus with the help of a third investigator (O.B.A.).
For each article, the study design, sample size, journal name and category (general or specialty) were extracted. We also extracted the selected outcome measure – i.e., odds ratio (OR), relative risk (RR), hazard ratio (HR), incidence rate ratio (IRR) or objective response rate (ORR) – with its effect size and respective CI. For controlled studies, results were classified as positive or negative according to the authors’ stated statistical significance threshold. Non-inferiority trials were classified as positive only if the intervention was found to be superior (i.e., not merely non-inferior) to the comparator.
For each result, the population, intervention, comparator and outcome (PICO) (28), both in specific (e.g., “myocardial infarction, ischemic stroke, unstable angina, or cardiovascular surgery”) and general forms (e.g., “cardiovascular events”) (14) were extracted. Two evaluators (K.N. and G.G.C.) described PICO components independently and resolved disagreements by consensus. Results of the independent extraction and consensus decisions can be found at https://osf.io/sfdxv.
Search for replications
After agreement was reached on PICO components, two evaluators (G.G.C. and K.N.) performed independent searches for replications of highly cited studies in PubMed. Search terms were defined independently by each evaluator and included the name of the drug or intervention, the general form of the outcome, and the population (i.e., clinical condition) as described in the article’s title, along with corresponding Medical Subject Headings (MeSH) terms. The comparator was included in the search strategy only if it was an active intervention (i.e., not a placebo or sham). Details can be found at https://osf.io/zv65u/.
A study was considered a replication of the highly cited study when it shared the same PICO general components, namely (a) the drug or intervention, without considering dose or regimen (except for studies performing dose or regimen comparisons), (b) the general form of the outcome (as described in (14)), (c) the population/clinical condition as described in the highly cited article’s title and (d) the comparator. When geographical information was included as a descriptor of the population (e.g., “European patients”) (29,30), we did not include this information as part of the population component (31–33).
Replications needed to be (a) a study type with higher strength of evidence (34) (i.e., randomized controlled trials (RCTs) over cohort studies over smaller uncontrolled studies) or (b) a similar study type with a sample size equal to or larger than the original study. Meta-analyses were considered as eligible replications if the highly cited study accounted for less than half of their sample size. For network meta-analyses, only the sample size of the direct comparison between the intervention and comparator counted for this purpose. If a meta-analysis (35,36) included a single additional study beyond the highly cited one (32,37), we considered the effect size of this study as the replication (31,38), rather than that of the entire meta-analysis. If more than one replication was found, the one with the largest sample size for the specific comparison was considered.
When different replications were selected by each evaluator, both were made available for the two evaluators to choose the best option independently. Disagreements in this step were solved by consensus with the participation of a third author (O.B.A). Agreement in the initial selection was 36%, but rose to 91% when selected replications were made available to both evaluators. Agreement data can be found at https://osf.io/qz6u9, with resolution of disagreements detailed at https://osf.io/ma9bn.
After identifying the best available replication, both evaluators independently selected the outcome and effect size from the replication that corresponded most closely to the one in the original study. Disagreements were solved by consensus. For network meta-analyses, direct comparisons were favored over indirect ones when both were available, either in the manuscript or supplementary material. Agreement data for this process can be found at https://osf.io/2c6jx. Changes in the choice of replication and effect size during analysis are documented at https://osf.io/jq7ec.
As the effect estimates of meta-analyses usually included the highly cited study and were thus not fully independent from it, we re-estimated these effects after removing the highly cited when enough information was provided for this purpose. For this, we used the primary study results as retrieved from the meta-analysis, estimating effect sizes based on numbers of events and patients when these were available, or the log-transformed point estimate of the RR or HR when they were not, with a standard error estimated . For data synthesis, a random-effects model was performed using the Mantel-Haenszel method for effect size estimation in the package meta in the R software for statistical computing (39). Replicability rates using the effect sizes from these fully independent meta-analyses are provided in addition to the main results as a supplementary analysis.
Evaluating Replication Success
Pairs of highly cited studies and their replications were analyzed to evaluate whether results were successfully replicated on the basis of two criteria: (a) statistical significance (an effect in the replication with p <0.05 in the same direction as that observed in the highly cited study) and (b) confidence interval overlap (an overlap of the 95% CIs for the outcome of interest in both studies). When the highly cited study presented a non-significant effect or did not include a statistical comparison (e.g., phase 1 trials), only the second criterion was used. The primary outcome was the rate of successful replication in our sample by both criteria (or by CI overlap alone when statistical significance was not applicable). As additional criteria, we analyzed whether the replication point estimate was contained in the 95% CI of the highly cited study and vice versa. A sensitivity analysis was performed applying the statistical significance criterion to initially non-significant studies as well. In one case where the replication was a Bayesian meta-analysis, the original study’s CI was compared to a credible interval (CrI), in the case of minimally informative priors (40). P-values were calculated from effect sizes (point estimates and confidence intervals) for each replication and highly cited study (details at https://osf.io/jbn83).
When outcome measures differed between highly cited studies and replications (e.g., RR in the highly cited study vs. OR in the replication or vice-versa) and the replication was a primary study, the replication measure was converted to the one in the highly cited study using the data available in the article. When the replication was a meta-analysis that included the highly cited study, we chose the risk measure that was used for data synthesis, using the original study’s effect size as included in the meta-analysis (details at https://osf.io/rfqgd). If the highly cited study was not included in the meta-analysis (e.g., when the meta-analysis included an update of the highly-cited study with a longer follow-up), we manually converted the outcome measure of the highly cited study to the one in the meta-analysis using the original data. When the original study was a phase 1 trial measuring ORR, we manually calculated this measure in replications when needed, with CIs based on the Clopper-Pearson exact method. In the meta-analysis by Hamid et al. (41), ORRs for both RCTs that were eligible replications of the highly cited study (a phase 1 trial) were calculated manually based on the combined data. Details can be found at https://osf.io/mfwv2.
95% CIs for replicability rates were calculated by . Replicability rates in the main results use effect sizes from published replications, while supplementary analyses use only fully independent replications, recalculating meta-analytic effect sizes in the absence of the highly cited study (and excluding meta-analyses for which this was not possible).
Effect size inflation
Effect size inflation was estimated on the basis of ratios between the effect sizes of highly cited studies and replications. For unfavorable outcomes (e.g., death, tumor progression), in which effectiveness increases as the outcome measure decreases, the inflation ratio was defined as the point estimate of the replication divided by that of the original study. For favorable outcomes (e.g., neurologic improvement), in which effectiveness increases along with the outcome measure, it was defined as the point estimate of the original study divided by that of the replication. Publication order was considered in this calculation: thus, when the replication was a meta-analysis in which the pooled sample size of studies preceding the highly cited study was larger than that of those that followed it, we inverted the ratios, considering the highly cited study as the replication and vice-versa for this purpose. This was also performed in a case where the replication was an RCT (identified within a meta-analysis (37) that was published before the highly cited one (31). CIs for mean effect size inflation were calculated by the Wilson score interval.
As these two adjustments had not been pre-specified in the protocol, we performed sensitivity analyses using different ways to deal with positive/negative outcomes and study order within meta-analyses when analyzing effect size inflation. For the former, coining of the effects was performed to convert all favorable outcomes to unfavorable outcomes: overall response rates were subtracted from 1 (1 – ORR), odds ratios were inverted (1/OR) and relative risks for the complementary outcome were calculated based on the original data (https://osf.io/5tmus and https://osf.io/7wtr8). For the latter, we analyzed data considering the highly cited study as the original one, independent of study order in the meta-analysis. As done for replication rates, we also provide supplementary effect size inflation analyses based on fully independent replications only, using recalculated meta-analytical estimates in the absence of the highly cited study.
For analysis of effect size inflation, natural logarithms of the ratios were used for each study pair (including those with initially negative results) to calculate the mean and CI of these ratios, both for the whole sample and for phase 1 trials and RCTs separately. For the whole sample, we performed a one-sample t-test against a theoretical mean of 0, which would indicate absence of systematic inflation. Although these calculations were performed using log-transformed values to correct for the inherent asymmetry of ratios, we transformed means and CIs back to a linear scale for clarity when describing results.
Predictors of Replicability
Finally, we analyzed if studies with contradicted results – i.e., those failing in one or both replication criteria – differed from successfully replicated ones in the following aspects: (a) study design (RCTs vs. other designs); (b) nature of intervention, (pharmacological vs. non-pharmacological); (c) sample size; (d) p-value of the original study; and (e) citations per year. To compare these aspects between replicated and contradicted studies, we used Fisher’s exact test for categorical variables (a and b) and Mann-Whitney’s U test for continuous variables (c through e). We had also planned to use the effect size of the highly cited study as a predictor, but due to the heterogeneity in outcome measures, which included both proportions (i.e., overall response rates) and measures of association (i.e., ORs, RRs, IRRs and HRs) converting them to a single effect size measure turned out to be unfeasible.
Results
Results from our systematic search of the literature are shown as a flowchart in Figure 1. A total of 89 highly cited studies met our inclusion criteria. Of these, 24 had an eligible replication according to our criteria.
As shown in Table 1, included studies received a median of 2842 citations, and were mostly RCTs of pharmacological interventions in cancer or heart disease, with some phase 1 cancer trials as well.
Most replications were direct-comparison meta-analyses, followed by RCTs, network meta-analyses and a phase 2 trial (Table 2), with RCTs more commonly representing replications of phase 1 trials. Two meta-analyses (42,43) replicated more than one highly-cited study in the sample (2 each). All phase 1 trials had available replications in the literature, while the only cohort study in our sample had no eligible replication.
A list of claims from highly cited studies is shown in Table 3. With few exceptions, most of them made claims of efficacy in their abstracts. Efficacy in phase 1 trials was measured by ORR (n=7), while differences in outcomes in RCTs were measured by HR (n=8), RR (n=5), IRR (n =1) or OR (n=3). All phase 1 trials made clear claims of efficacy based on tumor regression. Among RCTs, 2 were negative, with a p-value above the standard cutoff of 0.05.
When a meta-analysis was selected as a replication, the highly cited study could come after some (or most) of the studies in the meta-analysis, and thus consist of a replication of previous literature itself. Figure 2, shows the relative sample sizes of the highly cited studies in their replications. On average, highly cited studies corresponded to 18% of the total sample size of the meta-analyses, but this number ranged from 2% to 42% (samples sizes can be found at https://osf.io/ma9bn). Most meta-analyses had a larger number of patients after the highly cited study than before it, with three exceptions (30,33,66), including one (42) in which all other studies in the meta-analysis preceded the highly cited one (33). In another case, a meta-analysis of 2 studies (37) led to an RCT published before the highly cited study (31) to be selected as a replication.
Replication rates using different criteria are shown in Table 4. Among the 15 highly cited studies with statistically significant results, only 2 (13%) had a non-significant result in the replication, whereas the 2 highly cited studies with negative results had significant results in their replications (albeit marginally so). We did not consider the latter as replication failures in our main analysis, as lack of significance in null hypothesis tests should not be taken as evidence of equivalence, especially considering that sample size was higher in the replications. However, we did perform a sensitivity analysis using the statistical significance criterion for studies with non-significant results as well.
Phase 1 trials with no control group were also not considered for the statistical significance criterion. That said, among the 7 phase 1 trials, 6 had replications with statistically significant results when comparing the intervention to a control group in a RCT; the remaining one was replicated by an uncontrolled phase 2 trial (75).
Concerning the effect size criterion, 21 out of 24 studies (88%) had overlapping 95% CIs, with 2 phase 1 trials and 1 RCT failing this criterion. In total, 15 out of 24 replications (62%) had point estimates that were contained in the CIs of the highly cited studies; conversely, only 9 (38%) of the original point estimates were included in the replication’s CIs. That said, as CIs get narrower with increasing sample size, the latter criterion is excessively strict and should not be considered as a measure of replicability.
One major limitation of this analysis is that effect sizes from meta-analyses are not fully independent from the highly cited study when it is included in the estimate. To circumvent this, we recalculated meta-analytical effect estimates in the absence of the highly cited studies (Table S1). This was technically feasible without re-extracting data from the original studies for 10 out of 14 meta-analyses (replicating a total of 11 highly cited studies). The remainder were network meta-analyses without direct comparisons (42,50,59,67) or individual patient-data meta-analyses (62).
When using only fully independent replications (i.e., excluding meta-analyses that could not be reanalyzed), replicability rates were 80% for the statistical significance criterion, 84% for the CI overlap and 79% for the aggregated criterion (Table S2). Differences between this and the main analysis were due to the different samples used in each of them, as no meta-analysis changed its replication status in either criterion when reanalyzed without the highly-cited study.
No evidence of effect size inflation was observed in our sample (Figure 3 and Table 5), with the average ratio between the effect sizes of replications and those of highly cited studies approaching 1 when publication order was considered. Inflation increased slightly when effect sizes were coined (i.e., when the percentage of non-responders was used as the outcome) and when publication order was not taken into account (i.e., when highly cited studies were always considered as the reference), but remained low on average and did not reach statistical significance in any of our analyses. This picture did not change when only fully independent replications were used to estimate inflation (Figure S1 and Table S3).
Potential predictors of replicability are shown in Table 6. Although this analysis was planned in the protocol, it has low statistical power due to the low number of contradicted studies in our sample. The same analysis using only fully independent replications is shown on Table S4. In both analyses, low power prevents us from drawing any definite conclusions on predictors of replicability.
Discussion
Replicability of the literature
The replicability of highly cited clinical studies in our sample was high, with an 83% replication rate when considering our primary outcome of overlapping CIs along with statistical significance in the same direction. Using only fully independent replications (i.e., including meta-analyses only when highly-cited studies were removed) led to a slightly lower but still reasonably high estimate of 79%. Moreover, we did not find evidence of systematic effect size inflation, either for phase 1 trials or for RCTs.
These replication success rates are higher than those found in previous studies of highly cited clinical literature, where rates of 59% in general medicine (11) and 37% in psychiatry (12) were described when both statistical significance and effect size were considered, although the criteria for comparing effect sizes differed in each study. For statistical significance alone (a more homogeneous criterion), the successful replication rate for studies with significant results was 87% in our study, as compared to 79% (11) and 63% (12) in the previous two studies, respectively.
Although the differences in these estimates could be due to changes in the replicability of the published literature, methodological discrepancies between studies should be considered. Ioannidis’ study (11), used a broader definition of replication; thus, many article pairs in his sample would not have been considered to have matching PICO components by our criteria. Moreover, among the contradicted or inflated studies in his sample, 4 were cohort studies whose replications were RCTs or meta-analyses of RCTs. The author himself acknowledges that it is not always possible to validate exposures as interventions, and other studies comparing observational research with RCTs have used the term “concordance” instead of replicability (76). If one considers only interventional studies (i.e., RCTs and case series) in Ioannidis’ study (11) – as was the case in our sample –, the replication rate is 67% when considering both significance and effect size, or 87% – exactly the same as ours when including only statistically significant highly cited studies – when considering statistical significance alone.
Defining replication boundaries
Considering a study as a replication of another inevitably requires establishing the boundary conditions of a claim (10). We opted to define replications as studies that had matching PICO general components (77). This led most highly cited studies to be classified as unchallenged, with replications being found for only 27% of our sample (as opposed to 76% in Ioannidis (11), 52% in Tajika et al. (12), and 42% in Niven et al. (13), in which criteria were less stringent). Many studies that could have been considered replications by looser criteria were thus excluded.
Even though we were more conservative than previous studies in defining replication boundaries, our study pairs were still not perfect replicas of each other. In some cases, definitions for clinical conditions were very broad, such as “cancer” in Brahmer et al. 2012 (45) and Topalian et al. 2012 (47), meaning that replication samples could potentially be quite distinct from the original one. These discrepancies thus remain as possible explanations for contradictions between results. Heterogeneity between study populations and interventions presents challenges to studying replicability in clinical research, and methodological differences between the original study and replications seem to be common in previous studies as well (13).
Contradicted studies
Regarding our primary outcome, two phase 1 trials and two RCTs were classified as contradicted. Both phase 1 trials had CIs that did not overlap with those of the replication – in one case, the effect was larger in the original study, while in the other it was larger in the replication. As phase 1 trials typically have small sample sizes and are likely to be more prone to publication/citation bias (i.e., a negative phase 1 trial is unlikely to become highly cited), their replicability is expected to be lower than that of RCTs. Nevertheless, the majority of phase 1 trials in our sample were successfully replicated, although replication criteria were less stringent for these studies as (a) they were not subject to the statistical significance criteria and (b) CIs for their effect sizes were broader. Still, it’s worth noting that all RCTs replicating phase 1 trials in our sample showed a statistically significant benefit of the intervention when compared to a control group.
Concerning RCTs, the two replication failures for initially positive studies were observed for ERSPC (29) (a large prostate cancer screening trial) and KEYNOTE-024 (68) (a trial of the checkpoint inhibitor pembrolizumab in lung cancer). ERSPC (29) was contradicted because it showed a statistically significant effect (p=0.03) while the meta-analysis did not reach significance, although effect sizes do overlap. Some methodological issues have been proposed to account for this discrepancy, such as the large degree of control group contamination in the PLCO trial (78) and the lower screening intensity in the CAP trial (79), two negative studies that account for most of the weight in the replication meta-analysis.
KEYNOTE-024 (68), meanwhile, was considered contradicted by our replication criteria, which specifically addressed the primary outcome in the highly cited study – in this case, progression-free survival. Nevertheless, both the original study and the replication (KEYNOTE-042) (69) showed an overall survival benefit, and the lack of effect on progression in the replication seems to be due to differences in the early stage of the trial, even though the intervention group fared better on the long run. Thus, the result was replicated when the more clinically relevant outcome of survival is considered, and the lack of replication for our chosen outcome appears to be a statistical accident.
Of note, we did not use the statistical significance criterion to evaluate studies with non-significant results (as lack of statistical significance should not be taken as evidence of equivalence between treatments). Accordingly, of the two non-significant studies in our sample, one of them – the ACCORD trial (60) – had a replication with a significant result – in this case, a large meta-analysis that reached marginal significance (p=0.04). Nevertheless, other meta-analyses arrived at different conclusions (61,80–83). Effect sizes were similar between studies, suggesting that lack of agreement in this criterion was a consequence of lower statistical power in the highly cited study. PARTNER A (35), meanwhile, was a non-inferiority study that showed similar 1-year outcomes with transcatheter and surgical aortic valve replacement. Its replication (38) found a better outcome in the transcatheter group, with the authors speculating that the contrasting results could be due to the type of prosthesis used or to differences in the patients’ risk profile.
Effect size inflation
Many studies analyzing replications have found evidence that published effects are systematically inflated, a fact that is expected when statistical significance thresholds are used as a criterion for publication (84). Nevertheless, strategies to measure effect size inflation vary widely across studies. Ioannidis found initially stronger effects in 7 out of 27 replicated studies, using the criteria of a decrease in risk reduction of at least 50%, or a benefit of shorter duration or limited generalizability in the replication (11). Tajika et al. reported standardized mean differences of initial studies to be 2.3 times larger than those of replications (12), while Niven et al. (13) reported a mean absolute risk difference of 16% between original studies and replications. Similar evidence of effect size inflation has also been found both in systematic replication initiatives (5,8) and meta-analyses of effect sizes over time (85,86).
Contrary to these studies, we found little evidence of systematic effect size inflation in the highly cited clinical literature between 2004 and 2018. This suggests that publication/citation bias might be more limited in our sample than in other fields of research. That said, our ability to detect it in our primary analysis could have been reduced by the use of meta-analyses as replications, as the replication sample included the highly cited study, as well as some studies that came before it. Nevertheless, our supplementary analysis removing the primary studies from the meta-analyses found very similar inflation estimates, suggesting that this was not a major issue.
Another limitation is that using ratios for measuring effect size inflation leads to variation in estimates depending on the outcome used. If nonresponse rates or response odds were used instead of ORRs for phase 1 trials, for example, estimates of inflation increased to 17% or 18%, respectively – although CIs were still wide and compatible with absence of systematic inflation. One can also make the case that relative differences in effect sizes may be less relevant than absolute ones for clinical practice. Nevertheless, the fact that different outcome measures were used across studies makes absolute differences non-commensurable and prevents us from analyzing the sample as a whole in this manner.
Although evidence for systematic inflation was limited, this does not mean that initially stronger effects were not found in some studies, as in the case of KEYNOTE-024 (69), in which a risk reduction in progression disappeared in the replication, and TAXUS-IV (49), in which relative risk in the treated group increased from 0.39 to 0.66. Nevertheless, the fact that increases in effect size were found in other studies – such as Brahmer et al., 2012 (45), in which the ORR doubled from 13% to 27%, – suggests that some or most of these discrepancies can be explained by statistical fluctuation, and that systematic bias in favor of positive studies is smaller in this literature than in other fields.
Replication criteria
Different replication criteria complement each other by capturing distinct aspects of replicability (6,8). Statistical significance alone does not distinguish between magnitude and precision, and thus says little about how two effects compare directly (87). Comparing effect sizes, meanwhile, avoids emphasis on statistical thresholds (8), but may lead to studies with different conclusions be considered as successful replications of each other. For this reason, our primary outcome was a combination of statistical significance and CI overlap of effect sizes.
Highly cited studies were expected to predominantly present statistically significant results for their primary outcomes, as this literature is enriched in studies with high power and high prior probabilities (88). Most replications also yielded significant results, something that would be expected if the primary findings represent true effects. In fact, considering that the replication rate for this criterion was 87% for initially positive studies, even replication failures could represent vibration around statistical thresholds (as might have been the case for the KEYNOTE-024 replication (68), for example), as studies in clinical medicine are often powered around that level.
Replicability based on CI overlap was similarly high (88%), although this is a rather loose threshold for effect size similarity: absence of CI overlap for two identical effects is expected to occur by chance alone in around 0.6% of cases (89), and this high bar for type 1 error comes at the cost of lower statistical power to detect differences in effect sizes. A more stringent criterion of having the replication effect size included in the original CI led to a lower but still reasonable (62%) replication rate, despite not considering the potential variability in replication estimates.
Inclusion of the highly cited study’s point estimate in the replication CI was less frequent (38%), but this is an overly strict criterion, especially when replications have sample sizes that are much larger than the original studies.
Calculating prediction intervals for the original effect given the replication sample size would likely represent the fairest way to assess replication of effect sizes (90), but this was not always possible for all cases based on the available data.
General limitations
Defining what constitutes a replication is not trivial: even though we followed a predefined protocol to define PICO components, their abstraction inevitably involves a degree of subjectivity. Moreover, as was the case in previous studies of replications in the published literature (11–13,86), there was no way to develop a systematic search strategy that was applicable for every study. Because of these factors, our independent searches for replications had low agreement, and a second step was needed to reach consensus. Still, it is possible that our searches could have missed some valid replication candidates.
In at least one case – the ACCORD trial on intensive glucose control (61) – there were candidate replications that reached different results (60,61,81–83). Although we used an objective criterion to define the replication selected for analysis (i.e., sample size), a replication with a different result might have been chosen if we used other criteria. Of note, we did not evaluate risk of bias or methodological quality in replications, opening up the possibility that the largest replication available might not be necessarily the most reliable one.
An important caveat in our analysis is the fact that meta-analyses were considered as replications, even though most of them included the highly cited study. This leads to a degree of circularity in the analysis that could have biased our reproducibility estimate upwards. To deal with this problem, we conducted independent meta-analyses excluding the highly cited study in order to turn them into truly independent replications. Interestingly, this did not lead to major changes in our replication rates, and actually led to lower estimates of effect size inflation. This confirms our impression that highly cited clinical literature seems to be generally replicable, and that these studies’ effect sizes are not systematically higher than those of other studies on the same topic.
Even after including meta-analyses, our stringent criteria to consider studies as having matching PICO components left us with a small sample, and the high replicability rate led to an even lower number of contradicted studies. Thus, our analysis was markedly with low statistical power to detect predictors of replicability. Even though none of our predictors reached statistical significance, it seems likely that factors such as lower p values or higher sample sizes would be associated with a higher replicability rate, had a larger sample been available.
As a final limitation, we relied on replications that were published in the literature. As the existence and publication of these replications are subject both to the interest of researchers to perform them and to that of editors and reviewers to publish them, the approach in our study is not directly comparable to the systematic replication attempts that have been performed in other areas (1–8). This is particularly important given that the majority of highly cited studies in our sample had no available replications according to our criteria – thus, selectiveness in performing or publishing replications may have biased our replicability rates. It is also possible that successfully replicated studies receive more citations in the long run, biasing the reproducibility rate of highly cited studies upwards.
Conclusions
Despite the high rate of unchallenged studies, we found the replicability rate of the highly cited clinical literature between 2004 and 2018 to be higher than previously estimated, with little evidence of effect size inflation. These numbers are valid for a narrow, very influential subsample of articles, and cannot be generalized to medical research at large. Nevertheless, they run counter to the assertion that there is a widespread reproducibility crisis in science, and suggest that this may not be the case for every scientific field.
The higher replication rate found in our study when compared to earlier samples of the clinical literature could also be taken as a sign of improvement over time; nevertheless, this conclusion is tentative at best, as differences in methodology (such as the definition of effect size inflation) and samples (such as the frequency of different study designs) do not warrant direct comparisons between studies.
Further research is warranted to examine whether the high replicability of highly cited clinical research is related to particular research practices that are not as widely used in other areas of biomedical science, such as randomization, blinding or prospective protocol registration (14,16,91–94). If such links can be reliably established, they could be used to inform attempts to improve replicability in different research fields.
Funding
G.G.C. received funding from CNPq and CAPES. K.N. received funding from the Serrapilheira Institute. O.B.A. received funding from FAPERJ (E-26/200.824/2021), CNPq (308624/2018-1) and the Serrapilheira Institute.
Author Contributions
Conceptualization: All authors
Methodology: All authors
Project Administration: All authors
Investigation (data collection): GGC and KN
Writing – Original Draft: GGC
Writing – Review and Editing: OBA and KN
Supervision: OBA
Conflicts of Interest
None declared.
Data Availability
All data related to the article is available at Open Science Framework.
Acknowledgements
An abstract for this study has been published in BMJ Evidence-Based Medicine (http://dx.doi.org/10.1136/bmjebm-2022-PODabstracts.105) as a product of the EBM Live 2022 Conference.
Footnotes
This version of the manuscript has been revised to include independent replications in the analyses. That is, when replications were meta-analyses, we removed the highly-cited study from each and recalculated the result when possible, making the replication independent of the highly-cited study. We also recalculated replicability rates, effect size inflation and predictors of replicability under this scenario. These analyses are going to be submitted as supplementary work, also on Open Science Framework. The main goal was to see that if the fact that most of the replications of our sample being meta-analyses that included the highly-cited study was a problem, due to the fact that some highly-cited studies had a lot of weight in these replications. Some other changes are documented here: https://osf.io/jq7ec.
References
- 1.↵
- 2.
- 3.
- 4.
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.
- 49.↵
- 50.↵
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.↵
- 57.
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.
- 93.
- 94.↵