Abstract
BACKGROUND The objective is to determine the impact of the Bacillus Calmette-Guérin (BCG) vaccine compared to placebo or no vaccine on COVID-19 infections and hospitalisations in healthcare workers. We are using a living and prospective approach to Individual-Participant-Data (IPD) meta-analysis of ongoing studies based on the Anytime Live and Leading Interim (ALL-IN) meta-analysis statistical methodology.
METHODS Planned and ongoing randomised controlled trials were identified from trial registries and by snowballing (final elicitation: Oct 3 2022). The methodology was specified prospectively – with no trial results available – for trial inclusion as well as statistical analysis. Inclusion decisions were made collaboratively based on a risk-of-bias assessment by an external protocol review committee (Cochrane risk-of-bias tool adjusted for use on protocols), expected homogeneity in treatment effect, and agreement with the predetermined event definitions. The co-primary endpoints were incidence of COVID-19 infection and COVID-19-related hospital admission. Accumulating IPD from included trials was analysed sequentially using the exact e-value logrank test (at level α = 0.5% for infections and level α = 4.5% for hospitalisations) and anytime-valid 95%-confidence intervals (CIs) for the hazard ratio (HR) for a predetermined fixed-effects approach to meta-analysis (no measures of statistical heterogeneity). Infections were included if demonstrated by PCR tests, antigen tests or suggestive lung CTs. Participants were censored at date of first COVID-19-specific vaccination and two-stage analyses were performed in calendar time, with a stratification factor per trial.
RESULTS Six trials were included in the primary analysis with 4 433 participants in total. The e-values showed no evidence of a favourable effect of minimal clinically relevance (HR < 0.8) in comparison to the null (HR = 1) for COVID-19 infections, nor for COVID-19 hospitalisations (HR < 0.7 vs HR = 1). COVID-19 infection was observed in 251 participants receiving BCG and 244 participants not receiving BCG, HR 1.02 (anytime-valid 95%-CI 0.78-1.35). COVID-19 hospitalisations were observed in 13 participants receiving BCG and 7 not receiving BCG, resulting in an uninformative estimate (HR 1.88; anytime-valid 95%-CI 0.26-13.40).
DISCUSSION It is highly unlikely that BCG has a clinically relevant effect on COVID-19 infections in healthcare workers. With only limited observations, no conclusion could be drawn for COVID-19 related hospitalisation. Due to the nature of ALL-IN meta-analysis, emerging data from new trials can be included without violating type-I error rates or interval coverage. We intend to keep this meta-analysis alive and up-to-date, as more trials report. For COVID-19 related hospitalisations, we do not expect enough future observations for a meaningful analysis. For BCG-mediated protection against COVID-19 infections, on the other hand, more observations could lead to a more precise estimate that concludes the meta-analysis for futility, meaning that the current interval excludes the HR of 0.8 predetermined as effect size of minimal clinical relevance.
OTHER No external funding. Preregistered at PROSPERO: CRD42021213069.
Introduction
With the emergence of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in 2019, leading to the Coronavirus disease 2019 (COVID-19) pandemic in early 2020, a global search for effective prevention and treatment modalities was initiated. Earlier epidemiological studies and clinical trials have suggested that Bacillus Calmette-Guérin (BCG) vaccine induces heterologous protection against respiratory tract infections in both young children and adults (Wardhana, E.A., Sultana, Mandang, & Jim, 2011; Aaby, et al., 2011; de Castro, Pardo-Seco, & Martinón-Torres, 2015; Nemes, Geldenhuys, Rozot, & others, 2018; Giamarellos-Bourboulis, Tsilika, S, & al., 2020; Singh, Netea, & Bishai, 2021). The mechanisms of protection have been suggested to involve induction of heterologous T-cell responses (Benn, Netea, Selin, & Aaby, 2013) and innate immune reprogramming (also termed trained immunity (Netea, et al., 2020)). Supported by ecological studies, suggesting lower COVID-19 incidences in countries with active national BCG vaccination policy (O’Connor, Teh, Kamat, & Lawrentschuk, 2020), researchers across the world initiated randomised controlled trials with BCG as a non-specific protective approach against COVID-19. A substantial number of these trials targeted healthcare workers because of their high exposure rates and fear of a healthcare crisis due to high absenteeism rates. Many of these trials were very similar in their design, especially when they were inspired by one of the first trials (Ten Doesschate, et al., 2022) that had shared their protocol.
Having multiple independent trials in parallel increases the long run precision of effect estimates and improves generalisability of results. However, the situation poses a risk of false-positive or too-late findings. On the one hand, with over 20 ongoing trials and many of these performing interim analyses, the probability that at least one such trial would find superiority early compared to available results in other trials, under the hypothesis of no effect, is much larger than the typically accepted type-I error rate of 5%. Such a scenario might have direct implications for ongoing trials such as a decision to stop follow-up and vaccinate the control group without knowledge of the totality of accumulated evidence. On the other hand, the power of each trial was uncertain due to the pandemic situation with large fluctuations in COVID-19 incidence. Each trial separately could be underpowered, especially for identifying effects on severe disease, and a protective effect might only be observed after trial data were combined. The pandemic situation at the time urged for a collaborative approach with a statistical method that enables continuous synthesis of the data from all ongoing trials.
It can take months or years to collect and analyse the data of all trials and perform an individual-participant-data (IPD) meta-analysis (Tierney, Riley, Smith, Clarke, & Stewart, 2021). By simplifying the structure of the data collection and inviting all ongoing trials, the current study aimed to collect accumulating data and perform an Anytime Live and Leading INterim (ALL-IN) meta-analysis. The results presented here are based on the limited data that was accumulated on an ongoing basis. The analysis is rich in its representation of the history of the evidence and timing of events, yet it does not contain all the (subgroup/covariate-adjusted) analyses that would be possible by combining all full trial datasets.
Our collaboration was named ALL-IN-META-BCG-CORONA. Figure 1 illustrates what the approach was designed to achieve. In this hypothetical case, not only could the world benefit much earlier from the knowledge of an effective existing vaccine in a pandemic, the initiation of hypothetical trial 7 could have been advised against, reducing possible research waste (Chalmers & Glasziou, 2009; Glasziou, Sanders, & Hoffmann, 2020).
E-values for seven hypothetical trials. These are e-values that test hazard ratio 1 versus 0.8 and are explained in the Methods section below, that also addresses the accompanying threshold of 400 in more detail in section E-value analysis design: effect sizes of minimal clinical relevance and thresholds. Note that the y-axis is logarithmic.
Methods
Aims of ALL-IN-META-BCG-CORONA
The aim is to keep track of all ongoing trials by calculating for each calendar date and each trial a notion of evidence – the e-value. As illustrated by Figure 1 for the analysis of events of COVID-19 infection, the sequence of e-values goes up for each observed event in the control group (indicating possible protection in the BCG group) and goes down for each observed event in the BCG group (indicating the opposite). The e-value would therefore accumulate evidence indicating either a protective effect of BCG (increasing e-value), or no or even a harmful effect (decreasing e-value). Each trial contributes e-values from the first observed event until the last, and these e-values are combined in a meta-analysis e-value by multiplication, as described in the paper introducing the ALL-IN approach (Ter Schure & Grünwald, 2022) and the more general literature on combining e-values (Vovk & Wang, 2021).
Since the meta-analysis e-value is based on more events than any individual trial alone, it can reach higher levels of evidence earlier. Figure 1 shows that this hypothetical case with seven trials where the evidence reaches a level of 400 in March 2021. At that time, most of these hypothetical trials still have participants in follow-up with two actively recruiting (hyp trial 3 and 7). The e-values allow for the meta-analysis to be the leading source of information for any decisions to start, stop or expand trials, e.g. advising against the initiation of hypothetical trial 7.
E-values and anytime-valid confidence intervals
ALL-IN-META-BCG-CORONA is based on e-values for hypothesis testing (Figure 1). Their counterpart for estimation is the anytime-valid confidence interval. These methods have recently seen further development – since early work in the ‘40s and ‘60s-’80s (Stein & Wald, 1947; Darling & Robbins, 1967) – and attracted considerable attention in the statistics literature (Wasserman, Ramdas, & Balakrishnan, 2020; Vovk & Wang, 2021; Shafer, 2021; Howard, Ramdas, McAuliffe, & Sekhon, 2021; Henzi & Ziegel, 2022; Grünwald, De Heide, & Koolen, 2022). They distinguish themselves from methods based on p-vales and conventional confidence intervals by the fact that they can be continuously updated without losing statistical validity.
While a small p-value is better, a large e-value is better. Broadly speaking, e=7 indicates seven times more evidence for the hypothesis that BCG has a favourable effect compared to the null. We can set a threshold for a predetermined amount of evidence to keep an eye out for, like e > 400 in Figure 1. A threshold of 400 controls type-I error at a level α= 1/400 = 0.25%, and can be used to inform decisions about the data collection in ongoing and new trials while retaining that type-I error control (Ter Schure & Grünwald, 2022). For more details, see the Appendix section Detailed methods.
IPD, living, interim, bottom-up, prospective and collaborative meta-analysis
E-values and anytime-valid confidence intervals can be used in many different meta-analysis settings, such as those based on IPD or aggregate data, final analysis or living systematic review, complete trials or interim, top-down (like a multi-centre trial) or bottom-up, retrospective or prospective, external or collaborative (Ter Schure & Grünwald, 2022). ALL-IN-META-BCG-CORONA is (a) IPD, (b) living, (c) interim, (d) bottom-up (but also slightly top-down, see below), (e) prospective and (f) collaborative.
(a) Individual Participant Data (IPD); reducing data problems
The data was shared in file formats with a row of data per participant. With full information on the date of randomisation, events and follow-up, the evolution of evidence over time could be retrospectively processed, based on each new data upload. Only information for the time-to-event analysis in the two primary outcomes (COVID-19 infections and hospitalisations) was collected, along with information on the stratifying factor, limiting the amount of work in ongoing data cleaning. The IPD approach encourages close collaboration between the trial data-uploaders and meta-analysis statisticians. This improves problem solving in data cleaning and optimises data quality. The IPD approach also limits the risk of falsified or fabricated data entering the meta-analysis because more than one person inspects all the data.
(b) Living and on (c) Interim data
New trials were contacted as soon as we became aware of their existence. Crucially, we did not only analyse completed trials, but included interim datasets to aim for a live account of the evidence while the trials were ongoing.
(d) Bottom-up (but also slightly top-down); enabling homogeneity
Each trial could specify their own stratification factor (‘hospital’ see Data extraction) in the data uploaded, either following their randomisation strategy for multi-centre trials that randomised stratified by hospital of occupation, or based on other available information that determined common risk of COVID-19. In this sense, part of the data analysis decisions are bottom-up. Also, in contrast to methods based on group-sequential and alpha-spending approaches (Simmonds, et al., 2017), ALL ALL-IN meta-analysis does not require a predetermined stopping boundary. There is a threshold, but rather than being a top-down decision rule, this merely serves for optimal timing to consider stopping, while the analysis stays valid if data collection continues. The specific e-values used, however, do require a predetermined effect size of minimal clinical relevance (Grünwald, De Heide, & Koolen, 2022), which was decided top-down by the meta-analysis steering committee (see further information in section E-value analysis design: effect sizes of minimal clinical relevance and the Appendix section Detailed methods).
(e) Prospective; minimizing bias
All decisions on effect size of minimal clinical relevance (Van Werkhoven, et al., 2021), event definitions and trial inclusion were made before results were known, as prescribed for a prospective meta-analysis (Seidler, et al., 2019). This reduced risks of publication bias and other meta-analysis “significance-chasing biases” (Ioannidis, 2010). In agreement with this recommendation by the Cochrane PMA Methods Group (Seidler, et al., 2019), we constructed a central steering and data analysis committee, and worked in collaboration with representatives (the PI and a data-uploader) from each individual study.
(f) Collaborative; using a dashboard
All trials were represented by a team member in an Advisory committee and members were involved in the meta-analysis while their trials were ongoing, and in some cases, while their trials were still in preparation. This allowed for detailed information sharing about the trials to inform decision making. Moreover, it also facilitated direct information flow from the meta-analysis to the trial members becoming the leading source of information on decisions to start, stop or expand trials. A dashboard, shown in demo-mode in Figure 2, was used to communicate the e-values throughout the course of the participating trials. Dashboard access permissions were regulated based on the stage of blinding each trial was in, with initial permissions only granted to data uploaders unblinded to their own trial results (with other trials not visible), and later all trials were made visible to all data-uploaders and participating investigators. The launch of the dashboard occurred once three trials were ready to be included, such that initially no individual trial contributions could be retrofitted from the meta-analysis e-values and a single other trial’s e-values.
Dashboard used to communicate meta-analysis results to all data-uploaders with a login. The dashboard is in demo mode, showing synthetic (‘fake’) data until June 2021. The option to (de)select trials is for plotting purposes of individual trial e-values; all trials in the dashboard stay included in the meta e-value, following the decision on trial inclusion. The demo dashboard was openly available to easily explain the project to outsiders, using user name ‘demo’, password ‘show’ (Ter Schure J., ALL-IN-META-BCG-CORONA dashboard, 2020). Note that the y-axis is on the log scale.
Structure of the collaboration
Our collaboration consists of a Steering committee (PDG, MGN, MJMB), a representative of Cochrane Netherlands (JAAD) with two collaborators (KDH and JMW), an Advisory committee including a principal investigator and data uploader representing each trial, and an Operational team (JAS, AL, CHW), with the first two acting as meta-analysis statisticians and the last as meta principal investigator.
The Steering committee stayed blinded to the interim results until all trials were concluded and (1) decided on the primary outcome measures, effect size of minimal clinical relevance, thresholds, and event definitions, (2) decided on trial inclusion, advised by the Advisory committee and Cochrane Netherlands, and (3) decided when to make the meta-analysis results public in the dashboard and/or in a scientific publication by consulting the Advisory committee.
Cochrane Netherlands performed an external risk-of-bias assessment based on the protocols and feedback from the participating trials.
The Advisory committee (1) provided Cochrane Netherlands with detailed protocol information to perform a risk-of-bias assessment, (2) advised on trial inclusion criteria, and (3) advised on when to make the meta-analysis results public.
The Operational team (1) identified the trials, (2) coordinated data collection, (3) analysed data and updated the dashboard, (4) wrote news updates, and (5) prepared the publication.
All documents detailing this approach can be found in the supplementary material available at Research Equals (ALL-IN-META-BCG-CORONA Replication Package). These include the Statistical Analysis Plan, a webinar and tutorials explaining the statistical approach and analysis code, newsletters with updates throughout the pandemic, risk-of-bias assessments, summary data and links to data publications.
Identification of trials
To identify trials for inclusion in the meta-analysis, ClinicalTrials.gov was searched for the terms “BCG AND (COVID OR corona OR SARS-CoV-2)”. We also screened a database constructed for the Kaggle hackathon on BCG and COVID-19 clinical trials (Kaggle, 2020). Finally, we used snowballing by regularly asking the trial investigators involved whether they were aware of trials that were not yet included (final elicitation October 3, 2022).
Trials were eligible if they met the following criteria: 1) individual randomisation, 2) comparison of BCG to either placebo or no intervention, 3) population consists of adult healthcare workers, 4) COVID-19 and COVID-19 related hospitalisation are among the primary or secondary outcome measures.
Trial inclusion criteria
Inclusion decisions were made on the study level. The decision to include a trial was based on (a) an external risk-of-bias assessment, (b) a specific type of expected homogeneity in effect sizes, and (c) an agreement on event definitions. These are described in detail below.
(a) Risk-of-bias assessment
Risk of bias was assessed at the study level using a modified version of the Cochrane risk-of-bias tool (Higgins, Altman, Gøtzsche, & al., 2011). The tool was modified to make it fit for assessing trial protocols, rather than publications of completed trials with results. Risk of bias was assessed for the following domains: random sequence generation, allocation concealment, blinding of participants, blinding of outcome assessment, method of outcome assessment and other bias. From the original risk of bias tool, we removed the domain selective reporting as there is no reporting of results in a protocol, and we changed the domain incomplete outcome data to method of outcome assessment. In this domain we assessed whether the method of collecting outcomes was appropriate and whether we expected that any outcomes could be missed by the researchers. Risk of bias was scored independently by two reviewers: one reviewer with a methodological background (AD) and one of two reviewers with a clinical background (KH and JW). All risk of bias domains (agreements and disagreements) were subsequently discussed within the review team.
During advisory board meetings, each trial’s risk-of-bias assessment was discussed, and advisory board members were invited to ask clarifying questions and to share their opinion. If there were concerns about possible biases and the data was already uploaded, the meta-trial statistician arranged a description of the data structure that was blinded to BCG vaccine allocation. The final decision for inclusion of a trial was made by the Steering committee.
(b) Expected homogeneity in effect sizes
The meta-analysis relies on a notion of qualitative trial homogeneity that is sufficient to make decisions during the pandemic. If evidence arises that BCG has a beneficial effect in the prevention of COVID-19, the analysis could serve as the leading source of information to start, stop or expand trials, and for countries to implement BCG vaccination in their population. Such evidence may also inform decisions in other countries and trial settings with slightly different population characteristics, historical BCG vaccination policies, different BCG strains, or different COVID-19 event definitions e.g. a trial based on PCR testing that informs decisions in trials based on antigen testing. As long as not enough data is available to do post-hoc subgroup analyses, a pandemic situation would require acting on what is available. In cases where it is possible to test differences between trials this would of course be recommended before acting on the evidence. However, the possibility of detailed subgroup analyses was not anticipated given the expected small effect sizes and large amounts of data needed to distinguish differences between small effects.
This rationale of decision making requires a specific argument of expected homogeneity in the effect sizes as a key criterium in the decision on trial inclusion, such that the meta-analysis could be the leading source of information. Trials were only included if we expected each of them to have an effect in the same direction (benefit, not harm) – if there would be benefit in one trial, we expect it in all trials. Moreover, trials were included if they could be expected to have an effect of a certain minimal size (if BCG has an effect at all). This connects to the use of fixed-effects (plural) meta-analysis discussed below. We follow Peto (1987) in this regard: “In performing overviews, we are not trying to provide exact quantitative estimates of percentage risk reductions in some precisely defined population of patients. We are simply trying to determine whether or not some type of treatment tested in a wide range of trials produces any effect […]”.
(c) Event definitions
Primary analysis
In the original statistical analysis plan that was shared with the participating trials, an event of COVID-19 infection was defined as “Documented COVID-19 disease is defined as PCR-based detection of SARS-CoV-2 in a respiratory sample”. During an advisory committee meeting on April 23rd, 2021 this was expanded to lung CTs and SARS-CoV-2 antigen positive, rapid point of care testing for current infection. The date of each event was set at the time when the test or scan were performed. For all trials having a positive test at randomisation (either PCR, serology, or otherwise) was an exclusion criterium for the trial itself (see detailed trial characteristics in section Detailed results), or for data-extraction for the meta-analysis. Some trials (e.g. BR) also detected positive COVID-19 cases with serology during the course of the trial. These cases were not included in the COVID-19 event count in the primary analysis, but the participants that had a positive serology were also not deleted from the dataset, to not introduce bias. This means that they stayed in the risk set for COVID-19 infections detected by other means than serology, even though they were at different risk from participants without positive serology.
Secondary analysis
On December 1st, 2022 it was decided to perform a secondary analysis including trials that have the majority of events detected by SARS-CoV-2 serology defined as SARS-CoV-2 antibody positive/detected after infection, with the date of serology as the event time.
Data extraction
Each trial had a designated data-uploader that also attended several Advisory board meetings and would become a meta-analysis co-author. IPD was extracted from each trial and uploaded to a secured cloud server in repeated uploads described in the ‘Working instruction for data-uploaders’, Statistical Analysis Plan and a webinar that explained the statistical methodology (ALL-IN-META-BCG-CORONA Replication Package). Data-uploaders were encouraged to check their data as it appeared in the dashboard (see Figure 2) based on a data processing tutorial (ALL-IN-META-BCG-CORONA Replication Package) and upload new data if available. The following variables were included: intervention randomised to (control or BCG), calendar date of randomisation, the stratification factor ‘hospital’ (e.g. “A”, “B”, “C”, etc) (see Appendix section Detailed results for the meaning of ‘hospital’ in the analysis of each trial), COVID-19 infection (yes/no) and calendar date the positive test or scan for COVID-19, COVID-19 related hospitalisation (yes/no) and calendar date of being hospitalised for COVID-19-related reasons, and calendar date of last follow-up. For patients still in follow-up at the time of data extraction, the last follow-up date was the calendar date of data extraction. The meta-analysis statisticians were in continuous contact with the data-uploaders to correct mistakes (dates before the COVID-19 pandemic, dates of randomisation later in time than dates of COVID-19 infection etc.).
Participants at risk
Participants were considered at risk of COVID-19 infection and hospitalisation from the date of randomisation to the date of either a COVID-19 infection/hospitalisation, the end of follow-up, loss to follow-up or date of first COVID-19 specific vaccination. Therefore, follow-up time was censored at the date of first COVID-19 specific vaccination. Infections occurring after COVID-19 vaccination and reinfections were not included as events. Participants remained at risk of COVID-19 hospitalisation after a COVID-19 infection if this did not result in hospitalisation.
Fixed-effects meta-analysis
The ALL-IN meta-analysis uses the e-values from the exact e-value logrank test (Ter Schure, Pérez-Ortiz, Ly, & Grünwald, 2022) and anytime-valid confidence intervals for the hazard ratio (HR) described in the Statistical Appendix that accompanies this paper.
The e-value analysis tests the global null hypothesis of no effect in all trials. This global null assumes a HR of 1 in all trials throughout their entire course. All events were analysed stratified by hospital within trial, by calculating hospital specific e-values and multiplying those into trial specific e-values, which are multiplied into the ALL-IN META e-values.
The approach to anytime-valid confidence intervals in this meta-analysis is the fixed-effects (plural) model following the logic of (Peto, 1987) describing a typical effect that is a weighted average of all the trials that contribute to the analysis (Rice, Higgins, & Lumley, 2018; Hedges & Vevea, 1998). The analysis assumes that the HR can vary from trial to trial, and is a two-stage analysis. A Cox proportional hazards model (maximum-likelihood) estimate was obtained for each trial, stratified by hospital (i.e. a single HR per trial that follows from evaluating all events with regard to the risk set in their hospital alone instead of the risk set in the full trial). These trial estimates were combined into a meta-analysis HR estimate using inverse-variance weighting. The event times are analysed in calendar time such that all participants within the same hospital at a given date are at the same risk, regardless of their own time since inclusion. More details about the exact scripts used are available in the Appendix section Detailed methods and supplementary material R code (ALL-IN-META-BCG-CORONA Replication Package).
The Kaplan-Meier curves that were promised in our Statistical Analysis Plan appeared difficult to interpret given the left-truncation of analysis in calendar time. The appearance of events is therefore presented as sequences of e-values and confidence intervals over time, with accompanying subplots indicating recruitment and follow-up period for each trial. The plot illustrating the sequence of confidence intervals is related to a single cumulative hazard plot for comparison.
E-value analysis design: effect sizes of minimal clinical relevance and thresholds
An effect size of minimal clinical relevance was specified – arbitrarily – at HR = 0.8 for COVID-19 infections and at HR = 0.7 for COVID-19 hospitalisations. These corresponded well with the Food and Drug Administration recommendation to reject a null hypothesis Vaccine Efficacy of 0-30% (HR 0.7-1) in the COVID-19 pandemic, that was published soon after (FDA, Development and Licensure of Vaccines to Prevent COVID-19., 2020). In contrast to vaccines in development for COVID-19, the BCG vaccine was already widely available at a low price, and producing it at scale was considered possible. Hence the minimal relevant effect on reducing COVID-19 infections was kept at a smaller Vaccine Efficacy of 20% (HR = 0.8), while the effect on reducing hospitalisations was kept at 30% (HR = 0.7).
The main aim was to evaluate if the BCG vaccine was able to reduce severe disease and alleviate the burden on hospitals worldwide. A reduction in infections would rationally result in a reduction of hospitalisations as well, such that the former was considered as important, and that the power of that analysis would be larger with an expected higher event rate. Following this rationale, the two outcomes were set as co-primary outcomes and tested at the level α = 5%, with a Bonferroni correction spending 10% of 5% (0.5%) on infections and 90% of 5% (4.5%) on hospitalisations. This is an approach that loosely agrees with the FDA two-trial rule (two trials at α-level 5% (FDA, 1998, p. 3)) translating into an α = 0.25% meta-analysis (0.05*0.05 = 0.0025), two-sided. This was the α-level of the main analysis of interest (the analysis with the most power), that was a further restriction from two-sided, being the one-sided test for benefit on COVID-19 infections at level α = 0.25%.
Since statistical tests were performed for each side separately, a two-sided α of 0.5% for COVID-19 infections means 0.25% for each side and 4.5% for COVID-19 hospitalisations means 2.25% for each side. In terms of e-values this translates into our threshold for COVID-19 infections comparing the null of HR 1 to a smaller risk (left-sided test for benefit at HR of 0.8 or smaller) lies at 1/0.0025 = 400, and similarly for the e-value comparing the null to a larger risk (right-sided test for harm at HR of 1/0.8 or larger) at 400. For COVID-19 hospitalisations the threshold for the left-sided test (for benefit) and right-sided (for harm) comparing HR 1 to 0.7 and 1/0.7 lie at 1/0.0225 = 44.
Design for early stopping for efficacy, not futility
The design decisions for the e-values – the effect sizes of minimal interest and α-levels stated above – were made by the Steering committee on May 29, 2020. This set a clear rule for what to watch out for in the dashboard, as illustrated by the dotted lines at 400 for COVID-19 infections and 44 for COVID-19 hospitalisations (shown in the demo dashboard in Figure 2 and results in Figure 4).
No such threshold was decided for futility. A very early conclusion on futility was deemed unlikely, and a conclusion in the final follow-up phase of all trials would not prevent much wasted effort on further recruitment. As such we reported the e-values compared to the predetermined threshold for efficacy set at level α of 0.5% (infections) and 4.5% (hospitalisations), but we do not emphasise these same α-levels for the anytime-valid confidence intervals in a judgement of futility. Nevertheless, also without a threshold for futility, we can use confidence intervals to draw conclusions. If in all trials the HR is more extreme than a certain value, then it is very unlikely that such interesting extreme effect will fall outside of the interval. In infinite repeated use the chance that this happens for the anytime-valid 95%-confidence interval is at most 5%, no matter how long we keep updating (see Appendix Detailed methods).
Registration
The meta-analysis design was agreed on by the Steering Committee on May 29, 2020 and time-stamped in a logrank design object within R that is visible in the dashboard (Ter Schure J., ALL-IN-META-BCG-CORONA dashboard, 2020), see Figure 2 on page 5. Working Instructions for data-uploaders and the Statistical Analysis Plan (SAP) were made publicly available on the project website (Ter Schure, Ly, & Grünwald, 2020) on June 17, 2020 and were registered in the International prospective register of systematic reviews (PROSPERO: CRD42021213069) on February 11, 2021 (Van Werkhoven, et al., 2021). The SAP was updated into the version 2 available on the project website and in the Supplementary material (ALL-IN-META-BCG-CORONA Replication Package) accompanying this publication on September 19, 2022 and PROSPERO was updated on December 5, 2022.
Results
Trial inclusion
A total of 20 protocols were identified of trials of BCG for prevention of COVID-19 in healthcare workers. Of these, data from 6 trials were included in the current primary meta-analysis. A secondary analysis was defined that adds COVID-19 infections based on serology, and includes the AF trial as the 7nd trial. This process is described in Figure 3 and trial characteristics and summary data are described in Table 1 and Table 2. More trial details are provided in the Appendix section Detailed results. Three protocols for risk-of-bias assessment were published, for the NL trial (Ten Doesschate, et al., 2020), the DK trial (Madsen, et al., 2020) and the BR trial (Junqueira-Kipnis, et al., 2020).
Flowchart of included trials
Exact logrank e-values for COVID-19 infections and hospitalisations with the Primary ALL-IN meta-analysis that excludes the AF trial. Note that the y-axis is logarithmic.
Trial characteristics of included trials, with AF only included in the Secondary analysis
Trial summary statistics
Exact logrank e-values for COVID-19 infections and hospitalisations
Following the rationale of a certain expected homogeneity in effect sizes (see Methods section Expected homogeneity in effect sizes), it was decided prospectively to exclude trials in the elderly or more vulnerable population from the search for the same meta-analysis, because of the possibility that one of the two populations observed a small beneficial effect while the other observed a small effect of harm.
Trials completed and ongoing
All trials presented here are completed, have locked their databases and shared their final data for the meta-analysis. Table 2 shows the date of last follow-up for each trial. The course of each trial is represented in Figure 4 on page 14, in a subplot indicating the period of recruitment and follow-up in each trial.
Figure 3 shows that three more trials are concluded and in process for their data to be added to this living systematic review. One has been in preparation for the data transfer agreement for 2,5 years, one cannot share IPD but is preparing aggregate data, and one is awaiting their own trial publication before joining the meta-analysis.
Statistical results
The e-values show no evidence in favour of an effect of minimal clinically relevance (HR < 0.8) in comparison to the null (HR = 1) for COVID-19 infections and neither for COVID-19 hospitalisations (HR < 0.7 vs HR = 1). For the meta-analysis as a whole, we find an e-value of 0.023 for benefit for COVID-19 infections and an e-value of 0.241 for COVID-19 hospitalisations, indicating that the data is better supported by the null than the specific alternatives set for minimal clinical relevance. Overall, the results show that the e-values were never close to the thresholds at 400 and 44 for rejecting the null hypothesis, see Figure 4, and that the null hypothesis describes the data quite well, as indicated by the anytime-valid confidence intervals below. All hospital specific e-values are shown in the Appendix section Detailed results.
COVID-19 infections
The primary meta-analysis estimate of the typical hazard ratio for COVID-19 infections was 1.02 with an anytime-valid 95%-confidence interval of (0.78-1.35). See the forest plot Figure 5 and further details in Table 4. A secondary analysis including the AF trial with the majority of events confirmed by serology is shown in the forest plot in Figure 6. The primary analysis as well as the secondary analysis strongly suggest that the planned addition of events from trials not yet included in the meta-analysis – which is allowed in our anytime-valid approach – is highly likely to exclude the hypothesis that there is a minimal effect size of 0.8. Since this was our pre-specified minimum clinically relevant effect size, it would complete the meta-analysis with a futility conclusion, at the α = 5% level.
Primary analysis (excl AF) forest plot for the fixed-effects (plural) meta-analysis model
Estimates for COVID-19 infections
Detailed characteristics NL trial
Detailed characteristics SA trial
Detailed characteristics US trial
Detailed characteristics DK trial
Detailed characteristics HU trial
Detailed characteristics BR trial
Detailed characteristics AF trial
Full summary statistics per hospital within trial
Secondary analysis (incl AF) forest plot for the fixed-effects (plural) meta-analysis model
COVID-19 hospitalisations
For events of COVID-19 hospitalisations only the NL, DK, BR and SA trials contribute events. However, for NL, DK and BR it is not possible to obtain a useful trial-specific confidence interval due to limited data; the estimation procedure did not converge (so we have no maximum likelihood estimate) and the intervals for the HR range from 0 to infinity. For SA, the maximum likelihood estimator in the Cox model stratified by hospital was 2.11 with (0.17-26.73) as its anytime-valid 95%-confidence interval. Because we have no estimate per trial, we cannot inverse-variance weigh these estimates to produce a meta-analysis estimate, as described in the Methods section and performed for COVID-19 infections. We can however, opt for a one-stage non-stratified approach and combine all data together in one dataset and analyse it stratified by trial (but not stratified by hospital). This achieves a maximum likelihood estimator of the Cox model that is still quite uninformative of 1.88 with (0.26, 13.40) as its anytime-valid 95%-confidence interval. Note that in contrast to the meta-analysis on COVID-19 infections, this analysis does assume that all strata (trials in this case) share a single HR, and does not stratify the baseline risk by hospital.
Discussion
In this prospective and living IPD ALL-IN meta-analysis of completed and ongoing trials no effectiveness of BCG in reducing COVID-19 infection was observed. The precision of effect estimates is high and ‘almost’ excludes the minimal effect that was pre-specified to be of interest: the anytime-valid confidence interval lower end is 0.78 and the predetermined minimal relevant effect size was set at 0.8. For COVID-19 related hospitalisation, the limited number of events precluded a firm conclusion. This endpoint is rare in healthcare workers, especially with the less pathogenic SARS-CoV-2 Omicron variants circulating today and the availability of SARS-CoV-2 targeted vaccines.
In vitro and experimental studies demonstrated that BCG vaccination induces non-specific changes in the innate immune system that last for months (Netea, Domínguez-Andrés, Barreiro, & others, 2020). A decreased incidence of respiratory infections in adults after receiving BCG has been demonstrated by several small trials conducted before the SARS-CoV-2 pandemic (Datau, Sultana, Mandang, & others, 2011; Nemes, Geldenhuys, Rozot, & others, 2018; Giamarellos-Bourboulis, Tsilika, S, & al., 2020). The present results indicate that no such clear effect exists for COVID-19. Most of the previously published trials of BCG against COVID-19 in healthcare workers are included in the present analysis, with a trial from France in preparation. A global trial coordinated from Australia and another trial from Poland are not yet included due to restrictions by the funder. Results published from the Polish trial are in line with the meta-analysis (Czajka, et al., 2022). We did not include in the current meta-analysis trials performed in older adults, one small trial from Greece shows a reduction of the COVID-19 incidence in BCG-vaccinated individuals, while two larger trials from the Netherlands do not demonstrate an effect (Tsilika, et al., 2022; Ten Doesschate, et al., 2022; Koekenbier, 2021).. Also, a trial in type 1 diabetics found a strong protective effect of having received multiple doses of BCG within the last years (Faustman, et al., 2022). Why BCG vaccination would be protective against other respiratory tract infections but not SARS-CoV-2 remains a topic for further research. A recent experimental study demonstrating strong protection induced by BCG against influenza, but not COVID-19, suggested that important immunological and pathophysiological differences between the two infections may explain this observation (Kaufmann, et al., 2022). Noteworthy, a recent meta-analysis of all the published BCG-COVID-19 trials that reported deaths within the trials showed that BCG was associated with 39% (1-62%) reduction in all-cause mortality (Aaby, Netea, & Benn, 2022).
Our study was prospectively planned and analyses were designed before any trial data was available. This, together with the use of e-values and anytime-valid confidence intervals, is an important strength of our study, as it controls the type-I error rate even if new data is added in future updates of this analysis. In fact, the use of e-values was employed to allow for the use of interim meta-analysis results at any time for putative strategic decisions such as the initiation of a new trial, early termination, or extension of follow-up. Unfortunately, due to the delayed availability of data for the meta-analysis, the interim meta-analysis results were never used for such strategic decisions. In hindsight, the evidence against the hypothesis of superiority against COVID-19 infection hardly changed after 2020 (see Figure 7). Investigators might have decided to stop recruiting new participants for that reason.
Sequences of anytime-valid confidence intervals for the trials and the meta-analysis from the first date with observed events of COVID-19 infection until the last date of follow-up. There is a confidence interval for every calendar day but these are only shown every 16 days for visibility. For HU and BR, intervals are only visible for the final day of follow-up, since these intervals stay the full plot width over time (larger than (0.2, 5), see Figure 5). AF observes most events at the 6-month end of follow-up serology for each participant. An example cumulative hazard plot is given for the NL trial to show how the incidence of COVID-19 infections in the two groups over time, and the censoring (indicated by +’es in the curve) relates to the HR estimates. The Dutch intervals can be seen in the background, shrinking fast between September 2020 and January 2021. Note that the y-axis is logarithmic.
A limitation of our study is that we were not able to include all trials and some trials were added only after they were concluded. This was mostly due to issues with the transfer of data, either from a legal or funder’s perspective. As a result, the meta-analysis had little chance to affect trial decisions. The use of aggregate data (recently proposed as an approach to prospective collaborative meta-analysis (Tierney, et al., 2021)) instead of individual participant data might have allowed a more timely and complete meta-analysis that could have had more impact during the pandemic. However, this would require each trial to generate and share these statistics and limits the data quality verification possibilities for the meta-analysis statistician. Therefore, it remains difficult to completely recommend against the IPD approach (Ter Schure, Grünwald, & Ly, 2021). One might add that, fortunately, specific COVID-19 vaccinations became available relatively quickly, in what was still an early stage of the ongoing meta-analysis. Had this not been the case, our meta-analysis would have been all the more urgent and data-transfer issues might have been overcome more easily. Since this is a live meta-analysis, efforts will be undertaken to keep this paper up to date as additional trials become available.
To the best of our knowledge, this is the first time a live meta-analysis of ongoing trials is being conducted on a continuous basis. A challenge when analysing data of ongoing trials is that datasets are subject to retrospective changes due to misclassifications and delayed registration of events. Considering misclassifications, we used objective outcome definitions to reduce this risk. Delayed registration cannot be avoided. Both misclassification and delayed registration could be argued to be conservative or result in unbiased relative effects if not associated with the intervention, but they may somewhat increase the type-I error rate. Despite these limitations, we are convinced that a live meta-analysis in an emergency setting with over 20 trials ongoing in parallel offers important potential benefits for society. Putative false-positive findings from one of these trials might have resulted in discontinuation of some of the other trials if they observed a trend in the same direction in their own data. Data from the meta-analysis would have protected trials against incorrect decisions in such circumstances.
Future studies should aim for a better understanding of how BCG-mediated changes in the immune system differentially affect SARS-CoV-2 and other respiratory infections. The role of BCG or other live vaccines in the immunogenicity of SARS-CoV-2 specific vaccines is also a topic of further research. From a methodological perspective, the continued development of anytime-valid meta-analysis techniques are likely to be extremely valuable for increasing the efficiency of global research efforts and reducing avoidable research waste (Ter Schure & Grünwald, 2022).
In conclusion, BCG vaccination has very little to no impact as an intervention for prevention of COVID-19 infections in healthcare workers, though the observation of very low numbers of severe cases (hospitalisations) prevented this study from measuring whether BCG vaccination has any impact on disease severity. Therefore, BCG should not be recommended as preventive intervention against COVID-19 infections in this population.
APPENDIX
Detailed methods
Anytime-validity without specifying a stopping rule
In contrast to methods based on p-values and conventional confidence intervals, analyses based on e-values and anytime-valid confidence intervals can be continuously monitored. Doing so would otherwise inflate type-I error rates for p-values (Armitage, McPherson, & Rowe, 1969), introduce accumulation bias in estimates for hazard ratios (Ter Schure & Grünwald, 2019), and lose coverage in conventional confidence intervals (Howard, Ramdas, McAuliffe, & Sekhon, 2021). With e-value methods, however, type-I error rates and coverage are preserved and any sequential or interim analysis can be validly performed, and trials can be added, without specifying an alpha-spending function or a maximum sample size (Ter Schure, Pérez-Ortiz, Ly, & Grünwald, 2022).
Type-I error control
Suppose that BCG is truly ineffective in preventing infections. Then, if we sample a sequence of events and record the corresponding logrank e-value for benefit after each event, the probability that this e-value sequence will ever increase above the threshold of 1/α = 400, is less than α = 0.025%, controlling the type-I error rate (chance of false-positives) at α = 0.025%, no matter how many new events are added to the analysis. Equivalently, suppose we independently sample (simulate) very many such sequences. In this ‘infinite repeated use’ less than α = 0.025% of the logrank e-value sequences for benefit will ever increase above the threshold of 1/α = 400. Similarly, if BCG is truly ineffective in preventing hospitalisations, in infinite repeated use less than α = 0.225% of logrank e-value sequences for benefit will ever increase above 1/α = 44 (Grünwald, De Heide, & Koolen, 2022; Vovk & Wang, 2021). This property allows us to freely monitor the e-value (updated whenever new data arrive) and stop for benefit as soon as the e-value crosses the threshold of 400 and 44, respectively, while retaining type-I error control.
Type-II error control
If BCG truly reduces COVID-19 infections or hospitalisations, then we would like to detect such an effect as quickly as possible. The fastest detecting e-value logrank test is known as the GROW e-value logrank test, and is tuned to a minimal clinically relevant (relative) reduction of COVID-19 risk. This design ensures that the e-values will grow based on data from any trial that observes that effect of minimal relevance, or an effect that is more extreme.
An e-value analysis can reach it threshold at any sample size, such that the ability of an e-value study design cannot be summarised by a single sample size and power. Each design has a dual sample size calculation: a maximum sample size and an average sample size that accompany a single power – for example 80%. The maximum sample size of an e-value logrank test is the number of events at which, under the minimal clinically relevant effect size, we have reached the threshold with 80% probability. Equivalently, we sample (simulate) a very large number of independent sequences of events with this effect size. Then the maximum sample size is the sample size at which, in this ’infinite repeated use’, 80% of the corresponding e-value sequences has reached a threshold. The average sample size is the average at which those e-value sequences reach that threshold, and is always smaller since many reach the threshold before the maximum sample size.
Such a dual sample size calculation shows that we need a maximum of 1345 events of COVID-19 infections to have 80% power for the e-value logrank test to reach the threshold of 400 for a minimal effect of HR 0.8 (simulated using designSafeLogrank(hrMin = 0.8, beta = 0.2, alpha= 0.0025, alternative = “less”), +-2x bootstrap se of 67) and will reach that threshold at 874 events on average (2x bootstrap se of 22)1. A similar dual sample size calculation shows that we need a maximum of approximately 355 events of COVID-19 hospitalisation to have 80% power for the e-value logrank test to reach the threshold of 44 (simulated using designSafeLogrank(hrMin = 0.7, beta = 0.2, alpha = 0.0225, alternative = “less”), +-2x bootstrap se of 20) and will reach that threshold at 212 events on average (2x bootstrap se of 6). Note that we are more likely to observe higher e-values (like in Figure 1 on page 3, simulated with HR = 0.7) if the effects are more extreme then these minimum effects (0.7 is further away from 1 than hrMin = 0.8 for infections). Also, a standard logrank test would allow only a single analysis and needs 1069 events to detect a HR of 0.8 and 225 events to detect a HR of 0.7, which is less than the maximum sample size for the e-value logrank test, but more than the average sample size.
Code used for the analysis
The event times are considered in calendar time such that all participants within the same hospital at a given date are at risk of any events observed, regardless of their own time since randomisation (in R processed using the Surv() function from the survival package with type = “counting”). Late entries are left-truncated and participants that are lost to follow-up or vaccinated with a COVID-19 specific vaccine are right-censored. Data was analysed for e-values using the R functions designSafeLogrank() and safeLogrankTest() from the R package safestats, maximum-likelihood estimators for the HR were obtained using coxph() from the R package survival, and anytime-valid confidence sequences were obtained using inverse-variance weighted z-scores for log(HR) using computeConfidenceIntervalZ() from the R package safestats (Turner, Ly, Pérez-Ortiz, ter Schure, & and Grünwald, 2022).
Detailed results
Table 13 shows that the exact e-values and the Gaussian approximation to the logrank statistics are very similar such that the e-values can be easily recalculated based on the logrank Z-score (Ter Schure, Pérez-Ortiz, Ly, & Grünwald, 2022). The Gaussian e-values are likelihood ratios of Gaussians comparing the HR of minimal clinical relevance to the null, e.g. the e-value for COVID-19 infections in trial NL can be recalculated as
meaning that the data is hardly better supported by the alternative hypothesis of HR of 0.8 or smaller than by the null of HR 1.
Exact and Gaussian logrank e-values by hospital within trial
The e-values for the secondary analysis can be easily obtained by multiplying the primary analysis e-value of 0.023 by the AF e-value of 1.203 into 0.028.
The typical effect size for the HR is an inverse variance weighted estimate, e.g. (from Table 14) 1.02 is estimated as exp(0.02) following: 0.02 =
Detailed estimates for COVID-19 infections
These estimates can also be approximated by the Peto estimator (Yusuf, Peto, Lewis, Collins, & Sleight, 1985, pp. 366-367, Statistical Appendix) based on the sum of observed minus expected (Sum(Observed - Expected)) and the approximate sum of variances of these observed minus expected, of n*(½)*(1 − ½),
Statistical Appendix
Anytime-valid confidence intervals for location parameters based on Z-scores
In this appendix we elaborate on the anytime-valid confidence intervals for a univariate parameter θ based on Z-scores as implemented in the safestats package (Turner, Ly, Ortiz-Perez, ter Schure, & Grünwald, 2022) and shown in the main text. We assume availability of an asymptotically normal estimator for scalar parameter θ. where we denote the value
takes on a sample Y1,…,Y2 of length n by
and its standard error by SE(N). Thu (approximate or exact, see below) anytime-valid (1 - α)-confidence interval for θ that we employ is given by
where we set
where
and ψmin is a pre-set minimal clinically relevant (mean) difference parameter such as (as in the main text) ψmin = log(0.8) PD -0.223.
If the data Y1,Y2,… are i.i.d. normally distributed with fixed variance and mean θ and is the maximum likelihood estimator (MLE), then the confidence interval Eq. (1) is exact and exactly anytime-valid, a notion explained below. This holds for any fixed value of meta-parameter ψmin picking ψmin to be equal to a pre-specified minimum clinically relevant effect size merely serves to ‘fine-tune’ the intervals so that they are optimized to exclude any pre-specified null hypothesis early on (i.e. at small sample size) if the effect is at least as far away from that null hypothesis (in terms of mean difference) as |ψmin|
is well-approximated by a standard normal, then the interval is still approximately anytime-valid. This is the the case if θ represents the logarithm of the hazard ratio as in the main text and
is either the MLE based on Cox’ partial likelihood with a single covariate treatment/control or, as in the main text, the fixed-effects inverse variance weighted effect size estimator typically used in meta-analysis.
Anytime-Valid Bayes Factors for the Normal Location Family
To derive the intervals above and get an idea of their width, we consider the exact case where the data come from the normal location family. Thus, let with σ known and θ the freely varying location (i.e.. mean). Let θ0 be a postulated value for the true value of the mean θ. Each such θ0 fixes the population mean at, for instance, θ0, = 0 or θ0, = −3.1415, and, for data Y1,…,Yn, yields a relative (to θ0,) z-statistic defined by
where
is the sample mean. and
is the standard error.
For normal location problems Grünwald, de Heide, and Koolen (2019) showed that any Bayes factor for testing a simple null hypothesis provides an e-value. when evaluated at a fixed sample size n. and more generally, provides an e-process (Ramdas, Grünwald, Vovk, & Shafer, 2022) when evaluated at an arbitrary stopping time, leading, as we explain below, to anytime-valid tests. This holds in particular for Bayes factors in which the null represents θ = θ0 and the alternative is formed by a Bayesian prior; in particular we may take any conjugate, i.e. easily computable, prior of the form for any fixed tuning parameter
. The resulting Bayes factor can then, by standard calculations (analogous to these of Example 3 of Grünwald et al.. 2019) be written, at sample size n. as
Anytime-validity relates to the behaviour of
whenever the null hypothesis ℋ0 : θ = θD holds true. More specifically. the fact that
constitutes an e-proccss implies that, for any 0 < α < 1 (e.g. α = 0.05), the test that rejects ℋ0 whenever
, when α = 0.05) has a typo I error guarantee at level α irrespective of n being fixed in advanced, or informed by the observations. Equivalently, provided that θ is truly θ0 the statistic
remains smaller than l/θ (i.e. smaller than 20 if α = 0.05) forever, with a least 1 − α (i.e. 05%. if α = 0.05) chance. This property holds irrespective of our choice of prior for the alternative, in particular for any fixed choice
in (3): although we use Bayesian tools to derive our confidence intervals, their validity holds in a frequentist, non-Bayesian way.
Let us now fix minimum relevant effect size ψmin (we explain this choice for
later). We now construct, for fixed α, the corresponding 1 − α-confidence interval CS1−α(n)(ψmin) at sample size n to consist of any θ0 such that
, i.e. all θ0 that would not have been rejected by the test described above, had they been the null hypothesis. A straightforward computation shows that, with
, this gives the interval Eq. (1).
Any time-validity now moans the following: with probability at least 1−α, the true parameter value θ will be simultaneously contained in all intervals CS1−α(n)(ψmin), i.e. for n = 1. n = 2, n = 3 and so on. In particular, this means that this sequence of confidence intervals (or ‘confidence sequence’ as it is usually called, e.g. Howard, Ramdas, MeAuliffe. & Sekhon. 2021) will cover the true parameter value θ with at least (1 − α)% chance regardless of the specific-time (i.e. sample size n) we look at the data; we may also peek at the data as often as we like. Equivalently, we may imagine sampling (simulating) many independent sequences Y1 = Y1,1,Y1,2…,Y2 = Y2,1,Y2,2,…, Y3 = Y3,1,Y3,2,… and so on, and make a sequence of confidence intervals as above for each of these sequences. If we do this very many times, then in such an ‘infinite repeated use’, at least (1 − α)% of these interval sequences will cover the true parameter value θ at all sample sizes n. As such, we can monitor and act on an anytime-valid confidence interval at any moment in time without over-inflating the risk of detecting a false positive result due to repeated testing across time.
To explain our choice , fix some θ0 representing the null in the test that we aim to accompany with an anytime-valid confidence interval (in the main text, with θ0 representing the log of the hazard ratio, this was just θ= 0 = log 1). We would like. among all anytime-valid confidence intervals constructed from (3) as above, the instance (i.e. the value of
) for which we ran expect the corresponding anytime-valid confidence interval to exclude θ0 as fast is possible (i.e. based on a sample that is as small as possible), whenever |θ − θ0| ≥ ψmin|. According to the theory of GRD e-variables and any time-valid tests as developed by Grünwald et al. (2015). this is given by the value of
maximizing
A straightforward computation shows that the maximum is achieved for
that
which is consistent with (1) above.
As stressed in the literature (Howard et al.. 2021), there is no such thing as an ‘overall optimal’ any time-valid confidence interval: the interval with the minimum widths at sample size n in a fixed sub-region of values for (say. if
is close to θ0 + ψmin) will be suboptimal (overly wide) for other sample sizes and regions of
and may not even shrink to width zero as n increases. On the other hand, the one that shrinks fastest to 0 asymptotically may have overly large width at the smaller n of interest and cannot be easily calculated. By insisting on intervals generated by Bayes factors with mean centered at null, 0 we obtain a compromise: easily calculable anytime-valid confidence intervals that have some sensitivity to minimal effect size ψmin whereas at the same time the guarantee that the width at time n is of order
, being wider than traditional fixed-n intervals by just a
factor this guarantee follows by the same reasoning as in Example 3 of Grünwald et al. (2019)).
Footnotes
↵1 These numbers (1 345 and 874) are much larger than the actual number (495) of COVID-19 infections we have observed so far, at the time of this publication, over all contributing trials. This might perhaps suggest that we should wait for hundreds of additional events and that our analysis, while valid at any number of events, is “not relevant yet”. Crucially though, it already is relevant: already at the current number of events we have a 95%-anytime-valid confidence interval 0.78-1.35, which means that we have almost reached the point where we can in principle stop, not for having reached sufficient power but instead for futility. As already stated in the main text, this happens as soon as the lower end of the confidence interval exceeds minimum relevant effect size 0.8 so that any effect size ≤ 0.8 is ruled out of our always-valid confidence interval (this effectively corresponds to all these hypotheses, rather than the null, being rejected).