Abstract
Biomedical research often utilizes Cox regression for the analysis of time-to-event data. The pervasive use of frequentist inference for these analyses implicates that the evidence for or against the presence (or absence) of an effect cannot be directly compared and that researchers must adhere to a predefined sampling plan. As an alternative, the use of Bayes factors improves upon these limitations, which is especially important for costly and time-consuming biomedical studies. However, Bayes factors involve their own difficulty of specifying priors for the parameters of the statistical model. In this article, we develop data-driven priors for Cox regression tailored to nine subfields in biomedicine. To this end, we extracted hazard ratios and associated x% confidence intervals from the abstracts of large corpora of already existing studies within the nine biomedical subfields. We used these extracted data to inform priors for the nine subfields. All of our suggested priors are Normal distributions with means of 0 and standard deviations closely scattered around 1. We propose that researchers use these priors as reasonable starting points for their analyses.
The collection and analysis of time-to-event data forms a central part of modern biomedical research (see, e.g., Boney et al., 2005; D’Agostino et al., 2008; Diener et al., 2004; Kenchaiah et al., 2002; Singh & Mukhopadhyay, 2011; Stupp et al., 2009). In these types of designs, the outcome variable is the time until an event of interest occurs, which is commonly called the survival time. In the medical context, this event of interest could be, for example, death, relapse towards alcoholism, disease/symptom onset, or recovery. Using survival analysis (we refer the interested reader to other sources, which treat survival analysis more thoroughly; e.g., Bradburn et al., 2003a, 2003b; Clark et al., 2003a, 2003b; Collett, 2015; Harrell, 2015; Hosmer et al., 2008; Klein & Moeschberger, 1997; Therneau & Grambsch, 2000), it is possible to estimate differences in event rates between conditions, which makes it particularly appealing for clinical trials. For instance, evidence about the effectiveness of an oncological treatment of cancer patients could be gathered by comparing the survival times of patients receiving the treatment to the survival times of patients receiving a placebo or an active control treatment (for examples see Kantoff et al., 2010; Rinke et al., 2009; Stupp et al., 2009).
The use of frequentist inference for the analysis of survival data has a long tradition in biomedical research and is still very common today (Brard et al., 2017). Classical frequentist inference, however, is not well suited to quantify evidence in favor of the absence of an effect. In addition, frequentist inference requires fixing the study sample size in advance to avoid an inflation of the Type I and/or Type II error rates (Yu et al., 2014), a downside that can be particularly costly in the realm of resource-intensive biomedical research and clinical trials (Berry, 2006; Brard et al., 2017; Moyé, 2008). As an alternative, Bayesian statistics have gained popularity among researchers (van de Schoot et al., 2017). In particular, Bayes factors (Jeffreys, 1939, 1948, 1961; Kass & Raftery, 1995) do not suffer from the two limitations mentioned before and are therefore a valuable alternative for conducting inference in biomedical research (Goodman, 1999b).
Bayesian modeling requires the specification of a prior distribution for the parameters of the model. The prior expresses one’s beliefs about the plausibility of parameter values before observing the data (Kruschke, 2010, 2015; Kruschke & Liddell, 2018). Oftentimes, it is notoriously difficult to express these beliefs and different priors sometimes lead to qualitatively different Bayes factors (Gallistel, 2009; Kruschke & Liddell, 2018; Liu & Aitkin, 2008; Sinharay & Stern, 2002; Tendeiro & Kiers, 2019; Vanpaemel, 2010). As a result, some researchers lament that the use of Bayesian statistics involves subjectivity (e.g., Efron, 1986) and that proper guidance is missing, possibly resulting in hesitation to use Bayesian inference. Hence, recommendations for well-established default priors in Bayesian survival analysis - in particular Cox proportional hazards regression (henceforth called Cox regression or Cox model; Cox, 1972) - are missing and urgently needed.
In this article, we propose default priors for Bayesian Cox regression tailored to nine subfields within biomedicine. The construction of these priors harnesses historical records consisting of large corpora of hazard ratios and associated x% confidence intervals from existing studies within the respective subfields. We argue that these proposed priors can be used as reasonable defaults or starting points for biomedical researchers wishing to conduct a Bayesian Cox regression.
The remainder of this article is structured as follows: First, we provide an overview of Cox regression and Bayes factors. Second, we briefly review how priors can be defined. Third, we explain our process of generating priors for nine subfields in biomedicine and present the corresponding results. Fourth, we reflect on our findings and implications thereof, and conclude with recommendations.
Bayes Factors in Cox Regression
Survival analysis is a statistical method to analyze time-to-event/survival data. Among the many existing forms of survival analysis - for example, Kaplan-Meier product-limit estimator (Kaplan & Meier, 1958), parametric survival analyses (e.g., Exponential, Gompertz, and Weibull), and Cox regression (Cox, 1972) - the latter is used most frequently within biomedicine (e.g., Bradburn et al., 2003a). Therefore, our treatment of priors for survival analysis is limited to the case of Cox regression.
In Cox regression, the hazard function λ (t) presents the risk of an event happening in a small time period around a specific time t within cases for which the event has not yet happened before time t (see, e.g., Clark et al., 2003a; Harrell, 2015). The specific λ (t) is allowed to have any shape but must be proportional across all values of an independent variable x. Usually, the main goal is not to estimate λ (t) but rather to estimate the β parameter of the Cox model:
Here and throughout, we work with the specific case where x is dichotomous and dummy-coded (i.e., there are two conditions, a common situation in biomedical designs). For this scenario, a hazard ratio can be calculated
which provides information about the relative hazard rates between conditions.
In clinical trials, HR is often the key indicator regarding the effectiveness of a treatment. HR = 1 (or β = 0) means that the two conditions have the same risk of the event happening at any t; HR > 1 (or β > 0) means that the experimental condition has a higher risk of the event happening at any t; HR < 1 (or β < 0) means that the control condition has a higher risk of the event happening at any t. Frequentist inference on HR or β is then conducted either in the form of null hypothesis significance testing (i.e., test statistic and p-value) or in the form of estimation (i.e., a point estimate accompanied with a confidence interval).
The reliance on frequentist inference (Chavalarias et al., 2016; Goodman, 1999a) has some undesirable consequences for biomedical research. Here, we focus on two of these consequences, namely (1) the impossibility to obtain evidence for the null hypothesis and the inability to adjust the sampling plan based on interim results. Concerning (1), it is important to not only determine whether a treatment is working but also whether a treatment is not working over and above a placebo effect. The frequentist approach is not suitable for this because it only allows rejecting the hypothesis that there is no effect, but not accepting it (e.g., Bakan, 1966; Gallistel, 2009; Hoekstra et al., 2018; Ioannidis, 2005; Rouder et al., 2009; Wagenmakers, 2007).
Concerning (2), the use of frequentist inference prescribes the diligent adherence to a predefined sampling plan, prohibiting to continue or prematurely stop data collection based on interim data analyses (e.g., Armitage et al., 1969; Rouder, 2014; Schönbrodt et al., 2017; Tendeiro et al., 2022). Further criticism is described elsewhere (see, e.g., Goodman, 1999a; International Committee of Medical Journal Editors, 1997; Rennie, 1978; van Ravenzwaaij et al., 2019; Wagenmakers, 2007).
Bayesian testing in the form of Bayes factors permits a direct comparison between the evidence for the null hypothesis that there is no effect and an alternative hypothesis that operationalizes that there is some effect (Rouder et al., 2009). For instance, with BF10 = 14, it allows the interpretation that the obtained data is 14 times more likely under the chosen hypothesis that there is some effect compared to the hypothesis that there is no effect; similarly, BF10 = 0.2 indicates that the obtained data is 1/0.2 = 5 times more likely under the hypothesis that there is no effect compared to the chosen hypothesis that there is some effect.
Moreover, using Bayes factors, it is legitimate to monitor the results and stop data collection once a predetermined evidence threshold is reached (Armitage et al., 1969; Rouder, 2014; Schönbrodt & Wagenmakers, 2018; Schönbrodt et al., 2017; Stefan, Schönbrodt, et al., 2022; Tendeiro et al., 2022). Thus, Bayes factors take the evidence for and against both the null and alternative hypotheses into account, yield more substantial interpretations, and empower researchers to sample just the sufficient amount of cases. These characteristics of Bayes factors are critical for biomedical research as studies (especially clinical trials) can be expensive and time-consuming.
In the case where there is a point null hypothesis stating that there is no effect
and an interval alternative hypothesis stating that there is an effect
the Bayes factor is a ratio of a marginal likelihood and a likelihood evaluated at β = 0. Here, f (.) represents any probability density function and ϕ the associated parameters. Let Ω1 be the parameter space under the alternative hypothesis; then the Bayes factor is:
In Equation 5, the integral constitutes a weighted average of the likelihood, with weights supplied by the prior. Depending on the complexity of the underlying statistical model, computing the expression in Equation 5 can be challenging. Through concerted efforts of researchers to develop closed-form solutions and through the explosion of computational power over recent decades that allows applying complex numerical methods (e.g., Monte Carlo sampling and bridge sampling; Betancourt, 2018; Brooks et al., 2011; Gilks et al., 1995; Gronau et al., 2017; van Ravenzwaaij et al., 2018), computing the expression in Equation 5 and variants of it has become feasible. These efforts have led to method developments and software implementations for calculating Bayes factors for survival analyses (Bartoš, 2022; Bartoš et al., 2022; Linde, Tendeiro, et al., 2022; Linde, van Ravenzwaaij, et al., 2022) and many other designs (e.g., Gronau et al., 2020; Gu et al., 2021; Heck et al., 2020; JASP Team, 2023; Morey & Rouder, 2018; Rouder et al., 2012; Rouder et al., 2009; van Lissa et al., 2021; van Ravenzwaaij et al., 2019).
Choosing Priors
Even though it is possible nowadays to calculate Bayes factors for various sorts of designs, it remains difficult to specify an appropriate prior distribution (or prior for short) for the parameters of interest (cf. Equation 5). The prior is a probability distribution that is placed on the statistical model’s parameters of interest and it expresses belief over the plausibility across all possible parameter values before taking into account new data. In the context of null hypothesis Bayes factor calculations, the prior is one important element of the alternative hypothesis. Even among researchers who advocate Bayesian statistics, there is disagreement on how the prior should be specified (see, e.g., Berger, 2006; Goldstein, 2006).
Objective Bayesians strive to define non-informative priors that are as “objective” as possible. Objective Bayesians assert that the results of Bayesian analyses should depend only to a minimal extent on the beliefs of different people. They promote default priors that can be used when no other information is available and often seek to find priors that “behave well” and fulfill certain mathematical properties (see, e.g., Bayarri et al., 2012; Consonni et al., 2018, for more details). Subjective Bayesians, on the other hand, counter that the subjective nature of the prior is an integral part of Bayesian analyses. According to them, the prior allows the incorporation of domain knowledge and results from prior studies into Bayesian analyses and therefore permits tests of theories (Dienes, 2011; Vanpaemel, 2010). Further, they state that it is questionable whether a truly “objective” prior even exists.
Recently, the opportunities that well-defined priors open were increasingly recognized in biomedical research. Prior elicitation procedures, in which informed priors are defined by means of using external sources, gained popularity within biomedical research (e.g., Guo et al., 2019; Johnson et al., 2010; Thall & Cook, 2004; van de Schoot et al., 2018; Zondervan-Zwijnenburg et al., 2017). There are various forms of prior elicitation. For example, through structured interviews, information about prior beliefs can be extracted by one or multiple experts in the respective field, which is subsequently combined into one prior (see, e.g., Johnson et al., 2010; O’Hagan et al., 2006; Stefan, Evans, et al., 2022; van de Schoot et al., 2018). Tools like the MATCH software (Morris et al., 2014) have been developed for this purpose. Alternatively, the results of meta-analyses and prior research in general can be used to create a prior (e.g., Rietbergen et al., 2011; van de Schoot et al., 2018). That is, researchers could use the overall effect size combined with a measure of uncertainty from a meta-analysis to construct an informed prior for their own analysis; or they could conduct their own literature search and extract the relevant statistics and use them for developing priors. This approach of using prior study results can also be combined with an empirical Bayes approach, which utilizes the current data to create a prior instead of predefining it (e.g., Casella, 1985). Such a procedure was proposed by van Zwet and Gelman (2022).
In this article, we follow the approach of using results of prior studies to create priors. For this, we conducted our own literature search instead of relying on meta-analyses. The reason for this is that we aim to suggest priors that are generic, such that they apply to entire medical subfields; most meta-analyses do not offer this generality.
The Current Study
In the present article, we develop priors for Bayesian Cox regression for nine subfields that we believe are representative for different areas of research within biomedicine. For the construction of these priors we make use of reported hazard ratios and associated x% confidence intervals from large corpora of existing studies. These extracted data are then combined through pooling to generate priors.
Methods
We selected the subfields in biomedicine considered for further investigation based on a taxonomy provided by Scimago (available at https://www.scimagojr.com/journalrank.php; SCImago, n.d.). On the Scimago website, we used “Medicine” as a subject area, upon which a list of medical subfields were provided (see Figure 1). Among those, we selected the following nine subfields for further consideration:
Settings for extracting top journals from Scimago (available at https://www.scimagojr.com/journalrank.php; SCImago, n.d.) for one of the nine considered subfields in biomedicine; in this case “Anesthesiology and pain medicine”.
Anesthesiology and pain medicine
Cardiology and cardiovascular medicine
Gastroenterology
Hematology
Immunology and allergy
Neurology
Oncology
Psychiatry and mental health
Pulmonary and respiratory medicine
Our selection was based on three criteria: (1) we aimed to obtain a manageable number of subfields (between eight and twelve); (2) we aimed to obtain subfields with limited overlap; and (3) we aimed to obtain subfields that represent relatively large areas of study within biomedicine. For each of the nine subfields, we obtained a list of the top journals from Scimago. We considered only journals (i.e., neither book series, nor conferences and proceedings, nor trade journals); we considered journals from all regions/countries; and the journal list was based on the year 2022 (see Figure 1 for an example of the settings on the Scimago website for the subfield of “Anesthesiology and pain medicine”). The extraction of the top journals for all nine subfields yielded 2, 469 journals in total (see columns 1 and 2 of Table 1). Some subfields shared a set of journals; for example, the journal “Pain” is a top journal for both the subfields of “Anesthesiology and pain medicine” and “Neurology”. We found that there was not a lot of overlap of journals between the subfields, with 2, 196 of the 2, 469 journals being uniquely assigned to only one subfield.
Number of used journals and studies for each of the nine subfields within biomedicine. The first column indicates the subfield, the second column the number of used (i.e., matched between Scopus data and Scimago data) journals (not necessarily unique) from all Scimago journals, the third column the number of studies allocated, the fourth column the number of studies for which there was a match and data extraction was successful, and the fifth column the number of studies remaining after excluding studies that provide flawed results.
As a separate step, we used Scopus to obtain a list of medical articles. We used the following search query:
ABS((“hazard ratio” OR {hr}) AND {cox}) AND SUBJAREA(medi) AND PUBYEAR > 1999 AND PUBYEAR < 2021 AND (LIMIT-TO(SRCTYPE,”j”)) AND (LIMIT-TO(PUBSTAGE,”final”)) AND (LIMIT-TO(DOCTYPE,”ar”)) AND (LIMIT-TO(LANGUAGE,”English”))Only fully published (line 5) articles (line 6) from a journal (line 4) written in English (line 7) were considered. Furthermore, the results had to belong to the field of medicine (line 2) and be published between the years 2000 and 2020, inclusive (line 3). Lastly, the abstracts of the results needed to contain the keywords “hazard ratio” or “HR” and the keyword “Cox”, ignoring case (line 1). Note, however, that this search query was generic such that it did not restrict the results towards one of the nine subfields. The Scopus query yielded 59, 669 results, of which 23 could not be exported, leaving 59, 646 results in total (see column 3 of Table 1).
The 59, 646 Scopus results were allocated to the nine subfields by matching the journal names indicated by Scopus to the Scimago lists of journal names for the nine subfields. Importantly, a Scopus result could be allocated to multiple subfields as some subfields had journals in common. Before the allocation of Scopus results to the nine subfields, both the Scopus journal names and the Scimago lists of journal names were cleaned and standardized in order to accommodate slight differences in their presentation. This included replacing “&” and “&” with “AND”, removing all characters that are not alphabetic or white space, repositioning the word “the” (e.g., “Lancet Oncology, The” was turned into “The Lancet Oncology”), and transforming all characters to uppercase. The number of matched journal names relative to the total number of Scimago journal names for the nine subfields are shown in column 2 of Table 1 and the number of allocated results for each of the nine subfields can be seen in column 3 of Table 1.
Once the individual Scopus results were allocated to the nine subfields, we extracted hazard ratios and associated x% confidence intervals from the abstracts of the results. This was done in an automatic fashion through the use of regular expressions (see Friedl, 2006, for the standard reference on regular expressions). We extracted the following information from the abstracts:
Hazard ratio (HR)
Confidence level of the confidence interval (CI) for HR (i.e., 100 (1 − α))
Lower boundary of the CI for HR (HRl)
Upper boundary of the CI for HR (HRu)
There are several important details about our implemented text-mining procedure. First, we exclusively considered the abstracts for data extraction. Second, if the regular expression yielded multiple matches for a given abstract, only the first match was considered; any other matches were discarded. The justification for this decision was that we assumed that the main findings are commonly reported first, followed by secondary or exploratory findings.
Third, data extraction was only done when all of the four above-mentioned information were available in the abstract. We disregarded abstracts where results were not complete or were presented in any other form. For instance, presentations of a HR coupled with a p-value and potentially a test statistic were ignored. Although this seems like an overly drastic measure, the number of matches was still very high (see column 4 of Table 1). Fourth, we did not distinguish between variations of Cox regression (e.g., multivariate, stratified, multiple predictors).
Fifth, we allowed various forms of the displayed results. For example, the following variations were all captured by our regular expression: “HR = 2.3” (with or without spaces around =), “hazard ratio = 2.3”, “hazards ratio = 2.3”, “hazard ratios = 2.3”, “hazard ratio (HR) = 2.3”, “hazards ratio [HR] = 2.3”, “HR : 2.3”, “H.R. : 2.3”, and many more. Thus, we attempted to make the regular expression as flexible as possible, so that it could capture the maximum amount of valid text, while still maintaining a healthy level of restrictions. For more details, please consult our code, available at https://osf.io/ua4ys/. In total, we were able to extract data for 21, 768 out of 36, 431 results (see column 4 of Table 1).
We applied additional checks on the extracted data to make sure that both the regular expression worked properly and the information in the abstracts themselves was correct. As a first step, we checked whether the confidence level of the CI was between 0 and 100. Second, we tested whether HR, HRl, and HRu were higher than 0 (because the possible range goes from 0 to ∞). Third, we examined whether the log HR was approximately (because of rounding) in the middle of HRl and HRu. Here, we also excluded results where the log HR and at least one of HRl or HRu had the same value due to rounding. Any extracted data that did not fulfill all of these criteria was discarded. Column 5 of Table 1 shows that for all nine subfields only a small number of extracted data had to be excluded (170 out of 21, 768 in total), leaving a final number of considered results of 21, 598.
With this step completed, the nine subfields and their hazard ratios and associated x% confidence intervals were considered separately. For each study i in i ∈ {1, 2, …, N} within one of the nine subfields (where N is the number of studies within one of the nine subfields), the extracted HRi was log-transformed (bi). Also, the standard error of bi was calculated based on the confidence interval (e.g., Higgins et al., 2019) of HR (i.e., and
):
where
and Q (.) is the quantile function of the standard Normal distribution. The sign of bi, however, is meaningless because it depends on how the independent variable is coded. For instance, commonly the control condition is coded with 0 and the experimental condition with 1; occasionally, the opposite is the case, which would reverse the sign of bi. Therefore, we “mirrored” bi, so that we have both −bi and bi:
and the corresponding standard errors:
The calculated ω and θ within a subfield could then be used for the construction of a prior for each subfield separately. We decided to use the Normal distribution for the prior. To obtain a prior, we combined the mirrored data through pooling (e.g., Higgins et al., 2019). Using this procedure, the individual values in ω and θ were treated as coming from separate samples that were combined into a single pooled sample. One desirable feature of this pooling method is that ω values with higher θ are central to (rather than discarded for) the calculation of the pooled standard error. In other words, the θ around the ω values had a direct influence on the calculation of the pooled standard error: All else being equal, the higher the θ of a sample, the more it would increase the pooled standard error. We deemed this behavior desirable since we wanted the prior to reflect uncertainty. The resulting mean and standard error of a single pooled sample served as the mean and standard deviation of the prior.
We decided against using the inverse-variance weighting procedure that is commonly used in meta-analyses (see, e.g., Borenstein et al., 2021). The reason for this was that the prior would get increasingly narrow as the corpus of considered studies increases. In addition, effect sizes with high θ values have a relatively low influence (they are less diagnostic in determining the mean), which was undesirable for our purposes of trying to estimate the spread of the prior. Also, we decided to not use the partly empirical Bayes procedure described in van Zwet and Gelman (2022) because the nature of it prescribes that the current data (i.e., not only the corpus of prior studies) is used to determine the prior.
Using the pooling method, determining the mean of the prior was redundant because it was always 0 due to the mirrored nature of our data. However, in principle, the mean µp of the prior can be calculated as follows (see Higgins et al., 2019):
upon which the standard deviation of the prior σp can be calculated:
Here, j ∈ {1, 2, …, K} is an index of the mirrored data ω and θ, which both have length K = 2N . In essence, Equation 10 sums up the squared standard errors
and it sums up the squared deviations of ωj from the pooled mean µp. Thus, all else being equal, σp increases as θj increases and as (ωj − µp)2 increases. Note that our extracted information did not provide information on the sample size nj within each study. We used an arbitrary sample size of n = 200 for all nj, but we tested whether the choice of nj had an influence on the calculation of σp. Specifically, we varied nj with nj = n ∈ {10, 11, …, 10000}, so that all studies had the same sample size, and applied Equation 10. In order to verify the robustness of the assumption of equal sample size per study, we randomly varied nj across studies, so that nj in Equation 10 could take on values sampled from U (10, 10000). We repeated this procedure 100, 000 times to accommodate many possible arrangements of nj.
Results
The results for the construction of the Normal priors for the nine subfields through the pooling method (Higgins et al., 2019) can be seen in Figure 2. The panels represent the nine subfields. Histograms show the distributions of ω for the different corpora, independent of θ. The red curves show the priors resulting from the application of the pooling method (Higgins et al., 2019).
Prior distributions for the nine subfields, one subfield per panel. The histograms show the distribution of ω (i.e., ignoring θ). The red curves display the densities of the Normal priors on β, where σp is presented at the top of each panel.
For all nine subfields, the center of the Normal prior was located at µp = 0 as a necessary consequence of our decision to mirror the data. The more interesting parameter of the Normal prior is the standard deviation σp because it reflects both the effect sizes and the uncertainties around them based on past studies within a subfield. From the histograms and the Normal priors it can be seen that the effect sizes, and therefore also the Normal priors, were similar across the nine subfields. σp ranged between 0.915 for “Psychiatry and mental health” and 1.079 for “Gastroenterology”.
Since the calculation of σp through the pooling method (Higgins et al., 2019) was based on an arbitrary choice of nj = n = 200 that was the same for all studies within a corpus of a specific subfield, we also investigated the dependence of σp on nj. Figure 3 shows this dependence, where the panels represent the different subfields. The curves represent the variation of σp as a function of nj = n, which was the same across all studies within a specific corpus of a subfield. The box plots represent the variation of σp as a function of nj, where nj for each study j was sampled from U (10, 10000) (i.e., different sample sizes were possible across studies), over 100, 000 repetitions.
Sensitivity of the pooling method with respect to the choice of nj for the estimation of σp. The nine panels correspond to the nine subfields. The curves display σp of the priors as a function of nj = n (i.e., all studies have the same sample size). The box plots show σp of the priors when each nj is drawn randomly from U (10, 10000) (i.e., studies have different sample sizes); this process was repeated 100, 000 times.
When assuming that nj across studies are equivalent, only small variations in σp were observed, as shown by the curves. For small nj, σp was smaller compared to when nj was large. In the limit of nj, σp seems to reach an asymptote. However, the assumption that sample sizes are equal across studies is overly unrealistic and simplistic. When nj were allowed to vary across studies, a larger variation in σp was observed, as shown by the box plots. Still, the variation in σp was not radical enough to question our heuristic of using nj = 200 for all studies within a corpus of a subfield.
Example Application
To illustrate the application of one of the nine priors (i.e., the prior for “Oncology”), we conducted a Bayesian fixed-effect meta-analysis of clinical trials investigating the effectiveness of novel cancer drugs that are approved by the Food and Drug Administration (FDA; for the CEIT-Cancer project see Ladanie et al., 2018). All of the studies included in the meta-analysis consisted of Cox regressions. We used data provided by and described in Pittelkow et al. (2023, available at https://osf.io/qz7xy/). Our goal was to estimate the evidence for or against the effectiveness of medications for the treatment of breast cancer. Note, however, that our analysis should be interpreted as a mere demonstration or proof of concept only, rather than a thorough meta-analysis with meaningful and robust results.
For our Bayesian fixed-effect meta-analysis, we only considered randomized controlled trials that were double-blinded and that contained a placebo control (i.e., not an active control). Further, we only considered “overall survival” (i.e., not “progression-free survival” or “tumor response”) as an outcome measure and exclusively included trials investigating breast cancer. This yielded a total of five trials. The b coefficients and the corresponding standard errors can be found in Table 2.
Drug names and corresponding b coefficients and SE (b) for the considered trials.
We conducted our Bayesian fixed-effect meta-analysis using the “metaBMA” R package (Heck et al., 2019).
> library(“metaBMA”)We defined the prior on the effect size β, which is one-sided negative because our alternative hypothesis states that the hazard is lower in the treatment compared to the control condition (i.e., ℋ1 : β < 0).
> onc_prior <- prior(family = “norm”, > param = c(mean = 0, > sd = 0.968), > upper = 0)Subsequently, we defined the variables for the b coefficients and corresponding standard errors and conducted the fixed-effect meta-analysis.
> b <- c(-0.443, -0.293, -0.315, -0.158, -0.412) > b_se <- c(0.159, 0.191, 0.212, 0.182, 0.142) > > mod <- meta_fixed(y = b, > SE = b_se, > d = onc_prior)The results indicate that BF10 = 2, 887, suggesting decisive evidence (Jeffreys, 1961; Kass & Raftery, 1995) in favor of the overall effectiveness of drugs for breast cancer.
> mod$bf # (denominator) # (numerator) fixed_H0 fixed_H1 # fixed_H0 1.00 0.0003463491 # fixed_H1 2887.26 1.0000000000Discussion
Survival analysis, and in particular Cox regression (Cox, 1972), is an indispensable statistical tool for biomedical research. The ubiquitous frequentist framework is limited in that it cannot quantify evidence in favor of the null hypothesis that there is no effect (Rouder et al., 2009) and in that it does not allow continuing or stopping data collection based on interim analyses (Armitage et al., 1969; Rouder, 2014; Schönbrodt & Wagenmakers, 2018; Schönbrodt et al., 2017; Stefan, Schönbrodt, et al., 2022; Tendeiro et al., 2022). Bayes factors remedy these shortcomings and permit intuitive interpretations. Nevertheless, the specification of priors that are required for Bayesian analyses can be difficult.
In this paper, we have developed priors for the β parameter in Cox regression for nine subfields in biomedicine (see column 1 in Table 1). These priors were informed by large corpora of already existing studies within the respective subfields and therefore provide reasonable approximations to the to-be-expected effect sizes and uncertainties thereof. For all nine subfields, we decided to use a Normal prior, which is centered on µp = 0. We found very similar standard deviations for the Normal priors across the nine subfields, ranging from σp = 0.915 for “Psychiatry and mental health” to σp = 1.079 for “Gastroenterology”, suggesting considerable similarities across subfields. Since our developed priors differ only slightly across the nine subfields, we believe that it is reasonable to use a standard Normal prior (i.e., N (0, 1)) for all nine subfields, forming a compromise among the nine individual priors, as a starting point. Still, any choice of prior is always arbitrary to some degree. Therefore, we urge researchers to complement their analysis using a specific prior with sensitivity analyses (e.g., Berger et al., 1994; Depaoli & van de Schoot, 2017; Du et al., 2019; Kruschke, 2015), in which parameters of the prior are systematically varied and even entirely different (reasonable) priors are chosen, in order to examine the robustness of the resulting Bayes factors.
We caution the reader to not take our proposed priors to be set in stone. The choice of prior always depends to some extent on the goals of the researcher. For example, it might not always be desirable for the prior to accurately reflect expected effects. Sometimes, the focus might be on ensuring sufficient shrinkage of the parameter estimate (e.g., Park & Casella, 2008; van Erp et al., 2019; van Zwet & Gelman, 2022).
Moreover, our process of arriving at the priors contained decisions, assumptions, and heuristics that might be questioned. First, the allocation of the articles to the nine subfields based alone on the journal is a drastic heuristic. A proportion of the articles is therefore probably classified into the wrong subfield. Second, the regular expression that we created to extract hazard ratios and associated x% confidence intervals from the abstracts of the articles might have been biased to some extent. Some journals have very specific reporting guidelines for the abstracts, which might not have been captured by our regular expression. Thus, it is possible that certain journals were systematically underrepresented in our results. Third, we only matched abstracts where a hazard ratio is combined with a confidence interval but not abstracts where a hazard ratio is combined with a p-value. We made this decision because p-values often do not map directly onto the confidence intervals. Additional information on how the p-value was calculated would be needed, which is rarely available in abstracts. Fourth, for the Cox regression analyses, we did not differentiate between different types of predictors (e.g., categorical, continuous) and types of analyses (e.g., stratified, multivariate), leaving open the possibility that certain nuances are ignored by our calculations.
It is clear that our proposed priors for Bayesian Cox regressions are very generic, such that one individual prior accommodates an entire subfield (or even all nine subfields if we are willing to accept the standard Normal prior). These priors might still be appropriate approximations for smaller specializations within subfields. However, in these cases it might be worthwhile to obtain more informed and precise priors that are tailored to these smaller subfields.
Conclusion
The analysis of time-to-event data with Cox regression (Cox, 1972) is pervasive in biomedical research. Cox regression combined with Bayes factors has much to offer over traditional frequentist inference because it allows researchers to directly contrast the evidence for the null and alternative hypotheses (Rouder et al., 2009) and because it allows monitoring results during data collection and continue or stop at any time (Armitage et al., 1969; Rouder, 2014; Schönbrodt & Wagenmakers, 2018; Schönbrodt et al., 2017; Stefan, Schönbrodt, et al., 2022; Tendeiro et al., 2022). These characteristics of Bayes factors have the potential to reduce the waste of scarce resources in biomedical research and especially clinical trials (Macleod et al., 2014; van Ravenzwaaij et al., 2019). However, the specification of priors for these Bayesian analyses can be challenging and be perceived as overly subjective. We propose default priors in Cox regression that are informed by large corpora of already existing studies for nine subfields. These priors are all Normal distributions centered on 0 with standard deviations that are close to 1. They can be used as a default or starting point for medical researchers and can be augmented with sensitivity analyses.
Acknowledgments
This research was supported by a Dutch scientific organization VIDI fellowship grant (016.Vidi.188.001) awarded to Don van Ravenzwaaij and a Japanese JSPS KAKENHI grant (21K20211) awarded to Jorge N. Tendeiro.
Footnotes
Author Note