Abstract
A novel coronavirus (2019-nCoV) has recently emerged as a global threat. As the epidemic progresses, many disease modelers have prioritized estimating the basic reproductive number ℛ0, defined as the average number of secondary cases caused by a primary case. While these efforts are extremely valuable, their modeling approaches and the resulting estimates vary widely. Here, we present a framework for comparing different estimates of ℛ0 across a wide range of models by decomposing it into three key quantities (the exponential growth rate r, the mean generation interval , and the generation-interval dispersion κ) and apply our framework to early estimates of ℛ0 for the 2019-nCoV outbreak. Our results emphasize the importance of propagating uncertainties in all three quantities, in particular in the shape of the generation-interval distribution. While rapid response during an outbreak is important, careful consideration of methodology is also worthwhile. Modelers should work with field workers to develop better methods for characterizing generation intervals.
1 Introduction
Since December 2019, a novel coronavirus (2019-nCoV) has been spreading in China and other parts of the world (World Health Organization, 2020c). Although the virus is believed to have originated from animal reservoirs (Centers for Disease Control and Prevention, 2020), its ability to directly transmit between humans has posed a greater threat for its spread (Huang et al., 2020; World Health Organization, 2020a). As of January 30th, 2020, the World Health Organization (WHO) has confirmed 7818 cases, including 82 confirmed cases in 18 different countries, outside China (World Health Organization, 2020b). WHO has now declared the outbreak a Public Health Emergency of International Concern (World Health Organization, 2020d).
As the disease continues to spread, many researchers have published analyses of the out-break, focusing in particular on estimates of the basic reproductive number ℛ0 (i.e., the average number of secondary cases generated by a primary case in a fully susceptible population (Anderson and May, 1991; Diekmann et al., 1990)). Estimating the basic reproductive number is of interest during an outbreak because it provides information about the level of intervention required to interrupt transmission (Anderson and May, 1991), and about the final size of the outbreak (Anderson and May, 1991; Ma and Earn, 2006). We commend these researchers for their timely contribution and those who made the data publicly available. However, it can be difficult to assess the estimates of ℛ0 (as well as the associated degrees of uncertainty) when the estimation methods and their underlying assumptions vary widely, especially since these assumptions can affect the estimates.
Here, we show that a wide range of approaches to estimating ℛ0 can be understood and compared in terms of estimates for three quantities: the exponential growth rate r, the mean generation interval , and the generation-interval dispersion κ. The generation interval, which is defined as the time between when an individual becomes infected and when that individual infects another individual (Svensson, 2007), plays a key role in shaping the relationship between r and ℛ0 (Wearing et al., 2005; Roberts and Heesterbeek, 2007; Wallinga and Lipsitch, 2007; Park et al., 2019); therefore, estimates of ℛ0 from different models directly depend on their implicit assumptions about the shape of the generation-interval distribution and the exponential growth rate. Early in an epidemic, information is scarce and, inevitably, there is large uncertainty around case reports (affecting the estimates of the exponential growth rate) and contact tracing (affecting the estimates of the generation-interval distribution). We suggest that disease modelers should make sure their assumptions about these three quantities are clear and reasonable, and that estimates of uncertainty should propagate error from all three sources.
We evaluate six disparate models published online between January 24–26, 2020 that estimated ℛ0 for the 2019-nCoV outbreak (Imai et al., 2020; Liu et al., 2020; Majumder and Mandl, 2020; Read et al., 2020; Riou and Althaus, 2020; Zhao et al., 2020). We use our framework to construct pooled estimates for the three key quantities: r, , and κ. We use these pooled estimates to illustrate the importance of propagating different sources of error, particularly uncertainty in both the growth rate and the generation interval. We also use our framework to unravel which assumptions of these different models led to their different estimates and confidence intervals.
2 Results
We gathered information on estimates of ℛ0 and their assumptions about the underlying generation-interval distributions from 6 articles that were published online between January 24th, 2020 and January 26th, 2020 (Table 1). As most studies do not report their estimates of the exponential growth rate or the associated confidence intervals, we first recalculate the exponential growth rate that correspond their model assumptions. We do so by modeling explicitly or implicitly reported distributions of the reproductive number ℛ0, the mean generation interval , and the generation-interval dispersion parameter κ with appropriate probability distributions; we used Gamma distributions to model values reported with confidence intervals and uniform distributions to model values reported with ranges. For example, Study 2 estimated ℛ0 = 2.92 (95% CI: 2.28–3.67); we model this estimate as a Gamma distribution with a mean of 2.92 and a shape parameter of 67, which has a 95% probability of containing a value between 2.28 and 3.67 (see Table 2 for a complete description). We then constructed a family of parameter sets – which include r, , and κ – for each study and used these in a Bayesian multilevel model to build a distribution of pooled estimates (see Methods).
Fig. 1 compares the reported values of the exponential growth rate r, mean generation interval , and the generation-interval dispersion κ from different studies with the pooled estimates (µr, µG, and µκ) that we calculate from our multilevel model. We find that there is a large uncertainty associated with the underlying parameters; many models rely on stronger assumptions that ignore these uncertainties. Surprisingly, no studies take into account how the variation in generation intervals affects their estimates of compares the reported values ℛ0: all studies assumed fixed values for κ, ranging from 0 to 0.5. Assuming fixed parameter values can lead to overly strong conclusions (Elderd et al., 2006). It is also interesting that none of the six studies explicitly or implicitly assumed an exponentially distributed generation interval (i.e., κ = 1), an assumption which used to be extremely common, particularly implicitly.
Fig. 2 shows how propagating uncertainty (µr, µG, and µκ) in different combinations would affect estimates and CIs for ℛ0. For illustrative purposes, we use our pooled estimates, which may represent a reasonable proxy for the state of knowledge as of 26 January. Comparing the models that include only some sources of uncertainty to the “all” model, we see that propagating error from the growth rate (which all but one of the studies reviewed did) is absolutely crucial: the middle bar (GI mean), which lacks growth-rate uncertainty, is far too narrow. Propagating error from the generation interval also has important effects. Once these two are included, the impact of leaving out uncertainty in the dispersion is small, though noticeable, in this particular example.
We also evaluate the estimates of ℛ0 across different studies by replacing their values of r, , and κ with our pooled estimates (µr, µG, and µκ) one at a time and recalculating the basic reproductive number ℛ0 (Fig. 3). We find that incorporating uncertainties one at a time increases the width of the confidence intervals all but three cases. We estimate slightly narrower confidence intervals for Study 2 and Study 6 when we use our pooled estimate of the generation-interval dispersion µκ to recalculate ℛ0 because they assume a narrow generation-interval distribution (compare base with GI variation in Fig. 3); when higher values of κ are used, their estimates of ℛ0 become less sensitive to the values of r and , giving narrower confidence intervals. We estimate narrower confidence intervals for Study 4 when we use our pooled estimate of the mean generation time µG to recalculate ℛ0 (compare base with GI mean in Fig. 3) because the range of uncertainty in the mean generation time they consider is much wider than the pooled range (Fig. 1).
Consistent with our previous observations (Fig. 2), we find that accounting for uncertainties in the estimate of r has the largest effect on the estimates of ℛ 0 (Fig. 3). For example, recalculating ℛ 0 for Study 6 by using our pooled estimate of r gives ℛ 0 = 3.9 (95% CI: 2.3–9.8), which is much wider than the uncertainty range they reported (2.0–3.1). There are two explanations for this result. First, even though the exponential growth rate r and the mean generation time have identical effects on ℛ0 under the gamma approximation framework (Eq. 2 in Methods), r has a greater overall effect on ℛ0 because it is associated with more uncertainty (Fig. 1). Second, assuming a fixed generation time (κ = 0) makes the estimate of ℛ0 too sensitive to r and as discussed previously.
Finally, we incorporate all uncertainties by using posterior samples for µr, µG, and µκ to recalculate ℛ0 and compare it with the reported ℛ0 estimates. Our estimated ℛ0 from the pooled distribution has a median of 3.1 (95% CI: 2.1–5.7). While the point estimate of 0 is similar to other reported values from this date range, the confidence intervals are wider than those of other studies. This result does not imply that assumptions based on the pooled estimate are too weak; we believe that this confidence interval more accurately reflects the level of uncertainties present in the information that was available when these models were fitted. In fact, because the pooled estimate does not account for overlap in data sources used by the models, we feel that it is more likely to be over-confident than under-confident. Our median estimate averages over the various studies, and therefore particular studies have higher or lower median estimates. Here, our focus is on certainty, not on the reason for these discrepancies.
3 Discussion
Estimating the basic reproductive number ℛ0 is crucial for predicting the course of an outbreak and planning intervention strategies. Here, we used a simple framework (Park et al., 2019) to compare estimates of ℛ0 for the novel coronavirus outbreak. Our results demonstrate the importance of accounting for uncertainties associated with the underlying generation-interval distributions, including with the amount of dispersion in the generation intervals: although our pooled estimates are relatively insensitive to the estimated uncertainty, our analysis of individual studies shows that assuming too narrow a generation-interval distribution can make the estimate of ℛ0 too sensitive to the estimates of the exponential growth rate r.
In this study, we focused on propagating errors arising from implicit or explicit estimates of growth rate and generation intervals. Other key issues underlying early estimates of ℛ 0 include statistical independence and types of noise.
Of the six studies that we reviewed, two of them directly fit their models to cumulative number of confirmed cases. This approach can be appealing because of its simplicity and apparent robustness, but fitting a model to cumulative incidence instead of raw incidence can both bias parameters and give overly narrow confidence intervals, if the result non independent error structures are not taken into account (Ma et al., 2014; King et al., 2015). Naive fits to cumulative incidence data should be avoided.
There are many sources of noise in real-world incidence data, including both dynamical, or “process”, noise (randomness that directly or indirectly affects disease transmission); and observation noise (randomness underlying how many of the true cases are reported). Disease modelers face the choice of incorporating one or both of these in their data-fitting and modeling steps. This is not always a serious problem, particularly if the goal is inferring parameters rather than directly making forecasts (e.g., Ma et al. (2014)). Modelers should be aware of the possibSility that ignoring one kind of error can give overly narrow confidence intervals (e.g., King et al. (2015)).
Here, we focused on the estimates of ℛ0 that were published within a very short frame of time (January 24th–26th). Although our analysis only reflects a snapshot of a fast-moving epidemic, our lessons will hold: confidence intervals must combine different sources of uncertainty. In fact, as epidemics progress and more data becomes available, it is likely that inferences about exponential growth rate will become more precise; thus the risk of over-confidence when uncertainty about the generation-interval distribution is neglected will become greater.
We strongly emphasize the value of attention to accurate characterization of the transmission chains via contact tracing and better statistical framework for inferring generation-interval distributions from such data (Britton and Scalia Tomba, 2019). A combined effort between public-health workers and modelers in this direction will be crucial for predicting the course of an epidemic and controlling it. We also emphasize the value of transparency from modelers. Model estimates during an outbreak, even in pre-prints, should include code links and complete explanations. Ideally, the code should not rely on closed-source programs. We have provided a basis for evaluating and comparing exponential-growth based estimates of ℛ 0 in terms of three simple components: the exponential growth rate, mean generation interval, and generation interval dispersion. We are hopeful that this will provide a guide to understanding and reconciling different estimates early in an epidemic.
4 Methods
4.1 Gamma approximation framework for linking r and ℛ0
Early in an outbreak, ℛ 0 is difficult to estimate directly; instead, ℛ 0 is often inferred from the exponential growth rate r, which can be estimated reliably from incidence data (Mills et al., 2004; Nishiura et al., 2009; Ma et al., 2014). Given an estimate of the exponential growth rate r and an intrinsic generation-interval distribution g(τ) (Champredon and Dushoff, 2015), the basic reproductive number can be estimated via the Euler-Lotka equation (Wallinga and Lipsitch, 2007):
In other words, estimates of ℛ 0 must depend on the assumptions about the exponential growth rate r and the shape of the generation-interval distribution g(τ).
Here, we use the gamma approximation framework (McBryde et al., 2009; Nishiura et al., 2009; Roberts and Nishiura, 2011; Park et al., 2019) to (1) characterize the amount of uncertainty present in the exponential growth rates and the shape of the generation-interval distribution and (2) assess the degree to which these uncertainties affect the estimate of ℛ 0. Assuming that generation intervals follow a gamma distribution with the mean and the squared coefficient of variation κ, we have
This equation demonstrates that a generation-interval distribution that has a larger mean (higher ) or is less variable (lower κ) will give a higher estimate of ℛ 0 for the same value of r.
4.2 Description of the studies
We reviewed 6 modeling studies of the novel coronavirus outbreak that were published online between January 24th, 2020 and January 26th, 2020 (Table 1). Five studies (Liu et al., 2020; Majumder and Mandl, 2020; Read et al., 2020; Riou and Althaus, 2020; Zhao et al., 2020) were uploaded to pre-print servers (bioRxiv, medRxiv, and SSRN), and one report was posted on the website of Imperial College London (Imai et al., 2020). There is a wide variation in their statistical methods and the amount of data they used to infer ℛ 0. Imai et al. (2020) and Riou and Althaus (2020) simulated branching process models and compared the predicted number of cases from their models with the estimated number of total cases by January 18th. Read et al. (2020) fitted a deterministic, metapopulation Susceptible-Exposed-Infected-Recovered (SEIR) model to incidence data between January 1st and January 21st from major cities in China and other countries. Zhao et al. (2020) and Liu et al. (2020) fitted exponential growth models to incidence data up to January 22nd and January 23rd, respectively, and inferred ℛ 0 via the Euler-Lotka equation (Eq. 1). Majumder and Mandl (2020) fitted the Incidence Decay and Exponential Adjustment (IDEA) model (Fisman et al., 2013) to incidence data up to January 26th, which is equivalent to fitting an exponential growth model and assuming a fixed generation-interval distribution.
4.3 Statistical framework
For each study i, we construct a family of parameter sets by drawing 100,000 random samples from the probability distributions (Table 2) that represent the estimates of ℛ 0i and the assumed values of and κi and calculating the exponential growth rate ri via the inverse of Eq. 2:
This allows us to approximate the probability distributions of the estimated exponential growth rates by each study; uncertainties in the probability distributions that we calculate for the estimated exponential growth rates will reflect the methods and assumptions that the studies rely on.
We construct pooled estimates for each parameter (r, , and κ) using a Bayesian multilevel modeling approach, which assumes that the parameters across different studies come from the same gamma distribution:
We account for uncertainties associated with ri, and κi (and their correlations), by drawing a random set from the family of parameter sets for each study at each Metropolis-Hastings step; this approach is analogous to Bayesian methods for analyzing phylogenetic data, which often rely on drawing random samples of phylogenetic trees from a discrete set to account for phylogenetic uncertainty (Pagel et al., 2004; Bedford et al., 2014). Since the gamma distribution does not allow zeros, we use κ = 0.02 instead for Study 6. We note that this approach does not account for non-independence between the parameter estimates made by different modelers. As we add more models, we expect the pooled estimates to become sharper even when the models no longer add more information. Thus, the pooled estimator should be interpreted with care.
Weakly informative priors are assumed on the hyperparameters:
We run 4 parallel Markov Chain Monte Carlo (MCMC) chains that consist of 200,000 burnin steps and 200,000 sampling steps. Posterior samples are thinned every 400 steps. Convergence is assessed by ensuring that the Gelman-Rubin statistic is below 1.01 for all hyperparameters (Gelman et al., 1992). 95% confidence intervals are calculated by taking 2.5% and 97.5% quantiles from the posterior distribution. R code is available in GitHub (https://github.com/parksw3/nCoVframework).
Data Availability
R code is available in GitHub (https://github.com/parksw3/nCoV\_framework)
Acknowledgements
We thank Ben Bolker for providing useful comments on the manuscript.