Global prediction of unreported SARS-CoV2 infection from observed COVID-19 cases

Carson C. Chow; Joshua C. Chang; Richard C. Gerkin; Shashaank Vattikuti

doi:10.1101/2020.04.29.20083485

Summary

Estimation of infectiousness and fatality of the SARS-CoV-2 virus in the COVID-19 global pandemic is complicated by ascertainment bias resulting from incomplete and non-representative samples of infected individuals. We developed a strategy for overcoming this bias to obtain more plausible estimates of the true values of key epidemiological variables. We fit mechanistic Bayesian latent-variable SIR models to confirmed COVID-19 cases, deaths, and recoveries, for all regions (countries and US states) independently. Bayesian averaging over models, we find that the raw infection incidence rate underestimates the true rate by a factor, the case ascertainment ratio CAR_t that depends upon region and time. At the regional onset of COVID-19, the predicted global median was 13 infections unreported for each case confirmed (CAR_t = 0.07 C.I. (0.02, 0.4)). As the infection spread, the median CAR_t rose to 9 unreported cases for every one diagnosed as of April 15, 2020 (CAR_t = 0.1 C.I. (0.02, 0.5)). We also estimate that the median global initial reproduction number R₀ is 3.3 (C.I (1.5, 8.3)) and the total infection fatality rate near the onset is 0.17% (C.I. (0.05%, 0.9%)). However the time-dependent reproduction number R_t and infection fatality rate as of April 15 were 1.2 (C.I. (0.6, 2.5)) and 0.8% (C.I. (0.2%,4%)), respectively. We find that there is great variability between country- and state-level values. Our estimates are consistent with recent serological estimates of cumulative infections for the state of New York, but inconsistent with claims that very large fractions of the population have already been infected in most other regions. For most regions, our estimates imply a great deal of uncertainty about the current state and trajectory of the epidemic.

Introduction

Nearly 3 million confirmed cases of the novel coronavirus SARS-CoV-2 infection have been reported worldwide as of late-April, 2020 (“COVID-19 Map” 19). However, it is presumed that many infections remain unreported due to either mildness of symptoms or inadequate testing. Knowledge of the full extent of the COVID-19 pandemic is required to evaluate the effectiveness of mitigation strategies such as social distancing, which have a major economic and social cost.

In the absence of universal testing, a proportion of the infected either do not cross the diagnostic threshold for COVID-19 testing or are unable to acquire medical attention and thus remain undetected. Consequently, the fraction of infected who are reported as cases may be substantially less than 1. Estimating the total case ascertainment ratio (CAR_t), defined as the ratio of the total number of confirmed cases to the total number of individuals infected with the novel coronavirus SARS-CoV-2 on day t, is thus important for constraining the initial reproduction number R₀ of COVID-19, as well as recovery and fatality rates due to the disease.

Estimating the true magnitude and dynamics of the fractions of the population who are infected, susceptible, or recovered is a difficult and open problem. Serological and molecular diagnostic tests may not have perfect specificity and sensitivity (a substantial problem if the true positive rates are also low), may not be widely available, and may continue to suffer from ascertainment bias. Recent modeling work has attempted to estimate the true infected population, often relying on reported deaths since there is less ambiguity in the definition (Lourenco et al.; Flaxman et al.). Flaxman et. al 2020 focused on death data from the European Centre of Disease Control and suggest that there are orders of magnitude more infected than detected in confirmed cases. This claim was supported by a separate SIR model fitting cases, case recoveries, and case deaths that predicted a CAR of 1/63 for Italy (Calafiore et al.). Recent, preliminary, serological test results have been mixed. A study from Benavid et. al (Bendavid et al.) is consistent with this high number of undiagnosed cases. However a study of women admitted for delivery estimated that 1 in 8 cases are symptomatic (Sutton et al.), which is consistent with a recent press report by New York state based on sampling grocery store customers (New York Times). Debates continue on the adequacy of these tests.

Here, we develop a set of Bayesian, mechanistic, latent-variable, SIR models (Kermack and McKendrick; Wo and Ag, “Contributions to the Mathematical Theory of Epidemics--II. The Problem of Endemicity.1932”; Wo and Ag, “Contributions to the Mathematical Theory of Epidemics--III. Further Studies of the Problem of Endemicity. 1933”) that respect the uncertainty in the underlying data. We explore multiple specifications and find general consensus among the models. Our models try to account for the effects of mitigation (such as social distancing), the possibility that CAR_t (either through changes in definition or in detection, for example via ramp-up of testing) can vary in time, excess variability due to reporting and heterogeneity in the regional population, and incomplete quarantine of reported active cases. A similar model assumed complete quarantine on the basis of debilitating effects of illness or strict adherence to policy (Pedersen and Meneghini), which we relax to quantify the effectiveness of such actions. Using the model, we predict the total number of infected individuals from the confirmed numbers of cases, case recoveries, and case deaths. These three data time-series are sufficient to constrain time-varying estimates for CAR_t as well as for the fraction of unobserved infected individuals from which we can obtain the total reproduction number R_t and total Infection Fatality Ratio (IFR_t) as functions of time.

Methods

Our overall approach uses latent-variable Bayesian models with underlying dynamics given by a family of SIR models (Kermack and McKendrick). We implement our models in the Bayesian inference software language Stan (Carpenter et al.; Hartikainen et al.) to estimate the model statistics within each region independently, using data from The Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE; Dong et al.) and from The COVID Tracking Project (“The COVID Tracking Project”). Collectively, these data consist of daily counts for the reported new cases, deaths, and recoveries (as well as key regional mitigation dates, where available (Flaxman et al.; Mervosh et al.)). We only consider regions that report all three of these quantities, which eliminates several US States and dates beyond April 15 for which the complete data was unavailable.

In the SIR models that we use to quantify the regional population dynamics of COVID-19, interactions between susceptible individuals (S) and infected individuals (I) induce an increase in I and a concomitant decrease in S. The unobserved members of I can either transition into a reported SARS-CoV-2 case, C, or become no longer infectious (either by recovering or dying). Confirmed cases C can then either recover (R_C) or die (D_C) and also infect more S at a rate that can differ from I. Mitigation efforts (including any social or policy changes aimed at reducing transmission rates) are modeled by a time dependence on the rate of infectious transmission. This rate changes from an initial value to a final value on a specified day with a 5 day transition period. We also include a similar time dependence in the rate of transition from I to C to account for changes in testing regimens, clinical classification, and awareness of infected individuals to seek medical attention. We consider six variations of the latent-variable SIR model of varying complexity. Besides the standard nonlinear SIR models, we also fit linear approximations that assume that S is approximately fixed (i.e. the epidemic is small enough that herd immunity is not significantly affecting transmission). The linear models have exponential growth with growth rates that can vary in time including becoming negative.

We tie each underlying SIR model to the available data using the generative probabilistic scheme shown in Figure 1. We assume a Negative Binomial likelihood, where the variance is a fitted parameter, for the observable data given the model predictions. This formulation accounts for the inherent variability of an underlying stochastic birth-death process (which usually obeys a Poisson distribution), disease progression effects, heterogeneous mixing of population clusters, and errors in classification and reporting inherent in the available data. We modeled prior probabilities for recovery and death rates using inverse gamma distributions derived from ref. (Verity et al.). For the other unknown distributions, we used weakly-informative priors for satisfying positivity and providing numerical and inferential stability. Priors for the day of mitigation application, taken from ref. (Flaxman et al.) and (Mervosh et al.), were used when available. We assessed models by balancing the likelihood of observing the data (given each model) against model complexity using a standard Bayesian model comparison metric such as Leave-One-Out cross validation (LOO) and WAIC (Vehtari et al.), to avoid overfitting. On the basis of this and other Bayesian metrics, we also report model-averaged statistics that take into account model specification uncertainty (Yao et al.).

Figure 1: Diagram of generative observation model.

The SIR models consist of differential equations that describe the evolution of the S, I, C, R_C, and D_C variables. The models generate the time dependent rate of appearance of cases (λ_C), case recoveries (λ_R), and case deaths (λ_D). These rates predict the observable data of daily new cases ΔC, new case recoveries ΔR_C, and new case deaths ΔD_C with a Negative Binomial likelihood function with an excess variance (fλ)², compared to a Poisson likelihood. An expanded version of this diagram with additional details is available in Fig. S1.

Computer code written for this project is available at https://github.com/nih-niddk-mbs/covid-sicr.

Results

Global estimates for R_t, CAR_t, and IFR_t, for 25 representative countries (selected based on case counts) and US states with the prerequisite data, are shown in Figures 2–4 (see Supplementary Material Table S1 for every region analyzed). Overall, we found a consensus between the different model predictions, including between linear and nonlinear models (Supplementary Material Figure S2). Results presented are Bayesian LOO weighted model averages. CAR_t can be greater than 1 in rare cases where more cases were introduced by migration at outbreak onset than disappeared from the infected pool without being detected. Credible intervals were broad for most region specific estimates (see Supplementary Material Table S1).

Global average quantiles across all regions and models are shown in Figures 2–5. The models capture the dynamics of R_t (Fig. 2), CAR_t (Fig. 3), and IFR_t (Fig. 4). The global median R₀ across all regions was 3.3 (C.I (1.5,8.3), Fig. 5). In the first week of the onset of COVID-19 appearance in a region (which differs by date for each region), we found that for every 1 case identified there are approximately 13 other infections unreported (CAR_t (week 0) of 0.07 (C.I (0.02,0.4)). The infection fatality ratio during the first week from regional onset has a global median of 0.0017 (C.I. (0.0005,0.009)) or 0.17%. As the infection spreads these estimates evolve over time such that the R_t on the week of April 15th, 2020 was 1.2 (C.I (0.6,2.5)). While R_t fell, the global median CAR_t and IFR_t increased over the same time period to 1 in 10 for CAR_t (April 15th, 2020) = 0.1 (C.I. (0.02,0.5)) (a larger fraction of infections ultimately detected) and 0.008 (C.I. (0.002,0.04)) or 0.8% for IFR_t (April 15th, 2020). IFR_t will be lower than the true mortality rate since it does not account for unreported deaths due to COVID-19. We have also not corrected for the fact that some of the current cases are critically ill and may die in the future. We should also note that since the onset date of COVID-19 varies regionally, the global median as of April 15 is an aggregate over different stages of regional pandemic progression.

We also evaluated the effects of quarantine and excess variability. Across models we observed a substantial decrease in infection rate by active cases. However, the residual infection rate was still above zero with broad uncertainty. The infectiousness of active cases (the known infected) across all regions was much lower than that of the remaining (unreported) infected (Figure 6), with a median global q = 0.03 (C.I. (0.001,0.16)). While the median global quarantine attenuation was large, there was a broad distribution with some areas having smaller degrees than others. Within Europe, Italy had the smallest median case infection rate consistent with strong quarantine measures (Pedersen and Meneghini) (see Supplementary Figure S3 for regional details).

Variability in reported data likely leads to overdispersion, i.e. to counts that are more variable than would be expected from a Poisson distribution. We thus accounted for and analyzed overdispersion by using a negative binomial distribution in which the excess variance is modeled as (fλ)² where λ is the mean rate for that day and f is an overdispersion factor that we estimate. As shown in Supplementary Material Figure S4, f was approximately 1.5 across all regions.

Figure 2: Estimates of the reproduction number R_t across selected regions and time.

Upper panel: R₀ for selected regions. Bars show median, interquartile range, and 95% credible interval for each region. Lower panel: R_t on April 15th in all regions. Color indicates change from R₀ (more red is more reduction).

Figure 3: Estimates of the case ascertainment ratio (CAR_t) across selected regions and time.

Upper panel: Initial CAR_t for selected regions. Bars show median, interquartile range, and 95% credible interval for each region. Lower panel: CAR_t on April 15th in all regions. Color indicates change from initial CAR (more red is more reduction).

Figure 4: Estimates of the infection fatality rate IFR_t across selected regions and time.

Upper panel: Initial IFR_t for selected regions. Bars show median, interquartile range, and 95% credible interval for each region. Lower panel: IFR_t on April 15th in all regions. Color indicates change from initial IFR (more red is more reduction).

Figure 5: Global distributions of (a) R_t, (b) CAR_t, and (c) IFR_t using average across models.

Figure shows quantiles median, interquartile range, and 95% credible interval.

Figure 6: Effectiveness of quarantine.

Shows a median q of 0.03 (C.I. (0.001, 0.16)) in the relative infection rate from active known cases. A value of q=0 would correspond to perfect quarantine of active eases (no new infections result from a case once it is identified). See Figure S2 for regional details.

Discussion

We use our Bayesian mechanistic models to estimate the extent of COVID-19 prevalence, from publicly-available data. Importantly, we quantify the uncertainty of our estimates, finding them to be large in some regions. Our models perform this estimation task under simplified but plausible mechanisms for infection transmission, case detection, recovery, death, and disease spread mitigation.

The advantage of mechanistic models over non-mechanistic curve extrapolation (Murray and IHME COVID-19 health service utilization forecasting team) is that the model parameters have direct mechanistic interpretations that can be validated independently. More mechanisms and details can also be added incrementally.

Estimate of the Case Ascertainment Ratio (CAR)

We show that the unaccounted cases exceed the reported by 10 to 20 fold, which is consistent with two independent groups in New York (one using PCR testing) (Sutton et al.; New York Times) but not with a similar test in Santa Clara (Bendavid et al.). However, factors such as false positive rates of the serological tests and selection bias could affect outcomes. Furthermore, the studies cited were not drawn from random samples of the population. Until wide-spread serological testing is implemented, the true proportion of cumulative infections will remain unknown, but we believe that our results provide a credible range for this proportion.

Our model could provide guidance on the effectiveness of various policies for reducing infectiousness until more complete data is obtained. Similar attempts have been made to model these factors. However, they have relied on priors for R₀ based on confirmed cases (Lourenco et al.; Flaxman et al.; Ferguson et al.), restricted data to reported deaths or cases only (Flaxman et al.; Pedersen and Meneghini; Murray and IHME COVID-19 health service utilization forecasting team), made strong assumptions about quarantine (Pedersen and Meneghini), or ignored the underlying mechanisms that guide the dynamics of epidemics (Murray and IHME COVID-19 health service utilization forecasting team). We are able to improve identifiability of key parameters compared to previous results as in (Lourenco et al.) because we use more information. The motivation to use only case death data is understandable as it may be a more reliable measure than case counts but it is insufficient to constrain the dynamics alone and thus has much less predictive power. However, as shown in the Supplementary Material (Sec. S2), models that includie cases, case recoveries, and case deaths, provide sufficient information for identifiability of undocumented infections although the uncertainty can be large.

Estimate of the time-dependent reproduction number (R_t)

Our model predicts the time dependent reproduction number R_t for each region, which can give an estimate as to how well mitigation policies are working. Our results are consistent with social distancing, or other policies reducing transmission, as shown by the decrease in R_t relative to R₀ at outbreak onset. Our directly inferred region-wide R₀ estimate of 3.3 is larger than the mean of priors chosen for some other models (Lourenco et al.; Flaxman et al.; Ferguson et al.), but it is consistent with estimates from other work (Cao et al.; Sanche et al.) including several reviewed in (Liu et al.). Our estimate is close to the upper estimates for Italy by Pedersen and Menegheni (Pedersen and Meneghini) of 3.5 when removing quarantine on active cases. Using a similar linear SIR model with complete quarantine lowered their R₀ estimate to 2.6. Our estimate of R was robust to assumptions on the effectiveness of quarantine (models without this factor gave similar estimates for R₀ see Supplementary Material Figure S2). Our estimate of parameter q near zero for Italy implies near perfect quarantine. There was more incomplete quarantine in other regions.

Policy implications

We found a decrease in R_t over time with the global estimate currently near 1, though many regions are still well above this as of April 15. However, R_t does not need to be below 1 for herd immunity to stop the spread of COVID-19. When the fraction of infected individuals reaches 1-1/R_t regionally, then in the absence of reinfection from outside, herd immunity will quench the pandemic locally (Fine et al.). Although there are regional variations, if the current estimate of R_t persists then the pandemic will begin to extinguish itself when 17% (C.I. (0%, 60%)) of the susceptible global population has achieved resistance due to either recovery from prior infection or vaccination (0% means we have herd immunity now, under the current mitigation measures). However, if mitigation measures are lifted and R_t returns to R₀, then that fraction rises to 70% (C.I. (33%, 88%)) to guard against further waves of illness upon reinfection.

The time-varying models give other notable results. On average CAR_t has increased, indicating that a larger proportion of infections are being detected. This might be explained by recognition of symptoms increasing during the pandemic coupled with broader availability of testing. Our estimate of IFR_t also increased during the spread of the infection. This is probably due to the fact that early in the pandemic those fated to die would have been less likely to be identified as COVID-19 deaths, or more generally from changes in how deaths are reported over time. Because the IFR that we report uses the conventional definition, it is “right censored” -- at any given time point the number of deaths from the current number of infections will continue to grow for some time -- it could reflect an underestimate of the true mortality rate. Because our models explicitly estimate the rate at which both case and non-case infected either die or recover, one can derive a closed form expression for the true mortality rate.

Assumptions and Limitations

Driven by mechanistic considerations, we make several simplifying modeling assumptions that we believe to be reasonable. Chiefly, we make the standard assumptions of SIR models. For instance, we ignore heterogeneity within populations and the effect of connectivity between populations. We subsume heterogeneity of populations themselves into the model parameters, which are to be interpreted as population-averaged quantities. While mitigation measures have greatly reduced travel, cross-population transmission may still be occurring at low levels. Although our models account for mitigation and for diagnostic changes, our rigid implementation of these factors is unlikely to fully capture the true time course of their effects. Finally, biologically, our model considers an extremely simplified version of disease progression within individuals, ignoring time inhomogeneous mechanisms that may be significant (Bottcher et al.).

Motivated by statistical considerations, we also consider reduced linear approximations to the SIR models, where changes in the size of the susceptible pool are ignored. Empirically, we find these approximations to be more stable for fitting the pandemic in the early stages than the standard nonlinear SIR model. A complication in using the standard nonlinear SIR model lies in the knowledge of the size of the susceptible population. In addition to the fact that it is not known what fraction of the population has innate immunity, the inherent heterogeneity of the spread of disease in the early stages renders the susceptible population estimate problematic if not impossible. (In our nonlinear models we fit it as an adjustable parameter). The linearized model circumvents this difficulty by scaling out the population, and to some degree ameliorates the homogeneous assumption, by averaging over the initial growth of the local clusters. Hence, for the linearized model during the early stages of an epidemic, heterogeneity in infectiousness has less of an impact on the underlying aggregate state variables, and so does not provide a worse fit than a well-mixed population with the same observed data. Only later in the epidemic when the effects of herd immunity are important will the limitations of a linear model with homogeneous mixing be revealed, although the time dependence in the infectiousness we introduce could also be capturing the effects of herd immunity.

Outside of our modeling assumptions, issues of data quality may affect the accuracy of our estimates. Disparate criteria exist across regions for defining cases and case deaths, oftentimes changing within regions themselves as each epidemic progresses.

Additionally, data (including death counts) reported by many regions is likely incomplete or unreliable, as evidenced in large spikes of unexplained deaths not officially recognized as Covid-19 deaths (Covid-19 Data - Tracking Covid-19 Excess Deaths across Countries | Graphic Detail | The Economist); while we believe that systematic under-testing can be overcome up to a point, totally inconsistent or fraudulent reporting cannot be.

Future Extensions

Additional factors need to be explored in future models. For example, a more robust and flexible implementation of mitigation, as well as heterogeneity of disease progression within individuals and in their social interactions, is essential. Rather than fitting regions independently it should be possible to fit a hierarchical global model in which region-specific parameters are partially pooled to achieve additional regularization and to narrow credible intervals. The linear model is robust to uncertainty in the size of the initial susceptible population but if the disease progresses such that we begin to reach population saturation, we must use a nonlinear model that explicitly accounts for population size to make reliable parameter estimates although the time dependence in infectiousness can partially mimic this nonlinear effect. We are optimistic that such a transition will be seamless since we find consistency in model predictions between linear and nonlinear models in the current work.

View this table:

Definitions

Supplementary Material

View this table:

Table S1

Figure S2 R₀ estimates across models showing model consistency.

Figure S3 Median case infectiousness factor q for all regions

Figure S4 Median overdispersion factor for all regions.

Blue line is median of 1.5 across all regions.

S1 Latent variable SIR Model

We consider a latent variable SIR model where the observables are cases C, case recoveries R_C, and case deaths D_C. The latent variables are the numbers of susceptible S and undetected infected I individuals. Assuming probability of contact between people is completely random and uniform and ignoring the possibility of births, migration, or deaths due to other causes, the mass action equations for the model are where N is the population of the region in question, β is the infection rate per contact, σ_C is the transition rate from I to C, σ_U is the disappearance rate of I not transitioning to C, and σ = σ_R + σ_D is the disappearance rate for cases. The parameter q accounts for possible differences in virus transmission rate from non-cases. It can be less than one due to say lower social contact or higher due to a larger viral load. If q = 0 then the model is identical to an SEIR model.

The equations obey the conservation condition where

The population U represents the portion of I that either recover or die but remain undetected.

If initially S = N, I << N, and the parameters are such that the right hand side of (S2) is positive, then I will grow until it reaches a peak where , which is given by the condition where σ = σ_R + σ_D. The fraction of the population remaining susceptible at the peak of the epidemic is the inverse of the reproduction number

(R₀ can also be computed using the next generation method). The pandemic will spread if R₀ > 1. The total number that becomes infected and thus are no longer susceptible is N_I = N − S. The peak of this number is given by

I will decrease after N_I passes . Thus, the pandemic can be mitigated by reducing , which can be achieved by either decreasing R₀ or decreasing N. We model the effects of mitigation with a time dependence in β via β → β_t where where t = 0 is the day of the first case. β_t transitions from its initial value to a new value βm_∞ on day m₅₀ with a transition time of 5 days. We also account for the possibility that the case detection rate can change in time with with σ_C → σt where

For example, the method of classification of cases changed in China on a single day leading to spike in the number of cases and this time dependence accounts for situations such as that.

The mean field SIR model assumes a well mixed population where the probability of interactions are homogeneous that does not reflect actual human interactions. An actual epidemic, rather than growing homogeneously throughout the population, will be seeded locally and spread within clusters that then propagate to other local clusters. Each cluster may locally saturate before spreading to another cluster and thus the average dynamics of the epidemic over a large region is not reflected by the actual spread at the local level. Hence, the population size N in the SIR model is not simply the regional population but a complex aggregation over many interacting local clusters, which may be difficult to estimate. This is a major limitation of mean field models since N is a major factor in determining the eventual rise and fall of the pandemic. However, this limitation can be circumvented by noting that in the initial stages of the epidemic where I is small and less than even the cluster population, we can fix S = N and the system (S1), (S2), and (S3) reduces to the two dimensional linear system

N is scaled out of the dynamics.

The eigenvalues for the linear system (S10) and (S11) assuming that β_t and σ_Ct are constant in time are where Tr = β_t − σ_Ct − σ_U − σ and Det = −σ(β_t − σ_Ct − σ_U) − qβσ_Ct. For q = 0 the eigenvalues simplify to

For most parameter values r₊ will dominate the temporal dynamics. The pandemic will grow when it is positive and begin to extinguish only when β_t falls below σ_Ct + σ_U.

We can use the eigenvalues for a lower bound on the rate for the pandemic to extinguish. In the nonlinear equations, we can consider an effective infectiousness β* = β_tS(t)/N = β_t(1 − n_I), where n_I is the fraction of the population that is immune, either innately or by recovering from the infection.

From (S7) we can set for q = 0 to obtain

Thus the pandemic will extinguish at the rate of infection disappearance moderated by the residual spread.

S2 Parameter identifiability

Here, we investigate whether the parameters in (S10) and (S11) are in principle identifiable from the observable data. We first consider the case without mitigation in which there are six parameters, β, σ_C, σ_R, σ_D, σ_U, and q together with the dynamic variables I(t) and C(t). The observable data consists of the time series of cases, case recoveries, and case deaths, which we denote by λ_C(t), λ_R(t),and λ_D(t), respectively. For this analysis, we assume perfect information with no uncertainty.

From the data, we can immediately construct the following conditions

From these conditions we immediately obtain C(t), σ_R, and σ_D. Dividing (S10) by (S11), we obtain where α₁ = β − σ_U − σ_C. Given that the epidemic is initiated with I > 0 and C = 0, then prior to the first case arising we can assume dI/dC = α₁/σ_C, from.which we can integrate to obtain σ_CI(0) = α₁C(0), where t = 0 marks the day of appearance of the first case. This then gives α₁ = λ_C(0)/C(0).

Multiplying (S10) by σ_C gives and thus σ_Cqβ = L/C. Finally, we can also rewrite (S10) as from which σ_C can be inferred. Given σ_C we can then infer I(t) from λ_C(t) from which we can derive conditions for β − σ_U and qβ, which are only two conditions for three unknowns. Thus, the observable data can at most specify five out of the six unknown parameters.

We can resolve this nonidentifiability in two ways. The first is by specifying one of the parameters. For example, we can specify σ_U given that we have a prior on the rate of recovery or death from the infection. We can also specify q since it is a measure of how effective the isolation of cases are. The second is to utilize the fact that mitigation acts as a time dependent perturbation on β assuming all the other parameters are fixed. The mitigation is set by two parameters, the day it is applied and the effectiveness in diminishing β. Essentially the model can be applied separately before and after mitigation where only β has a different value. If the day of mitigation is known then this results in four conditions for four unknowns, i.e. β_before, β_after, q, and σ_U and the parameters can be identified. Thus in principle, with perfect information, the model can be identified if given some prior information.

S3 Parameter estimation

We estimate the parameters and their uncertainties using Bayesian methods, by putting prior distributions on each of the model parameters. A schematic of our overall Bayesian latent-variable model is shown in Fig. S1

We consider various versions of the model with varying number of parameters and weigh them using model comparison measures such as WAIC and LOO.

Considering the mean field SIR equations as a birth and death process implies a Poisson likelihood. However, in addition to this inherent stochasticity, errors will be introduced due to variability in criteria for measurement and recording. We account for this additional variance with a Negative Binomial distribution cf. where n_C,R,D are the total number of new cases, recovered cases, and case deaths on day d, I(d), C(d) are the total number for day d, and ϕ is a fitted factor quantifying the extra variance where x ~ NegBinom(μ, ϕ) with E[X] = μ and var[X] = μ + μ²/ϕ. We use inverse gamma distributed priors on σ_U, σ_R and σ_D using parameters derived from ref X. We fit from the first day in which the daily case count is greater than one, which we call day 0. We then set C(0) to this count and I(0) = (β − σ_U − σ_C)/σ_CC(0).

Figure S1:

Bayesian Model diagram showing variable dependencies and prior distribution families.

The cumulative population that becomes infected and cumulative cases at time t are given by

The cumulative totals for recovered and dead are R_C and D_C. From these quantities we can estimate the case ascertainment ratio, N_C/N_I and the total infection fatality ratio, D_C/N_I.

S4 Data and Software

All data and code can be found at https://github.com/nih-niddk-mbs/covid-sicr. This repository contains several python scripts and Jupyter notebooks for replicating our findings, as described in the README file there. These can be used to obtain new post-publication estimates with additional data, provided that the data providers listed in the main text continue to provide a consistent API. We used Stan, which provides Bayesian inference using a heavily-optimized No-U-Turn sampler, a variant of Hamiltonian Monte Carlo, and each model described here is available as a .stan file. Python 3 was used for all other reported results, including interaction with Stan (via pystan). A subset of the work implemented in Julia is also available in the repository.

Acknowledgments

The authors would like to thank Christopher Kim for early assistance and Kevin Hall for critical comments. CCC and SV were supported by the Intramural Program of the NIH, NIDDK. RCG was supported by NIDCD, NINDS, and NSF. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

References

Bendavid, Eran, et al. COVID-19 Antibody Seroprevalence in Santa Clara County, California. preprint, Epidemiology, 17 Apr. 2020. DOI.org (Crossref), doi:10.1101/2020.04.14.20062463.
OpenUrl Abstract/FREE Full Text Google Scholar
Bottcher, Lucas, et al. Why COVID-19 Case Fatality Rates May Be Misleading: Individual- and Population-Based Mortality Estimates and Factors Influencing Them. 2020.
Google Scholar
Calafiore, Giuseppe C., et al. “A Modified SIR Model for the COVID-19 Contagion in Italy.” ArXiv:2003.14391 [Physics], Mar. 2020. arXiv.org, http://arxiv.org/abs/2003.14391.
Google Scholar
Cao, Zhidong, et al. “Estimating the Effective Reproduction Number of the 2019-NCoV in China.” MedRxiv, Jan. 2020, p. 2020.01.27.20018952. www.medrxiv.org, doi:10.1101/2020.01.27.20018952.
Google Scholar
Carpenter, Bob, et al. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software, vol. 76, no. 1, 2017.
Google Scholar
Covid-19 Data - Tracking Covid-19 Excess Deaths across Countries | Graphic Detail | The Economist. https://www.economist.com/graphic-detail/2020/04/16/tracking-covid-19-excess-deaths-across-countries. Accessed 25 Apr. 2020.
Google Scholar
“COVID-19 Map.” Johns Hopkins Coronavirus Resource Center, https://coronavirus.jhu.edu/map.html. Accessed 15 Apr. 2020.
Google Scholar
Dong, Ensheng, et al. “An Interactive Web-Based Dashboard to Track COVID-19 in Real Time.” The Lancet Infectious Diseases, vol. 0, no. 0, Feb. 2020. www.thelancet.com, doi:10.1016/S1473-3099(20)30120-1.
OpenUrl CrossRef PubMed Google Scholar
Ferguson, Neil M., et al. Impact of Non-Pharmaceutical Interventions (NPIs) to Reduce COVID-19 Mortality and Healthcare Demand. 2020, p. 20.
Google Scholar
Fine, Paul, et al. “‘Herd Immunity’: A Rough Guide.” Clinical Infectious Diseases, vol. 52, no. 7, Oxford Academic, Apr. 2011, pp. 911–16. academic.oup.com, doi:10.1093/cid/cir007.
OpenUrl CrossRef PubMed Web of Science Google Scholar
↵
Flaxman, S., et al. “Report 13: Estimating the Number of Infections and the Impact of Non-Pharmaceutical Interventions on COVID-19 in 11 European Countries.” 35, Report, 30 Mar. 2020. spiral.imperial.ac.uk:8443, doi:https://doi.org/10.25561/77731.
Google Scholar
Hartikainen, Ari, et al. Stan-Dev/Pystan. 2013. Stan, 2020. GitHub, https://github.com/stan-dev/pystan.
Google Scholar
JHU CSSE. CSSEGISandData/COVID-19. 2020. 2020. GitHub, https://github.com/CSSEGISandData/COVID-19.
Google Scholar
Kermack, W. O., and A. G. McKendrick. “Contributions to the Mathematical Theory of Epidemics--I. 1927.” Bulletin of Mathematical Biology, vol. 53, no. 1–2, 1991, pp. 33–55. PubMed, doi: 10.1007/bf02464423.
OpenUrl CrossRef PubMed Web of Science Google Scholar
Liu, Ying, et al. “The Reproductive Number of COVID-19 Is Higher Compared to SARS Coronavirus.” Journal of Travel Medicine, vol. 27, no. 2, Mar. 2020. academic.oup.com, doi:10.1093/jtm/taaa021.
OpenUrl CrossRef PubMed Google Scholar
Lourenco, Jose, et al. “Fundamental Principles of Epidemic Spread Highlight the Immediate Need for Large-Scale Serological Surveys to Assess the Stage of the SARS-CoV-2 Epidemic.” MedRxiv, Cold Spring Harbor Laboratory Press, Mar. 2020, p. 2020.03.24.20042291. www.medrxiv.org, doi:10.1101/2020.03.24.20042291.
OpenUrl Abstract/FREE Full Text Google Scholar
Mervosh, Sarah, et al. “See Which States and Cities Have Told Residents to Stay at Home.” The New York Times, 31 Mar. 2020. NYTimes.com, https://www.nytimes.com/interactive/2020/us/coronavirus-stay-at-home-order.html.
Google Scholar
New York Times. Cuomo Says 21% of Those Tested in N. Y.C. Had Virus Antibodies - The New York Times. https://www.nytimes.com/2020/04/23/nyregion/coronavirus-new-york-update.html. Accessed 24 Apr. 2020.
Google Scholar
Pedersen, Morten, and Matteo Meneghini. Quantifying Undetected COVID-19 Cases and Effects of Containment Measures in Italy. 2020. ResearchGate, doi:10.13140/RG.2.2.11753.85600.
OpenUrl CrossRef Google Scholar
Sanche, Steven, et al. Early Release - High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2 - Volume 26, Number 7—July 2020 - Emerging Infectious Diseases Journal - CDC. www.nc.cdc.gov, doi:10.3201/eid2607.200282. Accessed 24 Apr. 2020.
OpenUrl CrossRef PubMed Google Scholar
Sutton, Desmond, et al. “Universal Screening for SARS-CoV-2 in Women Admitted for Delivery.” New England Journal of Medicine, vol. 0, no. 0, Massachusetts Medical Society, Apr. 2020, p. null. Taylor and Francis+NEJM, doi:10.1056/NEJMc2009316.
OpenUrl CrossRef PubMed Google Scholar
Murray, Christopher JL, and IHME COVID-19 health service utilization forecasting team. “Forecasting COVID-19 Impact on Hospital Bed-Days, ICU-Days, Ventilator-Days and Deaths by US State in the next 4 Months.” MedRxiv, Cold Spring Harbor Laboratory Press, Mar. 2020, p. 2020.03.27.20043752. www.medrxiv.org, doi:10.1101/2020.03.27.20043752.
OpenUrl Abstract/FREE Full Text Google Scholar
“The COVID Tracking Project.” The COVID Tracking Project, https://covidtracking.com/. Accessed 24 Apr. 2020.
Google Scholar
Vehtari, Aki, et al. “Practical Bayesian Model Evaluation Using Leave-One-out Cross-Validation and WAIC.” Statistics and Computing, vol. 27, no. 5, Sept. 2017, pp. 1413–32. Springer Link, doi: 10.1007/s11222-016-9696-4.
OpenUrl CrossRef Google Scholar
Verity, Robert, et al. “Estimates of the Severity of Coronavirus Disease 2019: A Model-Based Analysis.” The Lancet Infectious Diseases, vol. 0, no. 0, Elsevier, Mar. 2020. www.thelancet.com, doi:10.1016/S1473-3099(20)30243-7.
OpenUrl CrossRef PubMed Google Scholar
Wo, Kermack, and McKendrick Ag. “Contributions to the Mathematical Theory of Epidemics--II. The Problem of Endemicity.1932.” Bulletin of Mathematical Biology, 1991, doi:10.1007/bf02464424.
OpenUrl CrossRef Google Scholar
Wo, Kermack, and McKendrick Ag. “Contributions to the Mathematical Theory of Epidemics--III. Further Studies of the Problem of Endemicity. 1933.” Bulletin of Mathematical Biology, 1991, doi:10.1007/bf02464425.
OpenUrl CrossRef Google Scholar
Yao, Yuling, et al. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis, vol. 13, no. 3, International Society for Bayesian Analysis, Sept. 2018, pp. 917–1007. Project Euclid, doi:10.1214/17-BA1091.
OpenUrl CrossRef Google Scholar

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

Community Reviews

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

[1] Bendavid, Eran, et al. COVID-19 Antibody Seroprevalence in Santa Clara County, California. preprint, Epidemiology, 17 Apr. 2020. DOI.org (Crossref), doi:10.1101/2020.04.14.20062463.
OpenUrl Abstract/FREE Full Text Google Scholar

[2] Bottcher, Lucas, et al. Why COVID-19 Case Fatality Rates May Be Misleading: Individual- and Population-Based Mortality Estimates and Factors Influencing Them. 2020.
Google Scholar

[3] Calafiore, Giuseppe C., et al. “A Modified SIR Model for the COVID-19 Contagion in Italy.” ArXiv:2003.14391 [Physics], Mar. 2020. arXiv.org, http://arxiv.org/abs/2003.14391.
Google Scholar

[4] Cao, Zhidong, et al. “Estimating the Effective Reproduction Number of the 2019-NCoV in China.” MedRxiv, Jan. 2020, p. 2020.01.27.20018952. www.medrxiv.org, doi:10.1101/2020.01.27.20018952.
Google Scholar

[5] Carpenter, Bob, et al. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software, vol. 76, no. 1, 2017.
Google Scholar

[6] Covid-19 Data - Tracking Covid-19 Excess Deaths across Countries | Graphic Detail | The Economist. https://www.economist.com/graphic-detail/2020/04/16/tracking-covid-19-excess-deaths-across-countries. Accessed 25 Apr. 2020.
Google Scholar

[7] “COVID-19 Map.” Johns Hopkins Coronavirus Resource Center, https://coronavirus.jhu.edu/map.html. Accessed 15 Apr. 2020.
Google Scholar

[8] Dong, Ensheng, et al. “An Interactive Web-Based Dashboard to Track COVID-19 in Real Time.” The Lancet Infectious Diseases, vol. 0, no. 0, Feb. 2020. www.thelancet.com, doi:10.1016/S1473-3099(20)30120-1.
OpenUrl CrossRef PubMed Google Scholar

[9] Ferguson, Neil M., et al. Impact of Non-Pharmaceutical Interventions (NPIs) to Reduce COVID-19 Mortality and Healthcare Demand. 2020, p. 20.
Google Scholar

[10] Fine, Paul, et al. “‘Herd Immunity’: A Rough Guide.” Clinical Infectious Diseases, vol. 52, no. 7, Oxford Academic, Apr. 2011, pp. 911–16. academic.oup.com, doi:10.1093/cid/cir007.
OpenUrl CrossRef PubMed Web of Science Google Scholar

[11] ↵
Flaxman, S., et al. “Report 13: Estimating the Number of Infections and the Impact of Non-Pharmaceutical Interventions on COVID-19 in 11 European Countries.” 35, Report, 30 Mar. 2020. spiral.imperial.ac.uk:8443, doi:https://doi.org/10.25561/77731.
Google Scholar

[12] Hartikainen, Ari, et al. Stan-Dev/Pystan. 2013. Stan, 2020. GitHub, https://github.com/stan-dev/pystan.
Google Scholar

[13] JHU CSSE. CSSEGISandData/COVID-19. 2020. 2020. GitHub, https://github.com/CSSEGISandData/COVID-19.
Google Scholar

[14] Kermack, W. O., and A. G. McKendrick. “Contributions to the Mathematical Theory of Epidemics--I. 1927.” Bulletin of Mathematical Biology, vol. 53, no. 1–2, 1991, pp. 33–55. PubMed, doi: 10.1007/bf02464423.
OpenUrl CrossRef PubMed Web of Science Google Scholar

[15] Liu, Ying, et al. “The Reproductive Number of COVID-19 Is Higher Compared to SARS Coronavirus.” Journal of Travel Medicine, vol. 27, no. 2, Mar. 2020. academic.oup.com, doi:10.1093/jtm/taaa021.
OpenUrl CrossRef PubMed Google Scholar

[16] Lourenco, Jose, et al. “Fundamental Principles of Epidemic Spread Highlight the Immediate Need for Large-Scale Serological Surveys to Assess the Stage of the SARS-CoV-2 Epidemic.” MedRxiv, Cold Spring Harbor Laboratory Press, Mar. 2020, p. 2020.03.24.20042291. www.medrxiv.org, doi:10.1101/2020.03.24.20042291.
OpenUrl Abstract/FREE Full Text Google Scholar

[17] Mervosh, Sarah, et al. “See Which States and Cities Have Told Residents to Stay at Home.” The New York Times, 31 Mar. 2020. NYTimes.com, https://www.nytimes.com/interactive/2020/us/coronavirus-stay-at-home-order.html.
Google Scholar

[18] New York Times. Cuomo Says 21% of Those Tested in N. Y.C. Had Virus Antibodies - The New York Times. https://www.nytimes.com/2020/04/23/nyregion/coronavirus-new-york-update.html. Accessed 24 Apr. 2020.
Google Scholar

[19] Pedersen, Morten, and Matteo Meneghini. Quantifying Undetected COVID-19 Cases and Effects of Containment Measures in Italy. 2020. ResearchGate, doi:10.13140/RG.2.2.11753.85600.
OpenUrl CrossRef Google Scholar

[20] Sanche, Steven, et al. Early Release - High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2 - Volume 26, Number 7—July 2020 - Emerging Infectious Diseases Journal - CDC. www.nc.cdc.gov, doi:10.3201/eid2607.200282. Accessed 24 Apr. 2020.
OpenUrl CrossRef PubMed Google Scholar

[21] Sutton, Desmond, et al. “Universal Screening for SARS-CoV-2 in Women Admitted for Delivery.” New England Journal of Medicine, vol. 0, no. 0, Massachusetts Medical Society, Apr. 2020, p. null. Taylor and Francis+NEJM, doi:10.1056/NEJMc2009316.
OpenUrl CrossRef PubMed Google Scholar

[22] Murray, Christopher JL, and IHME COVID-19 health service utilization forecasting team. “Forecasting COVID-19 Impact on Hospital Bed-Days, ICU-Days, Ventilator-Days and Deaths by US State in the next 4 Months.” MedRxiv, Cold Spring Harbor Laboratory Press, Mar. 2020, p. 2020.03.27.20043752. www.medrxiv.org, doi:10.1101/2020.03.27.20043752.
OpenUrl Abstract/FREE Full Text Google Scholar

[23] “The COVID Tracking Project.” The COVID Tracking Project, https://covidtracking.com/. Accessed 24 Apr. 2020.
Google Scholar

[24] Vehtari, Aki, et al. “Practical Bayesian Model Evaluation Using Leave-One-out Cross-Validation and WAIC.” Statistics and Computing, vol. 27, no. 5, Sept. 2017, pp. 1413–32. Springer Link, doi: 10.1007/s11222-016-9696-4.
OpenUrl CrossRef Google Scholar

[25] Verity, Robert, et al. “Estimates of the Severity of Coronavirus Disease 2019: A Model-Based Analysis.” The Lancet Infectious Diseases, vol. 0, no. 0, Elsevier, Mar. 2020. www.thelancet.com, doi:10.1016/S1473-3099(20)30243-7.
OpenUrl CrossRef PubMed Google Scholar

[26] Wo, Kermack, and McKendrick Ag. “Contributions to the Mathematical Theory of Epidemics--II. The Problem of Endemicity.1932.” Bulletin of Mathematical Biology, 1991, doi:10.1007/bf02464424.
OpenUrl CrossRef Google Scholar

[27] Wo, Kermack, and McKendrick Ag. “Contributions to the Mathematical Theory of Epidemics--III. Further Studies of the Problem of Endemicity. 1933.” Bulletin of Mathematical Biology, 1991, doi:10.1007/bf02464425.
OpenUrl CrossRef Google Scholar