Abstract
The first cases of COVID-19 in France were detected on January 24, 2020. The number of screening tests carried out and the methodology used to target the patients tested do not allow for a direct computation of the actual number of cases and the mortality rate. In this note, we develop a ‘mechanistic-statistical’ approach coupling a SIR ODE model describing the unobserved epidemiological dynamics, a probabilistic model describing the data acquisition process and a statistical inference method. The objective of this model is not to make forecasts but to estimate the actual number of people infected with COVID-19 during the observation window in France and to deduce the mortality rate associated with the epidemic.
Main results The actual number of infected cases in France is probably much higher than the observations: we find here a factor ×15 (95%-CI: 4 − 33), which leads to a 5.2/1000 mortality rate (95%-CI: 1.5/1000 − 11.7/1000) at the end of the observation period. We find a R0 of 4.8, a high value which may be linked to the long viral shedding period of 20 days.
Introduction
The COVID-19 epidemic started in December 2019 in Hubei province, China. Since then, the disease has spread around the world reaching the pandemic stage, according to the WHO, on March 11. The first cases were detected in South Korea on January 19, 2020, and in France on January 24. The number of screening tests carried out varies widely from country to country (36, 747 in France vs 268, 212 in South Korea up to March 15, 2020; Sources: Santé Publique France and Korean Center for Disease Control) and does not make possible to know with certainty the actual number of infected people in the population. Thus, even if the number of deaths linked to COVID-19 is known with more certainty, ignoring the total number of patients does not allow a direct calculation of the mortality rate. Using the data available in France and South Korea, our objectives are:
- to estimate the number of people infected with COVID-19 in France;
- to deduce from this number the associated mortality rate;
- to calculate the parameters of a SIR-type model representing the early phase of the outbreak in France;
- to compare the results obtained for France and South Korea.
In this aim, we carried out an analysis grounded on a mechanistic-statistical formalism. This formalism allows the analyst to couple a mechanistic model, here an ordinary differential equation model (ODE) of the SIR type, and uncertain, non-exhaustive and not necessarily commensurable data with the solutions of the ODE. This formalism, which we have popularized by applying it to biological invasions (Roques et al., 2011; Roques and Bonnefon, 2016; Abboud et al., 2019), combines (1) the mechanistic model, (2) a probabilistic model describing the data collection process conditional on the solution of the mechanistic model and (3) a statistical method for estimating the parameters of the mechanistic model.
Data
The COVID-19 screening data in France and South Korea used for preparing this note are those available from January 22, 2020 to March 17, 2020. These data describe the number of positive cases and deaths, day by day (source: Johns Hopkins University Center for Systems Science and Engineering, https://github.com/CSSEGISandData/COVID-19). The number of tests carried out is only available from February 22 (Sources: Santé publique France and Korean Center for Disease Control). As some data (positive cases, deaths) are not fully reliable (example: 0 new cases detected in France on March 12, 2020), we smoothed the data with a moving average over 5 days.
Mechanistic model
SIR models are the most standard ODE-based epidemiological models. These models are compartmental: they divide the population into subclasses: the susceptibles (S), the infected (I) and the recovered (R, immune individuals in our case). The simplest SIR compartmental model does not take into account the demography of the S compartment: with N = S + I + R the total population, which is constant. The impact of the compartment D (number of dead people) on the dynamics of the SIR system is therefore neglected here. The compartment D satisfies:
We use this equation later to compute the death rate due to the disease. The initial conditions are S(t0) = N − 1, I(t0) = 1 and R(t0) = 0, where N corresponds to the population size in France or South Korea (67·106 and 52·106 people). The SIR model is started at t = t0, which should approach the date of introduction of the virus in the considered country (this point is shortly discussed at the end of this note).
Note that I′(t) = β I (R0 S/N − 1), with R0 = α/β the basic reproduction rate (Murray, 2002). When R0 < 1, we observe that I′ < 0, and so the epidemic cannot spread in the population. When R0 > 1, the infected compartment I increases as long as R0 S > N = S + I + R.
The system (1) can be analytically solved, using a change of time variable which requires a numerical integration. Here, we rather solve the ODE system thanks to a standard numerical algorithm, using Matlab® ode23s solver.
Observation model
We denote by the number of cases tested positive on day t. We suppose that these increments follow independent binomial laws, conditionally on the number of tests, on I(t) and on S(t): where nt corresponds to the number of tests carried out on day t and pt the probability of being tested positive in this sample. The tested population consists of a fraction of the infecteds and a fraction of the susceptibles: nt = τ1(t) I(t) + τ2(t) S(t). Thus, with κ := τ2(t)/τ1(t), the relative probability of undergoing a screening test for an individual of type S vs an individual of type I (probability of being tested conditionally on being S / probability of being tested conditionally on being I). We assume that the ratio κ does not depend on t at the beginning of the epidemic (i.e., over the period that we use to estimate the parameters of the model). The daily number of deaths caused by the virus is assumed to be known exactly.
In order to compute the maximum likelihood estimator (the MLE, i.e., the parameters that maximize ℒ), we use the BFGS constrained minimization method, applied to − ln(ℒ), via the Matlab ® function fmincon. In order to find a global maximum of ℒ, we apply this method starting from random initial values for α, t0, κ drawn uniformly in the following intervals:
For each country, the minimization algorithm is applied to 2000 random initial values of the parameters.
The posterior distribution of the parameters (α, t0, κ) is computed with a Bayesian method, using uniform prior distributions in the intervals given by (4). This posterior distribution corresponds to the distribution of the parameters conditionally on the observations: where π(α, t0, κ) corresponds to the prior distribution of the parameters (therefore uniform) and C is a normalization constant independent of the parameters. The numerical computation of the posterior distribution (which is only carried out for French data) is performed with a Metropolis-Hastings (MCMC) algorithm, using 5 independent chains, each of which with 106 iterations, starting from random values close to the MLE.
Unless otherwise mentioned, the data used to compute the MLE and the posterior distribution are those corresponding to the period from February 29 to March 17.
Results
Model fit. We denote the maximum likelihood estimator (MLE), and I*(t), S*(t) the solutions of the system (1) associated with these values. In France, we get . The expectation of the observations associated with this MLE is (expectation of a binomial) with
Fig. 1 compares this expectation with observations. In France, we get a good match between nt and the data. In South Korea, on the other hand, the gap between the data and the model predictionis significant: the SIR model, which leads to an exponential trajectory of I at the beginning of the epidemic, cannot properly render the dynamics. Using data obtained at an earlier stage in South Korea, the model fit is better for the initial outbreak phase (Fig. 2). The corresponding MLE is: .
Parameter distribution
The joint posterior distributions of the three pairs of parameters (α, κ), (t0, α) and (t0, κ) for France are presented in Appendix A (Fig. 6). Note that the distributions are very different from the uniform prior distribution. However, the distributions of t0 and κ are quite dispersed. The joint distribution of (t0, κ), presented in Appendix A shows a correlation between t0 and κ. Based on this distribution, in order to reduce the uncertainty on κ, we assume that t0 is between January 13 and 30.
Fig. 3 depicts the marginal posterior distribution of the basic reproduction rate R0. The value of R0 corresponding to the MLE in France is . A similar computation in South Korea, based on the data used in Fig. 2 gives .
Actual number of infected cases
Using the posterior distribution of the model parameters, with the constraint that t0 is between January 13 and 30, we can compute the daily distribution of the actual number of infected peoples. This distribution is represented across time in Fig. 4. We deduce the following ratios between the actual number of infected people and the observations, (with the sum of the observed infected cases at time t). Thus, in France, the estimated ratio between the actual number of infected and observed cases is 15 (95%-CI: 4-33).
Actual mortality rate
The mortality rate corresponds to the fraction of the infected who die, that is γ(t)/(γ(t) + β). The term γ(t) is computed using the formula (2) and the mortality data. We thus obtain, on March 17, a mortality rate in France of 5.2/1000 (95%-CI: 1.5/1000 − 11.7/1000). The temporal dynamics of the mortality rate are depicted in Fig. 5.
Discussion
On the number of infecteds and the mortality rate
The actual number of infected individuals in France is probably much higher than the observations (we find here a factor × 15), which leads at a lower mortality rate than that calculated on the basis of the observed cases. However, if the virus led to contaminate 80% of the French population (Ferguson et al., 2020), the total number of deaths to deplore in the absence of variation in the mortality rate (increase induced for example by the saturation of hospital structures, or decrease linked to better patient care) would be 277, 000 (95%-CI : 81, 000 − 629, 000)). This estimate could be corroborated or invalidated when 80% of the population will be infected, eventually over several years, assuming that an infected individual is definitively immunized. It has to be noted that measures of confinement or social distancing can decrease both the percentage of infected individuals in the population and the degree of saturation of hospital structures.
On the differences between France and South Korea
The mechanistic-statistical SIR model achieves a satisfactory goodness-of-fit for French data, but does not capture the decline in the number of cases observed in South Korea. Grounding the analysis of the South Korean situation on the first phase of the epidemic improves the goodness-of-fit and yields an estimated R0 twice as low as the French R0, resulting on a slower epidemic dynamic. The difference between the dynamics predicted by the SIR model and the South Korean data is probably linked to a different management of the epidemic in Korea, having a strong impact on the epidemic dynamics (more important screening, tracing, social distancing in South Korea).
On the value of R0
The value of R0 obtained in South Korea is consistent with recent estimates for COVID-19 (2.0,2.6); see Ferguson et al. (2020). The estimated distribution in France is therefore surprisingly high. This difference could be due to a different definition of R0 depending on the type of model used to calculate it. A direct estimate, by a non-mechanistic method, of the parameters (ρ, t0) of a model of the form gives t0 = 30 (January 30) and ρ = 0.19. With the SIR model, I′(t) ≈ I (α − β) for small times (S ≈ N), which leads to a growth rate equal to ρ ≈ α − β, and a value of α ≈ 0.24, that is to say R0 = 4.8, which is consistent with the distribution presented in Fig. 3. Note that β = 1/20 corresponds to the median period of viral shedding of 20 days described by Zhou et al. (2020). A shorter period would lead to a lower value of R0.
On the uncertainty linked to the data
The uncertainty on the actual number of infected and therefore the mortality rate are very high. We must therefore interpret with caution the inferences that can be made based on the data we currently have in France. In addition, we do not draw forecasts here: the future dynamics will be strongly influenced by the containment measures that will be taken and should be modeled accordingly.
On the hypotheses underlying the model
The data used here contain a limited amount of information, especially since the observation period considered is short and corresponds to the initial phase of the epidemic dynamics, which can be strongly influenced by discrete events. This limit led us to use a particularly parsimonious model in order to avoid problems of identifiability for the parameters. The assumptions underlying the model are therefore relatively simple and the results must be interpreted with regard to these assumptions. For instance, the date of the introduction t0 must be seen as an efficient date of introduction for a dynamics where a single introduction would be decisive for the outbreak and the other (anterior and posterior) introductions would have an insignificant effect on the dynamics.
Data Availability
All of the data in this manuscript are available from public sources: Johns Hopkins University Center for Systems Science and Engineering, Santé publique France, and Korean Center for Disease Control
Appendix A joint posterior distributions
The joint posterior distributions of the three pairs of parameters (α, κ), (t0, α) and (t0, κ) are depicted in Fig. 6.
Appendix B maximum likelihood estimators
The computation of the MLEs were based on a BFGS minimization algorithm, applied to ln − (ℒ), using Matlab ® fmincon function, starting from 2000 randomly drawn initial guesses for the parameters. This led to 2000 values of . We only retained the value leading to the highest likelihood. The other values are depicted in Fig. 7.
Footnotes
Contact: lionel.roques{at}inrae.fr