Abstract
The sanitary crisis of the past two years has focused the public’s attention on quantitative indicators of the spread of the COVID-19 pandemic. The daily reproduction number Rt, defined by the average number of new infections caused by a single infected individual at time t, is one of the best metrics for estimating the epidemic trend. In this paper, we give a complete observation model for sampled epidemiological incidence signals obtained through periodic administrative measurements. The model is governed by the classic renewal equation using an empirical reproduction kernel, and subject to two perturbations: a time-varying gain with a weekly period and a white observation noise. We estimate this noise model and its parameters by extending a variational inversion of the model recovering its main driving variable Rt. Using Rt, a restored incidence curve, corrected of the weekly and festive day bias, can be deduced through the renewal equation. We verify experimentally on many countries that, once the weekly and festive days bias have been corrected, the difference between the incidence curve and its expected value is well approximated by an exponential distributed white noise multiplied by a power of the magnitude of the restored incidence curve.
Simple Summary In the past two years, the COVID-19 incidence curves and reproduction number Rt have been the main metrics used by policy makers and journalists to monitor the spread of this global pandemic. However, these metrics are not always reliable in the short term, because of a combination of delay in detection, administrative delays and random noise. In this article, we present a complete model of COVID-19 incidence, faithfully reconstructing the incidence curve and reproduction number from the renewal equation of the disease and precisely estimating the biases associated with periodic weekly bias, festive day bias and residual noise.
1. Introduction
The renewal equation, first formulated for birth-death processes in a 1907 note of Alfred Lotka [1], establishes a model for epidemic propagation based on the individual infectiousness. The infectiousness of individuals at time t is characterized by the reproduction number Rt, defined as the average number of cases generated by an infected person at time t, and by the generation time, [2,3], defined as the probability distribution of the time between infection of a primary case and infections in secondary cases. This probability distribution depends on the incubation time (a permanent biological factor) and on the detection time (which we assume stationary). For these reasons, the distribution of the generation time is supposed to be independent of t. In practice, the generation time is replaced by the observable serial interval Φs which represents the time distribution of the delay of the onset of symptoms between primary and secondary cases. In Fig. 1, we show the serial interval obtained in [4] u0sing 689 observed pairs of primary and secondary cases.
The case renewal equation [5,6] is a classic equation linking Rt, Φ and the incidence it of new daily cases, where tc is the current time. This equation does not account for several strong perturbations of it. Government statistics of the observed incidence curve are indeed affected by changes in testing and polling policies and by weekend reporting delays. These recording delays and subsequent rash corrections result in impulse noise, and in a strong weekly periodic bias observable on the observed incidence curve . In [3] this bias is corrected by a seven days sliding average and in [7] it is corrected by multiplying by a 7-day periodic factor qt. These bias correcting coefficients qt are learned by a variational method that we describe below. Our first purpose in this note is to resolve the festive day problem. We denote by F the set of festive days t, at which the curve is strongly affected by the reduction in the number of registered cases. This reduction is compensated by an increase in the number of registered cases the following days. No model has been proposed so far to address this problem, which creates strong impulse noise in any estimation of it and Rt. We tackle this problem by a variational method computing Rt, where both it and Rt are considered unknown on festive days and in the next few days. To that purpose, we shall denote by F+ the union of festive days and the ones following them affected by the festive day (typically 2 or 3 days after the festive day).
Our second purpose is to give a noise model for the difference between the signal ît corrected of the week-end and festive effects, and its restored version using the renewal equation, defined by
We give strong experimental evidence that the relation between ît and , can be empirically modeled by where a > 0 and εt is a white noise.
This leads us to propose a signal processing version of the renewal equation model taking into account noise and bias and justifying a posteriori the variational method. The proposed observation model linking the observed signal to the ground truth incidence it is where qt is a quasi-periodic gain with period 7, εt is a white noise. The exponent a can be estimated for each country and varies between 0.6 and 0.9. The exceptional set F+ is introduced because festive days provoke perturbations of the observation model (4). Specifically, the 7 days period of qt is broken for these groups of days.
We shall verify experimentally on 38 countries (and detail the results on USA, France and Germany) that the normalized error εt is indeed a white noise with a distribution that is well described by an exponential distribution. This a posteriori noise model contradicts the classic a priori stochastic formulation of the renewal equation where the first member it of equation (1) is assumed to be a Poisson variable, and the second member of this equation is interpreted as the expectation of this Poisson variable. Using this Poisson model leads to maximum likelihood estimation strategies to compute Rt [3,8–10]. As we shall see, the Poisson model is not verified. Indeed, as we mentioned, the empirically observed standard deviation of the noise follows a power law with exponent a significantly larger than 0.5, which is incompatible with the Poisson model.
The proposed observation model (4) of the pandemic’s incidence curve gives a simple framework enabling:
a computation of the reproduction number Rt;
a correction of the weekend and festive days bias on it;
a verification that the difference between the observed incidence curve after bias correction and its expected value using the renewal equation is a white noise, the parameters of which can be estimated.
Plan
In section 2 we describe an anterior variational method [7] and point out its main three limitations: its weekly bias correction is strongly periodic, which does not work on long periods; the festive days cause strong perturbations in the inversion, finally no residual noise model is proposed. We therefore modify its variational formulation. In section 3 we present the results of the statistical analysis of the residual noise on many countries. These examples lead to specify the noise model and to validate a posteriori the proposed inversion model. In section 4 we organize and present all previous Rt and it analysis methods, particularly those using an a priori noise model. Section 5 is a final discussion.
Timely estimates of restored versions of it and Rt are extremely useful to tame a pandemic. The proposed restoration and inversion algorithm can be run through an online demo [11] for every day in every country and U.S. state. The demo plots the objects of this paper, namely the incidence curve , its bias corrected version ît, its fully restored version , finally the main pandemic index, the time-dependent reproduction number Rt. Fig. 2 illustrates the application of the variational method of section 2 to USA on February 1st, 2022, as displayed by the online demo. Fig. 3 compares the results of this inversion method, applied with and without festive day bias correction, obtained for France on January 6, 2022.
2. The proposed variational model
The EpiInvert method proposed in [7] is a deconvolution + denoising procedure to solve the functional equation (1) using the Tikhonov-Arsenin [12], [13] variational approach. EpiInvert estimates both Rt and a restored it corrected for the weekend bias. To remove the weekend effect, it computes a 7-day periodic multiplicative factor q = (q0, q1, q2, q3, q4, q5, q6). From the observed incidence curve and the serial interval, Rt and q are jointly estimated by minimizing where t%7 denotes the remainder of the Euclidean division of t by 7 and median(t−τ,t](i0) is the median of in the interval (t − τ, t] used to normalize the energy with respect to the size of it. The total number of cases is preserved by adding to (5) the constraint on qt : where T is a period of analysis empirically fixed to T = 56 days. The minimization of the above energy yields estimates of Rt, q and a restored incidence curve.
One limitation of using a 7-day periodic formulation to model the weekend effect is that it does not take into account the variation over time of the seasonal profile. To deal with this issue, we consider qt for t = 0,.., tc allowing different correction factors qt for every day but keeping the values qt − qt−7 small which forces qt to be quasi-periodic. A regularity assumption for the seasonality is commonly used in the study of time series as it is the case of the standard Holt-Winters’ seasonal method [14].
In addition to the weekend bias, festive days can introduce a strong bias in the incidence values. On a festive day t ∈ F, a sharp decrease in the number of registered incident cases is generally observed. This is compensated by increased incidence numbers in the next few days. Assuming that each festive day, t ∈ F, mainly affects the value of the incidence curve in the festive day and in the next Mt days (where Mt is an algorithm parameter (by default we fix Mt = min{2, tc − t})), we consider the values of as unknown. We denote by F+ the union of the festive days t ∈ F and the Mt days following them. We set for t ∉ F+ and consider the values as unknowns. Then the new proposed inversion functional is
The values for t ∈ F+ are set free in the minimization. Yet the third term in the functional ensures that the overall number of cases in the affected days remains unchanged. For each t ∈ F, λt ≥ 0 represents the weight we assign to this constraint for each festive day. We fix, experimentally, if tc > t and λt = 0 if tc = t. In other terms, the value of λt is adjusted according to the number of days that have passed since the festive day. To keep a smooth seasonality we add to the energy a regularization term where we penalize high values of qt − qt−7. The parameters wR and wq are regularization weights with default values wR = wq = 5. Their values are proven in [7] to be nearly optimal for Covid-19 incidence curves.
By minimizing this energy we obtain the reproduction number Rt, the seasonality qt and , which corresponds to the original incidence but with the optimized values in the festive days. The bias corrected incidence ît defined in model (3) is given by .
The estimated incidence curve must preserve the number of cases. In the original EpiInvert formulation this constraint is enforced by (6) on its analysis interval (tc − T, T]. In the new formulation, the interval time of analysis is the whole time interval [0, tc]. Extra conditions are required to keep close to ît and . Therefore, to preserve the number of cases we add to the energy (7) the constraints on qt:
The first constraint corresponds to a global preservation of the number of cases in the whole period and the second one corresponds to a local preservation of the number of cases every 2 weeks. In particular, the second constraint ensures a good agreement between the epidemiological indicator given by the accumulated number of cases in the last 14 days of the original incidence curve and the estimated ones using the proposed method. This indicator is currently widely used to evaluate the current epidemic transmission.
The minimization of the energy (7) is obtained by alternating steps computing in turn Rt, qt, and then (for t ∈ F+) until convergence. The above constraints are added to the minimization by the Lagrange multiplier technique.
3. Results
We used the incidence data published in [15] for France, [16] for Germany, [17] for Spain and [18] for the rest of countries. We checked the observation model and its inversion on the 626 daily incidence data from March 24, 2020 to December 9, 2021 for 38 countries and will detail the results for France, Germany, and the USA. In general, for the festive days we fixed Mt = 2, so the method estimated the incidence value of the festive day and of the next 2 days. However, not all festive days disturb the incidence in the same way. Parameter Mt allows us to adapt the number of days affected. To illustrate this option we set Mt = 5 for Thanksgiving in the USA in 2021 because this festive day causes in 2021 a longer perturbation in the number of registered cases. Figs. 5,A1,A2 show the minimization results for the energy (7). They display for each country (i) the original incidence curve , (ii) the incidence curve after bias correction ît, (iii) the restored incidence curve using the renewal equation (1), (iv) the weekly bias correction factors qt, (v) the reproduction number estimation Rt and (vi) the normalized error defined by
The power a was obtained through log-log linear regression. Indeed, if is proportional to , then , and a and b can be estimated by a linear regression between and . Its results are illustrated for 38 countries in fig. 4 and table A2. The Pearson correlation p-values in this table confirm the linear relation. The estimated exponent a varies between 0.7 and 0.9, and the constant coefficient b varies between −0.11 and −2.6. For the world we have a = 0.76 and b = −1.16.
We performed a control test on a Brownian motion simulated by starting from 10000 and sampling it+1 − it ≃ 𝒩(0, 100). The obtained exponent a is negative (a = −1.01) and we have b = 13.4. Both values are far away from the group of coefficients of real incidence curves. The p-value for the control is anyway non significant (0.0844), compared to the extremely small p-values for the real incidence curves. Fig. A3 shows the results of the variational inversion method on the Brownian control. For this control, both Rt and the weekly seasonality correction coefficients stay very close to 1 as should be expected, with means 1.001 and 1.00002, and standard deviations 1.7% and 0.3% respectively.
Next, we looked for a stochastic model of the normalized error εt defined by (9). Figs. 5-A2 visually support a stationarity assumption for εt in France, Germany and USA. In fig. 6 we show the autocorrelation function for these three countries. For most non-zero shifts, its value stays inside the 95% confidence interval for the stationarity assumption. (This interval is indicated by horizontal blue lines in the plot.) Similar results were obtained on 33 more countries, as illustrated in fig. A5. These results support a white noise assumption for εt.
We finally estimated the parameters of the distribution of εt assuming an exponential power distribution with density where μ is the location, α the scale and β the shape. These parameters to approximate εt by an exponential power distribution were estimated by the R-package normalp [19].
In Fig. 7, we plot for these three countries the histogram of the distribution of εt and its approximation by a normal (β = 2) and by the obtained optimal exponential distribution. We display the same result for 33 more countries in fig. A4.
Table A1 gives the results for all countries. Columns 5 to 8 in the table give the parameters of the optimal exponential law: location, scale, shape. In all cases the exponent remains close to 1. Fig. 8 displays a quantile-quantile plot comparing εt with the estimated exponential distribution for three countries: France, Germany, USA. The linear fit is excellent, and this goodness of fit is confirmed for 33 more countries in Figure A6.
4. Review of previous models
4.1. The Fraser renewal equation
In our proposed incidence model, we used the general integral equation (1), which is a functional equation in Rt. Integral equations have been previously used to estimate Rt: in [20], the authors estimate Rt as the direct deconvolution of a simplified integral equation where it is expressed in terms of Rt and it in the past, without using the serial interval. A simpler functional equation than (1) was proposed in Fraser [21] (equation (9)),
This equation is derived from the general case renewal equation (1) by assuming that Rt is constant in the serial interval support. It computes the “instantaneous reproduction number” and represents the number of secondary cases arising from an individual showing symptoms at a particular time, assuming that conditions remain identical after that time, in contrast with the case renewal equation (1). This last equation applied to the incidence curve is coherent if Φs denotes the serial interval between two cases, which can have negative dates, because an infectious may be detected after the infection cases she caused. Using (11) requires that Φs only has positive dates. This explains why [22] proposed to estimate the generation time, namely the (always positive) time between two infections, before using it in (11). The advantage of equation (11) is that Rt is estimated at time t from the past incidence values it−s by a simple division, provided that Φs = 0 for s < 0:
4.2. Deterministic implementations using Fraser’s renewal equation (11) and other models
Many papers estimating Rt use the deterministic causal renewal equation (11). This is the case of [23], [24] [25]. This last paper also involves the Wallinga-Teunis formulation [2], also based on the renewal equation but only allowing a backward estimate of Rt (see the discussion in [7]). Some papers like [26] propose a simplified version of (11). See also [27], who use this equation but estimate the probability distribution Φs by a maximum entropy method. A few papers use another deterministic model, the Wallinga-Teunis formulation, to compute Rt [28], or a SIR model, like in [29], where the time variable parameter β(t) of the three ODE’s of a SIR model is estimated from incidence data in a seven days sliding window.
4.3. Stochastic observation models for it and Rt
The renewal equation (11) is often endowed with an a priori stochastic Poisson model as
In this stochastic formulation, the first member it of Equation (11) is assumed to be a Poisson variable, and the second member of this equation is interpreted as the expectation of this Poisson variable. This leads to a maximum likelihood estimation strategy to compute Rt (see [3,8–10,30]). This form of the renewal equation is proposed and used in [3] and in the EpiEstim software. It is highly recommended in a recent review [31] signed by representatives from ten different epidemiological labs from several continents. Many papers dedicated to the computation of Rt use this model, for example [32], [33] and [34], who also assume that Rt is a Poisson variable, and [35] who also assume that Rt also is a random variable following a Gamma distribution. In [36], the authors use the stochastic form of the renewal equation (13) where they call Φs causal serial interval. Then Rt is estimated jointly on all regions of a country by a variational model containing a spatial total variation regularization to ensure that Rt is piecewise constant, and the L1 norm of its time Laplacian to ensure time regularity. The functional also penalizes outliers, typically Sundays and holidays by assuming a sparse structure of such events. See also [37] for an exposition of the application of this method.
In [38] the method Epifilter is introduced as an extension of EpiEstim and of the Wallinga-Teunis formulation. Epifilter has been applied in practical studies like [39]. The core of Epifilter is again the causal renewal equation in Poisson form (13). Yet, the author proposes a doubly stochastic model, as Rt is assumed to follow a recursive discrete Brownian motion of the sort where ϵ ≃ 𝒩(0, 1) and η is a user parameter, that we can interpret as a regularity control on Rt. Then (Rs)s≤t is computed from the incidence data (it)s≤t by recursive filtering. The method is complemented by Bayesian (backward) recursive smoothing that brings a better estimate on low incidence periods.
Similarly in [40], a parametric model with a stochastic multiplicative term is proposed for Rt where the stochastic term is a Gamma law with prescribed standard deviation. The parameters are estimated in several prefectures in interaction to give the best fit to incidence data linked to Rt through the causal renewal equation (11).
A few papers assume a negative binomial a priori for the incidence [41]. Nevertheless the equations given in the paper indicate the adoption of the renewal equation (11) and put the stochastic process on Rt by assuming Rt ≃ Rt−1GP where GP is a squared exponential kernel. The very same model is used in [42], and is based on the authors’ software EpiNow2. Similarly in [43], incidence it and reproduction number Rt are linked through the classic SIR model; a parametric piecewise linear model for Rt is estimated by fitting the parameters to real incidence data. Here the daily incidence data are modeled as a negative binomial, with mean given by the deterministic solution of the SIR equations and unknown dispersion.
In [44], a direct stochastic model is proposed for Rt, assuming that its log derivative is Brownian, namely where ν is the volatility of the Brownian process to be estimated. Then we have where C is a constant depending on steady transmission characteristics and s(t) is the proportion of the population that is susceptible. The case incidence is then estimated through an SEIR model. We refer to [45] for a still more complex stochastic model for Rt, depending on three stochastic parameters.
5. Discussion
In [7] we have proven extensively by simulations and experiments on live worldwide Covid-19 incidence data that using the simplified causal renewal equation (11) incurs in a five days delay in the estimation of Rt, compared to the Nishiura renewal equation (1). This is why we used here this second model.
All of the stochastic models mentioned in section 4.3 are formulated a priori. To the best of our knowledge, no there has been no a posteriori verification of their noise models on it or Rt. In contrast, we have proposed to learn the noise model from data and to verify a posteriori that the noise model is correct. Our experiments show that the weekly and festive administrative perturbations are more important than the noise. Hence they must be corrected first to enable a proper noise analysis.
These experiments seem to confirm the validity of the observation model (4). As we saw, this model can be inverted by minimizing the energy (7). This minimization yields three signals: a restored incidence on the festive days, the administrative bias correcting coefficients qt that are quasi-periodic with period 7, and the time varying reproduction number Rt, arguably the pandemic’s most useful control parameter. Last but not least, the renewal equation deduces a restored incidence by (2) from the bias compensated incidence ît. The modeling loop was closed by verifying that the normalized error defined by (9) is a white noise. We also found that this noise follows an exponential distribution. This analysis discards the Poisson model for the pandemic’s case count it. A pure case count should be a Poisson noise, but we saw that the main perturbation was an administrative bias which, once compensated, leaves behind a noise with standard deviation proportional to a power larger than 0.5 of the case count it. Under the Poisson model this standard deviation would have been equal to the square root of it.
In summary, based on the renewal equation inversion, this work contributes to a better understanding of the dynamic of the registered administrative observation of the incidence curve, its weekly seasonality, the influence of the festive days and the expected noise model in the observation of the incidence curve.
Data Availability
We use COVID-19 incidence data avaliable at "Our World in Data" webpage.
https://github.com/owid/covid-19-data/tree/master/public/data
Footnotes
jean-david.morel{at}epfl.ch
jean-michel.morel{at}ens-paris-saclay.fr
We change the title and we add a detailed review of related papers.