Abstract
During an epidemic, the daily number of reported (infected/death) cases is often lower than the actual number of cases due to underreporting. Nowcasting aims to estimate the cases that have not yet been reported and combine it with the already reported cases to obtain an estimate of the daily cases. In this paper, we present a fast and flexible Bayesian approach to nowcasting combining P-splines and Laplace approximations. The main benefit of Laplacian-P-splines (LPS) is the flexibility and faster computation time compared to Markov chain Monte Carlo (MCMC) algorithms that are often used for Bayesian inference. In addition, it is natural to quantify the prediction uncertainty with LPS in the Bayesian framework, and hence prediction intervals are easily obtained. Model performance is assessed through simulations, and the method is applied to the Belgian COVID-19 mortality cases for the year 2021. Simulation results show that our model has good predictive performance except when the nowcast date is near the peak date, where it has lower prediction interval coverage.
1 Introduction
Nowcasting is a term used for estimating the occurred-but-not-yet-reported-events (Donker et al., 2011; Van de Kassteele et al., 2019). In epidemiology, real-time updates of new symptomatic/infected individuals are helpful to assess the present situation and provide recommendations for rapid planning and for implementing essential measures to contain an epidemic out-break. The exact number of new daily cases is frequently subject to reporting delays, resulting in underreporting of the real number of infected individuals for that day. Failing to account for the reporting delays will lead to possibly biased predictions that might have an effect on policy making (Gutierrez et al., 2020). The main goal of nowcasting is to estimate the actual number of new cases by combining the (predicted) not-yet-reported-cases with the already reported cases.
Several early references that establish the statistical framework for this type of problem can be found in the paper of Lawless (1994) which demonstrates how nowcasting can be used not only in disease surveillance but also in other contexts such as warranty and insurance claims. More recently, Höhle and an der Heiden (2014) applied nowcasting to the outbreak of Shiga toxin-producing Escherichia coli in Germany and also to the SARS-CoV-2 outbreak (Glöckner et al., 2020; Günther et al., 2021). Their approach is formulated within a hierarchical Bayesian framework that consists of estimating the epidemic curve by using a quadratic spline based on a truncated power basis function and the time-varying reporting delay distribution is approximated by a discrete time survival model. Van de Kassteele et al. (2019) pointed out that a potential drawback of such a method is the long computation time required by the Markov chain Monte Carlo (MCMC) algorithm and therefore proposed an alternative fast and flexible modelling strategy based on bivariate P-splines (Penalized B-splines). P-splines (Eilers and Marx, 1996) provide a flexible smoothing tool used to describe trends in the data. It introduces a penalty parameter that controls the roughness of the fit and counterbalances the flexibility of a rich B-splines basis (Eilers and Marx, 1996; Eilers et al., 2015). Other attractive features of the P-splines smoother are the relatively simple structure of the penalty matrix that is effortlessly computed and a natural extension to the Bayesian framework (Lang and Brezger, 2004). Based on the approach of Van de Kassteele et al. (2019), the number of cases are structured in a two-dimensional table (with calendar time as the first dimension and delay time as the second dimension), yielding the data matrix used as an input in the model. The reporting intensity is assumed to be a smooth surface and is modelled using two-dimensional P-splines.
In this paper, we extend the method of Van de Kassteele et al. (2019) by proposing a new now-casting methodology based on Laplacian-P-splines (LPS) in a fully Bayesian framework. A key advantage of working with the Bayesian approach is the ease to obtain the predictive distribution and quantify the uncertainty associated with the predictions. In addition, the posterior distribution of the penalty parameter can be explored, and hence its uncertainty can be accounted for. The Laplace approximation uses a second-order Taylor expansion to approximate the posterior distribution of the regression parameters by a Gaussian density. It is a sampling-free method with the major advantage of faster computational time as opposed to MCMC approaches that are commonly used in Bayesian inference. Therefore, given the flexibility of previously mentioned P-splines smoothers and the computational benefit of Laplace approximations, it can be a helpful tool in the daily monitoring of new cases during an epidemic period. Laplacian-P-splines already proved to be useful in survival models (Gressani and Lambert, 2018; Gressani et al., 2022), generalized additive models (Gressani and Lambert, 2021) and also in epidemic models for estimating the effective reproduction number (Gressani et al., 2021). We build on the work of Gressani and Lambert (2021) to extend the Laplacian-P-splines methodology to now-casting, thereby providing a fast and flexible (fully) Bayesian alternative to Van de Kassteele et al. (2019). To evaluate the (predictive) performance of our method, a simulation study is implemented and several performance measures are reported such as the mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE) and prediction interval coverage. Finally, we apply our method to the COVID-19 mortality data in Belgium for the year 2021. The data and R codes used to implement the method are available on GitHub through this link https://github.com/bryansumalinab/Laplacian-P-spline-nowcasting.git.
2 Methodology
2.1 Bayesian model formulation
In this section, we build upon the work of Van de Kassteele et al. (2019) to formulate a fully Bayesian model based on P-splines. Let yt,d denote the number of cases that occurred at time t = 1, 2, …, T (corresponding to the calendar day) and reported with a delay of d = 0, 1, 2…, D days. The information on cases can be summarized in matrix form:
with cases that have not yet been reported (at time T) highlighted in bold. The not-yet-reported cases correspond to (t, d) combinations satisfying t > T − d. The main objective is to predict the total number of cases,
, for t = T − (D − 1), …, T for which the nowcasted and already reported cases can be combined.
Let 𝒟 := y = (y1, y2 …, yn)′ denote the vector of the observed number of cases by stacking the columns of matrix Y for the reported cases, where each entry corresponds to each (t, d) combination of reported cases yt,d. The model assumes that the number of cases are Poisson distributed, i.e., yt,d ∼ Poisson(μt,d) with mean μt,d > 0. Following Van de Kassteele et al. (2019), the (log) mean number of cases is modeled with two dimensional B-splines:
where β0 is the intercept; bj(·) and bk(·) are univariate B-splines basis functions specified in the time and delay dimensions, respectively; and zl(t, d) represents additional covariates with regression coefficients βl. In matrix notation:
where the matrices B and Z correspond to the basis functions and covariates, respectively, and vectors θ and β are the associated parameters to be estimated (Details in Appendix A1).
In the philosophy of P-splines (Eilers and Marx, 1996), we use a rich (cubic) B-splines basis and counterbalance the associated flexibility by imposing a discrete roughness penalty on contiguous B-spline coefficients. For the two dimensional P-splines, the penalty can be obtained based on row-wise (direction of calendar time) and column-wise (direction of reporting delay) differences for matrix Θ = (θj,k) with j = 1, …, KT and k = 1, …, KD (Appendix A1) (see Durbán et al. (2002) and Fahrmeir et al. (2013) (pp. 507-508)). Let and
denote the mth order row-wise and column-wise difference matrix with dimensions (KT − m) × KT and (KD − m) × KD, respectively. In this paper, we use a second order (m = 2) difference penalty. For ease of notation, let
and
. The difference matrix for vector θ can be obtained by expanding the difference matrix into
and
, where ⊗ denotes the Kronecker product. Using this notation, the row-wise and column-wise difference penalty can be written, respectively, as:
Let λt > 0 and λd > 0 denote the row-wise and column-wise penalty parameter, respectively. The penalty for the two dimensional B-splines is then given by:
Let us define the penalty matrices
and
, where δ is a small number (say δ = 10−6), to ensure that the penalty matrices are full rank and thus invertible. Following Lang and Brezger (2004), the penalty can be translated in the Bayesian framework by specifying a Gaussian prior for each column and row vector of Θ. We write Θ in terms of its columns and rows:
The priors are then given by
and
for k = 1, …, KD and j = 1, …, KT, respectively. It follows that the joint prior for θ can be written as:
Gaussian priors are assumed for β and θ, namely
with Vβ = ζIp+1 (small ζ, e.g. ζ = 10−5) and (θ|λ) ∼ 𝒩dim(θ)(0, 𝒫−1(λ)), where λ = (λt, λd)′ is the penalty vector and
the global penalty matrix. Denote by X = (B, Z) the global design matrix, ξ = (β′, θ′)′ the latent parameter vector and
precision matrix for ξ. The full Bayesian model is then summarized as follows:
where 𝒢 (a, b) denotes a Gamma distribution with mean a/b and variance a/b2. This robust prior specification on the penalty parameters follows from Jullion and Lambert (2007). They have shown that when the hyperparameters aδ, bδ are chosen to be equal and small enough (say 10−4), then the resulting fit is robust to the value of ν (e.g. ν = 3 in this paper).
2.2 Laplace approximation to the conditional posterior of ξ
To obtain a Laplace approximation for the set of regression parameters ξ = (β′, θ′)′, we use the posterior of ξ conditional on the vector of penalty parameters λ. Then, the gradient and Hessian of the log-posterior are computed and used in a Newton-Raphson algorithm to obtain the Gaussian approximation to the conditional posterior distribution of ξ.
Note that for a Poisson distributed yi with mean μi, we have f (yi) ∝ exp (yiγi − s(γi)), where γi = log(μi) and s(γi) = exp(γi) = μi. Using Bayes’ rule, the posterior of ξ conditional on the penalty vector λ is:
where ℒ(ξ; 𝒟) denotes the likelihood function. Hence, the log-conditional posterior of ξ can be obtained as:
where ≐ denotes equality up to an additive constant.
It is known for Poisson generalized linear models that the gradient (score vector) and Hessian of the first term on the right hand side of (2) are given, respectively, by (see Agresti (2013), p.136):
where W = diag(w1, …, wn) is a diagonal matrix with wi = (V ar(yi)[g′(μi)]2)−1 = μi and g(·) is the log-link function such that g(μi) = log(μi). Hence, the gradient and Hessian for the log-conditional posterior of ξ are:
Using a Newton-Raphson algorithm, the Laplace approximation for the conditional posterior of ξ is, by abuse of notation,
(Gressani and Lambert, 2021), where
and
correspond to the mode and variance-covariance matrix, respectively, given by:
where
is the vector at convergence resulting from the sequence
, with
, the initial condition
, and W (ξ(0)) = diag(Xξ(0)).
2.3 Hyperparameter optimization
In this section, we derive the (approximate) posterior distribution of the hyperparameters belonging to the penalization part, i.e., λ and δ. We first derive the joint posterior of λ and δ and then proceed with an integration to isolate the marginal posterior distribution of the penalty parameter.
Let η = (λt, λd, δt, δd)′ denotes the vector of hyperparameters. Using Bayes’ theorem, the marginal posterior of η is:
Following Rue et al. (2009), the above posterior can be approximated by replacing p(ξ|η, 𝒟) with
obtained in Section 2.2 and by evaluating the latent vector at
. Note that
,Where
corresponds to the i th row of the design matrix X. Also, the determinant
in p(ξ|λ) is
. Hence, the approximated marginal posterior of η is:
To obtain the marginal posterior of λ, we need to integrate out the hyperparameters δt and δd from
as follows:
Hence,
The posterior mode (obtained via Newton-Raphson) is used as a point estimate for the penalty vector. To ensure numerical stability, we log-transformed the penalty vector v = (vt, vd)′ = (log(λt), log(λd))′. Using the method of transformations, we obtain the posterior of v given by:
The exponent of the last term above reduces to
because of multiplying
with the Jacobian of the transformation J = exp(vd) exp(vt). Moreover, the log-posterior of v is given by:
2.4 Nowcasting with prediction interval
To obtain the mean nowcast estimate with the prediction interval, note that or equivalently
, where
. Thus,
. The mean estimate for the not-yet-reported cases is calculated as
for all (t, d) combinations with t > T − d. Then, the estimate for the total number of cases for each t is obtained by summing the already reported cases and the mean estimate for the not-yet-reported cases, i.e.
.
The prediction interval for the nowcast values is obtained by sampling from the posterior predictive distribution of the log-mean number of cases by following the five steps below:
For each (t, d) combinations with t > T − d (corresponding to the not-yet-reported cases), generate 1000 random samples
from a Gaussian distribution with mean
and variance
Exponentiate the sampled values from the previous step to obtain the average reporting intensities
.
Compute the average prediction for the not-yet-reported cases for each t. That is, compute
for t > T − d
For each t, sample
containing 1000 values from Poisson
.
Finally, compute the quantiles of the sampled values
that correspond to the desired prediction interval.
2.5 Delay distribution
To obtain the smooth estimate of the delay distribution, only the first term on the right hand side of equation (1) is used (excluding the week effects), as explained in the paper of Van de Kassteele et al. (2019). Specifically, the procedure to compute the delay distribution is as follows:
Compute the contribution of the smoothing term in equation (1) to the reporting intensity for all (t, d) combinations: μsmooth = exp(Bθ).
Arrange μsmooth into a T × (D + 1) matrix with entries
.
For each t = 1, …, T, compute the reporting delay distribution given by:
.
3 Simulations
A simulation study is implemented in order to evaluate the predictive performance of the pro-posed method. The procedure to perform the simulations is as follows:
Consider a function f (t) that represents the mean epidemic curve of all cases such that μ(t) = exp(f (t)) for t = 1, …, T.
For each t, generate a random sample yt from a Poisson distribution with mean μ(t).
To account for possible delays d = 0, 1, 2, …, D, generate samples from a multinomial distribution with probabilities (p0, p1, p2, · · ·, pD such that
, i.e.,
This sample represents the reported number of cases for each (t, d) combination.
Repeat steps 1 to 3 for 1000 times to generate 1000 possible realizations and compute the desired accuracy measures.
We consider two functions f (t) inspired from the paper of Noufaily et al. (2016) given by:
for t = 1, …, 365. That is, we assume a one year (365 days) time window in the simulation. In terms of delay probabilities, we consider a maximum delay of D = 7 days with probabilities (p0, p1, p2, · · ·, p7) = (0.0, 0.1, 0.4, 0.2, 0.1, 0.1, 0.05, 0.05). This is illustrated in Figure 1. For example, at time T − 6, only the cases that have a delay of 7 days (with probability p7 = 0.05) are not yet reported, or equivalently, 95% of the cases have already been reported. For time T − 5, only 90% of the cases have been reported, as cases with delays of 6 and 7 days are yet to be reported. On the nowcast day (time T), no case is reported, that is, the delay is 100%.
Two scenarios are considered for the first function f1(t): (i) the first one is having a small number of cases with values of θ1 = 2 and θ2 = 1; (ii) the second scenario has relatively larger cases with θ1 = 3 and θ2 = 1. These functions are denoted by f11(t) and f12(t), respectively. The second function f2(t) has values θ1 = 0, θ2 = 0.4 and θ3 = 0.2. The first function has a symmetric curve with three peaks as shown in Figures 2a and 2b. On the other hand, the second function is not periodic as opposed to the first function (see Figure 2c). The plot for (one realization of) simulated cases based on these functions are shown in Figure 3 with (blue dashed) vertical lines corresponding to the different nowcast dates.
The mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (SMAPE) are chosen to measure the predictive accuracy of our methodology. Moreover, for each simulation, we consider a 95% prediction interval and determine the (true) unreported cases that falls within the interval to obtain the prediction interval coverage. The formulas to compute these measures are given by:
where S = 1000 is the number of generated realizations, yst is the true value (generated from step 2 of the simulation in Section 3 at iteration s) and ŷst is the corresponding predicted value (mean nowcast estimate
in Section 2.4). The MAPE and SMAPE metrics are scale-independent and measure the error of prediction from the actual value in terms of percentage. One disadvantage of MAPE is that it could have undefined values when the actual values are zero, which is evident in the formula where the true value is in the denominator. In our setting, this happened especially with very small delay probabilities where there is a high probability of zero unreported cases, and so we exclude this case in computing the MAPE. Another way to address this issue is to use the SMAPE indicator that has values with lower and upper bounds of 0% and 200%, respectively.
In the simulation, every end of the month from March to November are set as nowcast dates. For each nowcast date, the prediction measures are computed for the dates having unreported cases, that is, for t = T − (D − 1), …, T. As we have a maximum delay of 7 days, there will be seven dates (including the nowcast date) that involves prediction of unreported cases. Tables 1-3 summarize the MAPE results for the different functions being considered. The error increases as it gets closer to the nowcast day, that is, from 6 days prior to the nowcast day to the nowcast day. This is expected, as there is less data available at each delay time as we approach the nowcast day. Indeed, on day T, about 95% of the cases that occurred on day T − 6 have been reported, while on day T no case is reported yet. The errors are relatively higher in March and April, where we have the least available data. For functions f12(t) and f2(t), the MAPE on the nowcast day ranges between 20% - 35% for May to November. This is reasonable considering that we do not have data available on the nowcast day because all cases have a delay of at least one day. Moreover, the MAPE results using LPS are similar to the ones obtained using Van de Kaasteele et al.’s method (values in parentheses in the Table). There is no clear pattern that indicates one approach has higher or lower error than the other.
Mean absolute percentage error (MAPE) for the number of cases at day T− t (t = 0, …, 6) using function f11(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Mean absolute percentage error (MAPE) for the number of cases at day T − t (t = 0, …, 6) using function f12(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Mean absolute percentage error (MAPE) for the number of cases at day T − t (t = 0, …, 6) using function f2(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Results for the prediction interval coverage can be found in Tables 4-6. For function f11(t), almost all of the values are higher than the 95% nominal prediction interval. For function f12(t), the prediction interval coverage is close to 95% for most cases, although the coverage of the nowcast day for June and November are somewhat too low (60.04% and 73.95%, respectively). This could be resulting from reaching the peak (curve change) and the fact that no data is available for the nowcast day, that is, there is no zero delay. Moreover, the function f2(t) has high coverage except on the nowcast day for October and November, having higher simulated cases. There is moderate to strong undercoverage in the prediction interval with Van de Kassteele et al.’s method for 4 to 6 days before the nowcast day, while LPS is more conservative and has coverages that are closer to the nominal value. On the other hand, from 2 days before the now-cast day to the nowcast day, Van de Kaasteele et al.’s method has higher coverage than LPS, reflecting much wider prediction intervals. The results for SMAPE of the three functions are also displayed in Appendix A2 with the percentage error of functions f12(t) and f2(t) ranging between 20% - 40% on the nowcast day. Similar to MAPE results, LPS and Van de Kaasteele et al.’s method have comparable SMAPE values with no systematic difference between the two approaches.
Prediction interval coverage for the number of cases at day T − t (t = 0, …, 6) using function f11(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Prediction interval coverage for the number of cases at day T − t (t = 0, …, 6) using function f12(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Prediction interval coverage for the number of cases at day T − t (t = 0, …, 6) using function f2(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
4 Real Data application
We apply our method to COVID-19 mortality data in Belgium, for the year 2021. The raw data is available on the website of the Sciensano research institute (see sciensano.be/covid19data). The data contains the cumulative number of mortality cases, reported up to the day of the file. The file is updated every day, and in this way, the number of cases and reporting delays are obtained. The data is structured in matrix format with the date of death as rows and number of days of reporting delay as columns. Figure 4 shows the total number of cases with (blue dashed) vertical lines corresponding to different nowcast dates used for illustration. As no cases are reported immediately on the day of death in the data, all cases have a delay of at least one day. The proportion of cases reported for each delay is shown in Table 7. Majority of the cases (66%) were reported with a delay of 2 days. Followed by a delay of 1 day and 3 days in 14% and 12% of the cases, respectively. The rest of the delays account for 8% of the cases. The reporting delay is truncated to a maximum of seven days.
Proportion of cases reported for each delay for the mortality data.
In this section, we present the model results at different nowcast dates. In the analysis, we choose KT = 40 and KD = 10 as the number of B-splines for the time and delay dimension, respectively. Moreover, a cubic B-splines basis with a second order penalty is used. In addition, we also consider the day of the week effect as additional covariate in the model.
Figure 5 presents the nowcasting results using the mortality data at different nowcast dates (namely, at every end of the month). The blue color represents the reported cases for the past 14 days on the nowcast date, the gray color are cases that have not yet been reported, and the orange line corresponds to the nowcast prediction with the 95% prediction interval. It can be seen that the nowcast predictions are fairly close to the observed cases (gray). In addition, all of the observed cases fall within the prediction interval. The nowcasts at the day of nowcasting (i.e. last date presented in the plots) for March and April are slightly underestimated. Based on the simulation results, possible reasons for the underestimation are: (1) the number of cases in this period is higher as compared to other periods; (2) the amount of information used in this period is still limited (from January till March or April); and (3) at the nowcast date, no information is available as there is a 100% delay of cases.
The estimated delay density for different nowcast dates is shown in Figure 6. The steps to obtain the time-varying delay distribution is outlined in Section 2.5. Figure 6 shows that the delay distribution is fairly constant through time, with density being highest for a delay of 2 days. This confirms the observed reporting intensity in our data that most cases are reported with a two-day delay. Furthermore, the density is highest between April and June, which relates to a higher number of reported cases (see Figure 4).
Table 8 shows the computational time required for doing the nowcasting analysis using mortality data. We used the microbenchmark() function from the microbenchmark package in R. The analysis was implemented on a device with an Intel(R) Core(TM) i5-1135G7, CPU running at a base frequency of 2.40GHz, and having 4 cores. Twenty function calls were evaluated for each nowcast date. The average real elapsed time is around 12 to 25 seconds, except for the nowcast date on October, for which we used almost all of the 2021 data with an average of 50 seconds.
Computation time (in seconds) for 20 evaluations using mortality data with the LPS methodology.
5 Conclusion
This paper presents a novel Bayesian approach of the nowcasting model proposed by Van de Kassteele et al. (2019). One advantage of using the Bayesian approach is that the prediction intervals can be naturally obtained which is the main goal of nowcasting. Our proposed methodology is based on a combination of Laplace approximations and P-splines, making it both fast and flexible. We perform a simulation study to evaluate the (predictive) performance of our method. Simulation results show that nowcasts and prediction interval coverage are fairly close to the true value and nominal (95%) prediction interval, respectively, except when the nowcast date falls on the day of the peak where it has low coverage. This could be due to the higher values and a shift in the epidemic trend. Moreover, we also apply our method to the Belgium COVID-19 mortality data (2021). The nowcasts are close to the actual observed values, and all of the 95% prediction intervals for the different nowcast dates being considered contain the observed values.
This study has some limitations. First, it assumes that the counts are Poisson distributed. One possible consequence of this is that the variability in the data tends to be underestimated because of overdispersion, which can cause undercoverage in the prediction intervals. This can be addressed by using the negative Binomial distribution. Another issue is that the priors for the B-spline coefficients are based on the Kronecker product of the penalty matrices. Lang and Brezger (2004) pointed out that this could overfit the data, and they suggest using spatial smoothness priors. Moreover, as mentioned by Van de Kassteele et al. (2019) our model specification makes it difficult to model abrupt changes in the reporting process.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This project is funded by the European Union Research and Innovation Action under the H2020 work programme EpiPose (grant number 101003688).
6 Appendix
A1
In equation (1), we have a two-dimensional B-spline model:
where
B = BD ⊗ BT is the Kronecker product of B-splines matrices BT and BD with dimension T × KT and D × KD, given by:
As a result, B has dimension TD × KT KD.
Z is the design matrix for additional covariates with dimension TD × (p + 1) and corresponding parameters β. This is given by:
θ = vec(Θ) has dimension KT KD × 1 where vec(Θ) denotes the vectorization of a matrix Θ. It is obtain by vectorizing the matrix:
of B-splines coefficients θj,k by stacking the columns of Θ, that is,
A2. Result for symmetric mean absolute percentage error (SMAPE)
Symmetric mean absolute percentage error (SMAPE) for the number of cases at day T − t (t = 0, …, 6) using function f11(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Symmetric mean absolute percentage error (SMAPE) for the number of cases at day T − t (t = 0, …, 6) using function f12(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.
Symmetric mean absolute percentage error (SMAPE) for the number of cases at day T − t (t = 0, …, 6) using function f2(t), as well as the average over these days. Values in parentheses are results using Van de Kassteele et al.’s method.