Abstract
We construct a recursive Bayesian smoother, termed EpiFilter, for estimating the effective reproduction number, R, from the incidence of an infectious disease in real time and retrospectively. Our approach borrows from Kalman filtering theory, is quick and easy to compute, generalisable, deterministic and unlike many current methods, requires no change-point or window size assumptions. We model R as a flexible, hidden Markov state process and exactly solve forward-backward algorithms, to derive R estimates that incorporate all available incidence information. This unifies and extends two popular methods, EpiEstim, which considers past incidence, and the Wallinga-Teunis method, which looks forward in time. This combination of maximising information and minimising assumptions, makes EpiFilter more statistically robust in periods of low incidence, where existing methods can struggle. As a result, we find EpiFilter to be particularly suited for assessing the risk of second waves of infection, in real time.
INTRODUCTION
During an unfolding epidemic, one of the most commonly available and useful types of surveillance data is the daily (or weekly) number of new infected cases. This time-series of case counts, also known as the incidence curve, not only measures the epidemic size and burden, but also provides information about trends or changes in its transmissibility [1], [2]. These trends are captured by the time-varying effective reproduction number, denoted Rs at time s, which defines how the number of secondary cases generated per primary case varies across the duration of the outbreak [3]. When Rs > 1 we can expect and prepare for growing incidence, whereas Rs < 1 signifies that the epidemic is being controlled [4].
Inferring changes in Rs given an observed incidence curve is therefore crucial, both to understanding transmissibility and to forecasting upcoming case loads, especially in an ongoing epidemic, where it can help inform policymaking and intervention choices [1], [5]. Real-time and retrospective Rs estimates have been used to characterise rates and patterns of spread in various diseases such as malaria [6] and Ebola virus disease [7] and have already proven valuable across the COVID-19 pandemic, providing updating synopses of global transmission [8] or evidencing the impact of implemented control actions (e.g. lockdowns and social distancing) [9], [10].
Many of the studies that aim to infer Rs either apply the Wallinga-Teunis (WT) method [2] or the Cori et al method, known as EpiEstim [11]. Both methods take complementary viewpoints on how incidence data inform on Rs and hence have diverse use-cases. The WT method reconstructs the average number of new cases caused by infectious individuals at s and so requires incidence data beyond time s for its estimate, which is called the case reproduction number. It is better suited for retrospective analyses [12]. Alternatively, EpiEstim infers how past infections propagate to form the incidence observed at s, only requiring data prior to time s. EpiEstim is preferred for real-time investigations and its Rs estimate is termed the instantaneous reproduction number [3].
While both methods provide useful and important estimators of transmission, they are not perfect. Two main limitations exist. First, each suffers from data censoring or edge-effects [3]. Because the WT method is forward-looking i.e. depends on data later than s, its estimates are right censored when s is close to the last observed time point [12]. In contrast, EpiEstim looks backward in time and suffers edge-effects when s is near the first observed time point [11]. Estimates in the vicinity of the start and end of the incidence time-series are therefore unreliable under EpiEstim and the WT method, respectively.
Techniques have been proposed to limit this unreliability [5], [14], but the problem is intrinsic, and inevitable near the actual start and end times of an epidemic, where there is necessarily no or scarce data. This leads into the second key limitation: inference in periods of small incidence. This presents a fundamental challenge for any Rs estimation approach and effectively creates additional edge-effects. When few or no case counts are available to constrain inference, methods are largely driven by their inherent prior distributions and assumptions, which can result in misleading and unreliable estimates [11], [15]. Understanding how to best mediate the trade-off between prior assumptions and data when incidence is small is of both statistical and epidemiological importance.
Following a period of low incidence, two outcomes are possible: either the epidemic continues to exhibit small or zero case counts until it goes extinct or a resurgence in infections, also termed a second wave, occurs. Deciphering, in real time, which of these conditions is likely is a key challenge for infectious disease epidemiology given the limited information available at low incidence [16]. Better inference of transmission under these conditions is currently considered central to designing data-informed COVID-19 intervention exit or relaxation strategies [17]. With many countries now facing potential second waves in this pandemic, estimating fluctuations in transmission during suspected epidemic troughs could be essential to achieving sustained control [10].
Here we present and develop a novel method, termed EpiFilter, for reliably estimating Rs in real time, which ameliorates the above limitations. We take an engineering inspired approach and construct an exact, recursive and deterministic (i.e. EpiFilter produces the same output for fixed input data and requires no Monte Carlo steps) inference algorithm that is quick and easy to compute both across an unfolding outbreak and in retrospect. Our method solves what is called the smoothing problem in control engineering [13]. This means we compute Rs estimates that formally integrate both forward and backward looking information. This unifies the WT method and EpiEstim, and nullifies their edge-effect issues.
Further, EpiFilter only makes a minimal Markov assumption for Rs, which allows it to avoid the strong prior window size and change-point assumptions that existing methods may apply to infer shifts in transmission [11], [9]. Using simulated and empirical data, we show that EpiFilter accurately tracks changes in Rs and provides reliable one-step-ahead incidence predictions. Moreover, we find that EpiFilter is appreciably more robust than even optimised versions of EpiEstim [14], if performing real-time inference in periods of low incidence and between epidemic waves. We illustrate the practical utility of EpiFilter on the COVID-19 incidence curve of New Zealand, which exhibits a potential second wave.
Our method, which is outlined in Fig. 1, provides a straightforward yet formally optimal (in mean squared error [18]) solution to real-time and retrospective Rs estimation. Because it couples minimal prior assumptions with maximum information extraction, it more gracefully handles periods with scarce data. Hopefully our approach will serve as a useful inference tool for investigating the risk of second waves of infection in COVID-19 and other epidemics. An implementation of EpiFilter is available at https://github.com/kpzoo/EpiFilter and its mechanics are explored and validated in the Appendix.
METHODS
Renewal models and inference problems
We consider an infectious disease epidemic observed over some period of time 1 ≤ s ≤ t. If the incidence or number of newly infected cases at time s is Is then is known as the incidence curve of the epidemic. For convenience, we assume that incidence is available on a daily scale so that is a vector of t daily counts (weeks or months could be used instead). A common problem in infectious disease is the inference of the transmissibility of the epidemic given this incidence curve. The renewal model [1], [19] presents a general and popular framework for investigating this problem.
This model posits that epidemic transmissibility, summarised by the effective reproduction number, Rs, generates the observed incidence as in Eq. (1). There Pois indicates Poisson observation noise, means that both distributions are equal and .
In Eq. (1) Rs describes the branching of the epidemic i.e. it captures the number of secondary cases generated per effective primary one at s, while Λs, which is termed the total infectiousness, defines how many effective past cases are still infectious at s. The generation time distribution of the epidemic, , controls how past incidence influences Λs, with wu as the probability that a primary case takes between u − 1 and u days to generate a secondary case [1].
We make the standard assumption that is known for the epidemic of interest and well approximated by its serial interval distribution and focus on inferring the complete set of Rs values i.e. given with t as the last recorded time of the incidence curve. Estimating is important, not only for learning whether the epidemic is still growing (whenever Rs > 1) or under control (i.e. if Rs < 1) [4], but also for assessing how much existing interventions can be relaxed without risking a second wave of infection [10]. This latter problem can be particularly difficult (as we will show) since, under these conditions, incidence data are scarce.
We define three key inference problems, commonly studied in engineering [18], based on how information in is recruited to construct estimates. We represent these problems in terms of the posterior distribution they induce over . Estimates are then functions of these posteriors. The first is called causal filtering, where we sequentially calculate [20]. Causal means that estimates of Rs only depend on data up to time s. Solving this problem is fundamental to real-time inference [21]. The second problem is non-causal filtering and is the complement of the first. Here we calculate , which is important for retrospective or backward-looking estimates [22].
The last problem is termed smoothing and is our main interest. To solve it we must extract all the information possible to formulate [13]. Note that ps, rs and qs depend on the choice of state space model, which describes the dynamics of Rs across time, and observation model, which explains how changes in Rs lead to trends in the observed incidence data [18]. These models encode our assumptions about the epidemic of interest and determine how estimates tradeoff assumptions against data. We will find this latter point important for determining performance when incidence data are scarce. We next explain how the most popular approaches for estimating , the WT method and EpiEstim fit within this engineering framework.
Inference methods and low incidence
We mostly detail EpiEstim as real-time Rs estimates are the key focus of this work. EpiEstim assumes that all reproduction numbers are the same over a window back in time, of size k. Consequently, only provides information about Rs and the causal filtering distribution follows as in Eq. (2) with a shape-scale gamma prior distribution applied i.e. Gam(a, c−1) [11].
Here and are sums over the window of interest.
The resulting posterior mean estimate is then and the Fisher information that contains about Rs is proportional to λτ(s) [23]. The observation model is Eq. (1) but the state space model of EpiEstim is not explicit. However, if Rτ(s) is the assumed average reproduction number in τ(s) (which is used to estimate Rs) then [14]. As a result, EpiEstim somewhat incorporates a linear moving average state space model, and assumes that ps ≈ pτ(s) by deeming to be uninformative. Since pτ(s) can be computed sequentially across an ongoing epidemic, EpiEstim provides real-time inference.
The WT method takes a complementary approach to EpiEstim and computes an estimate of Rs based on a forward-looking window [2]. It also uses the observation model of Eq. (1) and has an implicit moving average state model [3]. We abuse notation and define in this case as a window into the future of size k. The WT method assumes that and approximately solves the non-causal filtering problem. We illustrate the information windows employed by the WT method and EpiEstim and the complete causal and non-causal filtering windows in Fig. 1. The goodness of pτ(s) and rτ(s) as measures of ps and rs will depend on k and the appropriateness of the state model [14].
While EpiEstim and the WT methods are powerful tools for inferring Rs in real-time and in retrospect, they have two main and related limitations, which necessarily reduce the reliability of their outputs [12]. First, their performance degrades as s gets close to 1 for EpiEstim and t for the WT method [11]. These edge or censoring effects correspond, at the extreme, to p1 = pτ(1) = P(R1 | I1) and rt = rτ(t) = P(Rt | It), which are weakly informed posterior distributions. As a result, at the beginning of the incidence curve EpiEstim can be unreliable (and even unidentifiable). The WT method suffers similarly at the end of that curve [1] (see Fig. 1).
The second limitation occurs in phases of the epidemic where incidence is low and prolonged [23], [17]. In these periods information is scarce and the quality of estimates will depend on how well the method of choice mediates between the little available data and its inherent assumptions [15]. We illustrate this effect using EpiEstim by considering an epidemic with a sequence of n zero incidence days. This sequence is realistic for an epidemic in its tail phase, and can precede either elimination (i.e. the epidemic goes extinct) or resurgence (i.e. a second wave of infection emerges) [24]. As n increases both iτ(s) and λτ(s) shrink, though iτ(s) decays more rapidly. For every n ≥ k, iτ(s) = 0, meaning that the shape of pτ(s) is completely determined by the prior P(Rs).
As n increases further λτ(s) becomes negligible and we exactly recover our prior distribution (see Eq. (2)). This problem is acute if the window size, k, is small. Previous studies, which formally optimise k to maximise estimate reliability, found that small k is needed when inferring sharp changes in transmissibility (e.g. due to lockdowns) [14]. Moreover, as λτ(s) falls, the Fisher information available about Rs from shrinks, leading to an inflation in estimate uncertainty and a loss of statistical identifiability [25], [23]. An analogous effect occurs in the WT method if there is low incidence across its forward-looking window [12].
While some estimate degradation is guaranteed for any Rs inference method when faced with either edge-effects or low incidence, robustness can still be improved. Edge-effects can be largely overcome by constructing qs. Solving the smoothing problem melds the advantages of the opposite looking windows of EpiEstim and the WT method, removing the vulnerability near the ends of . This follows as q1 = r1 and qt = pt. Further, by maximising the information used for inferring every Rs and by minimising our state model assumptions, we can ameliorate the impact of low incidence. We develop a method, termed EpiFilter, to realise these improvements in the next section.
Bayesian (forward) recursive filtering
We reformulate the causal inference problem of estimating Rs from as an optimal Markov state filtering problem. Filtering is the term used in engineering for a general class of estimation problems aimed at optimally (usually in a mean squared error (mse) sense) inferring some hidden state causally and in real time from noisy observations [18], [13]. Given functions fs and gs, which describe the state (Rs in our case) space dynamics and the process of generating noisy observations (the Is here), the filtering problem tries to construct the posterior distribution ps [20], which EpiEstim approximates. The conditional mean estimate leads to the causal minimum mse of [20].
The famed Kalman filter [21] was the genesis of these methods. Here we focus on Bayesian recursive filters for models with noisy count observations. These generalise the Kalman filter [20] and have been successfully applied to similar problems in phylodynamics and computational biology [26], [27], [28], [29]. We reconsider our renewal model inference problem within this engineering state-observation framework, as described in Eq. (3) [13].
Here Rs is the hidden Markov state that we wish to infer. It dynamically depends on the previous state Rs−1 and a noise term ∊s−1, via fs. Observation Is is then elicited due to Rs and a noise term νs, according to gs [30].
We formulate our filter under two very mild assumptions. First, we define some closed space, , over which Rs is valid i.e. for a given resolution m, extrema Rmin and Rmax, and grid size then . This means Rs must take a discrete value in , the ith element of which is . We formalise this in Eq. (4a).
This is not restrictive since we can compute our filter for large m if needed and usually we are only interested in Rs on a coarse scale (e.g. policymakers may only want to know if Rs ≤ 1 or not). Note that other approaches, which depend on MCMC or related sampling methods, all implicitly assume some sort of discretisation [27].
Second, we propose a linear model for fs, as defined in Eq. (4b). There ∊s−1 is a standard white noise term i.e. with Norm signifying a normal distribution and η is a free parameter. We assume that a noisy linear projection of states over consecutive time-points provides a good approximation of the state trajectory. Not only is this assumption standard in engineering [20] and epidemiology [31] but it is also more flexible than the state model inherent to EpiEstim and the WT method. We scale the noise of this projection by a fraction, η < 1 of the magnitude of Rs−1. This allows variation but ensures Rs is a-priori non-negative.
Our observation model, gs, is implicit and leads to the probability law in Eq. (1). As a result, both our observations and state models are discrete and are summarised in Fig. 1. We now define the Bayesian recursive filtering procedure, which is a main contribution of this work, and can be solved exactly, in real-time and with minimal computational effort. We adapt general recursive filtering equations from [30], [13], [22], [18], which are valid for various types of observation and state models, for our renewal model inference problem. The proof of the equations we employ can be found in these works.
Recursive filtering involves two steps: prediction and correction. The first, given in Eq. (5a), constructs a sequential prior predictive distribution, , which is informed by past data and the last state Rs−1 The second step then corrects or updates this prior into the posterior distribution, ps, which constrains ps using the latest observation, Is, as in Eq. (5b).
Here is the state model from Eq. (4b), is the observation model from Eq. (1) and the constant of proportionality for Eq. (5b) is simply a normalising factor.
Solving Eq. (5) iteratively and simultaneously over the grid of leads to our novel real-time estimate of the time varying effective reproduction number. We initialise this process with a uniform prior over for p1 and note that ps and ps are m element vectors that sum to 1, with ith term corresponding to when . Eq. (5) forms the first half of EpiFilter, is flexible and can be adapted to many related problems [13]. A key difference between EpiFilter and the EpiEstim-type methods [11], [14] is that the latter approximate and with and P(Rs), respectively. Estimators based on these approximations can be suboptimal, especially when data (i.e. cases) are scarce.
Bayesian (backward) recursive smoothing
While Eq. (5) provides a complete real-time solution to the causal filtering problem, it is necessarily limited at the starting edge of the incidence curve, where past data are scarce or unavailable. Further, because it does not update past estimates as new data accumulate, it cannot provide optimal retrospective estimates. Here we develop the second half of EpiFilter, which involves solving the optimal smoothing problem and hence computing qs. To our knowledge, smoothing has not yet been explicitly considered in infectious disease epidemiology.
We specialise the general methodology from [13], [22] to obtain the recursive smoother of Eq. (6a).
We can realise Eq. (6a) exactly by taking a forward-backward algorithmic approach with ps and ps+1 first computed from Eq. (5). Remaining terms emerge from the state space model and by noting that qt = pt and iterating backwards. This approach neatly links ps and qs. If we assume that rs is reasonably approximated by then we can also apply the two-filter smoothing solution of [22] to get Eq. (6b). This shows how smoothing connects EpiEstim and the WT methods, and explains why EpiFilter can be used for both real-time and retrospective inference. We summarise the main ingredients of the EpiFilter algorithm in Fig. 1.
The smoothed posterior qs yields the conditional mean estimate , which is known to significantly improve on the mse of the filtered equivalent [20]. While filtering provides the minimum mse given causal knowledge, smoothing provides the minimum given all knowledge. This relationship is formal, with filtered and smoothed mse values mapping to the amount of mutual information that provides about [32]. Extracting the maximum information from should engender estimates that are more robust to periods of low incidence.
While our main interest is on optimised and rigorous real-time and retrospective estimates of transmissibility, which are completely defined by qs, it is also important to predict future incidence, for informing epidemic preparedness plans and for validating past Rs estimates [33], [14]. We compute the filtered posterior predictive distribution as in Eq. (7) (integrals are over [13].
We assume, as in [34]:. Replacing ps with qs yields the smoothed equivalent of Eq. (7). We will use Eq. (7) to compare EpiFilter against APEestim, which is the prediction-optimised version of EpiEstim [14]. Since predictions depend strongly on ps or qs, optimising these distributions can be crucial, for example, to forecasting second waves of infection.
RESULTS
Improved estimation at low incidence
The estimation of time-varying effective reproduction numbers when incidence is low is seen as a key challenge limiting our understanding of transmission [17]. Periods with small counts of new cases contain little information and so present necessary statistical difficulties [23]. Here we compare EpiFilter, which allows exact inference but conditioned on , with EpiEstim and APEestim, which minimises the prediction error of EpiEstim by optimising its window size, at these data-poor settings [11], [14]. We use EpiEstim with a long window, as this is known to improve robustness at low incidence [24]. The assumptions and choices inherent in estimation methods become important and visible when data are scarce and can bias inference or support spurious predictions [15].
We simulate epidemics using Eq. (1)) under the serial interval distribution of Ebola virus disease available from [35]. We examine several diverse incidence trajectories (e.g. outbreaks with small sizes, large peaks or multiple modes) that all eventually decline to periods of few or no cases. We apply APEestim with optimal window , EpiEstim with a long window k and EpiFilter with state noise η (see Eq. (4b)) to estimate Rs and causally predict Is for each epidemic. For the first two methods we compute mean predictions () as in [14] and estimates () from Eq. (2) with τ(s) delimiting the window times used. We obtain smoothed EpiFilter estimates () and causal predictions () from Eq. (6) and Eq. (7).
Our main results are in Fig. 2, which exposes how these three approaches degrade as data diminish. Each quadrant of this figure examines a different epidemic scenario. In all plots the true Rs and Is are dashed black and dotted grey respectively, estimates are in red and predictions are blue. The central lines are conditional means and the shaded regions are their 95% confidence intervals. As our main focus is on real-time performance we do not investigate the WT method. See [11], [12] for comparisons of the WT method and EpiEstim. Our reasons for comparing to EpiEstim with a long window and APEestim follow from the Methods.
There we showed that since EpiEstim-type methods group data over a window into the past, they revert to their prior distribution as the total cases in this window becomes small. During low incidence periods, e.g. when the epidemic is dying or in its early growth phase, using long windows makes sense [24]. However, this reduces predictive accuracy as fluctuations in Rs are underfit.
This is the use-case for APEestim, which optimises for prediction. The top and middle panels of each quadrant of Fig. 2 illustrate the clear consequences of these tradeoffs. APEestim often selects shorter windows to better fit Rs and sequentially predict Is. Each predicted Is is a one-step-ahead prediction depending on the past data up to s − 1 as in Eq. (7) (we have abused notation).
While shorter windows improve estimates and predictions the reversion to the prior distribution (seen as wide confidence intervals in Fig. 2) when Is is low is acute. Long windows resolve this latter difficulty, showing a slower rise to the prior but at the expense of getting other parts of the epidemic trajectories incorrect. Interestingly, in the bottom panels of the quadrants of Fig. 2 we see that EpiFilter has superior performance for these types of epidemics. Not only does it give good estimates and predictions throughout the trajectory of the epidemic but its rise in uncertainty as the epidemic dies is slower and more controlled. It therefore combines the advantages of APEestim and long-window EpiEstim.
Improved estimation between epidemic waves
Maintaining robust Rs estimation at low incidence is not just statistically important. Two possible outcomes may follow periods of small Is: either the epidemic goes extinct, or a second wave of infection (also known as resurgence) occurs e.g. due to imports or unmonitored local transmission. Predicting which outcome is likely, rapidly and in real-time, is of major global concern as countries aim to relax interventions during the ongoing COVID-19 pandemic, while also minimising the risks of second waves [17]. As changes in reproduction numbers predate and signal corresponding variations in incidence, reliably identifying and inferring Rs trends in the trough preceding potential new peaks is crucial for preparedness and both timely and effective epidemic control [10].
Reliable estimation of Rs between epidemic waves depends on the prior assumptions of the inference method used and on how that method relies on those assumptions when data are scarce [36], [37]. Here we examine this dependence and exactly investigate this scenario, where after a low incidence period resurgence occurs. As in the previous section, we test APEestim with optimal window (top panels), EpiEstim with a long window k (middle) and EpiFilter with state noise η (bottom panels) on four diverse epidemics, which in this case all feature multiple waves or early hints of upcoming resurgence. Simulation parameters are unchanged from the last section.
Our main results are in Fig. 3 with each quadrant specifying a different epidemic trajectory. The true Rs and Is are in dashed-black and dotted-grey respectively.
All estimates ( for EpiEstim and APEestim and for Epifilter) are in red with corresponding causal predictions ( and ) in blue. Plots provide conditional means with 95% confidence intervals. For all second and additional wave epidemic examples we observe a similar tradeoff in performance between APEestim and long-window EpiEstim as in our previous analysis of Fig. 2. APEestim is better able to predict upcoming Is terms and follow the overall trend in Rs than EpiEstim.
However, the long window of EpiEstim makes it more stable during the trough between waves, which can be advantageous for understanding transmission there.
EpiFilter once again combines the advantages of the other approaches. For every scenario in Fig. 3 it provides accurate tracking of changes in Rs with stable confidence intervals. Concurrently, its predictive performance rivals that of APEestim. EpiFilter is therefore a powerful tool when dealing with second-wave scenarios. We also find that the η = 0.1 parameter value seems to be an all-purpose heuristic, meaning that usage of EpiFilter can be somewhat simpler than APEestim and EpiEstim. The superior capability of EpiFilter in these scenarios likely results from its minimalistic assumptions (see Eq. (4)) and its increased information extraction. We next test our method on empirical incidence data.
COVID-19 in New Zealand and H1N1 in the USA
The previous sections confirmed EpiFilter as a powerful inference and prediction tool, especially in data-poor conditions, using simulated epidemics. We now confront our method with empirical data from the 1918 H1N1 influenza pandemic in Baltimore (USA) [38] and the ongoing COVID-19 pandemic in New Zealand (up to 17 Aug 2020) [39]. The H1N1 dataset has been well-studied and so we first use this to benchmark EpiFilter. We clean this data by applying a 5-day moving average filter as recommended in [38]. Previous work [11] analysed this data with EpiEstim and found that sensible Rs estimates result when a weekly window (k = 7) is applied.
A recent study, which re-examined this epidemic with APEestim [14], shows that, while EpiEstim with k = 7 provides stable estimates for this epidemic, it is a poor causal predictor of the incidence data. Instead, an optimised window of 2 days () yields good predictions but the resulting Rs estimates are noisy. We reproduce the estimates () and predictions () from both studies in Fig. 4 and compare them against EpiFilter with η = 0.1 ( and ). The cleaned H1N1 incidence is in dotted grey, the critical Rs = 1 line in dashed black and all Rs estimates and Is predictions (with 95% confidence intervals) are in red and blue respectively.
Top and middle rows of Fig. 4 show the mentioned trade-off between estimate stability and prediction accuracy. The bottom row confirms the power of EpiFilter. Our Rs estimates are of comparable stability to that of EpiEstim at k = 7, yet our prediction fidelity matches APEestim. Our improved inference again benefits from using more information (i.e. the backward pass in Fig. 1) and making less restrictive prior assumptions. We see the latter from the Rs confidence intervals over 40 ≤ s ≤ 60.
There EpiEstim seems overconfident, and this results in a rigid overestimation of incidence. However, EpiFilter mediates its confidence to a level similar to APEestim.
We explore COVID-19 transmission patterns in New Zealand using incidence data up to 17 Aug 2020 from [39]. New Zealand presents an insightful case study because officials combined swift lockdowns with intensive testing to achieve and sustain very low incidence levels that some believed could engender local elimination of COVID-19 [40]. However, an upsurge in cases in early Aug inspired concerns about a second wave (which led to new interventions and is why we do not consider data beyond 17 Aug). Here we investigate time-varying transmission in New Zealand to see if this uptick suggests that the epidemic was resurfacing in mid Aug. We believe smoothing can confer important inferential advantages in exactly these types of low incidence scenarios.
We make the common assumptions that case under-reporting is constant and that reporting delays are negligible [11]. While this makes our investigation somewhat naive, we consider this to be reasonable given the intense surveillance that New Zealand employed [41]. We also use the serial interval distribution from [42] and do not explicitly distinguish imported from local cases in our analysis. The latter could bias our study [24] but our focus is on demonstrating differences between filtering and smoothing on Rs trends instead of specifying the precise value of Rs during this period. We plot the results of our exploration in Fig. 5. More involved analyses can be performed by pre-processing the data for known delays or case ascertainment fractions as in [12], [6].
We apply EpiFilter and obtain causally filtered (, grey) and smoothed (, red) conditional mean estimates together with their 95% confidence intervals. These are in the top panel of Fig. 5 and computed from Eq. (5) and Eq. (6) respectively. The times of lockdown and release are the grey vertical dashed lines. Interestingly, we see a notable difference in the quality of inference between and . The former, as expected, is unreliable at the beginning of the incidence curve and features wider uncertainty and unclear trends. The smoothed estimate, by using both forward and backward-looking data largely overcomes these issues and clarifies transmission trends.
Our smoothed analysis suggests that Rs has resurged and supports the reimplementation of measures around 14 Aug. In the bottom panel of Fig. 5 we provide predictions (which are from the smoothed hence the notation ) and their 95% confidence intervals in blue against the reported incidence from [39] in dotted grey. These confirm that reasonably describes the data. In the inset we show that the P(Rs ≤ 1) also supports the resurgence hypothesis. Last, we note that transmission decreased considerably (and rise in P(Rs ≤ 1)) across the lockdown period.
DISCUSSION
Estimating time-varying trends in effective reproduction number Rs reliably and in real-time is an important and popular problem in infectious disease epidemiology
[5]. As the COVID-19 pandemic has continued to unfold, the interest in solving this problem has only elevated [8]. Initially, the focus was on characterising how Rs might respond to interventions such as lockdowns and social distancing [10]. However, as COVID-19 has progressed and countries have entered the controlled phase, concerns have shifted to trying to understand how existing interventions might be relaxed with minimum risk.
The literature on intervention exit strategies is, however, still in development, and several challenges remain to modelling the transmission behaviour of epidemics following prolonged periods of low incidence [17], [24]. While changes in Rs over this period are not the only analytic for assessing this risk, they do provide a key real-time diagnostic since upticks in Rs generally precede corresponding rises in incidence. Unfortunately, current approaches to estimating Rs under these conditions tend to be underpowered or prior-constrained [11], [23].
Here we re-examined existing methodology for inferring Rs from an engineering perspective. We observed that two of the most useful and popular inference approaches, EpiEstim [11] and the WT method [2], only capitalise on a portion of the data available, deeming either upcoming or past incidence informative (see Fig. 1). This informative portion is directly controlled by prior assumptions on the speed of possible Rs changes, which are often characterised by a window of size k. Other methods are also known to apply similarly strong change-point assumptions on Rs [23], [9]. When data are scarce these assumptions can control inference.
In control engineering a common problem, known as filtering, involves optimally (in a mse sense) estimating hidden Markov states, causally in real-time, from noisy and uncertain observations [18]. A related problem termed smoothing provides accompanying and optimal retrospective inference [13]. By reinterpreting Rs as a Markov state (Eq. (4)) observed through a noisy renewal process (Eq. (1)) and defining Rs on a predetermined grid R, we were able to construct exact filtering (Eq. (5)) and smoothing (Eq. (6)) solutions. This led to EpiFilter, which is our central contribution.
Generally, filtering and smoothing can be involved and require sophisticated sequential Monte Carlo techniques [30]. However, because our system is low dimensional and as we make only minimal assumptions about Rs, modelling it as a simple diffusion, we were able to solve these problems exactly and without complex sampling algorithms [27]. Our solutions are computationally simple and deterministic (i.e. precisely reproducible given the same data). While EpiFilter is limited by one free parameter η, our η = 0.1 heuristic seems statistically justified by its good causal one-step-ahead predictive performance (see Appendix and Fig. 2 and Fig. 3) [14].
Importantly, EpiFilter is able to look both forward and backward through the incidence data, and so maximise the information extracted [32]. This property means it combines the advantages of both EpiEstim and the WT method (see Fig. 1) and largely negates their edge-effect issues [12], offering both real-time and retrospective Rs estimations that are robust. We illustrated the advantages of EpiFilter by comparing it to EpiEstim and APEestim on many simulated examples with diverse periods of low incidence and epidemic resurgences (Fig. 2 and Fig. 3). Interestingly, we found EpiFilter able to track arbitrary changes in transmission without requiring any preset change-points or window sizes or any tuning of η.
Moreover, EpiFilter was notably better at negotiating periods of low incidence, offering a graceful degradation to its prior without sacrificing predictive accuracy. When incidence is low, it is usually beneficial to use long windows with EpiEstim [24]. This keeps Rs estimates reasonably stable. However, it often leads to poor predictions [14]. APEestim, which optimises window size for prediction fidelity, showed that in many of the simulated scenarios short windows are necessary for describing transmission patterns. Consequently, we have a tradeoff between estimate robustness and prediction accuracy.
We found that EpiFilter overcomes this tradeoff, concurrently achieving good estimates and predictions. We confirmed these advantages on empirical data from the H1N1 pandemic of 1918 (see Fig. 4). The ability to integrate the benefits of long window EpiEstim and APEestim make EpiFilter particularly useful for investigating resurgence after a period of low incidence, as it is better able to forewarn of increasing case numbers and can more speedily infer upticks in transmission (see Fig. 3). Capitalising on these properties, we performed a naive exploratory analysis of COVID-19 in New Zealand and found evidence of a potential second wave of infection after a prolonged period of control (see Fig. 5).
Balancing the assumptions inherent to a model against the data it is applied on, to produce reliable inference is a non-trivial problem that is still under active investigation in several fields [15], [36], [37]. EpiFilter, by maximising the information extracted from the incidence data and minimising its state space model assumptions, appears to strike this balance. Consequently, it performs strongly on a wide range of problems, including those involving sparse data, where other methods might struggle. Given its demonstrated advantages, straightforward formulation and theoretical underpinning, we hope that EpiFilter will be useful as a diagnostic tool for reliably warning about second waves of infection as COVID-19 unfolds.
Data Availability
Code and data are available in both MATLAB and R at https://github.com/kpzoo/EpiFilter
ACKNOWLEDGMENTS
This work is jointly funded under grant reference MR/R015600/1 by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement and is also part of the EDCTP2 programme supported by the European Union.
APPENDIX
Validating the recursive filter
Having constructed EpiFilter in Eq. (3) – Eq. (6) we now investigate its mechanics and performance. First we illustrate how the choice of the grid resolution m allows our approach to achieve estimates at differing levels of accuracy. Our recursive implementation is also known as a grid smoother, since its estimates are optimal (in mse) but conditioned on the grid [27]. Left panels of Fig. A1 show how the smoothed estimator (red together with 95% confidence bounds) becomes more accurate with increased resolution. The true Rs is in black. Generally, we find that no visible improvement occurs above m ∼ 103. Wider grid ranges Rmax − Rmin require larger m for the same resolution.
We also test the behaviour under different choices of the process or state noise parameter η. This parameter is vaguely similar to the window size parameter, k, in EpiEstim approaches, but is easier to select. The right panels of Fig. A1, show how a smaller η might lead to underfitting, while large η (where we would not consider η ≥ 1 given Eq. (4b)) could promote overfitting. How-ever, while the optimal choice of window in EpiEstim varies with the form of the true Rs (e.g. large k is needed for stable Rs time-series and small k for strongly fluctuating ones [14]), we find that a fixed η = 0.1 works well across many and diverse epidemic scenarios.
We illustrate this further in Fig. A2. There we compare Rs estimates from APEestim (left panels, denoted with optimal window size k) [14], [11] to EpiFilter (right panels, denoted ) at η = 0.1 and m = 2000 for several simulated epidemic scenarios. We generate epidemics according to the standard renewal model of Eq. (1) using the serial interval distribution for Ebola virus disease [35]. The true Rs is in black with respective estimates in red (together with 95% confidence intervals). To keep comparisons fair we set the usual gamma prior distribution over Rs in APEestim to Gam(1, 2) so that its domain is essentially bounded by Rmax.
EpiFilter has broadly comparable inference perfor-mance to APEestim, despite our using a single η value across the entire range of simulated scenarios. This is in stark contrast to the discordant optimal k values, which are shown in Fig. A2. Moreover, our implementation is fast (it executes in seconds) and unlike other window-less approaches does not require any choices of the timing or number of change-points i.e. at this single η it easily handles unchanging trajectories, seasonal variations and rapidly or gradually fluctuating transmission.
The bottom panels of Fig. A2 illustrate the central advantage of EpiFilter. There the associated incidence values (not shown) are small. While this causes APEestim inference to destabilise (see Methods for explanation), EpiFilter offers more robust and usable estimates. We explore this in detail in the main text. Note that in none of these scenarios does the true Rs have either the state model of Eq. (4b) or the sliding-window relationship in EpiEstim. Consequently, these simulations also indicate robustness of the different inherent state assumptions in both approaches to moderate model mismatch.