Abstract
Hawkes processes are used in machine learning for event clustering and causal inference, while they also can be viewed as stochastic versions of popular compartmental models used in epidemiology. Here we show how to develop accurate models of COVID-19 transmission using Hawkes processes with spatial-temporal covariates. We model the conditional intensity of new COVID-19 cases and deaths in the U.S. at the county level, estimating the dynamic reproduction number of the virus within an EM algorithm through a regression on Google mobility indices and demographic covariates in the maximization step. We validate the approach on short-term forecasting tasks, showing that the Hawkes process outperforms several benchmark models currently used to track the pandemic, including an ensemble approach and a SEIR-variant. We also investigate which covariates and mobility indices are most important for building forecasts of COVID-19 in the U.S.
1 Introduction
Mathematical modeling and forecasting is playing a pivotal role in the ongoing SARS-CoV-2 (COVID-19) pandemic. In mid-March 2020, a report out of Imperial College London [8] forecasted severe consequences in the US and UK without significant public health interventions. In both nations, governments responded by closing schools, non-essential businesses and releasing general stay-at-home (shelter-in-place) orders. In the U.S., state and local policymakers are using mathematical models and projections to inform decisions about when and how to relax public health measures that have been put in place. By and large, compartmental models that explicitly incorporate transmission characteristics of infectious diseases have been favored over machine learning approaches. High profile Susceptible-Exposed-Infected-Removed (SEIR) models include those out of Columbia University [22], MIT [13], and UCLA [32] (in the case of the UCLA model, a SEIR-variant with an unreported compartment is fit using least-squares to reported infection and recovery data). A major exception is the well-known IHME model [3], that employs Gaussian curve fitting to COVID-19 case and death count time series in locations further along (e.g. China, Europe), to estimate curves in locations where the outbreak is more recent (e.g. the United States). The IHME model has been called into question by epidemiologists because it lacks explicit transmission dynamics in the model [11].
Our goal in this paper is to show that Hawkes processes, widely used in the machine learning community to model contagion patterns in event data, are well suited for modeling and forecasting COVID-19 case and mortality data. They have several advantages that we will highlight, including being highly flexible in accommodating auxiliary spatio-temporal features such as county-level demographics and temporal mobility patterns, yet mathematically they are connected to compartmental models [24] and allow for explicit incorporation of transmission dynamics (which we briefly review in the following section). Furthermore, extensive research has been conducted in the past several years to couple machine learning techniques with the point process framework. Non-parametric Hawkes processes can be constructed where the triggering kernel is learned [31] and more recently fully neural network based point processes have been developed [16, 21]. Sparse linear combinations of Hawkes processes were a winning solution in the 2017 NIJ Crime Forecasting Challenge [19]. In certain cases a mixture of Hawkes processes may be needed to model more complex event contagion with high dimensional marks through dirichlet processes [30, 7]. Hawkes processes can also be used for causal inference on networks [29] and recent efforts have also focused on training point processes through reinforcement learning [27, 14]. We believe all of these methods have potential applications to modeling infectious diseases that are highly complex due to heterogeneity in the population, environment, and disparate public policies across regional and local jurisdictions. Despite these advantages, to our knowledge the only U.S. state where a Hawkes process is being used to inform COVID-19 policy is in New Jersey (a collaboration with Facebook AI Research) [1].
The outline of the paper is as follows. In Section 2 we introduce our Hawkes process model whose productivity (reproduction number) is dynamic and depends on spatio-temporal covariates. Unlike recently introduced models that incorporate covariates into the background rate of a Hawkes process [18, 23], our Hawkes process model may be viewed as a convolution of lagged mobility with an inter-infection time distribution to estimate the intensity of secondary infections in the future. This is important as phased reopening in the U.S. leads to mobility changes, the effects of which are not realized in case and mortality data until days or weeks later. Hence the model can be used to forecast changes in transmission and new cases in real-time as mobility changes (see Figure 1). We estimate the intensity along with the dynamic reproduction number of the virus within an EM algorithm through a regression on Google mobility indices and demographic covariates in the maximization step. In Section 3, we validate the approach on short-term forecasting tasks, showing that the Hawkes process outperforms several benchmark models including the Columbia University SEIR model [22] and an ensemble model from Berkeley that uses combined linear and exponential predictors with spatial covariates [2]. We also investigate which covariates and mobility indices are most important for building forecasts of COVID-19 in the U.S. In Section 4 we discuss directions for future research and how the machine learning community may be able to help improve Hawkes process models of COVID-19 as the pandemic continues to unfold.
2 Hawkes process model of COVID-19 transmission
In this section we introduce a Hawkes process with spatio-temporal covariates for modeling COVID-19 case and death data. We then discuss the connection of the model to compartment models used in epidemiology and develop an expectation-maximization algorithm for inference.
2.1 Incorporating covariates into the Hawkes process
We propose a novel Hawkes process model that simultaneously estimates the intensity of events and tracks the dynamic reproduction number of the virus. Given the timestamps (or dates), 𝒯 = {t 1, t2, tn}, of daily reported positive test cases or deaths, we model the rate of new cases (or deaths) in each country c as, where µc is the background rate modeling imported infections, w(t) is the inter-infection time distribution, are mobility indices on day t, and dc = [d1, d2,…] T are static demographic features. The time-varying reproduction number can be interpreted as the average number of secondary infections caused by a primary infection. Because we are modeling reported infections rather than time of exposure, we introduce the parameter Δ to capture a potential lag between a mobility change and the time tj of a reported primary infection.
We model the dynamic reproduction number through a Poisson regression, where we have combined the spatial and temporal covariates to simplify notation in the rest of the paper. Our approach is related to those in recent preprints that incorporate mobility into compartment models [17], however those approaches typically involve large-scale Monte Carlo simulations when performing inference. As we will show, the Hawkes process likelihood can be maximized without simulation via an efficient expectation-maximization algorithm.
2.2 Mathematical connection between Hawkes processes and compartmental models
Here we briefly review several variations of the Hawkes process in Equation 1 that can be connected to SEIR-type compartment models. The first variant is the SIR-Hawkes process, also referred to as HawkesN. This model captures the long-term evolution of a pandemic by incorporating a pre-factor that accounts for the dynamic decrease in the number of susceptible individuals [24]:
Here Ic(t) is the cumulative number of infections that have occurred up to time t and N is the total population size. The point process governed by Equation 3 is a continuous time analog of a discrete stochastic SIR model when w(t) is specified to be exponential [24]. When w(t) is chosen to be gamma distributed, the Hawkes process also can approximate staged compartment models, like SEIR, if the average waiting time in each compartment is equal [15]. More complex parametric (or non-parametric) inter-infection time distributions w(t) may be employed within the Hawkes process framework in situations where disease dynamics cannot be captured by a SIR or SEIR model. In the early exponential growth stage of an epidemic, before finite population effects play a role (which is the case with current U.S. data), the Hawkes process in Equation 1 without the prefactor can be used to model new infections arising from SIR and SEIR models, as Ic(t)/N will be small (see Section S1 in the supplemental material for more details).
2.3 EM algorithm for parameter inference
We use an expectation–maximization (EM) algorithm to estimate the model in Equation 1. First, we introduce latent random variables, pc(i, j), that represent the event that secondary infection i is caused by primary infection j in county c. We let pc(i, i) represent the event that case i is imported. The complete data log-likelihood is then given by,
Here we use a Weibull distribution with shape α and scale β to model inter-infection times, which is used in other studies of epidemics [20, 4, 10] and we find accurately captures transmission in the present data.
As the branching structure of the process is unobservable, we optimize the complete data log-likelihood in Equation 4 by iteratively alternating between an expectation step where the branching probabilities pc are estimated and a maximization step where model parameters are updated by maximizing Equation 4. The EM-algorithm is equivalent to a projected gradient ascent on the likelihood of the Hawkes process [12].
2.3.1 Expectation step
During the expectation step, we estimate the latent variables pc(i, j) for each county. Given the parameters θ, α, β, an µc estimated from the last iteration, the probabilities that case i was caused by case j or was imported are given by:
2.3.2 Maximization step
We then maximize the complete data log-likelihood with respect to the model parameters, conditioned on the estimated branching structure pc(i, j). During estimation we do not include event pairs (i, j) when j is within Ψ days of the last day of the dataset, as the offspring events i have not yet been realized and the inclusion of these incomplete data biases parameter estimates.
We approximate the integrals in Equation 4 as is done in [25] by noting that (given we are removing the last Ψ days from the estimation).
Maximization of Equation 4 then decouples into three independent optimization problems. The first is a Poisson regression of observations against the covariates :
The second optimization problem is weighted maximum likelihood estimation for the Weibull shape and scale parameters: where pc(i, j) is the weight of each inter-infection time observation ti − tj.
Third, the background rate µc is determined analytically:
Pseudo code for the EM algorithm is provided in Algorithm S1 in the supplementary material. We note that the EM algorithm of the Hawkes process is also connected to the dynamic reproduction number estimator of Wallinga and Teunis [28], as the latter can be viewed as a 1-iteration EM algorithm where a histogram estimator is used for with initial guess (see Section S2 in the supplemental material for a derivation).
2.4 Hawkes process forecasting
We forecast future events using the branching process representation of the Hawkes process. In particular, for each event in the history of the process we simulate a Poisson random variable with mean representing the number of secondary infections caused by event j. For each of these infections we simulate the time of infection by drawing inter-event times from the estimated Weibull distribution. Events falling in the future (past the forecasting date) are then used to update the forecasted intensity through Equation 1. We simulate multiple realizations of this process to estimate a mean intensity forecast along with confidence intervals.
3 Experiments and Results
In this section we first provide details on the datasets and baseline models used in our experiments. We then discuss the experimental results of several COVID-19 retrospective forecasting tasks at the U.S. county level. The source code and dataset are both available online*.
3.1 Datasets
3.1.1 Covid-19 daily cases and deaths reported by The New York Times
The New York Times (NYT) [26] † releases a daily report of the cumulative numbers of COVID-19 cases in the United States at the county level and over time. While NYT data closely tracks data aggregated by a project at Johns Hopkins University [5], NYT county level reporting started earlier and is therefore used in this study. In total, there are 2, 920 counties with cases and/or deaths in the dataset. The time series data are compiled from state and local government health departments. In order to have sufficient data for statistical inference, we select the counties with confirmed cases greater than and equal to 10 (denoted by 𝒟 conf) and the counties with at least 1 death (denoted by 𝒟 death) by 05/20/2020. In total, there are 2, 113 and 1, 236 counties in these two datasets.
Parameter sharing may improve models in counties with less data through variance reduction, but can potentially bias estimates in more populated counties with more cases. We therefore assess model performance over different subsets of counties grouped by case volume. We first rank counties by the number of confirmed cases and deaths by the cut-off date, 05/20/2020, and we then evaluate forecasting accuracy on the top-10% of counties (denoted by ), the top-25% counties , and counties between the top-25% and top-50% quantiles (denoted by ).
3.1.2 Google mobility index reports
We use Google daily mobility index reports at the county level [9] to estimate a dynamic reproduction number that tracks changes in movement patterns due to stay at home orders (and their staged removal). In total, there are 6 mobility types, including grocery & pharmacy, Parks, transit stations, retail & recreation, residential and workplaces. Mobility indices for each category and county are calculated with respect to a baseline value for that day of the week ‡. We drop “workplace” mobility from our analysis due to high collinearity with “residential” mobility. Some mobility data are missing when data is sparse for a given date. To deal with missing values, we adopt multivariate feature imputation §, which estimates each missing mobility entry as a function of other mobility types on the same day in the same county. We show heatmaps of mobility patterns across counties and time in the Figure S4 in the supplemental material, where a major change can be observed coinciding with stay at home orders (the first state-wide stay-at-home order was issued at 03/21/2020).
3.1.3 County-level demographic covariates
We incorporate spatial demographic features that may be predictive of symptomatic cases of COVID-19 (which are more likely to result in testing and mortality). The dataset is available in a curated form [2] and is derived from CDC and census datasets. The data is at the county level and includes population, median age, number of hospitals and ICU beds, percentage of smokers and diabetes, and heart disease mortality.
3.2 Baseline models and experimental setup for retrospective forecasting comparison
We compare the Hawkes process model in Equation 1 with several benchmarks including a SEIR model used in a pandemic tracking dashboard out of Columbia University [22] (denoted by PROJ) and an ensemble model with linear and exponential predictors from University of California, Berkeley [2] (denoted by CLEP). A simplified Hawkes process, denoted by Hawkes, where the reproduction number is held constant is used for comparison to demonstrate the effectiveness of tracking the reproduction number dynamically. We also compare our full Hawkes process model, denoted by , to a Hawkes process, HawkPRm, with only mobility features to determine the marginal improvement of adding demographics.
We backtest the five competing models on the 𝒟 conf and 𝒟 death datasets using the “walk-forward” validation approach. In particular, for 7-day forecasts we first train the models based on cases and deaths before the first cut-off date, 04/15/2020, and then forecast through 04/21/2020. We then slide the forecasting window, training on data before 4/22/2020 and forecasting from 04/22/2020 to 04/28/2020. We repeat this process until the final date of 05/19/2020 (a similar approach is used for 3 and 5 day forecasts). The multivariate imputation models are also trained in the same walk forward fashion to avoid possible data leakage. We evaluate the models according to mean absolute error, MAE, averaged across counties and forecasting windows of the same length, along with percentage error, PE. We also compare the ranking quality of the competing models using Normalized Discounted Cumulative Gain (NDCG).
3.3 Experimental results
In Table 1 and 2, we present the experimental results for all models applied to both confirmed cases (𝒟conf) and deaths (𝒟death) and in Figure 2 and 3, we show example forecasts, along with confidence intervals, for the Top-6 counties in and .¶.
In Table 1 we show the MAE of each method for 3, 5, and 7 day window forecasts and in Table 2 we show NDCG scores for the same experiments. In terms of MAE, both our proposed models, HawkPRm and , outperform the benchmarks, PROJ and CLEP, by a large margin in all three forecasting periods and across quantile subsets of the data. The proposed HawkPRm and models are also the best performing models in terms of PE (see Table S1 in the supplementary material).
We also find that adding mobility indices improves the baseline Hawkes process, Hawkes, where HawkPRm has improvements of up to 27% (5 day forecast) and 23% (7 day forcast) over Hawkes when applied to . By adding demographic features, we can marginally boost the MAE of over HawkPRm. Generally, the proposed models have a better NDCG performance when applied to confirmed cases for most of the quantile subsets. The baseline Hawkes process, Hawkes, performs the best in terms of NDCG on the 𝒟death dataset.
In Table 3, we show the dynamic reproduction number coefficients of estimated from the Poisson regression when applied to 𝒟death.‖ The absolute value of the coefficients indicates the magnitude of the correlation between the reproduction number and the features. With the exception of heart disease mortality and parks, the coefficients of all variables are statistically significant at the 10−4 level or below. The dynamic reproduction number is positively correlated with retail and recreation, meaning that as mobility shifted away from commercial areas towards residences the reproduction number decreased. The reproduction number is negatively correlated with percent of the population with diabetes. Several possible explanations for this observation include high-risk individuals are being more cautious or that they tend to live in areas with less cases, potentially with less population.
In Figure 4, we find that the estimated dynamic reproduction number closely tracks lagged mobility, where the optimal lag parameter is determined as Δ = 14 days. The top-2 counties have estimated reproduction number initially above 3, however after stay-at-home orders mobility decreased and the reproduction number fell to around 1 (which explains why curves are relatively “flat” in many areas in the U.S.). As we observe from the tail of the reproduction number curve, a “reoponing” after 04/20/2020 coincides with a slight uptick in the reproduction number.
4 Conclusion
We showed how Hawkes processes can be combined with spatio-temporal covariates to accurately model COVID-19 transmission and forecast future cases and deaths. The model is competitive with several benchmark models used to forecast the pandemic, achieving improved MAE and NDCG scores on a majority of the experiments we conducted. Our hope is that this work will encourage more research into Hawkes process models of disease spreading that incorporate more advanced features and machine learning principles.
One potential direction for future research is extending the work here to neural network based point process models [16, 21]. These models may be able to capture more complicated relationships between mobility patterns, demographics, and transmission. The challenges of such an approach include the potential for over-fitting with added parameters and determining how best to realistically model transmission in a neural point process (analagous to the SIR-Hawkes process), which will be important if neural point processes are to be used in long-term forecasting.
Many of the preprints and models currently being released on academic archives and websites present a single model without model evaluation, goodness of fit analysis, or comparison to baselines. Here we believe the “common task framework” [6] could be beneficial in model selection and validation. The machine learning community can contribute to pandemic modeling efforts by performing careful benchmarking of methodologies, creating standardized datasets and tasks, and comparing competing models that come from different fields such as epidemiology, statistics, and machine learning.
Broader Impact Statement
We propose a novel Hawkes process model that incorporates spatio-temporal mobility and demographic features to improve forecasts of COVID-19 transmission. These models can help guide health policy decisions during the pandemic to prevent spread in forecasted hotspots and also may serve as a bridge between the machine learning community, who have developed high-dimensional and non-linear versions of these models, and the epidemiological modeling community, who utilize compartment models to which Hawkes processes are mathematically connected. We caution that forecasting COVID-19, especially in the long-term, is challenging due to uncertainties in the data, human behavior, and future policy decisions.
Data Availability
The source code and pruned dataset is available online.
(Supplementary material)
S1 Linear Hawkes processes can approximate SEIR during early stages
While a pre-factor in the Hawkes process involving the cumulative number of infections, Ic(t), is necessary to model long-term disease dynamics [24], in the early stages of transmission a linear Hawkes process can be used (as the prefactor will be close to 1),
To illustrate this, we simulate a SEIR differential equation dS/dt = −βSI/N, dE/dt = βSI/N − µE, dI/dt = µE −γI, dR/dt = γI, β = γR0, where the parameters are chosen similar to those of COVID-19: γ = .1, R0 = 2, µ = 1, and N = 5 108. We then fit the linear Hawkes process model in Equation 2 to new infections, µE, generated by the SEIR model. We use a non-parametric histogram estimator for w(t) and find a close fit between the Hawkes process and the SEIR model in Figure S1.
S2 Connection of EM algorithm for Hawkes Process and dynamic R estimator of Wallinga and Teunis
Here we make the connection between the EM algorithm for the Hawkes process and the popular dynamic reproduction number estimator of Wallinga and Teunis [S1, 28, S2]. The dynamic R estimator of Wallinga and Teunis is constructed as follows. The probability that individual i at time ti was infected by individual j at time tj is defined to be, where the distribution of inter-infection times w(ti − tj) is typically modeled as Weibull, Gamma, or log-normal [S2].
The expected total number of individuals that j infects is then given by:
Wallinga and Teunis then obtain an estimate of the dynamic reproduction number R(t) by averaging Rj over all observed cases j where the time of infection tj occurred on day t: (here Nt is the number of observed infections on day t).
On the other hand, for the Hawkes process the intensity (rate) of infections is modeled as where w(t) and R(t) are the inter-infection time distribution and dynamic reproduction number respectively. Rather than modeling R(t) as dependent on mobility, we can instead model R(t) as a piecewise constant function:
Here the Ik are intervals discretizing time, B is the number of such intervals, and rk is the estimated reproduction rate in interval k.
Given initial guesses for the model parameters and rk, the EM algorithm for the Hawkes process iteratively updates the parameters and branching probabilities by alternating between the E-step update: and M-step update: where T is the total length of the observation period, Nk is the total number of events in interval k, and the w(t) is estimated via weighted MLE (for either a Gamma, Weibull or log-normal) using the inter-event times as observations and branching probabilities as weights.
Finally, we can compare Equation 8 to Equation 3. The dynamic R(t) estimator in Equation 3 is what you obtain with 1 step of the EM algorithm in Equation 8 with initial guess R(t) ≡ 1, µ = 0 and 1 day chosen as the bin width for the histogram estimator.
S3 Hawkes process model of COVID-19 transmission
S3.1 EM algorithm for parameter inference
In the Algorithm S1, we present the pseudo code for the model . The details of the derivation is elaborated in the Section 2.3 in the manuscript. After initialization, we iterate between the expectation and maximization steps to estimate the parameters until the algorithm converges (parameter changes between consecutive iterations are within a specified tolerance).
S4 Experiments
S4.1 Datasets
S4.1.1 Covid-19 report in the USA by The New York Times
As described in Section 3.1.1 in the manuscript, we first rank counties by the number of confirmed cases and deaths by the cut-off date, 05/20/2020 and evaluate models on the top-10% of counties (denoted by , the top-25% counties , and counties between the top-25% and top-50% quantiles (denoted by ). In Figure S2, we show boxplot of the distribution of the cumulative number of confirmed cases/deaths in , and by the cut-off date. Counties with cumulative number of cases more than 1.5 times the inter-quartile range are shown individually (gray circles). The groups and represent the counties hit hardest by COVID-19.
In Figure S3, we show how the cumulative and daily number of confirmed cases and deaths progress through time. As of 05/20/20, in the U.S. about 25% of counties have more than 1,000 confirmed cases and 10% of counties have a death toll higher than 1,000. We observe a significant increase of confirmed cases in both Figure S3a and S3c starting around 03/25/2020. We can also see a jump in Figure S3b and S3d in the reported deaths around 04/07/2020, when the the U.S. reported more than 2,000 deaths in a single day for the first time.
S4.1.2 Google mobility index reports
In Figure S4, we present heat maps of mobility indices over time for counties in 𝒟conf. There are 6 types of mobility indices provided by Google, including grocery and pharmacy, park, residential, retail and recreation, transit stations, and workplace. Starting around 03/19/2020, the mobility patterns change drastically in most counties corresponding to school and non-essential business closings, along with stay at home orders. In Figure S4, we observe a mobility shift at this time away from recreation, transit and workplace towards residential areas. We can also observe a spike in grocery and pharmacy mobility preceding the stay at home orders, potentially caused by stockpiling of food and other essential items. As states across the U.S. begin staged reopening after 04/16/2020, grocery, parks, retail and transit mobility indices have moved towards baseline levels (though are still lower than before). Workplace and residential patterns have not yet returned to baseline as of 05/20/20.
S4.1.3 County-level demographic covariates
In Figure S5, we present four examples of spatial demographic features at the county-level used to model variations in the reproduction number. In Figure S5a we can observe that the east part of USA is more densely populated compared to the west (population and population density are positively correlated with the reproduction number). Smoker and diabetes percentage, on the other hand, are negatively correlated in our regression with the reproduction number of COVID-19.
S4.2 Experimental results
In Table S1, we present the percentage error for all models applied to both confirmed cases (𝒟conf) and deaths (𝒟death) and in Figure S6 and S7, we show example forecasts, along with confidence intervals, for the Top-6 counties in and for confirmed cases. The proposed HawkPRm and models outperform the benchmarks, PROJ and CLEP, across all three forecasting periods and across quantile subsets of the data. Compared to the baseline Hawkes where the reproduction number is held constant, our model can also lower the percentage error through the introduction of the dynamic reproduction number estimation.
In Table S2, we show the dynamic reproduction number coefficients of estimated from the Poisson regression when applied to 𝒟conf. With the exception of “Diabetes percentage,” the coefficients of all variables have significance below the level 10−4. Again, the dynamic reproduction number is most positively correlated with “retail and recreation.” Whereas the coefficient of heart disease mortality was not significant in the regression using death data, the coefficient is statistically significant when using cases. We note that heart disease mortality and diabetes percentage are positively correlated with each other, potentially explaining why they switched importance across regressions on cases and deaths. The sign of the coefficients is consistent, however, across regressions indicating that the reproduction number is lower for cases in counties with higher levels of smoking, heart disease and diabetes.
In Figure S8, we present examples of the estimated reproduction number R along with lagged mobility (optimal lag chosen as Δ = 14 days). Before stay-at-home orders, the top-2 counties have estimated reproduction number above 3. However, as more Americans adopted social distancing and sheltered at home, mobility decreased and the reproduction number fell to around 1. As we observe from the estimated R after staged reopening begins (around 04/20/2020), the increased mobility is associated with a slight increase in the reproduction number.
5 Acknowledgements
This research was supported by NSF grants SCC-1737585 and ATD-1737996.
Footnotes
{chiangwe{at}iupui.edu, xl17{at}iupui.edu, gmohler{at}iupui.edu}
↵‡ The baseline is the median value, for the corresponding day of the week calculated during the 5-week period, 01/03/2020 to 02/06/2020.
↵§ https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation
↵¶ More examples of 𝒟conf forecasts are presented in the Figure S2 and Figure S3 in the supplementary file.
↵‖ The model coefficients for confirmed cases is included in Table S3 of the supplementary file.