Government Responses Matter: Predicting Covid-19 cases in US using an empirical Bayesian time series framework
==============================================================================================================

* Ziyue Liu
* Wensheng Guo

## Abstract

Since the Covid-19 outbreak, researchers have been predicting how the epidemic will evolve, especially the number in each country, through using parametric extrapolations based on the history. In reality, the epidemic progressing in a particular country depends largely on its policy responses and interventions. Since the outbreaks in some countries are earlier than United States, the prediction of US cases can benefit from incorporating the similarity in their trajectories. We propose an empirical Bayesian time series framework to predict US cases using different countries as prior reference. The resultant forecast is based on observed US data and prior information from the reference country while accounting for different population sizes. When Italy is used as prior in the prediction, which the US data resemble the most, the cases in the US will exceed 300,000 by the beginning of April unless strong measures are adopted.

When facing an epidemic, people and government of a country may underestimate its seriousness in the beginning but will eventually step up their responses. Hence the case numbers tend to increase exponentially in the early stage, while the trends will gradually bend and plateau. Therefore, similarities in the case number trajectories can be observed in different countries, though the timing and severity can differ substantially due to different responses. Figure 1 displays the trajectories of total Covid-19 case numbers for China, S. Korea, Italy, France, Iran, Germany, Spain and USA using Johns Hopkins data. These countries have more days from time zero than US, where time zero is defined as first day with 100 or more (100+) cases as a heuristic but widely used choice1. The curve of South Korea increased rapidly early on but quickly bended and plateaued, for which S. Korea’s swift and deterministic policy responses are credited2. China exhibits similar but later flattening pattern, which agrees with its missing early intervention window, but later extreme lockdown policy implementation3. On the other hand, the cases in Italy and France have grown exponentially until recent days, which have partially been attributed to their late and weak policy responses4. The US trajectory is almost linear on the logarithm scale. While the US government is catching up with policies such as work/study from home, social distancing and self-quarantine, the effect has not seen in the trajectory.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/03/30/2020.03.28.20044578/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2020/03/30/2020.03.28.20044578/F1)

Figure 1. 
Cases numbers for China, S. Korea, Italy, France, Spain, Germany, Iran, and US on the natural logarithm scale. For the first seven countries, the raw data are shown as symbols, the smoothed trends as solid lines, and the 95% confidence intervals in dotted lines. For US, only the raw data are displayed.

Existing Covid-19 forecasting are extrapolations into the future time5-11. Their validity relies on the crucial but unrealistic assumption that the future trajectories are completely determined by the history. This by design cannot incorporate government responses yet to come. Not surprisingly, these predictions can be off the target. For example, Fanelli and Piazza7 predicted a maximum number of cases in Italy to be 15,000, where the real cases have already multipled. Batista8 predicted the pandemic should peak around Feb 9th, 2020, but it shows no sign of slowing down into late-March, 2020. Zheng et al9 predicted about 20,000 cases in South Korea, which is unlikely to happen given its current flat trend around 9,000. Models used in these forecasting are mainly the susceptible-infected-removed (SIR) models and its variants5-8. Others include state transition model9, parametric growth curve models such as logistic curves10, and auto regressive integrated moving average (ARIMA) models11.

## Proposed Methods

We propose an empirical Bayesian time series framework to forecast the US trajectory by utilizing the idea of internal time. Since the virus spread to different countries at different time, their trajectories are different in calendar time but comparable in internal time. We define time zero as the first day with 100 or more cases in a given country. We first model the trajectories of the eight countries by a functional mixed effects model12, where different countries shared a similar mean trajectory over time, and each country has its own random deviation curve. An additional scalar fixed effect parameter is incorporated to account for different population sizes on the natural logarithm scale. The estimated coefficient is 0.34, suggesting while the population size has some effects on the cases numbers, it is not fully proportion to the population. Both the population-average curve and random deviation curves are modeled by cubic splines. The model is then casted into state space model for computational efficiency and forecasting. The smoothing parameters and the variances are estimated through maximum likelihood.

Based on the estimated parameters using the eight countries, our next task is forecast the US cases while incorporate one of the countries as the prior information. This is done through constructing conditional state space model from the functional mixed effects model conditional on the observed data of the specified country13. By running the Kalman filter forward on the conditional state space model with the US time series data and into the future, the results are the posterior prediction incorporating both the prior information from the specific country and the observed US data. As the reference country is only specified as the prior, the posterior can be substantially different from the prior, suggesting strong deviation from the reference country.

In addition, the observed US data can be substantially different from posterior prediction, indicating that the US case are following a different trajectory because of different policy responses. More technical details are given in the Supplement.

### Data Analysis

The Johns Hopkins University CSSE data were downloaded from its GitHub repository ([https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19)). We modeled the natural logarithms of the case numbers as the outcome. The data were then used for prediction using the proposed method. After the posterior means and variances were calculated and the 95% prediction intervals were constructed, they were taken exponential to transform back to the original scale. The whole data analysis from reading in the data to plotting the results took less than 10 seconds on personal computer with Intel® Core™i76600U CPU @ 2.60GHz, 2801Mhz, 2 Cores, 4 Logical Processors.

## Results

Results based on US data up to March 26th, 2020 are shown in Figure 2∼4. Two important observations can be made from these figures. There is no apparent slowing down yet for US trajectory based on either the observed trend or predicted trend. This indicates that US is still in its exponentially increasing phase in the near future.

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/03/30/2020.03.28.20044578/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2020/03/30/2020.03.28.20044578/F2)

Figure 2. 
Prediction of the US trajectory using Italy as prior. The horizontal axis is based on US calendar time. On April 4th, 2020, the mean prediction value exceeds 300,000.

Figure 2 displays the results using Italy as prior. It shows that US and Italy have similar patterns and majority of the observed US data are in the 95% prediction intervals. This suggests that the trajectory in Italy serves as a good prior for the US prediction. Based on this prediction, on the next day as March 27th, 2020, US may have as many as 108,595 cases. In about 10 days, the US case number will exceed 300,000 around April 4th, 2020 shall the US policy responses have similarly effects as Italy.

Figure 3 displays the results using China as prior. It shows that the observed US case numbers are already higher than the predicted values. Even if the US policy responses have similar effect as China, US case numbers will exceed 150,000 around April 11th, 2020. The results using South Korea as prior are displayed in Figure 4. US case numbers are predicted to exceed 200,000 around April 6th, 2020. Since the observed US data are already well above the upper bound of the 95% prediction intervals, the data from China and South Korea are not good priors for the US prediction, suggesting that the situation in the US will be much worse than those in China and South Korea.

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/03/30/2020.03.28.20044578/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2020/03/30/2020.03.28.20044578/F3)

Figure 3. 
Prediction of the US trajectory using China as prior. The horizontal axis is based on US calendar time. On April 11th, 2020, the mean prediction value exceeds 150,000.

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/03/30/2020.03.28.20044578/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2020/03/30/2020.03.28.20044578/F4)

Figure 4. 
Prediction of the US trajectory using S. Korea as prior. The horizontal axis is based on US calendar time. On April 6th, the mean prediction value exceeds 200,000.

## Conclusion

We have proposed a new prediction method for predicting total COVID-19 cases of US by incorporating the information from other countries. While we demonstrated our method in predicting US cases, our method can be used for predicting state-by-state data as well as hospital-by-hospital data. Our prediction intervals are much smaller than most exiting methods due to the additional information from the reference country. We show that the current trajectory in US is most similar to that in Italy. The stronger response from Italy has led to slowing down of the spread in the last few days, while the effect of social distancing in the US has not shown in the observed data.

It is well-known that there are serious under-reporting or under-detection of cases in various countries and under-reporting rates may be very different across counties. This can contribute to substantial differences in the trajectories. With the advance of testing techniques, more and more people are tested in the US. This may also explain why the reported cases in the US are substantially higher than other countries in the same stages.

## Data Availability

Data and Matlab code for this manuscript are included as a supplement.

## Supplement

### Functional mixed effects model

Let *N**ij* be the number of total COVID-19 cases for the *i**th* country on the *j**th* day, where day 1 is defined as the first day with 100 or more cases. We model the natural logarithm of *N**ij*, *y**ij* = log(*N**ij*), by a functional mixed effects model1 as ![Formula][1]</img> 

Where *p**i* is the population size in the unit of millions, *β* is the fixed effect slope for log(*p**i*), *f*(*t**j*) is the functional fixed effects, *g**i*(*t**j*) is the functional random effects, and ![Graphic][2]</img> is the error term. We model *f*(*t**j*) by a cubic smoothing spline with the state space representation as ![Formula][3]</img>  where *f*′(*t**j*) is the first derivative with respect to time. The state transition matrix ![Graphic][4]</img> with *δt* is the time interval between two points with the overall time range scaled to [0,1]. The state innovation vector ***η****j*∼N(****, Σ*f*) with ![Graphic][5]</img> and *λ* is the smoothing parameter. The state vector ![Graphic][6]</img> is initialized at time zero as ![Graphic][7]</img> with *κ* → ∞ and *I* is the identity matrix. We model *g**i*(*t**j*) similarly but using a sine function in the *W* space with the state space representation2 as ![Formula][8]</img>  with time rescaled to [0, 0.5]. The state transition matrix ![Graphic][9]</img>.

The state innovation vector ***ξ****ij*∼***N***(****, Σ*g*), with ![Formula][10]</img>  and *λ**g* is the smoothing parameter. The state vector ![Graphic][11]</img> is initialized at time zero as ![Graphic][12]</img>.

### Parameter estimation

Data from eight countries (China, S. Korea, Italy, France, Iran, Germany, Spain and USA) were used. Let *n**i* denote the number of observations for subject *i* = 1, …, *k*, ***y****i* the corresponding observed data vector, ***t****i* the time vector, ***f***(***t****i*) the vector of fixed functional effects evaluated at ***t****i*, ***g****i*(***t****i*) the functional random effect, *f*”(*t*) and *g*”*i*(*t*) the second derivatives with respect to time, and ***y*** the overall observed data vector. The following penalized log-likelihood3 was maximized to estimate the parameter vector ![Graphic][13]</img> ![Formula][14]</img> 

The maximum likelihood parameter estimates were ![Graphic][15]</img>. We adopt an empirical Bayes approach such that these parameters are treated as known in the following steps.

### Construction of the conditional SSM

For the *i**th* reference country, the conditional SSM was constructed on the state vectors dimension of six as ![Graphic][16]</img>, where subscript ‘US’ denote US-specific component. The working data are ![Graphic][17]</img>. The observation matrix is *F* = (1 0 1 0 0 0), the state transition matrix is *H* = block diagonal (*H**f*, *H**g*, *H**g*), the state innovation vector ![Graphic][18]</img> distributed as ***ϖ*** ∼N(****, Σ*j*) where Σ*j* = block diagonal (Σ*f*, Σ*g*, Σ*g*). The working SSM is ![Formula][19]</img> 

Let ***m***(**·**) denote the mean and *W*(**·**) the variance of ![Graphic][20]</img>. The conditional SSM was constructed in three steps using the dynamic state space models4.

Step 1. The forward filtering: for *j* = 1, …, *n**i* ![Formula][21]</img> 

Step 2. The backward smoothing: at *j* = *n**i* let ***m***(*j*|*n**i*) = ***m***(*n**i*|*n**i*) and *W*(*j*|*n**i*) = *W*(*n**i*|*n**i*), for *j* = *n**i* − 1, …, 0 ![Formula][22]</img> 

Step 3. Construction step: ![Graphic][23]</img>, for *j* = 1, …, *n**i* ![Formula][24]</img>  where ![Graphic][25]</img> is the state transition matrix, ![Graphic][26]</img> is the mean vector and ![Graphic][27]</img> is the variance matrix of the state innovations.

### Predict US trajectory

The US trajectory was predicted by running the *n**US* observations ![Graphic][28]</img> through the conditional SSM constructed above with an observation matrix *F* = (1 0 0 0 1 0). For *j* = 1, …, *n**US*, the first two equations in (7) generated the one-step ahead prediction, while the last two equations generate the filtered values. Further running the model for *j* = *n**US* + 1, …, *n**i* generated predictions into the future time for US. ![Graphic][29]</img> was then added back to recover the effects of population size.

*   Received March 28, 2020.
*   Revision received March 28, 2020.
*   Accepted March 30, 2020.


*   © 2020, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.Rattner, N. & Schoen, J. W. These charts show how fast coronavirus cases are spreading – and what it takes to flatten the curve. CNBC (2020). Available at: [https://www.cnbc.com/2020/03/22/these-charts-show-how-fast-coronavirus-cases-are-spreading.html](https://www.cnbc.com/2020/03/22/these-charts-show-how-fast-coronavirus-cases-are-spreading.html)
    
    
2.  2.Zastrow, M. South Korea is reporting intimate details of COVID-19 cases: has it helped? Nature News (2020). Available at: [https://www.nature.com/articles/d41586-020-00740-y](https://www.nature.com/articles/d41586-020-00740-y)
    
    
3.  3.Cyanoski, D. What China’s coronavirus response can teach the rest of the world? Nature News (2020). Available at: [https://www.nature.com/articles/d41586-020-00741-x](https://www.nature.com/articles/d41586-020-00741-x)
    
    
4.  4.Donadio, R. Italy’s coronavirus response is a warning from the future. The Atlantic, March 8th, 2020. [https://www.theatlantic.com/international/archive/2020/03/italy-coronavirus-covid19-west-europe-future/607660/](https://www.theatlantic.com/international/archive/2020/03/italy-coronavirus-covid19-west-europe-future/607660/)
    
    
5.  5.Zhang, Y. et al. Prediction of the COVID-19 outbreak based on a realistic stochastic model. Preprint at [https://doi.org/10.1101/2020.03.10.20033803](https://doi.org/10.1101/2020.03.10.20033803)
    
    
6.  6.Wang, L. et al. An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China. Preprint at [https://doi.org/10.1101/2020.02.29.20029421](https://doi.org/10.1101/2020.02.29.20029421)
    
    
7.  7.Fanelli, D. & Piazza, F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Preprint at [https://arxiv.org/abs/2003.06031](https://arxiv.org/abs/2003.06031)
    
    
8.  8.Batista, M. Estimation of the final size of the COVID-19 epidemic. Preprint at [https://doi.org/10.1101/2020.02.16.20023606](https://doi.org/10.1101/2020.02.16.20023606)
    
    
9.  9.Zheng, Z., Wu, K., Yao, Z., Zheng, J. & Chen, J. The prediction for development of COVID-19 in global major epidemic areas through empirical trends in China by utilizing state transition matrix model. Preprint at [https://doi.org/10.1101/2020.03.10.20033670](https://doi.org/10.1101/2020.03.10.20033670)
    
    
10. 10.Buizza, R. Probabilistic prediction of COVID-19 infections for China and Italy, using an ensemble of stochastically-perturbed logistic curves. Preprint at [https://arxiv.org/abs/2003.06418](https://arxiv.org/abs/2003.06418)
    
    
11. 11.Benvenuto, D., Giovanetti, M., Vassalo, L. & Angeletti, Silvia. Application of the ARIMA model on the COVID-2019 epidemic dataset. Preprint at [https://doi.org/10.1016/j.dib.2020.105340](https://doi.org/10.1016/j.dib.2020.105340)
    
    
12. 12.Guo, W. Functional mixed effects models. Biometrics, 58, 121–8 (2002).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.0006-341X.2002.00121.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=11890306&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F03%2F30%2F2020.03.28.20044578.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000174229300014&link_type=ISI) 

13. 13.Guo, W. Dynamic state space models. Journal of Time Series analysis, 24, 149–158 (2003).
    
    
14. 14.Durbin, J. & Koopman, S.J. Time Series Analysis by State Space Methods (2nd edn).Oxford University Press: Oxford, UK. (2012).
    
    
## References for the supplement

1.  1.Guo, W. Functional mixed effects models. Biometrics, 58, 121–8. (2002).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.0006-341X.2002.00121.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=11890306&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F03%2F30%2F2020.03.28.20044578.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000174229300014&link_type=ISI) 

2.  2.Qin, L. Functional models using smoothing splines, a state space approach. Dissertation.
    
    
3.  3.Wahba, G. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. The Annals of Statistics, 13, 1378–1402. (1985).
    
    
4.  4.Guo, W. Dynamic state space models. Journal of Time Series analysis, 24, 149–158. (2003).

 [1]: /embed/graphic-5.gif
 [2]: /embed/inline-graphic-1.gif
 [3]: /embed/graphic-6.gif
 [4]: /embed/inline-graphic-2.gif
 [5]: /embed/inline-graphic-3.gif
 [6]: /embed/inline-graphic-4.gif
 [7]: /embed/inline-graphic-5.gif
 [8]: /embed/graphic-7.gif
 [9]: /embed/inline-graphic-6.gif
 [10]: /embed/graphic-8.gif
 [11]: /embed/inline-graphic-7.gif
 [12]: /embed/inline-graphic-8.gif
 [13]: /embed/inline-graphic-9.gif
 [14]: /embed/graphic-9.gif
 [15]: /embed/inline-graphic-10.gif
 [16]: /embed/inline-graphic-11.gif
 [17]: /embed/inline-graphic-12.gif
 [18]: /embed/inline-graphic-13.gif
 [19]: /embed/graphic-10.gif
 [20]: /embed/inline-graphic-14.gif
 [21]: /embed/graphic-11.gif
 [22]: /embed/graphic-12.gif
 [23]: /embed/inline-graphic-15.gif
 [24]: /embed/graphic-13.gif
 [25]: /embed/inline-graphic-16.gif
 [26]: /embed/inline-graphic-17.gif
 [27]: /embed/inline-graphic-18.gif
 [28]: /embed/inline-graphic-19.gif
 [29]: /embed/inline-graphic-20.gif