Dynamics and Prediction of the COVID-19 Epidemics in the US – a Compartmental Model with Deep Learning Enhancement

Qi Deng

doi:10.1101/2020.05.31.20118414

Abstract

Compartmental models dominate epidemic modeling. Estimations of transmission parameters between compartments are typically done through stochastic parameterization processes that depend upon detailed statistics on transmission characteristics, which are economically and resource-wide expensive to collect. We apply deep learning techniques as a lower data dependency alternative to estimate transmission parameters of a customized compartmental model, for the purposes of projecting further development of the US COVID-19 epidemics. The deep learning-enhanced compartment model predicts that the basic reproduction rate (R₀) will become less than one around June 18–19, 2020, and that the US “Infected” population will peak on June 17–18, 2020 at around 1·36 million individual cases. The model also predicts that the number of accumulative confirmed cases will cross the 2 million mark around June 11, 2020. It also projects that the infection transmission parameter will drop to virtually zero as early as July 8, 2020, implying that the total number of confirmed cases will likely become stabilized around that time frame (predicted at 2·23–2·24 million).

Introduction

The COVID-19 pathogen that ravages China, Europe and the US since December 2019 is a member of the coronavirus family, which also includes the SARS-CoV and MERSCoV. In the US, as of May 30, 2020, there have been 1,770,165 confirmed cases and 103,776 deaths.

The COVID-19 pandemic is still in progress, and most of the noticeable early research is of descriptive nature, focusing on reported cases to establish the baseline demographic parameters for the disease, such as age, gender, health and medical conditions, in addition to the disease’s clinical manifestations, in a Chinese context. These studies includes reports on demographic characteristics, epidemiological and clinical characteristics, exposure and travel history to the epicenter, illness timelines of laboratory-confirmed cases^1–5, as well as epidemiological information on patients from social networks, local, national and international health authorities⁶. The spread of SARS-CoV-2 outside of China (e.g. Iceland) is also analyzed⁷, albeit limitedly. From a US perspective, concerned with the worsening situation in New York City, researchers characterize information on the first 393 consecutive Covid-19 patients admitted to two hospitals in that city⁸.

Some stage-specific studies on COVID-19 patients have also been carried out, including a single-centered, retrospective study on critically ill adult patients in Wuhan, China⁹, and a retrospective, multi-center study on adult laboratory-confirmed inpatients (≥ 18 years old) from two Wuhan hospitals who have been discharged or have died¹⁰.

COVID-19 epidemic modeling

There have been attempts to model the COVID-19 epidemic dynamics. These studies all add a worldwide mobile dimension, reflecting a higher level of mobility and globalization in 2020 than 2003 (SARS) and even 2013 (MERS). The SEIR model is used to infer the basic reproduction ratio, and simulate the Wuhan epidemic¹¹; it considers domestic and international air travel to/from Wuhan to other cities to forecast the national and global spread of the virus. More sophisticated models have also been developed to correlate risk levels of foreign countries with their travel exposure to China^12–13, including a stochastic dual-SEIR approach on both Wuhan population and international travelers to estimate how transmission have varied over time from Wuhan to international destinations¹³. Simulations on the international spread of the COVID-19 after the start of travel ban from Wuhan on Jan 23, 2020 have also been conducted¹⁴, which apply the Global Epidemic and Mobility Model (GLEAM) to a multitude of Chinese and international cities, and a SEIR variety (SLIR) to project the impact of human-to-human transmissions. To simulate the transmission mechanism itself, a Bats-Hosts-Reservoir-People (BHRR) network is developed to simulate potential transmission from the infection sources (i.e., bats) to humans¹⁵. The BHRR network is essentially an elaborated collection of SEIR models that are applied to each state of the transmission network.

Since March 2020, with the COVID-19 outbreak winding down in China, researchers have dedicated more efforts in analyzing the effectiveness of containment measures. Mobility and travel history data from Wuhan is used to ascertain the impact of the drastic control measures implemented in China¹⁶. A research investigates the spread and control of COVID-19 among Chinese cities with data on human movements and public health interventions¹⁷. Utilizing the contact data for Wuhan and Shanghai and contact tracing information from Hunan Province, a group of researchers build a transmission model to study the impact of social distancing and school closure¹⁸.

Theoretical Foundation

Compartmental models dominate epidemic modeling on COVID-19 epidemics (and previous coronavirus outbreaks), and they require detailed statistics on transmission characteristics to estimate the stochastic transmission parameters between compartments. Essentially, these models correlate factors such as geographic distances and contact intensities among heterogeneous subpopulations with gradient probability decay. Technically, transmission parameterization applies Bayesian inference methods, such as Marcov Chain Monte Carlo (MCMC) or Gillespie algorithm¹⁹ simulations to form probability density functions (PDFs) on cross-section, in order to estimate parameters for each timestep of a multivariate time series construct. These detailed statistics on transmission characteristics are economically and resource-wide expensive to collect.

We are particularly interested in extended compartmental models that cover multiple inter-connected and heterogeneous subpopulations^10,15,20. There are also some pure time series analyses on epidemic dynamics outside of the compartmental modeling mainstream, for example, the AutoRegressive Integrated Moving Average (ARIMA) approach²¹ that is typically found in financial applications. They provide another perspective.

We develop a multistep, multivariate deep learning methodology to estimate the transmission parameters. We then feed these estimated transmission parameters to a customized compartmental model to predict the development of the US COVID-19 epidemic.

We establish a SEIR-variety discrete time series on a daily interval as the theoretical foundation for a deep learning-enhanced compartment model. We start with the construction of a so called SEIRQJD (SEIR-Quarantined-Isolated-Deceased) model (Fig. 1).

Fig. 1. The SEIRQJD Model

Since we use the US COVID-19 epidemic datasets from John Hopkins University Center for Systems Science and Engineering (JHU CSSE) Github COVID-19 data depository, which do not include direct Exposed (E) and quarantined (Q) data, we set all transmission parameters to/from the “E” and “Q” compartment (, ,) to zero. Furthermore, the datasets assume that all Deaths (D) arise from the Isolated population (J), thus we also set the transmission parameter from Infectious (I) to Deceased (D), , to zero. We then simplify the abovementioned SEIRJD model to a SIRJD (SIR-Isolated-Deceased) construct, in which a population is grouped into five compartments:

Susceptible (S): The susceptible population arises at a percentage (–) of a net influx of individuals ( ).
Infectious (I): The infectious individuals are symptomatic, come from the Susceptible compartment, and further progress into the Isolated or Recovered compartments.
Isolated (J): The isolated individuals have developed clinical symptoms and have been isolated by hospitalization or other means of separation. They come from the Infectious compartment and progress into the Recovered or Deceased compartments
Recovered (R): The recovered individuals come from Infectious and Isolated compartments and acquire lasting immunity (there has yet any contradiction against this assumption).
Deceased (D): The deceased cases come from Infectious and Isolated compartments.

The SIRJD model has a multivariate time series structure. The model separates a population (N_t, which can be assumed to be a constant) in the target region (i.e., the US) with a “net influx” population (L_t, which can also be assumed to be a constant at equilibrium) at a given point of time (t). In reality, the number of individuals in all other compartments combined is far fewer than the total population, as well as the number of individuals in the Susceptible compartmental, thus in order to construct a linear multivariate time series without economical consequence, it is not unreasonable to assume that and . Thus the daily (Δt = 1) multivariate time series is given by the follow matrix form: or:

The Greek letters in the time series are transmission parameters defined in the state diagram in Figure 1. Essentially, all these parameters are stochastic.

Since we need to estimate the transmission parameters, we can rewrite and rearrange Equations (1) and (2) to the following matrix representation: or:

Data

We collect the following US COVID-19 datasets from the JHU CSSE data depository^①.

Dataset 1: The JHU CSSE updates daily records (confirmed, active, dead, recovered, hospitalized, etc.) from April 12, 2020. We use these detailed case data to construct the compartmental model.
Dataset 2: The JHU CSSE updates two time series on daily basis. One tracks confirmed cases and the other dead cases, both starting January 22, 2020. We use the confirmed/dead cases to form training data for deep learning.

We notice that, the JHU CSSE dataset has an almost precise period of 7 days (±1 days), reflecting that a majority of the reporting agencies in the country choose to update their respective statistics on a weekly, fixed-calendar interval. As a counter measure, we run a 7-day moving average on the dataset to smooth out this “unnatural” data seasonality.

Methodology

We then conduct the following step-by-step operations, to model the US epidemic:

We construct an in-sample SIRJD time series starting April 12, 2020 with Dataset 1.
We use the in-sample SIRJD time series constructed in Step 1 to come up with an insample time series for the two most critical daily transmission parameters (β and γ^R).
We construct a confirmed/dead-case time series starting from January 22, 2020 (insample) with Dataset 2.
We apply two deep learning approaches, the standard DNN (Deep Neural Networks) and the advanced RNN-LSTM (Recursive Neural Networks – Long Short Term Memory), to fit the confirmed/dead in-sample time series from Step 3, and predict the further development of confirmed/dead cases for 35 and 42 days (out-of-sample).
We use the confirmed/dead in-sample time series from Step 3 as training data and the in-sample β and γ^R time series from Step 2 as training label, and apply the DNN and RNN-LSTM techniques to predict β and γ^R for 35 and 42 days (out-of-sample).
Finally, we use the predicted (out-of-sample) transmission parameters (β and γ^R) from Step 5 to simulate 35-day and 42-day progressions (out-of-sample) of the SIRJD model (particularly the SIR portion) in a recursive manner, starting with the data point of the last timestep from the in-sample SIRJD time series from Step 1. Equations (3, 4) and then Equations (1, 2) are used for and calculations, respectively.

Fig. 2 is the flowchart to illustrate the dataset and methodology.

Fig. 2. The dataset and methodology

Results

The results are illustrated in Figs. 3–6 (35-day forecast) and Figs. 7–10 (42-day forecast).

Fig. 3. The results – transmission parameter estimations (DNN) for 35 days

Beta is the “Susceptible-to-Infected” transmission parameter ( ) and Gamma_R is the “Infected-to-Recovered” transmission parameter ( ) for the in-sample (observed) data. Beta_fo is the forecasted and Gamma_R_fo is forecasted for the out-of-sample (forecasted) data.

Fig. 4. The results – transmission parameter estimations (RNN-LSTM) for 35 days

Fig. 5. The results – SIR model forecasting (DNN) for 35 days

Susceptible, Infected, Recovered, Dead are in-sample compartmental model data, Confirmed_cal is the in-sample number of confirmed cases (‘_cal’ indicates that it is calculated, rather than obtained directly from the dataset). Susceptible_fo, Infected_fo, Recovered_fo, Dead_fo and Confirmed_cal_fo are their out-of-sample (forecasted) counterparts. The left y-axis is for Infected/Infected_fo, Recovered/Recovered_fo, Dead/Dead_fo, Confirmed_cal/Confirmed_cal_fo, while the right y-axis is for Susceptible/Susceptible_fo. The right y-axis is needed for scaling purpose, as Susceptible/Susceptible_fo are derived from the total population (subtracting Confirmed/Confirmed_fo).

Fig. 6. The results – SIR model forecasting (RNN-LSTM) for 35 days

Fig. 7. The results – transmission parameter estimations (DNN) for 42 days

Fig. 8. The results – transmission parameter estimations (RNN-LSTM) for 42 days

Fig. 9. The results – SIR model forecasting (DNN) for 42 days

Fig. 10. The results – SIR model forecasting (RNN-LSTM) for 42 days

In Fig. 3 (35-day forecast), the DNN method predicts on June 18, 2020, the “Infected-to-Recovered” transmission parameter will rise above and stay above the “Susceptible-to-Infected” transmission parameter. This means that the value of the basic reproduction rate, will become less than one and that the spread of COVID-19 in the US will effectively end on that day. In Fig. 4 (35-day forecast), the RNN-LSTM method gives a slightly less aggressive prediction that γ^R will overtake β on June 19, 2020. Thus, with 35-day forecast, we predict that the tide of the US epidemic will turn around June 18 to 19, 2020 timeframe.

In Fig. 5 (35-day forecast), the DNN method predicts that the US “Infected” population will peak on June 17, 2020 at 1,357,705 individual cases. In Fig. 6 (35-day forecast), the RNN-LSTM method predicts that the US “Infected” population will peak on June 18, 2020 at 1,363,876 individual cases. Both methods predict that the number of accumulative confirmed cases will cross the 2 million mark on June 11, 2020 at 2,008,638 by DNN and 2,008,832 by RNN-LSTM.

In Fig. 7 (42-day forecast), the DNN method gives a slightly more aggression prediction that γ^R will β overtake on June 17, 2020. While in Fig. 8 (42-day forecast), the RNN-LSTM method gives a more conservative prediction that R₀, won’t become less than one till June 24, 2020. Comparing Figs. 7–8 with Figs. 3–4, we observe that the 42-day forecast gives a wider range (June 17 to 24, 2020) than the 35-day forecast on when the turnaround will occur. The reason for the wider-range is that, for the same (in-sample) training data size, longer forecast produces wider probability distribution.

Also in Fig. 7 (42-day forecast), the DNN method predicts that β will drop to virtually 0% around July 8, 2020, and the number of accumulative confirmed cases will cross the become stabilized at 2·23–2·24 million.

Discussion

We apply DNN and RNN-LSTM techniques to estimate the stochastic transmission parameters for a SIRJD model with a discrete time series construct. We then use the predicted transmission parameters to forecast the further development of the US COVID-19 epidemics.

We make use of two US COVID-19 datasets from the JHU CSSE data depository. The first dataset includes detailed daily records (confirmed, active, dead, recovered, hospitalized, etc.) starting from April 12, 2020, from which we construct the SIRJD model. The second dataset includes time series tracking confirmed and dead cases starting from January 22, 2020, which we use to construct training data for deep learning. The JHU CSSE data has an almost precise period of 7 days (±1 days) that masks the true epidemic dynamics, thus we run a 7-day moving average on the dataset to smooth out this data seasonality.

We then apply DNN and RNN-LSTM deep learning techniques to fit the confirmed/dead series to predict the further development of confirmed/dead cases, as well as to predict the “Susceptible-to-Infected” and “Infected-to-Recovered” transmission parameters (β and γ^R) for 35 and 42 days. Finally, we use the predicted transmission parameters (β and γ^R from simulate the epidemic progression for 35 and 42 days.

Implementations of the DNN and RNN-LSTM enhanced SIRJD model are consistent in terms of forecasting the US COVID-19 epidemic dynamics. The DNN and RNN-LSTM implementations for 35-day forecast predict that the basic reproduction rate (R₀) will become less than one around June 18–19, 2020, and for 42-day forecast around June 17–24, 2020, at which point the spread of the coronavirus will effectively start to die out. Both methods predict that the number of accumulative confirmed cases will cross the 2 million mark on June 11, 2020 at 2,008,638 by DNN and 2,008,832 by RNN-LSTM. The DNN and RNNLSTM implementations for 35-day forecast predict that the US “Infected” population will peak on June 17–18, 2020 at 1,357,705 or 1,363,876 individual cases, respectively.

The DNN implementation for 42-day forecast projects that the “Susceptible-to-Infected” transmission parameter β will drop to virtually zero around July 8, 2020, which implies that the total number of confirmed cases and deaths will likely become stable (forecasted at 2·23–2·24 million).

With the introduction of the deep learning-enhanced compartmental model, we provide an effective and easy-to-implement alternative to prevailing stochastic parameterization, which estimates transmission parameters through either probability likelihood maximization, or Marcov Chain Monte Carlo simulation. The effectiveness of the prevalent approach depends upon detailed statistics on transmission characteristics among heterogeneous subpopulations, and such statistics are economically and resource-wide expensive. On the other hand, deep learning techniques uncover hidden interconnections among seemly less related data, reducing prediction’s dependency on data particularity. Future research on deep learning’s utilities in epidemic modeling can further enhance its forecasting power.

Data Availability

Up-to-date time series data available up request.

Supplementary Materials

Dataset_1 time series (constructed): covid_us.csv

Dataset_2 time series (constructed): covid_us_ts.csv

Acknowledgements

The authors thank Ms. Liu Chang and Mr. Liu Shuigeng with Cofintelligence Fintech Co, Ltd. (Hong Kong and Shanghai) for data collection and formatting. Authors declare no competing interests.

Footnotes

↵① The JHU CSSE Github COVID-19 data depository link is at https://github.com/CSSEGISandData/COVID-19.

References and Notes

1.↵
Chen N, Zhou M, Dong X, Qu J, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. The Lancet 2020; doi:10.1016/S0140-6736(20)30211-7.
OpenUrl CrossRef PubMed
2.
Guan W, Ni ZY, Hu Y, et al. Clinical Characteristics of Coronavirus Disease 2019 in China. The New England Journal of Medicine 2020; doi:10.1056/NEJMoa2002032.
OpenUrl CrossRef PubMed
3.
Huang CY, Wang Y, Li X, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The Lancet 2020; doi:10.1016/S0140-6736(20)30183-5.
OpenUrl CrossRef PubMed
4.
Li Q, Guan X, Wu P, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia. The New England Journal of Medicine 2020; doi:10.1056/NEJMoa2001316.
OpenUrl CrossRef PubMed
5.↵
Shi H, Han X, Jiang N, et al. Radiological findings from 81 patients with COVID-19 pneumonia in Wuhan, China: a descriptive study. The Lancet 2020; doi:10.1016/S1473-3099(20)30086-4.
OpenUrl CrossRef PubMed
6.↵
Sun K, Chen J, Viboud C. Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population level observational study. The Lancet 2020; doi:10.1016/S2589-7500(20)30026-1.
OpenUrl CrossRef PubMed
7.↵
Gudbjartsson DF, Helgason A, Jonsson H, et al. Spread of SARS-CoV-2 in the Icelandic Population. The New England Journal of Medicine 2020; doi:10.1056/NEJMoa2006100.
OpenUrl CrossRef PubMed
8.↵
Goyal P, Choi JJ, Pinheiro LC, et al. Clinical Characteristics of Covid-19 in New York City. The New England Journal of Medicine 2020; doi:10.1056/NEJMc2010419.
OpenUrl CrossRef PubMed
9.↵
Yang X, Yu Y, Xu J, et al. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study. The Lancet 2020; doi:10.1016/S2213-2600(20)30079-5.
OpenUrl CrossRef PubMed
10.↵
Zhang J, Litvinova M, Liang Y, et al. Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China. Science 2020; doi:10.1126/science.abb8001.
OpenUrl Abstract/FREE Full Text
11.↵
Wu JT, Leung K, Leung GM. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. The Lancet 2020; doi:10.1016/S0140-6736(20)30260-9.
OpenUrl CrossRef PubMed
12.↵
Gilbert M, Pullano G, Pinotti F, et al. Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study. The Lancet 2020; doi:10.1016/S0140-6736(20)30411-6.
OpenUrl CrossRef PubMed
13.↵
Kucharski AJ, Russell TW, Diamond C, et al. Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet 2020; doi:10.1016/S1473-3099(20)30144-4.
OpenUrl CrossRef PubMed
14.↵
Chinazzi M, Davis JT, Ajelli M, et al., The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science 2020; doi:10.1126/science.aba9757.
OpenUrl Abstract/FREE Full Text
15.↵
Chen TM, Rui J, Wang QP, et al. A mathematical model for simulating the phase-based transmissibility of a novel coronavirus. Infectious Diseases of Poverty 2020; 9:24.
OpenUrl
16.↵
Kraemer MUG, Yang CH, Gutierrez B, et al. The effect of human mobility and control measures on the COVID-19 epidemic in China. Science 2020; 368:493–497.
OpenUrl Abstract/FREE Full Text
17.↵
Tian T, Liu Y, Li Y, et al. An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China. Science 2020; 368:638–642.
OpenUrl Abstract/FREE Full Text
18.↵
Zhang J, Lou J, Ma Z, Wu J. A compartmental model for the analysis of SARS transmission patterns and outbreak control measures in China. Applied Mathematics and Computation 2005; 162(2):909–924.
OpenUrl
19.↵
Gillespie DT. Exact Stochastic Simulation of Coupled Chemical Reactions. The Journal of Physical Chemistry 1977; 81(25):2340–2361.
OpenUrl CrossRef PubMed Web of Science
20.↵
Naheed A, Singh M, Lucy D. Numerical study of SARS epidemic model with the inclusion of diffusion in the system. Applied Mathematics and Computation 2014; 229:480–498.
OpenUrl
21.↵
Lai D. Monitoring the SARS Epidemic in China: A Time Series Analysis. Journal of Data Science 2005; 3:279–293.
OpenUrl