Abstract
Compartmental models dominate epidemic modeling. Estimations of transmission parameters between compartments are typically done through stochastic parameterization processes that depend upon detailed statistics on transmission characteristics, which are economically and resource-wide expensive to collect. We apply deep learning techniques as a lower data dependency alternative to estimate transmission parameters of a customized compartmental model, for the purposes of projecting further development of the US COVID-19 epidemics. The deep learning-enhanced compartment model predicts that the basic reproduction rate (R0) will become less than one around June 18–19, 2020, and that the US “Infected” population will peak on June 17–18, 2020 at around 1·36 million individual cases. The model also predicts that the number of accumulative confirmed cases will cross the 2 million mark around June 11, 2020. It also projects that the infection transmission parameter will drop to virtually zero as early as July 8, 2020, implying that the total number of confirmed cases will likely become stabilized around that time frame (predicted at 2·23–2·24 million).
Introduction
The COVID-19 pathogen that ravages China, Europe and the US since December 2019 is a member of the coronavirus family, which also includes the SARS-CoV and MERSCoV. In the US, as of May 30, 2020, there have been 1,770,165 confirmed cases and 103,776 deaths.
The COVID-19 pandemic is still in progress, and most of the noticeable early research is of descriptive nature, focusing on reported cases to establish the baseline demographic parameters for the disease, such as age, gender, health and medical conditions, in addition to the disease’s clinical manifestations, in a Chinese context. These studies includes reports on demographic characteristics, epidemiological and clinical characteristics, exposure and travel history to the epicenter, illness timelines of laboratory-confirmed cases1–5, as well as epidemiological information on patients from social networks, local, national and international health authorities6. The spread of SARS-CoV-2 outside of China (e.g. Iceland) is also analyzed7, albeit limitedly. From a US perspective, concerned with the worsening situation in New York City, researchers characterize information on the first 393 consecutive Covid-19 patients admitted to two hospitals in that city8.
Some stage-specific studies on COVID-19 patients have also been carried out, including a single-centered, retrospective study on critically ill adult patients in Wuhan, China9, and a retrospective, multi-center study on adult laboratory-confirmed inpatients (≥ 18 years old) from two Wuhan hospitals who have been discharged or have died10.
COVID-19 epidemic modeling
There have been attempts to model the COVID-19 epidemic dynamics. These studies all add a worldwide mobile dimension, reflecting a higher level of mobility and globalization in 2020 than 2003 (SARS) and even 2013 (MERS). The SEIR model is used to infer the basic reproduction ratio, and simulate the Wuhan epidemic11; it considers domestic and international air travel to/from Wuhan to other cities to forecast the national and global spread of the virus. More sophisticated models have also been developed to correlate risk levels of foreign countries with their travel exposure to China12–13, including a stochastic dual-SEIR approach on both Wuhan population and international travelers to estimate how transmission have varied over time from Wuhan to international destinations13. Simulations on the international spread of the COVID-19 after the start of travel ban from Wuhan on Jan 23, 2020 have also been conducted14, which apply the Global Epidemic and Mobility Model (GLEAM) to a multitude of Chinese and international cities, and a SEIR variety (SLIR) to project the impact of human-to-human transmissions. To simulate the transmission mechanism itself, a Bats-Hosts-Reservoir-People (BHRR) network is developed to simulate potential transmission from the infection sources (i.e., bats) to humans15. The BHRR network is essentially an elaborated collection of SEIR models that are applied to each state of the transmission network.
Since March 2020, with the COVID-19 outbreak winding down in China, researchers have dedicated more efforts in analyzing the effectiveness of containment measures. Mobility and travel history data from Wuhan is used to ascertain the impact of the drastic control measures implemented in China16. A research investigates the spread and control of COVID-19 among Chinese cities with data on human movements and public health interventions17. Utilizing the contact data for Wuhan and Shanghai and contact tracing information from Hunan Province, a group of researchers build a transmission model to study the impact of social distancing and school closure18.
Theoretical Foundation
Compartmental models dominate epidemic modeling on COVID-19 epidemics (and previous coronavirus outbreaks), and they require detailed statistics on transmission characteristics to estimate the stochastic transmission parameters between compartments. Essentially, these models correlate factors such as geographic distances and contact intensities among heterogeneous subpopulations with gradient probability decay. Technically, transmission parameterization applies Bayesian inference methods, such as Marcov Chain Monte Carlo (MCMC) or Gillespie algorithm19 simulations to form probability density functions (PDFs) on cross-section, in order to estimate parameters for each timestep of a multivariate time series construct. These detailed statistics on transmission characteristics are economically and resource-wide expensive to collect.
We are particularly interested in extended compartmental models that cover multiple inter-connected and heterogeneous subpopulations10,15,20. There are also some pure time series analyses on epidemic dynamics outside of the compartmental modeling mainstream, for example, the AutoRegressive Integrated Moving Average (ARIMA) approach21 that is typically found in financial applications. They provide another perspective.
We develop a multistep, multivariate deep learning methodology to estimate the transmission parameters. We then feed these estimated transmission parameters to a customized compartmental model to predict the development of the US COVID-19 epidemic.
We establish a SEIR-variety discrete time series on a daily interval as the theoretical foundation for a deep learning-enhanced compartment model. We start with the construction of a so called SEIRQJD (SEIR-Quarantined-Isolated-Deceased) model (Fig. 1).
Since we use the US COVID-19 epidemic datasets from John Hopkins University Center for Systems Science and Engineering (JHU CSSE) Github COVID-19 data depository, which do not include direct Exposed (E) and quarantined (Q) data, we set all transmission parameters to/from the “E” and “Q” compartment (, ,) to zero. Furthermore, the datasets assume that all Deaths (D) arise from the Isolated population (J), thus we also set the transmission parameter from Infectious (I) to Deceased (D), , to zero. We then simplify the abovementioned SEIRJD model to a SIRJD (SIR-Isolated-Deceased) construct, in which a population is grouped into five compartments:
Susceptible (S): The susceptible population arises at a percentage (–) of a net influx of individuals ( ).
Infectious (I): The infectious individuals are symptomatic, come from the Susceptible compartment, and further progress into the Isolated or Recovered compartments.
Isolated (J): The isolated individuals have developed clinical symptoms and have been isolated by hospitalization or other means of separation. They come from the Infectious compartment and progress into the Recovered or Deceased compartments
Recovered (R): The recovered individuals come from Infectious and Isolated compartments and acquire lasting immunity (there has yet any contradiction against this assumption).
Deceased (D): The deceased cases come from Infectious and Isolated compartments.
The SIRJD model has a multivariate time series structure. The model separates a population (Nt, which can be assumed to be a constant) in the target region (i.e., the US) with a “net influx” population (Lt, which can also be assumed to be a constant at equilibrium) at a given point of time (t). In reality, the number of individuals in all other compartments combined is far fewer than the total population, as well as the number of individuals in the Susceptible compartmental, thus in order to construct a linear multivariate time series without economical consequence, it is not unreasonable to assume that and . Thus the daily (Δt = 1) multivariate time series is given by the follow matrix form: or:
The Greek letters in the time series are transmission parameters defined in the state diagram in Figure 1. Essentially, all these parameters are stochastic.
Since we need to estimate the transmission parameters, we can rewrite and rearrange Equations (1) and (2) to the following matrix representation: or:
Data
We collect the following US COVID-19 datasets from the JHU CSSE data depository①.
Dataset 1: The JHU CSSE updates daily records (confirmed, active, dead, recovered, hospitalized, etc.) from April 12, 2020. We use these detailed case data to construct the compartmental model.
Dataset 2: The JHU CSSE updates two time series on daily basis. One tracks confirmed cases and the other dead cases, both starting January 22, 2020. We use the confirmed/dead cases to form training data for deep learning.
We notice that, the JHU CSSE dataset has an almost precise period of 7 days (±1 days), reflecting that a majority of the reporting agencies in the country choose to update their respective statistics on a weekly, fixed-calendar interval. As a counter measure, we run a 7-day moving average on the dataset to smooth out this “unnatural” data seasonality.
Methodology
We then conduct the following step-by-step operations, to model the US epidemic:
We construct an in-sample SIRJD time series starting April 12, 2020 with Dataset 1.
We use the in-sample SIRJD time series constructed in Step 1 to come up with an insample time series for the two most critical daily transmission parameters (β and γR).
We construct a confirmed/dead-case time series starting from January 22, 2020 (insample) with Dataset 2.
We apply two deep learning approaches, the standard DNN (Deep Neural Networks) and the advanced RNN-LSTM (Recursive Neural Networks – Long Short Term Memory), to fit the confirmed/dead in-sample time series from Step 3, and predict the further development of confirmed/dead cases for 35 and 42 days (out-of-sample).
We use the confirmed/dead in-sample time series from Step 3 as training data and the in-sample β and γR time series from Step 2 as training label, and apply the DNN and RNN-LSTM techniques to predict β and γR for 35 and 42 days (out-of-sample).
Finally, we use the predicted (out-of-sample) transmission parameters (β and γR) from Step 5 to simulate 35-day and 42-day progressions (out-of-sample) of the SIRJD model (particularly the SIR portion) in a recursive manner, starting with the data point of the last timestep from the in-sample SIRJD time series from Step 1. Equations (3, 4) and then Equations (1, 2) are used for and calculations, respectively.
Fig. 2 is the flowchart to illustrate the dataset and methodology.
Results
The results are illustrated in Figs. 3–6 (35-day forecast) and Figs. 7–10 (42-day forecast).
In Fig. 3 (35-day forecast), the DNN method predicts on June 18, 2020, the “Infected-to-Recovered” transmission parameter will rise above and stay above the “Susceptible-to-Infected” transmission parameter. This means that the value of the basic reproduction rate, will become less than one and that the spread of COVID-19 in the US will effectively end on that day. In Fig. 4 (35-day forecast), the RNN-LSTM method gives a slightly less aggressive prediction that γR will overtake β on June 19, 2020. Thus, with 35-day forecast, we predict that the tide of the US epidemic will turn around June 18 to 19, 2020 timeframe.
In Fig. 5 (35-day forecast), the DNN method predicts that the US “Infected” population will peak on June 17, 2020 at 1,357,705 individual cases. In Fig. 6 (35-day forecast), the RNN-LSTM method predicts that the US “Infected” population will peak on June 18, 2020 at 1,363,876 individual cases. Both methods predict that the number of accumulative confirmed cases will cross the 2 million mark on June 11, 2020 at 2,008,638 by DNN and 2,008,832 by RNN-LSTM.
In Fig. 7 (42-day forecast), the DNN method gives a slightly more aggression prediction that γR will β overtake on June 17, 2020. While in Fig. 8 (42-day forecast), the RNN-LSTM method gives a more conservative prediction that R0, won’t become less than one till June 24, 2020. Comparing Figs. 7–8 with Figs. 3–4, we observe that the 42-day forecast gives a wider range (June 17 to 24, 2020) than the 35-day forecast on when the turnaround will occur. The reason for the wider-range is that, for the same (in-sample) training data size, longer forecast produces wider probability distribution.
Also in Fig. 7 (42-day forecast), the DNN method predicts that β will drop to virtually 0% around July 8, 2020, and the number of accumulative confirmed cases will cross the become stabilized at 2·23–2·24 million.
Discussion
We apply DNN and RNN-LSTM techniques to estimate the stochastic transmission parameters for a SIRJD model with a discrete time series construct. We then use the predicted transmission parameters to forecast the further development of the US COVID-19 epidemics.
We make use of two US COVID-19 datasets from the JHU CSSE data depository. The first dataset includes detailed daily records (confirmed, active, dead, recovered, hospitalized, etc.) starting from April 12, 2020, from which we construct the SIRJD model. The second dataset includes time series tracking confirmed and dead cases starting from January 22, 2020, which we use to construct training data for deep learning. The JHU CSSE data has an almost precise period of 7 days (±1 days) that masks the true epidemic dynamics, thus we run a 7-day moving average on the dataset to smooth out this data seasonality.
We then apply DNN and RNN-LSTM deep learning techniques to fit the confirmed/dead series to predict the further development of confirmed/dead cases, as well as to predict the “Susceptible-to-Infected” and “Infected-to-Recovered” transmission parameters (β and γR) for 35 and 42 days. Finally, we use the predicted transmission parameters (β and γR from simulate the epidemic progression for 35 and 42 days.
Implementations of the DNN and RNN-LSTM enhanced SIRJD model are consistent in terms of forecasting the US COVID-19 epidemic dynamics. The DNN and RNN-LSTM implementations for 35-day forecast predict that the basic reproduction rate (R0) will become less than one around June 18–19, 2020, and for 42-day forecast around June 17–24, 2020, at which point the spread of the coronavirus will effectively start to die out. Both methods predict that the number of accumulative confirmed cases will cross the 2 million mark on June 11, 2020 at 2,008,638 by DNN and 2,008,832 by RNN-LSTM. The DNN and RNNLSTM implementations for 35-day forecast predict that the US “Infected” population will peak on June 17–18, 2020 at 1,357,705 or 1,363,876 individual cases, respectively.
The DNN implementation for 42-day forecast projects that the “Susceptible-to-Infected” transmission parameter β will drop to virtually zero around July 8, 2020, which implies that the total number of confirmed cases and deaths will likely become stable (forecasted at 2·23–2·24 million).
With the introduction of the deep learning-enhanced compartmental model, we provide an effective and easy-to-implement alternative to prevailing stochastic parameterization, which estimates transmission parameters through either probability likelihood maximization, or Marcov Chain Monte Carlo simulation. The effectiveness of the prevalent approach depends upon detailed statistics on transmission characteristics among heterogeneous subpopulations, and such statistics are economically and resource-wide expensive. On the other hand, deep learning techniques uncover hidden interconnections among seemly less related data, reducing prediction’s dependency on data particularity. Future research on deep learning’s utilities in epidemic modeling can further enhance its forecasting power.
Data Availability
Up-to-date time series data available up request.
Supplementary Materials
Dataset_1 time series (constructed): covid_us.csv
Dataset_2 time series (constructed): covid_us_ts.csv
Acknowledgements
The authors thank Ms. Liu Chang and Mr. Liu Shuigeng with Cofintelligence Fintech Co, Ltd. (Hong Kong and Shanghai) for data collection and formatting. Authors declare no competing interests.
Footnotes
↵① The JHU CSSE Github COVID-19 data depository link is at https://github.com/CSSEGISandData/COVID-19.