Abstract
The confirmed cases of novel coronavirus disease (COVID-19) have been reported in the United States since late January 2020. There were over 4.8 million confirmed cases and about 320,000 deaths as of May 19, 2020 in the world. We examined the characteristics of the confirmed cases and deaths of COVID-19 in all affected counties of the United States. We proposed a COVID-Net combining the architecture of both Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) by using the trajectories of COVID-19 during different periods until May 19, 2020, as the training data. The validation of the COVID-Net was performed by predicting the numbers of confirmed cases and deaths in subsequent 3-day, 5-day, and 7-day periods. The COVID-Net produced relatively smaller Mean Relative Errors (MREs) for the 10 counties with the most severe epidemic as of May 19, 2020. On average, MREs were 0.01 for the number of confirmed cases in all validation periods, and 0.01, 0.01, and 0.03 for the number of deaths in the 3-day, 5-day, and 7-day periods, respectively. The COVID-Net incorporated five risk factors of COVID-19 and was used to predict the trajectories of COVID-19 in Hudson County, New Jersey and New York County, New York until June 28, 2020. The risk factors include the percentage of the population with access to exercise opportunities, average daily PM2.5, population size, preventable hospitalization rate, and violent crime rate. The expected number of cumulative confirmed cases and deaths depends on the dynamics of these five risk factors.
Significance Statement A COVID-Net model was built to predict the trajectories of COVID-19, based on the percentage of the population with access to exercise opportunities, average daily PM2.5, population size, preventable hospitalization rate and violent crime rate in the metropolises areas. The increasing awareness of how these risk factors affect the pandemic helps policymakers develop plans that mitigate the spread of COVID-19.
1. INTRODUCTION
According to The New York Times (2020), the early confirmed cases of COVID-19 were reported on January 21, 2020 in the United States. In March, the outbreak of COVID-19 was proclaimed as a pandemic by the World Health Organization (2020). Since then, the United States has had the largest number of both confirmed cases and deaths in the world (National Health Commission of the People’s Republic of China, 2020), where the confirmed cases and deaths were 1,536,447 and 91,936, respectively, as of May 19, 2020.
A vast majority of states in the United States had issued a stay at home order to reduce the transmission of COVID-19 since March 2020 (Governor’s Press Office, 2020). As the states are in the process of reopening for normalcy, it is important to predict the trajectories of COVID-19 based on the risk factors to provide the decision-makers with a quantitative and dynamic assessment.
Besides personal factors, county-level factors are known to affect the spread of infectious diseases. For instance, the hospitalization rate of H1N1 2009 had a disproportionate impact on high-poverty areas in New York City (Balter et al., 2010) and on the small population of racial/ethnic groups in Wisconsin (Truelove et al., 2011). Consequently, we considered the data from county health ranking and roadmaps programs (A Robert Wood Johnson Foundation program, 2020). The details about the database are available from “https://www.countyhealthrankings.org/reports/county-health-rankings-reports.” Specifically, we used five risk factors related to diet and exercise, air quality, population size, quality of care and community safety (See Materials and Methods section). Using deep learning, we included data from all counties in the United States to build a prediction model and selected the 10 most severe counties to validate the model. Then, we predicted future trajectories of COVID-19 in Hudson County, New Jersey and New York County, New York.
2. MATERIALS AND METHODS
2.1 Data Sources
We collected the daily numbers of cumulative confirmed cases and deaths from January 21 to May 19, 2020 for counties in the United States from the New York Times (New York Times, 2020). The daily cumulative confirmed cases and deaths were collected from health departments and U.S. Centers for Disease Control and Prevention (CDC), where patients were identified as “confirmed” based on the laboratory positive tests and clinical symptoms and exposure (New York Times, 2020). The five risk factors were compiled from 2020 annual data on the County Health Rankings and Roadmaps program official website (A Robert Wood Johnson Foundation program, 2020). The five risk factors including the percentage of the population with access to exercise opportunities, average daily PM2.5, population size, preventable hospitalization rate and violent crime rate were listed in Table 1. Data analysis was conducted in version 3.7 Python with TensorFlow-GPU 1.14.0 and Keras 2.3.0.
2.2 Statistical Analysis
We combined the architecture of LSTM and GRU (Hochreiter and Schmidhuber, 1997; Chung et al., 2014; Bandyopadhyay and Dutta, 2020) by using the five risk factors as auxiliary inputs. The COVID-Net was shown in Fig. 1. During the training process, the observed cumulative confirmed cases and deaths every past seven days in each county of the United States were used to predict the cumulative confirmed cases and deaths in the present day. All risk factors involved in the model were min-max normalized before used.
Our models were divided into two parts. The first part was to learn the observed patterns of COVID-19 and then to validate the learned patterns, where the accuracy of the models was evaluated by Mean Relative Errors (MREt).
In the first part, we conducted three experiments:
(a) t=3: the training data were from January 21 to May 16, 2020, and the validation data from May 17 to May 19, 2020;
(b) t=5: the training data were from January 21 to May 14, 2020, and the validation data from May 15 to May 19, 2020;
(c) t=7: the training data were from January 21 to May 12, 2020, and the validation data from May13 to May 19, 2020.
The second part used the data after the training data period until June 28, 2020. We simulated the epidemic by considering different levels of the five covariates in the near future.
3. RESULTS
3.1 Model validation
Our COVID-Net was built by using cumulative counts of confirmed cases and deaths in three overlapping periods from January 21 to May 16, 2020, to May 14, 2020, and to May 12, 2020 as the training data. The mean relative errors (MREs) between the observed and projected counts from the day after the training period to May 19, 2020 were computed to assess the accuracy of the prediction. Depending on the training period, we obtained 3-day, 5-day, and 7-day MREs for the 10 most severe counties. Table 2 presented both individual and average MREs for those 10 counties. New York County, New York had the smallest MREs for the 3-day prediction of the confirmed cases and 5-day prediction of deaths. Also, Westchester County, New York had the smallest MREs for the 5-day and 7-day predictions of the confirmed cases. Nassau County, New York had the smallest MREs in the prediction of deaths for the other periods. Overall, all the averaged MREs were relatively small, assuring the accuracy of our COVID-Net model in predicting future trajectories of COVID-19 for the numbers of both confirmed cases and deaths. Additional data were presented in Figs. S1 and S2 of the Supporting Information (SI) Appendix.
3.2 Prediction of future trajectories of COVID-19
Our COVID-Net model incorporated five risk factors to predict future trajectories of COVID- 19 in both cumulative confirmed cases and deaths from May 20 to June 28, 2020. The results for the 10 most severe counties were shown in Fig. S3 in the SI Appendix. The weights of the five risk factors in the construction of the COVID-Net models were presented in Table S1 in the SI Appendix.
To offer insights into the prediction dynamics of the COVID-Net, we varied the levels of the five risk factors and presented the resulting trajectories of COVID-19 for New York County, New York was shown in Fig. 2. The impact of the five risk factors on COVID-19 in both cumulative confirmed cases and deaths was visible, depending on their weights of the risk factors (Table S1 in the SI Appendix). Specifically, increases in the percentage of the population with access to exercise opportunities, average daily PM2.5, population size, preventable hospitalization rate, and violent crime rate would increase in both the cumulative confirmed cases and deaths. For example, on June 28, 2020, if the percentage of the population with access to exercise opportunities was changed to 4 times versus 0.5 times of that on May 20, 2020, the cumulative confirmed cases would increase to 225,000 versus 204,000. If the violent crime rate was varied to 4 times versus 0.5 times of that on May 20, 2020, the cumulative confirmed cases would increase to 210,000 versus 206,000. Overall, the numbers of cumulative confirmed cases and deaths were projected to increase slowly in the coming month in New York County, New York. The results were similar for Hudson County, New Jersey as shown in Fig. S4 in the SI Appendix. Nonetheless, the trajectories were expected to have sustainable growth in the coming month.levels of the five risk factors were changed from 0.5 times to 4 times of those on May 20, 2020. The trajectories in the training period were observed.
DISCUSSION
Our COVID-Net was built by deep learning and was shown to effectively model and predict the cumulative confirmed cases and deaths in the counties of the United States. The risk factors that were used in the COVID-Net provided the visible evidence on actionable steps that influenced the trajectories of COVID-19. Thus, COVID-Net took the advantages of both deep learning and interpretability of risk factors.
In our study, we found that the higher percentage of the population with access to locations for physical activity (e.g. indoor activities in the gym and outdoor activities in the squares), the higher risk of COVID-19 spread, likely by negating the effect the social distancing (McCloskey et al., 2020; Joscha Weber, 2020), which was identified to reduce transmission (Anderson et al., 2020; Greenstone and Nigam, 2020).
Several studies indicated that air pollution could influence the COVID-19 spread (Contini and Costabile, 2020; Martelletti and Martelletti, 2020; Conticini et al., 2020). We focused on the county-level average daily density (mg/m3) of fine particulate matter (PM2.5) because high levels of average daily PM2.5 were found to increase in cumulative confirmed cases and deaths (Wu et al., 2020). Exposed to high levels of PM2.5 led to systemic inflammation (Conticini et al., 2020; Yang et al., 2019), and chronic respiratory conditions (Cao et al., 2020). In the end, a weakened immune system may result in acute respiratory distress syndrome (ARDS) and even death.
High population density was a known infectious agent (Rocklov and Sjodin, 2020). Metropolitan areas generally had larger populations and higher densities, and higher numbers of cumulative confirmed cases and deaths. For example, New York County, with an estimated population of 1,628,706 as of July 1, 2019 (United States Census Bureau, 2019) and under 23 square miles of area (GIS Geography, 2020), had 198,710 confirmed cases and 20,376 deaths as of May 19, 2020 (New York Times, 2020).
The higher number of Medicare enrollees who stayed for ambulatory-care sensitive conditions in hospitals, the more Medicare enrollees would have with poor health status, leading them to be vulnerable to novel diseases (Hatch et al., 2018) and become infected in the hospitals. Also, the more enrollees in the hospitals, the more healthcare providers were required, increasing the pressure of health care workers and exposing more people at risk. The higher violent crime rate may not be directly related to the COVID-19, but it could be a surrogate variable for community safety. If the community environment was not safe, which might cause psychological distress (Wang et al., 2020; Ho et al., 2019) and trigger mental health illness (Xiang et al., 2020), especially under the “stay-at-home” order, it was reasonable to expect that the change in the violent crime rate would be relatively small in the near future. Therefore, other factors were expected to make greater differences in the COVID-19 trajectories.
There might be other factors that we could consider in building the COVID-Net. We chose to use the five risk factors because of the documented evidence in the literature, the interpretability of the results, and the accuracy of the prediction. Our goal was not to identify risk factors, but to build an accurate and interpretable prediction tool for the outbreak of COVID-19. Because COVID-Net used only five risk factors, the small number of variables made it easier to interpret and act for decision-makers, and with its relatively small MREs, the room for further prediction improvement was expected to be too small to make a practical difference.
Data Availability
We collected the number of cumulative confirmed cases and total deaths from January 21 to May 19, 2020 for counties in the United States from the New York Times, based on reports from state and local health agencies. The county health rankings reports from year 2020 were compiled from the County Health Rankings and Roadmaps program official website.
ACKNOWLEDGMENTS
We would like to thank all individuals who collected epidemiological data of the COVID-19 outbreak, and the data in the county health ranking and roadmaps program.