ABSTRACT
The novel coronavirus disease (COVID-19) pandemic is a global threat presenting health, economic and social challenges that continue to escalate. Meta-population epidemic modeling studies in the susceptible-exposed-infectious-removed (SEIR) style have played important roles in informing public health and shaping policy making to mitigate the spread of COVID-19. These models typically rely on a key assumption on the homogeneity of the population. This assumption certainly cannot be expected to hold true in real situations; various geographic, socioeconomic and cultural environments affect the behaviors that drive the spread of COVID-19 in different communities. What’s more, variation of intra-county environments creates spatial heterogeneity of transmission in different sub-regions. To address this issue, we develop a new human mobility flow-augmented stochastic SEIR-style epidemic modeling framework with the ability to distinguish different regions and their corresponding behavior. This new modeling framework is then combined with data assimilation and machine learning techniques to reconstruct the historical growth trajectories of COVID-19 confirmed cases in two counties in Wisconsin. The associations between the spread of COVID-19 and human mobility, business foot-traffic, race & ethnicity, and age-group are then investigated. The results reveal that in a college town (Dane County) the most important heterogeneity is spatial, while in a large city area (Milwaukee County) ethnic heterogeneity becomes more apparent. Scenario studies further indicate a strong response of the spread rate on various reopening policies, which suggests that policymakers may need to take these heterogeneities into account very carefully when designing policies for mitigating the spread of COVID-19 and reopening.
1 Introduction
The novel coronavirus disease (COVID-19) pandemic is a global threat presenting health, economic and social challenges that continue to escalate. As of September 30, 2020, the CDC had reported 7,168,077 total confirmed cases and 205,372 total deaths in the U.S.1, and community transmission of COVID-19 remains widespread in many parts of the world. In the absence of vaccines or pharmacologic agents to reduce the transmission of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) that causes COVID-19, it is essential to understand the effects of non-pharmacological epidemic control and intervention measures. These include, but are not limited to, social (physical) distancing, travel restrictions, closures of schools and nonessential business services, mandated face coverings, testing, isolation, contact tracing and timely quarantine on delaying the COVID-19 spread, all of which have been intensely investigated2–8. In the end, the same combination of interventions will have different effects on the overall progress of the epidemic in different contexts, and the dependence of outcomes on local conditions is typically complex. It is thus essential to develop good models which can make principled predictions about the effect of non-pharmacological interventions (NPIs) on the spread of COVID-19 under various combinations of local circumstances. For example, during the early outbreak in China, scholars developed a model using observed data about human mobility and COVID-19 infection and found that early detection and isolation of confirmed cases could prevent more infections than travel restrictions and contact reductions, but that combined NPIs achieved the strongest and most rapid effect4. Another study using laboratory-confirmed cases demonstrated the effects of various types of public health NPIs (such as centralized quarantine and isolation) on reducing the transmission of the SARS-CoV-2) in Wuhan, China2. In Italy, scholars reconstructed the COVID-19 spatial spread dynamics and investigated the effects of population-wide interventions9,10.
The large body of modeling work on COVID can be grouped into four broad categories: mechanistic models2,4,6,8–13, agent-based models14,15, generalized linear regression models16,17, and machine learning models18,19. Both mechanistic and agent-based models typically rely on susceptible-infectious-removed (SIR) or susceptible-exposed-infectious-removed (SEIR) compartmental-style models in epidemiology, which separates the population into four compartments (susceptible, exposed, infectious, recovered or removed) and models the dynamics of transitions between the compartments. In the U.S., the Institute for Health Metrics and Evaluation (IHME) at the University of Washington also employed a SEIR-style framework to project hospital bed-days, ICU-days, ventilator days and deaths, and to model possible trajectories of SARS-CoV-2 infections and the impact of NPIs at the state level in the U.S11,12. Regarding the time-varying state-specific control measures, a temporal extended susceptible-antibody-infectious-removed (eSAIR) model was developed for the projection of county-level COVID-19 prevalence20. Under the agent-based modeling framework, a recent study modeled the impact of testing, contact tracing and household quarantine on the second waves of COVID-19 infection in the Boston metropolitan area14.
However, a straightforward SEIR analysis, useful as it may be in revealing and informing public health and shaping policy making, relies on a key assumption of homogeneity of the population. Each compartment is treated as an aggregate of indistinguishable individuals. This is not realistic. A recent study has shown evidence of spatial heterogeneity in COVID-19 transmissions within cities of Seattle and Washington, DC21. Socioeconomic and cultural environments, which differ within small geographic communities, affect the behaviors that drive the spread of COVID-19. What’s more, geographic and transportation factors create spatial heterogeneity of infection spread in different sub-regions. Accurate and fine-grained modeling requires a more refined classification of the population.
Examining the spatial heterogeneity of infection spread is particularly important for modeling COVID-19 due to its complex dependence on social factors. COVID-19 presents a highly difficult policy issue because people need to weigh various trade-offs when they make health behavior decisions, trade-offs which may involve political ideologies, socioeconomic status and lifestyles22. Extensive research from social science has demonstrated that people’s assessments of these trade-offs and their resulting health and risk behavior are affected by a variety of factors, such as gender, race, political ideology, economic status, culture, and religion23–27. Despite a scientific consensus on the effectiveness of social distancing, precautionary actions like facial masking and social distancing are not universally adhered to,28, and adherence to these measures is by no means uniform across populations. Thus, modeling the spread of COVID-19 requires scholars to attend to the heterogeneity in people from different backgrounds, with different value systems, and in different locales. The place (i.e., the type of neighborhoods and communities) people live in is a reflection of race, socioeconomic status, and age29,30. Segregation along these lines is not just an aggregate of individual choices, but a consequence of public polices31. Emerging evidence has shown disparities in COVID-19 outcomes between social groups. For instance, scholars found that income, along with race and other socioeconomic factors, has become a major predictor of COVID-19 infections and deaths32,33. A team of international scholars has addressed the importance of using social and behavioral sciences to support COVID-19 pandemic responses34, such as the study of social inequalities and political polarization in response to social distancing and stay-at-home mandates35–37.
In this paper, we aim to capture the effect of social and geographic heterogeneity within a population, and to reveal multiple aspects of the influences from different cultural environments, by building models with finer spatial resolution within a small geographic region (in this case, a single county.) To this end, we have developed a new human mobility flow-augmented stochastic SEIR-style epidemic modeling framework that is able to distinguish different regions and their corresponding behavior. This new modeling framework is then combined with data assimilation and machine learning to reconstruct the historical growth trajectories of COVID-19 confirmed cases. We then investigate the associations between the spread of COVID-19 and human mobility, business foot-traffic, race, and age group. Finally, we perform scenario studies in order to understand the social and policy implications as well as to facilitate the development of guidelines to inform future policy-making in public health crisis. The two geographic regions we consider are Dane County, WI and Milwaukee County, WI, which contain the two largest cities in Wisconsin — respectively, Madison (a college town and home of the state government) and Milwaukee (a large city subject to severe racial and ethnic segregation, by some measures the most extreme in the United States38,39.)
2 Data
The official testing results of COVID-19 confirmed cases between March 11, 2020 and August 12, 2020 were obtained from the local COVID-19 Dashboards, created by the Public Health Offices of City of Madison & Dane County and Milwaukee County40,41. The census-tract level geographic boundaries with demographics and socioeconomic attributes were obtained from the U.S. Census Bureau42,43. We collected over 3.6 million points of interest (POIs) with aggregated and de-identified human travel patterns in the U.S. from SafeGraph, which data was further spatially filtered by our study area. SafeGraph’s data is demographically distributed in a way that matches the proportion of people in various population subgroups in the United States Census populations1. These mobile location data consist of “pings” identifying the coordinates of a smartphone at a moment in time. To enhance privacy, SafeGraph excludes census block group (CBG) information if fewer than five devices visited a place in a month from a given CBG. For each POI, the records of aggregated visitor patterns record the number of unique visitors and the number of total visits to each venue during a specified time window (i.e., hourly, weekly, and monthly); this allows us to estimate the foot-traffic of each venue and the origin-to-destination (O-D) spatial interaction flow patterns during the the study period44. We further aggregate the O-D flow matrices to the census-tract level to match the COVID-19 testing data.
3 Mathematical model and configuration
In this research we develop a new human mobility flow-augmented stochastic SEIR model to simulate and analyze the deep connection between the fast spread of COVID-19 and individual mobility, social activities, government orders and population distribution, using Milwaukee County and Dane County, the two most populated counties in the state of Wisconsin as examples. The classical susceptible-exposed-infectious-removed (SEIR) model divides the population into four compartments, S (susceptible), E (latent), I (reported infections) and R (removed from I) and uses an ordinary differential equation (ODE) system to describe the flow between the compartments. The model assumes the homogeneity within the population, but this assumption is typically not valid. Our new approach is to investigate the compartment model at a finer scale, to reveal the differences and connections between sub-communities within one county, and to recover the dynamics of the population for different sub-regions.
In our new approach, human mobility flow data is incorporated into machine learning clustering algorithms first to divide a county into several regions that present diverse social, cultural, and economical background. We then assume the homogeneity within each smaller region. Human mobility flow-augmented stochastic SEIR models are then built for each region respectively. The models are coupled together through mobility data that adds strong nonlinear interactions between regions into the model. Figure S1 demonstrates the flowchart of this model.
It is worthwhile to highlight the following key features in this new human mobility flow-augmented stochastic SEIR model. We use a simple but effective stochastic (Ornstein-Uhlenbeck) process to describe the dynamical evolution of key parameters in the model, e.g. the transmission rate. This it to contrast the use of a constant transmission rate as is done in the classical SEIR model, and a function of time without any structure as is done in most recent works45. Our approach not only reflects the fact that the parameters are not fixed constants, but rather adjust the values according to the changes of the environments, but also permits a cheap but effective online learning process of reconstruction. This reconstruction procedure will be discussed in section of parameter setting.
We now discuss the clustering process using human mobility data, and the mathematical model respectively.
3.1 Origin-to-Destination (O-D)-flow based spatial clusters
Each county to be studied will be divided into multiple smaller regions. We use the Walktrap network-based community detection method46 that utilizes short random walks to detect communities in a large graph. The travel O-D flow data of the first week of March (right before the widespread of COVID-19 in U.S.) is used as the graph weights for clustering. Denote N the number of census tracts (the smallest area unit) in the whole area. The algorithm starts with an initial division of the whole area into N regions, each containing only one census tract, and iterates till the best clustering is found. During each iteration, the distances between each pair of adjacent vertices are computed, upon which, by minimizing the mean of the squared distances between each vertex and its region, the distance between regions are calculated. The algorithm merges the adjacent regions so that upon merging, the smallest mean-squared distance is achieved. After N − 1 iterations, we acquire N different partitions and the one with smallest modularity is chosen as the product of the algorithm. The distance between vertex i and vertex j used in this algorithm is defined as where is the probability of a random walk to go from i to j in t steps and the degree d(i) is the number of neighbors of vertex i, including itself, and t is a user-chosen parameter. The Walktrap method utilizes the property that random walk tends to get “trapped” in more connected areas.
3.2 A new human mobility flow-augmented stochastic SEIR model
We denote different regions using sub-indices i, the new human mobility flow-augmented stochastic SEIR model on each region writes as: It is a coupled ODE-SDE (stochastic differential equation) system. Pi = Si + Ei + Ri denotes the free population that can move between regions. We assume the reported cases are quarantined, and thus removed from Pi. The parameters used in the model15 are explained below:
ni j: the daily traffic flow from region j to region i.
De: the duration of latent period. We set De = 5.
Dc: the duration of infection for critical cases. We set Dc = 10.
Dl: the duration of infection for non-critical cases. We set Dl = 6
cI: the proportion of critical cases among all reported cases. We assume cI = 0.1.
bi: the transmission rate in region i
: parameters in stochastic process for bi.
We use the following initial data (t = 0 at the start of the study period): where the unit for t is one day and Ni is the total population of region i.
The effective reproduction number is one of the key quantities that reflects the speed of the spreading of the disease. It measures, on average, how many new cases are directly generated by a single infection. In our model, it can be computed by: We note that the transmission rate b and ratio of susceptible S/P are different for each region, and thus each region has its own Re.
The feature that differentiates our approach from the classical SEIR model is our incorporation of population heterogeneity. This feature enables our new human mobility flow-augmented stochastic SEIR model to simulate and analyze the important differentiations between multiple sub-communities within one county, and the connection between spread of COVID-19 and local mobility and social activities within and among these subcommunities.
To be more specific: a traditional SEIR model divides the population into four compartments and uses an ordinary differential equation (ODE) system to describe the flow between the compartments. This treats the individuals within each compartment as indistinguishable, and thus misses important distinctions which are relevant to epidemic dynamics. By contrast, our model begins by using a machine learning algorithm to partition the county into clusters, using the observed human mobility flow data. Once these subcommunities have been generated, we build a local SEIR model for each region, which takes into account region-specific social, cultural, and economic factors; finally, we couple together all these local SEIR models using the data on intra-county mobility flow. The use of multiple non-identical subcommunities has a significant impact of the overall SEIR dynamics and its predictability.
3.3 An efficient data assimilation strategy for reconstructing infection trajectory
The application of data assimilation for the reconstruction of the remaining parameters in the model is another desirable feature of our approach. In particular, we apply the ensemble Kalman filter method47 to estimate the hidden parameters in the model, and to recover the trajectory of the spread of the disease. Such a model calibration procedure is completely data-driven, with no ad-hoc parameter tuning or adjustment. It is shown that the model with parameters estimated in this way can almost perfectly recover the observed time series data, which justifies the application of the data assimilation approach utilized here.
3.3.1 Ensemble Kalman Filter
To estimate parameters and state variables, we use the ensemble Kalman Filter47–49. It was derived from the classical Kalman Filter for the application in atmospheric science, with the analytical covariance matrix replaced by its ensemble version, eliminating the computation of the Kalman gain matrix and the Riccati Equation that are typically expensive when problems are in high dimensional space. In each iteration, it combines the data with prior information to generate a posterior distribution of the state variables and parameters. The idea is to sample a fixed number of particles on the state variable space according to the initial distribution, and move these particles around at every time step, to obtain the adjusted distribution. Let u be the augmented vector of state variables and parameters for each region: where k is the number of regions we get from clustering. The goal is to derive a distribution density function of u. At the beginning of each iteration, let Pn−1|n−1 be the probability density to start with. Then we perform an evolution step to get the probability density Pn|n−1 and a further analysis step to obtain Pn|n. This serves as the initial probability density for the next iteration.
Let 𝒢m,n(u) be the solution to the model (3.2) that runs from time tm to time tn with u being the initial condition. Since the available data is the cumulative confirmation cases, we set as our measuring operator, mathematically it is: where 𝕀k is an identity matrix of size k, and M maps a 5k dimensional vector to a k dimensional vector.
We assume that the infection data has a Gaussian noise from the true measuring operator: and thus the likelihood function is: where ∝ is the proportional sign. The evolution step and analysis step in each iteration can be then calculated:
− Evolution: Pn|n−1(u) = δ (u−𝒢n−1,n(u′))Pn−1|n−1(u′),
− Analysis: Pn|n(u) ∝ Pn|n−1(u)q(d|u),
We need to further normalize to have ∫Pn|n(u)u = 1.
Suppose we sample N particles, and we denote the j-th particle after the evolution step and analysis step at time tn by and respectively. In summary, the algorithm iteratively applies to the following evolution and analysis steps:
− Evolution:
where FE is the Forward-Euler discretization applied to the model (3.2).
− Analysis:
where is a k-length vector with each entry i.i.d. drawn from 𝒩 (0, σ2). dn is the k-length vector collecting the reported infected data on k regions on day n. To compute the covariance matrices, we set: Then are matrices of size 5k × k and k × k respectively.
3.3.2 Online learning of the transmission rate using an effective stochastic parameterization approach
One fundamental difference between our model and the classical SEIR model is that we embed stochasticity in the to-be-reconstructed parameters, the transmission rate, in particular. The last equation in (3.2) exploits a possibility of including the time variation in b. This is crucial in describing the transient behavior of the entire system. Since the exact governing law of the transmission rate is unknown and it is not a directly observed model variable, there is no need to describe the transmission rate in an exact fashion. Thus, an effective way of modeling the transmission rate is to adopt a simple stochastic parameterized form, which allows a cheap online learning process. To incorporate the stochastic parameterization in a simple and effect fashion, an Ornstein–Uhlenbeck (OU) process is adopted, which involves a Gaussian process as the prior ansatz of the transmission rate. The exact time evolution of the transmission rate, however, can be arbitrary. The parameters in the stochastic transmission rate equation is determined systematically using an expectation-maximization (EM) algorithm50, which is completely data-driven with no ad-hoc tuning or adjustment.
One starts from an arbitrary initial guess of the three parameters di, σi, and in the stochastic parameterized equation (the last equation in (3.2). In each EM iteration, one applies Ensemble Kalman Filter, resulting an estimated time series for each state variable and the transmission bi, and these parameters of OU process are then updated to fit the time series of bi by matching the mean, variance, and decorrelation time. The fitting strategy can be viewed as the maximization process, and at the end of iteration, the optimal configuration of the three parameters, up to a preset threshold error, is found. This configuration is finally used in the last stage of Ensemble Kalman Filter to acquire the estimation of state variables and transmission rates used in our study.
4 Results
4.1 Origin-to-destination flow based spatial clusters
The traditional SEIR model assumes homogeneity over its population. This assumption, however, is not valid in either of the two counties studied here. To have a finer scale study, we divide both counties into several smaller intra-county regions using the Walktrap network-based community detection method46. The travel O-D flow data of the first week of March (just before the initial widespread outbreak of COVID-19 in the U.S.) is used to generate the graph weights for clustering.
The spatial clustering results for the two counties under investigation are shown in Figure 1b and Figure 1d. Dane County is partitioned into 7 regions (the two white areas are lakes, Mendota and Monona). Each region has its unique cultural features. In this respect it is worth specificially singling out Region 7, the area containing downtown Madison and the University of Wisconsin-Madison campus. As one might expect, Region 7 has significantly higher population density than the others. Region 3 is the residential region adjacent to Region 7 and it also has relatively high population density. The data is seen in Table 1. Milwaukee County is partitioned into 6 regions. While the population density in each region is relatively uniform, the region is heavily segregated by race and ethnicity. The population density and the distribution of different race and ethnic backgrounds is listed in Table 2. Region 3 has a predominantly Black population, while Region 4 has a high concentration of Hispanic residents.
4.2 Human mobility flow-augmented stochastic SEIR model results
Dane County
We treat the meta-population as relatively homogeneous within each region. We then build a separate SEIR model for each region, and couple these models together using the inter-regional mobility flow traffic information. Each region then has its own effective reproduction number Re, a function of time representing the expected number of new cases directly generated by each existing case. The effective reproduction number is a critical indicator that quantifies the spread of disease, and its dependence on time clearly is related to the county’s own “stay-at-home” order and subsequent reopening procedure. In Dane County, Phase 1 reopening was enacted on May 26. During this period, all business were allowed to open with 25% limit capacity. Phase 2 reopening was enacted on June 15, and the capacity was increased to 50%. On July 2, the county, after seeing an unusual increase of infection, rolled back the Phase 2 reopening and the capacity limit was reduced to 25%. On July 13, the county imposed an indoor face mask requirement.
In Figure 2, we plot the 3-day average Re in each region and its corresponding interquartile range (IQR, the range that shows 25th to 75th percentiles). The dates of the reopening/rolling-back are also indicated in the plots. Region 7 (the downtown and university campus area) has the highest Re, spiking as high as 8 at the end of June, during the brief period of relaxed capacity limits for indoor businesses. By contrast, Region 3 (the adjoining residential area), despite also possessing a high population density, has a substantially lower Re, without a major late-June spike. In Figure 3a we plot the the traffic normalized effective reproduction number in Dane County. This measure controls for the fact that some regions have higher intra-regional traffic and communications than the others and are thus expected to have a higher Re. However, Region 7, even upon this normalization, still has a significantly higher Re than all other regions, suggesting that there are social characteristics beyond population mobility contributing to rapid spread there. In particular, the normalized Re has a peak in late June, approximately seven days after the county started Phase 2 reopening on June 15.
The mobile device visit tracking data available to use allows to investigate some potential explanations for the differences of Re between regions. In Table 3 we document the Pearson’s correlation of Re in each region with different categories of business service visits. Clearly, the increase of Re in Region 7 has a strong cross-correlation with the visits to “Drinking Places (Alcoholic Beverages)” with a time lag of 5 days. Such a data-driven analysis may complement the local official report showing that more than 21% of the newly COVID-19 infected cases reporting recent trips to bars and taverns51.
We also perform scenario studies to evaluate the possible consequences were different policies to be put in place. In particular, we study what would have happened if
(case 1) there was no Phase 2 reopening enacted on June 15;
(case 2) the county did not roll back the Phase 2 reopening on July 2;
(case 3) the county further reopens on August 4 at different levels.
In Figure 4a we plot the predicted infection of selected regions supposing the June Phase 2 reopening had not taken place. Our model predicts clearly that the number of cases would be drastically reduced in Regions 3, 6, and 7 if the county did not enter Phase 2 reopening; in other words, those are the regions most severely affected by that decision. In particular, in Region 7, had there been no reopening, the projected number of cases is only 90 by June 30, as compared to the actual count of 282, more than three times as high. Regions 1, 2, 4, and 5 see small impacts from the reopening (see the results plotted in S9, S10 and S11.) In Figure 4b we plot the predicted infection assuming there was no rolling-back from Phase 2 reopening in the same three regions. It is clear that rolling-back the reopening reduced the number of infection cases significantly. Without rolling-back, the predicted total number of infection in Dane County during July 2-9 would have been 3, 394. This is almost three times of 1, 272, the actual number of confirmed cases in that time interval. In Figure 4c, we plot the predicted number of infections from August 4 into the future by 10 days. If the basic reproduction number tripled, the predicted number of infection would increase from 462 to 1,384 within 10 days.
To sum up: a strong cross-correlation is observed between the number of visits to “Drinking places (Alcoholic Beverages)” and increased infections in Region 7. At the same time, the scenario study suggests that without Phase 2 reopening, the infection in Dane county would have been significantly lower (reduce about two-thirds), and without the subsequent roll-back from Phase 2 reopening, the infection would have continued to increase dramatically. All this together provides strong evidence that the Phase 2 reopening allowing 50% capacity in drinking places played a major role in the drastic increase of the number of cases observed in Dane County during the summer of 2020.
Milwaukee County
We perform similar studies in Milwaukee County. The city of Mulwaukee started its Phase 2 reopening on May 14. During this period, restaurants and bars were allowed to have take-out or delivery service only. On June 5, Phase 3 reopening started, and restaurants and bars opened with 25% limit capacity. The limit capacity was lifted to 50% on June 26, when Phase 4 began. On July 16, Milwaukee instituted a mask ordinance requiring face covering in public space, both indoors and outdoors.
In Figure 5 we plot the effective reproduction number in different regions, and the corresponding IQR. Among the six regions, Regions 3 and 4 see the highest Re, averaging 1.8 and 2 respectively. In particular, in Region 4, in mid-May, Re rose as high as 6. As mentioned above, Regions 3 and 4 are the two regions most subject to racial and ethnic segregation. Such neighborhood segregation patterns have also been identified previously through human mobility-based spatial interaction network analysis using location big data52. The COVID-19 spread and deaths in Milwaukee shows the racial disparities in health.
As we did in the Dane County study, we also consider the effective reproduction number normalized to control for intra-regional traffic. As seen in Figure 3b, even after this normalization, Region 4 still demonstrates the highest transmission. That is, even with high traffic frequency within this region being discounted, the local infection rate is still very high, with the mid-May peak persisting. This suggests that policymakers need to explicitly consider local transmission variation contexts even after restricting intra-regional traffic for preventing more health disparities in future pandemics.
4.3 Assessing Spatial Heterogeneity with Age, Race and People’s Self-Reported Social Distancing
The disparities in COVID-19 spread rate between different regions of the two counties studied reflect important interaction between demographic divisions and reaction to COVID-19. In Dane County, we observed an appreciable difference in the outcomes between people who lived near the campus area and those farther away, i.e., spatial heterogeneity. In Milwaukee County case, we observed stark differences in the burden of disease between different race & ethnicity groups, especially between predominantly white regions and regions with a higher Black and Hispanic population. One author in this manuscript was on the data analysis team of the “COVID-19 and Social Distancing” survey conducted between March 19th and April 1st, 2020 by a group of interdisciplinary scholars and non-governmental organizations in Wisconsin53. The survey received responses from over 30,000 Wisconsin residents. We conducted data analysis on a behavioral question asking people to what extent they practice social distancing and compared different age groups’ answers and different ethnic groups’ answers. We found that among respondents aged 20-29, a substantially higher proportion of participants (4.25%) said they did not perform social distancing at all or performed little social distancing, compared to the other two age groups (30-59: 1.26%, above 60: 1.03%) [Figure 6a]. For Hispanic and the African American ethnic groups, there is still certain amount of respondents who reported that they did not perform social distancing at all (the top red bar in Figure 6b). This self-reported behavioral differences in precautionary behavior might partially explain the heterogeneous patterns we observed in the two counties. More importantly, this disparities among age groups and race & ethnicity groups in the COVID-19 outcome and in the self-reported social distancing behavior flags an urgent policy need. Policymakers and health communicators must investigate the underlying reasons for demographically specific barriers to adoption of social distancing and other mitigation measures54.
5 Discussion
5.1 Sensitivity test of clustering
The results of clustering are subjected to the weekly travel O-D flow data and to the network-based community detection algorithms. In general, the clustering results in Milwaukee County and Dane County are rather stable with respect to the change of clustering methods and data from different weeks. We test the sensitivity of clustering using different clustering methods and different periods of data.
The results shown above utilize the Walktrap method on the travel O-D flow data of the week of March 2. The Walktrap algorithm uses short random walks to detect communities in a large graph. In comparison, we also apply the Louvain method on the same data. The Louvain method is a popular community detection algorithm that aims to maximize the modularity55. Figures S2 and S3 show the clustering results of Milwaukee County and Dane County using the Louvain method. For Milwaukee County, the two clustering results are very similar apart for a small section in the lower-left corner of the map. Both methods result in 6 clustering groups. The modularities of clustering using Walktrap and Louvain are 0.365 and 0.376 respectively, indicating almost identical performance. For Dane County, however, the two methods provide slightly different results. Walktrap divides the county into 7 areas, while the Louvain gives 5. Comparing the subfigures in Figure S3, Louvain method clusters the central area as a single region, but the Walktrap method further divides this area into smaller components. Region 7 is the downtown area where the campus of University of Wisconsin – Madison is located. It plays an important role in our analysis of human mobility flows and infection of COVID-19, and thus it is meaningful to identify it as a single region. We therefore chose to use the Walktrap method. The modularities of clustering using Walktrap and Louvain algorithms are very close: 0.318 and 0.334 respectively.
In addition to comparing different clustering methods, we also compare the clustering results using data from different time intervals. In Figures S2 and S3 we show the results for the clustering of two counties using the data of March 2-8, and 9-15 respectively. For Dane County, only 3 out of 105 census tracts (each small sub-region in the figure represents a census tract) appear in different groups after the change of data. In Milwaukee County, after changing the data, 13 out of 296 census tracts have different group assignment using Louvain method 17 of 296 census tracts were assigned to a different cluster using Walktrap method. This indicates the clustering results are stable enough for the purpose of this project.
5.2 Policy implications
Our results suggest several policy lessons. Consider Dane County, which contains a large “college town” environment. As we see in Figure 3a, the vertical dashed line which corresponds to the policy phases of re-opening in Dane County, the area experienced spikes in the reproduction number when the county entered the Phase 2 reopening in mid June. We also observed that the reproduction number decreased significantly after the county rolled back its reopening policy in early July. The pattern of reproduction rate following the policies is especially substantial for the campus region. These results suggest that policymakers need to design reopening in a more nuanced manner. Instead of implementing an one-size-fits-all reopening policy for all the regions within a county, reopening policies that attend to the diverse nature of regions within a county should be considered. Policymaking needs to be spatially heterogeneous. For instance, we demonstrated that certain types of business and certain spatially distinct regions have a much higher infection rate in Dane County: visits to drinking places have the strongest association with infection rates, and the campus region has a higher infection rate than other parts of the county do. These empirical evidences flag the crucial importance of designing reopening policies which take these particularities into account when choosing what types of business and what regions to reopen first. Public policies also need to adapt to different regions that consist of very different demographic populations in order to control the reproduction rate of COVID-19, since COVID-19 is not only a public health problem, but also a problem that reflects existing disparities among various socially distinct population groups. For instance, for testing supply and future vaccination enforcement, policymakers might prioritize target certain demographic regions with priority.
Data Availability
The official testing results of COVID-19 confirmed cases between March 11, 2020 and August 12, 2020 were obtained from the local COVID-19 Dashboards, created by the Public Health Offices of City of Madison & Dane County (https://publichealthmdc.com/coronavirus) and Milwaukee County (https://county.milwaukee.gov/EN/COVID-19). The census-tract level geographic boundaries with demographics and socioeconomic attributes were obtained from the U.S. Census Bureau. We collected points of interest (POIs) with aggregated foot-traffic information from SafeGraph. For each POI, the records of aggregated visitor patterns record the number of unique visitors and the number of total visits to each venue during a specified time window (i.e., hourly, weekly, and monthly); this allows us to estimate the foot-traffic of each venue and the origin-to-destination (O-D) spatial interaction flow patterns during the the study period (https://github.com/GeoDS/COVID19USFlows). We further aggregate the O-D flow matrices to the census-tract level to match the COVID-19 testing data.
Author contributions statement
Research design and conceptualization: S.G., Q.L., N.C; Data collection and processing: X.H., J.R., Y.K.; Result analysis: X.H, S.G., Q.L., N.C, K.C., J.E.; Visualization: H.X, J.R., Y.K., K.C.; Obtained funding: S.G.,Q.L.,N.C,J.P.,J.E.; Project administration: S.G., J.P.; Writing: all authors.
Acknowledgements
We would like to thank the SafeGraph Inc. for providing the anonymous and aggregated place visits and human mobility flow data. S.G., Q.L., K.C, and J.P. acknowledge the funding support provided by the National Science Foundation (Award No. BCS-2027375). J.E. acknowledge the funding support provided by the National Science Foundation (Award No. DMS-1700884). Q.L., N.C. and X.H. are also supported by Data Science Initiative, provided by the University of Wisconsin - Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.