Abstract
The initial contagiousness of a communicable disease within a given population is quantified by the basic reproduction number, denoted R0. The value of R0 gives the expected number of new cases generated by an infectious person in a wholly susceptible population and depends on both pathogen and population properties. On the basis of compartmental models that reproduce Coronavirus Disease 2019 (COVID-19) surveillance data, we estimated region-specific R0 values for 280 of 384 metropolitan statistical areas (MSAs) in the United States (US), which account for 95% of the US population living in urban areas and 82% of the total population. Our estimates range from 1.9 to 7.7 and quantify the relative susceptibilities of regional populations to spread of respiratory diseases.
One-Sentence Summary Initial contagiousness of Coronavirus Disease 2019 varied over a 4-fold range across urban areas of the United States.
LA-UR-22-29514
Main Text
Public health surveillance efforts in the US during the COVID-19 pandemic were sweeping. In 2020 alone, approximately 254 million diagnostic tests for Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection were administered (1). The data generated from these surveillance efforts provide an unprecedented opportunity to gain insights into disease transmission dynamics in the US, especially for respiratory diseases similar to COVID-19, which is believed to be aerosol-transmitted (2). To gain insights into the heterogeneity of disease transmission across the US, we attempted to use COVID-19 surveillance data, namely daily county-level reports of new cases, which have been collected in various repositories (3, 4), to estimate the basic reproduction number of COVID-19 for as many distinct geographical regions in the US as possible.
The basic reproduction number, R0, is a dimensionless quantity corresponding to the expected number of secondary cases generated by an index case in a naïve population (5). Although the contagiousness of a communicable disease is commonly characterized by an R0 estimate (6), the value of R0 is in fact both pathogen- and population-specific (7), meaning that it is a function of not only pathogen properties, including virulence factors, but also of population properties, including biological, sociobehavioral, and environmental factors (8). The population properties that influence R0 are generally unknown.
Because many regions in the US were impacted by a common pathogen (i.e., SARS-CoV-2) around roughly the same time at the beginning of the COVID-19 pandemic (9), we reasoned that a comparison of R0 estimates for COVID-19 in different regions would elucidate how the population properties of distinct regions combine to determine differential susceptibility to disease spread for COVID-19 and similar diseases. Thus, we undertook an effort to generate regional COVID-19 R0 estimates.
Estimation of R0 can be pursued in multiple ways (10). Here, we adopted the approach of deriving R0 for a region of interest on the basis of a compartmental model parameterized for consistency with region-specific surveillance data (11). This approach requires a tractable explanatory model (12). An assumption of any compartmental model for an epidemic is that the population being considered is homogeneous (13). With this limitation in mind, the population considered in a model should be carefully chosen: a population that is more uniformly impacted by an epidemic is a better choice for modeling.
We considered developing models for the populations of either states or metropolitan statistical areas (MSAs), the latter of which are delimited by the federal government on the basis of socioeconomic ties after each census (14). In the US, there are currently 384 MSAs (14), each encompassing a city with 50,000 or more inhabitants and surrounding communities. In the US, surveillance data are collected and reported by county-level public health authorities (or authorities of comparable jurisdictions) (15), but county-level data can be aggregated to characterize disease transmission in MSAs and states. We did not develop models for county populations for two reasons. First, county-level surveillance data tend to be noisy, with noise arising from the stochastic nature of case detection combined with a small number of cases and/or because of irregularities in reporting. It is for these reasons that county-level data are commonly aggregated in epidemiological analyses (16). Second, inspection of infection-risk curves suggested that counties within MSAs are similar, as illustrated in Fig. 1, which shows risk curves for four multi-county MSAs. Risk curves for the other multi-county MSAs are shown in Fig. S1.
To ascertain whether disease transmission is more homogeneous across the counties of MSAs or states, we defined three different measures for variability in weekly disease incidence (i.e., infection risk over a 1-week period) across the counties of a given region (see Supplemental Methods). These measures are based on the Fano factor (17), the Gini coefficient (18), and the Wasserstein-1 distance (19). We calculated variability measures for epidemiological weeks 5 through 52, a period starting on 26-January-2020 and ending on 26-December-2020, for all multi-county MSAs and the 50 states by aggregating confirmed COVID-19 daily county-level case-count data available in the GitHub repository maintained by The New York Times newspaper (3).
Representative results of our risk variability analysis are shown in Fig. 2. In Fig. 2A, the Fano factor, Gini coefficient, and Wasserstein-1 distance variability measures are plotted as a function of epidemiological week for each of three selected MSAs and also for overlapping states. As can be seen, each variability measure tends to be less for an MSA than for an overlapping state. Results for other multi-county MSAs and overlapping states are shown in Figs. S2–S4. Histograms of time-averaged Fano factor, Gini coefficient, and Wasserstein-1 distance variability measures obtained for all multi-county MSAs and states are shown in Figs. 2B–2D. For each variability measure, the histogram for MSAs is shifted to the left of the histogram for states. These results indicate that county-level infection risks are more homogeneous for counties within MSAs than for counties within states. For this reason, we focused on developing models for MSA populations instead of state populations.
In earlier work, we developed a compartmental model that is able to reproduce daily COVID-19 case count data for the 15 most populous MSAs in the US and all 50 states (11, 20). Here, we found region-specific parameterizations of this model consistent with surveillance data for MSAs in the US having more than 200 cumulative cases reported before 21-May-2020 and at least 5 new cases on any given day between 21-January-2020 and 21-June-2020. 280 MSAs satisfy these criteria. For each of these MSAs, we applied a Bayesian inference approach described earlier (11, 20), which is enabled by an adaptive Markov chain Monte Carlo (MCMC) sampling procedure. Inference job setup files for PyBioNetFit (21), including files with MSA-specific surveillance data, are provided for each of the 280 MSAs (Data S1). To ensure that MCMC sampling converged, we visually inspected log-likelihood trace plots, parameter trace plots, and pairs plots. In addition, to ensure that regional parameterizations are explanatory, we compared posterior predictive distributions for case detection against daily counts of new cases (https://github.com/lanl/COVID-19-basic-reproduction-numbers). As illustrated in Fig. 3 for four selected MSAs, we were able to find parameterizations that are consistent with regional surveillance data.
For the compartmental model used in our analysis, we previously used the next-generation matrix method (12) to derive a formula that gives R0 in terms of model parameters (20). According to this formula, the value of R0 depends on one inferred region-specific parameter, the contact rate parameter β, and seven fixed parameters describing within host-dynamics, which were estimated earlier and are expected to be universally applicable across regions of interest (11). Using our earlier estimates for fixed parameters (11), region-specific samples of the parametric posterior distribution for β obtained in Bayesian inference, and the formula for R0 (20), for each of 280 MSAs, we found a maximum a posteriori (MAP) estimate for R0, which is equivalent to a maximum likelihood estimate because of the use of a uniform proper prior in Bayesian inference, and a 95% credible interval (Table S1). Similarly, through numerical solution of an equation obtained earlier (20), for the same MSAs, we obtained a MAP estimate for the initial epidemic growth rate λ and a 95% credible interval (Table S1). Importantly, growth rate estimates are consistent with early surveillance data (https://github.com/lanl/COVID-19-basic-reproduction-numbers).
In Fig. 4, the R0 MAP estimates and 95% credible intervals that we obtained for 280 MSAs are presented in a rank order plot. The largest R0 estimate was 7.7 (for the MSA encompassing Detroit, Michigan) and the smallest was 1.9 (for the MSA encompassing Appleton, Wisconsin). These disparate estimates indicate that the population features contributing to initial disease spread are geographically heterogeneous and combine (in an unknown way) to yield a distribution of initial COVID-19 contagiousness that is spread over a 4-fold range. The absolute R0 values reported here are COVID-19-specific; however, the relative strengths or ratios quantify relative susceptibilities to respiratory disease spread arising from MSA-specific population features.
Using a mechanistic compartmental model in concert with COVID-19 surveillance data and Bayesian inference, we have obtained MSA-specific COVID-19 R0 estimates that quantify differences in population properties across urban areas in the US that affect disease contagiousness. This information may help mitigate future respiratory disease outbreaks, as some urban areas are evidently far more susceptible to rapid disease spread than others. Our findings may be helpful in identifying which population properties contribute to contagiousness, which is important because the population properties underlying our R0 estimates may eventually change with time.
Data Availability
All data produced in the present work are contained in the manuscript.
Funding
2020 National Science Foundation Mathematical Sciences Graduate Internship program (AM)
Center for Nonlinear Studies, Los Alamos National Laboratory (AM)
Laboratory Directed Research and Development program at Los Alamos National Laboratory (project 20220268ER) (WSH, YTL)
National Institute of General Medical Sciences of the National Institutes of Health (grant R01GM111510) (WSH)
Author contributions
Conceptualization: AM, WSH
Methodology: AM, YTL,
WSH Investigation: AM, YTL, WSH
Visualization: AM, YTL, WSH
Funding acquisition: YTL, WSH
Project administration: YTL, WSH
Supervision: YTL, WSH
Writing – original draft: AM
Writing – review & editing: AM, YTL, WSH
Competing interests
Authors declare that they have no competing interests.
Data and materials availability
All data are available in the main text or the supplementary materials.
Supplementary Materials for
Materials and Methods
Here, we derive the measures discussed in the Main Text for infection risk variability across counties of a specified region. These variability measures are based on the Gini coefficient, Wasserstein-1 distance, and Fano factor. Infection risk for a time period of interest in a given region is defined as the population fraction of new cases reported.
Gini coefficient risk variability measure
Let us consider a population of n persons. The ith person has infection risk ri. We want to know two quantities. The first quantity is MD, the mean absolute difference in risk. We calculate MD as the absolute difference in risk averaged over all distinct pairs of persons. The sum of absolute differences is , which equals . The number of distinct pairs of persons is , which is equal to n(n − 1)/2. Thus, MD is given by
The second quantity of interest is AM, the arithmetic mean risk. AM is given by
The ratio MD/AM is RMD, the relative mean absolute difference. RMD is given by
RMD is equal to twice the Gini coefficient, G. Thus G ≡ RMD/2 is given by
Let us consider the case where the n persons of interest reside in k ≤ n counties. All persons with the same county have the same risk. Let us denote the county-associated risks as c1, c2, …, ck. Let us use ni to denote the population of the ith county. It follows that . We want to calculate G in terms of county-level risks {c1, …, ck} instead of the individual-level risks {r1, …, rn}. The number of pairs of persons consisting of a person from county I and a person from a different county j ≠ i is given by ninj. Thus, because persons in the same county share the same risk. We also have the following relation:
It follows that
We can take ci to be the COVID-19 incidence proportion for the ith county over a given epidemiological week. If n ≫ 1,
Wasserstein-1 distance risk variability measure
Let X denote a finite, non-empty, and ordered set. The Wasserstein-1 distance between two discrete distributions is given as W1(p, q) = ∑x∈X |Fp (x) − Fq (x)| where Fp(·) ∈ [0,1] and Fq(·) ∈ [0,1] are the cumulative distribution functions of p(·) and q(·), respectively.
Let k denote the number of counties in a given region (i.e., an MSA or state) in a given epidemiological week. Let us consider the case where all persons within the same county have the same risk. Let us denote the county-associated risks as c1, c2, …, ck. Let us denote the population of the ith county as ni. It follows that , where p denotes the population of the aggregated region (i.e., an MSA or state.)
We take ci to be the COVID-19 incidence proportion for the ith county in a given epidemiological week. Let us denote our random variable by C, taking on the values c1, c2, …, ck. After separately sorting the values ci and the populations (or weights) pi in ascending order, it follows that 0 ≤ c1 < c2 < … < ck ≤ 1 and 0 ≤ p1 < p2 < … < pk ≤ p.
The weighted empirical cumulative distribution function is then constructed for a given MSA (or state) in a given epidemiological week, and is comprised of the values ci and the normalized weights pi/p, ordered by the population (from small to large).
The reference empirical cumulative distribution function for a given MSA (or state), with averaged risk in a given epidemiological week, is comprised of the value and the weight 1, with the value on the x-axis and the weight on the y-axis.
The Wasserstein-1 distance is computed between a given MSA (or state) and the corresponding reference distribution.
Fano factor risk variability measure
For a given region (i.e., MSA or state) and a given epidemiological week, the Fano factor is defined as the variance-to-mean ratio. The random variable of interest is given by the set of county-level risks, where the risk is defined as the confirmed number of new cases in a given week normalized by the county population.
Time-averaging
As presented in the Main Text (see Figs. 1B to 1D), we obtained average quantities for the infection risk variability measures above by averaging over the time domain (i.e., over the epidemiological weeks).
Bayesian inference
To infer region-specific values of adjustable model parameters (and R1 estimates), we followed the Bayesian inference approach of Lin et al. (11). In inferences, we used all region-relevant confirmed COVID-19 case-count data available in the GitHub repository maintained by The New York Times newspaper (3) for the period starting on 21-January-2020 and ending on 21-June-2020 (inclusive dates). The first case in the US was reported on 21-January-2020. We focused on early surveillance data (vs. all available surveillance data up to the present time) so as to characterize COVID-19 transmission within populations that are nearly wholly susceptible. Markov Chain Monte Carlo (MCMC) sampling was performed using the Python code of Lin et al. (11) and a new release of PyBioNetFit (21), version 1.1.9, which includes an implementation of the adaptive MCMC method used in the study of Lin et al. (11). Inference job setup files for PyBioNetFit, including data files, are provided for each of 280 MSAs (Data S1). There are 384 MSAs in the US. We excluded 104 of 384 MSAs from analysis. In each of these cases, there was insufficient data to support parameterization; namely, each of these MSAs had fewer than 200 cumulative cases reported before 21-May-2020 or no more than 5 new cases reported on any given day between 21-January-2020 and 21-June-2020.
Fig. S1. County- and MSA-level infection-risk curves for weekly-aggregated data from 26-January-2020 to 26-December-2020 (inclusive dates) in each of 232 multi-county MSAs, except those encompassing Atlanta, GA; Washington, DC; New York City, NY; and Virginia Beach, VA, which are considered in Fig. 1 of the Main Text. MSA-level risk curves are shown in black and county-level risk curves are shown with a viridis color scheme. States are indicated using two-letter US postal service abbreviations.
Fig. S2. Plots of the infection-risk variability measures based on the Fano factor as a function of epidemiological week for each of 232 multi-county MSAs, except those encompassing Atlanta, GA; Washington, DC; and New York City, NY, which are considered in Fig. 2 of the Main Text.
Fig. S3. Plots of the infection-risk variability measures based on the Gini coefficient as a function of epidemiological week for each of 232 multi-county MSAs, except those encompassing Atlanta, GA; Washington, DC; and New York City, NY, which are considered in Fig. 2 of the Main Text.
Fig. S4. Plots of the infection-risk variability measures based on the Wasserstein-1 distance as a function of epidemiological week for each of 232 multi-county MSAs, except those encompassing Atlanta, GA; Washington, DC; and New York City, NY, which are considered in Fig. 2 of the Main Text.
Acknowledgments
Computational resources used in this study included the FARM cluster at the University of California, Davis.