Abstract
As SARS-CoV-2 emerged as a global threat in early 2020, China enacted rapid and strict lock-down orders to prevent introductions and suppress transmission. In contrast, the United States federal government did not enact national orders. State and local authorities were left to make rapid decisions based on limited case data and scientific information to protect their communities. To support local decision making in early 2020, we developed a model for estimating the probability of an unseen COVID-19 epidemic in each US county based on the number of confirmed cases. We found that counties with only a single reported case by April 13th had a 50% chance that SARS-CoV-2 was already spreading widely. By that date, 85% of US counties covering 96% of the population had reported at least one case. Given the low rates of testing and reporting early in the pandemic, taking action upon the detection of just one or a few cases may be prudent.
Author Summary By March 28, 2020, only 3 months after the first US case of COVID-19 was detected, COVID-19 emerged in all 50 US states. Officials were forced to weigh the economic and societal costs of strict social distancing measures against the future risks of COVID-19 hospitalizations and mortality in their communities. To support local decision makers throughout the US, we developed a simple model to determine the chance that COVID-19 was spreading unseen based on scant reported case counts. In mid-April, 85% of US counties containing 96% of the population had reported at least one confirmed case. Our model predicted that each of those counties thus faced at least a 50% chance that the virus was already spreading widely. Aggressive pandemic mitigation measures, even before a threat is fully apparent, are particularly critical when testing resources are limited.
Introduction
The pandemic caused by the 2019 novel coronavirus (COVID-19) has claimed over 242,000 American lives as of early November 2020 and may kill tens of thousands more by the end of the year [1-3]. Early in the pandemic, when confirmed case counts were still relatively low across the US, the federal government left decision making largely to state and local public authorities. Amidst great uncertainty, they faced the unprecedented challenge of balancing the threat of a mostly unseen but deadly virus against the economic and societal costs of shelter-in-place and travel restrictions. At the time, most SARS-CoV-2 (the virus responsible for COVID-19) cases were not reported due to the high proportion of mild and asymptomatic infections, limited laboratory testing capacity and strict requirements for receiving tests (e.g. travel or contact with someone from Wuhan, China) [4,5]. The CDC estimated that only one in ten COVID-19 infections were reported during the early phase of the pandemic [6].
As the first cases of COVID-19 were reported, decision makers urgently needed to determine whether these cases reflected sporadic clusters stemming from recent introductions or sustained community transmission that might evolve into a large epidemic. In the southern US, the 2016 expansion of Zika Virus (ZIKV) across the Americas posed a similar challenge. Cryptic transmission meant that by the time a few cases were reported, a large epidemic could already be underway [7]. Here, we describe a stochastic susceptible-exposed-infected-recovered compartmental model framework for estimating the magnitude of an epidemic threat from scarce case data. The approach was originally developed to support situational awareness for ZIKV then adapted for COVID-19. We apply it to estimating the risk of unseen COVID-19 waves in counties across the US during the emergence phase of the pandemic in 2020.
Results
We modeled the stochastic emergence of COVID-19 accounting for potential superspreading events, asymptomatic infections, and epidemiological characteristics. We assumed all US counties had roughly similar transmission rates. The chance that a county had emerging COVID-19 waves ranged from 9% for zero detected cases to 100% for 25 or more (Fig 1). By March 16, 2020, counties cumulatively reported between 0 and 489 cases totaling 4,009 nationally. Epidemic risk exceeded 50% in roughly 15% of the 3,142 counties covering 63% of the US population. By April 13, 2020, total reported cases in the US climbed to 467,158. Consequently, we estimated that over 85% of US counties comprising 96% of the national population had at least a 50% chance of having an epidemic already underway (Fig 1).
The chance of an unseen epidemic (epidemic risk) in a county reporting only a single reported case is 50%; for a county reporting zero cases, the chance is 9%. The model used in both maps assumes Re=1.5, a 10% case detection rate, and a generation time of six days.
Based on COVID-19 case detection rates [6] for the week of April 13, 2020, we estimated that sustained community transmission was probable as soon as even one case was confirmed (Fig 2). At a moderate transmission rate (i.e. Re=1.5), the first case in a county signals a 50% chance that an epidemic was underway. For a high transmission rate (i.e. Re=3.0), as may be expected before COVID-19 lockdowns, the estimated risk increased to 83%. The projected risks are generally higher for both larger transmission rates and lowercase detection rates. For example, when Re is 1.5, the expected epidemic risk associated with a single case is 50% and increases to 63% when the case detection rate drops from one in ten (10%) to one in twenty (5%). For outbreaks that eventually spread widely, the expected time between the first COVID-19 case report and the epidemic reaching 1,000 cumulative infections was 7.5 (95% CI 3.9-16.3) weeks. The expected time between the tenth reported case and 1,000 cumulative infections shrank by 41% to 4.4 (95% CI 2.1-11.4) weeks (Fig 3).
For a given number of reported cases, the estimated risk of an epidemic increased with Re. By the time a single case is reported, there is a 13%, 50% or 83% chance of an ongoing epidemic for Re of 1.1, 1.5 or 3.0, respectively.
For a given number of cumulative reported cases (x-axis), we assume an epidemic is underway then estimate the median and 95% CI (error bars) number of weeks until the cumulative total cases exceed 1,000. When the first case is reported, we would expect cumulative cases to surpass 1,000 in 7.5 (95% CI 3.9-16.3) weeks; when the 10th case is reported, the expected lead time shrinks to 4.4 (95% CI 2.1-11.4) weeks. The estimates are based on 100,000 stochastic simulations assuming Re=1.5, a 10% case detection rate, and generation time of six days.
As a retrospective validation of our model, we compared our estimates to reported case counts. We cannot know, with certainty, if and when epidemics began spreading in most US counties. As a proxy, we assess whether case counts increased by at least five in the week following our estimate on March 16 (Fig 4, middle line). We find that our estimates for the probability of an ongoing epidemic (epidemic risk) are highly consistent with the fraction of counties that exhibited jumps in reported cases. The cumulative number of reported cases in a county by March 16 was a significant predictor of whether the number of new reported cases in the following week (March 16-23) was at least one, five, or ten cases (logistic regression, p<0.001). A one unit increase in cumulative reported cases increased the odds of a county detecting at least one, five, or ten new cases by March 23 by 7.92 (95% CI 5.98-10.80), 4.90 (95% CI 4.14-5.99), and 3.16 (95% CI 2.80-3.63), respectively.
The light, medium and dark gray lines correspond to increases of at least one, five, or ten new cases within one week, respectively. The red ribbon indicates the model estimates for the probability that an epidemic is underway, given the cumulative reported cases indicated on the x-axis. The bottom and top of the ribbon correspond to scenarios in which Re=1.5 and Re=3.0, respectively. These estimates are calculated based on 100,000 simulations for each reproduction number, assuming a 10% case detection rate and a generation time of six days. The odds of a county detecting at least five new cases increased by 4.90 (95% CI 4.14-5.99) for every one unit increase in cases on March 16. For example, a county with one case on March 16 was roughly five times more likely to have at least six cases a week later than a county with no reported cases.
Discussion
The timing and rate of COVID-19 emergence varied widely across the US. The earliest of the 3,142 US counties to report a case was Snohomish, Washington on January 21, 2020. By the first of March, April and May, 1%, 70% and 90% of all counties had reported at least one case, respectively. We estimate that, by the time a county reported its first case, it had at least a 50% chance of harboring an unseen but growing epidemic. As of April 13, 2020, the risk exceeded 90% in 54% of counties containing 91% of the US population. The New York Times published real-time projections of our model in a national risk map on April 3, 2020, which spread awareness of the growing COVID-19 threat to the nation [8].
Proactive responses to COVID-19 have been estimated to shorten the duration of costly measures [9,10], whereas delays have likely cost lives [11]. If the goal of COVID-19 interventions is to fully contain an emerging outbreak as quickly as possible, our study suggests that the first reported case should trigger action. The risk of an ongoing epidemic may already exceed 50% and delaying until ten cases have been reported, for example, may substantially reduce the window for corrective action and amassing adequate healthcare and other mitigation resources.
Our analyses make several key assumptions. Case detection rates may vary geographically and change through time depending on testing availability and regulations. Our assumption of 10% is based on a CDC seroprevalence study, which reported rates ranging from 4% to 16% across ten sites [6]. We modeled superspreading events based on estimates for SARS-CoV in Singapore in 2003 [12], which are consistent with more recent reports for SARS-CoV-2 [13–15]. Our estimates do not account for repeated importations given the stay at home orders and travel restrictions at the time. Multiple introductions would reduce our estimated levels of epidemic risk since detected cases could reflect independent clusters rather than continuous chains of transmission. Finally, our estimates depend on the effective reproduction number of the pandemic which can vary spatiotemporally depending on local policies, testing efforts, behavior, and population density [16,17]. Our estimates for mid-April, when much of the US was under shelter-in-place orders, assume a relatively low Re of 1.5. Our retrospective validation using data from mid-March, when intervention efforts varied geographically, considers reproduction numbers ranging from 1.5 to 3.0.
This analysis, while simple, provided useful insight during a highly uncertain period of the COVID-19 pandemic and can be easily adapted to provide early situational awareness for future emerging infectious outbreaks. Our results suggest that proactive control measures may be prudent, even before the threat becomes apparent [18].
Methods
We obtained county-level estimates for confirmed and suspected COVID-19 cases from a data repository curated by the New York Times [19] and 2019 estimates of each county’s population from the US Census Bureau [20]. We adapted the framework of another silent spreader–Zika Virus (ZIKV)–which threatened to emerge in southern US states in 2016 [7] to model COVID-19 in US counties. The discrete-time SEIR model assumes a branching process for early transmission in which the number of secondary infections per infected case is distributed according to a negative binomial distribution to capture occasional superspreading events, as estimated for SARS-CoV outbreaks in 2003 [12]. The exposure and infectious periods consist of “boxcars” to enforce the minimum number of days simulated individuals spent in each compartment. We account for imperfect detection and COVID-19 specific epidemiological characteristics (details in Table 1). Our baseline scenario assumes the Re of COVID-19 is 1.5, accounting for ongoing social distancing measures across the US by mid-April, 2020 [21], and 10% detection of all cases. We do not explicitly model asymptomatic or pre-symptomatic transmission and thus maintain a low detection probability for all infectious cases. To assess the impact of these assumptions on our estimates, we conducted a sensitivity analysis that varied Re (1.1 and 3.0) and detection rates (5%-40%).
We ran 100,000 stochastic outbreak simulations per scenario beginning with a single undetected case and ending when cumulative infections reached 2,000 or the outbreak died out (whichever came first). Because we model transmission as a branching process, the susceptible population does not deplete as in other compartmental SEIR models. Following the methodology of [7], simulated outbreaks that reached 2,000 cumulative cases and had a minimum prevalence of 50 cases per day were classified as epidemics. We calculated the probability of an epidemic for a given number of detected cases, x, by looking at all outbreaks that had x reported cases and calculating the proportion of those outbreaks that progressed to epidemics. We then matched county case numbers with the detected case number to obtain epidemic probabilities for each US county based on their reported cumulative number of cases. We use our baseline scenario to compare US maps of epidemic risk from March 16 and April 13, although Recloser to 3.0 may be appropriate for mid-March. For simulations that became epidemics, we also calculated the distribution of lags (in weeks) between the day the xth case was reported and the day the epidemic surpassed 1,000 cumulative cases. Confidence intervals were calculated with the quantile function in R version 3.6.1 [26].
To validate epidemic risk, we fit three logistic regressions to if US counties reported at least one, five, or ten new cases over the week of March 16 to 23, 2020 (y-axis) and how many cumulative cases there were on March 16 (x-axis). First, counties were grouped by the number of reported cases on March 16. Counties with ten or more cases were put into one group due to the low number of counties with more than ten cases on March 16. Second, March 23rd case counts were subtracted from those on March 16 and the difference was classified as an increase of at least one, five, or ten cases (three separate binary classifications). Finally, a logistic regression was fit to each classification to determine if the number of cases on March 16 was a significant predictor of new cases one week later. This week in mid-March was before lockdowns took place in the US and saw only a moderate increase in daily tests nationally (from 20,000 to 60,000) [27]. We compare case counts from Monday to Monday to avoid weekend reporting bias. To readily compare with epidemic risk estimates, we plot the percent of counties with an increase in new cases with our epidemic risk estimates from Re=1.5 to Re=3.0.
Data Availability
Code to generate data available online.
Footnotes
The MS has been updated to include a retrospective analysis and combine brief report with the supplementary.