Abstract
Rapid identification of new SARS-CoV-2 variants is a critical component of the public health response to the COVID-19 pandemic. However, we lack a quantitative framework to assess the expected performance of sampling strategies in varying epidemic contexts. To address this gap, we used a multi-patch stochastic model of SARS-CoV-2 spread in New York City to evaluate the impact of the volume of testing and sequencing, geographic representativeness of sampling, location and timing of variant emergence, and relative variant transmissibility on the time to first detection of a new variant. The strategy of targeted sampling of likely emergence locations offered the most improvement in detection speed. Increasing sequencing capacity reduced detection time more than increasing testing volumes. The relative transmissibility of the new variant and the epidemic context of variant emergence also influenced detection times, showing that individual surveillance strategies can result in a wide range of detection outcomes, depending on the underlying dynamics of the circulating variants. These findings help contextualize the design, interpretation, and trade-offs of genomic surveillance strategies.
Introduction
Genomic surveillance is an important tool for public health response to infectious disease (1,2). For pathogens such as SARS-CoV-2, genomic surveillance has enabled rapid identification and characterization of genetic variants that differ in transmissibility, virulence, and antigenic space, and thus informed disease control policies and the design and clinical use of therapeutics and vaccines (3–7).
Many questions remain about the impact of sampling strategies on the time to first detection of a variant. Guidelines suggest specific sequencing rates or fixed sequencing volumes (7)(8), depending on available resources, logistical considerations, overall SARS-CoV-2 prevalence, and surveillance objectives. They also emphasize the importance of sequencing cases that accurately reflect all SARS-CoV-2 infections, which requires representativeness of both testing and sequencing. In low- and middle-income countries, increasing testing volume as a way of improving the representativeness of sampling is most important to reducing variant detection times (9). However, we lack a general quantitative assessment of sampling strategies to detect new variants—where ‘new’ refers to local emergence of a novel variant or the importation of an emerging, known or an unknown variant—within a target detection time that considers the role of geographic representativeness of sampling, the optimal testing and sequencing volumes, and epidemic context.
To address this gap, we simulated the impact of a set of surveillance strategies on variant detection across a range of contexts. We developed a multi-patch stochastic transmission model for SARS-CoV-2 in New York City (NYC) and incorporated empirical human mobility data for the geographic dispersal of pathogens. We chose NYC as a case study given its experience with genomic surveillance and publicly available data on testing, sequencing, and mobility (10,11). We simulated the introduction of new variants with varying levels of transmissibility in locations across the city and at multiple introduction times relative to the introduction of the previously dominant variant. We then simulated testing and sequencing scenarios, varying both the volume and distribution of sampling, and computed the time to first detection, the overall burden of disease, and the geographic variability of the disease burden under each strategy. By developing this framework, we aimed to contextualize decision-making on genomic surveillance within the diversity of possible disease scenarios.
Methods
Data
Baseline COVID-19 testing rates (609 tests per 100,000 residents per week) and sequencing rates for NYC were obtained from the NYC Department of Health and Mental Hygiene (NYC DOHMH) (11) from December 2020 until November 2021 at the geographic resolution of modified ZIP-code tabulation areas (MODZCTAs). Mobility data were obtained from Meta via the Facebook Data for Good Initiative (12), which reported the physical locations of anonymized app users within 600m-by-600m tiles in 8-hour intervals. These data were aggregated to both MODZCTAs and boroughs and used to construct a mixing matrix estimating the rate of interpersonal encounters among the residents of NYC. We used data from the United States Census Bureau to define mappings between MODZCTAs, tiles, and boroughs. The main analysis was conducted at the geographic scale of boroughs. We conducted a sensitivity analysis at the level of MODZCTAs. Full details are provided in the Supplementary Materials and Methods.
Model structure
To simulate the introduction and subsequent transmission of a novel SARS-CoV-2 variant, we constructed a multi-patch, two-variant stochastic compartmental model that builds on the basic structure of a Susceptible-Exposed-Infectious-Recovered-Susceptible (SEIRS) model. A first variant was seeded in the population and a single index case of a novel variant was introduced at varying times and in varying locations. The new variant was simulated to be more infectious than the previously circulating variant. In sensitivity analyses (see Supplemental Material), we considered variants that also had greater and faster immune evasion. Individuals progressed stochastically through the states defined by variant characteristics (cross-protection, transmission probability, duration of latent and infectious periods) and the number of individuals in each compartment. The probability of transmission between geographic locations (patches representing a borough or MODZCTA) was governed by the contact between boroughs (as observed through human mobility), and the relative quantities of infectious and susceptible residents. We accounted for imperfect test sensitivity and specificity and for infection-induced behavioral changes, such as reducing contacts in response to a positive test. We then assessed the impact on detection outcomes of different characteristics of the surveillance strategy and the epidemiological context, specifically the volume of testing and sequencing, the geographic distribution of testing, the introduction time and location of the second variant relative to the first, the probability of transmission, and the connectivity (inward and outward mobility) of introduction and sampling locations. Full details on the model structure are provided in the Supplementary Materials and Methods. Code is available at github.com/gradlab/detectingsarscov2-variants.
Statistical analysis
The main outcomes in this study were the time to variant detection (the number of days between when the index case becomes infectious and laboratory confirmation of the new variant among sequenced specimens), the cumulative number of infections, and the variation in cumulative infections across locations. We ran 100 simulations per scenario and calculated the arithmetic means, medians, and confidence intervals of the main outcomes across simulations. For sampling schemes, we considered (1) the distribution of test volume in New York City provided to DOHMH, which we termed ‘baseline’ testing, (2) test volumes distributed by population density, and (3) test volumes distributed randomly across locations. We also considered “focused testing scenarios” in which 20-100% of tests were allocated in a single location, with the remaining tests distributed evenly among the other locations according to their population size. For total testing volumes, we considered a range between 5% and 300% of the reported testing volume. For sequencing rates, we considered a range between 1% and 90% of all positive tests to be selected for sequencing. In a sensitivity analysis, we evaluated fixed sequencing quantities instead of sequencing proportions, which distinguished the marginal contribution of sequencing versus testing alone.
When all introduction times produced qualitatively similar results, we reported in the main text the results from the scenario in which the new variant is introduced just before the outbreak peak of the previously dominating variant (at t = 50 days). In this scenario, detection was expected to be at or near its slowest, due to high background prevalence and depletion of susceptible individuals. Full results for all parameter combinations are provided in the Supplementary Data.
Results
Geographic sampling strategy
Relative to the baseline volume and distribution of testing and sequencing in NYC (the “baseline” testing and sequencing strategy), detection times were similarly distributed when test volumes were allocated to be (a) proportional to the population density or (b) uniformly at random across locations (Fig. 1). This similarity across geographic sampling strategies was unaffected by the outcome measure used as well as the timing and location of the new variant’s introduction. However, the geographic sampling strategy affected detection outcomes if the introduction location of the new variant was oversampled. Allocating a greater proportion of tests in a single location reduced detection times and cumulative infections of variants emerging in that location but increased detection times of variants that first appeared elsewhere (Fig. 2). This effect was especially pronounced in Staten Island and Brooklyn, but weaker in Manhattan (Supplementary Fig. S5).
Testing and sequencing volumes
Outcomes varied considerably across testing and sequencing rates, with higher rates leading to faster detection, fewer cases, and less variation in cumulative infections across locations (Fig. 3). In accordance with sampling guidelines for well-resourced settings (7), we assumed that a fixed percentage of tests was sequenced. Thus, increasing the number of tests also increased the number of sequenced samples.
To better understand the individual contributions of testing and sequencing, we fixed the quantities of samples selected for sequencing at varying testing volumes. Fixed sequencing volumes were implemented as a cap on the maximum number of samples that can be sequenced per day. The actual number of sequenced cases depended on the test positivity rate.
The improvement in variant detection with increasing test volumes at a given sequencing proportion was driven by the increase in sequencing volume rather than test volume. At all levels of testing, increasing the number of sequenced samples reduced the detection time, while increasing testing alone had little impact on new variant detection (Fig. 4).
We also considered an alternative interpretation of the sequencing cap, where the sequencing volume depended on both the test volume and the positivity rate (Supplement 3; Supplementary Materials and Methods). The results from this sensitivity analysis fall between the fixed volume and fixed rate analyses (Figs. 3 and 4). Raising testing capacity improved detection times for low levels of testing (up to 50-75 tests per 100k persons). At higher levels of testing, improvements in detection time were driven primarily by increased sequencing capacity (Supplementary Fig. 3).
Emergence context
We compared introduction times of the new variant as an approximation for varying background prevalence of the previously circulating variant and the population susceptibility to infection.
When the second, more transmissible variant was introduced into a fully susceptible population together with the first variant (at t = 0), the second variant was more likely to dominate due to its increased transmissibility. Under this scenario, the extinction probability of the second variant (defined as the likelihood that a variant will cause no more than 10 infections) was only 9.6% under the baseline sampling strategy. Both variants generally persisted through the duration of the simulation, though the second variant caused more infections. Consequently, at a t=0 introduction time, the second variant was detected in under 33 days in 95% of simu-lations. If the second variant was introduced after the peak of the first variant’s outbreak (at t = 80 or t = 100), the second variant had a high probability of extinction (64.6 and 84.2%, respectively), and if it persisted, it was detected later (at least 56 and 37 days after introduction in 95% of simulations, respectively). The greatest range of disease dynamics and consequently detection times was observed when the second variant was introduced just before the peak of the first variant (at t = 50), with detection times ranging from 16 to 145 days (Fig 5).
The introduction location did not significantly impact the detection time or cumulative disease burden across the city but did influence where infections occurred .The number of infections was highest in locations with the highest mobility connectivity to the emergence location, which was either the introduction location itself or other locations, depending on the mobility matrix. Emergence in Staten Island, for example, produced infections primarily within Staten Island, while emergence in Manhattan led to a high number of infections in Brooklyn and Queens (Supplementary Fig. S4).
Variant characteristics
We compared variants with different levels of transmissibility, varying the probability of infection given an infectious contact from β = 0.21 to β = 0.5 (contrasting with the transmissibility of the first variant of β = 0.2). This transmission parameter affected the disease dynamics, with more transmissible variants spreading more quickly, leading to earlier detection. All transmission rates yielded a wide range of cumulative infections at detection time (Fig. 6).
Discussion
This study provides an assessment of testing and sequencing strategies for the detection of new SARS-CoV-2 variants to help inform genomic surveillance policies. We considered varying quantities and distributions of resources within a wide range of potential settings for variant emergence and assessed how they influenced variant detection times and the undetected disease burden.
Our results confirm that variant detection is governed by both the surveillance strategy and the epidemic dynamics in which the new variant arises (13). The relative transmissibility of the new variant as well as the context of variant emergence influenced its speed of spread and extinction probability, which in turn affected detection outcomes (Figs. 5 and 6).
Surveillance guidelines stipulate the need for representative sampling strategies, which require not only that a random subset of positive cases be chosen for sequencing, but that positive cases also accurately reflect infections across the population; non-representative sampling delays the detection of new variants (14). Nonrepresentative testing and sequencing must be considered separately and jointly, along dimensions such as demography, socioeconomic factors, disease outcome, and geography. In this study, we assessed the role of geography in representative sampling. For variation in the geographic distribution of sequencing, other work found that if sampling was not representative, then raising test volume improved variant detection times more than raising sequencing rates (9). Our study considered the geographic distribution of testing, as well as the volume of both testing and sequencing, and found that in the context of random sampling of tests for sequencing, improvements in detection time were driven primarily by increases in sequencing volume rather than testing volume (Fig. 4). The geographic distribution of residents receiving tests impacted detection outcomes only via the proximity to the emergence location of the new variant (Fig. 2), though careful test distribution matters to many related public health and social equity objectives (14–16). Oversampling emergence locations improved detection outcomes, underscoring the importance of targeted sequencing, for example at ports of entry or of patients with prolonged viral replication (8). The connectivity of the introduction location did not impact detection times but did affect where infections occurred before variant detection (Supplement 3). Variants that emerged among residents of boroughs with more inward and outward mobility produced more infections in other boroughs. In our simulations, a variant first appearing in a resident of Manhattan, for example, caused more infections on average in Brooklyn and Queens than in Manhattan itself. Failing to adequately sample locations near emergence or those highly connected to emergence locations will lead to a disproportionate number of infections in those locations.
The number of undetected infections varied widely for a given transmission rate, even at fixed detection times (Fig 6). This result demonstrated the challenge of understanding the epidemiologic scenario on discovery of a new variant and the need for combining pathogen genome sequencing with other forms of surveillance. More work is also needed to understand whether optimal surveillance strategies differ if the primary objective is monitoring or detecting variants and how to position genomic surveillance within the broader landscape of sometimes competing public health objectives.
The model in this study was designed to be simple, while accounting for the most important factors affecting testing and sequencing, and to help attain a qualitative understanding of which parameters influence detection times and undetected infections. Consequently, the simulation results, such as the detection times, should not be interpreted as predictions. Specific simplifications include the modeling of single introductions of a novel variant, rather than accounting for multiple introductions or several variants. We also assumed homogeneous mixing within locations and did not account for age structure, social networks, or social determinants of health. SARS-CoV-2 infection risk varies across socioeconomic and demographic groups, due in part to variability in the average number of contacts, vaccine uptake, longand shortdistance mobility, comorbidities linked to more severe disease outcomes, and other social factors (17–19). While we incorporated neighborhood-level variations in movement, we did not include within-neighborhood heterogeneity or between-neighborhood variation in social determinants of health. A basic first analysis suggests that household income correlates negatively with undetected infections in our simulations (Supplement 4). Future work may explore how these heterogeneities influence emergence locations of new variants, disease dynamics, and consequently detection outcomes.
The model in this study also took a simplified perspective of genomic surveillance processes.
We assumed random sampling of positive tests and did not account for variations in specimen quality across testing sites or in access to testing, which may cloud estimates of the prevalence of circulating variants (20). In this sense, our model takes an idealized view of our capacity to sample randomly from the population in each borough or zip code.
Even though we incorporated human mobility data, the model did not account for human responses to the epidemiological situation, such as reduced contact rates during periods of high or rising prevalence, or changes in public health policies like mask mandates. Contact patterns vary over time, not just across places, and future studies should examine how new variant detection changes when incorporating time-varying human behavior.
Detection of novel variants remains a critical component of the response to the COVID-19 pandemic. Emerging empirical evidence on genomic surveillance of SARS-CoV-2 variants has allowed public health agencies to provide guidance on sampling strategies to detect and monitor variants, though more research is needed to anticipate the impact of these strategies under as yet unseen epidemiologic settings. This modeling study aimed to contribute to these ongoing efforts to assess variant detection strategies, by simulating detection outcomes for varying testing and sequencing rates in NYC. We focused on the role of geography in representative sampling as well as the context of variant emergence. The epidemiologic context, including emergence timing and background prevalence, played an important role in shaping detection times and undetected disease burdens. Targeted sampling of emergence locations was the primary aspect of geographical representativeness examined in our model that improved detection outcomes. Increased testing is an important tool to enable more representative sampling (9), though in well-resourced settings with random sampling, as was assumed in this case study of NYC, increasing sequencing capacity rather than testing had a larger impact on improving detection speeds.
Data Availability
Code and data are available at github.com/gradlab/detecting-sarscov2-variants.
Funding
K. Oliveira Roster gratefully acknowledges the support of the São Paulo Research Foundation (FAPESP) under grant 2021/11608-6. This project has been funded by contract 200-201691779 and 6NU50CK000517-01-07 with the Centers for Disease Control and Prevention. Disclaimer: The findings, conclusions, and views expressed are those of the author(s) and do not necessarily represent the official position of the Centers for Disease Control and Prevention (CDC).
Supplementary Materials and Methods
Data
Mobility
Through its Facebook Data for Good initiative, Meta provides aggregated and anonymized movement data from users who have enabled location sharing (12). It is available in 8-hour intervals and with a maximum geographic resolution of 600-by-600m tile sizes. We leveraged this data to compute the average share of users moving between geographic tiles on a given day, aggregated to both modified zip code areas (MODZCTA) and boroughs. This percentage was rescaled to the population size of each location, under the assumption that Facebook users who activate location sharing are representative of NYC residents, which may not be true in practice and may limit our ability to accurately capture human movement across the city. The movements were captured in the raw mobility matrix M, where an entry M{i−>j} indicates the number of movements from location i to location j. We then computed a contact matrix from the mobility matrix under the assumption of homogeneous mixing. The probability that a resident of location i has contact with a resident of location j in any location k was defined as:
Where μcantacts is the average number of contacts per person, M{i−>k} is the number of individuals moving from location i to location k, and Σl M{l−>k} is the sum of all individuals moving to location k. This contact matrix determined the coupling strength of two locations in the mathematical model, and therefore influenced the likelihood for an infection to spread between the two locations.
Testing and Sequencing Rates
Weekly COVID-19 testing rates and sequencing rates by MODZCTA are published by the NYC Department of Health and Mental Hygiene (16). The baseline test rate was calculated as the daily average from December 2020 to November 2021.
Compartmental Model Structure
We implemented a multi-patch, two-variant stochastic compartmental model which builds upon the basic structure of an SEIRS model. Exposure, testing, sequencing, isolation, recovery, waning (cross-)protection, and reinfection were treated as stochastic events. The model explicitly incorporated human mobility, testing and sequencing rates, as well as test sensitivity and specificity. It also included control measures to account for the fact that testing strategies may impact disease dynamics if infectious individuals reduce their contacts upon receiving a positive test result.
The model structure is presented in Supplemental Figure 1 and described in more detail below. The corresponding parameters are listed in Supplemental Table 1. Susceptible (S) individuals may be exposed (E) to one of two variants, represented by subscripts 1 and 2, or they may isolate if they receive a false positive test result (Sq). After a latent period, an exposed individual moves to one of five infectious compartments depending on their reporting and isolation outcome: an infection may be unreported (IU), for example if it is asymptomatic or detected only via an at-home test. A positive case may be detected at an official test center (IT), leading the person to isolate (ITq) or not (ITnq). Finally, the positive test may also be sequenced (IG), again separating individuals by their isolation status (IGq, IGnq). After recovery, individuals have temporary full protection against reinfection with either variant while they remain in compartments RU, RT, RG. Full protection wanes over time at variant-specific rates, making individuals susceptible to reinfection with one (S1{1}, S2{1}, S1{2}, S2{2}) or both variants (S12{1}, S12{2}). The model incorporates a variant-specific leaky immunity parameter (a11, a12, a21, a22), which defines the reduction in the probability of reinfection after full protection has waned.
Variant 1:
Variant 2: where I1 = IU1 + ITq1 + ITqn1 + IGq1 + IGnq1 and I2 = IU2 + ITq2 + ITnq2 + IGq2 + IGnq2
Where pt is the test rate, pg is the sequencing rate, pq is the control measure compliance rate, p6: is the true positive rate, ax{y} is the leaky immunity parameter, wx{y} is the waning full immunity parameter, b is the probability of infection given contact, L is the average duration of the latent period, D is the average duration of the infectious period, k is the contact matrix, and ϵ is the rate of ending control measures if false positive.
Simulations
Geographic Test Distribution
We implement the following primary strategies for testing allocation. Supplemental Figure 2 shows an example of the test rates under each scenario.
- Baseline: Tests are allocated according to the current NYC strategy, computed as the average allocation and daily test quantity between December 2020 and November 2021.
- Population density-based allocation: The same number of tests are distributed homogeneously across the city, so that each location has the same test rate. Locations with a larger population receive more tests than less densely populated boroughs or MODZCTAs.
- Random allocation: The available tests are distributed randomly across locations. This effectively results in higher per-capita test rates in less densely populated areas.
Test Volume
We varied the quantity of available tests from 5% to 400% of the NYC average of 7,184 tests per day (87 tests per 100,000 persons). Changing the quantity of available tests may be understood as either increased capacity or reduced reporting, for example due to a shift toward home testing. From February 2020 to September 2021, an estimated 75 percent of COVID-19 infections went unreported (21). With the emergence of the Omicron sub-variants BA.2.12 and BA.2.12.1, underreporting may be as high as 95 percent (22).
To further understand the role of the geographic test distribution, we also considered “focused testing scenarios”, where individual locations are over-sampled. In these scenarios, 20-100% of all tests were sampled from a single location and the remaining tests were distributed across the remaining locations proportional to population size. We then compared how detection out-comes differed when the new variant emerged in the over-sampled location versus one of the other locations.
Sequencing volume
The main analysis considered sequencing probabilities of 5-90% of all positive tests. The sequencing volume for each day of the simulation was therefore determined by the test volume and the number of infections. Raising test volume for a given epidemiologic scenario effectively raises sequencing volume at fixed sequencing probabilities.
In two sensitivity analyses, we sought to isolate the effect of testing and sequencing volumes by placing a cap on sequencing resources:
- Sensitivity analysis 1: At any time point, a fixed number of positive tests was sequenced. If there were fewer positive tests than sequencing resources (due to low prevalence), then all positive tests were sequenced. The maximum sequencing volume was reached when there was a sufficient quantity of positive tests, i.e., sufficiently high prevalence and test volume. The effective sequencing volume did not necessarily vary for different sequencing caps, if prevalence and/or test volume were low.
- Sensitivity analysis 2: A fixed number of tests were assigned as potentials for sequencing, akin to placing a stamp on a subset of test kits. If a test with a stamp was positive, the sample was sequenced. The maximum sequencing volume was achieved when all (stamped) tests were positive, which depended on the number of infections. The effective sequencing volume necessarily varied between the different sequencing caps.
Variants
The first variant was modeled to resemble the Delta SARS-CoV-2 variant with an effective transmission probability of 0.2. We then introduced a second variant, which was (i) more transmissible, (ii) had greater and faster immune evasion, or (iii) both. We considered effective transmission probabilities of 0.21-0.5.
Context of variant emergence
We simulated the emergence of the second variant at different locations and different times relative to the introduction of the first variant. The first variant was introduced with a single index case in each location, to simulate even spread across the city. The second variant was introduced with one index case in a single location, representing the location of residence of the index case. We simulated all possible introduction locations. Introduction times of the second variant varied from 0 to 150 days after the introduction of the first variant. The introduction times represent different contexts, because of varying prevalence of the first variant and varying numbers of susceptible individuals. We did not distinguish between variants that emerged within the city and those that were imported. The index case of the novel variant in our model could thus represent the first case of a newly emerging variant introduced to NYC from outside the city or a globally undetected variant that either emerged within NYC or was imported from outside before detection.
Outcome measures
For each simulation, we computed three primary outcome measures. The time to detection was defined as the number of days between the introduction and detection (first sequenced case) of the second variant. Cumulative undetected infections measured the total number of people who were exposed to (i.e., infected by) the second variant in NYC by the time the new variant was detected, including individuals who at detection time were in the latent phase prior to infectiousness, infectious, recovered, or susceptible to reinfection. Finally, we computed the standard deviation of the cumulative undetected infections across locations as an estimate of the geographic variation in disease burden.
Supplement 2: Sensitivity analysis of fixed sampling volumes
Supplement 3: Role of introduction location
Supplement 4: Social equity
We explored the relationship between socioeconomic variables and the average cumulative undetected exposures at the time of new variant detection across all NYC zip codes (Supplementary Fig. S7). We observed a weak negative correlation (Pearson correlation coefficient -0.39), with cumulative exposures decreasing with increasing household income, possibly due to lower population density in wealthier neighborhoods.
Acknowledgments
The authors thank Faten Takai for helpful feedback on the manuscript and the Public Health Lab whole genome sequencing and data units.
Footnotes
↵† Co-senior authors