Abstract
Many months into the SARS-CoV-2 pandemic, basic epidemiologic parameters describing burden of disease are lacking. To reduce selection bias in current burden of disease estimates derived from diagnostic testing data or serologic testing in convenience samples, we are conducting a national probability-based sample SARS-CoV-2 serosurvey. Sampling from a national address-based frame and using mailed recruitment materials and test kits will allow us to estimate national prevalence of SARS-CoV-2 infection and antibodies, overall and by demographic, behavioral, and clinical characteristics. Data will be weighted for unequal selection probabilities and non-response and will be adjusted to population benchmarks. Due to the urgent need for these estimates, expedited interim weighting of serosurvey responses will be undertaken to produce early release estimates, which will be published on the study website, COVIDVu.org. Here, we describe a process for computing interim survey weights and guidelines for release of interim estimates.
Introduction
SARS-CoV-2 is responsible for more than 190,000 deaths in the U.S. to date,[1] yet 7 months into the pandemic, much remains unknown about how many individuals have been infected or their demographic characteristics. Diagnostic testing has been fraught with implementation challenges. Reported cases are largely a reflection of testing among those who suspect they may have been infected, so case reports are likely to severely undercount mildly symptomatic or asymptomatic infections, or infections among people unwilling or unable to be tested. Population-based screening strategies are urgently needed to understand population-level prevalence of SARS-CoV-2 infection and antibody response. To that end, SARS-CoV-2 serosurveys, which pair PCR and/or antibody screening with demographic and/or behavioral data collection, have recently been launched in Europe[2-4] and the U.S.[5-10] Serosurveys conducted or launched in the U.S. to date use are designed to produce seroprevalence estimates for individual counties or states[5, 6, 9, 11], or use convenience samples that may not be generalizable to the underlying population.[8, 10] To fill an on-going need for nationally representative estimates of SARS-CoV-2 disease prevalence, incidence, and antibody response, we are conducting a national probability-based sample serosurvey.[12] Using baseline serosurvey data, we will estimate national prevalence of SARS-CoV-2 infection and antibodies, overall and by demographic, behavioral, and clinical characteristics.
Probability-based sample surveys are needed for estimation of population-level prevalence that is robust to selection bias. Participants are selected at random with known and nonzero probabilities of selection to allow computation of extrapolation factors (weights) for inferential purposes. Weighting processes include computation of design weights to reflect selection probabilities of sampled units, as well as a series of adjustments to compensate for differential nonresponse and under-coverage.[13] Generally, weighting of survey data is undertaken after all survey responses have been collected to allow a full treatment of observed nonresponse patterns after all responses have been received. Due to the complexity of these analytic procedures, weighted estimates from most such surveys are often released months after data collection is completed (see, for example[14-16]). Because of the urgent need for population-based estimates of SARS-CoV-2 prevalence and antibody response, here we describe an expedited interim weighting procedure of serosurvey responses that we will use to produce early release estimates, which will be published on the study website, COVIDVu.org.
Methods
For this study, our overall target is a national sample of 4,000 U.S. adults completing study procedures, and an additional 3,584 adults residing in seven states of interest (CA, FL, GA, NY, IL, TX, WA). Sampling procedures have been previously described in detail.[12] Briefly, households will be selected from an address-based sampling frame created by Marketing Systems Group from the latest Delivery Sequence File of the U.S. Postal Service.[17] Recruitment materials and kits for self-collecting SARS-CoV-2 testing specimens will be mailed to households, and adults will be sampled for participation within households based on household enumeration. Generally, one adult per household will self-collect specimens for PCR and antibody testing and complete a survey, but full households will be included in 10% of randomly selected, participating households. Surveys will be completed online, and specimens will be returned through the mail for lab testing. These procedures will be repeated three months later with persons participating at baseline for incidence estimation. Primary study outcomes will be prevalence and incidence of SARS-CoV-2 infection and antibodies.[12]
Computation of survey weights
Weighting processes usually entail four major steps. In the first step, design weights are computed to reflect selection probabilities of households and, in the case of the present survey, subsampling of adults in sample households. In the second step, design weights are adjusted to correct for nonresponse observed during the survey administration. In the third step, nonresponse-adjusted design weights are adjusted against population benchmarks so that the final weights conform to the target population distributions with respect to a set of demographic characteristics. For general population surveys these characteristics include gender, age, race/ethnicity, education, household income, region, and metropolitan status. For adjustment to population benchmarks, an iterative procedure commonly known as raking is used so that respondents’ distributions can be adjusted to multiple benchmarks simultaneously.[18] Finally, weights are examined, and if necessary, trimmed at both ends of the distribution to avoid extreme weights that can result in unstable estimates.
As is typically done in other weighting processes, missing demographic data for variables used to weight our survey data will be imputed prior to weight computation, although based on our previous work using web-based surveys, we expect minimal missing data for such data.[19, 20] We will use a hot-deck imputation procedure to replace missing values, using observed values from respondents with non-missing data for a given element (“donors”) who are deemed to be otherwise demographically similar to respondents for whom the data element is missing.[21]
Computation of interim survey weights
Given the urgent need to produce expedited estimates, we will employ an interim weighting methodology that is an abbreviated version of the standard steps previously described in two notable ways. First, the nonresponse adjustment (Step 2) will be skipped and postponed for the final weighting process when all respondents and nonrespondents have been identified. Second, due to smaller sample sizes available for interim weighting, some of the weighting variables may be collapsed into coarser categories. For example, we may use four categories of education level in final weighting but collapse data into two categories of education level for interim weighting. This need for parsimony may also require replacement of multivariate raking benchmarks with their corresponding marginal distributions, or averages.
Interim weighting will seek to balance potential bias reduction against variance inflation, which is an inevitable consequence of weighting. To accomplish this, we will ensure that (1) there are enough respondents to “carry” the weights to avoid an unstable scenario when a few respondents with extreme weights can heavily influence the resulting estimates and (2) the impact of weighting vis-à-vis the resulting unequal weighting effect is kept to a minimum to avoid undue loss of precision due to excessive weighting.
Following the above guidelines, the first set of interim weights will be produced after 1,000 surveys have been completed and accompanying specimens have been returned. We will require at least 1,000 respondents for interim national estimates and 200 respondents for sub-group estimates (e.g. by age group, race, or state-specific for over-sampled states). As responses accumulate, more robust sub-group estimates will be made possible by increasing the granularity of the weighting adjustments. Having adequate precision for interim estimates using these sample size guidelines assumes, on average, 1% prevalence of SARS-CoV-2 virus and 3% prevalence of SARS-CoV-2 antibodies.[5, 7, 10, 11] If observed prevalence for these outcomes are higher overall or in sub-groups, we may reduce the minimum sample size requirements. After adequate sample size is reached for computation of the first set of interim weights and estimation of prevalence of SARS-CoV-2 infection and antibodies, interim weighting and outcome estimation and dissemination will be conducted periodically until the serosurvey is complete.
Presentation of final estimates
When the serosurvey is complete and final weights have been computed, we will use the resulting design effect (Deff) as a surrogate to assess the impact of unequal weighting effect to set guidelines for adequate stability of estimates to be presented. The Deff is a commonly used metric to measure the efficiency of a weighting methodology to capture the impact of unequal weighting across respondents. While application of weights tends to improve the representation of survey respondents, and hence reduces bias in survey estimates, this gain comes at a precision cost because weighting increases variance of survey estimates. The inflation due to weighting can be approximated by the following formula, in which Wi represents the final weight of the ith respondent[22] [23]:
Results
Recruitment for this study began in July, 2020. We anticipate interim findings on prevalence of SARS-CoV-2 infection and antibodies will be available on COVIDVu.org by November, 2020.
Discussion
Population-based estimates of national SARS-CoV-2 infection and antibody prevalence are critical for improving our understanding of burden of COVID-19 disease. Expediting such estimates through the use of interim weighting will allow data from our on-going probability-based sample survey to inform prevention and control measures during a time when they are acutely needed. While interim estimates may diverge somewhat from final estimates due to evolving data availability, the emergency nature of the SARS-CoV-2 pandemic requires that precision of key estimates be balanced against timeliness. Both interim and final estimates will be publicly available on COVIDVu.org, where they will be interpreted and visualized for use by researchers, policy makers, clinicians, and public health program administrators. Estimates of SARS-CoV-2 infection and antibody prevalence, overall and in sub-groups, will provide an empirical foundation that will allow other surveillance data sources to be contextualized and used more robustly.
Data Availability
Data collection is on-going but interim estimates will be available on the study website in the coming months.
Footnotes
↵* Co-first authors
Funding: NIAID 3R01AI143875-02S1, Woodruff Foundation