Abstract
A population-representative serological study was conducted in all districts of the state of Tamil Nadu (population 72 million), India, in October-November 2020. State-level seroprevalence was 31.6%. However, this masks substantial variation across the state. Seroprevalence ranged from just 11.1% in The Nilgris to 51.0% in Perambalur district. Seroprevalence in urban areas (36.9%) was higher than in rural areas (26.9%). Females (30.8%) had similar seroprevalence to males (30.3%). However, working age populations (age 40-49: 31.6%) have significantly higher seroprevalence than the youth (age 18-29: 30.7%) or elderly (age 70+: 25.8%). Estimated seroprevalence implies that at least 22.6 million persons were infected by the end of November, roughly 36 times the number of confirmed cases. Estimated seroprevalence implies an infection fatality rate of 0.052%.
Introduction
Knowledge of population-level immunity is critical for understanding the epidemiology of SARS-CoV-2 (COVID-19) and formulating effective infection control, including the allocation of scarce vaccines. Tamil Nadu is the 6th most populous state in India, with roughly 72 million persons1. It has reported roughly 820,000 COVID-19 cases and 12,000 deaths, ranked 4th and 2nd highest, respectively, among Indian states2. Reported cases are not, however, gathered from population-representative samples.
The state conducted a population-level seroprevalence survey of 26,640 adults across the 37 districts of the state in October-November 2020. We report seroprevalence estimates from this survey by district, by demographic groups, and by urban status. We also compare the results of the survey to estimates from reported cases to measure the degree to which serological surveys underestimate population immunity.
Methods
This study was approved by the Directorate of Public Health & Preventative Medicine, Government of Tamil Nadu, and the Institutional Ethics Committee of Madras Medical College, Chennai, India.
Outcomes
The first primary endpoint of the study is the rate of positive results on CLIA antibody tests at the district-level. The second primary endpoint is seroprevalence rates at the district-level once seropositivity rates are adjusted for test inaccuracy.
There are multiple secondary endpoints. One set is seroprevalence (a) by age categories and sex, (b) by rural or urban status, and (c) at the state level. A final secondary outcome is the difference between population immunity estimated by serological survey and by reported cases.
Sample and location
Individuals residing in Tamil Nadu and ages 18 years and older were eligible for this study. The exclusion criteria were refusal to consent and contraindication to venipuncture.
Sample size
The state sought to sample roughly 300 persons per 1 million population based on the 2011 Indian Census. The state’s population is organized into districts, districts are organized into health unit districts (HUD), and HUDs are organized into clusters, defined as a street in urban areas and habitations in rural areas. Within each cluster the study aimed to sample 30 individuals. Therefore, the sample rate can be converted into a target number of clusters per HUD. In total 888 clusters were sampled.
Sample selection
The study selected participants within a HUD in three steps. First, within each HUD, the study randomly selected clusters. Second, within each cluster, we selected a random GPS starting point. Third, we sampled one participant from households adjacent to that starting point until 30 persons consented within a cluster. Within each household, the member asked to provide a biosample was selected via the Kish method3.
Data collection and timing
Data was collected between October 19 - November 30, 2020. Each participant was asked to complete a health questionnaire and provide 5ml venous blood collected in EDTA vacutainers. Serum was analyzed for IgG antibodies to the SARS-CoV-2 spike protein using either the iFlash-SARS-CoV-2 IgG (Shenzhen YHLO Biotech; sensitivity of 95.9% and specificity of 95.7% per manufacturer)4 or the Vitros anti-SARS-CoV-2 IgG CLIA kit (Ortho-Clinical Diagnostics; sensitivity of 90% and specificity of 100% per manufacturer)5. Clusters were characterized as rural if they were in Census-defined villages. The government shared data on each reported COVID-19 case and death, including date and demographics.
Statistical analysis
We estimate the proportion of positive CLIA tests by district by estimating a weighted logit regression of test result on district indicators and reporting the inverse logit of the coefficient for each district indicator. Observations are weighted by the inverse of sampling probability for their age and gender groups. Standard errors calculated account for correlations at the cluster level.
We estimate the seroprevalence by district in two steps. First, we calculate the weighted proportion of positive tests at the district level everywhere except Chennai, where we calculate it at the health unit district (HUD), a subset of districts. All samples in a district use the same CLIA kit, except in Chennai, where all samples in a HUD do so. We estimate a weighted logit regressions of test results on district indicators outside Chennai and HUD indicators in Chennai and take the inverse logit of the coefficient for each jurisdiction indicator. Observations are weighted by the inverse of sampling probability for their age and gender groups. Standard errors are clustered at the cluster level. Second, for each jurisdiction, we predict seroprevalence using the Rogan-Gladden formula, test parameters for the kit used in each jurisdiction, and regression estimates of seropositive proportion by jurisdiction. In Chennai district, we calculate seroprevalence at the district level as a weighted average of seroprevalence at the HUD level using as weights the share of clusters in each HUD. (We employ this approach to Chennai in the estimators below.)
We estimate the seroprevalence in the state by aggregating the seroprevalence across districts weighted by 2011 census data on the relative populations of districts.
We estimate seroprevalence by demographic group in three steps. First, we calculate the proportion of positive tests at the jurisdiction-by-demographic group level using logit regressions of test results on jurisdiction-by-demographic group indicators. Demographic groups indicators are sex x age for 6 age bins. Standard errors are clustered at the cluster level. Second, we predict district-by-demographic group level seroprevalence using the Rogan-Gladden formula. Third, we compute the weighted average of seroprevalence at the demographic-group level using as weights the share of demographic-group population in each district using data from the 2011 Indian census.
We estimate seroprevalence by urban status in the same manner we estimate it for demographic groups with two changes. First, we use the urban status of a cluster in lieu of demographic status of an individual at each step. Second, observations in our regression are weighted by inverse of the sampling probability for their urban status.
We estimate the infection fatality rate (IFR) for a population (defined by state, district, demographic group, or urban status) by dividing total SARS-CoV-2-related deaths reported as of two days after the last date of sampling in a district by the estimated size of previously infected persons in that population. We obtain data on deaths by district and demographic group from daily reports from the Tamil Nadu government. The date of the death count reflects that fact that the delay the delay between infection and death is on average two days longer than the delay between infection and seroconversion.6, 7
We estimate the size of a population that was previously infected by multiplying our seroprevalence estimates for that population by the size of those populations as reported in the 2011 census. We estimate the degree of undercounting of cases in a population by dividing the estimated number previously infected in the population by the number of officially reported cases in that population as of 1 week before the median sampling date (median October 23, 2020). We obtain data on officially reported cases from the government of Tamil Nadu. The lag accounts for the delay both between infection and seropositive status and between infection and prevalence testing.
We calculate the Pearson’s correlation coefficient between IFR and age at the individual level, between undercounting rate and testing rate (tests per million as of median date of testing) by district, and between the number of SARS-CoV-2-related deaths and the testing rate by district.
Statistical tests comparing groups are performed using a two-sided Wald test with 95%.
All statistical analyses were conducted with Microsoft Excel 365 (Microsoft, USA) and Stata 16 (StataCorp, USA). All plots were generated in R.
Results
One person aged 16 was incorrectly consented and dropped from the analysis. The study was unable to obtain lab results for 287 additional observations. The study employs 26,135 test results. Because numerous clusters had fewer than 30 persons consenting, the study yielded 292 samples per million population.
Table 1 reports the demographic characteristics of the sample. Age is missing for 8 persons and 6 persons are transgender. The sample has substantially more females and fewer persons age 18-29 than the general population.
Seropositivity varies dramatically across the state, from 12.1% in The Nilgris to 49.3% in Perambalur district (Figure 1). Seroprevalence has a similar pattern to seropositivity (Figure 2). Seroprevalence is significantly greater in urban areas (36.9%; rural, 26.9%; p<0.001) (Table 2). State level seroprevalence is 31.6% (95% CI: 30.4-32.8%).
Seroprevalence is not significantly different across sexes (females, 30.8%; males; 30.3%; p=0.25) (Table 2). Seroprevalence among the elderly (70+: 25.8%) is significantly lower than among the working age populations (age 40-49: 31.6%; p<0.001) or the young (18-29: 30.7%; p<0.001), respectively (Table 2).
The IFR varies substantially across the state, from 0.007% in Perambalur to 0.203% in Chennai (Table 3). The IFR increases with age (ρ=0.8791; p=0.021) and is higher among males (0.11%) than among females (0.04%; p=<0.001) (Table 4).
Ratio of actual cases to confirmed cases ranges from 9 in Chennai to 144 in Perambalur (Table 3). There is a negative and significant (ρ=-0.58; p<0.001) correlation between testing rate and the undercount rate (Figure 3).
Discussion
Overall seroprevalence (31.6%) implies that at least 22.7 million persons were infected by November 30, 2020, the last day of serological sampling. Thus, the actual number of infections is roughly 36 times larger than the number of confirmed cases, which totaled 670.392 by 15 October 2020, 7 days before the date the median biosample is collected.
Seroprevalence is highest among working age populations. The lower seroprevalence among the young is not informative about the value of closing schools because the young in our sample were over age 18. Moreover, the difference in seroprevalence among the young and working ages is not significant. The significantly lower seroprevalence among the old is different than what was found in a recent Karnataka seroprevalence study8 and suggests that the elderly in Tamil Nadu have been somewhat protected even in multigenerational households.
Higher seroprevalence in urban areas is consistent with higher density in urban districts (Figure 4). It does not seem positively correlated with mobility as measured by average decline in non-residential Google mobility measures.
The IFR across the state is 0.052%, nominally lower than that in Karnataka8 and Mumbai9. The lower IFR is not due to lower infection rate amongst the elderly because IFR is nominally lower even among the elderly. Higher IFR among males is consistent with the literature10.
Our study has several limitations. One is that, because antibody concentrations in infected persons decline over time11, our estimate of seroprevalence may underestimate the level of prior infection and perhaps natural immunity.
Second, we may underestimate IFR. The number of deaths per million is positively correlated (ρ=0.96; p<0.001) with testing rate per million at the district level (Table 3). Perhaps increasing the testing rate would show greater deaths from SARS-CoV-2.
SUPPLEMENT
Methods
Sampling
Numerous clusters had less or more than 30 persons sampled. We report those in Table S 1. We retain all samples because it is unclear which samples to drop from clusters with >30 observations.
Sample size
Sampling 300 per million would lead to different sample sizes per district. Because some clusters yielded less than 30 persons per cluster, the study produced just 292 samples per 1 million population. The following table presents the sample size obtained and the resulting minimum detectable effect in each district. These are reported in Table S 2.
Assuming a design effect of 2, the implied minimum detectable effect (MDE) per district varies from 3.3 (Chennai) to 13.6 (Nagapattinam) percentage points.
Sample
Suspected or confirmed current or prior COVID-19 infection was not an exclusion criterion. If a participant was currently receiving medical care for COVID-19, a family member or proxy was used to complete the questionnaire on the participant’s behalf; however, the blood sample was taken from the participant.
Data collection
Blood was collected in EDTA vacutainers. Serum was isolated and stored in Eppendorf tubes. Serum was analyzed using either of two chemiluminescent immunoassay (CLIA) kits.
The first kit was the iFlash-SARS-CoV-2 IgG kit from Shenzhen YHLO Biotech. Per the manufacturer, it has a sensitivity of 95.9% (95% CI: 93.3-97.5%) and specificity of 95.7% (95% CI: 92.5-97.6%)4. Independent analysis estimated a sensitivity of 93% (95% CI: 84.3–97.7%) and specificity of 92.9% (95% CI: 85.3–97.4%)12.
The second kit was the Vitros anti-SARS-CoV-2 IgG CLIA from Ortho-Clinical Diagnostics. Per the manufacturer it has 90% sensitivity (95% CI: 76.3-97.2%) and 100% specificity (95% CI: 99.1–100.0%)5. FDA evaluation suggests it has 100% sensitivity (95% CI: 88.7-100%) and 100% specificity (95% CI: 95.4-100%)13. Independent analysis estimated that it has a sensitivity of 98.8% (95% CI: 92.9-100%) and specificity of 97.3% (95% CI: 85-100%)14.
All the samples in a district are analyzed using the same kit, with the exception of the Chennai. In Chennai 2 HUDs used 1 kit, one used the other. Table S 3 reports the test kit used in each district.
Statistical analysis
Nagapattinam district was split into Nagapattinam and Mayiladuthurai districts in March 2020, after the state started reported data on confirmed cases but before we conducted our serological survey. We aggregate these two districts together in our estimates of seropositivity and seroprevalence.
In Chennai, we do not have the population by HUDs. Since the samples were drawn proportional to population, we divide the district population across the HUDs in proportion to the sample size.
When estimating our district-level seroprevalence, the weights for our regression analysis employ data from the 2011 Census for the population in each age x gender category in each district. We estimate the sampling probability for demographic group (age category x sex) as the number of observations in that group in the sample in a district divided by the census population in that group in a district.
When estimating our urban- and rural-level seroprevalence, the weights for our regression analysis employ data from the 2011 Census for the population in each urban/rural category in each district. We estimate the sampling probability for urban/rural group as the number of observations in that group in the sample in a district divided by the census population in that group in a district.
We calculate the sampling probabilities for each regression observation at the level of 2011-defined districts (of which there are 32) rather than the 2020-defined districts (of which there are 38), HUDs or clusters because the population is available only at the level of the old 32 districts. Likewise, we calculate district weights when we aggregate estimates across districts using the 32, 2001 districts. The 38, 2020 districts are all the same or bifurcations of the 32, 2011 districts. Fortunately, in all bifurcated districts, the same kit was used. Therefore, we can combine all bifurcated districts into older 2011 districts for purposes of calculating sampling probabilities in regression analyses or weights when aggregating estimates.
Data Availability
Aggregated data are reported in the Supplement.
Footnotes
↵# independent
Supplemental files updated.