ABSTRACT
Accurate estimates of the prevalence of asymptomatic SARS-CoV-2 infections, ψ, have been important for understanding and forecasting the trajectory of the COVID-19 pandemic. Two-part population-based surveys, which test the infection status and also assess symptoms, have been used to estimate ψ. Here, we identified a widely prevalent confounding effect that compromises these estimates and devised a formalism to adjust for it. The symptoms associated with SARS-CoV-2 infection are not all specific to SARS-CoV-2. They can be triggered by a host of other conditions, such as influenza virus infection. By not accounting for the source of the symptoms, the surveys may misclassify individuals experiencing symptoms from other conditions as symptomatic for SARS-CoV-2, thus underestimating ψ. We developed a rigorous formalism to adjust for this confounding effect and derived a facile formula for the adjusted prevalence, ψadj. We applied it to data from 50 published serosurveys, conducted on the general populations from 28 nations. We found that ψadj was significantly higher than the reported prevalence, ψc (P=3×10−8). The median ψadj was ∼60%, whereas the median ψc was ∼40%. In several instances, ψadj exceeded ψc by >100%. These findings suggest that asymptomatic infections have been far more prevalent than previously estimated. Our formalism can be readily deployed to obtain more accurate estimates of ψ from standard population-based surveys, without additional data collection. The findings have implications for understanding COVID-19 epidemiology and devising more effective interventions.
INTRODUCTION
Asymptomatic SARS-CoV-2 infections have been a major contributor to the spread of the COVID-19 pandemic, with nearly a quarter of all transmission events attributed to them1. They also represent a key outcome of COVID-19 vaccination; vaccine efficacies have been estimated as the fraction of potentially symptomatic infections rendered asymptomatic by vaccination in clinical trials2-4. Accurate estimation of the prevalence of asymptomatic infections, ψ, is thus important for understanding COVID-19 epidemiology and for designing and assessing public health interventions. A large number of surveys, conducted throughout the pandemic, have offered estimates of ψ5,6. Here, we recognized an important confounding factor that compromises these estimates and devised a formalism to adjust for it.
The surveys contain two parts: 1) a nucleic acid or an antibody test to detect SARS-CoV-2 infection, and 2) a questionnaire to assess the symptoms experienced. Individuals who test positive for the infection but declare no symptoms are deemed asymptomatically infected. ψ is thus estimated as the fraction of test-positive cases that reports no symptoms. The confounding effect arises from the symptoms assessed not being specific to COVID-19. Symptoms such as cough and fever, which are part of nearly all COVID-19 surveys, can be triggered not only by SARS-CoV-2 infection but also by a host of other infections including influenza and circulating coronaviruses. It is possible, therefore, that some individuals who reported symptoms in the surveys may have had them due to the other conditions. Such individuals should be classified as asymptomatic for SARS-CoV-2 but get misclassified as symptomatic, resulting in a systematic underestimation of ψ.
Evidence of this misclassification exists in the data gathered by the surveys: The surveys identify individuals who test negative for SARS-CoV-2 but report symptoms. For instance, a survey from The Netherlands reported that ∼62% of the individuals who tested negative for SARS-CoV-2 displayed symptoms7. The number was as high as 80% in a survey in the US8,9. These individuals must have had their symptoms arise from causes other than SARS-CoV-2 infection. The high prevalence of such individuals in these surveys implies that at least some of the test-positive, symptomatic cases may have had their symptoms arise from non-COVID conditions. Adjusting for this confounding effect is important to obtain accurate estimates of ψ.
The adjustment is challenging because of the two-part survey methodology, with the tests used in the first part, to assess SARS-CoV-2 infection, limited by their own sensitivities and specificities. Thus, the test-negative, symptomatic individuals, discussed above, may not all have been uninfected; some who had the infection may have been classified as test-negative because the antigen (or antibody) levels in them were below assay detection limits. Indeed, the symptoms they experienced may well have arisen from SARS-CoV-2 infection. Thus, the adjustment for the non-specificity of the symptoms must also simultaneously account for the sensitivity and specificity of the SARS-CoV-2 test. Here, we developed a formalism that accomplished that. We applied our formalism to data from 50 published serosurveys, conducted in 28 countries across continents, and found that the adjusted ψ was significantly higher than previously reported. Indeed, in several instances, the previous estimates had to be revised upward by over 100%.
RESULTS
Formalism to adjust for symptom specificity
We developed our formalism for the general scenario where the goal is to estimate the prevalence of asymptomatic infections caused by a pathogen of interest when another pathogen that could trigger similar symptoms is also circulating in the population, confounding the estimates. We assumed that data relating to the pathogen of interest was gathered following the two-part survey methodology described above. The detailed derivation is presented in Methods. Here, we let the pathogen of interest be SARS-CoV-2 and the other pathogen represent the collection of all other conditions with symptoms that overlap with those of SARS-CoV-2 infection. Remarkably, we obtained a closed-form expression for the adjusted prevalence of asymptomatic SARS-CoV-2 infections, ψadj: Here, α and β are the SARS-CoV-2 test sensitivity and specificity, respectively, ψc is the crude (or unadjusted) prevalence of asymptomatic cases among test-positive individuals, ρc is the crude fraction of test-positive cases among the sampled individuals, and ϕc is the crude proportion of symptomatic cases among test-negative individuals. Thus, given the set of quantities S = {α, β, ρc, ϕc, ψc}, all of which are typically reported in surveys, ψadj can be readily calculated.
Adjusted estimates of ψ from serosurveys
To apply our formalism, we collated data from published serosurveys (Table S1)7-56. Although our method applies also to surveys using nucleic acid-based (PCR) testing, serosurveys have been preferred for assessing asymptomatic SARS-CoV-2 infections because nucleic acid-based testing could miss presymptomatic individuals, who do not display symptoms at the time of testing but develop them later5,57. Serosurveys seek symptoms experienced during a longer ‘recall period’, which renders them more susceptible to confounding from other conditions with overlapping symptoms, highlighting the need for the present adjustment.
We considered serosurveys in the early phase of the pandemic, before vaccination programs began, to eliminate any confounding effect of symptoms elicited by vaccines. We restricted our analysis to studies with a sample size of ≥ 500, as smaller datasets could introduce significant uncertainties in our calculations58. We excluded studies on samples biased by symptom status, such as hospitalized patients or long-term care facilities, and focused instead on studies sampling the general population. We, of course, also excluded studies that did not provide all the quantities in S required for the adjustment. With these criteria, we identified 50 serosurveys that were amenable to our analysis. Three of these studies13,39,56 estimated ψ at three different time points, resulting in a total of 56 estimates of ψ (Table S1). The selected studies spanned 28 countries across Asia, the Americas, Europe, and Africa, covering a broad spectrum of epidemiological settings.
To first assess the prevalence and scale of the confounding effect due to the non-specificity of symptoms, we examined the fraction, ϕc, of seronegative individuals who reported symptoms across the surveys. ϕc varied from 0 to 0.8 with a median of 0.31 (Figure 1A), indicating that overlapping symptoms commonly arose from other conditions and could therefore significantly affect estimates of ψ. Furthermore, although most surveys employed antibody tests with high sensitivity and specificity, several reported sensitivities ≤0.85 (Table S1), potentially amplifying the confounding effect.
The crude seroprevalence, ρc, varied from 0.01 to 0.58 across the studies, with a median of 0.11, representing a wide range of the extent of spread of the infection in the populations studied at the time of the surveys (Figure 1B).
The surveys reported widely varying estimates of the crude prevalence of asymptomatic infections, ψc, spanning the range from 0.068 to 1 with a median of 0.40 (Figure 1C, orange). Using equation (1), we calculated the adjusted prevalence, ψadj, for all the 56 estimates of ψc. ψadj varied from 0.04 to 1.00 with a median of 0.60 (Figure 1C, red). We found overall that ψadj was significantly larger than ψc (P=3×10−8 using the Wilcoxon signed rank test; Figure 1D). We defined η = 1 × (ψadj − ψc)/ψc as the percentage increase in ψ due to the adjustment. Out of the 56 estimates, 9 had η>100%, 10 had η in the range of 50-100%, 12 in the range 25-50%, 21 between 0% and 25%, and 4 had η<0% (Figure 1E).
Factors contributing to the adjustment
To identify the quantities in S most responsible for the adjustment in the datasets we considered, we calculated pairwise correlations of η with each quantity in S. We found that ϕc was strongly positively correlated with η (Spearman’s coefficient rs = 0.86, P < 10−16) (Figure 1F). β showed a moderate positive correlation with η (rs = 0.30, P = 0.026) (Figure 1G). The other quantities were not significantly correlated with η (Figure S1). Thus, the non-specificity of the symptoms was the major contributor to the adjustment. Indeed, for the 9 estimates with η>100%, ϕc was >50%.
Our expression in equation (1) reduced when α = β = 1 to , showing how ϕc would contribute to the adjustment even with a perfect antibody test and explaining the positive correlation between ψadj and ϕc. For imperfect antibody tests, where α<1 and/or β<1, ψadj displayed a more complex dependency on the quantities in S (equation (1)). In the absence of symptom overlap (ϕc =0), equation (1) reduced to allowing ψadj to be larger or smaller than ψc depending on the specific values of α, β, and ρc. When α = 1, for instance, . (The latter inequality follows because (1 − β)(1 − ρ) > and hence ) Indeed, the reduction in ψ due to imperfect test sensitivity and specificity may dominate the increase due to overlapping symptoms, explaining the few instances with η < 0% above. Nonetheless, in all but 4 of the 56 instances we studied, we found ψadj ≥ ψc, highlighting the dominant effect of the adjustment due to symptom overlap.
We conclude therefore that ψ has been substantially underestimated by existing serosurveys, primarily due to the confounding effect of the non-specificity of the symptoms elicited by SARS-CoV-2. Our formalism enables adjusting for this effect and arriving at more accurate estimates of ψ.
DISCUSSION
Our formalism makes important advances in addressing confounding effects in the estimation of ψ. A general formalism to adjust for antibody (or nucleic acid) test sensitivity and specificity was developed earlier60, which has been applied to obtain accurate SARS-CoV-2 prevalence estimates during the pandemic7. The formalism has been extended to estimate ψ, but without accounting for the specificity of the symptoms46. The importance of symptom specificity has been recognized earlier: For instance, an increase in the proportion of asymptomatic cases of influenza virus infection resulted after accounting for overlapping symptoms caused by other infections61,62. The adjustment in the latter studies, which relied on regression techniques, did not account, however, for the infection test sensitivity and specificity. Here, we accounted for the infection test sensitivity and specificity as well as the specificity of the symptoms. Furthermore, we derived a closed-form expression for the adjustment (equation (1)) which enables facile application of our formalism.
We foresee several implications of our study. First, the refined estimates of ψ that our formalism yields would help reassess the contribution of asymptomatic infections to COVID-19 transmission and spread1,63. They would also form inputs to models of COVID-19 epidemiology64-66, enabling more reliable forecasting of disease spread and the design of effective control strategies. Second, the formalism could aid COVID-19 vaccine development efforts67 by enabling more accurate estimation of vaccine efficacies, which are often based on comparing estimates of ψ in the vaccinated and unvaccinated arms of clinical trials2-4. Third, estimates of ψ will inform efforts underway to unravel genetic, immunological, and demographic underpinnings of asymptomatic infections68-72. Finally, we anticipate our formalism to be applicable to settings beyond COVID-19 that involve asymptomatic infections, such as influenza61,62. It would be particularly important to epidemiological studies that employ extended symptom recall periods, which increase the likelihood of contracting other infections during the recall period and, consequently, the confounding effect of symptom overlap.
Our study has limitations. First, we assumed that symptoms caused by SARS-CoV-2 and by other infections are independent. While co-infection can potentially influence the severity of SARS-CoV-2 infection, such instances appear rare73. Further justification of our assumption comes from studies that found influenza vaccination not to offer significant protection against SARS-CoV-2 symptoms74. Second, our selection of serosurveys is not exhaustive. Our aim was to demonstrate the wide applicability and relevance of our formalism and not to provide a global estimate of ψ. Future studies may conduct a more systematic search and meta-analysis using our formalism to obtain such a global estimate of ψ.
METHODS
Formalism to adjust for specificity of symptoms
We consider the scenario where infection by the pathogen of interest, denoted X, can trigger symptoms that may also be triggered by other pathogens (or conditions), the latter collectively denoted Y. Surveys aim to assess the prevalence of asymptomatic infections by X. A test, de-noted T, assesses whether an individual undertaking the test is infected by X. Simultaneously, a questionnaire inquires into the symptoms, denoted S, experienced by the individual during a pre-defined recall period. We recognize that the symptoms may also be triggered by Y. We distinguish between these possibilities by letting SX and SY represent events associated with the symptoms being triggered by X and Y, respectively. The aim is to estimate the fraction of individuals infected by X who do not experience symptoms triggered by X. We arrive at this estimate as follows.
We define P[X+] and P[T+] as the probability with which an individual is infected by X and the probability that the infection test yields a positive result, respectively. Clearly, P[T+] = ρc, the crude prevalence estimated by the survey as the fraction of individuals tested who show a positive result. P[X+] = ρadj is the actual prevalence, obtained after adjusting for test sensitivity and specificity. The test sensitivity is α = P[T+|X+], the probability of the test yielding a positive result given the infection by X. The test specificity is β = P[T−|X−], the probability that the test yields a negative result, given that the tested individual is not infected by X. The total probability of the test yielding a positive result can thus be written as Recognizing that P[T+|X+] = 1 − P[T−|X−] and P[X−] = 1 − P[X+] and substituting the definitions above in equation (2), it follows that We next consider events related to the occurrence of symptoms. The crude prevalence of asymptomatic individuals, ψc = P[S−|T+], is the probability that an individual who tests positive reports no symptoms. It is thus measured in the surveys as the fraction of test-positive cases who declare no symptoms. Accounting for the test sensitivity and specificity, we again write, which, upon recognizing that P[X−|T+] = 1 − P[X+|T+] and invoking Bayes’ theorem, Yields Given the simultaneous presence of X and Y in circulation, the absence of symptoms implies the absence of symptoms triggered by both X and Y. In other words, . This yields, where is the probability that an individual infected by X does not experience symptoms triggered by X, which is the adjusted prevalence of asymptomatic infections, the key quantity of interest here.
Similarly, in the absence of infection by X, we may write where the latter equality follows from an individual not infected by X cannot have symptoms triggered by X.
Combining equations (6) − (8) yields We next assume that experiencing symptoms triggered by Y or not is independent of infection by X, so that To estimate the latter probabilities, we invoke their relationship with test results as follows. We recognize that ωc = P[S−|T−] is the probability of not experiencing symptoms given test-negative status, which represents the crude proportion of asymptomatic cases among test-negative individuals. Following the arguments above, the symptoms must arise neither from X nor Y, so that Invoking test sensitivity and specificity, we write the first term on the right hand side of equation (11) as where the latter equality follows because and P[X−|T−] = 1 − P[X+|T−]. Using Bayes’ theorem and the definitions of the quantities above, we obtain Combining equations (11) − (13) and rearranging terms yields Following a similar procedure, we write the second term on the right hand side of equation (11) as where the latter equality follows because and P[X+|T−] = 1 − P[X−|T−]. Combining equations (14) and (15) with equation (11) and rearranging terms, we obtain Finally, combining equations (9), (1), (11), and (1), and letting ϕc = 1 − ωc, the fraction of symptomatic cases in the test-negative subpopulation, we obtain equation (1):
Data Availability
All data produced in the present work are contained in the manuscript
AUTHOR CONTRIBUTIONS
A.T. and N.M.D. designed the study and developed the mathematical formalism. A.T. collated data from serosurveys, performed the analysis, and wrote the first draft. S.C., A.J., B.C. and N.M.D. contributed to the analysis and edited the draft. A.T. and S.C. had access to all the data. All authors approved the final draft and submission.
COMPETING INTERESTS
The authors declare that no competing interests exist.
ACKNOWLEDGEMENTS
We thank Jeremie Guedj and Shreyas Joshi for helpful discussions. This study did not receive any funding.