Abstract
In this work we estimate the incidence of COVID-19 in China using online indirect surveys (which preserve the privacy of the participants). The indirect surveys deployed collect data on the incidence of COVID-19, asking the participants about the number of cases, deaths, vaccinated, and hospitalized that they know. The incidence of COVID-19 (cases, deaths, etc.) is then estimated using a modified Network Scale-up Method (NSUM). Survey responses (100, 200 and 1,000, respectively) were collected from Australia, the UK, and China in January 2023. The estimates in Australia and the UK are compared with official data, showing that they are in the confidence intervals or rather close. Cronbach’s alpha values also indicate good confidence in the estimates. The estimates obtained in China are, among others, that 91% of the population is vaccinated, almost 80% had been infected in the last month, and almost 3% in the last 24 hours.
1 Introduction
Being able to have estimates of the incidence of COVID-19 on variables such as the number of cases, deaths, and hospitalizations, among others, is of great interest to decision-makers. Direct surveys have been used in several countries during COVID-19 to estimate these variables [1, 2]. Unfortunately, these surveys need large numbers of participants to achieve reliable estimates and usually collect sensitive personal data (which may deter respondents due to privacy concerns and require careful data manipulation).
As an alternative to these surveys, there are indirect surveys, in which the questions answered by the participants are not about themselves, but about their contacts. Estimates are then obtained from indirect survey responses using the Network Scale-up Method (NSUM) [3, 4]. This approach allows 1) reaching a larger sub-population, 2) reducing data collection cost, and 3) estimating the true value with a computationally efficient method, and 4) provide high privacy for participants. Indirect surveys, have already been used for COVID-19 [5].
We use indirect online surveys to estimate cases, mortality, vaccinated, and hospitalizations due to COVID-19 in China for the period of January 18-26, 2023. We validate our approach using data from Australia and the United Kingdom (UK) collected on January 19, 2023. These metrics are compared with the official values reported by Our World in Data (OWID) and the Office for National Statistics (ONS) from UK. In addition, we use Cronbach’s alpha index [6], which is a reliability value to measure the internal consistency of the questionnaire generated by indirect surveys.
2 Methods
2.1 Sampling Participants
We conducted online indirect surveys using the PollFish platform. Specifically, we conducted an online survey in China between January 18-26, 2023. This survey collected information about various COVID-19 indicators (vaccination, deaths, and number of cases in the last month, the last 7 days, and the past 24 hours) among the 15 closest contacts of 1,000 participants (see Supplementary Information section for the Chinese and English versions of the survey questions). Additionally, for validation, we conducted online surveys in Australia (100 responses) and the UK (200 responses) on January 19, 2023. Table 3 in Supplementary Information shows the characteristics of the survey respondents (the platform provides information on gender, age group, education, and ethnicity). The respondents of each survey are also stratified by region. For instance, Fig. 1 in Supplementary Information shows a map of China where the intensity corresponds to the number of questionnaires completed in each province.
2.2 Data Analysis
In order to obtain a reliable dataset, we performed two subphases of preprocessing: (1) an inconsistency filter, and (2) a univariate outlier detection.
The inconsistency filter removes participants with inconsistent responses: less infected contacts than fatalities, less infected contacts than hospitalized, less infected contacts in the last month than in the last 7 days, and less infected contacts in the last month than in the last 24 hours.
Since the collected variables exhibit extremely skewed distributions, the robust outlier detection method reported in [7] is applied. Based on the variable data, this method firstly estimates the quartiles Q1 and Q3, as well as the interquartile range (IQR). Then, the whiskers Qα and Qβ are set. Finally, this method preserves the samples in the interval limited by where MC is the medcouple statistic that estimates the degree of skewness of the data. Samples outside the interval are marked as outliers and, consequently, are removed. In addition, to estimate the parameters a and b, we consider the system [7]
To estimate cumulative incidences, hospitalization rates, and mortality rates, we use a modification of NSUM. Let ci be the number of contacts of i-th respondent with a particular characteristic, e.g., persons who are hospitalized. Further, consider ri the number of close contacts of the i-th respondent (which in this study is fixed at ri = 15, as shown in the questions in the Supplementary Information). The requirement of close contacts is introduced to minimize the effect of the visibility bias [8] with respect to the classical method [3]. We estimate the aggregated rate, p, as ∑i ci/ ∑i ri = ∑i ci/(15n). The estimator’s variance is , assuming that the ci are independent binomial random variables with 15 trials and success probability p.
To determine the validity of our method, we compared the difference between the official values reported in Our World in Data (OWID)1 and the values estimated in our approach, for Australia and UK (see Table 1). For both countries, official data were extracted between December 20, 2022, and January 19, 2023. To transform hospital occupancy into the number of persons hospitalized, the length of a hospital stay is estimated to be 4 days [9, 10].
Additionally, for UK, we use the data provided by the Office for National Statistics (ONS)2. In particular, for the number of cases we use the daily estimates of infected population obtained by the Coronavirus (COVID-19) Infection Survey of the ONS. For the 7 days and the last month estimates, in order not to count multiple times the same cases, the sum of the daily percentages is divided by 10 days, an estimated average duration of the infection with Omicron [11]. Hospitalizations is the sum of the weekly admission rates with COVID-19 in England from Dec 19, 2022, to Jan 22, 2023 (5 weeks). Mortality is the rate of registered deaths involving COVID-19 in England from Dec 17, 2022, to Jan 20, 2023.
Finally, we use Cronbach’s Alpha coefficient to measure the reliability of the results obtained from the indirect surveys. Specifically, it quantifies the reliability of a value of an unobservable variable constructed from the observed variables. The closer this coefficient is to its maximum value of 1, the greater the reliability of the measure, but in general, it is considered that values greater than 0.7 are sufficient to guarantee reliability. There are several ways to calculate it, in this work the one based on the correlations of the observed variables is used.
3 Results
The comparison of our estimates with the official data in UK and Australia is presented in Table 1. The vaccination estimates are very close to the official values. The vaccination rates in Australia and UK are estimated as 76.50% (73.70% - 79.29%) and 78.86% (95% confidence interval: 77.00% - 80.72%), while the official (OWID) values are 84.95% and 79.71%, respectively. In the case of mortality and hospitalizations in the last month, the official values are within the confidence interval of our estimates in the case of Australia. Specifically, the mortality rate is 0.34% (0.00% - 0.22%) and the official is 0.006%, and the hospitalization rate is 1.02% (0.36% - 1.68%) and the official is 1.327%. Also, in the case of UK, the official values of ONS are within the confidence interval of our estimates of the number of cases, cases in the last 7 days, and cases in the last 24 hours.
For the rest of the variables, the differences are never abysmal in cases where there are major differences between the official values and our estimates (possibly due to underreporting in the official data). Cronbach’s alpha coefficient is 0.83 for Australia and 0.95 for the UK, which tells us that the reliability of the estimates is very good. The results of the estimates and Cronbach’s alpha coefficient allow concluding that we can use the indirect survey approach to make estimates when official data is not available or reliable, and use them considering a prudential bias when assessing them.
Table 2 shows the estimated results for China for all the questions of the survey. While 1.000 indirect survey responses were collected, the filters specified in Section 2.2 were used, reducing drastically the sample size to 469. Comparing our results with the OWID data for China, the vaccination rate is 91.9% while we estimate 91.03% (90.36%-91.7%), which is almost a perfect match. The estimate of the values for deaths is approximately 0.073% while we estimate 1.19% (0.94%-1.45%), a much higher value. However, OWID warns that “the number of confirmed deaths may not accurately represent the true number of deaths”. Therefore, our estimate could serve as a first approximation (that may be biased). Our estimate of number of cases in the last month is 78.57% (77.62%-79.54%), very far from 0.138% reported by OWID (which warns that “the number of confirmed cases is lower than the true number of infections”). Note that some areas of China may have a high incidence, as noted in the report published at [12]: “nearly 90% of Henan’s population had been infected by 6 January”.
We compute estimates for the provinces and cities with the largest number of samples (see Table 2). The rate of vaccination and cases in the last month is similar in all of them and similar to the values in China. The Guangdong province shows the lowest estimates of hospitalizations and deaths, while it has large case estimates among provinces. Among cities, Beijing shows low estimates of monthly cases, but large rates of recent cases and hospitalizations. Unfortunately, the sample size for cities is very small.
Finally, we would like to point out that, in general, the data are relatively small compared to the size of the country. Additionally, as can be seen in Table 3 in Supplementary Information, the sample is biased by age and education level. These biases are reduced with the use of indirect questions, but still more studies are needed.
Data Availability
The data collected in the indirect surveys is publicly available at https://github.com/GCGImdea/coronasurveys/tree/master/papers/2023-COVID-19-China-January.
https://github.com/GCGImdea/coronasurveys/tree/master/papers/2023-COVID-19-China-January
Supplementary Information
Questions of the Indirect Survey
Questions in English
Think of your 15 closest contacts in the last month. The rest of the questions below are with respect to this group of people. These contacts can be family, friends, or colleagues whose health status you know.
From the above 15 closest contacts in the last month, how many have had COVID-19 in the last month?
From the above 15 closest contacts in the last month, how many have been hospitalized for COVID-19 in the last month?
From the above 15 closest contacts in the last month, how many died from COVID-19 in the last month?
From the above 15 closest contacts in the last month, how many have COVID-19 today?
From the above 15 closest contacts in the last month, how many started with COVID-19 in the latest 7 days?
From the above 15 closest contacts in the last month, how many have (ever) been vaccinated for COVID-19?
Questions in Chinese
Contact the corresponding author to request access to this information.
Conflict of Interest Disclosures
None reported.
Funding/Support
This work was partially supported by grants COMODIN-CM and PredCov-CM, funded by Comunidad de Madrid and the European Union through the European Regional Development Fund (ERDF), and grants TED2021-131264B-I00 (SocialProbing) and PID2019-104901RB-I00, funded by Ministry of Science and Innovation - State Research Agency, Spain MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
Data Sharing Statement
The data collected in the indirect surveys is publicly available at https://github.com/GCGImdea/coronasurveys/tree/master/papers/2023-COVID-19-China-January.
Acknowledgment
We want to thank Lin Wang for his help with the Chinese version of the survey.
Footnotes
We have corrected a mistake in the text when the ratio of deaths is given, and we have changed the numbering of tables to avoid collision.