Abstract
The COVID-19 virus has spread worldwide in a matter of a few month. Healthcare systems struggle to monitor and report current cases. Limited capabilities in testing result in difficult to guide policies and mitigate lack of preparation. Since severe cases, which more likely lead to fatal outcomes, are detected at a higher percentage than mild cases, the reported death rates are likely inflated in most countries. Such under-estimation can be attributed to under-sampling of infection cases and results in systematic death rate estimation biases.
The method proposed here utilizes a benchmark country (South Korea) and its reported death rates in combination with population demographics to correct the reported COVID-19 case numbers. By applying a correction, we predict that the number of cases is highly under-reported in most countries. In the case of China, it is estimated that more than 700.000 cases of COVID-19 actually occurred instead of the confirmed 80,932 cases as of 3/13/2020.
Introduction
Severe acute respiratory syndrome-related COVID-19 (COVID-19) is a novel virus with the initial outbreak most likely in China (1). It has reached pandemic status by the World Health Organization within less than four months of initial reports of the disease. The origin of the virus can be traced back to related strains predominantly found in bats (2). Individuals infected by the disease can experience a series of symptoms, including cough, chills, fever, and shortness of breath (3). From data currently available, fatal disease progression is higher than that of the common influenza strains and as such it resulted in more deaths than recent virus of Severe Acute Respiratory Syndrome (SARS) and Search Results Middle East Respiratory Syndrome (MERS) combined. (?). The infection rate of COVID-19 has been estimated between a R0 of 2 and up to 6.49 (4) compared to influenza with about 1.3 (5). The severity of infection is highly correlated to the age of the infected individual. Younger parts of a population present a much lower risk than older populations. A current data release from the Center for Disease Control in South Korea shows that while there are no reported fatalities for individuals under 30 years of age, the death rate for individuals older than 80 is over 8% (6). Figure 1 shows eight countries with a significant number of reported COVID-19 cases. China, which has been the origin of the outbreak, registered the most cases with over 80,000. Through severe measures such as curfews, new infections have slowed significantly. Other countries that have been only recently affected are still in the exponential growth curve. Countries like Italy have only recently taken action to slow the spread of the virus. With a reported incubation time of about five days, it will take several days until the effects of a slowdown will be visible (7). Another country that is currently experiencing high numbers of reported COVID-19 cases is Iran, with more than 12,000 confirmed cases. Due to the limited information available, most parameters describing the dynamics of the disease spread have significant uncertainties around them. Healthcare systems in most countries are not capable of monitoring the exponential growth of a virus in this manner. South Korea, as of writing, has the most extensive capabilities of testing individuals with a capacity of around 20,000 tests a day. Hence, South Korea represents the best benchmark country in order to correct reported COVID-19 cases in other countries. The proposed method uses demographic information to identify the fraction of the vulnerable population. Countries such as China have a generally younger population reducing the overall risk of fatal outcomes and thereby should result in a lower death rate compared to South Korea. Countries, such as Italy with an older population compared to South Korea, should have a higher death rates. Estimating the true case count is relevant in identifying the correct measures to stop the disease from spreading.
Methods
A. Data
The case correction relies on two datasets. The first is the data published by the WHO, which is updated every day and contains case, recovery, and death numbers for countries reporting all known COVID-19 cases (8). The second dataset is a global demographic database maintained by the United Nations (9). This database contains the number of individuals per year of age for more than 200 countries. For the analysis, we extracted the data between 2007 and 2019. We always choose the most recent data entry for the countries if multiple exist. This file is hosted as a Kaggle dataset at: https://www.kaggle.com/lachmann12/world-population-demographics-by-age-2019.
B. Assumptions
This method makes a series of assumptions in order to adjust reported COVID-19 cases compared to the benchmark country (South Korea).
Deaths are confirmed equally It is assumed that if a death occurs, caused by COVID-19, the case is confirmed. When there is under-reporting, the death rate would be lower than the true death rate.
The population is infected uniformly We assume that the probability of infection is uniformly distributed across. The probability of an 80-year-old person to become infected is equal to the probability of a 30-year-old to become infected.
Treatment has minor influence on outcome The provided healthcare in countries is comparable. For developed countries such as Italy and South Korea, it is assumed that the population has similar access to treatment. The death rates reported by age group are thus applicable in all countries.
C. Case Adjustment
Figure 2 shows the progression of death rate estimates for the US, Italy, China, and South Korea. It can be noted that South Korea shows the most consistent death rate estimates. Additionally, it also shows a significantly lower death rate compared to other countries, with the exception of Germany (not shown). The change of death rate over time within the same country is potentially caused by changes in the number of false-negative cases, meaning that many infections go unnoticed until they become fatal. In the case of Italy, there might not have been sufficient capacity to confirm infections. With a smaller fraction of potential cases tested, the estimated death rate will increase. In the case of Italy, the estimated rate increased from 2% to more than 6%.
This method requires the comparison of two countries with sufficient confirmed cases and reported deaths. One country (target country) will be adjusted, given the information from the second country (benchmark country). In order to adjust for the difference in the population demographics of the target country, π, and the benchmark country, πΉ, we compute a Vulnerability Factor (VππΉ). , where fπi is the fraction of the population with age i for target country π, fπΉi is the fraction of the population with age i for benchmark country πΉ, and ri the death rate for age i. ri is listed in Table 1.
If VππΉ > 1, then the population of π has a higher risk of fatal outcomes due to a larger percentage of the older population. It results in a higher death rate compared to πΉ. If VππΉ < 1, then π has a younger population and it should result in a lower death rate compared to πΉ.
Another correction factor is the fixed average death rate of the benchmark country, DπΉ. , where dπΉi is the death rate of day i.
With both normalization factors we can now adjust the expected cases relative to πΉ. The methods applies the normalization to each time point. The original case number oπi is adjusted for π and πΉ at time point i with:
Results
By applying the proposed correction, the number of adjusted cases is significantly higher for most countries. Figure 3a illustrates population age distributions. Figure 3b shows the expected number of fatal outcomes for a 100% infection rate. The vulnerability factor for the US compared to South Korea is 1.07. This means that the population is equally vulnerable to fatal outcomes of COVID-19 infections. Italy, in contrast, has a vulnerability factor of 1.57. This is due to a higher fraction of the population being at a higher risk of death. This would indicate the expected death rate would be 57% higher in Italy compared to South Korea. China, with a younger population relative to South Korea, has a vulnerability factor of 0.63. The expected death rate in China should be lower than in South Korea based on the population risk. After applying the case adjustment, we observe a significant increase in the number of COVID-19 infections. The discrepancy in reported death rates in combination with favorable population scores in the case of China suggests a large number of unreported COVID-19 infections. The adjustment suggests around 702,518 cases compared to 80,932 reported cases. This equates to an 868% higher case count than previously reported. The corrections for Italy and the US are similar, but not as extreme. Italy has an adjusted number of cases of 112,182 cases and the US potentially 6,085 cases. Table 2 shows the adjusted number of cases for a selected number of countries. Iran is the country with the most substantial adjustment of 1,363%, reaching 154,853 cases.
Summary
This study suggests that the current reporting of COVID-19 cases is significantly underestimating the true scale of the pandemic. The lack of testing makes the estimation of the true death rate difficult and causes a significant misinformation. This study tries to leverage the information derived from a well-tested sub-population (South Korea). With testing capacities of 20,000 tests daily, it has the largest and most accurate coverage compared to all other countries as of writing. The low false-negative rate in detecting COVID-19 infections leads to the lowest death rate compared to all other countries (0.84) with major case count. By applying the parameters, estimated from this benchmark country, the proposed method can adjust global COVID-19 case numbers. This method is limited in its ability to predict the exact number of cases accurately. The method relies on the assumption that deaths by COVID-19 are detected and reported reliably. False-negative rates can have a distorting effect on the case adjustment. This is especially true if the benchmark country does not adequately report deaths from COVID-19. Germany, as an example, only reports eight deaths from with 3,675 reported cases. This could be due to the very recent increase in actual cases leaving not enough time for fatal disease progression. Over time, when more data is available, death rates will most likely increase in Germany. Additionally, the assumption of a globally similar death rate is untested. Improvements in this method could look at the case number of other viral diseases to see if there are significant differences between countries. This method explains the observed fluctuations in death rate over time by country. It is unlikely that the death rate in the same country can fluctuate by multiple percent points over a period of a few days. This method suggests that due to the fast exponential growth of true case counts, most modern healthcare systems are not able to track the changes adequately. In addition, the method suggests that computational tools can be used to impute missing information based on regions where testing and tracking is more advanced. It also highlights the importance of publicly accessible real time data and the relevance of combining global healthcare efforts.
Data Availability
All code and data is available on the Kaggle platform.
https://www.kaggle.com/lachmann12/correcting-under-reported-covid-19-case-numbers
ACKNOWLEDGEMENTS
I want to thank Dr Avi Maβayan and Federico Giorgi for feedback on the original manuscript and Alon Bar Tal for insightful discussion as well as the Kaggle community. Special thanks to the seamless accessibility of up-to-date COVID-19 case statistics published on GitHub by Johns Hopkins and the World Health Organization.