Abstract
Previous studies have used Benford’s distribution to assess whether there is misreporting of COVID-19 cases and deaths. Data inaccuracies provide false information to the media, undermine global response and hinder the preventive measures taken by countries worldwide. In this study, we analyze daily new cases and deaths from all the countries of the European Union and estimate the conformance to Benford’s distribution. For each country, two statistical tests and two measures of deviations are calculated to determine whether the reported statistics comply with the expected distribution. Four country-level developmental indexes are also included, the GDP per capita, health expenditures, the Universal Health Coverage index, and full vaccination rate. Regression analysis is implemented to show whether the deviation from Benford’s distribution is affected by the aforementioned indexes. The findings indicate that only three countries were in line with the expected distribution, Bulgaria, Croatia, and Romania. For daily cases, Denmark, Greece, and Ireland, showed the greatest deviation from Benford’s distribution and for deaths, Malta, Cyprus, Greece, Italy, and Luxemburg had the highest deviation from Benford’s law. Furthermore, it was found that the vaccination rate is positively associated with deviation from Benford’s distribution. These results suggest that overall official data provided by authorities are not confirming Benford’s law, yet this approach acts as a preliminary tool for data verification. More extensive studies should be made with a more thorough investigation of countries that showed the greatest deviation.
Introduction
The pandemic of COVID-19 has affected the life of millions of people worldwide. Due to rapid contagiousness of the virus (Hafeez et al., 2020), nearly every country employed measures against the virus’ spread, such as national lockdowns and restrictions of typical activities. The pandemic showed that statistical and machine learning modelling procedures can potentially predict the number of new cases or deaths for a given country (Cássaro & Pires, 2020; Niazkar & Niazkar, 2020; Neto et al., 2020). The accurate forecast of the infection curve can facilitate government’s measures towards the suppression of the growth rate. However, in order to accurately predict or model COVID-19 spread, reliable and valid data should be collected from authorities. The recent pandemic of COVID-19 raised issues about data collection and handling. Media reports have questioned whether the statistics provided by countries are trustworthy (Kilani, 2021). Several studies have disputed the accuracy of government data and had linked data manipulation with transparency and democracy indexes (Adsera, Boix & Payne, 2003; Magee & Doces, 2015; Rozenas & Stukal, 2019).
Previous studies, in different fields, have applied Benford’s distribution (or law) analysis to detect fraudulent and manipulated data. Specifically, for COVID-19, it was found that deaths were underreported in the USA (Campolieti, 2021), while in China no manipulation was found (Koch & Okamura, 2020). A study for Japan also showed deviation from Benford’s distribution (Lee, Han & Jeong, 2020). Furthermore, it was found that countries with higher values of the developmental index are less likely to deviate from Benford’s law (Balashov, Yan & Zhu, 2021). This study applies Benford’s law to detect the first digit deviations of the announced cases and deaths from the expected frequencies in the European Union (EU). We further investigate whether the estimated deviation for each country, is associated with four developmental indexes, the GDP per capita, health expenditures (% of GDP), the Universal Health Coverage Index and full vaccination rate.
Methods
Sample
The public COVID-19 data of the European Union, regarding daily cases and deaths were exported from the European Centre for Disease Prevention and Control (ECDC) and consisted of observations between 2nd of March to the 20th of December 2021 (N = 8820). ECDC’s Epidemic Intelligence team collects and refines daily data of new cases and deaths associated with COVID-19, based on reports from health authorities worldwide. Apart from COVID-19 data, we included the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), and the Universal Health Coverage Index (UHC) from the World Bank (https://data.worldbank.org/). Finally, we included the full COVID-19 vaccination rate as of the 16th of December 2021, obtain from ECDC.
Benford’s distribution
Benford’s law (or law of prime digits) is a probability distribution for determining the first digit in a set of numbers. It was formally proposed in 1938, after an early work by the mathematician Simon Newcomb, and by the physicist Frank Benford, who claimed that in natural and unrestricted data sets, the probability of each digit appearing is given by the formula: Based on Benford’s distribution, the probabilities for each number d as the first digit are presented in Table 1.
The most common application of the law is in Economics, where it has already been considered as a tool for checking tax validity and detecting fraud (Nigrini, 1996; Durtschi, Hillison & Pacini, 2004; Tam Cho & Gaines, 2007). More recent studies have used Benford’s law to investigate whether COVID-19 data provided by countries are accurate (Kilani, 2021; Silva & Figueiredo Filho, 2021; Campolieti, 2021; Koch & Okamura, 2020) and if the deviation from Benford’s distribution could be affected by developmental indexes (Balashov, Yan & Zhu, 2021).
Goodness-of-fit
First, in order to investigate to which extent, the observed cases and deaths conform to Benford’s law’s expected frequencies, two goodness-of-fit tests were applied, the chi-squared (χ2) goodness-of-fit test and Kolmogorov-Smirnov (K-S). The chi-squared test statistic is given by: where the index i is the digit, and Oi and Ei are the observed and expected frequencies of the i-th digit, respectively. The degrees of freedom for this test are equal to 8, and the critical value is for the significance level set at a = 0.05; thus, any value of the statistic greater than the critical value would imply significant deviation from the expected distribution. However, in large samples, the interpretation of significance should be avoided, as the test has enough power to detect even small deviations from the expected distribution (Lin, Lucas & Shmueli, 2013). To accompany the results of the chi-squared test, Cramer’s V was calculated along with 95% bootstrap confidence interval (CI) for an estimate of the effect size. The Kolmogorov-Smirnov D statistic is commonly used for comparing empirical with theoretical continuous distributions, but it can also be used with integers. The statistic is given by: Both chi-squared and D statistics are greatly affected by sample size, hereby we included two measures that are not affected by large sample sizes, namely the Euclidean distance (ED) in the nine-dimensional space (Tam Cho & Gaines, 2007) given by: and Mean Absolute Distance (MAD) given by: where POi and PEi are the observed and expected proportions of the first digit, respectively.
Regression analysis
The two measures of deviation (ED and MAD) were used as the dependent variables in two regression models, with independent variables the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), the Universal Health Coverage Index (UHC) and the full COVID-19 vaccination rate (Vac), to examine whether the distance observed from Benford’s distribution could be associated with those predictors. Instead of relying in ordinary least squares (OLS) estimates for the parameters of the model, bootstrap estimates have been calculated due to the small sample size of countries (N = 27) leading to more robust results. With bootstrap, we selected 10000 samples with replacement of the initial size, as the original sample, and each time we estimated the OLS coefficients of the parameters; hence, creating the sampling distribution of each coefficient along with 95% bootstrap CIs (Davison & Hinkley, 1997).
Results
The results of the goodness-of-fit tests along with the two measures of deviations are presented in Table 2. For almost all countries, except for Bulgaria, Croatia, and Romania, significant deviations were found for both cases and deaths. For daily cases, Denmark, Ireland and Greece were associated with the highest chi-squared statistics and this was also confirmed by the two distance measures (Figure 1 and 2). Regarding deaths, Cyprus, Italy, and Greece had the highest chi-squared statistics and distance measures. The K-S D statistic in most cases came in agreement with the chi-squared test.
The bootstrap estimates and 95% bootstrap CIs of the regression analysis for the two measures of deviation are presented in Table 3. In order to avoid having small coefficients, GDP per capita has been log-transformed and the other three predictors were divided by 100. Regarding new cases, no predictor was found to significantly affect either MAD or ED. Vaccination rate was positively associated with deviation from Benford’s distribution in deaths as measured by MAD (0.076, 95% CI [0.020, 0.144]) and as measured by ED (3.415, 95% CI [1.175, 6.286]), indicating that countries with a higher full vaccination percentage tend to deviate more from Benford’s law.
Discussion
This study aimed to examine the validity of COVID-19 data from EU using Benford’s law. Data of daily new cases and deaths were collected by ECDC for the period of the 2nd of March 2021 and 20th of December 2021. Also, four country-level indexes were collected, the GDP per capita, the health expenditure as GDP percentage, the Universal Health Coverage index and the full vaccination rate. Two goodness-of-fit tests were applied, the chi-squared test and the Kolmogorov-Smirnov test, and two measures of deviation were estimated, the Euclidean distance and Mean Absolute distance. Bulgaria, Croatia and Romania were not deviating from Benford’s law for both new cases and deaths. Regarding daily cases, Cyprus, Germany, Lithuania, and Slovakia were in line with Benford’s distribution, while Denmark, Greece and Ireland, showed the greatest distance from Benford’s distribution. Regarding deaths, France, Ireland, Latvia, Netherlands, Slovenia and Spain matched Benford’s law whilst Malta, Cyprus, Greece, Italy and Luxemburg had the highest distance from Benford’s law. The results from the regression analysis suggested that the full vaccination rate was positively associated with non-conformity with Benford’s law, where countries with the highest vaccination percentage exhibited greater deviation.
The results of this study imply that the deviation from Benford’s law is not associated with country’s economy, which was suggested by earlier findings (Hollyer, Rosendorff & Vreeland, 2011). However, the effect would possibly be more apparent by including developing with developed countries (Judge & Schechter, 2009). Deviations from Benford’s distribution are a preliminary step for obtaining evidence for data manipulation; it is suggested that for specific economies that showed the greatest deviations, further studies could be made validating data reported by authorities. Additional parameters can be included, such as lockdown restrictions, preventive measures, and regional statistics and indicators.
Data Availability
All data produced are available online at: https://data.worldbank.org/
Funding
This study did not receive any funding.
Declaration of Conflicting Interests
The author declares that there is no conflict of interest.
Footnotes
Results have been updated for clarification