Abstract
Previous studies have used Benford’s distribution to assess whether there is misreporting of COVID-19 cases and deaths. Data inaccuracies provide false information to the media, undermine global response, and hinder the preventive measures taken by countries worldwide. In this study, daily new cases and deaths from all the countries of the European Union were analyzed and the conformance to Benford’s distribution was estimated. For each country, two statistical tests and two measures of deviation were calculated to determine whether the reported statistics comply with the expected distribution. Four country-level developmental indexes were also included, the GDP per capita, health expenditures, the Universal Health Coverage index, and the full vaccination rate. Regression analysis was implemented to show whether the deviation from Benford’s distribution is affected by the aforementioned indexes. The findings indicate that four countries were in line with the expected distribution, Bulgaria, Croatia, Lithuania, and Romania. For the daily cases, Denmark, Greece, and Ireland, showed the greatest deviation from Benford’s distribution and for deaths, Malta, Cyprus, Greece, Italy, and Luxemburg exhibited the highest deviation from Benford’s law. Furthermore, it was found that the vaccination rate is positively associated with deviation from Benford’s distribution. These results suggest that overall, official data provided by authorities are not confirming Benford’s law, yet this approach is not conclusive; it acts as a preliminary tool for data verification. More extensive studies should be made with a more thorough investigation of countries that showed the greatest deviation.
Introduction
The pandemic of COVID-19 has affected the life of millions of people worldwide. Due to the rapid contagiousness of the virus (Hafeez et al., 2020), nearly every country employed measures against the virus’ spread, such as national lockdowns and restrictions of outdoor activities. The pandemic showed that statistical and machine learning modelling can potentially predict the number of new cases or deaths for a given country (Cássaro & Pires, 2020; Niazkar & Niazkar, 2020; Neto et al., 2020). The accurate forecast of the infection curve can facilitate government’s measures towards the suppression of the growth rate. However, in order to accurately predict or model COVID-19 spread, reliable and valid data should be collected from authorities. The recent pandemic of COVID-19 raised issues about data collection and handling. Media reports have questioned whether the statistics provided by countries are trustworthy (Kilani, 2021). Several studies have disputed the accuracy of government data and had linked data manipulation with transparency and democracy indexes (Adsera, Boix & Payne, 2003; Magee & Doces, 2015; Rozenas & Stukal, 2019).
Previous studies, in different fields, have applied Benford’s distribution (or law) analysis to detect fraudulent and manipulated data. Specifically, for COVID-19, it was found that deaths were underreported in the USA (Campolieti, 2021), while in China no manipulation was found (Koch & Okamura, 2020). A study for Japan also showed deviation from Benford’s distribution (Lee, Han & Jeong, 2020). Furthermore, it was found that countries with higher values of the developmental indexes are less likely to deviate from Benford’s law (Balashov, Yan & Zhu, 2021).
This study applies Benford’s law to detect the first digit deviations of the announced cases and deaths from the expected frequencies in the European Union (EU). We further investigate whether the estimated deviation for each country, is associated with four developmental indexes, the GDP per capita, health expenditures (% of GDP), the Universal Health Coverage Index and full vaccination rate.
Methods
Sample
The public COVID-19 data of the European Union, regarding daily cases and deaths were exported from the European Centre for Disease Prevention and Control (ECDC) and consisted of observations between 2nd of March to the 20th of December 2021 (N = 8820). ECDC’s Epidemic Intelligence team collects and refines daily data of new cases and deaths associated with COVID-19, based on reports from health authorities worldwide. Apart from COVID-19 data, we included the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), and the Universal Health Coverage Index (UHC) from the World Bank (https://data.worldbank.org/). Finally, we included the full COVID-19 vaccination rate as of the 16th of December 2021, obtain from ECDC.
Benford’s distribution
Benford’s law (or law of prime digits) is a probability distribution for determining the first digit in a set of numbers. It was formally proposed in 1938, after an early work by the mathematician Simon Newcomb, and by the physicist Frank Benford, who claimed that in natural and unrestricted data sets, the probability of each digit appearing is given by the formula:
Based on Benford’s distribution, the probabilities for each number d as the first digit are presented in Table 1.
The most common application of the law is in Economics, where it has already been considered as a tool for checking tax validity and detecting fraud (Nigrini, 1996; Durtschi, Hillison & Pacini, 2004; Tam Cho & Gaines, 2007). More recent studies have used Benford’s law to investigate whether COVID-19 data provided by countries are accurate (Kilani, 2021; Silva & Figueiredo Filho, 2021; Campolieti, 2021; Koch & Okamura, 2020) and if the deviation from Benford’s distribution could be affected by developmental indexes (Balashov, Yan & Zhu, 2021). Conformity with Benford’s distribution is expected when the data under consideration spread over several terms of magnitude, however for countries with limited cases and/or deaths, the distribution of the first digit would generally deviate from Benford’s.
Goodness-of-fit
First, in order to investigate to which extent, the observed cases and deaths conform to Benford’s law’s expected frequencies, two goodness-of-fit tests were applied, the chi-squared (χ2) goodness-of-fit test and Kolmogorov-Smirnov (K-S). The chi-squared test statistic is given by: where the index i is the digit, and Oi and Ei are the observed and expected frequencies of the i-th digit, respectively. The degrees of freedom for this test are equal to 8, and the critical value is for the significance level set at a = 0.05; thus, any value of the statistic greater than the critical value would imply significant deviation from the expected distribution. However, in large samples, the interpretation of significance should be avoided, as the test has enough power to detect even small deviations from the expected distribution (Lin, Lucas & Shmueli, 2013). To accompany the results of the chi-squared test, Cramer’s V was calculated along with 95% bootstrap confidence interval (CI) for an estimate of the effect size. The Kolmogorov-Smirnov D statistic is commonly used for comparing empirical with theoretical continuous distributions, but it can also be used with integers. The statistic is given by:
Both chi-squared and D statistics are greatly affected by sample size, hereby we included two measures that are not affected by large sample sizes, namely the Euclidean distance (ED) in the nine-dimensional space (Tam Cho & Gaines, 2007) given by: and Mean Absolute Distance (MAD) given by: where POi and PEi are the observed and expected proportions of the first digit, respectively. As 27 countries of the EU were tested for conformity with Benford’s law, familywise error rate was also considered and the Holm-Bonferroni correction was applied to adjust the p-values obtained (Aickin & Gensler, 1996).
Regression analysis
The two measures of deviation (ED and MAD) were used as the dependent variables in two regression models, with independent variables the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), the Universal Health Coverage Index (UHC) and the full COVID-19 vaccination rate (Vac), to examine whether the distance observed from Benford’s distribution could be associated with those predictors. Instead of relying in ordinary least squares (OLS) estimates for the parameters of the model, bootstrap estimates have been calculated due to the small sample size of countries (N = 27) leading to more robust results. With bootstrap, 10000 samples were selected, and each time we estimated the OLS coefficients of the parameters; hence, creating the sampling distribution of each coefficient along with 95% bootstrap CIs (Davison & Hinkley, 1997).
Results
The results of the goodness-of-fit tests along with the two measures of deviations are presented in Table 2. For almost all countries, except for Bulgaria, Croatia, Lithuania and Romania, significant deviations from Benford’s distribution were found for both cases and deaths. For daily cases, Denmark, Ireland, and Greece were associated with the highest chi-squared statistics and this was also confirmed by the two distance measures (Figure 1 and 2). Regarding deaths, Cyprus, Italy, and Greece had the highest chi-squared statistics and distance measures. The K-S D statistic in most cases came in agreement with the chi-squared test.
The bootstrap estimates and 95% bootstrap CIs of the regression analysis for the two measures of deviation are presented in Table 3. In order to avoid having small coefficients, GDP per capita has been log-transformed and the other three predictors were divided by 100. Regarding new cases, no predictor was found to significantly affect either MAD or ED. Vaccination rate was positively associated with deviation from Benford’s distribution in deaths as measured by MAD (0.076, 95% CI [0.020, 0.144]) and as measured by ED (3.415, 95% CI [1.175, 6.286]), indicating that countries with a higher full vaccination percentage tend to deviate more from Benford’s law.
Discussion
This study aimed to apply Benford’s law to COVID-19 data from the European Union. Data of daily new cases and deaths were collected by ECDC for the period of the 2nd of March 2021 and 20th of December 2021. Also, four country-level indexes were collected, the GDP per capita, the health expenditure as GDP percentage, the Universal Health Coverage index and the full vaccination rate. Two goodness-of-fit tests were applied, the chi-squared test and the Kolmogorov-Smirnov test, and two measures of deviation were estimated, the Euclidean distance and Mean Absolute distance. Bulgaria, Croatia, Lithuania, and Romania were not deviating from Benford’s law for both new cases and deaths. Regarding daily cases, Cyprus, Estonia, Germany, Italy, Malta, Slovakia, and Slovenia were in line with Benford’s distribution, while Denmark, Greece and Ireland, showed the greatest distance from Benford’s distribution. Regarding deaths, the chi-squared test showed that France, Ireland, Latvia, Netherlands, Slovenia and Spain matched Benford’s law whilst Malta, Cyprus, Greece, and Italy had the highest distance from Benford’s distribution. These results do not necessarily indicate fraudulent data, especially for deaths, as countries with a small number of daily new deaths are not expected to behave according to Benford’s law. The results from the regression analysis suggested that the full vaccination rate was positively associated with non-conformity with Benford’s law, where countries with the highest vaccination percentage exhibited greater deviation. This may imply that countries with a higher vaccination rate were also associated with fewer deaths, thus the data did not confirm Benford’s law.
The results of this study imply that the deviation from Benford’s law is not associated with country’s economy, which was suggested by earlier findings (Hollyer, Rosendorff & Vreeland, 2011). However, the effect would possibly be more apparent by including developing along with developed countries (Judge & Schechter, 2009). Deviations from Benford’s distribution do not imply that data are fraudulent, but this approach may act as a preliminary step for obtaining evidence for data manipulation; it is suggested that for specific economies that showed the greatest deviations, further studies could be made validating data reported by authorities. Additional parameters can be included, such as lockdown restrictions, preventive measures, and regional statistics and indicators.
Data Availability
All data produced are available online at: https://data.worldbank.org/
Funding
This study did not receive any funding.
Declaration of Conflicting Interests
The author declares that there is no conflict of interest.
Footnotes
The title and discussion have been updated for clarification