Abstract
The impacts of COVID-19 outbreak on socio-economic status of countries across the globe cannot be overemphasized as we examine the role it played in various countries. A lot of people were out of jobs, many households were careful of their spending and a greater social fracture of the population in fourteen different countries has emerged. We considered periods of infection spread during the first and second wave in Organization for Economic Co-operation and Development (OECD) countries and countries in Africa, that is developed and developing countries alongside their social-economic data. We established a mathematical and statistical relationship between Theil and Gini index, then we studied the relationship between the data from epidemiology and socio-economic determinants using several machine learning and deep learning methods. High correlations were observed between some of the socio-economic and epidemiologic parameters and we predicted three of the socio-economic variables in order to validate our results. These result shows a sharp difference between the first and second wave of the pandemic confirming the real dynamics of the spread of the outbreak in several countries and ways by which it was mitigated.
1 Introduction
The modelling of COVID-19 by scientists, epidemiologists and health experts has been considered since the beginning of the pandemic as it ravages the world. Socio-economic determinants of this pandemic are important as they project how severe a country is hit and how it is being controlled, this leading to consider these variables alongside daily reproduction rates during the contagiousness period of individuals infected by the pandemic.
Some researchers have worked on socio-economic analysis of COVID-19 pandemic, they are as follows : in [1], the authors examined the geoclimatic, demographic and socio-economic determinants of COVID-19 prevalence and have shown that the influence of these determinants varies by comparing the first and second wave of the pandemic. The socio-economic impact of the COVID-19 pandemic in United State of America (USA) was studied by [2], where the authors investigate the systematic risk posted by sector-level industries within the USA. [3] modeled daily confirmed cases of COVID-19 in different countries across the globe using regression models with predictions for upcoming scenarios. [7] worked on the socio-economic and environmental factors influencing the basic reproduction number of COVID-19 pandemic by fitting a logistic growth curve to report daily cases up to the first peak of the pandemic while [8] studied the impacts of social and economic factors on the transmission of COVID-19 disease with China as a case study using an empirical model, and the authors conclude that these determinants have rich implications for ongoing efforts in containing the pandemic. The work in this present article is an extension of [16] which was based on the analysis of the reproduction number of COVID-19 based on the current health expenditure as Gross Domestic Product Percentage (CHE/GDP) across several countries using some machine learning tools.
Our goal in this article is to establish a relationship between Theil and Gini index, analyze critically some of the socio-economic determinants of the pandemic, correlate them, predict three of the socio-economic variables and perform some regression analysis. We also cluster countries according to these parameters and with the help of the lasso method (least absolute shrinkage and selection operator) we were able to select the best variables for this modelling.
The paper is divided into seven sections, in section two we explain the methodology used in this research. Section three deals with the variables used, in section four we established a mathematical and statistical relationship between Theil and Gini index, section five is dedicated to the visualization of the results obtained, while we finally gave the discussion and conclusion in section six and seven respectively.
2 Methods
The use of machine learning and deep learning methods to analyze data has been helpful over the years to get a proper view on how a model behaves. In this research we used some supervised and unsupervised machine learning methods and we also tried to use one deep learning method to see the visualization.
The supervised machine learning methods we used are polynomial regression, linear regression, lasso regression and ridge regression. We also use some of these methods to make prediction by training the model and testing some percentage of the values. Lasso regression helped us to know the best variables to be used in the modelling. We used unsupervised learning to cluster variables across countries and the methods we used to validate our results are K-means clustering, Hierarchy clustering and principal component analysis.
We also performed correlation among parameters used in this modelling and an optimization method called ordinary least square (OLS) to model the socio-economic determinants of COVID-19. The deep learning method we used are neural network and multilayer perceptron (MLP) regressor which is a class of feedforward artificial neural network (ANN).
Multivariate least square method allows us to test much more complex relations between variables. It can be can be represented as follows: where β1, β2 … are coefficients or weights, ∈ is the residual noise, y is the dependent variable and x1, x2, … are the independent variables.
Ridge and lasso regression are simple methods to reduce the model complexity and prevent over-fitting which may result from linear regression. The cost function for ridge regression is given below: with for some , while α is the penalty term that regularizes the coefficients such that if the coefficients take large values, the optimization function is penalized. Ridge regression puts constraint on the coefficients β. We define the cost function for lasso regression as: with for some .
3 Variables
3.1 Socio-economic Variables
Socio-economic variables are a strong determinant of the spread of the pandemic. We extracted some data from ([4-5] and [9-14]), while we also calculated some of the socio-economic variables.
The observed variables used in this research are immigration rate, average life expectancy (L.E), Tuberculosis incidence (TB), temperature, percentage of gross domestic product devoted to health expenditure (% GDP H. E) collated from [16], ten percentage lowest (L.I) and highest income (H.I), government response stringency index, sustainable development goal (SDG) index, human development index (HDI), environmental performance index (EPI), consumer confidence index (CCI), stringency index (S.I), Theil index (T.I) and Gini index (G.I). We collated the data based on the available countries and most recent years.
The calculated socio-economic variables are as follows:
– Social fracture (S.F) index which is the ratio between the ten percentage highest income and the ten percentage lowest income. In brief it is expressed by the formula below;
– Demographic Index (D.I) is the ratio between the percentage of GDP devoted to health expenditure and Social fracture index. It is expressed by equation below:
We give a precise values of all variables in Tables 3, 4 and 5 in the Appendix.
3.2 Epidemiologic Variables
The variables from epidemiology were chosen during the exponential phase of the first and second wave of the pandemic. Daily new cases observed during the first 100 days was used to calculate the exponential slope for first and second wave, opposite autocorrelation slope were averaged on six days for the first and second wave. The maximum Ro was collated from [16] while observing this value during the first and second waves of countries considered. We also collated from [16] the deterministic Ro for the first and second wave of the pandemic taking six days as length of contagiousness period.
So, in summary we have six variables from epidemiology which are: first wave maximum Ro, second wave maximum Ro, first wave deterministic Ro, second wave deterministic Ro, opposite initial autocorrelation slope averaged on six days for both first and second wave of the daily new cases for developed and developing countries. All epidemiologic variables values were taken from the Appendix in [16].
In this present study, we validated our results by performing cross validation and also training 80% of the data and training 30%.
4 The Relationship Between Theil and Gini Index
4.1 Mathematical Approach
We first show the relationship between Theil and Gini index mathematically: where xk (respectively yk) denotes the kth cumulative part of the population (respectively income). If we choose the population increments equal to 1/n and if E(Δk) represents the expectation for the distribution dk = xk - xk-1 of the increment Δk = yk-yk-1 (see [17]), we have for the Theil index applied to the percentage yk of the total income relative to a percentage xk of the total population (see [18]):
If the first increment of y is close to 1 (which corresponds to a square shaped Lorenz curve, i.e., closed to the down and left border of the income/population square, or to a high Gini index) (cf. Figure 1), then -Log(Δ1)∼-1+Δ1, and we have: the equality being available only if the Lorenz curve presents a perfect top left right-angle shape.
4.2 Statistical Approach
4.2.1 Correlation
We correlated both Theil and Gini Indices with all epidemiologic, demographic and socio-economic variables and as it can be seen in Figure 2, Theil and Gini Indices are highly positively correlated with coefficient 0.7.
4.2.2 Regression Analysis Between Theil and Gini Index
Linear regression models use some historic data of independent and dependent variables and consider a linear relationship between both while polynomial regression models use a similar approach but the dependent variable is modeled as a degree n (n=2 in the present study) polynomial in x.
Linear regression model is given as: where β1 are the weights, β0 is the intercept and ∈ is the random error term. The above equation is the linear equation that needs to be obtained with the minimum error. Polynomial regression of order 2 is given below:
We present the visualization of the results using this approach in Figure 3.
For the linear regression as shown in Figure 3a, the intercept is 31.03, p-value is 0.0181, R-squared is 0.4881, residual standard error is 3.116 and all coefficients are significant with p < 0.05 for both the train and test data for linear and polynomial regression. The median of the residual plot in Figures 3b and 3f are 0.2111 and 0.2566 respectively for both linear and polynomial regression which is close to zero. The normality of the residual was tested using Jarques Bera and Durbin-Watson tests which gave a high p-value and we fail to reject the null hypothesis that the skewness and kurtosis of the residuals are statistically equal to zero. In order to know the performance of the linear regression model we trained 80% of the data and tested 20% of the data and also did cross validation to be sure of the accuracy. The predicted and the observed values are very close to the result presented for the regression models used. For the linear model we present the cross-validation result in Figure 3e whose average mean square error for the 5 portion folds is 11.72794. We observed correlation between the tested and the predicted values has high correlation accuracy (R-squared = 0.97). The test set p-value is 0.02 with residual standard error of 3.528. For polynomial regression of order 2, the train set has the following results: R-squared = 0.6, p-value = 0.002 and residual standard error = 2.935. The test set has the following results: R-squared = 0.99, p-value = 0.008 and residual standard error = 0.5639.
4.2.3 Neural Network for Theil and Gini Index
We used the neuralnet package in R in order to visualize the weights of the network and the bias between Theil and Gini index and as it can be seen in Figure 4, the weights are good with low bias. We also predicted some of the data and the prediction score is 0.98.
4.2.4 Multivariate Analysis for Gini and Theil Index Alongside Other Socio-economic Variables and Epidemiologic Variables
Figure 5 corresponds to the ordinary multivariate least square methods with R-squared = 0.674. Figure 5a shows Paraguay as outliers not fitting the data, Figure 5b normalizes all countries and doesn’t point any country in the plot.
4.2.5 Prediction of Gini Index Using MLP Regressor, Linear, Lasso and Ridge Regression
In this section we used cross validation method to choose the best parameter α for the modelling as shown in Figure 6c. For ridge regression, α = 0.142 with mean square error of 1.36 and α = 0.368 for lasso regression with mean square error = 5.10. For Figure 6e, the training score = 1.000 and the test score = 0.641, for Figure 6f training score = 0.99 2 and the test score = 0.497, for Figure 6g training score = 0.99 and the test score = 0.406 and for Figure 6h training score = 0.984 and test score = -0.077. It is evident from these results that linear regression best predicts Gini index with the highest test score and predicted values are very close to each other as presented in Table 1. Also, we observed the same pattern of prediction in Figures 6e to 6h showing that the all methods used in this section have the same predictive behavior.
4.2.6 Clustering Analysis for Gini and Theil Index Alongside Other Socio-economic Variables and Epidemiologic Variables
In Figure 7c, the first cluster has 14 countries and the second has 3 countries which are Uruguay and El Salvador on same hierarchy while Argentina is on another hierarchy. We only show the clusters dendrogram for the first cluster. In Figure 7f, Gini index has the highest positive correlation of 0.44 and Theil index has the value 0.34 in PC 1.
5 Application of the Methods to OECD Countries, Africa Countries, Developed and Developing countries
5.1 Developed and Developing Countries
5.1.1 Regression and Multivariate Analysis for Socio-economic Variables and Epidemiologic Variables
In Figures 8c and 8d we modeled the dependent variable as a degree n (n=6 in the present study) polynomial in x, an extension of Equation 8. The Figure 8 present regression analyses with the following parameters: Figure 8a: LinregressResult (slope = 0.11463663009107196, intercept = -0.0037118697103040027, rvalue = 0.43871 57758684147, pvalue = 0.0032517683682962654, stderr = 0.03667137141150123, R-squared = 0.192472 and RMSE = 0.1044724946057671.)
Figure 8b: LinregressResult (slope = -0.002547609589041096, intercept = 0.07888755616438356, rvalue = -0.32728 86357381478, pvalue = 0.03672803730354382, stderr = 0.0011777868896598461,
R-squared = 0.107118 and RMSE = 0.03065537183298402)
Figure 8c: LinregressResult (slope = 0.0017309145398433248, intercept = 0.06695128460299407, rvalue = 0.26367 5660748951, pvalue = 0.08754941979369255, stderr = 0.0009889311191763849,
R-squared = 0.069525 and RMSE = 0.0354860744891158), R-squared for order Six Polynomial Regression = 0.3 and RMSE for Polynomial Regression of order six = 0.04060485094256808.
Figure 8d: LinregressResult (slope = 0.0015999465132904799, intercept = 0.0810899892250729, rvalue = 0.286126 6574746827, pvalue = 0.13239511534872409, stderr = 0.001031140187045727, R-squared = 0.081868 and RMSE = 0.05492215494302141), RMSE for Polynomial Regression of order six = 0.07286590609946085 and R-squared for order Six Polynomial Regression = 0.35.
Figures 8e to 8h correspond to the ordinary multivariate least square method with R-squared = 0.76. Figure 5a shows some developing countries as outliers while Belgium is the only developed country which do es not fit the data.
5.1.2 Prediction of Percentage GDP Health Expenditure
In this section we used cross validation method to choose the best parameter α for the modelling as shown in Figure 9c. For ridge regression, α = 0.012 with mean square error of 2.32 and α = 0.029 for lasso regression with mean square error = 2.21. For Figure 9e, the training score = 0.983 and the test score = 0.607, for Figure 9f training score = 0.170 and the test score = 0.021, for Figure 9g training score = 0.854 and the test score = 0.115 and for Figure 9h training score = 0.980 and test score = -2.386.
It is evident from the results that linear regression best predicts percentage of GDP devoted to health expenditure with the highest test score and predicted values are very close.
5.1.3 Principal Component Analysis and Clustering Result
In Figures 10e and 10f, the first cluster has 15 countries, the second cluster 27 countries while the last cluster has 2 countries which are Tanzania and Mauritius. We only show the two clusters dendrograms with many countries. In Figure 10c, Gini index has the highest positive correlation of 0.52 and demographic index has highly negative correlation of -0.53 in PC 1 while first wave maximum Ro has highest positive correlation in PC 2, whose value equals 0.70.
5.2 Africa Countries
5.2.1 Multivariate Analysis for Socio-economic Variables and Epidemiologic Variables
Figures 11a to 11d correspond to the ordinary multivariate least square method with R-squared = 0.60. Figure 11a shows Botswana and Tanzania as outliers not fitting the data.
5.2.2 Prediction of Temperature
In this section we used cross validation method to choose the best parameter α for the modelling as shown in Figure 12c. For ridge regression, α = 1.005 with mean square error of 19.13 and for lasso regression α = 6.018 with mean square error = 16.93. For Figure 12e, the training score = 0.647 and the test score = -2.228, for Figure 12f training score = 0.316 and the test score = 0.154, for Figure 12g training score = 0.573 and the test score = -1.136 and for Figure 12h training score = -6.728 and test score = -4.714. It is evident from these results that lasso regression best predicts temperature with the highest test score and predicted values are close.
5.2.3 Principal Component Analysis and Clustering Result
In Figures 13e and 10f, the first cluster has 40 countries, the second cluster 13 countries while the last cluster has only one country which is Botswana. We only show the two clusters dendrograms with many countries. In Figure 13c, average life expectancy has the highest positive correlation of 0.46 in PC 1 while first wave deterministic Ro has highest positive correlation in PC 2, value equal to 0.47.
5.3 OECD Countries
5.3.1 Multivariate Analysis for Socio-economic Variables and Epidemiologic Variables
Figure 14 corresponds to the ordinary multivariate least square method with R-squared = 0.90. Figure 15a shows Austria and Belgium as outliers not fitting the data.
5.3.2 Prediction of Percentage GDP Health Expenditure
In this section we used cross validation method to choose the best parameter α for the modelling as shown in Figure 15d. For ridge regression, α = 0.005 with mean square error of 1.905 and for Lasso regression, α = 0.027 with mean square error = 1.657. For Figure 15e, the training score = 0.993 and the test score = 0.535, for Figure 15f training score = 0.898 and the test score = 0.629, for Figure 15g training score = 0.983 and the test score = 0.259 and for Figure 15h training score = -0.072 and test score = -0.196. It is evident from these results that lasso regression best predicts percentage of GDP devoted to health expenditure with the highest test score and predicted values are very close.
5.3.3 Principal Component Analysis and Clustering Result
In Figures 16e and 16f, the first cluster has 20 countries, the second has 5 countries which are USA and Bulgaria on same hierarchy, Mexico and Costa Rica on same hierarchy and Chile standing alone. The third cluster has 12 countries. We only show the two highest clusters dendrograms. In Figure 16c, Gini index and social fracture index have the highest positive correlation of 0.45 and 0.46 respectively in PC 1 while Gini index and percentage of GDP devoted to health expenditure has highest positive correlation in PC 2, whose values equal to 0.41 and 0.65 respectively.
Discussion
We have been able to develop new approaches to the socio-economic determinants for the modelling of the Covid-19 pandemic during the exponential phase. Some of these determinants have shown high correlation with parameters from epidemiology and can be seen in the heatmap diagrams in Figures 2, 10a, 13a, and 16a, explaining the role of each variable thanks to these correlations.
For developed and developing countries, lasso regression reduced the correlation between social fracture index and ten percent highest income to zero while for OECD countries Gini index and social fracture index correlation was reduced to zero. Some of our variables were not used in the optimization method-OLS due to multi-colinearity we observed at the summary of the results. For the two sets of countries, consumer confidence index, opposite initial autocorrelation averaged on six days for first and second wave, ten percent lowest income and ten percent highest income were not used in the modelling. The r-squared for OLS results for developed & developing and OECD countries are 0.76 and 0.90 respectively which shows a high significance rate (see figures 10e to 10f, 14).
The principal component analysis shows high correlation for the numbers of new cases we used in this research. Social fracture index has high correlation in PC1 for both cases while in PC2, percentage of GDP devoted to health expenditure was dominant for OECD countries and maximum R+ for the first wave was dominant for both developed and developing countries (see Figures 10c and 16c).
What we could deduce from all these observations is that the socio-economic determinants are a key to the modelling of infectious disease as these parameters give high signals on the trend during the spread of the pandemic for various countries.
7 Conclusion
The systematic study of the correlations between socio-economic variables (Gini and Theil indices, percentage of GDP devoted to heath expenditure, etc.) and epidemiological variables (reproduction rate, opposite of the slope of autocorrelation to origin, etc.) shows a disparity between developed and developing countries, as well as between epidemic waves. Developed countries with high indices of social divide, but high health expenditure, did not, for the first wave, react better to the COVID-19 epidemic than developing countries. On the other hand, the rapid implementation of isolation and vaccination measures enabled them to anticipate and reduce the effects of the second wave. In a subsequent work, we will study the evolution of this disparity between developed and developing countries during subsequent waves of SARS CoV-2.
Data Availability
All data comes from public databases
Conflict of Interest
The authors declare no conflict of interest.
Authors Contributions
Conceptualization, J.D.; K.O. and M.R.; methodology, J.D.; K.O.and M.R.; software, K.O.; validation, J.D.; K.O. and M.R.; formal analysis, K.O.; investigation, J.D. and M.R.; resources, J.D.; data curation, K.O.; writing— original draft preparation, K.O.; writing—review and editing, J.D. and K.O; visualization, K.O.; supervision, J.D. and M.R; project administration, J.D. and M.R. All authors have read and agreed to the final version of the manuscript.
Funding
No specific funding was received for this research
Acknowledgments
The authors wish to acknowledge the Petroleum Technology Development Fund (PTDF) Nigeria doctoral fellowship in collaboration with Campus France Africa Unit.