CLINICAL APPLICATIONS OF MACHINE LEARNING ON COVID-19: THE USE OF A DECISION TREE ALGORITHM FOR THE ASSESSMENT OF PERCEIVED STRESS IN MEXICAN HEALTHCARE PROFESSIONALS

Juan Luis Delgado-Gallegos; Gener Avilés-Rodriguez; Gerardo R. Padilla-Rivas; María De los Ángeles Cosio-León; Héctor Franco-Villareal; Erika Zuñiga-Violante; Gerardo Salvador Romo-Cardenas; Jose Francisco Islas

doi:10.1101/2020.11.18.20233288

Abstract

Stress and anxiety have shown to be indirect effects of the COVID-19 pandemic, therefore managing stress becomes essential. One of the most affected populations by the pandemic are healthcare professionals. Thus, it is paramount to understand and categorize their perceived levels of stress, as it can be a detonating factor leading to mental illness. In our study, we used a machine learning prediction model to help measure perceived stress; a C5.0 decision tree algorithm was used to analyze and classify datasets obtained from healthcare professionals of the northeast region of Mexico. Our analysis showed that 6 out of 102 instances were incorrectly classified. Missing two cases for mild, three for moderate and 1 for severe (accuracy of 94.1%), statistical correlation analysis was performed to ensure integrity of the method, in addition we concluded that severe stress cases can be related mostly to high levels of Xenophobia and Compulsive stress.

Introduction

As the deadly coronavirus disease COVID-19 continues to spread globally medical and allied healthcare professionals have become one of the most highly affected sectors by this disease^1–3. Particularly in developing democracies, such is the case of Mexico, the public health system has become engulfed by the overwhelming levels of stress^4,5 In addition, the situation becomes even more taxing for attending personnel as they not only deal with the burdened system⁶, but they must also deal with the enemy upfront; It is here, where they too can become prey to the disease⁷. Recent reports showed for the period spanning from the end of February to the end of August 97,632 health-care professionals became infected with over 1300 more deaths reported over any other country⁸. What is even more burdensome is the fact that Mexico continues atop of all Latin-America countries in infection-to-death rate (>10%)⁹.

According to the Pan-American Health Organization (PAHO), Mexico has the highest number of healthcare workers with COVID-19 in Latin America¹⁰. Up to October 25^th 2020, the amount of healthcare professionals affected by COVID-19, as reported by the National health ministry, by comparison to other healthcare professionals in Mexico, medical doctors rank highest in mortality vs their nursing or other allied healthcare professional counterparts. Recent reports show that both physicians and nurses have similar levels of burnout and emotional fatigue. Physicians tend to work in a more independent manner, this along with their long shift hours, high-sense of duty, work ethics, and the fact they partake multiple jobs, normally of low wages, becomes a source of additional stress⁸. Counter, nursing staff work in groups and develop higher social support^11,12. Recent reports have shown that resilience level in nurses seem to correlate with lower anxiety, this is part of a well-developed social support system in which nursing professionals provide emotional help and support in order to uplift the communal spirit even under overwhelming circumstances¹³.

During the period of August 16th up to November 3rd, the spreading of the disease amongst physicians grew significantly along the passing months, reporting 140,196 confirmed cases, 31,870 classified as “possible” COVID-19 and 222,372 discarded cases, with 1,884 COVID-19 confirmed deaths and up to 198 COVID-19 suspected deaths, with an active case count of 3362. Interestingly, active cases decreased from the initial count of 4743 in the period of August, to 3362, nonetheless in November the number of suspected and confirmed COVID-19 death rate showed an increase¹⁴ (Figure 1). This is of particular interest when we take into account stress, as it may be a trigger to determine a potential severe outcome of the disease.

Figure 1. COVID-19 epidemiological curve for physicians in accordance to the Mexican health ministry form the periods of August to November 2020. Increases, confirmed cases (140,196), suspected cases (31,870), negative cases (222,372), confirmed deaths (1,884), suspected deaths (198). Decreases, active cases (3362).

Recent developments in computational modeling have led to the ever-evolving field of artificial intelligence which, when combined with neuro- and behavioral-science, has created the new field of computational psychiatry^15,16. Nowadays Machine Learning and Artificial Intelligence are promising technology used by various healthcare providers as they result in better scale-up, speed-up processing power, reliability which translates into the strengthening of the performance of the clinical team¹⁷. Therefore, there’s has been an expected trend in using these techniques in order to better understand and fight against the pandemic. Computational psychiatry helps to model and understand underlying mental illness, when this is driven forward with “machine learning” we can potentially predict behavioral patterns, improve classification and assist the physician to provide a faster and personalized medical attention.

Well-known machine learning algorithms are decision trees, commonly used for establishing classification systems based on multiple variables for a target feature¹⁸. Typically, with the use of this method, it is possible to classify a population into branch-like segments that generate an inverted tree^18,19. The algorithm can efficiently deal with large, complicated datasets without imposing a complicated parametric structure¹⁹. Researchers have reported the use of these types of algorithms for applications in the study of behavioral and mental health²⁰. Thus, it is possible to use this technique to help define different clinical paths, classify subgroups of patients for different diagnostic tests, treatment strategies, and assessment of mental health-related conditions^21,22.

Methods

A common strategy used for data analytics is the Cross Industry Standard Process for Data Mining methodology (CRISP-DM), for which six steps are defined for data-based knowledge projects. This strategy begins with a phase where problems from the scenario are stated and objectives defined (Business Understanding), followed by a stage where data insights are obtained (Data Understanding). Next, a final dataset for study is defined and analyzed (Data Preparation), results from the data preparation analysis would allow to define the algorithm over which a model would be generated (Modeling) once the model is generated a performance evaluation would be required in order to confirm if it fulfills the proposed objectives (Evaluation), if the goal for which the modeling is achieved, it can be implemented for the purpose for which it was proposed²⁹.

This work is based on previously reported adapted COVID-19 stress scales data¹. The analysis method was adapted from CRISP-DM, in the same amount of stages and sequence, as shown in Figure 4. An initial Data Structure study was done from the data analytics scope in order to consider the type of variables that compose the stress scales. Also, to define the goal for the study as a model that contributes in the understanding of mental illness in health care workers during the COVID-19 outbreak. Next, a Data Validation analysis that considered statistical test to confirm relations between variables from the scales and classification outcomes from raw data, and also, to confirm internal consistency. This was followed by a Data Distribution analysis to study stress components that could be used for model selection and interpretation. A Decision tree model was selected in order to study the relations and possible classifications routes for stress level according data from its respective scales. Given the clinical component of the study, a performance analysis based on accuracy, sensitivity and specificity was carried out on the results. This allowed to validate the obtained model and Observations and conclusions from stress scales were made as an application of the obtained model.

Figure 2. Stress level distribution in Healthcare personnel. (Left to right) Absent, Mild, Moderate, Severe.

Figure 3. Decision tree applied into healthcare personnel stress scale level dataset. Atop variables influencing stress are Xenophobia (Xeno) and Compulsive checking (Comp), which leads to severe stress. Traumatic stress (Trauma) and Danger + Contamination (Dan Con) also influenced the perception of stress. Social economical variable did not influence the outcome of the decision tree.

Figure 4. Methods for machine learning based analysis on stress scales of health care workers.

The statistical analysis and computational modeling (algorithm) were done to obtain the behavior patterns and distribution of the data. Both the statistical and algorithm analysis of the dataset was performed in R and RStudio.

Descriptive statistical analysis

In order to understand the structure, validate and explore data distribution, a descriptive statistical study was carried out. The comprised obtaining measures of central tendency and variability metrics to describe the whole-set of data with a single values, the center of distribution and similitude³⁰. Cronbach’s alpha was estimated considering the numerical values from all participant responses. Presented as the index of reliability and internal consistency³¹. Chi-square statistic was applied to the general classification of each participant and each of the questions separated by the sections of the generalized stress scale reported³².

Mathematical and computational algorithms

Following the statistical analysis that implied the first three stages of the study. A decision tree model was developed to behave as a computational supportive scaffold for the study of mental illness. These types of algorithms have previously been useful to study and predict mental illness^19,20. C5.0 algorithm was used to analyze and classify the stress level from the dataset²³, and for construction of decision tree algorithm²³. The relation between entropy and information gain allows for a model, that trains based on the datasets that contribute to analyze and visually explain the route and relations of the studied variables. Excluding features that don’t affect the outcome. Making it an efficient tool for the understanding of illness and its severity based on measures stress features.

Following both the statistical and computational study of the dataset, performance based on sensibility, sensitivity and accuracy was studied on the generated model³³. Conclusions were drawn from the outcome and routes defined by the tree model branches considering initial statistical analysis.

Results

An initial preprocessing statistical analysis was applied to the 106 instances dataset. After eliminating missing data instances for statistical and algorithm-based analysis preparation, a group of 102 instances was used for the study.

Frequency counts were applied to the Participant profession and work area variables, as shown in Table 1.

View this table:

Table 1. Frequency count on Participant profession and work area

The adapted COVID-19 stress scales were built upon five areas that come because of the addition from the response of the questions related to each section of the survey. Central tendency metrics were calculated for each of these components based on the cumulative result of each participant, as shown in Table 2.

View this table:

Table 2. Central tendency metrics for the adapted COVID-19 stress scale features..

Given that each feature is based on the addition of the responses from the survey. We considered all the values from each question and participant for the calculation of Cronbach’s alpha. This gave a result of 0.94 which shows a good internal consistency for the whole survey instrument and data. In addition, supplementary Table 1 shows Chi-square tests to each question for all in order to define significance in the relationship of the variables. Table 3 shows the result of the test for each scale area and each question and for the cumulative adapted COVID-19 stress scales.

View this table:

Table 3. Analysis per General Area

Most of the results from the Chi-square test show significance, except for the following: one question from the Traumatic stress scale and four from Compulsive checking scale. This relating dependence of the stress level classification calculated as a cumulative result of the scales and the answers given by each participant.

The distribution for the stress level classification in healthcare personnel calculated from the stress scales is shown Figure 2. Frequency histogram of the four severity levels of the stress scale from the observed dataset. This variable would correspond to the target feature in the decision tree model (Figure 3).

Both results from the Cronbach alpha and the Chi-square test, show internal consistency of the data and validate the dependence for stress level calculation, ensuring the dataset quality for algorithm-based analysis.

Following the descriptive statistical analysis, a decision tree model was trained with the preprocessed dataset using the C5.0 algorithm^19,23. Considering the stress level to be the target variable. Participant profession, work area and all five cumulative stress scales areas were used as the predictive variables in order to find any relationship between them aiming to predict stress level. Figure 3 shows the decision tree obtained from the dataset. At the predictive branch, a set of boxes with all four levels of stress are observed. In each box, the extreme right bar corresponds to the severe level indicator, followed to the left by moderate, mild and absent levels, respectively, also observed in Figure 3. Despite declaring the features related to participant profession and work area, these variables were pruned from the model. Table 4 shows the confusion matrix from the obtained decision tree model. Where only 6 out of 102 instances were incorrectly classified. Missing two cases for mild level, three for moderate and 1 for severe. All these errors were classified only in neighboring levels, giving the model an accuracy of 94.1%.

View this table:

Table 4. Confusion matrix of obtained decision tree model for stress level classification

In order to analyze model performance, a sensibility and specificity calculation was carried out. For this, three different scenarios were considered based on the classification outcome from the dataset, dividing instances in healthy and disease groups. Calculation was done with the figures from the confusion matrix. Results shown in Table 5.

View this table:

Table 5. Decision tree sensitivity and specificity calculation from stress scales dataset.

Discussion

Our purpose was to define a statistical and computational framework algorithm to understand stress levels in healthcare professionals due to the impact of the COVID-19 pandemic and to potentially define a tool which can be further used as a predictor of severity of stress.

A dataset related to adapted COVID-19 stress scales as defined by Delgado-Gallegos et.al., was studied with a calculated Cronbach alpha of 0.94 which shows a good internal consistency; stress levels were calculated as a geometrical result from the addition of five scales from the survey defined as Danger + Fear of contamination, Socioeconomic stress, Xenophobia, Traumatic stress and Compulsive checking. Chi-square test was done for all questions individually, looking to validate stress level calculation. Statistical significance (p < 0.05) was found in most of the questions considering the answer of all participants, except one question in the Traumatic scale, and four from the Compulsive checking scale (all shown in supplementary table1), although all scales showed statistical significance when the test was applied to the accumulated value for each of these scales, as seen on Table 3. Thus validating, the use of the adapted COVID-19 stress scales, in a population^1,24. Recently the Mexico’s health ministry has also launched their own questionnaires, in an effort, to correctly assess the stress levels in healthcare professionals²⁵. Therefore, the use of this model can be re-adapted to help in correctly assessing and provide a faster diagnosis and opportune treatment.

From the central metrics statistical analysis, no relation was observed between participant profession and work area, similar analysis was done for the stress scales which showed an exception for Danger + Fear of contamination joint scale, all other areas had a similar maximum value but with different means. Therefore, considering results from the preprocessing stage, the dataset shows good quality, independence and internal consistency for algorithm analysis. All 102 instances from the dataset were used to train a decision tree model by C5.0 algorithm, where stress level was defined as the target variable, with participant profession, work area and cumulative stress scales as predictors. The resulting model showed an accuracy of 94%. Nonetheless, the algorithm did not find enough information gain from the participant profession, work area, and the socioeconomic scale. Neglecting these variables from the resulting model allows to understand that experience and day-to-day work routine are not a factor on how healthcare professionals perceive stress. Resilience could help explain this pattern as it is an adaptation mechanism in which a person overtime can handle stress in overwhelming situations ^13,26.

Conclusion

Computational psychiatry states the similarity between the brain and a computer and proposes the use of computational terminology for the study of mental illness²⁷. Our results show interesting data denoting hypothetical tendencies based on the purity of the resulting branches of the decision tree, where severe stress cases can be related mostly to high levels of Xenophobia and Compulsive stress. This considering that threshold values from the extreme right route of the decision tree are above the 3rd quartile for both scales. In a similar manner, absent stress level comes from the scenario of combined thresholds below the 1st quartile from Xenophobic, Compulsive and Traumatic stress scales. It is interesting to note that the Danger + Fear of contamination scale can be used to find both mild and moderate cases, despite being a larger joint scale.

We believe this method could potentially define concepts on how to diagnose and handle severe stress, contributing to mathematical informed understanding of mental illness and computational psychiatry, thus forming a diagnostic tool to help in the assessment of patients. In this study, we used healthcare professionals as they are one of the most affected sectors in the pandemic²⁸. In addition, an expansion of this method outside the pandemic can be used to understand different stress factors and how they can interfere with performance and social dynamics in different populations.

While this is only a fist approximation based on recent data from healthcare professionals in the northeast part of Mexico¹, the impact of applying machine learning algorithms and computational psychiatry, can potentially help reshape the way we run global health.

Data Availability

DATA AVAILABILITY: Dataset may be downloaded from Kaggle (https://www.kaggle.com/chepox/css-mexico). CODE AVAILABILITY: Code may be accessed from github (https://github.com/Bio-Math/COVID-19-related-stress-analysis)

https://www.kaggle.com/chepox/css-mexico

https://github.com/Bio-Math/COVID-19-related-stress-analysis

DATA AVAILABILITY

Dataset may be downloaded from Kaggle (https://www.kaggle.com/chepox/css-mexico).

CODE AVAILABILITY

Code may be accessed from github (https://github.com/Bio-Math/COVID-19-related-stress-analysis)

Author contributions

Research and writing, J.L.D.-G., J.F.I., G.A-R, E.Z-V and H.F.-V.; algorithm development, M.D.A.C-L., G.S.R-C, G.A-R., statistical analysis, J.L.D.-G. and G.R.P.-R.; supervision, J.F.I. All authors have read and agreed to the published version of the manuscript.

Competing interests

Authors declare no competing interests.

Figures and supplemental legend

Supplementary Table 1. Proportional analysis per question and overall area, Chi-square Test, Sig.

REFERENCES

1.↵
Delgado-Gallegos, J. L., Montemayor-Garza, R. J., Padilla-Rivas, G. R., Franco-Villareal, H. & Islas, J. F. Prevalence of stress in healthcare professionals during the covid-19 pandemic in Northeast Mexico: A remote, fast survey evaluation, using an adapted covid-19 stress scales. Int. J. Environ. Res. Public Health 17, (2020).
2.
Shah, K., Chaudhari, G., Kamrai, D., Lail, A. & Patel, R. S. How Essential Is to Focus on Physician’s Health and Burnout in Coronavirus (COVID-19) Pandemic? Cureus 12, 10–12 (2020).
OpenUrl
3.↵
Petzold, M. B., Plag, J. & Ströhle, A. Dealing with psychological distress by healthcare professionals during the COVID-19 pandemia. Nervenarzt 91, 417–421 (2020).
OpenUrl
4.↵
Morales, G. Live Updates: COVID-19 death toll in Mexico. El Universal (2020).
5.↵
Bello-Chavolla, O. et al. Predicting mortality due to SARS-CoV-2: A mechanistic score relating obesity and diabetes to COVID-19 outcomes in Mexico Omar. J Clin Endocrinol Metab. 105, 1–13 (2020).
OpenUrl CrossRef PubMed
6.↵
Burki, T. COVID-19 in Latin America. Lancet. Infect. Dis. 20, 547–548 (2020).
OpenUrl
7.↵
Shah, K. et al. Focus on Mental Health During the Coronavirus (COVID-19) Pandemic: Applying Learnings from the Past Outbreaks. Cureus 12, (2020).
8.↵
Agren, D. Understanding Mexican health worker COVID-19 deaths. Lancet (London, England) vol. 396 807 (2020).
OpenUrl
9.↵
CONACYT. COVID-19 Mexico. Gobierno de México https://coronavirus.gob.mx/datos/ (2020).
10.↵
Pan American Health Organization & World Health Organization. Epidemiological Alert: COVID-19 among health workers - 31 August 2020 - PAHO/WHO | Pan American Health Organization. https://www.paho.org/en/documents/epidemiological-alert-covid-19-among-health-workers-31-august-2020 (2020).
11.↵
Blake, H., Bermingham, F., Johnson, G. & Tabner, A. Mitigating the psychological impact of covid-19 on healthcare workers: A digital learning package. Int. J. Environ. Res. Public Health 17, (2020).
12.↵
Hamama, L. et al. Burnout and perceived social support: The mediating role of secondary traumatization in nurses vs. physicians. J. Adv. Nurs. 75, 2742–2752 (2019).
OpenUrl
13.↵
Labrague, L. J. & Santos, J. A. A. COVID-19 anxiety among front-line nurses: Predictive role of organisational support, personal resilience and social support. J. Nurs. Manag. 28, 1653–1661 (2020).
OpenUrl
14.↵
Secretaría de Salud. PERSONAL DE SALUD 03 DE NOVIEMBRE DE 2020. https://www.gob.mx/cms/uploads/attachment/file/590340/COVID-19_Personal_de_Salud_2020.11.03.pdf (2020).
15.↵
Montague, P. R., Dolan, R. J., Friston, K. J. & Dayan, P. Computational psychiatry. Trends Cogn. Sci. 16, 72–80 (2012).
OpenUrl CrossRef PubMed Web of Science
16.↵
Mujica-Parodi, L. R. & Strey, H. H. Making Sense of Computational Psychiatry. Int. J. Neuropsychopharmacol. 23, 339–347 (2020).
OpenUrl
17.↵
Davenport, T. & Kalakota, R. DIGITAL TECHNOLOGY The potential for artificial intelligence in healthcare. Futur. Healthc. J. 6, 94–102 (2019).
OpenUrl
18.↵
Kelleher, J.D., Mac Name, B. & D’Arcy, A. Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. (The MIT Press; Illustrated edition (July 24, 2015), 2020).
19.↵
Song, Y.. & Ying, L. U. Decision tree methods: applications for classification and prediction. Shanghai Arch. Psychiatry 27, 130–135 (2015).
OpenUrl
20.↵
Zhu, T., Ning, Y., Li, A. & Xinguo, X. Using decision tree to predict mental health status based on web behavior. in 3rd Symposium on Web Society. IEEE 27–31 (2011).
21.↵
Li, C. et al. Tree-structured Subgroup Analysis of Receiver Operating Characteristic Curves for Diagnostic Tests. Acad. Radiol. 19, 1529–1536 (2012).
OpenUrl PubMed
22.↵
Magyary, D. & Brandt, P. A decision tree and clinical paths for the assessment and management of children with ADHD. Issues Ment. Health Nurs. 23, 553–566 (2002).
OpenUrl CrossRef PubMed
23.↵
Yao, Q., Liu, P., Lei, L. & Yin, J. R-C4.5 decision tree model and its applications to health care dataset. in Proceedings of ICSSSM ‘05. 2005 International Conference on Services Systems and Services Managemen 1099–1103 (2005). doi:10.1109/ICSSSM.2005.1500165.
OpenUrl CrossRef
24.↵
Taylor, S. et al. Development and initial validation of the COVID Stress Scales. J. Anxiety Disord. 72, 102232 (2020).
OpenUrl PubMed
25.↵
Salud, S. de. cuestionario para la detección de riesgos a la salud mental COVID-19. Gobierno de Mexico (2020).
26.↵
Wang, C. et al. Immediate Psychological Responses and Associated Factors during the Initial Stage of the 2019 Coronavirus Disease (COVID-19) Epidemic among the General Population in China. Int. J. Environ. Res. Public Heal. 17, 1729 (2020).
OpenUrl
27.↵
Huys, Q. J. M., Maia, T. V. & Frank, M. J. Computational psychiatry as a bridge from neuroscience to clinical applications. Nat. Neurosci. 19, 404–413 (2016).
OpenUrl CrossRef PubMed
28.↵
Krystal, J. H. & McNeil, R. L. Responding to the hidden pandemic for healthcare workers: stress. Nat. Med. 26, 639 (2020).
OpenUrl
29.↵
Wirth, R. & Hipp, J. CRISP-DM□: Towards a Standard Process Model for Data Mining. Proc. Fourth Int. Conf. Pract. Appl. Knowl. Discov. Data Min. 29–39 (2000).
30.↵
Huang, J., Li, Y. F. & Xie, M. An empirical analysis of data preprocessing for machine learning-based software cost estimation. Inf. Softw. Technol. 67, 108–127 (2015).
OpenUrl
31.↵
Tavakol, M. & Dennick, R. Making sense of Cronbach’s alpha. Int. J. Med. Educ. 2, 53–55 (2011).
OpenUrl CrossRef PubMed
32.↵
Sharpe, D. Chi-Square Test is Statistically Significant: Now What? Pract. Assessment, Res. Eval. 20, 8 (2015).
OpenUrl
33.↵
Zhu, W., Zeng, N. & Wang, N. Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practical SAS® implementations. Northeast SAS Users Gr. 2010 Heal. Care Life Sci. 1–9 (2010).