COVID-19 diagnosis prediction by symptoms of tested individuals: a machine learning approach ============================================================================================ * Yazeed Zoabi * Noam Shomron ## Abstract Effective screening of SARS-CoV-2 enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed in hopes of assisting medical staff worldwide in triaging patients when allocating limited healthcare resources. We established a machine learning approach that trained on records from 51,831 tested individuals (of whom 4,769 were confirmed COVID-19 cases) while the test set contained data from the following week (47,401 tested individuals of whom 3,624 were confirmed COVID-19 cases). Our model predicts COVID-19 test results with high accuracy using only 8 features: gender, whether the age is above 60, information about close contact with an infected individual, and 5 initial clinical symptoms. Overall, using nationwide data representing the general population, we developed a model that enables screening suspected COVID-19 patients according to simple features accessed by asking them basic questions. Our model can be used, among other considerations, to prioritize testing for COVID-19 when allocating limited testing resources. ## Main The novel coronavirus disease 2019 (COVID-19) pandemic caused by the newly emerged SARS-CoV-2 is a critical and urgent threat to global health. The outbreak in early December 2019 in the Hubei province of the People’s Republic of China has spread worldwide. As of May 2020, the overall number of patients confirmed to have the disease has exceeded 3,580,000 in more than 180 countries, the number of people infected is probably much higher, and more than 250,000 people have died from COVID-19. 1 This pandemic continues to challenge medical systems worldwide in many aspects, including sharp increases in demands for hospital beds and critical shortages in medical equipment, while many healthcare workers have themselves been infected. Thus, the capacity for immediate clinical decisions and effective usage of healthcare resources is crucial. The most validated diagnosis test for COVID-19, using reverse transcriptase polymerase chain reaction (RT-PCR), is currently in shortage in developing countries. This contributes to increased infection rates and delays critical preventive measures. In Israel, all diagnostic laboratory tests for COVID-19 are performed according to criteria determined by the Israeli Ministry of Health. While subject to change, these currently include the presence and severity of clinical symptoms, possible exposure to confirmed patients, geographical area, the risk of complications if infected, and other factors. 2 The Israeli Ministry of Health recently publicly released data of individuals who were tested for SARS-CoV-2 via RT-PCR assay of a nasopharyngeal swab3. The dataset contains initial records, on a daily basis, for all citizens tested for COVID-19 nationwide. In addition to the test date and result, various information is available, including clinical symptoms, gender and a binary indication as to whether the tested individual is above age 60 years. Effective screening enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed in hopes of assisting medical staff worldwide in triaging patients when allocating limited healthcare resources. These models use features such as computer tomography (CT) scans 4-7, information available at hospital admission including clinical symptoms 8, and laboratory tests. 9 We developed a model that predicts COVID-19 test results with high accuracy using only 8 features: gender, whether the age is above 60, information about close contact with an infected individual, and 5 initial clinical symptoms (Supplementary Table 1). The results for a prospective test set were 0.90 auROC (area under the receiver operating curve) with 95% CI: 0.892-0.905 (Figure 1.a). Possible working points are: 87.3% sensitivity and 72% specificity, or 85.7% sensitivity and 79% specificity. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/05/11/2020.05.07.20093948/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2020/05/11/2020.05.07.20093948/F1) Figure 1. **a**. ROC curves of the predictive model. The blue line reflects training and testing via cross-validation. The orange line reflects testing the model on the prospective dataset. **b. SHapley Additive exPlanations** (SHAP) summary plots for COVID-19 diagnosis prediction show the SHAP values for the most important features of the model. Features in the summary plots (y-axis) are organized by their mean absolute SHAP values (x-axis), which represent the importance of that feature in driving the classifier’s prediction. Values of those features for each patient (i.e. fever) are colored by their relative value. The training set consisted of records from 51,831 tested individuals (of whom 4,769 were confirmed COVID-19 cases, Supplementary Table 1), from the period March 22th, 2020 through March 31st, 2020. The test set contained data from the following week, April 1nd through April 7th (47,401 tested individuals of whom 3,624 are confirmed COVID-19 cases). Our framework provides a ranking of the most important features that were used to define the decisions (Figure 1.b). Presenting with fever and cough were key features in predicting contraction of the disease. As expected, close contact with a confirmed COVID-19 individual was also an important feature, thus corroborating the disease’s high transmissibility10. In addition, ‘male’ gender was revealed as a predictor of a positive result by the model, concurring with the observed gender bias11-13. View this table: [Supplementary Table 1](http://medrxiv.org/content/early/2020/05/11/2020.05.07.20093948/T1) Supplementary Table 1 Characteristics of the dataset and the features used by the model in this study. Predictions were generated using a gradient-boosting machine model built with decision-tree base-leamers14. Gradient boosting is widely considered state of the art in predicting tabular data15 and is used by many successful algorithms in the field of machine learning16. As suggested by previous studies17, missing values were inherently handled by the gradient-boosting predictor18. We used the gradient-boosting predictor trained with the LightGBM19 Python package. The following list describes each of the features used by the model: 1. Basic information: 1. Gender (male/female). 2. Age ≥ 60 (true/false) 2. Symptoms: * 3. Cough (true/false). * 4. Fever (true/false). * 5. Sore throat (true/false). * 6. Shortness of breath (true/false). * 7. Headache (true/false). 3. Other information: * 8. Information of contact with a confirmed COVID-19 individual (true/false). To identify the principal features driving model prediction, SHAP (SHapley Additive exPlanations) values20 were calculated. These values are suited for complex models such as artificial neural networks and gradient-boosting machines 21. Originating in game theory, SHAP values partition the prediction result of every sample into the contribution of each constituent feature value. This is done by estimating the difference between models with subsets of the feature space. By averaging across samples, SHAP values estimate the contribution of each feature to overall model predictions. Overall, using nationwide data representing the general population, we developed a model that enables screening suspected COVID-19 patients according to simple features accessed by asking them eight basic questions. Our model can be used, among other considerations, to prioritize testing for COVID-19 when allocating limited testing resources. ## Data Availability All data used in this study was retrieved from the Israeli Ministry of Health website. [https://data.gov.il/dataset/covid-19](https://data.gov.il/dataset/covid-19) * Received May 7, 2020. * Revision received May 7, 2020. * Accepted May 11, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. 1.Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases (2020) doi:10.1016/S1473-3099(20)30120-1. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1473-3099(20)30120-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32087114&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F05%2F11%2F2020.05.07.20093948.atom) 2. 2.The Novel Coronavirus - Israel Ministry of Health. [https://govextra.gov.il/ministry-of-health/corona/corona-virus-en/](https://govextra.gov.il/ministry-of-health/corona/corona-virus-en/). 3. 3.COVID-19 - Government Data. [https://data.gov.il/dataset/covid-19](https://data.gov.il/dataset/covid-19). 4. 4.Gozes, O. et al. Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection & Patient Monitoring using Deep Learning CT Image Analysis. *arXiv e-prints* 2003, arXiv:2003.05037 (2020). 5. 5.Song, Y. et al. Deep learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) with CT images. *medRxiv* 2020.02.23.20026930 (2020) doi:10.1101/2020.02.23.20026930. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wMi4yMy4yMDAyNjkzMHYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjAvMDUvMTEvMjAyMC4wNS4wNy4yMDA5Mzk0OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 6. 6.Wang, S. et al. A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). *medRxiv* 2020.02.14.20023028 (2020) doi:10.1101/2020.02.14.20023028. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wMi4xNC4yMDAyMzAyOHY1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjAvMDUvMTEvMjAyMC4wNS4wNy4yMDA5Mzk0OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 7. 7.Jin, C. et al. Development and Evaluation of an AI System for COVID-19 Diagnosis. *medRxiv* 2020.03.20.20039834 (2020) doi:10.1101/2020.03.20.20039834. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wMy4yMC4yMDAzOTgzNHYzIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjAvMDUvMTEvMjAyMC4wNS4wNy4yMDA5Mzk0OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 8. 8.Tostmann, A. et al. Strong associations and moderate predictive value of early symptoms for SARS-CoV-2 test positivity among healthcare workers, the Netherlands, March 2020. Eurosurveillance 25, 2000508 (2020). 9. 9.Feng, C. et al. A Novel Triage Tool of Artificial Intelligence Assisted Diagnosis Aid System for Suspected COVID-19 pneumonia In Fever Clinics. *medRxiv* 2020.03.19.20039099 (2020) doi:10.1101/2020.03.19.20039099. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4wMy4xOS4yMDAzOTA5OXYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjAvMDUvMTEvMjAyMC4wNS4wNy4yMDA5Mzk0OC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 10. 10.Liu, Y., Gayle, A. A., Wilder-Smith, A. & Rocklöv, J. The reproductive number of COVID-19 is higher compared to SARS coronavirus. J Travel Med 27, (2020). 11. 11.Jin, J.-M. et al. Gender Differences in Patients With COVID-19: Focus on Severity, and Mortality. Front. Public Health 8, (2020). 12. 12.Blogs, B. G. Sex, gender and COVID-19: Disaggregated data and health disparities. BMJ Global Health blog [https://blogs.bmj.com/bmjgh/2020/03/24/sex-gender-and-covid-19-disaggregated-data-and-health-disparities/](https://blogs.bmj.com/bmjgh/2020/03/24/sex-gender-and-covid-19-disaggregated-data-and-health-disparities/) (2020). 13. 13.Team, T. N. C. P. E. R. E. The Epidemiological Characteristics of an Outbreak of 2019 Novel Coronavirus Diseases (COVID-19) — China, 2020. CCDCW 2, 113–122 (2020). 14. 14.Hastie, T., Tibshirani, R. & Friedman, J. Boosting and Additive Trees. in The Elements of Statistical Learning: Data Mining, Inference, and Prediction (eds. Hastie, T., Tibshirani, R. & Friedman, J.) 337–387 (Springer, 2009). doi:10.1007/978-0-387-84858-7_10. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/978-0-387-84858-7_10&link_type=DOI) 15. 15.Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research 15, 3133-3181 (2014). 16. 16.Omar, K. B. A. XGBoost and LGBM for Porto Seguro ‘ s Kaggle challenge: A comparison Semester Project. in (2018). 17. 17.Josse, J., Prost, N., Scornet, E. & Varoquaux, G. On the consistency of supervised learning with missing values. *arXiv:1902.06931 {cs, math, stat}* (2019). 18. 18.Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016). doi:10.1145/2939672.2939785. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1145/2939672.2939785&link_type=DOI) 19. 19.Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 3146-3154 (Curran Associates, Inc., 2017). 20. 20.Lundberg, S. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. *arXiv: 1705.07874 {cs, stat}* (2017). 21. 21.Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering 2, 749–760 (2018).