The external validity of machine learning-based prediction scores from hematological parameters of COVID-19: A study using hospital records from Brazil, Italy, and Western Europe

Ali Akbar Safdari; Chanda Sai Keshav; Deepanshu Mody; Kshitij Verma; Utsav Kaushal; Vaadeendra Kumar Burra; Sibnath Ray; Debashree Bandyopadhyay

doi:10.1101/2023.03.07.23286949

Abstract

Background The COVID-19 pandemic is the deadliest threat to humankind caused by the SARS-COV-2 virus in recent times. The gold standard for its detection, quantitative Real-Time Polymerase Chain Reaction (qRT-PCR), has several limitations regarding experimental handling, expense, and time. While the hematochemical values of routine blood tests have been reported as a faster and cheaper alternative, the external validity of the model on a diverse population has yet to be thoroughly investigated. Here we studied the external validity of machine learning-based prediction scores from hematological parameters recorded in Brazil, Italy, and Western Europe.

Methods and Findings The publicly available hematological records (raw sample size (n) = 195554) from hospitals of three different territories, Brazil, Italy, and Western Europe, were preprocessed to develop the training, testing, and prediction cohorts for ML models. A total of eight (sub)datasets were trained on seven different ML classifiers. The XGBoost classifier performed consistently better on all the datasets producing eight different models. The working models include a set of either four or fourteen hematological parameters. The internal performances of the XGBoost models (AUC scores range from 84% to 97%) were superior to the ML models reported in the literature for a few datasets (AUC scores range from 84% to 87%). The external performance (AUC score) was 86% when the model was trained and tested on fourteen hematological parameters obtained from the same country (Brazil) but on independent datasets. However, the external performances were reduced when tested across the populations; 69% when trained on datasets from Italy (n=1736) and tested on datasets from Brazil (n=602)) and 65%, when trained on datasets from Italy and tested on datasets from Western Europe (n=1587)) respectively.

Conclusion For the first time, this report showed that the models trained and tested on the same population but on separate records produced reasonably accurate results. The study promises the confidence of these models trained and tested within the same populations and has the potential application to extend those to other demographic locations. Both four- and fourteen-parameter models are publicly available; https://covipred.bits-hyderabad.ac.in/home

Author Summary COVID-19 has posed the deadliest threat to the human population in the 21^st century. Timely detection of the disease could save more lives. The RT-PCR test is considered the gold standard for COVID-19 detection. However, there are several limitations of the technique that suggests developing an alternate detection protocol that would be efficient, fast, and cheap. Among several other alternate detection techniques, hematology based Machine-Learning (ML) prediction is one. All the hematology-based predictions reported so far in the literature were only internally validated. Considering the need to develop an alternate protocol for rapid, near-accurate, and cheaper COVID-19 detection techniques, we aim to externally validate the hematology-based ML prediction. Here external validation indicates use of two independent datasets for model training and testing, in contrast to internal validation where the same dataset splits into train and test sets. We have integrated published clinical records from Brazil, Italy, and West Europe hospitals. Internal ML model performances are superior compared to those reported in literature. The external model performances were equivalent to the internal performances when trained and tested on the same population. However, the external performances were inferior when train and test sets were from different populations. The results promise the utility of these models on the same populations. However, it also warns to train the model on one population and test it on another. The outcome of this work has the potential for an initial screen of COVID-19 based on hematological parameters before qRT-PCR tests.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

DB gratefully acknowledges DST-MATRICS (COVID-19 special call) Govt. of India, Grant/Award number: MSC/2020/000498, for funding this project

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Not Applicable

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Not applicable

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Not Applicable

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Not Applicable

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Not Applicable

Data Availability

Data sources are mentioned in the manuscript text

https://www.kaggle.com/einsteindata4u/covid19

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.