ABSTRACT
The infection by SARS-CoV-2 which causes the COVID-19 disease has widely spread all over the world since the beginning of 2020. On January 30, 2020 the World Health Organization (WHO) declared a global health emergency.Researchers of different disciplines work along with public health officials to understand the SARS-CoV-2 pathogenesis and jointly with the policymakers urgently develop strategies to control the spread of this new disease. Recent findings have observed imaging patterns on computed tomography (CT) for patients infected by SARS-CoV-2. In this paper, we build a public available SARS-CoV-2 CT scan dataset, containing 1252 CT scans that are positive for SARS-CoV-2 infection (COVID-19) and 1230 CT scans for patients non-infected by SARS-CoV-2, 2482 CT scans in total. These data have been collected from real patients in hospitals from Sao Paulo, Brazil. The aim of this dataset is to encourage the research and development of artificial intelligent methods which are able to identify if a person is infected by SARS-CoV-2 through the analysis of his/her CT scans. As baseline result for this dataset we used an eXplainable Deep Learning approach (xDNN) which we could achieve an F1 score of 97.31% which is very promising. The proposed dataset is available www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset and xDNN code is available at https://github.com/Plamen-Eduardo/xDNN-SARS-CoV-2-CT-Scan.
1 Introduction
In December 2019, an outbreak coronavirus (SARS-CoV-2) infection began in Wuhan, the capital of central China’s Hubei province1-3. On January 30, 2020 the World Health Organization (WHO) declared a global health emergency4. By 22 April 2020, accumulative 2,564,515 confirmed cases and 177,466 deaths were documented5.
Researchers of different disciplines work along with public health officials to understand the COVID-19 pathogenesis and jointly with the policymakers urgently develop strategies to control the spread of this new disease6. Recent findings have observed imaging patterns on computed tomography (CT) for patients diagnosed with COVID-19 as prospective analysis revealed bilateral lung opacities on 40 of 41 (98%) chest CTs in infected patients in Wuhan and described lobular and subsegmental areas of consolidation as the most typical findings6. Other investigators found high rates of ground-glass opacities and consolidation, sometimes with a rounded morphology and peripheral lung distribution7,8. Thoracic radiology evaluation is often key to the evaluation of patients suspected of COVID-19 infection9. Prompt detection and diagnosis of the disease is invaluable in the efforts to ensure timely treatment. From a public health perspective, rapid patient isolation is crucial for containment of this communicable disease4 and optimal use of available resources which quickly become scarce and overwhelmed by the exponentially growing number of patients and prolonged periods of treatment.
In this paper, we make publicly available a large dataset of CT scans for SARS-CoV-2 identification. The dataset is composed of 1252 CT-scans of patients infected by the SARS-CoV-2 virus and 1230 CT scans of non-infected by SARS-CoV-2 patients, but that have other pulmonary diseases. In order to test the dataset we an eXplainable deep learning method (xDNN)10. The eXplainbale approach is non-iterative and is entirely based on recursive calculations and use of prototypes. Therefore, it is computationally very efficient. In this paper we demosntrate that the proposed dataset can be of huge importance for the identification of SARS-CoV-2 via CT scan. Moreover, we demonstrated that the baseline approach, xDNN, can achieve high performance for this challenging task.
2 eXplainable Deep Learning (xDNN)
2.1 Concept and Basic Algorithm
The prototype-based learning is the core of the xDNN method (Fig. (1)). The prototypes are actual training data samples (in this case, images) which are highly representative (local peaks of the density and empirically derived probability distributions11). They are focal points of locally valid generative models described by multi-modal Cauchy distribution11.
The algorithm of the proposed approach is described below. With the first observed image (data sample) it is being converted to a vector of features using transfer learning. In this paper, we use a vector with size 4096 formed from the last fully connected layer of the VGG-1612.
Let be training data set with xi ∈ ℝn denoting the feature vector and ci ∈ {1,2} denoting the class (SARS-Cov-2 or Non-SARS-Cov-2) for each i ∈ {1,…,N}. N is the number of training data/images used.
The proposed algorithm works per class; therefore, all the calculations are done for each class separately.
The meta-parameters are initialized with the first observed data sample. where μ denotes the mean; V1 denotes the first cluster; p1 is the first prototype of the first cluster, V1; S1 is the corresponding support (number of members); P is the total number of the identified prototypes; r1 is the corresponding radius of the area of influence of V1 (in this paper, we use same as11; the rationale is that two vectors for which the angle between them is less than π/6 or 30o are pointing in close/similar directions. That is, we consider that two feature vectors can be considered to be similar if the angle between them is smaller than 30 degrees. Note that r* is data derived, not a problem- or user- specific parameter. In fact, it can be defined without prior knowledge of the specific problem or data).
The next step is to calculate the data density at the current data point, .
Starting from the mutual distances (Euclidean, Cosine, or Minkowski type) between the data points (samples) in the feature space it can be demonstrated theoretically11 that the data density takes the form of a Cauchy type function as in Eq. (2).
Then the algorithm absorbs the new data samples/images, one by one by assigning then to the nearest (in the feature space) prototype, :
Note that different distance metrics can be used for this type of assignment. Because of this form of assignment, the shape of the data partitioning is of the so-called Voronoi tessellation type13. We call all data points associated with a prototype data clouds, because their shape is not regular (e.g., hyper-spherical, hyper-ellipsoidal, etc.) and the prototype is not necessarily the statistical and geometric mean11.
Then, using the density and the distance to the nearest prototype we check the following conditions11 based on which we determine if the current data sample/image is going to be added to the set of prototypes as a new prototype or not:
When adding a new data cloud the following updates are being made:
Alternatively, the meta parameters of the nearest data cloud are being updated as follows11:
One of the strongest aspects of the proposed approach is its high level of interpretability which comes from its prototype-based nature. Linguistic IF…THEN expressions that represent human reasoning can be formed around the local generative models:
The learning procedure of the proposed approach is summarized by the following algorithm.
Learning Procedure
Read the first feature vector sample xì representing the image Ii of the class c;
Set
FOR i = 2, …
Read ;
Calculate and D(pj) (j = 1,2,..,P) according to equation (2);
IF Eq. (4) holds
Create rule according to Eq. (5);
ELSE
Search for p j according to Eq. (3);
Update rule according to Eq. (6);
END
END
3 Dataset Description
The proposed dataset is composed of 2482 CT scans images, which is divided between 1252 for patients infected by SARS-CoV-2, and 1230 CT scans for non-infected by SARS-CoV-2 patients, but whom presented other pulmonary diseases. Data was collected from hospitals of Sao Paulo, Brazil. The detailed number of patients is illustrated by Fig (3). The detailed characteristic of each patient has been omitted by the hospitals due to ethical concerns. Fig. (4) illustrates some examples of CT scans for patients infected and non infected by SARS-CoV-2 that composes the dataset.
4 Results
In this section we report the results obtained by the proposed eXplainable Deep Learning classification approach, xDNN, when applied to the proposed SARS-CoV-2 CT scan dataset. We divided the dataset into 80% for training purposes and 20% for validation purposes. Results presented in Table 1 compare the xDNN algorithm with other state-of-the-art approaches, including traditional (black-box) deep neural network, Support vector Machines, etc. In summary, the advantages of the proposed method include:
– high precision as compared with the top state-of-the-art algorithms.
– high level of explainability.
– no user- or problem- specific algorithmic meta parameters
– non-iterative algorithm able to learn continuously.
Using the proposed method we generated (extracted form the data) linguistic IF…THEN rules which involve actual images of both cases (COVID-19 and NO COVID-19) as illustrated in Figs. (5) and (6). Such transparent rules can be used in the decision-making process for early diagnostics for COVID-19 infection. Rapid detection with high sensitivity of viral infection may allow better control of the viral spread. Early diagnosis of COVID-19 is crucial for the disease treatment and control.
Computing tomography is a quick non-invasive imaging modality with high accuracy. According to14, 15 almost all patients with COVID-19 had characteristic CT features during the disease, effects such as different degrees of ground-glass opacities with or without crazy-paving sign, multifocal organizing pneumonia, and architectural distortion in a peripheral distribution. The proposed approach has demonstrated high efficiency on the identification and classification of such characteristics, and then provide high accurate and interpretable results.
5 Conclusion
In this paper we make public a new large dataset for SARS-CoV-2 identification via CT scans. The presented dataset is composed by 2482 CT scans, which 1252 corresponds to 60 patients identified with SARS-CoV-2 and 1230 CT scans corresponds to 60 patients not identified with SARS-CoV-2. These data has been collected from different hospitals in Sao Paulo, Brazil. Moreover, we tested the proposed dataset has been tested with different methods. The xDNN classifier has presented the best results in terms present a new eXplainable deep learning approach for COVID-19 detection via CT scan. The proposed approach demonstrates better results in terms of performance than other state-of-the-art approaches, presenting an F1 score of 97.31% for the best case. Moreover, it also provides epxlanations in the form of IF…THEN rules using actual images of CT scans with and without COVID-19. This is of great importance for medical specialists to understand and diagnose COVID-19 at early stages via computed tomography. The proposed dataset is available www.kaggle.com/plameneduardo/sarscov2-ctscan-dataset and xDNN code is available at https://github.com/Plamen-Eduardo/xDNN-SARS-CoV-2-CT-Scan.
Data Availability
All the data and experiments available in this research paper have been approved by the Ethical Committee of the Public Hospital of the Government Employees of Sao Paulo (HSPM), Sao Paulo/Brazil.
https://www.kaggle.com/plameneduardo/sarscov2-ctscan-datasetand
Author contributions statement
P. A. conceived and detailed the idea. E. S. designed and implemented the algorithms, designed and performed the experiments.
P. A. and E. S. wrote the manuscript and interpreted the results. E. S., S. B., M. H. F., and D. K. A. collected the data.
Ethical statement
All the data and experiments available in this research paper have been approved by the Ethical Committee of the Public Hospital of the Government Employees of Sao Paulo (HSPM), Sao Paulo/Brazil.