MosMedData: Chest CT Scans with COVID-19 Related Findings Dataset
=================================================================

* S.P. Morozov
* A.E. Andreychenko
* N.A. Pavlov
* A.V. Vladzymyrskyy
* N.V. Ledikhova
* V.A. Gombolevskiy
* I.A. Blokhin
* P.B. Gelezhe
* A.V. Gonchar
* V.Yu. Chernina

## Abstract

This dataset contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A small subset of studies has been annotated with binary pixel masks depicting regions of interests (ground-glass opacifications and consolidations). CT scans were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia. Permanent link:[https://mosmed.ai/datasets/covid19_1110](https://mosmed.ai/datasets/covid19_1110). This dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) License.

Key words
*   artificial
*   intelligence
*   COVID-19
*   machine
*   learning
*   dataset
*   CT
*   chest
*   imaging

## Background

During the COVID-19 pandemic, most countries faced a tremendous increase in the healthcare burden. This situation calls for the most careful use of financial and human resources than ever before. Unfortunately, the preventive measures put in place in medical facilities are not always enough to avoid deaths of medical workers. The loss of trained specialists in emergency care, radiology, etc., is especially concerning. Computed tomography is considered a key tool to diagnose and evaluate the progression of COVID-19. The CT studies are performed in the outpatient setting and target patients with acute respiratory symptoms, and those with established diagnosis and mild disease progression who are able to recover in their homes (under supervision via telemedicine). Inpatient facilities use CT for initial and differential diagnostic assessment, evaluation of disease progression, and determining, whether the patient should be admitted to the intensive care unit or discharged [1, 3, 4].

The ever-increasing use of CT translates to an immense burden on the health system. For example, in Moscow, the chain of municipal outpatient CT centers perform about 90 studies per 1 CT scanner per day (the record holder scanner performed 163 studies during one day).

To standardize and streamline clinical decision-making the experts developed a classification model that grades the severity of lung tissue abnormalities observed on CT images along with other symptoms (see Table).

View this table:
[Table.](http://medrxiv.org/content/early/2020/05/22/2020.05.20.20100362/T1)

Table. 
Classification of the severity of lung tissue abnormalities with COVID-19 and routing rules

Increased burnout and high risk of occupational death among healthcare workers call for automation of the reading process, which will improve productivity and minimize errors [8]. Preliminary figures indicate the AI algorithms have sufficient accuracy for the diagnostic evaluation of COVID-19 (responsiveness – 90%, specificity – 96%, AUC – 0.96, overall accuracy 76.37-98.26) [6, 9

## Dataset

Data were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia. This dataset contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings (CT1-CT4), as well as without such findings (CT0) (fig.).

![Fig.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/05/22/2020.05.20.20100362/F1.medium.gif)

[Fig.](http://medrxiv.org/content/early/2020/05/22/2020.05.20.20100362/F1)

Fig. 
*Examples of images: a – CT-0, b – CT-1 (overlay: binary mask), c – CT-2, d – CT-3, e – CT-4*

There are 1110 studies in dataset. Population: 1110 persons, males – 42%, females – 56%, other/unknown – 2%; age from 18 to 97 years, median – 47 years. As a first step, all studies (n=1110) were distributed into 5 categories according to classification (table). Number of cases by category: CT-0 – 254 (22,8%), CT-1 – 684 (61,6%), CT-2 – 125 (11,3%), CT-3 – 45 (4,1%), CT-4 – 2 (0,2%). Secondly, every study has been saved in NifTI format and archived into Gzip archive. During this process only every 10-th image (Instance) was preserved in the final study file.

A small subset of studies (n=50) has been annotated by the experts of Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department. During the annotation for every given image ground-glass opacifications and regions of consolidation were selected as positive (white) pixels on the corresponding binary pixel mask. The resulting masks have been saved in NIfTI format and then transformed into Gzip archives. The MedSeg® web-based annotation software has been used for creating binary masks (© 2020 Artificial Intelligence AS)

## Value

The dataset is intended for education, calibration, and independent assessment of the AI (computer vision) algorithms [7].

To help combating COVID-19 the AI (computer vision) algorithms will allow: 

1.  Triaging patients in outpatient facilities to secure rapid and consistent routing (incl. based on the CT0-4 criteria)

2.  Prioritizing studies that contain signs of COVID-19 in the worklist

3.  Performing a rapid and high-quality assessment of abnormal changes by comparing several studies

4.  Minimizing the risks of errors and missed abnormalities.

At the moment, there is a wide range of publicly available COVID-19 datasets [2, 5]. However, this should not be viewed as an obstacle for the following reasons: 

1.  Development of algorithms requires large volumes of high-quality clinical data that is representative of real-world patient populations

2.  The algorithms must be verified with new data, i.e. new datasets that were not used during the learning and calibration phases. The more data are available in the open sources, the better for the developers

3.  The available datasets are relatively small and rarely present additional information, such as tags and/or binary masks for the regions of interest (ROI).

Permanent link: [https://mosmed.ai/datasets/covid19\_1110](https://mosmed.ai/datasets/covid19_1110). This dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) License.

## Ethics approval

The study was approved by Independent Ethics Committee of Moscow Regional Office of the Russian Society of Radiologists and Radiotherapists

## Data Availability

[https://mosmed.ai/static/landing/docx/README\_RU.pdf](https://mosmed.ai/static/landing/docx/README_RU.pdf)

*   Received May 20, 2020.
*   Revision received May 20, 2020.
*   Accepted May 22, 2020.


*   © 2020, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  1.Ai T, Yang Z, Hou H, et al. Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID- 19) in China: A Report of 1014 Cases [published online ahead of print, 2020 Feb 26]. Radiology. 2020;200642. doi:10.1148/radiol.2020200642.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/radiol.2020200642&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32101510&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F05%2F22%2F2020.05.20.20100362.atom) 

2.  2.Cohen JP, Morrison P, Dao L. COVID-19 Image Data Collection. arXiv:2003.11597v1.
    
    
3.  3.Handbook of COVID-19 Prevention and Treatment. Ed. by  T. Liang. Zhejiang University School of Medicine. 2020. 68 p.
    
    
4.  4.Huang Z, Zhao S, Li Z, et al. The Battle Against Coronavirus Disease 2019 (COVID-19): Emergency Management and Infection Control in a Radiology Department [published online ahead of print, 2020 Mar 24]. J Am Coll Radiol. 2020;S1546-1440(20)30285-4. doi:10.1016/j.jacr.2020.03.011.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jacr.2020.03.011&link_type=DOI) 

5.  5.Jun M, Cheng G, Yixin W et al. COVID-19 CT Lung and Infection Segmentation Dataset. 2020. Verson 1.0.DOI:10.5281/zenodo.3757476.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5281/zenodo.3757476&link_type=DOI) 

6.  6.Li L, Qin L, Xu Z, et al. Artificial Intelligence Distinguishes COVID-19 from Community Acquired Pneumonia on Chest CT [published online ahead of print, 2020 Mar 19]. Radiology. 2020;200905. doi:10.1148/radiol.2020200905.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1148/radiol.2020200905&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32191588&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F05%2F22%2F2020.05.20.20100362.atom) 

7.  7.Morozov S.P., Vladzymyrskyy A.V., Klyashtornyy V.G., Andreychenko A.E., Kulberg N.S., Gombolevsky V.A., Sergunova K.A. Clinical acceptance of software based on artificial intelligence technologies (radiology). In: “Best practices in medical imaging”. Moscow. 2019. Issue 57. 45 p.
    
    
8.  8.Morozov S, Guseva E, Ledikhova N, Vladzymyrskyy A, Safronov D. Telemedicine-based system for quality management and peer review in radiology. Insights Imaging. 2018;9(3):337-341. doi:10.1007/s13244-018-0629-y.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s13244-018-0629-y&link_type=DOI) 

9.  9.Ucar F, Korkmaz D. COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagnosis of the coronavirus disease 2019 (COVID-19) from X-ray images [published online ahead of print, 2020 Apr 23]. Med Hypotheses. 2020;140:109761. doi:10.1016/j.mehy.2020.109761.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.mehy.2020.109761&link_type=DOI)