Abstract
Objective In this large multi-institutional study, we aimed to analyze the prognostic power of computed tomography (CT)-based radiomics models in COVID-19 patients.
Methods CT images of 14,339 COVID-19 patients with overall survival outcome were collected from 19 medical centers. Whole lung segmentations were performed automatically using a previously validated deep learning-based model, and regions of interest were further evaluated and modified by a human observer. All images were resampled to an isotropic voxel size, intensities were discretized into 64-binning size, and 105 radiomics features, including shape, intensity, and texture features were extracted from the lung mask. Radiomics features were normalized using Z-score normalization. High-correlated features using Pearson (R2>0.99) were eliminated. We applied the Synthetic Minority Oversampling Technique (SMOT) algorithm in only the training set for different models to overcome unbalance classes. We used 4 feature selection algorithms, namely Analysis of Variance (ANOVA), Kruskal- Wallis (KW), Recursive Feature Elimination (RFE), and Relief. For the classification task, we used seven classifiers, including Logistic Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), Linear Discriminant Analysis (LDA), Random Forest (RF), AdaBoost (AB), Naïve Bayes (NB), and Multilayer Perceptron (MLP). The models were built and evaluated using training and testing sets, respectively. Specifically, we evaluated the models using 10 different splitting and cross-validation strategies, including different types of test datasets (e.g. non-harmonized vs. ComBat-harmonized datasets). The sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) were reported for models evaluation.
Results In the test dataset (4301) consisting of CT and/or RT-PCR positive cases, AUC, sensitivity, and specificity of 0.83±0.01 (CI95%: 0.81-0.85), 0.81, and 0.72, respectively, were obtained by ANOVA feature selector + RF classifier. In RT-PCR-only positive test sets (3644), similar results were achieved, and there was no statistically significant difference. In ComBat harmonized dataset, Relief feature selector + RF classifier resulted in highest performance of AUC, reaching 0.83±0.01 (CI95%: 0.81-0.85), with sensitivity and specificity of 0.77 and 0.74, respectively. At the same time, ComBat harmonization did not depict statistically significant improvement relevant to non-harmonized dataset. In leave-one-center-out, the combination of ANOVA feature selector and LR classifier resulted in the highest performance of AUC (0.80±0.084) with sensitivity and specificity of 0.77 ± 0.11 and 0.76 ± 0.075, respectively.
Conclusion Lung CT radiomics features can be used towards robust prognostic modeling of COVID-19 in large heterogeneous datasets gathered from multiple centers. As such, CT radiomics-based model has significant potential for use in prospective clinical settings towards improved management of COVID-19 patients.
INTRODUCTION
The novel coronavirus disease which emerged in 2019 (COVID-19) is now a major cause of death worldwide 1. This highly contagious virus can cause a spectrum of pulmonary, hematological, neurological, and systemic complications, making it a highly lethal pathogen 2. As of August 23st, 2021, there have been >200 million globally confirmed cases of COVID-19, including >4 million deaths and >4 billion vaccinations reported to the world health organization (WHO) [https://covid19.who.int/]. There remains an urgent need for addressing issues such as diagnosis, prognosis, and treatment options 3.
Diagnostic tools for COVID-19, such as reverse transcription polymerase chain reaction (RT-PCR) aid to distinguish between negative and positive cases 4. Prognostic tools, on the other hand, provide clinicians with insights to optimize treatment strategies, manage the hospitalization of patients both in the wards and intensive care units (ICU), and better handle patient follow-up plans 5. Different studies have evaluated clinical and/or non-clinical features for determining the diagnosis and prognosis of patients with COVID-19. Yan et al. 6 used only clinical features to classify patients into different categories, ranging from mild to critical conditions. Zhou et al. 7 also aimed at establishing a prognostic model for outcome prediction of patients with COVID-19, utilizing their clinical data.
Computed Tomography (CT) plays a pivotal role in the management of a wide variety of diseases as a fast and non-invasive imaging modality. In the case of COVID-19, CT is used for both diagnostic (e.g. in case of limited access to RT-PCR) and prognostic purposes 8. The clinical value of CT imaging relies mainly on the early detection of lung infections and high accuracy in quantifying the disease progression and severity 9. Francone et al. 10 assessed the correlation of CT scores with COVID-19 pneumonia severity and outcome. In another study, Zhao et al. 11 specified pulmonary involvement severity by measuring the extent of pneumonia and consolidation that appear on CT images. Li et al. 12 concluded that high scores of CT images are associated with severe COVID-19 pneumonia.
In spite of previously conducted research, there is still a need for studies with more accurate and comprehensive analyses 13. In conventional analyses, CT features are visually and subjectively defined, while machine learning (ML) and/or deep learning (DL) models have the potential to provide more comprehensive and objective assessment of images. Towards modeling of outcomes in COVID-19 patients, several ML and DL algorithms have been utilized to assess severity and to predict outcome of patients using CT imaging 14–17. To evaluate the sensitivity and specificity of CT for COVID-19 diagnostic purposes, Harmon et al. 14 achieved a sensitivity of 84% and specificity of 93% in an independent test set of 1337 patients applying DL on CT images. In another machine learning study, Mei et al. 15 reported a sensitivity of 84.3% and specificity of 82.8% based on combined CT and clinical data. Cai et al. 16 utilized a random forest model to assess the severity of COVID-19 disease in 99 patients and their need for a longer hospital or ICU stay. Another study by Lessman et al. 17 reported the model performance of three DL models for severity assessment in COVID-19. In another multi-center study, Meng et al. 18 differentiated patients with high-risk of mortality versus low-risk ones using a convolutional neural network named De-COVID-Net. Ning et al. 19 used their pre- trained DL model on a dataset consisting of 351 patients that was capable of distinguishing between non-coronavirus pneumonia, mild coronavirus pneumonia, and severe forms of COVID-19 disease.
Various studies also reported remarkable prediction accuracies utilizing radiomics approaches using CT and chest x-ray imaging modalities 20–23. Medical images could be converted into high-dimensional data by means of radiomics, wherein radiomics features are selected from the images and combined using machine learning algorithms to arrive at radiomics signatures as biomarkers of disease24–34. In addition to wide usage in several oncologic 35–37 as well as non-oncologic diseases 27, 38, radiomics studies have indicated that imaging features extracted from CT or chest X-ray images could be used as parameters for outcome prediction of patients with COVID-19 pneumonia. Radiomics analyses have been applied to different aspects of COVID-19, including diagnosis, severity scoring, prognosis, hospital/ICU stay prediction, and survival analysis 20–23. In a retrospective study, Fu et al. 39 constructed a predictive model based on CT radiomics, clinical and laboratory features. This signature could classify COVID-19 patients into stable and unstable (i.e. progressive phenotype). Homayounieh et al. 40 aimed to predict the severity of pneumonia in patients with COVID-19 using a radiomics model that outperformed models consisting of clinical-only features. Another study by Li et al. 41 analyzed a radiomics/DL model that distinguished severe from critical COVID-19 pneumonia patients. Cai et al. 42 developed a model by means of combining CT radiomics features and clinical data to predict RT-PCR negativity during admission. Yue et al. 43 conducted a multicentric radiomics study on 52 patients to differentiate whether an individual needs a short-term or long-term hospital stay. Another study by Bae et al. 44 predicted the mortality of patients with COVID-19 using chest x-ray radiomics. Their model could identify whether a patient needs mechanical ventilation or not. Artificial intelligence (AI) has been widely used in radiology to provide diagnostic and prognostic tools to help clinicians during the pandemic. However, owing to the lack of standardization in AI studies in terms of data collection, methodology, and evaluation, most of these studies were not pragmatic when it comes to clinical adoption 13, 45, 46. In a recent study by Roberts et al. 13, possible sources of bias in more than 2000 AI articles in COVID-19 were evaluated in both deep learning and traditional machine learning-based studies. This review showed that bias do exist in most, if not all, of the studies in different domains, including dataset and methodology. In the dataset domain, several articles used public datasets which can contain duplicates, low-quality images, false demographics, or unknown clinical/lab data of patients. These public datasets can also induce bias in the outcome domain as they may fail to supply sufficient information about how they exactly proved a patient is COVID-19 positive or how imaging data were acquired in terms of image acquisition and reconstruction. They also mentioned using small datasets, Frankenstein datasets 13, and Toy datasets 45 in several articles. In the methodology domain, most of the studies did not provide all methodological information or did not perform a standard AI analysis based on guidelines 47, 48.
Overall, Roberts et al. 13 reviewed 69 traditional machine learning/radiomics studies and reported that 44 were excluded because of Radiomics Quality Score (RQS) 47 of less than six or not describing the datasets appropriately. From the remaining 25, six articles performed model evaluation using external validation sets and only four papers reported the significance of their model along with the statistical parameters (agreement level). They also assessed bias in the prognostication studies in the four areas of prediction model risk of bias assessment tool (PROBAST) guide and reported high bias in participants, predictors, outcomes, and analysis areas. Overall, there are several radiomics studies targeting improved COVID-19 diagnosis or prognosis. However, owing to the limited sample size, single-centered nature of most of the databases, and variability in data acquisition and image reconstruction parameters, the models tend to overfit 13. Providing a generalizable model which is reproducible on unseen datasets of other centers is highly desired. In this context, we designed a large multi-institutional study to build and evaluate a radiomics model based on a large-scale CT imaging dataset aimed at the prediction of survival (alive or deceased) in COVID-19 infected patients. We built and evaluated our model based on different guidelines and tested different machine learning algorithms in different strategies to evaluate model reproducibility and repeatability in a large dataset.
MATERIALS AND METHODS
Figure 1 summarizes the different steps adopted in this study. To provide a standard and reproducible study, we completed different checklists/guidelines concerning predictive modeling, radiomics studies, and artificial intelligence studies. The Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) [40] checklist is provided in supplemental Table 1. We also reported the Radiomics Quality Score (RQS) based on Lambin et al. 47 and the Checklist for Artificial Intelligence in Medical imaging (CLAIM) 48 in supplemental material. These checklists were filled out by two individuals (with consensus) who are experts in radiomics field and not co-authors in this study.
Patient Population
This study was approved by our local institutional review board (IRB), and written informed consent of patients was waived by the ethics committee as anonymized data were used without any interventional effect on diagnosis, treatment or management of COVID-19 patients.
In the first step, 24,448 patients, from 19 medical centers in Iran, suspected of COVID-19 and with acquired chest CT images were included. Different exclusion criteria were applied to provide a reliable dataset. We excluded: (i) patients without follow-up information or clear evidence of clinical endpoint, or if they were transferred to another medical center (3519 patients), or patients with (ii) negative RT-PCR (1860 patients), (iii) laboratory-confirmed pneumonia of other types (1606 patients), (iv) confirmed lung cancer or metastases from other origins to the lungs (1400 patients), (v) atypical CT findings for other abnormalities (850 patients), (vi) CT images with contrast media administration (58 patients), (vii) severe motion or bulk motion artifacts in CT images were carefully checked (515 patients), (viii) extremely inappropriate positioning which resulted in missing the upper and lower bounds of the lungs (121 patients), or (ix) CT images with extremely low quality or SNR (210 patients).
Considering these criteria, we excluded 10,149 patients from further analysis (Figure 2). Hence, 14,339 chest CT scans (with one scan per patient whose COVID-19 was confirmed either by RT- PCR or CT imaging) were included in this study. Common symptoms of COVID-19, including fever, respiratory symptoms, shortness of breath, dry cough and tiredness were recorded and contact history with COVID-19 patients was also assessed. In each center, CT images were evaluated at least by two radiologists and in case of discrepancy, a third radiologist was involved to settle the disagreement. As defined in the COVID-19 Reporting and Data System (CO-RADS) 49, typical manifestations of COVID-19, such as ground-glass opacity, consolidation, crazy-paving pattern, or dominant peripheral distribution of parenchymal abnormalities were considered diagnostic for COVID-19 in CT images.
Among these studies, 13,741 CT images were collected from 18 centers in Iran (1560 deceased, 12,171 alive; fraction of deceased cases are significantly overrepresented due to our exclusion criteria), and 608 images were gathered from an online open-access databases from China (Center 9: 18 deceased, 590 alive) 50. All patients from Iran received standard treatment regimens according to the interim national COVID-19 treatment guideline [corona.behdasht.gov.ir]. Only one center (Center 10) included outpatient studies and the rest were inpatient-only studies from hospitalized patients. Follow-up was performed 3-4 months after the initial CT scan in the outpatient cases. For admitted patients (inpatient), follow-up was performed until discharge from the hospital, which was considered after careful evaluation of patients by the attending physician based on several criteria, including stable hemodynamics state (BP>90/60, HR<120), absence of fever for > 2 days, absence of any respiratory distress, blood oxygen saturation >93% in ambient oxygen without the need for supplementary oxygen, and no need for hospitalization for any other pathology.
CT Image Acquisition
All chest CT images from the Iranian centers were acquired according to an institutional variation of the national society of radiology COVID-19 imaging guidelines 51. Image acquisition was performed during breath-hold to reduce motion artifacts. Variations in CT imaging protocols among centers were observed which led to considerable variability in image quality and radiation dose. Volumetric CT Dose Index (CTDIvol), as a parameter representing vendor-free information on radiation exposure, was reported to better reflect intra/inter institutional variability of our dataset. Table 1 summarizes the image acquisition characteristics of each center, including the number of images, acquisition parameters (slice thickness, tube current), and CTDIvol.
Image Segmentation and Image Preprocessing
The lungs were automatically segmented using our DL-based algorithm named COLI-NET which we previously proposed and evaluated 52. For efficient radiomics feature extraction (feature extraction time), all images were first cropped to the lung region and then resized to 296×216 to obtain a computationally efficient feature extraction. After reviewing the segmentations, the image voxel was resized to an isotropic voxel size of 1×1×1 mm3, and the intensity discretized to 64- binning size 53.
Radiomics Feature Extraction and Harmonization
After image preprocessing, radiomics feature extraction was performed using the PyRadiomics Python library 54. Radiomics features, including morphological (n=16), intensity (n=17), and texture features including second-order features, such as Gray Level Co-occurrence Matrix (GLCM, n=24), higher-order features namely Gray Level Size Zone Matrix (GLSZM, n=16), Neighboring Gray Tone Difference Matrix (NGTDM, n=5), Gray Level Run Length Matrix (GLRLM, n=16), and Gray Level Dependence Matrix (GLDM, n=14) were extracted in compliance with the Image Biomarker Standardization Initiative (IBSI) guidelines 53.
Feature Preprocessing
For each feature vector, the mean and standard deviation were calculated (in training sets) and then normalized using Z-Score normalization, which consists of subtracting each feature vector from the mean followed by division by the standard deviation. For Z-score normalization, the mean and standard deviation were calculated for the training set and then applied on test set. Features’ correlation was evaluated using Pearson correlation and features with high correlation (R2>0.99) were eliminated. Owing to unbalanced datasets in training and test set, we applied Synthetic Minority Oversampling Technique (SMOT) algorithm to only the training set for the different models.
Feature Selection and Classification
In this study, we used 4 feature selection algorithms, including Analysis of Variance (ANOVA), Kruskal-Wallis (KW), Recursive Feature Elimination (RFE), and Relief. Feature preprocessing and selection were performed on training sets and then applied on test sets. All test and external validation sets were unseen to feature processing and the selection and model building process. For classification task, we used seven classifiers, including Logistic Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), Linear Discriminant Analysis (LDA), Random Forest (RF), AdaBoost (AB), Naïve Bayes (NB) and Multilayer Perceptron (MLP). By cross-combination of four feature selectors and seven classifiers, we tested twenty-eight different combinations.
Evaluation
For thorough assessment, we trained and evaluated our models using 10 different strategies as summarized in Figure 3. To evaluate the models on whole datasets without considering data variability in each center, we divided the dataset of each center to 70% training and 30% tests sets resulting in the following two strategies (1 and 2):
- Random Splitting method #1: Non-harmonized datasets were randomly split into 70% (10,038 patients) and 30% (4301 patients) for training and test sets, respectively, without considering centers. The data included patients whose COVID-19 was confirmed using RT-PCR and patients confirmed only by imaging. This test dataset included both populations.
- Random Splitting method #2: Non-harmonized datasets were randomly split into 70% (8503 patients) and 30% (3644 patients) for the training and test sets, respectively, without considering centers. The train and test sets consisted of only patients with positive RT-PCR.
- To evaluate the models on whole datasets considering data variability in each center, we divided the dataset of each center to 70% training and 30% test sets resulting in the following two strategies (3 and 4):
- Random Splitting method #3: Data from each center (non-harmonized) were randomly split into 70% (10,048 patients) and 30% (4291 patients) for the training and test sets, respectively. As our data included patients whose COVID-19 was confirmed using RT-PCR and patients confirmed only by imaging, this test dataset included both populations.
- Random Splitting method #4: Data from each center were randomly split into 70% (10,704 patients) and 30% (3635 patients) for the training and test sets, respectively. The train and test sets consist of only patients with positive RT-PCR.
To evaluate the models in the whole dataset by removing data variability due to acquisition/reconstruction from different centers, ComBat harmonization proposed by Johnson et al. 55 was applied to the extracted features to tackle the effect of center-based imaging variability. The impact of ComBat harmonization on radiomics features was assessed by Kruskal-Wallis test. After applying ComBat harmonization, we divided the datasets of each center to 70/30% train/test sets resulting in the following two strategies (3 and 4):
- Random Splitting method #5: Data from each center (ComBat harmonization) were randomly split into 70% (10,048 patients) and 30% (4291 patients) for the training and test sets, respectively. As our data included patients whose COVID-19 was confirmed using RT-PCR and patients confirmed only by imaging, this test dataset included both populations.
- Random Splitting method #6: Data from each center (ComBat harmonization) were randomly split into 70% (10,704 patients) and 30% (3635 patients) for the training and test sets, respectively. The train and test sets consisted of only patients with positive RT-PCR.
To evaluate model generalizability and sensitivity to datasets, we performed model assessment using the following strategies (7 to 9) on the external validation sets:
- Random Splitting method #7: Data (non-harmonized) were randomly split into 70% (10,655 patients) and 30% (3684 patients) for the training and external validation sets, respectively. The center’s number in the test set appears in the test sets.
- Center-based model evaluation #8: we built models on one center’s dataset (non-harmonized) and then evaluated on 18 remaining centers (external validation set), and then repeated this process for all datasets.
- Leave-one-center-out (LOCO) #9: On each of the 19 iterations, 18 centers were used as the training set, and one as the external validation set (unseen data during training). We repeated this process for all center datasets (non-harmonized).
To evaluate of model sensitivity to each dataset, we trained and tested the models in each center separately on each center dataset using the following strategies:
- Random Splitting method #10: Data from each center (non-harmonized) were randomly split into 70% and 30% for the training and test sets, respectively. The models were built and evaluated on each center separately.
All multivariate steps, including feature preprocessing, feature selection and classification were performed separately for each strategy. Classification algorithms were optimized during training using grid search algorithms. The best models were chosen by one standard deviation rule in 10- fold cross-validation and then evaluated on test or external validation sets. The accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) were reported for the test or external validation sets (unseen during training). Statistical comparison of AUCs (by 10000 bootstrapping) between models was performed using the DeLong test 56. The significance level was considered at a level of 0.05. All multivariate analysis steps were performed using Python Scikit-Learn open-source library.
RESULTS
Figures 4 depicts the hierarchical clustering heat map of radiomics features distribution in alive and deceased groups for the whole dataset prior to ComBat harmonization. Supplemental Figure 1 shows the cluster heat map of radiomics features in the non-harmonized data set. Figure 5 shows the correlation of radiomics features in the whole dataset, whereas supplemental Figure 2 represents the same for ComBat harmonized features. The statistical differences calculated using the Kruskal-Wallis test are presented in supplemental Table 1 before and after ComBat harmonization. The results of ComBat harmonization showed that the algorithm properly eliminated the center effect on radiomics features in most features. ComBat harmonization data only were used for strategies 5 and 6. Figures 6-8 provide the classifications power indices of AUC, sensitivity and specificity, respectively, for splitting strategies 1-10. More detailed results were presented in supplemental Tables 2-11 for the different strategies.
In strategy 1 where the data were randomly split into train and test sets (without considering centers), RFE feature selection and RF classifier results in highest performance of AUC 0.84±0.01 (CI95%: 0.82-0.85) with sensitivity and specificity of 0.78 and 0.76, respectively. In strategy 2 where only PCR positive studies were randomly split into train and test sets (without considering centers), KW feature selection and RF classifier combination resulted in the highest performance with an AUC of 0.84±0.01 (CI95%: 0.82-0.86) and sensitivity and specificity of 0.81 and 0.76, respectively. There was no statistical significant difference between Strategies 1 and 2, the main difference being the inclusion of CT and PCR positive studies in strategy 1 and only PCR positive studies in strategy 2.
In strategy 3 where whole data splitting was performed in each center separately for train and test sets, ANOVA feature selector and RF classifier combination resulted in the highest performance with AUC of 0.83±0.01 (CI95%: 0.81-0.85), sensitivity and specificity of 0.81 and 0.72, respectively. Similar results as above were achieved for strategy, 4 where data splitting was performed in each center separately to train and test set for PCR positive dataset. There were no statistically significant difference between strategies 3 and 4 where the main difference was including CT and PCR positive in strategy 4 and only PCR positive in strategy 5.
In strategy 5 where Combat harmonized whole data splitting was performed in each center separately to the train and test sets, Relief feature selector and RF classifier combination resulted in the highest performance with an AUC of 0.83±0.01 (CI95%: 0.81-0.85), sensitivity and specificity of 0.77 and 0.74, respectively. In strategy 6, where Combat harmonized data splitting was performed in each center separately to the train and test sets for PCR positive studies, Relief feature selector and RF classifier combination resulted in the highest performance with an AUC 0.83±0.01 (CI95%: 0.81-0.84), sensitivity and specificity of 0.79 and 0.72, respectively. . There were no statistically significant differences between strategies 5 and 6. The statistical comparison of AUCs between for ComBat harmonization strategies 5 and 6 and to the same splitting in strategies 3 and 4 using DeLong test didn’t reveal any statistically significant differences.
In strategy 7, where the splitting into train and test sets was performed based on centers (centers appear in training and test sets only once), RFE selector and RF classifier combination resulted in the highest performance with an AUC of 0.79±0.01 (CI95%: 0.76-0.81), sensitivity and specificity of 0.73 and 0.71, respectively. In Figure 9, the ROC curves of the test set for strategies 1-7 as well as the comparison of the different strategies are depicted. In strategy 8 where the model is built based on one center dataset (non-harmonized) and then evaluated on the 18 remaining centers (external validation set), ANOVA feature selector and NB classifier combination resulted in the highest performance with an AUC of 0.74±0.034, sensitivity and specificity of 0.71 ± 0.026 and 0.69 ± 0.033, respectively. The results of each center were presented in supplemental Table 12. To evaluate the model on external validation sets, we reported the results of each center in the LOCO strategy 9. ANOVA feature selector and LR classifier combination resulted in the highest performance with an AUC of 0.80±0.084, sensitivity and specificity of 0.77 ± 0.11 and 0.76 ± 0.075, respectively. In strategy 10, the data from each center (non-harmonized) were randomly split into 70% and 30% for training and test sets, respectively, and the models were built and evaluated on each center separately. ANOVA feature selector and LR classifier combination resulted in the highest performance with an AUC of 0.82±0.10, sensitivity and specificity of 0.84 ± 0.12 and 0.77 ± 0.09, respectively. The results of each center were presented in supplemental Tables 13 and 14 for strategies 9 and 10, respectively.
DISCUSSION
In this multi-centric study, we conducted a CT-based radiomics analysis to assess the ability of our model in predicting the overall survival of patients with COVID-19 using a large multi- institutional dataset. We included 14,339 patients along with their CT images, segmented the lungs, and extracted distinct radiomics features. We evaluated different combinations of feature selectors and classifiers in different strategies. Since the dataset was gathered from different centers, we applied the ComBat Harmonization algorithm that has been successfully applied in radiomics studies over the extracted features 57. As our dataset consisted of imbalanced classes, we first used SMOT algorithm in the training sets. Our model was trained and the results of 3 different testing methods were reported.
Prognostic modeling can be regarded as an important framework towards better understanding of disease, its management, monitoring, and identification of the best treatment options. A number of reports have shown the effectiveness of image-based, laboratory-based, or combined models in outcome prediction of COVID-19 infected patients 58, 59. Qiu et al. 60 constructed a radiomics model trained to classify the severity of COVID-19 lesions (mild vs severe) using CT images. Their study included a medium-to-large number of patients (n=1160) and achieved an AUC of 0.87 in the test dataset. They showed that the radiomics signature is potent in aiding physicians to manage patients in a more precise way. Fu et al. 39 conducted a similar experiment with a radiomics-based model using CT images and applied it to data from 64 patients to classify them into progressive and stable groups. Their model could accurately perform the given task (AUC=0.83). While the results were promising, their study did not include a large cohort.
A study by Chao et al. 59 included different types of information, such as CT-based radiomics features, clinical, and demographic data to employ a holistic prognostic model. Their model could predict whether the patients will demand an ICU admission or not with an AUC of 0.88. Tang et al. 61 also assessed a random forest model for classifying patients into categories of severe and non-severe based on CT imaging radiomics features along with laboratory test results. The model performed well (AUC=0.98) on their dataset consisting of 118 patients. In a study by Wu et al. 62, the authors assessed the predictive power of a radiomic signature for showing poor patient outcomes defined as ICU admission, need for mechanical ventilation, or death. Their model could reach an AUC of 0.97 in the prediction of 28-day outcomes after CT images were taken. This highly promising result was achieved with the help of clinical data and harmonization of the features. At the same time, in our study, ComBat harmonization did not appear to impact outcome prediction.
One should note that both clinical-only and radiomics-only survival prediction models have advantages. However, studies have shown that radiomics features yield superior accuracy in most cases. In a study by Homayounieh et al. 63, the authors developed a radiomics-based signature and compared it with a clinical-only signature in terms of mortality prediction. They concluded that radiomics-based model can outperform the clinical-only model with a wide margin (AUC of 0.81 versus 0.68). Their study included 315 adults and was applied to other clinical outcomes as well, such as the prediction of outpatient/inpatient care and ICU admission. In addition, other reports indicated that adding clinical features to the radiomics model only slightly improved the results 64. In a recent study, Shiri et al. 64 performed a radiomics study for prognostication purpose (alive or deceased) of COVID-19 patients using clinical (demographic, laboratory, and radiological scoring), COVID-19 pneumonia lesion radiomics features and whole lung radiomics features, separately and in combination. They trained a machine learning algorithm, Maximum Relevance Minimum Redundancy (MRMR) as the feature selector and XGBoost as the classifier, on 106 patients and evaluated and reported results on 46 test sets. They reported an AUC of 0.87 ± 0.04 for clinical-only, 0.92 ± 0.03 for whole lung radiomics, 0.92 ± 0.03 for lesion radiomics, 0.91 ± 0.04 for lung + lesion radiomics, 0.92 ± 0.03 for lung radiomics + clinical data, 0.94 ± 0.03 for lesion radiomics + clinical data and 0.95 ± 0.03 for lung + lesion radiomics + clinical data. The lung and lesion radiomics-only models showed similar performance, while the integration of features resulted in the highest accuracy.
Lassau et al. 65 combined CT-based DL models, biological and clinical features for severity prediction in 1003 COVID-19 patients, confirmed by either CT or RT-PCR. They showed clinical and biological features correlation with CT markers. Zhang et al. 58 conducted a diagnostic and prognostic study using 3777 COVID-19 patients. They reported a high positive and negative correlation of lung-lesion CT manifestations with a number of clinical and laboratory tests. They also reported that their diagnostic model (COVID-19 from common pneumonia and normal control) can improve radiologist’s performance from junior to senior level (AUC = 0.98) for progression to severe/critical disease in their prognostic model. They reported an AUC of 0.90 with sensitivity and specificity of 0.80 and 0.86, respectively. Feng et al. 66 built a machine learning prognostic model using a multicenter COVID-19 dataset. They reported a high correlation of CT features with clinical findings, also utilizing a multivariable model in the validation set consisting of 106 patients. The AUC was 0.89 (95% CI: 0.81–0.98). Recently, Xu et al. 67 conducted a multicentric study for the prediction of ICU admission, mechanical ventilation, and mortality of hospitalized patients with COVID-19. CT radiomics features were integrated with demographic and laboratory tests. The evaluation was performed in 1362 patients from nine hospitals reporting an AUC of 0.916, 0.919 and 0.853 for ICU admission, mechanical ventilation, and mortality of hospitalized patients, respectively. For the radiomics-only model, they reached an AUC of 0.86, 0.80, and 0.66 for the above three mentioned outcomes, respectively.
Most previous studies suffered from a common limitation of COVID-19 RT-PCR not being available for the entire dataset when using multicentric data. In our study, COVID-19 positivity was confirmed by either RT-PCR or CT images, and different strategies were adopted to evaluate the models, including random splits and leave-one-center-out. We randomly split the data to train and test sets containing both CT positive and RT-PCR positive patients. Furthermore, to ensure the reproducibility of our results on RT-PCR positive patients, we split the dataset in a way that the test set consisted only of RT-PCR positive patients. To maximize the generalizability of the model and avoid overfitting on training sets, owing to variability in acquisition and reconstruction protocols, our model was developed on multicentric datasets with a wide variety of acquisition and reconstruction parameters. To test the generalizability of our model, we repeated the evaluation of our model using leave-one-center-out cross-validation. The results were reported for 10 different strategies of splitting and cross-validation scenarios.
Several studies reported on the use of CT radiomics or DL algorithms for diagnostic and prognostic purposes in patients with COVID-19 58, 59, 68. However, most studies were performed using a small sample size. Overall, establishing evidence that radiomics features can help prioritize patients based on the severity of their disease and/or predicting their survival requires assessment using larger cohorts for a more generalizable model because of the wide variability in COVID-19 manifestations in different patients. In this study, we provided a large multinational multicentric dataset and evaluated our model in different scenarios to ensure model reproducibility, robustness and generalizability.
While attempting to address bias and limitations to create a generalizable model, the results should be interpreted considering some issues. First, motion artifacts were unavoidable in some COVID-19 patient scans which resulted in overlapping pneumonia regions. We removed patients with severe motion artifacts to omit this effect on model generalization. Second, we enrolled patients with common symptoms of COVID-19 whose infection was confirmed by either RT-PCR or CT imaging (typical manifestation of COVID-19 defined by interim guidelines). We handled this issue by testing different scenarios, including training a model using RT-PCR or CT positive patients and held out only RT-PCR patients in the test set and reported reproducible and repeatable results. Third, we did not include comorbidities (increased risk of adverse outcome), clinical or laboratory data during modeling. However, previous studies showed high correlation of lung features with these findings 58, 65, 66. Future studies combining various information to build a holistic model using a large dataset could improve the model’s performance. Forth, we built a prognostic model based on all lung radiomics features. However, COVID-19 can result in imaging manifestations in other organs, such as the heart. Including features from different organs has the potential of improving prognostic performance 69. Fifth, therapeutic regimens for different patients were not considered during modeling although providing this information may help improving the accuracy of the model. Sixth, only binary classification was considered for the prognostic model in this study. Future studies should perform survival analysis using time-to-event models to account for the time of adverse event. Lastly, we did not evaluate the impact of image acquisition or reconstruction parameters on radiomics features at the same time. We applied ComBat harmonization algorithm to eliminate center-specific parameter effects on CT radiomics features.
CONCLUSION
A very large heterogeneous COVID-19 database was gathered from multiple centers and a predictive model of survival outcome derived and extensively tested to evaluate its reproducibility and generalizability. We demonstrated that lung CT radiomics features could be used as biomarkers for prognostic modeling in COVID-19. Through the use of a large imaging dataset, the predictive power of the proposed CT radiomics model is more reliable and may be prospectively used in clinical setting to manage COVID-19 patients.
Data Availability
All data produced in the present work are contained in the manuscript
Data and code availability
Radiomics features and code would be available with request upon publication.
Conflict of Interest statement
The authors declare that they have no conflict of interest.
ACKNOWLEDGMENTS
This work was supported by the Swiss National Science Foundation under grant SNRF 320030_176052.
Footnotes
First Author: Isaac Shiri, MSc Geneva University Hospital, Division of Nuclear Medicine and Molecular Imaging, CH-1211 Geneva, Switzerland Email: Isaac.shirilord{at}unige.ch