Abstract
The SARS-COV-2 pandemic has put pressure on Intensive Care Units, and made the identification of early predictors of disease severity a priority. We collected clinical, biological, chest CT scan data, and radiology reports from 1,003 coronavirus-infected patients from two French hospitals. Among 58 variables measured at admission, 11 clinical and 3 radiological variables were associated with severity. Next, using 506,341 chest CT images, we trained and evaluated deep learning models to segment the scans and reproduce radiologists’ annotations. We also built CT image-based deep learning models that predicted severity better than models based on the radiologists’ reports. Finally, we showed that adding CT scan information—either through radiologist lesion quantification or through deep learning—to clinical and biological data, improves prediction of severity. These findings show that CT scans contain novel and unique prognostic information, which we included in a 6-variable ScanCov severity score.
Introduction
Hospitalized COVID-19 patients are likely to develop severe outcomes requiring mechanical ventilation or high-flow oxygenation. Among hospitalized patients, 14 to 30% will require admission to an ICU, 12 to 33% will require mechanical ventilation, and 20% to 33% will die1–4. Detection at admission of patients at risk of severe outcomes is important to deliver proper care and to optimize use of limited intensive care unit (ICU) ressources5.
Identification of hospitalized COVID-19 patients at risk for severe deterioration can be done using risk scores that combine several factors including age, sex, and comorbidities (CALL, COVID-GRAM)6–11. Some risk scores also include additional markers of severity such as the dyspnea symptom, clinical examination variables such as low oxygen saturation and elevated respiratory rate, as well as biological factors reflecting multi-organ failures such as elevated Lactate dehydrogenase (LDH) values8,10,12–14.
Beyond clinical and biological variables, computerized tomography (CT) scans also contain prognostic information, as the degree of pulmonary inflammation is associated with clinical symptoms, and the amount of lung abnormality has been associated to severe evolution15–19. However, the extent to which CT scans at patient admission add prognostic information beyond what can be inferred from clinical and biological data is unresolved.
The objective of this study was to integrate clinical, biological and radiological data to predict the outcome of hospitalized patients. CT-scan information was included in multimodal scores either through deep learning models or using radiologist quantification of lesions.
Results
A total of 1,003 patients from Kremlin-Bicêtre (KB, Paris, France) and Gustave Roussy (IGR, Villejuif, France) were enrolled in the study. Clinical, biological, and CT scan images and reports were collected at hospital admission. There were 931 patients for which clinical, biological and CT-scan data were available (Supp Fig 1). A total of 506,341 images were analyzed for the 980 patients with CT-scans (average of 517 slices per scan). Radiologists annotated 17,873 images from 329 CT-scans. Summary statistics for the clinical, biological, and CT scan data are provided in Table 1.
Variables associated with severity
We first evaluated how clinical and biological variables measured at admission were associated with future severe progression, which we defined as an oxygen flow rate of 15 L/min or higher and/or the need for mechanical ventilation and/or patient death20. This definition of severe progression corresponds to a score of 5 or more according to the World Health Organization evaluation of severity on a 1 to 10 scale. We computed the severity odds ratios for each individual variable, and at each hospital center (Table 1 and Supp Fig 2). When combining association results from the two centers, we found 11 variables significantly associated with severity (P <0.05/58 to account for testing 58 variables, Table 1 and Supp Fig 2): age (Odds Ratio [OR] KB 1.66 (1.41-1.96), OR IGR 1.35 (0.92-1.98), PStouffer = 5.25e-10), sex (OR KB 1.95 (1.41-2.69), OR IGR 1.07 (0.51-2.23), PStouffer = 5.75e-05), hypertension (OR KB 1.84 (1.35-2.51), OR IGR 1.11 (0.51-2.42), PStouffer = 1.11e-04), chronic kidney disease (OR KB 2.51 (1.62-3.89), OR IGR 16.59 (1.93-142.84), PStouffer = 6.71e-06), respiratory rate (OR KB 1.34 (1.13-1.59), OR IGR 3.37 (1.28-8.86), PStouffer = 2.14e-04), oxygen saturation (OR KB 0.38 (0.31-0.47), OR IGR 0.35 (0.20-0.63), PStouffer = 2.91e-21), diastolic pressure (OR KB 0.70 (0.59-0.83), OR IGR 0.75 (0.51-1.11), PStouffer = 1.35e-05), CRP (OR KB 1.47 (1.25-1.72), OR IGR 1.48 (1.03-2.14), PStouffer = 4.48e-07), LDH (OR KB 2.05 (1.65-2.54), OR IGR 2.36 (1.32-4.21), PStouffer = 6.05e-12), polynuclear neutrophil (OR KB 1.36 (1.13-1.60), OR IGR 1.15 (0.80-1.64), PStouffer = 1.29e-04), and urea (OR KB 1.70 (1.43-2.01), OR IGR 2.19 (1.36-3.52), PStouffer = 9.17e-11).
We then assessed the predictive value of features from admission radiology reports, and found three significant features: (i) extent of disease (OR KB 2.37 (1.97-2.86), OR IGR 1.62 (1.11-2.37), PStouffer = 9.56e-21) and (ii) crazy paving (OR KB 2.50 (1.82-3.44), OR IGR 2.37 (1.10-5.11), PStouffer = 3.10e-09), associated with greater severity, and (iii) peripheral topography, associated with lesser severity (OR KB 0.54 (0.39-0.74), OR IGR 0.55 (0.23-1.31), PStouffer = 8.21e-05).
Segmentation of CT-scans
We next trained the deep neural network AI-segment (Supp Fig 3) to segment radiological patterns and provide automatic quantification21,22 of their volume, expressed as a percentage of the full lung volume. These patterns included the three distinguishable features that appear as disease severity progresses: ground glass opacity (GGO), crazy paving, and finally consolidation. AI-segment was trained on 184 patients from KB hospital (8 fully annotated scans, 176 partially annotated ones) and evaluated on 145 patients from IGR hospital (14 fully annotated scans and 131 partially annotated ones). To evaluate AI-segment, we first compared its performance to that of radiologists manual annotation. AI-segment discriminated lung regions from regions outside of the lung with an accuracy of 99.9% when evaluated on the fully annotated scans. Within the lung, the model’s ability to discriminate between lesions and healthy areas had F1 values of 0.85 and 0.98 on partially and fully annotated scans. In the fully annotated scans, the predicted volumes of each lesion type had relative errors (median [min-max]) of 3.77% [0.054%-14%] for GGO, 0.96% [0.058%-4.4%] for consolidation, and 5.92% [0.41%-13%] for sane lung (no crazy paving was present in these scans). We next compared AI-segment to the information contained in the radiology reports. The F1 score measuring the ability of AI-segment to detect the presence of a lesion type per patient, was of 0.88 for GGO, 0.65 for crazy paving, and 0.75 for consolidation (Supp Table 1). Correlation between quantification of the proportion of lesions with AI-segment and the radiologist evaluation was of 0.56 (Supp Fig 5). AI-segment visual results were also consistent with radiologist observations (see Figure 1 for three representative cases). We lastly evaluated to what extent AI-segment provided biomarkers of future severity. We found that severity was significantly associated to GGO extent (OR KB 0.64 (0.54,0.76), 0.77 (0.54,1.10), PStouffer = 1.94e-07), crazy paving extent (OR KB 1.47 (1.20-1.79), OR IGR 1.31 (0.92,1.87), PStouffer = 6.70e-05), consolidation extent(OR KB 1.46 (1.23,1.73), 1.27 (0.89,1.82), PStouffer = 7.61e-06) as well as total disease extent (OR KB 2.11 (1.74,2.55), OR IGR 1.90 (1.30,2.79), PStouffer = 7.66e-16)(accounting for multiple testing). These correlations were observed in the larger KB dataset, but were not found in the IGR dataset (Supp Table 2).
Prognostic models based on CT-scan only
We next evaluated the prognostic value of variables extracted from CT scans through three different models. The first model called report combined variables from the radiological report using logistic regression. The second was based on the lesion volumes computed by AI-segment and variables were again combined with logistic regression. The third called AI-severity used a weakly supervised approach with no radiologist-provided annotations (Supp Fig 4)23. All three models were trained on 646 KB patients, validated on 150 KB patients and on the independent IGR dataset of 135 patients (Figure 2). On the validation set from KB hospital, AI-severity outperformed report (AUCAI-severity = 0.76 (0.66,0.85), AUCAI-segment = 0.67(0.56,0.77), AUCreport = 0.71 (0.62,0.80)). On the independent IGR validation set, both AI-segment and AI-severity outperformed the model report (AUCAI-severity = 0.75 (0.65,0.84), AUCAI-segment = 0.70 (0.59,0.80), AUCreport = 0.65 (0.54,0.75)). When considering alternatives outcomes consisting of either death, or death or admission to ICU, AI-severity and AI-segment were also superior to report in terms of AUC (Supp Table 3).
To interpret the weakly supervised AI-severity model, and understand what it detects within the CT scans, we evaluated to what extent the features extracted by AI-severity (internal representation) could predict clinical and radiological variables. To this end, we trained a new logistic regression with AI-severity’s extracted features as input, and some clinical and radiological variables as output. AUC on the KB validation set was 0.93 (C.I. = (0.88,0.97)) for disease extent (threshold >2), 0.78 (C.I. = (0.70,0.85)) for crazy paving, 0.64 for condensation (C.I. = (0.53,0.74)) and 0.80 for GGO (C.I. = (0.65,0.94)) (Supp Table 4). It was also possible to relate internal representations of the neural networks to clinical variables. We obtained an AUC of 0.88 (C.I. = (0.82,0.94)) for predicting an age strictly larger than 60 year-old, an AUC of 0.93 (C.I. = (0.89,0.97)) for sex, and of 0.76 (C.I. = (0.68,0.84)) for predicting an oxygen saturation larger than 90%. As a comparison, a logistic regression trained on the variables from the radiology report obtained only AUC scores of 0.70 (C.I. = (0.61,0.78)) for age, 0.57 (C.I. = (0.48, 0.67)) for sex and of 0.68 (C.I. = (0.58, 0.77)) for oxygen saturation. Simply put, this analysis shows that the internal representation of the AI-severity neural network captures clinical features from the lung CTs, such as sex or age, on top of the known COVID-19 radiology features.
Multimodal prognostic models and ScanCov score
Lastly, we evaluated whether CT scans have a prognostic value beyond what can be inferred from clinical and biological characteristics alone. To this end, we sought to compare the performance of trimodal CT scan/clinical/biological models to a bimodal clinical/biological model (C & B). Using a greedy search approach to include optimal variables, we therefore incorporated clinical and biological variables into report and named the resulting trimodal model the ScanCov score. Coefficients and transformations required to compute the 6-variable ScanCov score are available in Supp Table 6. Through the same method, we also made a trimodal version of AI-segment, and AI-severity (Supp Fig 6, Supp Table 5). We evaluated the models’ performances on three outcomes: the initial WHO-defined high severity outcome of “oxygen flow rate of 15 L/min or higher, or need for mechanical ventilation, or death”, as well as two other outcomes “death or ICU admission”, and “death”. For each outcome and validation set, both ScanCov and AI-severity performed better than the bimodal biological/clinical C & B model (Figures 2 & 3, Supp Table 3). The gain of performance when compared to the C & B model was larger for the KB hospital (median AUC increase of 4.0% for AI-severity and of 3.6% for ScanCov) than for the IGR hospital (median AUC increase of 1.5% for AI-severity and of 0.4% for ScanCov). For the model AI-segment, the median increase of AUC was of 0.5% for the KB hospital and of 1.9% for the IGR hospital.
The ScanCov and AI-severity models also outperformed other previously published severity or mortality scores (Figure 3, Supp Fig 7, Supp Table 3). The median difference (averaged over outcomes) between the AUC of AI-severity and of other scores ranged between 5% (COVID-GRAM) and 15% (CALL) at KB and between 10% (COVID-GRAM) and 26%15 at IGR. The median difference (averaged over outcomes) between the AUC of AI-segment and of other scores ranged between 2% (COVID-GRAM) and 12% (CALL) at KB and between 5% (COVID-GRAM) and 24%15 at IGR. Similarly, the median difference (averaged over outcomes) between the AUC of ScanCov and of other scores ranged between 4% (COVID-GRAM) and 14% (CALL) at KB and between 5% (COVID-GRAM) and 24%15 at IGR. Among alternative scores, the COVID-GRAM score provided the largest value of median AUC (Figure 3, Supp Table 3). When averaging AUC (median) over outcomes at KB, the range of AUC increase when comparing COVID-GRAM to other scores was between 0.5% (MIT analytics) and 10% (CALL); range of AUC increase at IGR was between 7% (MIT analytics) and 10%15. The COVID-GRAM score was also the only alternative scoring system we considered that includes CT scan information.
Of all the features within the radiologists’ report, disease extent was the most strongly associated to prognosis (Table 1). We therefore further investigated this feature to confirm that it brings additional prognostic information that is not otherwise captured in any clinical or biological variable. In the KB dataset, the three variables that were the most correlated with disease extent were LDH (r = 0.52, C.I. = (0.45,0.58)), CRP (r = 0.45, C.I. = (0.39,0.51)), and oxygen saturation (r = -0.43, C.I. = (-0.49,-0.37)) (Supp Table 7). We then regressed the severity outcome with disease extent and the three correlated variables and found that significant predictors included oxygen saturation (P = 1.57e-07) and disease extent (P = 0.01), whereas statistical evidence for association was weak for LDH (P = 0.06) and absent for CRP (P = 0.26). The statistical evidence for association between disease extent and severity was also found (P = 9.85e-08) when accounting for the five additional variables of the ScanCov score, which were also significantly related with the outcome (Age P = 1.49e-06, Oxygen saturation P = 2.83e-08, Sex P = 0.035, Platelet P = 0.001, Urea = 9.77e-05). This confirms that the radiological feature of disease extent brings unique prognostic information.
To further evaluate the ScanCov score, individuals in the top tercile were assigned in a high risk group. We found that the survival function of the individuals at high risk was significantly different from the survival function of the other individuals (Figure 4, P = 2.90e-07 at KB, P = 5.38e-08 at IGR for a log-rank test). When considering a binary classification consisting of a high-risk group and a medium or low risk group, we obtained positive predictive values (or precision) of 54% (KB) and 68% (IGR), negative predictive values of 85% (KB) and 80% (IGR), specificities of 78% (KB) and 91% (IGR), and sensitivities of 65% (KB) and 48% (IGR).
Discussion
Taken together, these results show that unique future disease severity markers are present within routine CT scans performed at admission, and that these scans provide useful and interpretable elements for prognosis.
Looking back on the prognostic clinical and biological variables, we found 11 of these significantly associated with severe evolution, which is consistent with previous studies14,30,37. First, looking at clinical characteristics, we confirmed that male and older persons are more at risk24. Although BMI is a known risk clinical factor for severe COVID24, it was not associated with severity here. Discussion with clinicians however indicated that the data capture may have been biased, with emergency room doctors inputting height and weight more frequently for obese patients. Second, looking at clinical examination variables, we found that respiratory rate, diastolic pressure, and oxygen saturation are clinical variables associated with severity. These associations may reflect physician decisions taken for ICU triage. Inclusion criteria for critical care triage include (i) requirement for invasive ventilatory support characterized by an oxygen saturation lower than 90%, or by respiratory failure, or (ii) requirement for vasopressors characterized by hypotension and low blood pressure33. Third, looking at comorbidities, we confirmed the results of several meta-analyses28,30–32 that showed that chronic kidney disease and hypertension are linked to severity. We however did not find significant associations for other comorbidities previous associated with severity, such as diabetes, and cardiovascular diseases28,29. While we expected cancer patients to have more severe outcomes because they are generally older, with multiple co-morbidities and often in a treatment-induced immunosuppressive state25–27, we did not find this association. Several factors can explain this. Each cohort was not optimally balanced to conclusively study the association between cancer and severity: IGR admitted mostly cancer patients (80% of the patients) while KB admitted very few cancer patients (7%). Fourth, looking at COVID symptoms, we did not find any significantly associated with severity. Dyspnea is a prominent symptom that has been repeatedly associated with severity and our results are compatible with a positive association with severity but we may lack a large-enough sample size to be significant6,34,35. Last, looking at biological measures, we found that inflammatory biomarkers, LDH and CRP are related to severity13,36,37. We also found association with neutrophil and urea, the later being explained by the fact that high urea is indicative of kidney disfunction. Thrombocytopenia (low platelet count) was not significantly associated to severity, possibly because of lack of statistical power and stringent correction for multiple testing, but association betwen thrombocytopenia and severity was in the expected direction and platelet counts are included in the 6-variable ScanCov score and in the trimodal models AI-segment and AI-severity.
Beyond these clinical and biological variables, chest CT-scans provided additional markers of disease severity. Significant features include the total extent of lesions, the presence of crazy paving pattern lesions, and the proportion of consolidation lesions when measured with automatic segmentation. Although the extent of disease severity and consolidation are known to be associated with severity15,18,38–43, our study discovered its association with crazy paving, a precursor of consolidation lesions. Initial damages to the alveoli, as well as protein and fibrous exudation, explain the early onset of GGO. As the disease progresses, more and more inflammatory cells infiltrate the alveoli and interstitial space, followed by diffuse alveolar lesions and the formation of a hyaline membrane, which results in a crazy paving appearance, which is then followed by consolidation on the CT examination44,45. Correlation results between the proportion of each lesion type and severity reflects this sequencing, as GGO proportions are negatively related with severity, while crazy paving and consolidation proportions are positively correlated to severity (Supp Table 2).
Compared to a radiologist’s reporting and quantification of lesions, there are several advantages to capturing CT-scan information through deep learning models. Good reproducibility is a key element for imaging biomarkers such as disease extent, and visual inspection of images introduces variability that can hinder its clinical application46. Additionally, the consolidation feature, which has been repeatedly associated with COVID severity34,38,41–43, was not found to be associated with severity with a simple presence/absence radiologist coding, whereas correlation was evident once pulmonary consolidation was quantified with automatic segmentation. Another advantage is that radiologists are faced with the challenge that large numbers of cases must be read, annotated and prioritized in a COVID-19 pandemic. AI analysis of radiological images has the potential to reduce this burden and speed up their reading time. Finally, unimodal prognosis scores obtained with deep learning models trained on CT-scans are more predictive of severity than manually extracted radiological features. We indeed showed that internal representation of the AI-severity neural network captures clinical information from CT scans, and this can can be particularly useful when some clinical or lab measurements are missing.
Our reported prognostic values for CT-scan-based models (AUC range of 0.70 - 0.80) are lower than the 0.85 AUC reported in the previously published Zhang et al study16. We hypothesize that this is due to use of different outcome definitions, as well as different patient characteristics in the study cohorts (age, severity at admission, etc). Hospital admission criteria vary between countries and hospitals; for instance, proportion of deaths in our French KB and IGR cohorts was of 16-17%, while it was of 39% in the Zhang et al study16. When applying other previously published scores to the KB and IGR datasets, we found smaller AUC scores than reported values in the original papers. This difference can again be explained by differing patient characteristics, and different measures of severity between studies7,10,15,6,9
Our evaluation of the different trimodal models that included CT scan information in addition to clinical and biological information revealed the added prognostic value of CT scans. Interestingly, while CT scan disease extent was correlated to biological and clinical severity biomarkers such as CRP levels, tissue damage (LDH) and oxygenation —highlighting some information redundancy between data modalities34,47–4—disease extent was still significantly associated with severity even after accounting for these other severity markers, confirming the unique added value of CT-scans. Beyond AI modeling, our study shows that the 6-variable ScanCov score integrating a radiological quantification of lesions with key clinical and biological variables provides accurate severity predictions, and can rapidly become a reference patient scoring approach.
Methods
Description of the retrospective study
Data including CT-scans, were collected at two French hospitals (Kremlin Bicêtre Hospital, APHP, Paris denotes as KB and Gustave Roussy Hospital, Villejuif denoted as IGR). CT scans, clinical, and biological data were collected in the first 2 days after hospital admission. This study has received approval of ethic committees from the two hospitals and authors submitted a declaration to the National Commission of Data Processing and Liberties (N° INDS MR5413020420, CNIL) in order to get registered in the medical studies database and respect the General Regulation on Data Protection (RGPD) requirements. An information letter was sent to all patients included in the study. We stopped to update information about patient status on the 5th of May. Among the 1,003 patients of the study, two patients asked to be excluded from the study.
Inclusion criteria were (1) date of admission at hospital (from the 12th of February to the 20th of March at Kremin Bicêtre and from the 2nd of March to the 24th of April at Institut Gustave Roussy) and (2) a positive diagnosis of COVID-19. Patients were considered positive either because of a positive RT-PCR (real-time fluorescence polymerase chain reaction) based on nasal or lower respiratory tract specimens or a CT scan with a typical appearance of COVID-19 as defined by the ACR criteria for negative RT-PCR patients50. Children and pregnant women were excluded from the study.
The clinical and laboratory data were obtained from detailed medical records, cleaned and formatted retrospectively by 10 radiologists with 3 to 20 years of experience (5 radiologists at GR and 5 at KB). Data include demographic variables: age and sex, variables from the clinical examination include: body weight and height, body mass index, heart rate, body temperature, oxygen saturation, blood pressure, respiratory rate, and a list of symptoms including cough, sputum, chest pain, muscle pain, abdominal pain or diarrhoea, and dyspnea. Health and medical history data include presence or absence of comorbidities (systemic hypertension, diabetes mellitus, asthma, heart disease, emphysema, immunodeficiency) and smoker status. Laboratory data include conjugated alanine, bilirubin, total bilirubin, creatine kinase, CRP, ferritin, haemoglobin, LDH, leucocytes, lymphocyte, monocyte, platelet, polynuclear neutrophil, and urea.
CT scan acquisition
CT scan data were available for 980 patients representing a total of 506,341 images (517 slices per patient on average). Summary statistics for the clinical, biological, and CT scan data are provided in Table 1. Three different models of CT scanners were used : two General Electric CT scanners (Discovery CT750 HD and Optima 660 GE Medical Systems, Milwaukee, USA) and a Siemens CT scanner (Somatom Drive; Siemens Medical Solutions, Forchheim). All patients were scanned in a supine position during breath-holding at full inspiration. The acquisition and reconstruction parameters were of 120kV tube voltage with automatic tube current modulation (100-350 mAs), 1 mm slice thickness without interslice gap, using filtered-back-projection (FBP) reconstruction (SOMATOM Drive) or blended FBP/iterative reconstruction (Discovery or Optima). Axial images with slice thickness of 1 mm were used for coronal and sagittal reconstructions.
Radiology reports
COVID-19 associated CT imaging features were obtained from radiologist reports that follow the guidelines of several scientific societies of radiology (French SFR, STR, ACR, RSNA) regarding the reporting of chest CT findings related to COVID-19 50. The template of the radiologist report (https://ebulletin.radiologie.fr/actualites-covid-19/compte-rendu-tdm-thoracique-iv-0) was available the 17th of March and the reports were completed retrospectively for the patients who were admitted to the hospital before that date. CT imaging characteristics were evaluated to provide the five following variables : (i) ground glass opacity (GGO) (rounded / non rounded / absent) that is defined as an increase in lung density not sufficient to obscure vessels or preservation of bronchial and vascular margins, (ii) consolidation (rounded / non rounded / absent) that occurs when parenchymal opacification is dense enough to obscure the vessels’ margins and airway walls and other parenchymal structures, (iii) the crazy-paving pattern (present/absent) that is defined as ground-glass opacification with associated interlobular septal thickening51, (iv) peripheral topography (present/absent) that corresponds to the spatial distribution of lesions in the one-third external part of the lung, and (v) inferior predominance (present/absent) that is defined as a predominance of lesions located in the lower segments of the lung. A rounded pattern (for GGO and consolidation) is defined as a lesion presenting a well delineated shape. In addition to the five CT imaging features, radiologists assessed the extent of lung lesions according to the evaluation criteria established by the French Society of Radiology (SFR)52. Disease extent can be: absent / minimal (<10%) / moderate (10-25%) / extensive (25-50%) / severe (>50%) / critical >75%. The coding absent / minimal / moderate extensive / severe / critical was based on a quantitative variable with values of 0 / 1 / 2 / 3 / 4 / 5. Variables were automatically extracted from the report using optical character recognition.
Annotation scenario of CT scans by radiologists
Two radiologists (4 and 9 years of experience) examined and annotated 307 anonymized chest scans independently and without access to the patient’s clinic or COVID-19 PCR results. All CT images were viewed with lung window parameters (width, 1500 HU; level, -550 HU) using the SPYD software developed by Owkin. Regions of interest were annotated by the radiologists in four distinct classes : healthy pulmonary parenchyma, ground glass opacity, consolidation, crazy-paving. The presence of organomegaly was also notified when present, as a binary class. When multiple CT images were available for a single patient, the image to analyze was selected using the SPYD software. One AI and imaging PhD student also provided full 3D annotation of the four classes on 22 anonymized chest scans using the 3D Slicer software.
Statistical analysis
When detecting association with the severity outcome, odds ratio and P-values (two-sided tests) were computed separately for each hospital using logistic regression (glm function of the R statistical software). P-values from the two different hospitals were pooled using the Stouffer meta-analysis formula accounting for the two different sample sizes. For association between severity and each variable, we considered Bonferroni correction accounting for 58 variables and 62 variables when also considering imaging markers obtained with AI-segment. To compute confidence intervals for AUC values, we considered DeLong method 53. Survival functions were obtained using Kaplan-Meier estimators.
The AI-segment pipeline for lesion segmentation from CT scans was based on 3 segmentation networks: 3D Resnet5054, 2.5D U-Net, and 2D U-Net 55. U-Net consists of convolution, max pooling, ReLU activations, concatenation and up-sampling layers with sections: contraction, bottleneck, and expansion. ResNet contains convolutions, max pooling, batch normalization, and ReLU layers that are grouped in multiple bottleneck blocks. All models were trained on CT scans provided by Kremlin-Bicêtre (KB) and evaluated on annotated CT scans from Institut Gustave Roussy (IGR). The dataset was divided into two categories: Fully Annotated Scans (FAS) composed of 22 scans (8 from KB and 14 from IGR) and Partially Annotated Scans (PAS) composed of 307 scans (176 from KB and 131 from IGR). PAS contains a total of 7,374 annotated slices and 24,476,521 annotated pixels, i.e. 24 slices per PAS and 3,319 pixels annotated per slice on average.
2D U-Net was trained for left/right lung segmentation and 3D ResNet and 2.5D U-Net were used for lesion segmentation. 3D ResNet50 was trained on 8 KB FAS (i.e. 3,704 slices). Inputs for the 3D ResNet consist of a height and a width of 128, and a depth of 32. We initialized the 3D ResNet with pretrained weights56. We then trained the network with Stochastic Gradient Descent for parameter optimization and an initial learning rate of 0.1 with a decay factor of 0.1 every 20 epochs. The network was trained for a total of 100 epochs. For the 2.5D U-Net, we first pretrained the network on a left-right lung segmentation task using the LCTCS dataset57. The network was then trained on the KB dataset using Adam optimization algorithm with a learning rate, weight decay, gradient clipping and learning rate decay parameters of 1e-3, 1e-8, 1e-1, and 0.1 (applied at epochs 90 and 150) for 300 epochs. While the validation set remains the same as when evaluating the 3D resnet50 model, 176 KB PAS scans were added to the 8 KB FAS, in the training set. PAS were only added to the 2.5D U-Net training set due to the incompleteness of the annotated volume in the scans which would not satisfy the volumetric requirements of the 3D ResNet50 input. Finally, for the left/right lung segmentation, the 2D U-Net was trained on the 8 KB FAS. Similarly to 2.5D U-Net, Adam optimization algorithm was used with a learning rate, weight decay, gradient clipping, learning rate decay, and number of epochs of 1e-3, 1e-8, 1e-1, 0.1 (applied at epoch 70), and 104. Both 2.5D U-Net and 2D U-Net used affine transformation and contrast change for data augmentation while 3D ResNet50 used affine transformation, contrast change, thin plate splines, and flipping. 3D ResNet and 2.5D U-Net are trained through the minimization of a cross entropy loss and 2D U-Net minimized a binary cross entropy loss. All training was performed on NVIDIA Tesla V100 GPUs using Pytorch as a coding framework. During the validation phase, ensemble inference58 was performed on all available scans by averaging lesion masks, which were predicted from the 3D ResNet and 2.5D U-Net models, using arithmetic mean.
We evaluated AI-segment on three distinct aspects. First, we evaluated its ability to perform accurate segmentation. To this aim, we computed F1 scores for the PAS (partially annotated scans) and FAS (fully annotated scans), of the IGR test set, when discriminating lesions versus sane areas inside the lung. Micro-averaging was used to limit the effect of class imbalance for the three different lesion types. We also reported the accuracy to discriminate background versus lung regions using FAS where background regions outside of the lung were annotated. Second, we evaluated its ability to estimate the proportion of each lesion type per scan. To this aim, we computed the median, minimum and maximum of the absolute value of the difference between the ground truth percentage of each lesion type obtained from radiologists’ annotations and the estimated ones, on the 14 available FAS of the IGR dataset. Third, we evaluated to what extent AI-segment reproduces the analysis reported by radiologists. To this aim, we first compared the binary decision ‘presence or absence of a lesion type’ of AI-segment to the radiologist report considered as ground truth. A lesion type was detected by AI-segment when its estimated volumetry, averaged over both lungs, was above a certain threshold. The difference was then evaluated in terms of detection accuracy and F1 score, for two threshold values, using all scans of the IGR dataset (Supp Table 1). Then, we compared disease extent as evaluated by radiologists to the one predicted by AI-segment (Supp Fig 5).
Machine learning models for severity classification based of CT scans (AI-severity)
The AI-severity model was defined as an ensemble of two sub-models, as illustrated in Supp Fig 4. Each sub-model predicted disease severity from CT scans without using any expert annotations at the slice level. Preprocessing of the data consisted of resizing the CT scans to a fixed pixel spacing of (0.7mm, 0.7mm, 10mm) and applying a specific windowing on the HU intensities. Each sub-model is composed of two blocks: a deep neural network called feature extractor and a penalized logistic regression. The two sub-models feature extractors are an EfficientNet-B059 pre-trained on the ImageNet public database and a ResNet5060 pre-trained with MoCo v261 on one million CT scan slices from both Deep Lesion62 and LIDC63. Each of these networks provide an embedding of the slices of the input CT scans into a lower-dimensional feature space (1280 for EfficientNet-B0 and 2048 for ResNet50). For the ResNet50-based sub-model, we reduced the dimension of the feature space using a principal component analysis with 40 components before applying logistic regression. A different windowing was applied on the CT scans before the feature extractor : (-1000 HU, 600 HU) for EfficientNet-B0 and (-1000 HU, 0 HU), (0 HU, 1000 HU) and (-1000 HU, 4000 HU) for ResNet50. Predictions of AI-severity were obtained by averaging predictions of the submodels using equal weights. Optimisation of the architecture of the network (preprocessing, feature extraction or model architecture and training, feature engineering, model aggregation) was performed using a 5 fold cross validation on the training set of 646 patients from KB.
CT scans may contain devices such as catheters (EKG monitoring, oxygenation tubing…) that are easily detectable in a CT and can bias prediction of severity (i.e. detecting the presence of a technical device associated with severity instead of detecting the radiological features associated with severity). In order to ensure that medical devices do not affect feature extraction, all voxels outside of the lungs were masked using a pre-trained U-Net lung segmentation algorithm 64.
Multivariate models to predict severity
The different models that combine multiple features to predict severity were fitted using logistic regression (AI-segment, trimodal AI-segment, report, trimodal report/ScanCov, trimodal AI-severity). Models were trained using cross validation with 5 folds on the training dataset of 646 patients from KB, and folds were stratified by age and severity outcome. Variables that were available for less than 300 patients of the training set (conjugated bilirubin and alanine) were not used. For the remaining variables, missing values were imputed by the average over patients of the training set. L2 regularization was applied when fitting logistic regression. The regularization coefficient was determined by maximizing the average AUC over the 5 cross-validation folds, using a range of different values ranging from 0.01 to 100. XGBoost algorithm was also evaluated but did not show superior performance on this dataset. We use pandas and scikit-learn to manipulate data, train and evaluate machine learning algorithms65.
To select variables in the multivariate models, we considered a forward feature selection technique (Supp Fig 6). The first variable included in the model is the variable which provides the largest AUC values. Then, we computed AUC values for all models with two variables including the first one that has already been included. We continued this procedure until all variables were included. Performances of the models increased quickly when the first variables were included and then AUC values reached a plateau (Supp Fig 6). We used the elbow method to select the parsimonious set of variables that is found when a plateau of AUC is reached.
Other scores to predict severity and mortality
We performed a comparison with several multivariate scores of COVID severity or death. The COVID-GRAM score was the only multimodal score we considered that also includes information from CT scan6. When computing COVID-GRAM score, we assume that patients were not unconscious at admission and did not have hemoptysis as a symptom because these two information were missing from our dataset. The other scores we considered are based on clinical variables and possibly biological variables. They include the CALL (severity) score with clinical and biological information9 as well as two other scores (the Yan et al. model for mortality prediction and the Colombi et al. model for severity prediction) that include clinical and biological information7,15. In order to compute the Yan et al. score, we considered the same features as the ones used by the authors and reproduced their training of an XGBoost model with a single tree and a maximum depth of 27. We also considered the MIT Covid Analytics as a risk score for mortality (https://www.covidanalytics.io/mortality_calculator) and the CURB65 score developed to predict mortality for community-acquired pneumonia66.
Data Availability
The dataset of patients hospitalized at Kremlin-Bicetre (KB) and Institut Gustave Roussy (IGR) are stored on a server at Institut Gustave Roussy (IGR). The data are available from the first author upon request subject to ethical review.
Data Availability
The dataset of patients hospitalized at Kremlin-Bicêtre (KB) and Institut Gustave Roussy (IGR) are stored on a server at Institut Gustave Roussy (IGR). The data are available from the first author upon request subject to ethical review.
Code Availability
Code to execute all the models presented in this article, including ScanCov score, AI-segment and AI-severity is available online on a public github repository.
Author Contributions
N.L., S.A., E.C.,P.H.,R.M.,N.L.,P.T., E.B.,M.S., A.S., F.C.,S.J., M.S., I.B., J.D.,JC.P., H.T.,E.P.,G.W., T.C., F.B.,MF.B.,M.B conceived the idea of this paper
N.L., S.A., E.C., H.G.,P.H., M.D., S.S., O.M., MP.T., JP.L.,R.M.,N.L.,P.T., E.B.,G.G, C.B.,S.J., F.G.,N.T.,Y.L., T.D., K.G., A.N., M.T., S.V., M.S., I.B., Y.B, E.P., M.A., J.D.,F.B., A.G.,J.D.,JC.P., H.T.,E.P.,G.W., T.C., F.B.,MF.B.,M.B participated to the acquisition and treatment of data
N.L., S.A., E.C.,P.H.,R.M.,N.L.,P.T., E.B.,S.J., M.S., P.J., I.B., J.D.,JC.P.,H.T.,E.P.,G.W., T.C., MF.B.,M.B.implemented the analysis
N.L., S.A., E.C.,P.H.,R.M.,N.L.,P.T., E.B.,S.J., M.S., I.B., J.D.,JC.P., H.T.,E.P.,G.W., T.C., MF.B.,M.B.contributed to the writing of the manuscript
Competing Interests statement
The authors declare the following competing interests:
Employment: Michael Blum, Paul Herent, Rémy Dubois, Nicolas Loiseau, Paul Trichelair, Etienne Bendjebbar, Simon Jégou, Meriem Sefta, Paul Jehanno, Fabien Brulport, Olivier Dehaene, Jean-Baptiste Schiratti, Kathryn Schutte, Elodie Pronier, Jocelyn Dachary, Adrian Gonzalez, employed by Owkin
Co-founders of Owkin Inc : Thomas Clozel, Gilles Wainrib.
Acknowledgements
We would like to thank J.-Y. Berthou, H. Berry, and Ph. Gesnouin from Inria and M. He, R. Patel, G. Rouzaud, B. Schmauch, J. Du Terrail from Owkin and F Lion from Gustave Roussy for their support.