Abstract
Question Addressed High resolution computed tomography (HRCT) of the chest is increasingly used in clinical practice for sarcoidosis. Visual assessment of chest HRCTs in patients with sarcoidosis has high inter- and intra-rater variation. Radiomics offers a reproducible quantitative assessment of HRCT lung parenchyma and could be useful as an additional summary measure of disease. We develop radiomic profiles on HRCT and map them to clinical and patient reported outcomes.
Patients and Methods Three-dimensional radiomic features were calculated on chest HRCT for the left and right lung from sarcoidosis cases enrolled in the Genomic Research in Alpha-1 Antitrypsin Deficiency and Sarcoidosis study (N=321). Sparse K-means was used to group sarcoidosis cases using their radiomic profiles. Linear regression investigated how pulmonary function tests and patient reported outcomes differed between groups. Resampling approaches were used to validate the robustness of the findings.
Results Five groups were identified. The new radiomic-based grouping was associated with Scadding stage (p<0.001), but each radiomic group had patients from all Scadding stages. All pulmonary function testing measures significantly differed between radiomic groups (p<0.001). Radiomic group remained significantly associated with pulmonary function even after adjusting for Scadding stage (p<0.0001). Individual radiomic measures explained 9-18% of the variation in pulmonary function testing. Only two patient reported outcomes (shortness of breath and physical health) differed by radiomic group (p<0.013).
Answer to Question Radiomic quantification of sarcoidosis identifies subgroups associated with pulmonary function and patient reported outcomes. These associations provide additional evidence that radiomics may be useful for quantifying new disease phenotypes.
Introduction
Sarcoidosis is a granulomatous interstitial lung disease which affects ~ 110 thousand individuals in the United States (1). Typical diagnoses is between 30-50 years of life, resulting in a decrease in quality of life and productivity (2). Pulmonary disease occurs in over 90% of those with sarcoidosis (3) with significant morbidity and mortality. Currently, visual assessment of chest radiography (CXR) is used to quantify lung abnormalities standardized via the Scadding staging system (4) which groups CXR findings into five groups from 0 to 4. Substantial variation exists in the radiographic patterns within each Scadding stage. Scadding stage has limited utility in predicting prognosis even in the extremes of the scale (5).
Chest high resolution computed tomography (HRCT) is increasingly used in clinical practice of sarcoidosis to monitor disease as it offers more detailed visualization of parenchymal abnormalities compared to CXR (4, 6–8). As with CXR, visual assessment of chest HRCT is used to evaluate abnormalities although there are limited standardized scoring metrics (4, 9, 10). This is due in part to the diverse and heterogeneous patterns present on chest HRCT in sarcoidosis, often with multiple patterns noted on one CT. These complexities result in high inter- and intra-rater variation (11). More automated systems that quantify sarcoidosis chest HRCT could decrease these sources of variation.
Radiomics is a field of study in which large numbers of quantitative features are extracted from medical images (12). A radiomics panel computes summary measures of the distribution of the Hounsfield units (HU) along with summary measures of the spatial relationships of neighbouring voxels (13). The result is a panel of quantitative measures characterizing the texture of the image.
Radiomic analysis has proved useful for quantifying HRCT in emphysema (7), idiopathic pulmonary fibrosis (14, 15), interstitial lung disease (16, 17), diffuse lung disease (18) and cancer (19). Ryan et al. (20) showed the potential utility of radiomics in sarcoidosis, comparing radiomic and other textural based measures between sarcoidosis patients and controls. It remains unclear whether radiomics also has the potential to differentiate varied phenotypes within sarcoidosis patients. Radiomic profiles within sarcoidosis patients that also correlate with pulmonary function or patient reported outcomes could indicate that radiomics may serve as a useful measure to track change in the lung parenchyma over time.
The goal of this research study is to develop a radiomic profile of sarcoidosis chest HRCT using a large, phenotypically-diverse population of sarcoidosis cases from the Genomic Research in Alpha-1 Antitrypsin Deficiency and Sarcoidosis (GRADS) study (21) using statistical clustering techniques and to investigate the clinical utility of the clusters by quantifying their association with pulmonary function testing (PFT) and patient reported outcomes (PROs). PFTs and PROs were chosen to capture both clinician and patient focused measures of disease (26,,27).
Methods
The study design and participants
The sarcoidosis population was recruited in the multicentre NHLBI-funded GRADS study (N=368) (21). This investigation has GRADS ancillary study approval and all participants provided informed consent (IRB approval HS-2779 and HS-2780;N=365). More details of this cohort can be found in the online supplement.
Pulmonary function testing included pre-bronchodilator (BD) forced expiratory volume at one second (FEV1), forced vital capacity (FVC), the ratio of FEV1 to FVC, and diffusing capacity of the lungs for carbon monoxide (DLCO). The PROs included the gastroesophageal reflux disease questionnaire (GERDQ) (24), the University of California San Diego Shortness of Breath Questionnaire (SOBQ) (25), two measures of fatigue (the Fatigue Assessment Scale [FAS] (26) and Patient-Reported Outcomes Measurement Information System fatigue profile [PROMIS] (27)), the Cognitive Failure Questionnaire (CFQ) (28), and the physical and mental subscales of the SF-12 (33,,34). The final analysis dataset included N=321 patients who each had both an analysable CT and clinical data (online supplement Figure 1).
Radiomic Analysis
Details of the imaging acquisition and lung segmentation can be found in the online supplement. Radiomic features were calculated on the left and right lungs using RIA_lung from the lungct R package. These features include 44 first-order features and 239 GLCM features for each lung, for a total of 566 features. To calculate the GLCM features, the Hounsfield units (HU) from each HRCT scan were first discretized into 16 bins with equal relative frequencies; then, the features were calculated in each of 26 directions, assuming a voxel distance of one; finally, these features from all directions were summarized using the mean statistic. The exact computation for each radiomic measure can be found elsewhere (31).
The ComBat function (https://github.com/Jfortin1/ComBatHarmonization) was used to harmonize the radiomic measures across scanners (32–34), while preserving the biological variability in the following covariates: age, height, BMI, sex, race and clinical phenotype.
As radiomic features are high-dimensional and repetitive (Figure 1), we first used a decorrelation filter (see online supplement), to identify a subset of features for analysis. This reduced the number of variables from 566 to 97. We then used robust k-means (35) to cluster patients (R package, RSKC). For outlier robustness, the proportion of cases trimmed was set at 0.1; the optimal bound on feature weights was found to be 7.5, using the permutation approach observed in sparcl R package (36). We used a standardized Gap statistic to determine the optimal number of clusters using the cluster R package (37). A standardized Gap statistic is the typical gap statistic divided by its standard deviation. The variables with the top five weights from the RSKC clustering algorithm were then selected for further investigation (discriminative features).
Statistical Analysis
Linear regression was used to assess associations between cluster group and Scadding stage and each of the outcomes controlling for age, sex, race, height, and BMI. We fitted a model with only cluster group and only Scadding stage and then put both together in the model to determine if the association with radiomics remained significant in the presence of Scadding stage. Each outcome was modelled separately using a complete case analysis for that outcome. Linear regression was used quantify the associations between the selected discriminative radiomic features and the outcomes controlling for age, sex, race, height, and BMI.
Validation
The statistical analysis is an unsupervised learning problem, which makes traditional training and test validation approaches difficult because the true groupings are unknown. Instead, we investigated the effect of repeatedly applying our analysis pipeline under various conditions and with bootstrapped samples. The details of the validation can be found in the online supplement.
Results
Tables 1 and 2 show the characteristics of the study population, which included N=321 sarcoidosis cases, including 147 (45.8%) males and 233 (72.6%) whites, with an average age of 53 (SD=10) years, average height of 67.0 in (SD=4.2) and average BMI of 30.6 kg/m2 (SD=6.5). Participants were spread across Scadding stages, with 43 (13.5%) in stage 0, 63 (19.8%) in stage 1, 92 (28.9%) in stage 2, 44 (13.8%) in stage 3 and 76 (23.9%) in stage 4. The average FVC was 2.62 L (SD=0.93), FEV1 was 3.57 L (SD=1.14), and DLCO 80.15 (SD=23.37). The population was predominantly of the non-obstructive type (72.7%).
Radiomic Based Clustering of Sarcoidosis
Five groups (clusters) of patients were identified (Table 1). Ordered from highest FVC to lowest, 75 (23.4%) patients were in radiomic group 0, 57 (17.8%) in group 1, 51 (15.9%) in group 2, 58 (18.1%) in group 3 and 80 (24.9%) in group 4. The new radiomic-based grouping was associated with Scadding stage (p<0.001), but not a direct reflection of Scadding stage (Table 1).
PFT differed between radiomic groups (Figure 2; p<0.0001). For FVC, radiomic groups 3 and 4 had between 0.5 and 0.7 L lower average FVC compared to radiomic groups 0, 1, and 2. For FEV1, radiomic groups 2 and 3 and had an average FEV1 0.5 L lower than radiomic groups 0 and 1, which were similar. Radiomic group 4 had the lowest average FEV1 and was approximately 0.8 L lower than groups 0 and 1. DLCO has a similar pattern to FVC. Groups 2 and 4 demonstrated more obstructive disease compared to groups 0, 1, and 3 (57.4% and 38.0% vs. 13.5%-17.2%, respectively; p<0.001).
Some PROs also differed between groups, but not consistently (Figure 3). Average shortness of breath (SOBQ) score differed between radiomic groups (p<0.0001). Radiomic group 4 had higher average SOBQ compared to radiomic groups 0 and 1 with radiomic group 4 being approximately 17.5 units higher than radiomic groups 0 and 1. In addition, average physical health (SF-12) was lowest for radiomic group 4 and linearly increased to radiomic group 0.
The new radiomic groups remained significantly associated with PFT after adjusting for Scadding stage (Figure 2; p<0.0001). Scadding stage also remains significant for all PFT after controlling for radiomic group (p<0.0041; Figure 2). The radiomic group also remained associated with SOBQ score (p=0.032; Figure 3), while Scadding stage was no longer associated with SOBQ (p=0.072; Figure 3). None of the other PROs maintained a significant association with radiomic group after adjustment for Scadding stage (p>0.38).
The five most discriminatory radiomic features included kurtosis, which is a measure of shape of the distribution of the HU from an image, as well as four summary measures from the gray level co-occurrence matrix (GLCM). The GLCM measure spatial correlation and similarity of the HU in image voxels near each other. Image visualization of cases with different values of kurtosis and two GLCM measures are shown in Figure 4 with distribution of kurtosis noted in Figure 5 for maximum, median and minimum kurtosis levels in our population. The images visibly show the increase in observable parenchymal abnormalities (left panel Figure 4) with lower kurtosis.
The discriminatory radiomic measures were jointly associated with FVC, FEV1, FEV1/FVC and DLCO (p<0.001; Table 3). The radiomic measures explained between 9 and 18% more variation in PFT than adjustment for demographics (age, race, sex, height and BMI) only. For comparison, Scadding stage explained between 0 and 11% more variation in PFT. Kurtosis was associated with FVC, FEV1/FVC and DLCO (p<0.0001). A 1-unit increase in kurtosis was associated with a 0.4 L (SE=0.07; p<0.0001) increase in FVC. Kurtosis was also positively associated with increasing DLCO (1.59, SE=0.68, p=0.020). Kurtosis was negatively associated with FEV1/FVC. A 1-unit increase in kurtosis was associated with a 7% (SE=0.01, p<0.0001) decrease in FEV1/FVC ratio.
Validation of Clustering
Pairwise ARI values ranged from 0.3 to 1 and peaking around 0.5 (online supplement Figure 2 left). The maximum ARI of fit clusters with Scadding stage was 0.07. In the linear models, the maximum p-values for the significant of cluster in the demographic adjusted and demographic plus Scadding models were <0.0001 for both models. The corresponding proportions of p-values less than 0.01 were 100% for both models.
Bootstrap samples contained between 184 and 219 unique observations and contained 203 on average. Pairs of bootstrap samples contained between 100 and 159 unique overlapping observations and contained 129 on average or about 63% of the sample is used in each ARI calculation. Pairwise ARI values computed from unique overlapping observations ranged from 0.2 to 1 and had distribution peaking around 0.5 (online supplement Figure 2 right). The maximum ARI of fit clusters with Scadding stage was 0.17. For significance of cluster label in linear models fit to bootstrap samples, maximum p-values for the significant of cluster in the demographic adjusted and demographic plus Scadding models were <0.0001 for both models. The corresponding proportions of p-values less than 0.01 were 100%.
Taken together the validation analysis suggest fairly mild sensitivity of clustering to the analysis pipeline. There is more sensitivity to bootstrapped samples; however cluster groups remain a significant predictor of FVC across all sources of random perturbations in the pipeline.
Discussion
Radiographic manifestations in sarcoidosis are protean. As a result, image characterization is traditionally done visually and not standardized. The utility of visual assessment in sarcoidosis is limited by the intra-observer and inter-observer variability. Radiomics is a more reproducible and computationally efficient approach to characterize HRCT. We used radiomics to quantitatively characterize images from a large, phenotypically diverse cohort of sarcoidosis subjects and demonstrated that radiomics are associated with clinical and patient reported outcomes of disease. Using a common unsupervised learning approach (35), we identified five clusters of cases based on radiomic characterization of their chest CT. These five radiomic-based groups differed significantly by PFTs and several PROs. Notably, each radiomic group included a range of Scadding stages and radiomic group remained significantly associated with PFT after adjusting for Scadding stage. These data suggest that radiomics represents radiographic abnormalities that differ from Scadding stage.
Demographic characteristics of the individual explained a sizable amount of the variation observed in PFT (R2=50-60%). Radiomics also explained a significant amount of additional variation in PFT (8% to 18%). This additional amount of variation explained is consistent with other quantitative imaging approaches such as CALIPER (16) and those used for investigations of other lung conditions such as systemic sclerosis (38) and diffuse interstitial lung diseases (39).
Except in lung cancer research, much of the work on quantitative CT image analysis has focused only on the HU density (17, 38). Our radiomic panel included summarization of both the density and the spatial characterization. We note that more measures of the GLCM appear as discriminative in the cluster analysis than first-order densitometry measures. This implies that the texture (the spatial information) is perhaps more useful for differentiating various parenchymal abnormalities seen in sarcoidosis.
Although it remains speculative to the reasons for the association between decreased kurtosis and decreased FVC and DLCO, the following explanation is plausible. As kurtosis decreases, more severe parenchymal abnormalities are present that impact the function of the lung, as is shown in Figure 4. Severe parenchymal abnormalities in sarcoidosis are observed as higher HU values. In Figure 5 we observe the presence of a higher frequency of moderately high HU values dampening the peak in the HU distribution and leading to a lower kurtosis in this case. We also found a negative association between lung function and Gaussian weighted GLCM. Figure 4 highlights more abnormalities present with a higher Gaussian weighted GLCM value, as might be expected with this association.
The associations with PROs were less consistent, although important patterns emerged. Radiomic group 4 demonstrated decreased quality of life based on physical function in the SF-12 and worse shortness of breath on the SOBQ, findings that are consistent with the worse lung function in this group. Radiomic group explained substantially less variation in PROs (1 to 15%) compared to PFTs. While the tools we used to measure PROs are validated and used widely, they are not generally correlated with objective measures of lung function (40). As an example, fatigue is a known multi-factorial PRO in sarcoidosis and may impact a number of other PRO (22, 40). For example, fatigue can be associated with shortness of breath and the SOBQ but also with other organ involvement; these two measurements can be confounded by physical function and cardiac or even neurological involvement. Our finding that radiomic group is associated with decreased physical function and increased shortness of breath provides encouragement that more detailed characterization of lung abnormalities, like we developed here, could contribute to a better understanding of the current disconnect between objective measures of lung function and patient experiences.
This study had several strengths. To our knowledge, this was the largest group of sarcoidosis patients with research grade HRCT available allowing for a full 3-D based quantitative analysis. In addition to the high-quality and robust clinical data, the GRADS study also provides a comprehensive and consistently collected set of patient reported outcomes, which reflect important aspects of treatment decision-making processes (23). The results were also internally validated. The validation results taken together suggest our approach is not over fitting to these data.
This study is not without limitations. The GRADS study relied on enrolment from academic centres and could be skewed to a population that was referred for worse disease, was near one of the centres, which were primarily localized to the Eastern US. The demographic is predominantly white and of higher SES as a result and thus did not provide full representation of the spectrum of disease. In addition, many subjects had disease for a decade or longer and were treated. While the same protocol was used across study sites for obtaining the CT images we studied, the scanners themselves differed. We performed harmonization of the scans to mitigate the potential for systematic large differences in scans due to scanner type. In addition, the distribution of PFT measures is not dependent on scanner type such that we expect any differences in the radiomic measures due to scanner type are non-differential as they relate to PFTs. Finally, the number of radiomic clusters (groups) we identified may not directly translate to other populations of sarcoidosis patients or even be the optimal solution for this group of patients. The optimization algorithm we used is common, however, there is no single way to choose the optimal number of clusters.
Radiomic and other quantitative imaging approaches have a major strength in that they can be computed in a time efficient and reliable, reproducible automated procedure. The radiomics package in R took only 3 minutes to compute all the 3-D radiomics measures. This means large quantities of scans can have radiomic profiles computed for aiding in decisions on what scans the radiologist might prioritize for further consideration for visual or clinician evaluation.
In summary, this work provides evidence that in sarcoidosis radiomic quantification is useful for classifying abnormality of the lung along with pulmonary function and to a lesser degree patient reported outcomes. Future work should evaluate the potential of radiomics to capture small changes over time not easily detected by radiologic assessment to further evaluate the potential of radiomics to serve as a clinically useful quantitative measure of sarcoidosis presentation and progression.
Data Availability
Data is available via the process laid out by the GRADS consortium.
Footnotes
Sources of Support: This work was supported by the National Institutes of Health (R01 HL114587; R01 HL142049; U01 HL112695). Data from the GRADS study was used, which was funded by the NIH grant U01 HL112707 entitled “Sarcoidosis and A1AT Genomics and Informatics Centre”, as well as others (U01 HL112707, U01 HL112694, U01 HL112695, U01 HL112696, U01 HL112702, U01 HL112708, U01 HL112711, U01 HL112712). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health.
Supplementary Material Statement: This article has an online supplement, which is accessible from this issue’s table of contents online.