ABSTRACT
Almost every lung cancer patient has multiple pulmonary nodules while the significance of nodule multiplicity in locally advanced non-small cell lung cancer (NSCLC) remained unclear. This study explores the relationship between deep learning detected total nodule number (TNN) and survival outcomes in patients with surgical resected stage I-III NSCLC. Patients who underwent surgical resection for stage I-III NSCLC with accessible preoperative chest CT scan from 2005 to 2018 were identified from our database. Deep learning-based AI algorithms using convolutional neural networks (CNN) was applied for pulmonary nodule (PN) detection and classification. Of the 2126 patients, a total number of 33410 PNs were detected by AI. Median TNN detected per person was 12 (IQR 7-20). AI-detected TNN (analyzed as continuous variable) was independent prognostic factor for both RFS (HR 1.012, 95% CI 1.002-1.022, p = 0.021) and OS (HR 1.013, 95% CI 1.002-1.025, p = 0.021) in multivariate analyses of stage III cohort; while it was not significantly associated with survival in stage I and II cohorts. In terms of nodule categories, the numbers of upper-lobe nodule, same-side nodule, other-side nodule, solid nodule, and even solid nodule at small size (≤ 6mm) were independent prognostic factors; while the numbers of middle/lower-lobe nodule, same-lobe nodule, subsolid nodule, calcific nodule and perifissural nodule were not associated with survival. In survival tree analysis, rather than using traditional IIIA and IIIB classification, the model grouped cases by AI-detected TNN (lower vs. higher: log-rank p < 0.001), which showed superior discrimination of survival in stage III cohort. In conclusion, AI-detected TNN was significantly associated with survival in patients with surgical resected stage III NSCLC. Lower TNN detected on preoperative CT scan indicated better prognosis in patients who underwent complete surgical resection.
INTRODUCTION
Lung cancer is a leading cause of cancer-related death worldwide [1]. Since early detection of cancer is an important opportunity for decreasing mortality, multiple randomized trials and guidelines recommend lung cancer screening using low-dose CT (LDCT) for high-risk individuals [2-7]. With the adoption of LDCT for lung cancer screening, the number of chest CT scans increase dramatically each year [8]. To address the repetitive and burdensome work of dealing with mostly normal images, computer aided detection/diagnosis (CAD), which has a computer to perform a given task consistently and tirelessly, becomes extremely appealing [9].
CAD supported by machine-learning techniques has been utilized to detect PN since 2002 [10]. Although standardized CAD systems have been proven to improve diagnostic accuracy, few of them have been implemented in actual clinical practice due to its high dependence on imaging processing and false positive rates [11, 12]. In recent years, deep-learning based AI algorithms using convolutional neural networks (CNN) have attracted considerable attention in the area of machine-learning. The key advantage that CNNs have over conventional CAD techniques is their ability to self-learn previously unknown features, maximizing classification with limited direct supervision [13]. Thus, CNNs bring about a significant false-positive reduction in PN detection, recognition, segmentation, and classification [14-19].
The key issue in the management of incidental PNs detected on CT images is to differentiate benign and malignant nodules. Radiological features, such as larger nodule size, upper lobe location, marginal spiculation, and faster growth rate are generally considered as risk factors for malignancy [20-25]. These principles mainly focus on the assessment of the largest or most suspicious nodule. However, half of the patients detected with PNs have multiple nodules [26]. Nodule multiplicity, which is a potential indicator for malignancy, is commonly overlooked. Only limited data concerning the relationship between TNN and lung cancer probability is available. In the PanCan and BCCA trials, lower TNN was associated with an increased risk of lung cancer [20]. While another study analyzing patients from the NELSON trial showed that the risk of lung cancer increased as TNN rising from 1 to 4, but decreased in patients with 5 or more nodules [26].
The results of above-mentioned screening trials indicated that TNN was either negatively or not significantly associated with lung cancer probability, which might reflect a low incidence of multiple malignancies in screening population [27]. However, for patients with high pretest probability of malignancy, whether TNN plays a role in determining lung cancers probability with multiple pulmonary sites of involvement, distinguishing multiple primary lung cancers (MPLC) from intrapulmonary metastasis (IPM), and thus, predicting prognosis remains unknown. The purpose of this study was to calculate TNN detected on preoperative CT images using CNN-based AI algorithm, and explore in-depth the relationship between AI-detected TNN and survival outcomes in patients with resectable stage I-III NSCLC.
MATERIALS AND METHODS
Patients
We retrospectively reviewed the medical records of patients pathologically diagnosed with stage I-III (according to the 8th edition of American Joint Committee on Cancer [AJCC] prognostic group) NSCLC who underwent surgical resection at the department of thoracic surgery of Peking University People’s Hospital from October 2005 to December 2018. Only patients received preoperative chest CT scan within 90 days prior to the surgery in our institution were included. Patients were excluded: 1) if they received neoadjuvant therapy; 2) if the surgical margin was positive; 3) if perioperative death occurred within 30 days; 4) if the follow-up information was inadequate.
AI-Powered PN Detection
InferRead™ CT Lung, a wildly used deep learning-based AI algorithm developed by InferVision, was applied for PN detection in this study. Only the last chest CT scan before surgery was chosen. Firstly, PNs was detected by AI algorithm and TNN was calculated accordingly. Then, PNs were classified according to their lobar distributions (left lower lobe, left upper lobe, right lower lobe, right middle lobe, and right upper lobe), locations (same lobe as the primary tumor [same-lobe], ipsilateral lobe different from the primary tumor [same-side], and contralateral lobe [other-side]), and types (solid nodule, mixed ground-glass nodule [m-GGN], pure ground-glass nodule [p-GGN], calcific nodule, and perifissural nodule). Moreover, solid and subsolid (m-GGN and p-GGN) nodules were further categorized based on their size.
Statistical Analysis
Continuous variables were presented as median with interquartile range (IQR) and were analyzed by using Wilcoxon rank-sum test and one-way analysis of variance (ANOVA). Categorical variables were presented as frequencies and percentages. Survival curves were compared using Kaplan-Meier method with log-rank test. Univariate Cox proportional hazards models were built first to determine which factors were significantly associated with survival. Then, multivariate Cox models were constructed incorporating factors with p ≤ 0.10 identified in the univariate analyses.
In stage III cohort, maximally selected log-rank statistics were used to determine the optimal cutoff value of nodule number for predicting OS. Patients were then dichotomized into lower- and higher-nodule number groups according to the estimated cut-point. Furthermore, least absolute shrinkage and selection operator (LASSO) Cox regression model with cross-validation was utilized to select the most useful prognostic features among all categories of AI-detected nodule numbers.
Finally, survival tree analysis was conducted to generate a tree-based model for survival data using log-rank test statistics for recursive partitioning. This tree-based model grouped cases according to the best split of OS using AI-detected TNN and the 8th edition of AJCC prognostic group. A candidate grouping scheme was then developed based on this tree analysis.
All the statistical analyses were executed using R version 4.0.0 for Windows (R Foundation for Statistical Computing, Vienna, Austria). All the statistical tests were two-sided and p values of0.05 or less were considered statistically significant.
Ethics
The study involving human participants were reviewed and approved by the Institutional Review Board of Peking University People’s Hospital (2020PHB385-01). Since only de-identified data were used in this study, informed consents for participants were waived by the committee.
RESULTS
Characteristics of Patients and Nodules
A total of 2126 patients who underwent surgical resection for stage I-III NSCLC with accessible preoperative chest CT scan were included in this study. The median follow-up time was 33 months (IQR 21-48). The demographic and clinicopathologic characteristics of the patients are summarized in Table 1.
The framework of deep-learning powered PN detection algorithm and an example of 3D reconstruction of AI-detected nodules are shown in Figure 1A and B. Of the 2126 patients, a total number of 33410 PNs were detected. The features of these AI-detected nodules are given in Table 2. The distributions of AI-detected TNN, solid nodule number, and subsolid nodule number per person were all positively skewed, and the medians of these three above-mentioned nodules were 12 (IQR 7-20), 6 (IQR 3-10), and 3 (IQR 1-6) respectively (Figure 2A to C).
When considering the discrepancy of nodule number among different stages, we found that there was no statistically significant difference between means of TNN (one-way ANOVA p = 0.655). However, means of solid nodules were significantly higher in patients with stage II and III disease, while means of subsolid ones were higher in those with stage I disease (both p < 0.001, Figure 2D to F). Moreover, patients with late-stage disease tended to have more solid nodules with greater size (Supplementary Figure 1).
Survival Analyses
We analyzed the survival of the patients by stage according to the 8th edition of AJCC prognostic group (Figure 3A and B). The differences of both RFS and OS between any two stages were statistically significant (pairwise comparison p < 0.001). Cox proportional hazards models were then built to determine the prognostic factors of the entire cohort (Supplementary Table). In univariate analysis, lower AI-detected TNN (as continuous variable) was associated with improved RFS (HR 1.008, 95% CI 1.001-1.014, p = 0.017), while it was not significantly associated with improved OS (HR 1.006, 95% CI 0.999-1.012, p = 0.099). In multivariate analysis, TNN was neither an independent prognostic factor for RFS (HR 1.006, 95% CI 0.999-1.012, p = 0.080) nor for OS (HR 1.002, 95% CI 0.995-1.009, p = 0.590) after adjusting for age, sex, smoking history, surgical approach, surgical procedure, histologic type, adjuvant therapy status, and pathologic T and N stage.
Subgroup analysis stratified by stage was then performed to assess whether AI-detected TNN was an independent prognostic factor for patients in each stage. We found that TNN was not significantly associated with survival for patients with stage I (RFS: HR 1.010, 95% CI 0.998-1.022, p = 0.102; OS: HR 1.003, 95% CI 0.989-1.017, p = 0.689) and stage II disease (RFS: HR 1.000, 95% CI 0.988-1.013, p = 0.973; OS: HR 1.000, 95% CI 0.989-1.012, p = 0.965). However, in stage III cohort, fewer TNN was independently associated with improved survival in multivariate analyses (RFS: HR 1.012, 95% CI 1.002-1.022, p = 0.021; OS: HR 1.013, 95% CI 1.002-1.025, p = 0.021) (Table 3 and Table 4).
Exploratory Analyses in Stage III Cohort
To further evaluate the prognostic effect of AI-detected TNN, we used maximally selected log-rank statistics to dichotomize patients into lower- and higher-TNN groups. The optimal cutoff value of 8 was selected (Supplementary Figure 2). When compared with patients with higher TNN (> 8), those with lower TNN (≤ 8) had significantly improved OS (log-rank p < 0.001, Figure 4A). Lower TNN was also an independent favorable predictor for OS in multivariate analysis (HR 2.348, 95% CI 1.351-4.082, p = 0.002).
To assess which of the components were associated with survival, we classified AI-detected nodules into different categories. When analyzed as continuous variables, the numbers of upper-lobe nodule (HR 1.028, 95% CI 1.008-1.049, p = 0.006), same-side nodule (HR 1.032, 95% CI 1.001-1.064, p = 0.046), other-side nodule (HR 1.020, 95% CI 1.001-1.039, p = 0.040), solid nodule (HR 1.020, 95% CI 1.004-1.036, p = 0.012), and even solid nodule at small size (≤ 6mm) (HR 1.027, 95% CI 1.007-1.047, p = 0.008) were independently associated with OS in multivariate analyses; however, none of the numbers of middle/lower-lobe nodule (HR 1.016, 95% CI 0.994-1.039, p = 0.153), same-lobe nodule (HR 1.021, 95% CI 0.986-1.056, p = 0.246), m-GGN (HR 1.104, 95% CI 0.885-1.376, p = 0.381), p-GGN (HR 1.015, 95% CI 0.976-1.056, p = 0.462), calcific nodule (HR 1.021, 95% CI 0.975-1.068, p = 0.384), and perifissural nodule (HR 1.007, 95% CI 0.792-1.279, p = 0.957) were significantly associated with survival.
The five above-mentioned nodule numbers, which were independent prognostic factors for OS as continuous variables, were set as binary variables according to their optimal cutoff values. Similarly, compared with patients with higher nodule numbers, those with lower nodule numbers had significantly improved OS (Figure 4B to F). In addition, the numbers of upper-lobe nodule (HR 2.532, 95% CI 1.567-4.091, p < 0.001), other-side nodule (HR 1.957, 95% CI 1.322-2.898, p < 0.001), solid nodule (HR 1.851, 95% CI 1.241-2.761, p = 0.003), and solid nodule at small size (≤ 6mm) (HR 1.862, 95% CI 1.248-2.779, p = 0.002) were still independent prognostic factors for OS in multivariate analyses.
Finally, to evaluate which of the components contributed most to prognosis, LASSO Cox regression model incorporating both clinicopathologic features and all categories of AI-detected nodule numbers (as continuous variables) was built (Supplementary Figure 3). Seven features with nonzero coefficient were as follows: age (0.021), smoking history (0.106), surgical approach (0.669), adjuvant therapy status (−0.389), IIIA/IIIB classification (0.095), upper-lobe nodule number (0.014), and small (≤ 6mm) solid nodule number (0.008). Thus, the number of upper-lobe nodule, followed by solid nodule at small size, were individual features that contributed most to the model and corelated best with OS among all categories of AI-detected nodule numbers.
Survival Tree Analyses
A tree-based model incorporating AI-detected TNN and the 8th edition of AJCC prognostic group was constructed based on the best split of OS in the entire cohort (Figure 5A). We found that discrimination of survival curves of sub-stages was unsatisfactory with the current staging system in our study, especially in sub-stages of IA2 to IB (IA2 vs. IA3: log-rank p = 0.177; IA3 vs. IB: log-rank p = 0.778) and IIA to IIB (log-rank p = 0.236). Moreover, in stage III cohort, rather than using the traditional IIIA and IIIB classification, the model grouped OS by AI-detected TNN (lower vs. higher: log-rank p < 0.001), since it showed superior discrimination of survival. Accordingly, Kaplan-Meier curves of OS by the tree-based grouping scheme were shown in Figure 5B.
Treatment Failure Analyses
To evaluate the potential relationship between AI-detected TNN and the tumor recurrence pattern, we further divided stage III cohort into two groups according to the first disease progression site. Among all 263 patients, 60 had local recurrent disease, 40 had distant metastatic disease, and 19 had progressive disease without specified pattern. Compared with distant metastasis group (median: 17 [IQR 10.75-23.25]), patients with local recurrent disease had lower AI-detected TNN (median: 14 [IQR 7.75-18.25]). However, the difference between these two groups was not statistically significant (Wilcoxon rank-sum p = 0.077, Supplementary Figure 4).
DISCUSSION
The widespread application of artificial intelligence algorithm in PN detection is reshaping our knowledge on this topic. The number of patients with tens or even hundreds of PNs is rapidly growing, while the interpretation of these lesions and its impact on surgical decision making are complicated and yet underrepresented. As the number of nodules grows, accurate diagnosis for every single nodule becomes labor-intensive and statistically challenging. As an alternative, we hypothesized that TNN measured by deep-learning algorithm may serve as a surrogate indicator of the probability of malignancy and metastasis in locally advanced NSCLC. Such hypothesis is proved preliminarily by our result that TNN is an independent prognostic factor in stage III lung cancer.
The accurate measurement of TNN is highly challenging. First, the definition of PN varies across radiologists and surgeons due to their different purposes: some may only report guideline mandated PNs in order not to provoke panic of patients, while others may report as many as one can detect for more accurate surgery planning. Unfortunately, both standards are rather subjective and less repeatable. Second, the accuracy and robustness of a single radiologist or surgeon is limited. The sensitivity of PN detection by a single radiologist is around 77% and can be increased to 90% with a concurrent radiologist’s help [28]. However, such method is time-consuming and still subjected to human error.
The emerging of deep learning-based AI algorithm ensured the objectiveness and robustness of PN detection and thus the measurement of TNN. Mature algorithms have reached a diagnosis sensitivity of 85-100% [29-31]. The best-performing deep learning algorithm in LUNA16 challenge, which based on the LIDC-IDRI dataset, exhibited an excellent sensitivity of over 95% at fewer than 1.0 false positives per scan [32]. The algorithm (InferRead™ CT Lung, InferVision) in this study is trained using over 350 thousand chest CTs labeled by radiologists [33]. In real-world, the performance of this model has reached AUC at 0.89 in PN detection and can significantly improve the performance assisting radiologists [33-35]. Our result showed median TNN at 12 per patient, much higher than the median of 2 per patient reported in the malignant cohort of NELSON study [26]. Such difference may, on one hand, because of the difference in CT radiation dosage, while on the other hand, reflect the difference in diagnosis preference and consistency between AI and human radiologists.
From the clinical aspect, our results suggested that TNN may be a visualized representation of tumor burden in stage III NSCLC patients. Different from the result of NELSON study that shows higher nodule count is in favor of benign diagnosis [26], our study focused on more advanced NSCLC patients instead of high-risk screening population. Past evidences vaguely showed that, with confirmed histology, extensive nodal or systemic metastasis are relative arguments for multiple PNs to be IPM [36], suggesting that high TNN may relate to higher pretest probability of IPM. Our result further supported this speculation by revealing the survival advantage of lower TNN group comparing to higher TNN. Such advantage existed when analyzing TNN as either a continuous variable or a binary variable, which strengthened the argument.
It is worth notice that the major impact on survival was caused by the number of solid nodules but not GGNs. For GGN components, the IASLC guideline suggested that the prognosis of multifocal GGNs is similar to single MIA or AIS [37], while others indicated that there are metastatic GGNs on molecular level [38]. In our study, concurrent multiple GGNs in all three stages did not increase HR, indicating that concurrent multiple GGNs in invasive lung cancer possessed the same biological behavior as in multifocal GGN cases. For solid components, most of the nodules were ≤ 6mm and radiologically benign, with round shape, no speculation, no lobulation. But the growth of a few unresected nodules suggested their malignancy (Supplementary Figure 5). It showed that the diagnosis using traditional radiological characteristics for multiple PNs in stage III NSCLC patients is less reliable. The treatment failure pattern analysis showed that higher TNN was related to distant metastasis (without statistical significance due to small sample size) indicating that TNN was not only an indicator of IPM, but also a visual representation of systematic tumor burden.
From a surgeon’s perspective, the impact of PNs on surgical planning is dramatic. Convincing a patient to accept unresected GGNs after surgery is difficult even if with the guideline’s support. A sublobar resection of a GGN may turn into lobectomy due to multiple GGNs detected by AI, while lobectomy may also be altered into sublobar resection due to bilateral nodules clinically diagnosed as separate primary lung cancer. However, no evidence showed the validity of such approach. Our study provided the first proof of concept that TNN, driven by deep learning algorithm, should be considered as a mandatory test before surgery planning. It would be reasonable for surgeons to be more aggressive in resection of solid nodules instead of GGNs. Moreover, neoadjuvant therapy should be considered for stage III patients with higher TNN for better PN evaluation since empirical diagnosis may not be reliable.
Some may argue that PET-CT is a valid method in differentiating MPLC and IPM before surgery, but the partial-volume effect of PET-CT prevented it from a proper diagnostic performance for solid nodules less than 8mm, which represented 85.6% of the solid nodules in our study [23, 39, 40]. Moreover, PET-CT is relatively expensive for most under-development countries and is not affordable by every patient.
As a retrospective study, our results need validation before clinical application, however, no public databases provide sufficient data, thus a prospective validation is needed and may be time-consuming. The AI algorithm needs optimization to further reduce false positive rate, and the performance in the detection of peri-vascular nodule still needs improvement. Currently, there is a technological barrier in the alignment of pre- and post-operative PNs on chest CT. Given time, we may be able to analyze the growth speed of PN and other information for better prognostic modeling.
To our knowledge, this study is the first to identify that TNN measured by deep-learning algorithm is an independent prognostic factor in stage III lung cancer. Our results suggested a potentially critical clinical application of AI as a mandatory examination for surgery decision. The current cut-off point of TNN is still preliminary but shows great potential and appeals to future validation.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
CONFLICT OF INTEREST
Author DW, JS, and WT were employed by the company Beijing Infervision Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
AUTHOR CONTRIBUTIONS
XC took full responsibility for the content of the manuscript. QQ, and ZS had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. DW, JS, WT, XL, TL, NH, and FY contributed substantially to the study design, data analysis and interpretation.
FUNDING
The authors received no specific funding for this work.
ACKNOWLEDGEMENTS
We would like to thank Yuqing Huang, Xianjun Min, and Guotian Pei from Beijing Haidian Hospital for sharing their thoughts on this work.
Abbreviations list
- AI
- artificial intelligence
- CAD
- computer aided detection/diagnosis
- CNN
- convolutional neural network
- GGN
- ground-glass nodule
- HR
- hazard ratio
- IPM
- intrapulmonary metastasis
- LDCT
- low-dose computed tomography
- MPLC
- multiple primary lung cancer
- NSCLC
- non-small cell lung cancer
- OS
- overall survival
- PET
- positron emission tomography
- PN
- pulmonary nodule
- RFS
- recurrence-free survival
- TNN
- total nodule number