Abstract
Background Random forest model is a recently developed machine-learning algorithm, and superior to other machine learning and regression models for its classification function and better accuracy. But it is rarely used for predicting causes of death in lung cancer patients. On the other hand, specific causes of death in lung cancer patients are poorly classified or predicted, largely due to its categorical nature (versus binary death/survival).
Methods We therefore tuned and employed a random forest algorithm (Stata, version 15) to classify and predict specific causes of death in lung cancer patients, using the surveillance, epidemiology and end results-18 and several clinicopathological factors. The lung cancer diagnosed during 2004 were included for the completeness in their follow-up and death causes. The patients were randomly divided into training and validation sets (1:1 match). We also compared the accuracies of the final random forest and multinomial regression models.
Results We identified and randomly selected 40,000 lung cancers for the analyses, including 20,000 cases for either set. The causes of death were, in descending ranking order, were lung cancer (72.45 %), other causes or alive (14.38%), non-lung cancer (6.87%), cardiovascular disease (5.35%), and infection (0.95%). We found more 250 iterations and the 10 variables produced the best prediction, whose best accuracy was 69.8% (error-rate 30.2%). The final random forest model with 300 iterations and 10 variables reached an accuracy higher than that of multinomial regression model (69.8% vs 64.6%). The top-10 most important factors in the random-forest model were sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality, which were also independently associated with 5-category causes of death.
Conclusion We optimized a random forest model of machine learning to predict the specific cause of death in lung cancer patients using a set of clinicopathologic factors. The model also appears more accurate than multinomial regression model.
Introduction
Lung cancer is one of the leading causes of death in the U.S. and the world [1]. Basic research has revealed new insights into development and progression of lung cancer [2]. Clinical works also identified the factors associated with the survival outcomes of lung cancer, particularly non-small cell lung cancer, including histology and oncogenic drivers [3]. However, the survival outcomes of the lung cancers are mostly binary, which were either alive versus death or alive versus progression/death. Few studies on lung cancer patients were focused on the specific deaths due to non-lung cancer[4], cardiovascular diseases[5, 6], infection or other causes[4, 7]. Even fewer works investigated the multicategory causes of death (COD).
Random forest (RF) model is a recently developed machine-learning algorithm, and superior to other machine learning and regression models for its classification function and better accuracy [8]. But it is rarely used for predicting causes of death in lung cancer patients, while we have shown a better prediction accuracy in prostate cancers [8]. On the other hand, specific causes of death in lung cancer patients are poorly classified or predicted, largely due to its categorical nature (versus binary death/survival). Therefore, this study was aimed to understand the factors associated with multicategory COD in lung cancer patients using RF and multinomial regression models.
Material and methods
We extracted clinical, pathologic and socioeconomic data of the patients in the Surveillance, Epidemiology, and End Results-18 (SEER-18) Program (www.seer.cancer.gov) SEER*Stat Database with Treatment Data, who were diagnosed of lung carcinoma in 2004. The follow up was last conducted in Nov. 2016 (the last in the 2019 data release). The inclusion criteria included the survival time > 1 month, aged 20+ years, first primary-cases only and having a known COD. The average follow-up time was 12.5 years.
The informed consent was not obtained for the SEER patients due to de-identified nature of the dataset. Owing to the use of publicly available, de-identified SEER cases, this study was exempt from an institutional review board approval. However, we have received the approval for using the SEER-18 data under the condition of compliance with their preset terms (user ID lzhang). Moreover, all 50 states in the USA have laws requiring newly diagnosed cancers to be reported to a central registry. The state cancer registries in the SEER program would deposit their extracted, de-identified cancer data to the SEER database after meeting quality control standards (www.seer.cancer.gov). Thus, the SEER data collection was authorized by the US state laws, and supervised by respective state public-health officials and ethical review committees.
We simplified the SEER COD, which were extracted from death certificates, into 5 categories, namely cardiovascular system disease (CVS), infection, non-breast cancer, breast cancer and others cause (including alive). The alive patients were a small proportion of the population and were thus included in the other causes. We included 40 dichotomized variables in the analyses since dichotomized variables (i.e., one-hot encoding) produced slightly better accuracy as shown before[8] and would not require normalization.
We tuned and employed a random forest algorithm (Stata, version 15) to classify and predict specific causes of death in lung cancer patients. The lung cancer diagnosed during 2004 were included for the completeness in their follow-up and death causes. The patients were randomly divided into training and validation sets (1:1 match). We also compared the accuracies of the final random forest and multinomial regression models using chi-squared test. A p value <0.05 was considered statistically different.
Results
Among the 44,735 case diagnosed in 2004, we identified and randomly selected 40,000 qualified lung cancers for the analyses, including 20,000 cases for either train or test set (Table 1). The causes of death were, in descending ranking order, were lung cancer (72.45%), other causes or alive (14.38%), non-lung cancer (6.87%), cardiovascular disease (5.35%), and infection (0.95%). The mean follow-up time was 114.8 months (standard deviation 49.1).
We found more 250 iterations and the 10 variables produced the best prediction, whose best accuracy was 69.8% (error-rate 30.2%, Figure 1).
The final random forest model with 300 iterations and 10 variables reached an accuracy higher than that of multinomial regression model (69.8% vs 64.6%, P<0.001, Table 2). The top-10 most important factors in the random-forest model were sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality (Figure 2).
Interestingly, the pathological confirmation of the diagnosis appeared not very importance in the model. In the multinomial regression model, we also found that sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality (Table 3), while other factors were not, despite being associated with the multicategory COD in univariable regression analysis (Chi-squared test, Table 1). This finding was consistent with that from RF model.
Discussion
In this population-based study of 40,000 lung cancer patients with more than 12 years of follow-up, we optimized a RF model of machine learning to predict the specific cause of death in lung cancer patients. The RF model also appears more accurate than multinomial regression model (69.8% vs 64.6%, P<0.001). We also identified the factors linked to the 5-category COD in lung cancer patients, which was not reported by previous works [4, 7].
This study is limited by using the cases of a single diagnosis year, while we feel the large number of cases justified for our findings. Future studies may include the cases of recent diagnosis years so that the findings will be more related to the recent cases. Moreover, the dataset appeared unbalanced in the outcome. Such a nature of the dataset is the real-world evidence, but posed a challenge to modelling. This in part explained why the multinomial regression model had many cases missing an assigned COD. Finally, the findings were not validated by an external dataset. Future works should be carried out using other dataset to validate our findings.
In summary, we here identified the factors that were independently associated with 5-category long-term COD in lung cancer patients in the USA. We also tuned the RF model and showed that it was significantly more accurate in predicting 5-category long-term COD.
Data Availability
The SEER data were available upon request to the SEER website (www.seer.cancer.gov). All other data are available upon request.
Data availability
The SEER data were available upon request to the SEER website (www.seer.cancer.gov). All other data are available upon request.