Abstract
Background Stroke-associated pneumonia (SAP) is a frequent complication of stroke, characterized by its high incidence rate, and it can have a severe impact on the prognosis of patients. The limitations of current clinical treatment measures underscore the critical need to identify high-risk factors promptly to decrease the incidence of SAP.
Objective To analyze the risk factors of SAP in stroke patients, construct a predictive model of SAP based on the SHAP interpretable machine learning method, and explain the important variables.
Methods A total of 763 stroke patients admitted to the Second Affiliated Hospital of Anhui University of Traditional Chinese Medicine from July 1, 2023, to May 31, 2024, were selected and randomly divided into the model training set (n=457) and model validation set (n=306) according to the ratio of 6:4. Firstly, the included data were sorted out, and then Lasso regression was used to screen the included characteristic variables. Based on the tidymodels framework, Using decision tree (DT), logistic regression, extreme gradient boosting (XGBoost), support vector machine (SVM), The classification model was constructed by five machine learning methods, including SVM and LightGBM. The grid search and 5-fold cross validation were used to optimize the hyperparameter optimization strategy and the performance index of the model. The predictive performance of the model was evaluated by the area under the receiver operating curve (AUC), calibration curve, and decision curve analysis (DCA), and we used Shapley additive explanation (SHAP) to account for the model predictions and provide interpretable insights.
Results The incidence of SAP in this study was 31.72% (242/763). Six variables were selected by Lasso regression, including nasogastric tube use, age, ADL score, Alb, Hs-CRP, and Hb. The model with the best performance in the validation set was the XGBoost model, with an AUC of 0.926, an accuracy of 0.914, and an F1 score of 0.889. Its calibration curve and DCA showed good performance. SHAP algorithm showed that ADL score ranked first in importance.
Conclusion The model constructed using XGBoost has good prediction performance and clinical applicability, which is expected to support clinical decision-making and improve the prognosis of patients.
1. Introduction
Stroke is a severe cardiovascular disease that significantly impacts the quality of life and survival rate of patients. According to the Global Burden of Disease Study, stroke caused approximately 6.55 million deaths in 2019, ranking as the second leading cause of death worldwide, second only to cardiovascular diseases[1]. In the aftermath of a stroke, patients frequently contend with numerous complications, with stroke-associated pneumonia (SAP) being among the most prevalent[2]. Surveys have indicated that the incidence of SAP ranges from 7% to 38%[3–6], SAP not only prolongs hospitalization and increases economic burdens but can also severely affect patient mortality[4–5]. Currently, the primary treatment for SAP in clinical practice is anti-infective therapy [7]; however, studies have shown that prophylactic antibiotics do not effectively reduce the risk of SAP occurrence[8]. Therefore, it is crucial for clinicians to promptly identify high-risk individuals for SAP and implement appropriate preventive measures, which are essential for improving the prognosis of stroke patients.
The establishment of risk prediction models can assist clinicians in identifying high-risk populations for diseases, allowing for early intervention measures to reduce disease incidence[9]. In recent years, scholars both domestically and internationally have developed multiple models to predict the risk of SAP occurrence, presented in the form of scoring systems and scorecards, such as the Kwon Score[10], A2DS2 Score[11], and AIS-APS Score[12]. These models provide relatively reliable assessment tools for the prevention and treatment of SAP. However, even well-validated models may experience performance degradation over time due to changes in disease risk factors, treatment measures, and treatment contexts. Therefore, models need continuous dynamic updates [13]. In addition, few studies are using interpretable machine learning to construct SAP risk prediction models. Based on this, this study considered combining new predictors with known predictors to construct a model using the machine learning method and using the SHAP algorithm to explain the model, to improve the accuracy and interpretability of SAP risk prediction.
2. Materials and Methods
2.1 Study Design and Subjects
This study is a single-center, retrospective cohort study. We selected stroke patients who were treated at The Second Affiliated Hospital of Anhui University of Chinese Medicine from July 1, 2023, to May 31, 2024, as the study subjects. Inclusion criteria were: (1) patients diagnosed with stroke; (2) age ≥18 years; (3) no mechanical ventilation within 7 days post-stroke. Exclusion criteria were: (1) discharge, transfer, or death within 24 hours of admission; (2) pre-existing pulmonary infection prior to admission; (3) abandonment of treatment or voluntary discharge. (4) the missing rate of collected data exceeded 30%.This study was approved by the Ethics Committee of the Second Affiliated Hospital of Anhui University of Traditional Chinese Medicine (2023-SXXM43). In addition, informed consent was omitted for all participants as the study was retrospective in nature, and the research process was in accordance with the Declaration of Helsinki.
2.2 Identification of Candidate Predictive Factors
Candidate predictive factors for this study were identified through literature review and expert consultation, totaling 27 factors, including: (1) General demographic data: gender, age; (2) Disease-related factors: activity of daily living (ADL) scale[14] score at admission, type of stroke, location of stroke, dysphagia, impaired consciousness, hypertension, diabetes mellitus; (3) History of comorbidities/Personal history: history of stroke, history of underlying pulmonary disease, smoking, drinking; (4) Disease treatment factors: nasogastric tube, acid suppressants, urinary catheters; (5) Laboratory test indicators: albumin (Alb), triglyceride (TG), hypersensitive C-reactive protein (Hs-CRP), white blood cell count (WBC), neutrophil-to-lymphocyte ratio (NLR), monocyte-to-lymphocyte ratio (MLR), hemoglobin (Hb).
2.3 Definition and Diagnosis of SAP
According to the consensus published by the SAP Consensus Group composed of multidisciplinary experts in the UK[2], the outcome indicator SAP is defined as pneumonia that newly occurs within 7 days in non-mechanically ventilated stroke patients. The diagnosis of SAP follows the modified standards of the Centers for Disease Control and Prevention (CDC) [15].
2.4 Sample Size Estimation
This study utilized the sample size estimation method specifically designed for developing risk prediction models, as proposed by Riley et al. in 2020[16]. This method takes into account the effects of multiple categories, interactions, and non-linear relationships, minimizing the risk of model overfitting while precisely estimating key parameters to determine the appropriate sample size required to construct the predictive model. The specific calculations were performed using the “pmsampsize” package in R software. Based on the literature, the average C-statistic of existing prediction models is approximately 0.827[17], and the incidence of SAP is around 7%-38%[3–6]. This study anticipated the inclusion of 27 predictive factors, with the calculated required sample size ranging from 701 to 1272 cases, and an expected number of outcome events ranging from 179 to 253. The detailed calculation process and results are shown in the supplementary material.
2.5 Data Collection and Preprocessing
Data sources were determined by reviewing electronic medical records, including admission records, discharge summaries, nursing records, and laboratory results, with laboratory parameters primarily collected from the first admission data. During data collection, we ensured blinding between the predictive factors and outcome indicators to avoid information bias. To address missing data issues, the random forest method was used for data imputation to maintain data integrity and the accuracy of the predictive model (imputation of missing values was performed using the “mice” package in R software, with five imputations). To prevent information loss, all continuous variables were inputted using their original values; for categorical variables, categories with low proportions were considered for merging. After data processing, the dataset was divided into training and validation sets in a 6:4 ratio. The training set data were used to fit the predictive model, while the validation set data were used to evaluate the model.
2.6 Model construction and evaluation
Taking the occurrence of SAP in stroke patients as the dependent variable and alternative predictors as the independent variable, least absolute shrinkage, and selection operator (Lasso) regression was used to screen the variables, and then a model was constructed based on the tidymodels framework. We selected the following five machine learning algorithms to build the model: decision tree (DT), logistic regression (LR), extreme gradient boosting (XGBoost), support vector machine (SVM), and light gradient boosting machine (LightGBM). A 5-fold cross-test was performed on the training set as a resampling method, and the hyperparameters were optimized by grid search. We used the area under the receiver operating characteristic (ROC) curve (AUC) to evaluate the discrimination of the model and drew calibration curves to evaluate the fit of the model. Decision curve analysis (DCA) was used to evaluate the clinical practicability of the model. Finally, the accuracy, sensitivity, specificity, and other indicators of the model were calculated to evaluate the performance of the model.
2.7 Interpretability of the model
Shapley additive explanation (SHAP) is a popular method of model explanation, which is based on the Shapley value in game theory and is used to explain the predictions of any model. The core idea of SHAP is to decompose the contribution of each feature to the prediction, so that the model prediction can be expressed as a weighted sum of feature contributions, thereby improving the transparency and trust of the model and supporting more reasonable decision making. By applying the SHAP algorithm to the optimal model, we can obtain the importance ranking of features and have an intuitive understanding of the contribution of these features to the prediction model.
2.8 Statistical Methods
Statistical analysis was performed using R 4.3.0 software.Normally distributed continuous data were expressed as mean ± standard deviation (Mean ± SD), and between-group comparisons were performed using the independent sample t-test. Non-normally distributed continuous data were expressed as median and interquartile range [M (Q1, Q3)], and between-group comparisons were conducted using the Mann-Whitney U test. Categorical data were expressed as counts and percentages [n (%)], and comparisons of unordered categorical data between groups were performed using the Pearson x2 test or Fisher’s exact test. All statistical tests were two-sided, with P<0.05 considered statistically significant.
3. Results
3.1 Missing Data
There were 7 variables with missing values in this study, including Alb, TG, LDL, HDL, Stroke location, Hs-CRP, and D-Dimer. The missing values were all imputed by the random forest method (Fig 1).
3.2 Patient Characteristics
Based on the occurrence of SAP, the 763 patients were divided into the non-SAP group (n=521) and the SAP group (n=242). Among the 763 included patients, there were 504 males (66.06%) and 259 females (33.94%), with an average age of 67.35±13.00 years. The incidence of SAP was 31.72%. There were statistically significant differences in age, ADL score, type of stroke, dysphagia, disturbance of consciousness, diabetes, history of lung disease, nasogastric tube therapy, urinary catheter, Alb, Hs-CRP, WBC, NLR, MLR, Hb, D-Dimer and HDL between the two groups (P < 0.05). The detailed results are presented in Table 1.
3.3 Correlation Analysis of Variables and Variable Selection
To explore the relationship between variables, Spearman’s rank correlation coefficient was used to assess the linear relationship between variables (Fig 2).We performed the Lasso regression method to screen the 27 included variables. The optimal λ value corresponding to one standard error of the minimum mean square error was determined through 10-fold cross-validation. The results indicated that when log(λ) reached one standard error of the minimum mean square error, the number of model variables was reduced to 6, with the optimal λ value determined to be 0.039 through cross-validation. The final variables included Nasogastric tube therapy, Age, ADL, Alb, Hs-CRP, Hb. The procedure for selecting variables by Lasso regression is shown in Fig 3A and Fig 3B.
3.4 Model construction and evaluation
Five machine learning models were constructed using the six identified variables. In the training set, 5-fold cross-validation was employed as a method for data resampling to ensure the model’s performance on unseen data. Additionally, grid search was utilized to optimize the key hyperparameters, which included the number of features, the number of trees, the minimum number of samples, the maximum depth of trees, and the learning rate. The validation set was utilized to assess the model’s performance. The ROC curves for the five machine learning models in the validation set revealed that the XGBoost model achieved the highest AUC value of 0.926 (Fig. 4). The calibration curves for each model indicated that the average predicted probability aligned with the actual occurrence probability for all models except for the decision tree (DT) and support vector machine (SVM) models (Fig. 5). Decision curve analysis (DCA) demonstrated that all models maintained good net benefits within a certain threshold probability, indicating high clinical utility value (Fig. 6). The XGBoost model also demonstrated superior predictive performance and practical utility in both the calibration curve and DCA. A comparison of the performance indicators for the five machine learning models in the validation set showed that the XGBoost model continued to excel, with an AUC of 0.926, an accuracy of 0.914, and an F1 score of 0.889. Additional model indicators are presented in Table 2.
3.5 Interpreting the XGBoost model using SHAP
The importance ranking plot of the variables showed that ADL had the highest mean SHAP value and the strongest predictive performance (Fig 7). To analyze the positive and negative correlation between the variable and the target outcome, we drew a beeswarm in which the color depth reflected the value of the variable. Taking the first behavior as an example, the low ADL level had a negative effect on the outcome prediction. High ADL level was a positive predictor of outcome (Fig 8). Finally, the SHAP value is used to show the influence of different variables in a sample on the prediction result of SAP occurrence risk. The value in the Fig represents the contribution degree of each feature to the output result of the model. A positive number indicates that when the feature increases, the predictive value of the model will also increase. The opposite is true for negative numbers (Fig 9).
4. Discussion
In this study, five machine learning methods were used to construct a risk prediction model for SAP in stroke patients based on six predictors, including Nasogastric tube therapy, Age, ADL, Alb, Hs-CRP, Hb. Among them, the XGBoost model validation set showed good discrimination and calibration. The prediction performance of the models was better than those of previous studies [10–12]. Additionally, the model was visualized using a nomogram, making the risk scoring more intuitive and quantifiable.
SHAP algorithm results showed that the lower the ADL score, the higher the risk of SAP. It was found that patients with SAP tended to have longer hospital stays and lower ADL scores compared with non-SAP patients[18]. A low ADL score typically indicates a limited ability to perform self-care, potentially leading to prolonged bed rest, which indirectly increases the risk of pulmonary infections[19]. Furthermore, low ADL scores may correlate with poor nutritional status, compromising the immune system and making patients more prone to infections. Early rehabilitation exercises can enhance the self-care abilities of stroke patients, reducing the risk of lung infections, and clinical practitioners should promptly intervene in patients with limited mobility.
In this study, indwelling nasogastric tube was a risk factor for SAP in stroke patients. This result is consistent with reports in the existing study[20].Brogan found that nasogastric tubes were a stronger predictor of SAP than dysphagia[21]. Research indicates that during nasogastric tube feeding, the gastric volume significantly expands, potentially causing gastric muscle spasms and food accumulation. In the mean time, patients with nasogastric tubes face increased risks of elevated intracranial pressure, vomiting, and food reflux, collectively heightening the probability of SAP[22]. However, a study on acute stroke patients found that placing a nasogastric tube within 48 hours of onset did not increase the incidence of SAP, mortality, or adverse functional outcomes[23], contradicting our findings. Thus, the relationship between the duration of nasogastric tube indwelling and SAP remains to be studied.
This finding suggests that older stroke patients are more susceptible to SAP, consistent with previous studies[11,12]. As individuals age, their physiological functions and immune system capabilities deteriorate, weakening respiratory defenses. Additionally, older stroke patients often exhibit diminished swallowing and coughing reflexes, making them more vulnerable to pulmonary infections post-stroke[24].
Among the laboratory examination indicators, high Hs-CRP level was a factor affecting the occurrence of SAP, which was consistent with the results of previous studies[25]. Hs-CRP represents an acute phase reactive protein produced by the liver, falling under the category of C-reactive protein (CRP). Hs-CRP demonstrates superior sensitivity and early discriminative capabilities, making it detectable with high efficiency during the initial phases of inflammation or at minimal concentrations[26]. Research indicates that alterations in serum Hs-CRP levels among patients following cerebral infarction are strongly correlated with the occurrence of secondary pulmonary infections. As Hs-CRP levels rise, the inflammatory response becomes more pronounced[27]. Another effective predictor of SAP is Hb, and the decrease in Hb level usually means the patient is at risk of anemia, which can significantly increase the mortality and the risk of pneumonia[28]. Anemia is associated with compromised immune system function, thereby compromising the patient’s immune response and rendering stroke patients more vulnerable to bacterial or viral pathogens[29]. Several prior studies have established that stroke patients with reduced serum Alb levels are at an elevated risk of contracting infections or pneumonia while hospitalized[30–32], one possible reason is that Alb has anti-inflammatory, anti-oxidative, anticoagulant effects as well as regulation of microvascular permeability; therefore, low Alb level is a marker of systemic inflammatory response[32]. Furthermore, reduced levels of Alb can precipitate malnutrition in patients, which in turn diminishes the patient’s overall health and impairs their resistance to infection, consequently elevating the risk of pneumonia[33].
Machine learning algorithms demonstrate substantial superiority over traditional regression techniques in handling high-dimensional data, intricate relationships, and feature selection, while also enhancing prediction accuracy. In this study, XGBoost exhibited robust predictive capabilities, currently standing as one of the most widely used machine learning algorithms. It offers high accuracy, incorporates regularization, and boasts both high prediction performance and interpretability, hence holding great promise for application in risk prediction and various other domains[34–36]. In order to achieve the best accuracy on large modern datasets, it is often necessary to rely on complex models. However, complex models are often difficult to balance the contradiction between model accuracy and interpretability. Based on this, the SHAP framework provides a powerful tool for model interpretation that can help us balance model accuracy and interpretability and make better use of complex models in practical applications[37].
5. Limitations
Despite the robust performance of our model in various aspects, there are limitations. First, this study is a single-center, retrospective study, with data collection primarily relying on nursing and physician records, which may result in limited data and potential information bias. Secondly, our model has not yet undergone external validation in other populations, so its generalizability remains to be further examined. Additionally, the model depends on accurate data input, and data integrity in practical applications could affect the predictive performance. Future research should aim to optimize the model further, incorporating more potential predictive factors and exploring advanced machine learning algorithms.
6. Conclusion
The models developed in this study demonstrate strong predictive capabilities and clinical utility in assessing the risk of stroke-associated pneumonia in stroke patients, with the XGBoost model achieving the highest performance. Furthermore, the SHAP framework has been instrumental in elucidating the models, identifying the critical factors influencing the onset of stroke-associated pneumonia, and has furnished crucial insights for clinical decision-making. Our findings provide a new tool for SAP prediction, which may support clinical decision-making and improve patient outcomes.
Data Availability
All relevant data are within the manuscript and its Supporting Information files.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Author Contributions
Chunbiao Li: Writing – original draft, Writing – review & editing, Conceptualization, Software;
Formal Analysis, Visualization
Ting Wang: Writing – review & editing, Conceptualization, Methodology, Funding acquisition;
Juan Yuan: Writing – review & editing, Conceptualization, Methodology; Supervision;
Linli Yuan: Writing – review & editing, Funding acquisition, Data curation
Min You: Writing – review & editing, Supervision, Funding acquisition, Data curation.
Funding Information
This study were supported by Anhui Provincial University Research Project (2023AH050784), Anhui Provincial Research Preparation Programme Project (2022AH050463) and Anhui Provincial University Scientific Research Project (2023AH050857)
Supporting information
S1 Fig. Graphical representation of cross validation of the optimal hyperparameters of the XGBoost model
S2 Fig. Confusion matrix of the XGBoost model in the training and validation sets
S3 Fig. Importance ranking of variables in the XGBoost model
S4 Fig. Performance of five machine learning models in the validation set
S5 Fig. Comparison of five machine learning models with five-fold cross-validation (AUC as evaluation index)
S6 Fig. Dependence plot of ADL
S7 Fig. Dependence plot of Hs-CRP
S8 Fig. Dependence plot of Age
S9 Fig. Dependence plot of Nasogastric tube therapy
S10 Fig. Dependence plot of Alb
S11 Fig. Dependence plot of Hb
Acknowledgments
The authors are very grateful to the authority of the Second Affiliated Hospital of Anhui University of Traditional Chinese Medicine for allowing us to use these data for the construction of the model. The authors would like to thank Dajin Li and Yi Liu for their valuable help.