Predictive Models for Secondary Epilepsy in Patients with Acute Ischemic Stroke Within One Year =============================================================================================== * Liu Jinxin * He Haoyue * Wang Yanglingxi * Du Jun * Liang Kaixin * Xue Jun * Liang Yidan * Chen Peng * Tian Shanshan * Deng Yongbing ## Abstract **Objective** Post-stroke epilepsy (PSE) is a major complication that worsens both prognosis and quality of life in patients with ischemic stroke. This study aims to develop an interpretable machine learning model to predict PSE using medical records from four hospitals in Chongqing. **Methods** We collected and analyzed medical records, imaging reports, and laboratory test results from 21,459 patients diagnosed with ischemic stroke. Traditional univariable and multivariable statistical analyses were performed to identify key predictive factors. The dataset was divided into a 70% training set and a 30% testing set. To address class imbalance, the Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbors was used. Nine widely applied machine learning algorithms were evaluated and compared using relevant prediction metrics. SHAP (SHapley Additive exPlanations) was used to interpret the model, assessing the contributions of different features. **Results** Regression analyses showed that complications such as hydrocephalus, cerebral hernia, and deep vein thrombosis, as well as brain regions (frontal, parietal, and temporal lobes), significantly contributed to PSE. Factors like age, gender, NIH Stroke Scale (NIHSS) scores, and laboratory results such as WBC count and D-dimer levels were associated with a higher risk of PSE. Among the machine learning models, tree-based methods such as Random Forest, XGBoost, and LightGBM demonstrated strong predictive performance, achieving an AUC of 0.99. **Conclusion** Our model successfully predicts PSE risk, with tree-based models showing superior performance. The NIHSS score, WBC count, and D-dimer were identified as the most important predictors. ## Introduction Stroke is the second leading cause of death globally, with an annual mortality of approximately 5.5 million, and it is also the leading cause of disability, accounting for 50% of cases worldwide [1]. Ischemic stroke comprises about 80% of all stroke cases [2][3]. Post-stroke epilepsy (PSE) is a common complication, with studies reporting that 3-30% of stroke patients develop epilepsy, which adversely affects their prognosis and quality of life [4]. PSE can worsen cognitive, psychiatric, and physical impairments already caused by cerebrovascular disease and related conditions [5]. The highest incidence of PSE occurs within the first year after an acute stroke, accounting for nearly half of the cases [2]. Thus, early prediction and intervention for PSE, especially in ischemic strokes, are critical. Currently, most studies rely on clinical data to build statistical models using survival analysis, Cox regression [2][6], and multiple linear regression [7] to create basic models for PSE prediction. Last year, Lin et al. developed a radiomics-based model that outperformed conventional clinical models in predicting PSE related to intracerebral hemorrhage (ICH). They suggested that a combined radiomics-clinical model could improve the assessment of individual PSE risk after the first occurrence of ICH, facilitating early diagnosis and treatment [8]. However, subsequent research raised concerns about the use of radiomics, indicating a need for further investigation [9]. Overall, research on PSE prediction remains limited, with most studies focusing on specific risk factors [10][11][8][12] and building simple models, without proposing more comprehensive and scientifically robust prediction models. Machine learning has gained attention as a powerful tool for building medical models due to its ability to process large datasets and complex information. It has been increasingly applied in neuroscience and clinical prediction [13][14][15]. Previous studies have used machine learning to explore post-stroke cognitive impairments [16], predict stroke and myocardial infarction risks in large artery vasculitis patients [14], develop post-stroke depression models based on liver function tests [17], and predict hematoma expansion in traumatic brain injury (TBI) [18]. Machine learning models can automatically manage both linear and complex nonlinear relationships between variables and offer insights into how different factors contribute to the prediction target—something that is difficult for traditional statistical models. However, machine learning requires substantial amounts of data and is prone to overfitting with small sample sizes. The quality and volume of input data are critical for the algorithm to detect underlying patterns and make accurate predictions. This study aims to identify key risk factors from various features extracted from the clinical records and test data of ischemic stroke patients. Using these features, we will develop a machine learning-based prediction model for PSE. By leveraging early admission data, we seek to automatically predict the likelihood of PSE occurrence and provide guidance for clinical decision-making and patient care. ## Result ### Filling of missing data Missing values were filled using a Random Forest (RF) model, handling one feature at a time. The imputed features were: Plt, WBC, RBC, HbA1c, CRP, TG, LDL, HDL, AST, ALT, bilirubin, albumin, urea, creatinine, BUA, PT, APTT, TT, INR, D-dimer, fibrinogen, CK, CK-MB, LDH, HBDH, IMA, lactate, anion gap, TCO2, and NIHSS. ### Characteristics of study participants A total of 21,459 patients were included in the study. The training set consisted of 15,021 patients, with a PSE incidence of 4.3%. The test set contained 6,438 patients, also with a 4.3% incidence of PSE. The external validation cohort included 536 patients from three hospitals. The statistical details of the clinical characteristics are presented in Table 1. View this table: [Table 1.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/T1) Table 1. Single factor significant analysis results This table presents the results of Chi-Square and Mann-Whitney U tests used to evaluate the association of various features with positive and negative samples. Sample Sizes: Positive samples (N=954) and negative samples (N=20,789). Statistical Methods: The Chi-Square test assesses the relationship between categorical variables, while the Mann-Whitney U test compares differences between independent groups for continuous data. P-values: Indicate the significance of the associations, with lower values suggesting stronger evidence against the null hypothesis. Statistical Values: Include counts and percentages of samples for each feature in both groups, along with the calculated statistics for each test. Statistical analysis indicated that patients with a higher likelihood of developing PSE had complications such as uremia, a history of DVT, atrial fibrillation, hyperuricemia, cerebral hernia, and hydrocephalus. The affected brain regions included the frontal, parietal, occipital, and temporal lobes, as well as the cortex, subcortex, basal ganglia, and hypothalamus. General characteristics included age, gender, and NIHSS score. Laboratory indicators associated with a higher risk of PSE included WBC count, HbA1C, CRP, triglycerides, AST, ALT, bilirubin, urea, uric acid, APTT, PT, D-dimer, CK, CK-MB, LDH, HBDH, IMA, lactate, and anion gap. Additionally, significant p-values were found for fatty liver, coronary heart disease, hyperlipidemia, and HDL, with low or negative values of these indicators linked to a higher risk of secondary complications. The results of the statistical analyses, as well as the univariate and multivariate regression analyses, are detailed in Tables 1, 2, and 3. View this table: [Table 2.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/T2) Table 2. Single Factor Significant Analysis Results This table presents the results of a single factor significance analysis for various features across two groups of samples: negative samples (0) and positive samples (1). Sample Size: Group 0 (Negative): N = 20,789 Group 1 (Positive): N = 954 Feature Analysis: For each feature, the table includes the mean and standard deviation (±) for both groups, odds ratios (OR) from univariable analysis, coefficients (coef), standard errors (std err), z-scores (z), p-values (P>|z|), and 95% confidence intervals ([0.025, 0.975]). Significance Levels: Features with statistically significant differences are indicated by p-values less than 0.05. An odds ratio greater than 1 suggests an increased risk associated with the feature in the positive group, while an odds ratio less than 1 suggests a decreased risk. Labels: The last two columns provide the proportions of the positive and negative samples for selected features. ### Performance of machine learning models The relevant performance indicators of the machine learning models are presented in Table 4, while the ROC curves, calibration curve, and decision curve analysis (DCA) are shown in Figure 3. Among all models, tree-based models such as Random Forest (RF), XGBoost, and LightGBM had the highest AUC scores, outperforming other models. Notably, Random Forest had the highest positive predictive value (PPV) at 0.864, which was the most significant metric in our models. Complex machine learning algorithms performed better than traditional logistic regression. The Brier score of the calibration curve was 0.006, and the DCA demonstrated good clinical decision-making benefits, indicating strong practical value. In the external validation cohort, we used RF for predictions, achieving a sensitivity of 0.91 and a PPV of 0.95, confirming the model’s strong predictive capability. View this table: [Table 3.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/T3) Table 3. Multivariable Analysis Results This table summarizes the results of a multivariable analysis for various features across two groups of samples: negative samples (0) and positive samples (1). Sample Size: Group 0 (Negative): N = 20,789 Group 1 (Positive): N = 954 Feature Analysis: For each feature, the table includes the mean and standard deviation (±) for both groups, odds ratios (OR) from multivariable analysis, coefficients (Coef.), standard errors (Std. Err.), z-scores (z), p-values (P>|z|), and 95% confidence intervals ([0.025, 0.975]). Significance Levels: Features with statistically significant differences are indicated by p-values less than 0.05. An odds ratio greater than 1 indicates an increased risk associated with the feature in the positive group, while an odds ratio less than 1 suggests a decreased risk. Labels: The last column presents the proportions of the positive and negative samples for selected features. View this table: [Table 4.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/T4) Table 4. Model Performance Evaluation Results This table presents the performance evaluation metrics for various machine learning models, including AUC, Accuracy, Sensitivity (Recall), Specificity, F1-score, Positive Predictive Value (PPV/Precision), and Negative Predictive Value (NPV) AUC: Area Under the Curve, indicating the model’s ability to distinguish between positive and negative samples; values closer to 1 indicate better performance. Accuracy: The proportion of correctly classified samples among the total samples. Sensitivity/Recall: The proportion of correctly identified positive samples out of all actual positive samples. Specificity: The proportion of correctly identified negative samples out of all actual negative samples. F1-score: The harmonic mean of precision and recall, considering both the accuracy and completeness of the model. Positive Predictive Value (PPV/Precision): The proportion of correctly identified positive samples among all samples predicted as positive. Negative Predictive Value (NPV): The proportion of correctly identified negative samples among all samples predicted as negative. ### Analysis of SHAP risk factors Figure 4 shows the SHAP (Shapley Additive Explanations) values, individual decision attempts, and overall decision curves. Among general characteristics, females had a higher rate of PSE. A higher NIHSS score was associated with a higher incidence of PSE. Additionally, elevated values of WBC count, D-dimer, CRP, AST, CK-MB, HbA1c, bilirubin, TCO2, and LDH at admission were linked to a greater likelihood of developing PSE. Conversely, lower levels of HBDH, PLT, and APTT were also associated with a higher probability of PSE. The specific brain regions affected did not have a significant individual effect on the overall outcome. Among complications, hypertension was more strongly associated with PSE development, while other conditions, such as coronary heart disease, diabetes, hyperlipidemia, and fatty liver, were less likely to be related to the outcome. We used the force plot of the first patient to illustrate how different features influenced the prediction. In this case, a prolonged APTT time contributed the most to PSE, followed by elevated AST levels, while a low NIHSS score contributed negatively to the final result. The decision plot aggregated model decisions to show how complex models arrived at their predictions. ## Discussion Our study used comprehensive clinical, imaging, and laboratory data from stroke patients to develop a predictive model using machine learning algorithms. This model achieved an AUC score above 0.95, demonstrating more accurate predictions compared to traditional statistical methods. Our research revealed that tree-based ensemble models provided superior predictive performance, especially when handling large datasets with high-dimensional features. During the modeling process, due to the extreme imbalance between negative and positive samples, we applied the SMOTEENN technique to resample the dataset, improving the performance of the machine learning models. Through SHAP analysis, we conducted interpretability assessments of the model and identified the importance of different features. In our study, age and NIHSS scores were treated as continuous variables. We found that female patients, older individuals, and those with higher NIHSS scores were more likely to develop PSE, consistent with recent studies. Higher NIHSS scores, indicating more severe strokes, significantly increased the risk of complications, second only to white blood cell (WBC) count and D-dimer in our model [5][19][10][20]. However, there are differing views on the effect of age. Some studies [5][21] suggest that age below 65 is a high-risk factor, which aligns with our findings, while other studies [22] have found that advanced age is the key factor. Yamada et al. [21] also agreed with our study, indicating that female patients have a higher risk of complications. On the other hand, Waafi et al. [10] reported that male patients are 3.325 times more likely to develop complications, which contradicts our findings. Previous research has shown that patients with diabetes, dyslipidemia, hypertension, depression, or dementia are at higher risk of developing vascular epilepsy [12]. In our study, statistical analysis and multiple machine learning (ML) models examined the relationship between comorbidities and complications. We found that patients with coronary heart disease, diabetes, fatty liver, hyperlipidemia, or large artery stenosis or plaques (CCA and ICA) were less likely to develop epilepsy. According to the TOAST classification, ischemic stroke is divided into five categories: large artery atherosclerosis, cardioembolism, small vessel occlusion, other determined etiology, and undetermined etiology. Patients with multiple comorbidities often fall into the large artery atherosclerosis and cardioembolism categories, which are more clearly defined and easier to treat, resulting in a lower likelihood of epilepsy. In contrast, strokes of undetermined etiology tend to have worse prognoses and are more likely to lead to epilepsy. Among patients with diabetes, higher HbA1c levels indicate poor blood sugar control and a higher risk of complications. Patients with better control of their blood sugar have a lower overall risk of developing complications. Alain et al. found that cortical infarction is more likely to lead to epilepsy in patients hospitalized with anterior circulation ischemic stroke [23]. Lin et al. found that factors such as cortical involvement and intracerebral hemorrhage volume increase the likelihood of PSE, which is consistent with our findings [8]. Al-Sahli et al. also suggested that cortical brain injury and large-area lesions raise the risk of PSE [5][21]. In our study, statistics showed that both cortical and subcortical involvement increased the likelihood of PSE, but these regions had less influence compared to other features and were not selected in the LASSO regression. Previous studies have identified acute infection as a risk factor for ischemic stroke [24]. C-reactive protein (CRP) reflects inflammation levels and is an independent prognostic factor [25]. In our study, both regression and SHAP analysis indicated that WBC count had a significant impact among routine blood test parameters, even surpassing the NIHSS score in SHAP analysis. A high WBC count may indicate severe inflammation or infection, as well as increased blood viscosity, making patients more prone to secondary complications. In general, a high red blood cell count and low platelet count also contributed to an increased risk of complications. A large-scale study on Chinese individuals found a negative correlation between plasma high-density lipoprotein cholesterol (HDL-C) levels and the risk of ischemic stroke, a weak positive correlation between plasma triglyceride (TG) levels and stroke risk, and a strong correlation between plasma low-density lipoprotein cholesterol (LDL-C) and apolipoprotein B levels [26]. High HDL-C levels are linked to better prognosis [27]. Our study aligns with these findings, showing that high LDL-C, low HDL-C, and elevated TG levels are more likely to result in PSE. This can be understood as high cholesterol and triglyceride levels increase blood viscosity and contribute to vascular sclerosis, promoting clot formation [12][28][29]. Higher D-dimer levels indicate more significant brain tissue damage, increasing the likelihood of PSE. In general, lower activated partial thromboplastin time (APTT) and fibrinogen levels are associated with higher PSE risk, while INR, PT, and TT have a smaller impact. Among liver function indicators, aspartate aminotransferase (AST) had the greatest influence on PSE. High AST, low alanine aminotransferase (ALT), and low albumin levels also had some impact. Lingling Ding et al. found that liver enzyme subgroups defined by ALT and AST were linked to higher risks of adverse outcomes [30], which is consistent with our findings. Studies have also shown that renal function biomarkers such as urinary microalbumin, cystatin C, and creatinine are associated with higher stroke recurrence rates and poorer prognosis [30]. In our study, low urea levels and high uric acid levels had a negative impact [31][32][33]. Our research supports these conclusions. Elevated uric acid levels at admission were positively associated with PSE, although patients with a prior diagnosis of hyperuricemia were less likely to develop epilepsy. Since uric acid acts as a strong antioxidant and has neuroprotective properties [34], patients with normal liver and kidney function and mild hyperuricemia may have greater resilience in emergencies [35][36]. However, excessively high uric acid levels suggest metabolic disorders and poor liver and kidney function, which are linked to a poor prognosis. When stroke patients are admitted, cardiac enzyme tests are often conducted to rule out myocardial ischemia. However, studies have shown that elevated CK-MB in stroke patients may not be solely heart-related [37]. Cardiac enzymes are important prognostic indicators [38][39] and have been incorporated into stroke scores [40]. Some studies have reported a higher incidence of abnormal serum cardiac enzyme levels in the acute phase of stroke. While the abnormalities are not related to the stroke type, they are associated with stroke severity, with patients exhibiting consciousness disorders having a significantly higher incidence of abnormal cardiac enzymes than those without such disorders [41]. In our study, CK, CK-MB, and IMA in the cardiac enzyme profile had a significant impact and high predictive value, though further research is required to understand the specific mechanisms involved [34]. Although our study incorporated extensive clinical, imaging, and laboratory data to build more accurate prediction models using machine learning algorithms, surpassing traditional statistical methods, there were still several limitations in the modeling process. While the current study offers valuable insights, the data sample may not be fully representative, and the model’s generalizability requires further evaluation. Although the data was collected from multiple tertiary hospitals and includes over 20,000 cases, earlier data was lost due to hospital system upgrades. The dataset mainly reflects patients diagnosed within the past five years and is predominantly from the Chongqing region, which may limit the model’s applicability to other geographic areas. Additionally, the retrospective nature of the study led to the absence of some important predictive indicators. Many potentially valuable features, such as hemorheology, thromboelastography, and hormone levels, were missing and had to be excluded. Including these features could potentially improve the model’s accuracy. To enhance the predictive power of the model, it would be beneficial to incorporate more data beyond baseline patient characteristics. The current analysis primarily used the results from the first examination upon admission, without fully utilizing information from subsequent exams. In future research, recurrent neural networks could be employed to extract features from the entire sequence of examinations more comprehensively. To strengthen the study further, data standardization should be improved, and the number of cases and key indicators should continue to grow. Additionally, it would be advantageous to explore more advanced scientific methods, such as deep learning, and utilize all available data to improve prediction accuracy. ## Materials and methods ### Research patients This study retrospectively included all stroke patients admitted to the Chongqing Emergency Center between June 2017 and June 2022 for the development of the prediction model. Data from three external validation centers—Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei District Traditional Chinese Medicine Hospital—were collected between July 2022 and July 2023 to validate and evaluate the model externally. The external validation cohort emphasized collecting positive cases to test the model’s ability to identify these cases accurately. Inclusion criteria: (1) Age between 18 and 90 years at admission; (2) Diagnosed with acute ischemic stroke and hospitalized for treatment. Exclusion criteria: (1) Patients with a history of stroke or transient ischemic attack (TIA); (2) Patients with a history of other conditions such as traumatic brain injury, intracranial tumors, or cerebral vascular malformations that may cause epilepsy; (3) Patients with a history of epilepsy or who have received antiseizure medications for the prevention of seizures or for other diseases (such as migraine or psychiatric disorders); (4) Patients who died within 72 hours after stroke onset. This study collected de-identified data from relevant patients to build a multi-modal stroke patient database. The study protocol was approved by the Ethics Committees of Chongqing University Center Hospital, Chongqing University Qianjiang Central Hospital, Bishan District People’s Hospital, and Yubei District Traditional Chinese Medicine Hospital. The selection process is outlined in Figure 1. A total of 42,079 records were retrieved from the stroke database, and 24,733 patients were diagnosed with ischemic or lacunar stroke with new onset. Hemorrhagic strokes (4,565), a history of stroke (2,154), TIA (3,570), unclear cause strokes (561), and records with missing essential data (6,496) were excluded. Patients whose seizures might have been caused by other factors (such as brain tumors, intracranial vascular malformations, or traumatic brain injury) (865), those with a seizure history (152), and patients who died in the hospital (1,444) were also excluded. Additionally, patients lost to follow-up (those without outpatient records or unreachable by phone) or who died within three months of the stroke incident (813) were excluded. Finally, 21,459 cases were included in the study. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/24/2024.06.27.24309564/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/F1) Figure 1. Selection and Exclusion Procedure of Patients A total of 42,079 records were retrieved from the stroke database, and 24,733 patients were diagnosed with ischemic or lacunar stroke with new onset. Hemorrhagic strokes (4,565), a history of stroke (2,154), TIA (3,570), unclear cause strokes (561), and records with missing essential data (6,496) were excluded. Patients whose seizures might have been caused by other factors (such as brain tumors, intracranial vascular malformations, or traumatic brain injury) (865), those with a seizure history (152), and patients who died in the hospital (1,444) were also excluded. Additionally, patients lost to follow-up (those without outpatient records or unreachable by phone) or who died within three months of the stroke incident (813) were excluded. Finally, 21,459 cases were included in the study. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/24/2024.06.27.24309564/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/F2) Figure 2. LASSO Regression Coefficient Paths The image shows the LASSO regression coefficient paths for various features related to a medical or research study. The x-axis represents the log of the regularization parameter alpha, and the y-axis shows the regression coefficient values. The lines in the plot represent the coefficient paths for different features as the regularization parameter changes. The features are labeled on the right side of the plot, and the most important features selected by the LASSO model are shown at the bottom of the image. ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/24/2024.06.27.24309564/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/F3) Figure 3. Model Evaluation Metrics and Curves The figure displays model performance curves across six sections (A1, A2, A3 on the left; B1, B2, B3 on the right) for training and test sets. ROC Curve: Illustrates the trade-off between sensitivity and specificity, with the AUC indicating overall model performance. Calibration Curve: Compares predicted probabilities to actual outcomes, assessing the model’s confidence accuracy. Precision-Recall Curve: Analyzes the balance between precision and recall at various thresholds, particularly useful for imbalanced datasets. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/24/2024.06.27.24309564/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2024/10/24/2024.06.27.24309564/F4) Figure 4. Description of the SHAP Values and Feature Importance SHAP Value (Left): Displays the impact of each feature on the model’s predictions, with features sorted by importance. The color gradient indicates the range of feature values, from low (blue) to high (red). Force Plot (Upper Right): Illustrates the contribution of individual features of the first sample to the final model output, highlighting how each feature value pushes the prediction away from the baseline value. Decision Plot (Lower Right): Visualizes the cumulative impact of features on the model output for each sample, showing how the feature values combine to produce the final prediction. ### Data collection We extracted all relevant records and data from the hospital databases. Using PostgreSQL, we wrote Structured Query Language (SQL) to manage the data as follows: 1. General Information: This included gender, age, and NIH Stroke Scale (NIHSS) score at admission. 2. Comorbidities and Complications: These included uremia, previous deep vein thrombosis (DVT), diabetes mellitus, hypertension, coronary atherosclerosis, atrial fibrillation, cerebral hernia, hydrocephalus, hypoproteinemia, hyperuricemia, hyperlipidemia, internal carotid stenosis, and common carotid stenosis. 3. Brain Involvement (CT or MRI records): We recorded involvement of the cortical lobes and subcortical areas, including the frontal, parietal, temporal, occipital, and insular lobes, as well as the basal ganglia, internal capsule, brain stem, cerebellum, periventricular area, centrum semiovale, and thalamus. The extent of cortical involvement (frontal, parietal, temporal, occipital, and insular lobes) was scored, with each lobe contributing 1 point. Similarly, subcortical involvement (basal ganglia, internal capsule, brain stem, periventricular area, thalamus, and cerebellum) was scored with each area contributing 1 point. 4. Vascular Involvement (CTA, MRA, or DSA records): We recorded the presence of vascular stenosis or occlusion in the anterior cerebral artery (ACA), middle cerebral artery (MCA), posterior cerebral artery (PCA), vertebral artery (VA), and basilar artery (BA). 5. Key Laboratory Indicators: These included blood lipids such as triglycerides (TG), high-density lipoprotein cholesterol (HDL), and low-density lipoprotein cholesterol (LDL); liver function indicators such as alanine transaminase (ALT), aspartate aminotransferase (AST), bilirubin, and albumin; renal function markers such as urea, blood uric acid (BUA), and creatinine; blood gas parameters such as lactate, anion gap, and total carbon dioxide (TCO2); coagulation markers such as international normalized ratio (INR), prothrombin time (PT), activated partial thromboplastin time (APTT), thrombin time (TT), D-dimer, and fibrinogen; and myocardial enzymes such as creatine kinase (CK), creatine kinase isoenzyme (CK-MB), lactate dehydrogenase (LDH), ischemic modified albumin (IMA), and α-hydroxybutyrate dehydrogenase (HBDH). ### Data processing and model building Processing of Missing Data: We recorded all laboratory indicators from the first set of tests after stroke admission (every stroke patient undergoes routine blood tests, and liver and kidney function assessments). Indicators with more than 10% missing data were excluded. The remaining indicators with missing values were imputed using the random forest algorithm with default parameters. We processed the features in order of missing values, starting with those that had the least missing data (as this requires the least information for imputation). When imputing a feature, missing values in other features were temporarily replaced with 0. After each regression prediction, the predicted value was inserted into the original feature matrix before proceeding to the next feature. Once all features were processed, the dataset was complete. Distribution of Characteristics: We used univariate analysis to compare the distribution of characteristics between the PSE-negative and PSE-positive groups. The data were then divided into a training set and a test set in a 7:3 ratio. Processing of Unbalanced Data: Given the low incidence of PSE and the small proportion of positive cases, we augmented the positive data in the training set using the Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbors (SMOTEENN). The SMOTEENN method from the imblearn Python package was applied with default parameters, and a random seed of 42 was set to ensure reproducibility. Processing of Categorical Data: For categorical variables, we used the one-hot encoding method for transformation. We then applied the LASSO method to the training set to identify the most important features. Model Building: First, we used LASSO regression to select the 20 most important features. We then employed 9 commonly used machine learning methods, including Naive Bayes, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Multi-Layer Perceptron, XGBoost, LightGBM, and K-Nearest Neighbors. Hyperparameters for each model were optimized through grid search to enhance performance. Model evaluation metrics included accuracy, sensitivity, specificity, F1-score, positive predictive value, and negative predictive value. We also generated ROC curves, calibration curves, and decision curves to further assess model performance. An independent external validation dataset was used to evaluate the generalization ability of the selected model. Lastly, we applied the SHAP algorithm to interpret the best-performing model, analyzing the contribution of each feature to the model’s predictions and their clinical relevance. Through this process of model development, optimization, and interpretation, we constructed a machine learning model with strong predictive performance and interpretability, offering valuable support for clinical decision-making. ### Statistical approach PostgreSQL v15 ([http://www.postgresql.org/](http://www.postgresql.org/)) was used to search and extract data from the local database. The open-source statistical package “Scipy.stats” in Python was used for statistical analysis. The details of the univariate significance analysis for each feature are as follows: The Shapiro-Wilk test was applied to assess the normality of each feature’s distribution. For features that did not follow a normal distribution, the Mann-Whitney U test was used to evaluate their significance in relation to the target variable. For features with a normal distribution, the Levene test was performed to evaluate the homogeneity of variances. Features with homogeneous variances were analyzed using the Student’s t-test for significance, while those with heterogeneous variances were analyzed using Welch’s t-test. Confidence intervals for AUC values and Brier scores were calculated using 1,000 bootstrap resampling iterations on the datasets. Binary classification thresholds for the predicted probabilities from all models were established using the maximum Youden index derived from the training cohort. Throughout the study, a two-tailed p-value of less than 0.05 was considered statistically significant. All the code used in this study was uploaded to [https://github.com/conanan/lasso-ml](https://github.com/conanan/lasso-ml). ## Conclusion We developed an interpretable machine learning model to predict the risk of post-stroke epilepsy (PSE) in hospitalized patients with ischemic stroke. Using a large dataset of medical records, our artificial intelligence model demonstrates strong predictive performance for PSE. The key predictors identified by the model include NIHSS score, D-dimer levels, lactate levels, and white blood cell count, along with liver function and cardiac enzyme profile indicators. The model’s transparency and interpretability can build trust among clinicians and support decision-making. While the results are promising, further prospective studies are necessary to validate the clinical utility of this tool before it can be applied in real-world settings. ## Data availability statement The codes,models,analysis results was uploaded at [https://github.com/conanan/lasso-ml](https://github.com/conanan/lasso-ml). The full dataset can be provided for researchers if needed by the corresponding author. ## Ethics approval statement We confirm that we have read the Journal’s position on issues involved in ethical publication and affirm that this report is consistent with those guidelines. ## Funding statement The research is funded by Central University basic research young teachers and students research ability promotion sub-project(2023CDJYGRH-ZD06);by Emergency Medicine Chongqing Key Laboratory Talent Innovation and development joint fund project(2024RCCX10). ## Conflict of interests The authors have no relevant conflicts of interest to disclose. ## Patient consent statement This study was a retrospective study and only deidentified patient data were collected, exempting the need for patient informed consent rights. ## Permission to reproduce material from other sources There are no reproduce material from other sources. ## Clinical trial registration The trail number is RS202406. ## Acknowledgements The authors would like to thank the colleagues in the information and imaging departments for their hard work contributing to the final research results. ## Footnotes * We have enhanced the codes and methods of the machine learning procedure. We have revised the draft to make the descriptions more clear and concise than before. * Received June 27, 2024. * Revision received October 24, 2024. * Accepted October 24, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. [1].Feigin V L, Krishnamurthi R V, Theadom A M, et al. Global, Regional, and National Burden of Neurological Disorders during 1990–2015: A Systematic Analysis for the Global Burden of Disease Study 2015[J]. The Lancet Neurology, 2017, 16(11): 877–897. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28931491&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 2. [2].Galovic M, Döhler N, Erdélyi-Canavese B, et al. Prediction of Late Seizures after Ischaemic Stroke with a Novel Prognostic Model (the SeLECT Score): A Multivariable Prediction Model Development and Validation Study[J]. The Lancet Neurology, 2018, 17(2): 143. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29413315&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 3. [3].Krishnamurthi R V, Feigin V L, Forouzanfar M H, et al. Global and Regional Burden of First-Ever Ischaemic and Haemorrhagic Stroke during 1990–2010: Findings from the Global Burden of Disease Study 2010[J]. The Lancet Global Health, 2013, 1(5): e259–e281. 4. [4].Zhao Y, Li X, Zhang K, et al. The Progress of Epilepsy after Stroke[J]. Curr Neuropharmacol, 2018, 16(1): 71–78. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28606039&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 5. [5].Al-Sahli O a M, Tibekina L, Subbotina O P, et al. Post-Stroke Epileptic Seizures: Risk Factors, Clinical Presentation, Principles of Diagnosis and Treatment[J]. Epilepsy and paroxysmal conditions, 2023, 15(2): 148–159. 6. [6].Chen Z, Churilov L, Chen Z, et al. Association between Implementation of a Code Stroke System and Poststroke Epilepsy[J]. Neurology, 2018, 90(13): e1126–e1133. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29467304&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 7. [7].Merkler A E, Gialdini G, Lerario M P, et al. Population-Based Assessment of the Long-Term Risk of Seizures in Survivors of Stroke[J]. Stroke, 2018, 49(6): 1319–1324. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToic3Ryb2tlYWhhIjtzOjU6InJlc2lkIjtzOjk6IjQ5LzYvMTMxOSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDI0LzEwLzI0LzIwMjQuMDYuMjcuMjQzMDk1NjQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 8. [8].Lin R, Lin J, Xu Y, et al. Development and Validation of a Novel Radiomics-Clinical Model for Predicting PSE after First-Ever Intracerebral Haemorrhage[J]. European Radiology, 2023, 33(7): 4526–4536. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36735039&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 9. [9].Pszczolkowski S, Law Z K. Editorial Comment on 《Development and Validation of a Novel Radiomics-Clinical Model for Predicting PSE after First-Ever Intracerebral Haemorrhage》[J]. European Radiology, 2023, 33(7): 4524–4525. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36977856&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 10. [10].Waafi A K, Husna M, Damayanti R, et al. Clinical Risk Factors Related to PSE Patients in Indonesia: A Hospital-Based Study[J]. Egyptian Journal of Neurology, Psychiatry and Neurosurgery, 2023, 59(1). 11. [11].Herzig-Nichtweiß J, Salih F, Berning S, et al. Prognosis and Management of Acute Symptomatic Seizures: A Prospective, Multicenter, Observational Study[J]. Annals of Intensive Care, 2023, 13(1). 12. [12].Pitkänen A, Roivainen R, Lukasiuk K. Development of Epilepsy after Ischaemic Stroke[J]. The Lancet Neurology, 2016, 15(2): 185–197. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26597090&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 13. [13].The Artificial Intelligence Revolution in Stroke Care: A Decade of Scientific Evidence in Review[J]. World Neurosurgery, Elsevier, 2024. 14. [14].Predicting Stroke and Myocardial Infarction Risk in Takayasu Arteritis with Automated Machine Learning Models[J]. iScience, Elsevier, 2023, 26(12): 108421. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=38077132&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 15. [15].Daidone M, Ferrantelli S, Tuttolomondo A, et al. Machine Learning Applications in Stroke Medicine: Advancements, Challenges, and Future Prospective[J]. Neural Regeneration Research, 2024, 19(4): 769–773. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37843210&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 16. [16].Lee M, Yeo N-Y, Ahn H-J, et al. Prediction of Post-Stroke Cognitive Impairment after Acute Ischemic Stroke Using Machine Learning[J]. Alzheimer’s Research and Therapy, 2023, 15(1). 17. [17].Gong J, Zhang Y, Zhong X, et al. Liver Function Test Indices-Based Prediction Model for Post-Stroke Depression: A Multicenter, Retrospective Study[J]. BMC Medical Informatics and Decision Making, 2023, 23(1). 18. [18].He H, Liu J, Li C, et al. Predicting Hematoma Expansion and Prognosis in Cerebral Contusions: A Radiomics-Clinical Approach[J]. Journal of Neurotrauma, 2024: neu.2023.0410. 19. [19].Lin R, Yu Y, Wang Y, et al. Risk of PSE Following Stroke-Associated Acute Symptomatic Seizures[J]. Frontiers in Aging Neuroscience, 2021, 13. 20. [20].Zöllner J P, Misselwitz B, Kaps M, et al. National Institutes of Health Stroke Scale (NIHSS) on Admission Predicts Acute Symptomatic Seizure Risk in Ischemic Stroke: A Population-Based Study Involving 135,117 Cases[J]. Scientific Reports, 2020, 10(1). 21. [21].Yamada S, Nakagawa I, Tamura K, et al. Investigation of Poststroke Epilepsy (INPOSE) Study: A Multicenter Prospective Study for Prediction of Poststroke Epilepsy[J]. J Neurol, 2020, 267(11): 3274–3281. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00415-020-09982-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32561989&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 22. [22].Lidetu T, Zewdu D. Incidence and Predictors of Post Stroke Seizure among Adult Stroke Patients Admitted at Felege Hiwot Compressive Specialized Hospital, Bahir Dar, North West Ethiopia, 2021: A Retrospective Follow up Study[J]. BMC Neurology, 2023, 23(1). 23. [23].Lekoubou A, Ssentongo P, Maffie J, et al. Associations of Small Vessel Disease and Acute Symptomatic Seizures in Ischemic Stroke Patients[J]. Epilepsy & Behavior, 2023, 145: 109233. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37329856&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 24. [24].Bova I Y, Bornstein N M, Korczyn. Acute Infection as a Risk Factor for Ischemic Stroke[J]. Stroke, 1996, 27(12): 2204–2206. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToic3Ryb2tlYWhhIjtzOjU6InJlc2lkIjtzOjEwOiIyNy8xMi8yMjA0IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMjQvMjAyNC4wNi4yNy4yNDMwOTU2NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 25. [25].Di Napoli M, Papa F, Bocola V. C-Reactive Protein in Ischemic Stroke an Independent Prognostic Factor[J]. Stroke, 2001, 32(4): 917–924. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToic3Ryb2tlYWhhIjtzOjU6InJlc2lkIjtzOjg6IjMyLzQvOTE3IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMjQvMjAyNC4wNi4yNy4yNDMwOTU2NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 26. [26].Sun L, Clarke R, Bennett D, et al. Causal Associations of Blood Lipids with Risk of Ischemic Stroke and Intracerebral Hemorrhage in Chinese Adults[J]. Nat Med, Nature Publishing Group, 2019, 25(4): 569–574. 27. [27].Bandeali S, Farmer J. High-Density Lipoprotein and Atherosclerosis: The Role of Antioxidant Activity[J]. Current Atherosclerosis Reports, 2012, 14(2): 101–107. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11883-012-0235-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22441969&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 28. [28].Gasparini S, Neri S, Brigo F, et al. Late Epileptic Seizures Following Cerebral Venous Thrombosis: A Systematic Review and Meta-Analysis[J]. Neurol Sci, 2022, 43(9): 5229– 5236. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35639217&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 29. [29].Abraira L, Giannini N, Santamarina E, et al. Correlation of Blood Biomarkers with Early-Onset Seizures after an Acute Stroke Event[J]. Epilepsy & Behavior, 2020, 104: 106549. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31677998&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 30. [30].Ding L, Liu Y, Meng X, et al. Biomarker and Genomic Analyses Reveal Molecular Signatures of Non-Cardioembolic Ischemic Stroke[J]. Sig Transduct Target Ther, Nature Publishing Group, 2023, 8(1): 1–16. 31. [31].Zhang W, Cheng Z, Fu F, et al. Serum Uric Acid and Prognosis in Acute Ischemic Stroke: A Dose–Response Meta-Analysis of Cohort Studies[J]. Frontiers in Aging Neuroscience, 2023, 15. 32. [32].Wang D, Hu B, Dai Y, et al. Serum Uric Acid Is Highly Associated with Epilepsy Secondary to Cerebral Infarction[J]. Neurotox Res, 2019, 35(1): 63–70. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s12640-018-9930-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30022372&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 33. [33].Wang C, Cui T, Wang L, et al. Prognostic Significance of Uric Acid Change in Acute Ischemic Stroke Patients with Reperfusion Therapy[J]. Eur J Neurol, 2021, 28(4): 1218– 1224. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33176022&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 34. [34].Ng G J L, Quek A M L, Cheung C, et al. Stroke Biomarkers in Clinical Practice: A Critical Appraisal[J]. Neurochemistry International, 2017, 107: 11–22. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.neuint.2017.01.005&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28088349&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 35. [35].Amaro S, Urra X, Gómez-Choco M, et al. Uric Acid Levels Are Relevant in Patients With Stroke Treated With Thrombolysis[J]. Stroke, American Heart Association, 2011, 42(1_suppl_1): S28–S32. 36. [36].Amaro S, Urra X, Gómez-Choco M, et al. Uric Acid Levels Are Relevant in Patients with Stroke Treated with Thrombolysis[J]. Stroke, 2011, 42(SUPPL. 1): S28–S32. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToic3Ryb2tlYWhhIjtzOjU6InJlc2lkIjtzOjE2OiI0Mi8xX3N1cHBsXzEvUzI4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMjQvMjAyNC4wNi4yNy4yNDMwOTU2NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 37. [37].Ay H, Arsava E M, Sarba O. Creatine Kinase-MB Elevation after Stroke Is Not Cardiac in Origin Comparison with Troponin T Levels[J]. Stroke, 2002, 33(1): 286–289. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToic3Ryb2tlYWhhIjtzOjU6InJlc2lkIjtzOjg6IjMzLzEvMjg2IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMjQvMjAyNC4wNi4yNy4yNDMwOTU2NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 38. [38].Liu X, Chen X, Wang H, et al. Prognostic Significance of Admission Levels of Cardiac Indicators in Patients with Acute Ischaemic Stroke: Prospective Observational Study[J]. J Int Med Res, SAGE Publications Ltd, 2014, 42(6): 1301–1310. 39. [39].Zeng Y-Y, Zhang W-B, Cheng L, et al. Cardiac Parameters Affect Prognosis in Patients with Non-Large Atherosclerotic Infarction[J]. Molecular Medicine, 2021, 27(1): 2. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33407066&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 40. [40].Hijazi Z, Lindbäck J, Alexander J H, et al. The ABC (Age, Biomarkers, Clinical History) Stroke Risk Score: A Biomarker-Based Risk Score for Predicting Stroke in Atrial Fibrillation[J]. European Heart Journal, 2016, 37(20): 1582–1590. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/eurheartj/ehw054&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26920728&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F24%2F2024.06.27.24309564.atom) 41. [41].Zheng Yuan-Hui, Zheng Jin-Yi, Zhang Jian. Changes of serum myocardial enzyme profile in acute stage of stroke [J]. Chinese Journal of Advanced Medical Doctors, China Medical Journal, 2009, 32(07): 46 -- 47.