Abstract
Background The number of elective surgeries for older individuals is on the rise globally. Machine learning may improve risk assessment with an impact on surgical planning and postoperative care. Preoperative cognitive assessment may facilitate early identification of postoperative delirium (POD). This study aims to estimate the predictive ability of machine learning models for POD using pre-and/or perioperative features, with a specific focus on adding neuropsychological assessments prior to surgery.
Materials and Methods This retrospective cohort study analyzed data from the multicenter PAWEL study and its PAWEL-R substudy, encompassing older patients (≥70 years) undergoing elective surgeries across five medical centers from July 2017 to April 2019. A total of 1624 patients were included, with POD diagnosis made before discharge. Data included demographics, clinical, surgical, and neuropsychological features collected pre- and perioperatively. Machine learning model performance was evaluated using the area under the receiver operating characteristic curve (AUC), with permutation testing for significance and SHapley Additive exPlanations (SHAP) to identify effective neuropsychological assessments.
Results In this cohort of 1624 patients, 52.3% (N=850) were male, with a mean [SD] age of 77.9 [4.9] years. Predicting POD before surgery using demographic, clinical, surgical, and neuropsychological features achieved an AUC of 0.79. Incorporating all pre- and perioperative features into the model yielded a slightly higher AUC of 0.82, with no significant difference observed (P= .19). Notably, cognitive factors alone were not strong predictors (AUC=0.61). However, specific tests within neuropsychological assessments, such as the Montreal Cognitive Assessment memory subdomain and Trail Making Test Part B, were found to be crucial for prediction according to SHAP analysis.
Conclusion and Relevance Preoperative risk prediction for POD can increase risk awareness in presurgical assessment and improve postoperative management in patients with a high risk for delirium.
Highlights
Analyzed 1624 older patients (≥70 years) undergoing elective surgeries across five medical centers from July 2017 to April 2019.
Established machine learning model to predict postoperative delirium before surgery.
Preoperative cognition enhances predictive performance, comparable to models incorporating all pre- and perioperative features.
Montreal Cognitive Assessment memory subdomain and Trail Making Test Part B drive the cognition-based prediction.
Perioperative surgical features, such as the duration of the surgery, are important predictors.
Introduction
In an aging society, there is a rising demand for elective surgeries due to the changing healthcare needs of older people 1–3. However, this increase in elective surgeries raises concerns about additional adverse outcomes, particularly given the unique challenges posed by aging, such as pre-existing health conditions, disease sequelae, and diminished physiological reserves 4,5. Meeting the growing demand for elective surgeries among the elderly requires a comprehensive strategy, including thorough preoperative assessments, personalized care plans, and continuous post-operative support 6–8. In this study, we systematically assess the potential benefits of utilizing machine learning techniques to predict the risk of postoperative delirium (POD) based on a diverse range of features.
POD, characterized by acute and fluctuating inattention with alterations in thinking or consciousness after surgery, affects 12% to 51% of older patients, with incidence varying by surgical procedures and regions 9. POD in older patients is linked to heightened rehospitalization rates, persistent postoperative cognitive dysfunction, increased incidence of dementia, and elevated mortality 10–12. Some studies have shown significant associations between POD and factors collected before or during surgery (pre- or perioperatively) 13,14. These factors could aid in predicting the POD 15,16. However, different pre- and perioperative feature categories may vary in their importance for POD risk assessment. Demographic factors, such as age and sex, are critical for assessing POD risk 17–23. Clinical data, including blood samples and chronic disease medication, are also predictive 24–26. The Type and length of surgery and anesthesia are indispensable for POD prediction 18,27. Although preoperative neuropsychological assessments, like the Mini-Mental State Examination (MMSE) and others 28–30, help identify at-risk patients for early risk mitigation 31–33, these evaluations are not yet incorporated in clinical routine despite experts’ recommendations 34. The early identification of POD risk factors enables clinicians to proactively mitigate the occurrence of POD 35 and might affect patient’s decision before non-emergent surgery. Therefore, identifying preoperative risk factors for POD is crucial in facilitating personalized surgical risk assessment before surgery 36,37. Moreover, stratifying POD risks remains challenging due to its multifactorial origins 18,22,38. The precise categories of preoperative and perioperative features with superior predictive capabilities for POD remain unknown and unvalidated 39.
In this study we (1) predict POD by employing a machine learning approach utilizing diverse pre- and perioperative features from a large multicenter cohort of older patients and (2) conduct a comparative analysis of independent models using various data categories, including demographic, clinical, surgical, and neuropsychological features. This study provides a comprehensive assessment of predictors for POD thereby enhancing presurgical awareness of risk and postoperative management in older patients.
Materials and Methods
Participants
The study utilized the cohort from the PAWEL study (Patientensicherheit, Wirtschaftlichkeit und Lebensqualität bei elektiven Operationen, English: Patient safety, Efficiency and Life quality in elective surgery) and its PAWEL-R sub-study (R for risk) to predict POD using machine learning models, departing from original statistical methods in the previous studies. Patients were recruited from five major medical institutions in Germany (three university hospitals: Tübingen, Freiburg, and Ulm, and two tertiary medical centers: Stuttgart and Karlsruhe) between July 31, 2017, and April 12, 2019 13,40,41. The study adheres to the guidelines outlined in the STROCSS criteria 42.
Participants included patients aged 70 and older undergoing elective surgery (joint, spine, vessels, heart, lung, abdomen, urogenital system, and other organs) with an expected surgical duration exceeding 60 minutes. Exclusion criteria covered patients unable to communicate effectively in German, those undergoing emergency surgery, severe dementia with MMSE < 15 or Montreal Cognitive Assessment (MOCA) < 8, or an estimated survival time less than 15 months. A stepped-wedge cluster randomized design was employed for equitable intervention group allocation 40,41. The study, initially including 1631 patients, conducted thorough postoperative assessments within one week after surgery but before discharge. To ensure accuracy in predicting POD diagnosis, patients without POD diagnosis before discharge (N=7) were excluded, resulting in 1624 patients included in prediction models. The study reported 23.1% of patients diagnosed with POD, with detailed group comparisons available in eTables 1 and 2.
Measures
The PAWEL and PAWEL-R studies extensively evaluated elective surgery patients to identify risk factors and outcomes related to POD. Diagnosis involved the Confusion Assessment Method (I-CAM) algorithm and chart review within the first postoperative week or until discharge. Demographic data, including age, sex, education, alcohol/smoking habits, living arrangement, and hospital location, were collected preoperatively. Neuropsychological assessments such as MOCA, Trial Making Test (TMT) parts A and B, digit span backwards, Subjective Memory Impairment (SMI), and Patient Health Questionnaire-4 (PHQ-4) were conducted at admission. Clinical profiles included blood samples, past medical histories including pre-existing dementia and previous delirium history, and baseline assessments. Surgical information covered types of surgery, anesthesia, and perioperative events. Preoperative and perioperative features were analyzed for predictive capacity at two time points. Models were developed and tested with both features and preoperative features alone. Additionally, models were compared with and without neuropsychological assessments to assess their contribution to predictive performance. The study aimed to identify effective feature combinations for predicting POD diagnosis by comparing predictive performance across different feature combinations from the two time points.
Preprocessing, imputation, and features
Features with more than 20% missing data, such as albumin level (60.28%) and depth of anesthesia (75.37%), were excluded 16. Information about missing values is available in eTable 3. Data imputation involved using the mean value for continuous variables and random sampling from the original probability distribution for discrete and binary variables within cross-validation folds. Blood samples, including hemoglobin, sodium levels, and C-reactive protein (CRP), were interpreted following clinical guidelines in line with standard practice 43. For instance, hemoglobin levels of less than 12 g/dL were considered indicative of anemia, while sodium levels of less than 135 mmol/L and more than 145 mmol/L were associated with hyponatremia and hypernatremia, respectively. Similarly, CRP levels greater than 3 mg/L indicated an increased CRP level, while values within the normal range were not considered clinically significant. Redundant features with a perfect correlation were excluded. Non-binary categorical features, including location, SMI, types of anesthesia, and surgery, were one-hot encoded because they had no natural ordinal relationship among their categories, and assigning numerical labels to them could introduce bias or incorrect assumptions in the model.
Seventy features were used, divided into preoperative (51) and perioperative (19) categories. Preoperative features encompassed demographic (7), clinical (23), surgical categories (6), and neuropsychological assessments (15). Perioperative features included clinical (4) and surgical (15) categories. Different combinations of features were compared, including preoperative only, preoperative and perioperative, and sub-feature sets of preoperative features. The study evaluated model performance, feature category effectiveness, and the additional benefits of preoperative neuropsychological assessments.
The study aimed to develop a prediction model in a naturalistic setting. Information regarding interventions was included only as a sensitivity analysis to demonstrate its potential impact on the prediction model. The potential imbalance of dataset for all models was tested by oversampling with the Synthetic Minority Oversampling Technique (SMOTE) 44. Possible entry errors in clinical assessments were identified by mean imputation of data points exceeding five times the standard deviation on a respective measure. Model performances were estimated and compared to the original predictions, which included those datapoints. Various sensitivity analyses as described in the supplement were performed.
Machine learning models, performance evaluation and feature importance
Machine Learning models were used to predict POD, including logistic regression, support vector machines, random forest, and gradient boosting without hyperparameters tuning using the scikit-learn library version 1.2.2 45 and the Xgboost library version1.7.3 46. Independent variables were feature variables, while the dependent variable was POD diagnosis, as illustrated in Figure 1-B. Five-fold cross-validation, with balanced labels across folds, measured model performance at testing using the area under the receiver operating characteristic curve (AUC) as the primary metric. Additional metrics included precision, recall, sensitivity, specificity, balanced accuracy, and area under the precision-recall curve presented in eTable 4. Permutation testing assessed AUC values compared to random chance. POD diagnosis labels were shuffled for random chance, and AUC values were measured 1000 times. The p-value compared the original AUC value to the distribution of permuted AUC values. The difference between models was assessed similarly, calculating the difference between the AUC values of two models with permuted labels47. The SHapley Additive exPlanations (SHAP) values were used to assess feature importance. SHAP measures each feature’s influence on the model’s prediction. Positive SHAP values increased the probability of POD, while negative values decreased it. All features’ average absolute SHAP value indicated contributions to the model’s prediction. The SHAP library version 0.41.0 in Python was employed 48. All preprocessing steps and analyses are available on GitHub upon publication from Sharma forked to https://github.com/MHM-lab. The data can be requested by addressing the PAWEL consortium.
Results
Predicting POD with combined, pre-, and perioperative features
Evaluating four classifiers, comparable performances were observed (eFigure 1). Here, we accentuate the findings obtained from random forest due to its marginally better performance across most models. The models incorporating combined and independent pre- and perioperative features exhibited robust performance, as evidenced by AUC values surpassing chance levels (Figure 1-C and eTables 4 and 5) and eFigure 2 displays the performance of these models through receiver operating characteristic curves. Notably, the model using only preoperative features (Pre-Op) achieved an AUC of 0.76, comparable to a model incorporating both pre- and perioperative features (Pre and Peri-Op), which had an AUC of 0.80, showing no significant difference (Figure 1-D). The independent model exclusively utilizing perioperative surgical features demonstrated an AUC value of 0.73 (Figure 1-E). It is noteworthy that longer cut-to-suture time (surgical duration) and increased equipment usage duration were associated with a higher likelihood of POD, as shown in eFigure 5. Further details, including pairwise comparisons and p-values for differences in AUC values, can be found in eFigure 3 and eTable 6.
Addition of preoperative neuropsychological assessments
The model using preoperative neuropsychological assessments exclusively, exhibited AUC values of 0.61. Integrating neuropsychological assessments into the model, utilizing both pre- and perioperative features (Pre and Peri-Op + NeuroPsy), led to a slight improvement in the AUC, reaching 0.82. This enhancement elevated the model to the top-performing one (Figure 2-B). Adding neuropsychological assessments to the preoperative model (Pre-Op + NeuroPsy) improved the AUC from 0.76 to 0.79, which resulted in a non-significant model comparison including a model based on all pre- and perioperative features (Pre and Peri-Op) with the AUC of 0.80 (Figure 2-C). Specifically, poorer performance in the MOCA memory subdomain and TMT part B before surgery indicated a higher likelihood of POD, as illustrated in Figure 2-D and eFigure 6. For detailed pairwise comparisons of AUC values and corresponding p-values, please refer to eFigure 4 and eTable 6. Including intervention allocation information had no discernible impact on predictive performance (eFigure 7 and eTables 4 and 7). Lastly, there was no difference in AUC between models with and without oversampling (eFigure 8) and outliers or erroneous data didn’t affect our models’ performance (eFigure 9).
Discussion
Leveraging machine learning, we forecasted the occurrence of POD after elective surgeries through a combination of preoperative and perioperative features with a large multicenter cohort. Although our models showcased robust preoperative performance, it was the perioperative surgical data that proved to be the most potent sole predictors category of POD. The integration of preoperative neuropsychological tests yielded an enhancement in the AUC comparable to the model with all pre- and perioperative features, notably influenced by the MOCA memory subdomain and TMT part B, as elucidated by model explanations. This integrated analysis improves conventional clinical risk profiling, furnishing superior predictive capacity with promising implications for surgical planning in the era of machine learning-assisted healthcare and empowers the prioritization of pivotal features in future work.
Perioperative surgical features
Perioperative surgical features, including those on the day of the operation, emerged as the best solo predictors category for POD, as illustrated in Figure 1-E. Their significant SHAP values highlighted vital factors such as ventilation duration, surgical procedures overall duration, and central venous catheter use, anticoagulants, oxygen use duration, as depicted in eFigure 5-B. Prolonged individual surgical durations with longer equipment use were associated with a higher risk of POD. This association was particularly notable in surgeries involving the use of a heart-lung machine (eFigure 5-A), with significantly more patients experiencing POD undergoing cardiopulmonary bypass (47.1%) compared to those not experiencing POD (17.8%) (eTable 1). Consistent with our findings, even a 30-minute increase in surgery duration corresponded to a 6% rise in POD risk 27, and this risk is further elevated in prolonged use of cardiopulmonary bypass during cardiac surgeries 49. This might be explained by potential hypoperfusion or micro-embolism in cardiac surgeries, elevating the risk of POD 50.
In contrast to the significance of the surgery type during the preoperative phase, our analysis revealed that the duration of individual surgery with ventilation emerged as a more influential factor in predicting POD. Overall, our findings highlight the critical importance of cardiovascular risk measures and surgical metrics in POD prediction. Looking ahead, the integration of real-time predictive technology into surgical workflows holds promise. This advancement could potentially facilitate on-the-fly predictions during surgery, enabling timely adjustments to medication or non-pharmacological intervention to mitigate potential adverse outcomes associated with surgical interventions. Our study emphasizes the substantial value of information gleaned from measures taken during surgery, shedding light on their crucial role in enhancing our understanding and prediction of POD.
Enhanced POD prediction prior to surgery through neuropsychological assessments
The preoperative model for predicting POD demonstrated effectiveness, which is in line with previous work 16,36,37. Although incorporating perioperative data slightly improved POD predictions, it’s crucial to note that this information is only available during and immediately after surgery. This limitation prevents surgical planning and decision-making beforehand, allowing only for adjustments in real-time. Therefore, augmenting the predictive performance of algorithms by incorporating data that can be gathered prior to a surgical procedure is important, as it allows for integrated surgical planning and informed decision making prior to any invasive or surgical procedure. In our study, preoperative neuropsychological assessments were predictive above chance, however, the added benefit to combined pre- and perioperative features was modest (Figure 2-B). However, the absence of preoperative neuropsychological tests reduced the performance of POD prediction in the context of models solely relying on preoperative features (Figure 2-C), which is critical given that surgical and postoperative management could be optimized. Preoperative models with neuropsychological tests effectively predicted POD before surgery substantiating previous observations 13,28,51. This could be attributed to preoperative neuropsychological tests revealing subtle cognitive deficits, despite less than 2% of dementia diagnoses before surgery in this study cohort (eTable 1). These deficits may progress into POD. In line with this explanation, timely preoperative cognitive interventions can mitigate the risk of POD and long-term cognitive dysfunction after cardiac surgeries52,53. Additionally, patients with pre-existing cognitive decline face increased risks of other postoperative complications 54. Consequently, baseline neuropsychological assessments are valuable for improving the prediction of POD although and as shown in this study with modest additional effects.
Selecting suitable neuropsychological assessments for clinical use is crucial. Our study identified the MOCA memory subdomain and TMT part B as effective indicated by their average absolute SHAP values (Figure 2-D). Low scores on the MOCA memory subdomain and longer test times on the TMT Part B indicate poor cognitive performance and executive dysfunction (eFigure 6). These tests are crucial for predicting POD risk, as demonstrated in a prior prediction study 15. Patients with mild cognitive impairment at baseline are more likely to develop POD 25, while good preoperative cognitive performance is protective against POD 18. Previous studies have often used the MMSE 28,31,55, Clock-Drawing Test 20, or MOCA score as preoperative risk factors 56. Critically, we replicated the strong association between baseline MOCA and POD risk in the previous theory-driven PAWEL-R study with a larger cohort13. These findings offer a thorough understanding of the efficacy of individual preoperative neuropsychological tests in predicting POD. By conducting comprehensive assessments of pre-existing risks, we may unlock new avenues for optimizing surgical planning and postoperative management. Our study underscores the added albeit moderate advantage of evaluating cognition, emphasizing its importance, and advocating for its inclusion in future developments aimed at refining preoperative risk assessments.
Limitations and recommendation
The study exhibits several limitations that require consideration. First, interpreting SHAP values warrants caution 57, as is generally the case for methods using model explanations in medicine 58,59. Unstable explanations are not uncommon for complex models trained on large datasets 60,61. While the ranking of importance may fluctuate, features with higher mean absolute SHAP values generally maintain consistent attributions. Second, we harnessed cognitive data derived from standardized assessments. However, it’s crucial to note that these procedures, while standardized, often remain non-digitalized. This presents significant untapped potential for future advancements in the realm of risk evaluation before surgical procedures. Third, while our study included a wide range of preoperative neuropsychological assessments, it’s not an exhaustive list. As neuropsychological assessments are time-consuming and require training for assessors to conduct accurate tests and interpret results. Therefore, incorporating semi-automatic assessments of cognition 62 may prove advantageous and a relevant direction for future research.
Conclusion and relevance
We showcase robust predictions of POD in older patients, relying on a combination of pre- and perioperative features. This study further underscores the feasibility of predicting POD prior to surgery and the additional incorporation of preoperative neuropsychological assessments. Our research contributes to a heightened comprehension of predictors for POD prioritizing distinct variable dimensions, thereby guiding a more targeted approach in addressing POD risk predictions in clinical practice.
Data Availability
The data can be requested by addressing the PAWEL consortium.
Acknowledgements
Assistance with the study: We would like to thank the PAWEL and PAWEL-R Study group and the patients’ participation in this study, making the work possible. Financial support and sponsorship: This work was supported by grant VF1_2016-201 from the Innovationsfonds (fund of the Gemeinsamer Bundesausschuss, GBA) as well as the German Research Foundation (DFG) Emmy Noether with reference 513851350 (TW) and the Cluster of Excellence with reference 390727645 (TW). This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). Conflicts of interest: CT received honoraria from serving on the scientific advisory board of Roche, a research grant from the gemeinsame Bundesausschuß der Krankenkassen and has received funding for travel and speaker honoraria from several hospitals for scientific education. GWE has nothing to disclosure. CAFvA received honoraria from serving on the scientific advisory board of Biogen, Roche, Novo Nordisk, Biontech, MindAhead UG and Dr. Willmar Schwabe GmbH &Co. KG and has received funding for travel and speaker honoraria from Lilly, Biogen, Roche diagnostics AG, Novartis, Medical Tribune Verlagsgesellschaft mbH, Landesvereinigung für Gesundheit und Akademie für Sozialmedizin Niedersachsen e. V., FomF GmbH | Forum für medizinische Fortbildung and Dr. Willmar Schwabe GmbH &Co. KG and has received research support from Roche diagnostics AG. Presentation: none.
Footnotes
Typos in the previous manuscript have been corrected.