Abstract
Aims To create, using a machine learning (ML) approach, a preoperative model from baseline demographic and health-related quality of life scores (HRQOL) to predict a good to excellent early clinical outcome.
Patients and Methods A single spine surgery center retrospective review of prospectively collected data from January 2016 to December 2020 from the institutional registry (SpineREG) was performed. The inclusion criteria were age ≥ 18 years, both sexes, lumbar arthrodesis procedure, a complete follow up assessment (ODI, SF-36 and COMI back) and the capability to read and understand the Italian language. A delta of improvement of the ODI higher than 12.7/100 was considered a “good early outcome”. A combined target model of ODI (Δ ≥ 12.7/100), SF-36 PCS (Δ ≥ 6/100) and COMI back (Δ ≥ 2.2/10) was considered an “excellent early outcome”. The performance of the ML models was evaluated in terms of sensitivity, i.e., True Positive Rate (TPR), specificity, i.e., True Negative Rate (TNR), accuracy and area under the receiver operating characteristic curve (AUC ROC).
Results A total of 1243 patients were included in this study. The model for predicting ODI at 6 months follow up showed a good balance between sensitivity (74.3%) and specificity (79.4%), while providing a good accuracy (75.8%) with ROC AUC = 0.842. The combined target model showed a sensitivity of 74.2% and specificity of 71.8%, with an accuracy of 72.8%, and a ROC AUC = 0.808.
Conclusion The results of our study suggest that a machine learning approach showed high performance in predicting early good to excellent clinical results.
Introduction
Degenerative spine disorders represent a complex condition that mainly affects the elderly population, with an incidence in the healthy people over 70 years up to 68% [1]. Due to its clinical and socio-economic impact, it is playing an increasing role in daily medical practice. Its dissemination goes hand in hand with the aging of the population of developed countries. Spinal disorders have a broad spectrum of clinical manifestations: from minimal or asymptomatic to invalidating condition. The presentation pattern can variably affect segmental, regional, and global alignment. The pain and disability represent the main feature in a way that is comparable with other self-reported chronic conditions in the general population as congestive heart failure, arthritis, chronic lung disease or diabetes [2].
The therapeutic approach of spinal disorders is challenging in terms of decision-making for several causes and symptoms. Furthermore, the decisional process is made even more complicated by aging patients eligible for surgery and different clinical conditions and comorbidities. In the last decades, the rate of spine surgery increased up to 40%, and several randomized trials demonstrated the positive and significant effects of these procedures [3, 4]. Its safety and effectiveness vary widely among patients. In the worse scenarios, the complications rate can be up to 13%, and the minimally clinically relevant improvements can below up to 25% of cases [5]. Indeed, there is always room for improvements in terms of the clinical, surgical, and economic points of view [6].
The scientific research in this field targeted to evaluate the improvement in quality of life (QoL) after surgical treatment for spine surgery in relation to patient age, comorbidity and baseline status. With the aim to improve the cost-effect ratio’s performance, the rise of predictive models (PM) is continuously increasing. In 2015 McGirt et al. [7] presented a PM for the clinical practice to help patients, providers, and hospital systems. It’s based on demographics, patients’ reported outcomes, and clinical data. In particular, the baseline patient-specific factors such as symptom duration, smoking status, preoperative comorbidities and mental and physical conditions seem to significantly influence outcomes following lumbar surgery.
Sinikallio et al. [8], in a prospective analysis, demonstrated that the patients with preoperative depression and those who had continuous depression postoperatively experienced poor post-operative surgical outcomes and can benefit from targeted cognitive behavioural therapy [9]. Patient-specific factors beyond medical comorbidities, surgical indications, and surgical approaches can play a significant role in influencing overall patient outcomes [10]. The impact of lumbar spine surgery on patients’ life is commonly evaluated with three Patient-Reported Outcome Measures (PROMs): Oswestry Disability Index (ODI), the Physical Component Score of the Short Form of the Medical Outcomes Study (SF-36 PCS), and pain scales (VAS leg and back). The minimum clinically important difference (MCID) is commonly considered the threshold to measure the effect of the surgery for the single questionnaire.
The use of PROMs and their prediction through a machine learning approaches represent a milestone in the development of shared, informed, and individualized decision making potentially capable to support the surgeon to choose the right intervention, at the right time, for the right patient [7]. Our study aims to develop a preoperative machine learning (ML) model to predict a good to excellent early clinical outcome by using baseline demographic and health-related quality of life scores (HRQOL).
Materials and Methods
Clinical and demographic data and outcomes
The study was conducted in a single spine surgery center and it is based retrospective review of prospectively collected data from the institutional registry: SpineReg [11]. The inclusion criteria were age ≥ 18 years, both genders, lumbar arthrodesis procedure identified using ICD-9 code (8106, 8107 or 8108), a follow up assessment (ODI, SF-36 and Core Outcome Measures Index - COMI back) and the capability to read and understand the Italian language. A full set of peri-operative and post-operative data along with clinical outcomes from January 2016 to December 2020 were evaluated. Exclusion criteria were a weak degree of baseline disability or pain (ODI < 20/100 and COMI back < 3/10), number of levels fused not specified and subject not stratified according to Glassman classification.
The study protocol was conducted in accordance with the Helsinki Declaration of 1957 as revised in 2000. The procedures followed the ethical standards of the responsible committee on human experimentation and was approved by the ethics committee of Ospedale San Raffaele, Milan, Italy (protocol “SPINEREG”, approved on April 13th, 2016). The project was supported with funds from the Italian Ministry of Health (project code CO-2016-02364645). All patients gave their written informed consent for the participation in the registry. Baseline demographics, BMI, gender, comorbidities collected through Comorbidity Charlson Index (CCI) [12], diagnosis according to Glassman Classification [13], number of spinal levels of intervention, spinal level indexed surgery, clinical scores resulting from medical surveys, complication and revision surgeries were collected.
Table 1 shows that the rates of missing data ranges from 0.0% for the baseline scores of the PROMs and for the patient’s personal information, to 20.4% for Levels variables. Independent variables with at least one missing value were imputed using predictive mean matching for numerical variables, binary/multinomial logistic regression or ordered logit model for categorical variables. Outcome variables were not imputed to avoid the introduction of bias into the results. Only patients with the observed outcome variables were included in the analysis.
The primary outcome was the early (six months post op) significant clinical improvement. In particular, value of improvement higher than 12.7 for ODI [14], 6 for SF36-PCS and 2.2 for COMI Back were considered as indicator of significant clinical improvement [15, 16].
To classify surgical operations results with a “good early outcome”, we defined a delta of improvement of the ODI higher than 12.7/100. On the other hand, to identify surgical operations with an “excellent early outcome” we used combined target that identifies an excellent clinical result when all the three underlying targets ODI (Δ ≥ 12.7/100), SF-36 PCS (Δ ≥ 6/100) and COMI back (Δ ≥ 2.2/10) showed a relevant improvement.
An exploratory analysis was performed. Patients were classified as having or not the binomial risk factor (Risk +) or (Risk -). Different scenarios were simulated to verify the performance of each method of calculation in three age categories. For each scenario, a 2 × 2 table was built (Good Outcome+/- vs. Risk +/-). Chi-squared test was used for statistical association between reaching the good outcomes and the presence of risk factor. The odds-ratios with their 95 % confidence intervals, and point estimations of the sensitivity and specificity of the alignment rules to discriminate patients with final good or poor clinical outcome, the positive (PPV) and negative (NPV) predictive values and positive and negative likelihood ratios (LR+, LR-) were calculated. Differences between preoperative and postoperative clinical outcomes were tested with the two-tailed Student’s t test for paired samples. The Mann-Whitney U-test was used in the cases of abnormally distributed variables. Normality was verified with the Kolmogorov–Smirnoff test. The threshold for statistical significance was set at p < 0.05 in all the tests.
Machine learning approach
The available dataset is characterized by the non-negligible presence of missing values. Therefore, data imputation of independent variables is first performed to exploit all the instances and obtain more stable and reliable results [17]. In the dataset, different data types are available and each of them is treated with dedicated techniques. In the case of numerical variables, for each instance with at least a missing value a small subset of complete instances similar to the instance under investigation is selected. From this set, a randomly sampled instance is used to replace the missing values. Discrete variables are instead imputed with ad hoc models: logistic regression models for binary variables, multinomial logistic regression models for unordered categorical variables and ordered logit models (or proportional odds models) for categorical variables with ordered categories. The data imputation is implemented with “mice” R package [18].
A multivariate classification model is used to predict the target variables: 1) single ODI improvement or 2) combined ODI + SF36 PCS + COMI Back. For both the targets, we use a Random Forest (RF) classification method to predict the outcome of the surgical operation. RF is an ensemble model composed by multiple decision trees, each of them trained independently on randomly sampled subsets of variables. The single outputs of the multiple decision tree models are then combined with a majority vote to obtain the final decision of RF. This ensemble helps in improving the predictive performance of the individual decision tree models. Indeed, RF has been recognized as one of the best performing classifiers in extensive classification studies [19], and the R implementation provided by the “randomForest” package is empirically more accurate than other implementations [20]. We thus train a RF model using the default settings of the “randomForest” R package for both targets. To evaluate the most important features used by RF to classify instances, we used the Mean Decrease Gini index, which measures the contribution of each variable to the homogeneity of internal and leaf nodes of the tree.
We train RF in cross-validation (5 folds) and we select the classification threshold for each fold by optimizing the geometric mean of sensitivity and specificity in a nested cross-validation loop. The proposed nested cross-validation allows to robustly estimate the optimal classification threshold and assess RF performance, while balancing sensitivity and specificity. This is particularly relevant since both the target variables are slightly unbalanced (70.7% of the available data are associated with a good surgical outcome and 43.3% of the available data are associated with an excellent surgical outcome).
The performance of the model is evaluated in terms of sensitivity, i.e., True Positive Rate (TPR), specificity, i.e., True Negative Rate (TNR), accuracy and area under the receiver operating characteristic curve (AUC ROC). The exploratory and the further machine learning analysis were performed in R [21].
The entire study was performed according to TRIPOD guideline for the development of multivariate models for individual prognosis or diagnosis [22].
Results
A total of 1243 patients undergone to lumbar arthrodesis surgery were included in this study. The 9.5% were disc pathologies, 38.4% were disc collapse, 32.8% were spondylolysis or spondylolisthesis, 7.6% degenerative scoliosis, 0.1% were facet pathologies, 11.1% non-union, 0.3% were cancer and 0.2% were infection. The rate of early good outcome was 70.7% (n=879). The 43.3% (n=538) of patient reached an “excellent” early outcome. The patients had a median age of 56 (interquartile range: 22) years and 771 (62.0%) were female. The mean baseline disability of study population was ODI 47.3 ± 17.1, the mean pain score was COMI back 7.7 ± 1.7 and the mean quality of life was SF- 36 PCS 32.7 ± 6.9 and SF-36 MCS 45.5 ± 11.8.
Since univariate exploratory analysis did not collect significant results, a multivariate classification models were used to identify both surgical operations with good and excellent early outcome, the results of these analyses are reported in the supplementary materials. The model for predicting ODI (good early outcome) at 6 months follow up (FU) makes use of the following features: classification of the patient’s clinical state (Glassman), equipe operating (Equipe), age, gender, Body Mass Index (BMI), ASA code (ASA), pre-operative medical PROMs: (ODI, COMI, SF- 36 Physical and SF-36 Mental), number of vertebrae stabilized during the operation (Levels), starting and ending point of the stabilized vertebrae (From_Level, To_Level) and comorbidity Charlson Index (Charlson).
The model showed a good balance between sensitivity (74.3%) and specificity (79.4%), while providing a good accuracy (75.8%) with ROC AUC = 0.842.
This combined target model (excellent early outcome) makes use of the same features used for the good early outcome model. The excellent early outcome model showed a sensitivity of 74.2% and specificity of 71.8%, with an accuracy of 72.8%, and a ROC AUC = 0.808. Furthermore, both the models for predicting good and excellent clinical outcomes show a good balance between sensitivity (74.3% for good and 74.2% for excellent outcome) and specificity (79.4% for good and 71.8% for excellent outcome), while providing a good accuracy (75.8% for good and 72.8% for excellent outcome). Details are reported in the Tables 2 and 3.
The models show also a good discriminatory capacity of the two classes (ROC AUC = 0.842 for good and ROC AUC = 0.808 for excellent outcome) (Figure 1).
According to the Mean Decreasing index of in the Random Forest model, the top five predictors of both good and excellent clinical outcomes were SF-36 PCS, SF-36 MCS, ODI, BMI, Age at baseline [weights of machine learning models for good clinical outcomes were: SFMPre = 73.20, SFPPre= 70.80, ODIPre = 66.77, BMI = 62.97 and Age =61.12; For excellent clinical outcomes were: SFMPre = 90.34, SFPPre= 87.13, BMI = 78.61, Age =69.92 and ODIPre = 66.21]. The tables and graphs of Mean Decrease Gini for weights of machine learning models, as well as odds ratios of the explorative analysis are resumed in the supplementary material.
Discussion
In spine surgery, multiple factors can influence clinical outcomes. According to our results and the exploratory analysis, there isn’t a single risk factor capable of influencing or predicting early clinical outcomes. In our registry, the machine learning (ML) approach predicts the likelihood of good or excellent early clinical results. Our ML model showed good performances of post-operative prediction if based on patients’ demographic data and pre-operative self-reported degree of disability and quality of life.
In the last decades, the machine learning approach in predictive models has gained interest in clinical practice.
1. Predictive Models of surgical improvement based on clinical data
The surgical treatment for degenerative spine disorders has been shown to improve the quality of life and reduce disability in patients most severely affected [23]. Nevertheless, the association between demographic baseline factors and overall complication rates are still unclear.
Thanks to the anesthesiological and surgical implementations, a large population can safely be a spinal surgery candidate. The increase in the use of mini-invasive surgery [24] and the improvements of pre-operative planning methods [25] has allowed enlarging the cohort of patients eligible for surgery and capable of obtaining significant results [26, 27].
One of the most relevant demographic indicators is BMI (Body Mass Index). High BMI values are known to be risk factors for many diseases and directly correlated with the complication rate after spinal surgery. Even if in the literature, its role it’s still debated in the prediction of functional outcomes.
According to Mulvanay et al. [28], an increased BMI is associated with decreased effectiveness of 1- to 3-level elective lumbar fusion, despite the absence of surgical complications. A BMI value higher than 30 is considered a risk factor for surgical complications and poor spine surgery results. According to our data, a low BMI seems to present a relevant role in predicting good clinical outcomes. Despite, several studies suggest that weight does not represent a major impact on the patients’ health-related quality of life after surgery[29], obesity has a relevant impact on intraoperative blood loss, length of surgery and complication rate. It seems that BMI should always be kept in mind when planning spinal fusion.
Several clinical indicators of postoperative success are continuously analyzed to improve the surgical outcome in terms of complication rate and patients’ satisfaction. The aging of the population and the relative increase of the comorbidities can challenge the surgical decision.
According to Daniels et al. [30], the upgrading of some surgical methods increased the performances of therapeutic strategies. In particular, in a retrospective analysis of surgical cases enrolled between 2009 and 2016 the complication rates decreased over time, despite an increasingly elderly, medically compromised, and obese patient population. As a critical point, the authors identified the evolution of surgical strategies that resulted in an overall improvement of the treatment quality.
2. Predictive Models of surgical improvement based on PROMs
The proper estimation of the pre-operative degree of disability and quality of life is mandatory when surgery is required. In spine surgery, considerable controversy exists regarding spinal arthrodesis’ risk-benefit where surgery itself creates a permanent fusion of vertebral bodies. Nevertheless, several studies demonstrate a significant improvement after spinal arthrodesis in cases of degenerative spinal disorders [31–33]. A combination of scales is often used in clinical studies to assess multiple aspects of human being.
The indicators of quality of life and disability progressively gained attention, becoming the gold standard to measure the success rate after spine surgery. The post-operative clinical improvement can be evaluated based on patients’ reported outcomes such as ODI. Although post-operative improvement may be statistically significant, it is not necessarily clinically relevant. For this reason, several studies have defined the values to indicate a difference that is clinically meaningful to the patient (MCID) [34]. In particular, Monticone et. al defined as significant a cut off value of MCID at 12.7 ODI unit score of improvement [14].
3. Predictive models performances (General)
Predictive models for patient-reported outcomes can improve the surgical strategies when deciding to opt for surgery or not or potentially to adapt the surgical approach.
Despite the significant variability in the population affected by a common clinical condition, lumbar disc herniation, Staartjes et al. [35] proposed a predictive model based on deep learning-based analytics. Out of a population of 422 patients, the deep learning and logistic regression attained AUC values of 0.84 and 0.72, and an accuracy of 75% and 59%, respectively. The greatest discrepancy in performance measures regarded the models predicting back pain improvement. This could reflect the model’s weakness or the inherent difficulty of the outcome to be measured.
Models based on Naïve Bayes machine learning to predict hospitalization days and indications for discharge (for example, admission to rehabilitation facilities or back to home and hospitalization costs) showed high performances. In particular, the system proposed by Karnuta et al. revealed a predictive accuracy of 0.800 for costs of recovery, 0.874 for Length of Stay (LOS), and 0.878 for disposition with AUC for hospitalization costs (0.880), excellent AUC for LOS (0.941), and an excellent AUC for discharge disposition (0.906) [36].
The disease variability, combined with the psychological influencing factors and patients’ expectations related to the surgeries, challenges the accuracy of clinical predictions. Siccoli et al. [37] evaluated the feasibility of short- and long-term PROMs and reoperation rate by ML approach in patients affected by lumbar stenosis. According to this study, the models were able to predict the endpoints providing accurate information.
Although a progressive increase in the use of prediction models in spine surgery, little is known in spinal arthrodesis for 2 to 4 levels surgery.
Our results seem to provide comparable or higher predictive performances than other studies on spine surgery. Thanks to the recent advances in technologies, AI can involve the application of mathematical algorithms that continuously learn and make observations from existing data. The aim is to create a more accurate predictive model based on that data [38].
4. Influence of ML predictions on therapeutic strategy
With the widening of the modern dataset, the use of ML will progressively become the gold standard and the primary candidate for the data analysis. A shining future application in diagnosis, prognosis and decision-making process is desirable and will soon become an essential spine physician tool. Khan et al. introduced the application in the clinal management of cervical myelopathy and nontraumatic spinal cord injury to predict the risk of neurological impairment at one year [39]. These tools allowed the physicians to predict individual patient outcome after surgery for degenerative cervical myelopathy [40] and to apply preventive strategies as targeted physiotherapy and timing of psychological counselling.
With the application of ML techniques, several studies demonstrate the possibility to predict clinical outcomes. Ames C.P. et al. predicted the patients’ responses to SRS-22R (questionnaire) item per item up to 86.9% AUROC at 1 and 2 years following surgical treatment for ASD. The main clinical application is to aid surgical decision making during preoperative counselling [41]. In complex surgery, this approach will be capable of implementing already available surgical decision-making [42].
Limitations
The result of our study comes with several limitations that we have to take into account. Firstly, the term follow-up indicators for 6 months can be considered only a preliminary result. External and prospective validations are necessary to support this methodology further to improve the knowledge acquired. Furthermore, the lower performance in terms of PPV in “excellent outcomes predictions” and PNV in “good outcomes predictions” can be explained by low number of positive and negative events respectively. Moreover, the two models can be combined to obtain a classification of patients in three categories: “Excellent”, “Good” and “Not good”. This classification of patients can be used to support clinicians in taking personalized and patient-specific decisions.
Conclusion
The results of our study suggest that a machine learning approach showed high performance in predicting early good to excellent clinical results. In particular, our data suggest that a worse score of preoperative indicators of disability and quality of life, younger or healthier patients should aspect a significant clinically relevant improvement. On the other hand, older patients, higher BMI, comorbidities (higher ASA and Charlson score) with a higher score of SF-36 and lower score ODI would experience less clinically relevant improvement by forwarding the path of lumbar spine surgery. These results must be seen in the light of the study’s limitations, firstly, the middle term follow-up indicators, six months. A potential improvement or worsening in the PROMS results could occur later. The latter is not the focus of the study.
Data Availability
Data are available from the corresponding author upon reasonable request.