ABSTRACT
Objective To 1) develop and evaluate a machine learning model incorporating gait and physical activity to predict medial tibiofemoral cartilage worsening over two years in individuals without or with early knee osteoarthritis and 2) identify influential predictors in the model and quantify their effect on cartilage worsening.
Design An ensemble machine learning model was developed to predict worsened cartilage MRI Osteoarthritis Knee Score at 2-year follow-up from gait, physical activity, clinical and demographic data from the Multicenter Osteoarthritis Study. Model performance was evaluated with Area Under the Curve (AUC) in repeated cross-validations. The top 10 influential predictors of the outcome across 100 held-out test sets were identified by a variable importance measure statistic, and their marginal effect on the outcome was quantified by g-computation.
Results Of 947 legs included in the analysis, 14% experienced medial cartilage worsening over two years. The median (2.5th-97.5th percentile) AUC across the 100 held-out test sets was 0.73 (0.64-0.80). Presence of baseline cartilage damage, higher Kellgren-Lawrence grade, greater pain during walking, and higher lateral ground reaction force impulse were associated with greater risk of cartilage worsening.
Conclusions An ensemble machine learning approach incorporating gait, physical activity, and clinical/demographic features showed good performance for predicting cartilage worsening over two years. While identifying potential intervention targets from the machine learning model is challenging, these results suggest that addressing high lateral ground reaction force impulse should be investigated further as a potential target to reduce medial tibiofemoral cartilage worsening in persons without or with early knee osteoarthritis.
What are the findings?
Machine learning models predicted cartilage worsening in persons without or with early knee osteoarthritis from gait, physical activity, and clinical and demographic characteristics with a median AUC of 0.73 across 100 held-out test sets.
High lateral ground reaction force impulse (> 1.8 N*s) was associated with 5.5% higher risk of cartilage worsening over two years compared to lower lateral impulse (< 1.1 N*s).
How might it impact on clinical practice in the future?
Gait and physical activity are some of the only modifiable risk factors for knee osteoarthritis; addressing high lateral ground reaction force impulse may be a potential target for interventions to slow early knee osteoarthritis progression.
INTRODUCTION
Knee osteoarthritis (OA) is a progressive, painful joint disease and leading cause of disability, affecting over 350 million adults worldwide.[1] While some individuals with advanced disease undergo knee replacement, there is no cure for OA and many live with pain and poor quality of life for decades. Additionally, existing structural damage and other risk factors (e.g., obesity, malalignment) can drive further degeneration.[2, 3] Addressing the burden of knee OA will require both early identification of individuals at risk and discovery of potential targets for early interventions that can be implemented prior to the onset of extensive damage or other risk factors.
Mechanical loading on the joint is one of the only modifiable risk factors for knee OA[4] and can be manipulated through gait and/or physical activity interventions. While prior research has identified key gait features associated with medial tibiofemoral knee OA progression,[5] these have typically been examined in isolation, in small samples, and/or without accounting for other known clinical/demographic risk factors. Importantly, little is known about gait and physical activity predictors of progression in individuals early in the disease process. Machine learning models can help identify features among larger complex constructs that are important to prediction without requiring assumptions about underlying relationships among these features, making them useful for exploring gait and physical activity data.[6-8]
The Multicenter Osteoarthritis Study (MOST)[9] is a unique, large, observational cohort of individuals with and without knee OA where data on gait biomechanics, accelerometer-derived physical activity, and key clinical and demographic measures are available for application of machine learning approaches. Further, MOST includes MRI knee exams at multiple timepoints, providing sensitive measures of early joint structural change, such as worsening cartilage damage.[10] Using the MOST data, our objectives were to (1) build and evaluate a machine learning model to predict medial tibiofemoral cartilage worsening over 2-years from gait, physical activity, and clinical and demographic risk factors in individuals without or with early knee OA, and (2) identify features that contributed most to prediction of cartilage worsening from the machine learning model and quantify their effect on the cartilage worsening outcome.
METHODS
Study sample
At 144-months, surviving participants from the original MOST cohort (age = 50-79, with or at increased risk for developing knee OA at enrollment) were invited to return for a clinic visit and concurrently, a new cohort (age 45-69, with or without knee pain, and with Kellgren-Lawrence radiographic grades ≤ 2) was enrolled. Participants with inflammatory disease or stroke were not included in either cohort. The MOST study received institutional review board approval from the four core sites (Boston University, University of Alabama at Birmingham [UAB], University of California San Francisco, University of Iowa [UIowa]). All participants provided informed consent prior to participating in the study, in accordance with the Helsinki Declaration.
We used data from both cohorts for our study baseline (original cohort: 144-month, new cohort: enrollment visit) and 2-year follow-up (original cohort: 168-month, new cohort: 24-month). MRIs were read for one knee per participant (herein referred to as the “study knee”) at both baseline and 2-years. We excluded participants with Kellgren-Lawrence grades > 2 in the study knee to focus on early disease (Figure 1). We also excluded participants with history of knee or hip replacement in either leg, steroid or hyaluronic acid injection during the past 6 months in either knee, or regular use of a walking aid. Finally, we excluded participants who did not undergo MRI assessment and participants with data quality issues related to gait and/or physical activity measures (described later).
Study sample from the Multicenter Osteoarthritis Study
Patient and public involvement
Currently, patients and the public are not involved in the design, conduct, reporting, or dissemination plans for research projects utilizing MOST data.
Exposures
Clinical and demographic features
Clinical and demographic factors that are both independent risk factors for OA and affect gait and/or physical activity independent of OA (i.e., confounders) were included as model inputs.[11-18] Sex, age, body mass index (BMI), race, clinic site, and prior history of knee injury or surgery were recorded at baseline. Given small sample sizes in multiple categories of race, particularly at the UIowa clinic site, race and site were combined into a single feature with 3-levels: UAB non-White, UAB White, and UIowa. Participants also completed the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC)[19] and Center for Epidemiologic Studies Depression Scale (CES-D), and had posterior-anterior and lateral weight bearing radiographs taken, which were read for Kellgren-Lawrence grade (KLG)[20] at the MOST analysis center. Hip-knee-ankle alignment for the new cohort was read from long-limb radiographs taken at baseline. Long limb radiographs were not acquired for the original cohort at the 144-month visit (current baseline), thus we used hip-knee-ankle alignment read from long limb radiographs taken at the 60-month visit for these participants. WOMAC pain during walking was extracted from the first question of the WOMAC questionnaire, with legs categorized as ‘no,’ ‘mild,’ or ‘moderate or higher’ pain during walking.
Baseline demographics and clinical characteristics
Gait features
Three-dimensional (3D) ground reaction force (GRF) data were recorded at 1000 Hz while participants walked at self-selected speed across a portable force platform embedded in a 5.3-meter walkway (AccuGait, AMTI Inc., Watertown, MA, USA). At least five trials of data were acquired per leg, with the first excluded as an acclimatization trial. Legs with at least three remaining trials where the foot landed completely on the force plate were retained for analysis. For each trial, we extracted a selection of gait features: commonly used discrete metrics calculated from the 3D GRF waveforms (Figure 2), “toe-out” angle as defined by Chang et al.,[21] stance time, and walking speed. We time-normalized all timing features to the stance phase of gait. We then averaged each feature across trials for each leg (Table 2). GRFs were not amplitude-normalized given the inclusion of BMI as a predictor in the model and to avoid issues with interpreting ratios.[22]
Features extracted from ground reaction force (GRF) data
Physical activity features
Participants wore an activity monitor (AX3, Axivity, Newcastle upon Tyne, UK) consisting of a tri-axial accelerometer and temperature sensor on the lower back (centered over the midpoint of L5-S1) for 7 days at baseline, with 3D acceleration data sampled at 100 Hz with a range of ±8g. Accelerometer data were extracted and processed at a centralized reading center. At least 10 hours of wear time were required for a day to be considered valid.[23] Non-wear was defined as periods 10 minutes with no movement (standard deviation < threshold) and verified using the temperature sensor in the device (change from average temperature > threshold). Summary metrics were calculated for each day, including time spent walking, time spent lying, and the mean 3D signal vector magnitude, which describes the overall magnitude of acceleration across all three dimensions (Equation 1). Time spent walking and lying were expressed as percent of wear time to account for possible differences in minutes of accelerometer wear among individuals.[24] Metrics were averaged across all valid days per participant (Table 2). We excluded participants with less than 3 valid days of data.[25]
Baseline gait and physical activity features
Outcome
Two musculoskeletal radiologists (AG, FWR) scored the severity of cartilage damage in 5 medial tibiofemoral subregions of the study knee at each timepoint using the MRI Osteoarthritis Knee Score (MOAKS)[26]. We defined medial cartilage worsening as any within-grade or ≥ full grade increase in area and/or depth in at least one of the 5 medial tibiofemoral cartilage subregions over the 2-year period, as has been done previously in MOST.[10, 27]
Machine learning model
Data preparation, model development, and evaluation were performed in R (v4.0.5). We examined Pearson correlations between all continuous features and in the case of near perfect correlations (r > 0.85), selected one feature to retain for analysis. We used the predictive mean matching algorithm within the multiple imputation by chained equations (MICE) framework (v3.13.0) to impute missing exposure data (<0.1%).[28] Continuous features were scaled and centered. We randomly split the data into 70% training data and 30% test data without altering the proportion of outcome in both the training and test sets.[29]
Our goal was to predict the binary outcome of cartilage worsening from the GRF, accelerometer, and clinical/demographic data. We used “super learning” (v1.4.2),[30] an ensemble machine learning approach that combines several candidate machine learning algorithms in order to enhance the accuracy of prediction above and beyond individual algorithms (Figure 3). Our candidate learners included Bayesian adaptive regression trees, generalized linear model, least absolute shrinkage and selection operator, ridge regression, elastic net, random forest, and extreme gradient boosting models. The candidate learners were trained through 5-fold cross validation, and the corresponding predictions on the out-of-fold samples were used to develop a meta learner that optimized the weight (i.e., contribution) of each individual learner. We then cross-validated this trained model by applying it to the held-out test set to assess its performance by the area under the receiver operating characteristic curve (AUC).
Machine learning model development and evaluation
To test the robustness and reproducibility of the model training and testing, we used repeated cross-validation, i.e., repeated the process of randomly splitting the data into train and test sets, training the model by super learning, and evaluating its performance on the held-out test set, 100 times. Here we report the median (i.e., 50-th percentile), 2.5- and 97.5-th percentile AUC across the 100 iterations.
Identification of influential predictors and marginal causal risk differences
A Variable Importance Measure (VIM) statistic was calculated for each of the 37 features for each data split, based on the size of the risk difference between the model fit with and without the feature. Thus, for each split, a list of 37 VIMs was produced. The top contributors to prediction for each split were identified as the 10 features with the highest VIMs. We defined “influential predictors” as the 10 features that most frequently appeared as top contributors to prediction across the 100 splits.
To quantify the effect of the influential predictors identified by the super learner model on the cartilage worsening outcome, we utilized the parametric g-computation method [31]. Continuous variables were quantized into tertiles. For each predictor, we calculated the marginal causal risk difference of each category of the predictor on cartilage worsening, compared to the corresponding reference category, using riskCommunicator (v 1.0.0) in R.
RESULTS
Model performance
Of 947 participants (KLG ≤ 2) included in the analysis, 133 (14%) experienced medial tibiofemoral cartilage worsening in the study knee over 2-years. Across 100 splits, the median AUC (2.5- and 97.5-th percentiles) on the held-out test sets was 0.73 (0.64-0.80).
Influential predictors and marginal risk differences
The features most frequently appearing as top contributors to prediction across 100 data splits (and frequency of appearance) were baseline medial tibiofemoral cartilage damage (100), KLG (97), lateral GRF impulse (43), WOMAC pain during walking (39), time spent walking (33), vertical GRF impulse (32), gait speed (29), vertical GRF 1st peak (28), time spent lying (27), and timing of early lateral GRF peak (26). Marginal risk differences (Figure 4) can be interpreted as the difference in risk of cartilage worsening per 100 individuals in the given category compared to the referent category. Presence of baseline cartilage damage, higher KLG, greater lateral GRF impulse, and greater pain during walking were associated with increased risk of cartilage worsening over 2-years.
Causal risk differences for influential predictors identified from the machine learning model
DISCUSSION
An ensemble machine learning approach incorporating gait, physical activity, and clinical/demographic features showed good performance (median AUC = 0.73) in predicting medial tibiofemoral cartilage worsening over 2-years in people without or with early radiographic OA. While determining the relationships among predictors and outcomes in machine learning models is challenging, our analysis suggests that addressing high lateral GRF impulse may be a potential target to reduce cartilage worsening.
Model performance
The current model performance is comparable to other machine learning models predicting OA progression from clinical/demographic data. Du et al. reported AUCs of 0.70-0.79 for predicting radiographic worsening (increase in KLG, medial, or lateral joint space narrowing) over 2-years from baseline cartilage damage features on MRI, but included individuals with KLG 0 to 4.[32] Tiulpin et al. reported AUCs of 0.73-0.75 for predicting worsening (increase in KLG or knee joint replacement) over 7-years from baseline age, sex, BMI, injury, surgery, WOMAC, and KLG in individuals with KLG < 2.[33] The current model achieved similar AUC for predicting cartilage worsening over 2-years in individuals with KLG ≤ 2, with the added benefit of providing information about potentially modifiable gait and physical activity predictors.
Prior longitudinal studies of gait and OA typically examined knee-specific loading (e.g., knee adduction moment) rather than GRFs, often with samples of 15 to 300 knees.[5] Correspondingly, few addressed clinical and demographic confounders, incorporated physical activity predictors, or used cross-validation to examine performance in independent test sets. Further, many of these studies were conducted in samples with established OA (KLG ≥ 2), limiting their potential to identify those at risk of progression early in the disease process or identify targets for early intervention. The current sample included 947 individuals with KLG ≤ 2, who were predominantly pain free or had mild pain during walking, and thus were younger with lower BMI than what has been reported previously for established OA samples.[18]
Predictors of OA progression
The machine learning model identified multiple gait and physical activity features as influential predictors of cartilage worsening in knees without or with early radiographic disease. The only significant result from the g-computation analyses, however, was for lateral GRF impulse, where there was a 5.5% increased risk of cartilage worsening for every 100 individuals in the highest compared to lowest tertile. In a cross-sectional study in the same cohort, we have previously reported that limbs with radiographic OA, with or without knee pain, have higher lateral GRFs in early stance compared to limbs free from both radiographic OA and pain.[34] The current results suggest lateral GRF may also play a role in progression.
The appearance of structure and symptom features as influential predictors is not surprising given that these are established risk factors for progression. Of note, despite only 10.1% of the sample having established radiographic OA (KLG = 2), 39.2% had baseline cartilage damage, and both KLG and baseline damage appeared as influential predictors in the model. The g-computation analysis identified a 15.4% increased risk of cartilage worsening for every 100 individuals with baseline damage compared to no damage, and a 14.3% increased risk for every 100 individuals with KLG 2 versus 0. The lack of risk difference for KLG 1 versus 0 may highlight limitations of the KLG scoring system, which does not reflect tissue-level damage well, particularly in early disease.[35, 36] WOMAC pain during walking also appeared as an influential predictor along with these structural measures. While those with mild pain had an increased risk of cartilage worsening (6.6%) compared to those with no pain, those with moderate or higher pain did not have a significantly higher risk. The large confidence interval and null result for moderate and higher pain could stem from the small proportion of knees (3% of sample) and/or heterogeneity in this group.
Clinical implications
The utility of this model for risk screening is debatable, as it requires collection of GRFs. While easier and faster to collect than joint moments, collecting GRFs requires specialized equipment (force platform). Future advances in wearable technologies may facilitate capture of gait information during everyday life, including estimates of GRFs,[37, 38] improving the potential of this type of model as a risk screening tool.
This model also identified potential gait and physical activity intervention targets that are of interest for further study. Interestingly, two influential predictors (baseline damage, KLG) appeared as top contributors in 97% of the data splits while the others appeared less consistently (<50%), and only one gait and physical activity predictor had a significant risk difference among tertiles. While we removed highly correlated features, this may result in part from predictors that capture similar constructs (e.g., four features that collectively describe an important construct could each appear as top contributors 25% of the time). Similarly, our g-computation approach provides insight into causal pathways but does not account for potential concurrent changes in several risk factors. An important motivation for using machine learning was to address potential interactions among gait, physical activity, and clinical/demographic factors. While it is challenging to identify these underlying relationships from the model, the lack of consistency in top contributors and null g-computation results could speak to a need for simultaneous intervention on several gait and physical activity features rather than a single feature, opening interesting avenues for future study.
Strengths and limitations
Strengths of this study include the large sample, investigation of risk factors in early disease, incorporation of both gait and physical activity, use of machine learning to address potential relationships among predictors, and use of g-computation to quantify the effects of these predictors. These strengths allowed us to expand existing literature by accounting for patient demographics and clinical characteristics, examining multiple gait and physical activity features, and testing our model in independent samples. While this study was performed in the large MOST dataset, our sample was limited to persons with KLG ≤ 2 and was largely White with little to no pain during walking, thus these results may not generalize to more diverse populations or those with severe symptoms. Lateral or patellofemoral worsening could have been present in both outcome groups, resulting in less clear separation between the two. Knee joint specific loading (e.g., knee adduction moment) is not available in MOST, limiting comparisons to prior longitudinal gait studies. Additionally, we are unaware of other large datasets with gait, physical activity, and MR outcome data that could be used for external validation. However, we utilized repeated cross-validation to provide information on reproducibility. Last, while 3D GRFs were well characterized (21 features), physical activity was described by only 3 features. Better characterization of dynamic physical activity patterns may improve model performance and ability to identify relevant intervention targets.
Conclusion
Using an ensemble machine learning approach, we predicted medial tibiofemoral cartilage worsening over 2-years in persons without and with early radiographic osteoarthritis with good performance on independent samples. Additionally, we identified gait and physical activity measures associated with cartilage worsening that may be potential early intervention targets, in particular reducing high lateral ground reaction force impulse.
DATA AVAILABILITY STATEMENT
Data are available in a public, open access repository. Data from the Multicenter Osteoarthritis Study are available through the National Institute on Aging Research Biobank: https://agingresearchbiobank.nia.nih.gov/. Public use datasets from the 144- and 168-month visits of the Multicenter Osteoarthritis Study will be available at this website in 2023.
ETHICS STATEMENT
Patient consent for publication
Not applicable.
Ethics approval
Ethical approval for the Multicenter Osteoarthritis Study was obtained at each of the four core sites (Boston University, University of Alabama at Birmingham, University of California San Francisco, University of Iowa) and all participants provided informed consent prior to participating.
@kecostello, @srjafarz, @ali_guermazi, @FrankRoemer1, @ProfCaraLewis, @vkola_lab, @ProfDeepakKumar
CONTRIBUTORS
Responsibility for the integrity of the work as a whole: K.E. Costello, D. Kumar
Conception and design: K.E. Costello, D. Kumar, D.T. Felson
Analysis and interpretation of the data: All authors
Drafting of the article: K.E. Costello
Critical revision of the article for important intellectual content: All authors
Final approval of the article: All authors
Provision of study materials or patients: D.T. Felson, C.E. Lewis, M.D. Nevitt, N.A. Segal
FUNDING
MOST is comprised of four cooperative grants [D.T. Felson (BU) – AG18820, J.C. Torner (UI) – AG18832, C.E. Lewis (UAB) – AG18947, and M.C. Nevitt (UCSF) – AG19069] funded by the National Institutes of Health (NIH), a branch of the Department of Health and Human Services, and conducted by MOST study investigators. Research reported in this publication was also supported under award numbers F32AR076907 (K.E. Costello), P30AR072571 (D.T. Felson), 1UL1TR001430 (D.T. Felson), R21AR074578 (S.R. Jafarzadeh), R03AG060272 (S.R. Jafarzadeh), K23AR063235 (C.L. Lewis), R01HL159620 (V.B. Kolachalama), R21CA253498 (V.B. Kolachalama), and K01AR069720 (D. Kumar) from the National Institutes of Health and 17SDG33670323 (V.B. Kolachalama) and 20SFRN35460031 (V.B. Kolachalama) from the American Heart Association. This manuscript was prepared using MOST data and does not necessarily represent the official views of MOST investigators or the National Institutes of Health. The National Institutes of Health was not involved in study design, collection, analysis or interpretation of data, or the decision to submit this manuscript for publication.
COMPETING INTERESTS
NAS reports personal fees from Tenex Health and grants from Pacira Bioscience, Inc., outside of the submitted work. AG is shareholder of BICL, LLC and consultant to Pfizer, AstraZeneca, Novartis, TissueGene, Regeneron and MerckSerono. FWR is shareholder of BICL, LLC and consultant to Grünenthal. All other authors have no competing interests to report.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the contributions of the MOST participants and clinic staff.