Abstract
Parkinson’s Disease (PD) is the second most common neurodegenerative disorder globally, and current screening methods often rely on subjective evaluations. We developed deep learning-based classification models using mouse trace data collected via a web application. 315 participants (73 PD, 179 non-PD, 63 suspected PD) completed three hand movement tasks: tracing a straight line, spiral, and sinewave. We developed three types of models: (1) engineered features models, (2) computer vision models, and (3) multimodal models. Feature importance was evaluated using Gradient Shapley Additive Explanations (GradShap). The multimodal Visual transformer (ViT) model achieved the highest performance, with F1 scores of 0.8413 ± 0.0336 (PD vs. non-PD), 0.8520 ± 0.0014 (suspected PD vs. non-PD), and 0.7034 ± 0.0017 (PD vs. suspected PD). Image data proved most influential in predicting PD outcomes. These findings suggested that models trained on confirmed PD diagnoses hold significant promise for early-stage PD screening at the population level.
Introduction
Parkinson’s Disease (PD) is a neurodegenerative disorder that significantly impacts the central nervous system. Major symptoms include tremors, bradykinesia (slowness of movement), muscle rigidity, and postural instability1,2, which progressively worsen over time, leading to difficulties in performing routine tasks such as typing and using a mouse. The progression of these symptoms significantly affects quality of life, making early and accurate diagnosis crucial to enable early intervention3. PD is the second most common neurodegenerative disease after Alzheimer’s, affecting approximately 10 million people globally. In the United States, around one million individuals are diagnosed with PD, with an annual increase of about 90,000 new cases. This number is projected to rise to 1.2 million by 20303,4. Currently, there is no definitive biomarker for PD, and diagnosis is primarily based on clinical symptoms and neuropsychological tests such as the Mini-Mental State Examination (MMSE) and the Unified PD Rating Scale (UPDRS)5,6,7. These tests involve questionnaires and subjective evaluations by clinicians, which can lead to significant biases and potential misdiagnoses6. This is particularly problematic, as PD symptoms often overlap with those of other age-related conditions and drug-induced Parkinsonism (DIP) 8,9,10. Additionally, PD is primarily caused by the degeneration of dopamine-producing neurons in the brain, and by the time motor symptoms become apparent, approximately 60% of these neurons have already deteriorated 8. Therefore, early and accurate detection of PD is essential for effective management and treatment.
Previous digital health research on PD classification included the analysis of hand and finger movements, keystroke dynamics, speech, handwriting, drawing tests, and sensor data from accelerometers and gyroscopes11–25. The use of sensors such as accelerometers and gyroscopes placed on lower limbs, wearable sensors supported augmented by video recordings, and sensing coils or paper-based pads have proven effective in classifying PD and detecting tremors16-27. However, these methods often require controlled laboratory settings and specialized devices, limiting their broader applicability. Self-administered methods, such as keyboard interactions, keystroke dynamics, and smartphone screen interactions, have also been explored but may introduce biases, particularly against individuals with slower typing speeds13,14,20. Mobile applications for data collection, symptom monitoring, and treatment management have demonstrated utility in tracking activities like finger tapping speed, gait, and motor performance, though they pose challenges for older adults unfamiliar with smartphone technology26-29. Multimodal approaches combining data from speech, gait, and upper limb movements have demonstrated potential in classifying PD patients but often require controlled environments and specialized equipment, which limits their use in real-world contexts30,31.
We aim to address these challenges and advance the field of digital PD screening by utilizing structured mouse trace data collected through a short 10-minute test delivered on a user-friendly web application. Participants recruited for the study provided demographic information and completed tasks involving the tracing of spiral, straight, and sine wave patterns using their mouse. We performed feature engineering on the collected mouse trace data and created images from the mouse movement patterns to develop computer vision models. We employed a variety of deep learning models, such as TabTransformer, DenseNet 201, ResNet 50, and MobileNet V2, for analyzing engineered features. For the mouse trace images, we utilized state-of-the-art computer vision models including Vision Transformer (ViT), Shifted Window Transformer (SwinT), DenseNet 201, ResNet 50, and MobileNetV2. To enhance classification performance, we developed multimodal models that combined the engineered features with the mouse trace images. We analyzed the performance of the models on three different sets of train-test splits: the first set included PD and non-PD in both train and test data, the second set included PD and non-PD for training while suspected PD and non-PD for testing, and the third set used suspected PD and non-PD for training while PD and non-PD for testing.
Methods
We created a website to collect structured mouse movement data by having participants complete a series of mouse tracing tests remotely. We trained a series of machine learning models using various features derived from the mouse traces. Figure 1 outlines the workflow from data collection to model evaluation and interpretability analysis.
Ethics approval
The study was approved by the University of Hawaii Institutional Review Board (IRB, protocol #2023-00948).
Participant Recruitment & Data Collection
We recruited participants for this study through both online methods (email, social media posts) and offline methods (community meetings and conventions in Hawaii). We collaborated with the Hawaii Parkinson Association and Beyond Rehab to post flyers and advertise the study to various PD listservs. Additionally, we established a recruitment booth at the 2023 and 2024 Hawaii Parkinson’s Association Symposiums, where we provided potential participants with flyers describing how to complete the study.
We collected data via a web application that we developed (https://parkinsonsurvey.github.io/), illustrated in Figure 2. Participants provided demographic and disease-related information, including age, sex, and dominant hand. Due to the absence of official diagnostic documents and biomarkers for PD, self-reporting was used, with an option to select “suspected PD”.
Participants used a physical mouse on their desktop or laptop, or their trackpad, to trace a straight line, sine wave, and spiral wave on the website. We visualized their progress and alignment with the lines through highlighted portions and start/end markings. We developed the website using HTML and Bootstrap for the interface and visuals, and JavaScript to track cursor position every 500 milliseconds. The data collected included mouse position (X, Y axis), time (milliseconds), and whether the mouse was inside the line (True or False). The web application also captured screen dimensions and operating system details for contextual information. Upon completing the test, all data were securely transmitted and stored in a Firebase collection.
Feature Engineering & Mouse Trace Image Generation
We collected mouse position, time, and line alignment data, along with screen dimensions. From these data, we calculated the following engineered features: the mean deviation from the line for the straight-line tracing, time taken to trace straight line, sine wave & spiral wave, percentage of points traced inside the straight line, sine wave & spiral wave, and the number of points traced inside the straight line, sine wave & spiral wave.
To generate mouse trace images, we created canvases matching the participants’ screen sizes and visualized the trace using the recorded X and Y coordinates over time. We marked traces outside the line in red and those inside the line in green.
Model Development
We developed three sets of models (Figure 3) using different data types. The first set of models processed engineered features through TabTransformer, DenseNet 201, ResNet 50, and MobileNet V2 models. For DenseNet 201, ResNet 50, and MobileNet, which are designed for image data, we added sequential layers to convert engineered features into acceptable shapes for these CNN models. The second set of models used image data with VIT, SwinT, DenseNet 201, ResNet 50, and MobileNet V2 models. We performed transfer learning by unfreezing the last 45 layers and replacing classification layers to align with our classes. The third set of models, the multimodal models, involved passing engineered features through a sequential layers and image data through ViT, SwinT, DenseNet 201, ResNet 50, and MobileNet V2 layers. The combined features from these networks were passed through hidden layers to obtain classification results, with the last 45 layers of the models unfrozen. All models were hyperparameter-tuned using Optuna and trained for 50 epochs with early stopping set to a patience of 5. All the models were set up to be a binary classification model. Models with similar structures were selected for all three analyses to enable comparison across modalities.
Data Splitting
We used three distinct evaluation approaches. The first approach focused exclusively on data labeled as PD and non-PD, excluding the participants with suspected PD. In this split, we used 5-fold cross validation with 500 bootstrapped samples. This approach enabled us to quantify the utility of mouse trace data for remote PD prediction.
In the second approach, we trained the models using all of the PD and 60% of non-PD data but tested them on data labeled as suspected PD and the rest of the non-PD data. This approach enabled us to evaluate the ability of models trained to detect confirmed PD to identify possibly more subtle and earlier signs of PD or other tremor-related conditions. We applied 500 bootstrapped resamples of the test set to generate standard deviation error bars.
In the third approach, we trained the models using all of the suspected PD data and 60% of the non-PD data but tested them on data labeled as PD and rest of the non-PD with 500 bootstrapped samples. This approach enabled us to evaluate the ability of models trained to detect suspect PD cases to identify confirmed PD.
Saliency Map & Feature Importance
To identify important features for the best-performing model architecture, we created a Gradient SHAP-based (GradShap) saliency map and bar plots, as GradShap is optimized for identifying complex feature importances in deep learning models32. This analysis determined the predictive importance of images versus engineered features. For images, GradShap was applied to sine, straight, and spiral images, comparing model outputs with actual versus baseline (zero-filled) images. Attributions were summed to derive final importance values. The negative and positive signs of the attributions were preserved to understand the direction of prediction.
Results
Dataset
315 participants completed the data collection process between February 27, 2024, and June 3, 2024. Among the 315 participants, 73 self-reported themselves as having PD, 179 as non-PD, and the remaining 63 as suspecting a PD diagnosis. As shown in Figure 4 and Table 1, Most participants were aged 50-69 years, predominantly right-handed, and used Windows devices.
5-Fold Cross Validation Predicting PD vs Non-PD
In our first analysis, using PD and non-PD data for both training and testing via 5-fold cross-validation, the multimodal VIT model showed the best performance, with accuracy (0.8706 ± 0.0302), sensitivity (0.8396 ± 0.0311), specificity (0.8396 ± 0.0311), PPV (0.8558 ± 0.0514), NPV (0.8558 ± 0.0514), and F1 score (0.8413 ± 0.0336). All models improved when transitioning from engineered features to image data. However, no consistent pattern of improvement was observed from image to multimodal data, except for the VIT model, which showed substantial gains, highlighting its effectiveness in integrating multiple data sources for strong predictive outcomes. The results from this analysis are presented in Table 2. The F1-score, sensitivity, and specificity are illustrated in Figure 5(a), with the best-performing model highlighted.
Training on PD vs Non-PD, testing on Suspected PD vs Non-PD
We trained models on PD and non-PD data, testing them on suspected PD and non-PD data. Performance improved when transitioning from engineered features to image data, and further to a multimodal approach. The VIT model achieved the highest metrics across all models, with accuracy, sensitivity, specificity, PPV, NPV, and F1 score all around 0.852. MobileNet V2 showed the largest gains, with a 40% improvement from engineered features to image data and a 20% increase from image to multimodal. Most models followed this trend, except for the SwinT model, which decreased in performance from image to multimodal data. Also, models using engineered features in this evaluation approach mostly performed worse than random guessing (less than 50%). The results from this analysis are presented in Table 3. The results for F1, sensitivity and specificity are illustrated in Figure 5(b), with the best-performing model highlighted.
Training on Suspected PD vs Non-PD, testing on PD vs Non-PD
We trained models on suspected PD and non-PD data, testing them on confirmed PD and non-PD data. The VIT model using multimodal data showed the best performance, with accuracy, sensitivity, specificity, PPV, NPV, and F1 score all around 0.71. All models improved as the data type shifted from engineered features to image and then to multimodal. Most models performed significantly better than random guessing (above 50%), except for the TabTransformer, which underperformed likely due to its attention mechanism failing to capture key feature interactions and data distribution shifts. Overall, these results suggest that most models can accurately predict confirmed PD when trained on suspected PD cases. The results from this analysis are presented in Table 4. The results for F1, sensitivity and specificity are illustrated in Figure 5(c), with the best-performing model highlighted.
Multimodal VIT Results & Analysis
We utilized PCA and GradShap feature importance analysis, as depicted in Figure 6.
In Figure 6(a), where the model was tested on PD vs non-PD after being trained on similar distribution of data, image features such as sine, straight, and spiral patterns emerged as highly influential for both positive and negative predictions. Among engineered features, the time taken to trace the patterns showed moderate importance, particularly in predicting PD, while other engineered features contributed minimally.
Figure 6(b), which shows the model tested on suspected PD vs non-PD (trained on PD and non-PD), reveals a similar trend with image features being the most significant contributors to the model’s predictions. The time taken to trace the patterns was critical for predicting PD, with screen height and width also playing a role, albeit to a lesser extent.
In Figure 6(c), where the model was tested on PD vs non-PD after being trained on suspected PD and non-PD, the influence of image features decreased compared to the previous cases. However, the time taken to trace the patterns remained a key factor in predicting PD, with screen height and width becoming more important than the image features in this scenario.
These analyses illustrate the ViT model’s performance significantly depends on the image features for the cases of testing on PD and non-PD as well as suspected PD and non-PD data. However, for identifying PD after training on suspected PD the engineered features do play an important role in the model’s prediction.
Discussion
Principal Results
This study demonstrates that a multimodal ML approach using mouse trace data can contribute predictive power towards remote PD assessments. We employed three distinct evaluation approaches in our study. The first approach consisted of data labeled as PD and non-PD, excluding participants with suspected PD. This approach enabled us to quantify the contribution of mouse trace data towards PD diagnostics. The second approach trained models on PD and non-PD data while testing them on suspected PD and non-PD data. This enabled us to evaluate the ability of models trained on individuals with verified PD diagnoses to potentially serve as an early screening tool for motor conditions more broadly, as individuals who suspect they have PD are likely to be early in the progression of the disease and may have a wide range of possible motor conditions. The third approach trained models on suspected PD and non-PD data and tested them on PD and non-PD data. This enabled us to evaluate the ability of self-reported and suspected PD to predict actual PD diagnoses.
We explored three different model types and combinations of engineered features and images for multimodal PD detection. Across all three evaluation approaches, our models demonstrated improved performance as the data type transitioned from engineered to image to multimodal. However, the SwinT model showed decreased performance when transitioning from image to multimodal in the second approach, where models were trained on PD and non-PD data and tested on suspected PD and non-PD data. This decline is likely due to SwinT’s hierarchical structure, which, while effective for single-modality image processing, may struggle to integrate and balance the diverse features from different data types in a multimodal context, especially when identifying nuanced data labeled as suspected PD. Also, most of the models using engineered features failed to perform better than random guessing while detecting suspected PD due to this nuanced labeling of data. Moreover, when trained on suspected PD cases and tested on confirmed PD cases, TabTransformer performed worse than random guessing. This is likely due to its simplistic attention mechanism failing to adequately capture the data distribution and the interaction between images and engineered features from suspected PD cases. In contrast, the multimodal ViT model consistently outperformed other models across most metrics, likely due to its complex attention mechanisms, which are particularly well-suited for capturing complicated patterns and interactions across multiple data modalities.
GradShap analysis provided insights into feature importance for the ViT models, highlighting that mouse trace images were critical for model predictions. The times taken to trace the patterns were less influential in the first two approaches but had a significant influence in the third approach, where suspected PD data were used for training and PD data for testing. Additionally, screen dimensions contributed to the model’s predictive capability, with lesser significance in the first two approaches but higher significance in the third approach.
These findings suggest that ViT and similar multimodal models could be valuable in developing non-invasive, accurate diagnostic tools for PD, facilitating early detection and improved patient management. The interpretability provided by the analyses highlighted that while image data were most important when the test data consisted of identified PD cases, engineered features played an important role in predicting suspected PD cases. This difference in the importance of image data and engineered features across training and evaluation procedures likely stems from the nature of the data and the specific challenges in each case. For identified PD cases, mouse trace images show clearer patterns with respect to non-PD cases, making image features more useful for prediction. In suspected PD cases, the distinctions are more subtle, due to the earlier stage of PD in individuals without an official diagnosis. In this case, the model relied more on engineered features, like tracing time, to capture less obvious differences. This suggests that suspected PD cases require a broader use of data to make accurate predictions.
Comparison to previous works
Our study enhances previous research by introducing a novel approach to PD detection through the use of multimodal deep learning models, relying exclusively on mouse trace data and images captured during a brief 10-minute online test. Unlike prior methods that have explored hand and finger movements, keyboard typing patterns, keystroke dynamics, speech analysis, handwriting, drawing tests, and sensor data from accelerometers, gyroscopes, and smartphone interactions, our study focuses specifically on the remote collection of mouse tracing data, demonstrating the potential of mouse trace data alone to provide significant predictive power in predicting PD despite differences in mouse types and devices across participants.
Previous studies by Gil-Martin et al.33 and Pereira et al.34 focused on hand movement dynamics from spiral, meander, and other drawing shapes for PD analysis. However, their data collection was not remote, and they did not consider handedness, unlike our study. Their best models achieved accuracies of 97.7% and 83.77%, respectively. Goel et al.35 used pen-and-paper methods to collect spiral pattern data but also lacked remote testing and consideration of handedness, achieving an accuracy of 84.73%. Memedi et al.36 used a remote data collection method involving a touchscreen tablet and web interface, but their study spanned three years and involved only 65 participants, resulting in an accuracy of around 84.73%.
While our study differs significantly from these prior works in terms of data collection methods, the duration of data collection, remote accessibility, and the inclusion of handedness, these studies serve as important foundational works.
Our pilot study37, which explored the feasibility of an earlier version of our web application, achieved an accuracy of 74.29% and an F1 score of 73.11%. Our current model shows marked improvements in performance, reflecting the advancements and refinements made in our approach.
Limitations & Future work
This study has several limitations that should be considered in future work. First, the sample size of 315 participants split between 3 diagnostic categories may not be sufficient to generalize the findings across a diverse population. Additionally, our focus on mouse tracing data collected through a website does not fully capture all aspects of PD symptoms. Importantly, the study did not count medication usage, specifically accounting for the on phase versus the off phase, which can significantly influence the presence and severity of symptoms like tremors38,39,40. Stress, which is known to exacerbate tremors, was also not accounted for, potentially affecting the results41,42. The type of mouse used by participants, such as whether they used an ergonomic mouse or a computer trackpad, could have influenced the prominence of tremor symptoms, introducing variability in the data. Furthermore, the impact of device type and handedness on the results remains unclear, as PD often affects one side of the body more than the other, and it is not certain that participants’ dominant hands were the ones most affected by the disease. While the ViT model demonstrated relatively strong performance, its computational complexity and resource requirements may limit its practical application in real-world settings. Future research should focus on optimizing these models for use on standard hardware without compromising performance and should incorporate additional data modalities, such as voice recordings and gait analysis, to provide a more comprehensive diagnostic approach.
Data Availability
An anonymized version of the data used in this study may be released upon completion of the ongoing data collection process.
Contributors
Conceptualization: PW, RSZ; Data collection: RSZ, ZNT, LS, SP; Web development: RSZ, SP, LS; Writing - primary writing: RSZ; Writing - editing: PW; Funding acquisition: PW; Data analysis: RSZ; Visualization: RSZ; Ideation: RSZ, LS, ZNT, SP, PW; Supervision: PW. All authors had full access to the data used in the study and had final responsibility for the decision to submit for publication.
Declarations of Interests
All authors declare no financial or non-financial competing interests.
Data Sharing
An anonymized version of the data used in this study may be released upon completion of the ongoing data collection process.
Code Sharing
The code for this study [and training/validation datasets] are not publicly available at the moment but may be made available to researchers on reasonable request to the first author.
Acknowledgements
The authors are grateful to the participants who participated in this study. This research was, in part, funded by the National Institutes of Health (NIH) Agreement NO. 1OT2OD032581-01. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the NIH. Only the language and grammar of this manuscript was revised using AI tools such as ChatGPT, though the output of the AI tools was further edited by the authors.