Abstract
Depression is a widely prevalent psychiatric illness with variable levels of severity that necessitate different approaches to treatment. To enhance the management of this condition, there is a growing interest in utilizing mobile devices, especially smartphones, for remote monitoring of patients. This study aims to build prediction models for depression severity based on active and passive features collected from patients with major depressive disorder (MDD) and healthy controls to assess the feasibility of remote monitoring of depression severity.
Using data from 142 participants (85 healthy controls, 67 MDD) we extracted features such as GPS-derived mobility markers, ecological momentary assessments (EMA), age, and sex to develop machine learning models of depression severity on the different diagnostic subgroups in this cohort.
Our results indicate that the employed models outperformed baseline estimators in random split scenarios. However, the improvement was marginal in user-split scenarios, highlighting the need for larger and more diverse samples for clinical utility. Among the features, mood EMA emerged as the most influential predictor, followed by GPS-derived mobility features. Models also showed a significant association between depression severity and average reported mood, as well as GPS-derived mobility markers such as number of places visited and percent home.
While predicting composite depression scores is important, future studies could explore predicting individual symptom items or symptom groups for a more comprehensive assessment of depression severity. Challenges for clinical utility include participant dropout, which could be addressed through more engaging app design to promote user adherence. Harmonization of phone-derived measures is also crucial to facilitate model transfer across studies.
In conclusion, this study contributes valuable evidence supporting the potential utility of smartphone data for mood state monitoring and predicting depression severity. Future research should focus on predicting depression further ahead in time and addressing the challenges identified to create more robust and effective depression monitoring solutions using smartphone-based data.
Introduction
Depression is a widespread psychiatric illness that affects millions of individuals worldwide, with a lifetime prevalence of 20% [1], projected to be the lead contributor to global disease burden by the year 2030 [2]. Being an ever-increasing burden on our healthcare systems, there is an increasing need for improved monitoring and predictive tools to enhance the management of this condition. As relying solely on human judgement is costly and subject to potential biases attention in recent years has turned to mobile devices for the quantitative assessment of behaviour [3, 4]. Sometimes referred to as digital phenotyping smartphones and mobile technology have opened up new possibilities for remote monitoring and data collection, providing valuable insights into the daily lives of individuals with depression [5, 6].
Since the inception of the iPhone in 2008 the prevalence of smartphones has rapidly increased with smartphone penetration reaching >80% in the United States and 50% worldwide [7, 8]. The widespread use of smartphones has made them a prime target for digital phenotyping efforts, with multiple uses in a range of different mental health conditions such as Bipolar disorder, OCD and MDD [9–11]. Smartphones are often used to collect both active measures, such as momentary ecological assessments and passive measures, such as GPS-derived mobility markers, device usage or app usage. A key challenge for the application of remote monitoring via smartphones is the reliable prediction of depression severity for in-time intervention and improved treatment monitoring. In previous studies of depressed populations, multiple different smartphone-derived measurements have been used such as social interactions, location, activity measures, EMAs and phone use [11]. For EMAs, it has been shown that depressed individuals report lower average mood ratings [12, 13]. These EMA-reported mood ratings are also significantly associated with composite scores of depression such as PHQ-9 [14, 15]. Similarly, GPS-derived mobility markers such as percent home or the number of locations visited have been shown to be altered in depression [16], while also correlating strongly with depression severity over time [17, 18]. Daily GPS features and mood do not seem to be strongly related however, with substantial day-to-day variability [16, 17].
These promising findings have prompted researchers to harness smartphone-derived sensor data for the prediction of psychiatric diagnoses, utilizing an array of sensor-based features. Multiple studies have effectively classified individuals based on their depression status, achieving accuracies ranging from 0.60 to over 0.85 [19–22]. These studies utilized a range of different features such as GPS, screen status or internet connectivity. Beyond diagnosis prediction, some investigations have attempted to forecast depression severity through smartphone data. While certain studies, such as those by Asare et al. and Ware et al., focused on predicting individual symptoms using classifiers, others, including Saeb et al., Wang et al., Pedrelli et al., and Lewis et al. have predicted depression severity on a continuous scale [17, 23, 24]. For continuous predictions, the PHQ-9 and Hamilton Depression Rating Scale (HAM-D) were commonly used with a subset of studies incorporating alternative metrics like the PHQ-8. Notably, varying model performances were reported, with some studies yielding inconclusive outcomes at the population level [16]. It is noteworthy that the majority of these models relied on passive sensing measures, with only a limited number incorporating active measures such as Ecological Momentary Assessments (EMAs).
Contributing to a growing body of evidence, our study constructs prediction models for depression severity by leveraging both active and passive features obtained from individuals with major depressive disorder (MDD) and healthy controls. Through the utilization of permutation feature importances and statistical modelling techniques, we not only replicate insights from prior research but also introduce supplementary evidence by addressing the integration of missing data within our machine learning frameworks. We are hopeful that the outcomes of our research will help to guide forthcoming investigations in the realm of digital phenotyping.
Methods
The primary goal of the study was to analyze the feasibility of mood prediction from active and passive smartphone data in a clinical population. We also analyzed the statistical relationships of the different predictors to our target to inform potential future studies. In this section we first describe our cohort and then our preprocessing and analysis methods used in this study.
Subjects
N=1550 Participants from the ReMAP study with available GPS data, demographic information, ecological momentary assessments and sleep questionnaires from the Remote Monitoring in Psychiatry (ReMAP) study with data between January 2019 and September 2022 were selected for further processing. All study participants were recruited as part of ongoing longitudinal cohort studies. For more information on these studies please refer to the supplementary material. After pre-processing quality control, processing and post-processing quality control 207 Participants remained, of which 156 were either healthy controls (N=96) or diagnosed with major depressive disorder (N=60). Patients were recruited and assessed at the Institute for Translational Psychiatry, Department of Psychiatry at the University of Muenster. Current and lifetime psychopathological diagnoses were assessed using the Structured Clinical Interview for DSM-IV (SCID-I). Patients included those with mild, moderate, severe and (partially) remitted MDD. All subjects were educated about study aims, data collection, data security and financial compensation. The study was approved by the local institutional review board and Informed consent was obtained from each subject. Further information about study design can be found in the supplement as well as our previous work [25, 26].
Quality Control
Geolocation data underwent quality control before (pre-processing quality control) and after processing (post-processing quality control). For pre-processing quality control, a custom script was used that automatically discarded participants with less than 30 days of data or with a minimum timestamp difference >6 hours, indicating low data density. We also discarded all data points during periods where individual participants were isolated at home due to COVID-19 infections. After this automated quality control step accuracy data over time was plotted and visually inspected. If there were missing periods for >30% of collection time or other anomalies data was discarded. After processing daily location maps from the DPLocate pipeline were plotted and visually inspected. If anomalies were detected data was discarded and did not enter the final statistical analysis.
Geolocation Data Processing
Geolocation was processed using the DPLocate pipeline. Data were converted to Matlab format, signal epochs were extracted using a threshold of 1 datapoint and temporally filtered. Points of interest (POIs) were extracted using the modified Density-Based Clustering Algorithm with Noise (DBSCAN), calculating the 80 most visited locations with the maximum number of epochs in their 50 meters neighbourhood and then pruning overlapping clusters to find the most visited locations [27, 28]. Epochs were assigned to POIs, the home location was estimated as the most commonly visited location and metrics of interest were calculated from these epochs. These metrics were: Average Distance from home, Radius of Mobility, Percent Home and the number of places visited.
The pipeline was run using MatLab 2019a.
Psychometric Questionnaire Data
For the assessment of depressive symptoms, the German version of the Beck Depression Inventory (BDI) was used [29]. The questionnaire was administered every 2 weeks through the ReMAP phone app as specified by the study protocol. For temporal granularity, a daily ecological momentary assessment of mood and sleep consisting of two items with a sliding scale (0-10) was delivered through the study app at random times between 8 am and 10 pm. Users were notified via push notifications. Both the BDI and EMA questionnaires have been externally validated in our previous work [25, 26].
Predictive Data Processing
For our predictive models, the data was averaged over the 2-week period prior to BDI administration. 2-week windows with less than 5 data points were labelled as missing and assigned a missing value of 1 for that time window. This was done to model missingness within the data while preserving an adequate sample size. Predictors were GPS-derived mobility markers (percent home, number of places visited, radius of mobility, distance from home), ecological momentary assessments (mood, sleep quality), age, sex and phone platform.
Predictive Modelling
All machine learning models were developed using the PHOTONAI package [30]. Four different predictive models were used: Random Forest, Automatic Relevance Determination Regression, Huber Regression and a Support Vector Machine. Every model was evaluated using a nested 10×10-Fold cross-validation with the best model being selected on mean absolute error and grid search as an optimizer strategy. Additionally, mean squared error, Spearman rank correlation and Pearson correlation were calculated. We also calculated a baseline estimate (referred to as a “Dummy Estimator”) by predicting the sample mean. After pipeline training and evaluation we calculated the permutation feature importance to assess the importance of different predictors. To evaluate our predictive models under different conditions we trained each pipeline on two different data splits. First, we split the data randomly (random split), leading to data from most participants being available in the training and test set. This was done to model a scenario where some user data is already available for model tuning. Second, we also performed a user split where some users were held out in the test set and not seen during training. This was done to estimate how well the model might generalize to new unseen participants. A potential time split was not performed due to the high variability in the amount of data points per user.
Inferential Modelling
To gain insight into the nature of the statistical relationships within the data we built linear regression models for the different predictors. We modelled the relationship between BDI and different predictors using linear mixed models with a random intercept for each subject with age and sex as covariates. Modelling the group relationship we used the averaged values of the predictors in a linear regression. All statistical analyses were performed using the statsmodels Python package.
Results
Prediction Model
To predict the BDI sum score from mobile data we we trained each estimator model on either patients with major depressive disorder (MDD), healthy controls (HC) only or both of them combined (Comb). 142 subjects went into the analysis with 85 HC and 57 MDD patients. The age range spanned 18 to 67 years with a median age of 35 years. Demographic information on each subgroup can be found in the supplementary materials. BDI was predicted in each of the subgroups using both a random split and user split, leading to 24 different classifier pipelines, excluding the Dummy estimator.
Predicting BDI score in the MDD subgroup was best achieved by the Random Forest in the random split cohort with a mean absolute error of 5.301 and a mean squared error of 44.543. However, the other classifiers showed a highly similar predictive performance, with no classifier being clearly superior. Each classifier performed better than the dummy estimator with a mean absolute error of 7.25 In the user split the classifiers performed even more similarly with no clear best classifier. Here the estimators performed only slightly better than the dummy estimator (Mean absolute error: 7.59). For each pipeline, we also performed a permutation test to quantify whether the correlation between the true and predicted values was due to chance. All models performed significantly above chance with p-values <0.001. Predicting BDI score in the HC subgroup was again best achieved by the Random Forest in the random split subgroup with a mean absolute error of 3.793 and a mean squared error of 27.101. All estimators performed better than the dummy estimator (Mean absolute error: 5.962). In the user split subgroup, the results between the estimators were much more similar with the SVR ranking as the best classifier, however, most classifiers still performed better than the dummy estimator (Mean absolute error: 6.139). The permutation test was again significant for each model (p<0.001).
Predicting BDI score in the combined group was again best achieved by the random forest when using a random split with a mean absolute error of 3.793 and a mean squared error of 27.101. All estimators outperformed the Dummy (Mean absolute error: 5.962). The SVR was the best estimator in the group split with a mean absolute error of 4.259 and a mean squared error of 32.921. In the combined group all estimators performed highly similarly both in the random and the user split condition. In the combined condition all models performed above chance (p<0.001)
Feature importances
For each pipeline, we also calculated the permutation feature importance using 100 permutations. In the MDD subgroup with random splitting, the most important feature was reported mood, with the missingness of the mood EMA as the second most important feature. The number of places visited, percent home, age, GPS missingness, platform and sleep were other notable features. While different in exact importance score, the most important features where highly similar among estimators. With the group split GPS missingness was the most important feature with mood EMA, number of places visited, percent home and missingness of the mood EMA as other important predictors. However, these should be interpreted with caution as the different estimators did not perform much better than the Dummy estimators.
Predicting BDI. Predicting BDI in a random split scenario, all estimators performed better than the dummy estimator that was predicting the average BDI score, with very similar mean absolute errors between the different estimators. The random forest performed best for the MDD subgroup (MAE: 5.316, Pearson’s R: 0.637, Permutation p-value: p<0.001) and Combined subgroup (MAE: 3.791, Pearson’s R: 0.704) while the ARD Regression performed best in the HC subgroup (MAE: 2.280, Pearson’s R: 0.753, Permutation p-value: p<0.001). In the user split scenario, the estimators did not perform much better than the dummy estimator. The best estimator in the group split scenario was the random forest for the MDD (MAE: 7.183, Pearson’s R: 0.475, Permutation p-value: p<0.001), HC (MAE: 2.786, Pearson’s R: 0.383, Permutation p-value: p<0.001) and Combined (MAE: 4.259, Pearson’s R: 0.610, Permutation p-value: p<0.001) subgroups. The performance here should be interpreted with caution however as no estimator strongly outperformed the dummy estimator. *MAE: Mean absolute error.
In the HC subgroup with random splitting, the most important feature was the mood EMA followed by missingness of the mood EMA, sex and age. Again, feature importances were highly similar among estimators. With user splitting the most important feature was number of places visited, followed by mood EMA missingness, percent home, GPS missingness, age and sex. Again these need to be interpreted with caution as the estimators did not perform much better than the Dummy estimator.
In the combined group with random splitting, the most important feature was the mood EMA, followed by GPS missingness, mood EMA missingness, sex and number of places visited. With user splitting the most important feature was mood EMA followed by GPS missingness, mood EMA missingness percent home and age. Within both splitting scenarios, there was a high degree of similarity between the feature importances for the different estimators.
BDI correlates with decreased mobility
With GPS-derived mobility measures as being some of the important features in our predictive models we next looked at the association between different geolocation-derived metrics and BDI scores. This was done in an exploratory manner to provide possible explanations for the observed performance.
We predicted BDI using either percent home or number of places visited with age and sex as covariates in a linear mixed model with a variable intercept for each subject. We also ran each model separately for the depressed (MDD) and healthy (HC) subgroups to account for differences in trends when analyzing the different subgroups compared to analyzing all subjects together, also known as Simpson’s paradox [31]. When predicting BDI using percent home in the MDD subgroup, neither percent home (p=0.796, z=0.259), age (p=0.767, z=0.297) nor sex (p=0.793, z=-0.262) were significant predictors of BDI. In the HC subgroup percent home was a significant predictor (p<0.001, z=3.958), while age (p=0.405, z=0.833) and sex (p=0.293, z=1.051) were not. When predicting BDI using number of places visited in the MDD subgroup, the number of places (p=0.003, z=3.016) was a significant predictor, while age (p=0.758, z=0.308) and sex (p=0.788, z=-0.269) were not. In the HC subgroup neither number of places (p=0.171, z=-1.369), age (p=0.383, z=0.872) nor sex (p=0.253, z=1.142) was significant.
BDI correlates with worsened mood and sleep
As the mood EMA was one of the strongest predictors in all models with good predictive performance, we investigated the utility of self-reported measures of mood and sleep (ecological momentary assessments, EMAs) by using them as predictors of BDI. As before we ran two separate regressions for HC and MDD to account for a potential Simpson’s paradox [31].
Permutation Feature Importance. We calculated the permutation feature importance for every pipeline. Mood EMA emerged as the most important predictor in the random split scenario for every pipeline. Other important features were the missingness of the mood EMA, the missingness of the GPS data, the number of places visited, the percent home, age and sex. For the user split the pipelines did not perform much better than the Dummy estimator for the MDD subgroup therefore interpreting the feature importance is not encouraged. In the combined subgroup where the pipelines outperformed the Dummy estimator mood again emerged as the most important feature with GPS missingness, mood EMA missingness, percent home and age as additional important features. *Mood: Mood EMA, sleep: Sleep EMA, numPlaces: Number of places visited, percentHome: Percent home, percentHomep: normalised percent Home, homDist: Distance from home, radiusMobility: Radius of Mobility, age: age, sex: Sex, platform: phone platform, GPS_missing: missingness of GPS data, mood_miss: Missingness of mood EMA, sleep_miss: Missingness of sleep EMA, percenthome_miss: Missingness of percent home, percenthomep_miss: Missingness of normalised percent home, numplaces_miss: Missingness of number of places, radMobil_miss: Missingness of Radius of Mobility, homDist_miss: Missingness of radius of Mobility
For predicting BDI from self-reported mood we used 3753 data points from 132 subjects (53 MDD, 79 HC). We used a linear mixed model as described above with age and sex as covariates. For the MDD subgroup, we found a significant effect of mood (p<0.001, z=-22.612), but not age (p=0.931, z=0.087) or sex (p=0.943, z=0.071). In the HC subgroup, we again found a significant effect of mood (p<0.001, z=-23.653), but not age (p=0.343, z=0.947) or sex (p=0.345, z=0.730).
For predicting BDI from self-reported sleep we used 3781 data points from 132 subjects (54 MDD, 78 HC). For the MDD subgroup, we found a significant effect of self-reported sleep (p<0.001, z=4.267), but not age (p=0.739, z=0.333) or sex (p=0.782, z=-0.276). For the HC subgroup, we found a significant effect of self-reported sleep (p=0.004, z=-2.912), but not age (p=0.448, z=0.759) or sex (p=0.278, z=1.084).
Diagnosis correlates with mobility markers
After establishing the relation between BDI and geolocation markers of mobility we looked at the association between GPS-derived mobility markers and diagnosis. We extracted the average percent home and average number of places visited for each healthy control and each subject with a diagnosis of major depressive disorder. 156 individuals (60 MDD, 96 HC) were included in the analysis (see Table 16). We predicted one of the two mobility measures, average mood EMA response or average sleep EMA response with diagnosis, age and sex as predictors using a linear regression model.
Predicting percent home we saw a significant effect of sex (p<0.001, t=3.621), diagnosis (p=0.009, t=2.633) and age (p=0.018, t=2.389). A diagnosis of MDD was associated with an increased amount of time spent home (coeff=1.538, CI=0.384 - 2.692). Predicting the number of places visited we saw a significant effect of sex (p=0.016, t=-2.429), diagnosis (p=0.010, t=-2.593) and age (p=0.019, t=-2.376). A diagnosis of MDD was associated with a decreased number of places visited (coeff=-0.552, CI=-0.972 - -0.0131). When predicting average mood we found a strongly significant effect for diagnosis (p<0.001, t=-6.869) where a diagnosis of MDD was negatively correlated with (coeff=-1.537, CI=-1.98 - -1.095), without a significant effect for age (p=0.645, t=-0.461) or sex (p=0.816, t=-0.233). For average reported sleep we found no significant effect of diagnosis (p=0.897, t=-0.130). The effects for diagnosis on number of places, percent home and average mood remained significant even after adjusting for multiple comparisons using the Bonferroni method.
Discussion
This study aimed to explore the feasibility of predicting symptom scores from smartphone data, focusing on an area that had received limited attention in previous research, compared to classification. Unlike earlier models that primarily concentrated on predicting item-level symptoms or overall diagnosis state, our analysis focused on predicting symptom scores using different estimators. Most previous studies used machine learning models to classify patients based on their phone data as depressed or healthy [11, 17, 20]. Other studies had demonstrated success in predicting specific aspects of depression but still used classifiers to predict individual symptom-level items, rather than predicting composite scores that rate depression severity [32, 33]. Notably some studies predicted depression severity using composite scores, using psychometric instruments other than BDI. These studies also did not explicitly model missingness as they mostly relied on passive features. Here our study adds to a growing body of evidence for the utility of monitoring depressive symptomatology.
Predictors of BDI. GPS-derived mobility markers are partially significant predictors in the MDD subgroup (number of places: p=0.003, percent home: p=0.796) as well as the HC subgroup (number of places: p=0.171, percent home: p<0.001)BDI is also significantly associated with mood (p<0.001), sleep (p=0.004) in the MDD subgroup as well as the HC subgroup (mood: p<0.001; sleep: p<0.001).
Our results show that employed models outperformed the Dummy baseline estimator in a random split scenario, albeit with only a marginal improvement in a user split scenario. This is in line with previous findings by Lewis et al, which found that classifiers work better than baseline estimators in random and time-split scenarios, but not in user-split scenarios [24]. First, these models should be evaluated in larger, more heterogeneous samples, as the relatively small dataset used in the study may have impacted the models’ performance. Larger and more diverse samples could yield better results. Secondly, models should be trained in cohorts where individuals show greater variability of symptom ranges enabling models to leverage more meaningful within-person variation. Third, for clinical deployment, it might be necessary to collect a few data points from a user and then adjust the model based on this user’s data. To further improve prediction accuracy, model personalization using mixed effects machine learning models, as proposed by Lewis et al., could be considered [24].
Among the features analyzed using permutation feature importance, the mood ecological momentary assessment (EMA) emerged as the most influential predictor, backed by highly significant effects in our mixed models, followed by the missingness of different signals and GPS-derived mobility features. The assessment of feature importance was only meaningful in the random split scenario as in the user-split scenarios models failed to significantly predict mood above baseline. In this scenario interpreting feature importances should not be attempted. For future studies, it might be important to effectively model missingness as it may provide crucial information for machine learning models to achieve accurate predictions.
Diagnosis predicts lower mood, more time at home and decreased number of places. When using diagnosis as a predictor for GPS-derived mobility features we find a significantly increased percent home (p=0.009) and a significantly decreased number of places visited (p=0.010). A diagnosis of depression also predicts a significantly lower average mood (p<0.001). We do not find evidence for a significant difference in average self-reported sleep quality (p=0.897).
In our investigation using mixed modelling to explore the connection between distinct predictors and BDI scores, we confirmed the robust predictive capability of mood EMAs, while also establishing significant effects for GPS-derived mobility markers. These findings reinforce the strength of EMAs as a key indicator of depression severity. Previous research has consistently demonstrated substantial links between EMA scores and depression scales within psychiatric populations and other study groups [15, 34]. The empirical evidence for GPS-derived mobility markers also aligns with existing studies, which have highlighted the correlation between alterations in these markers and shifts in depressive symptoms [35]. Despite notable levels of missing data in our GPS records, we replicate effects from prior investigations. Enhancing the frequency and reliability of data sampling could potentially yield more robust and meaningful estimates. However, it’s important to note that while EMAs exhibit effectiveness, their reliance on active user participation may limit their widespread adoption in certain contexts [34], thus impacting their clinical utility.
Considering the group-level findings, we show that depression is significantly associated with the average reported mood and GPS-derived mobility markers. Both number of places and percent home have previously been shown to be significantly associated with a diagnosis of depression across multiple studies [11, 17]. The average sleep quality has also been previously associated with depression [36–38], yet we are unable to reproduce this finding. While this might be due to depressed patients experiencing different levels of symptom severity being ignored when examining the average, self-reported sleep did not emerge as a strong predictor of BDI score in the feature importance analysis. Overall our group-level analysis is in line with findings from previous studies.
While predicting composite scores is an important task in remote monitoring, it may not fully capture the variability in patient symptoms. Depression is considered a highly heterogeneous disease, with some authors suggesting it might be multiple diseases disguised under a common symptom. Therefore, to gain a more comprehensive overview of a patient’s condition, predicting individual items or symptom groups such as in latent factor models might provide a more holistic view of depression severity and is potentially easier to predict as these categories are better reflected in measured outcomes than composite scores. These factor models have also been shown to track multiple dimensions of psychiatric illness as well as general psychopathology [39].
Other challenges for clinical utility concern participant dropout, especially for models that rely on active features such as EMAs. To address this future apps will need to be carefully designed to become more engaging and promote user adherence, without strongly altering user behaviour. Models derived exclusively from passive sensing are less prone to these problems as they do not rely on active user participation. Nonetheless, the lack of standardization among sensors across various studies and applications introduces challenges in the transference of models. Sensor harmonization, a pivotal challenge, is influenced by factors such as the diversity in the count and categories of phone sensors captured in different studies. In this context, our study offers valuable insights into the array of phone-derived metrics that hold significance for predicting depression severity, contributing to a deeper understanding of this challenge.
Overall, this study contributes valuable evidence to the potential utility of smartphone data in mood state monitoring. However, future studies should explore the prospect of predicting depression further ahead in time, potentially serving as an early-warning tool for better intervention and management of the condition. Additionally, addressing the challenges and limitations identified in this study could pave the way for more robust and effective depression monitoring solutions using smartphone-based data.
Supplementary Material
A. ReMAP App
The Remote Monitoring Application in Psychiatry (ReMAP) app was developed by researchers from the Department of Psychiatry at the University of Münster. It enables high-resolution monitoring of activity using smartphone sensors, including geolocation, steps, walking distance, and acceleration. The app also incorporates active self-reports on sleep, mood, and voice samples. Every two weeks, participants were given the option to complete the Beck Depression Inventory (BDI) with a random variance of two days. Participation in the self-reporting was voluntary, and it was not a requirement for financial compensation. ReMAP was used as an additional assessment tool for participants in various ongoing longitudinal studies, primarily focusing on neuroimaging studies involving structural and functional MRI assessments, genotyping, and the evaluation of various clinical variables.
B. ReMAP Participants
The sample for the analyses in this paper was drawn from several ongoing longitudinal cohorts. Included samples stem from the Marburg/Münster Affective Disorder Cohort Study (MACS, n=95) (Vogelbacher et al., 2018), the Münster Neuroimage Cohort (MNC, n=27) (Dannlowski et al., 2016; Opel et al., 2019), two subsamples of the SFB-TRR58 cohort (n=23; Z02 Münster cohort and SpiderVR Münster cohort (Schwarzmeier et al., 2020)), the TIP (n=6) cohort and the SEED cohort (n=5). All cohorts comprise healthy control (HC) participants, as well as different patient groups. Both, the MACS and the MNC cohorts include major depressive disorder (MDD) and bipolar disorder (BD) patients under current or former inpatient treatment. The SFB-TRR58 sample includes patients with a spider phobia (SP). The TIP sample includes patients with a social anxiety disorder (SAD), MDD, or comorbid SAD and MDD. For our analyses, we excluded all patients that did not meet the criteria for healthy controls or major depressive disorder.
C. Predictive Modelling
Results for all the different estimator pipelines can be found in table 4. Condition refers to the subgroup the pipeline was trained on, while split refers to the splitting procedure for the data. Tables 2 and 3 provide information on the HC and MDD subgroups used in the predictive modelling.
Results for the different classifiers trained on the MDD cohort. Condition refers to the subgroup, split to the data splitting and estimator to the estimator used in the pipeline. *MAE: Mean absolute error; MSE: Mean squared error; Spearman: Spearman Rank Correlation; Pearson: Pearson correlation; r-pval: p-value of the permutation test for the correlation between predicted and true values of the target variable.
D. Mixed Modelling of BDI
The demographics and results for the different regression models on BDI can be seen in tables 5 to 15. The formula for the linear mixed models can be read as BDI ∼ var + age + sex + (1 | subject) where var stands for the variable of interest such as mood EMA, sleep EMA, percent home or the number of places visited.
E. Linear model for diagnosis effects
Tables 16 to 20 provide information on the linear models used to investigate diagnosis effects. The formula for the linear models can be read as var ∼ diagnosis + age + sex, with var standing for the variable of interest. Note that the the average for the variable of interest is used in these models.
ACKNOWLEDGEMENTS
Funding was in part provided by grants from the German Science Foundation (DFG): Project number 44541416-TRR 58 (CRC-TRR58, Projects C09 and Z02 to Udo Dannlowski). Further DFG funding as part of Forschungsgruppe/Research Unit FOR2107 was received by: Tilo Kircher (speaker FOR2107; DFG grant numbers KI 588/14-1, KI 588/14-2, KI 588/15-1, KI 588/17-1), Udo Dannlowski (co-speaker FOR2107; DA 1151/5-1, DA 1151/5-2, DA 1151/6-1), Axel Krug (KR 3822/5-1, KR 3822/7-2), Igor Nenadic (NE 2254/1-2, NE 2254/2-1, NE 2254/3-1, NE 2254/4-1), Carsten Konrad (KO 4291/3-1), Marcella Rietschel (RI 908/11-1, RI 908/11-2), Markus Noethen (NO 246/10-1, NO 246/10-2), Stephanie Witt (WI 3439/3-1, WI 3439/3-2), Andreas Jansen (JA 1890/7-1, JA 1890/7-2), Tim Hahn (HA 7070/2-2, HA7070/3, HA7070/4), Bertram Mueller-Myhsok (MU1315/8-2), Astrid Dempfle (DE 1614/3-1, DE 1614/3-2), Petra Pfefferle (PF 784/1-1, PF 784/1-2), Harald Renz (RE 737/20-1, 737/20-2), Carsten Konrad (KO 4291/4-1).
Interdisciplinary Center for Clinical Research (IZKF) of the medical faculty of Muenster provided funding to Udo Dannlowski (Grant Dan3/012/17) and Nils Opel (SEED 11/19)
“Innovative Medizinische Forschung” (IMF) of the medical faculty of Münster provided funding to Elisabeth J. Leehr (Grant LE121703 and LE121904), Nils Opel and Tim Hahn (Grant OP121710).