|Data type||Extracted feature||Description|
|total calling frequency||the number of times that a participant answers and makes phone calls during a day.|
|total calling duration||
|non-working time calling frequency||
|non-working time calling duration||
|number of missed calls||the number of calls that are marked as missed during the day.|
|number of contacts||the number of contact a participant answers and makes phone calls during the day.|
|calling entropy||the variability of calling durations a participant spends in contacts during the day.|
|Phone call data||normalised calling entropy||calling entropy divided by the logarithm of the number of contacts during the day.|
|phone usage frequency moshe2021predicting||the number of times that a participant interacts with their phone during a day.|
|Phone usage data||phone usage duration moshe2021predicting||
|lock screen duration||the total time in seconds that participants lock their mobile phones during the day.|
|number of used apps||the number of applications that a participant uses during the day.|
|number of midnight used apps||the number of applications that a participant uses between 0am to 5am during the day.|
|User activity data||sleep time||
location variance moshe2021predicting
|location entropy moshe2021predicting||the variability of the time that participants spend in significant places in the day.|
|normalised location entropy moshe2021predicting||the location entropy divided by the logarithm of the number of significant places.|
|time at home moshe2021predicting||
|GPS data||total distance moshe2021predicting||the total distance covered by a participant during the day.|
A total of 19 features extracted and their descriptions.
Depression, as a common mental health disorder, is typically characterised by low mood, overthinking, feelings of hopelessness, and decreased motivation. In extreme cases, people experiencing severe depression may have suicidal thoughts. Depression affects not only individual patients and their families, but also their social circle and overall economic development auerbach2016mental. In Germany, depression is the leading cause of the inability to work or early retirement and is the trigger for about half of all suicides each year. While most people with depression are treated in primary care settings, more than 50 % of people are not identified or effectively treated moshe2021predicting.
The long-lasting primary method of clinical depression diagnosis relies on the self-assessment questionnaires, such as the Patient Health Questionnaire (PHQ)-2 and PHQ-9. These questionnaires have shown a strong correlation with actual human health. However, collecting them is usually time-consuming and has fixed time intervals, which can hardly detect the moment-by-moment psychological changes and achieve timely interventions.
Recently, the rise of wearable devices and mobile phones has made sensor data more readily available. Previous studies have explored the possibility of using sensor data to diagnose human mental health states and have shown the effectiveness [han2021deep, qian2021artificial]. rohani2018correlations provided a systematic survey for the correlations between sensor data and depressive mood symptoms. Compared to the self-assessment questionnaires, the passive data collection does not require an interaction with the device and can be collected at a more flexible time interval, which means that it can reflect immediate changes in psychological state, potentially enabling early diagnosis, prediction of disease progression, and timely adjustment of treatment plans.
However, previous studies mainly focused on depression diagnosis based on mobile phone data, i. e. , the prediction of depression state and/or severity for a given time-period (e. g. , on a daily, weekly, or bi-weekly basis) given concurrent features. In contrast, the forecasting of depression progression, i. e. , the prediction of state/severity on a given time-period given features further in its past, has not received sufficient attention. saeb2015mobile has shown the effectiveness of using features extracted from mobile phone GPS and usage of sensors to diagnose if participants have depressive symptoms (PHQ-9
5). masud2020unobtrusive extracted 12 features from GPS and acceleration data and classified participants’ weekly PHQ-9 into three groups based on that week’s features. lu2018joint used the GPS, activity, sleep, and heart rate data collected from mobile phones and wearable devices to distinguish the participants with depression and diagnosed their clinical severity. They also only used the features from the same week to make the weekly diagnosis.
Different from previous work, we design two tasks for both diagnosis and forecasting: the first task is to diagnose the current week’s PHQ-9 score according to data from the same week, while the second task is to forecast the PHQ-9 score at the end of next week based on data from the current week. We treat the diagnosis and forecasting of PHQ-9 as a regression problem and implement an LSTM model combined with a subject-independent 10-fold cross-validation. We use a portion of passive data from a newly collected dataset called MAIKI, which includes phone call, phone usage, user activity, and GPS data. We choose root-mean-square error (RMSE) as the evaluation metric and use two methods categorising PHQ-9 scores into different subgroups. We distinguish the participants with major depression (PHQ-910) from those without. Additionally, we report a 5-class depression severity. Results show that the forecasting task achieves comparable results with the diagnostic task, which indicates the possibility of forecasting depression from mobile phone data. In order to compare different algorithm options and parameter settings, we also compare three different clustering methods to identify significant places (GPS coordinates that need to be considered the same place and meet certain conditions) from the GPS data.
The rest of the paper is organised as follows. Section II introduces the newly collected MAIKI dataset and our feature extraction methods. Section III describes our task design, experimental setting, and evaluation approaches. Section IV outlines the obtained results for diagnostic and forecasting tasks. Section V concludes the paper with a brief discussion.
Ii Dataset and feature extraction
Ii-a MAIKI dataset
The MAIKI dataset is collected from the “Mobile daily living therapy assistant with interaction-focused artificial intelligence for depression” (MAIKI) project. A total of 48 people participated in this project and carried mobile phones with a sensor data acquisition app for 8 weeks. The study procedures were approved by the ethics committee of the Friedrich-Alexander-University Erlangen-Nuremberg (385_20B). The dataset has both active data from self-assessment questionnaires and passive data from mobile phone sensors. The active questionnaire data includes the weekly PHQ-9 and other questionnaires data such as Generalised Anxiety Disorder (GAD-7) and Perceived Stress Scale (PSS-4). The passive data includes phone call, phone usage, user activity, GPS, battery, phone text, ringtone setting, and step count data, which were collected during each day.
In this work, we focus only on using (parts of) the passive data, namely phone call, phone usage, user activity, and GPS data to diagnose and forecast the weekly PHQ-9 scores. Table I shows an overview over all features per data type. In Section II-B, we outline the procedure followed to extract the features of each data type, placing an emphasis on GPS features which follow a more involved process.
Ii-B Feature extraction
Ii-B1 Phone call, phone usage, and user activity features
We extract a total of 8, 2, and 4 features from phone call, phone usage, and user activity data, respectively. The descriptions of these features can be seen in Table I. These features are all extracted at a daily level for each participant.
Ii-B2 GPS features
The extraction of GPS features is composed of three steps. The first step is to preprocess the raw GPS data. We remove GPS coordinates with positioning accuracy 80th percentile of all participants’ GPS accuracy and additionally remove GPS measurements taken at a speed less than 0. Since only the GPS coordinates in the stationary state should be used in the following clustering step to identify significant places, we remove the GPS coordinates in the transition state with a speed of more than 1.4 m/s. The second step aims to aggregate the GPS coordinates of the same location into a cluster. The cluster that meet certain conditions is considered a significant place. To compare different algorithm options and parameter settings, we implement three clustering algorithms described below.
Time-based clustering. The basic idea of this algorithm is to cluster the GPS coordinates along the time axis and remove the intermediate coordinates between significant places [kang2005extracting]. This algorithm computes the place clusters incrementally as the next GPS coordinates come in. The algorithm has two parameters: the distance threshold is the maximum distance at which the next coordinate is considered to belong to the current place cluster; the time threshold is the minimum time duration for which the current cluster is considered as a significant place. When a cluster is a significant place, the algorithm checks whether the cluster should be merged into one of the existing clusters according to the distance between their centroids (the merged distance threshold equals /3).
The typical K-Means clustering algorithm requires a predetermined number of clusters. But in our cases, the number of clusters can vary widely among different participants. Following saeb2015mobile, we first set to 1 and increase the cluster number until the distance of the farthest point in each cluster to its cluster centre is less than a threshold . This threshold determines the maximum radius of a cluster.
DBSCAN clustering. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm has two parameters: the is the maximum distance between two coordinates for one to be considered as belonging to the same cluster of the other; the
is the minimum number of data points to form a cluster. This algorithm is generally regarded as particularly suitable for GPS data, as it allows to identify clusters of varying shapes and is robust to outliers[muller2021depression].
The third step is extracting the GPS features. The description of these features can be found in Table I. We perform clustering algorithms on all days of data for each participant, then extract GPS features for each participant on each day.
Iii Experimental setup
We design two tasks for diagnosis and forecasting. The first task is to diagnose the current week’s PHQ-9 score according to data from the same week. The second task is to forecast the PHQ-9 score at the end of next week based on data from the current week. For example, the diagnostic task is to predict the PHQ-9 score on day 7 based on the data from day 1 to day 7. In contrast, the forecasting task is to predict the PHQ-9 score on day 14 based on the data from day 1 to day 7. In order to maximise the utilisation of day-level features, we do not use the average of features in a week but give the same weekly PHQ-9 score as the label to the daily data in that week. We treat these two tasks as a regression problem and implement an LSTM model combined with a subject-independent 10-fold cross-validation to complete these tasks. The model has one LSTM layer, one fully connected layer, and a ReLU activation function. The learning rate and the hidden size of LSTM are set to 0.001 and 4, respectively. We choose the mean squared error as the loss function. The model is trained by gradient descent and using the Adam optimiser with1 and 2 set to 0.9 and 0.999.
As for the time-based clustering algorithm for GPS features, since the GPS data from our dataset is recorded every 5 minutes, we set to 15 minutes and to 40 metres [kang2005extracting]. For k-means clustering, we set to 500 metres [saeb2015mobile]. For DBSCAN, we set to 30 metres and to 3 [muller2021depression]. We use Haversine distance as the distance function.
We choose the RMSE as the evaluation metric for regression and utilise two methods categorising PHQ-9 scores into subgroups. We set the cutoff value to 10 to distinguish the participants with major depression (PHQ-910) [kroenke2001phq] from those without and report the 2-class classification accuracy. We evaluate the predicted severity of depression according to Table II and report the 5-class classification accuracy.
|PHQ-9 Score||Depression Severity|
|15–19||Moderately severe depression|
|K-means||78.4 (3.5)||77.0 (6.7)||54.5 (7.1)||53.7 (6.4)||4.184 (0.569)||4.094 (0.619)|
|DBSCAN||74.4 (7.2)||71.9 (6.6)||52.5 (5.3)||47.4 (7.8)||4.443 (0.431)||4.401 (0.349)|
|Time-based||78.9 (4.6)||75.7 (5.3)||54.5 (4.7)||48.5 (7.3)||4.203 (0.621)||4.556 (0.445)|
Results of diagnosing/forecasting the PHQ-9 (range from 0 to 27) at the end of the current/next week based on the data of this week. Accuracy [%] reported for major depression (binary) and depression severity (5-class) tasks whereas RMSE is reported for PHQ-9 prediction. Baseline computed using mean PHQ-9 score of the training set. Mean performance and the standard deviations (in brackets) are reported over all 10 folds.
Iv-a Results of diagnostic task
Table III shows the results of diagnosis the current week’s PHQ-9 based on data from the same week. We calculate the baseline using the mean value of PHQ-9. Specifically, we assume that all the predictions from the baseline model are the mean value and then use this assumed mean prediction to calculate the RMSE with the actual labels. The results of all three methods are better than the baseline model. The optimal result is obtained from K-Means, which achieves an accuracy of 78.4 % for major depression diagnosis, for depression severity diagnosis, and a best RMSE score of 4.184. The time-based clustering algorithm obtains suboptimal results. It is worth mentioning that, contrary to previous findings that the DBSCAN clustering algorithm may be more suitable for GPS data [muller2021depression]; DBSCAN obtained the worst results in our setting.
Iv-B Results of forecasting task
Table III additionally shows the results of forecasting the PHQ-9 score at the end of next week based on data from the current week. The K-Means algorithm still obtains optimal results, which achieves an accuracy of 77.0 % for major depression forecasting and 53.7 % for depression severity forecasting. The best RMSE score of 4.094 is marginally lower than that of the diagnostic task, which means that the forecasting task achieves comparable results with the diagnostic task. The results indicate that it is possible to forecast depression based on mobile phone data.
Iv-C Feature comparison
Finally, in Figure 1 we show a performance comparison for PHQ-9 diagnosis and forecasting using different features. We observe that the user activity features obtain the best individual performance for both tasks. The performance of the phone call features is worse, potentially because this data is more sparse as subjects accept and conduct calls less frequently than using the phone for other purposes. Combining all information in the all feature sets results in a slight performance boost.
In this work, we investigated the potential of using passively collected mobile phone data for depression diagnosis and forecasting. We have shown that forecasting (predicting PHQ-9 scores, major depression, and depression severity) 1-week ahead of the collected data is possible, with performance close to that of predicting the current week (diagnosis). These results showcase the potential of such features of timely diagnosis and change-of-state prediction; both valuable targets for future digital health applications. These tasks are best modelled using a combination of features, of which user activity are the most important. In addition, our experiments show that K-Means clustering for generating GPS features fares better than DBSCAN and time-based clustering; a finding which contrasts previous work showing the latter to be better. This indicates that the performance of such algorithms might be dataset-dependent, and thus a cross-study comparison would be necessary to identify the strengths and weaknesses of each for depression detection.
Data analysed in this publication were collected as part of the MAIKI project, which was funded by the German Federal Ministry of Education and Research (grant No. 13GW0254). The responsibility for the content of this publication lies with the authors.