The increasing use of electronic forms of communication presents new opportunities in the study of mental health, including the ability to investigate the manifestations of psychiatric diseases unobtrusively and in the setting of patients' daily lives. A pilot study to explore the possible connections between bipolar affective disorder and mobile phone usage was conducted. In this study, participants were provided a mobile phone to use as their primary phone. This phone was loaded with a custom keyboard that collected metadata consisting of keypress entry time and accelerometer movement. Individual character data with the exceptions of the backspace key and space bar were not collected due to privacy concerns. We propose an end-to-end deep architecture based on late fusion, named DeepMood, to model the multi-view metadata for the prediction of mood scores. Experimental results show that 90.31 based on session-level mobile phone typing dynamics which is typically less than one minute. It demonstrates the feasibility of using mobile phone metadata to infer mood disturbance and severity.READ FULL TEXT VIEW PDF
Mobile phones, in particular, “smartphones” have become near ubiquitous with 2 billion smartphone users worldwide. This presents new opportunities in the study and treatment of psychiatric illness including the ability to study the manifestations of psychiatric illness in the setting of patients’ daily lives in an unobtrusive manner and at a level of detail that was not previously possible. Continuous real-time monitoring in naturalistic settings and collection of automatically generated smartphone data that reflect illness activity could facilitate early intervention and have a potential use as objective outcome measures in efficacy trials (Ankers and Jones, 2009; Bopp et al., 2010; Faurholt-Jepsen et al., 2016).
While mobile phones are used for a variety of tasks the most widely and frequently used feature is text messaging. To the best of our knowledge, no previous studies (Association et al., 2013; Puiatti et al., 2011; Frost et al., 2013; Gruenerbl et al., 2014; Schleusing et al., 2011; Valenza et al., 2014) have investigated the relationship between mobile phone typing dynamics and mood states. In this work, we aim to determine the feasibility of inferring mood disturbance and severity from such data. In particular we seek to investigate the relationship between the digital footprints and mood in bipolar affective disorder which has been deemed the most expensive behavioral health care diagnosis (Peele et al., 2003), costing more than twice as much as depression per affected individual (Laxman et al., 2008). For every dollar allocated to outpatient care for people with bipolar disorder, $1.80 is spent on inpatient care, suggesting early intervention and improved prevention management could decrease the financial impact of this illness (Peele et al., 2003).
We study the mobile phone typing dynamics metadata on a session-level. A session is defined as beginning with a keypress which occurs after 5 or more seconds have elapsed since the last keypress and continuing until 5 or more seconds elapse between keypresses1115-second is an arbitrary threshold we set which can be changed and tuned easily.. The duration of a session is typically less than one minute. In this manner, each participant would contribute many samples, one per phone usage session, which could benefit data analysis and model training. Each session is composed of features that are represented in multiple views or modalities (e.g., alphanumeric characters, special characters, accelerometer values), each of which has different timestamps and densities, as shown in Figure 1. Modeling the multi-view time series data on such a fine-grained session-level brings up several formidable challenges:
Unaligned views: An intuitive idea for fusing the multi-view time series is to align them with each unique timestamp. However, features defined in one view would be missing for data points collected in another view. For example, a data point in special characters has no acceleration in accelerometer values or distance from last key in alphanumeric characters222This is for privacy concerns, because malicious person may be able to unscramble and recover the texts using such information..
Dominant views: One may also attempt to do the fusion by concatenating the multi-view time series per session. However, the views usually have different densities in a session, because the metadata are collected from different sources or sensors. For example, character-related metadata collected following a person’s typing behaviours are much sparser than accelerometer values collected in the background which have 16 times more data points in our dataset. Dense views could dominate a concatenated feature space and potentially override the effects of sparse but important views.
View interactions: The multi-view time series from typing dynamics contains complementary information reflecting a person’s mental health. The relationship between the digital footprints and mood states can be highly nonlinear. An effective fusion strategy is needed to explore feature interactions across different views.
In this paper, we propose a deep architecture based on late fusion, named DeepMood, to model mobile phone typing dynamics, as illustrated in Figure 2. The contributions of this work are threefold:
Data analysis (Section 2): We obtain interesting insights related to the digital footprints on mobile phones by analyzing the correlation between patterns of typing dynamics metadata and mood in bipolar affective disorder.
A novel fusion strategy in a deep framework (Section 3): Motivated by the aforementioned challenges that early fusion strategies (i.e.
, aligning views with timestamps or concatenating views per session) would lead to the problems of unaligned or dominant views, we propose a two-stage late fusion approach for modeling the multi-view time series data. In the first stage, each view of the time series is separately modeled by a Recurrent Neural Network (RNN)(Mikolov et al., 2010; Sutskever et al., 2011)
. The multi-view metadata are then fused in the second stage by exploring interactions across the output vectors from each view, where three alternative approaches are developed following the idea of Multi-view Machines(Cao et al., 2016), Factorization Machines (Rendle, 2012), or in a fully connected fashion.
Empirical evaluations (Section 4): We conduct experiments showing that 90.31% prediction accuracy on the depression score can be achieved based on session-level typing dynamics which reveals the potential of using mobile phone metadata to predict mood disturbance and severity. Our code is open-sourced at https://www.cs.uic.edu/~bcao1/code/DeepMood.py.
The data used in this work were collected from the BiAffect333http://www.biaffect.com study which is the winner of the Mood Challenge for ResearchKit444http://www.moodchallenge.com. During a preliminary data collection phase, for a period of 8 weeks, 40 individuals were provided a Galaxy Note 4 mobile phone which they were instructed to use as their primary phone during the study. This phone was loaded with a custom keyboard that replaced the standard Android OS keyboard. The keyboard collected metadata consisting of keypress entry time and accelerometer movement and uploaded them to the study server. In order to protect participants’ privacy, individual character data with the exceptions of the backspace key and space bar were not collected.
In this work, we study the collected metadata for participants including bipolar subjects and normal controls who had provided at least one week of metadata. There are 7 participants with bipolar I disorder that involves periods of severe mood episodes from mania to depression, 5 participants with bipolar II disorder which is a milder form of mood elevation, involving milder episodes of hypomania that alternate with periods of severe depression, and 8 participants with no diagnosis per DSM-IV TR criteria (Kessler et al., 2005).
Participants were administered the Hamilton Depression Rating Scale (HDRS) (Williams, 1988) and Young Mania Rating Scale (YMRS) (Young et al., 1978) once a week which are used as the golden standard to assess the level of depressive and manic symptoms in bipolar disorder. However, the use of these clinical rating scales requires a face-to-face patient-clinician encounter, and the level of affective symptoms is assessed during a clinical evaluation. Study findings may be unreliable when using rating scales as outcome measures due to methodological issues such as unblinding of raters and patients, differences in rater experiences and missing visits for outcome assessments (Demitrack et al., 1998; Psaty and Prentice, 2010; Faurholt-Jepsen et al., 2016). Thus, it motivates us to explore more objective methods with real-time data for assessing affective symptoms.
Due to privacy reasons, we only collected metadata for keypresses on alphanumeric characters, including duration of a keypress, time since last keypress, and distance from last key along two axises. Firstly, we aim to assess the correlation between duration of a keypress and mood states. The complementary cumulative distribution functions (CCDFs) ofduration of a keypress are displayed in Figure 3
. Data points with different scores are colored differently, and the range of mood scores corresponds to the colorbar. In general, the higher the score, the darker the color and the more severe the depressive or manic symptoms. According to the Kolmogorov-Smirnov test on two samples, for all the pairs of distributions, we can reject the null hypothesis that two samples are drawn from the same distribution with significance level. As expected, we are dealing with a heavy-tailed distribution: (1) most keypresses are very fast with median 85ms, (2) but a non-negligible number have longer duration with 5% using more than 155ms. Interestingly, samples with mild depression tend to have shorter duration than normal ones, while those with severe depression stand in the middle. Samples in manic symptoms seem to hold a key longer than normal ones.
Next we ask how the time since last keypress correlates with mood states. We show the CCDFs of time since last keypress in Figure 4. Based on the Kolmogorov-Smirnov test, for 98.06% in HDRS and 99.52% in YMRS of the distribution pairs, we can reject the null hypothesis that two samples are drawn from the same distribution with significance level
. Not surprisingly, this distribution is heavily skewed, with most time intervals being very short with median 380ms. However, there is a significant fraction of keypresses with much longer intervals where 5% have more than 1.422s. We can observe that the values of time since last keypress from the normal group (with light blue/red) approximate a uniform distribution on the log scale in the range from 0.1s to 2.0s. On the contrary, this metric from samples with mood disturbance (with dark blue/red) shows a more skewed distribution with a few values on the two tails and majority centered between 0.4s and 0.8s. In other words, healthy people show a good range of reactivity that gets lost in mood disturbance where the range is more restricted.
Figure 5 shows the CCDFs of distance from last key along two axises which can be considered as a sort of very rough proxy of the semantic content of people’s typing. No distinction can be observed across different mood states, because there are no dramatic differences in the manner in which depressive or manic people type compared to controls.
In this view, we use one-hot-encoding for typing behaviors other than alphanumeric characters, includingauto-correct, backspace, space, suggestion, switching-keyboard and other. They are usually sparser than alphanumeric characters. Figure 6 shows the scatter plot between rates of these special characters as well as alphanumeric ones in a session where the color of a dot/line corresponds to the HDRS score. Although no obvious distinction can be found between mood states, we can observe some interesting patterns: the rate of alphanumeric keys is negatively correlated with the rate of backspace (from the subfigure at the 2nd row, 7th column), while the rate of switching-keyboard is positively correlated with the rate of other keys
(from the subfigure at the 5th row, 6th column). On the diagonal there are kernel density estimations. It shows that the rate of alphanumeric characters is generally high in a session, followed byauto-correct, space, backspace, etc. Similar patterns can be found from the plot of YMRS which is omitted here.
Accelerometer values are recorded every 60ms in the background during an active session regardless of a person’s typing speed, thereby making them much denser than alphanumeric characters. The CCDFs of absolute accelerometer values along three axises are displayed in Figure 7. Data points with different mood scores are colored differently, and the higher the score, the more severe the depressive or manic symptoms. According to the Kolmogorov-Smirnov test on two samples, for all the pairs of distributions, we can reject the null hypothesis that two samples are drawn from the same distribution with significance level . Note that the vertical axis of the non-zoomed plots is on a log scale. We observe a heavy-tailed distribution for all three axises and for both HDRS and YMRS, with more than 99% of data points being less than 7.45, 9.97 and 10.56 along X, Y and Z axis, respectively. By zooming into data points at the “head” of the distribution on a regular scale, we can see different patterns on the absolute acceleration along different axises. There is a nearly uniform distribution of absolute acceleration along the Y axis in the range from 0 to 10, while the majority along the X axis lie between 0 and 2, and the majority along the Z axis lie between 6 and 10. An interesting observation is that compared with normal ones, samples with mood disturbance tend to have larger accelerations along the Z axis, and smaller accelerations along the Y axis. Hence, we suspect that people in a normal mood state prefer to hold their phone towards to themselves, while people in depressive or manic symptoms are more likely to lay their phone with an angle towards to the horizon, given that data were collected only when the phone was in a portrait position.
See Table 1 for more information about the statistics of the dataset. Note that the length of a sequence is measured in terms of the number of data points in a sample rather than the duration in time.
|# data points||836,027||538,520||14,237,503|
In this paper, we propose an end-to-end deep architecture, named DeepMood, to model mobile phone typing dynamics. Specifically, DeepMood provides a late fusion framework. It first models each view of the time series data separately using Gated Recurrent Unit (GRU)(Cho et al., 2014)
, a simplified version of Long Short-Term Memory (LSTM)(Hochreiter and Schmidhuber, 1997). It then fuses the output of the GRU from each view. As the GRU extracts a latent feature representation out of each time series, where the notions of sequence length and sampling time points are removed from the latent space, this avoids the problem of dealing directly with the heterogeneity of the time series from each view. Following the idea of Multi-view Machines (Cao et al., 2016), Factorization Machines (Rendle, 2012), or in a conventional fully connected fashion, three alternative fusion layers are designed to integrate the complementary information in the multi-view time series to produce a prediction on the mood score. The architecture is illustrated in Figure 2.
Each view in the metadata is essentially a time series whose length can vary a lot across sessions that largely depends on the duration of a session. In order to model the dynamic sequential correlations in each time series, we adopt the RNN architecture (Mikolov et al., 2010; Sutskever et al., 2011) which keeps hidden states over a sequence of elements and updates the hidden state by the current input as well as the previous hidden state where with a recurrent function:
The simplest form of an is as follows:
where are model parameters that need to be learned, and are the input dimension and the number of recurrent units, respectively.
is a nonlinear transformation function such as tanh, sigmoid, and rectified linear unit (ReLU). Since RNNs in such a form would fail to learn long term dependencies due to the exploding and the vanishing gradient problem(Bengio et al., 1994; Hochreiter, 1998), they are not suitable to learn dependencies from a long input sequence in practice.
To make the learning procedure more effective over long sequences, the GRU (Cho et al., 2014) is proposed as a variation of the LSTM unit (Hochreiter and Schmidhuber, 1997). The GRU has been attracting great attentions since it overcomes the vanishing gradient problem in traditional RNNs and is more efficient than the LSTM in some tasks (Chung et al., 2014). The GRU is designed to learn from previous timestamps with long time lags of unknown size between important timestamps via memory units that enable the network to learn to both update and forget hidden states based on new inputs.
A typical GRU is formulated as:
where is the element-wise multiplication operator, a reset gate allows the GRU to forget the previously computed state , and an update gate balances between the previous state and the candidate state . The hidden state can be considered as a compact representation of the input sequence from to .
Here we pursue a late fusion strategy to integrate the output vectors of the GRU units on these time series data from different views. This avoids the issues of alignment and diverse frequencies among the time series under different views when performing early fusion directly on the input data.
In the following we study alternative methods for performing late fusion. These include not only the straightforward approach based on adding a fully connected layer to concatenate the features from different views, but also novel approaches to capture interactions among the features across multiple views by exploring the concept of Factorization Machines (Rendle, 2012) to capture the second-order interactions as well as the concept of Multi-view Machines (Cao et al., 2016) to capture higher order interactions as shown in Figure 8.
We denote the output vectors at the end of a sequence from the -th view as . We can consider as multi-view data where is the number of views.
Fully connected layer. In order to generate a prediction on the mood score, a straightforward idea is to first concatenate features from multiple views together, i.e., , where is the total number of multi-view features, and typically for one-directional RNNs and for bidirectional RNNs. We then feed forward into one or several fully connected neural network layers with a nonlinear function in between.
where , is the number of hidden units, is the number of classes, and the constant signal “1” is to model the global bias. Note that here we consider only one hidden layer between the input layer and the final output layer as shown in Figure 8(a).
Factorization Machine layer. Rather than capturing nonlinearity through the transformation function, we consider explicitly modeling feature interactions between input units as shown in Figure 8(b).
where , is the number of factor units, and denotes the -th class. By denoting , we can rewrite the decision function of in Eq. (5) as follows:
One can easily see that this is similar to the two-way Factorization Machines (Rendle, 2012) except that the subscript ranges from to in the original form.
Multi-view Machine layer. In contrast to modeling up to the second-order feature interactions between all input units as in the Factorization Machine layer, we could further explore all feature interactions up to the th-order between inputs from views as shown in Figure 8(c).
As shown in Figure 2
, the full-order feature interactions across multiple views are modeled in a tensor, and they are factorized in a collective manner.
Note that a dropout layer (Hinton et al., 2012)
is applied before feeding the output from GRU to the fusion layer which is a regularization method designed to prevent co-adaptation of feature detectors in deep neural networks. The dropout method randomly sets each unit as zero with a certain probability. The dropout units contribute to neither the feed-forward process nor the back-propagation process.
Following the computational graph, it is straightforward to compute gradients for model parameters in both the Factorization Machine layer and the Multi-view Machine layer, as we do for the conventional fully connected layer. Therefore, the error messages generated from the loss function on the final mood score can be back-propagated through these fusion layers all the way to the very beginning,i.e., , , , , , in GRU for each input view. In this manner, we can say that DeepMood is an end-to-end learning framework for mood detection.
We investigate a session-level prediction problem. That is to say, we use features of alphanumeric characters, special characters and accelerometer values in a session to predict the mood score of the associated participant.
The implementation is completed using Keras(Chollet, 2015)
with Tensorflow(et al., 2015) as the backend. The code has been made available at the author’s homepage555https://www.cs.uic.edu/~bcao1/code/DeepMood.py
. Specifically, a bidirectional GRU is applied on each view of the metadata. RMSProp(Tieleman and Hinton, 2012)
is used as the optimizer. We truncate sessions that contain more than 100 keypresses, and we remove sessions if any of their views contain less than 10 keypresses. It leaves us with 14,613 total samples which are then split by time for training and validation. Each user contributes first 80% of her sessions for training and the rest for validation. We empirically set other parameters, including the number of epochs, batch size, learning rate and dropout fraction. The number of recurrent units and factor units are selected on the validation set. Detailed configurations of the hyper-parameters are summarized in Table2.
Experiments on the depression score HDRS are conducted as a binary classification task where . We consider sessions with the HDRS score between 0 and 7 (inclusive) as negative samples (normal) and those with HDRS greater than or equal to 8 as positive samples (from mild depression to severe depression). On the other hand, the mania score YMRS is more complicated without a widely adopted threshold. Therefore, YMRS is directly used as the label for a regression task where
. Accuracy and F-score are used to evaluate the classification task, and root-mean-square error (RMSE) is used for the regression task.
|# recurrent units ()||4, 8, 16|
|# factor units ()||4, 8, 16|
|maximum sequence length||100|
|minimum sequence length||10|
The compared methods are summarized as follows:
DMVM: The proposed DeepMood architecture with a Multi-view Machine layer for data fusion.
DFM: The proposed DeepMood architecture with a Factorization Machine layer for data fusion.
DNN: The proposed DeepMood architecture with a conventional fully connected layer for data fusion.
In general, DMVM, DFM and DNN can be categorized as late fusion approaches, while XGB, SVM and LR are early fusion strategies for the sequence prediction problem on multi-view time series. Note that the number of model parameters for fusing multi-view data in DMVM and DFM is and , respectively, thereby leading to approximately the same model complexity due to . For DNN, the number of model parameters for fusion is . For a fair comparison, we need to control the model complexity of the compared methods at the same level. Therefore, in all experiments, we always set .
Experimental results are shown in Table 3. We can see that the late fusion based DeepMood methods are the best on the prediction for the dichotomized HDRS scores, especially DMVM and DFM with 90.31% and 90.21%, respectively. It demonstrates the feasibility of using passive typing dynamics from mobile phone metadata to predict the disturbance and severity of mood states. In addition, it is found that SVM and LR are not a good fit to this task, or sequence prediction in general. XGB performs reasonably well as an ensemble method, but DMVM still outperforms it by a significant margin 5.56%, 5.93% and 10.02% in terms of accuracy, F-score and RMSE, respectively. Among the DeepMood variations, the improvement of DMVM and DFM over DNN reveals the potential of replacing a conventional fully connected layer with a Multi-view Machine layer or Factorization Machine layer for data fusion in a deep framework. This is because DMVM and DFM can explicitly capture higher order interactions among features, while DNN does not capture any feature interaction.
In practice, it is important to understand how the model works for each individual when monitoring her mood states. Therefore, we investigate the prediction performance of DMVM on each of the 20 participants in our dataset. Results are shown in Figure 9 where each dot represents a participant with the number of her contributed sessions in the training set and the corresponding prediction accuracy. We can see that the proposed model can steadily produce accurate predictions (87%) of a participant’s mood states when she provides more than 400 valid typing sessions in the training phase. Note that the prediction we make in this work is per session which is typically less than one minute. We can expect more accurate results on the daily level by ensembling sessions occurring during a day.
In this section, we show more details about the learning procedure of the proposed DeepMood architecture with different fusion layers and that of XGB. Figure 10 illustrates how the accuracy on the validation set changes over epochs. We observe that different fusion layers have different convergence performance in the first 300 epochs, and afterwards they steadily outperform XGB. Among the DeepMood methods, it is found that DMVM and DFM converge more efficiently than DNN in the first 300 epochs, and they reach a better local minima of the loss function at the end. This again shows the importance of the fusion layer in a deep framework. It is also interesting to see the convergence process of XGB considering its popularity and success on many tasks in practice. We found that the generalizability of XGB on the sequence prediction task is limited, although its training error could perfectly converge to 0 at an early stage.
To better understand the role that different views play in the buildup of mood detection by DeepMood, we examine separate models trained with or without each view. Since DMVM is designed for heterogeneous data fusion, i.e., data with at least two views, we train DMVM on every pairwise views. Moreover, we train DFM on every single view. Experimental results are shown in Table 4. First, we observe that Spec. are poor predictors of mood states. Alph. and Accel. have significantly better predictive performance. Alph. are the best individual predictors of mood states. It validates a high correlation between the mood disturbance and typing patterns including duration of a keypress, time interval since the last keypress, as well as accelerometer values.
|DMVM w/o Alph.||0.8125||0.8164||3.9833|
|DMVM w/o Spec.||0.9008||0.9034||3.8166|
|DMVM w/o Accel.||0.8318||0.8253||3.9499|
|DMVM w/ all||0.9031||0.9070||3.5664|
|DFM w/ Alph.||0.8322||0.8224||3.9515|
|DFM w/ Spec.||0.6260||0.5676||4.1040|
|DFM w/ Accel.||0.8015||0.8089||3.9722|
|DFM w/ all||0.9021||0.9011||3.6767|
This work is studied in the context of supervised sequence prediction. Xing et al. provide a brief survey on the sequence prediction problem where sequence data are categorized into five subtypes: simple symbolic sequences, complex symbolic sequences, simple time series, multivariate time series, and complex event sequences (Xing et al., 2010). Sequence classification methods are grouped into three subtypes: feature based methods, sequence distance based methods, and model based methods. Feature based methods first transform a sequence into a feature vector and then apply conventional classification models (Lesh et al., 1999; Aggarwal, 2002; Leslie and Kuang, 2004; Ji et al., 2007; Ye and Keogh, 2009)
. Distance based methods include K nearest neighbor classifier(Keogh and Pazzani, 2000; Keogh and Kasetty, 2003; Ratanamahatana and Keogh, 2004; Wei and Keogh, 2006; Xi et al., 2006; Ding et al., 2008) and SVM with local alignment kernel (Lodhi et al., 2002; She et al., 2003; Sonnenburg et al., 2005)2005)2005)2007).
However, most of the works focus on simple symbolic sequences and simple time series, with a few on complex symbolic sequences and multivariate time series. The problem of classifying complex event sequence data (a combination of multiple numerical measurements and categorical fields) still needs further investigation which motivates this work. Furthermore, most of the methods are devoted to shallow models with feature engineering. Inspired by the great success of deep RNNs in the applications of other sequence tasks, including speech recognition (Graves et al., 2013)2010; Bahdanau et al., 2014), in this work, we propose a deep architecture to model complex event sequences of mobile phone typing dynamics.
On multi-view learning, Cao et al. propose to fuse multi-view data through the operation of tensor product and assume that the effects of feature interactions across views have a low rank (Cao et al., 2014, 2016). Lu et al. extend it to multi-task learning (Lu et al., 2017). Zhang et al. use Factorization Machines to initialize the bias terms and embedding vectors for multi-field categorical data at the bottom layer of a deep architecture (Zhang et al., 2016b). There are also some work incorporating multiple views into the process of subgraph mining (Cao et al., 2015)
and deep learning(Zhang et al., 2016a) to help identify meaningful patterns from data.
It appears that mobile phone metadata could be used to predict the presence of mood disorders. The proposed DeepMood architecture is able to achieve 90.31% prediction accuracy, where late fusion is indeed more effective than early fusion and more sophisticated fusion layer also helps. The ability to passively collect data that can be used to infer the presence and severity of mood disturbances may enable providers to provide interventions to more patients earlier in their mood episodes. Models such as the one presented here may also lead to deeper understanding of the effects of mood disturbances in the daily activities of people with mood disorders.
This work is supported in part by NSF through grants IIS-1526499, and CNS-1626432, and NSFC 61672313.
Tensor-based Multi-view Feature Selection with Applications to Brain Diseases. InICDM.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.(2015). http://tensorflow.org/ Software available from tensorflow.org.