The data used in this work is obtained from the BiAffect111http://www.biaffect.com/ project by the team’s permission for research use. The data was collected in the following way. During a preliminary data collection phase, for a period of 8 weeks, 40 individuals were provided a Galaxy Note 4 mobile phone which they were instructed to use as their primary phone during the study. This phone was loaded with a custom keyboard that replaced the standard Android OS keyboard. The keyboard collected meta-data consisting of keypress entry time and accelerometer movement and uploaded them to the web server. In order to protect subjects’ privacy, the detailed content of typing with the exceptions of the backspace key and space bar was not collected.
|Bipolar I||Bipolar II||Control|
|Age (years)||45.6 9.9||52.4 9.4||46.1 10.7|
|Gender (% female)||57%||80%||63%|
|Years of education||15.4 1.7||14.8 2.8||15.8 1.4|
|IQ||109.0 3.4||102.2 8.5||107.8 11.7|
As in [cao2017deepmood], we study the collected meta-data of bipolar subjects and normal controls who had provided at least one week of meta-data. In the selected subset of data, there are 7 subjects with bipolar I disorder that involves periods of severe mood episodes from mania to depression, 5 subjects with bipolar II disorder which is a milder form of mood elevation, involving milder episodes of hypomania that alternate with periods of severe depression, and 8 subjects with no diagnosis per DSM-IV TR criteria [kessler2005lifetime]. Subjects were administered the Hamilton Depression Rating Scale (HDRS) [williams1988structured] and Young Mania Rating Scale (YMRS) [young1978rating] once a week which are used as the golden standard to assess the level of depressive and manic symptoms in bipolar disorder. Also, subjects performed self-evaluations everyday. Subject demographics are described in Table Document. Two groups of features are used in this work: alphanumeric characters and accelerometer values. Although special characters were investigated in [cao2017deepmood], we find that their contribution in our model performance is negligible and will even cause a decrease on the depression prediction task by about 2%. For alphanumeric characters, due to privacy reasons, we only collected meta-data for these keypresses, including duration of a keypress, time since last keypress, and distance from last key along two axises. On the other hand, accelerometer values along three axises are recorded every 60ms in the background during an active session regardless of a person’s typing speed.
Although Cao et al. [cao2017deepmood] provide data analysis on the correlation between patterns of typing meta-data and mood in bipolar disorder, they do not study the temporal effects on typing dynamics. Here we investigate the relationship between the typing dynamics and time, in order to justify the necessity of time-based calibration in our model. Figure Document presents the distribution of typing hours in the 7-day by 24-hour matrix. Figure Document
shows how the mean and standard deviation values of some features change over 24 hours. They are computed from all the samples in the dataset. About alphanumeric characters, we can observe that theduration of a keypress (Figure Document) and the time since last keypress (Figure Document) are correlated and share the same pattern. In general, the fastest typing speed occurs at 6:00 and remains stable between 8:00 to 20:00, and then it becomes significantly slower during the midnight. We suspect that this is primarily due to the circadian rhythm which is a biological process that displays an endogenous oscillation of about 24 hours [edgar2012peroxiredoxins]. It is running in the background of human brain and cycles between sleepiness and alertness at regular intervals. About accelerometer values, we omit the acceleration along the X axis here, because data were collected only when the phone was in a portrait position which makes the X-axis acceleration to be centered around 0 and less interesting. Figure Document reveals a complementary relationship between the acceleration along the Y axis and that along the Z axis. The fact that the Z-axis acceleration will usually be negative when one is lying and using the phone may explain the observation that the average Z-axis acceleration is small (large Y-axis acceleration) during the night time, and there is relatively large Z-axis acceleration (small Y-axis acceleration) during the day time. Previous studies have shown that people’s mental health depends on the day of the week [suhara2017deepmood, golder2011diurnal]. In Figure Document, we illustrate how these typing dynamics features correlate with the day of the week. It can be seen that the duration of a keypress and the time since last keypress are significantly different across different days during a week, although they are not as interpretable as with the circadian rhythm. Moreover, accelerometer values along Y and Z axises also vary on different days of the week, and we suspect that the relatively smaller Z-axis acceleration and larger Y-axis acceleration on Sunday may result from that people spend more time lying in the bed or couch on Sunday. It motivates us to incorporate the time effects into the model design.
We further explore the personal traits hidden in the typing dynamics by analyzing the feature distributions across subjects, as shown in Figure Document where the left column (Figure Document, Document, Document, Document) presents the boxplots with scattered samples per subject of duration of a keypress, time since last keypress, acceleration along the Y axis, and acceleration along the Z axis, and in the right column (Figure Document, Document, Document, Document), there are the corresponding heatmaps showing the p-values from the subject-level two-sided T-tests. We conclude that two subjects often have significantly different feature distributions regardless of their diagnosis. Therefore, it is critical to consider the subject-level baselines. Moreover, it is worthwhile noting that although the connection between the bipolar disorder and these typing behavior features is not obvious from Figure Document, we will show in experiments that accurate mood prediction can be achieved by capturing the local patterns, exploiting the circadian rhythm, and personalization in our proposed model.
There are two key modules used in this work, the temporal convolutional neural network and the Gated Recurrent Unit (GRU), which are briefly reviewed in this section.
Although convolutional neural networks (CNNs) are most widely used in computer vision models, it has recently been demonstrated that CNNs also work well on some natural language processing (NLP) tasks such as sentence classification[kim2014convolutional, zhang2015character]. A temporal convolutional module computes an 1-dimensional convolution over an input sequence. Given an input sequence with length and channels, and a list of discrete kernels with length , then the convolution between and
with strideis calculated by: h_j(y) = ∑_i=1^n ∑_x=1^kg_j(x) ⋅f_i(y ⋅d - x + c), where , is an offset constant, and the final output feature is of size .
Gated Recurrent Unit
Recurrent neural networks (RNNs) are widely used to model the dynamics of sequence data where hidden states are updated to model the changes in the sequence over time.The current hidden state is computed using the current input and the previous hidden state . The simplest way of implementing an RNN is: h_t & = ϕ(Wx_t + Uh_t-1 + b), where W, U are weight parameters, b is the bias parameter, and
is a nonlinear activation function. Although the vanilla RNNs are useful, they suffer from the problem of exploding or vanishing gradients which makes them fail in capturing long-term dependencies effectively. The Gated Recurrent Unit (GRU)[cho2014GRU]
was proposed as a simplification of the Long-Short Term Memory (LSTM)[hochreiter1997LSTM]
module which overcomes the vanishing gradient problem in the simplest RNNs and has less parameters than LSTM. A typical GRU is formulated in the following way:r_t & = σ(W_r x_t + U_r h_t-1),
z_t & = σ(W_z x_t + U_z h_t-1),
~h_t & = tanh(W x_t + U_z(r_t ⊙h_t-1)),
h_t & = z_t ⊙h_t-1 + (1-z_t) ⊙~h_t, where
is the sigmoid function,is the hyperbolic tangent function, and is the element-wise multiplication operator.
In this section, we propose the model that utilizes early-fused features, combines CNNs with RNNs, and considers personal circadian rhythm. The architecture of is illustrated in Figure Document. In following sections, we give detailed explanations of each component in , i.e. early fusion of features, stacking CNNs and RNNs and considering each person’s circadian rhythm.
Early Fusion of Features
Although DeepMood [cao2017deepmood] works well by using different networks to process features of different views and concatenate them in a later stage of the model (which we refer to as late fusion approach), we instead propose to use early fusion
methods that align the alphanumeric keypresses with the accelerometer values before feeding them into any downstream machine learning models. The motivation behindearly fusion is that, aligning features of different views can provide extra information of the temporal relations between features.
An intuitive early fusion approach, named as EF-dropna, is to find the accelerometer value whose timestamp is the closest to each alphanumeric keypress, align them together, and drop the unaligned accelerometer values. However, there are more accelerometer values than alphanumeric keypresses, and the abandon of those unaligned accelerometer values will certainly result in information loss. Therefore, another early fusion approach, named as EF-fillna, is to retain the accelerometer sequence and fill the unaligned features in the alphanumeric sequence with zero values. These two early fusion methods are illustrated in Figure Document. We try both early fusion methods in our proposed model and name them -fillna and -dropna respectively.
Exploiting Local Typing Dynamics with Convolutional Neural Networks
In order to exploit the local typing patterns, we use 1-D convolution to learn the features of an input sequence. Specifically, the convolution block used in this paper is a convolution layer followed by batch-normalization and ReLU activation. One advantage of using multi-layer CNNs to extract useful features from the sequence data instead of directly applying multi-layer RNNs is that CNN-based models can be trained much faster than RNNs without any sacrifice in prediction performance which will be validated in the experiments.
Combining CNNs and RNNs
Using CNNs alone will lose the long-term temporal dynamics, since the kernels in CNNs are designed to capture only local features within a small window. On the contrary, although RNNs take more time for training, they can produce features that capture the overall dynamics of the input sequence. Therefore, we propose to use a combination of CNNs and RNNs, so that we can make use of both their advantages in learning local patterns and temporal dependencies. To do so, we first feed features into two convolution block and then into a bi-directional GRU module. Specifically, we take the output features from the second convolution block, split the features along the temporal dimension, and feed the split features into the GRU sequentially. We concatenate the last output features from the GRU in two directions to form a single feature vector. Also, since we applyearly fusion to the features, we only have one pathway and do not need to concatenate results from different views.
Periodic Dynamics and Personalized Mood Prediction
A circadian rhythm is any biological process that displays an endogenous oscillation of about 24 hours [edgar2012peroxiredoxins]. It is running in the background of human brain and cycles between sleepiness and alertness at regular intervals. It is commonly known that individual depressive moods vary according to the circadian rhythm, as well as the day of the week [suhara2017deepmood]. In order to exploit the circadian rhythm and other periodic patterns as discussed in Section Document, as a rough approximation, we propose to use the function for time-based calibration of the final prediction. Specifically, suppose the output of the last fully-connect layer is , before producing the final score for regression, we scale by the value of the function which takes the current time as input. Furthermore, smart-phone users usually have very distinct typing dynamics. As shown in Section Document, each subject may have different baselines in terms of mood states, even with similar typing dynamics. Therefore, we should make personalized mood prediction rather than using a subject-unaware model. This could enable us to improve prevention and treatment outcomes by better incorporating individual patient characteristics. In order to provide personalized mood prediction, the final prediction should be further adjusted per subject. Hence, it is more intuitive to use a different set of parameters for each person, which are learned automatically by gradient descent and back-propagation. The calibration for user is given by: s = x * (α_u * sin (β_u * t_0 + γ_u) + δ_u), where is the starting time of the input sequence (e.g. a phone usage session), represented by the number of hours passed since the start of the earliest session of the subject. There are many choices of periodic functions, but since any periodic function can be approximated by a Fourier series which is made of many functions, here we use only one function and demonstrate its effectiveness in helping mood prediction.
In this section, we evaluate the proposed model HDRS and YMRS regression tasks, study the convergence efficiency of different methods and investigate the effects of changing the size of training set on regression tasks.
|min sequence length||10|
|max sequence length||100|
|GRU hidden dimension||20|
For the prediction of depression score HDRS and mania score YMRS, we treat them as regression tasks where HDRS/YMRS scores are used as labels. Root-mean-square error (RMSE) is used as the metric for both tasks. The features used in all the compared models are defined below:
Alphanumeric sequence: The features in the alphanumeric sequence are represented by a 4-dimensional vector that includes duration of a keypress, time since last keypress, horizontal distance to last keypress, and vertical distance to last keypress.
Accelerometer sequence: The features in the accelerometer sequence are represented by a 3-dimensional vector where each dimension represents the accelerometer values along X, Y and Z axises, respectively.
Each session is composed of these two sequences, a alphanumeric sequence and a accelerometer value sequence. We truncate sessions that contain more than 100 alphanumeric characters, and remove sessions if they contain less than 10 alphanumeric characters. This leaves us with 14,613 total sessions. For each subject, we use the earliest 80% of sessions for training and the rest for testing. The model is implemented in PyTorch, and runs on a Ubuntu system with an NVIDIA Titan X Pascal GPU. The hyper-parameters for the proposed model and all baselines are fixed to the same, and RMSProp[hinton2012rmsprop] is chosen as the optimizer. The two convolutional blocks of have 10 and 20 kernels respectively, and each kernel is of size 3 and stride 2. Note the CNN baseline has a third convolutional block with 30 kernels, each kernel if of size 3 and stride 2, while uses only two convolution blocks in order to preserve enough information on the temporal dimension for GRUs. Other hyper-parameter values are listed in Table Document. Our code is open-source at https://github.com/stevehuanghe/dpMood.
Since Cao et al. [cao2017deepmood] have shown that deep architectures work better than traditional methods in mood prediction, we only compare with baselines that use deep neural networks. The compared methods are introduced as follows:
RNN (DeepMood): A model that feeds each sequence to a separate bi-directional GRU, and the concatenated outputs are connected to a fully-connected network for regression. It is the same as the deep architecture proposed in DeepMood [cao2017deepmood] that uses alphanumeric characters, accelerometer values and special characters as features, and we re-implement it in PyTorch.
CNN: A CNN-based model which stacks three convolutional blocks followed by a max-pooling layer that reduces the number of channels to 1. Late fusion is applied to compare with DeepMood, which means there is a separate convolutional neural network for each kind of input features. The resulting features of each network are then concatenated into a single feature which is then put into a fully-connected network to produce a single scalar for regression.
CNNRNN: A model that stacks CNN and RNN together as in , uses late fusion, without any time-based or personalized calibration.
CNNRNN-Cr: The CNNRNN model that learns the same circadian for all users, i.e., no personalization.
CNNRNN-PsCr: The CNNRNN model that explores each person’s circadian rhythm.
CNNRNN-fillna: The CNNRNN model that uses early fusion method EF-fillna.
CNNRNN-dropna: The CNNRNN model that uses early fusion method EF-dropna.
-fillna: The proposed model which stacks CNN with RNN, learns personal circadian rhythms and utilizes the early fusion method EF-fillna
-dropna: The proposed model which stacks CNN with RNN, learns personal circadian rhythms and utilizes the early fusion method EF-dropna
|Model||HDRS w/ ctrl||YMRS w/ ctrl||HDRS w/o ctrl||YMRS w/o ctrl|
|RNN||5.410 ( 0.054)||3.700 ( 0.034)||4.765 ( 0.048)||4.150 ( 0.097)|
|CNN||5.077 ( 0.119)||3.600 ( 0.050)||4.806 ( 0.057)||4.085 ( 0.048)|
|CNNRNN||4.671 ( 0.097)||3.477 ( 0.042)||4.526 ( 0.067)||4.129 ( 0.110)|
|CNNRNN-Cr||7.023 ( 0.043)||4.032 ( 0.034)||8.647 ( 0.031)||5.056 ( 0.049)|
|CNNRNN-PsCr||2.818 ( 1.439)||3.202 ( 0.243)||5.164 ( 1.686)||4.103 ( 0.398)|
|CNNRNN-fillna||6.304 ( 0.105)||3.798 ( 0.027)||5.165 ( 0.067)||4.198 ( 0.021)|
|CNNRNN-dropna||6.698 ( 0.063)||6.698 ( 0.063)||5.196 ( 0.023)||4.312 ( 0.032)|
|dpMood-fillna||2.400 ( 1.281)||3.104 ( 0.422)||3.064 ( 1.381)||4.013 ( 0.294)|
|dpMood-dropna||2.376 ( 1.065)||3.020 ( 0.254)||3.641 ( 1.880)||3.921 ( 0.175)|
Here we conduct two sets of experiments, one with all 20 subjects, and the other with only 12 bipolar subjects (without control subjects). We did not use the same random seed for all models, and let the compared models run 20 times (200 epochs for each run) so that we can calculate their means and standard deviations. The results are shown in Table Document
. Overall, our model achieves the lowest average RMSE among all compared methods. By comparing the results of RNN and CNN, we can see that the performance of fully-convolutional model is comparable with the performance of RNN model, even if CNN fails to capture the long-term dependencies of the sequences. This reveals that the local patterns in the typing dynamics that are captured by the CNN model is as important as the temporal dependencies learned by the RNN model. When we combine the CNN and RNN model, the CNNRNN model achieves lower regression error than the separate CNN and RNN models by an average margin of 7% for the HDRS (with controls), and 6% improvement for YMRS (with controls). For HSRS regression without controls, the CNNRNN model performs slightly better than CNN and RNN, while in the YMRS regression without controls, CNNRNN is about the same as CNN and RNN, but with a larger variance in RMSE. Overall, the performance of the CNNRNN model indicates that preserving both local and global typing dynamics can help mood predictions. As for the CNNRNN-Cr model, we can see that adding the same calibration function to all subjects does not improve the performance, and it even leads to a large increase in RMSE when compared to the simple CNN and RNN model, which may be because of the fact that different people have very different personal traits, especially for the bipolar subjects, as shown in SectionDocument. When learning a different calibration function for each subject, the CNNRNN-PsCr model is able to achieve better performance than all previously mentioned methods in both HDRS and YMRS regression with controls, which shows the potential of considering personal circadian rhythms in mood prediction. However, CNNRNN-PsCr is worse than other methods on tasks that do not have control subjects, which may be because the patterns of normal persons are easier to learn than that of bipolar subjects. From the performance of CNNRNN-fillna and CNNRNN-dropna, we can see that using early fusion alone does not help better predicting HDRS or YMRs, and they are slightly inferior to the CNNRNN model that uses late fusion. As for our -fillna and -dropna, they both perform better than all baselines and have lower average RMSEs, which proves the effectiveness of considering personal circadian rhythms and using early fusion. By comparing our proposed models with CNNRNN-PsCr, we can see that early fusion can help learning better personal traits, while the results of CNNRNN-fillna and CNNRNN-dropna show that early fusion works better when considering circadian rhythms.
In this section, we analyze the convergence of and the baseline methods. The random seed is fixed as 1234 for all models, and each model is trained for 200 epochs and is evaluated on the test set after each epoch. As shown in Figure Document, although our -fillna and -dropna converge not as quickly as some of the baselines, -fillna achieves it best performance at the 97th epoch, while -dropna takes 75 epochs, and -fillna has slightly lower RMSE than -dropna. The RNN model converges slowly. After a fast decrease in test RMSE in the first 30 epochs, its convergence procedure slows down as training goes on. In the experiment with 500 epochs, the RNN model achieves its best test RMSE at around the 480 epoch. As for the CNN model, we can see that it converges very quickly and reaches its lowest RMSE at the 17 epoch, after which it starts to overfit with a decrease in the performance. Compared to the CNN model, the CNN+RNN model takes more epochs to converge to its best performance (88 epochs), which is normal because it contains an RNN that requires longer time to converge. However, because of the CNN part in the model that significantly reduces the length of input features to the RNN part, the convergence efficiency of the CNNRNN model is still much better than the RNN model. The CNNRNN model takes about 76 epochs to reach its best test RMSE. Clearly that the CNNRNN-Cr model performs the worst, which indicates that it is inappropriate to treat all users’ circadian rhythms as they are the same. The CNNRNN-PsCr model, with better performance than all previously mentioned baselines, also preserves a fast convergence rate. In the experiments, it reaches the lowest test RMSE at the 195 epoch which is slower than the CNNRNN model, but it outperforms CNNRNN by 40% on HDRS regression task with control subjects.
Sensitivity Regarding the Size of Training Data
In this section, we analyze the performance of our method with respect to the size of training data. We fix the random seed as 1234 and train each model for 200 epochs. We start with using only 30% of the whole data as training and use the rest for testing, and gradually increase the portion of training data up to 70%. This experiment is conducted on HDRS regression task with all 20 subjects, and the results are shown in Table Document. As we can see, all models benefit from the increase of training data, but our -fillna achieves the lowest regression error with only 30% of the whole data as training data, which indicates that our model is able to work well with small training data and that it is important to consider each user’s personal trait. As for our -dropna, although it has slightly higher RMSE than CNNRNN-PsCr when trained on small training sets, it outperforms CNNRNN-PsCr when we use 70% of the whole data for training.
Visualizing Personalized Calibration Functions
In this section, we visualize the learned calibration for each subject and analyze the differences between subjects. Figure Document shows the functions learned for the HDRS regression task, while Figure Document shows the functions learned for the YMRS regression task. We can see that most subjects have very distinct functions, which indicates that every subject has her own circadian rhythm, and some subjects have longer periods than the others. The colors in Figure Document indicate the diagnosis of each subject, where the control group contains subjects who are diagnosed with neither bipolar I or bipolar II, and we can consider that bipolar I is more severe than bipolar II. An interesting discovery is that the offset terms are effectively clustered w.r.t. the diagnosis. Figure Document shows that the control group has positive offsets, and subjects in the bipolar I and bipolar II group usually have a negative offset. From Figure Document we can see that the control group has calibration values that are very close to 0, which corresponds to the fact that a large portion of the YMRS scores for subjects in the control group are 0.
Mood prediction has been explored in different manners such as using smart-phones and wearable devices to record personal traits like sleeping [sano2015recognizing], voice acoustic [lu2012stresssense], and social patterns [moturu2011socialsense, ma2012daily]. Recently, Suhara et al. [suhara2017deepmood] design a method that uses peoples’ self-reported mood histories to predict depression. However, their method highly relies on the quality of users’ self-assessment, which may be additional burden to users and they may stop reporting after a while. Cao et al. [cao2017deepmood] use keyboard typing as features and professional medical diagnosis as labels for predicting depression and mania, but their model does not capture the local patterns in the typing dynamics or consider each person’s circadian rhythm. The task of mood prediction is closely related to supervised sequence prediction. A brief survey by Xing et al. [xing2010brief] categorizes sequence data into five groups: simple symbolic sequences, complex symbolic sequences, simple time series, multivariate time series, and complex event sequences. Models used for sequence classification are also categorized into three groups: feature based methods [lesh1999mining, aggarwal2002effective, leslie2004fast, ji2007mining, ye2009time], sequence distance based methods [keogh2000scaling, keogh2003need, ratanamahatana2004making, wei2006semi, xi2006fast, ding2008querying, lodhi2002text, she2003frequent, sonnenburg2005large], and model based methods [cheng2005protein, yakhnenko2005discriminatively, srivastava2007hmm]
. This work is related to the feature based approach, but we use deep learning models to learn higher level features for regression tasks. This work is also related to sentence classification in natural language processing[kim2014convolutional, zhang2015character]. The common part of sentence classification and mood prediction is that both tasks use sequence data as input. Text sequences and time series data are similar to each other in that the order of elements in a sequence is important to the meaning of the sequence. Both text sequence and typing sequence have local patterns. For text, several words together form an n-gram phase that represents a certain meaningful concept. In our case, although we do not know the meaning of each keypress, the local patterns can still represent some information similar to n-grams, since the keypresses are mostly alphanumeric characters. However, the sentence classification and mood prediction tasks are different from some perspectives. In sentence classification, each word is represented by an embedding vector that indicates its position in the latent space, while in mood prediction, each feature is obtained from the real-world sensors thereby having certain physical meaning. Although we also use CNNs as in [kim2014convolutional, zhang2015character, kalchbrenner2014convolutional], we need to deal with data from multiple views, while in sentence classification, the data usually come from a single view.
This paper studies the problem of mood prediction using typing dynamics collected from smart-phones, and proposes an end-to-end deep architecture that incorporates both CNNs and RNNs. Moreover, the proposed model considers each person’s circadian rhythm and adjusts the predictions accordingly. Extensive experiments demonstrate the power of using the combination of CNNs and RNNs in mood prediction, and that modeling each person’s circadian rhythm is critical for achieving more accurate predictions. In addition, we study the effect of early fusion for multi-view sequence data and compare it with late fusion, and find that early fusion help improve the performance of our model in the given tasks. The Precision Medicine Initiative222https://allofus.nih.gov is a recent project that aims to improve prevention and treatment outcomes by better incorporating individual patient characteristics, and mobile technologies including smart-phones and wearable devices are expected to play a significant role in these efforts. This work demonstrates the feasibility and potential of such efforts.