1 Introduction & Related Work
The ubiquity of smartphone usage in many people’s lives make it a rich source of information about a person’s mental and cognitive state. Here, we solely focus on app usage patterns and investigate to what extent they are informative about a person’s cognitive health. There has been significant past work on analyzing smartphone app usage patterns in general (e.g., Böhmer et al. (2011); Girardello and Michahelles (2010); Do et al. (2011); Morrison et al. (2018); Farrahi and Gatica-Perez (2008)), including many studies that predict user behaviour and characteristics based on app usage (e.g., Baeza-Yates et al. (2015); Chittaranjan et al. (2013); Bati and Singh (2018); Wang et al. (2015, 2018); Bai et al. (2012); Singh and Ghosh (2017); Zhao et al. (2016); Murnane et al. (2016)). Most closely related to our work are prior studies that aim to predict cognitive health and abilities from smartphone usage Dagum (2018); Chen et al. (2019); Gordon et al. (2019). Gordon et al. Gordon et al. (2019) analyzed the relationship between app usage and cognitive function in healthy older adults. Characteristics of app usage such as number of apps installed, the average app duration or app usage by hour of the day was shown to be informative of Cogstate Brief Battery Maruff et al. (2009) test scores. Chen et al. Chen et al. (2019)
integrate a larger number of data sources, including passively sensed data (e.g., activity data, heart rate, phone usage, sleep data), survey responses (mood, energy), and apps testing specific psycho-motor functions (e.g., typing speed) into one model. Using an ensemble of gradient boosted treesChen and Guestrin (2016) on 1k hand-engineered features, they achieve an area under the receiver-operator curve (AUROC) of 0.77 discriminating between healthy and symptomatic subjects. Inspired by these results, we address the same task as Chen et al. (2019)
while using only data sources related to app usage. Our model uses unsupervised learning to find differenttypes of interaction sessions in a user’s app stream. When combining the learned session types with supervised prediction of cognitive health, we obtain an AUROC of 0.79.
Through a number of ablation studies we demonstrate the importance of different model design decisions, including learned app embeddings, segmenting of the app stream into sessions, and clustering the sessions into session types. Finally, the interpretable structure of our model reveals novel insights into what aspects of phone usage have a strong relationship with cognitive health in our dataset. For example, we find that the relation between important apps such as Messages and cognitive health completely changes depending on what other apps are used in the same session. As such, the notion of sessions, and the learned structure of such session content, is critical to our performance; solely examining which apps are commonly used by an individual is not sufficient.
We use a subset of the data collected in a 12-week feasibility study which monitored 31 people with clinically diagnosed cognitive impairment and 82 healthy controls in normal living conditions Chen et al. (2019). The age range of the subjects was between 60 and 75 with a median age of 66 and 66% of the subjects were female. In particular, we analyze the app usage event streams that consist of the timestamps and app identity of all app openings and closings for each user over the course of the study. Furthermore, we use the phone unlock/lock event streams that consist of the time-stamps for all phone unlock and lock events. Overall this data amounts to more than 800k app launches and 230k phone unlock events. More elaborate details on the study design, data collection and the full dataset can be found in Chen et al. (2019).
3 Model Description
Our model first segments the app event stream of each user into a stream of interaction sessions using the phone unlock/lock event stream (Fig. 1a). Thus all apps that are opened in between a pair of phone unlock and lock events are grouped into the same session. To represent the many different apps in the dataset in a way that encodes similarity between apps, we train a 50-dimensional embedding in the same way as the popular word2vec Mikolov et al. (2013)
by considering each user to be a “sentence” and predicting each app from the three apps before and after it in time. To obtain a single vector representationof each session, we average the embeddings of all apps within each session.
Next, we use k-means to cluster all session vectors in the dataset to identify differentsession types (Fig. 1b). A user’s app usage is then represented by a time-series of session types (Fig. 1c).
Finally, we summarize the time series of each user by counting the session types (Fig. 1
d). For each user, we normalize the absolute session counts by the number of days the user participated in the study. In addition, we rescale all features of all users together such that the overall mean is 1 (to make the size of the features independent of the number of clusters). These features are then used as input to an L1-regularized logistic regressor to classify users ashealthy or symptomatic (Fig. 1e).
4 Experiments & Results
Since the number of users, N, in our dataset is small, we perform our experiments using N leave-one-out (LOO) train/test splits. For each of the N splits we select model hyper-parameters via a second LOO cross-validation loop on the N-1 training subjects. The model parameters consist of the logistic regression weights and the hyper-parameters consist of (i) the number of session types, K, used for the session clustering and (ii) the inverse regularization strength, C, for the logistic regression. We evaluate the final performance by computing AUROC using the predicted probabilities from each of the N left out test subjectsBradley (1997) (Table 1). Our full model achieves a test AUROC of 0.79, which is slightly higher than the 0.77 reported in Chen et al. (2019) that used a much larger range set of input features.111Note though that Chen et al. (2019) uses random 70/30 train/test splits instead of LOO for evaluation. This may limit their performance given the small size of the dataset.
As described in Section 3, our full model entails segmenting the app stream into sessions, embedding apps into a vector space and averaging them to get session vectors, and clustering session vectors into session types. Here, we systematically evaluate the impact of each of these model design choices on our ability to predict cognitive health using ablation studies.
Our first baseline (B1) tests the importance of grouping the app event stream into interaction sessions using the unlock/lock stream. Instead of aggregating and clustering phone usage at the session level, we cluster the individual app embeddings directly. We observe that performance drops from 0.79 to 0.75 when not grouping the app event stream in terms of sessions.
Our next three baselines (B2, B3, B4) aim to isolate the effect of the learned app embeddings. In B2, we randomly permute the assignment between apps and their embeddings and find that performance drops from 0.79 to 0.69. In B3 and B4, we replace the learned app embeddings with one-hot vectors encoding the app identity (B3) or coarser-scale App Store category (B4) of the app. Session vectors are obtained by averaging the one-hot app vectors and a user is represented as the sum over session vectors instead of counts over learned session types. While both B3 (0.75) and B4 (0.61) perform worse than our full model, the much larger drop for B4 indicates that App Store categories do not retain sufficient information to support the down-stream classification.
Our final two baselines (B5, B6) have the least structure, using one-hot encodings instead of learned embeddings (like B3 and B4) and no session aggregation (like B1). Each user is represented as a vector of counts of the different apps (B5) or App Store categories (B6). Again, we find that performance decreases (B3B5: 0.750.72, B4B6: 0.610.53) when not grouping the app event stream into sessions.
5 Model Introspection
For our first analysis, we fit the model to all N subjects and analyse the four session types with the highest contribution to the model decision in either direction (Fig. 2). The contribution of a session type is measured by the product between the regression weight and the corresponding feature’s value summed over all subjects (weight sum of feature). To characterize each session type we report the app closest to the cluster center 222As described in section 3, each session type is a cluster in the app2vec embedding space. and the most common session in the session type. Furthermore, for the 15 most common apps in the dataset, we visualize the difference between the app distribution in each session type and the overall distribution of apps in the dataset (Fig. 2, bar plots). The four session types that are most strongly associated with a high symptomatic score are dominated by Call and Phone, Messages and Mail, Mail and Safari, and Settings (Fig. 2 upper half), followed by Clock and Calendar (not shown). Session types most strongly related to a low score for symptomatic are dominated by Messages, Safari, Mail and Facebook and one session type consisting of many less frequent apps and thus difficult to summarize. We observe a very interesting dependency of the influence of a session type on the interplay between multiple apps. The session types dominated by Messages and Mail or Mail and Safari strongly increase the model’s predicted score for symptomatic, whereas session types dominated by single Messages or single Safari sessions or by Mail and Facebook strongly decrease it.
To better understand our model’s prediction for individual subjects, we use the N models resulting from the LOO procedure and analyze which sessions cause them to (mis-)classify the respective test subjects. Inference in our model is linear and thus we know the contribution of each session of a user to the model’s prediction. In Fig. 3 we show twenty different users, five for each combination of healthy (right) or symptomatic (left) and high score (top) or low score (bottom). For each of these users, we list the three sessions with the largest contribution to the respective high or low score. For subjects with a high score (top) the most contributing single app sessions contain Phone, Calendar and Clock. For subjects with a low score (bottom) the most contributing single app sessions contain Messages, Instagram and Camera. As in our first analysis, we see that the impact of apps such as Messages or Mail strongly depends on the surrounding apps in the session. When Messages shares a session with Mail or Safari it strongly increases the predicted score. When Messages is alone or in a session with Facebook or Instagram it strongly decreases the predicted score. Overall we find that very similar sessions cause the model to correctly assign a high score to symptomatic subjects as well as to incorrectly assign a high score to healthy subjects (Fig. 3 upper half) and similarly for (in-)correctly assigning low scores (Fig. 3 lower half).
6 Discussion & Future Work
The reported results have several potential limitations. For example, the generalization of our results to the general population will be limited by size of the dataset and the fact that symptomatic subjects were already diagnosed when entering the study.
Nevertheless, it is exciting that app usage alone captures systematic differences between healthy and symptomatic subjects and we are actively pursuing multiple avenues to extend our model. There are multiple parts in our model that can be replaced by more complex building blocks. For example, one could use topic models Blei et al. (2003)
to extract session types or replace the logistic regression by a non-linear classifier such as gradient boosted trees or neural networks. Additionally, we are aiming to incorporate the ordering of the apps in each session as well as user context such as time of the day or a user’s motion state into the session representation. Finally, we are exploring methods to learn the extraction of session types jointly with the classification of cognitive health in and end-to-end fashion.
- Predicting the next app that you are going to use. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 285–294. Cited by: §1.
- Will you have a good sleep tonight?: sleep quality prediction with mobile phone. In Proceedings of the 7th International Conference on Body Area Networks, pp. 124–130. Cited by: §1.
- “Trust us”: mobile phone use patterns can predict individual trust propensity. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 330. Cited by: §1.
Latent dirichlet allocation.
Journal of machine Learning research3 (Jan), pp. 993–1022. Cited by: §6.
- Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage. In Proceedings of the 13th international conference on Human computer interaction with mobile devices and services, pp. 47–56. Cited by: §1.
- The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), pp. 1145–1159. External Links: Cited by: §4.
- Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2145–2155. Cited by: §1, §2, §4, footnote 1.
- Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §1.
- Mining large-scale smartphone data for personality studies. Personal and Ubiquitous Computing 17 (3), pp. 433–450. Cited by: §1.
- Digital biomarkers of cognitive function. npj Digital Medicine 1 (1), pp. 10. Cited by: §1.
- Smartphone usage in the wild: a large-scale analysis of applications and context. In Proceedings of the 13th international conference on multimodal interfaces, pp. 353–360. Cited by: §1.
- Daily routine classification from mobile phone data. In International Workshop on Machine Learning for Multimodal Interaction, pp. 173–184. Cited by: §1.
- AppAware: which mobile applications are hot?. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, pp. 431–434. Cited by: §1.
- App usage predicts cognitive ability in older adults. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 168. Cited by: §1.
- Validity of the cogstate brief battery: relationship to standardized tests and sensitivity to cognitive impairment in mild traumatic brain injury, schizophrenia, and aids dementia complex. Archives of Clinical Neuropsychology 24 (2), pp. 165–178. Cited by: §1.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §3.
- A large-scale study of iphone app launch behaviour. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 344. Cited by: §1.
- Mobile manifestations of alertness: connecting biological rhythms with patterns of smartphone app use. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 465–477. Cited by: §1.
- Inferring individual social capital automatically via phone logs. Proc. ACM Hum.-Comput. Interact. 1 (CSCW), pp. 95:1–95:12. External Links: Cited by: §1.
- SmartGPA: how smartphones can assess and predict academic performance of college students. In Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing, pp. 295–306. Cited by: §1.
- Tracking depression dynamics in college students using mobile phone and wearable sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (1), pp. 43. Cited by: §1.
- Discovering different kinds of smartphone users through their application usage behaviors. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 498–509. Cited by: §1.