The delivery of mental health interventions via ubiquitous devices has shown much promise. A natural conversational interface that allows longitudinal symptom tracking and appropriate just-in-time interventions would be extremely valuable. However, the task of designing emotionally-aware agents is still poorly understood. Furthermore, the feasibility of automating the delivery of just-in-time mHealth interventions via such an agent has not been fully studied. In this paper, we present the design and evaluation of EMMA (EMotion-Aware mHealth Agent) through two human-subject experiments with N=39 participants (one-week, and two-week long respectively). EMMA conducts experience sampling in an empathetic manner and provides emotionally appropriate micro-activities. We show the system can be extended to detect a user’s mood purely from smartphone sensor data. Our results show that extraverts preferred EMMA significantly more than introverts and that our personalized machine learning model worked as well as relying on gold-standard self-reports of emotion from users. Finally, we provide a set of guidelines for the design of bots for mHealth.
We increasingly rely on intelligent agents in our everyday lives. For these systems to be trusted, natural and engaging, they need to be able to have emotional intelligence. An assistant that can sense a user’s emotional state and therefore, adapt, is considered more valuable, intelligent and trustworthy [4, 21, 31]. Virtual agents have shown success in multiple contexts, including intelligent tutoring systems , health care decision support , and more recently as virtual therapists .
Advances in affective computing  over the past twenty years mean that it is now possible to deploy applications in-situ and longitudinally. Computer sensing platforms can now track a user’s state across time , which presents the opportunity to personalize interactions with individuals based on their affective state. Not only desktop computers, but also smartphones and wearable devices have been studied to conduct “Reality Mining”  and to infer the user’s context and mood [29, 28].
A very promising application for intelligent agents is in the delivery of mental health therapies. Prior work has shown that simple micro-interventions, such as deep breathing or talking with a friend  can be effective in increasing positive affect and reducing negative affect. Mobile mental health is of growing interest, as it leverages ubiquitous devices and can be used to reach people, regardless of their location. Furthermore, smartphones and watches are equipped with a wide variety of sensors that can be very useful in affect detection. However, the affective qualities of an agent delivering such an intervention are poorly understood. Is it beneficial if the agent expresses emotion? Can an agent learn to react emotionally appropriately given the context and user? Does an emotionally intelligent agent magnify the impact of an intervention?
In the area of mental health there are still open questions about how to use technology to sense affective states and, more importantly, how to effectively provide interventions should one need help. Might recipients open up to technologies that are more affectively neutral, resulting in the technology being trusted more or considered more objectively intelligent? Or should designers try to resemble a counselor or trusted companion, designing for a more empathic and human experience during a technological intervention?
In this paper, we introduce the design of EMMA (EMotion-Aware mHealth Agent), an emotionally intelligent wellness personal assistant for the general population. EMMA conducts experience sampling in an empathetic manner; this means that EMMA is not only the instrument to capture self-reports, but also responds with emotionally appropriate conversations. It acknowledges the user’s emotional state, similar to what an empathetic companion would do. Emma also provides relevant micro-activities for mental wellness, and learns to detect mood from smartphone location data. We evaluate different aspects of EMMA through 2 experiments with N=41 participants.
The first experiment is one week long and focuses on empathetic vs. neutral experience sampling. This experiment is a randomized controlled trial that compares EMMA to a neutral control condition. Our results show that participants recorded a higher percentage of positive emotions using the empathetic bot compared to the neutral bot. Also, extraverts preferred the emotionally expressive bot significantly more than introverts.
The second experiment is two weeks long and focuses on the introduction of emotionally relevant interventions, scalability, and automation of emotion prediction. This experiment is a randomized trial, comparing two groups: EMMA, and a control condition. Though our results did not reach the significance threshold of 0.05, a trend toward more frequent and faster response to interventions from the EMMA group was observed. This experiment also explores the introduction of machine learning models for automating affect detection and its influence on users’ perception of the system. We have developed the models for this study using training data from Experiment I and the first week of Experiment II. We deployed the trained machine learning models during the second week of this experiment. Our results showed that automating mood detection using personalization and location data from the phone worked as well as relying on ground-truth emotion samples from our users.
3 Related Work
Despite multiple attempts by several researchers, classifying subjective metrics related to wellbeing and mood remains a difficult task, with relatively low accuracies, ranging from 55% to 80%. Examples include using smartphone data to model social interactions (), to study the relationship between mood and sleep (), to detect stress, happiness, and mood ([53, 5, 2, 29, 6, 24]), and to predict depressive symptoms (). Others have also attempted prediction of fine grained symptoms on a continuous scale using smartphone data and wearable sensors (). Though not perfect, personal sensing - ”collection and analysis of data from sensors embedded in the context of daily life with the aim of identifying human behaviors, thoughts, feelings, and traits”  - has shown potential for monitoring mental health and providing a platform for just-in-time interventions.
Ecological momentary interventions (EMIs) are becoming more popular, especially for the treatment of clinical depression and anxiety. They have been effective at reducing symptoms of depression and anxiety, reducing outcomes of stress, and increasing positive psychological functioning . Automated text-messaging, used as an adjunct to therapy, has helped users stay in therapy for longer, and attend more sessions . Synchronous, text-based interventions, either by a human or a chat-bot, have shown significant mental health outcome improvements compared to a wait-list condition .
There are endless subtleties in designing automated text interventions for mental health purposes. Tailoring  and diversifying  messages have shown potential for improving efficacy and reducing habituation. Sender, stimulus type, delivery medium, heterogeneity, timing of delivery, frequency, intensity, the trigger’s target, structure, narrative , and the linguistic content of messages  are among the variables that need to be optimized for the purpose of the intervention. Other researchers have addressed low engagement and high attrition in self-guided web-based interventions by building a peer support platform - Panoply [36, 37]
- and using a conversational agent - woebot.
Conversational agents have shown promise in automating the detection of psychological symptoms for both assessment and the evaluation of treatment impact . There is evidence suggesting that the general population can also benefit from such eHealth interventions. Anxiety and depression prevention EMIs are associated with small but positive effects on symptom reduction. The medium to long-term effects of such interventions need further exploration .
In the positive computing  literature, there have been efforts around personalizing interventions toward the users’ preferences (e.g., [25, 43]) and using sensor data to derive the timing of interventions (e.g., [17, 18, 52]). However, targeting relevant micro-activities toward a full range of emotional states, varying the tone of delivery appropriately, and exploring the feasibility of automating the process has not been fully studied.
4 EMMA System Design
We designed EMMA to understand affect by first asking the user explicitly, and later by inferring affect, from phone sensor data. EMMA was crafted to respond to the user’s mood appropriately, and suggest micro-activities for improving mood, or at least for practicing positive affect coping skills. In this section we describe the EMMA system design.
4.1 Mobile application
The mobile application administers experience sampling and pulls the machine learning output (predicted affect scores), suggests appropriate wellness activities, and seamlessly puts them all into context with affective surrounding text. The app adjusts its behavior based on the group condition and the temporal phase of the study. Figure 1 depicts the system design.
The mobile app consists of a web-based user interface (UI) (Figure 2). The UI visualizes the conversations between the agent and the user. The content appears within bubbles that are left- or right-aligned based on the speaker. Also, the bubbles are color coded to show if they are coming from the agent (gray), are prompts for the user to respond to (green), or have already been answered and are no longer editable (blue). To make the experience more realistic, the agent starts typing for one second before the text appears (see Figure 2
). The content is selected from the pool of scripted texts by a rule-based decision tree according to the group condition, stage of the study, and the user’s most recent affect.
The web-based UI is built upon the StudyPortal platform which is designed to handle different OS types . In our case, StudyPortal is in charge of delivering notifications to the participants’ phones, and continuously monitors sensor data and uploads them to a Microsoft Azure database. For this study, we have solely focused on android phones due to their higher flexibility in capturing continuous sensor data.
4.2 Measuring Affect
To make the EMMA system design emotionally intelligent, it needed to reason about the user’s current affective state . To capture ground-truth emotion labels, we administered experience sampling five times a day and explicitly asked the participants to rate their mood. In addition, we continuously captured phone sensor data in the background as a behavioral surrogate of mood. The first experiment was used to collect data for training our machine learning models. During the second experiment, automatic predictions of mood based on phone sensor data substituted self-reports (although we continued to collect self-report data to validate these predictions).
4.2.1 Experience sampling:
We adopted Russel’s two-dimensional model of emotion  as our primary “gold-standard” mood measurement technique. This is one of the most prevalent and highly cited models of emotion and considers two dominant dimensions for mood: valence (pleasure - displeasure) and arousal (high energy - low energy). Horizontal and vertical axes correspond to valence and arousal respectively. To make it easier for users to self-report their mood, we included sample icons (visual cues) and emotional states (textual cues) that fall under the corresponding quadrants (See the experience sampling grid in Figure 2). Note that the visual grid captures continuous values between 0 and 1 for both valence and arousal.
4.2.2 Phone sensor monitoring and feature extraction:
In order to test whether passive sensing from phone sensor data could be used to replace self-reports, we captured geolocation and detailed activities within the application to get contextual information from the phone 111Other data streams such as accelerometer data, communications–including calls and messages (no content was captured)–, and calendar data were also captured. However, due to the high rate of missing data in the accelerometer (82% missing per person), communications (missing for 50% of participants), and calendar data (missing for 52% of participants) we decided to solely focus on location data. The missing data was due to differences in the availability of sensor data on different versions of the Android OS.. For capturing location in a practical way that saved battery power, we set the movement threshold to 10 meters and uploaded the captured location once every minute. We were able to capture at least 50 location data points from 97% of the participants, including 294279 total location data points. The loggers we implemented captured data periodically in the foreground and background. We auto-resumed loggers when they were stopped by the OS.
We translated the raw data into higher level features for each hour. Our features included average latitude, average longitude, standard deviation of latitude, and standard deviation of longitude during every hour. We also included average distance from work. Since all participants were internal members of the same institution, the work location was approximated by the building’s latitude and longitude. We also included distance from home, where home was approximated by the median of the location when the user was not at work. We also encoded time of the day and day of the week as contextual information. These types of location features have precedent in prior mHealth studies. Personal measures from pre-study surveys were included as well. These features included user ID, gender, and their scores of the big five personality test , PANAS , and DASS 
scales. PANAS quantifies mood and DASS captures depression, anxiety, and stress symptoms. For categorical variables such as user ID and gender, we used their one-hot representation: when a variable hasdistinct possible values, one-hot representation substitutes each observation with binary values, indicating the presence (1) or absence (0) of the th value.222Note that our current study design guarantees enough training data for each participant. If we plan to run the system on unseen users, one-hot user id will not be scalable. In that context, we could introduce an embedding for users, or drop this feature. These features will be used by the machine learning models. In section 6, we will explain the prediction engine.
4.3 Wellbeing Interventions
We built-upon previous work on micro-interventions for improving wellness [43, 51, 9]. This set of interventions includes individual or social short activities that fall into one of the following psychotherapy categories: positive psychology, cognitive behavioral, meta-cognitive, or somatic. The activities provide a textual prompt and a link to an online tool for executing the activity. This set of interventions has previously been tested and confirmed to reduce depressive symptoms and improve stress coping capabilities over the course of 4 weeks .
We revisited these activities to make them more appropriate for different emotional states. Toward this end, we have assigned each micro-activity to the most relevant quadrant(s) on the 2x2 Russell circumplex model of emotion. The interventions were augmented to have 16 activities per quadrant. Table 1 shows a sample intervention for each quadrant. Note that for categorizing activities, we have relied on the authors’ expertise in psychology and affective computing. In the discussion section, we explain limitations of this approach. See Table 2 for a sample of interventions and Supplementary Materials for a complete list of interventions.
|TL||Write yourself a note with some issue that could wait for longer.|
|TR||Spread the joy by calling a friend and passing along your positive energy!|
|BL||Affirmations always make us feel better. Check some of these out and share them with some friends.|
|BR||Celebrate with others! Write a positive comment to some friend’s good posting.|
|Therapy Group||Therapy Techniques||Micro-intervention Samples|
4.4 Agent Dialogue/Communications
For smooth communications between the agent and the user, we scripted dialogue that was emotionally expressive and added emojis (from the set depicted in Figure 3) when appropriate to better communicate emotions. In the emotional condition, each textual interaction had an average of 1.3 emojis, where there was an emoji per 6.5 words. In order to keep the content more realistic and engaging, we have scripted 6 different phrasings for each dialogue interaction and randomly selected one when starting a conversation. For the control condition, we scripted similar texts, but so as to be completely neutral without any expression of affect or use of emojis. Table 3 provides an example of the affective vs. neutral text.
|Neutral||Ok. Let’s try an intervention then.|
|TL||Oh, seems things are a little tense right now, let’s practice an intervention for this!|
|TR||Awesome! Then let’s do a positive skill to keep things going!|
|BL||Feeling glum? I have a skill that might brighten your day. Let’s practice.|
|BR||It’s a calm period. A great time to practice a skill. Let’s do one.|
5 Human Subjects
To evaluate EMMA and answer our research questions, we designed two human-subject experiments. The study protocol was approved by the institutional review board at [anonymous institution]. Participants signed-up for the study online and were randomly assigned to the EMMA condition or the Control group. The same participants joined both experiments, while keeping their group assignments. Forty one participants were recruited. One participant dropped out early due the app’s phone battery usage. Another participant had previous knowledge about the hypotheses of the study and thus was excluded. Among N=39 participants that completed the protocol successfully, 7 were females and 32 were males. This included 17 full-time employees (FTE), 17 interns, and 5 external members or contractors. Table 4 summarizes the group assignments. The participants’ ages ranged between 16 and 49 (M=29.4, SD=7.9). Participants received $200 gift-cards for successfully completing both experiments. In addition, gift-card raffles were held at the end each week, for 50, 75, and 100 dollars respectively. Among active users, three were randomly selected as winners of the raffle at the end of each week.
Overall, the participant population was generally healthy in terms of their mental health scores, as measured by the Depression Anxiety Stress Scales (DASS) . DASS includes a set of self-report scales designed to measure negative emotional states of depression, anxiety, and stress. We utilized the short version of DASS, which includes 21 items, 7 per scale. Each item is rated on a Likert scale, ranging between 0 (never) to 3 (almost always). Total DASS has possible scores of 0-63, and depression, anxiety, and stress sub-scales have possible scores of 0-21. See Table 5 for baseline values and their standard deviation among participants. Values under 4.5 for depression scale, under 3.5 for anxiety scale, and under 7 for stress scale are considered in the normal range.
6 Experiment I: The influence of interacting with EMMA
Our first research question is regarding the influence of interacting with an emotionally expressive bot compared to a neutral agent. Previous research has shown that interacting with a textual agent that shows minimal support of affect helps relieve strong negative affect. Also, when combined with a system that is designed to be frustrating, i.e., a game with unexpectedly long delays, participants prefer to continue to use such a system for longer if they are interacting with the emotional bot . Subtle emotional expressiveness in agents has also been associated with higher trust and likability . Other researchers have looked into the role of personality (introversion/extraversion dimension) in interacting with virtual agents [46, 42, 7]. Building upon previous research, we would like to explore the following questions: Does interacting with EMMA improve users’ self-reported mood? Do extraverts benefit more from adding emotional expressiveness to bots compared to introverts?
To answer these questions, we designed a one-week, longitudinal experiment. We randomized participants into two groups: EMMA and Control. The EMMA group had access to the mobile app that administered experience sampling. The app would generate 5 probes at random times throughout the day, between 9AM and 9PM, approximately every 2.5 hours, and we made sure that the probes were at least 30 minutes apart. Each experience sampling prompt started with a phone notification from the app, saying “Hi! Have a minute?”. The participants could then click on the notification, or start the app by clicking on the application icon on the home screen. After the app opened, EMMA would randomly select from a set of initial prompts that asked the participant to report his/her emotional state. Then EMMA would provide the experience sampling visual grid (See figure 2). After the participant responded to the prompt by dragging the indicator to express his/her emotional state, EMMA would detect the selected quadrant, and randomly draw from a set of emotionally relevant phrases scripted for the respective quadrant. Note that the Control group had access to a similar interface, with the same methodology in triggering experience sampling probes. However, the responses to the experience sampling would always be selected from a pool of plain neutral texts without any expressive emotions. In summary, the difference between EMMA and Control users was in the responses that the participants received after reporting their mood. In the control group, the app was only an instrument to capture data. Regardless of the user’s selection, it would thank the user politely afterwards with a neutral tone; but in the EMMA group, the app would acknowledge the user’s current status, respond appropriately, and resemble an empathetic companion.
To test our hypotheses regarding the interplay between personality and agent likability, we captured personality traits in the pre-study survey. We used well-validated measures of affect in the pre- and post-study surveys to capture weekly affect. We introduced satisfaction measures to study agent likability and user experience. Also, we analyzed the momentary mood sampled by the bot.
6.1.1 Big Five Personality Traits
The Big Five personality trait scale is a model based on common descriptors of personality that includes five factors: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism . The scale is composed of 44 items, where each item is rated on a Likert scale, ranging between 1 (strongly disagree) to 5 (strongly agree). Each personality factor is associated with 8-10 questions, thus possible scores are between 8-50.
6.1.2 Positive and Negative Affect Schedule
The Positive and Negative Affect Schedule (PANAS) consists of 20 words that describe different emotions . Half of the items indicate positive affect (PA) and half indicate negative affect (NA). Items are rated on a Likert scale, ranging from 1 (very slightly or not at all) to 5 (extremely). PA and NA are calculated separately and each range between 10-50. the PA/NA ratio is another commonly used measure derived from PANAS. PANAS has been used to capture affect in different time scale ranges. These include momentary, daily, over the past few days, weekly, for the past few weeks, yearly, and general affect. In our study, we have used PANAS to capture affect over the past week.
6.1.3 User Preference
We assessed satisfaction and efficacy of the system through different questions using a Likert scale, ranging from 1 (strongly disagree) to 7 (strongly agree). These questions asked about the agent’s likability, intelligence, and the appropriateness of its “tone”. Questions were also asked about user preference for continuing to interact with the agent, and his/her improvement in awareness of daily emotions. They also asked if the notifications from the app where too frequent. Also, we included an open-ended question at the end of the week for general comments. See Supplementary Materials for the complete list of the user preference questions.
6.1.4 Experience Sampling
Using the visual experience sampling grid, we capture valence () and arousal () on a continuous scale, . Then, we discretize to have positive and negative valence:
We also discretize to have high and low arousal.
The 4 possible combinations of and are mapped to the 4 quadrants on the visual grid: Top Left (TL), Top Right (TR), Bottom Left (BL), and Bottom Right (BR).
6.2.1 How was EMMA perceived?
Given that EMMA is emotionally expressive, we questioned whether different personality types would prefer the agent more or less. Specifically, do extraverts prefer EMMA more than introverts? To answer this question, we discretized the Big 5 extroversion scores into binary values: extravert (above median) vs. introvert (below median). Focusing only on the EMMA group, we compared the overall likability of the agent as averaged across all likability questions. An independent-samples t-test showed a significant difference in the overall likability scores for extraverts (M=5.17, SD=.91) and introverts (M=4.43, SD=.55); t(17)=2.08, p=.05 (Figure4).
6.2.2 Did interacting with EMMA increase positive mood reports?
To answer this question, we compared the daily percentage of positive and negative ESM mood reports across groups. Granular daily self-reported emotion samples revealed significant differences across EMMA and control groups. Figure 5 shows the average percentage of the positive and negative ESM self-reports per participant. An independent-samples t-test was conducted to compare percentage of positive emotions reported daily between the EMMA and the control conditions. There was a significant difference in the percentage of positive emotions for EMMA (M=80.55, SD=3.65) and control (M=69.08, SD=4.16) conditions; t(37)=2.74, p = .009. Since the percentage of negative emotions is 100 minus the percentage of positive emotions, similarly there was a significant difference in the percentage of negative emotions for EMMA (M=19.45, SD=3.65) and control (M=30.92, SD=4.16) conditions; t(37)=-2.74, p = .009. The EMMA group reported a higher percentage of positive emotions (Top Right and Bottom Right affective quadrants in Russell’s 2x2 model) and a lower percentage of negative emotions (Top Left and Bottom Left quadrants) compared to the control group333Note that the weekly PANAS survey and the daily ESM are capturing instantaneous vs. weekly mood which are different in definition. But our analysis showed that the PA score derived from PANAS and the total number of positive self-reports over the course of the week are correlated (Pearson r=0.217, p=0.020). However, looking more closely at the influence of EMMA on PA from weekly PANAS scores, a 2 (group) x 2 (pre-post PA) RM-ANOVA did not show a significant group x pre-post interaction..
6.2.3 User Feedback
Several participants reported interacting with the app as “an interesting experience” (pa070), “pretty quick” (pa081) and “fun” (pa045).
Some mentioned experience sampling made them more self-aware, or amplified their emotional state; pa050: “notifications from the agent amplified how I was feeling.”; pa064: “It is a good exercise to periodically reflect on my emotions. I really like that aspect.”; pa063 mentioned surveys acted as a feedback loop, too: “answering this survey forces me to define an emotional profile, to which I somehow become committed or identify with, which in turn influences my daily ratings.”.
Originally, we did not fully absorb the extensive role of the bot on self-awareness, but the overwhelming feedback from participants recognizing how it influenced their behavior highlighted that any behavior change application needs to support self-reflection. This result is in line with previous research findings, suggesting that self-reflection is an important part of behavior change and has the potential to improve wellbeing and mood [23, 55]. However, it is worth mentioning that encouraging users to self-reflect should be done in moderation. There are downsides with interrupting users too frequently to self-report. First, the possible consequences should be considered. Some participants mentioned that an extremely high frequency of self-reflection could be harmful in certain circumstances; pa050: “When I was stressed/worried at work and saw that I had to report on my feelings then those feelings felt more intense.”; pa088: “I’m not sure if thinking about my feeling so many times in a day is a good thing. I realized that I’ve been picking happy only infrequently, which made me a little sad.”. Second, there can simply be high missing data rate. pa011: “I frequently miss notifications.”. Therefore, if the application is solely relying on users’ self-reported data, it will significantly hurt performance. Partial or full automation could help address these caveats which we discuss further in section 6.
Also, the open responses shed light on what could be improved. Some participants mentioned EMMA’s responses were exaggerated and unable to capture subtle or nuanced emotional states; pa073: “The agent’s responses seemed very narrow responding with just a few generic phrases to my self-assessments[…]. Although the emotion quadrant consists of four squares, the actual coordinates within each square have a wide range of meanings […]. However, the agent did not appear to respond particularly differently [to the intensity of the reported emotion].”; pa028: “The reactions to input could be better. They seem to come from the four basic zones (+- x/y), and the feedback from the bot doesn’t indicate that the input I send is any more granular than that.”; pa067: “The agent seems to respond as if your emotional state is either great or terrible[…]. It would be nice if it could adopt a more neutral tone in some circumstances. It’s just kind of weird when it says something like Bummer when I report that I’m feeling [almost] neutral.”; pa031: “It would be nice if the agent had better design and some kind of persona. The way it is now seems simplistic (though, still useful in the sense of reminding)”.
We need to emphasize that there was overwhelming feedback from participants highlighting that they wanted to be able to enter more nuanced emotional state self-report data, but the bot’s responses were coarse and rough and did not account for the subtleties in their report. In other words, they wanted to select very particular feelings during self-report, and they wanted an agent that reflected the precision. It really bothered our users, even though they still liked the reminding facility.
Some of the responses mentioned difficulty in expressing precise emotion samples. This could be due to the UI; pa078: “It’s difficult to be precise in positioning the dot on the axes.”. It could also be due to difficulty identifying emotions and mapping them to the quadrants; pa061: “Sometimes I found it hard to describe my feelings”, pa052: “[the] subtle changes in my emotions are not being captured by my current way of recording it.”
7 Experiment II: Intervention Effectiveness, Scalability, and Automation
Our next research question is regarding intervention engagement and how it is mediated by the emotional intelligence of the bot delivering it. Previously researchers have studied response time to phone notifications and accounted perceived disruption as an influencing factor on response time . Thus, we measure response latency as a proxy for intervention disruption vs. engagement. We also measure frequency of response to interventions as another measure for quantifying engagement. More precisely, our research question is: If interventions are delivered by an emotionally expressive bot, do people respond to them more quickly and more often?
Our other research question is regarding the capacity to scale and automate the bot so that it predicts emotion labels only from the user’s phone usage behavior and does not require constant self-report of emotion labels. This question should be first addressed objectively by calculating the accuracy of mood prediction from phone sensor data. However, it is also important to analyze users’ preference to study if substituting ground-truth emotion labels with a machine learning prediction influences the likability of the system.
To answer these questions, we designed a two-week longitudinal experiment. We randomized participants into two groups: EMMA, and Control. During the first week, the EMMA group had access to the mobile app that administered experience sampling, detected user’s selected emotional quadrant, and responded with emotionally relevant phrases similar to Experiment I. In addition, EMMA would randomly select from a set of interventions that were emotionally appropriate for the user’s current state. EMMA would deliver the intervention surrounded with emotionally expressive text, scripted for that quadrant. The Control group received a similar experience, in terms of triggering experience sampling and providing emotionally relevant interventions; however, the bot was not emotionally expressive itself. Though it understood which quadrant has been selected by the user and provided skills relevant to that quadrant, all the surrounding text was neutral, without any expression of emotion.
During the second week, a machine learning model simultaneously predicted the user’s current affect. This prediction was the basis of the suggested intervention in both EMMA and Control conditions. In the EMMA condition, the surrounding affectively expressive text was also driven by the prediction. Note that the self-reported emotion labels were still being stored on the cloud, but were only used later as the ground-truth measure for calculating accuracy of the machine learning model in charge of emotion detection. Below, we explain the machine learning model selection, training, and validation in detail.
7.1 Machine Learning Models
To translate the sensor data into affect, we developed a prediction engine. We used two weeks of data from Experiment I and the first week of Experiment II, and split it into train and test sets (75% and 25% of samples respectively). We trained multiple models on the training set, used 10-fold cross validation for parameter optimization within each model category, and used the hold-out test set for selecting the best model for the final week of Experiment II. Our criteria for best model selection were performance, simplicity, and explainability, in that order.
7.1.1 Classification models:
7.1.2 Regression models:
Additionally, we tried modeling valence and arousal on a continuous scale using regression models. We normalized the valence and arousal values and experimented with a range of regression models including Linear Regression, several regularized versions of linear regression (Ridge, Lasso, Elastic Net), Bayesian Ridge, Support Vector Regression, Gradient Boosting, AdaBoost, Random Forest, and robust to outlier methods (RANSAC, Theil-Sen, and Huber). We later quantized the predicted values to calculate accuracy measures.
7.1.3 Personalized regression models:
Individuals tend to have different baselines and oscillate around those baseline values. In our regression modeling, we did not fully utilize these individual differences. Because of this, we tried another method: first, calculating individual baselines for valence and arousal for each person. Then, explicitly modeling the variation of valence and arousal from that baseline on a continuous scale using our regression models.
In the next section (Section 6.1.4), we show the boost in performance, especially for arousal detection, using personalization. Ultimately, we selected the personalized model with Random Forest regression for valence prediction and AdaBoost regression for arousal prediction, and this is explained in the results section444Although the final aim is to perform a classification task, what makes the regression model better suit our problem is our ability to predict explicit deviation from personal baseline rather than predicting the absolute value in the label space. A continuous label space would easily allow such transformation while it is not be feasible in a binary label space. We believe that is why the personalized model, although not directly optimizing for classification, works better than the classification models..
Note that in our current study design, the machine learning models go into effect when we have captured days of data from each user. However, this may not be the case in real-world deployment. To address the cold-start problem in such scenarios, the model could start without personalization or use heuristic baselines such as the average mood of other participants, a random selection, or a neutral value. Then, adapt the user baseline by capturing more data over time.
|Regression||Random Forest||80.6%||Random Forest||50.4%||40.1%|
|Personalized||Random Forest||82.4%||Ada Boost||67.0%||56.8%|
|Baseline||Most frequent||80.6%||Most frequent||51.9%||42.4%|
- the number of estimators,- criterion, - maximum samples, and - learning rate.
Table 6 summarizes the performance of classification, regression, and personalized regression models on the hold-out test-set from the combined two weeks of data. As expected, the personalized regression model outperformed the classification, non-personalized regression model and the baseline; thus, the personalized regression model was selected for the second week of Experiment II deployment. For valence prediction we used the Random Forest regressor and for arousal prediction we used the AdaBoost regressor.
To further confirm the performance of the selected model, personalized regression, we calculated Pearson correlation coefficients between the predicted and actual values for the hold-out test-set. There was a significant correlation between predicted and actual arousal (r=.43, p.0001, n=387), and a significant correlation between predicted and actual valence (r=.57, p.0001, n=387).
7.2.1 Latency in Response to Interventions
To test our hypotheses regarding the interplay between emotional intelligence of the bot and intervention engagement, we captured and analyzed the latency in response to interventions. We define response latency as the time between receiving a notification and responding to it in minutes. This measure is extracted from the application logs of user clicks on the app UI.
7.2.2 Frequency of Response to Interventions
We extract the average number of responses to interventions per participant, per week, from the application usage logs. This measure encodes response frequency and is used as a surrogate for intervention engagement.
7.2.3 User Preference
We assessed satisfaction and efficacy of the system through different questions using a Likert scale, ranging from 1 (strongly disagree) to 7 (strongly agree). These questions asked about agent’s likability, intelligence, and appropriateness of its “tone”. They asked about user preference for continuing to interact with the agent, and his/her improvement in awareness of daily emotions. They also asked if the notifications from the app where too frequent. Also, we included an open-ended question for general comments. This measure was captured at the end of each week. The questions are provided in the Supplementary Materials section.
7.2.4 Experience Sampling
Using the visual experience sampling grid, we capture valence () and arousal () on a continuous scale, . Besides the continuous values of and , we discretize to have positive and negative valence:
We discretize similarly to derive which encodes high vs. low arousal. We use binary values of and for calculating accuracy of our machine learning models on valence and arousal separately.
The 4 possible combinations of and are mapped to the 4 quadrants on the visual grid: Top Left (TL), Top Right (TR), Bottom Left (BL), and Bottom Right (BR). We also use quadrant prediction accuracy for selecting the best performing machine learning model.
7.3.1 Does EMMA influence intervention engagement?
An independent t-test between EMMA and the control condition to test for response time differences did not reach statistical significance at .05 level555t(37)=-.99, p=.32. However, we observed a trend suggesting that participants in the EMMA condition tended to respond more quickly to the notifications from the agent, while the latency for the Control group was higher. Similarly, though we did not observe a significant difference between frequency of response to interventions between the two groups666t(37)=1.59, p=.11, the EMMA condition tended to respond to a higher number of the interventions (See Figure 6).
7.3.2 What was the model’s performance?
|Best model||Acc.||Best model||Acc.||Acc.|
|Personalized||Random Forest||82.2%||Ada Boost||65.7%||56.6%|
|Baseline||Most frequent||82.3%||Most frequent||48.0%||41.5%|
After deploying the personalized regression model in the second week of Experiment II, we did similar post-hoc analyses to calculate objective performance of the model. Table 7 summarizes the model performance on the actual test set, the second week of Experiment II.
We also calculated Pearson correlation coefficients between the predicted and actual values for the final week data. There was a significant correlation between predicted and actual arousal (r=.54, p.0001, n=702), and a significant correlation between predicted and actual valence (r=.43, p.0001, n=702).
7.3.3 How did the users perceive the automated system?
The objective performance measures show that the model had reasonable accuracy during the automation phase (final week). But did the users agree? Did they find that the first week of Experiment II that used ground-truth emotion samples as likable as the second week of that experiment that used machine learning predicted emotion samples? Or did the occasional errors in prediction reduce the perceived likability of the agent significantly? To answer this question, we compared the self-reported agent evaluation for when it was driven by machine learning vs. experience sampling.
We employed two one-sided t-tests (TOST) as a test for non-inferiority on the average of all likability measures before and after deploying machine learning. We set the equivalence intervals as follows: . We tested the two resulting composite null hypotheses: and . The results were t(38)=5.31, p0.0001 and t(38)=-6.33, p0.0001, respectively. Since both of these one-sided tests are statistically rejected, we conclude that the likability of the agent is practically equivalent before and after deploying machine learning and there is no significant decline in overall preference of the agent as measured by the average of all the likability measures. This is a promising result, suggesting that machine learning models could provide a scalable affect-driven agent that does not require constant user effort for providing self-reported emotions, and users perceive it just as favorably.
7.3.4 User Feedback
Qualitative feedback from users provides great insights into the different study conditions and the application itself. Some of the users mentioned enjoying interacting with the app; pa041: “I love being part of this study. The app is great, the surveys are short, and it’s been fun thinking about my emotions.”; pa070: “Had a fun week interacting with the agent.”; pa052: “I did find it interesting to use the app and become aware of how stable my emotions are. That was the most positive outcome for me in this study.”
Responses showed individual differences among users’ preferences about interventions, however. Most users preferred shorter and simpler activities; pa063: “The most successful activities have involved watching short videos or images.”; pa067: “I preferred the interventions that I could do on the phone without making any noise.”; pa064: “Simple things, like do a stretch or read a joke or think about this kind of fond memory were generally helpful.”
Some participants mentioned that the activities were not always optimized for the context, they did not have time for them, or they did not like them. These points were brought up by users from all groups. For example, pa035: “I’m frequently in the middle of other things when the notification shows up and I don’t have time or it’s inappropriate for me to engage with my phone for 5-10 minutes.”; pa038: “it doesn’t take busyness into account.”; pa040: “It has suggested that I walk over to a colleague’s office; but I was working remotely so that wasn’t possible.”; pa041: “They seem like fantastic suggestions. I’m just not going to stop what I’m doing.”; pa064: “I found it very difficult to engage with many of the skills that agent presented to me, due to time, the local environment I was in, or lack of interest.” pa057: Some of the tasks we were asked to do were not applying to me. For example I have not posted anything on Facebook and I was uncomfortable posting some random stuffs after a while.”
Importantly, several participants mentioned they preferred not to be interrupted when feeling positive; pa081: “If someone indicates that they are feeling happy and/or positive, they shouldn’t have to do an activity.”; pa077: “I find it annoying that when I report myself as happy or content, it still has exercises for me, that typically end up making my mood less positive.”; pa080: “I felt that when I reported positive emotional state it shouldn’t then try and improve my mood further with an exercise. I am already feeling positive so an intervention will just distract me and lower my mood.”
Some participants mentioned the tone of the agent has become expected, and thus not as effective; pa040: “The first couple of times I saw feedback on my ratings it was kind of neat; but now it just feels like it is expected that the app will tell me this, so it doesn’t really have an effect on me.” All of this feedback suggests that personalizing the feedback from the agent based on the context and preferences of the user would be preferable to a rules-based approach as was implemented based on self-reports.
It is worth mentioning that the same group of participants were enrolled in Experiment I. Thus, they interacted with the bot for 3 weeks in total. Consequently, EMMA’s responses and the interventions have become predictable. As participants started to anthropomorphize EMMA, they expected more richness and variability in their interactions with it. Similar findings have been observed previously in micro-intervention studies .
Some participants mentioned the way the activities were provided sounded prescriptive. For example, pa041 said “I have a hard time giving over control to any kind of app. And I don’t need another thing in my life telling me what to do and when to do it.”; pa064 said: “the agent should frame the skill as something I can do if I want to.”
8.1 Personality and preference for an affective agent
In this paper, we showed that there is value in adding emotional understanding and expression to conversational agents. The emotionally expressive bot was generally liked (on average more than 4 from on a 7-scale Likert scale). However, extraversion was an important personality factor influencing the likability of the agent: extraverts’ average likability measures were significantly higher than introverts’. This suggests that certain personality types may benefit more from adding emotional intelligence or expressiveness of conversational agents.
8.2 Automating affect detection in an affective bot
We showed that a mobile bot can use machine learning techniques from phone location data and a two-week history of a person’s mood and be perceived as likeable as a bot that works with ground-truth emotion labels captured by experience sampling. This is an encouraging result, as it relies only on smartphone location data, a ubiquitous technology that can significantly reduce the users’ burden of self-reporting during intervention applications. It suggests that automatic - albeit error-prone - affect detection can still be as effective as self-report in certain contexts. In other words, imperfect performance metrics in affect detection should not discourage researchers and practitioners from using such techniques in practice, especially when such imperfection will likely not harm the acceptance of the system significantly.
8.3 Empathetic experience sampling and mood
Our results show that providing an emotionally appropriate response when conducting experience sampling, similar to what happens in a successful human-human interaction - resulted in a higher percentage of positive responses being recorded. However, interaction with the agent did not significantly influence positive and negative affect as captured by the weekly PANAS surveys. We have three possible interpretations: 1) the influence of the agent may be subtle and, since it only appeared in granular experience sampling about five times a day, was possibly not enough to show its influence over one week. 2) one participant said: ”Sometimes, the responses when the mood is marked as negative seem somewhat validating or disheartening, subconsciously making me reluctant to mark my mood as such.” This suggests that the affirmative response from the agent might have affected the ratio of missing self-reports asymmetrically for negative vs. positive samples. 3) Our population was overall quite healthy and happy–improved positive affect in a clinical sense would probably be unlikely. Further studies are needed to get a deeper understanding about empathetic experience sampling to tease these issues apart.
8.4 Tailoring wellness suggestion activities to affective states
We expected positive states to be good times for practicing skills and building resilience. Also, we expected negative states to benefit more from immediate intervention activities as a treatment. However, from qualitative user feedback we learned that suggesting such activities when a user is in a high energy and positive valence state may have an opposite effect. It is worth mentioning that we focused on a general population rather than clinically depressed individuals. As shown in Table 5, our participants had low scores on depression, anxiety, and stress as measured by DASS . It might be that our healthy participants did not feel the need to practice such skills and found them simplistic, and thus were sometimes annoyed by them. This irritation may have undermined the benefits of practicing such activities in bottom left or top left quadrants of Russel’s circumplex model. This may have diminished the role of emotional vs. non-emotional conditions.
8.5 Design guidelines
We summarize the guidelines we extracted from users’ feedback for the design of affective conversational bots and EMIs. For detailed exploration of user responses, see sections 5.2.3 and 6.3.4.
Emotional intelligence is sometimes a neutral response. Feedback from participants revealed that providing emotionally expressive responses to subtle emotions decreased the perception of emotional intelligence of the bot. For example, expressing sympathy in response to minor expressions of sadness was received as unnecessary exaggeration. Instead, a neutral or nuanced response was preferred. We learned that low intensity emotions should be responded to with more subtle and neutral interactions.
Do not interrupt a good mood for an EMI. Participants mentioned the high rate of interruption by personal technological devices and not wanting to be controlled by them for unnecessary reasons. Our population expressed that when they were in a high energy and positive valence mood, they were already engaged in rewarding activities and interrupting them for an intervention was annoying to them and sometimes resulted in a less positive mood. However, they found the activities more useful when in a low energy and negative valence mood.
Short, simple, and effortless activities are better received. Participants mentioned that they were more likely to perform shorter and simpler activities. This highlights the fact that success of an activity in a self-guided mHealth setting first depends on how likely it is to be performed. This calls for the design of more effortless interventions such as .
Contextual relevance makes EMIs more respectful. Users’ feedback revealed that making EMIs contextually relevant is one of the most important elements in designing an intelligent system. The simplest way to mitigate this is to ask participants upfront what times they would like to receive triggers. Taking into account busyness and time of the day and including sensor data to detect context switching are other ways to optimize timing of triggers. This is in line with previous research findings such as  and .
Diversifying content is required to prevent habituation. Habituation is one of the main reasons of interventions being ignored. Starting with a big enough pool of interventions can delay habituation. However, more dynamic methods can sustain the system in the long-term. Novel ways of combining exploitation and exploration to maximize efficacy of personalized suggestions , including machine learning techniques to automate content creation, and using peer support can be example solutions to this problem [36, 37].
Providing an opt-out choice is needed for a respectful EMI. Especially for a population with relatively low scores on depression, anxiety, and stress scales, which do not qualify for clinical depression or anxiety (Table 5), users may prefer to maintain control over receiving interventions and providing an opt-out choice may be necessary for the EMI system to be perceived as respectful and intelligent–and ultimately, useful.
Behavior change applications need to support self-reflection. The overwhelming feedback from our participants shed light on the influence of self-reflection on behavior change. We suggest that any behavior change application should consider supporting self-reflection to improve the efficacy of the system. We need to highlight that supporting self-reflection does not necessarily require sole reliance on the user to provide data frequently. It rather means intelligent support systems could provide opportunities for the user to self-reflect at the right pace and frequency, while still being able to function without needing high rates of data from the user.
We relied on the authors’ expertise in psychology and affective computing to assign interventions to their appropriate emotional state. However, the affective assignment has not been evaluated through a user study. In the future, we would like to evaluate the appropriateness of this assignment through a separate user study.
We manually scripted all the textual interactions. Though we created multiple phrases with similar, but slightly different messages, their occurrence soon became ”expected” over the course of three weeks. In the future, we would like to use machine learning to automate the intervention text generation and make it emotionally expressive by adding emojis or sentiment that works for an individual according to context.
Due to the high percentage of missing data from several of the sensors we could use, we were not able to fully capture context. For example, because we were missing calendar data, we were not able to detect availability and optimize the timing of interventions. In the future, we would like to explore more sophisticated machine learning models to be able to leverage sparse data.
We present EMMA, the first emotionally-intelligent and expressive mHealth agent, that provides wellness suggestions in the form of micro-interventions. We quantitatively and qualitatively evaluated EMMA in 2 experiments, over the course of 3 weeks in total, with a fairly large population. Our results show that an emotionally expressive agent is likable, particularly to extraverts. Furthermore, it has the potential to improve positive affect and reduce negative affect.
Our longitudinal study allowed us to identify several design guidelines for future work. Specifically, we found that delivering interventions was not effective for those people already in a high activation positive mood, that an emotionally appropriate response is sometimes neutral and a diversity of dialogue and content is necessary to avoid habituation. If interventions are more focused to specific moods and contexts, and less predictable, they have the potential to improve positive affect.
We have shown that our system can be extended to detect a user’s mood from passive smartphone sensor data and that using automatically predicted emotional states to drive emotional dialogue and the choice of interventions did not impact people’s positive opinions of the agent. This result means we could remove the burden on the user to report their emotions and makes EMMA highly scalable.
Appendix A Supplementary Materials
a.1 List of interventions
a.1.1 Top Left quadrant interventions
- Write yourself an email with some issue that could wait for later.
- Replace an unpleasant thought with two pleasant ones. Write the pleasant ones down. http://www.rapidtables.com/tools/notepad.htm
- Make a phone call to a friend and ask for some small advice in some problem you are facing.
- Relax and listen to a simple calming tune… https://www.mixcloud.com/discover/calming/
- Acceptance is blissful. Write down a stressful incident you encountered with another person, and imagine it flies away and disappears, and then destroy it. http://privnote.com - Share a calm video with family or friends after viewing it yourself. https://www.youtube.com/results?search_query=calming
- Think about what is stressing you out right now. Now consider the aspects of the situation that you have control over, and those that you do not. Think about how you can take what you have control over and how you might be able to lessen the stress. Talk to a friend about it.
- Sometimes we build up stress in our facial muscles, like in our jaw/mouth. Open your mouth widely–is it tight? If so, rub the jawbones just beneath your cheekbones and try to relax.
- Take a moment to look at this video and really try to immerse yourself in it.https://www.moodica.com/
- Quick breathing exercise… Breathe deeply and slowly until the time runs out. http://e.ggtimer.com/m/1+minutes
- Sometimes we build up stress in our neck and shoulders. Try rolling your head slowly in a circle in order to relax your neck. Now try the other direction. Roll your shoulders forward and backward.
- When stressed, it is often helpful to stop what you are doing a take a mini-break. Why not go to Facebook and take a look at your timeline for a quick social media break? Look for something on your timeline that makes you happy. http://www.facebook.com
- Playing classical music (or any kind of music you like that does not have words) can have a calming effect if you are stressed at work. Want to give it a try? https://www.mixcloud.com/discover/classical/
- When we are working hard, itś easy to forget to drink enough water. Why not get up and get a glass of water?
- Smooth jazz can be one way to stay focused at work. Check out this online station. https://www.jazzradio.com/pariscafe
- Change your posture / sit up straight (look at pics/video) https://www.google.com/search?q=sitting+good+posture&safe=off&tbm=isch&tbo=u&source=univ
a.1.2 Top Right quadrant interventions
- The World is so diverse, learn 3 words in a language you’ve always wanted to learn. https://translate.google.com/
- Let’s take a moment to travel somewhere. http://www.bing.com/images/?q=exotic+places
- Make someone feel good! http://www.facebook.com
- Try to think what new perspective this news brings to your life and share it with someone else: http://www.huffingtonpost.com/good-news/
- Shall we play a short game? http://mobi.online-games-zone.com/
- Give somebody a pat on the back for any good reason.
- Cats are hilarious. Check out a few of these and show ones you like to your friends. http://www.bing.com/images/?q=funny+cats
- Watch a funny video with a friend.https://www.youtube.com/results?search_query=funny
- Go to your Facebook timeline and find something positive that has happened to a friend. Leave a happy comment for them! http://www.facebook.com
- Go to your Facebook timeline and look for 3 things you are thankful for! https://www.facebook.com/
- Listen to some music that matches your mood! https://www.mixcloud.com/discover/
- If you are feeling good, spread the joy by calling a friend and passing along your positive energy!
- Now would be a great time to go for a short walk and leverage that energy!
- If no one is looking, jump for joy! If there are people around, do a happy dance in your mind.
- Memorize one of these jokes and share it with a friend. http://www.laughfactory.com/jokes/clean-jokes
- Read a good news story from around the world and share it with someone. http://www.goodnewsnetwork.org/news/world
a.1.3 Bottom Left quadrant interventions
- Everyone has something they do really well… find an example in your favorite social media timeline that showcases one of your strengths.
- Everyone has something they do really well… find an example on your Facebook timeline that showcases one of your strengths. http://www.facebook.com/me
- Think back to a time when you did something really well in work or in school. Try to remember what you were wearing and who was there. Let that feeling of pride wash over you.
- Look at cute things!. https://www.bing.com/images/search?q=cute+things&qpvt=cute+things&qpvt=cute+things&qpvt=cute+things&FORM=IGRE
- Revise or recall your resume and remember how you survived a difficult work moment.
- Watch this beautiful scene and imagine you are there. Really let the visualization wash over you. http://www.moodica.com
- Listen carefully with your eyes closed to a new song in a genre you like. https://www.mixcloud.com/discover
- Look at cute things! http://cuteoverload.com
- Read at least 5 positive affirmations to yourself and think about how they can help right now. https://www.bing.com/images/search?q=positive+affirmation+images&qpvt=positive+affirmation+images&qpvt=positive+affirmation+images&qpvt=positive+affirmation+images&FORM=IGRE
- Watch this funny video for a mini-break. https://www.bing.com/videos/search?q=funny+videos&qpvt=funny+videos
- Affirmations always make us feel better. Check some of these out and share them with some friends. https://www.google.com/search?q=positive+affirmations&tbm=isch
- Share one of these with friends after viewing it.http://www.inspirationalstories.eu
- Scroll through some of these funny baby photos and pick your favorite one. https://www.bing.com/images/search?q=funny+baby+pictures&qpvt=funny+baby+pictures&qpvt=funny+baby+pictures&qpvt=funny+baby+pictures&FORM=IGRE
- Think about taking a vacation. Where would you like to go? Search for the location and find pictures of it. http://www.google.com
- Enjoy one of these funny jokes … http://www.jokesclean.com
- Smile… fake it till you make it!
a.1.4 Bottom Right quadrant interventions
- Use a random generator from 1 to your current age and try to remember a good simple memory when you were that age: http://www.random.org
- Learn about active constructive responding and practice with one person http://www.youtube.com/results?search_query=active+constructive+responding
- Celebrate with others! Write a positive comment to some friendś good posting in Facebook http://www.facebook.com
- Donate to a cause you like or admire. https://www.indiegogo.com
- Think about a hard situation a friend/family is going through, and find the strengths that make this person strong… send him/her an email about it.
- If you can achieve something in the next month, what would it be? Use your personal notepad to write down a reasonably small step towards a goal. http://www.rapidtables.com/tools/notepad.htm
- Write down a very simple goal you want to accomplish this week on a post it note and place it where you can see it every day.
- Write a friend asking for ideas on how to do something you want or need to accomplish.
- Think about a hard situation a friend/family is going through, and find some alternative solutions online. Send it to them. http://www.wikihow.com
- Find two different opinions about a topic and share them with a friend.
- Make the familiar new again. Pick one picture of a mundane object and observe it mindfully for a couple of minutes. https://www.bing.com/images
- Remember a beautiful moment in your recent past, close your eyes and remember each detail.
- Time for a quick stretch! Try some of these for a few of minutes… http://www.bing.com/images/?q=office+stretch
- Walk to a friend’s office and have a quick chat until the time runs out. http://e.ggtimer.com/m/3+minutes
- Ask friends to do these…https://www.google.com/search?q=happy+smile&source=lnms&tbm=isch
- Go grab a coffee or take a short walk with someone.
a.2 User preference questionnaire
- Please select how much you agree or disagree with the statements below.
[Strongly agree (7), Agree (6), Somewhat agree (5), Neither agree nor disagree (4), Somewhat disagree (3), Disagree (2), Strongly disagree (1)]
The agent is likable.
The agent is intelligent.
I would like to continue interacting with the agent.
The agent’s “tone” was appropriate.
I have become more aware of my daily emotions.
Notifications from the application were too frequent.
-Do you have any comments or feedback?
-Was there any significant event that happened to you during the first week of this study that you feel is affecting your mood or stress level during the study?
-  Adrian Aguilera, Emma Bruehlman-Senecal, Orianna Demasi, and Patricia Avila. Automated text messaging as an adjunct to cognitive behavioral therapy for depression: A clinical trial. Journal of medical Internet research, 19(5), 2017.
-  Gerald Bauer and Paul Lukowicz. Can smartphones detect stress-related changes in the behaviour of individuals? In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pages 423–426. IEEE, 2012.
-  T Bickmore and R Picard. Subtle expressivity by relational agents. In Proceedings of the CHI 2003 Workshop on Subtle Expressivity for Characters and Robots, 2003.
-  Timothy Bickmore and Justine Cassell. Relational agents: a model and implementation of building user trust. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 396–403. ACM, 2001.
-  Andrey Bogomolov, Bruno Lepri, Michela Ferron, Fabio Pianesi, and Alex Sandy Pentland. Daily stress recognition from mobile phone data, weather conditions and individual traits. In Proceedings of the 22nd ACM international conference on Multimedia, pages 477–486. ACM, 2014.
-  Andrey Bogomolov, Bruno Lepri, and Fabio Pianesi. Happiness recognition from mobile phone data. In Social Computing (SocialCom), 2013 International Conference on, pages 790–795. IEEE, 2013.
-  Stéphanie Buisine and Jean-Claude Martin. The influence of user’s personality and gender on the processing of virtual agents’ multimodal behavior. Advances in Psychology Research, 65:1–14, 2010.
-  Rafael A Calvo and Dorian Peters. Positive computing: technology for wellbeing and human potential. MIT Press, 2014.
-  M Deady, I Choi, RA Calvo, N Glozier, H Christensen, and SB Harvey. ehealth interventions for the prevention of depression and anxiety in the general population: a systematic review and meta-analysis. BMC Psychiatry, 17(1):310, 2017.
-  David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. Simsensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 1061–1068. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
-  John M Digman. Personality structure: Emergence of the five-factor model. Annual review of psychology, 41(1):417–440, 1990.
-  Sidney D’Mello, Rosalind W Picard, and Arthur Graesser. Toward an affect-sensitive autotutor. IEEE Intelligent Systems, 22(4), 2007.
-  Wen Dong, Bruno Lepri, and Alex Sandy Pentland. Modeling the co-evolution of behaviors and social relationships using mobile phone data. In Proceedings of the 10th International Conference on Mobile and Ubiquitous Multimedia, pages 134–143. ACM, 2011.
-  Nathan Eagle and Alex Pentland. Reality mining: sensing complex social systems. Personal and ubiquitous computing, 10(4):255–268, 2006.
-  Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524, 2017.
-  Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): A randomized controlled trial. JMIR Mental Health, 4(2):e19, 2017.
-  Asma Ghandeharioun, Asaph Azaria, Sara Taylor, Pattie Maes, and Rosalind Picard. Promoting kindness and gratitude with a smartphone and triggers. Annals of Behavioral Medicine, 50(Supplement 1):266, 2016.
-  Asma Ghandeharioun, Asaph Azaria, Sara Taylor, and Rosalind W Picard. ”kind and grateful”: a context-sensitive smartphone app utilizing inspirational content to promote gratitude. Psychology of well-being, 6(1):1–21, 2016.
-  Asma Ghandeharioun, Szymon Fedor, Lisa Sangermano, Dawn Ionescu, Jonathan Alpert, Chelsea Dale, David Sontag, and Rosalind Picard. Objective assessment of depressive symptoms with machine learning and wearable sensors data. In Affective computing and intelligent interaction (ACII), 2017 international conference on. IEEE, 2017.
-  Asma Ghandeharioun and Rosalind Picard. Brightbeat: Effortlessly influencing breathing for cultivating calmness and focus. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pages 1624–1631. ACM, 2017.
-  Jonathan Gratch, Ning Wang, Jillian Gerten, Edward Fast, and Robin Duffy. Creating rapport with virtual agents. In International Workshop on Intelligent Virtual Agents, pages 125–138. Springer, 2007.
-  Simon Hoermann, Kathryn L McCabe, David N Milne, and Rafael A Calvo. Application of synchronous text-based dialogue systems in mental health interventions: Systematic review. Journal of Medical Internet Research, 19(8):e267, 2017.
-  Ellen Isaacs, Artie Konrad, Alan Walendowski, Thomas Lennig, Victoria Hollis, and Steve Whittaker. Echoes from the past: how technology mediated reflection improves well-being. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1071–1080. ACM, 2013.
-  Natasha Jaques, Sara Taylor, Asaph Azaria, Asma Ghandeharioun, Akane Sano, and Rosalind Picard. Predicting students’ happiness from physiology, phone, mobility, and behavioral data. In Affective computing and intelligent interaction (ACII), 2015 international conference on, pages 222–228. IEEE, 2015.
-  Sooyeon Jeong and Cynthia Lynn Breazeal. Improving smartphone users’ affect and wellbeing with personalized positive psychology interventions. In Proceedings of the Fourth International Conference on Human Agent Interaction, pages 131–137. ACM, 2016.
-  Jonathan Klein, Youngme Moon, and Rosalind W Picard. This computer responds to user frustration: Theory, design, and results. Interacting with computers, 14(2):119–140, 2002.
-  Rafal Kocielnik and Gary Hsieh. Send me a different message: Utilizing cognitive space to create engaging message triggers. In CSCW, pages 2193–2207, 2017.
-  Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Can your smartphone infer your mood. In PhoneSense workshop, pages 1–5, 2011.
-  Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Moodscope: Building a mood sensor from smartphone usage patterns. In Proceeding of the 11th annual international conference on Mobile systems, applications, and services, pages 389–402. ACM, 2013.
-  Peter F Lovibond and Sydney H Lovibond. The structure of negative emotional states: Comparison of the depression anxiety stress scales (dass) with the beck depression and anxiety inventories. Behaviour research and therapy, 33(3):335–343, 1995.
-  Gale M Lucas, Jonathan Gratch, Aisha King, and Louis-Philippe Morency. It’s only a computer: Virtual humans increase willingness to disclose. Computers in Human Behavior, 37:94–100, 2014.
-  Daniel McDuff, Amy Karlson, Ashish Kapoor, Asta Roseway, and Mary Czerwinski. Affectaura: an intelligent system for emotional memory. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 849–858. ACM, 2012.
-  Abhinav Mehrotra, Veljko Pejovic, Jo Vermeulen, Robert Hendley, and Mirco Musolesi. My phone and me: understanding people’s receptivity to mobile notifications. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 1021–1032. ACM, 2016.
-  Adam Miner, Amanda Chow, Sarah Adler, Ilia Zaitsev, Paul Tero, Alison Darcy, and Andreas Paepcke. Conversational agents and mental health: Theory-informed assessment of language and affect. In Proceedings of the Fourth International Conference on Human Agent Interaction, pages 123–130. ACM, 2016.
-  David C Mohr, Mi Zhang, and Stephen M Schueller. Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annual Review of Clinical Psychology, 13:23–47, 2017.
-  Robert R Morris, Stephen M Schueller, and Rosalind W Picard. Efficacy of a web-based, crowdsourced peer-to-peer cognitive reappraisal platform for depression: Randomized controlled trial. Journal of medical Internet research, 17(3), 2015.
-  Robert Randall Morris. Crowdsourcing mental health and emotional well-being. PhD thesis, Massachusetts Institute of Technology, 2015.
-  Sai T Moturu, Inas Khayal, Nadav Aharony, Wei Pan, and Alex Pentland. Using social sensing to understand the links between sleep, mood, and sociability. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on, pages 208–214. IEEE, 2011.
-  Frederick Muench and Amit Baumel. More than a text message: Dismantling digital triggers to curate behavior change in patient-centered health interventions. Journal of Medical Internet Research, 19(5):e147, 2017.
-  Frederick Muench, Katherine van Stolk-Cooke, Alexis Kuerbis, Gertraud Stadler, Amit Baumel, Sijing Shao, James R McKay, and Jon Morgenstern. A randomized controlled pilot trial of different mobile messaging interventions for problem drinking compared to weekly drink tracking. PloS one, 12(2):e0167900, 2017.
-  Frederick Muench, Katherine van Stolk-Cooke, Jon Morgenstern, Alexis N Kuerbis, and Kendra Markle. Understanding messaging preferences to inform development of mobile goal-directed behavioral interventions. Journal of medical Internet research, 16(2), 2014.
-  Clifford Nass and Kwan Min Lee. Does computer-generated speech manifest personality? an experimental test of similarity-attraction. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 329–336. ACM, 2000.
-  Pablo Paredes, Ran Gilad-Bachrach, Mary Czerwinski, Asta Roseway, Kael Rowan, and Javier Hernandez. Poptherapy: Coping with stress through pop-culture. In Proceedings of the 8th International Conference on Pervasive Computing Technologies for Healthcare, pages 109–117. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2014.
-  Rosalind W Picard. Affective computing, volume 252. MIT press Cambridge, 1997.
-  Mashfiqui Rabbi, Angela Pfammatter, Mi Zhang, Bonnie Spring, and Tanzeem Choudhury. Automated personalized feedback for physical activity and dietary behavior change with mobile phones: a randomized controlled trial on adults. JMIR mHealth and uHealth, 3(2), 2015.
-  Byron Reeves and Clifford Ivar Nass. The media equation: How people treat computers, television, and new media like real people and places. Cambridge university press, 1996.
-  Lazlo Ring, Timothy Bickmore, and Paola Pedrelli. An affectively aware virtual therapist for depression counseling. In ACM SIGCHI Conference on Human Factors in Computing Systems (CHI) workshop on Computing and Mental Health, 2016.
-  Kael Rowan. Studyportal api. http://studyservice.cloudapp.net/docs/, 2013.
-  James A Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161–1178, 1980.
-  Sohrab Saeb, Mi Zhang, Christopher J Karr, Stephen M Schueller, Marya E Corden, Konrad P Kording, and David C Mohr. Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: an exploratory study. Journal of medical Internet research, 17(7), 2015.
-  Akane Sano, Paul Johns, and Mary Czerwinski. Healthaware: An advice system for stress, sleep, diet and exercise. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on, pages 546–552. IEEE, 2015.
-  Akane Sano, Paul Johns, and Mary Czerwinski. Designing opportune stress intervention delivery timing using multi-modal data. In Affective computing and intelligent interaction (ACII), 2017 international conference on. IEEE, 2017.
-  Akane Sano and Rosalind W Picard. Stress recognition using wearable sensors and mobile phones. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 671–676. IEEE, 2013.
-  Stephen M Schueller, Adrian Aguilera, and David C Mohr. Ecological momentary interventions for depression and anxiety. Depression and anxiety, 34(6):540–545, 2017.
-  Lee Taber and Steve Whittaker. Personality depends on the medium: differences in self-perception on snapchat, facebook and offline. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 607. ACM, 2018.
-  David Watson, Lee A Clark, and Auke Tellegen. Development and validation of brief measures of positive and negative affect: the panas scales. Journal of personality and social psychology, 54(6):1063, 1988.