Having a Bad Day? Detecting the Impact of Atypical Life Events Using Wearable Sensors

by   Keith Burghardt, et al.

Life events can dramatically affect our psychological state and work performance. Stress, for example, has been linked to professional dissatisfaction, increased anxiety, and workplace burnout. We explore the impact of positive and negative life events on a number of psychological constructs through a multi-month longitudinal study of hospital and aerospace workers. Through causal inference, we demonstrate that positive life events increase positive affect, while negative events increase stress, anxiety and negative affect. While most events have a transient effect on psychological states, major negative events, like illness or attending a funeral, can reduce positive affect for multiple days. Next, we assess whether these events can be detected through wearable sensors, which can cheaply and unobtrusively monitor health-related factors. We show that these sensors paired with embedding-based learning models can be used “in the wild” to capture atypical life events in hundreds of workers across both datasets. Overall our results suggest that automated interventions based on physiological sensing may be feasible to help workers regulate the negative effects of life events.



There are no comments yet.


page 1

page 2

page 3

page 4


A multi-modal sensor dataset for continuous stress detection of nurses in a hospital

Advances in wearable technologies provide the opportunity to continuousl...

Wearable Affective Life-Log System for Understanding Emotion Dynamics in Daily Life

Past research on recognizing human affect has made use of a variety of p...

Towards Emotion Retrieval in Egocentric PhotoStream

The availability and use of egocentric data are rapidly increasing due t...

TILES-2018: A longitudinal physiologic and behavioral data set of hospital workers

We present a novel longitudinal multimodal corpus of physiological and b...

An analysis of non-immigrant work visas in the USA using Machine Learning

High-skilled immigrants are a very important factor in US innovation and...

Understanding College Students' Phone Call Behaviors Towards a Sustainable Mobile Health and Wellbeing Solution

During the transition from high school to on-campus college life, a stud...

Wearable affect and stress recognition: A review

Affect recognition aims to detect a person's affective state based on ob...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As organizations prepare their workforce for changing job demands, worker wellness has emerged as an important focus. Organizations see worker wellness as being central to their mission to develop a healthy and productive workforce while also maintaining optimal job performance. These goals are especially important in high-stakes jobs, such as healthcare providers working at hospitals, where job-related stress often leads to burnout and poor performance [1, 2, 3], and is one of the most costly modifiable health issues at the workplace [4]. An additional challenge faced by workers is balancing demanding jobs with equally stressful events in their personal life. Adverse events—such as attending a funeral, the death of a pet, or illness of a family member—may amplify worker stress, and potentially harm job performance. On the other hand, positive life events—such as getting a raise, getting engaged, or taking a vacation—may decrease stress and improve well-being. The ability to detect such atypical life events in a workforce can help organizations better balance tasks to reduce stress, burnout, and absenteeism and improve job performance.

Until recently, detecting such life events automatically, in real time and at scale, would have been unthinkable. However, recent advances in sensing technologies have made wearable sensors more accurate and widely available, offering opportunities for unobtrusive and continuous acquisition of diverse physiological states.

Sensor-generated data, such as heart rate and physical activity, allows for real-time, quantitative assessment of individual’s health [5] and psychological well-being [6, 7, 8, 9]. Sensor data could also provide insights into atypical life events that individual workers experience and could affect their psychological well-being and job performance. However, the connection between atypical life events, individual well-being, and quantitative measurements from sensor data has not been demonstrated for such dynamic environments, especially in real-world scenarios.

In this paper, we report results of large longitudinal studies of hospital and aerospace industry workers who wore sensors and reported ecological momentary assessments (EMAs) over the course of several months. Workers also reported whether they had experienced an atypical event. The data allows us to use difference-in-difference analysis, a type of causal inference method [10], to measure the effect of atypical events—either positive or negative life events—on individual psychological states and well-being. We find that negative life events increase self-reported stress, anxiety, and negative affect by 10-20% or more, while decreasing positive affect over multiple days. Positive life events, meanwhile, have little effect on stress, anxiety, and negative affect, but boost positive affect on the day of the event. Negative atypical events have a greater impact on worker’s psychological states than positive events, in line with previous findings [11].

In addition to measuring the effects of atypical events, we show that it is possible to detect these events from a non-invasive wristband sensor. We discover that, although changes in individual psychological constructs are difficult to detect, atypical events are amenable to detection because they jointly affect several constructs. We propose a method that learns a representation of multi-modal physiological signals from sensors by embedding them in a lower-dimensional space. The embedding provides features for classifying when atypical events occur. Detection results are improved over baseline F1 scores by up to nine times, and achieve ROC-AUC of between 0.60-0.66.

Physiological data from wearable sensors allows for studying individual response to atypical life events in the wild, creating opportunities for testing psychological theory about affect and experience. In addition, sensors data opens the possibility of passive monitoring to detect when individuals have stressful or negative experiences. While our initial results show that models can be further improved in the future, the ability to detect such experiences can help organizations improve the health and well-being of their workforce and reduce their detrimental effects on vulnerable populations.

2 Related Work

In this paper, we explore the effect of acute positive and negative events on human behavior, and how to detect these events with wearable sensors.

We find that negative events increase stress, anxiety, and negative effect over the course of one to two days. Acute stress, in which stress increases over short periods [12], can increase cardiovascular risk and depression [13], and can negatively impact job performance [1, 2, 3].

Increased anxiety is associated with reduction in fertility [14], while negative affect is associated with higher sensitivity to pain [15]. In this paper, we find that positive events increase positive affect. Higher positive affect is associated with broadened attention and improved creative problem solving [16, 17], and preferring future utility over present [18], although high levels may be associated with aversion to change [17].

There exists extensive research on how sensors can be used to detect patterns and changes in human behavior [19, 8], including psychological constructs such as stress, anxiety, and affect (c.f., literature review of wearable sensors [20]). For example, they can detect if workers [21] or students [22] are stressed, even at a minute-by-minute level (c.f., cited literature in [23]). Recent research has also explored detecting the degree to which a subject is stressed at shorter [9, 6, 23, 24, 25, 26], and longer [27, 28] timescales. Papers on stress typically induce stress externally [6, 21, 29, 30], but there are also papers on detecting natural stresses [23, 9, 27, 28]. While most related works have explored stress detection, there is some literature on detecting bio-markers associated with other psychological constructs. This includes anxiety [31], positive and negative affect [32, 33], and depression [34]. In addition, recent literature has explored predicting (instead of detecting) multiple constructs using multi-task learning [35]

. Notably, however, researchers needed access to data on social interactions, exercise, drug use, and sensors of several modalities, which may be unavailable in many situations. Finally, detecting acute positive and negative events is similar to research on using sensors for anomaly detection

[20]. In contrast to previous literature, however, we detect events that affect psychological constructs rather than physiological constructs such as heart rate or sleep. In order to detect bio-marker patterns, sensors used in previous research measure a number of modalities including phone usage [22], skin conductance [6, 36, 21, 9], heart rate [6, 30, 24, 21, 9, 7], or breathing rate [6] features.

The past work has suffered from two significant limitations. First, research has focused on either short time intervals (up to two weeks) and very small sample sizes (on the order of tens of subjects) [9, 6, 23, 24, 25, 26], or collected data sporadically (once every several months) [28, 27]. Second, previous literature has typically detected very short-term stresses (e.g., stresses that affect people on minute level [9, 6, 23, 24, 25, 26, 23]) rather than individual stressful events that impact someone over the longer term, such as funerals. Our work differs from these previous studies through continuous evaluation over several weeks of hundreds of subjects, allowing us to robustly uncover effects in diverse populations. Moreover, we uncover patterns associated with unusually good or bad events that can affect multiple psychological constructs over multiple days.

3 Data

The data used in this paper comes from two studies aimed at understanding the relationship between individual variables, job performance, and wellness [37], which was part of the IARPA MOSAIC program. The study protocol was reviewed by USC Institutional Review Board (HS-17-00876 - TILES). Although the studies were conducted at different locations and recruited different populations, they had similar longitudinal design and collected similar data. The hospital workforce data was collected during a 10-week long study that recruited 212 hospital workers. Participants were enrolled over the course of three “study waves,” each with different start dates (03/05/18, 04/09/18 and 05/05/18 for waves 1, 2 and 3 respectively). The aerospace workforce data was collected from 264 subjects from 01/08/18 to 04/06/18.

In both datasets, subjects’ bio-behavioral data was captured via wearable devices. The studies also administered daily surveys to collect self-assessments of individual participant stress, sleep, job performance, organizational behavior, and other personality constructs. The same survey questions were asked in both studies. We focus on positive affect, negative affect, anxiety and stress, which we discuss in greater detail in the psychological construct section.

In this paper, we use data collected from Fitbit wristbands. Although other sensor data was collected during each study, including location data and audio or environmental features, we focus on this modality since it was common to both studies, and is the only sensor we have access to in the aerospace dataset. The Fitbit wristband captures dynamic heart rate and step count. It also offers a summary report of duration and quality of sleep for each day. Data is collected voluntarily by each subject, which was recorded at sub-minute levels. It was then uploaded to servers, where we aggregate the data. Table I summarizes the modalities captured by the Fitbit Charge 2 sensor. For the embedding approach, we only used the signals extracted from Fitbit (heart rate and steps) but for the aggregated method we also included the static summary features.

Fitbit Modality
Signals (time series): Heart rate (PPG)
Number of steps
Summary features (static): Time in personalized heart rate ranges: “fat burn,” “cardio,” or “out of range”
Daily minutes in bed
Daily minutes asleep
Daily sleep efficiency
Sleep start & end time
TABLE I: Extracted features from sensors.

Study participants exhibited varying compliance rates. As a result, collected data varied in the amount (hours per day) and length (number of days) across different participants. Figure 1 shows the distribution of the data collected in these two datasets and the average ratio of atypical events for participants as a function of their compliance rate. We find in the left panel of Fig. 1 that most participants had several days of data, but a minority had only a few days of data over the entire study period. Pre-processing was therefore as follows. We only used data from participants who had at least two days worth of data and one day marked as an atypical day. This brings the hospital data down to 8,155 days for 150 participants and the aerospace data to 10,057 days for 207 participants. We find in the right panel of Fig. 1 that removal of these low compliance subjects does not appear to significantly bias the data. Instead the frequency of atypical events is relatively independent of the compliance rate.

The amount of data available from each day also varies and depends on the amount of time the participant wore the wristband. Although most participants (90% in hospital dataset and 89% in aerospace dataset) had the wristband on for the full day (24 hours), there are instances where only five hours of data could be collected in a day.

Fig. 1:

Statistics of compliance and frequency of atypical events. Left figure shows the number of days of data we have for each participant. Right figure shows the ratio of days in which there is an atypical event as a function of subject participation. Error bars are 95% confidence intervals of the mean.

3.1 Psychological Constructs

The data used for this study includes daily self-assessments of psychological states provided by subjects over the course of the study. These constructs include self-assessments of job performance (Individual Task Proficiency (ITP) [38], In-Role Behaviors [39]), Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism [40]), alcohol [41] and tobacco use [42], sleep quality [43]

, stress, anxiety, positive and negative affect. Stress and anxiety were measured by responses to questions that read, “Overall, how would you rate your current level of stress?” and “Please select the response that shows how anxious you feel at the moment” respectively and have a range of 1–5. Positive and negative affect were measured based on 10 questions from

[44] (five questions for measuring positive affect and five for measuring negative affect) and have a range of 5–25. We focus on positive and negative affect, stress, and anxiety in this study because these were found to consistently change during an atypical event.

3.2 Atypical event classification

In addition to these constructs, subjects were also asked if they had experienced, or anticipated experiencing, an atypical event: “Have any atypical events happened today or are expected to happen?”. If subjects replied yes, they had the option of add free-form text describing the atypical event. In the hospital data, there are 8,155 days of data, of which 958 days had atypical events (11.7%). The aerospace data has 10,057 days of data, of which 1,503 were considered atypical (14.9%).

We have access to the free-form text in the hospital data, which was filled out by participants in 87% of all atypical events. Surprisingly, the severity of the event could not be easily gleaned from sentiment analysis, such as VADER

[45] or LIWC [46], as these tools gave neutral sentiment to text samples that were clearly negative. For example, text alike to “at a funeral” is given zero sentiment in VADER. We therefore applied a protocol, using human annotators, to categorize text as major negative events (such as deaths or injuries of loved ones), minor negative events (such as being stuck in traffic), or positive events (such as promotions). Major negative events were classified as negative life events such as major medical issues and funerals while minor negative events were daily hassles, sickness, or negative work events. Positive events were awards, promotions, weddings, and other events that were beneficial. Of all categorized atypical events, 210 (24%) were positive, 626 (71%) were minor negative events, and 39 (4.5%) were major negative events.

4 Methods

Fig. 2: Overview of the modeling framework. Sensor data collected from participants A and B (left two panels) is fed into non-parametric HMM model which outputs a state sequence per participant (middle panel), where states are shared among participants. Output from the HMM model is used to learn embeddings for each day of each participants (right panel). The daily embedding (colored circles) and the average embedding for each participant (hashed circles) are used as features to detect an atypical day.

4.1 Causal Inference Method

The text descriptions of many atypical events in the hospital data mention sudden and unexpected events, such as an injured family member or unusually heavy traffic. We can therefore conjecture that atypical events create an as-if random assignment of any given subject over time. This is not always true, as in the case of subjects who report being on vacation multiple days, or are at different stages of burying a loved one. These are, however, relatively rare instances, with sequential events occurring in less than 15% of atypical events in either dataset and exclusion of this data does not significantly affect results. To determine the effect of atypical events on subjects, we use a difference-in-difference approach to causal inference. Specifically, we look at all subjects who report an atypical event and then look at a subset who report stress, anxiety, negative affect, or positive affect the prior day. This is usually the majority of all events (83%). We finally take the difference in their self-reported constructs from the day before the event. If subjects report construct values after the event (which is usually the case) we report the difference between these values and the day prior to an atypical event. We contrast these measurements with a null model, in which we find subjects who did not report an atypical event on the same days that other subjects reported an atypical event, and find the change in their construct values from the prior day. This null model shows very little change in constructs over consecutive days, in agreement with expectation. The difference between construct values associated with the event and the null model is the average treatment effect (ATE).

4.2 Representation Learning

We detect atypical events by embedding individuals’ physiological time series data into a vector space, using the framework proposed in


. We then train models to identify where in this space do atypical events happen unexpectedly often. Namely, the time series is modeled as a hidden Markov model, where each state corresponds to an automatically inferred activity (e.g., exercising, working, or resting). The model effectively distinguishes activities people do during atypical days from activities during “normal” days.

In more detail, each subject’s day of physiological data is interpreted as a multivariate time series, as described in Fig. 2, left two panels. The time series are transformed into sequences of hidden Markov states using a Beta Process Auto Regressive HMM (BP-AR-HMM) [48] (Fig. 2, center panel). Unlike classical hidden Markov models, BP-AR-HMM is flexible by allowing the number of hidden states to be inferred from the data. Based on these datasets the model found 73 states in the hospital data, and 130 states in the aerospace dataset, i.e., we find 73–130 “activities” that subjects perform, although they may only do a small fraction of these activities in a day. In addition, these states are shared among all subjects, rather than specific to one subject. This makes it feasible to embed data across different subjects and across different days. After the states are learned, we calculate the stationary distribution of time spent in each state to embed the daily data into the activity space (Fig. 2

, right panel). This can be easily calculated from the HMM transition matrix by finding the eigenvector corresponding to the largest eigenvalue of the matrix.

5 Results

How do atypical events affect individual’s psychological states? We apply a difference-in-difference approach to measure the impact of atypical events on self-reported psychological constructs. We first look at the effect of atypical events across all our datasets, as shown in Fig. 3. Atypical events, on average, have a relatively small effect on positive affect the day of the event (difference from null ; p-value, for hospital and aerospace data, respectively). We notice a decrease in positive affect from the day of the event to the day after the event (difference; p-values for aerospace and hospital data, respectively). On the other hand, there is a substantial increase in negative affect, stress, and anxiety (p-values ), although changes are smaller in the aerospace dataset.

5.1 Causal Effect of Atypical Events

Fig. 3: Effect of atypical events among the datasets studied. (a) Positive affect, (b) negative affect, (c) stress, and (d) anxiety. Green squares show the aerospace dataset, red diamonds show the hospital dataset, and gray circles are the null models, in which we collect sequential data from subjects who do not experience an atypical event at day zero.
Fig. 4: Effect of atypical events versus severity of event. (a) Positive affect, (b) negative affect, (c) stress, and (d) anxiety. Green squares are positive events, white triangles are minor negative events, red diamonds are major negative events, and gray circles are the null models. In the null models we collect sequential data from subjects who do not experience an atypical event at day zero.

The free-text descriptions that subjects provided about atypical events they experienced (only available in the hospital data), confirms these results. Most atypical events are negative, such as a fight with the spouse, traffic, or deaths. In a minority of cases, however, subjects report positive events, such as passing a test or a promotion. For the hospital data, we categorized atypical events as positive, minor negative, or major negative events, and determined the relative effect each has on subjects, as shown in Fig. 4. We find that, as expected, positive events increase positive affect (p-value), but have no statistically significant effect on negative affect, stress, or anxiety (p-value ). Minor negative events do not substantially change positive affect on the day of the event (difference from null , p-value), and have a small effect on positive affect the day after the event (difference from null , p-value). On the other hand, they significantly increase negative affect, anxiety, and stress (p-value ). Finally, major negative events both decrease positive affect the day of the event and the day after the event (p-value respectively). These results point to the strong diversity in atypical events, and support the idea that “bad is stronger than good” [11]: adverse, or negative, events have a stronger effect on people than positive events, and are reported as atypical events more often.

5.2 Detecting Atypical Events

5.2.1 Classification Task

We evaluate performance of three classification tasks using sensor data: (1) detecting whether an atypical event occurred on that day; (2) detecting whether subjects experienced a good day; or (3) detecting whether subjects experienced a bad day. For (2) and (3) the classification task was “1” if subjects experienced a good or bad day, respectively, and “0” otherwise. Hence we simplify all tasks into a binary detection task. We emphasize that these last two tasks are only available for the hospital data.

We use ten-fold cross validation. We choose to split datapoints at random, but in the Limitations section, we alternatively split users into training and testing sets to approximate a cold-start scenario where, in many cases, researchers train data on one cohort of subjects and classify data on another cohort [49]. The challenge of the latter detection task is that we need to classify if a subject has a good or bad day despite not training on any previous data from that subject. Performance metrics are averaged across all held-out folds.

5.2.2 Performance Metrics

We use three performance metrics for evaluation. First, we use the area under the receiver operating characteristic (ROC-AUC) which quantifies how well a model can make true positives versus false positives. Random detection has an ROC-AUC of 0.5. Next, we use the F1 score, which is the harmonic mean of precision and recall. The higher F1 scores correspond to higher recall and precision of our estimates. Finally, we use precision itself as a performance metric because we want to determine the fraction of times we correctly label an atypical day (i.e., a “good” or “bad” day) as atypical. Low precision would indicate many false positives.

Dataset Construct Model ROC-AUC F1 Precision
Hospital workforce Atypical Event Random 0.50 0.12 0.12
Good Event Random 0.50 0.03 0.03
Bad Event Random 0.50 0.08 0.08
Aerospace workforce Atypical Event Random 0.50 0.15 0.15
TABLE II: Performance of atypical event detection from sensors in the hospital and aerospace workforce datasets with randomly sampled cross-validation. For all datasets, we can classify whether an event is atypical. For hospital workers, we can also classify whether an event is “good” (instead of any other type of event), or “bad.” Percentages are above baseline (e.g., if classification is no better than random, the percentage would be 0%).]

5.2.3 Models

We compare detection quality for two types of models: models using features from statistics of aggregated data, and models using features based on time series embeddings.

Aggregated We create several features based on aggregated statistics of signals and static modalities, listed in Table I

. These statistics included the sum, mean, median, variance, kurtosis, and skewness of signals the day before, the day of, and the day after each day. Missing data is substituted with mean statistic value in the training or testing set. Statistics before and after each day were created because some physiological features, such as mean heart rate, might change before an atypical event, and some may change after, such as sleep duration. We use Minimum Redundancy Maximum Relevance on each dataset to select the best features (23 and 26 for the aerospace and hospital data respectively)


. Alternative features selection approaches using random forest feature importance produced poorer results. Typical features in the hospital data relate to sleep (for example, the top feature was tomorrow’s minutes in bed). Typical features in the aerospace dataset tend to relate to heart rate (the top feature was the number of minutes in the “fat burn” heart rate zone in the past day).

Embedding when creating features from HMM embedding, we used only the signal modalities from Table I

; the summary features were not used. Representations from HMMs were learned for the day of, and the day after each day. We also include the centroid of embeddings for each person in the training data as features, to control for subject-specific differences in behavior. We did not use any additional feature selection because embedding naturally reduces the feature dimensions. Imputation is also not needed because the HMM learns states based on the amount of data available for that day.

We use several candidate classification methods to detect whether a subject experiences an atypical event. For aggregate features, we compared logistic regression

[51], random forest [52]

, support vector machines (SVMs)

[53], extra trees [54], AdaBoost [55]

, and multi-layered perceptrons (MLPs)

[56]. When training aggregate feature models, we make sure to downsample the majority class (no atypical event) such that the number of datapoints in each class are equal. Raw data, or upsampling the minority class, was found to produce worse results. Using all three performance metrics and ten-fold cross validation, we find atypical events in the hospital dataset are best modeled with random forests, while the aerospace workforce dataset is best modeled with logistic regression. In comparison, positive events are best modeled with random forests but negative events are best modeled with extra trees.

Model hyperparameters for these models are chosen as follows. For random forest and extra trees, we used 100 trees and a max depth of 10. For AdaBoost, we let the number of estimators be 100. For all other hyperparameters, we use default parameters in Python library sklearn 0.21.3 for Python 3.7. For MLPs, we use three dense layers where the number of nodes in each layer equals the number of features in the model. For the model with embedding features, we used SVM, the same classifier used in the original work

[47]. In all cases, hyperparameters were chosen as reasonable baselines, therefore additional improvements in model quality could be obtained with further tuning.

5.2.4 Detection Results

We demonstrate our model results in Table II. First, we find that HMM embedding-based model outperforms alternative models. The ROC-AUC for the HMM-based model is 0.60 for the aerospace workforce and 0.66 for the hospital workforce. Positive and negative events similarly have an ROC-AUC of 0.61-0.63. F1 and precision exceed random baselines by factors of two to nine. The seemingly low F1 and precision are due to the rarity of atypical events, especially for positive events, which only happen on 3% of days, and negative events which only happen in 8% of all days. A detection therefore represents a “warning sign” that a worker may have had an negative event that day. Overall, detecting atypical events shows promise.

6 Discussion

Our results highlight how unusual but impactful events strongly affect workers. Interestingly, however, atypical events are more often negative than positive. For example, 8% of all days among hospital workers contained negative events, while only 3% contained positive events. The relative adversity and frequency of negative events over positive events in our data agrees with previous findings that negative events are often more impactful [11]. Moreover, we find that significant events cannot be viewed as affecting a single psychological construct; they can affect multiple constructs at once. In the same way that multi-task learning can improve predictions in AI [57], we expect that atypical event detection could be useful to detecting anxiety, stress, and other psychological constructs simultaneously.

Our results also point to important future work. First, while the performance of our method does not allow it to be used in practice, it can be considered a significant starting point. Other sensor modalities can be added to better infer when or if an atypical event occurs. These include breathing, skin conductance, or phone usage sensors. Next, personalizing our methods to individuals has the potential to substantially improve detection performance [35]. We find, for example, some subjects experience very few atypical events while others experience atypical events triple the average rate. Next, we can extend our results by analyzing how similar good or bad events affect people differently. Some subjects may be able to cope with negative events better than others.

Dataset Construct Model ROC-AUC F1 Precision
Hospital workforce Atypical Event Random 0.50 0.12 0.12
Good Event Random 0.50 0.03 0.03
Bad Event Random 0.50 0.08 0.08
Aerospace workforce Atypical Event Random 0.50 0.15 0.15
Embedding 0.54
TABLE III: Performance of atypical event detection from sensors in the hospital and aerospace workforce datasets with user held-out detection. For all datasets, we can classify whether an event is atypical. For hospital workers, we can also classify whether an event is “good” (instead of any other type of event), or “bad.”

7 Limitations

There are, however, a number of limitations we should discuss, that highlight limitations in the data, as well as broader model limitations that offer implications for model design.

First, data was only collected once a day, and we were unable to gather when atypical events occurred during the day. This made the detection problem much harder because there are a number of separate reasons for heart rates or step counts to change and inferring the specific signal that would indicate an atypical event is unavailable in our data. Next, we are limited in the modalities we had access to, and therefore the physiological behavior we could measure. For example, stress might be more accurately measured with the help of skin monitors [6, 36, 21, 9].

Finally, our results are based on cross validation, a standard method in which datapoints are divided into training and testing splits. This is alike to previous work on detecting stress, in which training and testing was performed on the same users [6, 23, 25, 26]. It’s feasible, however, that a model may be trained on one dataset and tested on another. To approximate this scenario, we instead split users, rather than days of data, into training and testing folds. We show our model performance results in Table III. Atypical events can be detected 91–220% above baselines based on F1 score, but results are more modest than in the Results section, with a reduction in ROC-AUC from 0.66 to 0.58 for hospital atypical events. These results are alike to other recent papers, which split subjects into training and testing and found relatively poor model performance [9, 24]. On one hand, this means that these models will not necessarily be able to work out of the box. They need to be personalized to users. That said, once they are tuned to the cohort, the performance is respectable. Human heterogeneity therefore make physiologically-based psychological modeling especially difficult.

8 Conclusion

We discover that atypical events and negative events substantially increase stress, anxiety, and negative affect. Major negative events are found to reduce positive affect over multiple days, while positive events improve positive affect that day. We also demonstrate that wearable sensors can provide important clues about whether someone is experiencing a positive or negative event. We find atypical events can be predicted with ROC-AUC of 0.66 with relatively little model hyperparameter tuning. This suggests more improvements are possible to predict atypical events. Overall, these results point to the importance and relative detectability of negative events, which offer hope for remote sensing and automated interventions in the future.


The authors are grateful to the TILES team for the efforts in study design, data collection and sharing that enable this work. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No 2017-17042800005.


  • [1] P. Gray-Toft and J. G. Anderson, “Stress among hospital nursing staff: its causes and effects,” Social Science & Medicine. Part A: Medical Psychology & Medical Sociology, vol. 15, no. 5, pp. 639–647, 1981.
  • [2] U. Bashir and M. I. Ramay, “Impact of stress on employees job performance a study on banking sector of pakistan,” International Journal of Marketing Studies, vol. 2, no. 1, pp. 122–126, 2010.
  • [3] M. Jamal, “Job stress, job performance and organizational commitment in a multinational company: An empirical study in two countries,” International Journal of Business and Social Science, vol. 2, no. 20, 2011.
  • [4] R. Z. Goetzel, X. Pei, M. J. Tabrizi, R. M. Henke, N. Kowlessar, C. F. Nelson, and R. D. Metz, “Ten modifiable health risk factors are linked to more than one-fifth of employer-employee health care spending,” Health Affairs, vol. 31, no. 11, pp. 2474–2484, 2012.
  • [5] S. Aral and C. Nicolaides, “Exercise contagion in a global social network.” Nature communications, vol. 8, p. 14753, 2017.
  • [6] J. A. Healey and R. W. Picard, “Detecting stress during real-world driving tasks using physiological sensors,” IEEE Transactions on Intelligent Transportation Systems, vol. 6, no. 2, pp. 156–166, June 2005.
  • [7] K. Hovsepian, M. al’Absi, E. Ertin, T. Kamarck, M. Nakajima, and S. Kumar, “cstress: Towards a gold standard for continuous stress assessment in the mobile environment,” in Proc ACM Int Conf Ubiquitous Comput (UbiComp), 2015, pp. 493–504.
  • [8] R. Wang, F. Chen, Z. Chen, T. Li, G. Harari, S. Tignor, X. Zhou, D. Ben-Zeev, and A. T. Campbell, “Studentlife: Assessing mental health, academic performance and behavioral trends of college students using smartphones,” in Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ser. UbiComp ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 3–14. [Online]. Available: https://doi.org/10.1145/2632048.2632054
  • [9] E. Smets, E. Rios Velazquez, G. Schiavone, I. Chakroun, E. D’Hondt, W. De Raedt, J. Cornelis, O. Janssens, S. Van Hoecke, S. Claes, I. Van Diest, and C. Van Hoof, “Large-scale wearable data reveal digital phenotypes for daily-life stress detection,” npj Digital Medicine, vol. 1, no. 67, 2018.
  • [10] H. R. Varian, “Causal inference in economics and marketing,” Proceedings of the National Academy of Sciences, vol. 113, no. 27, pp. 7310–7315, 2016. [Online]. Available: https://www.pnas.org/content/113/27/7310
  • [11] R. Baumeister, E. Bratslavsky, C. Finkenauer, and K. Vohs, “Bad is stronger than good,” Review of General Psychology, vol. 5, pp. 323–370, 2001.
  • [12] J. E. Dimsdale, “Psychological stress and cardiovascular disease,” Journal of the American College of Cardiology, vol. 51, no. 13, pp. 1237 – 1246, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0735109708002581
  • [13] S. Cohen, D. Janicki-Deverts, and G. E. Miller, “Psychological Stress and Disease,” JAMA, vol. 298, no. 14, pp. 1685–1687, 10 2007. [Online]. Available: https://doi.org/10.1001/jama.298.14.1685
  • [14] J. Smeenk, C. Verhaak, A. Eugster, A. van Minnen, G. Zielhuis, and D. Braat, “The effect of anxiety and depression on the outcome of in-vitro fertilization,” Human Reproduction, vol. 16, no. 7, pp. 1420–1423, 07 2001. [Online]. Available: https://doi.org/10.1093/humrep/16.7.1420
  • [15] D. Ruiz-Aranda, J. M. Salguero, and P. Fernández-Berrocal, “Emotional intelligence and acute pain: The mediating effect of negative affect,” The Journal of Pain, vol. 12, no. 11, pp. 1190 – 1196, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1526590011006626
  • [16] G. Rowe, J. B. Hirsh, and A. K. Anderson, “Positive affect increases the breadth of attentional selection,” Proceedings of the National Academy of Sciences, vol. 104, no. 1, pp. 383–388, 2007. [Online]. Available: https://www.pnas.org/content/104/1/383
  • [17] C. F. Lam, G. Spreitzer, and C. Fritz, “Too much of a good thing: Curvilinear effect of positive affect on proactive behaviors,” Journal of Organizational Behavior, vol. 35, no. 4, pp. 530–546, 2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/job.1906
  • [18] J. Ifcher and H. Zarghamee, “Happiness and time preference: The effect of positive affect in a random-assignment experiment,” American Economic Review, vol. 101, no. 7, pp. 3109–29, December 2011. [Online]. Available: http://www.aeaweb.org/articles?id=10.1257/aer.101.7.3109
  • [19] N. Eagle and A. (Sandy) Pentland, “Reality mining: Sensing complex social systems,” Personal Ubiquitous Comput., vol. 10, no. 4, p. 255–268, Mar. 2006. [Online]. Available: https://doi.org/10.1007/s00779-005-0046-3
  • [20] H. Banaee, M. U. Ahmed, and A. Loutfi, “Data mining for wearable sensors in health monitoring systems: a review of recent trends and challenges,” Sensors, vol. 13, no. 12, pp. 17 472–17 500, 2013.
  • [21] S. Sriramprakash, V. D. Prasanna, and O. R. Murthy, “Stress detection in working people,” Procedia Computer Science, vol. 115, pp. 359 – 366, 2017, 7th International Conference on Advances in Computing & Communications, ICACC-2017, 22-24 August 2017, Cochin, India. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S187705091731904X
  • [22] A. Sano, A. Phillips, A. Yu, A. McHill, S. Taylor, N. Jaques, A. Czeisler, C, E. Klerman, and P. R.W., “Recognizing academic performance, sleep quality, stress level, and mental health using personality traits, wearable sensors and mobile phones,” in Body Sensor Networks, Cambridge, USA, 2015.
  • [23] Y. S. Can, N. Chalabianloo, D. Ekiz, and C. Ersoy, “Continuous stress detection using wearable sensors in real life: Algorithmic programming contest case study,” Sensors, vol. 19, no. 8, p. 1849, 2019.
  • [24] M. Gjoreski, M. Luštrek, M. Gams, and H. Gjoreski, “Monitoring stress with a wrist device using context,” Journal of Biomedical Informatics, vol. 73, pp. 159 – 170, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1532046417301855
  • [25] V. Sandulescu, S. Andrews, D. Ellis, N. Bellotto, and O. M. Mozos, “Stress detection using wearable physiological sensors,” in Artificial Computation in Biology and Medicine, J. M. Ferrández Vicente, J. R. Álvarez-Sánchez, F. de la Paz López, F. J. Toledo-Moreo, and H. Adeli, Eds.   Cham: Springer International Publishing, 2015, pp. 526–532.
  • [26] O. M. Mozos, V. Sandulescu, S. Andrews, D. Ellis, N. Bellotto, R. Dobrescu, and J. M. Ferrandez, “Stress detection using wearable physiological and sociometric sensors,” International Journal of Neural Systems, vol. 27, no. 02, p. 1650041, 2017.
  • [27] E. Guthrie, D. Black, H. Bagalkote, C. Shaw, M. Campbell, and F. Creed, “Psychological stress and burnout in medical students: a five-year prospective longitudinal study,” Journal of the Royal Society of Medicine, vol. 91, no. 5, pp. 237–243, 1998.
  • [28] D. Edwards, P. Burnard, K. Bennett, and U. Hebden, “A longitudinal study of stress and self-esteem in student nurses,” Nurse Education Today, vol. 30, no. 1, pp. 78 – 84, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0260691709001208
  • [29]

    A. Ghaderi, J. Frounchi, and A. Farnam, “Machine learning-based signal processing using physiological signals for stress detection,” in

    2015 22nd Iranian Conference on Biomedical Engineering (ICBME), Nov 2015, pp. 93–98.
  • [30] V. Camomilla, M. Salai, I. Vassányi, and I. Kósa, “Stress detection using low cost heart rate sensors,” Journal of Healthcare Engineering, p. 136705, 2016.
  • [31] Y. Huang, J. Gong, M. Rucker, P. Chow, K. C. Fua, M. S. Gerber, B. A. Teachman, and L. E. Barnes, “Discovery of behavioral markers of social anxiety from smartphone sensor data,” in DigitalBiomarkers ’17: Proceedings of the 1st Workshop on Digital Biomarkers, 2017, pp. 9–14.
  • [32] S. Yan, H. Hosseinmardi, H. Kao, S. Narayanan, K. Lerman, and E. Ferrara, “Estimating individualized daily self-reported affect with wearable sensors,” in 2019 IEEE International Conference on Healthcare Informatics (ICHI), June 2019, pp. 1–9.
  • [33] A. Mottelson and K. Hornbundefinedk, “An affect detection technique using mobile commodity sensors in the wild,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ser. UbiComp ’16.   New York, NY, USA: Association for Computing Machinery, 2016, p. 781–792. [Online]. Available: https://doi.org/10.1145/2971648.2971654
  • [34] L. Canzian and M. Musolesi, “Trajectories of depression: Unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis,” in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ser. UbiComp ’15.   New York, NY, USA: Association for Computing Machinery, 2015, p. 1293–1304. [Online]. Available: https://doi.org/10.1145/2750858.2805845
  • [35] N. Jaques, O. O. Rudovic, S. Taylor, A. Sano, and R. Picard, “Predicting tomorrow’s mood, health, and stress level using personalized multitask learning and domain adaptation,” in

    Proceedings of IJCAI 2017 Workshop on Artificial Intelligence in Affective Computing

    , ser. Proceedings of Machine Learning Research, N. Lawrence and M. Reid, Eds., vol. 66.   PMLR, 20 Aug 2017, pp. 17–33. [Online]. Available: http://proceedings.mlr.press/v66/jaques17a.html
  • [36] M. V. Villarejo, B. G. Zapirain, and A. M. Zorrilla, “A stress sensor based on galvanic skin response (gsr) controlled by zigbee,” Sensors, vol. 12, no. 5, pp. 6075–6101, 2012.
  • [37] K. Mundnich, B. M. Booth, M. l’Hommedieu, T. Feng, B. Girault, J. L’Hommedieu, M. Wildman, S. Skaaden, A. Nadarajan, J. L. Villatte, T. H. Falk, K. Lerman, E. Ferrara, and S. Narayanan, “Tiles-2018: A longitudinal physiologic and behavioral data set of hospital workers,” arXiv preprint arXiv:2003.08474, 2020.
  • [38] M. Griffin, A. Neal, and S. Parker, “A new model of work role performance: positive behavior in uncertain and interdependent contexts,” Academy of Management Journal, vol. 50, no. 2, pp. 327–347, 2007.
  • [39] L. J. Williams and S. E. Anderson, “Job satisfaction and organizational commitment as predictors of organizational citizenship and in-role behaviors,” J. of Management, vol. 17, no. 3, pp. 601–617, 1991.
  • [40] S. D. Gosling, P. J. Rentfrow, and W. B. Swann Jr, “A very brief measure of the big-five personality domains,” Journal of Research in personality, vol. 37, no. 6, pp. 504–528, 2003.
  • [41] J. B. Saunders, O. G. Asaland, T. F. Babor, J. R. D. la Fuente, and M. Grant, “Development of the alcohol use disorders identification test (audit): Who collaborative project on early detection of persons with harmful alcohol consumption‐ii,” Addiction, vol. 89, no. 6, 1993.
  • [42] G. T. S. S. (GTSS), “Global adult tobacco survey (gats),” Indicator Guidelines: Definition and Syntax, 2009.
  • [43] D. J. Buysse, C. F. Reynolds III, T. H. Monk, S. R. Berman, and D. J. Kupfer, “The pittsburgh sleep quality index: a new instrument for psychiatric practice and research,” Psychiatry research, vol. 28, no. 2, pp. 193–213, 1989.
  • [44] A. Mackinnon, A. F. Jorm, H. Christensen, A. E. Korten, P. A. Jacomb, and B. Rodgers, “A short form of the positive and negative affect schedule: Evaluation of factorial validity and invariance across demographic variables in a community sample,” Personality and Individual differences, vol. 27, no. 3, pp. 405–416, 1999.
  • [45] C. J. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model for sentiment analysis of social media text,” in Eighth international AAAI conference on weblogs and social media, 2014.
  • [46] J. Pennebaker, R. Boyd, K. Jordan, and K. Blackburn, “The development and psychometric properties of liwc2015,” 2015.
  • [47] N. Tavabi, H. Hosseinmardi, J. L. Villatte, A. Abeliuk, S. Narayanan, E. Ferrara, and K. Lerman, “Learning behavioral representations from wearable sensors,” arXiv preprint arXiv:1911.06959, 2019.
  • [48] E. B. Fox, M. C. Hughes, E. B. Sudderth, M. I. Jordan et al., “Joint modeling of multiple time series via the beta process with application to motion capture segmentation,” The Annals of Applied Statistics, vol. 8, no. 3, pp. 1281–1313, 2014.
  • [49] A. Bogomolov, B. Lepri, M. Ferron, F. Pianesi, and A. S. Pentland, “Daily stress recognition from mobile phone data, weather conditions and individual traits,” in Proceedings of the 22nd ACM International Conference on Multimedia, ser. MM ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 477–486. [Online]. Available: https://doi.org/10.1145/2647868.2654933
  • [50] A. Torralba and A. Oliva, “Depth estimation from image structure,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 27, no. 09, pp. 1226–1238, sep 2002.
  • [51]

    D. R. Cox, “The regression analysis of binary sequences,”

    Journal of the Royal Statistical Society. Series B (Methodological), vol. 20, no. 2, pp. 215–242, 1958. [Online]. Available: http://www.jstor.org/stable/2983890
  • [52] Tin Kam Ho, “Random decision forests,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, Aug 1995, pp. 278–282 vol.1.
  • [53] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, Sep 1995. [Online]. Available: https://doi.org/10.1023/A:1022627411411
  • [54] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, Apr 2006. [Online]. Available: https://doi.org/10.1007/s10994-006-6226-1
  • [55] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119 – 139, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S002200009791504X
  • [56] G. E. Hinton, “Connectionist learning procedures,” Artificial Intelligence, vol. 40, no. 1, pp. 185 – 234, 1989. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0004370289900490
  • [57] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv:1706.05098, 2017.