The Segmented iHMM: A Simple, Efficient Hierarchical Infinite HMM

02/20/2016 ∙ by Ardavan Saeedi, et al. ∙ 0

We propose the segmented iHMM (siHMM), a hierarchical infinite hidden Markov model (iHMM) that supports a simple, efficient inference scheme. The siHMM is well suited to segmentation problems, where the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. Conventional iHMMs often struggle with such problems, since they have no mechanism for distinguishing between high- and low-level dynamics. Hierarchical HMMs (HHMMs) can do better, but they require much more complex and expensive inference algorithms. The siHMM retains the simplicity and efficiency of the iHMM, but outperforms it on a variety of segmentation problems, achieving performance that matches or exceeds that of a more complicated HHMM.



There are no comments yet.


page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) (b)
Figure 1:

(a) A sample of a synthetic dataset with true and inferred segmentations. Top three plots: the y-axis shows 1-d observations, colors denote true or inferred hidden state, vertical lines denote true or inferred segment boundaries. Bottom plot: inferred posterior probability of a segment boundary. (b)

Top True and Bottom inferred transition matrices

The infinite hidden Markov model (iHMM) (Beal et al., 2001) and its variants (e.g., (Fox et al., 2008; Saeedi & Bouchard-Côté, 2011)) have been among the most successful Bayesian nonparametric models, with applications from speech recognition (Fox et al., 2011) to biology (Beal & Krishnamurthy, 2012). However, despite their success in modeling time series with complicated low-level dynamics, their application to time series with multiple timescales has been limited.

Such hierarchically structured sequences characterized by multiple timescales arise in many domains, such as natural language (Lee et al., 2013), handwriting (Lake et al., 2014), and motion recognition (Heller et al., 2009). For example, it is natural in motion recognition to model the sequence of high-level actions (such as walking to a chair, then sitting down) and steps within the actions (e.g., bending one’s knees then leaning back to sit down) at two different levels.

We will focus on the problem of segmentation, in which the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. In the motion recognition example, the segmentation problem would be to identify when a subject transitioned from one type of action (e.g., walking) to another (e.g., sitting down), without necessarily identifying what they were doing. This is one of the easiest problems in time-series modeling that involves multiple timescales, but (as we will see) it is quite hard for (i)HMMs, which have no mechanism for distinguishing between high- and low-level dynamics.

The hierarchical HMM (HHMM) is a generalization of the HMM that naturally deals with dynamics at multiple timescales (Fine et al., 1998; Murphy & Paskin, 2002). But this generality comes at a price: these models lack the simple predictive distributions and efficient inference schemes that make (i)HMMs so popular. And the available nonparametric versions of the HHMM such (e.g., (Heller et al., 2009; Stepleton et al., 2009)) are complex to implement and not readily amenable to efficient inference.

In this paper, we propose the segmented iHMM (siHMM), a simple extension to the iHMM that does not suffer from the above problems and can discover segment boundaries in time series with dynamics at two timescales. Unlike the HHMM, our model does not explicitly model higher-level state; instead, it assumes dynamics that evolve according to a standard iHMM except for occasional change-point events that kick the model into a new randomly chosen hidden state, disrupting the low-level dynamics of the iHMM. Because it relies on a very simple model of high-level dynamics, inference in the siHMM has time and implementation complexity similar to that of the iHMM, and well below that of a typical HHMM. We show that this simple change-point extension is sufficient to encourage the iHMM to model time-series data characterized by multiple regimes of low-level dynamics. Although our model is limited by the depth of the hierarchy, in many practical applications of HHMMs (e.g., (Olivera et al., 2004; Nguyen et al., 2005; Xie et al., 2003)) a two-level analysis of the dynamics is sufficient.

Below, we describe two versions of the model. The first version, which we call the feature-independent model, enjoys conditional conjugacy and therefore has simple Gibbs and variational inference algorithms. The second version, which we call the feature-based model, can incorporate domain knowledge without requiring a complex new machinery. We present an stochastic variational inference (SVI) scheme for the feature-based model; the derivation for the feature-independent model is similar and straightforward.

We apply the model to three different tasks: a novel task of segmenting traces of user behavior in software applications, automatic behavioral segmentation of fruit fly and sensor data labeling. Segmenting user behavior traces is of significant importance in understanding the behavior of software application users; it can help in identifying and simplifying the complex common patterns among the users (e.g., Adar et al. (2014); Han et al. (2007); Horvitz et al. (1998)). For the fruit fly behavior segmentation, we use a dataset from Kain et al. (2013); the results of this task can be used to better understand how the nervous system generates behavior. Finally, labeling sensor data gathered in everyday life settings can be used not only to understand physical activities (e.g., Ermes et al. (2008); Pärkkä et al. (2006)), but also to detect psychological and emotional states (e.g., Picard et al. (2001); Healey & Picard (2005)). Implementing effective health and wellbeing related interventions, understanding user behavior, and designing affective interfaces, are only a few applications of this task.

We empirically compare our model with two main baselines: 1) a two-level Bayesian nonparametric hierarchical HMM (HHMM) introduced by Johnson (2014) that models high-level dynamics as an infinite hidden semi-Markov model (HSMM) and sub-dynamics as an iHMM, and 2) the iHMM. In each of these tasks, we show that our model outperforms the nonparametric HHMM (despite being substantially simpler and faster) and the iHMM.

2 Model

Our model can be viewed as a generalization of an iHMM where the transition probability from each state

is a mixture of two distributions: 1) a state-dependent transition probability distribution

, as in an iHMM, and 2) a state-independent probability distribution . Which of these two distributions generates a hidden state at a given time depends on the hidden state and observation at the previous time .

We say that the transitions caused by define the boundaries of a segment. The model implicitly assumes that the low-level dynamics within a segment are more structured and predictable than the higher-level dynamics that govern transitions between segments, since it can throw much more modeling power at these low-level dynamics. In motion capture data, for instance, the dynamics of a walk may be highly structured and predictable, whereas the dynamics that govern whether a user transitions from walking to standing, sitting, or running may be much less predictable.

2.1 Feature-independent model

The feature-independent model assumes the following generative process. At time step , we initialize the process by sampling a hidden state from a distribution . Given a hidden state , we generate an observation from a conditional observation distribution where is the parameter corresponding to the hidden state : .

Next, we sample a variable

, which we call the segmentation variable, from a Bernoulli distribution with a parameter

. This is a state-dependent variable which has a conjugate beta prior with hyperparameters

and . Here, denotes the beginning of a new segment:

We denote the probability of creating a new segment at time by . If , we sample the next state from a state-dependent distribution (as in the iHMM), otherwise, we ignore the current state and sample from a distribution :

The transition matrix has the same generative process as the iHMM:

where is the prior distribution over , GEM is the stick-breaking distribution with concentration parameter , and DP denotes sampling from a Dirichlet process with concentration parameter . The graphical model of the feature-independent siHMM is depicted in Fig. 2.

Figure 2: Graphical representation of the feature-independent siHMM.

An illustration of the model applied to a synthetic dataset (explained in Sec. 5.1) is provided in Fig. 1. The model is able to approximately recover the block-diagonal structure of the true transition matrix. Even though the model does not explicitly encourage block-diagonal structure, the sparsity induced by the DP prior on is sufficient to encourage the model to push inter-segment dynamics into and recover the block-diagonal intra-segment dynamics.

2.2 Feature-based model

In some tasks such as segmenting software user traces or tagging fruit fly behavior, there is a rich domain knowledge available for improving the model. For instance, in segmenting user traces features like or may indicate the end of a segment. We modify the model in a way that we can add features declaratively. Although due to lack of conjugacy, deriving the Gibbs sampler is not straightforward anymore, in Section 3, we derive an efficient SVI algorithm for this model.

The difference between this version of the model and the feature-independent version is in the conditional distribution of the segmentation variable (see Fig. 3 for the graphical model). Here, the parameter of the Bernoulli distribution is


is the weight vector for the data-dependent features,

is the feature function which consists of all observation-dependent features, and is the feature weight for hidden state . To simplify the notation, we assume that the observation-dependent features only depend on the observation at a single time step. Hence, we have

We do not assume a prior for the feature weights; instead, we use a point estimate for them in our SVI algorithm.

Figure 3: Graphical representation of the feature-based siHMM.

3 Stochastic variational inference

To keep the notation uncluttered, we assume that we have a dataset of sequences all with the same length and write: , , . For inference, we use the stochastic variational inference (SVI) algorithm (Hoffman et al., 2013) and approximate the posterior with a truncated variational distribution introduced in (Johnson & Willsky, 2014). We approximate the posterior with mean field family distribution . In the language of SVI, z and s are local variables and , , , , and are global variables. We maximize the marginal likelihood lower bound :

by using stochastic natural gradient ascent over the global factors and standard mean field updates for the local factors. At each iteration of SVI, we sample a minibatch of sequences from the dataset and update its local factors; next, given the expectation with respect to the local factors, we update the global factors by taking a step of size in the approximate natural gradient direction. To further simplify the notation, we assume that the minibatch is a single sequence and drop the superscript for , and . Next, we explain the variational factors for each of the variables.

3.1 Variational factors

For , the “direct assignment” truncation used in (Johnson & Willsky, 2014), sets if for any of to we have and ; here is the truncation level. Since by using this truncation the update to conditioned on the other factors is not conjugate anymore, we use a point estimate for :

. We adopt the same point estimate approach for the parameters of the sigmoid function; hence,

and .

With this truncation scheme, we can write the prior over as . Here, and . We know from (Hoffman et al., 2013) that due to conjugacy the optimal is in the form of where is the parameter of the variational distribution.

We assume that the prior over is in exponential family with natural parameter

, and it is a conjugate prior for the likelihood function

. This implies that the optimal variational distribution is also in the same family with some other natural parameter denoted by . More formally, we have: and where is the sufficient statistic function of .

3.2 SVI update equations

For the variational updates, we need to take expectations with respect to each of the variational distributions. For the expectations with respect to , a modification of the standard HMM forward-backward algorithm with the following forward and backward messages can be used:

These messages can be computed in . In fact, the augmented transition matrix that we need to compute for these forward-backward messages has the following form:

where is a transition matrix and is an all-ones vector of size . The matrix operation for computing a message requires operations for the upper half of the matrix () and for the lower half (). This is because all the rows of each block in the lower half are the same. Hence, the total number of operations for message passing is . Furthermore, the total memory required to compute these operations is .

For updating the local factors, instead of and in Eq.3.2, we compute the forward-backward messages using: and

The expectations of the sufficient statistics with respect to are:

Given these expected sufficient statistics and a scaling factor , we can write the update equations for the parameters of the global variational factors and :

For the global factors , and , we use a point estimate; hence, we only need the gradient of with respect to , and . For we follow the derivation in (Johnson & Willsky, 2014) and obtain the following gradient to use in a truncated gradient step on :

Note that, we need to ensure that after each update. To estimate , we have:

We have a similar equation for .

4 Related models

There are few models similar to our model in terms of extending iHMM to multiple timescales. Infinite hierarchical HMM (iHHMM), introduced in (Heller et al., 2009), is a nonparametric model that allows the HHMM to have potentially unbounded depth. Hence, the model can infer the number of levels in the hierarchy. In iHHMM, the bottom level is the observed sequence and each level is a sequence of hidden variables dependent on the level above. As the authors suggested, more efficient inference algorithms are needed in order to make their model useful for practical applications.

The block-diagonal iHMM (Stepleton et al., 2009) is a generalization of iHMM that assumes a nearly block-diagonal structure on the transition matrix of the iHMM. Each block corresponds to a “sub-behavior” and the model can partition the data sequences according to these sub-behaviors. The model first partitions the infinite number of hidden states into an infinite number of blocks by using an additional stick-breaking process. Then, it increases the probability of transition between the states of a block by modifying the Dirichlet process prior over the transitions. Hence, as the block size becomes smaller, the model behavior converges to that of iHMM. For inference, as the authors explained, achieving a fast mixing rate in their proposed inference algorithm requires implementing a nontrivial bookkeeping-intensive method. In contrast, our model is much simpler and easier to implement inference for, but it can also discover transition matrices with approximately block-diagonal structure; the segmentation events provide a mechanism for transitioning from one group of connected states to another.

Another related model is a two-level Bayesian nonparametric HMM introduced in (Johnson, 2014) that models the high-level dynamics or the superstates as an HDP-HSMM. As a generalization of iHMM, HDP-HSMM can model the dwell time in each state by sampling that from a state-specific duration distribution once a state is entered. For a formal definition of the HDP-HSMM, see (Johnson, 2014). Given each superstate , observations are generated according to an iHMM with parameters . where denotes the substate at time step . Compared to our model, this model is much more flexible; however, the computation of forward-backward messages is less efficient. Moreover, it requires more bookkeeping for all superstates and their substates. Compared to our model, it needs setting a truncation level for the superstates and each of the iHMMs correpsonding to them. We can set a large truncation level for all of them, but this means we are paying a huge computational cost for iHMMs that only require few states. In contrast, we only need to set a single truncation level for the whole model. We call this model sub-iHMM and use it as a baseline since it is a flexible Bayesian nonparametric model that supports two-level dynamics.

Finally, our model can also be applied to time-series segmentation tasks at a single level. The iHMM or a variant of it with self-transition bias, sticky HDP-HMM (Fox et al., 2011), can be used for the same purpose. The self-transition bias encourages self-transitions, and consecutive hidden states tend to belong to one state. The model can capture segments in tasks such as speaker diarization; however, in contrast to our model, within a given state there are no dynamics. In other words, in the sticky HDP-HMM, given a state, the observations are independent of each other. This makes the model inappropriate for tasks such as user trace segmentation; that is because there has to be some order on substates within a segment. For instance, in a software application an action like selection needs to come before an action like move selection. To show the importance of this point, we also include a self-transition bias term in the iHMM and compare our model with sticky HDP-HMM 111We do not include a separate column in Table 2 for sticky HPD-HMM; instead, we find the best variational lower bound for all settings of hyperparameters of iHMM including a hyperparameter for self-transition bias..

5 Experiments

We evaluate the performance of our feature-independent and feature-based models on synthetic and real datasets. We use a synthetic dataset to illustrate the advantages of our model compared to baselines.

In order to further demonstrate the capabilities of our model, we apply it to three real tasks from three different domains: human-computer interaction, biology, and sensor data analysis. As mentioned in Section 4, two reasonable baselines for our model are the two-level Bayesian nonparametric HMM and iHMM. We report the labeling error (or normalized Hamming distance in the case of the synthetic dataset) and predictive log-likelihood for our model and these baselines. To choose among different hyperparameter settings, we use the variational lower bound (VLB) as our objective measure. We show that our model, while being simpler and efficient in terms of inference, is competitive with or outperforms these baselines. For all experiments on siHMM we try both the feature-independent and feature-based models; we report the results separately in Table 2 to show the effect of including observation features in the model. In the experiments, we only try the hidden state and the observation as the features for a given time step. However, more sophisticated features can be made from the observation(s).

5.1 Synthetic data

We generate a synthetic dataset with 5000 data points from 3 different transition matrices, each with 3 hidden states. Each row of each transition matrix is sampled from a modified

with self-transition bias of 1. The observations are sampled from normal distributions with non-conjugate separate priors


on their mean and variance parameters, respectively. The goal is to find the points where we change regimes and also to determine the dynamics within each segment of the sequence. At each time step with probability 0.05, we switch the regime.

Fig. 1 shows a sample sequence and the result of running 100 passes of SVI over the whole dataset for that sequence. For running SVI, we split the dataset into 20 sequences and use a batch of size 2. We randomly sample a sequence with length 750 to calculate the predictive log-likelihood. We report the error over the dataset for the hyperparameter setting with the highest VLB.

For the hyperparameters in the feature-independent model, we set and for the beta prior; for the feature-based model, we randomly initialize the feature weights from and only use the hidden state as a feature. We do a grid search over combinations of values for , , and . We place an prior on the mean of the observation distributions. Their variance prior is where and are coming from and . We run SVI with 10 different seeds for 100 iterations over all these combinations. To compute the normalized Hamming distance between the inferred states and the true states, we use the Munkres algorithm (Munkres, 1957). The algorithm assigns indices to the inferred sequence so that it maximizes the overlap with the true sequence. Table 2 gives the computed distance over the dataset for the hyperparameter setting with the highest VLB. We also report the predictive log-likelihood over the held-out set.

In addition to the above hyperparameters, the sub-iHMM also requires the truncation level of the superstates and the substates, which we set to 10 and , correspondingly. For all hyperparameters, shared with the siHMM, we use the same set of settings for the baselines. For the self-transition bias for sticky HDP-HMM we try .

Fig. 4 shows the histogram of the normalized Hamming distance and also the predictive log-likelihood for runs with different hyperparameter settings. In terms of the Hamming distance, the siHMM performs slightly better than the iHMM and outperforms the sub-iHMM for most settings. The same conclusion holds for the predictive log-likelihood. Furthermore, Fig. 1 shows that our model can do reasonably well in finding the regime change points and also the states within each segment of the sequence. We choose a threshold of 0.5 for the posterior segmentation probability to identify a time point as a change point; however, as shown in the last row of Fig. 1, the model is able to provide an estimate for the uncertainty over change points.

Table 2 shows that for the dataset which we generated from a block-diagonal transition matrix, siHMM outperforms both iHMM and sub-iHMM. This might be because the synthetic dataset is specifically generated without any state-dependent high-level dynamics. Our model, which does not assume any dynamics for the segments, performs better than the other baselines which implicitly or explicitly assume that.

Figure 4: Histogram of normalized Hamming distance and predictive log-likelihood for different settings of hyperparameters for the synthetic data set.
Dataset # Points Held-out Max. # States
Users %
Sensors %
Drosophila %
Table 1: Datasets used for experiments (description in text)
(Normalized Hamming Distance / Error% & Predictive LL)
Data Set iHMM sub-iHMM siHMM (WoF) siHMM (WF)
Table 2: Labeling error and predictive log-likelihood for various different datasets. Key: iHMM = infinite HMM; sub-iHMM = a two level hierarchical infinite HMM; siHMM (WoF) = feature-independent siHMM; siHMM (WF) = feature-based siHMM; Synthetic = synthetic data segmentation and hidden state inference tasks for which we report the normalized Hamming distance instead of error rate; Users = user trace segmentation task; Sensors = labeling sensor data task; Drosophila = segmenting fruit fly behavior task.

5.2 Segmenting user behavior traces

Modeling user behavior traces is of significant importance in human-computer interaction literature (e.g., (Adar et al., 2014; Bigham et al., 2014)). Having a good understanding of the tasks done by users can potentially help in designing better work-flows in software applications. It can also help in providing better guidance to users by predicting user intentions. Log files of software applications contain user actions and their corresponding time-stamps; however, it is not clear only from these log files how many different tasks have been done by a user in a single work session. A task consists of multiple actions and each work session consists of multiple tasks. Our two-level hierarchical model can be used for detecting the boundaries between tasks (i.e., segments). We believe that for the software applications in which the tasks are less predictable compared to the actions within each task, our model is a good fit. For instance, in a photo editing software, the high-level tasks such as adding filters or removing a part of a picture usually do not have a clear order and different users, based on their needs, apply different orderings. However, the actions required to add a filter (e.g., 1- choose a filter, 2- select a part of the photo, and 3- apply the filter) are typically more structured and follow an order.

We collect log files of users who follow 23 different tutorials in a photo editing software. The dataset is in the form of time-stamp and action; it contains 14000 data points and 59 unique actions in total. We randomly choose a sequence of size 1400 and form a held-out set. We split the sequences into subsequences of size 1000 and apply 100 passes of SVI to the dataset with both feature-based and feature-independent models.

The possible hyperparameter settings that we consider are , , and finally , the parameter for the Dirichlet prior, which we place on the parameters of the observation likelihood. We run SVI with 10 different seeds for each setting and report the result for the best setting with the highest VLB in Table 2.

The labels (i.e., the tutorial numbers that the user followed) for each segment of the dataset are available; hence, we can test the siHMM on predicting the labels for each segment. This task is more involved than segmentation, as we also need to group the substates. We use a simple K-means clustering on the empirical transition matrix which is generated from counting the transitions in each inferred segment. This approach works well in practice; however, more sophisticated methods are possible for grouping the substates. Table 

2 provides the prediction error (computed using the Munkres algorithm) for siHMM and the baselines. The performance of our feature-independent model is significantly better than iHMM and comparable with that of the more flexible (and also computationally intensive) sub-iHMM. Adding the observation feature to the model reduces the error to 16%. As mentioned in Section 2.2, in this dataset there are observations that can signal a change-point in the dynamics.

5.3 Segmenting fruit fly behavior

Automating scientific experiments on live animals has attracted significant attention recently (see, for instance, (Kain et al., 2013; Wiltschko et al., 2015; Crall et al., 2015; Freeman et al., 2014) ). With the advent of high throughput and more accurate devices, the need for automatic analysis of large amounts of collected data is felt more today. In neuroscience and biology, a large amount of behavioral data is collected from live animals in order to understand how the brain generates activity and how the underlying mechanisms have evolved (Kain et al., 2013). Typically the first step in analyzing this data, is finding and categorizing different types of behavior; this step can be done manually by experts but it is time-consuming and sometimes error-prone.

An automatic framework has been proposed in (Kain et al., 2013)

for tracking the leg movements and classifying the behavior of fruit flies. The behavior is recorded by tracking each leg of a fruit fly moving upon a track ball. The collected raw data is the

x and y

coordinates of 6 legs and the three rotational components of the rotating ball (i.e., a 15-dimensional vector in real time). After some post-processing and adding some higher-order features (e.g., derivatives of each of the 15 raw data vectors), they expand the dimensions to 45 and apply a KNN classifier to classify each frame as a part of 12 possible behavioral labels. Our goal is to use this dataset and categorize the frames in an unsupervised way. A frequent assumption in the behavioral sciences is that a small set of stereotyped motifs describe most animal activities

(Berman et al., 2014). In other words, actions within a behavioral segment (e.g., actions required for grooming) should be more structured, compared to the behavioral segments themselves. Given the capabilities of the siHMM, it is a reasonable choice for applying to this dataset.

The dataset contains 10000 data points; our held-out set is a randomly chosen subsequence with length 1500, and we apply SVI for 100 passes over both the feature-independent and feature-based models. For the observations, we use a multivariate Gaussian likelihood and a conjugate Normal/inverse-Wishart prior. We use the empirical mean and variance of the dataset as the mean and variance of the Gaussian prior. The parameters of the inverse-Wishart are chosen from the possible combinations of and . We set the truncation level for the states to 20. Finally, we have the following possible settings for the transition matrix prior and .

As in Section 5.2, we group the inferred substates with K-means and assign labels to the segments. The results, presented in Table 2, show that feature-based siHMM performs on a par with the iHMM and outperforms sub-iHMM by a relative error reduction of 17%. This may emphasize the importance of adding data-driven features to the model. Figure 5, shows a sample of the dataset and its segmentation by different methods.

Figure 5: A sample segmentation from the fruit fly dataset. From top to bottom: True, sub-iHMM, iHMM, siHMM.

5.4 Segmenting and classifying human behavior from sensor data

Through the emergence of pervasive computing and affordable wearable sensors, in-situ measurement of different bio-signals has become possible. This powerful source of data can be utilized for several purposes, including activity recognition and task identification. Toward this goal, an efficient algorithm for analyzing this large amount of data – which is gathered 24/7 – is essential. In this section, we use siHMM to model the data collected via Empatica E4 wristband (Empatica, 2015), a wearable device that can collect Electrodermal activity (EDA) (Boucsein, 2012), blood volume pulse (BVP), acceleration, and body temperature. EDA refers to changes in electrical properties of the skin caused by sudomotor innervation (Boucsein, 2012). EDA is an indication of physiological or psychological arousal and has been utilized to objectively measure affective phenomena and sleep quality (Sano et al., 2015).

Segmenting the sensor data can help psychophysiological activity recognition. For instance, it can help in finding stressful periods objectively in order to detect the roots of stress in a person’s lifestyle. However, manual labeling for large amounts of user data (days or months) is time-consuming and even invalid if not reported in a timely manner.

We use a dataset with 12000 time steps, collected from a single user, and model the (normalized) observations (i.e., EDA, BVP and acceleration in 3 dimensions) by a multivariate Gaussian distribution. The hyperparameter setting is similar to that of Section


Figure 6: Segments and labels for data collected from sensors. From top to bottom: True, sub-iHMM, iHMM, siHMM.

The labeling error in Table 2 shows that feature-independent siHMM, while performing comparably to sub-iHMM, outperforms iHMM by a relative error reduction of 50%. Fig. 5.4 demonstrates the inferred segments for siHMM and the baselines. It seems that the single observation feature that we are using for the experiment does not help in this dataset; however, more sophisticated features may help improving the segmentation.

6 Conclusion

We proposed a new Bayesian nonparametric model, siHMM, for modeling dynamics at two timescales in time series. Our model is a simple extension to the widely used iHMM and has an efficient inference scheme. Although our model is less flexible than other nonparametric models for hierarchical time series, we showed that it can perform reasonably well in practice. One potential application of our model is using the inferred state-independent transition vector () for summarizing a sequence. For instance, in the user behavior analysis, this vector may represent a user fingerprint and users can be grouped based on it. For a better understanding of this feature and the behavior of our model in other applications, a more comprehensive comparison with other models is useful.


  • Adar et al. (2014) Adar, Eytan, Dontcheva, Mira, and Laput, Gierad. Commandspace: Modeling the relationships between tasks, descriptions and features. In Proceedings of the 27th annual ACM symposium on User interface software and technology, pp. 167–176. ACM, 2014.
  • Beal & Krishnamurthy (2012) Beal, Matthew and Krishnamurthy, Praveen. Gene expression time course clustering with countably infinite hidden markov models. arXiv preprint arXiv:1206.6824, 2012.
  • Beal et al. (2001) Beal, Matthew J, Ghahramani, Zoubin, and Rasmussen, Carl E. The infinite hidden markov model. In Advances in neural information processing systems, pp. 577–584, 2001.
  • Berman et al. (2014) Berman, Gordon J, Choi, Daniel M, Bialek, William, and Shaevitz, Joshua W. Mapping the stereotyped behaviour of freely moving fruit flies. Journal of The Royal Society Interface, 11(99):20140672, 2014.
  • Bigham et al. (2014) Bigham, Jeffrey P, Bernstein, Michael S, and Adar, Eytan. Human-computer interaction and collective intelligence. 2014.
  • Boucsein (2012) Boucsein, Wolfram. Electrodermal activity. Springer Science & Business Media, 2012.
  • Crall et al. (2015) Crall, James D, Gravish, Nick, Mountcastle, Andrew M, and Combes, Stacey A. Beetag: a low-cost, image-based tracking system for the study of animal behavior and locomotion. PloS one, 10(9):e0136487, 2015.
  • Empatica (2015) Empatica. E4 sensor., 2015. Accessed: 2015-12.
  • Ermes et al. (2008) Ermes, Miikka, Parkka, Juha, Mantyjarvi, Jani, and Korhonen, Ilkka. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. Information Technology in Biomedicine, IEEE Transactions on, 12(1):20–26, 2008.
  • Fine et al. (1998) Fine, Shai, Singer, Yoram, and Tishby, Naftali. The hierarchical hidden markov model: Analysis and applications. Machine learning, 32(1):41–62, 1998.
  • Fox et al. (2008) Fox, Emily B, Sudderth, Erik B, Jordan, Michael I, and Willsky, Alan S. An hdp-hmm for systems with state persistence. In Proceedings of the 25th international conference on Machine learning, pp. 312–319. ACM, 2008.
  • Fox et al. (2011) Fox, Emily B, Sudderth, Erik B, Jordan, Michael I, and Willsky, Alan S. A sticky hdp-hmm with application to speaker diarization. The Annals of Applied Statistics, pp. 1020–1056, 2011.
  • Freeman et al. (2014) Freeman, Jeremy, Vladimirov, Nikita, Kawashima, Takashi, Mu, Yu, Sofroniew, Nicholas J, Bennett, Davis V, Rosen, Joshua, Yang, Chao-Tsung, Looger, Loren L, and Ahrens, Misha B. Mapping brain activity at scale with cluster computing. Nature methods, 11(9):941–950, 2014.
  • Han et al. (2007) Han, Jiawei, Cheng, Hong, Xin, Dong, and Yan, Xifeng. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007.
  • Healey & Picard (2005) Healey, Jennifer A and Picard, Rosalind W. Detecting stress during real-world driving tasks using physiological sensors. Intelligent Transportation Systems, IEEE Transactions on, 6(2):156–166, 2005.
  • Heller et al. (2009) Heller, Katherine A, Teh, Yee W, and Görür, Dilan. Infinite hierarchical hidden markov models. In International Conference on Artificial Intelligence and Statistics, pp. 224–231, 2009.
  • Hoffman et al. (2013) Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Horvitz et al. (1998) Horvitz, Eric, Breese, Jack, Heckerman, David, Hovel, David, and Rommelse, Koos. The lumiere project: Bayesian user modeling for inferring the goals and needs of software users. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 256–265. Morgan Kaufmann Publishers Inc., 1998.
  • Johnson & Willsky (2014) Johnson, Matthew and Willsky, Alan. Stochastic variational inference for bayesian time series models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1854–1862, 2014.
  • Johnson (2014) Johnson, Matthew James. Bayesian time series models and scalable inference. PhD thesis, Massachusetts Institute of Technology, 2014.
  • Kain et al. (2013) Kain, Jamey, Stokes, Chris, Gaudry, Quentin, Song, Xiangzhi, Foley, James, Wilson, Rachel, and de Bivort, Benjamin. Leg-tracking and automated behavioural classification in drosophila. Nature communications, 4:1910, 2013.
  • Lake et al. (2014) Lake, Brenden M, Lee, Chia-ying, Glass, James R, and Tenenbaum, Joshua B. One-shot learning of generative speech concepts. In Proceedings of the 36th Annual Meeting of the Cognitive Science Society, 2014.
  • Lee et al. (2013) Lee, Chia-ying, Zhang, Yu, and Glass, James R. Joint learning of phonetic units and word pronunciations for asr. In EMNLP, pp. 182–192, 2013.
  • Munkres (1957) Munkres, James. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957.
  • Murphy & Paskin (2002) Murphy, Kevin P and Paskin, Mark A. Linear-time inference in hierarchical hmms. Advances in neural information processing systems, 2:833–840, 2002.
  • Nguyen et al. (2005) Nguyen, Nam T, Phung, Dinh Q, Venkatesh, Svetha, and Bui, Hung. Learning and detecting activities from movement trajectories using the hierarchical hidden markov model. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pp. 955–960. IEEE, 2005.
  • Olivera et al. (2004) Olivera, Nuria, Gargb, Ashutosh, and Horvitza, Eric. Layered representations for learning and inferring office activity from multiple sensory channels. Computer Vision and Image Understanding, 96:163–180, 2004.
  • Pärkkä et al. (2006) Pärkkä, Juha, Ermes, Miikka, Korpipää, Panu, Mäntyjärvi, Jani, Peltola, Johannes, and Korhonen, Ilkka. Activity classification using realistic data from wearable sensors. Information Technology in Biomedicine, IEEE Transactions on, 10(1):119–128, 2006.
  • Picard et al. (2001) Picard, Rosalind W, Vyzas, Elias, and Healey, Jennifer. Toward machine emotional intelligence: Analysis of affective physiological state. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(10):1175–1191, 2001.
  • Saeedi & Bouchard-Côté (2011) Saeedi, Ardavan and Bouchard-Côté, Alexandre. Priors over recurrent continuous time processes. In Advances in Neural Information Processing Systems, pp. 2052–2060, 2011.
  • Sano et al. (2015) Sano, A et al. Discriminating high vs low academic performance, selfreported sleep quality, stress level, and mental health using personality traits, wearable sensors and mobile phones. Body Sensor Networks (BSN)(to appear), 2015.
  • Stepleton et al. (2009) Stepleton, Thomas S, Ghahramani, Zoubin, Gordon, Geoffrey J, and Lee, Tai S. The block diagonal infinite hidden markov model. In International Conference on Artificial Intelligence and Statistics, pp. 552–559, 2009.
  • Wiltschko et al. (2015) Wiltschko, Alexander B, Johnson, Matthew J, Iurilli, Giuliano, Peterson, Ralph E, Katon, Jesse M, Pashkovski, Stan L, Abraira, Victoria E, Adams, Ryan P, and Datta, Sandeep Robert. Mapping sub-second structure in mouse behavior. Neuron, 88(6):1121–1135, 2015.
  • Xie et al. (2003) Xie, Lexing, Chang, Shih-Fu, Divakaran, Ajay, and Sun, Huifang. Unsupervised discovery of multilevel statistical video structures using hierarchical hidden markov models. In Multimedia and Expo, 2003. ICME’03. Proceedings. 2003 International Conference on, volume 3, pp. III–29. IEEE, 2003.