We propose the segmented iHMM (siHMM), a hierarchical infinite hidden Markov model (iHMM) that supports a simple, efficient inference scheme. The siHMM is well suited to segmentation problems, where the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. Conventional iHMMs often struggle with such problems, since they have no mechanism for distinguishing between high- and low-level dynamics. Hierarchical HMMs (HHMMs) can do better, but they require much more complex and expensive inference algorithms. The siHMM retains the simplicity and efficiency of the iHMM, but outperforms it on a variety of segmentation problems, achieving performance that matches or exceeds that of a more complicated HHMM.READ FULL TEXT VIEW PDF
We propose a Bayesian nonparametric mixture model for prediction- and
The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) has bee...
We describe a generalization of the Hierarchical Dirichlet Process Hidde...
Most existing approaches to clustering gene expression time course data ...
This paper proposes a hierarchical feature extractor for non-stationary
We propose a robust framework for interpretable, few-shot analysis of
Round-Trip Times are one of the most commonly collected performance metr...
(a) A sample of a synthetic dataset with true and inferred segmentations. Top three plots: the y-axis shows 1-d observations, colors denote true or inferred hidden state, vertical lines denote true or inferred segment boundaries. Bottom plot: inferred posterior probability of a segment boundary. (b)Top True and Bottom inferred transition matrices
The infinite hidden Markov model (iHMM) (Beal et al., 2001) and its variants (e.g., (Fox et al., 2008; Saeedi & Bouchard-Côté, 2011)) have been among the most successful Bayesian nonparametric models, with applications from speech recognition (Fox et al., 2011) to biology (Beal & Krishnamurthy, 2012). However, despite their success in modeling time series with complicated low-level dynamics, their application to time series with multiple timescales has been limited.
Such hierarchically structured sequences characterized by multiple timescales arise in many domains, such as natural language (Lee et al., 2013), handwriting (Lake et al., 2014), and motion recognition (Heller et al., 2009). For example, it is natural in motion recognition to model the sequence of high-level actions (such as walking to a chair, then sitting down) and steps within the actions (e.g., bending one’s knees then leaning back to sit down) at two different levels.
We will focus on the problem of segmentation, in which the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. In the motion recognition example, the segmentation problem would be to identify when a subject transitioned from one type of action (e.g., walking) to another (e.g., sitting down), without necessarily identifying what they were doing. This is one of the easiest problems in time-series modeling that involves multiple timescales, but (as we will see) it is quite hard for (i)HMMs, which have no mechanism for distinguishing between high- and low-level dynamics.
The hierarchical HMM (HHMM) is a generalization of the HMM that naturally deals with dynamics at multiple timescales (Fine et al., 1998; Murphy & Paskin, 2002). But this generality comes at a price: these models lack the simple predictive distributions and efficient inference schemes that make (i)HMMs so popular. And the available nonparametric versions of the HHMM such (e.g., (Heller et al., 2009; Stepleton et al., 2009)) are complex to implement and not readily amenable to efficient inference.
In this paper, we propose the segmented iHMM (siHMM), a simple extension to the iHMM that does not suffer from the above problems and can discover segment boundaries in time series with dynamics at two timescales. Unlike the HHMM, our model does not explicitly model higher-level state; instead, it assumes dynamics that evolve according to a standard iHMM except for occasional change-point events that kick the model into a new randomly chosen hidden state, disrupting the low-level dynamics of the iHMM. Because it relies on a very simple model of high-level dynamics, inference in the siHMM has time and implementation complexity similar to that of the iHMM, and well below that of a typical HHMM. We show that this simple change-point extension is sufficient to encourage the iHMM to model time-series data characterized by multiple regimes of low-level dynamics. Although our model is limited by the depth of the hierarchy, in many practical applications of HHMMs (e.g., (Olivera et al., 2004; Nguyen et al., 2005; Xie et al., 2003)) a two-level analysis of the dynamics is sufficient.
Below, we describe two versions of the model. The first version, which we call the feature-independent model, enjoys conditional conjugacy and therefore has simple Gibbs and variational inference algorithms. The second version, which we call the feature-based model, can incorporate domain knowledge without requiring a complex new machinery. We present an stochastic variational inference (SVI) scheme for the feature-based model; the derivation for the feature-independent model is similar and straightforward.
We apply the model to three different tasks: a novel task of segmenting traces of user behavior in software applications, automatic behavioral segmentation of fruit fly and sensor data labeling. Segmenting user behavior traces is of significant importance in understanding the behavior of software application users; it can help in identifying and simplifying the complex common patterns among the users (e.g., Adar et al. (2014); Han et al. (2007); Horvitz et al. (1998)). For the fruit fly behavior segmentation, we use a dataset from Kain et al. (2013); the results of this task can be used to better understand how the nervous system generates behavior. Finally, labeling sensor data gathered in everyday life settings can be used not only to understand physical activities (e.g., Ermes et al. (2008); Pärkkä et al. (2006)), but also to detect psychological and emotional states (e.g., Picard et al. (2001); Healey & Picard (2005)). Implementing effective health and wellbeing related interventions, understanding user behavior, and designing affective interfaces, are only a few applications of this task.
We empirically compare our model with two main baselines: 1) a two-level Bayesian nonparametric hierarchical HMM (HHMM) introduced by Johnson (2014) that models high-level dynamics as an infinite hidden semi-Markov model (HSMM) and sub-dynamics as an iHMM, and 2) the iHMM. In each of these tasks, we show that our model outperforms the nonparametric HHMM (despite being substantially simpler and faster) and the iHMM.
Our model can be viewed as a generalization of an iHMM where the transition probability from each state
is a mixture of two distributions: 1) a state-dependent transition probability distribution, as in an iHMM, and 2) a state-independent probability distribution . Which of these two distributions generates a hidden state at a given time depends on the hidden state and observation at the previous time .
We say that the transitions caused by define the boundaries of a segment. The model implicitly assumes that the low-level dynamics within a segment are more structured and predictable than the higher-level dynamics that govern transitions between segments, since it can throw much more modeling power at these low-level dynamics. In motion capture data, for instance, the dynamics of a walk may be highly structured and predictable, whereas the dynamics that govern whether a user transitions from walking to standing, sitting, or running may be much less predictable.
The feature-independent model assumes the following generative process. At time step , we initialize the process by sampling a hidden state from a distribution . Given a hidden state , we generate an observation from a conditional observation distribution where is the parameter corresponding to the hidden state : .
Next, we sample a variable
, which we call the segmentation variable, from a Bernoulli distribution with a parameter
. This is a state-dependent variable which has a conjugate beta prior with hyperparametersand . Here, denotes the beginning of a new segment:
We denote the probability of creating a new segment at time by . If , we sample the next state from a state-dependent distribution (as in the iHMM), otherwise, we ignore the current state and sample from a distribution :
The transition matrix has the same generative process as the iHMM:
where is the prior distribution over , GEM is the stick-breaking distribution with concentration parameter , and DP denotes sampling from a Dirichlet process with concentration parameter . The graphical model of the feature-independent siHMM is depicted in Fig. 2.
An illustration of the model applied to a synthetic dataset (explained in Sec. 5.1) is provided in Fig. 1. The model is able to approximately recover the block-diagonal structure of the true transition matrix. Even though the model does not explicitly encourage block-diagonal structure, the sparsity induced by the DP prior on is sufficient to encourage the model to push inter-segment dynamics into and recover the block-diagonal intra-segment dynamics.
In some tasks such as segmenting software user traces or tagging fruit fly behavior, there is a rich domain knowledge available for improving the model. For instance, in segmenting user traces features like or may indicate the end of a segment. We modify the model in a way that we can add features declaratively. Although due to lack of conjugacy, deriving the Gibbs sampler is not straightforward anymore, in Section 3, we derive an efficient SVI algorithm for this model.
The difference between this version of the model and the feature-independent version is in the conditional distribution of the segmentation variable (see Fig. 3 for the graphical model). Here, the parameter of the Bernoulli distribution is
is the weight vector for the data-dependent features,is the feature function which consists of all observation-dependent features, and is the feature weight for hidden state . To simplify the notation, we assume that the observation-dependent features only depend on the observation at a single time step. Hence, we have
We do not assume a prior for the feature weights; instead, we use a point estimate for them in our SVI algorithm.
To keep the notation uncluttered, we assume that we have a dataset of sequences all with the same length and write: , , . For inference, we use the stochastic variational inference (SVI) algorithm (Hoffman et al., 2013) and approximate the posterior with a truncated variational distribution introduced in (Johnson & Willsky, 2014). We approximate the posterior with mean field family distribution . In the language of SVI, z and s are local variables and , , , , and are global variables. We maximize the marginal likelihood lower bound :
by using stochastic natural gradient ascent over the global factors and standard mean field updates for the local factors. At each iteration of SVI, we sample a minibatch of sequences from the dataset and update its local factors; next, given the expectation with respect to the local factors, we update the global factors by taking a step of size in the approximate natural gradient direction. To further simplify the notation, we assume that the minibatch is a single sequence and drop the superscript for , and . Next, we explain the variational factors for each of the variables.
For , the “direct assignment” truncation used in (Johnson & Willsky, 2014), sets if for any of to we have and ; here is the truncation level. Since by using this truncation the update to conditioned on the other factors is not conjugate anymore, we use a point estimate for :
. We adopt the same point estimate approach for the parameters of the sigmoid function; hence,and .
With this truncation scheme, we can write the prior over as . Here, and . We know from (Hoffman et al., 2013) that due to conjugacy the optimal is in the form of where is the parameter of the variational distribution.
We assume that the prior over is in exponential family with natural parameter
, and it is a conjugate prior for the likelihood function. This implies that the optimal variational distribution is also in the same family with some other natural parameter denoted by . More formally, we have: and where is the sufficient statistic function of .
For the variational updates, we need to take expectations with respect to each of the variational distributions. For the expectations with respect to , a modification of the standard HMM forward-backward algorithm with the following forward and backward messages can be used:
These messages can be computed in . In fact, the augmented transition matrix that we need to compute for these forward-backward messages has the following form:
where is a transition matrix and is an all-ones vector of size . The matrix operation for computing a message requires operations for the upper half of the matrix () and for the lower half (). This is because all the rows of each block in the lower half are the same. Hence, the total number of operations for message passing is . Furthermore, the total memory required to compute these operations is .
For updating the local factors, instead of and in Eq.3.2, we compute the forward-backward messages using: and
The expectations of the sufficient statistics with respect to are:
Given these expected sufficient statistics and a scaling factor , we can write the update equations for the parameters of the global variational factors and :
For the global factors , and , we use a point estimate; hence, we only need the gradient of with respect to , and . For we follow the derivation in (Johnson & Willsky, 2014) and obtain the following gradient to use in a truncated gradient step on :
Note that, we need to ensure that after each update. To estimate , we have:
We have a similar equation for .
There are few models similar to our model in terms of extending iHMM to multiple timescales. Infinite hierarchical HMM (iHHMM), introduced in (Heller et al., 2009), is a nonparametric model that allows the HHMM to have potentially unbounded depth. Hence, the model can infer the number of levels in the hierarchy. In iHHMM, the bottom level is the observed sequence and each level is a sequence of hidden variables dependent on the level above. As the authors suggested, more efficient inference algorithms are needed in order to make their model useful for practical applications.
The block-diagonal iHMM (Stepleton et al., 2009) is a generalization of iHMM that assumes a nearly block-diagonal structure on the transition matrix of the iHMM. Each block corresponds to a “sub-behavior” and the model can partition the data sequences according to these sub-behaviors. The model first partitions the infinite number of hidden states into an infinite number of blocks by using an additional stick-breaking process. Then, it increases the probability of transition between the states of a block by modifying the Dirichlet process prior over the transitions. Hence, as the block size becomes smaller, the model behavior converges to that of iHMM. For inference, as the authors explained, achieving a fast mixing rate in their proposed inference algorithm requires implementing a nontrivial bookkeeping-intensive method. In contrast, our model is much simpler and easier to implement inference for, but it can also discover transition matrices with approximately block-diagonal structure; the segmentation events provide a mechanism for transitioning from one group of connected states to another.
Another related model is a two-level Bayesian nonparametric HMM introduced in (Johnson, 2014) that models the high-level dynamics or the superstates as an HDP-HSMM. As a generalization of iHMM, HDP-HSMM can model the dwell time in each state by sampling that from a state-specific duration distribution once a state is entered. For a formal definition of the HDP-HSMM, see (Johnson, 2014). Given each superstate , observations are generated according to an iHMM with parameters . where denotes the substate at time step . Compared to our model, this model is much more flexible; however, the computation of forward-backward messages is less efficient. Moreover, it requires more bookkeeping for all superstates and their substates. Compared to our model, it needs setting a truncation level for the superstates and each of the iHMMs correpsonding to them. We can set a large truncation level for all of them, but this means we are paying a huge computational cost for iHMMs that only require few states. In contrast, we only need to set a single truncation level for the whole model. We call this model sub-iHMM and use it as a baseline since it is a flexible Bayesian nonparametric model that supports two-level dynamics.
Finally, our model can also be applied to time-series segmentation tasks at a single level. The iHMM or a variant of it with self-transition bias, sticky HDP-HMM (Fox et al., 2011), can be used for the same purpose. The self-transition bias encourages self-transitions, and consecutive hidden states tend to belong to one state. The model can capture segments in tasks such as speaker diarization; however, in contrast to our model, within a given state there are no dynamics. In other words, in the sticky HDP-HMM, given a state, the observations are independent of each other. This makes the model inappropriate for tasks such as user trace segmentation; that is because there has to be some order on substates within a segment. For instance, in a software application an action like selection needs to come before an action like move selection. To show the importance of this point, we also include a self-transition bias term in the iHMM and compare our model with sticky HDP-HMM 111We do not include a separate column in Table 2 for sticky HPD-HMM; instead, we find the best variational lower bound for all settings of hyperparameters of iHMM including a hyperparameter for self-transition bias..
We evaluate the performance of our feature-independent and feature-based models on synthetic and real datasets. We use a synthetic dataset to illustrate the advantages of our model compared to baselines.
In order to further demonstrate the capabilities of our model, we apply it to three real tasks from three different domains: human-computer interaction, biology, and sensor data analysis. As mentioned in Section 4, two reasonable baselines for our model are the two-level Bayesian nonparametric HMM and iHMM. We report the labeling error (or normalized Hamming distance in the case of the synthetic dataset) and predictive log-likelihood for our model and these baselines. To choose among different hyperparameter settings, we use the variational lower bound (VLB) as our objective measure. We show that our model, while being simpler and efficient in terms of inference, is competitive with or outperforms these baselines. For all experiments on siHMM we try both the feature-independent and feature-based models; we report the results separately in Table 2 to show the effect of including observation features in the model. In the experiments, we only try the hidden state and the observation as the features for a given time step. However, more sophisticated features can be made from the observation(s).
We generate a synthetic dataset with 5000 data points from 3 different transition matrices, each with 3 hidden states. Each row of each transition matrix is sampled from a modified
with self-transition bias of 1. The observations are sampled from normal distributions with non-conjugate separate priorsand
on their mean and variance parameters, respectively. The goal is to find the points where we change regimes and also to determine the dynamics within each segment of the sequence. At each time step with probability 0.05, we switch the regime.
Fig. 1 shows a sample sequence and the result of running 100 passes of SVI over the whole dataset for that sequence. For running SVI, we split the dataset into 20 sequences and use a batch of size 2. We randomly sample a sequence with length 750 to calculate the predictive log-likelihood. We report the error over the dataset for the hyperparameter setting with the highest VLB.
For the hyperparameters in the feature-independent model, we set and for the beta prior; for the feature-based model, we randomly initialize the feature weights from and only use the hidden state as a feature. We do a grid search over combinations of values for , , and . We place an prior on the mean of the observation distributions. Their variance prior is where and are coming from and . We run SVI with 10 different seeds for 100 iterations over all these combinations. To compute the normalized Hamming distance between the inferred states and the true states, we use the Munkres algorithm (Munkres, 1957). The algorithm assigns indices to the inferred sequence so that it maximizes the overlap with the true sequence. Table 2 gives the computed distance over the dataset for the hyperparameter setting with the highest VLB. We also report the predictive log-likelihood over the held-out set.
In addition to the above hyperparameters, the sub-iHMM also requires the truncation level of the superstates and the substates, which we set to 10 and , correspondingly. For all hyperparameters, shared with the siHMM, we use the same set of settings for the baselines. For the self-transition bias for sticky HDP-HMM we try .
Fig. 4 shows the histogram of the normalized Hamming distance and also the predictive log-likelihood for runs with different hyperparameter settings. In terms of the Hamming distance, the siHMM performs slightly better than the iHMM and outperforms the sub-iHMM for most settings. The same conclusion holds for the predictive log-likelihood. Furthermore, Fig. 1 shows that our model can do reasonably well in finding the regime change points and also the states within each segment of the sequence. We choose a threshold of 0.5 for the posterior segmentation probability to identify a time point as a change point; however, as shown in the last row of Fig. 1, the model is able to provide an estimate for the uncertainty over change points.
Table 2 shows that for the dataset which we generated from a block-diagonal transition matrix, siHMM outperforms both iHMM and sub-iHMM. This might be because the synthetic dataset is specifically generated without any state-dependent high-level dynamics. Our model, which does not assume any dynamics for the segments, performs better than the other baselines which implicitly or explicitly assume that.
|Dataset||# Points||Held-out||Max. # States|
|(Normalized Hamming Distance / Error% & Predictive LL)|
|Data Set||iHMM||sub-iHMM||siHMM (WoF)||siHMM (WF)|
Modeling user behavior traces is of significant importance in human-computer interaction literature (e.g., (Adar et al., 2014; Bigham et al., 2014)). Having a good understanding of the tasks done by users can potentially help in designing better work-flows in software applications. It can also help in providing better guidance to users by predicting user intentions. Log files of software applications contain user actions and their corresponding time-stamps; however, it is not clear only from these log files how many different tasks have been done by a user in a single work session. A task consists of multiple actions and each work session consists of multiple tasks. Our two-level hierarchical model can be used for detecting the boundaries between tasks (i.e., segments). We believe that for the software applications in which the tasks are less predictable compared to the actions within each task, our model is a good fit. For instance, in a photo editing software, the high-level tasks such as adding filters or removing a part of a picture usually do not have a clear order and different users, based on their needs, apply different orderings. However, the actions required to add a filter (e.g., 1- choose a filter, 2- select a part of the photo, and 3- apply the filter) are typically more structured and follow an order.
We collect log files of users who follow 23 different tutorials in a photo editing software. The dataset is in the form of time-stamp and action; it contains 14000 data points and 59 unique actions in total. We randomly choose a sequence of size 1400 and form a held-out set. We split the sequences into subsequences of size 1000 and apply 100 passes of SVI to the dataset with both feature-based and feature-independent models.
The possible hyperparameter settings that we consider are , , and finally , the parameter for the Dirichlet prior, which we place on the parameters of the observation likelihood. We run SVI with 10 different seeds for each setting and report the result for the best setting with the highest VLB in Table 2.
The labels (i.e., the tutorial numbers that the user followed) for each segment of the dataset are available; hence, we can test the siHMM on predicting the labels for each segment. This task is more involved than segmentation, as we also need to group the substates. We use a simple K-means clustering on the empirical transition matrix which is generated from counting the transitions in each inferred segment. This approach works well in practice; however, more sophisticated methods are possible for grouping the substates. Table2 provides the prediction error (computed using the Munkres algorithm) for siHMM and the baselines. The performance of our feature-independent model is significantly better than iHMM and comparable with that of the more flexible (and also computationally intensive) sub-iHMM. Adding the observation feature to the model reduces the error to 16%. As mentioned in Section 2.2, in this dataset there are observations that can signal a change-point in the dynamics.
Automating scientific experiments on live animals has attracted significant attention recently (see, for instance, (Kain et al., 2013; Wiltschko et al., 2015; Crall et al., 2015; Freeman et al., 2014) ). With the advent of high throughput and more accurate devices, the need for automatic analysis of large amounts of collected data is felt more today. In neuroscience and biology, a large amount of behavioral data is collected from live animals in order to understand how the brain generates activity and how the underlying mechanisms have evolved (Kain et al., 2013). Typically the first step in analyzing this data, is finding and categorizing different types of behavior; this step can be done manually by experts but it is time-consuming and sometimes error-prone.
An automatic framework has been proposed in (Kain et al., 2013)
for tracking the leg movements and classifying the behavior of fruit flies. The behavior is recorded by tracking each leg of a fruit fly moving upon a track ball. The collected raw data is thex and y
coordinates of 6 legs and the three rotational components of the rotating ball (i.e., a 15-dimensional vector in real time). After some post-processing and adding some higher-order features (e.g., derivatives of each of the 15 raw data vectors), they expand the dimensions to 45 and apply a KNN classifier to classify each frame as a part of 12 possible behavioral labels. Our goal is to use this dataset and categorize the frames in an unsupervised way. A frequent assumption in the behavioral sciences is that a small set of stereotyped motifs describe most animal activities(Berman et al., 2014). In other words, actions within a behavioral segment (e.g., actions required for grooming) should be more structured, compared to the behavioral segments themselves. Given the capabilities of the siHMM, it is a reasonable choice for applying to this dataset.
The dataset contains 10000 data points; our held-out set is a randomly chosen subsequence with length 1500, and we apply SVI for 100 passes over both the feature-independent and feature-based models. For the observations, we use a multivariate Gaussian likelihood and a conjugate Normal/inverse-Wishart prior. We use the empirical mean and variance of the dataset as the mean and variance of the Gaussian prior. The parameters of the inverse-Wishart are chosen from the possible combinations of and . We set the truncation level for the states to 20. Finally, we have the following possible settings for the transition matrix prior and .
As in Section 5.2, we group the inferred substates with K-means and assign labels to the segments. The results, presented in Table 2, show that feature-based siHMM performs on a par with the iHMM and outperforms sub-iHMM by a relative error reduction of 17%. This may emphasize the importance of adding data-driven features to the model. Figure 5, shows a sample of the dataset and its segmentation by different methods.
Through the emergence of pervasive computing and affordable wearable sensors, in-situ measurement of different bio-signals has become possible. This powerful source of data can be utilized for several purposes, including activity recognition and task identification. Toward this goal, an efficient algorithm for analyzing this large amount of data – which is gathered 24/7 – is essential. In this section, we use siHMM to model the data collected via Empatica E4 wristband (Empatica, 2015), a wearable device that can collect Electrodermal activity (EDA) (Boucsein, 2012), blood volume pulse (BVP), acceleration, and body temperature. EDA refers to changes in electrical properties of the skin caused by sudomotor innervation (Boucsein, 2012). EDA is an indication of physiological or psychological arousal and has been utilized to objectively measure affective phenomena and sleep quality (Sano et al., 2015).
Segmenting the sensor data can help psychophysiological activity recognition. For instance, it can help in finding stressful periods objectively in order to detect the roots of stress in a person’s lifestyle. However, manual labeling for large amounts of user data (days or months) is time-consuming and even invalid if not reported in a timely manner.
We use a dataset with 12000 time steps, collected from a single user, and model the (normalized) observations (i.e., EDA, BVP and acceleration in 3 dimensions) by a multivariate Gaussian distribution. The hyperparameter setting is similar to that of Section5.3.
The labeling error in Table 2 shows that feature-independent siHMM, while performing comparably to sub-iHMM, outperforms iHMM by a relative error reduction of 50%. Fig. 5.4 demonstrates the inferred segments for siHMM and the baselines. It seems that the single observation feature that we are using for the experiment does not help in this dataset; however, more sophisticated features may help improving the segmentation.
We proposed a new Bayesian nonparametric model, siHMM, for modeling dynamics at two timescales in time series. Our model is a simple extension to the widely used iHMM and has an efficient inference scheme. Although our model is less flexible than other nonparametric models for hierarchical time series, we showed that it can perform reasonably well in practice. One potential application of our model is using the inferred state-independent transition vector () for summarizing a sequence. For instance, in the user behavior analysis, this vector may represent a user fingerprint and users can be grouped based on it. For a better understanding of this feature and the behavior of our model in other applications, a more comprehensive comparison with other models is useful.