siHMM
None
view repo
We propose the segmented iHMM (siHMM), a hierarchical infinite hidden Markov model (iHMM) that supports a simple, efficient inference scheme. The siHMM is well suited to segmentation problems, where the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. Conventional iHMMs often struggle with such problems, since they have no mechanism for distinguishing between high and lowlevel dynamics. Hierarchical HMMs (HHMMs) can do better, but they require much more complex and expensive inference algorithms. The siHMM retains the simplicity and efficiency of the iHMM, but outperforms it on a variety of segmentation problems, achieving performance that matches or exceeds that of a more complicated HHMM.
READ FULL TEXT VIEW PDF
We propose a Bayesian nonparametric mixture model for prediction and
in...
read it
The Hierarchical Dirichlet Process Hidden Markov Model (HDPHMM) has bee...
read it
We describe a generalization of the Hierarchical Dirichlet Process Hidde...
read it
Most existing approaches to clustering gene expression time course data ...
read it
This paper proposes a hierarchical feature extractor for nonstationary
...
read it
We propose a robust framework for interpretable, fewshot analysis of
no...
read it
RoundTrip Times are one of the most commonly collected performance metr...
read it
None


(a)  (b) 
(a) A sample of a synthetic dataset with true and inferred segmentations. Top three plots: the yaxis shows 1d observations, colors denote true or inferred hidden state, vertical lines denote true or inferred segment boundaries. Bottom plot: inferred posterior probability of a segment boundary. (b)
Top True and Bottom inferred transition matricesThe infinite hidden Markov model (iHMM) (Beal et al., 2001) and its variants (e.g., (Fox et al., 2008; Saeedi & BouchardCôté, 2011)) have been among the most successful Bayesian nonparametric models, with applications from speech recognition (Fox et al., 2011) to biology (Beal & Krishnamurthy, 2012). However, despite their success in modeling time series with complicated lowlevel dynamics, their application to time series with multiple timescales has been limited.
Such hierarchically structured sequences characterized by multiple timescales arise in many domains, such as natural language (Lee et al., 2013), handwriting (Lake et al., 2014), and motion recognition (Heller et al., 2009). For example, it is natural in motion recognition to model the sequence of highlevel actions (such as walking to a chair, then sitting down) and steps within the actions (e.g., bending one’s knees then leaning back to sit down) at two different levels.
We will focus on the problem of segmentation, in which the goal is to identify points at which a time series transitions from one relatively stable regime to a new regime. In the motion recognition example, the segmentation problem would be to identify when a subject transitioned from one type of action (e.g., walking) to another (e.g., sitting down), without necessarily identifying what they were doing. This is one of the easiest problems in timeseries modeling that involves multiple timescales, but (as we will see) it is quite hard for (i)HMMs, which have no mechanism for distinguishing between high and lowlevel dynamics.
The hierarchical HMM (HHMM) is a generalization of the HMM that naturally deals with dynamics at multiple timescales (Fine et al., 1998; Murphy & Paskin, 2002). But this generality comes at a price: these models lack the simple predictive distributions and efficient inference schemes that make (i)HMMs so popular. And the available nonparametric versions of the HHMM such (e.g., (Heller et al., 2009; Stepleton et al., 2009)) are complex to implement and not readily amenable to efficient inference.
In this paper, we propose the segmented iHMM (siHMM), a simple extension to the iHMM that does not suffer from the above problems and can discover segment boundaries in time series with dynamics at two timescales. Unlike the HHMM, our model does not explicitly model higherlevel state; instead, it assumes dynamics that evolve according to a standard iHMM except for occasional changepoint events that kick the model into a new randomly chosen hidden state, disrupting the lowlevel dynamics of the iHMM. Because it relies on a very simple model of highlevel dynamics, inference in the siHMM has time and implementation complexity similar to that of the iHMM, and well below that of a typical HHMM. We show that this simple changepoint extension is sufficient to encourage the iHMM to model timeseries data characterized by multiple regimes of lowlevel dynamics. Although our model is limited by the depth of the hierarchy, in many practical applications of HHMMs (e.g., (Olivera et al., 2004; Nguyen et al., 2005; Xie et al., 2003)) a twolevel analysis of the dynamics is sufficient.
Below, we describe two versions of the model. The first version, which we call the featureindependent model, enjoys conditional conjugacy and therefore has simple Gibbs and variational inference algorithms. The second version, which we call the featurebased model, can incorporate domain knowledge without requiring a complex new machinery. We present an stochastic variational inference (SVI) scheme for the featurebased model; the derivation for the featureindependent model is similar and straightforward.
We apply the model to three different tasks: a novel task of segmenting traces of user behavior in software applications, automatic behavioral segmentation of fruit fly and sensor data labeling. Segmenting user behavior traces is of significant importance in understanding the behavior of software application users; it can help in identifying and simplifying the complex common patterns among the users (e.g., Adar et al. (2014); Han et al. (2007); Horvitz et al. (1998)). For the fruit fly behavior segmentation, we use a dataset from Kain et al. (2013); the results of this task can be used to better understand how the nervous system generates behavior. Finally, labeling sensor data gathered in everyday life settings can be used not only to understand physical activities (e.g., Ermes et al. (2008); Pärkkä et al. (2006)), but also to detect psychological and emotional states (e.g., Picard et al. (2001); Healey & Picard (2005)). Implementing effective health and wellbeing related interventions, understanding user behavior, and designing affective interfaces, are only a few applications of this task.
We empirically compare our model with two main baselines: 1) a twolevel Bayesian nonparametric hierarchical HMM (HHMM) introduced by Johnson (2014) that models highlevel dynamics as an infinite hidden semiMarkov model (HSMM) and subdynamics as an iHMM, and 2) the iHMM. In each of these tasks, we show that our model outperforms the nonparametric HHMM (despite being substantially simpler and faster) and the iHMM.
Our model can be viewed as a generalization of an iHMM where the transition probability from each state
is a mixture of two distributions: 1) a statedependent transition probability distribution
, as in an iHMM, and 2) a stateindependent probability distribution . Which of these two distributions generates a hidden state at a given time depends on the hidden state and observation at the previous time .We say that the transitions caused by define the boundaries of a segment. The model implicitly assumes that the lowlevel dynamics within a segment are more structured and predictable than the higherlevel dynamics that govern transitions between segments, since it can throw much more modeling power at these lowlevel dynamics. In motion capture data, for instance, the dynamics of a walk may be highly structured and predictable, whereas the dynamics that govern whether a user transitions from walking to standing, sitting, or running may be much less predictable.
The featureindependent model assumes the following generative process. At time step , we initialize the process by sampling a hidden state from a distribution . Given a hidden state , we generate an observation from a conditional observation distribution where is the parameter corresponding to the hidden state : .
Next, we sample a variable
, which we call the segmentation variable, from a Bernoulli distribution with a parameter
. This is a statedependent variable which has a conjugate beta prior with hyperparameters
and . Here, denotes the beginning of a new segment:We denote the probability of creating a new segment at time by . If , we sample the next state from a statedependent distribution (as in the iHMM), otherwise, we ignore the current state and sample from a distribution :
The transition matrix has the same generative process as the iHMM:
where is the prior distribution over , GEM is the stickbreaking distribution with concentration parameter , and DP denotes sampling from a Dirichlet process with concentration parameter . The graphical model of the featureindependent siHMM is depicted in Fig. 2.
An illustration of the model applied to a synthetic dataset (explained in Sec. 5.1) is provided in Fig. 1. The model is able to approximately recover the blockdiagonal structure of the true transition matrix. Even though the model does not explicitly encourage blockdiagonal structure, the sparsity induced by the DP prior on is sufficient to encourage the model to push intersegment dynamics into and recover the blockdiagonal intrasegment dynamics.
In some tasks such as segmenting software user traces or tagging fruit fly behavior, there is a rich domain knowledge available for improving the model. For instance, in segmenting user traces features like or may indicate the end of a segment. We modify the model in a way that we can add features declaratively. Although due to lack of conjugacy, deriving the Gibbs sampler is not straightforward anymore, in Section 3, we derive an efficient SVI algorithm for this model.
The difference between this version of the model and the featureindependent version is in the conditional distribution of the segmentation variable (see Fig. 3 for the graphical model). Here, the parameter of the Bernoulli distribution is
where
is the weight vector for the datadependent features,
is the feature function which consists of all observationdependent features, and is the feature weight for hidden state . To simplify the notation, we assume that the observationdependent features only depend on the observation at a single time step. Hence, we haveWe do not assume a prior for the feature weights; instead, we use a point estimate for them in our SVI algorithm.
To keep the notation uncluttered, we assume that we have a dataset of sequences all with the same length and write: , , . For inference, we use the stochastic variational inference (SVI) algorithm (Hoffman et al., 2013) and approximate the posterior with a truncated variational distribution introduced in (Johnson & Willsky, 2014). We approximate the posterior with mean field family distribution . In the language of SVI, z and s are local variables and , , , , and are global variables. We maximize the marginal likelihood lower bound :
by using stochastic natural gradient ascent over the global factors and standard mean field updates for the local factors. At each iteration of SVI, we sample a minibatch of sequences from the dataset and update its local factors; next, given the expectation with respect to the local factors, we update the global factors by taking a step of size in the approximate natural gradient direction. To further simplify the notation, we assume that the minibatch is a single sequence and drop the superscript for , and . Next, we explain the variational factors for each of the variables.
For , the “direct assignment” truncation used in (Johnson & Willsky, 2014), sets if for any of to we have and ; here is the truncation level. Since by using this truncation the update to conditioned on the other factors is not conjugate anymore, we use a point estimate for :
. We adopt the same point estimate approach for the parameters of the sigmoid function; hence,
and .With this truncation scheme, we can write the prior over as . Here, and . We know from (Hoffman et al., 2013) that due to conjugacy the optimal is in the form of where is the parameter of the variational distribution.
We assume that the prior over is in exponential family with natural parameter
, and it is a conjugate prior for the likelihood function
. This implies that the optimal variational distribution is also in the same family with some other natural parameter denoted by . More formally, we have: and where is the sufficient statistic function of .For the variational updates, we need to take expectations with respect to each of the variational distributions. For the expectations with respect to , a modification of the standard HMM forwardbackward algorithm with the following forward and backward messages can be used:
These messages can be computed in . In fact, the augmented transition matrix that we need to compute for these forwardbackward messages has the following form:
where is a transition matrix and is an allones vector of size . The matrix operation for computing a message requires operations for the upper half of the matrix () and for the lower half (). This is because all the rows of each block in the lower half are the same. Hence, the total number of operations for message passing is . Furthermore, the total memory required to compute these operations is .
For updating the local factors, instead of and in Eq.3.2, we compute the forwardbackward messages using: and
The expectations of the sufficient statistics with respect to are:
Given these expected sufficient statistics and a scaling factor , we can write the update equations for the parameters of the global variational factors and :
For the global factors , and , we use a point estimate; hence, we only need the gradient of with respect to , and . For we follow the derivation in (Johnson & Willsky, 2014) and obtain the following gradient to use in a truncated gradient step on :
Note that, we need to ensure that after each update. To estimate , we have:
We have a similar equation for .
There are few models similar to our model in terms of extending iHMM to multiple timescales. Infinite hierarchical HMM (iHHMM), introduced in (Heller et al., 2009), is a nonparametric model that allows the HHMM to have potentially unbounded depth. Hence, the model can infer the number of levels in the hierarchy. In iHHMM, the bottom level is the observed sequence and each level is a sequence of hidden variables dependent on the level above. As the authors suggested, more efficient inference algorithms are needed in order to make their model useful for practical applications.
The blockdiagonal iHMM (Stepleton et al., 2009) is a generalization of iHMM that assumes a nearly blockdiagonal structure on the transition matrix of the iHMM. Each block corresponds to a “subbehavior” and the model can partition the data sequences according to these subbehaviors. The model first partitions the infinite number of hidden states into an infinite number of blocks by using an additional stickbreaking process. Then, it increases the probability of transition between the states of a block by modifying the Dirichlet process prior over the transitions. Hence, as the block size becomes smaller, the model behavior converges to that of iHMM. For inference, as the authors explained, achieving a fast mixing rate in their proposed inference algorithm requires implementing a nontrivial bookkeepingintensive method. In contrast, our model is much simpler and easier to implement inference for, but it can also discover transition matrices with approximately blockdiagonal structure; the segmentation events provide a mechanism for transitioning from one group of connected states to another.
Another related model is a twolevel Bayesian nonparametric HMM introduced in (Johnson, 2014) that models the highlevel dynamics or the superstates as an HDPHSMM. As a generalization of iHMM, HDPHSMM can model the dwell time in each state by sampling that from a statespecific duration distribution once a state is entered. For a formal definition of the HDPHSMM, see (Johnson, 2014). Given each superstate , observations are generated according to an iHMM with parameters . where denotes the substate at time step . Compared to our model, this model is much more flexible; however, the computation of forwardbackward messages is less efficient. Moreover, it requires more bookkeeping for all superstates and their substates. Compared to our model, it needs setting a truncation level for the superstates and each of the iHMMs correpsonding to them. We can set a large truncation level for all of them, but this means we are paying a huge computational cost for iHMMs that only require few states. In contrast, we only need to set a single truncation level for the whole model. We call this model subiHMM and use it as a baseline since it is a flexible Bayesian nonparametric model that supports twolevel dynamics.
Finally, our model can also be applied to timeseries segmentation tasks at a single level. The iHMM or a variant of it with selftransition bias, sticky HDPHMM (Fox et al., 2011), can be used for the same purpose. The selftransition bias encourages selftransitions, and consecutive hidden states tend to belong to one state. The model can capture segments in tasks such as speaker diarization; however, in contrast to our model, within a given state there are no dynamics. In other words, in the sticky HDPHMM, given a state, the observations are independent of each other. This makes the model inappropriate for tasks such as user trace segmentation; that is because there has to be some order on substates within a segment. For instance, in a software application an action like selection needs to come before an action like move selection. To show the importance of this point, we also include a selftransition bias term in the iHMM and compare our model with sticky HDPHMM ^{1}^{1}1We do not include a separate column in Table 2 for sticky HPDHMM; instead, we find the best variational lower bound for all settings of hyperparameters of iHMM including a hyperparameter for selftransition bias..
We evaluate the performance of our featureindependent and featurebased models on synthetic and real datasets. We use a synthetic dataset to illustrate the advantages of our model compared to baselines.
In order to further demonstrate the capabilities of our model, we apply it to three real tasks from three different domains: humancomputer interaction, biology, and sensor data analysis. As mentioned in Section 4, two reasonable baselines for our model are the twolevel Bayesian nonparametric HMM and iHMM. We report the labeling error (or normalized Hamming distance in the case of the synthetic dataset) and predictive loglikelihood for our model and these baselines. To choose among different hyperparameter settings, we use the variational lower bound (VLB) as our objective measure. We show that our model, while being simpler and efficient in terms of inference, is competitive with or outperforms these baselines. For all experiments on siHMM we try both the featureindependent and featurebased models; we report the results separately in Table 2 to show the effect of including observation features in the model. In the experiments, we only try the hidden state and the observation as the features for a given time step. However, more sophisticated features can be made from the observation(s).
We generate a synthetic dataset with 5000 data points from 3 different transition matrices, each with 3 hidden states. Each row of each transition matrix is sampled from a modified
with selftransition bias of 1. The observations are sampled from normal distributions with nonconjugate separate priors
andon their mean and variance parameters, respectively. The goal is to find the points where we change regimes and also to determine the dynamics within each segment of the sequence. At each time step with probability 0.05, we switch the regime.
Fig. 1 shows a sample sequence and the result of running 100 passes of SVI over the whole dataset for that sequence. For running SVI, we split the dataset into 20 sequences and use a batch of size 2. We randomly sample a sequence with length 750 to calculate the predictive loglikelihood. We report the error over the dataset for the hyperparameter setting with the highest VLB.
For the hyperparameters in the featureindependent model, we set and for the beta prior; for the featurebased model, we randomly initialize the feature weights from and only use the hidden state as a feature. We do a grid search over combinations of values for , , and . We place an prior on the mean of the observation distributions. Their variance prior is where and are coming from and . We run SVI with 10 different seeds for 100 iterations over all these combinations. To compute the normalized Hamming distance between the inferred states and the true states, we use the Munkres algorithm (Munkres, 1957). The algorithm assigns indices to the inferred sequence so that it maximizes the overlap with the true sequence. Table 2 gives the computed distance over the dataset for the hyperparameter setting with the highest VLB. We also report the predictive loglikelihood over the heldout set.
In addition to the above hyperparameters, the subiHMM also requires the truncation level of the superstates and the substates, which we set to 10 and , correspondingly. For all hyperparameters, shared with the siHMM, we use the same set of settings for the baselines. For the selftransition bias for sticky HDPHMM we try .
Fig. 4 shows the histogram of the normalized Hamming distance and also the predictive loglikelihood for runs with different hyperparameter settings. In terms of the Hamming distance, the siHMM performs slightly better than the iHMM and outperforms the subiHMM for most settings. The same conclusion holds for the predictive loglikelihood. Furthermore, Fig. 1 shows that our model can do reasonably well in finding the regime change points and also the states within each segment of the sequence. We choose a threshold of 0.5 for the posterior segmentation probability to identify a time point as a change point; however, as shown in the last row of Fig. 1, the model is able to provide an estimate for the uncertainty over change points.
Table 2 shows that for the dataset which we generated from a blockdiagonal transition matrix, siHMM outperforms both iHMM and subiHMM. This might be because the synthetic dataset is specifically generated without any statedependent highlevel dynamics. Our model, which does not assume any dynamics for the segments, performs better than the other baselines which implicitly or explicitly assume that.
Dataset  # Points  Heldout  Max. # States 

Synthetic  
Users  %  
Sensors  %  
Drosophila  % 
(Normalized Hamming Distance / Error% & Predictive LL)  

Data Set  iHMM  subiHMM  siHMM (WoF)  siHMM (WF) 
Synthetic  
Users  
Sensors  
Drosophila 
Modeling user behavior traces is of significant importance in humancomputer interaction literature (e.g., (Adar et al., 2014; Bigham et al., 2014)). Having a good understanding of the tasks done by users can potentially help in designing better workflows in software applications. It can also help in providing better guidance to users by predicting user intentions. Log files of software applications contain user actions and their corresponding timestamps; however, it is not clear only from these log files how many different tasks have been done by a user in a single work session. A task consists of multiple actions and each work session consists of multiple tasks. Our twolevel hierarchical model can be used for detecting the boundaries between tasks (i.e., segments). We believe that for the software applications in which the tasks are less predictable compared to the actions within each task, our model is a good fit. For instance, in a photo editing software, the highlevel tasks such as adding filters or removing a part of a picture usually do not have a clear order and different users, based on their needs, apply different orderings. However, the actions required to add a filter (e.g., 1 choose a filter, 2 select a part of the photo, and 3 apply the filter) are typically more structured and follow an order.
We collect log files of users who follow 23 different tutorials in a photo editing software. The dataset is in the form of timestamp and action; it contains 14000 data points and 59 unique actions in total. We randomly choose a sequence of size 1400 and form a heldout set. We split the sequences into subsequences of size 1000 and apply 100 passes of SVI to the dataset with both featurebased and featureindependent models.
The possible hyperparameter settings that we consider are , , and finally , the parameter for the Dirichlet prior, which we place on the parameters of the observation likelihood. We run SVI with 10 different seeds for each setting and report the result for the best setting with the highest VLB in Table 2.
The labels (i.e., the tutorial numbers that the user followed) for each segment of the dataset are available; hence, we can test the siHMM on predicting the labels for each segment. This task is more involved than segmentation, as we also need to group the substates. We use a simple Kmeans clustering on the empirical transition matrix which is generated from counting the transitions in each inferred segment. This approach works well in practice; however, more sophisticated methods are possible for grouping the substates. Table
2 provides the prediction error (computed using the Munkres algorithm) for siHMM and the baselines. The performance of our featureindependent model is significantly better than iHMM and comparable with that of the more flexible (and also computationally intensive) subiHMM. Adding the observation feature to the model reduces the error to 16%. As mentioned in Section 2.2, in this dataset there are observations that can signal a changepoint in the dynamics.Automating scientific experiments on live animals has attracted significant attention recently (see, for instance, (Kain et al., 2013; Wiltschko et al., 2015; Crall et al., 2015; Freeman et al., 2014) ). With the advent of high throughput and more accurate devices, the need for automatic analysis of large amounts of collected data is felt more today. In neuroscience and biology, a large amount of behavioral data is collected from live animals in order to understand how the brain generates activity and how the underlying mechanisms have evolved (Kain et al., 2013). Typically the first step in analyzing this data, is finding and categorizing different types of behavior; this step can be done manually by experts but it is timeconsuming and sometimes errorprone.
An automatic framework has been proposed in (Kain et al., 2013)
for tracking the leg movements and classifying the behavior of fruit flies. The behavior is recorded by tracking each leg of a fruit fly moving upon a track ball. The collected raw data is the
x and ycoordinates of 6 legs and the three rotational components of the rotating ball (i.e., a 15dimensional vector in real time). After some postprocessing and adding some higherorder features (e.g., derivatives of each of the 15 raw data vectors), they expand the dimensions to 45 and apply a KNN classifier to classify each frame as a part of 12 possible behavioral labels. Our goal is to use this dataset and categorize the frames in an unsupervised way. A frequent assumption in the behavioral sciences is that a small set of stereotyped motifs describe most animal activities
(Berman et al., 2014). In other words, actions within a behavioral segment (e.g., actions required for grooming) should be more structured, compared to the behavioral segments themselves. Given the capabilities of the siHMM, it is a reasonable choice for applying to this dataset.The dataset contains 10000 data points; our heldout set is a randomly chosen subsequence with length 1500, and we apply SVI for 100 passes over both the featureindependent and featurebased models. For the observations, we use a multivariate Gaussian likelihood and a conjugate Normal/inverseWishart prior. We use the empirical mean and variance of the dataset as the mean and variance of the Gaussian prior. The parameters of the inverseWishart are chosen from the possible combinations of and . We set the truncation level for the states to 20. Finally, we have the following possible settings for the transition matrix prior and .
As in Section 5.2, we group the inferred substates with Kmeans and assign labels to the segments. The results, presented in Table 2, show that featurebased siHMM performs on a par with the iHMM and outperforms subiHMM by a relative error reduction of 17%. This may emphasize the importance of adding datadriven features to the model. Figure 5, shows a sample of the dataset and its segmentation by different methods.
Through the emergence of pervasive computing and affordable wearable sensors, insitu measurement of different biosignals has become possible. This powerful source of data can be utilized for several purposes, including activity recognition and task identification. Toward this goal, an efficient algorithm for analyzing this large amount of data – which is gathered 24/7 – is essential. In this section, we use siHMM to model the data collected via Empatica E4 wristband (Empatica, 2015), a wearable device that can collect Electrodermal activity (EDA) (Boucsein, 2012), blood volume pulse (BVP), acceleration, and body temperature. EDA refers to changes in electrical properties of the skin caused by sudomotor innervation (Boucsein, 2012). EDA is an indication of physiological or psychological arousal and has been utilized to objectively measure affective phenomena and sleep quality (Sano et al., 2015).
Segmenting the sensor data can help psychophysiological activity recognition. For instance, it can help in finding stressful periods objectively in order to detect the roots of stress in a person’s lifestyle. However, manual labeling for large amounts of user data (days or months) is timeconsuming and even invalid if not reported in a timely manner.
We use a dataset with 12000 time steps, collected from a single user, and model the (normalized) observations (i.e., EDA, BVP and acceleration in 3 dimensions) by a multivariate Gaussian distribution. The hyperparameter setting is similar to that of Section
5.3.The labeling error in Table 2 shows that featureindependent siHMM, while performing comparably to subiHMM, outperforms iHMM by a relative error reduction of 50%. Fig. 5.4 demonstrates the inferred segments for siHMM and the baselines. It seems that the single observation feature that we are using for the experiment does not help in this dataset; however, more sophisticated features may help improving the segmentation.
We proposed a new Bayesian nonparametric model, siHMM, for modeling dynamics at two timescales in time series. Our model is a simple extension to the widely used iHMM and has an efficient inference scheme. Although our model is less flexible than other nonparametric models for hierarchical time series, we showed that it can perform reasonably well in practice. One potential application of our model is using the inferred stateindependent transition vector () for summarizing a sequence. For instance, in the user behavior analysis, this vector may represent a user fingerprint and users can be grouped based on it. For a better understanding of this feature and the behavior of our model in other applications, a more comprehensive comparison with other models is useful.
Comments
There are no comments yet.