# An Infinite Hidden Markov Model With Similarity-Biased Transitions

We describe a generalization of the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) which is able to encode prior information that state transitions are more likely between "nearby" states. This is accomplished by defining a similarity function on the state space and scaling transition probabilities by pair-wise similarities, thereby inducing correlations among the transition distributions. We present an augmented data representation of the model as a Markov Jump Process in which: (1) some jump attempts fail, and (2) the probability of success is proportional to the similarity between the source and destination states. This augmentation restores conditional conjugacy and admits a simple Gibbs sampler. We evaluate the model and inference method on a speaker diarization task and a "harmonic parsing" task using four-part chorale data, as well as on several synthetic datasets, achieving favorable comparisons to existing models.

## Authors

• 1 publication
• 4 publications
• 10 publications
05/03/2015

### A Linear-Time Particle Gibbs Sampler for Infinite Hidden Markov Models

Infinite Hidden Markov Models (iHMM's) are an attractive, nonparametric ...
04/06/2020

### Disentangled sticky hierarchical Dirichlet process hidden Markov model

The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) has bee...
06/27/2012

### Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models

Most existing approaches to clustering gene expression time course data ...
02/20/2016

### The Segmented iHMM: A Simple, Efficient Hierarchical Infinite HMM

We propose the segmented iHMM (siHMM), a hierarchical infinite hidden Ma...
06/02/2011

### Restricted Collapsed Draw: Accurate Sampling for Hierarchical Chinese Restaurant Process Hidden Markov Models

We propose a restricted collapsed draw (RCD) sampler, a general Markov c...
11/06/2007

### Infinite Viterbi alignments in the two state hidden Markov models

Since the early days of digital communication, Hidden Markov Models (HMM...
11/11/2017

### Parkinson's Disease Digital Biomarker Discovery with Optimized Transitions and Inferred Markov Emissions

We search for digital biomarkers from Parkinson's Disease by observing a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Background

The hierarchical Dirichlet process hidden Markov model (HDP-HMM) (Beal et al., 2001; Teh et al., 2006) is a Bayesian model for time series data that generalizes the conventional hidden Markov Model to allow a countably infinite state space. The hierarchical structure ensures that, despite the infinite state space, a common set of destination states will be reachable with positive probability from each source state. The HDP-HMM can be characterized by the following generative process.

Each state, indexed by , has parameters, , drawn from a base measure, . A top-level sequence of state weights, , is drawn by iteratively breaking a “stick” off of the remaining weight according to a distribution. The parameter is known as the concentration parameter and governs how quickly the weights tend to decay, with large corresponding to slow decay, and hence more weights needed before a given cumulative weight is reached. This stick-breaking process is denoted by (Ewens, 1990; Sethuraman, 1994) for Griffiths, Engen and McCloskey. We thus have a discrete probability measure, , with weights at locations , , defined by

 θji.i.d.∼Hβ∼GEM(γ). (1)

drawn in this way is a Dirichlet Process (DP) random measure with concentration and base measure .

The actual transition distribution, , from state , is drawn from another DP with concentration and base measure :

 πji.i.d.∼DP(αG0)j=0,1,2,… (2)

where represents the initial distribution. The hidden state sequence, is then generated according to , and

 zt|zt−1,πzt−1∼Cat(πzt−1)t=1,2,…,T (3)

Finally, the emission distribution for state is a function of , so that observation is drawn according to

 yt|zt,θzt∼F(θzt) (4)

A shortcoming of the HDP prior on the transition matrix is that it does not use the fact that the source and destination states are the same set: that is, each has a special element which corresponds to a self-transition. In the HDP-HMM, however, self-transitions are no more likely a priori than transitions to any other state. The Sticky HDP-HMM (Fox et al., 2008) addresses this issue by adding an extra mass at location to the base measure of the DP that generates . That is, (2) is replaced by

 πj∼DP(αG0+κδθj). (5)

An alternative approach that treats self-transitions as special is the HDP Hidden Semi-Markov Model (HDP-HSMM; Johnson & Willsky (2013)

), wherein state duration distributions are modeled separately, and ordinary self-transitions are ruled out. However, while both of these models have the ability to privilege self-transitions, they contain no notion of similarity for pairs of states that are not identical: in both cases, when the transition matrix is integrated out, the prior probability of transitioning to state

depends only on the top-level stick weight associated with state , and not on the identity or parameters of the previous state .

The two main contributions of this paper are (1) a generalization of the HDP-HMM, which we call the HDP-HMM with local transitions (HDP-HMM-LT) that allows for a geometric structure to be defined on the latent state space, so that “nearby” states are a priori more likely to have transitions between them, and (2) a simple Gibbs sampling algorithm for this model. The “LT” property is introduced by elementwise rescaling and then renormalizing of the HDP transition matrix. Two versions of the similarity structure are illustrated: in one case, two states are similar to the extent that their emission distributions are similar. In another, the similarity structure is inferred separately. In both cases, we give augmented data representations that restore conditional conjugacy and thus allow a simple Gibbs sampling algorithm to be used for inference.

A rescaling and renormalization approach similar to the one used in the HDP-HMM-LT is used by Paisley et al. (2012) to define their Discrete Infinite Logistic Normal (DILN) model, an instance of a correlated random measure (Ranganath & Blei, 2016), in the setting of topic modeling. There, however, the contexts and the mixture components (topics) are distinct sets, and there is no notion of temporal dependence. Zhu et al. (2016) developed an HMM based directly on the DILN model111We thank an anonymous ICML reviewer for bringing this paper to our attention.. Both Paisley et al. and Zhu et al. employ variational approximations, whereas we present a Gibbs sampler, which converges asymptotically to the true posterior. We discuss additional differences between our model and the DILN-HMM in Sec. 2.2.

One class of application in which it is useful to incorporate a notion of locality occurs when the latent state sequence consists of several parallel chains, so that the global state changes incrementally, but where these increments are not independent across chains. Factorial HMMs (Ghahramani et al., 1997) are commonly used in this setting, but this ignores dependence among chains, and hence may do poorly when some combinations of states are much more probable than suggested by the chain-wise dynamics.

Another setting where the LT property is useful is when there is a notion of state geometry that licenses syllogisms: e.g., if A frequently leads to B and C and B frequently leads to D and E, then it may be sensible to infer that A and C may lead to D and E as well. This property is arguably present in musical harmony, where consecutive chords are often (near-)neighbors in the “circle of fifths”, and small steps along the circle are more common than large ones.

The paper is structured as follows: In section 2 we define the model. In section 3, we develop a Gibbs sampling algorithm based on an augmented data representation, which we call the Markov Jump Process with Failed Transitions (MJP-FT). In section 4 we test two versions of the model: one on a speaker diarization task in which the speakers are inter-dependent, and another on a four-part chorale corpus, demonstrating performance improvements over state-of-the-art models when “local transitions” are more common in the data. Using sythetic data from an HDP-HMM, we show that the LT variant can learn not to use its similarity bias when the data does not support it. Finally, in section 5, we conclude and discuss the relationships between the HDP-HMM-LT and existing HMM variants. Code and additional details are available at http://colindawson.net/hdp-hmm-lt/

## 2 An HDP-HMM With Local Transitions

We wish to add to the transition model the concept of a transition to a “nearby” state, where transitions between states and are more likely a priori to the extent that they are “nearby” in some similarity space. In order to accomplish this, we first consider an alternative construction of the transition distributions, based on the Normalized Gamma Process representation of the DP (Ishwaran & Zarepour, 2002; Ferguson, 1973).

### 2.1 A Normalized Gamma Process representation of the HDP-HMM

The Dirichlet Process is an instance of a normalized completely random measure (Kingman, 1967; Ferguson, 1973), that can be defined as , where

 πkind.∼Gamma(αβk,1)T=∞∑k=1πk~πk=πkT, (6)

is a measure assigning 1 to sets if they contain and 0 otherwise, and subject to the constraint that and . It has been shown (Ferguson, 1973; Paisley et al., 2012; Favaro et al., 2013) that the normalization constant is positive and finite almost surely, and that is distributed as a DP with base measure . If we draw from the stick-breaking process, draw an i.i.d. sequence of from a base measure , and then draw an i.i.d. sequence of random measures, , from the above process, this defines a Hierarchical Dirichlet Process (HDP). If each is associated with the hidden states of an HMM, is the infinite matrix where entry is the th mass associated with the th random measure, and is the sum of row , then we obtain the prior for the HDP-HMM, where

 p(zt|zt−1,π)=~πzt−1zt=πjj′/Tj (7)

### 2.2 Promoting “Local” Transitions

In the HDP prior, the rows of the transition matrix are conditionally independent. We wish to relax this assumption, to incorporate possible prior knowledge that certain pairs of states are “nearby” in some sense and thus more likely than others to produce large transition weights between them (in both directions); that is, transitions are likely to be “local”. We accomplish this by associating each latent state with a location in some space , introducing a “similarity function” , and scaling each element by . For example, we might wish to define a (possibly asymmetric) divergence function and set so that transitions are less likely the farther apart two states are. By setting , we obtain the standard HDP-HMM. The DILN-HMM (Zhu et al., 2016), employs a similar rescaling of transition probabilities via an exponentiated Gaussian Process, following (Paisley et al., 2012), but the scaling function must be positive semi-definite, and in particular symmetric, whereas in the HDP-HMM-LT, need only take values in . Moreover, the DILN-HMM does not allow the scales to be tied to other state parameters, and hence encode an independent notion of similarity.

Letting , we can replace (6) for by

 πjj′|β,ℓ∼Gamma(αβj′,1),Tj=∞∑j′=1πjj′ϕjj′~πjj′=πjj′ϕjj′/Tj,p(zt|zt−1,π,ℓ)=~πzt−1zt. (8)

Since the are positive and bounded above by 1,

 0<πj1ϕj1≤Tj≤∑j′πjj′<∞ (9)

almost surely, where the last inequality carries over from the original HDP. The prior means of the unnormalized transition distributions, are then proportional (for each ) to where .

The distribution of the latent state sequence given and is now

 p(z|π,ℓ)=T∏t=1πzt−1ztϕzt−1ztT−nzt−1⋅zt−1=∞∏j=1T−1j∞∏j′=1πnjj′jj′ϕnjj′jj′ (10)

where is the number of transitions from state to state in the sequence and is the total number of visits to state . Since is a sum over products of and terms, the posterior for is no longer a DP. However, conditional conjugacy can be restored by a data-augmentation process with a natural interpretation, which is described next.

### 2.3 The HDP-HMM-LT as the Marginalization of a Markov Jump Process with “Failed” Transitions

In this section, we define a stochastic process that we call the Markov Jump Process with Failed Transitions (MJP-FT), from which we obtain the HDP-HMM-LT by marginalizing over some of the variables. By reinstating these auxiliary variables, we obtain a simple Gibbs sampling algorithm over the full MJP-FT, which can be used to sample from the marginal posterior of the variables used by the HDP-HMM-LT.

Let , , and be defined as in the last section. Consider a continuous-time Markov Process over the states , and suppose that if the process makes a jump to state at time , the next jump, which is to state , occurs at time , where , and , independent of . Note that in this formulation, unlike in standard formulations of Markov Jump Processes, we are assuming that self-jumps are possible.

If we only observe the jump sequence and not the holding times

, this is an ordinary Markov chain with transition matrix row-proportional to

. If we do not observe the jumps directly, but instead an observation is generated once per jump from a distribution that depends on the state being jumped to, then we have an ordinary HMM whose transition matrix is obtained by normalizing ; that is, we have the HDP-HMM.

We modify this process as follows. Suppose each jump attempt from state to state has probability of failing, in which case no transition occurs and no observation is generated. Assuming independent failures, the rates of successful and failed jumps from to are and , respectively. The probability that the first successful jump is to state (that is, that ) is proportional to the rate of successful jump attempts to , which is . Conditioned on , the holding time, , is independent of and is distributed as . We denote the total time spent in state by , where, as the sum of i.i.d. Exponentials,

 uj|z,π,θind.∼Gamma(nj⋅,Tj) (11)

During this period there will be failed attempts to jump to state , where

are independent. This data augmentation bears some conceptual similarity to the Geometrically distributed

auxiliary variables introduced to the HDP-HSMM (Johnson & Willsky, 2013) to restore conditional conjugacy. However, there are key differences: first, measure how many steps the chain would have remained in state j under Markovian dynamics, whereas our represents putative continuous holding times between each transition, and second allows for the restoration of a zeroed out entry in each row, whereas allows us to work with unnormalized entries, avoiding the need to restore zeroed out entries in the HSMM-LT

Incorporating and as augmented data simplifies the likelihood for , yielding

 p(z, u,Q|π)=p(z|π)p(u|z,π)p(Q|u,π) (12)

where dependence on has been omitted for conciseness. After grouping terms and omitting terms that do not depend on , this proportional (as a function of ) to

 ∏j∏j′πnjj′+qjj′jj′ϕnjj′jj′(1−ϕjj′)qjj′e−πjj′uj (13)

Conveniently, the have canceled, and the exponential terms involving and

in the Gamma and Poisson distributions of

and combine to cause to vanish.

Additional details and derivations for this data augmentation are in Appendix A.

### 2.4 Sticky and Semi-Markov Generalizations

We note that the local transition property of the HDP-HMM-LT can be combined with the Sticky property of the Sticky HDP-HMM (Fox et al., 2008), or the non-geometric duration distributions of the HDP-HSMM (Johnson & Willsky, 2013), to add additional prior weight on self-transitions. In the former case, no changes to inference are needed; one can simply add the the extra mass to the shape parameter of the Gamma prior on the , and employ the same auxiliary variable method used by Fox et al. to distinguish “Sticky” from “regular” self-transitions. For the semi-Markov case, we can fix the diagonal elements of to zero, and allow observations to be emitted according to a state-specific duration distribution, and sample the latent state sequence using a suitable semi-Markov message passing algorithm (Johnson & Willsky, 2013). Inference for the matrix is not affected, since the diagonal elements are assumed to be 1. Unlike in the original representation of the HDP-HSMM, no further data-augmentation is needed, as the (continuous) durations already account for the normalization of the .

### 2.5 Obtaining the Factorial HMM as a Limiting Case

One setting in which a local transition property is desirable is the case where the latent states encode multiple hidden features at time

as a vector of categories. Such problems are often modeled using factorial HMMs

(Ghahramani et al., 1997). In fact, the HDP-HMM-LT yields the factorial HMM in the limit as , fixing each row of to be uniform with probability 1, so the dynamics are controlled entirely by . If is the transition matrix for chain , then setting with asymmetric “divergences” yields the factorial transition model.

### 2.6 An Infinite Factorial HDP-HMM-LT

Nonparametric extensions of the factorial HMM, such as the infinite factorial hidden Markov Model (Gael et al., 2009) and the infinite factorial dynamic model (Valera et al., 2015), have been developed in recent years by making use of the Indian Buffet Process (Ghahramani & Griffiths, 2005) as a state prior. It would be conceptually straightforward to combine the IBP state prior with the similarity bias of the LT model, provided the chosen similarity function is uniformly bounded above on the space of infinite length binary vectors (for example, take to be the exponentiated negative Hamming distance between and ). Since the number of differences between two draws from the IBP is finite with probability 1, this yields a reasonable similarity metric.

## 3 Inference

We develop a Gibbs sampling algorithm based on the MJP-FT representation described in Sec. A, augmenting the data with the duration variables , the failed jump attempt count matrix, , as well as additional auxiliary variables which we will define below. In this representation the transition matrix is not represented directly, but is a deterministic function of the unscaled transition “rate” matrix, , and the similarity matrix, . The full set of variables is partitioned into blocks: , , , and , where represents a set of auxiliary variables that will be introduced below, represents the emission parameters (which may be further blocked depending on the specific choice of model), and represents additional parameters such as any free parameters of the similarity function,

, and any hyperparameters of the emission distribution.

### 3.1 Sampling Transition Parameters and Hyperparameters

The joint posterior over , , and given the augmented data will factor as

 p(γ,α,β,π|D)=p(γ|D)p(α|D)p(β|γ,D)p(π|α,β,D) (14)

We describe these four factors in reverse order. For additional details, see Appendix B.

##### Sampling π

Having used data augmentation to simplify the likelihood for to the factored conjugate form in (40), the individual are a posteriori independent distributed.

##### Sampling β

To enable joint sampling of , we employ a weak limit approximation to the HDP (Johnson & Willsky, 2013), approximating the stick-breaking process for using a finite Dirichlet distribution with a components, where is larger than we expect to need. Due to the product-of-Gammas form, we can integrate out analytically to obtain the marginal likelihood:

 p(β|γ)=Γ(γ/J)JΓ(γ)∏jβγJ−1j (15) p(D|β,α)∝J∏j=1(1+uj)−α∏j′Γ(αβj′+njj′+qjj′)Γ(αβj′)

where we have used the fact that the sum to 1 to pull out terms of the form from the inner product in the likelihood. Following Teh et al. (2006), we can introduce auxiliary variables , with

 p(mjj′|βj′,α,D)ind∝snjj′+qjj′,mjj′αmjj′βmjj′j′ (16)

for integer ranging between and , where is an unsigned Stirling number of the first kind. The normalizing constant in this distribution cancels the ratio of Gamma functions in the likelihood, so, letting and , the posterior for (the truncated) is a Dirichlet whose th mass parameter is .

##### Sampling Concentration Parameters

Incorporating into , we can integrate out to obtain

 p(D|α,γ)∝αm⋅⋅e−∑j′′log(1+uj′′)αΓ(γ)Γ(γ+m⋅⋅)×∏jΓ(γJ+m⋅j)Γ(γJ) (17)

Assuming that and have Gamma priors with shape and rate parameters and , then

 α|D∼Gamma(aα+m⋅⋅,bα+∑jlog(1+uj)). (18)

To simplify the likelihood for , we can introduce a final set of auxiliary variables, , and with the following distributions:

 p(rj′=r|m⋅j′,γ)∝s(m⋅j′,r)(γJ)r (19) p(w|m⋅⋅γ)∝wγ−1(1−w)m⋅⋅−1 (20)

The normalizing constants are ratios of Gamma functions, which cancel those in (17), so that

 γ|D,r,w∼Gamma(aγ+r⋅,bγ−log(w)) (21)

### 3.2 Sampling z and the auxiliary variables

We sample the hidden state sequence, , jointly with the auxiliary variables, which consist of , , , and . The joint conditional distribution of these variables is defined directly by the generative model:

 p(D)=p(z)p(u|z)p(Q|u)p(M|z,Q)p(r|M)p(w|M)

Since we are conditioning on the transition matrix, we can sample the entire sequence jointly with the forward-backward algorithm, as in an ordinary HMM. Since we are sampling the labels jointly, this step requires computation per iteration, which is the bottleneck of the inference algorithm for reasonably large or (other updates are constant in or in ). Having done this, we can sample , , , and from their forward distributions. It is also possible to employ a variant on beam sampling (Van Gael et al., 2008) to speed up each iteration, at the cost of slower mixing, but we did not use this variant here.

### 3.3 Sampling state and emission parameters

Depending on the application, the locations may or may not depend on the emission parameters, . If not, sampling conditional on is unchanged from the HDP-HMM. There is no general-purpose method for sampling , or for sampling in the dependent case, due to the dependence on the form of and on the emission model, but specific instances are illustrated in the experiments below.

## 4 Experiments

The parameter space for the hidden states, the associated prior on , and the similarity function , is application-specific; we consider here two cases. The first is a speaker-diarization task, where each state consists of a finite -dimensional binary vector whose entries indicate which speakers are currently speaking. In this experiment, the state vectors both determine the pairwise similarities and partially determine the emission distributions via a linear-Gaussian model. In the second experiment, the data consists of Bach chorales, and the latent states can be thought of as harmonic contexts. There, the components of the states that govern similarities are modeled as independent of the emission distributions, which are categorical distributions over four-voice chords.

### 4.1 Cocktail Party

##### The Data

The data was constructed using audio signals collected from the PASCAL 1st Speech Separation Challenge. The underlying signal consisted of speaker channels recorded at each of time steps, with the resulting signal matrix, denoted by , mapped to microphone channels via a weight matrix, . The 16 speakers were grouped into 4 conversational groups of 4, where speakers within a conversation took turns speaking (see Fig. 2). In such a task, there are naively possible states (here, ). However, due to the conversational grouping, if at most one speaker in a conversation is speaking at any given time, the state space is constrained, with only states possible, where is the number of speakers in conversation (in this case , for a total of 625 possible states).

Each “turn” within a conversation consisted of a single sentence (average duration s) and turn orders within a conversation were randomly generated, with random pauses distributed as

inserted between sentences. Every time a speaker has a turn, the sentence is drawn randomly from the 500 sentences uttered by that speaker in the data. The conversations continued for 40s, and the signal was down-sampled to length 2000. The ’on’ portions of each speaker’s signal were normalized to have amplitudes with mean 1 and standard deviation 0.5. An additional column of 1s was added to the speaker signal matrix,

, representing background noise. The resulting signal matrix, denoted , was thus and the weight matrix, , was . Following Gael et al. (2009) and Valera et al. (2015), the weights were drawn independently from a distribution, and independent noise was added to each entry of the observation matrix.

##### The Model

The latent states, , are the -dimensional binary vectors whose th entry indicates whether or not speaker is speaking. The locations are identified with the binary vectors, . We use a Laplacian similarity function on Hamming distance, , so that . The emission model is linear-Gaussian as in the data, with weight matrix , and signal matrix whose row is , so that . For the experiments discussed here, we assume that is independent of , but this assumption is easily relaxed if appropriate.

For finite-length binary vector states, the set of possible states is finite, and so it may seem that a nonparametric model is unnecessary. However, if is reasonably large, likely most of the possible states are vanishingly unlikely (and the number of observations may well be less than ), and so we would like to encourage the selection of a sparse set of states. Moreover, there could be more than one state with the same emission parameters, but with different transition dynamics. Next we describe the additional inference steps needed for this version of the model.

##### Sampling θ / ℓ

Since and are identified, influencing both the transition matrix and the emission distributions, both the state sequence and the observation matrix are used in the update. We put independent Beta-Bernoulli priors on each coordinate of , and Gibbs sample each coordinate conditioned on all the others and the coordinate-wise prior means, , which we sample in turn conditioned on . Details are in Appendix C.

##### Sampling λ

The parameter of the similarity function governs the connection between and . Substituting the definition of into (40) yields

 p(z,Q|ℓ,λ)∝∏j∏j′e−λdjj′njj′(1−e−λdjj′)qjj′ (22)

We put an prior on , which yields a posterior density

 p(λ|z,Q,ℓ) ∝e−(bλ+∑j∑j′djj′njj′)λ (23) ×∏j∏j′(1−e−λdjj′)qjj′

This density is log-concave, and so we use Adaptive Rejection Sampling (Gilks & Wild, 1992) to sample from it.

##### Sampling W and Σ

Conditioned on and , and

can be sampled as in Bayesian linear regression. If each column of

has a multivariate Normal prior, then the columns are a posteriori independent multivariate Normals. For the experiments reported here, we fix to its ground truth value so that can be compared directly with the ground truth signal matrix, and we constrain

to be diagonal, with Inverse Gamma priors on the variances, resulting in conjugate updates.

##### Results

We attempted to infer the binary speaker matrices using five models: (1) a binary-state Factorial HMM (Ghahramani et al., 1997), where individual binary speaker sequences are modeled as independent, (2) an ordinary HDP-HMM without local transitions (Teh et al., 2006), where the latent states are binary vectors, (3) a Sticky HDP-HMM (Fox et al., 2008), (4) our HDP-HMM-LT model, and (5) a model that combines the Sticky and LT properties333We attempted to add a comparison to the DILN-HMM (Zhu et al., 2016) as well, but code could not be obtained, and the paper did not provide enough detail to reproduce their inference algorithm.. For all models, all concentration and noise precision parameters are given priors. For the Sticky models, the ratio is given a prior. We evaluated the models at each iteration using both the Hamming distance between inferred and ground truth state matrices and F1 score. We also plot the inferred decay rate , and the number of states used by the LT and Sticky-LT models. The results for the five models are in Fig. 1. In Fig. 2, we plot the ground truth state matrix against the average state matrix, , averaged over runs and post-burn-in iterations.

The LT and Sticky-LT models outperform the others, while the regular Sticky model exhibits only a small advantage over the vanilla HDP-HMM. Both converge on a non-negligible value of about 1.6 (see Fig. 1), suggesting that the local transition structure explains the data well. The LT models also use more states than the non-LT models, perhaps owing to the fact that the weaker transition prior of the non-LT model is more likely to explain nearby similar observations as a single persisting state, whereas the LT model places a higher probability on transitioning to a new state with a similar latent vector.

### 4.2 Synthetic Data Without Local Transitions

We generated data directly from the ordinary HDP-HMM used in the cocktail experiment as a sanity check, to examine the performance of the LT model in the absence of a similarity bias. The results are in Fig. 3. When the parameter is large, the LT model has worse performance than the non-LT model on this data; however, the parameter settles near zero as the model learns that local transitions are not more probable. When , the HDP-HMM-LT is an ordinary HDP-HMM. The LT model does not make entirely the same inferences as the non-LT model, however; in particular, the concentration parameter is larger. To some extent, and trade off: sparsity of the transition matrix can be achieved either by beginning with a sparse rate matrix prior to rescaling ( small), or by beginning with a less sparse rate matrix which becomes sparser through rescaling (larger and non-zero ).

### 4.3 Bach Chorales

To test a version of the HDP-HMM-LT model in which the components of the latent state governing similarity are unrelated to the emission distributions, we used our model to do unsupervised “grammar” learning from a corpus of Bach chorales. The data was a corpus of 217 four-voice major key chorales by J.S. Bach from music21, 200 of which were randomly selected as a training set, with the other 17 used as a test set to evaluate surprisal (marginal log likelihood per observation) by the trained models. All chorales were transposed to C-major, and each distinct four-voice chord (with voices ordered) was encoded as a single integer. In total there were 3307 distinct chord types and 20401 chord tokens in the 217 chorales, with 3165 types and 18818 tokens in the 200 training chorales, and 143 chord types that were unique to the test set.

##### Modifications to Model and Inference

Since the chords were encoded as integers, the emission distribution for each state is . We use a symmetric Dirichlet prior for each , resulting in conjugate updates to conditioned on the latent state sequence, .

In this experiment, the locations, , are independent of the , with priors. We use a Gaussian similarity function, where is Euclidean distance. Since the latent states are continuous, we use a Hamiltonian Monte Carlo (HMC) update (Duane et al., 1987; Neal et al., 2011) to update the simultaneously, conditioned on and (see Appendix D for details).

##### Results

We ran 5 Gibbs chains for 10,000 iterations each using the HDP-HMM-LT, Sticky-HDP-HMM-LT, HDP-HMM and Sticky-HDP-HMM models on the 200 training chorales, which were modeled as conditionally independent of one another. We evaluated the marginal log likelihood on the 17 test chorales (integrating out ) at every 50th iteration. The training and test log likelihoods are in Fig. 4. Although the LT model does not achieve as close a fit to the training data, its generalization performance is better, suggesting that the vanilla HDP-HMM is overfitting. This is perhaps counterintuitive, since the LT model is more flexible, and might be expected to be more prone to overfitting. However, the similarity bias induces greater information sharing across parameters, as in a hierarchical model: instead of each entry of the transition matrix being informed mainly by transitions directly involving the corresponding states, it is informed to some extent by all transitions, as they all inform the similarity structure.

## 5 Discussion

We have defined a new probabilistic model, the Hierarchical Dirichlet Process Hidden Markov Model with Local Transitions (HDP-HMM-LT), which generalizes the HDP-HMM by allowing state space geometry to be represented via a similarity kernel, making transitions between “nearby” pairs of states (“local” transitions), more likely a priori. By introducing an augmented data representation, which we call the Markov Jump Process with Failed Transitions (MJP-FT), we obtain a Gibbs sampling algorithm that simplifies inference in both the LT and ordinary HDP-HMM. When multiple latent chains are interdependent, as in speaker diarization, the HDP-HMM-LT model combines the HDP-HMM’s capacity to discover a small set of joint states with the Factorial HMM’s ability to encode the property that most transitions involve a small number of chains. The HDP-HMM-LT outperforms both, as well as outperforming the Sticky-HDP-HMM, on a speaker diarization task in which speakers form conversational groups. Despite the addition of the similarity kernel, the HDP-HMM-LT is able to suppress its local transition prior when the data does not support it, achieving identical performance to the HDP-HMM on data generated directly from the latter.

The local transition property is particularly clear when transitions occur at different times for different latent features, as with binary vector-valued states in the cocktail party setting, but the model can be used with any state space equipped with a suitable similarity kernel. Similarities need not be defined in terms of emission parameters; state “locations” can be represented and inferred separately, which we demonstrate using Bach chorale data. There, the LT model achieves better predictive performance on a held-out test set, while the ordinary HDP-HMM overfits the training set: the LT property here acts to encourage a concise harmonic representation where chord contexts are arranged in bidirectional functional relationships.

We focused on fixed-dimension binary vectors for the cocktail party and synthetic data experiments, but it would be straightforward to add the LT property to a model with nonparametric latent states, such as the iFHMM (Gael et al., 2009) and the infinite factorial dynamic model (Valera et al., 2015), both of which use the Indian Buffet Process (IBP) (Ghahramani & Griffiths, 2005) as a state prior. The similarity function used here could be employed without changes: since only finitely many coordinates are non-zero in the IBP, the distance between any two states is finite.

#### Acknowledgments

This work was funded in part by DARPA grant W911NF-14-1-0395 under the Big Mechanism Program and DARPA grant W911NF-16-1-0567 under the Communicating with Computers Program.

## References

• Beal et al. (2001) Beal, Matthew J, Ghahramani, Zoubin, and Rasmussen, Carl E. The infinite hidden Markov model. In Advances in neural information processing systems, pp. 577–584, 2001.
• Duane et al. (1987) Duane, Simon, Kennedy, Anthony D, Pendleton, Brian J, and Roweth, Duncan. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
• Ewens (1990) Ewens, Warren John. Population genetics theory – the past and the future. In Mathematical and Statistical Developments of Evolutionary Theory, pp. 177–227. Springer, 1990.
• Favaro et al. (2013) Favaro, Stefano, Teh, Yee Whye, et al. MCMC for normalized random measure mixture models. Statistical Science, 28(3):335–359, 2013.
• Ferguson (1973) Ferguson, Thomas S. A Bayesian analysis of some nonparametric problems. The annals of statistics, pp. 209–230, 1973.
• Fox et al. (2008) Fox, Emily B, Sudderth, Erik B, Jordan, Michael I, and Willsky, Alan S. An HDP-HMM for systems with state persistence. In

Proceedings of the 25th international conference on Machine learning

, pp. 312–319. ACM, 2008.
• Gael et al. (2009) Gael, Jurgen V, Teh, Yee W, and Ghahramani, Zoubin. The infinite factorial hidden Markov model. In Advances in Neural Information Processing Systems, pp. 1697–1704, 2009.
• Ghahramani & Griffiths (2005) Ghahramani, Zoubin and Griffiths, Thomas L. Infinite latent feature models and the Indian buffet process. In Advances in neural information processing systems, pp. 475–482, 2005.
• Ghahramani et al. (1997) Ghahramani, Zoubin, Jordan, Michael I, and Smyth, Padhraic. Factorial hidden Markov models. Machine learning, 29(2-3):245–273, 1997.
• Gilks & Wild (1992) Gilks, Walter R and Wild, Pascal. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, pp. 337–348, 1992.
• Ishwaran & Zarepour (2002) Ishwaran, Hemant and Zarepour, Mahmoud. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.
• Johnson & Willsky (2013) Johnson, Matthew J and Willsky, Alan S. Bayesian nonparametric hidden semi-Markov models. The Journal of Machine Learning Research, 14(1):673–701, 2013.
• Kingman (1967) Kingman, John. Completely random measures. Pacific Journal of Mathematics, 21(1):59–78, 1967.
• Neal et al. (2011) Neal, Radford M et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162, 2011.
• Paisley et al. (2012) Paisley, John, Wang, Chong, and Blei, David M.

The discrete infinite logistic normal distribution.

Bayesian Analysis, 7(4):997–1034, 2012.
• Ranganath & Blei (2016) Ranganath, Rajesh and Blei, David M. Correlated random measures. Journal of the American Statistical Association, 2016.
• Sethuraman (1994) Sethuraman, Jayaram. A constructive definition of Dirichlet processes. Statistica Sinica, 4:639–650, 1994.
• Teh et al. (2006) Teh, Yee Whye, Jordan, Michael I, Beal, Matthew J, and Blei, David M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
• Valera et al. (2015) Valera, Isabel, Ruiz, Francisco, Svensson, Lennart, and Perez-Cruz, Fernando. Infinite factorial dynamical model. In Advances in Neural Information Processing Systems, pp. 1657–1665, 2015.
• Van Gael et al. (2008) Van Gael, Jurgen, Saatci, Yunus, Teh, Yee Whye, and Ghahramani, Zoubin. Beam sampling for the infinite hidden Markov model. In Proceedings of the 25th International Conference on Machine Learning, pp. 1088–1095. ACM, 2008.
• Zhu et al. (2016) Zhu, Hao, Hu, Jinsong, and Leung, Henry. Hidden Markov models with discrete infinite logistic normal distribution priors. In Information Fusion (FUSION), 2016 19th International Conference on, pp. 429–433. IEEE, 2016.

## Appendices

Appendix A concerns the derivation of the augmented data representation referred to as the “Markov Jump Process with Failed Transitions” (MJP-FT). Appendix B fills in details for the Gibbs sampling steps to sample the rescaled HDP used by the HDP-HMM-LT. Appendix C gives a derivation for the updates to the binary state vectors, , in the version of the HDP-HMM-LT used in the cocktail party experiment. Finally, appendix D gives the details for the Hamiltonian Monte Carlo update for in the version of the model used in the Bach chorale experiment.

## Appendix A Details of the Markov Jump Process with Failed Transitions Representation

We can gain stronger intuition, as well as simplify posterior inference, by re-casting the HDP-HMM-LT as a continuous time Markov Jump Process where some of the attempts to jump from one state to another fail, and where the failure probability increases as a function of the “distance” between the states.

Let be defined as in the last section, and let , and be defined as in the Normalized Gamma Process representation of the ordinary HDP-HMM. That is,

 β ∼GEM(γ) (24) θj i.i.d∼H (25) πjj′|β,θ ∼Gamma(αβj′,1) (26)

Now suppose that when the process is in state , jumps to state are made at rate . This defines a continuous-time Markov Process where the off-diagonal elements of the transition rate matrix are the off diagonal elements of . In addition, self-jumps are allowed, and occur with rate . If we only observe the jumps and not the durations between jumps, this is an ordinary Markov chain, whose transition matrix is obtained by appropriately normalizing . If we do not observe the jumps themselves, but instead an observation is generated once per jump from a distribution that depends on the state being jumped to, then we have an ordinary HMM.

We modify this process as follows. Suppose that each jump attempt from state to state has a chance of failing, which is an increasing function of the “distance” between the states. In particular, let the success probability be (recall that we assumed above that for all ). Then, the rate of successful jumps from to is , and the corresponding rate of unsuccessful jump attempts is . To see this, denote by the total number of jump attempts to in a unit interval of time spent in state . Since we are assuming the process is Markovian, the total number of attempts is distributed. Conditioned on , will be successful, where

 njj′|Njj′∼Binom(Njj′,ϕjj′) (27)

It is easy to show (and well known) that the marginal distribution of is , and the marginal distribution of is . The rate of successful jumps from state overall is then .

Let index jumps, so that indicates the th state visited by the process (couting self-jumps as a new time step). Given that the process is in state at discretized time (that is, ), it is a standard property of Markov Processes that the probability that the first successful jump is to state (that is, ) is proportional to the rate of successful attempts to , which is .

Let indicate the time elapsed between the th and and th successful jump (where we assume that the first observation occurs when the first successful jump from a distinguished initial state is made). We have

 ~ut|zt−1∼Exp(Tzt−1) (28)

where is independent of .

During this period, there will be unsuccessful attempts to jump to state , where

 ~qj′t|zt−1∼Poisson(~utπzt−1j′(1−ϕzt−1j′)) (29)

 Tj ={t|zt−1=j} (30) qjj′ =∑t∈Tj~qj′t (31) uj =∑t∈Tj~ut (32)

and let be the matrix of unsuccessful jump attempt counts, and be the vector of the total times spent in each state.

Since each of the with are i.i.d. , we get the marginal distribution

 uj|z,π,ϕind∼Gamma(nj⋅,Tj) (33)

by the standard property that sums of i.i.d. Exponential distributions has a Gamma distribution with shape equal to the number of variates in the sum, and rate equal to the rate of the individual exponentials. Moreover, since the

with are Poisson distributed, the total number of failed attempts in the total duration is

 qjj′ind∼Poisson(ujπjj′(1−ϕjj′)). (34)

Thus if we marginalize out the individual and

, we have a joint distribution over

, , and , conditioned on the transition rate matrix and the success probability matrix , which is

 p(z,u,Q|π,ϕ) =(T∏t=1p(zt|zt−1))∏jp(uj|z,π,ϕ)∏j′p(qjj′|ujπjj′,ϕjj′) (35) =(∏tπzt−1ztϕzt−1ztTzt−1)∏jTnj⋅jΓ(nj⋅)unj⋅−1je−Tjuj (36) ×∏j′e−ujπjj′(1−ϕjj′)uqjj′jπqjj′jj′(1−ϕjj′)qjj′(qjj′!)−1 (37) =∏jΓ(nj⋅)−1unj⋅+qj⋅−1j (38) ×∏j′πnjj′+qjj′jj′ϕnjj′jj′(1−ϕjj′)qjj′e−πjj′ϕjj′uje−πjj′(1−ϕjj′)uj(qjj′!)−1 (39) =∏jΓ(nj⋅)−1unj⋅+qj⋅−1j∏j′πnjj′+qjj′jj′ϕnjj′jj′(1−ϕjj′)qjj′e−πjj′uj(qjj′!)−1 (40)

Setting aside terms that do not depend on , we get the conditional likelihood function used in sampling :

 p(z,u,Q|π,ϕ) ∝∏j∏j′πnjj′+qjj′jj′e−πjj′uj (41)

which, combined with the independent Gamma priors on yields conditionally independent Gamma posteriors:

 πjj′|z,u,Q,β,αind.∼Gamma(αβj′+njj′+qjj′,1+uj) (42)

## Appendix B Inference details for hyperparameters of the rescaled HDP

### b.1 Sampling π, β, α and γ

The joint conditional over , , and given the augmented data factors as

 p(γ,α,β,π|D)=p(γ|D)p(α|D)p(β|γ,D)p(π|α,β,D) (43)

We will derive these four factors in reverse order.

##### Sampling π

The entries in are conditionally independent given and , so we have the prior

 p(π|β,α)=∏j∏j′Γ(αβj′)−1παβj′−1jj′exp(−πjj′), (44)

and the likelihood given given by (40). Combining these, we have

 p(π,z,u,Q|β,α,ϕ) =∏junj⋅+qj⋅−1j∏j′Γ(αβj′)−1παβj′+njj′+qjj′−1jj′ (45) ×e−(1+uj)πjj′ϕnjj′jj′(1−ϕjj′)qjj′(qjj′!)−1 (46)

Conditioning on everything except , we get

 p(π|Q,u,z,β,α) ∝∏j∏j′παβj′+njj′+qjj′−1jj′exp(−(1+uj)πjj′) (47)

and thus we see that the are conditionally independent given , and , and distributed according to

 πjj′|njj′,qjj′,βj′,αind∼Gamma(αβj′+njj′+qjj′,1+uj) (48)
##### Sampling β

Consider the conditional distribution of having integrated out . The prior density of is

 p(β|γ)=Γ(γ)Γ(γJ)J∏jβγJ−1j (49)

After integrating out in (45), we have

 p(z,u,Q|β,α,γ,ϕ) =J∏j=1u−1jJ∏j′=1unj