1 Introduction and Background
The hierarchical Dirichlet process hidden Markov model (HDPHMM) (Beal et al., 2001; Teh et al., 2006) is a Bayesian model for time series data that generalizes the conventional hidden Markov Model to allow a countably infinite state space. The hierarchical structure ensures that, despite the infinite state space, a common set of destination states will be reachable with positive probability from each source state. The HDPHMM can be characterized by the following generative process.
Each state, indexed by , has parameters, , drawn from a base measure, . A toplevel sequence of state weights, , is drawn by iteratively breaking a “stick” off of the remaining weight according to a distribution. The parameter is known as the concentration parameter and governs how quickly the weights tend to decay, with large corresponding to slow decay, and hence more weights needed before a given cumulative weight is reached. This stickbreaking process is denoted by (Ewens, 1990; Sethuraman, 1994) for Griffiths, Engen and McCloskey. We thus have a discrete probability measure, , with weights at locations , , defined by
(1) 
drawn in this way is a Dirichlet Process (DP) random measure with concentration and base measure .
The actual transition distribution, , from state , is drawn from another DP with concentration and base measure :
(2) 
where represents the initial distribution. The hidden state sequence, is then generated according to , and
(3) 
Finally, the emission distribution for state is a function of , so that observation is drawn according to
(4) 
A shortcoming of the HDP prior on the transition matrix is that it does not use the fact that the source and destination states are the same set: that is, each has a special element which corresponds to a selftransition. In the HDPHMM, however, selftransitions are no more likely a priori than transitions to any other state. The Sticky HDPHMM (Fox et al., 2008) addresses this issue by adding an extra mass at location to the base measure of the DP that generates . That is, (2) is replaced by
(5) 
An alternative approach that treats selftransitions as special is the HDP Hidden SemiMarkov Model (HDPHSMM; Johnson & Willsky (2013)
), wherein state duration distributions are modeled separately, and ordinary selftransitions are ruled out. However, while both of these models have the ability to privilege selftransitions, they contain no notion of similarity for pairs of states that are not identical: in both cases, when the transition matrix is integrated out, the prior probability of transitioning to state
depends only on the toplevel stick weight associated with state , and not on the identity or parameters of the previous state .The two main contributions of this paper are (1) a generalization of the HDPHMM, which we call the HDPHMM with local transitions (HDPHMMLT) that allows for a geometric structure to be defined on the latent state space, so that “nearby” states are a priori more likely to have transitions between them, and (2) a simple Gibbs sampling algorithm for this model. The “LT” property is introduced by elementwise rescaling and then renormalizing of the HDP transition matrix. Two versions of the similarity structure are illustrated: in one case, two states are similar to the extent that their emission distributions are similar. In another, the similarity structure is inferred separately. In both cases, we give augmented data representations that restore conditional conjugacy and thus allow a simple Gibbs sampling algorithm to be used for inference.
A rescaling and renormalization approach similar to the one used in the HDPHMMLT is used by Paisley et al. (2012) to define their Discrete Infinite Logistic Normal (DILN) model, an instance of a correlated random measure (Ranganath & Blei, 2016), in the setting of topic modeling. There, however, the contexts and the mixture components (topics) are distinct sets, and there is no notion of temporal dependence. Zhu et al. (2016) developed an HMM based directly on the DILN model^{1}^{1}1We thank an anonymous ICML reviewer for bringing this paper to our attention.. Both Paisley et al. and Zhu et al. employ variational approximations, whereas we present a Gibbs sampler, which converges asymptotically to the true posterior. We discuss additional differences between our model and the DILNHMM in Sec. 2.2.
One class of application in which it is useful to incorporate a notion of locality occurs when the latent state sequence consists of several parallel chains, so that the global state changes incrementally, but where these increments are not independent across chains. Factorial HMMs (Ghahramani et al., 1997) are commonly used in this setting, but this ignores dependence among chains, and hence may do poorly when some combinations of states are much more probable than suggested by the chainwise dynamics.
Another setting where the LT property is useful is when there is a notion of state geometry that licenses syllogisms: e.g., if A frequently leads to B and C and B frequently leads to D and E, then it may be sensible to infer that A and C may lead to D and E as well. This property is arguably present in musical harmony, where consecutive chords are often (near)neighbors in the “circle of fifths”, and small steps along the circle are more common than large ones.
The paper is structured as follows: In section 2 we define the model. In section 3, we develop a Gibbs sampling algorithm based on an augmented data representation, which we call the Markov Jump Process with Failed Transitions (MJPFT). In section 4 we test two versions of the model: one on a speaker diarization task in which the speakers are interdependent, and another on a fourpart chorale corpus, demonstrating performance improvements over stateoftheart models when “local transitions” are more common in the data. Using sythetic data from an HDPHMM, we show that the LT variant can learn not to use its similarity bias when the data does not support it. Finally, in section 5, we conclude and discuss the relationships between the HDPHMMLT and existing HMM variants. Code and additional details are available at http://colindawson.net/hdphmmlt/
2 An HDPHMM With Local Transitions
We wish to add to the transition model the concept of a transition to a “nearby” state, where transitions between states and are more likely a priori to the extent that they are “nearby” in some similarity space. In order to accomplish this, we first consider an alternative construction of the transition distributions, based on the Normalized Gamma Process representation of the DP (Ishwaran & Zarepour, 2002; Ferguson, 1973).
2.1 A Normalized Gamma Process representation of the HDPHMM
The Dirichlet Process is an instance of a normalized completely random measure (Kingman, 1967; Ferguson, 1973), that can be defined as , where
(6) 
is a measure assigning 1 to sets if they contain and 0 otherwise, and subject to the constraint that and . It has been shown (Ferguson, 1973; Paisley et al., 2012; Favaro et al., 2013) that the normalization constant is positive and finite almost surely, and that is distributed as a DP with base measure . If we draw from the stickbreaking process, draw an i.i.d. sequence of from a base measure , and then draw an i.i.d. sequence of random measures, , from the above process, this defines a Hierarchical Dirichlet Process (HDP). If each is associated with the hidden states of an HMM, is the infinite matrix where entry is the th mass associated with the th random measure, and is the sum of row , then we obtain the prior for the HDPHMM, where
(7) 
2.2 Promoting “Local” Transitions
In the HDP prior, the rows of the transition matrix are conditionally independent. We wish to relax this assumption, to incorporate possible prior knowledge that certain pairs of states are “nearby” in some sense and thus more likely than others to produce large transition weights between them (in both directions); that is, transitions are likely to be “local”. We accomplish this by associating each latent state with a location in some space , introducing a “similarity function” , and scaling each element by . For example, we might wish to define a (possibly asymmetric) divergence function and set so that transitions are less likely the farther apart two states are. By setting , we obtain the standard HDPHMM. The DILNHMM (Zhu et al., 2016), employs a similar rescaling of transition probabilities via an exponentiated Gaussian Process, following (Paisley et al., 2012), but the scaling function must be positive semidefinite, and in particular symmetric, whereas in the HDPHMMLT, need only take values in . Moreover, the DILNHMM does not allow the scales to be tied to other state parameters, and hence encode an independent notion of similarity.
Letting , we can replace (6) for by
(8) 
Since the are positive and bounded above by 1,
(9) 
almost surely, where the last inequality carries over from the original HDP. The prior means of the unnormalized transition distributions, are then proportional (for each ) to where .
The distribution of the latent state sequence given and is now
(10) 
where is the number of transitions from state to state in the sequence and is the total number of visits to state . Since is a sum over products of and terms, the posterior for is no longer a DP. However, conditional conjugacy can be restored by a dataaugmentation process with a natural interpretation, which is described next.
2.3 The HDPHMMLT as the Marginalization of a Markov Jump Process with “Failed” Transitions
In this section, we define a stochastic process that we call the Markov Jump Process with Failed Transitions (MJPFT), from which we obtain the HDPHMMLT by marginalizing over some of the variables. By reinstating these auxiliary variables, we obtain a simple Gibbs sampling algorithm over the full MJPFT, which can be used to sample from the marginal posterior of the variables used by the HDPHMMLT.
Let , , and be defined as in the last section. Consider a continuoustime Markov Process over the states , and suppose that if the process makes a jump to state at time , the next jump, which is to state , occurs at time , where , and , independent of . Note that in this formulation, unlike in standard formulations of Markov Jump Processes, we are assuming that selfjumps are possible.
If we only observe the jump sequence and not the holding times
, this is an ordinary Markov chain with transition matrix rowproportional to
. If we do not observe the jumps directly, but instead an observation is generated once per jump from a distribution that depends on the state being jumped to, then we have an ordinary HMM whose transition matrix is obtained by normalizing ; that is, we have the HDPHMM.We modify this process as follows. Suppose each jump attempt from state to state has probability of failing, in which case no transition occurs and no observation is generated. Assuming independent failures, the rates of successful and failed jumps from to are and , respectively. The probability that the first successful jump is to state (that is, that ) is proportional to the rate of successful jump attempts to , which is . Conditioned on , the holding time, , is independent of and is distributed as . We denote the total time spent in state by , where, as the sum of i.i.d. Exponentials,
(11) 
During this period there will be failed attempts to jump to state , where
are independent. This data augmentation bears some conceptual similarity to the Geometrically distributed
auxiliary variables introduced to the HDPHSMM (Johnson & Willsky, 2013) to restore conditional conjugacy. However, there are key differences: first, measure how many steps the chain would have remained in state j under Markovian dynamics, whereas our represents putative continuous holding times between each transition, and second allows for the restoration of a zeroed out entry in each row, whereas allows us to work with unnormalized entries, avoiding the need to restore zeroed out entries in the HSMMLTIncorporating and as augmented data simplifies the likelihood for , yielding
(12) 
where dependence on has been omitted for conciseness. After grouping terms and omitting terms that do not depend on , this proportional (as a function of ) to
(13) 
Conveniently, the have canceled, and the exponential terms involving and
in the Gamma and Poisson distributions of
and combine to cause to vanish.Additional details and derivations for this data augmentation are in Appendix A.
2.4 Sticky and SemiMarkov Generalizations
We note that the local transition property of the HDPHMMLT can be combined with the Sticky property of the Sticky HDPHMM (Fox et al., 2008), or the nongeometric duration distributions of the HDPHSMM (Johnson & Willsky, 2013), to add additional prior weight on selftransitions. In the former case, no changes to inference are needed; one can simply add the the extra mass to the shape parameter of the Gamma prior on the , and employ the same auxiliary variable method used by Fox et al. to distinguish “Sticky” from “regular” selftransitions. For the semiMarkov case, we can fix the diagonal elements of to zero, and allow observations to be emitted according to a statespecific duration distribution, and sample the latent state sequence using a suitable semiMarkov message passing algorithm (Johnson & Willsky, 2013). Inference for the matrix is not affected, since the diagonal elements are assumed to be 1. Unlike in the original representation of the HDPHSMM, no further dataaugmentation is needed, as the (continuous) durations already account for the normalization of the .
2.5 Obtaining the Factorial HMM as a Limiting Case
One setting in which a local transition property is desirable is the case where the latent states encode multiple hidden features at time
as a vector of categories. Such problems are often modeled using factorial HMMs
(Ghahramani et al., 1997). In fact, the HDPHMMLT yields the factorial HMM in the limit as , fixing each row of to be uniform with probability 1, so the dynamics are controlled entirely by . If is the transition matrix for chain , then setting with asymmetric “divergences” yields the factorial transition model.2.6 An Infinite Factorial HDPHMMLT
Nonparametric extensions of the factorial HMM, such as the infinite factorial hidden Markov Model (Gael et al., 2009) and the infinite factorial dynamic model (Valera et al., 2015), have been developed in recent years by making use of the Indian Buffet Process (Ghahramani & Griffiths, 2005) as a state prior. It would be conceptually straightforward to combine the IBP state prior with the similarity bias of the LT model, provided the chosen similarity function is uniformly bounded above on the space of infinite length binary vectors (for example, take to be the exponentiated negative Hamming distance between and ). Since the number of differences between two draws from the IBP is finite with probability 1, this yields a reasonable similarity metric.
3 Inference
We develop a Gibbs sampling algorithm based on the MJPFT representation described in Sec. A, augmenting the data with the duration variables , the failed jump attempt count matrix, , as well as additional auxiliary variables which we will define below. In this representation the transition matrix is not represented directly, but is a deterministic function of the unscaled transition “rate” matrix, , and the similarity matrix, . The full set of variables is partitioned into blocks: , , , and , where represents a set of auxiliary variables that will be introduced below, represents the emission parameters (which may be further blocked depending on the specific choice of model), and represents additional parameters such as any free parameters of the similarity function,
, and any hyperparameters of the emission distribution.
3.1 Sampling Transition Parameters and Hyperparameters
The joint posterior over , , and given the augmented data will factor as
(14) 
We describe these four factors in reverse order. For additional details, see Appendix B.
Sampling
Having used data augmentation to simplify the likelihood for to the factored conjugate form in (40), the individual are a posteriori independent distributed.
Sampling
To enable joint sampling of , we employ a weak limit approximation to the HDP (Johnson & Willsky, 2013), approximating the stickbreaking process for using a finite Dirichlet distribution with a components, where is larger than we expect to need. Due to the productofGammas form, we can integrate out analytically to obtain the marginal likelihood:
(15)  
where we have used the fact that the sum to 1 to pull out terms of the form from the inner product in the likelihood. Following Teh et al. (2006), we can introduce auxiliary variables , with
(16) 
for integer ranging between and , where is an unsigned Stirling number of the first kind. The normalizing constant in this distribution cancels the ratio of Gamma functions in the likelihood, so, letting and , the posterior for (the truncated) is a Dirichlet whose th mass parameter is .
Sampling Concentration Parameters
Incorporating into , we can integrate out to obtain
(17) 
Assuming that and have Gamma priors with shape and rate parameters and , then
(18) 
To simplify the likelihood for , we can introduce a final set of auxiliary variables, , and with the following distributions:
(19)  
(20) 
The normalizing constants are ratios of Gamma functions, which cancel those in (17), so that
(21) 
3.2 Sampling and the auxiliary variables
We sample the hidden state sequence, , jointly with the auxiliary variables, which consist of , , , and . The joint conditional distribution of these variables is defined directly by the generative model:
Since we are conditioning on the transition matrix, we can sample the entire sequence jointly with the forwardbackward algorithm, as in an ordinary HMM. Since we are sampling the labels jointly, this step requires computation per iteration, which is the bottleneck of the inference algorithm for reasonably large or (other updates are constant in or in ). Having done this, we can sample , , , and from their forward distributions. It is also possible to employ a variant on beam sampling (Van Gael et al., 2008) to speed up each iteration, at the cost of slower mixing, but we did not use this variant here.
3.3 Sampling state and emission parameters
Depending on the application, the locations may or may not depend on the emission parameters, . If not, sampling conditional on is unchanged from the HDPHMM. There is no generalpurpose method for sampling , or for sampling in the dependent case, due to the dependence on the form of and on the emission model, but specific instances are illustrated in the experiments below.
4 Experiments
The parameter space for the hidden states, the associated prior on , and the similarity function , is applicationspecific; we consider here two cases. The first is a speakerdiarization task, where each state consists of a finite dimensional binary vector whose entries indicate which speakers are currently speaking. In this experiment, the state vectors both determine the pairwise similarities and partially determine the emission distributions via a linearGaussian model. In the second experiment, the data consists of Bach chorales, and the latent states can be thought of as harmonic contexts. There, the components of the states that govern similarities are modeled as independent of the emission distributions, which are categorical distributions over fourvoice chords.
4.1 Cocktail Party
The Data
The data was constructed using audio signals collected from the PASCAL 1st Speech Separation Challenge^{2}^{2}2http://laslab.org/SpeechSeparationChallenge/. The underlying signal consisted of speaker channels recorded at each of time steps, with the resulting signal matrix, denoted by , mapped to microphone channels via a weight matrix, . The 16 speakers were grouped into 4 conversational groups of 4, where speakers within a conversation took turns speaking (see Fig. 2). In such a task, there are naively possible states (here, ). However, due to the conversational grouping, if at most one speaker in a conversation is speaking at any given time, the state space is constrained, with only states possible, where is the number of speakers in conversation (in this case , for a total of 625 possible states).
Each “turn” within a conversation consisted of a single sentence (average duration s) and turn orders within a conversation were randomly generated, with random pauses distributed as
inserted between sentences. Every time a speaker has a turn, the sentence is drawn randomly from the 500 sentences uttered by that speaker in the data. The conversations continued for 40s, and the signal was downsampled to length 2000. The ’on’ portions of each speaker’s signal were normalized to have amplitudes with mean 1 and standard deviation 0.5. An additional column of 1s was added to the speaker signal matrix,
, representing background noise. The resulting signal matrix, denoted , was thus and the weight matrix, , was . Following Gael et al. (2009) and Valera et al. (2015), the weights were drawn independently from a distribution, and independent noise was added to each entry of the observation matrix.The Model
The latent states, , are the dimensional binary vectors whose th entry indicates whether or not speaker is speaking. The locations are identified with the binary vectors, . We use a Laplacian similarity function on Hamming distance, , so that . The emission model is linearGaussian as in the data, with weight matrix , and signal matrix whose row is , so that . For the experiments discussed here, we assume that is independent of , but this assumption is easily relaxed if appropriate.
For finitelength binary vector states, the set of possible states is finite, and so it may seem that a nonparametric model is unnecessary. However, if is reasonably large, likely most of the possible states are vanishingly unlikely (and the number of observations may well be less than ), and so we would like to encourage the selection of a sparse set of states. Moreover, there could be more than one state with the same emission parameters, but with different transition dynamics. Next we describe the additional inference steps needed for this version of the model.
Sampling /
Since and are identified, influencing both the transition matrix and the emission distributions, both the state sequence and the observation matrix are used in the update. We put independent BetaBernoulli priors on each coordinate of , and Gibbs sample each coordinate conditioned on all the others and the coordinatewise prior means, , which we sample in turn conditioned on . Details are in Appendix C.
Sampling
The parameter of the similarity function governs the connection between and . Substituting the definition of into (40) yields
(22) 
We put an prior on , which yields a posterior density
(23)  
This density is logconcave, and so we use Adaptive Rejection Sampling (Gilks & Wild, 1992) to sample from it.
Sampling and
Conditioned on and , and
can be sampled as in Bayesian linear regression. If each column of
has a multivariate Normal prior, then the columns are a posteriori independent multivariate Normals. For the experiments reported here, we fix to its ground truth value so that can be compared directly with the ground truth signal matrix, and we constrainto be diagonal, with Inverse Gamma priors on the variances, resulting in conjugate updates.
Results
We attempted to infer the binary speaker matrices using five models: (1) a binarystate Factorial HMM (Ghahramani et al., 1997), where individual binary speaker sequences are modeled as independent, (2) an ordinary HDPHMM without local transitions (Teh et al., 2006), where the latent states are binary vectors, (3) a Sticky HDPHMM (Fox et al., 2008), (4) our HDPHMMLT model, and (5) a model that combines the Sticky and LT properties^{3}^{3}3We attempted to add a comparison to the DILNHMM (Zhu et al., 2016) as well, but code could not be obtained, and the paper did not provide enough detail to reproduce their inference algorithm.. For all models, all concentration and noise precision parameters are given priors. For the Sticky models, the ratio is given a prior. We evaluated the models at each iteration using both the Hamming distance between inferred and ground truth state matrices and F1 score. We also plot the inferred decay rate , and the number of states used by the LT and StickyLT models. The results for the five models are in Fig. 1. In Fig. 2, we plot the ground truth state matrix against the average state matrix, , averaged over runs and postburnin iterations.
The LT and StickyLT models outperform the others, while the regular Sticky model exhibits only a small advantage over the vanilla HDPHMM. Both converge on a nonnegligible value of about 1.6 (see Fig. 1), suggesting that the local transition structure explains the data well. The LT models also use more states than the nonLT models, perhaps owing to the fact that the weaker transition prior of the nonLT model is more likely to explain nearby similar observations as a single persisting state, whereas the LT model places a higher probability on transitioning to a new state with a similar latent vector.
4.2 Synthetic Data Without Local Transitions
We generated data directly from the ordinary HDPHMM used in the cocktail experiment as a sanity check, to examine the performance of the LT model in the absence of a similarity bias. The results are in Fig. 3. When the parameter is large, the LT model has worse performance than the nonLT model on this data; however, the parameter settles near zero as the model learns that local transitions are not more probable. When , the HDPHMMLT is an ordinary HDPHMM. The LT model does not make entirely the same inferences as the nonLT model, however; in particular, the concentration parameter is larger. To some extent, and trade off: sparsity of the transition matrix can be achieved either by beginning with a sparse rate matrix prior to rescaling ( small), or by beginning with a less sparse rate matrix which becomes sparser through rescaling (larger and nonzero ).
4.3 Bach Chorales
To test a version of the HDPHMMLT model in which the components of the latent state governing similarity are unrelated to the emission distributions, we used our model to do unsupervised “grammar” learning from a corpus of Bach chorales. The data was a corpus of 217 fourvoice major key chorales by J.S. Bach from music21^{4}^{4}4http://web.mit.edu/music21, 200 of which were randomly selected as a training set, with the other 17 used as a test set to evaluate surprisal (marginal log likelihood per observation) by the trained models. All chorales were transposed to Cmajor, and each distinct fourvoice chord (with voices ordered) was encoded as a single integer. In total there were 3307 distinct chord types and 20401 chord tokens in the 217 chorales, with 3165 types and 18818 tokens in the 200 training chorales, and 143 chord types that were unique to the test set.
Modifications to Model and Inference
Since the chords were encoded as integers, the emission distribution for each state is . We use a symmetric Dirichlet prior for each , resulting in conjugate updates to conditioned on the latent state sequence, .
In this experiment, the locations, , are independent of the , with priors. We use a Gaussian similarity function, where is Euclidean distance. Since the latent states are continuous, we use a Hamiltonian Monte Carlo (HMC) update (Duane et al., 1987; Neal et al., 2011) to update the simultaneously, conditioned on and (see Appendix D for details).
Results
We ran 5 Gibbs chains for 10,000 iterations each using the HDPHMMLT, StickyHDPHMMLT, HDPHMM and StickyHDPHMM models on the 200 training chorales, which were modeled as conditionally independent of one another. We evaluated the marginal log likelihood on the 17 test chorales (integrating out ) at every 50th iteration. The training and test log likelihoods are in Fig. 4. Although the LT model does not achieve as close a fit to the training data, its generalization performance is better, suggesting that the vanilla HDPHMM is overfitting. This is perhaps counterintuitive, since the LT model is more flexible, and might be expected to be more prone to overfitting. However, the similarity bias induces greater information sharing across parameters, as in a hierarchical model: instead of each entry of the transition matrix being informed mainly by transitions directly involving the corresponding states, it is informed to some extent by all transitions, as they all inform the similarity structure.
5 Discussion
We have defined a new probabilistic model, the Hierarchical Dirichlet Process Hidden Markov Model with Local Transitions (HDPHMMLT), which generalizes the HDPHMM by allowing state space geometry to be represented via a similarity kernel, making transitions between “nearby” pairs of states (“local” transitions), more likely a priori. By introducing an augmented data representation, which we call the Markov Jump Process with Failed Transitions (MJPFT), we obtain a Gibbs sampling algorithm that simplifies inference in both the LT and ordinary HDPHMM. When multiple latent chains are interdependent, as in speaker diarization, the HDPHMMLT model combines the HDPHMM’s capacity to discover a small set of joint states with the Factorial HMM’s ability to encode the property that most transitions involve a small number of chains. The HDPHMMLT outperforms both, as well as outperforming the StickyHDPHMM, on a speaker diarization task in which speakers form conversational groups. Despite the addition of the similarity kernel, the HDPHMMLT is able to suppress its local transition prior when the data does not support it, achieving identical performance to the HDPHMM on data generated directly from the latter.
The local transition property is particularly clear when transitions occur at different times for different latent features, as with binary vectorvalued states in the cocktail party setting, but the model can be used with any state space equipped with a suitable similarity kernel. Similarities need not be defined in terms of emission parameters; state “locations” can be represented and inferred separately, which we demonstrate using Bach chorale data. There, the LT model achieves better predictive performance on a heldout test set, while the ordinary HDPHMM overfits the training set: the LT property here acts to encourage a concise harmonic representation where chord contexts are arranged in bidirectional functional relationships.
We focused on fixeddimension binary vectors for the cocktail party and synthetic data experiments, but it would be straightforward to add the LT property to a model with nonparametric latent states, such as the iFHMM (Gael et al., 2009) and the infinite factorial dynamic model (Valera et al., 2015), both of which use the Indian Buffet Process (IBP) (Ghahramani & Griffiths, 2005) as a state prior. The similarity function used here could be employed without changes: since only finitely many coordinates are nonzero in the IBP, the distance between any two states is finite.
Acknowledgments
This work was funded in part by DARPA grant W911NF1410395 under the Big Mechanism Program and DARPA grant W911NF1610567 under the Communicating with Computers Program.
References
 Beal et al. (2001) Beal, Matthew J, Ghahramani, Zoubin, and Rasmussen, Carl E. The infinite hidden Markov model. In Advances in neural information processing systems, pp. 577–584, 2001.
 Duane et al. (1987) Duane, Simon, Kennedy, Anthony D, Pendleton, Brian J, and Roweth, Duncan. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
 Ewens (1990) Ewens, Warren John. Population genetics theory – the past and the future. In Mathematical and Statistical Developments of Evolutionary Theory, pp. 177–227. Springer, 1990.
 Favaro et al. (2013) Favaro, Stefano, Teh, Yee Whye, et al. MCMC for normalized random measure mixture models. Statistical Science, 28(3):335–359, 2013.
 Ferguson (1973) Ferguson, Thomas S. A Bayesian analysis of some nonparametric problems. The annals of statistics, pp. 209–230, 1973.

Fox et al. (2008)
Fox, Emily B, Sudderth, Erik B, Jordan, Michael I, and Willsky, Alan S.
An HDPHMM for systems with state persistence.
In
Proceedings of the 25th international conference on Machine learning
, pp. 312–319. ACM, 2008.  Gael et al. (2009) Gael, Jurgen V, Teh, Yee W, and Ghahramani, Zoubin. The infinite factorial hidden Markov model. In Advances in Neural Information Processing Systems, pp. 1697–1704, 2009.
 Ghahramani & Griffiths (2005) Ghahramani, Zoubin and Griffiths, Thomas L. Infinite latent feature models and the Indian buffet process. In Advances in neural information processing systems, pp. 475–482, 2005.
 Ghahramani et al. (1997) Ghahramani, Zoubin, Jordan, Michael I, and Smyth, Padhraic. Factorial hidden Markov models. Machine learning, 29(23):245–273, 1997.
 Gilks & Wild (1992) Gilks, Walter R and Wild, Pascal. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, pp. 337–348, 1992.
 Ishwaran & Zarepour (2002) Ishwaran, Hemant and Zarepour, Mahmoud. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.
 Johnson & Willsky (2013) Johnson, Matthew J and Willsky, Alan S. Bayesian nonparametric hidden semiMarkov models. The Journal of Machine Learning Research, 14(1):673–701, 2013.
 Kingman (1967) Kingman, John. Completely random measures. Pacific Journal of Mathematics, 21(1):59–78, 1967.
 Neal et al. (2011) Neal, Radford M et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162, 2011.

Paisley et al. (2012)
Paisley, John, Wang, Chong, and Blei, David M.
The discrete infinite logistic normal distribution.
Bayesian Analysis, 7(4):997–1034, 2012.  Ranganath & Blei (2016) Ranganath, Rajesh and Blei, David M. Correlated random measures. Journal of the American Statistical Association, 2016.
 Sethuraman (1994) Sethuraman, Jayaram. A constructive definition of Dirichlet processes. Statistica Sinica, 4:639–650, 1994.
 Teh et al. (2006) Teh, Yee Whye, Jordan, Michael I, Beal, Matthew J, and Blei, David M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
 Valera et al. (2015) Valera, Isabel, Ruiz, Francisco, Svensson, Lennart, and PerezCruz, Fernando. Infinite factorial dynamical model. In Advances in Neural Information Processing Systems, pp. 1657–1665, 2015.
 Van Gael et al. (2008) Van Gael, Jurgen, Saatci, Yunus, Teh, Yee Whye, and Ghahramani, Zoubin. Beam sampling for the infinite hidden Markov model. In Proceedings of the 25th International Conference on Machine Learning, pp. 1088–1095. ACM, 2008.
 Zhu et al. (2016) Zhu, Hao, Hu, Jinsong, and Leung, Henry. Hidden Markov models with discrete infinite logistic normal distribution priors. In Information Fusion (FUSION), 2016 19th International Conference on, pp. 429–433. IEEE, 2016.
Appendices
Appendix A concerns the derivation of the augmented data representation referred to as the “Markov Jump Process with Failed Transitions” (MJPFT). Appendix B fills in details for the Gibbs sampling steps to sample the rescaled HDP used by the HDPHMMLT. Appendix C gives a derivation for the updates to the binary state vectors, , in the version of the HDPHMMLT used in the cocktail party experiment. Finally, appendix D gives the details for the Hamiltonian Monte Carlo update for in the version of the model used in the Bach chorale experiment.
Appendix A Details of the Markov Jump Process with Failed Transitions Representation
We can gain stronger intuition, as well as simplify posterior inference, by recasting the HDPHMMLT as a continuous time Markov Jump Process where some of the attempts to jump from one state to another fail, and where the failure probability increases as a function of the “distance” between the states.
Let be defined as in the last section, and let , and be defined as in the Normalized Gamma Process representation of the ordinary HDPHMM. That is,
(24)  
(25)  
(26) 
Now suppose that when the process is in state , jumps to state are made at rate . This defines a continuoustime Markov Process where the offdiagonal elements of the transition rate matrix are the off diagonal elements of . In addition, selfjumps are allowed, and occur with rate . If we only observe the jumps and not the durations between jumps, this is an ordinary Markov chain, whose transition matrix is obtained by appropriately normalizing . If we do not observe the jumps themselves, but instead an observation is generated once per jump from a distribution that depends on the state being jumped to, then we have an ordinary HMM.
We modify this process as follows. Suppose that each jump attempt from state to state has a chance of failing, which is an increasing function of the “distance” between the states. In particular, let the success probability be (recall that we assumed above that for all ). Then, the rate of successful jumps from to is , and the corresponding rate of unsuccessful jump attempts is . To see this, denote by the total number of jump attempts to in a unit interval of time spent in state . Since we are assuming the process is Markovian, the total number of attempts is distributed. Conditioned on , will be successful, where
(27) 
It is easy to show (and well known) that the marginal distribution of is , and the marginal distribution of is . The rate of successful jumps from state overall is then .
Let index jumps, so that indicates the th state visited by the process (couting selfjumps as a new time step). Given that the process is in state at discretized time (that is, ), it is a standard property of Markov Processes that the probability that the first successful jump is to state (that is, ) is proportional to the rate of successful attempts to , which is .
Let indicate the time elapsed between the th and and th successful jump (where we assume that the first observation occurs when the first successful jump from a distinguished initial state is made). We have
(28) 
where is independent of .
During this period, there will be unsuccessful attempts to jump to state , where
(29) 
Define the following additional variables
(30)  
(31)  
(32) 
and let be the matrix of unsuccessful jump attempt counts, and be the vector of the total times spent in each state.
Since each of the with are i.i.d. , we get the marginal distribution
(33) 
by the standard property that sums of i.i.d. Exponential distributions has a Gamma distribution with shape equal to the number of variates in the sum, and rate equal to the rate of the individual exponentials. Moreover, since the
with are Poisson distributed, the total number of failed attempts in the total duration is(34) 
Thus if we marginalize out the individual and
, we have a joint distribution over
, , and , conditioned on the transition rate matrix and the success probability matrix , which is(35)  
(36)  
(37)  
(38)  
(39)  
(40) 
Setting aside terms that do not depend on , we get the conditional likelihood function used in sampling :
(41) 
which, combined with the independent Gamma priors on yields conditionally independent Gamma posteriors:
(42) 
Appendix B Inference details for hyperparameters of the rescaled HDP
b.1 Sampling , , and
The joint conditional over , , and given the augmented data factors as
(43) 
We will derive these four factors in reverse order.
Sampling
The entries in are conditionally independent given and , so we have the prior
(44) 
and the likelihood given given by (40). Combining these, we have
(45)  
(46) 
Conditioning on everything except , we get
(47) 
and thus we see that the are conditionally independent given , and , and distributed according to
(48) 
Sampling
Consider the conditional distribution of having integrated out . The prior density of is
(49) 
After integrating out in (45), we have
Comments
There are no comments yet.