Existing sampling algorithms for infinite hidden Markov models (iHMMs, also known as the hierarchical Dirichlet process HMMs) (beal2002infinite; teh2006hierarchical) do not use a hierarchical Chinese restaurant process (HCRP) (teh2006hierarchical), which is a way of representing the predictive distribution of a hierarchical Dirichlet process (HDP) by collapsing, i.e. integrating out, the underlying distributions of the Dirichlet process (DP). While an HCRP representation provides efficient sampling for many other models based on an HDP (teh2006bayesian; mochihashi2008infinite) through reducing the dimension of sampling space, it has been considered rather “awkward” (teh2006hierarchical)
to use an HCRP for iHMMs, due to the difficulty in handling coupling between random variables. In the simplest case, consider step-wise Gibbs sampling from an iHMM defined asand . Given , resampling hidden state at time step actually consists of two draws (Figure 1), and , under the restriction that these draws are consistent with the following sequence, i.e., . Under the HCRP, the two draws are coupled even if , because distributions , as well as the base measure are integrated out in an HCRP, and coupling complicates sampling from the restricted distribution.
To generalize, the main part of the difficulty is to obtain a sample from a restricted joint distribution of simultaneous draws from collapsed distributions, which we callrestricted collapsed draw (RCD). Consider resampling draws simultaneously, , from the respective restaurants , when we have a restriction such that . Step-wise Gibbs sampling from iHMM can be fitted into RCD with by allowing restaurant index to be dependent on the preceding draw .
In this paper, we point out that it is not enough to consider the distribution of draws. Since the HCRP introduces an additional set of latent variables that accounts for the seating arrangements of the restaurants, we have to compute an exact distribution of as well, under the restriction. We want to perform sampling from the following conditional distribution,
where is a normalization constant and
is the indicator function, whose value is 1 if the condition is true and 0 otherwise. Although non-restricted probabilitycan be easily calculated for a given and , calculating the normalization constant leads to a combinatorial explosion in terms of .
To solve this issue, we propose the restricted collapsed draw (RCD) sampler, which provides accurate distributions of simultaneous draws and seating arrangements from HCRP. The RCD sampler constructs a proposal of seating arrangements using a given proposal of draws, and the pair of proposals are stochastically accepted by the Metropolis-Hastings algorithm (hastings1970monte). Since the RCD sampler can handle any combination of restricted collapsed draws simultaneously, we were able to develop a series of sampling method for HCRP-HMM, including a blocked collapsed Gibbs sampler, a collapsed beam sampler, and a split-merge sampler for HCRP-HMM. Through experiments we found that our collapsed samplers outperformed their non-collapsed counterparts.
2 HCRP representation for iHMM
2.1 Infinite HMM
An infinite hidden Markov model (iHMM) (beal2002infinite; teh2006hierarchical) is defined over the following HDP:
To see the relation of this HDP to the transition matrix , consider the explicit representation of parameters:
where transition probability is given as , is the stick-breaking construction of DPs (sethuraman1994constructive), and .
A formal definition for the HDP based on this representation is:
Given an HDP and initial state , we can construct an infinite HMM by extracting a sequence of draws as , and corresponding observations . Figure 2 shows a graphical representation of the iHMM.
As another way of representing HDP in iHMM (Eq. 2), we introduce a hierarchical Chinese restaurant process (HCRP, also known as the Chinese restaurant franchise), which does not need to sample the transition distribution and its base measure in Eq. (4):
Using the Chinese restaurant metaphor, we say that customer of restaurant sits at table , which has a dish of an index .
To understand connection between HDP and HCRP, consider a finite model of grouped observations , in which each group choose a subset of mixture components from a model-wide set of mixture components:
As and , the limit of this model is HCRP; hence the infinite limit of this model is also HDP. Equation (6) is derived by taking the infinite limit of K and M after integrating out and in Eqs. (9) and (10). The distribution in Eq. 4 can be derived from and as follows:
To consider sampling of using HCRP (Eqs. 7 and 8), we use count notation as the number of customers in restaurant at table serving the dish of the -th entry, and as the number of tables in restaurants the serving the dish of the -th entry. We also use dots for marginal counts (e.g., ). Then, we sample table index from the following distribution:
When (i.e., the customer sits at a new table), we need to sample , whose distribution is:
These variables determine the new sample . Since does not uniquely determine the state of the HCRP model, we need to keep latent variables and for subsequent sampling. We will denote as the seating arrangement in restaurant , as the seating arrangement in the root restaurant, and
as the collection of all seating arrangements, corresponding to the sampled model state. In Bayesian inference based on sampling, we need a procedure to sample the latent variables, given the value of new drawand the seating arrangements for other draws , which is called as .
Construction of HCRP-HMM is the same as iHMM, i.e., extracting a sequence of draws given as , and corresponding observations .
3 Restricted Collapsed Draw Sampler
What we want is a sampling algorithm for HCRP-HMM. As described in the Introduction, the problem can be reduced to an algorithm for sampling from , i.e., the distribution of restricted collapsed draw with seating arrangements (Eq. 1).
Our idea is to apply the Metropolis-Hastings algorithm (hastings1970monte) to the seating arrangements, which stochastically accepts the proposal distribution of seating arrangements. Although it is hard to directly give proposal distribution of seating arrangements, our method constructs by combining with , another proposal of seating arrangements given the proposed draws, which is based on the procedure that is standardly used in Gibbs sampling of HCRP.
3.1 Overall sampling
The Metropolis-Hastings algorithm provides a way of constructing an MCMC sampler using unnormalized probability value . After sampling from proposal distribution , the algorithm computes acceptance probability :
Then the result with probability , and otherwise. Repeating this process constitutes an MCMC sampler from required distrubution ,
Within the context of HCRP, sample space consists of draws and seating arrangement . From Eq. (1), we can use the non-restricted probability of draws as unnormalized probability value , but it is not easy to provide a proposal for joint distribution .
Our idea is to factorize the proposal distribution as:
First factor is the proposal distribution of the draws. Second factor is the proposal distribution of the seating arrangements given the proposal draws. We use the result of the procedure, which stochastically updates the seating arrangements, as the proposal distribution of the seating arrangements.
3.2 Computing Factors
The following describes each factor in and its computation.
in Eq. (1) is the joint probability of all draws :
In practice, we only need to calculate probabilities that account for the change in seating from , because the probability for unchanged customers is cancelled out through reducing the fraction in . Let be the seating arrangement for the unchanged customers, then
In fact, is easily calculated along with addCustomer operations:
Here, is probability of obtaining seating arrangement as a result of drawing a sample from restaurant :
The same applies to the calculation of , which can be done along with removeCustomer operations.
Proposal Distribution of Draws
can be anything as long as it is ergodic within restriction . To increase the acceptance probability, however, it is preferable for the proposal distribution to be close to the true distribution. We suggest that a good starting point would be to use a joint distribution composed of the predictive distributions of each draw, as has been done in the approximated Gibbs sampler (beal2002infinite):
We will again discuss the proposal distribution of draws for the HCRP-HMM case in Section 4.
Proposal Distribution of Seating Arrangements
, is the product of the probabilities for each operation of adding a customer:
Here, , i.e., the probability of obtaining seating arrangement as a result of the operation.
to simplify the calculation of as:
In other words, calculation of the accept ratio does not use (the table index of each customer), despite the fact that the values of are being proposed; will indirectly affect the accept ratio by changing subsequent draw probabilities through modifying and , i.e., the number of customers and tables.
It is now clear that, as done in some previous work (teh2006bayesian), we can save storage space by using an alternative representation for seating arrangements , in which the table indices of each customer are forgotten but only the numbers of customers , and are retained. The only remaining reference to in the procedure can be safely replaced by sampling.
However, it should be noted that we have to revert to original seating assignment whenever the proposal is rejected. Putting the old draws back into by using the procedure again will lead sampling to an incorrect distribution of seating assignments, and consequently, an incorrect distribution of draws.
Algorithm 1 is the one we propose othat obtains new samples drawn simultaneously from restaurants indexed by and associated seating arrangement , given previous samples and .
The first half of this sampler is similar to a sampler for a single draw; it consists of removing old customers (line 3), choosing a new sample (line 7), and adding the customers again (line 10). The main difference is that there are times of iteration for each call /, and the calculation of , which is later used for acceptance probability .
4 Gibbs sampler for HCRP-HMM
This section describes a series of samplers for HCRP-HMM. First, we present the step-wise Gibbs sampler as the simplest example. After that, we describe a blocked Gibbs sampler using a forward-backward algorithm. We also explain the HCRP version of the beam sampler (vangael2008beam) as well as the split-merge sampler (jain2000split) for iHMM.
4.1 Step-wise Gibbs sampler
A step-wise Gibbs sampler for HCRP-HMM is easily constructed using an RCD sampler (Algorithm 5 in the Appendix describes one Gibbs sweep). We slightly modified the proposal distribution from that suggested in Section 3.2, in order to ensure that is proposed with non-zero probability even when no table in serves dish :
4.2 Blocked Gibbs sampler
We can construct an alternate sampling scheme under the framework of RCD sampler that resamples a block of hidden states simultaneously, based on the forward-backward sampler (scott2002bayesian). The idea is that we run the forward-backward sampler with a predictive transition distribution from HCRP-HMM, and use the result as a proposal of restricted collapsed draw.
For iHMM, the forward-backward sampling algorithm (scott2002bayesian) cannot be directly used, because the forward probability values for an infinite number of states have to be stored for each time step (vangael2008beam). This is not the case for HCRP-HMM, because predictive transition probability from given seating assignment , which is given as Eqs. (31) and (32), only contains transition probability for finite number of states plus one for . Thus we only need to store forward probability for each time step .
Result of the forward-backward sampler, however, cannot be used directly as the proposal; the -th state of the proposal is equal to when , but we need to assign new state indices to whenever . In particular, when has appeared times, all appearances of may refer either to the same new state, or to different states, or to anything in between the two, in which some appearances of share a new state.
To achieve this purpose, we prepare special CRP that accounts for the previously unseen states, marked by in the result of the forward-backward sampler. Specifically, each table in has a dish with an unused state index, and each appearance of is replaced with a draw from . This construction ensures that every state sequence is proposed with a non-zero probability, and allows the proposal probability to be easily calculated. The concentration parameter of is set as equal to . To handle the case where some of the new states are equal to , i.e., index of the state that succeeds to the resampling block, we add to an extra customer that correponds to when does not appear in ,
Resulting proposal probability is:
where the first factor accounts for the forward probability of the sequence, and the second factor accounts for probability of the new state assignment.
Note also that, to make a sensible proposal distribution, we cannot resample the whole state sequence simultaneously. We need to divide the state sequence into several blocks, and resample each block given the other blocks. The size of a block affects efficiency, because blocks that are too large have lower accept probability, while with blocks that are too small, the algorithm has little advantage over step-wise Gibbs sampling.
Algorithm 8 in the Appendix describes one sweep of a blocked Gibbs sampler for an HCRP-HMM.
4.3 Beam sampling
Beam sampling for HDP-HMM (vangael2008beam) is a sampling algorithm that uses slice sampling (neal2003slice) for transition probability to extract a finite subset from the state space. Although the possible states are already finite in HCRP-HMM, the same technique may benefit sampling of HCRP-HMM by improving efficiency from the reduced number of states considered during one sampling step.
We just need replace the call to in Algorithm 8 with the call to to use beam sampling with HCRP-HMM. A brief overview of the beam sampling is:
Sample auxiliary variables as ,
For , calculate forward probability using a slice of transition probability ,
For , sample the states backwardly, i.e. .
For details, refer to the original paper (vangael2008beam).
Some remarks may be needed for the calculation of , i.e., the proposal probability for the state sequence. Although beam sampling has a different proposal distribution from forward-backward sampling, we can use the same calculation of proposal probability used in acceptance probability as that of forward-backward sampling. This is because beam sampling satisfies the detailed balance equation, which ensures that the ratio of proposal probability with beam sampling is always equal to the ratio of the probability obtained by forward-backward sampling .
4.4 Split-Merge Sampling
We can integrate the split-merge sampling algorithm, which is another sampling approach to Dirichlet process mixture models (jain2000split), into HCRP-HMM using the RCD sampler. A split-merge sampler makes a proposal move that tries to merge two mixture components into one, or to split a mixture component into two; the sampler then uses a Metropolis-Hastings step to stochastically accept the proposal. Based on the RCD framework, we can extend the split-merge sampler into HCRP, which can be applied to HCRP-HMM. Within the context of HMM, the sampler corresponds to merge two state indices into one, or to split a state index into two.
Our implementation is based on an improved version of hte split-merge sampler, called the sequentially-allocated merge-split sampler (dahl2005sequentially-allocated), which produces a split proposal while sequentially allocating components in random order. To deal with temporal dependency in HMM, we identify fragments of state sequences to be resampled within the state sequence, and perform blocked Gibbs sampling for each fragment in random order.
We added one important optimization to the split-merge sampling algorithm. Since a merge move is proposed much more frequently than a split move, and the move has a relatively low accept probability, it is beneficial if we have a way of determining whether a merge move is rejected or not earlier. Let us point out that, when proposal probability for a merge move is calculated, the accept probability is monotonically decreasing. Consequently we sample , the threshold of accept probability, at the beginning of the algorithm and stop further calculation when becomes less than . Algorithm 9 in the Appendix is the split-merge sampling algorithm for HCRP-HMM.
Split-merge sampling allows faster mixing when it is interleaved with other sampling strategies. We examine split-merge sampling with each of the samplers we have presented in this paper.
5 Experiments and Discussion
This section presents two series of experiments, the first with small artificial sequences and the second with a sequence of natural language words.
We put gamma prior on and , and sampled between every sweep using an auxiliary variable method (teh2006hierarchical)
in all the experiments. We introduced HCRP as a prior of emission distributions as well, and its hyperparameters were also sampled in the same way.
The initial state sequence given to the sampler is the result of a particle filter with 100 particles.
We measured autocorrelation time (ACT) to evaluate mixing. Given a sequence of values , its mean
and variance, are defined as follows:
Since with larger , is expected to converge to zero, we used for .
For artificial sequence, we evaluated mutual information between the , hidden state used in sequence generation and , inferred states as follows:
For natural language text, the inferred model is evaluated by multiple runs of a particle filter on a given test sequence of length . We specifically construct a particle filter with particles for each sampled model state , and evaluate likelihood
for each emission. Finally, we calculate the perplexity (the reciprocal geometric mean of the emission probabilities) of the test sequence:
The samplers we chose for comparison are the step-wise Gibbs sampler with direct assignment representation (teh2006hierarchical), which uses stick-breaking for the root DP and CRP for the other DPs, the step-wise Gibbs sampler with stick-breaking construction, and the beam sampler with stick-breaking construction (vangael2008beam). For fair comparison between different algorithms, we collected samples to evaluate the autocorrelation time and perplexity on a CPU-time basis (excluding the time used in evaluation). All the algorithms were implemented with C++ and tested on machines with an Intel Xeon E5450 at 3 GHz.
5.2 Artificial data
The first series of experiments are performed with two small artificial sequences. Sequence 1 consists of repeating sequence of symbols A-B-C-D-B-C-D-E-… for length , and we run the sampler 30 s for burn-in, and after that, a model state is sampled every 2 s until a total of 300 s is reached. Sequence 2 is generated from the simple finite state automaton in Figure 3 for length , and we use 60 s for burn-in and total 600 s. We evaluated the mutual information between the inferred hidden states and the true hidden states.
Figure 4 shows the distribution of mutual information for 100 trials after 300 s. We can see that some of the samplers based on the proposed method achieved a better mutual information compared to existing samplers. The improvement depends on the type of sequence and the samplers.
For Sequence 1, we can see that split-merge sampling yields better results compared to other samplers. Although HMM with eight hidden states can completely predict the sequence, the samplers tend to be trapped in a local optimum with five states in the initial phase, because our selected prior of poses a larger probability on a smaller number of hidden states, Detailed investigations (Figure 5) confirmed this analysis.
For Sequence 2, on the other hand, blocked samplers worked very efficiently. Step-wise samplers generally worked poorly on the sequence, because the strong dependency on temporally adjacent states impedes mixing. Still, step-wise Gibbs sampler for HCRP-HMM outperformed the beam sampler with the stick-breaking process. The blocked Gibbs sampler had inferior performance due to its heavy computation for a large number of states, but the beam sampler for HCRP-HMM was efficient and performed well. Combination with a small number of split-merge samplers increases the performance (more split-merge sampling leads to lower performance by occupying computational resource for the beam sampler). From averages statistics of samplers (Table 1), we can see that (1) the increase of mutual information cannot be described only by the increase of the number of states; (2) The accept ratio for the Gibbs trial has a very high accept rate; (3) Split-merge samplers have a very low accept rate, but still make improvement for mutual information.
|(a) Sequence 1||(b) Sequence 2|
|(a) HDP-HMM (SB) SGibbs||(b) HDP-HMM (DA) SGibbs||(c) HDP-HMM (SB) Beam|
|(d) HCRP-HMM Beam||(e) HCRP-HMM Beam||(f) HCRP-HMM Beam|
|Split-Merge 3/sweep||Split-Merge 25/sweep|
|name||MI||ACT #states||#states||secs/sweep||Gibbs accept rate||SM accept rate|
|HDP-HMM (DA) SGibbs||2.92||0.527||14.910||0.044||—||—|
|HDP-HMM (SB) Beam||3.04||0.640||14.720||0.032||—||—|
|HCRP-HMM SGibbs +SM=2||3.28||0.615||17.190||0.030||0.999632||0.000631|
|HCRP-HMM SGibbs +SM=13||3.19||0.493||16.830||0.044||0.999619||0.000650|
|HCRP-HMM SSlice +SM=2||2.86||0.705||17.330||0.013||0.999822||0.000827|
|HCRP-HMM SSlice +SM=13||2.59||0.604||16.400||0.030||0.999830||0.000910|
|HCRP-HMM BGibbs +SM=2||3.12||0.513||16.270||0.237||0.995135||0.000985|
|HCRP-HMM BGibbs +SM=13||3.18||0.637||16.530||0.265||0.994715||0.000715|
|HCRP-HMM Beam +SM=2||3.37||0.898||16.910||0.019||0.996369||0.000497|
|HCRP-HMM Beam +SM=13||3.28||0.875||17.070||0.034||0.996316||0.000532|
|DA: Direct Assignment||SB: Stick-Breaking construction|
|MI: Mutual Information||ACT: Auto-correlation time, samples collected for every 0.1 s|
|#states: number of states||SGibbs: step-wise Gibbs|
|SSlice: step-wise Gibbs with slice sampling (beam sampling with block size=1)|
|BGibbs: blocked Gibbs (block size 8)|
|Beam: beam sampling (block size 8 for HCRP-HMM, for stick-breaking)|
|SM: Split-Merge sampler (+SM= denotes SM trials per Gibbs sweep)|
5.3 Natural language text
We also tested the samplers using a sequence of natural language words from Alice’s Adventure in Wonderland. We converted the text to lower case, removed punctuation, and placed a special word EOS after every sentence to obtain a corpus with words; we kept the last 1,000 words for test corpus and learned on a sequence with length . We introduce a special word UNK (unknown) to replace every word that occurred only once, resulting in unique words in the text. We took 10,000 s for burn-in, and sampled a model state for every 120 s, until the total of 172,800 s. Table 2 summarize the averaged statistics for 18 trials.
|Sampler||Perplexity||# states||sec/sweep||Gibbs accept rate||SM accept rate|
|HDP-HMM (DA) SGibbs||134.22||313.056||10.017||—||—|
|HDP-HMM (SB) SGibbs||151.10||242.389||38.045||—||—|
|HDP-HMM (SB) Beam||178.59||68.444||16.126||—||—|
|For HCRP-HMM, the block size .|
We found that step-wise sampling outperformed blocked sampling (including beam sampling). The reason for this may be the nature of the sequence, which has a lower temporal dependency. Blocked Gibbs sampling, in particular, consumes too much time for one sweep to be of any practical use. We also found that split-merge sampling had a very low accept rate and thus made little contribution to the result.
Yet, we can see the advantage of using HCRP representation over stick-breaking representation. The direct assignment (DA) algorithm showed a competitively good perplexity, reflecting the fact that DA uses stick-breaking for only the root DP and uses the CRP representation for the other DP. Though step-wise Gibbs sampling and its slice sampling version seems outperforming DA slightly, we need to collect more data to show that the difference is significant. At least, however, we can say that now many sampling algorithms are available for inference, and we can choose a suitable one depending on the nature of the sequence.
6 Conclusion and Future Work
We have proposed a method of sampling directly from constrained distributions of simultaneous draws from a hierarchical Chinese restaurant process (HCRP). We pointed out that, to obtain a correct sample distribution, the seating arrangements (partitioning) must be correctly sampled for restricted collapsed draw, and we thus proposed applying the Metropolis-Hastings algorithm to the seating arrangements. Our algorithm, called the Restricted Collapsed Draw (RCD) sampler, uses a naïve sampler to provide a proposal distribution for seating arrangements. Based on the sampler, we developed various sampling algorithms for HDP-HMM based on HCRP representation, including blocked Gibbs sampling, beam sampling, and split-merge sampling.
The applications of the RCD sampler, which is at the heart of our algorithms, are not limited to HCRP-HMM. The experimental results revealed that some of the proposed algorithms outperform existing sampling methods, indicating that the benefits of using a collapsed representation exceed the cost of rejecting proposals.
The main contribution of this study is that it opens a way of developing more complex Bayesian models based on CRPs. Since the RCD sampler is simple, flexible, and independent of the particular structure of a hierarchy, it can be applied to any combination or hierarchical structure of CRPs. Our future work includes using this algorithm to construct new Bayesian models based on hierarchical CRPs, which are hard to implement using a non-collapsed representation. Planned work includes extending HDP-IOHMM (doshivelez2009infinite) with a three-level hierarchical DP (e.g., the second level could correspond to actions, and the third level, to input symbols).
Appendix A Miscellaneous Algorithms
Appendix B Step-wise Gibbs sampler
To manipulate emission probability
with a conjugate prior, we introduced a similar notation to HCRP, which can be intuitively understood.