1 Introduction
Existing sampling algorithms for infinite hidden Markov models (iHMMs, also known as the hierarchical Dirichlet process HMMs) (beal2002infinite; teh2006hierarchical) do not use a hierarchical Chinese restaurant process (HCRP) (teh2006hierarchical), which is a way of representing the predictive distribution of a hierarchical Dirichlet process (HDP) by collapsing, i.e. integrating out, the underlying distributions of the Dirichlet process (DP). While an HCRP representation provides efficient sampling for many other models based on an HDP (teh2006bayesian; mochihashi2008infinite) through reducing the dimension of sampling space, it has been considered rather “awkward” (teh2006hierarchical)
to use an HCRP for iHMMs, due to the difficulty in handling coupling between random variables. In the simplest case, consider stepwise Gibbs sampling from an iHMM defined as
and . Given , resampling hidden state at time step actually consists of two draws (Figure 1), and , under the restriction that these draws are consistent with the following sequence, i.e., . Under the HCRP, the two draws are coupled even if , because distributions , as well as the base measure are integrated out in an HCRP, and coupling complicates sampling from the restricted distribution.To generalize, the main part of the difficulty is to obtain a sample from a restricted joint distribution of simultaneous draws from collapsed distributions, which we call
restricted collapsed draw (RCD). Consider resampling draws simultaneously, , from the respective restaurants , when we have a restriction such that . Stepwise Gibbs sampling from iHMM can be fitted into RCD with by allowing restaurant index to be dependent on the preceding draw .In this paper, we point out that it is not enough to consider the distribution of draws. Since the HCRP introduces an additional set of latent variables that accounts for the seating arrangements of the restaurants, we have to compute an exact distribution of as well, under the restriction. We want to perform sampling from the following conditional distribution,
(1) 
where is a normalization constant and
is the indicator function, whose value is 1 if the condition is true and 0 otherwise. Although nonrestricted probability
can be easily calculated for a given and , calculating the normalization constant leads to a combinatorial explosion in terms of .To solve this issue, we propose the restricted collapsed draw (RCD) sampler, which provides accurate distributions of simultaneous draws and seating arrangements from HCRP. The RCD sampler constructs a proposal of seating arrangements using a given proposal of draws, and the pair of proposals are stochastically accepted by the MetropolisHastings algorithm (hastings1970monte). Since the RCD sampler can handle any combination of restricted collapsed draws simultaneously, we were able to develop a series of sampling method for HCRPHMM, including a blocked collapsed Gibbs sampler, a collapsed beam sampler, and a splitmerge sampler for HCRPHMM. Through experiments we found that our collapsed samplers outperformed their noncollapsed counterparts.
2 HCRP representation for iHMM
2.1 Infinite HMM
An infinite hidden Markov model (iHMM) (beal2002infinite; teh2006hierarchical) is defined over the following HDP:
(2) 
To see the relation of this HDP to the transition matrix , consider the explicit representation of parameters:
(3) 
where transition probability is given as , is the stickbreaking construction of DPs (sethuraman1994constructive), and .
A formal definition for the HDP based on this representation is:
(4)  
(5) 
Given an HDP and initial state , we can construct an infinite HMM by extracting a sequence of draws as , and corresponding observations . Figure 2 shows a graphical representation of the iHMM.
2.2 HcrpHmm
As another way of representing HDP in iHMM (Eq. 2), we introduce a hierarchical Chinese restaurant process (HCRP, also known as the Chinese restaurant franchise), which does not need to sample the transition distribution and its base measure in Eq. (4):
(6)  
(7)  
(8) 
Using the Chinese restaurant metaphor, we say that customer of restaurant sits at table , which has a dish of an index .
To understand connection between HDP and HCRP, consider a finite model of grouped observations , in which each group choose a subset of mixture components from a modelwide set of mixture components:
(9)  
(10) 
As and , the limit of this model is HCRP; hence the infinite limit of this model is also HDP. Equation (6) is derived by taking the infinite limit of K and M after integrating out and in Eqs. (9) and (10). The distribution in Eq. 4 can be derived from and as follows:
(11) 
To consider sampling of using HCRP (Eqs. 7 and 8), we use count notation as the number of customers in restaurant at table serving the dish of the th entry, and as the number of tables in restaurants the serving the dish of the th entry. We also use dots for marginal counts (e.g., ). Then, we sample table index from the following distribution:
(12)  
(13) 
When (i.e., the customer sits at a new table), we need to sample , whose distribution is:
(14)  
(15) 
These variables determine the new sample . Since does not uniquely determine the state of the HCRP model, we need to keep latent variables and for subsequent sampling. We will denote as the seating arrangement in restaurant , as the seating arrangement in the root restaurant, and
as the collection of all seating arrangements, corresponding to the sampled model state. In Bayesian inference based on sampling, we need a procedure to sample the latent variables, given the value of new draw
and the seating arrangements for other draws , which is called as .Construction of HCRPHMM is the same as iHMM, i.e., extracting a sequence of draws given as , and corresponding observations .
3 Restricted Collapsed Draw Sampler
What we want is a sampling algorithm for HCRPHMM. As described in the Introduction, the problem can be reduced to an algorithm for sampling from , i.e., the distribution of restricted collapsed draw with seating arrangements (Eq. 1).
Our idea is to apply the MetropolisHastings algorithm (hastings1970monte) to the seating arrangements, which stochastically accepts the proposal distribution of seating arrangements. Although it is hard to directly give proposal distribution of seating arrangements, our method constructs by combining with , another proposal of seating arrangements given the proposed draws, which is based on the procedure that is standardly used in Gibbs sampling of HCRP.
3.1 Overall sampling
The MetropolisHastings algorithm provides a way of constructing an MCMC sampler using unnormalized probability value . After sampling from proposal distribution , the algorithm computes acceptance probability :
(16) 
Then the result with probability , and otherwise. Repeating this process constitutes an MCMC sampler from required distrubution ,
Within the context of HCRP, sample space consists of draws and seating arrangement . From Eq. (1), we can use the nonrestricted probability of draws as unnormalized probability value , but it is not easy to provide a proposal for joint distribution .
Our idea is to factorize the proposal distribution as:
(17) 
First factor is the proposal distribution of the draws. Second factor is the proposal distribution of the seating arrangements given the proposal draws. We use the result of the procedure, which stochastically updates the seating arrangements, as the proposal distribution of the seating arrangements.
3.2 Computing Factors
The following describes each factor in and its computation.
True Probability
in Eq. (1) is the joint probability of all draws :
(18)  
where  
(19)  
(20) 
and is the Gamma function. This is the product of the probabilities of seating arrangements (Eqs. 12 to 15) for each customer.
In practice, we only need to calculate probabilities that account for the change in seating from , because the probability for unchanged customers is cancelled out through reducing the fraction in . Let be the seating arrangement for the unchanged customers, then
(21) 
In fact, is easily calculated along with addCustomer operations:
(22) 
Here, is probability of obtaining seating arrangement as a result of drawing a sample from restaurant :
(23) 
The same applies to the calculation of , which can be done along with removeCustomer operations.
Proposal Distribution of Draws
can be anything as long as it is ergodic within restriction . To increase the acceptance probability, however, it is preferable for the proposal distribution to be close to the true distribution. We suggest that a good starting point would be to use a joint distribution composed of the predictive distributions of each draw, as has been done in the approximated Gibbs sampler (beal2002infinite):
(24) 
We will again discuss the proposal distribution of draws for the HCRPHMM case in Section 4.
Proposal Distribution of Seating Arrangements
, is the product of the probabilities for each operation of adding a customer:
(25) 
Here, , i.e., the probability of obtaining seating arrangement as a result of the operation.
(26)  
(27) 
3.3 Simplification
Paying attention to the fact that both Eqs. (23) and (27) are calculated along a series of calls, we introduce factors
(28) 
to simplify the calculation of as:
(29) 
where
(30) 
Surprisingly, assigning Eqs. (23) and (27) into Eq. (28) reveals that is equal to , i.e., the probability of new customer at restaurant eating dish :
(31)  
(32) 
In other words, calculation of the accept ratio does not use (the table index of each customer), despite the fact that the values of are being proposed; will indirectly affect the accept ratio by changing subsequent draw probabilities through modifying and , i.e., the number of customers and tables.
It is now clear that, as done in some previous work (teh2006bayesian), we can save storage space by using an alternative representation for seating arrangements , in which the table indices of each customer are forgotten but only the numbers of customers , and are retained. The only remaining reference to in the procedure can be safely replaced by sampling.
However, it should be noted that we have to revert to original seating assignment whenever the proposal is rejected. Putting the old draws back into by using the procedure again will lead sampling to an incorrect distribution of seating assignments, and consequently, an incorrect distribution of draws.
Algorithm 1 is the one we propose othat obtains new samples drawn simultaneously from restaurants indexed by and associated seating arrangement , given previous samples and .
The first half of this sampler is similar to a sampler for a single draw; it consists of removing old customers (line 3), choosing a new sample (line 7), and adding the customers again (line 10). The main difference is that there are times of iteration for each call /, and the calculation of , which is later used for acceptance probability .
4 Gibbs sampler for HCRPHMM
This section describes a series of samplers for HCRPHMM. First, we present the stepwise Gibbs sampler as the simplest example. After that, we describe a blocked Gibbs sampler using a forwardbackward algorithm. We also explain the HCRP version of the beam sampler (vangael2008beam) as well as the splitmerge sampler (jain2000split) for iHMM.
4.1 Stepwise Gibbs sampler
A stepwise Gibbs sampler for HCRPHMM is easily constructed using an RCD sampler (Algorithm 5 in the Appendix describes one Gibbs sweep). We slightly modified the proposal distribution from that suggested in Section 3.2, in order to ensure that is proposed with nonzero probability even when no table in serves dish :
(33) 
4.2 Blocked Gibbs sampler
We can construct an alternate sampling scheme under the framework of RCD sampler that resamples a block of hidden states simultaneously, based on the forwardbackward sampler (scott2002bayesian). The idea is that we run the forwardbackward sampler with a predictive transition distribution from HCRPHMM, and use the result as a proposal of restricted collapsed draw.
For iHMM, the forwardbackward sampling algorithm (scott2002bayesian) cannot be directly used, because the forward probability values for an infinite number of states have to be stored for each time step (vangael2008beam). This is not the case for HCRPHMM, because predictive transition probability from given seating assignment , which is given as Eqs. (31) and (32), only contains transition probability for finite number of states plus one for . Thus we only need to store forward probability for each time step .
Result of the forwardbackward sampler, however, cannot be used directly as the proposal; the th state of the proposal is equal to when , but we need to assign new state indices to whenever . In particular, when has appeared times, all appearances of may refer either to the same new state, or to different states, or to anything in between the two, in which some appearances of share a new state.
To achieve this purpose, we prepare special CRP that accounts for the previously unseen states, marked by in the result of the forwardbackward sampler. Specifically, each table in has a dish with an unused state index, and each appearance of is replaced with a draw from . This construction ensures that every state sequence is proposed with a nonzero probability, and allows the proposal probability to be easily calculated. The concentration parameter of is set as equal to . To handle the case where some of the new states are equal to , i.e., index of the state that succeeds to the resampling block, we add to an extra customer that correponds to when does not appear in ,
Resulting proposal probability is:
(34) 
where the first factor accounts for the forward probability of the sequence, and the second factor accounts for probability of the new state assignment.
Note also that, to make a sensible proposal distribution, we cannot resample the whole state sequence simultaneously. We need to divide the state sequence into several blocks, and resample each block given the other blocks. The size of a block affects efficiency, because blocks that are too large have lower accept probability, while with blocks that are too small, the algorithm has little advantage over stepwise Gibbs sampling.
Algorithm 8 in the Appendix describes one sweep of a blocked Gibbs sampler for an HCRPHMM.
4.3 Beam sampling
Beam sampling for HDPHMM (vangael2008beam) is a sampling algorithm that uses slice sampling (neal2003slice) for transition probability to extract a finite subset from the state space. Although the possible states are already finite in HCRPHMM, the same technique may benefit sampling of HCRPHMM by improving efficiency from the reduced number of states considered during one sampling step.
We just need replace the call to in Algorithm 8 with the call to to use beam sampling with HCRPHMM. A brief overview of the beam sampling is:

Sample auxiliary variables as ,

For , calculate forward probability using a slice of transition probability ,

For , sample the states backwardly, i.e. .
For details, refer to the original paper (vangael2008beam).
Some remarks may be needed for the calculation of , i.e., the proposal probability for the state sequence. Although beam sampling has a different proposal distribution from forwardbackward sampling, we can use the same calculation of proposal probability used in acceptance probability as that of forwardbackward sampling. This is because beam sampling satisfies the detailed balance equation, which ensures that the ratio of proposal probability with beam sampling is always equal to the ratio of the probability obtained by forwardbackward sampling .
4.4 SplitMerge Sampling
We can integrate the splitmerge sampling algorithm, which is another sampling approach to Dirichlet process mixture models (jain2000split), into HCRPHMM using the RCD sampler. A splitmerge sampler makes a proposal move that tries to merge two mixture components into one, or to split a mixture component into two; the sampler then uses a MetropolisHastings step to stochastically accept the proposal. Based on the RCD framework, we can extend the splitmerge sampler into HCRP, which can be applied to HCRPHMM. Within the context of HMM, the sampler corresponds to merge two state indices into one, or to split a state index into two.
Our implementation is based on an improved version of hte splitmerge sampler, called the sequentiallyallocated mergesplit sampler (dahl2005sequentiallyallocated), which produces a split proposal while sequentially allocating components in random order. To deal with temporal dependency in HMM, we identify fragments of state sequences to be resampled within the state sequence, and perform blocked Gibbs sampling for each fragment in random order.
We added one important optimization to the splitmerge sampling algorithm. Since a merge move is proposed much more frequently than a split move, and the move has a relatively low accept probability, it is beneficial if we have a way of determining whether a merge move is rejected or not earlier. Let us point out that, when proposal probability for a merge move is calculated, the accept probability is monotonically decreasing. Consequently we sample , the threshold of accept probability, at the beginning of the algorithm and stop further calculation when becomes less than . Algorithm 9 in the Appendix is the splitmerge sampling algorithm for HCRPHMM.
Splitmerge sampling allows faster mixing when it is interleaved with other sampling strategies. We examine splitmerge sampling with each of the samplers we have presented in this paper.
5 Experiments and Discussion
This section presents two series of experiments, the first with small artificial sequences and the second with a sequence of natural language words.
5.1 Settings
We put gamma prior on and , and sampled between every sweep using an auxiliary variable method (teh2006hierarchical)
in all the experiments. We introduced HCRP as a prior of emission distributions as well, and its hyperparameters were also sampled in the same way.
The initial state sequence given to the sampler is the result of a particle filter with 100 particles.
We measured autocorrelation time (ACT) to evaluate mixing. Given a sequence of values , its mean
and variance
, are defined as follows:(35)  
(36) 
Since with larger , is expected to converge to zero, we used for .
For artificial sequence, we evaluated mutual information between the , hidden state used in sequence generation and , inferred states as follows:
(37) 
For natural language text, the inferred model is evaluated by multiple runs of a particle filter on a given test sequence of length . We specifically construct a particle filter with particles for each sampled model state , and evaluate likelihood
for each emission. Finally, we calculate the perplexity (the reciprocal geometric mean of the emission probabilities) of the test sequence:
(38)  
where  
(39) 
The samplers we chose for comparison are the stepwise Gibbs sampler with direct assignment representation (teh2006hierarchical), which uses stickbreaking for the root DP and CRP for the other DPs, the stepwise Gibbs sampler with stickbreaking construction, and the beam sampler with stickbreaking construction (vangael2008beam). For fair comparison between different algorithms, we collected samples to evaluate the autocorrelation time and perplexity on a CPUtime basis (excluding the time used in evaluation). All the algorithms were implemented with C++ and tested on machines with an Intel Xeon E5450 at 3 GHz.
5.2 Artificial data
The first series of experiments are performed with two small artificial sequences. Sequence 1 consists of repeating sequence of symbols ABCDBCDE… for length , and we run the sampler 30 s for burnin, and after that, a model state is sampled every 2 s until a total of 300 s is reached. Sequence 2 is generated from the simple finite state automaton in Figure 3 for length , and we use 60 s for burnin and total 600 s. We evaluated the mutual information between the inferred hidden states and the true hidden states.
Figure 4 shows the distribution of mutual information for 100 trials after 300 s. We can see that some of the samplers based on the proposed method achieved a better mutual information compared to existing samplers. The improvement depends on the type of sequence and the samplers.
For Sequence 1, we can see that splitmerge sampling yields better results compared to other samplers. Although HMM with eight hidden states can completely predict the sequence, the samplers tend to be trapped in a local optimum with five states in the initial phase, because our selected prior of poses a larger probability on a smaller number of hidden states, Detailed investigations (Figure 5) confirmed this analysis.
For Sequence 2, on the other hand, blocked samplers worked very efficiently. Stepwise samplers generally worked poorly on the sequence, because the strong dependency on temporally adjacent states impedes mixing. Still, stepwise Gibbs sampler for HCRPHMM outperformed the beam sampler with the stickbreaking process. The blocked Gibbs sampler had inferior performance due to its heavy computation for a large number of states, but the beam sampler for HCRPHMM was efficient and performed well. Combination with a small number of splitmerge samplers increases the performance (more splitmerge sampling leads to lower performance by occupying computational resource for the beam sampler). From averages statistics of samplers (Table 1), we can see that (1) the increase of mutual information cannot be described only by the increase of the number of states; (2) The accept ratio for the Gibbs trial has a very high accept rate; (3) Splitmerge samplers have a very low accept rate, but still make improvement for mutual information.
(a) Sequence 1  (b) Sequence 2 
(a) HDPHMM (SB) SGibbs  (b) HDPHMM (DA) SGibbs  (c) HDPHMM (SB) Beam 
(d) HCRPHMM Beam  (e) HCRPHMM Beam  (f) HCRPHMM Beam 
SplitMerge 3/sweep  SplitMerge 25/sweep 
name  MI  ACT #states  #states  secs/sweep  Gibbs accept rate  SM accept rate 

HDPHMM (DA) SGibbs  2.92  0.527  14.910  0.044  —  — 
HDPHMM (SB) Beam  3.04  0.640  14.720  0.032  —  — 
HCRPHMM SGibbs  3.18  0.719  16.210  0.026  0.999666  — 
HCRPHMM SGibbs +SM=2  3.28  0.615  17.190  0.030  0.999632  0.000631 
HCRPHMM SGibbs +SM=13  3.19  0.493  16.830  0.044  0.999619  0.000650 
HCRPHMM SSlice  2.82  0.820  15.950  0.009  0.999847  — 
HCRPHMM SSlice +SM=2  2.86  0.705  17.330  0.013  0.999822  0.000827 
HCRPHMM SSlice +SM=13  2.59  0.604  16.400  0.030  0.999830  0.000910 
HCRPHMM BGibbs  3.01  0.317  14.900  0.206  0.995525  — 
HCRPHMM BGibbs +SM=2  3.12  0.513  16.270  0.237  0.995135  0.000985 
HCRPHMM BGibbs +SM=13  3.18  0.637  16.530  0.265  0.994715  0.000715 
HCRPHMM Beam  3.21  0.866  15.180  0.016  0.997233  — 
HCRPHMM Beam +SM=2  3.37  0.898  16.910  0.019  0.996369  0.000497 
HCRPHMM Beam +SM=13  3.28  0.875  17.070  0.034  0.996316  0.000532 
DA: Direct Assignment  SB: StickBreaking construction 
MI: Mutual Information  ACT: Autocorrelation time, samples collected for every 0.1 s 
#states: number of states  SGibbs: stepwise Gibbs 
SSlice: stepwise Gibbs with slice sampling (beam sampling with block size=1)  
BGibbs: blocked Gibbs (block size 8)  
Beam: beam sampling (block size 8 for HCRPHMM, for stickbreaking)  
SM: SplitMerge sampler (+SM= denotes SM trials per Gibbs sweep) 
5.3 Natural language text
We also tested the samplers using a sequence of natural language words from Alice’s Adventure in Wonderland. We converted the text to lower case, removed punctuation, and placed a special word EOS after every sentence to obtain a corpus with words; we kept the last 1,000 words for test corpus and learned on a sequence with length . We introduce a special word UNK (unknown) to replace every word that occurred only once, resulting in unique words in the text. We took 10,000 s for burnin, and sampled a model state for every 120 s, until the total of 172,800 s. Table 2 summarize the averaged statistics for 18 trials.
Sampler  Perplexity  # states  sec/sweep  Gibbs accept rate  SM accept rate 

HDPHMM (DA) SGibbs  134.22  313.056  10.017  —  — 
HDPHMM (SB) SGibbs  151.10  242.389  38.045  —  — 
HDPHMM (SB) Beam  178.59  68.444  16.126  —  — 
HCRPHMM SGibbs  133.31  379.833  7.027  0.999861  — 
HCRPHMM SGibbs+SM=130  131.66  386.833  7.664  0.999857  0.000052 
HCRPHMM SGibbs+SM=5400  135.94  336.278  31.751  0.999880  0.000050 
HCRPHMM SSlice  131.17  422.000  0.469  0.999986  — 
HCRPHMM SSlice+SM=130  131.67  409.833  0.976  0.999993  0.000052 
HCRPHMM SSlice+SM=5400  152.76  254.722  36.617  0.999993  0.000056 
HCRPHMM BGibbs  199.14  1840.833  29380.261  0.992748  — 
HCRPHMM Beam  141.77  603.333  80.681  0.995627  — 
HCRPHMM Beam+SM=130  142.69  567.278  72.217  0.995612  0.000124 
HCRPHMM Beam+SM=5400  141.07  495.667  84.925  0.995554  0.000101 
For HCRPHMM, the block size . 
We found that stepwise sampling outperformed blocked sampling (including beam sampling). The reason for this may be the nature of the sequence, which has a lower temporal dependency. Blocked Gibbs sampling, in particular, consumes too much time for one sweep to be of any practical use. We also found that splitmerge sampling had a very low accept rate and thus made little contribution to the result.
Yet, we can see the advantage of using HCRP representation over stickbreaking representation. The direct assignment (DA) algorithm showed a competitively good perplexity, reflecting the fact that DA uses stickbreaking for only the root DP and uses the CRP representation for the other DP. Though stepwise Gibbs sampling and its slice sampling version seems outperforming DA slightly, we need to collect more data to show that the difference is significant. At least, however, we can say that now many sampling algorithms are available for inference, and we can choose a suitable one depending on the nature of the sequence.
6 Conclusion and Future Work
We have proposed a method of sampling directly from constrained distributions of simultaneous draws from a hierarchical Chinese restaurant process (HCRP). We pointed out that, to obtain a correct sample distribution, the seating arrangements (partitioning) must be correctly sampled for restricted collapsed draw, and we thus proposed applying the MetropolisHastings algorithm to the seating arrangements. Our algorithm, called the Restricted Collapsed Draw (RCD) sampler, uses a naïve sampler to provide a proposal distribution for seating arrangements. Based on the sampler, we developed various sampling algorithms for HDPHMM based on HCRP representation, including blocked Gibbs sampling, beam sampling, and splitmerge sampling.
The applications of the RCD sampler, which is at the heart of our algorithms, are not limited to HCRPHMM. The experimental results revealed that some of the proposed algorithms outperform existing sampling methods, indicating that the benefits of using a collapsed representation exceed the cost of rejecting proposals.
The main contribution of this study is that it opens a way of developing more complex Bayesian models based on CRPs. Since the RCD sampler is simple, flexible, and independent of the particular structure of a hierarchy, it can be applied to any combination or hierarchical structure of CRPs. Our future work includes using this algorithm to construct new Bayesian models based on hierarchical CRPs, which are hard to implement using a noncollapsed representation. Planned work includes extending HDPIOHMM (doshivelez2009infinite) with a threelevel hierarchical DP (e.g., the second level could correspond to actions, and the third level, to input symbols).
References
Appendix A Miscellaneous Algorithms
Appendix B Stepwise Gibbs sampler
To manipulate emission probability
with a conjugate prior, we introduced a similar notation to HCRP, which can be intuitively understood.
Comments
There are no comments yet.