1. Introduction
Topic models are popular tools in the machine learning toolbox. They extract a set of latent topics from an input text corpus. Each topic is a unigram distribution over words, and the highprobability words often present strong semantic correlation. Topic models have been widely used in information retrieval
(Wei and Croft, 2006), text analysis (BoydGraber et al., 2007; Zhu et al., 2012), information visualization (Wang et al., 2016), and many other application areas for feature extraction and dimensionality reduction.
However, the traditional topic models, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), are flat. They do not learn any relationships between topics by assuming that the probabilities of observing all the topics are independent. On the other hand, topics are naturally organized in a hierarchy (Paisley et al., 2015)
. For example, when a topic on “unsupervised learning” is observed in a document, it is likely to also observe the more general topics containing the topic, such as “machine learning” and “computer science” in the same document. By capturing such relationships, hierarchical topic models can achieve deeper understanding and better generalization
(Ahmed et al., 2013; Paisley et al., 2015) of the corpus than the flat models.There are many different approaches to learning the topic hierarchy. For example, Google’s Rephil (Murphy, 2012) puts a hierarchical noisyor network on the documents; the supertopic approach learns topics of topics (Li and McCallum, 2006; Pujara and Skomoroch, 2012); and the nested Chinese Restaurant Process (nCRP) (Blei et al., 2010; Ahmed et al., 2013; Paisley et al., 2015) approach utilizes the nCRP as a prior on topic hierarchies. Hierarchical topic models have been successfully applied to document modeling (Paisley et al., 2015), online advertising (Murphy, 2012) and microblog location prediction (Ahmed et al., 2013), outperforming flat models. Amongst these approaches, the nCRP method has a nonparametric prior on the topic hierarchy structure, which leads to a natural structure learning algorithm with Gibbs sampling, avoiding the slowmixing MetropolisHastings proposals or neural rules (Murphy, 2012). The hierarchical Latent Dirichlet Allocation (hLDA) model is a popular instance of nCRP topic models (Blei et al., 2010). In hLDA, topics form a tree with an nCRP prior, while each document is assigned with a path from the root topic to a leaf topic, and words in the document are modeled with an admixture of topics on the path.
However, due to the lack of scalable algorithms and implementations, hLDA has only been evaluated at a small scale, e.g., with thousands of documents and tens of topics (Blei et al., 2010; Wang and Blei, 2009), which limits its wider adoption in reallife applications. The training of topic models can be accelerated via distributed computing, which has been successfully applied to LDA to handle hundreds of billions of tokens and millions of topics (Ahmed et al., 2012; Chen et al., 2016). Unfortunately, the previous algorithms for hLDA are unsuitable for distributed computing. Specifically, the collapsed Gibbs sampler (Blei et al., 2010) is difficult to parallelize because collapsing the topic distributions breaks the conditional independence between documentwise latent variables; on the other side, the instantiated weight variational inference algorithm (Wang and Blei, 2009) has inferior model quality because of label switching and local optimum, as we will analyze in Sec. 3.2. Moreover, the data structures used by hLDA, such as the dynamically growing count matrices and trees, are much more sophisticated than the data structures for LDA, and their efficient distributed implementations are challenging.
In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of the collapsed Gibbs sampler (Blei et al., 2010) and the instantiated weight variational inference method (Wang and Blei, 2009) to achieve a good tradeoff between the scalability and the quality of inference. We keep most topic distributions as instantiated to maximize the degree of parallelism; while we integrate out some rapid changing topic distributions to preserve the quality of inference. To further improve the model quality, we propose an initialization strategy. Finally, we present an efficient distributed implementation of PCGS through vectorization, preprocessing, and a careful design of the concurrent data structures and the communication strategy.
We design a set of experiments to extensively examine the model quality of our PCGS algorithm as well as its efficiency and scalability. The experimental results show that our singlethread PCGS is 111 times faster than the previous stateoftheart implementation, hldac (Blei, 2009); and our distributed PCGS can extract 1,722 topics from a 131milliondocument corpus with 28billion tokens, which is 45 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours. To the best of our knowledge, this is the first time to scale up hLDA for largescale datasets.
2. Hierarchical LDA
We first review nCRP and hLDA for learning a topic hierarchy.
2.1. Nested Chinese Restaurant Process
Nested Chinese Restaurant process (nCRP) (Blei et al., 2010) represents a powerful nonparametric Bayesian method to learn a tree structure, whose width and depth are unbounded. Suppose there is a truncated tree with
levels, where each node except the leaves has an infinite number of children. An unique ID is assigned to each node, where the root node has the ID 1. nCRP defines a probability distribution on a series of paths
on the tree, where each path consists of node IDs from the root to a certain leaf. Given , we mark a node as visited if any of the paths passes through it, and the next path is generated as follows: (1) let ; (2) for each level , denote as a shortcut for . Assume that there are already visited nodes, where the children of are denoted as . The next node of the path can be generated aswhere is the number of times that node is visited, is a hyperparameter, and denotes the cardinality of a set. If , the path goes through a child node of , which is not visited before, we assign it the ID . We refer this operation as the generation of a new child, although in fact it is just visiting a node that is never visited before. The above procedure is denoted as , where the subscript stands for all the possible indices that are smaller than , i.e., .
Intuitively, nCRP puts a CRP (Teh, 2011) on each parent node, where the probability of visiting each node is proportional to its previous times of visit. Due to this fact, we can easily extend the stickbreaking formulation for CRP (Sethuraman, 1994) to nCRP (Wang and Blei, 2009). The generative procedure is:

For each node on each level , draw , where is a distribution over the children of , and is the stickbreaking distribution (Sethuraman, 1994). A sample can be obtained as follows:

For , draw , and let .


For each path , let , and for , select a child , and let the corresponding id .
The original nCRP is recovered by integrating out . In the stickbreaking formulation, the probability of visiting each child is explicitly instantiated, and the paths are conditionally independent given the probabilities .
2.2. Hierarchical Latent Dirichlet Allocation
Given a corpus of bagofwords documents , where each document has tokens, and each token is represented by its word id in the vocabulary of unique words. hLDA is an nCRPbased topic model to learn a topic hierarchy (Blei et al., 2010). In hLDA, topics form a level tree, i.e., each tree node is a topic, and is associated with a distribution over words , where is the simplex. Since nodes and topics have onetoone correspondence, we do not distinguish them in the sequel.
In hLDA, each document is assigned with a path , and its words are modeled with a mixture of the topics in , with the documentspecific mixing proportion . The generative process for the corpus is:

For each node , draw , where is the level of node and is an allone vector;

For each document :

Draw ;

Draw ;

For each position , draw , and draw ,

where is the Dirichlet distribution, and and are Dirichlet hyperparameters. There are two special cases of hLDA. When the tree degenerates to a chain, hLDA recovers the vanilla LDA with topics, and when the tree has two levels and the probability of assigning to the first level is close to zero, hLDA recovers the Dirichlet Process Mixture Model (DPMM) (Neal, 2000).
3. Inference for HLDA
There are two classes of algorithms for the posterior inference in hLDA—the collapsed weight algorithm that integrates out the mixing weights and parameters , and the instantiated weight algorithm that explicitly infers these parameters. These algorithms present a tradeoff between scalability and the quality of inference.
In this section, we first introduce the collapsed Gibbs sampler (Blei et al., 2010), which is not scalable. To address this problem, we present an instantiated weight block Gibbs sampler, which is based on the same stickbreaking formulation as the variational inference algorithm (Wang and Blei, 2009) and has an excellent scalability. However, it suffers from local optima. To tackle this issue, we propose a partially collapsed Gibbs sampler that has good scalability as well as highquality inference. We also present an initialization strategy to find better local optima.
3.1. Collapsed Gibbs Sampling (CGS)
CGS is a collapsed weight algorithm that is based on the nCRP formulation. The generative process of hLDA (Sec. 2.2
) defines a joint distribution
= Based on the conjugacy between Dirichlet and multinomial, and are integrated out to get the collapsed distribution:(1) 
where are the documentlevel counts, i.e., , are the topicword counts, i.e., , and is the multivariate beta function, .
CGS alternatively samples and from their conditional distributions: , and , where represents excluding document , and means excluding the token , e.g., . Furthermore, the superscript denotes only considering document , e.g., . Finally, and
(2) 
A straightforward approach for parallelizing CGS is to let each thread work on a disjoint set of documents and synchronize the counts between threads and machines. One possible solution for the synchronization is the parameter server (Ahmed et al., 2012), which maintains a local copy of on each worker machine as well as on a parameter server. Each machine periodically synchronizes its local copy with the parameter server by pushing its change to the server and fetching the latest parameters. Multiple worker threads read and write the local copy concurrently, as illustrated in Fig. 1(a).
However, this approach has several disadvantages, limiting its scalability and efficiency. Firstly, due to the limited network bandwidth, the period of synchronization can be relatively long, e.g., minutes. While this is acceptable for LDA, the stale state can potentially harm the quality of inference for hLDA, which is much more sensitive to local optima, as we will analyze in Sec. 3.2. Secondly, the local copy of needs to support concurrent reads, updates, and resizes from the worker threads, which is notoriously difficult and much more expensive than a serial version (Dechev et al., 2006). Finally, even in a serial setting, the computational cost of CGS is high because involves the computation of the multivariate beta function, which is computed with gamma functions (see the appendix), that are much more expensive to compute than simple arithmetic.
3.2. Block Gibbs Sampling
To address the scalability and efficiency issues of CGS, we begin with a block Gibbs sampler (BGS), which is an instantiated weight algorithm that is based on the same model formulation of the variational inference algorithm (Wang and Blei, 2009), but the periteration time complexity is made lower by replacing expectation with sampling.
The BGS is based on the stickbreaking formulation of nCRP (defined in Sec. 2.1), which defines a joint distribution . Integrating out , BGS samples in the posterior distribution of by alternatively sampling , , and given the others. The resultant updates are as follows:
Sample : for each token, draw a level assignment from the distribution
Sample : for each document, sample a path from the conditional distribution where is the probability of going from node to its child , i.e., , and
(3) 
Sample : Draw the stickbreaking weights , where , is the number of times that a path go through and its th child, and .
Sample : Draw the topic distribution .
A subtlety here is on sampling and , where is infinitedimensional, and there are infinite ’s. We approximate the sampling by truncating to be finitedimensional, i.e., there are finite children for each node, so that the whole tree has a finite number of nodes. This approximation can be avoided with slice sampling (Ge et al., 2015), but truncation is not the main reason affecting the model quality, as there are some more severe issues as we shall see soon.
Due to conditional independence, the documentspecific random variables
and can be sampled in parallel for each document. This fits in a bulksynchronous parallel (BSP) pattern. In each iteration, and are sampled and broadcasted to each worker, and then the workers sample and without any communication, as illustrated in Fig. 1(b). BSP has been successfully adopted for LDA (Zhai et al., 2012; Zaheer et al., 2016; Chen et al., 2016) and achieved superior throughput than the asynchronous version (e.g., parameter server) for being lock free (Zaheer et al., 2016). Moreover, the BGS update for is also cheaper than that of CGS because the computation of only involves the logarithm of , which remains invariant during the sampling of and , and therefore, can be preprocessed.Unfortunately, while BSP works well for LDA, its quality of inference is unsatisfactory for hLDA in practice. We provide a number of explanations for this phenomenon:
is not slowchanging. BSP works well for LDA because the topicword count is slowchanging. Therefore, a stale is close to the fresh version, and the result is not much affected. However, this assumption is not true for hLDA because a topic can have very few documents assigned to it. For instance, if the sampler assigns a document to a topic that do not have any assigned documents, the topicword count for that topic will suddenly change from a zero vector to nonzero, which differs significantly with its stale version.
Label switching. If two different workers generate two new topics, it is not clear whether they are the same one. For example, in a certain iteration, documents and should be assigned to two different new topics and . But in BGS, two different workers may decide to assign and to a same topic , because both workers do not know the changes made by the other worker, and just regard as a topic that no document is assigned to it. As the result, instead of learning two different topics and , BGS learns one topic that is a combination of and . For flat models such as DPMM or hierarchial Dirichlet process (HDP) (Teh et al., 2004), label switching is sometimes (approximately) resolved by running an algorithm that matches the new topics from different workers (Campbell et al., 2015). However, it is not clear how to match topics on trees.
Local optima. In flat models, even when label switching happens, e.g., two topics and are mixed as one topic, the algorithm may gradually separate them by generating a new topic and assigning the documents that should belong to to the new topic (Yang et al., 2016). However, these moves are more difficult for hLDA because it is more sensitive to local optima. For instance, if two topics and are incorrectly mixed as one topic , and has a subtree. To correctly separate and , the sampler needs to create a new brother of , and move some decedents of to its brother. These operations can hardly be achieved with local moves. Wang and Blei (Wang and Blei, 2009) attempted to make this kind of moves by splitandmerge operations, whose time complexity is typically quadratic with the number of topics, and does not scale to a large number of topics.
3.3. Partially Collapsed Gibbs Sampling
It can be seen from the aforementioned discussion that there is a tradeoff between scalability and the quality of inference. CGS learns good models but is not scalable, while BGS is very scalable but sacrifices the quality of inference. To combine their advantages, we propose a partially collapsed Gibbs sampler (PCGS).
Intuitively, if a topic has lots of assigned documents, its count changes slowly. Based on this insight, we categorize the topics as slow changing topics (SCTs) and rapid changing topics (RCTs) , such that the number of SCTs dominates. Then, we perform CGS for the RCTs and BGS for the SCTs . The quality of inference is not greatly affected because we perform CGS for the RCTs, and the scalability and efficiency is good because for most topics we perform the scalable and efficient BGS. We define a topic to be rapid changing if it is assigned to less than (a userdefined constant) documents, and slow changing otherwise.
Formally, let , , and , we derive the following Gibbs sampling updates, where the details can be found in the appendix:
Sample : draw the level assignment for each token from
Sample : sample the path from
(4) 
Sample : For , draw .
These PCGS updates just combine the update rules of CGS and BGS, which utilizes CGS rule for and BGS rule for .
Since the document visit counts (defined in Sec. 2.1) only requires space, where is the number of topics, it is cheap to synchronize. We keep the entire tree weight collapsed out, and periodically synchronize the counts across machines. PCGS creates new topics in the same way as CGS. Therefore, PCGS does not require truncation and has the correct stationary distribution.
For distributed computing, PCGS performs asynchronous updates for the rapidchanging and perform BSP updates for the slowchanging counts , as illustrated in Fig. 1(c). Since there are few rapidchanging topics, the amount of asynchronous updates of PCGS is much smaller than that of CGS, which needs to update all the counts asynchronously. Thanks to the small size of asynchronous updates, network bandwidth is not a bottleneck for PCGS, and PCGS can update the counts more frequently than CGS. In the sequel, the PCGS counts are more fresh than CGS in distributed setting. Because the number of slowchanging topics dominates, PCGS enjoys similar scalability and efficiency as BGS.
3.4. Initialization Strategy
hLDA is sensitive to local optima, so a proper initialization is crucial for obtaining good results. We adopt the progressive online initialization strategy (Blei et al., 2010; Wang and Blei, 2009), which begins with an empty collection of documents, and gradually adds documents by inferring the posterior of documentspecific variables given all the previously observed documents. The documents are organized into minibatches, and is sampled per minibatch.
To further improve the model quality, we noticed that all the aforementioned algorithms update and while keeping the other fixed, which can severely trap the sampler in local optima. For example, after a document is assigned to a certain path , its words are assigned to the levels of the current path. In the next iteration, even if there is another path such that is larger than , is not likely to be larger than because is already optimized for . In this case, the path assignments will be quickly trapped in a local optimum even if there are better path assignments. We also noticed that similar as in multinomial mixture models (Rigouste et al., 2007), the sampling of is almost deterministic, because is a sum of loglikelihoods over all the words, and can differ by hundreds for different ’s. Therefore, it is difficult for a sampler to jump out of the local trap simply by its randomness.
We propose a remedy for this problem by sampling from directly instead of from (Eq. 4) for the first iterations. In other words, we integrate out. In the first iterations, the sampler focuses on finding the optimal assignment for each document. Afterwards, the algorithm samples to refine the model. Unfortunately, has no closedform representation. We approximate it with MonteCarlo integration where is the number of samples, and is a Polya distribution which is approximated with a uniform discrete distribution over levels.
4. System Implementation
Our distributed training system for hLDA consists of machinelevel and threadlevel parallelism, as shown in Fig. 2. On the machine level, we use MPI to synchronize the tree structure and the counts across machines; and on the threadlevel, a number of threads concurrently read and update the local counts.
In order to implement the system efficiently, several challenges must be addressed. Firstly, for each worker thread, the data layout and algorithm should be organized in a vectorizationfriendly way for efficient memory access and computation. Moreover, expensive computation such as logarithms should be avoided as much as possible. Secondly, for efficient multithread parallelism, the shared data structure should be lockfree. Finally, the communication strategy need to be chosen carefully to minimize communication overhead and maximize the freshness of the counts. We now present solutions to address these challenges.
4.1. Vectorization and Preprocessing
We first discuss how to organize the computation for vectorization and preprocessing. The most timeconsuming part of PCGS is the sampling of according to Eq. (4), or more concretely, computing for each and for each . Because both and are very close to zero, we compute their logarithms. Rewrite Eq. (3):
(5) 
where is the set of all tokens in document that are assigned to level , which can be computed by bucket sorting ’s along with their positions ’s.
Eq. (5) can be vectorized as , where is the subset of topic ids on level , and is the vector of ’s for all topics in . The matrix is the transpose of the elementwise logarithm of , which is preprocessed to avoid the computation of logarithm. We store the matrix in rowmajor order, so that accessing some topics for a certain word is continuous as long as the IDs are continuous for each level, which is easy to achieve by sorting the topic ids according to their levels, since do not change when sampling . Therefore, computing is fully vectorized by just adding the slices of the matrix indexed by and .
Similarly, Eq. (2) can be rewritten as:
(6) 
where we convert the logarithm of multivariate beta function as the sum of logarithms (the derivation details can be found in the appendix). The term , and in we assign each token with an offset indicating which time does this word appear, e.g., if a word is in for three times, we put , and into .
Again, we vectorize the computation of by computing for each , which is the vector of ’s for . is the subset of topic ids on level , that can change during the sampling of due to the birth and death of topics. For efficient vectorization, the counts need to be stored such that is continuous for all . To achieve this, we store separate count matrices for each level. For level , is stored, which is made by concatenating all the columns of . When a new topic on level is created, we append it to as the rightmost column. The removal of columns is deferred after the sampling of finishes, and the result will not be affected since the dead topics correspond to zero columns in .
Unlike computing , the logarithm in Eq. (6) cannot be preprocessed since the count changes during the sampling of . Therefore, is much more expensive to compute than , supporting our argument on the inefficiency of CGS in Sec. 3.1. Fortunately, for PCGS, the computation of logarithm is avoided as much as possible by keeping a small set. To further accelerate the computation, we use the SIMD enabled Intel VML library for logarithms.
4.2. Concurrent Data Structures
In our system, the collapsed count matrices are concurrently read and updated by the worker threads, and the number of columns (topics) can grow over time. Since there are a lot of reads, the matrix must be read efficiently, i.e., lock free. Meanwhile, the consistency can be relaxed since a small deviation of the counts will not affect the result much. Therefore, we only ask the matrices to have eventual consistency, i.e., the values of the matrices should be eventually correct if no new updates are given. We adopt atomic writes to preserve eventual consistency, while the reads are relaxed as nonatomic operations, to maximize the reading performance.
The dynamic number of columns makes the implementation challenging. The straightforward implementation for growing the matrix involves allocating a new memory region, copying the original content to the new memory, and deallocating the original memory. However, this implementation cannot achieve eventual consistency because the updates during copying will not be incorporated.
Inspired by the lockfree design of a concurrent vector (Dechev et al., 2006), which is a onedimensional version of our matrix, we provide an efficient implementation of the concurrent matrix. Internally, it holds a list of matrix blocks, where the th matrix block has the size , while is a constant. The first matrix block represents the th columns of the original matrix, the second matrix block represents the th columns of the original matrix, and so on. If there is a growing request that exceeds the current capacity, we allocates the next matrix block on the list. For every reading and updating request, the requested coordinate is converted to the coordinate, where is the index of the matrix block on the list and is the column index within the matrix block. The coordinate conversion can be achieved with a BSR instruction in modern x86 systems (Dechev et al., 2006). Finally, to improve the locality, we defragment after each PCGS iteration, i.e., deallocating all the matrix blocks and concatenating their content to form a single larger matrix block.
4.3. Communication
For PCGS, we need to synchronize the instantiated count across machines once per PCGS iteration, and the collapsed counts (, ) as frequently as possible. We now present an implementation of the synchronization by MPI.
Firstly, we synchronize by the MPI_Allreduce operation. There are many approaches to synchronizing and . One possible solution is the parameter server as shown in Fig. 1(a). However, while parameter server typically assumes the amount of communication is high and the network bandwidth is the bottleneck, the amount of our PCGS communication is low and our main focus is on the latency, which determines how fresh the count is. The parameter server first merges the changes from individual worker machines at the server, and then pushes the new state to the workers. While the merging decreases the amount of communication, it increases the latency by sending the change through the server.
To optimize the latency, we design a decentralized communication strategy, in which all the worker nodes directly send their changes to all the other nodes, as illustrated in Fig. 1(c). There is a synchronization thread on each worker machine with a to_send buffer, a sending buffer and a receiving buffer. The worker threads write their changes to the to_send buffer, and the synchronization threads periodically exchange the content in the to_send buffer across machines, as follows: (1) Atomically exchange the to_send buffer and sending buffer, clear the new to_send buffer; (2) Gather the content of all sending buffers to the receiving buffer, by a MPI_Allgatherv operation; (3) Merge all the changes in the receiving buffer to the local copy of collapsed counts and .
Dataset  # tokens  

NYTimes (subset)  101635  
NIPS  12375  
NYTimes  101635  
PubMed  141043  
ClueWeb12 (small)  100000  
ClueWeb12 (large)  100000 
5. Experiments
We evaluate the quality, efficiency and scalability of our algorithm and system on several datasets, including NIPS, NYTimes, PubMed from the UCI machine learning repository (Asuncion and Newman, 2007), and two subsets of the ClueWeb12 dataset (Project, 2013) (Table 1). The experiments are conducted on the Tianhe2 supercomputer, which has two 12core Xeon E52692v2 CPUs per node and an InfiniBand network. Our quantitative and qualitative results demonstrate the promise.
We quantitatively compare the quality of the inferred models by predictive loglikelihood using the document completion approach (Wallach et al., 2009). The corpus is divided as a training corpus and a testing corpus, and the testing corpus is further divided as an observed corpus , which contains a random half of the tokens for each document in the testing corpus; and a heldout corpus of the other half of the tokens. The predictive loglikelihood is defined as , where the model is inferred from the training corpus , and is approximated with a MonteCarlo integration:
where and are the samples from the posterior distribution , which can be obtained with Gibbs sampling, and is the number of samples. Finally, we convert predictive loglikelihood to predictive perplexity
where a lower perplexity score indicates a better model.
5.1. Quality of Inference
We first compare the model inferred by CGS and our proposed BGS and PCGS, and examine the effect of the initialization strategy (Sec. 3.4). We also include a comparison with the open source implementation for hLDA, hldac, which is a CGS algorithm but has a stickbreaking prior on instead of a Dirichlet prior (Blei et al., 2010).
Unlike parametric models, where the number of topics is fixed, nonparametric models such as hLDA produce different numbers of topics for different runs and various inference algorithms, even with the same hyperparameter setting. It is not fair to directly compare the perplexity of two models with different numbers of topics. For a fair comparison, we choose a rich set of hyperparameter configurations, run the algorithms for all these configurations, and plot the perplexity against the number of topics as in Fig.
3. In this experiment, we train a 4layer model (i.e., ) on the NYTimes (subset) dataset and the NIPS dataset, and , where is chosen from , is chosen from , and .By comparing the perplexity produced by different algorithms, we have a number of observations:

CGS and PCGS have similar quality, while BGS has worse results. This agrees with our previous analysis (Sec. 3.2) that BGS suffers from label switching and local optimum.

Our initialization strategy helps obtain better results for both CGS and PCGS.

Our result is not worse (actually better) than hldac. The discrepancy attributes to the different choice of prior on .
5.2. Efficiency
We compare the efficiency of our algorithms against hldac. We run the serial version of all the algorithms for 70 iterations on the NYTimes (subset) dataset while setting , and is tuned to keep the number of topics around 300. The timing result is shown in Table 2. Our CGS implementation is 48 times faster than hldac. The significant gain of efficiency attributes to our vectorization and the conversion of the logarithm of gamma function to the sum of logarithms in Sec. 4.1 and the appendix. PCGS is 2.3 times faster than CGS, and BGS is 1.3 times faster than PCGS. These results match our analysis in Sec. 4.1 on that BGS and PCGS are more efficient than CGS. Overall, our PCGS implementation is 111 times faster than hldac.
Combining the results on inference quality and efficiency, we find PCGS to be a good tradeoff between quality and efficiency by providing the CGSlevel quality within BGSlevel time consumption.
5.3. Sensitivity of Parameters
We now examine the impact of the hyperparameters and , which control the behavior of PCGS.
Impact of : is the threshold of the number of visits that we decide whether the topic distribution of a topic is rapidchanging or slowchanging. To investigate its effect, we run PCGS on the NYTimes dataset, setting ,and varying , while tuning to keep the number of topics around 500. PCGS becomes CGS when , and approaches BGS when . As shown in Fig. 6, the perplexity goes down and the time consumption goes up as grows. We also find that provides a good tradeoff between efficiency and quality. When , there are 427 slowchanging topics which covers 99.7% documents, so the change of rapidchanging topic counts (amount of communication) is kept small.
Impact of : is the number of initializing iterations to sample from . We run PCGS on the NYTimes (subset) dataset, setting , and varying and . It can be seen from Fig. 6 that the perplexity steadily decreases for large , which again shows that our initialization strategy is helpful. We select a moderate for all the experiments.
Impact of : The hyperparameter is the number of MonteCarlo samples to approximate . When , we directly sample from in the first iterations. We run PCGS on the NYTimes (subset) dataset, with and , and vary from 1 to 128. As shown in Fig. 6, has little impact on both the number of topics and the perplexity, implying that a small , e.g., , is adequate.



5.4. Scalability
Our experiments on scalability are in two folds: whether the quality of inference is affected by parallelization; and how good is the speedup. We first study the multithread setting, where the number of threads varies from 1 to 12 on the NYTimes corpus. The result is shown in Fig. 7(a), where we observe that there is no apparent increase of perplexity as the number of threads grows. The speedup with 12 threads is 8.56. The probable reasons of imperfect speedup include serial region, contention for atomic variables, and limited memory bandwidth.
For the multimachine setting, there are two CPUs per machine. We run our implementation on the PubMed corpus on 1 to 10 CPUs as shown in Fig. 7(b), and on the larger ClueWeb12 (small) corpus for 10 to 100 CPUs as shown in Fig. 7(c). The speedup is 8.5 from 1 to 10 CPUs, and 7.15 from 10 to 100 CPUs. The perplexity is slightly affected by parallelization when the number of CPUs exceeds 7 and 80 on the two datasets, respectively, indicating that the dataset is not large enough to utilize that many CPUs.
Finally, to demonstrate the scalability, we learn a model with 1,722 topics of the 131milliondocument ClueWeb12 (large) corpus with 50 machines, and the inference finishes in 7 hours. The results will be presented for qualitative evaluation in the next section.
5.5. Qualitative Analysis
We now demonstrate the topic hierarchy obtained from the ClueWeb12 (large) corpus, which is a crawl of web pages. The corpus is obtained by tokenizing the original ClueWeb12 dataset, randomly selecting about 30% documents, truncating the vocabulary size to 100,000 and keeping only the documents whose length is between . We show the selected parts of the obtained tree in Fig. 8, where some topics whose number of occurrences does not pass a particular threshold are filtered out, and the font size of words is proportional to the 4th root of their frequency in the topic. The tree has 5 levels in total.^{1}^{1}1The visualization demo is available online at http://ml.cs.tsinghua.edu.cn/~jianfei/scalablehlda.html.
Fig. 8(a) shows some selected topics on the first 3 levels. The root node contains the most commonly used words shared by all the documents. The second level contains a variety of general topics, such as “software”, “travel” and “city”, and the third level has more detailed concepts, e.g., the “city” topic on the second level splits as “shopping”, “city names”, and “locations”. We further show the topic subtrees of all the layers rooted at the highlighted nodes to examine the finegrained concepts. For example, in Fig. 8(b) the “travel” topic is divided as “islands”, “India” and “vacations”, and the leaf level contains specific concepts, such as “ferry” and “diving”, which are correctly placed under the “islands” topic. In Fig. 8(c), the “computer” topic is divided as “website”, “windows”, “vps”, “programming”, “linux” and “forum”. To our knowledge, this is the first time that hLDA is applied to largescale web data, and the results demonstrate our ability on automatically learning topic hierarchy from web data.
6. Conclusions and Discussions
We present a partially collapsed Gibbs sampling (PCGS) algorithm for the hierarchical latent Dirichlet allocation model, which is a combination of the collapsed weight algorithm and instantiated weight algorithm. The major feature of PCGS is that it is scalable and has highquality inference. We also present an initialization strategy to further improve the model quality. To make PCGS scalable and efficient, we propose vectorization and preprocessing techniques, concurrent data structures, and an efficient communication strategy. The proposed algorithm and system are scalable to hundreds of millions of documents, thousands of topics, and thousands of CPU cores.
Appendix A Derivation Details
a.1. Derivation of PCGS updates
Rewrite the joint distribution in Sec. 3.1 as: =
Integrating out and , we have the marginal distribution =
Utilizing the identity , where is a coordinate vector, we can derive the Gibbs sampling updates:
Sample : Keeping only the terms relevant with , we have
Sample :
Sample : For , draw .
a.2. Derivation of computing
We have = = =.
Acknowledgements.
The work was supported by the National Basic Research Program (973 Program) of China (No. 2013CB329403), National NSF of China (Nos. 61620106010, 61322308, 61332007), the Youth Topnotch Talent Support Program, Tsinghua Tiangong Institute for Intelligent Computing, and Special Program for Applied Research on Super Computation of the NSFCGuangdong Joint Fund (the second phase). J Lu and S Liu are supported by National NSF of China (No. 61672308).References
 (1)
 Ahmed et al. (2012) Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander Smola. 2012. Scalable inference in latent variable models. In WSDM.
 Ahmed et al. (2013) Amr Ahmed, Liangjie Hong, and Alexander J Smola. 2013. Nested Chinese Restaurant Franchise Process: Applications to User Tracking and Document Modeling.. In ICML.
 Asuncion and Newman (2007) Arthur Asuncion and David Newman. 2007. UCI machine learning repository. (2007).
 Blei (2009) David Blei. 2009. hldac. http://www.cs.columbia.edu/~blei/downloads/hldac.tgz. (2009).
 Blei et al. (2010) David M Blei, Thomas L Griffiths, and Michael I Jordan. 2010. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM) 57, 2 (2010), 7.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet allocation. JMLR 3 (2003), 993–1022.
 BoydGraber et al. (2007) Jordan L BoydGraber, David M Blei, and Xiaojin Zhu. 2007. A Topic Model for Word Sense Disambiguation.. In EMNLPCoNLL.
 Campbell et al. (2015) Trevor Campbell, Julian Straub, John W Fisher III, and Jonathan P How. 2015. Streaming, Distributed Variational Inference for Bayesian Nonparametrics. In NIPS.
 Chen et al. (2016) Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. 2016. WarpLDA: a Cache Efficient O (1) Algorithm for Latent Dirichlet Allocation. In VLDB.
 Dechev et al. (2006) Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2006. Lockfree dynamically resizable arrays. In International Conference On Principles Of Distributed Systems.
 Ge et al. (2015) Hong Ge, Yutian Chen, ENG CAM, Moquan Wan, and Zoubin Ghahramani. 2015. Distributed inference for Dirichlet process mixture models. In ICML.
 Li and McCallum (2006) Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAGstructured mixture models of topic correlations. In ICML.
 Murphy (2012) Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
 Neal (2000) Radford M Neal. 2000. Markov chain sampling methods for Dirichlet process mixture models. Journal of computational and graphical statistics 9, 2 (2000), 249–265.
 Paisley et al. (2015) John Paisley, Chong Wang, David M Blei, and Michael I Jordan. 2015. Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 2 (2015), 256–270.
 Project (2013) The Lemur Project. 2013. The ClueWeb12 Dataset. http://lemurproject.org/clueweb12/. (2013).
 Pujara and Skomoroch (2012) Jay Pujara and Peter Skomoroch. 2012. Largescale hierarchical topic models. In NIPS Workshop on Big Learning.
 Rigouste et al. (2007) Loïs Rigouste, Olivier Cappé, and François Yvon. 2007. Inference and evaluation of the multinomial mixture model for text clustering. Information processing & management 43, 5 (2007), 1260–1280.
 Sethuraman (1994) Jayaram Sethuraman. 1994. A constructive definition of Dirichlet priors. Statistica sinica (1994), 639–650.
 Teh (2011) Yee Whye Teh. 2011. Dirichlet process. In Encyclopedia of machine learning. Springer, 280–287.
 Teh et al. (2004) Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2004. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. In NIPS.
 Wallach et al. (2009) Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In ICML.
 Wang and Blei (2009) Chong Wang and David M Blei. 2009. Variational inference for the nested Chinese restaurant process. In NIPS.
 Wang et al. (2016) X. Wang, S. Liu, J. Liu, J. Chen, J. J. H. Zhu, and B. Guo. 2016. TopicPanorama: A Full Picture of Relevant Topics. TVCG 22, 12 (2016), 2508–2521. DOI:http://dx.doi.org/10.1109/TVCG.2016.2515592
 Wei and Croft (2006) Xing Wei and W Bruce Croft. 2006. LDAbased document models for adhoc retrieval. In SIGIR.
 Yang et al. (2016) Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the Stochastic Gradient Sampler for LargeScale LDA. In KDD.
 Zaheer et al. (2016) Manzil Zaheer, Michael Wick, JeanBaptiste Tristan, Alex Smola, and Guy L Steele Jr. 2016. Exponential stochastic cellular automata for massively parallel inference. In AISTATS.
 Zhai et al. (2012) Ke Zhai, Jordan BoydGraber, Nima Asadi, and Mohamad L Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In WWW.
 Zhu et al. (2012) J. Zhu, A. Ahmed, and E. P. Xing. 2012. MedLDA: Maximum Margin Supervised Topic Models. JMLR 13 (2012), 2237–2278.