Scalable Inference for Nested Chinese Restaurant Process Topic Models

by   Jianfei Chen, et al.

Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.


page 1

page 2

page 3

page 4


Sparse Stochastic Inference for Latent Dirichlet allocation

We present a hybrid algorithm for Bayesian topic models that combines th...

LightLDA: Big Topic Models on Modest Compute Clusters

When building large-scale machine learning (ML) programs, such as big to...

Scaling up Dynamic Topic Models

Dynamic topic models (DTMs) are very effective in discovering topics and...

Scalable and Robust Construction of Topical Hierarchies

Automated generation of high-quality topical hierarchies for a text coll...

HTMOT : Hierarchical Topic Modelling Over Time

Over the years, topic models have provided an efficient way of extractin...

The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

We present the nested Chinese restaurant process (nCRP), a stochastic pr...

Fast and Accurate Estimation of Non-Nested Binomial Hierarchical Models Using Variational Inference

Estimating non-linear hierarchical models can be computationally burdens...

Please sign up or login with your details

Forgot password? Click here to reset