LightLDA: Big Topic Models on Modest Compute Clusters

by   Jinhui Yuan, et al.

When building large-scale machine learning (ML) programs, such as big topic models or deep neural nets, one usually assumes such tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens -- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude faster than current state-of-the-art Gibbs samplers; 2) a structure-aware model-parallel scheme, which leverages dependencies within the topic model, yielding a sampling strategy that is frugal on machine memory and network communication; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed; and 4) a bounded asynchronous data-parallel scheme, which allows efficient distributed processing of massive data via a parameter server. Our distribution strategy is an instance of the model-and-data-parallel programming model underlying the Petuum framework for general distributed ML, and was implemented on top of the Petuum open-source system. We provide experimental evidence showing how this development puts massive models within reach on a small cluster while still enjoying proportional time cost reductions with increasing cluster size, in comparison with alternative options.


page 1

page 2

page 3

page 4


Scalable Inference for Nested Chinese Restaurant Process Topic Models

Nested Chinese Restaurant Process (nCRP) topic models are powerful nonpa...

Model-Parallel Inference for Big Topic Models

In real world industrial applications of topic modeling, the ability to ...

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

As Machine Learning (ML) applications increase in data size and model co...

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) sy...

Computing Web-scale Topic Models using an Asynchronous Parameter Server

Topic models such as Latent Dirichlet Allocation (LDA) have been widely ...

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advance...

High Performance Latent Variable Models

Latent variable models have accumulated a considerable amount of interes...

Please sign up or login with your details

Forgot password? Click here to reset