1 Introduction
Topic modeling provides a suite of statistical tools to discover latent semantic structures from complex corpora, with latent Dirichlet allocation (LDA) [7] as the most popular one. LDA has found many applications in text analysis [8, 35]
[18, 22], recommendation systems [14], information retrieval [29] and network analysis [11, 13]. LDA represents each document as an admixture of topics, each of which is a unigram distribution of words. Since exact inference is intractable, both variational Bayes (VB) and Markov Chain Monte Carlo (MCMC) methods have been developed for approximate inference, including meanfield variational Bayes
[7], collapsed variational Bayes [26], collapsed Gibbs sampling (CGS) [16] and expectation propagation [23]. Among these methods, CGS is most popular due to its simplicity and availability for fast sparsityaware algorithms [30, 20, 32].Entering the Big Data era, applications often require largescale topic modeling to boost their performance. For example, Wang et al. [28, 32] show that learning 1 million topics can lead to significant performance gain on various tasks such as advertisement and recommendation. Other recent endeavours on learning large topic models often contain billions of documents, millions of topics, and millions of unique tokens [28, 32, 21]. Various fast sampling algorithms have been proposed for LDA, reducing the time complexity of sampling the token assignment pertoken from to , where is the number of topics [30, 21, 32].
Although many efforts have been spent on improving the pertoken sampling complexity, little attention has been paid to examine the cache locality, another important dimension to improve the overall efficiency. For all the aforementioned algorithms, the cache locality is not getting better; in fact some are even getting worse (See Table 2 for details). The running time of LDA is often dominated by random memory accesses, where the time consumption is roughly proportional to the latency of each access. As shown in Table 1, the latency of accessing different levels of the memory hierarchy varies greatly, and accessing a higherlevel cache can be orders of magnitude faster than accessing a lowerlevel cache or the main memory. As we will show in Sec. 3, when processing a single document, the random accesses of previous LDA algorithms spread across either an matrix or an matrix, where is the number of documents and is the vocabulary size. As , and can all exceed one million in largescale applications, the matrix can be tens of gigabytes in size, which is too large to fit in any cache, resulting in unsatisfactory memory efficiency. Moreover, this size is difficult to reduce for CGS because both the documentwise counts and the wordwise counts need to be accessed for sampling a single token.
L1D  L2  L3  Main memory  

Latency (cycles)  5  12  30  180+ 
Size  32KB  256KB  30MB  10GB+ 
In this paper, we propose to reduce the latency of random accesses of LDA by reducing the size of the randomly accessed memory. Based on a careful analysis of existing algorithms, we develop WarpLDA^{1}^{1}1The name comes after Warp Drive, the hypothetical fasterthanlight propulsion system in Star Trek.
, a novel sampling algorithm based on MonteCarlo Expectation Maximization (MCEM) that preserves the best
time complexity pertoken and has some carefully designed reordering strategy to achieve an size of randomly accessed memory perdocument, which is small enough to fit in the L3 cache. As the L3 cache is at least six times faster than the main memory (See Table 1), this could result in a significant performance gain. Another nice property of WarpLDA is that it simplifies the system design. We present an implementation of WarpLDA in a distributed and memoryefficient way, with a carefully designed computational framework based on distributed sparse matrices.Extensive empirical studies in a wide range of settings demonstrate that WarpLDA consistently converges 515x faster than the stateoftheart MetropolisHastings based LightLDA [32] and in most settings, faster than the stateoftheart sparsity aware F+LDA [31] for learning LDA models with thousands to hundreds of thousands topics. WarpLDA scales to corpora of billionscale documents, and achieves an unprecedentedly 11G tokens per second throughput with 256 machines.
Outline: Section 2 introduces some basics of LDA, Section 3 provides the memory efficiency analysis of existing algorithms. Section 4 introduces the WarpLDA algorithm, and Section 5 provides system design and implementation details. Experiments are presented in Section 6. Section 7 concludes.
Algorithm  Type  Amount of sequential accesses  Number of random accesses  Size of randomly accessed memory  Order 

(pertoken)  (pertoken)  perdocument  
CGS [16]        doc  
SparseLDA [30]  SA  doc  
AliasLDA [20]  SA&MH  doc  
F+LDA [31]  SA  word  
LightLDA [32]  MH    doc  
WarpLDA  MH    1  doc&word 
2 Backgrounds
2.1 Basics of LDA
Let denote the collection of words in document and be a collection of documents. Let denote the vocabulary size. Latent Dirichlet Allocation [7] is a hierarchical Bayesian model, which models the distribution of a word as a mixture of topic distributions, with a shared mixing proportion for words within the same document. Formally, each topic is a dim word distribution , which follows a Dirichlet prior with parameter ; and the generative process of LDA for each document is:

[label=]

Draw a dim topic mixing proportion: ,

For each position :

[label=]

Draw a topic assignment: ,

Draw a word: ,

where is a multinomial distribution (or categorical distribution), and are Dirichlet parameters. Let be the topic matrix. We further denote and , where . Let and . In this paper, we define token as an occurrence of a word, e.g., “apple” is a word, and each of its occurrence is a token. When we say “all tokens of word ” we mean all the occurrences of , which may have different topic assignments and we use to denote the topic assignments of all tokens of word .
To train LDA, one must infer the posterior distribution (or its marginal version) of latent variables given . Unfortunately, exact posterior inference is intractable. Thus approximate techniques including variational Bayes and Markov Chain Monte Carlo (MCMC) methods are adopted. As mentioned before, Collapsed Gibbs Sampling (CGS) [16] is most popular because of its simplicity and the availability of fast sampling algorithms. Given , CGS integrates out by conjugacy and iteratively samples from the local conditional distribution:
(1) 
where is the number of tokens that are assigned to topic in document ; is the number of times that word has topic ; . ^{2}^{2}2We distinguish different counts by their subscripts. The superscript or subscript stands for excluding () from the corresponding count or collection. We further define to be the matrix formed by , and to be the matrix formed by , with and being their particular rows indexed by and , and ,
being the number of nonzero entries of the corresponding row. Let the global topic count vector
.By sampling in a collapsed space, CGS often converges faster than a standard Gibbs sampler in a space with all the variables. A straightforward implementation of Eq. (1) is of complexity pertoken by naively enumerating all the possible topic assignments. This can be too expensive for largescale applications where can be in the order of . Various fast sampling algorithms [30, 20, 32, 31] exist to speed this up, as we shall see in Sec. 3.
2.2 Sample from Probability Distributions
Before introducing the algorithm, we first describe some useful tools to sample from probability distributions, which are used by WarpLDA and other algorithms
[32, 20].MetropolisHastings (MH): Let be an (unnormalized) target distribution. We consider the nontrivial case that it is hard to directly draw samples from . MH methods construct a Markov chain with an easytosample proposal distribution at each step . Starting with an arbitrary state , MH repeatedly generates samples from the proposal distribution , and updates the current state with the new sample with an acceptance rate . Under some mild conditions, it is guaranteed that converges to as , regardless of and (See Alg. 1). In LDA, is the distribution of topic assignment in Eq. (1), whose sampling complexity is , and is often a cheap approximation of , as will be clear soon.
Mixture of multinomials: If a multinomial distribution has the form
it can be represented by a mixture of two distributions,
where and are the normalizing coefficients, and
are the normalized mixture distributions. By introducing an extra binary variable
, which follows a Bernoulli distribution
, and defining , one can confirm that is a marginal distribution of . Therefore, a sample from can be drawn via an ancestral sampler, which first draws and then samples from if and from if . This principle is useful when both and are easy to sample from.Alias Sampling: Alias sampling [27] is a technique to draw samples from a dim multinomial distribution in after construction of an auxiliary structure called alias table. The alias table has bins with equal probability, with at most two outcomes in each bin. The samples can be obtained by randomly selecting a bin and then randomly selecting an outcome from that bin. Alias sampling has been used in previous LDA works [20, 32].
3 Analysis of existing algorithms
We now describe our methodology to analyze the memory access efficiency, and analyze of existing fast sampling algorithms of LDA [30, 21, 32, 31] to motivate the development of WarpLDA.
3.1 Methodology
The running time of LDA is often dominated by random memory accesses, whose efficiency depends on the average latency of each access. As accessing main memory is very slow, modern computers exploit locality to reduce the latency. If many memory accesses are concentrated in a small area which fits in the cache, the latency will reduce greatly, as shown in Table 1. We focus on the L3 cache in this paper, which is about 30 megabytes, and is at least six times faster than the main memory.
The size of randomly accessed memory is an important factor to the latency, because a smaller memory region can fit in a higherlevel cache, which has smaller latency. So we analyze the memory access efficiency by the size of randomly accessed memory, more precisely, the size of possibly accessed memory region when sampling the topic assignments for a document or a word , depending on whether the sampling algorithm visits the tokens documentbydocument or wordbyword, which will be defined soon.
As we shall see soon, there are two main types of random access encountered for LDA: (1) randomly accessing a matrix of size or ; and (2) randomly accessing a vector of size . As , and can all exceed one million, the matrix is typically tens of gigabytes in size even when it is sparse; but the vector is megabytes or smaller in size, and fits in the L3 cache. Therefore, it is reasonable to assume that random accesses to matrices are not efficient, while random accesses to vectors are efficient.
Existing fast sampling algorithms for LDA follow CGS that we introduced in Sec. 2.1, which iteratively visits every token of the corpus, and samples the topic assignments based on the count matrices and . The tokens can be visited in any order. Two commonly used ordering are documentbydocument which firstly visits all tokens for document 1, and then visits all tokens for document 2, and so on; and wordbyword which firstly visits all tokens for word 1, and then visits all tokens for word 2, and so on. These orderings determine which one of the two count matrices and can be accessed efficiently, as we will see in Sec. 3.3.
3.2 Existing Fast Sampling Algorithm
We now summarize existing fast sampling algorithms [30, 20, 32, 31]. These algorithms can be categorized as being either sparsityaware or MHbased, and they have different factorizations to the basic CGS sampling formula Eq. (1). For clarity all the superscripts of CGS based algorithms are omitted.
Sparsityaware algorithms utilize the sparse structure of the count matrices and . For example, SparseLDA [30] has the factorization and it enumerates all nonzero entries of and to calculate the normalizing constant of each term, which are , and respectively. AliasLDA [20] has the factorization , where and in the latter term are approximated with their stale versions. AliasLDA enumerates all nonzero entries of to calculate the normalizing constant of the first term, and an alias table is used to draw samples from the latter term in amortized time. Then, an MH step is used to correct the bias of stale topic counts. F+LDA [31] has the same factorization as AliasLDA but visits the tokens wordbyword, and use a F+ tree for the exact sampling of the latter term.
MHbased algorithms rely on some easytosample proposal distributions to explore the state space . The proposal distributions are not necessarily sparse, and hence MHbased algorithms can be applied to models whose ’s do not have sparsity structures, e.g., MedLDA [34] or dynamic topic models [4]. For example, LightLDA [32] alternatively draws samples from two simple proposal distributions and , and accepts the proposals by the corresponding acceptance rate. The time complexity of sampling for a token is . The complexity of LightLDA has already reached the theoretical lower bound, however its practical throughput is only roughly 4M tokens per second per machine [32], due to its slow random access.
3.3 Analysis
All the aforementioned algorithms are not memory efficient because they have random accesses to large matrices. Specifically, the main random accesses of these algorithms are to the count matrices and . Ignoring all details of computing, we only focus on the reading and writing to the count matrices and . When sampling for each token , the memory access pattern to the count matrices can be summarized as follows:

[label=]

read and , for ;

write and ,
where is a set. The existing algorithms differ in two aspects: (1) Ordering of visiting tokens: SparseLDA, AliasLDA and LightLDA visit tokens documentbydocument, while F+LDA visits tokens wordbyword; and (2) Set : The set depends on the sparsity structure of the proposal distribution. For SparseLDA, is the set of nonzero topics of and ; for AliasLDA and F+LDA, is the set of nonzero topics of ; and for LightLDA, is the set of some samples from the proposal distribution.
We have two important observations, where the accesses can be made efficient if only or is accessed, while the accesses are inefficient if both and are accessed, as detailed below:
Accessing either or : Without loss of generality, assume only is accessed. There will be a lot of accesses to random entries of the matrix . However, these random accesses are sorted by , if the tokens are visited documentbydocument. For sampling the tokens in document , only one row is accessed. Therefore, the size of randomly accessed memory perdocument is only , which is the size of , and fits in the L3 cache. Moreover, the rows except need not to be stored in the memory and can be computed onthefly upon request, reducing the storage overhead. Symmetrically, if only is accessed, the accesses can be restricted in a vector by visiting the tokens wordbyword. The counts can be computed onthefly as well.
Accessing both and : Unfortunately, the accesses to and are coupled in all the aforementioned algorithms, i.e., they need to access both and when sampling a token. If the tokens are visited documentbydocument, the accesses to are not sorted by , and spread in the large matrix whose size is . If the tokens are visited wordbyword, the accesses to are not sorted by , and spread in the large matrix whose size is . Thus, all the existing algorithms have or size of randomly accessed memory, which is not efficient.
Table 2 summarizes the existing algorithms. Note that the amount of sequential accesses is smaller than or equals to the amount of random accesses for all the fast sampling algorithms. This justifies our previous argument that the running time of LDA is dominated by random accesses based on the wellknown fact that sequential accesses are faster than random accesses.
WarpLDA addresses the inefficient memory access problem by decoupling the accesses to and . Particularly, WarpLDA first visits all tokens but only accesses , and then visits all tokens again but only accesses . After that, we can choose the corresponding ordering of visiting the tokens so that the accesses to both matrices can be restricted in the current row and , without affecting the correctness of the algorithm.
4 WarpLDA
We now present WarpLDA, a novel MHbased algorithm that finds a maximum a posteriori
(MAP) estimate of LDA with an
pertoken sampling complexity and an size of randomly accessed memory perdocument (or word), by a designed reordering strategy to decouple the accesses to and .4.1 Notations
We first define some notations to make presentation clear. Specifically, we define a topic assignment matrix , where the topic assignment is put in the cell (See Fig. 1 for an illustration). Note that there might be multiple topic assignments in a single cell because a word can appear more than once in a document. Let be the collection of the topic assignments in the th row of , with being its th element. The ordering of the elements in can be arbitrary because LDA is a bagofwords model and ignores word ordering. This definition of is consistent with that in Sec. 2.1, where is the topic assignment of the th token of document . Again, let . Similarly, let be the collection of the topic assignments in the th column of , with size and being its th element. Note that is the term frequency of word , i.e., the number of times that appears in the entire corpus. Let .
Besides a topic assignment, WarpLDA has topic proposals for each token, denoted as (or ), where . Using the above definition, we have (or ) for a document and (or ) for the whole corpus.
It is worth noting that most of these notations are for the ease of presentation, and in the actual implementation we only store and the global count vector (Sec. 5). The other variables, including are either views of , or can be computed onthefly.
4.2 MCEM Algorithm of LDA
As analysed above, in the original CGS Eq. (1) and existing fast algorithms, it is difficult to decouple the access to and , because both counts need to be updated instantly after the sampling of every token. We develop our efficient WarpLDA based on a new MonteCarlo Expectation Maximization (MCEM) algorithm which is similar with CGS, but both counts are fixed until the sampling of all tokens are finished. This scheme allows us to develop a reordering strategy to decouple the accesses to and , and minimize the size of randomly accessed memory.
Specifically, WarpLDA seeks an MAP solution of the latent variables and , with the latent topic assignments integrated out:
where and are the Dirichlet hyperparameters. Asuncion et al. [3] have shown that this MAP solution is almost identical with the solution of CGS, with proper hyperparameters. Our empirical results in Sec. 6.3 also support this conclusion.
Computing directly is expensive because it needs to enumerate all the possible topic assignments for each token. We therefore optimize its lower bound as a surrogate. Let be a variational distribution. Then, by Jensen’s inequality, we get the lower bound :
(2) 
We then develop an Expectation Maximization (EM) algorithm [15] to find a local maximum of the posterior , where the Estep maximizes with respect to the variational distribution and the Mstep maximizes with respect the to model parameters , while keeping fixed. One can prove that the optimal solution at Estep is without further assumption on . We apply MonteCarlo approximation on the expectation in Eq. (2),
where We set the sample size and use as an abbreviation of .
Sampling : Each dimension of can be sampled independently,
(3) 
Optimizing : With the MonteCarlo approximation,
with the optimal solutions
(4) 
Instead of computing and storing and , we compute and store and to save memory because the latter are sparse. Plug Eq. (4) to Eq. (3), and let , we get the full MCEM algorithm, which iteratively performs the following two steps until a given iteration number is reached:

Estep: Sample , where
(5) 
Mstep: Compute and by .
Note the resemblance between Eq. (5) and Eq. (1) intuitively justifies why MCEM leads to similar results with CGS. The difference between MCEM and CGS is that MCEM updates the counts and after sampling all s, while CGS updates the counts instantly after sampling each . The strategy that MCEM updates the counts after sampling all s is called delayed count update, or simply delayed update. MCEM can be viewed as a CGS with delayed update, which has been widely used in existing algorithms [24, 1]. While previous work uses delayed update as a trick, we hereby present a theoretical guarantee to converge to a MAP solution. Delayed update is important for us to decouple the accesses of and to improve cache locality, without affecting the correctness, as will be explained in Sec. 4.4.
4.3 Sampling the Topic Assignment
A naive application of Eq. (5) is pertoken. We now develop an MH algorithm for faster sampling.
Specifically, starting from an initial state , we draw samples alternatively from one of the two proposal distributions:
(6) 
and update the current state with the acceptance rates:
(7) 
Similar as in Yuan et al.’s work [32], we can prove that this scheme converges to the correct stationary distribution in Eq. (5).
Computing the acceptance rates only involves constant number of arithmetic operations, so it is . The proposal distributions in Eq. (6) are mixture of multinomials mentioned in Section 2.2, with the mixing coefficient .^{3}^{3}3Due to the symmetry we only consider . There are two possible methods to draw samples from in time: (1) Alias sampling: Build a dim alias table for all the nonzero entries of ; and (2) Random positioning: Noticing that is the count of , randomly select a position , and return . Alias sampling is also used to draw samples from in . Because both sampling from proposal distributions and computing acceptance rates can be done in , the algorithm is pertoken.
4.4 Reordering the Computation
As discussed above, the size of randomly accessed memory perdocument or word is an important factor that influences the efficiency, but it is difficult to reduce for existing algorithms due to the coupling of the counts and . Thanks to the delayed update strategy in MCEM, we are able to decouple the access to and , and minimize the size of randomly accessed memory via a reordering strategy. Below, we explain how to do the reordering in the Estep and the Mstep in turn.
Estep: The Estep samples the topic assignment for each token, while keeping the counts and fixed. Consider the sampling of a single topic assignment with an MH algorithm. For simplicity, we only consider the document proposal . According to the MH Alg. 1, starting with the initial state , we do the following in the th step ():

Draw the topic proposals according to Eq. (6), where ;

Update the current state by the proposal , with the probability , where , according to Eq. (7).
Both and need to be accessed to sample in the above procedure. Following our analysis in Sec. 3.3, there are inevitable random accesses to matrices no matter whether we visit the tokens documentbydocument or wordbyword. Thanks to the delayed update strategy in MCEM, we can address this problem by a reordering strategy.
Delayed update makes , and fixed during the entire Estep. An important corollary is that the proposal distribution , which depends solely on and , is fixed during the Estep. Therefore, we can draw the topic proposals at any time within the Estep, without affecting the correctness. Particularly, we choose to draw the proposals for all tokens before computing any acceptance rate. With this particular ordering, the sampling of all the topic assignments can be done in two separate steps:

Draw the topic proposals for all tokens. This only accesses and .

Compute the acceptance rates and update the topic assignments for all tokens. This only accesses , and .
Since each step only accesses or , following the analysis in Sec. 3.3, we can make both steps memory efficient by carefully choosing the ordering of visiting tokens. In the first step, tokens are visited documentbydocument. Therefore, when processing the th document, only a small vector for the current document is randomly accessed. In the second step, the tokens are visited wordbyword, and when processing the th word, only a small vector is randomly accessed. Because the vectors are small enough to fit in the L3 cache, WarpLDA is memory efficient.
Word proposal: The word proposal can be treated similarly as the doc proposal , by drawing the topic proposals for all tokens before computing any acceptance rate. There are also two separate steps for the sampling of all topic assignments: (1) Draw the topic proposals for all tokens (accesses and ); and (2) Compute the acceptance rates and update the topic assignments for all tokens (accesses , , and ). These steps can be done efficiently by doing the first step wordbyword and doing the second step documentbydocument.
An WarpLDA iteration first samples the topic assignments for each token using the document proposal, and then samples again using the word proposal, which involves four passes of the tokens (two passes for each proposal). To improve the efficiency, these four passes can be compressed to two passes. The document phase visits tokens documentbydocument, and do the operations that require , which are computing the acceptance rates for the word proposal followed by drawing samples from the document proposal. Symmetrically, the word phase visits tokens wordbyword, computes the acceptance rates for the document proposal, and then draws samples from the word proposal.
Mstep: Up to now, we have talked about how to do the sampling of the topic assignments, i.e., the Estep of the MCEM algorithm. The Mstep, which updates the counts , and , need not to be conducted explicitly, because the counts can be computed onthefly. The only usage of the vector is when processing document in the document phase. Hence, it can be computed by when processing document , and discarded after the document is finished. Similarly, the only usage of the row vector is when processing word in the word phase, and it can be computed onthefly as well. Noticing , the count vector can be accumulated when computing for each document.
The above facts also imply that we even need not to store and , which simplifies the system design as we shall see in Sec. 5, and again, justifies there are no random accesses to matrices — we do not even store any of the matrices and .
5 System Implementation
WarpLDA not only improves the cache locality but also simplifies the distributed system design for training LDA on hundreds of machines. In previous systems for distributed LDA, including parameter servers (PS) [1, 21] and model parallel systems [28, 32], all workers collaboratively refine a globally shared count matrix . This adds additional complications to the system, such as read/write locks or delta threads. WarpLDA is arguably simpler to implement because its only globally shared object is a small vector which can be updated and broadcasted to each machine in every iteration, and all the other data are local so that they can be processed independently by each worker.
In this section, we discuss the design of the distributed sparse matrix framework for WarpLDA. We then present techniques to implement the framework in a memory efficient fashion. Some applicationlevel optimizations are also discussed.
5.1 Programming Model
We start the presentation of our framework for WarpLDA by abstracting its data structure. In WarpLDA, the only data to manipulate are: 1) local pertoken data, where each token is associated with integers, which are the topic assignment and the topic proposals ; and 2) global topic count vector . Noticeably, the count matrix is computed onthefly and not stored, as we have analyzed in Sec. 4.4.
The local pertoken data are stored in a matrix , where each token corresponds to an entry at the position with as its data. The matrix has the same structure with in Fig. 1, but is augmented in the sense that each entry stores both a topic assignment and proposals.
We now introduce the framework that we design for WarpLDA. The main data structure of our framework is the distributed sparse matrix of size . There are some entries in the sparse matrix, where each entry stores some data. To help readers relate our presentation with the actual implementation, we use C++ notations for array indexing, e.g., is the cell at the th row and th column of . Let as the collection of all entries in the th row, with being its th element. Similarly, define the columns , and . Let be the size of the collection , such as and . There might be multiple entries in a same cell . The structure of the matrix, i.e., the positions of the entries, is fixed, and only the data of the entries are iteratively refined to the solution.
There are several methods to manipulate the matrix (See Fig. 2):

AddEntry: add an entry to a given cell with the data data. This is called only at initialization time.

VisitByRow: For each row , update the data for each entry, given the data of all the entries in the current row , i.e., , where is a user defined function.

VisitByColumn: For each column , update the data for each entry, given the data of all the entries in the current column , i.e., , where is another user defined function.
Users initialize the matrix via the AddEntry method, and then call VisitByRow and VisitByColumn with user defined functions for a number of times. The data of the matrix will be refined towards the result.
For WarpLDA, the matrix stores the local pertoken data , where each row is a document, and each column is a word. One can verify that the topic assignments of the th document can be derived from the th row , and . Similarly, can be derived from . To use the framework, we firstly initialize the matrix by adding an entry to the cell for each token. The document phase can be implemented with a single VisitByRow, where for each row (document), is calculated with the topic assignments in ; then the topic assignments in are updated given the topic proposals and ; and finally new proposals are created given new topic assignments. ^{4}^{4}4Refer to Alg. 2 in the Appendix for details. Similarly, the word phase can be implemented by VisitByColumn. The globally shared vector can be updated by a simple “reduce” operation, which aggregates the wordtopic count vectors , in the user defined function . Since updating can be implemented by the user, our framework does not handle the global count vector but only .
A basic implementation of this framework is MapReduce. Each entry is represented by a pair, and VisitByRow is implemented by two steps: 1) take the entries , aggregate them by row, and emit the rows ; 2) take the rows (documents) , do the update and emit the individual entries . VisitByColumn is implemented by two similar steps, but aggregate the entries by column. This implementation is useful for industrial users who want to build a simple distributed LDA on top of the existing MapReduce framework. However, the shuffling overhead of MapReduce is often too large and the performance might be unsatisfactory. Hence, we present a dedicated implementation.
5.2 Data Layout
The first problem we need to address is how the matrix is represented internally. VisitByRow (or VisitByColumn) requires to provide (or ) to the user defined functions, while both and should be accessed efficiently, and the change of either should reflect to the underlying matrix.
There are a number of formats for a sparse matrix. Two most wellknown examples are the Compressed Sparse Row (CSR) format and the Compressed Sparse Column (CSC) format. In the CSR representation , the rows are stored continuously, and the accesses to each row is sequential. Similarly, in the CSC representation , the accesses to each column is sequential.
One possible data layout is storing both and . After each VisitByRow or VisitByColumn operation, perform a “transpose” operation to synchronize the change of one representation to the other representation. By storing both and , the user defined operation always has sequential accesses to the data and hence is memory efficient. However, the transpose operation requires an extra pass of data which is expensive.
In WarpLDA, we avoid explicitly transposing the matrix by only storing . For accessing the rows of the matrix, we store , which are pointers to the entries in . Since only one copy of the data is stored, there is no need for explicitly transposing the matrix. However, the rows are accessed by indirect memory accesses, which are not sequential. We now show these indirect memory accesses are still memory efficient because the cache lines are fully utilized.
For each column , we sort the entries by their row id. Then, while in VisitByRow, the entries of a particular column are always accessed sequentially, i.e., is always accessed before for all . When an entry is accessed, the cache line containing it is fetched to the cache, which also contains the next few entries . As long as the cache is large enough to store one cache line for each column, the cache lines can stay in the cache until all the entries on it are accessed. Thus, the cache lines are fully utilized. Moreover, the size of the columns follows the powerlaw, because they are termfrequencies of words in natural corpora [19]. Therefore, the required cache size can be even smaller, comparable to the number of columns which have most entries of the sparse matrix. For example, in the ClueWeb12 corpus where the vocabulary size (number of columns) is 1,000,000, the first 10,000 words (columns) attributes to 80% of the entries, and storing a cache line for these words (columns) requires only .
5.3 Scaling out
WarpLDA can be scaled out to hundreds of machines to meet the requirements of learning large models on massivescale corpora. We now present the key components for scaling out WarpLDA, including task partitioning, data placement and communication.
5.3.1 Multithreading and NUMA
We address the data race and NUMA issues which are often encountered in multi threaded environments.
WarpLDA is embarrassingly parallel because the workers operate on disjoint sets of data. To parallelize WarpLDA we only need to invoke the user defined functions in parallel for different rows and columns. In contrast, traditional frameworks such as Yahoo!LDA [1] and LightLDA [32] need to update the count matrix in parallel, and require extra treatments such as read/write locks and delta threads.
Modern computers have nonuniform memory access (NUMA), where each main memory DIMM belongs to a specific CPU socket. If one CPU needs to access data in the memory belongs to another CPU socket, the data flows through another CPU socket, resulting in degraded performance. For better performance, WarpLDA partitions the data by column and the points by row, and bind each slice to a different CPU socket. Both the VisitByColumn and the visiting of the pointers in access only the local data, but the indirect memory accesses in VisitByRow may flow through other CPUs.
5.3.2 Finegrained Distributed Computing
Distributed computation typically involves partitioning the task and addressing the communications between workers. We take a twolevel partitioning strategy to overlap the computation and communication, and propose a greedy approach for balanced partitioning on the challenging case where the length of columns are distributed as powerlaw. The communications are implemented with MPI_Ialltoall in Message Passing Interface (MPI).
Each VisitByRow and VisitByColumn involves a pass of the matrix . To assign the task to different workers, we split the data matrix as partitions, where is the number of MPI workers, and is the th partition. In VisitByRow, the data is partitioned by row, i.e., worker has partitions in its memory; in VisitByColumn, the data is partitioned by column, i.e., worker has partitions in its memory. Exchange of data happens only when the adjacent two operations are different, i.e., one is VisitByRow and the other is VisitByColumn, in which case the partitions are sent to their corresponding workers of the next operation after the current operation is finished. Because MPI requires both sending buffer and receiving buffer, we maintain a computing copy and receiving copy of the data. For example, in Fig. 3, the current operation is VisitByRow, which can be conducted independently on each worker without any communications, due to our partitioning strategy. After the current VisitByRow is finished, we need to partition the matrix by column for the next VisitByColumn, so we send partition from worker to worker , with MPI_Ialltoall. If the next operation is VisitByRow instead, no communication is needed.
We can overlap the communication and computation by further divide each partition as blocks, where is a small constant. During training, each of the blocks may be in one of the four states: 1) not started; 2) computing; 3) sending; 4) finished. In each VisitByRow / VisitByColumn, workers scan their blocks by row or column, and each block can be immediately sent to the receiving copy after it is finished.
To minimize the wait time, the number of tokens within each worker should be roughly the same, which implies that the sum of the number of tokens of all partitions within each row or column should be roughly the same. Note that the rows / columns can be treated independently: we partition the rows as disjoint sets and the columns as disjoint sets , so that each set has roughly the same number of tokens. Given the partitioning of rows and columns, the th partition .
Balanced partitioning of the columns is challenging because the term frequencies of words from a natural corpus, i.e., , can be very imbalanced because of the powerlaw [19]. For example, the most frequent word in the ClueWeb12 corpus occupies 0.257% of all occurrences, after the removal of stop words. Considering each slice may only have 1% of the tokens if there are 100 slices, this is a very large proportion, and random partitioning can be highly imbalanced in this case. We propose a greedy algorithm for balanced partitioning. First, all the words are sorted by their frequency in a decreasing order. Then from the most frequent word, we put each word (i.e., column) in the partition with the least number of total tokens. Because there are many low frequency words (the long tail), this algorithm can produce very balanced results. We compared the imbalance index of our greedy algorithm with two randomized algorithms: static first random shuffle the words, then partition so that each partition has equal number of words; dynamic allows each partition to have different number of words, but each slice is continuous. Imbalance index is defined as
In the ideal case the imbalance index should be zero. Fig. 4 shows the experimental results on the ClueWeb12 corpus, where we can see that the greedy algorithm is much better than both randomized algorithms. The imbalance index of the greedy algorithm grows dramatically when the number of machines reach a few hundreds, because the size of the largest column is so large that it is impossible to partition the columns in balance.
5.4 Application Level Optimizations
Besides system level optimizations, there are also some application level optimizations (i.e., optimization for the user defined function) for even better performance of WarpLDA. We describe them in this subsection.
Sparse vectors as hash tables: When is large, it is likely to have and . It is more effective to use hash tables rather than dense arrays for the counts and , because the size of a hash table is much smaller than that of a dense array, thus the cost of clearing the counts is smaller, and the size of randomly accessed memory is smaller. We choose an open addressing hash table with linear probing for hash collisions. The hash function is a simple and function, and the capacity is set to the minimum power of 2 that is larger than or . We find that the hash table is almost as fast as a dense array even when . Although LightLDA [32] also uses hash tables to store the counts, it is mainly for reducing the storage overhead instead of improving cache locality, because its size of randomly accessed memory is too large for the cache anyway.
Intraword parallelism: For the most frequent words in the corpus, , i.e., the term frequency can be extremely large (e.g., tens of millions), so . It is desirable to exploit intraword parallelism in this case. Making all the threads working at the same column is beneficial for better cache locality, because only the for one word needs to be stored in the cache. Moreover, it helps balancing the load between workers in the case of is too large. We choose to exploit intraword parallelism for words which . As the user defined function is itself parallel in this case, we let the framework only invoke one user defined function at a time.
To parallelize the user defined function, we compute the count on each thread, aggregate the results, and construct the alias table by a concurrent vector. Updating the topic assignments is embarrassingly parallel.
6 Experiments
Dataset  

NYTimes  300K  100M  102K  332 
PubMed  8.2M  738M  141K  90 
ClueWeb12 (subset)  38M  14B  1M  367 
ClueWeb12  639M  236B  1M  378 
We now present empirical studies of WarpLDA, by comparing it with two strong baselines LightLDA [32] and F+LDA [31]. LightLDA is the fastest MH based algorithm, and F+LDA is the fastest sparsityaware algorithm. We compare with them in both time efficiency and the quality of convergence.
). Each column corresponds to an evaluation metric (Please see text for details). The xaxis of column 3 and 4 are distorted for better resolution.
6.1 Datasets and Setups
Table 3
summarizes the datasets. NYTimes and PubMed are standard datasets from the UCI machine learning repository
[2]; they consist of news articles and biomedical literature abstracts, respectively. ClueWeb12 is a large crawl of web pages.^{5}^{5}5http://www.lemurproject.org/clueweb12.php/ While NYTimes and PubMed are already tokenized, we extract text from ClueWeb12 using JSoup, remove everything except alphabets and digits, convert letters to lower case, tokenize the text by space and remove stop words. ClueWeb12 (subset) is a random subset of the full ClueWeb12.The experiments are conducted on the Tianhe2 supercomputer. Each node is equipped with two Xeon E52692v2 CPUs ( 2.2GHz cores), and 64GB memory. Nodes are connected with InfiniBand, and single machine experiments are done with one node.
6.2 Speed of Convergence
We first analyze the convergence behaviors. Fig. 5 presents the singlemachine results on the moderatesized corpora, including NYTimes (first two rows) and PubMed (last two rows). We compare WarpLDA with a fixed to both LightLDA and F+LDA. Each algorithm is run for a fixed number of iterations. LightLDA is sensitive with because it affects the locality; We therefore pick up the that leads to fastest convergence over time for LightLDA, which is for the NYTimes dataset when , for NYTimes when , for PubMed when , and for PubMed when . A larger leads to longer running time per iteration, but the total number of iterations decreases in order to converge to a particular loglikelihood.
To have a full understanding, a diverse range of evaluation metrics are considered, including loglikelihood w.r.t the number of iterations (1st column), loglikelihood w.r.t running time (2nd column), the ratio of the iteration number of LightLDA (or F+LDA) over that of WarpLDA to get a particular loglikelihood (3rd column), the ratio of running time of LightLDA (or F+LDA) over that of WarpLDA to get a particular loglikelihood (4th column), and finally the throughput w.r.t the number of iterations (5th column). From the results, we have the following observations:

WarpLDA converges to the same loglikelihood as other baselines (1st and 2nd columns), demonstrating the good quality;

WarpLDA converges faster than F+LDA and LightLDA in terms of running time (2nd column), despite that it needs more iterations than the competitorsm to reach the same likelihood (1st column). Overall, WarpLDA is consistently 515x faster than LightLDA in all evaluations, and is faster than F+LDA when ;
Note that the convergence speed of F+LDA surpasses WarpLDA in later stages for the PubMed dataset when . This is mainly due to the difference between sparsity aware algorithms and MHbased algorithms. As and are increasing, the vector becomes less concentrated so that MHbased algorithms require more samples from to explore the state space, while the time complexity of the (exact) sparsity aware algorithms depends only on which is upper bounded by .
Setting  LightLDA  F+LDA  WarpLDA 

NYTimes,  33%  77%  17% 
NYTimes,  35%  53%  13% 
PubMed,  38%  57%  13% 
PubMed,  37%  17%  5% 
To show that WarpLDA is memory efficient, we compare the L3 cache miss rate of WarpLDA with that of LightLDA and F+LDA on the NYTimes and PubMed corpora. The cache miss rate is measured by PAPI. is set to 1 for both WarpLDA and LightLDA. From Table 4, we can see that the L3 cache miss rate of WarpLDA is much lower than that of the competitors. This is reasonable because the size of randomly accessed memory perdocument for WarpLDA fits in the L3 cache, while the competitors require to randomly access a large matrix with tensofgigabytes in size.
For the distributed setting, WarpLDA () and LightLDA () are compared on a 38milliondocument ClueWeb12 (subset). Both algorithms are run on 32 machines. Fig. 8 presents the results. We can see that WarpLDA is about 10x faster than LightLDA to reach the same loglikelihood.
6.3 Quality of Solutions
Now we carefully analyze the quality of the solutions by WarpLDA, and show that the MCEM solution of WarpLDA is very similar with the CGS solution of LightLDA. Comparing the updates of LightLDA [32] and WarpLDA, we conclude that the differences that may affect the quality of the solutions are: (1) delayed count update: the counts and are updated instantly in LightLDA but delayed in WarpLDA, (2) : For LightLDA and for WarpLDA . Therefore, we vary these factors to gain insight on how each factor influences the convergence rate. We use the NYTimes corpus as an example and set and . The algorithms in comparison are

LightLDA: LightLDA with , in which is updated instantly, and is updated every 300 documents (almost instantly).

LightLDA+DW: LightLDA, in which is updated instantly, and is updated per iteration.

LightLDA+DW+DD: LightLDA, in which both and are updated per iteration.

LightLDA+DW+DD+SP: LightLDA, in which both and are updated per iteration, using WarpLDA’s .

WarpLDA: WarpLDA, in which both and are updated per iteration, using WarpLDA’s .
Fig. 8 shows the results, where all the algorithms require roughly the same number of iterations to converge to a particular loglikelihood, with the same . This result shows that the delayed update and simple proposal distributions of WarpLDA do not affect the convergence much. Notice that the previous results in Fig. 5, where WarpLDA requires more iterations than LightLDA to converge to a particular loglikelihood, may look different from this result. This is because LightLDA uses a larger in Fig. 5.
Our observation that delayed update does not affect convergence much seems to contradict with the previous observations for exact sampling algorithms [1]. A possible reason is that for MHbased algorithms (e.g., WarpLDA) the bottleneck of convergence is the efficiency of exploring instead of the freshness of the counts; and thereby delayed update on the counts does not matter a lot.
We also analyze the impact of on the solution quality in Fig. 8. We can see that as gets larger WarpLDA converges faster. This is probably because of the bias induced by the finitelength MH chain. To keep the storage overhead small, we stick to a small such as 1, 2 or 4, which leads to a sufficiently fast convergence.
6.4 Scalability Results
Finally, we demonstrate that WarpLDA can scale up to handle billionscale documents on hundreds of machines.
Fig. 9(a) shows the multithreading speedup result for WarpLDA with and on the NYTimes dataset. The throughput for a single core, a single CPU (12 cores), and 2 CPUs (24 cores) are 6M, 53M, and 104M tokens per second, respectively. The speedup of the 24core version against the singlecore version is 17x, which is good for such a memory intensive task. The 2CPU (24 cores) version is faster than the single CPU (12 cores) version by 1.96x, indicating that our NUMA strategy is successful.
Fig. 9(b) shows the multimachine speedup result for WarpLDA with and on the PubMed corpus. The throughput for 16 machines is 13.5x faster than the single machine version, demonstrating the good scalability.
To show our capacity of learning largescale topic models, we learned topics on the 639milliondocument ClueWeb12 corpus, on 256 machines. The hyperparameter is set to 0.001 for finer grained topics, is set to 1, and the number of iterations is set to 900. The convergence results are shown in Fig. 9(c). We can see that the run produces meaningful results in 5 hours.^{6}^{6}6The learned 1 million topics are available at http://ml.cs.tsinghua.edu.cn/~jianfei/warplda.html. Fig. 9(d) shows the throughput is an unprecedentedly 11G tokens/s with 256 machines.
7 Conclusions and Future Work
We first analyze the memory efficiency of previous fast algorithms for LDA by the size of randomly accessed memory perdocument, and conclude they are inefficient because of frequent random accesses to large matrices. We then propose WarpLDA, an efficient algorithm for LDA which only requires to randomly access vectors that fit in the L3 cache, while maintaining the time complexity. WarpLDA builds on an MCEM algorithm that enables delayed updates to decouple the accesses to and , by some carefully designed ordering of visiting tokens. To implement WarpLDA in a memory efficient and scalable way, we design and implement a framework that supports manipulating the rows and columns of a distributed sparse matrix.
Extensive empirical studies in a wide range of testing conditions demonstrate that WarpLDA is consistently 515x faster than the stateoftheart MHbased algorithms, and is faster than stateoftheart sparsityaware algorithms in most settings. WarpLDA achieves an unprecedentedly 11G token per second throughput which allows to train billionscale corpora in hours.
In the future, we plan to combine WarpLDA with other promising directions of scaling up LDA and apply it to more sophisticated topic models to learn various types of topic structures.
Stochastic learning:
Stochastic algorithms explore the statistical redundancy of a given corpus, and estimate the statistics (e.g., gradient) of the whole corpus by a random subset. When the estimation has low variance, faster convergence is expected. Examples include stochastic variational inference (SVI)
[17], streaming variational Bayes [9], and stochastic gradient Riemann Langevin dynamics (SGRLD) [25]. These methods can be combined with fast sampling algorithms, e.g. WarpLDA, to create fast stochastic algorithms, e.g., Bhadury et al. combined SGRLD with LightLDA to scale up dynamic topic models [4].GPU accelerations: There are some works on GPU acceleration for LDA [10, 33]. However, the algorithm they accelerate is . WarpLDA is a promising option for GPU acceleration due to its single instruction multiple data (SIMD) nature. ^{7}^{7}7We do not mention this in the main text, but readers can verify that the tokens in one document / word can be processed simultaneously without data race.
Nonconjugate topic models: Compared to the vanilla LDA, nonconjugate topic models capture richer types of statistical structures, e.g., correlation of topics [5, 12], temporal dynamics of topics [6, 4], or relationship to labels [37, 36]. These models typically have less sparsity structures to exploit, making sparsityaware algorithms difficult to apply. We can still apply the MHbased WarpLDA to these models as a fast sampler for topic assignments.
8 Acknowledgments
We thank Jinghui Yuan for the help on the experiments. The work was supported by the National Basic Research Program (973 Program) of China (Nos. 2013CB329403, 2012CB316301), National NSF of China (Nos. 61322308, 61332007), Tsinghua TNList Lab Big Data Initiative, and Tsinghua Initiative Scientific Research Program (No. 20141080934).
References
 [1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, 2012.
 [2] A. Asuncion and D. Newman. Uci machine learning repository, 2007.
 [3] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, 2009.
 [4] A. Bhadury, J. Chen, J. Zhu, and S. Liu. Scaling up dynamic topic models. In WWW, 2016.
 [5] D. Blei and J. Lafferty. Correlated topic models. In NIPS, 2006.
 [6] D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, 2006.
 [7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
 [8] J. L. BoydGraber, D. M. Blei, and X. Zhu. A topic model for word sense disambiguation. In EMNLPCoNLL, 2007.
 [9] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational bayes. In NIPS, 2013.
 [10] J. Canny and H. Zhao. Bidmach: Largescale learning with zero memory allocation. In BigLearn Workshop, NIPS, 2013.
 [11] J. Chang and D. Blei. Relational topic models for document networks. In AISTATS, 2009.
 [12] J. Chen, J. Zhu, Z. Wang, X. Zheng, and B. Zhang. Scalable inference for logisticnormal topic models. In NIPS, 2013.
 [13] N. Chen, J. Zhu, F. Xia, and B. Zhang. Discriminative relational topic models. IEEE Trans. on PAMI, 37(5):973–986, 2015.
 [14] W.Y. Chen, J.C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009.
 [15] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 [16] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
 [17] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14(1):1303–1347, 2013.
 [18] T. Iwata, T. Yamada, and N. Ueda. Probabilistic latent semantic visualization: topic model for visualizing documents. In KDD, 2008.
 [19] Z. G. Kingsley. Selective studies and the principle of relative frequency in language, 1932.
 [20] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In KDD, 2014.
 [21] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, 2014.
 [22] S. Liu, X. Wang, J. Chen, J. Zhu, and B. Guo. Topicpanorama: a full picture of relevant topics. In Proceedings of IEEE VAST, 2014.
 [23] T. Minka and J. Lafferty. Expectationpropagation for the generative aspect model. In UAI, 2002.
 [24] D. Newman, P. Smyth, M. Welling, and A. U. Asuncion. Distributed inference for latent dirichlet allocation. In NIPS, 2007.
 [25] S. Patterson and Y. W. Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In NIPS, 2013.

[26]
Y. W. Teh, D. Newman, and M. Welling.
A collapsed variational bayesian inference algorithm for latent dirichlet allocation.
In NIPS, 2006. 
[27]
A. J. Walker.
An efficient method for generating discrete random variables with general distributions.
ACM Transactions on Mathematical Software (TOMS), 3(3):253–256, 1977.  [28] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, J. Zeng, Q. Yang, et al. Towards topic modeling for big data. ACM Transactions on Intelligent Systems and Technology, 2014.
 [29] X. Wei and W. B. Croft. Ldabased document models for adhoc retrieval. In SIGIR. ACM, 2006.
 [30] L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, 2009.
 [31] H.F. Yu, C.J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In WWW, 2015.
 [32] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.Y. Liu, and W.Y. Ma. Lightlda: Big topic models on modest compute clusters. In WWW, 2015.
 [33] H. Zhao, B. Jiang, and J. Canny. Same but different: Fast and highquality gibbs parameter estimation. arXiv:1409.5402, 2014.
 [34] X. Zheng, Y. Yu, and E. P. Xing. Linear time samplers for supervised topic models using compositional proposals. In KDD, 2015.
 [35] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximum margin supervised topic models. JMLR, 13:2237–2278, 2012.
 [36] J. Zhu, X. Zheng, L. Zhou, and B. Zhang. Bayesian logistic supervised topic models with data augmentation. In ACL, 2013.
 [37] J. Zhu, X. Zheng, L. Zhou, and B. Zhang. Scalable inference in maxmargin topic models. In KDD, 2013.
Appendix A Full code of WarpLDA
Alg. 2 is the full pseudocode of WarpLDA, with random positioning for sampling the document proposal and alias sampling for sampling the word proposal.