1 Introduction
Topic models provide a suite of widely adopted statistical tools for feature extraction and dimensionality reduction for bagofwords (
i.e., discrete count) data, such as text documents and images in a bagofwords format cao2007spatially . Given an input corpus, topic models automatically extract a number of latent topics, which are unigram distributions over the words in a given vocabulary. The highprobability words in each topic are semantically correlated. Latent Dirichlet Allocation
blei2003latent(LDA) is the most popular of topic models due to its simplicity, and has been deployed as a key component in data visualization
iwata2008probabilistic , text analysis boyd2007topic ; zhu2009medlda cao2007spatially , network analysis Chang:RTM09 , and recommendation systems chen2009collaborative .In practice, it is not uncommon to encounter largescale datasets, e.g., text analysis typically consists of hundreds of millions of documents yuan2014lightlda , and recommendation systems need to tackle hundreds of millions of users ahmed2012scalable . Furthermore, as the scale of the datasets increases, the model size needs to be increased as well — we need a larger number of topics in order to exploit the richer semantic structure underlying the data. A reasonable design goal for modern topic modeling systems is thousands of topics, to have a good coverage for both industry scale applications wang2014peacock and researching boyd2007topic ; iwata2008probabilistic ; cao2007spatially .
However, it is highly challenging to efficiently train large LDA models. The time complexity of training LDA is high because it involves iteratively scanning the input corpus for many times (e.g., 100), and the time complexity of processing each token is not constant but related to the number of topics.
Implementation  

Yan et al. yan2009parallel  300K  128  100K  100M 
BIDMach zhao2015same  300K  256  100K  100M 
Steele and Tristan tristan2015efficient  50K  20  40K  3M 
SaberLDA  19.4M  10K  100K  7.1B 
To train LDA in acceptable time, CPU clusters are often used. However, due to the limited memory bandwidth and low computational power of CPUs, large clusters are typically required to learn large topic models ahmed2012scalable ; yuan2014lightlda ; yu2015scalable . For example, a 32machine cluster is used to learn 1,000 topics from a 1.5billiontoken corpus.
A promising alternative is to train LDA with graphics processing units (GPUs), leveraging their high computational power and memory bandwidth. Along this line, there have been a number of previous attempts. For example, Yan et al. yan2009parallel implement the collapsed Gibbs sampling algorithm, BIDMach zhao2015same implements the variational Bayes algorithm as well as a modified Gibbs sampling algorithm, and Steele and Tristan tristan2015efficient
propose a expectationmaximization algorithm. These GPUbased systems are reported to achieve superior performance than CPUbased systems
yan2009parallel ; zhao2015same ; tristan2015efficient .Unfortunately, current GPUbased LDA systems can only learn a few hundred topics (See Table 1), which may not be sufficient to capture the rich semantic structure underlying the large datasets in industry scale applications wang2014peacock . It is fundamentally difficult for these systems to learn more topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics.
To address this problem, we propose SaberLDA, a novel GPUbased system that adopts a sparsityaware algorithm for LDA. Sparsity aware algorithms are based on the insight that a single document is not likely to have many topics, and are able to achieve sublinear (or even amortized constant) time complexity with respect to the number of topics. Representative examples include AliasLDA li2014reducing , F+LDA yu2015scalable , LightLDA yuan2014lightlda , WarpLDA chen2016warplda and ESCA zaheer2015exponential , which are implemented in general purpose CPU systems. Therefore, the running time is not sensitive with the number of topics.
However, it is considerably more challenging to design and implement sparsityaware algorithms on GPUs than on CPUs. Comparing with CPUs, GPUs have much larger number of threads, much smaller per thread cache size and longer cache lines, which makes it much more difficult to use caches to mitigate the random access problems introduced by sparsity. The branch divergence issues of GPUs suggest we should fully vectorize the code. But for sparse data structures, loop length is no longer fixed and the data are not aligned, which indicates that straightforward vectorization is not feasible. Finally, the limited GPU memory capacity requires streaming input data and model data, which add another dimension of complexity for data partition and layout, to enable parallelism, good locality and efficient sparse matrix updates all together.
SaberLDA addresses all these challenges for supporting sparsityaware algorithms on GPU, our key technical contribution includes:

A novel hybrid data layout “partitionbydocument and orderbyword” (PDOW) that simultaneously maximizes the locality and reduces the GPU memory consumption;

A warpbased sampling kernel that is fully vectorized, and is equipped with a arysampling tree that supports both efficient construction and sampling;

An efficient “shuffle and segmented count” (SSC) algorithm for updating sparse count matrices.
Our experimental results demonstrate that SaberLDA is able to train LDA models with up to 10,000 topics, which is more than an order of magnitude larger than previous GPUbased systems yan2009parallel ; zhao2015same ; tristan2015efficient , where the throughput only decreases by 17% when the number of topics increases from 1,000 to 10,000. SaberLDA is also highly efficient, under various topic settings, SaberLDA converges 5 times faster than previous GPUbased systems and 4 times faster than CPUbased systems. With a single card, SaberLDA is able to learn 10,000 topics from a dataset of billions of tokens, which is only achievable with clusters with tens of machines before yu2015scalable .
2 Latent Dirichlet Allocation
In this section, we introduce the latent Dirichlet allocation (LDA) model and its sampling algorithm.
2.1 Definition
LDA is a hierarchical Bayesian model that learns latent topics from text corpora blei2003latent . For an input text corpus, the scale of the learning task is determined by the following four numbers:

: the number of documents in the given corpus;

: the number of tokens, i.e., words, appearing in all the documents;

: the number of unique words in the corpus, also known as vocabulary size;

: the number of topics,
where , and are determined by the corpus, and is a parameter that users can specify.
The text corpus is represented as a token list , where each occurrence of word in document is called a token, and represented by a triplet . While and are specified by the corpus, training LDA involves assigning a topic assignment for each token .
After the token list is given, we construct two count matrices as below:

The documenttopic count matrix is , where is the number of tokens with .

The wordtopic count matrix is , where is the number of tokens with ,
where the matrix is often sparse, i.e., has many zero elements (See below for an example).
We omit the mathematical details of LDA and refer the interested readers to standard LDA literature blei2003latent ; griffiths2004finding . Informally, the LDA model is designed in a way that maximizes some objective function (the likelihood) related to the topic assignments . A document can be characterized by the th row of the documenttopic matrix , and a topic can be characterized by the th column of the wordtopic matrix . Figure 1 is a concrete example, where there are documents, tokens, words in the vocabulary, and topics. Each token is assigned with a topic assignment from 1 to 3, indicated as the superscript in the figure. is the documenttopic matrix, e.g., because both tokens in document 1 is assigned to topic 3. is the wordtopic matrix, and its each column characterizes what the corresponding topic is about, e.g., the first topic has the word “apple” and “iPhone”, and is about the device iPhone; the second topic is about fruits, and the third topic is about mobile OS. Likewise, the rows of characterizes what documents are about, e.g., the first document is about mobile OS, the second document is about iPhone and mobile OS, and the last document is about fruits. Note that the matrix is sparse, because a document is not likely to be relevant with all the topics at the same time.
2.2 Inference
Given the token list , our goal is to infer the topic assignment for each token . Many algorithms exist for this purpose, such as variational inference blei2003latent
, Markov chain MonteCarlo
griffiths2004finding , and expectation maximization zaheer2015exponential ; chen2016warplda . We choose the ESCA algorithm zaheer2015exponential to implement on GPU for its following advantages
It is sparsityaware, so the time complexity is sublinear with respect to the number of topics. This property is critical to support the efficient training of large models;

It enjoys the best degree of parallelism because the count matrices and only need to be updated once per iteration. This matches with the massively parallel nature of GPU to achieve high performance.
The ESCA algorithm alternatively performs the Estep and the Mstep for a given number of iterations (e.g., 100 iterations), where the Estep updates given the counts , and the M step updates given :

Estep: for each token , sample the topic assignment :
(1) where means “proportional to” and the wordtopic probability matrix is a normalized version of the count matrix in the sense that is roughly proportional to , but each of its column sums up to 1. is computed as follows:
(2) 
Mstep: update the counts and , and calculate .
Here, and are two user specified parameters that control the granularity of topics. Large and values mean that we want to discover a few general topics, while small and values mean that we want to discover many specific topics.
The updates Eq. (1, 2) can be understood intuitively. A token is likely to have the topic assignment if both and are large, i.e., a lot of tokens in document are topic and a lot of tokens of word are topic . For example, if we wish to update the topic assignment of the “apple” in document 3, it is more likely to be topic 2 rather than topic 1, because the other token “orange” in the same document is assigned to topic 2.
2.3 Sampling from a multinomial distribution
The core of the above algorithm is sampling according to Eq. (1). To help readers understand the sampling procedure, we begin with a vanilla sampling algorithm, and then proceed to a more complicated sparsityaware algorithm. Sampling according to Eq. (1) can be viewed as throwing a needle onto the ground, and report the number of the region where the needle falls in, where the area of region is the probability (Figure 2).This can be implemented with three steps:

For , compute the probabilities and their sum ;

Generate a random number ;

Compute the prefix sum , where , and return the first such that , with a binary search.
We refer the above procedure as the vanilla algorithm, whose time complexity is limited by step 1 and step 3, which are . Step 3 is a very important routine which we will use over and over again, and we refer its result as “the position of in the prefix sum array of ” for brief.
While the vanilla algorithm is , the sparsityaware algorithms li2014reducing ; zaheer2015exponential utilize the sparsity of , and improve the time complexity to , where is the average number of nonzero entries per row of . The algorithm decomposes the original sampling problem as two easier sampling subproblems. For sampling each token, it returns the result of a random subproblem, where the probability of choosing the first subproblem is , where and . The two subproblems are defined as follows:
Problem 1
Sample .
This can be sampled with the vanilla algorithm we described before. But sparsity can be utilized: if , then as well. Therefore, we only need to consider the indexes where . There are only such indexes on average, therefore, the time complexity is only instead of . Similarly, can be computed in .
Problem 2
Sample .
This problem is only relevant with but not . We can preprocess for each . There are various approaches for preprocessing which we will cover in detail in Sec. 3.2. In brief, we construct a tree for each , and then each sample can be obtained with the tree in . The preprocessing step is not the bottleneck because it is done only once per iteration.
2.4 PseudoCode
To make the presentation more concrete, we present the pseudocode of the ESCA algorithm as Alg. 1, which is a Bulk Synchronous Parallel (BSP) programming model. In the Estep of each iteration, all the tokens in the token list are updated independently, by calling Sample for each token. All arguments of the function Sample are readonly, and the return value is the new topic assignment . In the Mstep, the matrices and are calculated from the token list , by functions CountByDZ and CountByVZ. Then, is updated by the function Preprocess following Eq. (2). Finally, generate the sampling trees and sums for each word .
As shown in Alg. 2, the function Sample samples the topic assignment given the rows , , the sum , and the tree , by implementing the sparsityaware algorithm described in Sec. 2.3. First, we compute the probability for Problem 1, as well as , where is the elementwise product. Next, we flip a coin with the probability of head being . If the coin is head, perform sampling from , which involves finding the location of a random number in the prefixsum array of the sparse vector as discussed in Sec. 2.3. Otherwise, sample from with the tree , which is preprocessed.
3 Design and Implementation
In this section we present SaberLDA, a high performance GPUbased system for training LDA. The design goals of SaberLDA are:

supporting large models up to 10,000 topics;

supporting large dataset of billions of tokens;

providing comparable performance with a moderate size CPU cluster using a single GPU card.
It is difficult for previous GPUbased systems yan2009parallel ; zhao2015same ; tristan2015efficient to satisfy all these goals, because all of them are based on the vanilla algorithm mentioned in Sec. 2.3 which have time complexity, i.e., the training time will be 100 times longer if the number of topics increase from hundreds (current) to 10,000 (goal), which is not acceptable.
To address this problem, SaberLDA adopts the sparsityaware sampling algorithm whose time complexity is , which is not very sensitive to the number of topics. However, exploiting sparsity is highly challenging on GPUs. The memory access is no longer sequential as that for the dense case, and the small cache size and long cache line of GPU aggravate this problem. Therefore, the memory accesses need to be organized very carefully for good locality. Moreover, vectorization is not as straightforward as the dense case since the loop length is no longer fixed and the data are not aligned, so waiting and unconcealed memory accesses can happen. Finally, updating the sparse count matrix is more complicated than dense count matrices. We now present our design of SaberLDA which addresses all the challenges.
3.1 Streaming Workflow
We first consider how to implement Alg. 1, deferring the details of the functions Sample, CountByDZ, CountByVZ, Preprocess and BuildTree to subsequent sections. The implementation involves designing the storage format, data placement (host / device), and partitioning strategies for all the data items, as well as designing the order of sampling tokens (Line 5 of Alg. 1). The data items include the token list , the documenttopic count matrix , the wordtopic count matrix , and the wordtopic distribution matrix .
The aforementioned design should maximize the locality, and meanwhile, keep the size of the data items within the budget of GPU memory even when both and the data size are large. This is challenging because locality emerges as a new design goal to support sparsityaware algorithms, while previous dense matrix based GPU systems yan2009parallel ; zhao2015same ; tristan2015efficient only have one goal (i.e., fitting into the memory budget) instead of the two which need to be obtained simultaneously in our method. As we will see soon, the simple data layout in previous systems such as sorting all tokens by documentid yan2009parallel ; zhao2015same ; tristan2015efficient ; zaheer2015exponential or by wordid yu2015scalable cannot satisfy both requirements, and we will propose a hybrid data layout to address this problem.
3.1.1 Storage Format
We analyze the access pattern of these data items described in Alg. 1 and Alg. 2. The token list is accessed sequentially, and we store it with an array. The documenttopic count matrix is accessed by row, and the Sample function iterates over its nonzero entries, so we store it by the compressed sparse rows (CSR) format to avoid enumerating over zero entries. Besides efficiency, the CSR format also reduces the memory consumption and host/device data transferring time comparing with the dense matrix format in previous implementations yan2009parallel ; zhao2015same ; tristan2015efficient . The wordtopic matrices and are randomly accessed, and we store them as dense matrices. Table 2 lists the memory consumption of each data item in the PubMed dataset (See Sec. 4 for the details), where we can observe that representing as sparse matrix indeed saves a lot of memory when , and the GPU memory is large enough to hold and for the desired number of topics.
Data 

Token List  DocTopic Matrix  

Dense  Sparse  
K  V=141k  T=738M  D=8.2M  
100  0.108 GB  8.65 GB  3.2 GB  5.8 GB  
1k  1.08 GB  8.65 GB  32 GB  5.8 GB  
10k  10.8 GB  8.65 GB  320 GB  5.8 GB 
3.1.2 Data Placement and Partitioning Strategy
However, the token list and the documenttopic matrix cannot fit in the GPU memory, because the grows linearly with the number of tokens, and the size of grows linearly with the number of documents. Therefore, the sizes of both data items grow rapidly with the size of the dataset. Since one of our major design goals is to support large datasets with billions of tokens, and cannot be held in GPU memory because they can be arbitarily large as the size of the dataset grows.
To address this problem, we treat and as streams: we store them in the main memory, partition them into chunks, and process a few chunks with the GPU at a time. We partition and by document, i.e., each chunk has all tokens from a certain set of documents along with the corresponding rows of . Several workers are responsible of performing the sampling (Alg. 2) for the tokens in each chunk, as illustrated in Fig. 3. Each worker is a cudaStream that fetches a chunk from the main memory, samples the topic assignments for all the tokens in that chunk, updates the documenttopic count maxis , and sends the updated back to the main memory. The computation and communication are overlapped by having multiple workers. The wordtopic matrices and are in the GPU memory, while in the sampling stage, all the workers read and updates . When the sampling of each iteration is finished, is updated with .
3.1.3 Order of Sampling Tokens
We now discuss the order of sampling tokens, i.e., executing Sample for each token in Alg. 1. Although theoretically these tokens can be sampled in any order, the ordering greatly impacts the locality, as we shall see soon; therefore calling for a careful treatment.
As illustrated as Fig. 4(a), sampling for a token requires evaluating a elementwise product of two rows (Line 47 of Alg. 2), i.e., the th row of the documenttopic count matrix and the th row of the wordtopic probability matrix . The elementwise product involves accessing all nonzero entries of (sequential), and accessing the elements of indexed by the nonzero entries of (random).
There are two particular visiting orders which reuse the result of previous memory accesses for better locality. The docmajor order sorts the tokens by their documentid’s, so that the tokens belonging to the same document are adjacent in the token list. Before processing the tokens in the th document, the row can be fetched into the shared memory and reused for sampling all the subsequent tokens in that document. On the other hand, the access of cannot be reused, because each token requires to access random elements (indexed by the nonzero entries of ) in random row of the wordtopic count matrix , as shown in Fig. 4 (b). The bottleneck of this approach is accessing , where both the row index and column index are random.
On the contrary, the wordmajor order sorts the tokens by their wordid’s, so that the tokens belong to same word are adjacent in the token list. Before processing the tokens of the th word, the row can be fetched into the shared memory and reused for sampling all the subsequent tokens of that word. Each token needs to access all the elements of a random row (Fig. 4 (c)). The bottleneck of this approach is accessing , where only the row index is random.
The memory hierarchies are quite different on CPUs and GPUs. CPUs have larger cache size (30MB) and shorter cache line (64B), while GPUs have smaller cache size (2MB) and longer cache line. On CPUs, when and are small, the documentmajor order can have better cache locality than the wordmajor order because can fit in cache zaheer2015exponential . But for GPUs, the wordmajor order has clear advantage that it efficiently utilizes of the cache line by accessing whole rows of instead of random elements. Therefore, we choose the wordmajor order for SaberLDA.
3.1.4 PartitionbyDocument and OrderbyWord
Putting all the above analysis together, SaberLDA adopts a hybrid data layout called partitionbydocument and orderbyword (PDOW), which means to firstly partition the token list by documentid, and then sort the tokens within each chunk by wordid. Unlike the simple layouts such as sorting all tokens by documentid or by wordid in previous systems yan2009parallel ; zhao2015same ; tristan2015efficient ; zaheer2015exponential ; yuan2014lightlda ; yu2015scalable , PDOW combines the advantages of both bydocument partitioning and wordmajor ordering, and simultaneously improves cache locality with the wordmajor order, and keeps the GPU memory consumption small with the bydocument partitioning.
The number of chunks presents a tradeoff between memory consumption and locality. The memory consumption is inversely proportional to the number of chunks, but the number of times to load each row of into shared memory is proportional to the number of chunks. The chunk size should not be too small to ensure good locality, i.e., the preloaded should be reused for a suffciently large number of times. In SaberLDA, we minimize the number of chunks as long as the memory is sufficient.
3.2 Warpbased Sampling
We now turn to the Sample function (Alg. 2), which is the most time consuming part of LDA. To understand the challenges and efficiencyrelated considerations, it is helpful to have a brief review of GPU architecture.
GPUs follow a single instruction multiple data (SIMD) pattern, where the basic SIMD unit is warp, which has 32 data lanes. Each lane has its own ALU and registers, and all the lanes in a warp execute the same instruction. In CUDA, each thread is executed on a lane, and every adjacent 32 threads share the same instruction. Readers can make an analogy between GPU warp instruction and CPU vector instruction.
The most straightforward implementation of sampling on GPU is threadbased sampling, which samples each token with a GPU thread. Therefore, 32 tokens are processed in parallel with a warp. Threadbased sampling is acceptable when is dense because the loop length (Line 4 of Alg. 2) is always , and are no branches (Line 8 of Alg. 2). However, this approach has several disadvantages when becomes sparse. Firstly, because each token corresponds to different rows of , the loop length of each thread are different (Line 4 of Alg. 2). In this case, all the threads need to wait until the longest loop is finished, introducing long waiting time. Secondly, there are branches in the sampling procedure, for example, the branch choosing which distribution to sample (Line 8 of Alg. 2). The branches cause the thread divergence problem, i.e., if some of the threads go to one branch and other threads go to another branch, the warp need to perform the instructions of both branches, which again increases the waiting time. Finally, the access to is unconcealed (Line 5 of Alg. 2) because the threads are accessing different and discontinuous addresses of the global memory.
To overcome the disadvantages of threadbased sampling, SaberLDA adopts warpbased sampling, where all the threads in a warp collaboratively sample a single token. However, there are also challenges for warpbased sampling — all the operations need to be vectorized to maximize the performance.
We now present our vectorized implementation. As mentioned in Sec. 2.4, the sampling involves computing the elementwise product (Line 47 of Alg. 2), randomly decide which branch to sample (Line 8 of Alg. 2), and depending on which branch is chosen, sample from (Line 9 of Alg. 2), or sample with the preprocessed tree (Line 11 of Alg. 2). Before starting the presentation, we also remind the readers that is in global memory, and all the other data items such as and are preloaded into shared memory taking advantage of PDOW as discussed in Sec. 3.1.3.
3.2.1 Elementwise Product
The elementwise product step is vectorizable by simply letting each thread process an index and a warp compute the elementwise product for 32 indices at a time. All the threads are efficiently utilized except for the last 32 indices if the number of nonzero entries of Ad is not a multiple of 32. The waste is very small since the number of nonzero entries is typically much larger than 32, e.g., 100.
3.2.2 Choosing the Branch
This step only consists a random number generation and a comparison, whose costs are negligible. Note that thread divergence will not occur as for the threadbased sampling since the whole warp goes to one branch or the other.
3.2.3 Sample from
This step consists of the three steps mentioned in Sec. 2.3, where the elementwise product is already computed. We need to generate a random number, and find its position in the prefixsum array of .
Firstly, we need to vectorize the computation of the prefix sum, i.e., computing the prefix sum of 32 numbers using 32 threads. Because of the data dependency, the prefix sum cannot be computed with a single instruction. Instead, __shuf_down operations are required, and the prefix sum can be computed in instructions prefixsum . We refer this routine as warp_prefix_sum. Given the prefix sum, we need to figure out the index of the first element which is greater than or equal to the value. This can be achieved in two steps:

The warpvote intrinsic __ballot forms a 32bit integer, where the th bit is one if the th prefix sum is greater than or equal to the given value ballot ,

The __ffs intrinsic returns the index of the first bit 1 of a 32bit integer,
where we refer these two steps as warp_vote, which returns an index that is greater than or equal to the given value, or 1 if there is no such index.
Deferring the discussion for the preprocessed sampling for a while, Fig. 5 is an example code of our vectorized warpbased sampling, where we omit some details such as whether the vector length is a multiple of the warp size, and the warp_copy(a, id) function returns on the thread . There is no waiting or thread divergence issue as discussed above.
Besides the eliminated waiting time and thread divergence, the memory access behavior of our implementation is good as well. For the elementwise product, the access to is continuous. More specifically, the warp accesses two 128byte cache lines from the global memory, and each thread consumes two 32bit numbers (an integer index and a float32 value). The accesses to are random, but they are still efficient since the current row is in shared memory. All the rest operations only access data items in the shared memory, e.g., .
3.2.4 ary Sampling Tree
We now present the deferred details of the preprocessed sampling (Line 11 of Alg. 2). The preprocessed sampling is new in sparsityaware algorithms, since the vanilla algorithm does not break the original sampling problem into subproblems. Therefore, previous GPUbased systems do not have this problem. As discussed in Sec. 2.3, the preprocessed sampling problem is essentially the same problem as sampling from , but there are only different sampling problems, so we can preprocess for each problem to accelerate this process. In previous literature, there are two main data structures for that problem, and we briefly review them.

An alias table walker1977efficient can be built in time, and each sample can be obtained in time afterwards. However, building the alias table is sequential;

A Fenwick tree yu2015scalable can be built in time, and each sample can be obtained in time afterwards. However, the branching factor of the Fenwick tree is only two, so the 32thread GPU warp cannot be fully utilized.
Both approaches are designed for CPU, and are slow to construct on GPU because they cannot be vectorized. Vectorization is critical because using only one thread for preprocessing can be much slower than using a full warp for preprocessing. To allow vectorized preprocessing, we propose a arytree, which can be constructed in time with full utilization of GPU warp. Subsequent samples can be obtained in time, where is the number of threads in a GPU warp, i.e., 32.
We emphasize that our main focus is on the efficient construction of the tree instead of efficient sampling using the tree, because the cost of sampling using the tree is negligible comparing with sampling from . Moreover, the sampling using our arytree is efficient anyway because is in the order of 10,000, and , so is at the same level of the alias table algorithm.
Our arytree is designed for efficiently finding the location of a given number in the prefixsum array. Each node of the tree stores a number, where the bottommost level nodes store the prefixsums of the given array. The length of an upper level is equal to the length of the lower level divided by , and the th node in an upper level is equal to the node in the lower level. Constructing the tree is illustrated in Fig. 7. This procedure is efficient because all the nodes in one layer can be constructed in parallel. Therefore, the GPU warp can be efficiently utilized with the warp_prefix_sum function we mentioned before, which uses threads to compute the prefixsum of numbers in parallel.
To find the position of a given value in the prefixsum array, we recursively find the position at each level, from top to bottom (Fig. 7). Based on the particular construction of our tree, if the position on the th level is , then the position at the th level is between and . Therefore, only nodes need to be checked on each level. This checking can be done efficiently using the warp_vote function we mentioned before, and the memory access is efficient because the tree is stored in the shared memory (Sec. 3.1.3), and it only needs to read continuous floating point numbers, i.e., a 128byte cache line for each level.
The amount of memory accesses can be further reduced. We use a fourlevel tree for SaberLDA, which supports up to topics. The first and second layers contain only 1 and 32 nodes, respectively, and they can be stored in the thread registers. In this way, only two shared memory cache lines for level 3 and 4 are accessed per query. Fig. 6 is the code for the arytree.
3.3 Efficient Count Matrix Update
Finally, we discuss how to efficiently update the count matrices and . When the sampling of all tokens with the same word is finished, the corresponding row of the wordtopic count matrix is ready to be updated. The atomicAdd function must be used because there may be multiple workers updating the same row, but the overhead is very low since the time complexity of updating is lower than the time complexity of sampling. The wordtopic probability matrix can be easily generated according to Eq. (2) from after all the updates of the latter are finished. Maximal parallel performance can be achieved since both matrices are dense.
However, the update of the documenttopic matrix is challenging since is sparse. To update an entry of a sparse matrix, one must find the location of that entry, which is difficult to vectorize. Therefore, instead of updating, we rebuild the count after the sampling of each partition is finished.
A naïve approach of rebuilding the count matrix is to sort all the tokens by first documentid then topic assignment , and perform a linear scan afterwards. However, the sorting is expensive since it requires to access the global memory frequently. Moreover, the efficiency of sorting decreases as the chunk size increases.
We propose a procedure called shuffle and segmented count (SSC) to address this problem. To rebuild the count matrix, we first perform a shuffle to organize the token list by the documentid’s, i.e., segment the token list as smaller lists where the tokens in each list share the same documentid. Since the documentid’s of the tokens are fixed, the shuffle can be accelerated with a preprocessed pointer array, which points each token to its position in the shuffled list. Therefore, to perform shuffling, we only need to place each token according to the pointers. Furthermore, we can reduce the accesses to global memory by creating the counts for each smaller token list individually, where these token lists are small enough to fit in the shared memory.
Creating the counts is a common problem as known as segmented count, which means for segmenting the tokens by and counting in each segment. Unlike similar problems such as segmented sort segsort and segmented reduce segreduce which have fully optimized algorithms for GPUs, efficient GPU solution of segmented count is not well studied yet.
We propose a solution of segmented count which is sufficiently efficient for SaberLDA. Our procedure consists of three steps, as illustrated in Fig. 8:

Perform a radix sort by in shared memory;

Calculate the prefix sum of the adjacent difference, in order to get the number of different topics and the order number of each topic;

Assign the topic number at corresponding order number, and increase the counter of the same topic.
3.4 Implementation Details and Load Balancing
In SaberLDA, a word and a token are processed with a a block and a warp, respectively. Each block has its own shared memory to hold the rows and , where the former is fetched from the global memory before sampling any tokens in the current word and the latter is written back to global memory after the sampling. To minimize the number of memory transactions, we also align the memory address of each row of by 128 bytes.
The workload for each block is determined by the size of its chunk, the workload for each block is determined by the number of tokens in the word, and the workload for each warp is determined by the number of nonzero entries in the document. To minimize the imbalance of the workload, we apply dynamic scheduling at all the three levels, i.e., a block fetches a new word to process when it is idle, and similarly, a warp fetches a new token to process when it is idle.
Because the term frequency of a natural corpus often follows the power law kingsley1932selective , there are a few highfrequency words which have much more tokens than other lowfrequency words. Therefore, the block level workload is more imbalanced than the other two levels. To address this problem, we further sort the words decreasingly by the number of tokens in it, so that the words with most number of tokens will be executed first, and then the words with small number of tokens are executed to fill the imbalance gaps.
4 Evaluation
Dataset  

NYTimes asuncion2007uci  300K  100M  102k  332 
PubMed asuncion2007uci  8.2M  738M  141k  90 
ClueWeb12 subset  19.4M  7.1B  100k  365 
In this section, we provide extensive experiments to analyze the performance of SaberLDA, and compare SaberLDA with other cuttingedge open source implementations on various widely adopted datasets listed in Table 3.
The code of SaberLDA is about 3,000 lines, written in CUDA and C++, and compiled with NVIDIA nvcc 8.0 and Intel ICC 2016. Our testing machine has two Intel E52670v3 CPUs, with 12 cores each CPU, 128 GB main memory and a NVIDIA GTX 1080 GPU.
The hyperparameters and are set according to previous works yao2009efficient ; li2014reducing ; yu2015scalable ; chen2016warplda .
The training time of LDA depends on both the number of iterations to converge and the time for each iteration. The former depends on the algorithm, e.g., variational Bayes algorithm blei2003latent typically requires fewer iterations than ESCA zaheer2015exponential to converge. The latter depends on the time complexity of sampling each token as well as the implementation, e.g., the sparsityaware algorithm has time complexity and performs faster than the vanilla algorithms. Therefore, we use various metrics to compare LDA implementations:

We use time per iteration or throughput to compare different implementations of the same algorithm, e.g., compare SaberLDA, which is a GPU implementation of the ESCA algorithm, with a CPU implementation of the same algorithm, because they require the same number of iterations to converge. The throughput is defined as the number of processed tokens divided by the running time, and the unit is million tokens per second (Mtoken/s).

Since different algorithms require different numbers of iterations to converge, the time per iteration metric is no longer informative. Therefore, we compare different algorithms by the required time to converge to a certain model quality. The model quality is assessed by holdout loglikelihood per token, using the partiallyobserved document approach wallach2009evaluation . Higher loglikelihood indicates better model quality.
4.1 Impact of Optimizations
We first investigate the impact of each optimization technique we proposed in Sec. 3, by training LDA on the NYTimes dataset with 1,000 topics for 100 iterations. The results are shown in Fig. 9, where the total elapsed time is decomposed as the Sample function, rebuilding documenttopic matrix , constructing preprocessed data structures for sampling (preprocessing), and data transferring between CPU and GPU.
G0 is the most straightforward sparsityaware implementation on GPU which sorts all tokens by documents, performs the preprocessed sampling with the alias table, and builds count matrices by naïve sorting of all tokens. G1 adopts the PDOW strategy proposed in Sec. 3.1.4, and the time of sampling is reduced by almost 40% because of the improved locality for sampling. Note rebuilding the doctopic matrix A takes more time in G1 than in G0, because the tokens are ordered by word, and the sorting becomes slower.
The bottleneck of G1 is the construction of the alias table, which is hard to vectorize. In G2, we replace the alias table with the arytree, which fully utilizes warps to greatly reduce the construction time by 98%.
Then, we optimize the rebuilding of the documenttopic matrix with SSC (Sec. 3.3), which reduces the rebuilding time by 89%. Up to now, the time for updating and preprocessing is neglectable.
Finally, in G4, we enable multiple workers running asynchronously to hide the data transfer between CPU and GPU. It reduce 12.3% of total running time in this case. Overall, all these optimizations achieve 2.9x speedup comparing with the baseline version G0. We emphasize that even G0 is already highly optimized, and should handle more topics than previous GPU implementations because it still adopts the sparsityaware algorithm whose time complexity is.
4.2 Performance Tuning
Throughput (GB/s)  Utilization  

Global memory  144  50% 
L2 cache  203  30% 
L1 unified cache  894  20% 
Shared memory  458  20% 
Tuning parameters, such as the number of workers, the number of chunks and the number of threads in a CUDA kernel can largely affect the performance. In order to fully understand the impact of tuning parameters, we analyze the performance of SaberLDA under different parameter settings. We evaluate the total running time of the first 100 iterations on the NYTimes dataset with the number of topics varying from 1,000, to 3,000, and to 5,000.
4.2.1 Number of Chunks
Firstly, we analyze the single worker performance with various numbers of partitions, as shown in Figure 10 (a). We can see that the performance degrades with more partitions because of the degraded locality, but partitioning is necessary when the dataset is too large to be held in GPU. We keep the number of partitions as small as possible to maximize the performance.
4.2.2 Number of Workers
Figure 10 (b) presents the performance with different numbers of workers. We fix the number of chunks to 10. Using multiple workers can hide the data transfer time between GPU and CPU, and reduce the overall time consumption. We observe a 10% to 15% speedup from single worker to 4 workers, where the speedup is quite close to the proportion of data transferring shown in Fig. 9.
4.2.3 Number of Threads
Tuning the number of threads in each block maximizes the system performance. For the kernel function, more threads imply fewer active blocks, which reduces the total shared memory usage in a multiprocessor, but increases the inblock synchronization overhead of warps. Figure 10 (c) shows that setting 256 threads in a block always achieves the best performance for various numbers of topics.
4.3 Profiling Analysis
Next, we analyze the utilization of hardware resources with NVIDIA visual profiler. We focus on memory bandwidth because LDA is a memory intensive task chen2016warplda .
Table 4 is the memory bandwidth utilization of the first 10 iterations on NYTimes with . Statistics show that the throughput accessing global memory reaches more than 140 GB/s, which is about 50% of the bandwidth. This shows clear advantage over CPUs, given that the bandwidth between main memory and CPU is only 40 to 80 GB/s. The throughput of the L2 cache, unified L1 cache and shared memory is 203GB/s, 894GB/s and 458GB/s respectively, while the utilization is lower than 25%. Therefore, these are not the bottleneck of the overall system performance.
We further use performance counters to examine of kernel function. The memory dependency is the main reason of instruction stall (47%), and the second reason is execution dependency (27%). The hotspot is computing the elementwise product between sparse vector and dense vector , which is expected because it accesses global memory.
4.4 Comparing with Other Implementations
We compare SaberLDA with a stateoftheart GPUbased implementation BIDMach zhao2015same as well as three stateoftheart CPUbased methods, including ESCA (CPU), DMLC dmlclda and WarpLDA chen2016warplda . BIDMach zhao2015same is the only opensourced GPUbased method. BIDMach reports better performance than Yan et. al’s mehtod zhao2015same ; yan2009parallel , and Steele and Tristan’s method only reports tens of topics in their paper. Therefore, BIDMach is a strong competitor. ESCA (CPU) is a carefully optimized CPU version of the ESCA algorithm which SaberLDA also adopts. DMLC has various multithread LDA algorithms on CPU, and we choose its best performing FTreeLDA. WarpLDA is a stateofart distributed LDA implementation on CPU based on the MetropolisHastings algorithm, and uses a cacheefficient sampling algorithm to obtain high periteration throughput.
We compare the time to converge of these implementations on NYTimes and PubMed datasets, with the number of topics . Figure 12 shows the convergence over time. We compare the time to converge to the pertoken loglikelihood of and , for NYTimes and PubMed, respectively. SaberLDA is about 5.6 times faster than BIDMach. We also attempt to perform the comparison with 3,000 and 5,000 topics, and find that BIDMach is more than 10 times slower than SaberLDA with 3,000 topics, and reports an outofmemory error with 5,000 topics. This is as expected because the time consumption of BIDMach grows linearly with respect to the number of topics, and its dense matrix format is much more memory consuming than SaberLDA.
Finally, SaberLDA is about 4 times faster than ESCA (CPU) and 5.4 times faster than DMLC on the two datasets with , where WarpLDA converges to a worse local optimum possibly because of its inexact MetropolisHastings step and the different metric with its paperr chen2016warplda which we use to assess model quality. This shows that SaberLDA is more efficient than CPUbased implementations.
4.5 A Large Dataset
Finally, to demonstrate the ability of SaberLDA to process large datasets, we test the performance of SaberLDA on a large subset of the ClueWeb dataset, which is a crawl of webpages. ^{1}^{1}1http://www.lemurproject.org/clueweb12.php/ We filter out the stop words and keep the remaining 100,000 most frequent words as our vocabulary, which is comparable to the vocabulary size of NYTimes. We use the entire CPU memory to hold as many documents as possible, which is 19.4 million, and the total number of tokens is 7.1 billion, which is about 10 times larger than the PubMed dataset.
In this experiment, we also use GTX Titan X (Maxwell), which has 12GB global memory besides GTX 1080, which has 8GB global memory. We compare the performance of GTX 1080 and Titan X with 5,000 topics.
The algorithm convergences in 5 hours on both cards, where the throughput of GTX 1080 is 135 Mtoken/s and the throughput of Titan X is 116 Mtoken/s. With 10,000 tokens, it also convergences in 5 hours with a throughput of 92 Mtoken/s.
5 Conclusions
We present SaberLDA, a high performance sparsityaware LDA system on GPU. Adopting sparsityaware algorithms, SaberLDA overcomes the problem of previous GPUbased systems, which support only a small number of topics. We propose novel data layout, warpbased sampling kernel, and efficient sparse count matrix updating algorithm to address the challenges induced by sparsity, and demonstrate the power of SaberLDA with extensive experiments. It can efficiently handle largescale datasets with up to 7 billion tokens and learn large LDA models with up to 10,000 topics, which are out of reach for the existing GPUbased LDA systems.
In the future, we plan to extend SaberLDA to multiple GPUs and machines. Developing algorithms that converge faster and enjoy better locality is also our future work.
References
 [1] Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.

[2]
Arthur Asuncion and David Newman.
Uci machine learning repository, 2007.
 [3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
 [4] Jordan L BoydGraber, David M Blei, and Xiaojin Zhu. A topic model for word sense disambiguation. In EMNLPCoNLL, 2007.
 [5] Liangliang Cao and Li FeiFei. Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In ICCV, 2007.
 [6] J. Chang and D. Blei. Relational topic models for document networks. In AISTATS, 2009.
 [7] Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. In VLDB, 2016.
 [8] WenYen Chen, JonChyuan Chu, Junyi Luan, Hongjie Bai, Yi Wang, and Edward Y Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009.
 [9] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235, 2004.
 [10] Tomoharu Iwata, Takeshi Yamada, and Naonori Ueda. Probabilistic latent semantic visualization: topic model for visualizing documents. In KDD, 2008.
 [11] Zipf George Kingsley. Selective studies and the principle of relative frequency in language, 1932.
 [12] Aaron Q Li, Amr Ahmed, Sujith Ravi, and Alexander J Smola. Reducing the sampling complexity of topic models. In KDD, 2014.
 [13] John D. Owens Mark Harris, Shubhabrata Sengupta. Parallel prefix sum (scan) with cuda. http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html, 2007.
 [14] NVIDIA. Segmented reduction. https://nvlabs.github.io/moderngpu/segreduce.html, 2013.
 [15] NVIDIA. Segmented sort and locality sort. https://nvlabs.github.io/moderngpu/segsort.html, 2013.
 [16] NVIDIA. Cuda c programming guide. http://docs.nvidia.com/cuda/cudacprogrammingguide/index.html#warpvotefunctions, 2015.

[17]
JeanBaptiste Tristan, Joseph Tassarotti, and Guy Steele.
Efficient training of lda on a gpu by meanformode estimation.
In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 59–68, 2015. 
[18]
Alastair J Walker.
An efficient method for generating discrete random variables with general distributions.
ACM Transactions on Mathematical Software (TOMS), 3(3):253–256, 1977.  [19] Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1105–1112. ACM, 2009.
 [20] Yi Wang, Xuemin Zhao, Zhenlong Sun, Hao Yan, Lifeng Wang, Zhihui Jin, Liubin Wang, Yang Gao, Jia Zeng, Qiang Yang, et al. Towards topic modeling for big data. ACM Transactions on Intelligent Systems and Technology, 2014.
 [21] Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems, pages 2134–2142, 2009.
 [22] Limin Yao, David Mimno, and Andrew McCallum. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 937–946. ACM, 2009.
 [23] HsiangFu Yu, ChoJui Hsieh, Hyokun Yun, SVN Vishwanathan, and Inderjit S Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on World Wide Web, pages 1340–1350. International World Wide Web Conferences Steering Committee, 2015.
 [24] Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P Xing, TieYan Liu, and WeiYing Ma. Lightlda: Big topic models on modest compute clusters. In WWW, 2015.
 [25] Manzil Zaheer. Dmlc experimentallda. https://github.com/dmlc/experimentallda, 2016.
 [26] Manzil Zaheer, Michael Wick, JeanBaptiste Tristan, Alex Smola, and Guy L Steele Jr. Exponential stochastic cellular automata for massively parallel inference. In AISTATS, 2015.
 [27] Huasha Zhao, Biye Jiang, John F Canny, and Bobby Jaros. Same but different: Fast and high quality gibbs parameter estimation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1495–1502. ACM, 2015.
 [28] Jun Zhu, Amr Ahmed, and Eric Xing. Medlda: maximum margin supervised topic models for regression and classification. In International Conference on Machine Learning, 2009.