1 Introduction
Recent advances in storage and network technology have brought a new era of “big data”, where efficient processing and distillation of massive datasets has become one of the major pursuits in the field of machine learning (ML). Numerous algorithms and systems have been proposed to scale up ML for various tasks. However, many of the existing systems positioned for Big ML such as MapReduce [5] or Spark [21] resort to dataparallelism, based on the assumption that the tasks associated with partitions of the data are independent, and/or pose only mild reliance on synchronization. Such assumptions are indeed valid
for a majority of the “data processing” tasks such as keyword extraction from huge log files
or conventional database operations that typically sweep the data only once. Different from traditional data processing, many machine learning algorithms are not well suited for dataparallelism, due to the coupling of distributed elements through “shared states” such as model parameters, latent variables, or other intermediate states; for simplicity, we refer to such entities as the “model” underlying the data. A clear dichotomy between data (which is conditionally independent and persistent throughout the process of training) and model (which is internally coupled, and is transient before converging to an optimum), and the needs for an iterativeconvenient procedure to learning the model from the data is the hallmark of machine learning programs. For example, in the LDA topic model [3], the model to be extracted from the dataconsists of a large collection of subspace bases, i.e., latent topic vectors, which are shared among all the documents;
for each of such bases, elements thereof are coupled by normality and nonnegativity constraints; and estimators of all such bases bear no closeform and must be approximated through some iterative procedures. All these render any trivial parallel treatment of model and data elements impossible.
To achieve efficient distributed topic modeling under dependency constraints via a dataparallel scheme, the following approaches have been commonly considered. 1) Exploring approximate independencies among subtasks: For example in [15, 22] the variational inference algorithm is decomposed into independent subtasks and parallelized. This strategy in fact amounts to a Bulk Synchronous Parallel (BSP) computation model, of which the drawback is evident: too many synchronization barriers needed to ensure logical correctness can result in a large number of idle cycles and therefore is hard to achieve high scalability. 2) Fine grained locking: One can employ various locking mechanisms on shared variables to prevent the readerwriter’s problem. However is it only viable on the shared memory settings. An early version of GraphLab [14] engine implemented a sophisticated version of fine grained locking mechanism in distributed settings, however at the expense of encoding the model into a graph. 3) Bruteforce parallelization: In this case no specific action is taken to prevent error from being generated during the asynchronous updates. Some early attempts [16, 2], current stateoftheart distributed LDA inference method Yahoo!LDA [1], and recent advances in parameter server [9, 12] can be viewed as instantiations of the mechanism to some extent. The major problem with this approach is that there is little guarantee on the correctness of the inference procedure. Although recent studies[17, 10] have shown some justifications for this approach, for now the theory only supports simple models (e.g., Gaussians [10]) or requires certain assumptions (e.g., updates are not overlapping too much [17]) to hold as well. Empirically, as we show later, the convergence speed of such errorprone parallelization can be improved significantly if one can eliminate the parallelization error.
Apart from the dependency issue, largescale topic modeling also poses a challenge on how to accommodate and handle gigantic model size, which has received less attention in the literature. Unlike in academic convention, industryscale applications of topic modeling, for instance online advertising, typically go beyond extracting only topics for human interpretation or visualization, and feature a need for ultrahigh scales on vocabulary size and topic dimensions. However, most dataparallel schemes implicitly assume an image of all shared model states are readily available in each worker process, since it can be extremely expensive to fetch them from remote processes during iterative training steps. Such assumption of having a local copy of the model breaks down when facing the big model problems. For example modeling a corpus with a vocabulary of terms in a dimensional latent space would require model variables to be estimated. In realworld applications this is not uncommon considering the feature augmentation (e.g., taking word combinations) and large conceptual space behind the text. Since the raw model may already take terabytes of storage, unstructured dataparallel approach is unlikely to be applicable.
To address these issues, we take advantage of a different type of parallelization mechanism, namely modelparallelism, to complement dataparallelism. Originated from a machine learning perspective, modelparallelism addresses the above problems by carefully scheduling the updates based on dependencies among model states induced by the inference algorithm. Specifically, we make use of the fact that in Gibbs sampling for LDA, the shared state access is limited to a small subset of the entire model during the computation of an update from a data sample. In other words, if the subsets are small enough, it is possible to find a class of disjoint subsets whose updates are completely independent of each other. Based on the iid assumption, parallelizing over the disjoint blocks produces exactly the same result as the serial execution. Thus modelparallel inference not only ensures the inference quality, but also reduces memory requirement by partitioning both the data and the model space. In fact, we demonstrate the ability to handle topic modeling with 200 billion model variables on a cluster of 64 lowend machines. We also note that the modelparallelism is suitable not only for LDA but also for many more machine learning programs. Primitives for more general modelparallelism can be found in [11].
Related works: Various methods have been proposed to enable large scale inference for topic models. In [22], a MapReduce based parallelization is presented, by making use of the independent tasks in variational inference algorithm for LDA. The current stateoftheart distributed inference for LDA [1] resorts to fast background synchronization of the model. A rotationscheduling idea has been studied in [19], however we target distributed settings where things become more challenging due to network latency and smaller degree of parallelism (compared to GPU). GraphLab [14, 7] LDA application can be seen as a special case of modelparallelism, where by definition of the graph only nonoverlapping subgraphs (i.e., documents and words) are processed simultaneously. Recent study on streaming variational Bayes [4] also proposed a distributed inference algorithm, however specialized in the singlepass scenario.
Here is an outline of the rest of the paper: we begin with a briefly introduction of the LDA model and the collapsed Gibbs sampling algorithm in section 2. Then in section 3 we present the big picture and motivation of modelparallel inference for LDA, whose technical details are shown in section 4. Distributed experiments are conducted in section 5, and finally section 6 concludes.
2 Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA)[3] is a hierarchical Bayesian topic model that learns a lowdimensional representation for a highdimensional corpus. Because of the ability of capturing latent semantics underlying the text, it is widely applied to various real world tasks such as online advertising and personal recommendation. In recent years, with increasing amount of data the need for larger conceptual space is also emphasized, posing a challenge in large scale inference for LDA. In this section, we briefly overview the LDA model definition and its inference using collapsed Gibbs sampling.
2.1 The Model
LDA considers each document as an admixture of topics, where each topic is a multinomial distribution over a vocabulary of words. For each document a topic proportion vector is drawn from . Then for each token in the document, a word is drawn from and a topic assignment is drawn from . In fully Bayesian LDA, topics are random samples drawn from Dirichlet prior, .
Given the corpus where , LDA infers the posterior . However exact inference is intractable due to the normalization term, hence approximation methods such as Gibbs sampling comes into play. The mixing rate can be further accelerated by integrating out intermediate Dirichlet variables analytically, yielding the collapsed Gibbs sampling algorithm [8]:
(1) 
where is the word that the current token maps to; is the number of tokens assigned to topic in document ; is the number of times term has been assigned to topic ; is the total number of tokens assigned to topic ; and finally is the count excluding the th token.
2.2 Sparse Sampling
time complexity for each topic assignment in collapsed Gibbs sampling still leaves room for further improvement. In [20], a sublinear complexity sampling algorithm that makes use of sparsity is introduced. The motivation comes from the observation that the counts and are sparse: only a few out of entries are filled with nonzero counts in every document or term. As we shall discuss later, similar idea is also applicable to the modelparallel inference.
The fast sampling algorithm starts with decomposing the conditional (1) in the following way:
(2) 
where
Since is dense, can be precomputed in and maintained in time; can be precomputed in time where is the average number of nonzero entries in ; the fractional term in can be precomputed in time and can be constructed in time by taking advantage of sparsity in , whose expected sparsity is . Note that not only the construction of conditional distribution takes sublinear complexity in , the sampling procedure can also benefit significantly from the sparsity of and , due to the observation that and
bucket contain most of the probability mass. So the overall time complexity for sampling one topic assignment
is .3 ModelParallel Inference
Dataparallel inference for LDA typically distributes different set of documents to the workers to perform Gibbs sampling, while sharing a central model across all of them via some synchronization schemes. Albeit being powerful in handling large amount of data, it introduces two potential issues that are less recognized in the literature. 1) It fails to handle huge model. In dataparallel inference, it is natural to assume that the entire copy of the model is available in the workers throughout the inference procedure. However as we mention earlier the need for big model breaks down this assumption: a model with billions of variables can easily exceed the reasonable RAM size these days. 2) It cannot control inconsistency in the shared model. Most dataparallel inference trades correctness with performance. For example in [1], the shared model is updated by a separate thread cycling over the local model, hoping for the inconsistency does not affect the algorithm by much. However this strategy relies heavily on the network condition, as we show in Section 5: for low bandwidth networks, the effect of inconsistency becomes evident since the algorithm proceeds without noticing the slow synchronization in the background.
3.1 Dynamic Model Partitioning
A monolithic treatment of shared model in dataparallel inference often fails to address the “big model” problem, which can take place when 1) massive number of the model variables or parameters are introduced by the statistical model; or 2) huge additional shared data structure is required to assist the inference algorithm. In either case, having a complete copy of the model in every worker is a potential danger not only because it may fail to load the model in the first place, but also because adding computing nodes will not help reduce the memory consumption of individual workers.
Our solution to this issue is to partition the shared model into disjoint blocks. This is motivated by the fact that each step of Gibbs sampling only requires change in a small subset of the entire statistics , hence certain degree of parallelization can be achieved on the model side, in addition to the data. Specifically, since Gibbs sampling for two distinct words are nearly independent^{1}^{1}1We discuss the dependency on later., the wordtopic count matrix can be effectively partitioned by words. The outcome of model partitioning is straightforward: it reduces the model size on each workers, and also achieves scalability on the model by allowing more nodes to share the burden. Note that dynamic model partitioning is a complement to data partitioning rather than a replacement. Instead of static placement, it provides more flexibility to the algorithm and ensures each worker to work on the complete set of model during the inference, rather than only a subset of them.
In modelparallel LDA, dynamic partitioning of the model is realized by a scheduler component, as described in Algorithm 1. Specifically, it first divides the words into disjoint blocks . Each block is assigned to corresponding worker as the initial set of tasks. Therefore each worker only samples tokens such that . Once all the workers have finished sampling their own blocks, the scheduler rotates the blocks to different workers for another round (subiteration) of sampling: worker acquires the block where . After rounds of sampling, all topic assignments will have been sampled exactly once. This amounts to an iteration over the data and we repeat the process until convergence.
3.2 Ondemand Communication
Synchronization of shared model is another major issue in dataparallel inference. Existing methods such as [1] has been mainly focused on efficient maintenance of the wordtopic count matrix , for example using an asynchronous keyvalue store to frequently incorporate and distribute updates committed by the workers. However, besteffort synchronization only guarantees eventual consistency, hence the workers may construct incorrect distributions from the staled statistics. The effect of staleness becomes even evident with low network bandwidth, which is common in lowend clusters and custom cloud services. If the shared states cannot be synchronized in time even though the network is saturated, then parallelization error will only increase as the inference algorithm proceeds.
Based on dynamic model partitioning, we can avoid such issues easily by carefully managing communication between workers. To achieve this, we introduce a keyvalue store that stores the global model . Note that different from being a “parameter server” [1], the purpose of this component is mainly for distributed inmemory storage: thanks to dynamic model partitioning, frequent background asynchronous communication is no longer required. In practice a simple distributed hash table implementation suffices the need. Given the dynamic model partitioning strategy, ondemand communication between workers and keyvalue store follows the procedure described in Algorithm 2. At the beginning of each round, after receiving the task list, each worker can start requesting its model blocks from the keyvalue store. Similarly after finishing the tasks, workers can commit changes in local model blocks, thereby updating the global model. This process can be further accelerated by overlapping sampling procedure and communication, i.e., send/receive model blocks asynchronously.
Again since the model blocks are nonoverlapping, there is no synchronization issue on the keyvalue store. Moreover the amount of communication is reduced significantly, compared to the frequent synchronization approach. By combining dynamic model partitioning and ondemand communication via keyvalue store, variable dependency between workers is also reduced. It not only eliminates the need for a frequently synchronized shared states as in dataparallel inference, but also results in faster sampler convergence per token processed. In fact as we show later our method requires much fewer iterations to converge than others, while having similar periteration time complexity.
3.3 Nonseparable Dependency
So far we have deliberately omitted another source of dependency, the global topic count vector . It is impossible to divide into disjoint blocks since the term is required in sampling for all the tokens. However, noticing the fact that the value is relatively large since and it only appears in the denominator, changes in small magnitude will not affect the final distribution much. This demands for a much relaxed level of consistency. Therefore we synchronize across the workers at the beginning of each round through the keyvalue store. It is highly efficient since every worker only needs to send/receive a vector of size to/from the keyvalue store. During the round, workers are not aware of the changes in made by other workers, which causes some error in the distribution to sample from.
This is in some sense similar to the idea used in [16], where the entire model is allowed to go outofsync during an iteration. However we only relax the consistency requirement on , while the major element of the model, , is maintained without any error. As we will show in Section 5, due to the small amount of change compared to the actual value, the resultant error is empirically negligible.
To sum up, combining dynamic model partitioning and ondemand communication not only reduces memory load of each worker, but also avoids most of the parallelization error. A special protocol is introduced to address nonseparable dependency issue on , without sacrificing the inference quality. As we show later, compared to a dataparallel method [1], modelparallel inference takes an order of magnitude less time to converge.
4 Implementation Details
In this section we provide some technical details about implementing the modelparallel inference for LDA.
4.1 Overall Architecture
The complete design of the system components is illustrated in Figure 1. We partition both data and model so that each partition can be stored in a single machine memory. The scheduler directly communicates with the workers to 1) generate and assign tasks and 2) coordinate model partitions between workers. It also maintains a special communication channel with the keyvalue store to handle nonseparable dependency in . As we mentioned, model blocks are communicated via a distributed keyvalue store in a managed fashion, rather than busy synchronization. This significantly reduces the amount of communication and hence lowers the requirement on network bandwidth.
4.2 Fast Sampling on Inverted Index
Upon receiving the task list and necessary model blocks, the main job left for the worker is to perform Gibbs sampling on the local data. Because of the scheduling constraint, only tokens that are mapped to the words in the current task list can be sampled in this round. Traditional bagofwords representation of the documents turns out to be rather inefficient in this case: to determine the set of tokens to be sampled in this round, sequential iterations over the dataset as well as multiple comparisons between the task list and the token are required.
In fact this is a classic problem in search engines, where the typical solution is to represent the documents in inverted index, instead of forward index (i.e., bagofwords). With the inverted index created, for each worker , each record indexed by word represents all the topic assignments such that and . By doing so we can completely eliminate the multiple comparisons between two sets.
In addition, similar to the idea in [20], we can take advantage of sparsity as well. We first note that the same decomposition (2) is not optimal for sampling on inverted index. To see the reason, note that a great proportion of efficiency in the sparse sampling algorithm [20] comes from precomputing for each document. Once cached, additional changes to within a document only requires time to make incremental updates. The caching effect is maximized when tokens in each document are sampled sequentially in a process. However for sampling on inverted index, it is no more the case: is frequently recomputed since typically only a few tokens in a document represent a specific word.
Instead, a different decomposition can be done as follows to maximize the caching effect:
(3) 
where
The first probability bucket can be precomputed for every word (i.e., task) in time, with maintaining cost for future updates. It is cached for every token associated with the word in local partition. Also note that the fractional terms in and is identical, thus coefficients of can be precomputed along with with no additional cost. To get , we make use of sparsity in which requires time. Note that due to the dense fractional term and unbiased mass partition, the algorithm is not as efficient as the sparse sampler in [20]. However, as we stated above, this algorithm makes full use of the inverted index structure that is required by the modelparallel inference. In Section 5, we show that the disadvantage of nonoptimal sampling algorithm is mitigated as the benefit of modelparallelism becomes salient.
5 Experiments
In this section we quantitatively evaluate the proposed modelparallel inference for LDA. Our chosen baseline is Yahoo!LDA [1], which is a popular, publiclyavailable distributed implementation of the Sparse Gibbs sampler [20]. Another notable baseline is Google’s PLDA+ [13], which has similar token sampling throughput to Yahoo!LDA — roughly, both Yahoo!LDA and Google PLDA+ process 20K tokens per compute core, per second, on a mediumsized cluster with 10100 machines. Since the sampling throughput of Yahoo!LDA and Google PLDA+ are similar, we only compare to Yahoo!LDA. In our experiments, we will show that our method, while having similar sampling throughput to Yahoo!LDA (and PLDA+), converges significantly faster per iteration because our careful, wordpartitioned modelparallel design significantly reduces synchronization errors in the wordtopic table (as in Figure 3). We also attempted to compare with the topic modeling toolkit in GraphLab [7], however in all of the experiments it failed to initialize due to excessive memory consumption. This is acceptable since it is an application built on top of the generalpurpose system rather than a performancedriven instantiation of the algorithm, hence we omit the result hereinafter.
Experiment Settings: To demonstrate the effectiveness over different hardware settings, we conduct experiments on two disparate settings [6]: a highend cluster with 64core machines. a lowend cluster equipped with 2core machines. Specifically, the highend cluster contains 10 machines connected via 40Gbps Ethernet network interface, each node equipped with quadsocket 16core AMD Opteron 6272 (2.1GHz) and 128GB RAM. The lowend cluster consists of 128 machines connected via 1Gbps Ethernet with dualsocket AMD Opteron 252 (2.6GHz) and 8GB RAM in each machine. The modelparallel inference is fully implemented in C++11. Note that although the highend machines are NUMA nodes, for fare comparison we do not include any optimization for NUMA architecture.
Dataset: We use Pubmed^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Bag+of+Words and the 3.9M document English Wikipedia abstracts^{3}^{3}3http://wiki.dbpedia.org/Downloads39#extendedabstracts as our dataset. We further construct an augmented corpus by extracting bigrams (2 consecutive tokens) from Wikipedia corpus. Pubmed contains 8.2M documents, words and about 737.9M tokens. The original Wikipedia dataset consists of M unique words and 179M tokens, while in bigram corpus there are M unique phrases and M occurrence of these phrases. We note that bigram vocabulary of 21.8M is almost an order of magnitude larger than [1], and clearly demonstrates our scalability to very large model sizes. Our experiments used the number of topics from up to , which results in extremely large wordtopic tables: B elements in the unigram case, and B elements in the bigram case. The model size is about 60 times larger than the recent result [1].
Evaluation: We choose training loglikelihood as our surrogate measure of convergence because 1) LDA Gibbs samplers tend to converge to (one of many possible) local optima in the space of possible wordtopic and doctopic tables, and 2) the progress of the sampler to a local optimum correlates well with the rise, and then plateauing, of the training loglikelihood measured on the latest sample. Since the LDA Gibbs sampler is unlikely to leave a local optima once it has reached one, the algorithm can be safely terminated once the loglikelihood plateaus. One might ask why we did not employ test data perplexity as our surrogate, as did by many practitioners. We caution that this metric is in fact improper for evaluating competing inference systems (on the same model), but suitable for evaluating goodness of different model designs (using the same inference system). For instance, it can be used to evaluate different flavors of LDA or alternative models in terms of how well they capture training data characteristic and generalize to new data. However, our focus is on inference quality and efficiency on the same model, not goodness of competing models; all systems/algorithms we tested perform inference for the standard LDA model, and therefore (the difference of) model generalization is not the issue under investigation. Moreover, because an inference algorithm learns model parameters and variables only from the training data, it is only appropriate to track its convergence as a function of the training data. Using test data perplexity introduces additional confounding factors, particularly how well each training data local optima generalizes to the test data — this is a confounding factor because sampler algorithms are not designed to control which training optima they will eventually reach. The point is that training data loglikelihood controls for external factors better than test data perplexity, in the context of measuring inference speed and accuracy.
5.1 Convergence
We first compare convergence speed of different methods on the highend cluster, using Pubmed dataset with 1000 and 5000 topics. Figure 2(a) shows the loglikelihood at each iteration. We can observe that modelparallel inference achieves greater periteration progress than dataparallel approach. In other words, our method requires much fewer iterations to reach a certain likelihood. Figure 2(b) shows the loglikelihood trend in terms of elapsed time. We can observe similar trends as in periteration plot. This again shows the effect of sampling from correct distributions: dynamic model partitioning seamlessly handles the dependency on the model, whereas dataparallel approach suffers slow convergence especially at the beginning due to drastic change of the model copies in worker nodes.
We also show the effect of lazy synchronization in , which breaks down the independence between workers. As mentioned in Section 3, is only synchronized at the beginning of each round, therefore is free to go out of sync during the left of the round. We relax the consistency requirement based on the intuition that minor change in huge counts will not affect the overall result much. We now show that the induced error is almost negligible in practice.
As a proxy for the error made in each round, we can measure the difference between the true and its local copy on worker at the end of each round. Specifically, we define the error at each round and iteration to be where is the total number of tokens in the corpus. In other words, we compute the normalized distance between each worker’s local copy and true value , and then average the amount over all the workers. As a result must lie in , where denotes no error. Figure 3 shows the error collected on highend cluster using Pubmed dataset. We can observe that the error immediately drops to 0 and stays close to it during the rest of the inference procedure. This demonstrates that our method exhibits very small parallelization error and hence faster convergence.
5.2 Model Size
We demonstrate our ability to handle big models in Table 1. Yahoo!LDA starts to fail on the problem size of 2.5M vocabulary and 10000 topics. It is due to the fact that the local copy of the model no longer fits into the memory, even though it only stores keys that appear in the local subset of the data. In contrast, by sharding the model into blocks, our method effectively handles bigger models. As shown in the table, modelparallel approach is able to perform inference on all configurations of the model size, including the biggest one using bigram dataset with 10000 topics, indicating our ability to handle a model size over 200 billion only on a lowend cluster. In addition we can observe a faster convergence in small model setting compared to Yahoo!LDA. This indicates that modelparallelism is effective not only for big model but also for moderatesized model problems. All of these clearly demonstrates the effectiveness of dynamic model partitioning strategy.
Corpus  Wikiunigram  Wikibigram  
ModelParallel  2.3 hr  5.0 hr  8.9 hr  12 hr* 
Yahoo!LDA [1]  11.8 hr  N/A  N/A  N/A 
(*terminated by the cluster)
5.3 Scalability
In Figure 4(a), we show the total memory footprint of each worker as the number of computing nodes increase, using unigram dataset with topics. In the ideal case, as the number of machines doubles, the memory consumption should be halved. We can observe that the modelparallel inference achieves nearly ideal scalability over machines. Although starts with a higher memory footprint, it closely follows a trend and drops to a much lower number, indicating the dynamic model partitioning scheme effectively makes use of more memory storage without unnecessary duplication; whereas Yahoo!LDA’s permachine memory usage is almost constant, again because its dataparallel strategy requires most of the wordtopic table to be stored on each machine, indicating that adding machines will not solve big model problems.
We also show convergence speedup as a function of number of machines. In Figure 4(b), we show the speedup in terms of convergence time on different number of machines for a fixed model size (unigram dataset with 5000 topics). Interestingly, we can observe that Yahoo!LDA performs worse given 32 machines. The reason can be explained by the network congestion in the lowend cluster: since the models are frequently synchronized between every node, network traffic is increased in . Thus parameters are more likely to be outofdate when increasing number of nodes given low bandwidth. This introduces more error to the overall procedure. By contrast, we can see the curve for modelparallel inference follows the ideal speedup trend closely. This shows the modelparallel inference effectively utilizes additional computational resources without significant overhead. Unlike full connections, ondemand communication strategy in modelparallel inference greatly reduces the traffic by managed synchronization, while providing guarantee for model correctness. This demonstrates the ability of modelparallel inference to handle large scale inference problems on lowend clusters.
6 Conclusion
In this paper, we presented a modelparallel inference for LDA, motivated by the pitfalls of dataparallelism in distributed inference. We proposed a system that implements modelparallelism on top of dataparallelism, and show empirical results on improved time and memory efficiency over other approaches. In a word, modelparallelism not only eliminates dependency between inference processes but also brings the capability of handling big models. Therefore without drastic change in the algorithm itself, e.g., using crafted MetropolisHasting to speed up the sampler, we can already improve the algorithm significantly just by careful arrangement of model blocks.
We expect the idea can be applied to a broader class of models as well to scale up without sophisticated algorithmic tweak. The first attempt to the generalized modelparallelism can be found in [11], which nonetheless deserves further investigation. We are also interested in employing modelparallelism in more challenging tasks, for example Bayesian nonparametric models like Hierarchical Dirichlet Process (HDP) [18] and regularized Bayesian models such as MedLDA [23].
References
 [1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable Inference in Latent Variable Models, In International Conference on Web Search and Data Mining (WSDM), 2013.
 [2] A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, 2008.
 [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993–1022, 2003.
 [4] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan, Streaming Variational Bayes, In NIPS, 2013.
 [5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
 [6] G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. Probe: A thousandnode experimental cluster for computer systems research, USENIX; login, 38, 2013.
 [7] J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed GraphParallel Computation on Natural Graphs. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012.
 [8] T. L. Griffiths and M. Steyvers. Finding scientific topics, Proceedings of National Academy of Science (PNAS), 5228–5235, 2004.
 [9] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger, and E. P. Xing. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, In NIPS, 2013.
 [10] M. Johnson, J. Saunderson, and A. Willsky. Analyzing Hogwild Parallel Gaussian Gibbs Sampling, In NIPS, 2013.
 [11] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. .P. Xing, Primitives for Dynamic Big Model Parallelism, In NIPS, 2014.
 [12] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. J. Smola. Parameter server for distributed machine learning. In Workshop on Big Learning, NIPS, 2013.
 [13] Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun, PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning, 2011.
 [14] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012.
 [15] R. Nallapati, W. Cohen, and J. Lafferty. Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, 2007.
 [16] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation. In NIPS, 2007.

[17]
B. Recht, C. Ré, S. J. Wright, and F. Niu.
Hogwild!: A LockFree Approach to Parallelizing Stochastic Gradient Descent
, In NIPS, 2011.  [18] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
 [19] F. Yan, N. Xu, and Y. Qi. Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units, In NIPS, 2009.
 [20] L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections, In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2009.
 [21] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker and I. Stoica. Spark: Cluster Computing with Working Sets, HotCloud, 2010.
 [22] K. Zhai, J. BoydGraber, N. Asadi, and M. Alkhouja. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. In Proceedings of the 21th International World Wide Web Conference (WWW), 2012.
 [23] J. Zhu, A. Ahmed, and E. Xing. MedLDA: maximum margin supervised topic models. Journal of Machine Learning Research, (13):2237–2278, 2012.
Comments
There are no comments yet.