1 Introduction
Mixture models (Everitt, 1981) and topic models (Blei et al., 2003)
are fundamental to Bayesian unsupervised learning. These models find a set of
clusters or topics useful for exploring an input dataset. Mixture models assume the input data is fully exchangeable, while topic models extend mixtures to handle datasets organized by groups of observations, such as documents or images.Mixture and topic models have two kinds of latent variables. Global parameters define each cluster, including its frequency and the statistics of associated data. Local
, discrete assignments then determine which cluster explains a specific data observation. For both global and local variables, Bayesian analysts wish to estimate a posterior distribution. For these models, full posterior inference via Markov chain Monte Carlo (MCMC,
Neal (1992)) averages over sampled cluster assignments, producing asymptotically exact estimates at great computational cost. Optimization algorithms like expectation maximization (EM, Dempster et al. (1977)) or (mean field) variational Bayes (Ghahramani and Beal, 2001, Winn and Bishop, 2005) provide faster, deterministic estimates of cluster assignment probabilities. However, at each observation these methods give positive probability to every cluster, requiring dense storage and limiting scalability.This paper develops new posterior approximations for local assignment variables which allow optimizationbased inference to scale to hundreds or thousands of clusters. We show that adding an additional sparsity constraint to the standard variational optimization objective for local cluster assignments leads to big gains in processing speed. Unlike approaches restricted to hard, winnertakeall assignments, our approach offers a tunable parameter that determines how many clusters have nonzero mass in the posterior for each observation. Our approach fits into any variational algorithm, regardless of whether global parameters are inferred by point estimates (as in EM) or given full approximate posteriors. Furthermore, our approach integrates into existing frameworks for largescale data analysis (Hoffman et al., 2013, Broderick et al., 2013) and is easy to parallelize. Our open source Python code^{1}^{1}1http://bitbucket.org/michaelchughes/bnpydev/ exploits an efficient C++ implementation of selection algorithms (Blum et al., 1973, Musser, 1997) for scalability.
a: Overall Timings,  b: RespFromWeights step  c: Summary step  d: Distance from dense 

on the speed and accuracy of estimating a zeromean Gaussian mixture model for 8x8 pixel patches from natural images.
a: Comparison of algorithm substep costs for 2 complete passes through entire 3.6 million patch dataset from Sec. 4 with clusters. b: Speed comparison of our sparse approach when computing responsibilities given fixed weights for 36,000 patches. c: Speed comparison of our sparse approach when computing the percluster statistics defined in Eq. (7) for 36,000 patches. d: Cumulative density function of variational distance between dense and sparse responsibilities across 36,000 patches, using the pretrained mixture model published online by Zoran and Weiss (2012).2 Variational Inference for Mixture Models
Given
observed data vectors
, a mixture model assumes each observation belongs to one of clusters. Let hidden variable denote the specific cluster assigned to . The mixture model has two sets of global parameters, the cluster frequencies and cluster shapes . Let for scalar , where is the probability of observing data from cluster : . We generate observation according to likelihood(1) 
The exponential family density has sufficient statistics and natural parameter . The normalization function ensures that integrates to one. We let , where is a density conjugate to with parameter . Conjugacy is convenient but not necessary: we only require that the expectation can be evaluated in closedform.
Meanfield variational inference (Wainwright and Jordan, 2008) seeks a factorized posterior . Each posterior factor has free parameters (denoted with hats) that are optimized to minimize the KL divergence between the simplified approximate density and the true, intractable posterior. The separate factors for local and global parameters have specially chosen forms:
(2) 
Our focus is on the free parameter which defines the local assignment posterior . This vector is nonnegative and sums to one. We interpret value
as the posterior probability of assigning observation
to cluster . This is sometimes called cluster ’s responsibility for observation .The goal of variational inference is to find the optimal free parameters under a specific objective function . Using full approximate posteriors of global parameters yields the evidence lowerbound objective function in Eq. (3) which is equivalent to minimizing KL divergence (Wainwright and Jordan, 2008). Point estimation of global parameters instead yields a maximumlikelihood (ML) objective in Eq. (4).
(3)  
(4) 
Closedform expressions for both objectives are in Appendix A. Given an objective function, optimization typically proceeds via coordinate ascent (Neal and Hinton, 1998). We call the update of the dataspecific responsiblities the local step, which is alternated with the global update of or .
2.1 Computing dense responsiblities during local step
The local step computes a responsibility vector for each observation that maximizes given fixed global parameters. Under either the approximate posterior treatment of global parameters in Eq. (3) or ML objective of Eq. (4), the optimal update (dropping terms independent of ) maximizes the following objective function:
(5) 
We interpret as the log posterior weight that cluster has for observation . Larger values imply that cluster is more likely to be assigned to observation . For ML learning, the expectations defining are replaced with point estimates.
Our goal is to find the responsibility vector that optimizes in Eq. (5), subject to the constraint that is nonnegative and sums to one so is a valid density:
(6) 
The optimal solution is simple: exponentiate each weight and then normalize the resulting vector. The function DenseRespFromWeights in Alg. 1 details the required steps. The runtime cost is , and is dominated by the required evaluations of the exp function.
2.2 Computing sufficient statistics needed for global step
Given fixed assignments , the global step computes the optimal values of the global free parameters under . Whether doing point estimation or approximate posterior inference, this update requires only two finitedimensional sufficient statistics of , rather than all values. For each cluster , we must compute the expected count of its assigned observations and the expected data statistic vector :
(7) 
The required work is for the count vector and for the data vector.
1: : log posterior weights. 2: : responsibility values 3:def DenseRespFromWeights() 4: for do 5: 6: 7: for do 8: 9: return 
1: : log posterior weights. 2: : resp. and indices 3:def TopLRespFromWeights(, ) 4: 5: for do 6: 7: 8: for do 9: 10: return 
3 Fast Local Step for Mixtures via Sparse Responsibilities
Our key contribution is a new variational objective and algorithm that scales better to large numbers of clusters . Much of the runtime cost for standard variational inference algorithms comes from representing as a dense vector. Although there are total clusters, for any observation only a few entries in will have appreciable mass while the vast majority are close to zero. We thus further constrain the objective of Eq. (6) to allow at most nonzero entries:
(8) 
The function TopLRespFromWeights in Alg. 1 solves this constrained optimization problem. First, we identify the indices of the top values of the weight vector in descending order. Let denote these topranked cluster indices, each one a distinct value in . Given this active set of clusters, we simply exponentiate and normalize only at these indices. We can represent this solution as an sparse vector, with real values and integer indices . Solutions are not unique if the posterior weights contain duplicate values. We handle these ties arbitrarily, since swapping duplicate indices leaves the objective unchanged.
3.1 Proof of optimality.
We offer a proof by contradiction that TopLRespFromWeights solves the optimization problem in Eq. (8). Suppose that is optimal, but there exists a pair of clusters such has larger weight but is not included in the active set while is. This means , but and . Consider the alternative which is equal to vector but with entries and swapped. After substituting into Eq. (5) and simplifying, we find the objective function value increases under our alternative: . Thus, the optimal solution must include the largest clusters by weight in its active set.
3.2 Runtime cost.
Alg. 1 compares DenseRespFromWeights and our new algorithm TopLRespFromWeights sidebyside. The former requires exponentiations, additions, and divisions to turn weights into responsibilities. In contrast, given the indices our procedure requires only of each operation. Finding the active indices via SelectTopL requires runtime.
Selection algorithms (Blum et al., 1973, Musser, 1997) are designed to find the top values in descending order within an array of size . These methods use divideandconquer strategies to recursively partition the input array into two blocks, one with values above a pivot and the other below. Musser (1997) introduced a selection procedure which uses introspection to smartly choose pivot values and thus guarantee worstcase runtime. This procedure is implemented within the C++ standard library as nth_element, which we use for SelectTopL in practice. This function operates inplace on the provided array, rearranging its values so that the first entries are all bigger than the remainder. Importantly, there is no internal sorting within either partition. Example code is in found in Appendix F.
Choosing sparsitylevel naturally trades off execution speed and training accuracy. When , we recover the original dense responsibilities, while
assigns each point to exactly one cluster, as in kmeans. Our focus is on modest values of
. Fig. 1b shows that for large values TopLRespFromWeights is faster than DenseRespFromWeights for or . The dense method’s required exponentiations dominates the introspective selection procedure.With sparse responsibilities, computing the statistics in Eq. (7) scales linearly with rather than . This gain is useful when applying Gaussian mixture models with unknown covariances to image patches, where each 8x8 patch requires an expensive 4096dimensional data statistic . Fig. 1c shows the cost of the summary step virtually disappears when rather than . This savings makes the overall algorithm over twice as fast (Fig. 1a), with the remaining bottleneck the dense calculation of weights , which might be sped up for some likelihoods using fast data structures for finding nearestneighbors. Fig. 1d shows that captures nearly identical responsibility values as , indicating that modest values may bring speed gains without noticeable sacrifice of model quality.
3.3 Related work.
Hard assignments.
One widespread practice used for decades is to consider “hard” assignments, where each observation is assigned to a single cluster, instead of a dense vector of responsibilities. This is equivalent to setting in our sparse formulation. The kmeans algorithm (Lloyd, 1982) and its nonparametric extension DPmeans (Kulis and Jordan, 2012) justify
sparsity via smallvariance asymptotics. Socalled “hard EM” Viterbi training
(Juang and Rabiner, 1990), or maximizationexpectation algorithms (Kurihara and Welling, 2009) both use hard assignments. However, we expect to be too coarse for many applications while moderate values like offer better approximations, as shown in Fig. 1d.Sparse EM.
A prominent early method to exploit sparsity in responsibilities is the Sparse EM algorithm proposed by Neal and Hinton (1998). Sparse EM maintains a dense vector for each observation , but only edits a subset of this vector during each local step. The edited subset may consist of the largest entries or all entries above some threshold. Any inactive entries are “frozen” to current nonzero values and newly edited entries are normalized such that the length vector preserves its sumtoone constraint.
Sparse EM can be effective for small datasets with a few thousand examples and has found applications such as MRI medical imaging (Ng and McLachlan, 2004). However, our sparse approach has three primary advantages relative to Sparse EM: (1) Our sparse method requires less perobservation memory for responsibilities. While Sparse EM must store floatingpoint values to represent a responsibility vector, we need to store only . (2) Our sparse method easily scales to minibatchbased training algorithms in Sec. 3.4, but Sparse EM’s required storage is prohibitive. Our approach can safely discard responsiblity vectors after required sufficient statistics are computed. Sparse EM must explicitly store responsibilities for every observation in the dataset at cost if future sparse updates are desired. This prohibits scaling to millions of examples by processing small minibatches, unless each minibatch has its full responsibility array written to and from disk when needed. (3) We proved in Sec. 3.1 that topL selection is the optimal way to compute sparse responsibilities and monotonically improve our training objective function. Neal and Hinton (1998)
suggest this selection method only as a heuristic without justification.
Expectation Truncation.
When we undertook most of this research, we were unaware of a related method by Lücke and Eggert (2010) called Expectation Truncation
which constrains the approximate posterior probabilities of discrete or multivariate binary variables to be
sparse. Lücke and Eggert (2010) considered nonnegative matrix factorization and sparse coding problems. Later extensions applied this core algorithm to mixturelike sprite models for cleaning images of text documents (Dai and Lücke, 2014) and spikeandslab sparse coding (Sheikh et al., 2014). Our work is the first to apply sparse ideas to mixture models and topic models.The original Expectation Truncation algorithm (Lücke and Eggert, 2010, Alg. 1) expects a userdefined selection function to identify the entries with nonzero responsibility for a specific observation. In practice, the selection functions they suggest are chosen heuristically, such as the upper bound in Eq. 28 of (Lücke and Eggert, 2010). The original authors freely admit these selection functions are not optimal and may not monotonically improve the objective function (Lücke and Eggert, 2010, p. 2869). In contrast, we proved in Sec. 3.1 that top selection will optimally improve our objective function.
One other advantage of our work over previous Expectation Truncation efforts are our thorough experiments exploring how different values impact training speed and predictive power. Comparisons over a range of possible values on real datasets are lacking in Lücke and Eggert (2010) and other papers. Our key empirical insight is that modest values like are frequently better than , especially for topic models.
3.4 Scalabilty via minibatches
Stochastic variational inference (SVI).
Introduced by Hoffman et al. (2010), SVI scales standard coordinate ascent to large datasets by processing subsets of data at a time. Our proposed sparse local step fits easily into SVI. At each iteration , SVI performs the following steps: (1) sample a batch from the full dataset, uniformly at random; (2) for each observation in the batch, do a local step to update responsibilities given fixed global parameters; (3) update the global parameters by stepping from their current values in the direction of the natural gradient of the rescaled batch objective . This procedure is guaranteed to reach a local optima of if the step size of the gradient update decays appropriately as increases (Hoffman et al., 2013).
Incremental algorithms (MVI).
Inspired by incremental EM (Neal and Hinton, 1998), Hughes and Sudderth (2013) introduced memoized variational inference (MVI). The data is divided into a fixed set of batches before iterations begin. Each iteration completes four steps: (1) select a single batch to visit; (2) for each observation in this batch, compute optimal local responsibilities given fixed global parameters and summarize these into sufficient statistics for batch ; (3) incrementally update s wholedataset statistics given the new statistics for batch ; (4) compute optimal global parameters given the wholedataset statistics. The incremental update in step (3) requires caching (or “memoizing”) the summary statistics in Eq. (7) at each batch. This algorithm has the same periteration runtime as stochastic inference, but guarantees the monotonic increase of the objective when the local step has a closedform solution like the mixture model. Its first pass through the entire dataset is equivalent to streaming variational Bayes (Broderick et al., 2013).
4 Mixture Model Experiments
We evaluate dense and sparse mixture models for natural images, inspired by Zoran and Weiss (2012)
. We train a model for 8x8 image patches taken from overlapping regular grids of stride 4 pixels. Each observation is a vector
, preprocessed to remove its mean. We then apply a mixture model with zeromean, fullcovariance Gaussian likelihood function . We set concentration . To evaluate, we track the loglikelihood score of heldout observations under our trained model, defined as . Here, and are point estimates computed from our trained global parameters using standard formulas. The functionis the probability density function of a multivariate normal.
Fig. 2 compares sparse implementations of SVI and MVI on 3.6 million patches from 400 images. The algorithms process minibatches each with patches. We see the sparse methods consistently reach good predictive scores 24 times faster than dense runs do (note the logscale of the time axis). Finally, modestly sparse runs often reach higher values of heldout likelihood than hard runs, especially in the and plots for SVI (red).
a: Overall Timings, K=800  b: Local Step + Restarts  c: Summary Step  d: Distance from 

5 Fast Local Step for Topic Models via Sparse Responsibilities
We now develop a sparse local step for topic models. Topic models (Blei, 2012) are hierarchical mixtures applied to discrete data from documents, . Let each document consist of observed word tokens from a fixed vocabulary of word types, though we could easily build a topic model for observations of any type (real, discrete, etc.). Each document contains observed word tokens , where token identifies the type of the th word.
The latent Dirichlet allocation (LDA) topic model (Blei et al., 2003) generates a document’s observations from a mixture model with common topics but documentspecific frequencies . Each topic , where is the probability of type under topic . The documentspecific frequencies are drawn from a symmetric Dirichlet , where is a scalar. Assignments are drawn , and then the observed words are drawn .
The goal of posterior inference is to estimate the common topics as well as the frequencies and assignments in any document. The standard meanfield approximate posterior (Blei et al., 2003) is:
(9) 
Under this factorization, we again set up a standard optimization objective as in Eq. (3). Complete expressions are in Appendix C. We optimize this objective via coordinate ascent, alternating between local and global steps. Our focus is the local step, which requires updating both the assignment factor and the frequencies factor for each document . Next, we derive an interative update algorithm for estimating the assignment factor and the frequencies factor for a document . Alg. 2 lists the conventional algorithm and our new sparse version.
Documenttopic update.
Following (Blei et al., 2003), we have a closedform update for each topic : . This assumes that responsibilities have been summarized into counts of the number of tokens assigned to topic in document : .
Responsibility update.
As in (Blei et al., 2003), the optimal update for the dense responsibilities for token has a closed form like the mixture model, but with documentspecific weights:
(10)  
We can incorporate our sparse constraint from Eq. (8) to obtain sparse rather than dense responsibilties. The procedure TopLRespFromWeights from Alg. 1 still provides the optimal solution.
Iterative joint update for dense case.
Following standard practice for dense assignments (Blei et al., 2003), we use a blockcoordinate ascent algorithm that iteratively updates and using the closedform steps above. To initialize the update cycle, we recommend setting the initial weights as if the documenttopic frequencies are uniform: . This lets the topicword likelihoods drive the initial assignments. We then alternate updates until either a maximum number of iterations is reached (typically 100) or the maximum change in documenttopic counts falls below a threshold (typically 0.05). Appendix D provides a detailed algorithm. Fig. 3a compares the runtime cost of the local, summary, and global steps of the topic model, showing that the local iterations dominate the overall cost.
Iterative joint update with sparsity.
Our new sparse constraint on responsibilities leads to a fast local step algorithm for topic models. This procedure has two primary advantages over the dense baseline. First, we use TopLRespFromWeights to update the pertoken responsibilities , resulting in faster updates. Second, we further assume that once a topic’s mass decays near zero, it will never rise again. With this assumption, at every iteration we identify the set of active topics (those with nonneglible mass) in the document: . Only these topics will have weight large enough to be chosen in the top for any token. Thus, throughout local iterations we consider only the active set of topics, reducing all steps from cost to cost .
Discarding topics within a document when mass becomes very small is justified by previous empirical observations of the “digamma problem” described in Mimno et al. (2012): for topics with negligible mass, the expected log prior weight becomes vanishingly small. For example, for and , and gets smaller as increases. In practice, after the first few iterations the active set stabilizes and each token’s top topics rarely change while the relative responsibilities continue to improve. In this regime, we can reduce runtime cost by avoiding selection altogether, instead just reweighting each token’s current set of top topics. We perform selection for the first 5 iterations and then only every 10 iterations, which yields large speedups without loss in quality.
Fig. 3 compares the runtime of our sparse local step across values of sparsitylevel against a comparable implementation of the standard dense algorithm. Fig. 3b shows that our sparse local step can be at least 3 times faster when . Larger values lead to even larger gains. Fig. 3c shows that sparsity improves the speed of the summary step, though this step is less costly than the local step for topic models. Finally, Fig. 3d shows that modest sparsity yields documenttopic distributions very close to those found by the dense local step, while is much coarser.
Restart proposals.
In scalable applications, we assume that we cannot afford to store any documentspecific information between iterations. Thus, each time we visit a document we must infer both and from scratch. This joint update is nonconvex, and thus our recommended coldstart initialization for is not guaranteed to monotonically improve across repeat visits to a document. However, even if we could store document counts across iterations we find warmstarting often gets stuck in poor local optima (see Fig. 5 of Appendix D). Instead, we combine coldstarting with restart proposals. Hughes et al. (2015) introduced restarts as a postprocessing step for the single document local iterations that results in solutions with better objective function scores. Given some fixed point , the restart proposal constructs a candidate by forcing all responsibility mass on some active topic to zero and then running a few iterations forward. We accept the new proposal if it improves the objective . These proposals escape local optima by finding nearby solutions which favor the prior’s bias toward sparse documenttopic probabilities. They are frequently accepted in practice (4080% in a typical Wikipedia run), so we always include them in our sparse and dense local steps.
Related work.
MCMC methods specialized to topic models of text data can exploit sparsity for huge speed gains. SparseLDA (Yao et al., 2009) is a clever decomposition of the Gibbs conditional distribution to make each pertoken assignment step cost less than . AliasLDA (Li et al., 2014) and LightLDA (Yuan et al., 2015) both further improve this to amortized . These methods are still limited to hard assignments and are only applicable to discrete data. In contrast, our approach allows expressive intermediate sparsity and can apply to a broader family of mixtures and topic models for realvalued data.
More recently, several efforts have used MCMC samplers to approximate the local step within a larger variational algorithm (Mimno et al., 2012, Wang and Blei, 2012). They estimate an approximate posterior by averaging over many samples, where each sample is an hard assignment. The number of finite samples needs to be chosen to balance accuracy and speed. In contrast, our sparsitylevel provides more intuitive control over approximation accuracy and optimizes exactly, not just in expectation.
1: 2: : documenttopic smoothing scalar 3: : log prob. of word in topic 4: 5: : word type/count pairs for doc. 6: 7: 8: : dense responsibilities for doc 9: : topic pseudocounts for doc 10:def DenseStepForDoc() 11: for do 12: DenseRespFromWeights 13: while not converged do 14: for do 15: 16: Implicit 17: for do 18: for do 19: 20: DenseRespFromWeights 21: for do 22: 23: 24: return 
1: 2: : documenttopic smoothing scalar 3: : log prob. of word in topic 4: 5: : word type/count pairs for doc. 6: : integer sparsity level 7: 8: : sparse responsibilities and indices 9: : topic pseudocounts for doc 10:def LSparseStepForDoc() 11: for do 12: TopLRespFromW 13: for do 14: 15: 16: while not converged do 17: for do 18: 19: for do 20: for do 21: 22: TopLRespFromW 23: for do 24: 25: 26: for do 27: 28: return 
6 Topic Model Experiments
We compare our sparse implementations of MVI and SVI to external baselines: SparseLDA (Yao et al., 2009), a fast implementation of standard Gibbs sampling (Griffiths and Steyvers, 2004); and SVIGibbs (Mimno et al., 2012), a stochastic variational method that uses Gibbs sampling to approximate local gradients. These algorithms use Java code from Mallet (McCallum, 2002). We also compare to the public C++ implementation of LightLDA (Yuan et al., 2015). External methods use their default initialization, while we sample diverse documents using the Bregman divergence extension (Ackermann and Blömer, 2009) of kmeans++ (Arthur and Vassilvitskii, 2007) to initialize our approximate topicword posterior .
For our methods, we explore several values of sparsitylevel . LightLDA and SparseLDA have no tunable sparsity parameters. SVIGibbs allows specifying the number of samples used to approximate . We consider , always discarding half of these samples as burnin. For all methods, we set documenttopic smoothing and topicword smoothing . We set the stochastic learning rate at iteration to . We use grid search to find the best heldout score on validation data, considering delay and decay .
Fig. 4 compares these methods on 3 datasets: 1392 NIPS articles, 7961 Wikipedia articles, and 1.8 million New York Times articles. Each curve represents the best of many random initializations. Following Wang et al. (2011), we evaluate via heldout likelihoods via a document completion task. Given a test document , we divide its words at random by type into two pieces: 80% in and 20% in . We use set A to estimate documenttopic probabilities , and then evaluate this estimate on set B by computing . See supplement for details. Across all datasets, our conclusions are:
Moderate sparsity tends to be best.
Throughout Fig. 4, we see that runs with sparsitylevel under both memoized and stochastic algorithms converge several times faster than , but yield indistinguishable predictions. For example, on Wikipedia with both MVI and SVI plateau after 200 seconds with , but require over 1000 seconds for best performance with .
Hard assignments can fail catastrophically.
We suspect that is too coarse to accurately capture multiple senses of vocabulary words, instead favoring poor local optima where each word is attracted to a single best topic without regard for other words in the document. In practice, may either plateau early at noticeably worse performance (e.g., NIPS) or fall into progressively worse local optima (e.g., Wiki). This failure mode can occur because MVI and SVI for topic models both reestimate and from scratch each time we visit a document.
Baselines converge slowly.
Throughout Fig. 4, few runs of SparseLDA or SVIGibbs reaches competitive predictions in the allowed time limit (3 hours for NIPS and Wiki, 2 days for NYTimes). SVIGibbs benefits from using instead of samples only on NYTimes. More than 10 samples did not improve performance further. As expected, LightLDA has higher raw throughput than our sparse MVI or SVI methods, and for small datasets eventually makes slightly better predictions when . However, across all values we find our sparse methods reach competitive values faster, especially on the large NYTimes dataset. For large we find LightLDA never catches up in the allotted time. Note that LightLDA’s speed comes from a MetropolisHastings proposal that is highly specialized to topic models of discrete data, while other methods (including our own) are broadly applicable to clusterbased models with nonmultinomial likelihoods.
7 Conclusion
We have introduced a simple sparsity constraint for approximate posteriors which enjoys faster training times, equal or better heldout predictions, and intuitive interpretation. Our algorithms can be droppedin to any ML, MAP, or fullposterior variational clustering objective and are easy to parallelize across minibatches. Unlike previous efforts encouraging sparsity such as Sparse EM (Neal and Hinton, 1998) or Expectation Truncation (Lücke and Eggert, 2010), we have procedures that easily scale to millions of examples without prohibitive storage costs, we present proof that our chosen top selection procedure is optimal, and we have done rigorous experiments demonstrating that often modest values of or are much better than .
We have released Python code with fast C++ subroutines to encourage reuse by practioners. We anticipate further research in adapting sparsity to sequential models like HMMs, to structured variational approximations, to Bayesian nonparametric models with adaptive truncations (Hughes and Sudderth, 2013), and to fast methods like KDtrees for computing cluster weights (Moore, 1999).
References
 Ackermann and Blömer (2009) M. R. Ackermann and J. Blömer. Coresets and approximate clustering for Bregman divergences. In Proceedings of the 20th Annual ACMSIAM Symposium on Discrete Algorithms (SODA ’09), 2009.
 Arthur and Vassilvitskii (2007) D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In ACMSIAM Symposium on Discrete Algorithms, 2007.
 Blei (2012) D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

Blei et al. (2003)
D. M. Blei, A. Y. Ng, and M. I. Jordan.
Latent Dirichlet allocation.
Journal of Machine Learning Research
, 3:993–1022, 2003.  Blum et al. (1973) M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448 – 461, 1973.
 Broderick et al. (2013) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In Neural Information Processing Systems, 2013.
 Dai and Lücke (2014) Z. Dai and J. Lücke. Autonomous document cleaning—a generative approach to reconstruct strongly corrupted scanned texts. IEEE transactions on pattern analysis and machine intelligence, 36(10):1950–1962, 2014.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, pages 1–38, 1977.
 Everitt (1981) B. S. Everitt. Finite mixture distributions. Wiley Online Library, 1981.
 Ghahramani and Beal (2001) Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. In Neural Information Processing Systems, 2001.
 Griffiths and Steyvers (2004) T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.
 Hoffman et al. (2013) M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1), 2013.
 Hoffman et al. (2010) M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent Dirichlet allocation. In Neural Information Processing Systems, 2010.
 Hughes and Sudderth (2013) M. C. Hughes and E. B. Sudderth. Memoized online variational inference for Dirichlet process mixture models. In Neural Information Processing Systems, 2013.
 Hughes et al. (2015) M. C. Hughes, D. I. Kim, and E. B. Sudderth. Reliable and scalable variational inference for the hierarchical Dirichlet process. In Artificial Intelligence and Statistics, 2015.

Juang and Rabiner (1990)
B.H. Juang and L. R. Rabiner.
The segmental kmeans algorithm for estimating parameters of hidden Markov models.
IEEE Transactions on Acoustics, Speech and Signal Processing, 38(9):1639–1641, 1990.  Kulis and Jordan (2012) B. Kulis and M. I. Jordan. Revisiting kmeans: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning, 2012.
 Kurihara and Welling (2009) K. Kurihara and M. Welling. Bayesian kmeans as a “maximizationexpectation” algorithm. Neural computation, 21(4):1145–1172, 2009.
 Li et al. (2014) A. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014.
 Lloyd (1982) S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
 Lücke and Eggert (2010) J. Lücke and J. Eggert. Expectation truncation and the benefits of preselection in training generative models. Journal of Machine Learning Research, 11(Oct):2855–2900, 2010.
 McCallum (2002) A. K. McCallum. MALLET: Machine learning for language toolkit. mallet.cs.umass.edu, 2002.
 Mimno et al. (2012) D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation. In International Conference on Machine Learning, 2012.
 Moore (1999) A. W. Moore. Very fast EMbased mixture model clustering using multiresolution kdtrees. Advances in Neural information processing systems, pages 543–549, 1999.
 Musser (1997) D. R. Musser. Introspective sorting and selection algorithms. Softw., Pract. Exper., 27(8):983–993, 1997.
 Neal (1992) R. M. Neal. Bayesian mixture modeling. In Maximum Entropy and Bayesian Methods, pages 197–211. Springer, 1992.
 Neal and Hinton (1998) R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
 Ng and McLachlan (2004) S.K. Ng and G. J. McLachlan. Speeding up the EM algorithm for mixture modelbased segmentation of magnetic resonance images. Pattern Recognition, 37(8):1573–1589, 2004.
 Sheikh et al. (2014) A.S. Sheikh, J. A. Shelton, and J. Lücke. A truncated em approach for spikeandslab sparse coding. Journal of Machine Learning Research, 15(1):2653–2687, 2014.
 Wainwright and Jordan (2008) M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(12):1–305, 2008.
 Wang and Blei (2012) C. Wang and D. Blei. Truncationfree online variational inference for Bayesian nonparametric models. In Neural Information Processing Systems, 2012.
 Wang et al. (2011) C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. In Artificial Intelligence and Statistics, 2011.
 Winn and Bishop (2005) J. Winn and C. M. Bishop. Variational message passing. Journal of Machine Learning Research, 6:661–694, 2005.
 Yao et al. (2009) L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.
 Yuan et al. (2015) J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.Y. Liu, and W.Y. Ma. LightLDA: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web, 2015.
 Zoran and Weiss (2012) D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. In Neural Information Processing Systems, 2012.
Appendix A Meanfield variational for the mixture model
a.1 Generative model
Local assignments and observed data :
(13)  
(14) 
where is any likelihood density in the exponential family, with conjugate prior .
a.2 Assumed meanfield approximate posterior
Approximate posteriors for global parameters:
(15)  
(16) 
Approximate posterior for local assignment:
(17) 
a.3 Evidence lowerbound objective function
(18)  
where we have defined several iterpretable terms which separate the influence of the different free variational parameters.
(19)  
(20)  
(21) 
Mixture allocation term.
For the mixture model, we can expand the expectation defining and simplify for the following closedform function:
(22)  
where is the digamma function and is the log cumulant function, also called the log normalization constant, of the Dirichlet distribution:
(23) 
Entropy term.
The entropy of the approximate posterior for cluster assignments is:
(24) 
Data term.
Evaluating the data term requires a particular choice for the likelihood F and prior density P. We discuss several cases in Sec. B
Appendix B Variational methods for data generated by the exponential family
b.1 Zeromean Gaussian likelihood and Wishart prior
Zeromean Gaussian likelihood.
Each observed data vector is a real vector of size . We assume each cluster has a precision matrix parameter which is symmetric and positive definite. The log likelihood of each observation is then:
(25)  
(26) 
Wishart prior.
The Wishart prior is defined by a positive real
, which can be interpreted as a pseudocount of prior strength or degreesoffreedom, and
, a symmetric positive matrix. The log density of the Wishart prior is given by:(27) 
where the cumulant function is
(28) 
where is the multivariate Gamma function, defined as .
Approximate variational posterior
(29) 
Evaluating the data objective function.
First, we define sufficient statistic functions for each cluster :
(30) 
Then, we can write the data objective as
(31)  
b.2 Multinomial likelihood and Dirichlet prior
Multinomial likelihood.
Each observation indicates a single word in a vocabulary of size .
Comments
There are no comments yet.