Fast Learning of Clusters and Topics via Sparse Posteriors

09/23/2016 ∙ by Michael C. Hughes, et al. ∙ Brown University 0

Mixture models and topic models generate each observation from a single cluster, but standard variational posteriors for each observation assign positive probability to all possible clusters. This requires dense storage and runtime costs that scale with the total number of clusters, even though typically only a few clusters have significant posterior mass for any data point. We propose a constrained family of sparse variational distributions that allow at most L non-zero entries, where the tunable threshold L trades off speed for accuracy. Previous sparse approximations have used hard assignments (L=1), but we find that moderate values of L>1 provide superior performance. Our approach easily integrates with stochastic or incremental optimization algorithms to scale to millions of examples. Experiments training mixture models of image patches and topic models for news articles show that our approach produces better-quality models in far less time than baseline methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mixture models (Everitt, 1981) and topic models (Blei et al., 2003)

are fundamental to Bayesian unsupervised learning. These models find a set of

clusters or topics useful for exploring an input dataset. Mixture models assume the input data is fully exchangeable, while topic models extend mixtures to handle datasets organized by groups of observations, such as documents or images.

Mixture and topic models have two kinds of latent variables. Global parameters define each cluster, including its frequency and the statistics of associated data. Local

, discrete assignments then determine which cluster explains a specific data observation. For both global and local variables, Bayesian analysts wish to estimate a posterior distribution. For these models, full posterior inference via Markov chain Monte Carlo (MCMC, 

Neal (1992)) averages over sampled cluster assignments, producing asymptotically exact estimates at great computational cost. Optimization algorithms like expectation maximization (EM, Dempster et al. (1977)) or (mean field) variational Bayes (Ghahramani and Beal, 2001, Winn and Bishop, 2005) provide faster, deterministic estimates of cluster assignment probabilities. However, at each observation these methods give positive probability to every cluster, requiring dense storage and limiting scalability.

This paper develops new posterior approximations for local assignment variables which allow optimization-based inference to scale to hundreds or thousands of clusters. We show that adding an additional sparsity constraint to the standard variational optimization objective for local cluster assignments leads to big gains in processing speed. Unlike approaches restricted to hard, winner-take-all assignments, our approach offers a tunable parameter that determines how many clusters have non-zero mass in the posterior for each observation. Our approach fits into any variational algorithm, regardless of whether global parameters are inferred by point estimates (as in EM) or given full approximate posteriors. Furthermore, our approach integrates into existing frameworks for large-scale data analysis (Hoffman et al., 2013, Broderick et al., 2013) and is easy to parallelize. Our open source Python code111http://bitbucket.org/michaelchughes/bnpy-dev/ exploits an efficient C++ implementation of selection algorithms (Blum et al., 1973, Musser, 1997) for scalability.

a: Overall Timings, b: RespFromWeights step c: Summary step d: Distance from dense
Fig. 1: Impact of sparsity-level

on the speed and accuracy of estimating a zero-mean Gaussian mixture model for 8x8 pixel patches from natural images.

a: Comparison of algorithm substep costs for 2 complete passes through entire 3.6 million patch dataset from Sec. 4 with clusters. b: Speed comparison of our sparse approach when computing responsibilities given fixed weights for 36,000 patches. c: Speed comparison of our sparse approach when computing the per-cluster statistics defined in Eq. (7) for 36,000 patches. d: Cumulative density function of variational distance between dense and sparse responsibilities across 36,000 patches, using the pretrained mixture model published online by Zoran and Weiss (2012).

2 Variational Inference for Mixture Models

Given

observed data vectors

, a mixture model assumes each observation belongs to one of clusters. Let hidden variable denote the specific cluster assigned to . The mixture model has two sets of global parameters, the cluster frequencies and cluster shapes . Let for scalar , where is the probability of observing data from cluster : . We generate observation according to likelihood

(1)

The exponential family density has sufficient statistics and natural parameter . The normalization function ensures that integrates to one. We let , where is a density conjugate to with parameter . Conjugacy is convenient but not necessary: we only require that the expectation can be evaluated in closed-form.

Mean-field variational inference (Wainwright and Jordan, 2008) seeks a factorized posterior . Each posterior factor has free parameters (denoted with hats) that are optimized to minimize the KL divergence between the simplified approximate density and the true, intractable posterior. The separate factors for local and global parameters have specially chosen forms:

(2)

Our focus is on the free parameter which defines the local assignment posterior . This vector is non-negative and sums to one. We interpret value

as the posterior probability of assigning observation

to cluster . This is sometimes called cluster ’s responsibility for observation .

The goal of variational inference is to find the optimal free parameters under a specific objective function . Using full approximate posteriors of global parameters yields the evidence lower-bound objective function in Eq. (3) which is equivalent to minimizing KL divergence (Wainwright and Jordan, 2008). Point estimation of global parameters instead yields a maximum-likelihood (ML) objective in Eq. (4).

(3)
(4)

Closed-form expressions for both objectives are in Appendix A. Given an objective function, optimization typically proceeds via coordinate ascent (Neal and Hinton, 1998). We call the update of the data-specific responsiblities the local step, which is alternated with the global update of or .

2.1 Computing dense responsiblities during local step

The local step computes a responsibility vector for each observation that maximizes given fixed global parameters. Under either the approximate posterior treatment of global parameters in Eq. (3) or ML objective of Eq. (4), the optimal update (dropping terms independent of ) maximizes the following objective function:

(5)

We interpret as the log posterior weight that cluster has for observation . Larger values imply that cluster is more likely to be assigned to observation . For ML learning, the expectations defining are replaced with point estimates.

Our goal is to find the responsibility vector that optimizes in Eq. (5), subject to the constraint that is non-negative and sums to one so is a valid density:

(6)

The optimal solution is simple: exponentiate each weight and then normalize the resulting vector. The function DenseRespFromWeights in Alg. 1 details the required steps. The runtime cost is , and is dominated by the required evaluations of the exp function.

2.2 Computing sufficient statistics needed for global step

Given fixed assignments , the global step computes the optimal values of the global free parameters under . Whether doing point estimation or approximate posterior inference, this update requires only two finite-dimensional sufficient statistics of , rather than all values. For each cluster , we must compute the expected count of its assigned observations and the expected data statistic vector :

(7)

The required work is for the count vector and for the data vector.

1: : log posterior weights. 2: : responsibility values 3:def DenseRespFromWeights() 4:     for  do 5:                6:      7:     for  do 8:                9:     return

1: : log posterior weights. 2: : resp. and indices 3:def TopLRespFromWeights(, ) 4:      5:     for  do 6:                7:      8:     for  do 9:                10:     return

Alg. 1: Updates for the responsibilities of observation given weights defined in Eq. (5). Left: DenseRespFromWeights is the standard solution to the optimization problem in Eq. (6). This requires evaluations of the exp function, summations, and divisions. Right: Our proposed method TopLRespFromWeights optimizes the same objective subject to the additional constraint that at most clusters have non-zero posterior probability. First, an introspective selection algorithm (Musser, 1997) finds the indices of the largest weights. Given these, we find the optimum with evaluations of the exp function, summations, and divisions.

3 Fast Local Step for Mixtures via Sparse Responsibilities

Our key contribution is a new variational objective and algorithm that scales better to large numbers of clusters . Much of the runtime cost for standard variational inference algorithms comes from representing as a dense vector. Although there are total clusters, for any observation only a few entries in will have appreciable mass while the vast majority are close to zero. We thus further constrain the objective of Eq. (6) to allow at most non-zero entries:

(8)

The function TopLRespFromWeights in Alg. 1 solves this constrained optimization problem. First, we identify the indices of the top values of the weight vector in descending order. Let denote these top-ranked cluster indices, each one a distinct value in . Given this active set of clusters, we simply exponentiate and normalize only at these indices. We can represent this solution as an -sparse vector, with real values and integer indices . Solutions are not unique if the posterior weights contain duplicate values. We handle these ties arbitrarily, since swapping duplicate indices leaves the objective unchanged.

3.1 Proof of optimality.

We offer a proof by contradiction that TopLRespFromWeights solves the optimization problem in Eq. (8). Suppose that is optimal, but there exists a pair of clusters such has larger weight but is not included in the active set while is. This means , but and . Consider the alternative which is equal to vector but with entries and swapped. After substituting into Eq. (5) and simplifying, we find the objective function value increases under our alternative: . Thus, the optimal solution must include the largest clusters by weight in its active set.

3.2 Runtime cost.

Alg. 1 compares DenseRespFromWeights and our new algorithm TopLRespFromWeights side-by-side. The former requires exponentiations, additions, and divisions to turn weights into responsibilities. In contrast, given the indices our procedure requires only of each operation. Finding the active indices via SelectTopL requires runtime.

Selection algorithms (Blum et al., 1973, Musser, 1997) are designed to find the top values in descending order within an array of size . These methods use divide-and-conquer strategies to recursively partition the input array into two blocks, one with values above a pivot and the other below. Musser (1997) introduced a selection procedure which uses introspection to smartly choose pivot values and thus guarantee worst-case runtime. This procedure is implemented within the C++ standard library as nth_element, which we use for SelectTopL in practice. This function operates in-place on the provided array, rearranging its values so that the first entries are all bigger than the remainder. Importantly, there is no internal sorting within either partition. Example code is in found in Appendix  F.

Choosing sparsity-level naturally trades off execution speed and training accuracy. When , we recover the original dense responsibilities, while

assigns each point to exactly one cluster, as in k-means. Our focus is on modest values of

. Fig. 1b shows that for large values TopLRespFromWeights is faster than DenseRespFromWeights for or . The dense method’s required exponentiations dominates the introspective selection procedure.

With sparse responsibilities, computing the statistics in Eq. (7) scales linearly with rather than . This gain is useful when applying Gaussian mixture models with unknown covariances to image patches, where each 8x8 patch requires an expensive 4096-dimensional data statistic . Fig. 1c shows the cost of the summary step virtually disappears when rather than . This savings makes the overall algorithm over twice as fast (Fig. 1a), with the remaining bottleneck the dense calculation of weights , which might be sped up for some likelihoods using fast data structures for finding nearest-neighbors. Fig. 1d shows that captures nearly identical responsibility values as , indicating that modest values may bring speed gains without noticeable sacrifice of model quality.

3.3 Related work.

Hard assignments.

One widespread practice used for decades is to consider “hard” assignments, where each observation is assigned to a single cluster, instead of a dense vector of responsibilities. This is equivalent to setting in our sparse formulation. The k-means algorithm (Lloyd, 1982) and its nonparametric extension DP-means (Kulis and Jordan, 2012) justify

sparsity via small-variance asymptotics. So-called “hard EM” Viterbi training 

(Juang and Rabiner, 1990), or maximization-expectation algorithms  (Kurihara and Welling, 2009) both use hard assignments. However, we expect to be too coarse for many applications while moderate values like offer better approximations, as shown in Fig. 1d.

Sparse EM.

A prominent early method to exploit sparsity in responsibilities is the Sparse EM algorithm proposed by Neal and Hinton (1998). Sparse EM maintains a dense vector for each observation , but only edits a subset of this vector during each local step. The edited subset may consist of the largest entries or all entries above some threshold. Any inactive entries are “frozen” to current non-zero values and newly edited entries are normalized such that the length- vector preserves its sum-to-one constraint.

Sparse EM can be effective for small datasets with a few thousand examples and has found applications such as MRI medical imaging (Ng and McLachlan, 2004). However, our sparse approach has three primary advantages relative to Sparse EM: (1) Our sparse method requires less per-observation memory for responsibilities. While Sparse EM must store floating-point values to represent a responsibility vector, we need to store only . (2) Our sparse method easily scales to minibatch-based training algorithms in Sec. 3.4, but Sparse EM’s required storage is prohibitive. Our approach can safely discard responsiblity vectors after required sufficient statistics are computed. Sparse EM must explicitly store responsibilities for every observation in the dataset at cost if future sparse updates are desired. This prohibits scaling to millions of examples by processing small minibatches, unless each minibatch has its full responsibility array written to and from disk when needed. (3) We proved in Sec. 3.1 that top-L selection is the optimal way to compute sparse responsibilities and monotonically improve our training objective function. Neal and Hinton (1998)

suggest this selection method only as a heuristic without justification.

Expectation Truncation.

When we undertook most of this research, we were unaware of a related method by Lücke and Eggert (2010) called Expectation Truncation

which constrains the approximate posterior probabilities of discrete or multivariate binary variables to be

sparse. Lücke and Eggert (2010) considered non-negative matrix factorization and sparse coding problems. Later extensions applied this core algorithm to mixture-like sprite models for cleaning images of text documents (Dai and Lücke, 2014) and spike-and-slab sparse coding (Sheikh et al., 2014). Our work is the first to apply sparse ideas to mixture models and topic models.

The original Expectation Truncation algorithm (Lücke and Eggert, 2010, Alg. 1) expects a user-defined selection function to identify the entries with non-zero responsibility for a specific observation. In practice, the selection functions they suggest are chosen heuristically, such as the upper bound in Eq. 28 of (Lücke and Eggert, 2010). The original authors freely admit these selection functions are not optimal and may not monotonically improve the objective function (Lücke and Eggert, 2010, p. 2869). In contrast, we proved in Sec. 3.1 that top- selection will optimally improve our objective function.

One other advantage of our work over previous Expectation Truncation efforts are our thorough experiments exploring how different values impact training speed and predictive power. Comparisons over a range of possible values on real datasets are lacking in Lücke and Eggert (2010) and other papers. Our key empirical insight is that modest values like are frequently better than , especially for topic models.

3.4 Scalabilty via minibatches

Stochastic variational inference (SVI).

Introduced by Hoffman et al. (2010), SVI scales standard coordinate ascent to large datasets by processing subsets of data at a time. Our proposed sparse local step fits easily into SVI. At each iteration , SVI performs the following steps: (1) sample a batch from the full dataset, uniformly at random; (2) for each observation in the batch, do a local step to update responsibilities given fixed global parameters; (3) update the global parameters by stepping from their current values in the direction of the natural gradient of the rescaled batch objective . This procedure is guaranteed to reach a local optima of if the step size of the gradient update decays appropriately as increases (Hoffman et al., 2013).

Incremental algorithms (MVI).

Inspired by incremental EM (Neal and Hinton, 1998), Hughes and Sudderth (2013) introduced memoized variational inference (MVI). The data is divided into a fixed set of batches before iterations begin. Each iteration completes four steps: (1) select a single batch to visit; (2) for each observation in this batch, compute optimal local responsibilities given fixed global parameters and summarize these into sufficient statistics for batch ; (3) incrementally update s whole-dataset statistics given the new statistics for batch ; (4) compute optimal global parameters given the whole-dataset statistics. The incremental update in step (3) requires caching (or “memoizing”) the summary statistics in Eq. (7) at each batch. This algorithm has the same per-iteration runtime as stochastic inference, but guarantees the monotonic increase of the objective when the local step has a closed-form solution like the mixture model. Its first pass through the entire dataset is equivalent to streaming variational Bayes (Broderick et al., 2013).

4 Mixture Model Experiments

We evaluate dense and sparse mixture models for natural images, inspired by Zoran and Weiss (2012)

. We train a model for 8x8 image patches taken from overlapping regular grids of stride 4 pixels. Each observation is a vector

, preprocessed to remove its mean. We then apply a mixture model with zero-mean, full-covariance Gaussian likelihood function . We set concentration . To evaluate, we track the log-likelihood score of heldout observations under our trained model, defined as . Here, and are point estimates computed from our trained global parameters using standard formulas. The function

is the probability density function of a multivariate normal.

Fig. 2 compares sparse implementations of SVI and MVI on 3.6 million patches from 400 images. The algorithms process minibatches each with patches. We see the sparse methods consistently reach good predictive scores 2-4 times faster than dense runs do (note the log-scale of the time axis). Finally, modestly sparse runs often reach higher values of heldout likelihood than hard runs, especially in the and plots for SVI (red).

Fig. 2: Analysis of 3.6 million 8x8 pixel image patches using a zero-mean Gaussian mixture model trained by sparse stochastic (SVI) and memoized (MVI) variational algorithms. We train on 400 total images processed 4 images at a time. Each panel shows the heldout log likelihood score over time for training runs with various sparsity levels at a fixed number of clusters . Training time is plotted on log-scale.
a: Overall Timings, K=800 b: Local Step + Restarts c: Summary Step d: Distance from
Fig. 3: Impact of sparsity-level on speed and accuracy of training topic models. a: Comparison of the runtime costs for 10 complete passes through 7981 Wikipedia documents with topics. Our sparse methods further show breakdown by algorithm substeps. Text above each bar indicates the number of documents processed per second. b: Timings for the sparse local step with restart proposals on 1000 NYTimes articles, using 100 iterations at each document. c: Timings for the -sparse summary step on 1000 NYTimes articles. d: Cumulative density function of variational distance between dense and sparse document-topic distributions, across 1000 NYTimes documents. We define the empirical topic distribution of document by normalizing the count vector .

5 Fast Local Step for Topic Models via Sparse Responsibilities

We now develop a sparse local step for topic models. Topic models (Blei, 2012) are hierarchical mixtures applied to discrete data from documents, . Let each document consist of observed word tokens from a fixed vocabulary of word types, though we could easily build a topic model for observations of any type (real, discrete, etc.). Each document contains observed word tokens , where token identifies the type of the -th word.

The latent Dirichlet allocation (LDA) topic model (Blei et al., 2003) generates a document’s observations from a mixture model with common topics but document-specific frequencies . Each topic , where is the probability of type under topic . The document-specific frequencies are drawn from a symmetric Dirichlet , where is a scalar. Assignments are drawn , and then the observed words are drawn .

The goal of posterior inference is to estimate the common topics as well as the frequencies and assignments in any document. The standard mean-field approximate posterior (Blei et al., 2003) is:

(9)

Under this factorization, we again set up a standard optimization objective as in Eq. (3). Complete expressions are in Appendix C. We optimize this objective via coordinate ascent, alternating between local and global steps. Our focus is the local step, which requires updating both the assignment factor and the frequencies factor for each document . Next, we derive an interative update algorithm for estimating the assignment factor and the frequencies factor for a document . Alg. 2 lists the conventional algorithm and our new sparse version.

Document-topic update.

Following (Blei et al., 2003), we have a closed-form update for each topic : . This assumes that responsibilities have been summarized into counts of the number of tokens assigned to topic in document : .

Responsibility update.

As in (Blei et al., 2003), the optimal update for the dense responsibilities for token has a closed form like the mixture model, but with document-specific weights:

(10)

We can incorporate our -sparse constraint from Eq. (8) to obtain sparse rather than dense responsibilties. The procedure TopLRespFromWeights from Alg. 1 still provides the optimal solution.

Iterative joint update for dense case.

Following standard practice for dense assignments (Blei et al., 2003), we use a block-coordinate ascent algorithm that iteratively updates and using the closed-form steps above. To initialize the update cycle, we recommend setting the initial weights as if the document-topic frequencies are uniform: . This lets the topic-word likelihoods drive the initial assignments. We then alternate updates until either a maximum number of iterations is reached (typically 100) or the maximum change in document-topic counts falls below a threshold (typically 0.05). Appendix D provides a detailed algorithm. Fig. 3a compares the runtime cost of the local, summary, and global steps of the topic model, showing that the local iterations dominate the overall cost.

Iterative joint update with sparsity.

Our new -sparse constraint on responsibilities leads to a fast local step algorithm for topic models. This procedure has two primary advantages over the dense baseline. First, we use TopLRespFromWeights to update the per-token responsibilities , resulting in faster updates. Second, we further assume that once a topic’s mass decays near zero, it will never rise again. With this assumption, at every iteration we identify the set of active topics (those with non-neglible mass) in the document: . Only these topics will have weight large enough to be chosen in the top for any token. Thus, throughout local iterations we consider only the active set of topics, reducing all steps from cost to cost .

Discarding topics within a document when mass becomes very small is justified by previous empirical observations of the “digamma problem” described in Mimno et al. (2012): for topics with negligible mass, the expected log prior weight becomes vanishingly small. For example, for and , and gets smaller as increases. In practice, after the first few iterations the active set stabilizes and each token’s top topics rarely change while the relative responsibilities continue to improve. In this regime, we can reduce runtime cost by avoiding selection altogether, instead just reweighting each token’s current set of top topics. We perform selection for the first 5 iterations and then only every 10 iterations, which yields large speedups without loss in quality.

Fig. 3 compares the runtime of our sparse local step across values of sparsity-level against a comparable implementation of the standard dense algorithm. Fig. 3b shows that our sparse local step can be at least 3 times faster when . Larger values lead to even larger gains. Fig. 3c shows that sparsity improves the speed of the summary step, though this step is less costly than the local step for topic models. Finally, Fig. 3d shows that modest sparsity yields document-topic distributions very close to those found by the dense local step, while is much coarser.

Restart proposals.

In scalable applications, we assume that we cannot afford to store any document-specific information between iterations. Thus, each time we visit a document we must infer both and from scratch. This joint update is non-convex, and thus our recommended cold-start initialization for is not guaranteed to monotonically improve across repeat visits to a document. However, even if we could store document counts across iterations we find warm-starting often gets stuck in poor local optima (see Fig. 5 of Appendix D). Instead, we combine cold-starting with restart proposals. Hughes et al. (2015) introduced restarts as a post-processing step for the single document local iterations that results in solutions with better objective function scores. Given some fixed point , the restart proposal constructs a candidate by forcing all responsibility mass on some active topic to zero and then running a few iterations forward. We accept the new proposal if it improves the objective . These proposals escape local optima by finding nearby solutions which favor the prior’s bias toward sparse document-topic probabilities. They are frequently accepted in practice (40-80% in a typical Wikipedia run), so we always include them in our sparse and dense local steps.

Related work.

MCMC methods specialized to topic models of text data can exploit sparsity for huge speed gains. SparseLDA (Yao et al., 2009) is a clever decomposition of the Gibbs conditional distribution to make each per-token assignment step cost less than . AliasLDA (Li et al., 2014) and LightLDA (Yuan et al., 2015) both further improve this to amortized . These methods are still limited to hard assignments and are only applicable to discrete data. In contrast, our approach allows expressive intermediate sparsity and can apply to a broader family of mixtures and topic models for real-valued data.

More recently, several efforts have used MCMC samplers to approximate the local step within a larger variational algorithm (Mimno et al., 2012, Wang and Blei, 2012). They estimate an approximate posterior by averaging over many samples, where each sample is an hard assignment. The number of finite samples needs to be chosen to balance accuracy and speed. In contrast, our sparsity-level provides more intuitive control over approximation accuracy and optimizes exactly, not just in expectation.

1: 2: : document-topic smoothing scalar 3: : log prob. of word in topic 4:                  5: : word type/count pairs for doc.  6:  7: 8: : dense responsibilities for doc  9: : topic pseudo-counts for doc  10:def DenseStepForDoc() 11:     for  do 12:          DenseRespFromWeights       13:     while not converged do 14:         for  do 15:               16:               Implicit           17:         for  do 18:              for  do 19:                                  20:              DenseRespFromWeights                21:     for  do 22:          23:                24:     return

1: 2: : document-topic smoothing scalar 3: : log prob. of word in topic 4:                  5: : word type/count pairs for doc.  6: : integer sparsity level 7: 8: : -sparse responsibilities and indices 9: : topic pseudo-counts for doc  10:def LSparseStepForDoc() 11:     for  do 12:         TopLRespFromW       13:     for  do 14:                15:      16:     while not converged do 17:         for  do 18:                         19:         for  do 20:              for  do 21:                                  22:              TopLRespFromW           23:         for  do 24:                         25:                26:     for  do 27:                28:     return

Alg. 2: Algorithms for computing the per-unique-token responsibilities and topic pseudocounts for a single document given a fixed set of topics. Left: In the standard dense algorithm, each step scales linearly with the number of total topics , regardless of how many topics are used in the document. Right: In our sparse algorithm, forcing each observation to use at most topics and tracking the active topics in a document leads to update steps that scale linearly with the number of active topics , which can be much less than the total number of topics .

6 Topic Model Experiments

We compare our sparse implementations of MVI and SVI to external baselines: SparseLDA (Yao et al., 2009), a fast implementation of standard Gibbs sampling (Griffiths and Steyvers, 2004); and SVIGibbs (Mimno et al., 2012), a stochastic variational method that uses Gibbs sampling to approximate local gradients. These algorithms use Java code from Mallet (McCallum, 2002). We also compare to the public C++ implementation of LightLDA (Yuan et al., 2015). External methods use their default initialization, while we sample diverse documents using the Bregman divergence extension (Ackermann and Blömer, 2009) of k-means++ (Arthur and Vassilvitskii, 2007) to initialize our approximate topic-word posterior .

For our methods, we explore several values of sparsity-level . LightLDA and SparseLDA have no tunable sparsity parameters. SVIGibbs allows specifying the number of samples used to approximate . We consider , always discarding half of these samples as burn-in. For all methods, we set document-topic smoothing and topic-word smoothing . We set the stochastic learning rate at iteration to . We use grid search to find the best heldout score on validation data, considering delay and decay .

Fig. 4 compares these methods on 3 datasets: 1392 NIPS articles, 7961 Wikipedia articles, and 1.8 million New York Times articles. Each curve represents the best of many random initializations. Following Wang et al. (2011), we evaluate via heldout likelihoods via a document completion task. Given a test document , we divide its words at random by type into two pieces: 80% in and 20% in . We use set A to estimate document-topic probabilities , and then evaluate this estimate on set B by computing . See supplement for details. Across all datasets, our conclusions are:

Moderate sparsity tends to be best.

Throughout Fig. 4, we see that runs with sparsity-level under both memoized and stochastic algorithms converge several times faster than , but yield indistinguishable predictions. For example, on Wikipedia with both MVI and SVI plateau after 200 seconds with , but require over 1000 seconds for best performance with .

Hard assignments can fail catastrophically.

We suspect that is too coarse to accurately capture multiple senses of vocabulary words, instead favoring poor local optima where each word is attracted to a single best topic without regard for other words in the document. In practice, may either plateau early at noticeably worse performance (e.g., NIPS) or fall into progressively worse local optima (e.g., Wiki). This failure mode can occur because MVI and SVI for topic models both re-estimate and from scratch each time we visit a document.

Baselines converge slowly.

Throughout Fig. 4, few runs of SparseLDA or SVIGibbs reaches competitive predictions in the allowed time limit (3 hours for NIPS and Wiki, 2 days for NYTimes). SVIGibbs benefits from using instead of samples only on NYTimes. More than 10 samples did not improve performance further. As expected, LightLDA has higher raw throughput than our sparse MVI or SVI methods, and for small datasets eventually makes slightly better predictions when . However, across all values we find our sparse methods reach competitive values faster, especially on the large NYTimes dataset. For large we find LightLDA never catches up in the allotted time. Note that LightLDA’s speed comes from a Metropolis-Hastings proposal that is highly specialized to topic models of discrete data, while other methods (including our own) are broadly applicable to cluster-based models with non-multinomial likelihoods.


Fig. 4: Analysis of 1392 NIPS articles (top row), 7961 Wikipedia articles (middle), and 1.8 million New York Times articles (bottom). We use 200 batches for NY Times and 5 batches otherwise. Each panel shows for a single value how heldout likelihood (higher is better) changes over time for sparse and dense versions of our algorithms and external baselines. Training time is plotted on log-scale.

7 Conclusion

We have introduced a simple sparsity constraint for approximate posteriors which enjoys faster training times, equal or better heldout predictions, and intuitive interpretation. Our algorithms can be dropped-in to any ML, MAP, or full-posterior variational clustering objective and are easy to parallelize across minibatches. Unlike previous efforts encouraging sparsity such as Sparse EM (Neal and Hinton, 1998) or Expectation Truncation (Lücke and Eggert, 2010), we have procedures that easily scale to millions of examples without prohibitive storage costs, we present proof that our chosen top selection procedure is optimal, and we have done rigorous experiments demonstrating that often modest values of or are much better than .

We have released Python code with fast C++ subroutines to encourage reuse by practioners. We anticipate further research in adapting sparsity to sequential models like HMMs, to structured variational approximations, to Bayesian nonparametric models with adaptive truncations (Hughes and Sudderth, 2013), and to fast methods like KD-trees for computing cluster weights (Moore, 1999).

References

  • Ackermann and Blömer (2009) M. R. Ackermann and J. Blömer. Coresets and approximate clustering for Bregman divergences. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’09), 2009.
  • Arthur and Vassilvitskii (2007) D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, 2007.
  • Blei (2012) D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
  • Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation.

    Journal of Machine Learning Research

    , 3:993–1022, 2003.
  • Blum et al. (1973) M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448 – 461, 1973.
  • Broderick et al. (2013) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In Neural Information Processing Systems, 2013.
  • Dai and Lücke (2014) Z. Dai and J. Lücke. Autonomous document cleaning—a generative approach to reconstruct strongly corrupted scanned texts. IEEE transactions on pattern analysis and machine intelligence, 36(10):1950–1962, 2014.
  • Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, pages 1–38, 1977.
  • Everitt (1981) B. S. Everitt. Finite mixture distributions. Wiley Online Library, 1981.
  • Ghahramani and Beal (2001) Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian learning. In Neural Information Processing Systems, 2001.
  • Griffiths and Steyvers (2004) T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.
  • Hoffman et al. (2013) M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1), 2013.
  • Hoffman et al. (2010) M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent Dirichlet allocation. In Neural Information Processing Systems, 2010.
  • Hughes and Sudderth (2013) M. C. Hughes and E. B. Sudderth. Memoized online variational inference for Dirichlet process mixture models. In Neural Information Processing Systems, 2013.
  • Hughes et al. (2015) M. C. Hughes, D. I. Kim, and E. B. Sudderth. Reliable and scalable variational inference for the hierarchical Dirichlet process. In Artificial Intelligence and Statistics, 2015.
  • Juang and Rabiner (1990) B.-H. Juang and L. R. Rabiner.

    The segmental k-means algorithm for estimating parameters of hidden Markov models.

    IEEE Transactions on Acoustics, Speech and Signal Processing, 38(9):1639–1641, 1990.
  • Kulis and Jordan (2012) B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning, 2012.
  • Kurihara and Welling (2009) K. Kurihara and M. Welling. Bayesian k-means as a “maximization-expectation” algorithm. Neural computation, 21(4):1145–1172, 2009.
  • Li et al. (2014) A. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014.
  • Lloyd (1982) S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  • Lücke and Eggert (2010) J. Lücke and J. Eggert. Expectation truncation and the benefits of preselection in training generative models. Journal of Machine Learning Research, 11(Oct):2855–2900, 2010.
  • McCallum (2002) A. K. McCallum. MALLET: Machine learning for language toolkit. mallet.cs.umass.edu, 2002.
  • Mimno et al. (2012) D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation. In International Conference on Machine Learning, 2012.
  • Moore (1999) A. W. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural information processing systems, pages 543–549, 1999.
  • Musser (1997) D. R. Musser. Introspective sorting and selection algorithms. Softw., Pract. Exper., 27(8):983–993, 1997.
  • Neal (1992) R. M. Neal. Bayesian mixture modeling. In Maximum Entropy and Bayesian Methods, pages 197–211. Springer, 1992.
  • Neal and Hinton (1998) R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
  • Ng and McLachlan (2004) S.-K. Ng and G. J. McLachlan. Speeding up the EM algorithm for mixture model-based segmentation of magnetic resonance images. Pattern Recognition, 37(8):1573–1589, 2004.
  • Sheikh et al. (2014) A.-S. Sheikh, J. A. Shelton, and J. Lücke. A truncated em approach for spike-and-slab sparse coding. Journal of Machine Learning Research, 15(1):2653–2687, 2014.
  • Wainwright and Jordan (2008) M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
  • Wang and Blei (2012) C. Wang and D. Blei. Truncation-free online variational inference for Bayesian nonparametric models. In Neural Information Processing Systems, 2012.
  • Wang et al. (2011) C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. In Artificial Intelligence and Statistics, 2011.
  • Winn and Bishop (2005) J. Winn and C. M. Bishop. Variational message passing. Journal of Machine Learning Research, 6:661–694, 2005.
  • Yao et al. (2009) L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.
  • Yuan et al. (2015) J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. LightLDA: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web, 2015.
  • Zoran and Weiss (2012) D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. In Neural Information Processing Systems, 2012.

Appendix A Mean-field variational for the mixture model

a.1 Generative model

Global parameters:

(11)
(12)

where

is a conjugate prior density in the exponential family.

Local assignments and observed data :

(13)
(14)

where is any likelihood density in the exponential family, with conjugate prior .

a.2 Assumed mean-field approximate posterior

Approximate posteriors for global parameters:

(15)
(16)

Approximate posterior for local assignment:

(17)

a.3 Evidence lower-bound objective function

(18)

where we have defined several iterpretable terms which separate the influence of the different free variational parameters.

(19)
(20)
(21)

Mixture allocation term.

For the mixture model, we can expand the expectation defining and simplify for the following closed-form function:

(22)

where is the digamma function and is the log cumulant function, also called the log normalization constant, of the Dirichlet distribution:

(23)

Entropy term.

The entropy of the approximate posterior for cluster assignments is:

(24)

Data term.

Evaluating the data term requires a particular choice for the likelihood F and prior density P. We discuss several cases in Sec. B

Appendix B Variational methods for data generated by the exponential family

b.1 Zero-mean Gaussian likelihood and Wishart prior

Zero-mean Gaussian likelihood.

Each observed data vector is a real vector of size . We assume each cluster has a precision matrix parameter which is symmetric and positive definite. The log likelihood of each observation is then:

(25)
(26)

Wishart prior.

The Wishart prior is defined by a positive real

, which can be interpreted as a pseudo-count of prior strength or degrees-of-freedom, and

, a symmetric positive matrix. The log density of the Wishart prior is given by:

(27)

where the cumulant function is

(28)

where is the multivariate Gamma function, defined as .

Approximate variational posterior

(29)

Evaluating the data objective function.

First, we define sufficient statistic functions for each cluster :

(30)

Then, we can write the data objective as

(31)

b.2 Multinomial likelihood and Dirichlet prior

Multinomial likelihood.

Each observation indicates a single word in a vocabulary of size .