Stagewise Learning for Sparse Clustering of Discretely-Valued Data

06/09/2015 ∙ by Vincent Zhao, et al. ∙ Yale University 0

The performance of EM in learning mixtures of product distributions often depends on the initialization. This can be problematic in crowdsourcing and other applications, e.g. when a small number of 'experts' are diluted by a large number of noisy, unreliable participants. We develop a new EM algorithm that is driven by these experts. In a manner that differs from other approaches, we start from a single mixture class. The algorithm then develops the set of 'experts' in a stagewise fashion based on a mutual information criterion. At each stage EM operates on this subset of the players, effectively regularizing the E rather than the M step. Experiments show that stagewise EM outperforms other initialization techniques for crowdsourcing and neurosciences applications, and can guide a full EM to results comparable to those obtained knowing the exact distribution.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the model-based sparse clustering problem for discrete data using a mixture model of product distributions [9, 7]

. This model has application in many fields, including computational neurosciences, crowdsourcing and bioinformatics, and is interesting because it differs technically from the problem for continuous data, where the well-known Gaussian mixture model has been applied successfully.

A fundamental difficulty is that, in high-dimensional datasets, some features can be noisy, redundant or generally uninformative for clustering, and these can push clustering algorithms toward inappropriate or uninteresting results. If these uninformative or noise data points could be eliminated then, we argue, the results should be much more satisfying. This is precisely our goal: to find an informative set of data points and to use these to drive the clustering.

We illustrate our goal with a motivating example from neurosciences (Figure 1

). Some neurons in mouse visual cortex respond well to certain grating orientations, while others do not respond systematically to gratings. This is called

orientation selectivity [11], and we seek to organize neurons into orientation classes according to this activity automatically. The dataset consists of multiple neural spike trains obtained while the mouse viewed gratings with different orientations. The spike trains are converted into binary data indicating the presence (or absence) of an action potential during a short temporal interval. The problem is: given only the spike train data, are we able to group neurons into clusters that correspond to each orientation-tuned class? If we train a mixture model with all neurons using EM, the answer is negative; many of the neurons that are not well-tuned to orientation pollute the results. However, using only the informative neurons (cartooned as those with well-defined tuning curves) we are able to recover the clusters successfully, which underscores our goal to automatically identify those informative and relevant features for discrete data. The algorithms used for this example are developed in this paper, and applied at the end to crowdsourcing data as well.

Figure 1: Bernoulli mixture model learned from multiple neuron spike train data in mouse visual cortex illustrates our goal. Mice view gratings at one of 12 different orientations and an electrode records from multiple units. The gray box illustrates the given data. Some units are selective to the gratings, and some are not (uninformative neurons). If all neurons are used to learn a mixture model, the classes are ill-defined and the orientation tuning curves for each class are uniform (D). (A) shows a mixture model learned by the algorithm developed in this paper, which works by identifying the informative neurons and regularizing EM. In the cartoon these neurons correspond to those that are well tuned, as indicated by the solid box in (C). We emphasize that these tuning curves were not used by our algorithm but were derived from the classes it computed. Data courtesy M. Stryker, University of California at San Francisco.

A similar problem for continuous data – the Gaussian mixture model with sparse means – is better studied. In particular, [12] proposed an algorithm based on a penalized likelihood function that leads to an EM variant with a regularized M-step, and [1] analyzes learning for a mixture of two isotropic Gaussians in high dimensions under sparse mean separation. As in these papers, we also consider a penalized likelihood function but for a mixture of discrete product distributions. However, we differ from them in that we directly regularize the E-step rather than the M-step.

To regularize the E-step, we define an information-theoretic quantity and use it in a novel, stagewise fashion. Our measure is the sum of pairwise conditional mutual information of a certain hybrid distribution, defined below, which turns out to be closely related to maximum likelihood estimation and the EM algorithm. A similar idea appears in


, who use the Euclidean distance between pairs of data points to regularize a K-means algorithm for sparse clustering. However, in our approach we select which variables to place into the informative set in a stagewise fashion. This stagewise technique is important because many researchers pointed out the drawback of using dimensionality reduction before clustering

[3, 2, 4]. Importantly, as is also explained below, this involves starting with a single class and splitting it into multiple classes. We stress that our informative set is conceptually different from maximally informative dimensions [16].

The paper is organized as follows. After briefly introducing the problem setting for learning mixtures of discrete product distributions with sparse structure, our specific algorithm is presented in 3. In 4, we apply our algorithm in crowdsourcing data, to show its range of applicability beyond neurosciences, illustrate our information-theoretic measure and compare it with other state-of-the-art algorithms.

2 Background

Throughout the paper, we use to denote the integer set . In a mixture of discrete component distributions (MDPD), it is assumed that each observation is drawn from a finite mixture distribution . is the latent (non-observable) variable and denote the mixing weights; they satisfy . We assume and is an -dimensional discrete product distribution that can be factorized as . The conditional distribution, parametrized as and

, lives on the probability simplex. The set of all parameters is denoted by

. Given observations , the goal of mixture model learning is to maximize the marginal log-likelihood:


Since there are latent variables, the marginal log-likelihood is not convex, and EM has been used widely for learning mixture models. EM iteratively updates and optimizes a lower bound of the marginal likelihood function[10]. The lower bound is obtained by applying Jensen’s inequality to the log-likelihood function:


where is a distribution over that may depend on or . (We shall work on upper bounds shortly.) Let the current model be parametrized by . Then

E-step: Calculate for and set .

M-step: Maximize (2) with regard to .


We study MDPD in a high-dimensional, sparse setting. (The analogous problem for Gaussian mixture models has been studied by [12, 1, 14].) In this setting, the number of informative variables is much smaller than the dimension . Let denote the set of informative variables, , and let be the complementary set. It is intuitive that

the uninformative random variables

should not be distinguishable across the different mixture components, i.e. for .

Inspired by [12], we consider the following penalized maximum likelihood problem to encourage the sparse structure in the model.


where and is the norm. denotes KL-divergence. is non-negative and the equality holds if and only if . Our penalty encourages sparse structure in the model because if and only if for all which indicates that the conditional probability of the random variable is identical in the different mixture components.

3 Methods

For convenience, assume that we can draw an infinite number of samples. To begin, Let be a MDPD with parameter .

is informative if and only if . It can be verified that if is uninformative, the marginal distribution can be factorized as


denotes the index set of informative variables. Due to the factorization (5),


Intuitively, uninformative variables will not affect the posterior distribution – they provide no information about the underlying latent variable. Now, we analyze the penalized likelihood function (4). The normal E-step leads to the penalized M-step


The penalized M-step encourages a sparse update for the model and provides a way to determine . By (6), it makes the E-step in the next iteration depend only on . However, solving (7) is hard, so we seek another way to determine and, in the process, bypass the penalized M-step by using a regularized E-step, i.e. calculating . The following theorem motivates how we select .

111See proofs in the supplement

Let be a MDPD from which data are sampled and let be the informative set of . If for , then

Maximizing the likelihood function is equivalent to minimizing the KL-divergence loss , since where is the entropy of . The KL-divergence loss can be viewed as a measure of how well the model estimates . Theorem 3 suggests that, if we have an appropriate model for uninformative features, could be recovered by solving the following dimensionality reduction problem: find the smallest such that

Although in practice is unknown, it is easy to find a model that satisfies the condition in the theorem. Simply pretend that all features are uninformative. Then is just a one-component MDPD satisfying for . We use this idea to initialize our algorithm, which is distinct from the common practice of initializing mixture models with multiple components and random parameters. But two problems arise. First, the one-component model might not be a good one, since it does not capture any high-order interactions. It will have to be split, and the procedure to do this is in Section 3.2. Second, is computationally intractable. Our approach is to find a proper approximation to , based on which we can place variables into . The details are as follows.

3.1 Conditional Mutual Information Approximates KL-Divergence Loss


Figure 2: is an upper bound of induced by EM. denotes the global maximum of the log-likelihood. And is (3) with replaced by .

From now on, is referred as . We first define a hybrid distribution.

The hybrid distribution is defined as


The hybrid distribution is a valid probability distribution as it is non-negative and sums to one. By using the hybrid distribution, the following theorem gives an upper bound on



Let be the hybrid distribution (8), be the model distribution at time , and be the model distribution after one iteration of EM. Then,

The geometric interpretation of the theorem is provided in Figure 2. This theorem is a direct result of Jensen’s inequality and the EM algorithm. By information theory,


The first term in (9) involves the singleton marginal conditional entropy which is computationally tractable. However, because and cannot be factorized in most cases, the second term is computationally intractable. To tackle the intractability, we further approximate with the Bethe entropy approximation.

Recall, in graphical models, are random variables associated with vertices and

is the joint distribution associated with the graph

. The Bethe entropy approximation [13] is defined as

where is pairwise mutual information. The Bethe entropy approximation is accurate for acyclic Markov random fields.

Applying the Bethe entropy approximation to the second term in (9) yields an approximation to the conditional entropy:


Now, combine (9) and (10) to approximate an upper bound for the KL-divergence loss:


The approximation consists of pairwise conditional mutual information. It breaks the curse of dimensionality for KL-divergence loss and the computational complexity of

is . It leads to an operational version of Theorem 3:

Under the same conditions as in Theorem 3, we have


Thus we can recover in a similar way to that suggested by Theorem 3. Moreover, if the model fits the data perfectly, (11) would be zero.

If , then

In effect from Proposition 3.1, if is large for some feature pair , we can conclude that both and are informative. On the other hand, from Proposition 3.1, the model doesn’t fit the data well in those dimensions. Therefore, and are significant for model learning and should be used to regularize the E-step. This is the key idea that underlies our algorithm.

3.2 Algorithm: Stagewise EM

  INPUT: and
  Initialization: , and
  while Not Converge do
     Calculate for current model.
     Find the biggest entry in .
     if  or  then
        Add and into
        if  then
           Duplicate the -th component and perturb in coordinate (explained in context).
        end if
     end if
     Perform regularized E-step and M-step in Theorem 3.2
  end while
Algorithm 1 stage-wise EM

Our main algorithm – stagewise EM – is now developed. Following convention (one-hot encoding), let

be an observation of coordinate . The model is initialized as a one-component MDPD such that the conditional distribution of each feature equals the corresponding frequency in the observations. For uninformative features, this initialization is already a good estimate.


For finite observations, redefine . The regularized E-step is to calculate based on current model and . The corresponding M-step is given by

Thus stagewise EM iteratively performs a regularized E-step followed by a corresponding M-step. But, to regularize the E-step, the informative set has to be obtained explicitly, which we do in an interlaced fashion together with the EM iterations. Specifically, since at least one EM iteration is needed for each update of , the algorithm works conservatively and attempts to update after each iteration of EM.

We now develop the update for informative set . By a standard result in information theory,


where . is updated by picking the biggest triplet in and adding the related indices, and , into (if they are not already in ). The stagewise update is a strong regularization on EM, as it enforces EM on the features that are informative and have not been fitted well in the current model. We use the word “stagewise” because a similar idea has been applied to regression [6, 8].

An important detail remains. Since the algorithm is initialized as a one-component model, or during iterations, we may need to increase the number of components. We do this by splitting one, and again act conservatively: First find the largest triplet , duplicate the -th component, and add it into the mixture model: for and .


Let be the model before duplication and the one after duplication. It can be shown that

Intuitively, since the duplication does not alter the marginal distribution , the KL-divergence loss remains unchanged. Theorem 3.2 indicates that also remains the same. To break the symmetry between the -th component and the new component, we freeze all parameters in except for , , , and and perturb the model with regard to the free parameters. Due to the symmetry, it can be shown that (parameters after duplication) is a saddle point to the restricted function. Therefore, we calculate the Hessian of

with regard to the free parameters and perturb in the direction of the eigenvector with the most negative eigenvalue of the Hessian.

4 Empirical Studies

Following [5], model-based learning in crowdsourcing can be viewed as a special case of MDPD. Now is the label given by the -th worker () to an item with true label denoted by ; this requires . The goal is to estimate the true label for each question and to assess the individual workers’ performances. All the examples below are -sparse crowdsourcing, in which only workers give the true label with some probability; the other workers give random labels with unknown probability. As we show, stagewise EM performs well against the state-of-the-art crowdsourcing algorithms.

We first study the behavior of the informative set using a simulation of 0.3-sparse MDPD with 3 components (). 100 workers provide labels to 1000 items with the 30 informative (“expert”) workers enjoying decreasing capabilities: the first worker provides the true label with probability 0.7 and the 30th worker with probability 0.45. The rest of the workers are random.

Figure 3: The performance of stagewise EM on -sparse MDPD: from the left to the right are log-likelihood, max norm of conditional mutual information, and the size of the informative set against the number of iterations. Dashed lines are benchmarks obtained from the underlying true distribution.

Figure 4: Illustration of how evolves and regularizes EM. The diagonal entries are of no interests and therefore eliminated. The leftmost panel shows the conditional mutual information at . The next four top panels show the for the first 30 workers at iteration , while the bottom panels present the mutual information among workers in . For each iteration, the informative set regularizes the E-step.

We perform 20 iterations of stagewise EM (Figure 3). The benchmark log-likelihood is given by , and the benchmark conditional mutual information (middle panel) is also obtained with the true distribution and training data. According to Proposition 3.1, the max norm of the conditional mutual information evolves toward 0. The algorithm converges within 10 iterations and the size of . A more detailed look at the mutual information criterion (11) for dimensionality reduction is illustrated in Figure 4. By construction, the workers’ capabilities decay as the index increases. Note how stagewise EM rapidly identifies the top 5 most informative workers. The full for this task is (in order) . The first 8 in are all top 15 workers and, from Figure 3, the algorithm almost converges at that point; the later members of are not important. This further suggests that the algorithm seems able to “decide” how much information is needed to learn the model; although 30 informative workers exist, practically less than 10 are needed for a good estimation.

We now study the prediction performance by comparing our algorithm (stage-EM) against the spectral algorithm (Spec-EM) [18] and the majority-vote-initialized EM (MV-EM) commonly used in practice. We start with synthetic data for the -sparse crowdsourcing problem with 100 workers, 1000 samples, and 3 labels. The first workers are informative giving correct labels with probability 0.6; the rest are uninformative workers giving labels at random with unknown probability. We vary and, for each , the experiment is repeated for 10 times. In Figure 6, we show prediction performances achieved by different algorithms. The benchmark score is the prediction error by the true model. Spec-EM does not work in this sparse setting, MV-EM is able to keep up with our algorithm until becomes small, while stage-EM stays close to the benchmark for all . Our algorithm consistently outperforms the other methods in this sparse setting.

Figure 5: Prediction performance on 3-label -sparse crowdsourcing model: For each , experiments are repeated 10 times. The shaded error bar shows the best and the worst performance.
Algorithm Bird Dog
(2 labels) (4 labels)
StageEM-refine 10.19 16.73
StageEM 12.04 20.69
() (11/39) (14/52)
Spec-EM Low 11.11 16.98
Average 11.57 22.19
High 12.04 31.85
MV-EM222Refer to the results reported in [18] 11.11 16.66
Figure 6: Prediction error on real datasets. is the size of informative set, while is the total number of workers.

Finally, we turn to real datasets (Table 6), even though they are not sparse. The bluebird dataset [15] is a binary labeling task containing 108 items, 39 workers and 4,212 observed labels. The dog dataset [19]

contains 4 different dog breed labels from ImageNet. Since these datasets are incomplete, we add a new “missing label” which indicates that the worker does not label this item. The probability of not giving a label is assumed to be independent of the true label. It is estimated from the data for each worker and then frozen (not trainable) during model fitting. Since these datasets are not sparse, we run regular EM on the complete dataset after model fitting to leverage all the information (StageEM-refine). As shown in Table

6, it is still comparable with MV-EM and surpasses Spec-EM. Importantly, stagewise EM has decent prediction performance using only about 1/3 of the workers available in both datasets.

5 Conclusion

We developed a stagewise EM algorithm for sparse clustering of discretely-valued data. The key insight is that uninformative features should have uniform probability of belonging to any mixture class. This led to an informative set of features via a mutual information criterion and a practical algorithm by approximating it with Bethe entropy. The result performed well for neurosciences and crowdsourcing datasets.

6 Acknowledgements

We thank M. Stryker and M. Dadarat for the providing the neural data, and Harrison Huibin Zhou and Sahand N. Negahban for suggestions. We also thank Junjiajia Long for dicussions. Research supported by the Simons Foundation, the Paul Allen Family Foundation, and China Scholarship Council.


  • [1] Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. arXiv preprint, pages 1–9, 2013.
  • [2] C. Bouveyron, S. Girard, and C. Schmid. High-dimensional data clustering. Computational Statistics and Data Analysis, 52(1):502–519, 2007.
  • [3] Charles Bouveyron and Camille Brunet-Saumard. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52–78, 2014.
  • [4] Wei-Chien Chang.

    On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions.

    Applied Statistics, 32(3):267–275, 1983.
  • [5] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics, 28(1):20–28, 1979.
  • [6] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least Angle Regression. Annals of Statistics, 32(2):407–499, 2004.
  • [7] Jon Feldman, Ryan O’Donnell, and Rocco A. Servedio. Learning mixtures of product distributions over discrete domains. In Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, volume 2005, pages 501–510, 2005.
  • [8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer 2001, 18(4):746, 2009.
  • [9] Prateek Jain and Sewoong Oh. Learning Mixtures of Discrete Product Distributions using Spectral Decompositions. page 30, 2013.
  • [10] Thomas P Minka. Escpectation-Maximization as lower bound maximization. (1977):1–8, 1998.
  • [11] C M Niell and M P Stryker. Modulation of visual responses by behavioral state in mouse visual cortex. Neuron, 65(2007):472–479, 2010.
  • [12] Wei Pan and Xiaotong Shen. Penalized Model-Based Clustering with Application to Variable Selection.

    Journal of Machine Learning Research

    , 8:1145–1164, 2007.
  • [13] Martin J. Wainwright and M I Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
  • [14] Sijian Wang and Ji Zhu. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics, 64(2):440–448, 2008.
  • [15] Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The Multidimensional Wisdom of Crowds. Most, 6:1–9, 2010.
  • [16] Ross S. Williamson, Maneesh Sahani, and Jonathan W. Pillow. The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction. PLoS Computational Biology, 11(4), 2015.
  • [17] Daniela M. Witten and Robert Tibshirani.

    A Framework for Feature Selection in Clustering.

    Journal of the American Statistical Association, 105(490):713–726, 2010.
  • [18] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing. 2014.
  • [19] Dengyong Zhou, John Platt, Sumit Basu, and Yi Mao. Learning from the wisdom of crowds by minimax entropy. Advances in Neural Information Processing Systems 25, pages 2204–2212, 2012.


Proof: Theorem 3

Since is the informative set, by (5),

In the theorem, we set for . Hence, we also have

For KL-divergence,

Proof: Theorem 3.1

By the standard result from EM, is lower bounded

where . Let be defined as (8) and by Theorem 3.2,

Using the lower bound of to upper bound gives the desired result.

Proof: Theorem 3.2

In finite observation case, after a regularized E-step, M-step is to

By the factorization of MDPD, it can be written as

By the parameterization of MDPD, the discrete distribution is represented by and by , which satisfying


Therefore, we can maximize over different and separately. This constrained optimization problem is solved by applying Lagrangian multiplier, which gives

is indicator function. We define , therefore the probability for any instance is

And we can reformulate the results as

Proof: Theorem 3.2

It is enough to prove that for all pair ,


The conditional mutual information can be decomposed as

Let the component be the duplication of the component . We notice that

Moreover, duplication does not change the conditional distribution within each component, so we have

Therefore, we know that (14) holds.

Proof: Proposition 3.1

The proposition follows from the fact that if , then

The only term containing is . Therefore, is independent of other features in the hybrid distribution, which leads to the proposition.

Proof: Proposition 3.1

If , then . And the hybrid distribution becomes

By Proposition 3.1, it is enough to show that for . This is true because and is independent conditional on .