Feature Selection Facilitates Learning Mixtures of Discrete Product Distributions

11/25/2017 ∙ by Vincent Zhao, et al. ∙ Yale University 0

Feature selection can facilitate the learning of mixtures of discrete random variables as they arise, e.g. in crowdsourcing tasks. Intuitively, not all workers are equally reliable but, if the less reliable ones could be eliminated, then learning should be more robust. By analogy with Gaussian mixture models, we seek a low-order statistical approach, and here introduce an algorithm based on the (pairwise) mutual information. This induces an order over workers that is well structured for the `one coin' model. More generally, it is justified by a goodness-of-fit measure and is validated empirically. Improvement in real data sets can be substantial.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mixtures of discrete product distributions (MDPD) have been applied widely to large problems, from computational neuroscience to bioinformatics and recommendation systems. Here we concentrate on another, popular one – crowdsourcing [3] – to introduce a feature selection algorithm that works in the discrete variable setting. In effect, our algorithm enhances the learning process by identifying those workers who are likely to be performing well. We show experimentally that this feature selection leads to better performance than other state-of-the art algorithms, and we provide a theoretical framework that suggests why this is to be expected.

Learning MDPD is NP-hard. Although many authors have proposed algorithms and heuristics to learn MDPD under different circumstances

[3, 17, 9, 6], there is almost no literature concerning the feature selection problem as we formulate it. An exception is [2], who sought features that sharply separate mixture components. Their algorithm is based on correlations of the input data, but is restricted to mixtures of binary product distributions; our algorithm is applicable to general MDPD and is based on pairwise mutual information. Another group sought to identify reliable workers directly [7, 16], but this led to algorithms that are specific to crowdsourcing and hard to generalize for MDPD.

Dimensionality reduction for Gaussian mixture models is better studied. [12] proposed an algorithm based on a penalized likelihood function that leads to an EM variant with a regularized M-step. [1] analyze learning for a mixture of two isotropic Gaussians in high dimensions under sparse mean separation. More recently, [10] proposed an algorithm to discover influential features for high dimensional clustering. The dimensionality reduction methods in [1, 10]

are based on Principal Component Analysis, which constructs features that are linear combinations of the input variables. This underlines the fundamental difference between the continuous- and the discrete-valued problems: linear combination is not a valid operator for discrete random variables. To see this, let

denote a random variable which takes value from , while is another discrete variable. is obviously not well-defined.

Even though a direct generalization of the techniques for Gaussian mixture models to MDPD is not proper, the continuous variable case has been a source of inspiration in the following sense. PCA performs an eigen-decomposition on the sample covariance matrix which relies, in turn, on second-order statistics of the data. The second-order statistics for discrete random variables are basically co-occurrence. Thus we ask: can dimensionality reduction be based on co-occurrence for MDPD, which forms an analogue to the use of PCA for Gaussian mixture models? We give a positive answer in this paper and propose a novel feature selection technique for MDPD that is based on pairwise mutual information. The utilization of pairwise mutual information is justified by its connection to a goodness-of-fit measure and is validated by empirical studies on real crowdsourcing datasets. We show that, in effect, the algorithm filters out noise and makes the learning more robust; in many cases we significantly reduce the error rates.

2 Background

We study mixtures of discrete product distributions (MDPD). Throughout the paper, we use the uppercase letters and for random variables and the lowercase letters and for their instances (realizations). Let be an observable discrete variable and the latent variable. takes discrete values and indicates the mixture component , where is the total number of components. Let , is the dimension of the model and

. MDPD is a generative model with joint probability distribution:

is a product distribution, i.e.

Given the observations

, the goal is to estimate the model parameters, i.e.

and , for all and . There are many papers addressing this learning problem [3, 17, 9, 2]

. In general, those algorithms can be classified into two groups, (1) maximum likelihood estimation and (2) method of moments. The EM algorithm and its variants have been widely used to maximize the log-likelihood. However, since the log-likelihood function is non-convex, these algorithms can be stuck in a bad local maximum. Recently, several authors

[17, 9] proposed algorithms based on method of moments for learning MDPD which relies on third-order moments. The performance of these algorithms is statistically provable under certain conditions.

3 Feature Selection for MDPD

The problem of feature selection is to reduce the model dimension by identifying a useful and relevant feature subset. It is used to simplify the model for easier interpretation, to reduce training time, to overcome the curse of dimensionality and to avoid over-fitting thereby making the model more robust. Most literature on feature selection focuses on supervised learning, where the usefulness and relevance of features are generally defined by their prediction power. Feature selection and dimensionality reduction for unsupervised learning are more challenging problems, due to the lack of labeled data. Refer to

[8, 5] for reviews on this topic.

It is well-known that the EM algorithm is sensitive to initialization, while the method of moments [17] is sensitive to some global properties of the model. The performance of both algorithms can be dramatically impaired by noisy, irrelevant and redundant data. This makes feature selection relevant to learning MDPD; and critical in practice. In this section, we introduce our feature selection technique based on pairwise mutual information and illustrate the underlying ideas.

Intuitively, we want to identify those features that are discriminative of the latent variable , despite the lack of any direct access to that latent variable. Nothing can be said about if only one observable variable is revealed, because the one-dimensional MDPD is not identifiable. Therefore, the learning algorithm has to rely on the interaction among different observable variables. For MDPD, if is known to be independent of , it can be shown that must also be independent of (). On the other hand, if a strong dependence between and is observed, it can be concluded that and are discriminative of and should be identified as useful features.

Our feature selection technique is motivated by the argument above. We use mutual information to measure the dependence between two variables,

  Input: the number of features to be selected , observed data for and .
  Estimate from the data.
  Use either of the two heuristics:
  (a) Find the feature subset of size so that
  (b) For each , calculate the mutual information and select top features according to their scores.
Algorithm 1 Feature Selection for MDPD

The feature selection technique is shown in algorithm 1. First, we estimate the joint probability with . is the co-occurrence between and and is sample size. Then, we estimate pairwise mutual information with , for all and . After getting the pairwise mutual information matrix , two feature selection heuristics are proposed. The first one is to maximize the sum of the entries of sub-matrices of the pairwise mutual information matrix. The other one is based on feature ranking according to the mutual information score, i.e.


In practice, the mutual information score can be used to decide the number of features to be used in the model. In section 5

, we plot the mutual information score for the features in the real datasets. It is observed that the score drops quickly after the top few features and has a relatively flat tail. The curve “resembles” the plot of eigenvalues of PCA. Therefore, we may set a cut-off according to the gradient of the curve.

3.1 One-Coin Model

To demonstrate the feature selection technique, we consider a simple mixture of discrete product distributions that is usually referred to as “one-coin model” in crowdsourcing. For one-coin model, the number of components is identical to . We assume that

is uniformly distributed, i.e.

for . And the conditional probability of is parameterized by a single parameter . More concretely, it is defined as


In other words, the worker uses a single coin flip to decide the label. With probability , the worker gives the correct label, whatever the true label is. And with probability , he randomly gives an incorrect label. In this case, it is intuitive to define the capabilities of workers. A worker with larger is more capable than a worker with smaller . A worker with is the best, because he always gives the correct label. Given a group of workers with different capabilities, the goal of feature selection is to find the those most capable ones.

The mutual information depends on the joint probability distribution . Since , the marginal distribution of is uniform for all

. The joint distribution

can be represented by a -by- symmetric matrix whose diagonal elements and off-diagonal elements are (respectively) identical. Let the diagonal elements be denoted and those off-diagonal elements () become . The mutual information is then


It equals zero when and is monotonically increasing when . In addition, can be expressed as a function of and , i.e.

This function describes a hyperbolic paraboloid as shown in Figure 2. When , the function is at its saddle point, where . In the region where both and are larger than , is monotonically increasing with regard to when is fixed, and vice versa.

Figure 1: The figure shows as a function of and when . The red dot is the saddle point of the hyperbolic paraboloid, where and . In the area when , is monotonically increasing with regard to either or when the other is fixed.
Figure 2: This figure shows the relationship between the proposed goodness-of-fit measure and the KL-divergence . The two curves in the figure are the marginal log-likelihood and its lower bound derived from Jansen’s inequality.

Thus, we conclude that when and (i.e. when workers are better than guess randomly.), the mutual information is monotonically increasing with regard to and . In other words, if worker is more capable than worker (i.e. ), we have . This is enough to guarantee that our feature selection techniques (either (a) or (b) in algorithm 1) will always select over .

4 Pairwise Mutual Information and Maximum Likelihood Estimation

In this section, the use of pairwise mutual information is justified with theoretical analysis that reveals its relation to maximum likelihood estimation and a goodness-of-fit measure. We start by introducing the maximum likelihood objective function of MDPD and the well-known expectation-maximization (EM) algorithm. Provided

data points, MLE seeks to maximize the marginal log-likelihood

where . denotes all the parameters of the model, i.e. and . This is the standard definition of log-likelihood with finite samples. However, for convenience, we conduct our analysis at the population level (infinite sample size). Let denote the underlying distribution from which samples are drawn. The marginal likelihood can be defined as

Direct optimization on the marginal likelihood is hard. A common workaround uses Jensen’s inequality to relax the problem. Let denote a probability distribution over . By applying Jensen’s inequality, we have

Instead of maximizing the marginal log-likelihood, we are going to maximize the function on the right-hand side. This leads to the EM algorithm, an iterative algorithm consisting of two steps. Let be the model parameters at time .

E-step: Calculate the posterior distribution for all the configurations of and let be .

M-step: Update the parameters by calculating


where .

On the other hand, we want to get an upper bound of the log-likelihood. The KL-divergence is defined as

It equals the difference between the negative entropy of the data and the marginal log-likelihood. Due to the non-negativity of KL-divergence, the marginal log-likelihood is upper bounded by the negative entropy . Moreover, the KL-divergence equals zero when and are identical almost everywhere. Thus, the KL-divergence can be considered as a goodness-of-fit measure for mixture models. However, we usually don’t have access to the probability distribution and estimating the negative entropy from the data is computationally intractable. To overcome the difficulty, we consider using to approximate .


is defined to be the difference between and . is a function of with parameter . And equation 4 leads to fact that . Later on, we will focus on . It can be shown that underestimates but overestimates . The relation between and is illustrated in figure 2. Moreover, we have the following lemma.

Let be defined as equation 5 and . For MDPD, it can be shown that


where .


By the definition of , to minimize is equivalent to maximize , which is basically the M-step in EM. Since , it is straightforward from equation 4 that



In information theory, the multi-information of a multivariate probabilistic distribution is defined as

It is the KL-divergence between and the product distribution . Multi-information is zero when the random variables are mutually independent. According to lemma 4, measures the dependency among variables left in the data which is not explained by the current mixture model. It seems promising, however it is still computational intractable. As a work-around, we apply Bethe entropy approximation [14] to approximate multi-information with the sum of pairwise mutual information. This leads to an approximated goodness-of-fit measure (equation 10) for MDPD which only relies on the second-order statistics of the data; it can be calculated efficiently.


where the conditional mutual information

To summarize, we have derived a goodness-of-fit measure (equation 10) for MDPD based on maximum likelihood estimation and information theory. The question is how it is related to the feature selection algorithm we have proposed earlier.

Proposition 1.

Let be the underlying probability distribution of the data and be an one-component mixture model satisfying . Therefore, the proposed goodness-of-fit measure (equation 10) becomes


The proof follows the fact that since there is only one mixture component, we always have and it leads to . ∎

This proposition indicates that the sum of pairwise mutual information (equation 11), which can be estimated from data, is actually a goodness-of-fit measure of the one-component mixture model. If the features are mutually independent, an one-component mixture model will be enough to model the data perfectly and the sum of pairwise mutual information will be close to zero. Our feature selection algorithm (algorithm 1) selects the feature subset that maximizes the sum of mutual information with regard to the feature set. In other words, the selected features are the dimensions where the one-component mixture model doesn’t explain the data well.

5 Empirical Studies

In this section, we demonstrate our feature selection algorithm for crowdsourcing. Crowdsourcing has been an popular way to collect labels for large datasets in many application domains, including computer vision and natural language processing. Web services such as Amazon Mechanical Turk provide platforms where human intelligence tasks are posted and large quantities of labels from hundreds of online workers are collected. The problem is to infer the true labels for datasets from the collected labels.

The performances of different algorithms under our feature selection method are compared. And five real datasets are used in this study. We show that the algorithms are able to achieve a low mis-clustering rate with fairly small feature (worker) subsets, which reveals the redundancy inherent in the real datasets. In some cases, feature selection can even significantly boost the performance.

We also compare our feature selection algorithm to a supervised feature selection method. The supervised feature selection is done by ranking features according to their individual mis-clustering rate and selecting top features accordingly. As the real problem is essentially unsupervised, using a supervised feature selection is ‘cheating’, as it leaks true labels to the algorithm. Nevertheless, it provides a benchmark of how useful feature selection could possibly be.

5.1 Spectral Method and Majority Voting

According to [17] and related papers, spectral method (opt-D&S) and majority voting (with EM) outperforms other algorithms on these datasets. Therefore, we implement these two algorithms in our study.

The spectral method is a two-stage algorithm proposed in [17]

. The first stage uses the method of moments and tensor decomposition to estimate the mixture model parameters, while the second stage runs regular EM iterations taking the results of the first stage as initialization. The first stage of the algorithm randomly partitions all the workers into three disjoint groups. Therefore, the performance of the algorithm may fluctuate. To properly evaluate the performance, we repeat the spectral method multiple times and report the median, the first, and the third quartile.

Majority voting is a simple and popular algorithm for crowdsourcing. It gives the prediction by summing up all worker labels and picks the one with the highest votes. When there are ties in the votes, it randomly picks one and we report the expected mis-clustering error. For example, if the votes for three labels are tied, the expected error will be . When we evaluate mis-clustering rate, due to missing values, it is possible that some items receive no votes from the selected workers. In those cases, we treat them as ties.

5.2 Real Datasets and Deal with Missing Values

Data-sets # classes # items # workers # worker labels
Bird 2 108 39 4,212
RTE 2 800 164 8,000
TREC 2 19,033 762 88,385
Dog 4 807 109 8,070
Web 5 2,665 177 15,567
Table 1: The summary of datasets used in the empirical study.

Five real crowdsourcing data sets are used in this study: (1) bird dataset [15] is a binary labeling task , (2) recognizing textual entailment (RTE) dataset [13] contains pairs of sentences and is a binary task to determine if the second sentence can be inferred from the first, (3) TREC is a binary task from TREC 2011 crowdsourcing track [11]

assessing the quality of information retrieval, (4) Dog dataset contains a set of pictures from ImageNet

[4] and the task is to label the four breads of dogs, (5) web dataset [18] is a set of query-URL pairs for workers to label a relevance score from 1 to 5.

Except for the bird dataset, the other datasets contain lots of missing values. It is common for real datasets, as workers do not assign labels to all the items. To accommodate our feature selection technique to missing values, a natural way is to add a virtual label for each variable , i.e. . If we assume that being missing is not discriminative of the latent variable , we can adjust the algorithm by calculating for , to eliminate the contribution of the virtual label to mutual information.

5.3 Results

Dataset Opt-D&S MV MV+EM
Bird 11.11 24.07 10.18
RTE 7.12 10.31 7.25
TREC 32.33 34.86 29.76
Dog 15.75 17.78 15.74
Web 29.22 27.09 17.52
Bird 8.33 (15) 10.18 (5) 8.33 (15)
RTE 7.12 (163) 8.00 (162) 7.25 (159)
TREC 30.11 (425) 34.81 (378) 29.47 (459)
Dog 15.46 (76) 17.35 (64) 15.49 (75)
Web 11.41 (17) 12.03 (8) 11.20 (9)

Table 2: Mis-clustering rate (%) of algorithms are reported. For opt-D&S [17], we repeated the algorithm 20 times and report the median error rate. The top rows show the results of the algorithms on the complete datasets and the bottom rows demonstrate the results after feature selection. The numbers in the parentheses are the number of features used when the algorithms achieve the optimal accuracy.

We report the mis-clustering rate of different algorithms and their performance under feature selection in table 2. Majority voting (alone) are probably thought as the simplest algorithm for crowdsourcing. As known to the crowdsourcing society, using the EM to refine the majority voting algorithm can improve the error rate (see the top half of the table). This is probably due to the noise of the worker labels. We show that with proper feature (worker) selection, the noise can be reduced. For example, for bird and web datasets, majority voting did not work well on the complete datasets, compared to opt-D&S and MV+EM. However, after feature selection, the performance of majority voting becomes on a par with or even better than the performances of the more sophisticated algorithms (without feature selection). Also, both opt-D&S and MV+EM benefit from feature selection in terms of the mis-clustering rate. Moreover, the results shed light on the redundant nature of crowdsourcing datasets.

To better understand the influence of our feature selection technique, figure 3 show the mis-clustering rates of the algorithms at different levels of feature selection.

Figure 3: The top figure shows the mis-clustering rate of different algorithms under feature selection. For opt-D&S, the algorithm was repeated 20 times and the median is plotted while the first and the third quartiles are displayed as the shaded error bar. The dashed lines are benchmarks by utilizing the supervised feature selection mentioned in context. The bottom figure shows the mutual information score (equation 1).

The real datasets are redundant. From all the figures, it is clear that there is a big drop in the mis-clustering rate when the top few features are utilized. As the curve gets flattened quickly, the marginal utility is diminishing fast.

In most cases, the proposed feature selection technique (the solid lines) remains competitive, compared to the supervised feature selection (the dashed lines). For example, for bird and dog datasets, the performance of our feature selection technique stays close to that of the supervised feature selection, especially when the number of features is small.

Feature selection makes algorithms more robust and can potentially improve the outcomes. We noticed that in some cases (e.g. TREC dataset and web dataset) the mis-clustering rate of fluctuates a lot. It is possibly because of the noise in the data. Feature selection helps filtering out noisy data and makes more robust. For web dataset, feature selection significant improves the error rates for all the algorithms.

6 Discussion

In this paper, we proposed a novel feature selection technique for learning MDPD which is based on pairwise mutual information. The utilization of mutual information was justified by a goodness-of-fit measure of the mixture model. Empirical studies of feature selection in application of crowdsourcing are also reported. Our feature selection algorithm are able to identify relevant, useful and informative features for MDPD, filters out the noise in the data, and makes the learning more robust. We argue that this feature selection technique is generic. It is not ad hoc for crowdsourcing, as it does not require any additional assumptions. Since it is based on mutual information, it is invariant to label swapping.


  • [1] Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, pages 2139–2147, 2013.
  • [2] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correlations and independence. In COLT, volume 4, pages 9–1, 2008.
  • [3] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
  • [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 248–255. IEEE, 2009.
  • [5] Jennifer G Dy and Carla E Brodley. Feature selection for unsupervised learning.

    Journal of machine learning research

    , 5(Aug):845–889, 2004.
  • [6] Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
  • [7] Arpita Ghosh, Satyen Kale, and Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, pages 167–176. ACM, 2011.
  • [8] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
  • [9] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. In COLT, pages 824–856, 2014.
  • [10] Jiashun Jin, Wanjie Wang, et al. Influential features pca for high dimensional clustering. The Annals of Statistics, 44(6):2323–2359, 2016.
  • [11] Matthew Lease and Gabriella Kazai. Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC), 2011.
  • [12] Wei Pan and Xiaotong Shen. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.
  • [13] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008.
  • [14] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • [15] Peter Welinder, Steve Branson, Serge J Belongie, and Pietro Perona. The multidimensional wisdom of crowds. In NIPS, volume 23, pages 2424–2432, 2010.
  • [16] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
  • [17] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268, 2014.
  • [18] Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pages 2195–2203, 2012.