1 Introduction
Mixtures of discrete product distributions (MDPD) have been applied widely to large problems, from computational neuroscience to bioinformatics and recommendation systems. Here we concentrate on another, popular one – crowdsourcing [3] – to introduce a feature selection algorithm that works in the discrete variable setting. In effect, our algorithm enhances the learning process by identifying those workers who are likely to be performing well. We show experimentally that this feature selection leads to better performance than other stateofthe art algorithms, and we provide a theoretical framework that suggests why this is to be expected.
Learning MDPD is NPhard. Although many authors have proposed algorithms and heuristics to learn MDPD under different circumstances
[3, 17, 9, 6], there is almost no literature concerning the feature selection problem as we formulate it. An exception is [2], who sought features that sharply separate mixture components. Their algorithm is based on correlations of the input data, but is restricted to mixtures of binary product distributions; our algorithm is applicable to general MDPD and is based on pairwise mutual information. Another group sought to identify reliable workers directly [7, 16], but this led to algorithms that are specific to crowdsourcing and hard to generalize for MDPD.Dimensionality reduction for Gaussian mixture models is better studied. [12] proposed an algorithm based on a penalized likelihood function that leads to an EM variant with a regularized Mstep. [1] analyze learning for a mixture of two isotropic Gaussians in high dimensions under sparse mean separation. More recently, [10] proposed an algorithm to discover influential features for high dimensional clustering. The dimensionality reduction methods in [1, 10]
are based on Principal Component Analysis, which constructs features that are linear combinations of the input variables. This underlines the fundamental difference between the continuous and the discretevalued problems: linear combination is not a valid operator for discrete random variables. To see this, let
denote a random variable which takes value from , while is another discrete variable. is obviously not welldefined.Even though a direct generalization of the techniques for Gaussian mixture models to MDPD is not proper, the continuous variable case has been a source of inspiration in the following sense. PCA performs an eigendecomposition on the sample covariance matrix which relies, in turn, on secondorder statistics of the data. The secondorder statistics for discrete random variables are basically cooccurrence. Thus we ask: can dimensionality reduction be based on cooccurrence for MDPD, which forms an analogue to the use of PCA for Gaussian mixture models? We give a positive answer in this paper and propose a novel feature selection technique for MDPD that is based on pairwise mutual information. The utilization of pairwise mutual information is justified by its connection to a goodnessoffit measure and is validated by empirical studies on real crowdsourcing datasets. We show that, in effect, the algorithm filters out noise and makes the learning more robust; in many cases we significantly reduce the error rates.
2 Background
We study mixtures of discrete product distributions (MDPD). Throughout the paper, we use the uppercase letters and for random variables and the lowercase letters and for their instances (realizations). Let be an observable discrete variable and the latent variable. takes discrete values and indicates the mixture component , where is the total number of components. Let , is the dimension of the model and
. MDPD is a generative model with joint probability distribution:
is a product distribution, i.e.
Given the observations
, the goal is to estimate the model parameters, i.e.
and , for all and . There are many papers addressing this learning problem [3, 17, 9, 2]. In general, those algorithms can be classified into two groups, (1) maximum likelihood estimation and (2) method of moments. The EM algorithm and its variants have been widely used to maximize the loglikelihood. However, since the loglikelihood function is nonconvex, these algorithms can be stuck in a bad local maximum. Recently, several authors
[17, 9] proposed algorithms based on method of moments for learning MDPD which relies on thirdorder moments. The performance of these algorithms is statistically provable under certain conditions.3 Feature Selection for MDPD
The problem of feature selection is to reduce the model dimension by identifying a useful and relevant feature subset. It is used to simplify the model for easier interpretation, to reduce training time, to overcome the curse of dimensionality and to avoid overfitting thereby making the model more robust. Most literature on feature selection focuses on supervised learning, where the usefulness and relevance of features are generally defined by their prediction power. Feature selection and dimensionality reduction for unsupervised learning are more challenging problems, due to the lack of labeled data. Refer to
[8, 5] for reviews on this topic.It is wellknown that the EM algorithm is sensitive to initialization, while the method of moments [17] is sensitive to some global properties of the model. The performance of both algorithms can be dramatically impaired by noisy, irrelevant and redundant data. This makes feature selection relevant to learning MDPD; and critical in practice. In this section, we introduce our feature selection technique based on pairwise mutual information and illustrate the underlying ideas.
Intuitively, we want to identify those features that are discriminative of the latent variable , despite the lack of any direct access to that latent variable. Nothing can be said about if only one observable variable is revealed, because the onedimensional MDPD is not identifiable. Therefore, the learning algorithm has to rely on the interaction among different observable variables. For MDPD, if is known to be independent of , it can be shown that must also be independent of (). On the other hand, if a strong dependence between and is observed, it can be concluded that and are discriminative of and should be identified as useful features.
Our feature selection technique is motivated by the argument above. We use mutual information to measure the dependence between two variables,
The feature selection technique is shown in algorithm 1. First, we estimate the joint probability with . is the cooccurrence between and and is sample size. Then, we estimate pairwise mutual information with , for all and . After getting the pairwise mutual information matrix , two feature selection heuristics are proposed. The first one is to maximize the sum of the entries of submatrices of the pairwise mutual information matrix. The other one is based on feature ranking according to the mutual information score, i.e.
(1) 
In practice, the mutual information score can be used to decide the number of features to be used in the model. In section 5
, we plot the mutual information score for the features in the real datasets. It is observed that the score drops quickly after the top few features and has a relatively flat tail. The curve “resembles” the plot of eigenvalues of PCA. Therefore, we may set a cutoff according to the gradient of the curve.
3.1 OneCoin Model
To demonstrate the feature selection technique, we consider a simple mixture of discrete product distributions that is usually referred to as “onecoin model” in crowdsourcing. For onecoin model, the number of components is identical to . We assume that
is uniformly distributed, i.e.
for . And the conditional probability of is parameterized by a single parameter . More concretely, it is defined as(2) 
In other words, the worker uses a single coin flip to decide the label. With probability , the worker gives the correct label, whatever the true label is. And with probability , he randomly gives an incorrect label. In this case, it is intuitive to define the capabilities of workers. A worker with larger is more capable than a worker with smaller . A worker with is the best, because he always gives the correct label. Given a group of workers with different capabilities, the goal of feature selection is to find the those most capable ones.
The mutual information depends on the joint probability distribution . Since , the marginal distribution of is uniform for all
. The joint distribution
can be represented by a by symmetric matrix whose diagonal elements and offdiagonal elements are (respectively) identical. Let the diagonal elements be denoted and those offdiagonal elements () become . The mutual information is then(3) 
It equals zero when and is monotonically increasing when . In addition, can be expressed as a function of and , i.e.
This function describes a hyperbolic paraboloid as shown in Figure 2. When , the function is at its saddle point, where . In the region where both and are larger than , is monotonically increasing with regard to when is fixed, and vice versa.
Thus, we conclude that when and (i.e. when workers are better than guess randomly.), the mutual information is monotonically increasing with regard to and . In other words, if worker is more capable than worker (i.e. ), we have . This is enough to guarantee that our feature selection techniques (either (a) or (b) in algorithm 1) will always select over .
4 Pairwise Mutual Information and Maximum Likelihood Estimation
In this section, the use of pairwise mutual information is justified with theoretical analysis that reveals its relation to maximum likelihood estimation and a goodnessoffit measure. We start by introducing the maximum likelihood objective function of MDPD and the wellknown expectationmaximization (EM) algorithm. Provided
data points, MLE seeks to maximize the marginal loglikelihoodwhere . denotes all the parameters of the model, i.e. and . This is the standard definition of loglikelihood with finite samples. However, for convenience, we conduct our analysis at the population level (infinite sample size). Let denote the underlying distribution from which samples are drawn. The marginal likelihood can be defined as
Direct optimization on the marginal likelihood is hard. A common workaround uses Jensen’s inequality to relax the problem. Let denote a probability distribution over . By applying Jensen’s inequality, we have
Instead of maximizing the marginal loglikelihood, we are going to maximize the function on the righthand side. This leads to the EM algorithm, an iterative algorithm consisting of two steps. Let be the model parameters at time .
Estep: Calculate the posterior distribution for all the configurations of and let be .
Mstep: Update the parameters by calculating
(4) 
where .
On the other hand, we want to get an upper bound of the loglikelihood. The KLdivergence is defined as
It equals the difference between the negative entropy of the data and the marginal loglikelihood. Due to the nonnegativity of KLdivergence, the marginal loglikelihood is upper bounded by the negative entropy . Moreover, the KLdivergence equals zero when and are identical almost everywhere. Thus, the KLdivergence can be considered as a goodnessoffit measure for mixture models. However, we usually don’t have access to the probability distribution and estimating the negative entropy from the data is computationally intractable. To overcome the difficulty, we consider using to approximate .
(5) 
is defined to be the difference between and . is a function of with parameter . And equation 4 leads to fact that . Later on, we will focus on . It can be shown that underestimates but overestimates . The relation between and is illustrated in figure 2. Moreover, we have the following lemma.
Proof.
By the definition of , to minimize is equivalent to maximize , which is basically the Mstep in EM. Since , it is straightforward from equation 4 that
Therefore,
(7)  
(8)  
(9) 
∎
In information theory, the multiinformation of a multivariate probabilistic distribution is defined as
It is the KLdivergence between and the product distribution . Multiinformation is zero when the random variables are mutually independent. According to lemma 4, measures the dependency among variables left in the data which is not explained by the current mixture model. It seems promising, however it is still computational intractable. As a workaround, we apply Bethe entropy approximation [14] to approximate multiinformation with the sum of pairwise mutual information. This leads to an approximated goodnessoffit measure (equation 10) for MDPD which only relies on the secondorder statistics of the data; it can be calculated efficiently.
(10) 
where the conditional mutual information
To summarize, we have derived a goodnessoffit measure (equation 10) for MDPD based on maximum likelihood estimation and information theory. The question is how it is related to the feature selection algorithm we have proposed earlier.
Proposition 1.
Let be the underlying probability distribution of the data and be an onecomponent mixture model satisfying . Therefore, the proposed goodnessoffit measure (equation 10) becomes
(11) 
Proof.
The proof follows the fact that since there is only one mixture component, we always have and it leads to . ∎
This proposition indicates that the sum of pairwise mutual information (equation 11), which can be estimated from data, is actually a goodnessoffit measure of the onecomponent mixture model. If the features are mutually independent, an onecomponent mixture model will be enough to model the data perfectly and the sum of pairwise mutual information will be close to zero. Our feature selection algorithm (algorithm 1) selects the feature subset that maximizes the sum of mutual information with regard to the feature set. In other words, the selected features are the dimensions where the onecomponent mixture model doesn’t explain the data well.
5 Empirical Studies
In this section, we demonstrate our feature selection algorithm for crowdsourcing. Crowdsourcing has been an popular way to collect labels for large datasets in many application domains, including computer vision and natural language processing. Web services such as Amazon Mechanical Turk provide platforms where human intelligence tasks are posted and large quantities of labels from hundreds of online workers are collected. The problem is to infer the true labels for datasets from the collected labels.
The performances of different algorithms under our feature selection method are compared. And five real datasets are used in this study. We show that the algorithms are able to achieve a low misclustering rate with fairly small feature (worker) subsets, which reveals the redundancy inherent in the real datasets. In some cases, feature selection can even significantly boost the performance.
We also compare our feature selection algorithm to a supervised feature selection method. The supervised feature selection is done by ranking features according to their individual misclustering rate and selecting top features accordingly. As the real problem is essentially unsupervised, using a supervised feature selection is ‘cheating’, as it leaks true labels to the algorithm. Nevertheless, it provides a benchmark of how useful feature selection could possibly be.
5.1 Spectral Method and Majority Voting
According to [17] and related papers, spectral method (optD&S) and majority voting (with EM) outperforms other algorithms on these datasets. Therefore, we implement these two algorithms in our study.
The spectral method is a twostage algorithm proposed in [17]
. The first stage uses the method of moments and tensor decomposition to estimate the mixture model parameters, while the second stage runs regular EM iterations taking the results of the first stage as initialization. The first stage of the algorithm randomly partitions all the workers into three disjoint groups. Therefore, the performance of the algorithm may fluctuate. To properly evaluate the performance, we repeat the spectral method multiple times and report the median, the first, and the third quartile.
Majority voting is a simple and popular algorithm for crowdsourcing. It gives the prediction by summing up all worker labels and picks the one with the highest votes. When there are ties in the votes, it randomly picks one and we report the expected misclustering error. For example, if the votes for three labels are tied, the expected error will be . When we evaluate misclustering rate, due to missing values, it is possible that some items receive no votes from the selected workers. In those cases, we treat them as ties.
5.2 Real Datasets and Deal with Missing Values
Datasets  # classes  # items  # workers  # worker labels 

Bird  2  108  39  4,212 
RTE  2  800  164  8,000 
TREC  2  19,033  762  88,385 
Dog  4  807  109  8,070 
Web  5  2,665  177  15,567 
Five real crowdsourcing data sets are used in this study: (1) bird dataset [15] is a binary labeling task , (2) recognizing textual entailment (RTE) dataset [13] contains pairs of sentences and is a binary task to determine if the second sentence can be inferred from the first, (3) TREC is a binary task from TREC 2011 crowdsourcing track [11]
assessing the quality of information retrieval, (4) Dog dataset contains a set of pictures from ImageNet
[4] and the task is to label the four breads of dogs, (5) web dataset [18] is a set of queryURL pairs for workers to label a relevance score from 1 to 5.Except for the bird dataset, the other datasets contain lots of missing values. It is common for real datasets, as workers do not assign labels to all the items. To accommodate our feature selection technique to missing values, a natural way is to add a virtual label for each variable , i.e. . If we assume that being missing is not discriminative of the latent variable , we can adjust the algorithm by calculating for , to eliminate the contribution of the virtual label to mutual information.
5.3 Results
Dataset  OptD&S  MV  MV+EM 
Bird  11.11  24.07  10.18 
RTE  7.12  10.31  7.25 
TREC  32.33  34.86  29.76 
Dog  15.75  17.78  15.74 
Web  29.22  27.09  17.52 
FS+OptD&S  FS+MV  FS+MV+EM  
Bird  8.33 (15)  10.18 (5)  8.33 (15) 
RTE  7.12 (163)  8.00 (162)  7.25 (159) 
TREC  30.11 (425)  34.81 (378)  29.47 (459) 
Dog  15.46 (76)  17.35 (64)  15.49 (75) 
Web  11.41 (17)  12.03 (8)  11.20 (9) 

We report the misclustering rate of different algorithms and their performance under feature selection in table 2. Majority voting (alone) are probably thought as the simplest algorithm for crowdsourcing. As known to the crowdsourcing society, using the EM to refine the majority voting algorithm can improve the error rate (see the top half of the table). This is probably due to the noise of the worker labels. We show that with proper feature (worker) selection, the noise can be reduced. For example, for bird and web datasets, majority voting did not work well on the complete datasets, compared to optD&S and MV+EM. However, after feature selection, the performance of majority voting becomes on a par with or even better than the performances of the more sophisticated algorithms (without feature selection). Also, both optD&S and MV+EM benefit from feature selection in terms of the misclustering rate. Moreover, the results shed light on the redundant nature of crowdsourcing datasets.
To better understand the influence of our feature selection technique, figure 3 show the misclustering rates of the algorithms at different levels of feature selection.
The real datasets are redundant. From all the figures, it is clear that there is a big drop in the misclustering rate when the top few features are utilized. As the curve gets flattened quickly, the marginal utility is diminishing fast.
In most cases, the proposed feature selection technique (the solid lines) remains competitive, compared to the supervised feature selection (the dashed lines). For example, for bird and dog datasets, the performance of our feature selection technique stays close to that of the supervised feature selection, especially when the number of features is small.
Feature selection makes algorithms more robust and can potentially improve the outcomes. We noticed that in some cases (e.g. TREC dataset and web dataset) the misclustering rate of fluctuates a lot. It is possibly because of the noise in the data. Feature selection helps filtering out noisy data and makes more robust. For web dataset, feature selection significant improves the error rates for all the algorithms.
6 Discussion
In this paper, we proposed a novel feature selection technique for learning MDPD which is based on pairwise mutual information. The utilization of mutual information was justified by a goodnessoffit measure of the mixture model. Empirical studies of feature selection in application of crowdsourcing are also reported. Our feature selection algorithm are able to identify relevant, useful and informative features for MDPD, filters out the noise in the data, and makes the learning more robust. We argue that this feature selection technique is generic. It is not ad hoc for crowdsourcing, as it does not require any additional assumptions. Since it is based on mutual information, it is invariant to label swapping.
References
 [1] Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for highdimensional gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, pages 2139–2147, 2013.
 [2] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correlations and independence. In COLT, volume 4, pages 9–1, 2008.
 [3] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.

[4]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei.
Imagenet: A largescale hierarchical image database.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 248–255. IEEE, 2009. 
[5]
Jennifer G Dy and Carla E Brodley.
Feature selection for unsupervised learning.
Journal of machine learning research
, 5(Aug):845–889, 2004.  [6] Jon Feldman, Ryan O’Donnell, and Rocco A Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
 [7] Arpita Ghosh, Satyen Kale, and Preston McAfee. Who moderates the moderators?: crowdsourcing abuse detection in usergenerated content. In Proceedings of the 12th ACM conference on Electronic commerce, pages 167–176. ACM, 2011.
 [8] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
 [9] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. In COLT, pages 824–856, 2014.
 [10] Jiashun Jin, Wanjie Wang, et al. Influential features pca for high dimensional clustering. The Annals of Statistics, 44(6):2323–2359, 2016.
 [11] Matthew Lease and Gabriella Kazai. Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC), 2011.
 [12] Wei Pan and Xiaotong Shen. Penalized modelbased clustering with application to variable selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.
 [13] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008.
 [14] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 [15] Peter Welinder, Steve Branson, Serge J Belongie, and Pietro Perona. The multidimensional wisdom of crowds. In NIPS, volume 23, pages 2424–2432, 2010.
 [16] Jacob Whitehill, Tingfan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
 [17] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268, 2014.
 [18] Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pages 2195–2203, 2012.
Comments
There are no comments yet.