Crowdsourcing via Pairwise Co-occurrences: Identifiability and Algorithms

The data deluge comes with high demands for data labeling. Crowdsourcing (or, more generally, ensemble learning) techniques aim to produce accurate labels via integrating noisy, non-expert labeling from annotators. The classic Dawid-Skene estimator and its accompanying expectation maximization (EM) algorithm have been widely used, but the theoretical properties are not fully understood. Tensor methods were proposed to guarantee identification of the Dawid-Skene model, but the sample complexity is a hurdle for applying such approaches---since the tensor methods hinge on the availability of third-order statistics that are hard to reliably estimate given limited data. In this paper, we propose a framework using pairwise co-occurrences of the annotator responses, which naturally admits lower sample complexity. We show that the approach can identify the Dawid-Skene model under realistic conditions. We propose an algebraic algorithm reminiscent of convex geometry-based structured matrix factorization to solve the model identification problem efficiently, and an identifiability-enhanced algorithm for handling more challenging and critical scenarios. Experiments show that the proposed algorithms outperform the state-of-art algorithms under a variety of scenarios.



There are no comments yet.


page 17


Recovering Joint Probability of Discrete Random Variables from Pairwise Marginals

Learning the joint probability of random variables (RVs) lies at the hea...

Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization

Unsupervised learning of the Dawid-Skene (D S) model from noisy, incom...

MiSC: Mixed Strategies Crowdsourcing

Popular crowdsourcing techniques mostly focus on evaluating workers' lab...

Statistical Query Lower Bounds for Tensor PCA

In the Tensor PCA problem introduced by Richard and Montanari (2014), on...

EM algorithms for ICA

Independent component analysis (ICA) is a widely spread data exploration...

HodgeRank with Information Maximization for Crowdsourced Pairwise Ranking Aggregation

Recently, crowdsourcing has emerged as an effective paradigm for human-p...

Scalable Probabilistic Matrix Factorization with Graph-Based Priors

In matrix factorization, available graph side-information may not be wel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


The drastically increasing availability of data has successfully enabled many timely applications in machine learning and artificial intelligence. At the same time, most supervised learning tasks, e.g., the core tasks in computer vision, natural language processing, and speech processing, heavily rely on labeled data. However, labeling data is not a trivial task—it requires educated and knowledgeable annotators (which could be human workers or machine classifiers), to work under a reliable way. More importantly, it needs an effective mechanism to integrate the possibly different labeling from multiple annotators. Techniques addressing this problem in machine learning are called

crowdsourcing Kittur et al. (2008) or more generally, ensemble learning Dietterich (2000).

Crowdsourcing has a long history in machine learning, which can be traced back to the 1970s Dawid & Skene (1979). Many models and methods have appeared since then Karger et al. (2013, 2014, 2011); Snow et al. (2008); Welinder et al. (2010); Liu et al. (2012); Traganitis et al. (2018). Intuitively, if a number of reliable annotators label the same data samples, then majority voting among the annotators is expected to work well. However, in practice, not all the annotators are equally reliable—e.g., different annotators could be specialized for recognizing different classes. In addition, not all the annotators are labeling all the data samples, since data samples are often dispatched to different groups of annotators in a certain way. Under such circumstances, majority voting is not very promising.

A more sophisticate way is to treat the crowdsourcing problem as a model identification problem. The arguably most popular generative model in crowdsourcing is the Dawid-Skene model Dawid & Skene (1979)

, where every annotator is assigned with a ‘confusion matrix’ that decides the probability of an annotator giving class label

when the ground-truth label is . If such confusion matrices and the probability mass function (PMF) of the ground-truth label can be identified, then a maximum likelihood (ML) or a maximum a posteriori (MAP) estimator for the true label of any given sample can be constructed. The Dawid-Skene model is quite simple and succinct, and some of the model assumptions (e.g., the conditional independence of the annotator responses) are actually debatable. Nonetheless, this model has been proven very useful in practice Raykar et al. (2010); Traganitis et al. (2018); Ghosh et al. (2011); Karger et al. (2014); Liu et al. (2012); Zhang et al. (2014).

Theoretical aspects for the Dawid-Skene model, however, are less well understood. In particular, it had been unclear if the model could be identified via the accompanying expectation maximization (EM) algorithm proposed in the same paper Dawid & Skene (1979), until some recent works addressing certain special cases Karger et al. (2014). The works in Traganitis et al. (2018); Zhang et al. (2014) put forth tensor methods for learning the Dawid-Skene model. These methods admit model identifiability, and also can be used to effectively initialize the classic EM algorithm provably Zhang et al. (2014). The challenge is that tensor methods utilize third-order statistics of the data samples, which are rather hard to estimate reliably in practice given limited data Huang et al. (2018).

Contributions. In this work, we propose an alternative for identifying the Dawid-Skene model, without using third-order statistics. Our approach is based on utilizing the pairwise co-occurrences of annotators’ responses to data samples—which are second-order statistics and thus are naturally much easier to estimate compared to the third-order ones. We show that, by judiciously combining the co-occurrences between different annotator pairs, the confusion matrices and the ground-truth label’s prior PMF can be provably identified, under realistic conditions (e.g., when there exists a relatively well-trained annotator among all annotators). This is reminiscent of nonnegative matrix theory and convex geometry Fu et al. (2018b); Gillis (2014). Our approach is also naturally robust to spammers as well as scenarios where every annotator only labels partial data. We offer two algorithms under the same framework. The first algorithm is algebraic, and thus is efficient and suitable for handling very large-scale crowdsourcing problems. The second algorithm offers enhanced identifiability guarantees, and is able to deal with more critical cases (e.g., when no highly reliable annotators exist), with the price of using a computationally more involved iterative optimization algorithm. Experiments show that both approaches outperform a number of competitive baselines.

2 Background

The Dawid-Skene Model. Let us consider a dataset , where

is a data sample (or, feature vector) and

is the number of samples. Each belongs to one of classes. Let be the ground-truth label of the data sample . Suppose that there are annotators who work on the dataset and provide labels. Let represent the response of the annotator to . Hence,

can be understood as a discrete random variable whose alphabet is

. In crowdsourcing or ensemble learning, our goal is to estimate the true label corresponding to each item from the annotator responses. Note that in a realistic scenario, an annotator will likely to only work on part of the dataset, since having all annotators work on all the samples is much more costly.

In 1979, Dawid and Skene proposed an intuitively pleasing model for estimating the ‘true response’ of the patients from recorded answers Dawid & Skene (1979), which is essentially a crowdsourcing/ensemble learning problem. This model has sparked a lot of interest in the machine learning community Raykar et al. (2010); Traganitis et al. (2018); Ghosh et al. (2011); Karger et al. (2014); Liu et al. (2012); Zhang et al. (2014). The Dawid-Skene model in essence is a naive Bayesian model Robert (2014). In this model, the ground-truth label of a data sample is a latent discrete random variable, , whose values are different class indices. The ambient variables are the responses given by different annotators, denoted as , where is the number of annotators. The key assumption in the Dawid-Skene model is that given the ground-truth label, the responses of the annotators are conditionally independent. Of course, the Dawid-Skene model is a simplified version of reality, but has been proven very useful—and it has been a workhorse for crowdsourcing since its proposal.

Under the Dawid-Skene model, one can see that


where denotes the index of a given class, and denotes the response of the -th annotator. If one defines a series of matrices and let


then can be understood as the ‘confusion matrix’ of annotator : It contains all the conditional probabilities of annotator labeling a given data sample as from class while the ground-truth label is . Also define a vector such that i.e., the prior PMF of the ground-truth label . Then the crowdsourcing problem boils down to estimating for and .

Prior Art. In the seminal paper Dawid & Skene (1979), Dawid and Skene proposed an EM-based algorithm to estimate and . Their formulation is well-motivated from an ML viewpoint, but also has some challenges. First, it is unknown if the model is identifiable, especially when there is a large number of unrecorded responses (i.e., missing values)—but model identification plays an essential role in such estimation problems Fu et al. (2018b). Second, since the ML estimator is a nonconvex optimization criterion, the solution quality of the EM algorithm is not easy to characterize in general. More recently, tensor methods were proposed to identify the Dawid-Skene model Zhang et al. (2014); Traganitis et al. (2018). Take the most recent work in Traganitis et al. (2018) as an example. The approach considers estimating the joint probability for different triples . Such joint PMFs can be regarded as third-order tensors, and the confusion matrices and the prior are latent factors of these tensors. The upshot is that identifiability of and can be elegantly established leveraging tensor algebra Sidiropoulos et al. (2017); Kolda & Bader (2009). The challenge, however, is that reliably estimating is quite hard, since it normally needs a large number of annotator responses. Another tensor method in Zhang et al. (2014) judiciously partitions the data and works with group statistics between three groups, which is reminiscent of the graph statistics proposed in Anandkumar et al. (2014). The method is computationally more tractable, leveraging orthogonal tensor decomposition. Nevertheless, the challenge again lies in sample complexity: the group/graph statistics are still third-order statistics.

3 Proposed Approach

In this section, we propose a model identification approach that only uses second-order statistics, in particular, pairwise co-occurrences .

Problem Formulation. Let us consider the following pairwise joint PMF:

Letting and using the matrix notations that we defined, we have —or, in a more compact form:

where we have , which is a diagonal matrix. Note that is a confusion matrix, i.e., its columns are respectable probability measures. In addition, is a prior PMF. Hence, we have


In practice, ’s are not available but can be estimated via sample averaging. Specifically, if we are given the annotator responses , then

where is the index set of samples which both annotators and have worked on. Here, is an indicator function: If the event happens, then , and otherwise. It is readily seen that


where the expectation is taken over data samples. Note that the sample complexity for reliably estimating is much lower relative to that of estimating Zhang et al. (2014); Anandkumar et al. (2014), and the latter is needed in tensor based methods, e.g., Traganitis et al. (2018). To be specific, to achieve with a probability greater than , joint responses from annotators and are needed. However, in order to attain the same accuracy for , the number of joint responses from annotators , and is required to be atleast , where is the number of classes (also see supplementary materials Sec. J for a short discussion).

An Algebraic Algorithm. Assume that we have obtained ’s for different pairs of . We now show how to identify ’s and from such second-order statistics. Let us take the estimation of as an illustrative example. First, we construct a matrix as follows:


where for denote the indices of annotators who have co-labeled data samples with annotator , and the integer denotes the number of such annotators. Due to the underlying model of in (3), we have

Let us define This leads to the model We propose to identify from . The key enabling postulate is that, among all annotators, some ’s should be diagonally dominant—if there exist annotators who are reasonably trained. In other words, for a reasonable annotator , should be greater than and for . To see the intuition of the algorithm, consider an ideal case where for each class , there exists an annotator such that


This physically means that annotator is very good at recognizing class and never confuses other classes with class . Under such circumstances, one can use the following procedure to identify . First, let us normalize the columns of via for . This way, we have a normalized model , where


where the second equality above is because [cf. Eq. (3)]. After normalization, it can be verified that


i.e., all the rows of reside in the -probability simplex. In addition, by the assumption in (6), it is readily seen that there exists where such that


i.e., an identity matrix is a submatrix of

(after proper row permutations). Consequently, we have —i.e., can be identified from up to column permutations. The task also boils down to identifying . This turns out to be a well-studied task in the context of separable nonnegative matrix factorization Gillis & Vavasis (2014); Gillis (2014); Fu et al. (2018b), and an algebraic algorithm exists:

where and is a projector onto the orthogonal complement of and we let .

It has been shown in Gillis & Vavasis (2014); Arora et al. (2013) that the so-called successive projection algorithm (SPA) in Eq. (10) identifies in steps. This is a very plausible result, since the procedure admits Gram-Schmitt-like lightweight steps and thus is quite scalable. See more details in Sec. F.1.

Each of the ’s can be estimated from the corresponding by repeatedly applying SPA, and we call this simple procedure multiple SPA (MultiSPA) as we elaborate in Algorithm 1.

  Input: Annotator Responses .
  Output: for , .
  estimate second order statistics ;
  for  to  do
     construct and normalize columns to unit norm;
     estimate using Eq. (10);
  end for
  fix permutation mismatch between and for all ;
  estimate (and take average over all pairs if needed).;
  extract the prior .
Algorithm 1 MultiSPA

Of course, assuming that (6) or (9) holds perfectly may be too ideal. It is more likely that there exist some annotators who are good at recognizing certain classes, but still have some possibilities of being confused. It is of interest to analyze how SPA can do under such conditions. Another challenge is that one may not have perfectly estimated, since only limited number of samples are available. It is desirable to understand the sample complexity of applying SPA to Dawid-Skene identification. We answer these two key technical questions in the following theorem:

Theorem 1.

Assume that annotators and co-label at least samples , and that is constructed using ’s according to Eq. (5). Also assume that the constructed satisfies , where . Suppose that for , and that for every class index , there exists an annotator such that


where . Then, if , with probability greater than , the SPA algorithm in (10) can estimate an such that


where is a permutation matrix, ,

is the largest singular value of

, and is the condition number of .

In the above Theorem, the assumption means that the proposed algorithm favors cases where more co-occurrences are observed, since ’s elements are averaged number of co-occurrences—which makes a lot of sense. In addition, Eq. (11) relaxes the ideal assumption in (6), allowing the ‘good annotator’ to confuse class with class up to a certain probability, thereby being more realistic. The proof of Theorem 1 is reminiscent of the noise robustness of the SPA algorithm Gillis & Vavasis (2014); Arora et al. (2013); see the supplementary materials (Sec. F.1). A direct corollary is as follows:

Corollary 1.

Assume that the conditions in Theorem 1 hold for and , . Then, the estimation error bound in (12) holds for every MultiSPA-output , .

Theorem 1 and Corollary 1 are not entirely surprising due to the extensive research on SPA-like algorithms Arora et al. (2013); Gillis & Vavasis (2014); Fu et al. (2015); Nascimento & Bioucas-Dias (2005); Chan et al. (2011). The implication for crowdsourcing, however, is quite intriguing. First, one can see that if an annotator does not label all the data samples, it does not necessarily hurt the model identifiability—as long as annotator has co-labeled some samples with a number of other annotators, identification of is possible. Second, assume that there exists a well-trained annotator whose confusion matrix is diagonally dominant, then for every annotator who has co-labeled samples with annotator , the matrix can easily satisfy (11) by letting for all . In practice, one would not know who is —otherwise the crowdsourcing problem would be trivial. However, one can design a dispatch strategy such that every pair of annotators and co-label a certain amount of data. This way, it guarantees that appears in everyone else’s and thus ensures identifiability of all ’s for . This insight may shed some light on how to effectively dispatch data to annotators.

Another interesting question to ask is does having more annotators help? Intuitively, having more annotators should help: If one has more rows in , then it is more likely that some rows approach the vertices of the probability simplex—which can then enable SPA. We use the following simplified generative model and theorem to formalize the intuition:

Theorem 2.

Let , and assume that the rows of are generated within the -probability simplex uniformly at random. If the number of annotators satisfies then, with probability greater than or equal to , there exist rows of indexed by such that

Note that Theorem 2 implies (11) under proper and —and thus having more annotators indeed helps identify the model. The above can be shown by utilizing the Chernoff-Hoeffding inequality, and the detailed proof can be found in the supplementary materials (Sec. G).

After obtaining ’s, can be estimated via various ways—see the supplementary materials in Sec. D. Using and ’s together, ML and MAP estimators for the true labels can be built up Traganitis et al. (2018).

4 Identifiability-enhanced Algorithm

The MultiSPA algorithm is intuitive and lightweight, and is effective as we will show in the experiments. One concern is that perhaps the assumption in (11) may be violated in some cases. In this section, we propose another model identification algorithm that is potentially more robust to critical scenarios. Specifically, we consider the following feasibility problem:

(13a) (13b) (13c)

The criterion in (13) seeks confusion matrices and a prior PMF that fit the available second-order statistics. The constraints in (13c) reflect the fact that the columns of ’s are conditional PMFs and the prior is also a PMF.

To proceed, let us first introduce the following notion from convex geometry Fu et al. (2018b); Lin et al. (2015):

Definition 1.

(Sufficiently Scattered) A nonnegative matrix is sufficiently scattered if 1) , and 2) . Here, , . In addition, and are the conic hull of and its dual cone, respectively, and is the boundary of a closed set.

The sufficiently scattered condition has recently emerged in convex geometry-based matrix factorization Lin et al. (2015); Fu et al. (2018a). This condition models how the rows of are spread in the nonnegative orthant. In principle, the sufficiently scattered condition is much easier to be satisfied relative to the condition as in (9), or, the so-called separability condition under the context of nonnegative matrix factorization Donoho & Stodden (2003); Gillis & Vavasis (2014). satisfying the separability condition is the extreme case, meaning that . However, the sufficiently scattered condition only requires —which is naturally much more relaxed; also see Fu et al. (2018b) and the supplementary materials for detailed illustrations (Sec. E).

Regarding identifiability of and , we have the following result:

Theorem 3.

Assume that for all , and that there exist two subsets of the annotators, indexed by and , where and . Suppose that from and the following two matrices can be constructed: , where and . Furthermore, assume that i) both and are sufficiently scattered; ii) all ’s for and are available; and iii) for every there exists a available, where . Then, solving Problem (13) recovers for and up to identical column permutation.

The proof of Theorem 3 is relegated to the supplementary results (Sec. H). Note that the theorem holds under the the existence of and , but there is no need to know the sets a priori. Generally speaking, a ‘taller’ matrix would have a better chance to have its rows sufficiently spread in the nonnegative orthant under the same intuition of Theorem 2. Thus, having more annotators also helps to attain the sufficiently scattered condition. Nevertheless, formally showing the relationship between the number of annotators and for being sufficiently scattered is more challenging than the case in Theorem 2, since the sufficiently scattered condition is a bit more abstract relative to the separability condition—the latter specifically assumes ’s exist as rows of while the former depends on the ‘shape’ of the conic hull of , which contains an infinite number of cases. Towards this end, let us first define the following notion:

Definition 2.

Assume that there exist such that is sufficiently scattered. Also assume is the row index set of such that collects the extreme rays of . If there exist row indices for all , such that , then is called -sufficiently scattered.

One can see that an -sufficiently scattered matrix is sufficiently scattered when . With this definition, we show the following theorem:

Theorem 4.

Let , and assume that the rows of and are generated from uniformly at random. If the number of annotators satisfies , where for , for and for , then with probability greater than or equal to , and are -sufficiently scattered.

The proof of Theorem 4 is relegated to the supplementary materials (Sec. I). One can see that to satisfy -sufficiently scattered condition, is smaller than that in Theorem 2. Conditions i)-iii) in Theorem 3 and Theorem 4 together imply that if we have enough annotators, and if many pairs co-label a certain number of data, then it is quite possible that one can identify the Dawid-Skene model via simply finding a feasible solution to (13). This feasibility problem is nonconvex, but can be effectively approximated; see the supplementary materials (Sec. C). In a nutshell, we reformulate the problem as a Kullback-Leibler (KL) divergence-based constrained fitting problem and handle it using alternating optimization. Since nonconvex optimization relies on initialization heavily, we use MultiSPA to initialize the fitting stage—which we will refer to as the MultiSPA-KL algorithm.

5 Experiments

Baselines. The performance of the proposed approach is compared with a number of competitive baselines, namely, Spectral-D&S Zhang et al. (2014), TensorADMM Traganitis et al. (2018), and KOS Karger et al. (2013), EigRatio Dalvi et al. (2013), GhoshSVD Ghosh et al. (2011) and MinmaxEntropy Zhou et al. (2014). The performance of the Majority Voting scheme and the Majority Voting initialized Dawid-Skene (MV-D&S) estimator Dawid & Skene (1979) are also presented. We also use MultiSPA to initialize EM algorithm (named as MultiSPA-D&S). Note that KOS, EigRatio and MinmaxEntropy work with more complex models relative to the Dawid-Skene model, but are considered as good baselines for the crowdsourcing/ensemble learning tasks. After identifying the model parameters, we construct a MAP predictor following Traganitis et al. (2018) and observe the result. The algorithms are coded in Matlab.

Synthetic-data Simulations. Due to page limitations, synthetic data experiments demonstrating model identifiability of the proposed algorithms are presented in the supplementary materials (Sec. A).

Integrating Machine Classifiers. We employ different UCI datasets (; details in Sec. B). For each of the datasets under test, we use a collection of different classification algorithms to annotate the data samples. Different classification algorithms from the MATLAB machine learning toolbox ( such as various

-nearest neighbour classifiers, support vector machine classifiers, and decision tree classifiers are employed to serve as our machine annotators. In order to train the annotators, we use

of the samples to act as training data. After the data samples are trained, we use the annotators to label the unseen data samples. In practice, not all samples are labeled by an annotator due to several factors such as annotator capacity, difficulty of the task, economical issues and so on. To simulate such a scenario, each of the trained algorithms is allowed to label a data sample with probability . We test the performance of all the algorithms under different ’s—and a smaller means a more challenging scenario. All the results are averaged from 10 random trials.

Table 1 shows the classification error of the algorithms under test. Since GhoshSVD and EigenRatio works only on binary tasks, they are not evaluated for the Nursery dataset where . The ‘single best’ and ‘single worst’ rows correspond to the results of using the classifiers individually when , as references. The best and second-best performing algorithms are highlighted in the table. One can see that the proposed methods are quite promising for this experiment. Both algorithms largely outperform the tensor based methods TensorADMM and Spectral-D&S in this case, perhaps because the limited number of available samples makes the third-order statistics hard to estimate. It is also observed that the proposed algorithms enjoy favorable runtime;s ee supplementary materials (cf. Table 8 in Sec. B). Using the MultiSPA to initialize EM (i.e. MultiSPA-D&S) also works well, which offers another viable option that strikes a good balance between runtime and accuracy.

Nursery Mushroom Adult
MultiSPA 2.83 4.54 17.96 0.02 0.293 6.35 15.71 16.05 17.66
MultiSPA-KL 2.72 4.26 13.06 0.00 0.152 5.89 15.66 15.98 17.63
MultiSPA-D&S 2.82 4.44 13.39 0.00 0.194 6.17 15.74 16.29 23.88
Spectral-D&S 3.14 37.2 44.29 0.00 0.198 6.17 15.72 16.31 23.97
TensorADMM 17.97 7.26 19.78 0.06 0.237 6.18 15.72 16.05 25.08
MV-D&S 2.92 66.48 66.61 0.00 47.99 48.63 15.76 75.21 75.13
Minmax-entropy 3.63 26.31 11.09 0.00 0.163 8.14 16.11 16.92 15.64
EigenRatio N/A N/A N/A 0.06 0.329 5.97 15.84 16.28 17.69
KOS 4.21 6.07 13.48 0.06 0.576 6.42 17.19 24.97 38.29
Ghosh-SVD N/A N/A N/A 0.06 0.329 5.97 15.84 16.28 17.71
Majority Voting 2.94 4.83 19.75 0.14 0.566 6.57 15.75 16.21 20.57
Single Best 3.94 N/A N/A 0.00 N/A N/A 16.23 N/A N/A
Single Worst 15.65 N/A N/A 7.22 N/A N/A 19.27 N/A N/A
Table 1: Classification Error () on UCI Datasets; see runtime tabulated in Sec. B.

Amazon Mechanical Turk Crowdsourcing Data. In this section, the performance of the proposed algorithms are evaluated using the Amazon Mechanical Turk (AMT) data ( in which human annotators label various classification tasks. Data description is given in the supplementary materials Sec. B. Table 2 shows the classification error and the runtime performance of the algorithms under test. One can see that MultiSPA has a very favorable execution time, because it is a Gram-Schmitt-like algorithm. MultiSPA-KL uses more time, because it is an iterative optimization method—with better accuracy paid off. Since TensorADMM algorithm does not scale well, the results are not reported for very large datasets (i.e., TREC and RTE). Similar as before, since Web and Dog are multi-class datasets, EigenRatio and GhoshSVD are not applicable. From the results, it can be seen that the proposed algorithms outperform many existing crowdsourcing algorithms in both classification accuracy and runtime. In particular, one can see that the algebraic algorithm MultiSPA gives very similar results compared to the computationally much more involved algorithms. This shows the potential for its application in big data crowdsourcing.

Algorithms TREC Bluebird RTE Web Dog
(%) Error (sec) Time (%) Error (sec) Time (%) Error (sec) Time (%) Error (sec) Time (%) Error (sec) Time
MultiSPA 31.47 50.68 13.88 0.07 8.75 0.28 15.22 0.54 17.09 0.07
MultiSPA-KL 29.23 536.89 11.11 1.94 7.12 17.06 14.58 12.34 15.48 15.88
MultiSPA-D&S 29.84 53.14 12.03 0.09 7.12 0.32 15.11 0.84 16.11 0.12
Spectral-D&S 29.58 919.98 12.03 1.97 7.12 6.40 16.88 179.92 17.84 51.16
TensorADMM N/A N/A 12.03 2.74 N/A N/A N/A N/A 17.96 603.93
MV-D&S 30.02 3.20 12.03 0.02 7.25 0.07 16.02 0.28 15.86 0.04
Minmax-entropy 91.61 352.36 8.33 3.43 7.50 9.10 11.51 26.61 16.23 7.22
EigenRatio 43.95 1.48 27.77 0.02 9.01 0.03 N/A N/A N/A N/A
KOS 51.95 9.98 11.11 0.01 39.75 0.03 42.93 0.31 31.84 0.13
GhoshSVD 43.03 11.62 27.77 0.01 49.12 0.03 N/A N/A N/A N/A
Majority Voting 34.85 N/A 21.29 N/A 10.31 N/A 26.93 N/A 17.91 N/A
Table 2: Classification Error () and Run-time (sec) : AMT Datasets

6 Conclusion

In this work, we have revisited the classic Dawid-Skene model for multi-class crowdsourcing. We have proposed a second-order statistics-based approach that guarantees identifiability of the model parameters, i.e., the confusion matrices of the annotators and the label prior. The proposed method naturally admits lower sample complexity relative to existing methods that utilize tensor algebra to ensure model identifiability. The proposed approach also has an array of favorable features. In particular, our framework enables a lightweight algebraic algorithm, which is reminiscent of the Gram-Schmitt-like SPA algorithm for nonnegative matrix factorization. We have also proposed a coupled and constrained matrix factorization criterion that enjoys enhanced-identifiability, as well as an alternating optimization algorithm for handling the identification problem. Real-data experiments show that our proposed algorithms are quite promising for integrating crowdsourced labeling.


  • Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. A tensor approach to learning mixed membership community models. The Journal of Machine Learning Research, 15(1):2239–2312, 2014.
  • Arora et al. (2013) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for topic modeling with provable guarantees. In Proceedings of ICML, 2013.
  • Baraniuk et al. (2008) Baraniuk, R., Davenport, M., DeVore, R., and Wakin, M. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008.
  • Bertsekas (1999) Bertsekas, D. P. Nonlinear programming. Athena Scientific, 1999.
  • Chan et al. (2011) Chan, T.-H., Ma, W.-K., Ambikapathi, A., and Chi, C.-Y. A simplex volume maximization framework for hyperspectral endmember extraction. IEEE Trans. Geosci. Remote Sens., 49(11):4177 –4193, Nov. 2011.
  • Dalvi et al. (2013) Dalvi, N., Dasgupta, A., Kumar, R., and Rastogi, V. Aggregating crowdsourced binary ratings. In Proceedings of the 22Nd International Conference on World Wide Web, pp. 285–294, New York, NY, USA, 2013. ACM.
  • Dawid & Skene (1979) Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pp. 20–28, 1979.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L., and and. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 248–255, June 2009.
  • Dietterich (2000) Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
  • Donoho & Stodden (2003) Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, volume 16, 2003.
  • Fu et al. (2015) Fu, X., Ma, W.-K., Chan, T.-H., and Bioucas-Dias, J. M. Self-dictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related. IEEE J. Sel. Topics Signal Process., 9(6):1128–1141, 2015.
  • Fu et al. (2016) Fu, X., Huang, K., Yang, B., Ma, W.-K., and Sidiropoulos, N. D. Robust volume minimization-based matrix factorization for remote sensing and document clustering. IEEE Trans. Signal Process., 64(23):6254–6268, 2016.
  • Fu et al. (2018a) Fu, X., Huang, K., and Sidiropoulos, N. D. On identifiability of nonnegative matrix factorization. IEEE Signal Process. Lett., 25(3):328–332, 2018a.
  • Fu et al. (2018b) Fu, X., Huang, K., Sidiropoulos, N. D., and Ma, W.-K. Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. arXiv preprint arXiv:1803.01257, 2018b.
  • Ghosh et al. (2011) Ghosh, A., Kale, S., and McAfee, P. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 167–176. ACM, 2011.
  • Gillis (2014) Gillis, N. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12:257, 2014.
  • Gillis & Vavasis (2014) Gillis, N. and Vavasis, S. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):698–714, April 2014.
  • Huang et al. (2014) Huang, K., Sidiropoulos, N., and Swami, A. Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition. IEEE Trans. Signal Process., 62(1):211–224, 2014.
  • Huang et al. (2016) Huang, K., Sidiropoulos, N. D., and Liavas, A. P. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process., 64(19):5052–5065, 2016.
  • Huang et al. (2018) Huang, K., Fu, X., and Sidiropoulos, N. D.

    Learning hidden markov models from pairwise co-occurrences with applications to topic modeling.

    In Proceedings of ICML 2018, 2018.
  • Jonker & Volgenant (1986) Jonker, R. and Volgenant, T. Improving the hungarian assignment algorithm. Operations Research Letters, 5(4):171–175, 1986.
  • Karger et al. (2011) Karger, D. R., Oh, S., and Shah, D. Budget-optimal crowdsourcing using low-rank matrix approximations. 2011.
  • Karger et al. (2013) Karger, D. R., Oh, S., and Shah, D. Efficient crowdsourcing for multi-class labeling. ACM SIGMETRICS Performance Evaluation Review, 41(1):81–92, 2013.
  • Karger et al. (2014) Karger, D. R., Oh, S., and Shah, D. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
  • Kittur et al. (2008) Kittur, A., Chi, E. H., and Suh, B. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 453–456. ACM, 2008.
  • Kolda & Bader (2009) Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • Lease & Kazai. (2011) Lease, M. and Kazai., G. Overview of the trec 2011 crowdsourcing track. 2011.
  • Lin et al. (2015) Lin, C.-H., Ma, W.-K., Li, W.-C., Chi, C.-Y., and Ambikapathi, A. Identifiability of the simplex volume minimization criterion for blind hyperspectral unmixing: The no-pure-pixel case. IEEE Trans. Geosci. Remote Sens., 53(10):5530–5546, Oct 2015.
  • Liu et al. (2012) Liu, Q., Peng, J., and Ihler, A. T. Variational inference for crowdsourcing. In Advances in neural information processing systems, pp. 692–700, 2012.
  • Nascimento & Bioucas-Dias (2005) Nascimento, J. and Bioucas-Dias, J. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens., 43(4):898–910, 2005.
  • Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
  • Razaviyayn et al. (2013) Razaviyayn, M., Hong, M., and Luo, Z.-Q. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
  • Robert (2014) Robert, C. Machine learning, a probabilistic perspective, 2014.
  • Sidiropoulos et al. (2017) Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582, 2017.
  • Snow et al. (2008) Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254–263. Association for Computational Linguistics, 2008.
  • Stein (1966) Stein, P. A Note on the Volume of a Simplex. The American Mathematical Monthly, 73(3), 1966. doi: 10.2307/2315353.
  • Stephane Boucheron (2004) Stephane Boucheron, Gabor Lugosi, O. B. Concentration Inequalities, 2004. URL:
  • Traganitis et al. (2018) Traganitis, P. A., Pages-Zamora, A., and Giannakis, G. B. Blind multiclass ensemble classification. IEEE Trans. Signal Process., 66(18):4737–4752, 2018.
  • Welinder et al. (2010) Welinder, P., Branson, S., Perona, P., and Belongie, S. J. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432, 2010.
  • Zhang et al. (2014) Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260–1268, 2014.
  • Zhou et al. (2014) Zhou, D., Liu, Q., Platt, J., and Meek, C. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of ICML, volume 32, pp. 262–270, 2014.

Appendix A Synthetic Data Experiments

In the first experiment, we consider that annotators are available to annotate items, each belonging to one of classes. The true label for each item is sampled uniformly from

, i.e, the prior probability vector

is fixed to be . For generating the confusion matrices, two different cases are considered

  • Case 1: an annotator is chosen uniformly at random and is assigned an ideal confusion matrix, ie., an identity matrix . This ensures the assumption as given by Eq.(9) (or Eq. (6)).

  • Case 2: an annotator is chosen uniformly at random and its confusion matrix is made diagonally dominant such that . To achieve this, the elements of each column of

    is drawn from a uniform distribution between 0 and 1. The columns are then normalized using their respective

    -norms. After that, for each column, the elements are re-organized such that the corresponding diagonal entry is dominant in that column and then normalized with respect to -norm. In this way, Eq. (11) in Theorem 1 may be (approximately) satisfied.

In both the cases, for the remaining annotators, the confusion matrices are randomly generated; the elements are first drawn following the uniform distribution between 0 and 1, and then the columns are normalized with respect to the -norm. Once ’s are generated, the responses from each annotator for the items with true labels are randomly chosen from

using the probability distribution

. An annotator response for each item is retained for the estimation of with probability . In other words, with probability , each response is made 0. In this way, our simulated scenario is expected to mimic realistic situations where we have a combination of reliable and unreliable annotators, each labeling parts of the items. Using the generated responses, we construct ’s and then follow the proposed approach to identify the confusion matrices and the prior .

The accuracy of the estimation is measured using mean squared error (MSE) defined as


where is the estimate of and ’s are used to fix the column permutation.

The average (MSE) of the confusion matrices for various values of under the above mentioned cases are shown in Table 3 and Table 4 where the proposed methods, MultiSPA and MultiSPA-KL are compared with the baselines Spectral-E&M, TensorADMM and MV-D&S since these methods are also Dawid-Skene model identification approaches. As MV-D&S becomes numerically unstable for smaller values of , those results are not reported in the table. All the results are averaged from 10 trials.

From the two tables, one can see that MultiSPA works reasonably well for both cases. As expected, it exhibits lower MSEs for case 1, since the condition in (6) is perfectly enforced. Nevertheless, in both cases, using MultiSPA to initialize the KL algorithm identifies the confusion matrices to a very high accuracy. It is observed that MultiSPA-KL outperforms the baselines in terms of the estimation accuracy —which may be a result of using second order statistics.

MutliSPA 0.0184 0.0083 0.0063 0.0034
MultiSPA-KL 0.0019 0.0009 0.0004 1.73E-04
Spectral D&S 0.0320 0.0112 0.0448 1.74E-04
TensorADMM 0.0026 0.0011 0.0005 1.88E-04
MV-D&S 0.0173 1.84E-04
Table 3: Average MSE of the confusion matrices for case 1.
MutliSPA 0.0229 0.0188 0.0115 0.0102
MultiSPA-KL 0.0029 0.0014 0.0005 1.67E-04
Spectral D&S 0.0348 0.0265 0.0391 1.67E-04
TensorADMM 0.0031 0.0016 0.0006 1.93E-04
MV-D&S 0.0028 5.88E-04
Table 4: Average MSE of the confusion matrices for case 2.

Under the same settings as in case 2, the true labels are estimated using the MAP/ML predictor as in Traganitis et al. (2018) (in this case, ML and MAP are the same since the prior PMF is a uniform distribution). The classification error and the runtime of the crowdsourcing algorithms are computed and shown in Table 5.

Algorithms Run-time(sec)
MultiSPA 37.24 26.39 19.21 0.049
MultiSPA-KL 31.71 21.10 12.79 18.07
MultiSPA-D&S 31.95 21.11 12.80 0.069
Spectral-D&S 46.37 23.92 12.89 27.17
TensorADMM 32.16 21.34 12.91 56.09
MV-D&S 66.91 57.92 13.09 0.096
Minmax-entropy 62.83 65.50 67.31 200.91
KOS 71.47 61.05 13.12 5.653
Majority Voting 67.57 68.37 71.39
Table 5: Classification Error(%) & Averge run-time when

In the next experiment with case 2, the true labels are sampled with unequal probability. Specifically, is set to be with all other parameters and conditions same as in the first experiment. Using the MAP predictor, the true labels are estimated for the proposed algorithms for various values of and the results are shown in Table 6. It can be inferred from the results that both the proposed algorithms MultiSPA and MultiSPA-KL grantee better classification accuracy when the true label distribution of the items is not balanced.

Algorithms Run-time(sec)
MultiSPA 30.75 21.29 13.67 0.105
MultiSPA-KL 23.19 16.62 10.13 18.93
MultiSPA-D&S 40.12 32.1 21.46 0.122
Spectral-D&S 56.17 49.41 39.17 28.01
TensorADMM 34.17 25.53 11.97 152.76
MV-D&S 83.14 83.15 32.98 0.090
Minmax-entropy 83.04 63.08 74.29 232.82
KOS 70.79 67.55 78.00 6.19
Majority Voting 65.37 65.57 66.06
Table 6: Classification Error(%) & Averge run-time when

In the next experiment, the effect of the number of annotators () in the estimation accuracy of the confusion matrices is investigated. According to Theorem 2 and 4, the proposed methods will benefit from the availability of more annotators (i.e., a larger ). For , , , and the true confusion matrices being generated as in case 2, the MSEs under various values of are plotted in Figure 1. One can see that MultiSPA-KL achieves better accuracy relative to MultiSPA under the same ’s, which corroborates our results in Theorem 4.

Figure 1: MSE of the confusion matrices for various values of

Appendix B More Details on UCI and AMT Dataset Experiments

UCI data. The details of the UCI datasets employed in the real data experimemts is given in Table 7. To be more specific, the Adult dataset predicts the income of a person into classes based on 14 attributes. The Mushroom dataset has 22 attributes of certain variations of mushrooms and the task there predicts either ‘edible’ or ‘poisonous’. The Nursery dataset predicts applications to one of the 4 categories based on 8 attributes of the financial and social status of the parents.

UCI dataset name # classes # items # annotators
Adult 2 7017 10
Mushroom 2 6358 10
Nursery 4 3575 10
Table 7: Details of UCI Datasets.

The proposed methods and the baselines are compared in terms of runtime for various datasets and the results are reported in Table 8. All the results are averaged from 10 different trials.

Algorithms Nursery Mushroom Adult
MultiSPA 0.021 0.012 0.018
MultiSPA-KL 1.112 0.663 0.948
MultiSPA-D&S 0.035 0.027 0.027
Spectral-D&S 10.09 0.496 0.512
TensorADMM 5.811 0.743 4.234
MV-D&S 0.009 0.007 0.008
Minmax-entropy 19.94 2.304 6.959
EigenRatio 0.005 0.007
KOS 0.768 0.085 0.118
Ghosh-SVD 0.081 0.115
Table 8: Average runtime (sec) for UCI datset experiments.
Dataset # classes # items # annotators # annotator labels
Bird 2 108 30 3240
RTE 2 800 164 8,000
TREC 2 19,033 762 88,385
Dog 4 807 52 7,354
Web 5 2,665 177 15,567
Table 9: AMT Dataset description.

AMT data. The Amazon Mechanical Turk (AMT) datasets used in our crowdsourcing data experiments is given in Table 9. Specifically, the tasks involving the Bird dataset Welinder et al. (2010), the RTE dataset Snow et al. (2008), and the TREC dataset Lease & Kazai. (2011), are binary classification tasks. The tasks associated with the Dog dataset Deng et al. (2009) and the web dataset Zhou et al. (2014) are multi-class tasks (i.e., 4 and 5 classes, respectively).

We would like to add one remark regarding the two-stage approaches that involving an initial stage and a refinement stage (e.g., Spectral-D&S, MV-D&S, and MultiSPA-KL). Due to very high sparsity of the annotator responses in most of the AMT data, the estimated confusion matrices from the first stage may contain many zero entries, which may sometimes lead to numerical issues in the second stage, as observed in Zhang et al. (2014). In our experiments, we follow an empirical thresholding strategy proposed in Zhang et al. (2014). Specifically, the confusion matrix entries that are smaller than a threshold are reset to and the columns are normalized before initialization. In our experiments, we use for most of the cases except the extremely large dataset TREC, which enjoys better performance of all methods using .

Appendix C Algorithm for Criterion (13)

In this section, the MultiSPA-KL algorithm is discussed in detail. To implement the identification criterion in (13), we lift the constraint (13b) and employ the following coupled matrix factorization cirterion:


where and the Kullback-Leibler (KL) divergence is employed as the distance measure. The reason is that is a joint PMF of two random variables, and the KL-divergence is the most natural distance measure under such circumstances. Problem (15) is a nonconvex optimization problem, but can be handled by a simple alternating optimization procedure.

Specifically, we propose to solve the following subproblems cyclically:


where denotes the index set of ’s such that is available. Both of the above problems are convex optimization problems, and thus can be effectively solved via a number of off-the-shelf optimization algorithms, e.g., ADMM Huang et al. (2016) and mirror descent Arora et al. (2013). The detailed summarized algorithm is in Algorithm 2. The alternating optimization algorithm is also guaranteed to converge to a stationary point under mild conditions Bertsekas (1999); Razaviyayn et al. (2013).

  Input: Annotator Responses .
  Output: for , .
  Estimate second order statistics ;
  get initial estimates of using MultiSPA
  for  to MaxIter do
     for  to </