1 Introduction
Background.
The drastically increasing availability of data has successfully enabled many timely applications in machine learning and artificial intelligence. At the same time, most supervised learning tasks, e.g., the core tasks in computer vision, natural language processing, and speech processing, heavily rely on labeled data. However, labeling data is not a trivial task—it requires educated and knowledgeable annotators (which could be human workers or machine classifiers), to work under a reliable way. More importantly, it needs an effective mechanism to integrate the possibly different labeling from multiple annotators. Techniques addressing this problem in machine learning are called
crowdsourcing Kittur et al. (2008) or more generally, ensemble learning Dietterich (2000).Crowdsourcing has a long history in machine learning, which can be traced back to the 1970s Dawid & Skene (1979). Many models and methods have appeared since then Karger et al. (2013, 2014, 2011); Snow et al. (2008); Welinder et al. (2010); Liu et al. (2012); Traganitis et al. (2018). Intuitively, if a number of reliable annotators label the same data samples, then majority voting among the annotators is expected to work well. However, in practice, not all the annotators are equally reliable—e.g., different annotators could be specialized for recognizing different classes. In addition, not all the annotators are labeling all the data samples, since data samples are often dispatched to different groups of annotators in a certain way. Under such circumstances, majority voting is not very promising.
A more sophisticate way is to treat the crowdsourcing problem as a model identification problem. The arguably most popular generative model in crowdsourcing is the DawidSkene model Dawid & Skene (1979)
, where every annotator is assigned with a ‘confusion matrix’ that decides the probability of an annotator giving class label
when the groundtruth label is . If such confusion matrices and the probability mass function (PMF) of the groundtruth label can be identified, then a maximum likelihood (ML) or a maximum a posteriori (MAP) estimator for the true label of any given sample can be constructed. The DawidSkene model is quite simple and succinct, and some of the model assumptions (e.g., the conditional independence of the annotator responses) are actually debatable. Nonetheless, this model has been proven very useful in practice Raykar et al. (2010); Traganitis et al. (2018); Ghosh et al. (2011); Karger et al. (2014); Liu et al. (2012); Zhang et al. (2014).Theoretical aspects for the DawidSkene model, however, are less well understood. In particular, it had been unclear if the model could be identified via the accompanying expectation maximization (EM) algorithm proposed in the same paper Dawid & Skene (1979), until some recent works addressing certain special cases Karger et al. (2014). The works in Traganitis et al. (2018); Zhang et al. (2014) put forth tensor methods for learning the DawidSkene model. These methods admit model identifiability, and also can be used to effectively initialize the classic EM algorithm provably Zhang et al. (2014). The challenge is that tensor methods utilize thirdorder statistics of the data samples, which are rather hard to estimate reliably in practice given limited data Huang et al. (2018).
Contributions. In this work, we propose an alternative for identifying the DawidSkene model, without using thirdorder statistics. Our approach is based on utilizing the pairwise cooccurrences of annotators’ responses to data samples—which are secondorder statistics and thus are naturally much easier to estimate compared to the thirdorder ones. We show that, by judiciously combining the cooccurrences between different annotator pairs, the confusion matrices and the groundtruth label’s prior PMF can be provably identified, under realistic conditions (e.g., when there exists a relatively welltrained annotator among all annotators). This is reminiscent of nonnegative matrix theory and convex geometry Fu et al. (2018b); Gillis (2014). Our approach is also naturally robust to spammers as well as scenarios where every annotator only labels partial data. We offer two algorithms under the same framework. The first algorithm is algebraic, and thus is efficient and suitable for handling very largescale crowdsourcing problems. The second algorithm offers enhanced identifiability guarantees, and is able to deal with more critical cases (e.g., when no highly reliable annotators exist), with the price of using a computationally more involved iterative optimization algorithm. Experiments show that both approaches outperform a number of competitive baselines.
2 Background
The DawidSkene Model. Let us consider a dataset , where
is a data sample (or, feature vector) and
is the number of samples. Each belongs to one of classes. Let be the groundtruth label of the data sample . Suppose that there are annotators who work on the dataset and provide labels. Let represent the response of the annotator to . Hence,can be understood as a discrete random variable whose alphabet is
. In crowdsourcing or ensemble learning, our goal is to estimate the true label corresponding to each item from the annotator responses. Note that in a realistic scenario, an annotator will likely to only work on part of the dataset, since having all annotators work on all the samples is much more costly.In 1979, Dawid and Skene proposed an intuitively pleasing model for estimating the ‘true response’ of the patients from recorded answers Dawid & Skene (1979), which is essentially a crowdsourcing/ensemble learning problem. This model has sparked a lot of interest in the machine learning community Raykar et al. (2010); Traganitis et al. (2018); Ghosh et al. (2011); Karger et al. (2014); Liu et al. (2012); Zhang et al. (2014). The DawidSkene model in essence is a naive Bayesian model Robert (2014). In this model, the groundtruth label of a data sample is a latent discrete random variable, , whose values are different class indices. The ambient variables are the responses given by different annotators, denoted as , where is the number of annotators. The key assumption in the DawidSkene model is that given the groundtruth label, the responses of the annotators are conditionally independent. Of course, the DawidSkene model is a simplified version of reality, but has been proven very useful—and it has been a workhorse for crowdsourcing since its proposal.
Under the DawidSkene model, one can see that
(1) 
where denotes the index of a given class, and denotes the response of the th annotator. If one defines a series of matrices and let
(2) 
then can be understood as the ‘confusion matrix’ of annotator : It contains all the conditional probabilities of annotator labeling a given data sample as from class while the groundtruth label is . Also define a vector such that i.e., the prior PMF of the groundtruth label . Then the crowdsourcing problem boils down to estimating for and .
Prior Art. In the seminal paper Dawid & Skene (1979), Dawid and Skene proposed an EMbased algorithm to estimate and . Their formulation is wellmotivated from an ML viewpoint, but also has some challenges. First, it is unknown if the model is identifiable, especially when there is a large number of unrecorded responses (i.e., missing values)—but model identification plays an essential role in such estimation problems Fu et al. (2018b). Second, since the ML estimator is a nonconvex optimization criterion, the solution quality of the EM algorithm is not easy to characterize in general. More recently, tensor methods were proposed to identify the DawidSkene model Zhang et al. (2014); Traganitis et al. (2018). Take the most recent work in Traganitis et al. (2018) as an example. The approach considers estimating the joint probability for different triples . Such joint PMFs can be regarded as thirdorder tensors, and the confusion matrices and the prior are latent factors of these tensors. The upshot is that identifiability of and can be elegantly established leveraging tensor algebra Sidiropoulos et al. (2017); Kolda & Bader (2009). The challenge, however, is that reliably estimating is quite hard, since it normally needs a large number of annotator responses. Another tensor method in Zhang et al. (2014) judiciously partitions the data and works with group statistics between three groups, which is reminiscent of the graph statistics proposed in Anandkumar et al. (2014). The method is computationally more tractable, leveraging orthogonal tensor decomposition. Nevertheless, the challenge again lies in sample complexity: the group/graph statistics are still thirdorder statistics.
3 Proposed Approach
In this section, we propose a model identification approach that only uses secondorder statistics, in particular, pairwise cooccurrences .
Problem Formulation. Let us consider the following pairwise joint PMF:
Letting and using the matrix notations that we defined, we have —or, in a more compact form:
where we have , which is a diagonal matrix. Note that is a confusion matrix, i.e., its columns are respectable probability measures. In addition, is a prior PMF. Hence, we have
(3) 
In practice, ’s are not available but can be estimated via sample averaging. Specifically, if we are given the annotator responses , then
where is the index set of samples which both annotators and have worked on. Here, is an indicator function: If the event happens, then , and otherwise. It is readily seen that
(4) 
where the expectation is taken over data samples. Note that the sample complexity for reliably estimating is much lower relative to that of estimating Zhang et al. (2014); Anandkumar et al. (2014), and the latter is needed in tensor based methods, e.g., Traganitis et al. (2018). To be specific, to achieve with a probability greater than , joint responses from annotators and are needed. However, in order to attain the same accuracy for , the number of joint responses from annotators , and is required to be atleast , where is the number of classes (also see supplementary materials Sec. J for a short discussion).
An Algebraic Algorithm. Assume that we have obtained ’s for different pairs of . We now show how to identify ’s and from such secondorder statistics. Let us take the estimation of as an illustrative example. First, we construct a matrix as follows:
(5) 
where for denote the indices of annotators who have colabeled data samples with annotator , and the integer denotes the number of such annotators. Due to the underlying model of in (3), we have
Let us define This leads to the model We propose to identify from . The key enabling postulate is that, among all annotators, some ’s should be diagonally dominant—if there exist annotators who are reasonably trained. In other words, for a reasonable annotator , should be greater than and for . To see the intuition of the algorithm, consider an ideal case where for each class , there exists an annotator such that
(6) 
This physically means that annotator is very good at recognizing class and never confuses other classes with class . Under such circumstances, one can use the following procedure to identify . First, let us normalize the columns of via for . This way, we have a normalized model , where
(7) 
where the second equality above is because [cf. Eq. (3)]. After normalization, it can be verified that
(8) 
i.e., all the rows of reside in the probability simplex. In addition, by the assumption in (6), it is readily seen that there exists where such that
(9) 
i.e., an identity matrix is a submatrix of
(after proper row permutations). Consequently, we have —i.e., can be identified from up to column permutations. The task also boils down to identifying . This turns out to be a wellstudied task in the context of separable nonnegative matrix factorization Gillis & Vavasis (2014); Gillis (2014); Fu et al. (2018b), and an algebraic algorithm exists:
It has been shown in Gillis & Vavasis (2014); Arora et al. (2013) that the socalled successive projection algorithm (SPA) in Eq. (10) identifies in steps. This is a very plausible result, since the procedure admits GramSchmittlike lightweight steps and thus is quite scalable. See more details in Sec. F.1.
Each of the ’s can be estimated from the corresponding by repeatedly applying SPA, and we call this simple procedure multiple SPA (MultiSPA) as we elaborate in Algorithm 1.
Of course, assuming that (6) or (9) holds perfectly may be too ideal. It is more likely that there exist some annotators who are good at recognizing certain classes, but still have some possibilities of being confused. It is of interest to analyze how SPA can do under such conditions. Another challenge is that one may not have perfectly estimated, since only limited number of samples are available. It is desirable to understand the sample complexity of applying SPA to DawidSkene identification. We answer these two key technical questions in the following theorem:
Theorem 1.
Assume that annotators and colabel at least samples , and that is constructed using ’s according to Eq. (5). Also assume that the constructed satisfies , where . Suppose that for , and that for every class index , there exists an annotator such that
(11) 
where . Then, if , with probability greater than , the SPA algorithm in (10) can estimate an such that
(12) 
where is a permutation matrix, ,
is the largest singular value of
, and is the condition number of .In the above Theorem, the assumption means that the proposed algorithm favors cases where more cooccurrences are observed, since ’s elements are averaged number of cooccurrences—which makes a lot of sense. In addition, Eq. (11) relaxes the ideal assumption in (6), allowing the ‘good annotator’ to confuse class with class up to a certain probability, thereby being more realistic. The proof of Theorem 1 is reminiscent of the noise robustness of the SPA algorithm Gillis & Vavasis (2014); Arora et al. (2013); see the supplementary materials (Sec. F.1). A direct corollary is as follows:
Corollary 1.
Theorem 1 and Corollary 1 are not entirely surprising due to the extensive research on SPAlike algorithms Arora et al. (2013); Gillis & Vavasis (2014); Fu et al. (2015); Nascimento & BioucasDias (2005); Chan et al. (2011). The implication for crowdsourcing, however, is quite intriguing. First, one can see that if an annotator does not label all the data samples, it does not necessarily hurt the model identifiability—as long as annotator has colabeled some samples with a number of other annotators, identification of is possible. Second, assume that there exists a welltrained annotator whose confusion matrix is diagonally dominant, then for every annotator who has colabeled samples with annotator , the matrix can easily satisfy (11) by letting for all . In practice, one would not know who is —otherwise the crowdsourcing problem would be trivial. However, one can design a dispatch strategy such that every pair of annotators and colabel a certain amount of data. This way, it guarantees that appears in everyone else’s and thus ensures identifiability of all ’s for . This insight may shed some light on how to effectively dispatch data to annotators.
Another interesting question to ask is does having more annotators help? Intuitively, having more annotators should help: If one has more rows in , then it is more likely that some rows approach the vertices of the probability simplex—which can then enable SPA. We use the following simplified generative model and theorem to formalize the intuition:
Theorem 2.
Let , and assume that the rows of are generated within the probability simplex uniformly at random. If the number of annotators satisfies then, with probability greater than or equal to , there exist rows of indexed by such that
4 Identifiabilityenhanced Algorithm
The MultiSPA algorithm is intuitive and lightweight, and is effective as we will show in the experiments. One concern is that perhaps the assumption in (11) may be violated in some cases. In this section, we propose another model identification algorithm that is potentially more robust to critical scenarios. Specifically, we consider the following feasibility problem:
The criterion in (13) seeks confusion matrices and a prior PMF that fit the available secondorder statistics. The constraints in (13c) reflect the fact that the columns of ’s are conditional PMFs and the prior is also a PMF.
To proceed, let us first introduce the following notion from convex geometry Fu et al. (2018b); Lin et al. (2015):
Definition 1.
(Sufficiently Scattered) A nonnegative matrix is sufficiently scattered if 1) , and 2) . Here, , . In addition, and are the conic hull of and its dual cone, respectively, and is the boundary of a closed set.
The sufficiently scattered condition has recently emerged in convex geometrybased matrix factorization Lin et al. (2015); Fu et al. (2018a). This condition models how the rows of are spread in the nonnegative orthant. In principle, the sufficiently scattered condition is much easier to be satisfied relative to the condition as in (9), or, the socalled separability condition under the context of nonnegative matrix factorization Donoho & Stodden (2003); Gillis & Vavasis (2014). satisfying the separability condition is the extreme case, meaning that . However, the sufficiently scattered condition only requires —which is naturally much more relaxed; also see Fu et al. (2018b) and the supplementary materials for detailed illustrations (Sec. E).
Regarding identifiability of and , we have the following result:
Theorem 3.
Assume that for all , and that there exist two subsets of the annotators, indexed by and , where and . Suppose that from and the following two matrices can be constructed: , , where and . Furthermore, assume that i) both and are sufficiently scattered; ii) all ’s for and are available; and iii) for every there exists a available, where . Then, solving Problem (13) recovers for and up to identical column permutation.
The proof of Theorem 3 is relegated to the supplementary results (Sec. H). Note that the theorem holds under the the existence of and , but there is no need to know the sets a priori. Generally speaking, a ‘taller’ matrix would have a better chance to have its rows sufficiently spread in the nonnegative orthant under the same intuition of Theorem 2. Thus, having more annotators also helps to attain the sufficiently scattered condition. Nevertheless, formally showing the relationship between the number of annotators and for being sufficiently scattered is more challenging than the case in Theorem 2, since the sufficiently scattered condition is a bit more abstract relative to the separability condition—the latter specifically assumes ’s exist as rows of while the former depends on the ‘shape’ of the conic hull of , which contains an infinite number of cases. Towards this end, let us first define the following notion:
Definition 2.
Assume that there exist such that is sufficiently scattered. Also assume is the row index set of such that collects the extreme rays of . If there exist row indices for all , such that , then is called sufficiently scattered.
One can see that an sufficiently scattered matrix is sufficiently scattered when . With this definition, we show the following theorem:
Theorem 4.
Let , and assume that the rows of and are generated from uniformly at random. If the number of annotators satisfies , where for , for and for , then with probability greater than or equal to , and are sufficiently scattered.
The proof of Theorem 4 is relegated to the supplementary materials (Sec. I). One can see that to satisfy sufficiently scattered condition, is smaller than that in Theorem 2. Conditions i)iii) in Theorem 3 and Theorem 4 together imply that if we have enough annotators, and if many pairs colabel a certain number of data, then it is quite possible that one can identify the DawidSkene model via simply finding a feasible solution to (13). This feasibility problem is nonconvex, but can be effectively approximated; see the supplementary materials (Sec. C). In a nutshell, we reformulate the problem as a KullbackLeibler (KL) divergencebased constrained fitting problem and handle it using alternating optimization. Since nonconvex optimization relies on initialization heavily, we use MultiSPA to initialize the fitting stage—which we will refer to as the MultiSPAKL algorithm.
5 Experiments
Baselines. The performance of the proposed approach is compared with a number of competitive baselines, namely, SpectralD&S Zhang et al. (2014), TensorADMM Traganitis et al. (2018), and KOS Karger et al. (2013), EigRatio Dalvi et al. (2013), GhoshSVD Ghosh et al. (2011) and MinmaxEntropy Zhou et al. (2014). The performance of the Majority Voting scheme and the Majority Voting initialized DawidSkene (MVD&S) estimator Dawid & Skene (1979) are also presented. We also use MultiSPA to initialize EM algorithm (named as MultiSPAD&S). Note that KOS, EigRatio and MinmaxEntropy work with more complex models relative to the DawidSkene model, but are considered as good baselines for the crowdsourcing/ensemble learning tasks. After identifying the model parameters, we construct a MAP predictor following Traganitis et al. (2018) and observe the result. The algorithms are coded in Matlab.
Syntheticdata Simulations. Due to page limitations, synthetic data experiments demonstrating model identifiability of the proposed algorithms are presented in the supplementary materials (Sec. A).
Integrating Machine Classifiers. We employ different UCI datasets (https://archive.ics.uci.edu/ml/datasets.html; details in Sec. B). For each of the datasets under test, we use a collection of different classification algorithms to annotate the data samples. Different classification algorithms from the MATLAB machine learning toolbox (https://www.mathworks.com/products/statistics.html) such as various
nearest neighbour classifiers, support vector machine classifiers, and decision tree classifiers are employed to serve as our machine annotators. In order to train the annotators, we use
of the samples to act as training data. After the data samples are trained, we use the annotators to label the unseen data samples. In practice, not all samples are labeled by an annotator due to several factors such as annotator capacity, difficulty of the task, economical issues and so on. To simulate such a scenario, each of the trained algorithms is allowed to label a data sample with probability . We test the performance of all the algorithms under different ’s—and a smaller means a more challenging scenario. All the results are averaged from 10 random trials.Table 1 shows the classification error of the algorithms under test. Since GhoshSVD and EigenRatio works only on binary tasks, they are not evaluated for the Nursery dataset where . The ‘single best’ and ‘single worst’ rows correspond to the results of using the classifiers individually when , as references. The best and secondbest performing algorithms are highlighted in the table. One can see that the proposed methods are quite promising for this experiment. Both algorithms largely outperform the tensor based methods TensorADMM and SpectralD&S in this case, perhaps because the limited number of available samples makes the thirdorder statistics hard to estimate. It is also observed that the proposed algorithms enjoy favorable runtime;s ee supplementary materials (cf. Table 8 in Sec. B). Using the MultiSPA to initialize EM (i.e. MultiSPAD&S) also works well, which offers another viable option that strikes a good balance between runtime and accuracy.
Nursery  Mushroom  Adult  

Algorithms  
MultiSPA  2.83  4.54  17.96  0.02  0.293  6.35  15.71  16.05  17.66 
MultiSPAKL  2.72  4.26  13.06  0.00  0.152  5.89  15.66  15.98  17.63 
MultiSPAD&S  2.82  4.44  13.39  0.00  0.194  6.17  15.74  16.29  23.88 
SpectralD&S  3.14  37.2  44.29  0.00  0.198  6.17  15.72  16.31  23.97 
TensorADMM  17.97  7.26  19.78  0.06  0.237  6.18  15.72  16.05  25.08 
MVD&S  2.92  66.48  66.61  0.00  47.99  48.63  15.76  75.21  75.13 
Minmaxentropy  3.63  26.31  11.09  0.00  0.163  8.14  16.11  16.92  15.64 
EigenRatio  N/A  N/A  N/A  0.06  0.329  5.97  15.84  16.28  17.69 
KOS  4.21  6.07  13.48  0.06  0.576  6.42  17.19  24.97  38.29 
GhoshSVD  N/A  N/A  N/A  0.06  0.329  5.97  15.84  16.28  17.71 
Majority Voting  2.94  4.83  19.75  0.14  0.566  6.57  15.75  16.21  20.57 
Single Best  3.94  N/A  N/A  0.00  N/A  N/A  16.23  N/A  N/A 
Single Worst  15.65  N/A  N/A  7.22  N/A  N/A  19.27  N/A  N/A 
Amazon Mechanical Turk Crowdsourcing Data. In this section, the performance of the proposed algorithms are evaluated using the Amazon Mechanical Turk (AMT) data (https://www.mturk.com) in which human annotators label various classification tasks. Data description is given in the supplementary materials Sec. B. Table 2 shows the classification error and the runtime performance of the algorithms under test. One can see that MultiSPA has a very favorable execution time, because it is a GramSchmittlike algorithm. MultiSPAKL uses more time, because it is an iterative optimization method—with better accuracy paid off. Since TensorADMM algorithm does not scale well, the results are not reported for very large datasets (i.e., TREC and RTE). Similar as before, since Web and Dog are multiclass datasets, EigenRatio and GhoshSVD are not applicable. From the results, it can be seen that the proposed algorithms outperform many existing crowdsourcing algorithms in both classification accuracy and runtime. In particular, one can see that the algebraic algorithm MultiSPA gives very similar results compared to the computationally much more involved algorithms. This shows the potential for its application in big data crowdsourcing.
Algorithms  TREC  Bluebird  RTE  Web  Dog  

(%) Error  (sec) Time  (%) Error  (sec) Time  (%) Error  (sec) Time  (%) Error  (sec) Time  (%) Error  (sec) Time  
MultiSPA  31.47  50.68  13.88  0.07  8.75  0.28  15.22  0.54  17.09  0.07 
MultiSPAKL  29.23  536.89  11.11  1.94  7.12  17.06  14.58  12.34  15.48  15.88 
MultiSPAD&S  29.84  53.14  12.03  0.09  7.12  0.32  15.11  0.84  16.11  0.12 
SpectralD&S  29.58  919.98  12.03  1.97  7.12  6.40  16.88  179.92  17.84  51.16 
TensorADMM  N/A  N/A  12.03  2.74  N/A  N/A  N/A  N/A  17.96  603.93 
MVD&S  30.02  3.20  12.03  0.02  7.25  0.07  16.02  0.28  15.86  0.04 
Minmaxentropy  91.61  352.36  8.33  3.43  7.50  9.10  11.51  26.61  16.23  7.22 
EigenRatio  43.95  1.48  27.77  0.02  9.01  0.03  N/A  N/A  N/A  N/A 
KOS  51.95  9.98  11.11  0.01  39.75  0.03  42.93  0.31  31.84  0.13 
GhoshSVD  43.03  11.62  27.77  0.01  49.12  0.03  N/A  N/A  N/A  N/A 
Majority Voting  34.85  N/A  21.29  N/A  10.31  N/A  26.93  N/A  17.91  N/A 
6 Conclusion
In this work, we have revisited the classic DawidSkene model for multiclass crowdsourcing. We have proposed a secondorder statisticsbased approach that guarantees identifiability of the model parameters, i.e., the confusion matrices of the annotators and the label prior. The proposed method naturally admits lower sample complexity relative to existing methods that utilize tensor algebra to ensure model identifiability. The proposed approach also has an array of favorable features. In particular, our framework enables a lightweight algebraic algorithm, which is reminiscent of the GramSchmittlike SPA algorithm for nonnegative matrix factorization. We have also proposed a coupled and constrained matrix factorization criterion that enjoys enhancedidentifiability, as well as an alternating optimization algorithm for handling the identification problem. Realdata experiments show that our proposed algorithms are quite promising for integrating crowdsourced labeling.
References
 Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. A tensor approach to learning mixed membership community models. The Journal of Machine Learning Research, 15(1):2239–2312, 2014.
 Arora et al. (2013) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for topic modeling with provable guarantees. In Proceedings of ICML, 2013.
 Baraniuk et al. (2008) Baraniuk, R., Davenport, M., DeVore, R., and Wakin, M. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253–263, 2008.
 Bertsekas (1999) Bertsekas, D. P. Nonlinear programming. Athena Scientific, 1999.
 Chan et al. (2011) Chan, T.H., Ma, W.K., Ambikapathi, A., and Chi, C.Y. A simplex volume maximization framework for hyperspectral endmember extraction. IEEE Trans. Geosci. Remote Sens., 49(11):4177 –4193, Nov. 2011.
 Dalvi et al. (2013) Dalvi, N., Dasgupta, A., Kumar, R., and Rastogi, V. Aggregating crowdsourced binary ratings. In Proceedings of the 22Nd International Conference on World Wide Web, pp. 285–294, New York, NY, USA, 2013. ACM.
 Dawid & Skene (1979) Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pp. 20–28, 1979.

Deng et al. (2009)
Deng, J., Dong, W., Socher, R., Li, L., and and.
Imagenet: A largescale hierarchical image database.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 248–255, June 2009.  Dietterich (2000) Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
 Donoho & Stodden (2003) Donoho, D. and Stodden, V. When does nonnegative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, volume 16, 2003.
 Fu et al. (2015) Fu, X., Ma, W.K., Chan, T.H., and BioucasDias, J. M. Selfdictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related. IEEE J. Sel. Topics Signal Process., 9(6):1128–1141, 2015.
 Fu et al. (2016) Fu, X., Huang, K., Yang, B., Ma, W.K., and Sidiropoulos, N. D. Robust volume minimizationbased matrix factorization for remote sensing and document clustering. IEEE Trans. Signal Process., 64(23):6254–6268, 2016.
 Fu et al. (2018a) Fu, X., Huang, K., and Sidiropoulos, N. D. On identifiability of nonnegative matrix factorization. IEEE Signal Process. Lett., 25(3):328–332, 2018a.
 Fu et al. (2018b) Fu, X., Huang, K., Sidiropoulos, N. D., and Ma, W.K. Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. arXiv preprint arXiv:1803.01257, 2018b.
 Ghosh et al. (2011) Ghosh, A., Kale, S., and McAfee, P. Who moderates the moderators?: crowdsourcing abuse detection in usergenerated content. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 167–176. ACM, 2011.
 Gillis (2014) Gillis, N. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12:257, 2014.
 Gillis & Vavasis (2014) Gillis, N. and Vavasis, S. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):698–714, April 2014.
 Huang et al. (2014) Huang, K., Sidiropoulos, N., and Swami, A. Nonnegative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition. IEEE Trans. Signal Process., 62(1):211–224, 2014.
 Huang et al. (2016) Huang, K., Sidiropoulos, N. D., and Liavas, A. P. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process., 64(19):5052–5065, 2016.

Huang et al. (2018)
Huang, K., Fu, X., and Sidiropoulos, N. D.
Learning hidden markov models from pairwise cooccurrences with applications to topic modeling.
In Proceedings of ICML 2018, 2018.  Jonker & Volgenant (1986) Jonker, R. and Volgenant, T. Improving the hungarian assignment algorithm. Operations Research Letters, 5(4):171–175, 1986.
 Karger et al. (2011) Karger, D. R., Oh, S., and Shah, D. Budgetoptimal crowdsourcing using lowrank matrix approximations. 2011.
 Karger et al. (2013) Karger, D. R., Oh, S., and Shah, D. Efficient crowdsourcing for multiclass labeling. ACM SIGMETRICS Performance Evaluation Review, 41(1):81–92, 2013.
 Karger et al. (2014) Karger, D. R., Oh, S., and Shah, D. Budgetoptimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
 Kittur et al. (2008) Kittur, A., Chi, E. H., and Suh, B. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 453–456. ACM, 2008.
 Kolda & Bader (2009) Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 Lease & Kazai. (2011) Lease, M. and Kazai., G. Overview of the trec 2011 crowdsourcing track. 2011.
 Lin et al. (2015) Lin, C.H., Ma, W.K., Li, W.C., Chi, C.Y., and Ambikapathi, A. Identifiability of the simplex volume minimization criterion for blind hyperspectral unmixing: The nopurepixel case. IEEE Trans. Geosci. Remote Sens., 53(10):5530–5546, Oct 2015.
 Liu et al. (2012) Liu, Q., Peng, J., and Ihler, A. T. Variational inference for crowdsourcing. In Advances in neural information processing systems, pp. 692–700, 2012.
 Nascimento & BioucasDias (2005) Nascimento, J. and BioucasDias, J. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Trans. Geosci. Remote Sens., 43(4):898–910, 2005.
 Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
 Razaviyayn et al. (2013) Razaviyayn, M., Hong, M., and Luo, Z.Q. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
 Robert (2014) Robert, C. Machine learning, a probabilistic perspective, 2014.
 Sidiropoulos et al. (2017) Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582, 2017.
 Snow et al. (2008) Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254–263. Association for Computational Linguistics, 2008.
 Stein (1966) Stein, P. A Note on the Volume of a Simplex. The American Mathematical Monthly, 73(3), 1966. doi: 10.2307/2315353.
 Stephane Boucheron (2004) Stephane Boucheron, Gabor Lugosi, O. B. Concentration Inequalities, 2004. URL: http://www.econ.upf.edu/~lugosi/mlss_conc.pdf.
 Traganitis et al. (2018) Traganitis, P. A., PagesZamora, A., and Giannakis, G. B. Blind multiclass ensemble classification. IEEE Trans. Signal Process., 66(18):4737–4752, 2018.
 Welinder et al. (2010) Welinder, P., Branson, S., Perona, P., and Belongie, S. J. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432, 2010.
 Zhang et al. (2014) Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260–1268, 2014.
 Zhou et al. (2014) Zhou, D., Liu, Q., Platt, J., and Meek, C. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of ICML, volume 32, pp. 262–270, 2014.
Appendix A Synthetic Data Experiments
In the first experiment, we consider that annotators are available to annotate items, each belonging to one of classes. The true label for each item is sampled uniformly from
, i.e, the prior probability vector
is fixed to be . For generating the confusion matrices, two different cases are considered
Case 2: an annotator is chosen uniformly at random and its confusion matrix is made diagonally dominant such that . To achieve this, the elements of each column of
is drawn from a uniform distribution between 0 and 1. The columns are then normalized using their respective
norms. After that, for each column, the elements are reorganized such that the corresponding diagonal entry is dominant in that column and then normalized with respect to norm. In this way, Eq. (11) in Theorem 1 may be (approximately) satisfied.
In both the cases, for the remaining annotators, the confusion matrices are randomly generated; the elements are first drawn following the uniform distribution between 0 and 1, and then the columns are normalized with respect to the norm. Once ’s are generated, the responses from each annotator for the items with true labels are randomly chosen from
using the probability distribution
. An annotator response for each item is retained for the estimation of with probability . In other words, with probability , each response is made 0. In this way, our simulated scenario is expected to mimic realistic situations where we have a combination of reliable and unreliable annotators, each labeling parts of the items. Using the generated responses, we construct ’s and then follow the proposed approach to identify the confusion matrices and the prior .The accuracy of the estimation is measured using mean squared error (MSE) defined as
(14) 
where is the estimate of and ’s are used to fix the column permutation.
The average (MSE) of the confusion matrices for various values of under the above mentioned cases are shown in Table 3 and Table 4 where the proposed methods, MultiSPA and MultiSPAKL are compared with the baselines SpectralE&M, TensorADMM and MVD&S since these methods are also DawidSkene model identification approaches. As MVD&S becomes numerically unstable for smaller values of , those results are not reported in the table. All the results are averaged from 10 trials.
From the two tables, one can see that MultiSPA works reasonably well for both cases. As expected, it exhibits lower MSEs for case 1, since the condition in (6) is perfectly enforced. Nevertheless, in both cases, using MultiSPA to initialize the KL algorithm identifies the confusion matrices to a very high accuracy. It is observed that MultiSPAKL outperforms the baselines in terms of the estimation accuracy —which may be a result of using second order statistics.
Algorithms  

MutliSPA  0.0184  0.0083  0.0063  0.0034 
MultiSPAKL  0.0019  0.0009  0.0004  1.73E04 
Spectral D&S  0.0320  0.0112  0.0448  1.74E04 
TensorADMM  0.0026  0.0011  0.0005  1.88E04 
MVD&S  –  –  0.0173  1.84E04 
Algorithms  

MutliSPA  0.0229  0.0188  0.0115  0.0102 
MultiSPAKL  0.0029  0.0014  0.0005  1.67E04 
Spectral D&S  0.0348  0.0265  0.0391  1.67E04 
TensorADMM  0.0031  0.0016  0.0006  1.93E04 
MVD&S  –  –  0.0028  5.88E04 
Under the same settings as in case 2, the true labels are estimated using the MAP/ML predictor as in Traganitis et al. (2018) (in this case, ML and MAP are the same since the prior PMF is a uniform distribution). The classification error and the runtime of the crowdsourcing algorithms are computed and shown in Table 5.
Algorithms  Runtime(sec)  

MultiSPA  37.24  26.39  19.21  0.049 
MultiSPAKL  31.71  21.10  12.79  18.07 
MultiSPAD&S  31.95  21.11  12.80  0.069 
SpectralD&S  46.37  23.92  12.89  27.17 
TensorADMM  32.16  21.34  12.91  56.09 
MVD&S  66.91  57.92  13.09  0.096 
Minmaxentropy  62.83  65.50  67.31  200.91 
KOS  71.47  61.05  13.12  5.653 
Majority Voting  67.57  68.37  71.39  – 
In the next experiment with case 2, the true labels are sampled with unequal probability. Specifically, is set to be with all other parameters and conditions same as in the first experiment. Using the MAP predictor, the true labels are estimated for the proposed algorithms for various values of and the results are shown in Table 6. It can be inferred from the results that both the proposed algorithms MultiSPA and MultiSPAKL grantee better classification accuracy when the true label distribution of the items is not balanced.
Algorithms  Runtime(sec)  

MultiSPA  30.75  21.29  13.67  0.105 
MultiSPAKL  23.19  16.62  10.13  18.93 
MultiSPAD&S  40.12  32.1  21.46  0.122 
SpectralD&S  56.17  49.41  39.17  28.01 
TensorADMM  34.17  25.53  11.97  152.76 
MVD&S  83.14  83.15  32.98  0.090 
Minmaxentropy  83.04  63.08  74.29  232.82 
KOS  70.79  67.55  78.00  6.19 
Majority Voting  65.37  65.57  66.06  – 
In the next experiment, the effect of the number of annotators () in the estimation accuracy of the confusion matrices is investigated. According to Theorem 2 and 4, the proposed methods will benefit from the availability of more annotators (i.e., a larger ). For , , , and the true confusion matrices being generated as in case 2, the MSEs under various values of are plotted in Figure 1. One can see that MultiSPAKL achieves better accuracy relative to MultiSPA under the same ’s, which corroborates our results in Theorem 4.
Appendix B More Details on UCI and AMT Dataset Experiments
UCI data. The details of the UCI datasets employed in the real data experimemts is given in Table 7. To be more specific, the Adult dataset predicts the income of a person into classes based on 14 attributes. The Mushroom dataset has 22 attributes of certain variations of mushrooms and the task there predicts either ‘edible’ or ‘poisonous’. The Nursery dataset predicts applications to one of the 4 categories based on 8 attributes of the financial and social status of the parents.
UCI dataset name  # classes  # items  # annotators 

Adult  2  7017  10 
Mushroom  2  6358  10 
Nursery  4  3575  10 
The proposed methods and the baselines are compared in terms of runtime for various datasets and the results are reported in Table 8. All the results are averaged from 10 different trials.
Algorithms  Nursery  Mushroom  Adult 

MultiSPA  0.021  0.012  0.018 
MultiSPAKL  1.112  0.663  0.948 
MultiSPAD&S  0.035  0.027  0.027 
SpectralD&S  10.09  0.496  0.512 
TensorADMM  5.811  0.743  4.234 
MVD&S  0.009  0.007  0.008 
Minmaxentropy  19.94  2.304  6.959 
EigenRatio  –  0.005  0.007 
KOS  0.768  0.085  0.118 
GhoshSVD  –  0.081  0.115 
Dataset  # classes  # items  # annotators  # annotator labels 

Bird  2  108  30  3240 
RTE  2  800  164  8,000 
TREC  2  19,033  762  88,385 
Dog  4  807  52  7,354 
Web  5  2,665  177  15,567 
AMT data. The Amazon Mechanical Turk (AMT) datasets used in our crowdsourcing data experiments is given in Table 9. Specifically, the tasks involving the Bird dataset Welinder et al. (2010), the RTE dataset Snow et al. (2008), and the TREC dataset Lease & Kazai. (2011), are binary classification tasks. The tasks associated with the Dog dataset Deng et al. (2009) and the web dataset Zhou et al. (2014) are multiclass tasks (i.e., 4 and 5 classes, respectively).
We would like to add one remark regarding the twostage approaches that involving an initial stage and a refinement stage (e.g., SpectralD&S, MVD&S, and MultiSPAKL). Due to very high sparsity of the annotator responses in most of the AMT data, the estimated confusion matrices from the first stage may contain many zero entries, which may sometimes lead to numerical issues in the second stage, as observed in Zhang et al. (2014). In our experiments, we follow an empirical thresholding strategy proposed in Zhang et al. (2014). Specifically, the confusion matrix entries that are smaller than a threshold are reset to and the columns are normalized before initialization. In our experiments, we use for most of the cases except the extremely large dataset TREC, which enjoys better performance of all methods using .
Appendix C Algorithm for Criterion (13)
In this section, the MultiSPAKL algorithm is discussed in detail. To implement the identification criterion in (13), we lift the constraint (13b) and employ the following coupled matrix factorization cirterion:
(15a)  
(15b) 
where and the KullbackLeibler (KL) divergence is employed as the distance measure. The reason is that is a joint PMF of two random variables, and the KLdivergence is the most natural distance measure under such circumstances. Problem (15) is a nonconvex optimization problem, but can be handled by a simple alternating optimization procedure.
Specifically, we propose to solve the following subproblems cyclically:
(16a)  
(16b) 
where denotes the index set of ’s such that is available. Both of the above problems are convex optimization problems, and thus can be effectively solved via a number of offtheshelf optimization algorithms, e.g., ADMM Huang et al. (2016) and mirror descent Arora et al. (2013). The detailed summarized algorithm is in Algorithm 2. The alternating optimization algorithm is also guaranteed to converge to a stationary point under mild conditions Bertsekas (1999); Razaviyayn et al. (2013).
Comments
There are no comments yet.