1 Introduction
Crowdsourcing marketplaces, such as Amazon Mechanical Turk^{1}^{1}1https://www.mturk.com, make it possible to label large data sets in a shorter time as well as at a lower cost comparing to that needed for a limited number of experts. However, as workers at the marketplaces are nonprofessional and vary in levels of expertise, such labels are much noisier than those obtained from experts. In order to reduce the noisiness, typically, each object is labelled by several workers, and then these labels are further aggregated in a certain way to infer a more reliable consensus label for the object. Most advanced consensus models [e.g., 5, 15, 18] address different aspects of uncertainty in the process of generating noisy labels.
A traditional setting used in those and other previous studies is based on the latent label assumption (Lassumption), implying that each object has a unique latent true label, and, when a worker observes the object, this latent true label is corrupted with regard to a chosen stochastic model into an observed noisy label. As a consequence, consensus models designed under this assumption explain any disagreements among observed noisy labels of an object by the mistakes made by some of the labellers. However, this may not explain a certain kind of disagreements among labels produced by experts, which is typical for some object domains. E.g., when assessing relevance of documents to search queries, even well trained experts may disagree about the true label for certain objects [see e.g., 14]. This is equivalent to saying that a unique true label of an object does not exist, but rather each object has its specific distribution over possible subjective labels, which is induced by a distribution of personal preferences over different aspects of the task. This type of uncertainty lies beyond the traditional Lassumption, and this paper introduces a novel approach based on the latent distribution assumption (Dassumption) to deal with this problem.
The novel Dassumption suggests the following generative process: Each object has its specific (latent) distribution over subjective values of label. Each time a worker observes the object, a latent subjective label is sampled from the object’s distribution. Then, this subjective label is corrupted according to a stochastic model, and an observed label is revealed. It is crucial that the object’s subjective label introduced in the process is generated each time a worker observes the object. This distinguishes the latent distribution approach from the previous consensus models.
Besides, to the best of our knowledge, this is the first paper that looks at the output probabilities of consensus labels from a statistical point of view. In particular, we notice that posterior probabilities of labels obtained under the traditional Lassumption are poor estimates of the underlying true probabilities. In contrast to this, we show, both analytically and experimentally, that the probabilities obtained under the Dassumption may serve as accurate statistical estimates of the underlying true probabilities. Thus, the latent distribution for subjective labels of an object estimated via our framework can be further used for the problems of label distribution learning considered in
[6].Background. A line of previous studies on consensus modelling [e.g., 5, 15] explicitly assumed that each object has a single “true” label. Other works [e.g., 16, 17, 18] including Bayesian approaches [e.g., 1, 8, 13], made the Lassumption implicitly: though it is assumed that each object is associated with a probabilistic label, it is not allowed that a latent label of an object takes multiple different values. In other words, posterior probabilities for latent labels, inferred under the traditional Lassumption, is a measure of confidence in each value being a unique latent true label. Our work is orthogonal to those studies as it considers a different assumption for the process of generating noisy labels and leads to essentially different and “better calibrated” output probabilities. [10] suggested a consensus model for subjective tasks, our work is more fundamental in the sense that we describe a general framework to consensus modelling in such tasks and theoretically study properties of their outputs.
2 Latent distribution assumption
Let
be a random variable whose value is the latent label for an object
, and be a random variable whose value is the observed noisy label assigned by a worker to an object .To the best of our knowledge, all existing consensus models do not go beyond the general generative model shown in Figure 0(a). This class of models, including [5, 15, 16] and others, which we call the LAmodel, represents the following generative process:

A unique latent true label for object
is drawn from a probability distribution
with a parameter (note that is the same for all objects). For multiclass labels, is usually a multinomial distribution , whereis a vector of prior probabilities.

Given the latent true label for an object , an observed noisy label from a worker
for this object is sampled from the conditional probability distribution
^{2}^{2}2As usual, capital letters are used for random variables and the corresponding lower case letters are for particular values of the random variables. with two potential parameters^{3}^{3}3“Potential parameters” means that some of the parameters may be absent. and . Intuitively, represents a level of expertise for the worker , and corresponds to a level of difficulty for the object .
Note that most previous studies are focused on the second step. They proposed different ways to model the corruption of the true label and, for example, additionally employed latent communities of workers [13] or features of the object and the worker [11]. However, this does not put these models beyond the general twosteps scheme described above.
In this paper, by using the novel Dassumption we change the first step and introduce the novel class of models, which we call the DAmodel. At this point, for each object , we define a random variable whose value is a parameter of the distribution of subjective labels for this object. The graphical model is shown in Figure 0(b) and represents the following generative process:

For each object , a parameter is drawn form a probability distribution parameterised by (again, note that is the same for all objects).

When a worker observes an object :

A subjective label is sampled from the object’s distribution with the parameter .

A noisy label is generated from the conditional probability distribution with two potential parameters and .

Observe that the key difference between Figures 0(a) and 0(b) is in the workers’ plate: for the DAmodel, the workers’ plate is extended over the latent subjective labels . Next we show that this difference constitutes a conceptual change for the inference of consensus labels.
Consider a set of noisy labels for objects from workers . Traditionally, consensus labels are inferred by maximising the loglikelihood of the observed noisy labels. For the Lassumption, the consensus output for object is probability for each possible value , reflecting our confidence that the value is the unique true label for object . In this setting, the loglikelihood of the observed data is
(1) 
where the second sum is taken over the set of possible label values, is the set of indices of workers’ who labelled th object, and notations of the form are used instead of for short, and for the same reason we omitted possible parameters and in . Whereas, under the Dassumption, the consensus output for object is the distribution of subjective labels , parameterized by the object’s parameter . And then the loglikelihood is defined as
To make the discussion of DAmodels easier, in the rest of this paper we assume that parameter of the DAmodel is a deterministic distribution assigning to each object a chosen beforehand (but unknown) distribution of the object’s subjective labels. Then, we can remove the parameter in Figure 0(b) and assume that are additional parameters of the model, and the expression for the loglikelihood of the observed labels becomes the following:
(2) 
Remark. Two described approaches resemble, to some extent, two approaches to topic modelling – the mixture of unigrams model and the latent Dirichlet allocation model [see, e.g., 3]. The first model assumes that each document is associated with a single topic, which specifies the distribution of words in this document; for the second model, each document may be associated with multiple topics, and words in this document are generated from a mixture of the distributions for the topics. By replacing topics for latent labels and words for noisy labels, we get the two approaches discussed in this section. Note, however, that for our setting, it is necessary that the domains of latent and observed labels are the same. So that latent “topics” are in onetoone correspondence with the label values. Otherwise, we may not interpret the latent distribution.
Finally, let us motivate the novel approach by the following numerical example. Consider 210 workers with highly confident estimates of their expertise inferred from their labels for many tasks they completed. Let each worker assigns the correct label with probability (independent of the value of that correct label). Consider a binary classification task with uninformative prior over the two labels . Let each of the workers provides one noisy label for a new object, and it turns out that 110 of the labels are 1s and 100 are 0s. Under the Lassumption, the posterior probability of the true label being 1 is , i.e. the correct label is practically 1 with probability 1, and we observe a very unreliable event that 100 out of 210 workers made a mistake. On the contrast, under the Dassumption, according to (2), the probability should maximize
, and we infer that most of the workers provided true (though subjective) labels and for this object the latent Bernoulli distribution has the parameter
, which seems much more realistic.3 Consensus models under L and Dassumptions
A distinguishing feature of the DAmodel, shown in Figure 0(b), is that latent subjective labels for an object are sampled each time the object is observed. Thus, given a certain LAmodel, the corresponding DAone is defined as follows: Firstly, the distribution in the DAmodel is defined over the domain of the parameter in the LAmodel. E.g., for the examples of models below, is the parameter of the multinomial distribution , and is defined to be a Dirichlet distribution. Secondly, The conditional distribution in the DAmodel is the conditional distribution of noisy labels defined in the LAmodel, with the subjective label value for the random variable .
The rest of this section describes three established models^{4}^{4}4For all the models, let labels take values from . as special cases of the traditional LAmodel, and their novel counterparts as special cases of the novel DAmodel. Later, in Section 6, we will empirically compare the performance of all these models.
3.1 Dawid and Skene model
Consider a special case of the LAmodel with parameters and defined as follows: is the vector of prior probabilities for label values; the parameter
is the confusion matrix of size
, and . The model was proposed in [5] and we will refer to it as LA DS.The corresponding special case of the DAmodel is the following: (1) for each object , a vector is drawn from a Dirichlet distribution , this vector is the parameter of the multinomial distribution of subjective labels for this object; (2) when a worker observes an object , first, a subjective label is drawn from the multinomial distribution , and then, a noisy label is drawn from the multinomial distribution , where stands for th row of the confusion matrix . It will be denoted as DA DS.
3.2 GLAD
Consider the following special case of the LAmodel with parameters , and : parameter is absent meaning that true labels are assumed to be deterministic; the parameter is a scalar value representing the level of expertise for the worker ; and the parameter is a scalar value representing the level of difficulty for the object . Let and define
This model is called GLAD^{5}^{5}5GLAD is a shortcut for Generative model of Labels, Abilities, and Difficulties. and was described in [15], it will be denoted as LA GLAD.
The corresponding DAmodel is the following: (1) each object has a deterministic parameter ; (2) for a worker and an object , a latent subjective label is sampled form the multinomial distribution , and then, a noisy label is generated from . We will refer to it as DA GLAD.
3.3 Minimax entropy principle
Consider a model described in [18, Section 2] for multiclass labels; this model will be referred to as LA MME. The model was derived using a minimax entropy principle, and it is a special case of the LAmodel with and defined as matrices of size . Using these parameters, for each worker and each object , define .
The DAcounterpart is the following: (1) each object has a deterministic parameter ; (2) for a worker and an object , a latent subjective label is sampled from the multinomial distribution , and then, a noisy label is generated from . This will be denoted as DA MME.
4 Theoretical analysis
In this section, we analyse the ability of two approaches to consensus modeling (based on the L and Dassumptions) to output “calibrated” probabilities of labels.
Consider one object, assume . Consider the generative process for the DAmodel as described in Section 2, where , and is an unknown objectspecific parameter.
Assume workers provide labels for the object independently, and each of them makes a mistake with a known probability , which is the same for all workers.
Lassumption approach. According to the traditional approach, the unknown parameter is estimated by the posterior probability of label for the object:
where is the prior probability of label (common for all objects) and is the number of ones among noisy labels.
Proposition 1. The value of is neither unbiased, nor consistent estimate of .
Proof.
The generative process based on the Dassumption defines the following distribution of the number of ones :
.
Consider the expectation of with respect to this distribution: .
One can see that this is a biased estimate for (see the details in Appendix A).
Now we check the convergence of to as
tends to infinity. By the law of large numbers, the fraction of ones among noisy labels converges:
. Assume that workers are not malicious, i.e. . Note that , therefore converges to different values depending on (see the details in Appendix A). Thus, is not a consistent estimate for unless the distribution is degenerate or .Dassumption approach. According to the novel approach based on the Dassumption, is a binomial random variable with parameters , where is the probability that an observed noisy label equals . The estimate of parameter is the value providing the maximum for the log likelihood of the observed noisy labels:
Proposition 2. The value of is an unbiased and consistent estimate of .
Proof. Note that , where . Therefore, we have
The second equality uses linearity of . Note that is a consistent estimate of the success probability of binomial . Therefore, is a consistent estimate of .
5 Details of implementation
In this section we describe our implementation for the six consensus models described in Sections 3.1–3.3. Our implementation allows us to directly maximise the loglikelihood of observed noisy labels instead of using EM, which optimises the lower bound for the loglikelihood.
For LAmodels the loglikelihood (1) is a function of parameters , and . To maximise the loglikelihood we use a standard conjugate gradient descent algorithm. And to compute gradients at each iteration we use a proprietary library for the automatic differentiation, which is an implementation of the operator overloading approach [see, e.g., 2]. Finally, using the maximum likelihood estimates for the parameters, we compute the consensus output for each object as the posterior probabilities for all :
For DAmodels the loglikelihood (2) is a function of parameters , and and the same approach is used to maximize it. The consensus output for each object is the estimated distribution of the object’s subjective labels, i.e. for all :
The output of any consensus model depends on the initial values for parameters. However, it will not influence our comparisons of LA and DAversions of models: the same initial values are used for common parameters in each pair of models. For the pair of DS models, confusion matrices are initialised as follows: for each object we first compute the vector of relative frequencies of noisy labels for the object, then for each worker we compute a matrix of counts by considering each label of the worker and adding the corresponding vector to the column of the worker’s matrix of counts, finally we normalise each row in the worker’s matrix of counts to make it sum up to 1. Parameter for LA DS is initialised by the frequency of each label value among all noisy labels. For the pair of GLAD models, workers’ parameters are initialised by 1 and objects’ parameters are initialised by as suggested in the original paper [15]. For the pair of MME models, workers’ confusion matrices are initialised by computing the matrices of counts for workers for each worker (in the same way as for DS models) and then taking the logarithm for each element of the matrices^{6}^{6}6To avoid zeros in the matrices we add 1 to each element in the matrices of counts for workers and objects., and objects’ parameters are initialised in a similar way: for each worker we compute the vector of relative frequencies of noisy labels from this worker, then for each object we compute its matrix of counts by considering each noisy label for this object and adding the corresponding vector to the row of the matrix of counts, finally we take the logarithm for each elements in the matrix^{7}^{7}7Such nontrivial initialisation for confusion matrices in DS and MME models leads to better results comparing to trivial uniform initialisation for all matrix elements.. Parameters of objects’ latent distributions, , in all DAmodels are initialised by the frequency of noisy labels for each object .
Dawid & Skene  GLAD  Minimax entropy  

Data set (number of classes)  RFE  LA  DA  LA  DA  LA  DA 
Duchenne smiles ()  72.08  +4.65  +4.03  +3.4  +5.28  31.82  +0.88 
Web search ()  73.03  +12.69  +6.66  +8.2  +7.3  +3.04  +10.88 
TREC ()  45.48  +6.43  +3.47  0.78  0.1  +0.72  +1.51 
Textual entailment ()  89.92  +2.7  +2.33  +3.2  +2.08  5.55  2.42 
Temporal ordering ()  93.55  +0.61  +0.82  +0.39  +0.82  2.21  1.56 
Adult content ()  76.04  0.36  +1.44  0.36  0.96  1.56  +2.34 
Price()  32.5  1.25  +1.25  0.0  0.0  2.5  +2.5 
Dawid & Skene  GLAD  Minimax entropy  

Data set (number of classes)  RFE  LA  DA  LA  DA  LA  DA 
Duchenne smiles ()  2.467  0.821  2.074  0.753  3.464  0.817  
Web search ()  0.247  0.364  0.408  0.468  0.747  0.331  
TREC ()  1.403  1.143  1.28  1.116  1.779  1.196  
Textual entailment ()  0.509  0.554  0.363  0.308  0.377  0.9  0.454 
Temporal ordering ()  0.841  0.26  0.331  0.252  0.817  0.286  
Adult content ()  1.345  0.507  0.875  0.535  1.385  0.501  
Price ()  80.365  3.128  16.357  0.917  42.651  1.675 
6 Experiments
In this section we empirically show that: (1) the novel Dassumption is more realistic than the Lone, and, as a result, DAmodels produce more accurate consensus labels; (2) models designed for the Dassumption are capable for recovering latent distributions of labels. A straightforward way to check the first proposition, is to compare performance of an LAmodel against the corresponding DAcounterpart. However, as distributions of labels for real objects are unknown, we use synthetic data sets to check the second statement.
Consider multiclass labels . For each object , a model M produces a consensus label , that is a vector of length and its th element is the probability for the label value . Note that all models, LA and DA, produce probabilistic labels, but, as we explained in the previous sections, the intuition behind the LAmodel probabilistic output is “confidence” in each label value, whereas for the DAone it is estimated probability of labels.
Relative frequency estimator (RFE)
Given a set of noisy labels for an object , a standard baseline for consensus modeling is to estimate probabilities for all as , where is the set of indices of workers that label the object .
For each data set, the training set is a set of objects with multiple noisy labels for each object, and the test set is a subset of those objects with a known ground truth label for each of them. The quality of consensus labels produced from the training set is evaluated over the test set. Let be a set of indices of the test set objects, and, for an object , be its ground truth label. We use the following two metrics to evaluate performance of a model on the test set:
Accuracy.
For each object in a training set, a model M produces a probabilistic label , and an estimated label for this object is .^{8}^{8}8If the maximum is attained at several label values, we randomly choose one of them. In our experiments, RFE accuracy is averaged over 10 runs to reduce the effect of random tiebreaking. There was no need to average results of the nontrivial models as no tie was observed. Accuracy over a test set is defined as
where is the indicator function for .
Log loss. Given consensus labels based on a model M, the log loss over a test set is the mean negative log likelihood of the test set labels,
computed as
.
^{9}^{9}9For this definition, log loss for noninformative probabilistic labels, consisting of probabilities , is 1.
Realworld data. First, compare the seven models across the following data sets: Duchenne smiles [15], Web search [18], TREC [4], Textual Entailment [12], Temporal Ordering [12], Adult content [7], Price [9]. For each data set, ground truth labels for a subset of objects are provided by the domain experts. (See Appendix B for the tasks description and summary statistics of the data sets.) Table 1 shows accuracy results. For a data set, accuracy of each nontrivial model is shown as the difference between the result for the RFE baseline (given in the second column) and the model result. As a summary, among 21 comparisons of LA and DAmodels, the DAcounterpart is better in 13 out of 21 times. Besides this, the novel approach generates the best model for 4 out of 7 data sets. Table 2 shows the log loss of the seven models ^{10}^{10}10To avoid infinite losses for RFE we applied addone smoothing to . across the six data set. Remarkably, for the log loss, among 21 comparisons of LA and DAmodels, the DAcounterpart is better in 18 out of 21 times. And for 5 out of 7 data sets, the novel approach generates the best model. Let us highlight some interesting observations from Tables 1 and 2. First, consider the Textual entailment data set: according to the log loss, performance of the RFE baseline is infinitely bad^{11}^{11}11This happens if none of noisy labels for an object is equal to the ground truth. for all the data sets except the Textual entailment, which may indicate that the task was easy, in the sense that the traditional Lassumption is likely to be true, and the DAmodels suffer due to their greater flexibility. Secondly, consider the TREC data set, where according to the log loss, the quality of probabilities obtained by any model, LA or DA, is worse than that for noninformative uniform probabilities. We checked, that the mean number of labels per object for this data set is which may be not enough to get realistic distributions over three classes. And similarly we explain the results for Web search data set with the mean number of labels per object is .
The number of labels per object. To analyse how the number of labels per object affects the performance of the LA and DAmodels, we conducted the following experiment. As an example, we take the Duchenne smiles data set and two models, LA GLAD and DA GLAD. Figure 2 shows accuracy and the log loss for the Duchenne smiles data as the mean number of labels per object grows from to . For each integer , we sampled a part of the data set, such that the number of labels per object not greater than , it was done as following: given a number of noisy labels for an object, we randomly draw of those labels (if the number of noisy labels for the object was , all the noisy labels were taken). For the sample of the data set, we obtained consensus labels using the two models and evaluate their performance as before. Results shown in Figure 2 are averaged over simulations. Indeed, when the number of noisy labels is , the DAmodel is worse than the LAone according to both performance metrics. However, as the number of labels per object increases, the performance of the DAmodel notably exceeds that of the LAmodel at almost all overlap values. Especially, for the log loss, we can see that for the DAmodel, the bigger the overlap of noisy labels per object, the better the probabilistic labels. Whereas, for the LAmodel, when the number of labels per object grows, aggregated probabilistic labels converge to deterministic values and the log loss for the test set becomes worse.
Synthetic data. Inspired by the analysis in Section 4, next we empirically examine the ability of the two approaches for consensus modelling to recover the underlying probabilities for labels. Data for this experiment is generated similarly to [15]: we used objects and 20 workers, each worker labels each object; for each object we randomly generate its latent parameter , which is the probability of label ; to produce noisy labels we generate the following parameters for GLAD and . Using the parameters, a data set is generated according to the DA GLAD. Then, for the data set, we used our implementations for the LA GLAD and DA GLAD to obtain estimates . Table 3 shows the mean squared error, , averaged across 10 simulations. This confirms that DAmodel is able to accurately estimate the underlying true probabilities. Figure 3 shows calibrations plot for estimated probabilities obtained using the two approaches to consensus modelling. It demonstrates that probabilities estimated by the DAmodel are empirically well calibrated, whereas the LAmodel is biased towards degenerate distributions.
7 Conclusion
We have proposed the novel Dassumption for consensus modelling and a way of constructing novel DAmodels under the assumption. It has been shown that the DAmodels are able to obtain proper estimates for true probabilities in contrast to those for the established LAmodels. Our experiments confirm that the DAmodels often perform better for real data sets, implying that, the Dassumption is more realistic than the Lone. However, there is no universal model which is the best for every data set (even among the stateoftheart models). Thus, it always makes sense to produce golden labels (by professional judges) at least for a small part of any data set to evaluate all existing models and check which one serves the best for the data at hand. In that sense, our novel Dassumption approach is complimentary to the family of existing consensus models and allows to twice increase their number via considering our alternative noisy label generation process to be implemented for each one of them.
References
 [1] Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers—a bayesian graphical model for adaptive crowdsourcing and aptitude testing. arXiv preprint arXiv:1206.6386, 2012.
 [2] M. BartholomewBiggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of algorithms. Journal of Computational and Applied Mathematics, 124:171 – 190, 2000.

[3]
D. M Blei, A. Y Ng, and M. I Jordan.
Latent dirichlet allocation.
The journal of machine learning research
, 3:993–1022, 2003.  [4] C. Buckley, M. Lease, M. D Smucker, H. J. Jung, and C. Grady. Overview of the trec 2010 relevance feedback track (notebook). In The Nineteenth Text Retrieval Conference (TREC) Notebook, 2010.
 [5] A. P. Dawid and A. M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.
 [6] Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734–1748, 2016.
 [7] P. G Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67, 2010.

[8]
H. Kim and Z. Ghahramani.
Bayesian classifier combination.
InInternational conference on artificial intelligence and statistics
, pages 619–627, 2012.  [9] Q. Liu, A. T Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions are enough? In Advances in Neural Information Processing Systems, pages 1914–1922, 2013.
 [10] An Thanh Nguyen, Matthew Halpern, Byron C Wallace, and Matthew Lease. Probabilistic modeling for crowdsourcing partiallysubjective ratings. In Fourth AAAI Conference on Human Computation and Crowdsourcing, 2016.
 [11] P. Ruvolo, J. Whitehill, and J. R Movellan. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. 2013.

[12]
R. Snow, B. O’Connor, D. Jurafsky, and A. Y Ng.
Cheap and fast—but is it good?: evaluating nonexpert annotations
for natural language tasks.
In
Proceedings of the conference on empirical methods in natural language processing
, pages 254–263, 2008.  [13] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Communitybased bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pages 155–164, 2014.
 [14] E. M Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management, 36:697–716, 2000.
 [15] J. Whitehill, T. Wu, J. Bergsma, J. R Movellan, and P. L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
 [16] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2195–2203. 2012.
 [17] D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 262–270, 2014.
 [18] D. Zhou, Q. Liu, J. C Platt, C. Meek, and N. B Shah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015.
Appendix A Theoretical analysis for the latent label assumption
Remind, that we consider one object whose “true” label , where is an unknown objectspecific parameter. Given noisy labels for this object, such that each noisy label is correct with a known probability , the goal is to estimate the parameter .
According to the LAmodel, the unknown parameter is estimated by the posterior probability of label for the object:
where is the prior probability of label (common for all objects) and is the value of the random variable , which is the number of ones among noisy labels.
Proposition 1.
The posterior probability of label obtained under the Lassumption is neither unbiased, nor consistent estimate of the true probability underlying the generative process for the DAmodel.
Proof.
The posterior probability of label for the object is
where is the prior probability of label (common for all objects) and is the value of the random variable , which is the number of ones among noisy labels.
The generative process based on the Dassumption defines the following distribution over the number of ones :
Consider the expectation of with respect to this distribution:
(3) 
To demonstrate that this is a biased estimate of the true parameter , evaluate this expression as a function of for different values of and the uniform prior . Figure 4, showing the dependence of the expected value for from , confirms that is a biased estimate for .
Now we check whether converges to as the number of noisy labels tends to infinity. By the law of large numbers, the fraction of ones among noisy labels converges:
Assume that workers are not malicious, i.e. . Note that
therefore converges to different values depending on :
(4) 
Thus, is not a consistent estimate for unless the distribution is degenerate or . ∎
Appendix B Data sets details
We use the following public data sets for our empirical studies:

Duchenne smiles [15]. The task is to classify images into two categories – a Duchenne smile (“enjoyment” smile) and a nonDuchenne (“social” smile).

Web search [18] and TREC [4]. The task is to rate queryURL pairs. For a given queryURL pair, a worker is asked to rate how relevant is the URL to the search query. For the Web search data, the rating scale has 5 levels: perfect, excellent, good, fair, or bad. For the TREC data, the rating scale is ternary: highly relevant, relevant, and nonrelevant.

Recognising Textual Entailment [12]. For this task each object contains two statement, and a worker judges whether one statement implies another.

Temporal Ordering [12]. Each object describes two events and the task is to judge whether one event follows another.

Adult content [7]. The task is to classify web pages into four categories depending on the presence of adult content on them.

Price [9]. The task is to estimate prices of household items choosing one out of seven adjacent bins corresponding to different ranges of price.
Data set  # Cl.  # Obj.  # Workers  # Samples  Lab. per obj.  Lab. per worker  Gr. truth  

mean  mean  med.  objects  
Duchenne smiles  2  2134  64  30319  14.2  473.7  109  159 
Web search  5  2665  177  15567  5.8  87.9  19  2653 
TREC  3  20026  762  91783  4.6  120.5  18  3275 
Textual entailment  2  800  164  8000  10.0  48.8  20  800 
Temporal ordering  2  462  76  4620  10.0  60.8  20  462 
Adult content  4  11040  825  92721  8.4  112.4  18  1517 
Price  7  80  155  12400  155.0  80.0  80  80 
Table4 provides summary statistics for all the data sets. The table includes the following columns: the size of the label set, the number of objects, the number of crowdsourcing workers, the total number of noisy labels from those workers – the size of the training set, the mean number of labels per object, the mean and the median number of labels per worker, and also the number of objects with known ground truth labels – the size of the test set.
Comments
There are no comments yet.