Latent Distribution Assumption for Unbiased and Consistent Consensus Modelling

06/20/2019 ∙ by Valentina Fedorova, et al. ∙ Yandex 0

We study the problem of aggregation noisy labels. Usually, it is solved by proposing a stochastic model for the process of generating noisy labels and then estimating the model parameters using the observed noisy labels. A traditional assumption underlying previously introduced generative models is that each object has one latent true label. In contrast, we introduce a novel latent distribution assumption, implying that a unique true label for an object might not exist, but rather each object might have a specific distribution generating a latent subjective label each time the object is observed. Our experiments showed that the novel assumption is more suitable for difficult tasks, when there is an ambiguity in choosing a "true" label for certain objects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowdsourcing marketplaces, such as Amazon Mechanical Turk111https://www.mturk.com, make it possible to label large data sets in a shorter time as well as at a lower cost comparing to that needed for a limited number of experts. However, as workers at the marketplaces are non-professional and vary in levels of expertise, such labels are much noisier than those obtained from experts. In order to reduce the noisiness, typically, each object is labelled by several workers, and then these labels are further aggregated in a certain way to infer a more reliable consensus label for the object. Most advanced consensus models [e.g., 5, 15, 18] address different aspects of uncertainty in the process of generating noisy labels.

A traditional setting used in those and other previous studies is based on the latent label assumption (L-assumption), implying that each object has a unique latent true label, and, when a worker observes the object, this latent true label is corrupted with regard to a chosen stochastic model into an observed noisy label. As a consequence, consensus models designed under this assumption explain any disagreements among observed noisy labels of an object by the mistakes made by some of the labellers. However, this may not explain a certain kind of disagreements among labels produced by experts, which is typical for some object domains. E.g., when assessing relevance of documents to search queries, even well trained experts may disagree about the true label for certain objects [see e.g., 14]. This is equivalent to saying that a unique true label of an object does not exist, but rather each object has its specific distribution over possible subjective labels, which is induced by a distribution of personal preferences over different aspects of the task. This type of uncertainty lies beyond the traditional L-assumption, and this paper introduces a novel approach based on the latent distribution assumption (D-assumption) to deal with this problem.

The novel D-assumption suggests the following generative process: Each object has its specific (latent) distribution over subjective values of label. Each time a worker observes the object, a latent subjective label is sampled from the object’s distribution. Then, this subjective label is corrupted according to a stochastic model, and an observed label is revealed. It is crucial that the object’s subjective label introduced in the process is generated each time a worker observes the object. This distinguishes the latent distribution approach from the previous consensus models.

Besides, to the best of our knowledge, this is the first paper that looks at the output probabilities of consensus labels from a statistical point of view. In particular, we notice that posterior probabilities of labels obtained under the traditional L-assumption are poor estimates of the underlying true probabilities. In contrast to this, we show, both analytically and experimentally, that the probabilities obtained under the D-assumption may serve as accurate statistical estimates of the underlying true probabilities. Thus, the latent distribution for subjective labels of an object estimated via our framework can be further used for the problems of label distribution learning considered in

[6].

Background. A line of previous studies on consensus modelling [e.g., 5, 15] explicitly assumed that each object has a single “true” label. Other works [e.g., 16, 17, 18] including Bayesian approaches [e.g., 1, 8, 13], made the L-assumption implicitly: though it is assumed that each object is associated with a probabilistic label, it is not allowed that a latent label of an object takes multiple different values. In other words, posterior probabilities for latent labels, inferred under the traditional L-assumption, is a measure of confidence in each value being a unique latent true label. Our work is orthogonal to those studies as it considers a different assumption for the process of generating noisy labels and leads to essentially different and “better calibrated” output probabilities. [10] suggested a consensus model for subjective tasks, our work is more fundamental in the sense that we describe a general framework to consensus modelling in such tasks and theoretically study properties of their outputs.

2 Latent distribution assumption

Let

be a random variable whose value is the latent label for an object

, and be a random variable whose value is the observed noisy label assigned by a worker to an object .

workers

objects

(a) Latent label assumption

workers

objects

(b) Latent distribution assumption
Figure 1: Graphical structures for two generative models based on different assumptions.

To the best of our knowledge, all existing consensus models do not go beyond the general generative model shown in Figure 0(a). This class of models, including [5, 15, 16] and others, which we call the LA-model, represents the following generative process:

  1. A unique latent true label for object

    is drawn from a probability distribution

    with a parameter (note that is the same for all objects). For multiclass labels, is usually a multinomial distribution , where

    is a vector of prior probabilities.

  2. Given the latent true label for an object , an observed noisy label from a worker

    for this object is sampled from the conditional probability distribution

    222As usual, capital letters are used for random variables and the corresponding lower case letters are for particular values of the random variables. with two potential parameters333“Potential parameters” means that some of the parameters may be absent. and . Intuitively, represents a level of expertise for the worker , and corresponds to a level of difficulty for the object .

Note that most previous studies are focused on the second step. They proposed different ways to model the corruption of the true label and, for example, additionally employed latent communities of workers [13] or features of the object and the worker [11]. However, this does not put these models beyond the general two-steps scheme described above.

In this paper, by using the novel D-assumption we change the first step and introduce the novel class of models, which we call the DA-model. At this point, for each object , we define a random variable whose value is a parameter of the distribution of subjective labels for this object. The graphical model is shown in Figure 0(b) and represents the following generative process:

  1. For each object , a parameter is drawn form a probability distribution parameterised by (again, note that is the same for all objects).

  2. When a worker observes an object :

    1. A subjective label is sampled from the object’s distribution with the parameter .

    2. A noisy label is generated from the conditional probability distribution with two potential parameters and .

Observe that the key difference between Figures 0(a) and 0(b) is in the workers’ plate: for the DA-model, the workers’ plate is extended over the latent subjective labels . Next we show that this difference constitutes a conceptual change for the inference of consensus labels.

Consider a set of noisy labels for objects from workers . Traditionally, consensus labels are inferred by maximising the log-likelihood of the observed noisy labels. For the L-assumption, the consensus output for object is probability for each possible value , reflecting our confidence that the value is the unique true label for object . In this setting, the log-likelihood of the observed data is

(1)

where the second sum is taken over the set of possible label values, is the set of indices of workers’ who labelled -th object, and notations of the form are used instead of for short, and for the same reason we omitted possible parameters and in . Whereas, under the D-assumption, the consensus output for object is the distribution of subjective labels , parameterized by the object’s parameter . And then the log-likelihood is defined as

To make the discussion of DA-models easier, in the rest of this paper we assume that parameter of the DA-model is a deterministic distribution assigning to each object a chosen beforehand (but unknown) distribution of the object’s subjective labels. Then, we can remove the parameter in Figure 0(b) and assume that are additional parameters of the model, and the expression for the log-likelihood of the observed labels becomes the following:

(2)

Remark. Two described approaches resemble, to some extent, two approaches to topic modelling – the mixture of unigrams model and the latent Dirichlet allocation model [see, e.g., 3]. The first model assumes that each document is associated with a single topic, which specifies the distribution of words in this document; for the second model, each document may be associated with multiple topics, and words in this document are generated from a mixture of the distributions for the topics. By replacing topics for latent labels and words for noisy labels, we get the two approaches discussed in this section. Note, however, that for our setting, it is necessary that the domains of latent and observed labels are the same. So that latent “topics” are in one-to-one correspondence with the label values. Otherwise, we may not interpret the latent distribution.

Finally, let us motivate the novel approach by the following numerical example. Consider 210 workers with highly confident estimates of their expertise inferred from their labels for many tasks they completed. Let each worker assigns the correct label with probability (independent of the value of that correct label). Consider a binary classification task with uninformative prior over the two labels . Let each of the workers provides one noisy label for a new object, and it turns out that 110 of the labels are 1s and 100 are 0s. Under the L-assumption, the posterior probability of the true label being 1 is , i.e. the correct label is practically 1 with probability 1, and we observe a very unreliable event that 100 out of 210 workers made a mistake. On the contrast, under the D-assumption, according to (2), the probability should maximize

, and we infer that most of the workers provided true (though subjective) labels and for this object the latent Bernoulli distribution has the parameter

, which seems much more realistic.

3 Consensus models under L- and D-assumptions

A distinguishing feature of the DA-model, shown in Figure 0(b), is that latent subjective labels for an object are sampled each time the object is observed. Thus, given a certain LA-model, the corresponding DA-one is defined as follows: Firstly, the distribution in the DA-model is defined over the domain of the parameter in the LA-model. E.g., for the examples of models below, is the parameter of the multinomial distribution , and is defined to be a Dirichlet distribution. Secondly, The conditional distribution in the DA-model is the conditional distribution of noisy labels defined in the LA-model, with the subjective label value for the random variable .

The rest of this section describes three established models444For all the models, let labels take values from . as special cases of the traditional LA-model, and their novel counterparts as special cases of the novel DA-model. Later, in Section 6, we will empirically compare the performance of all these models.

3.1 Dawid and Skene model
Consider a special case of the LA-model with parameters and defined as follows: is the vector of prior probabilities for label values; the parameter

is the confusion matrix of size

, and . The model was proposed in [5] and we will refer to it as LA DS.

The corresponding special case of the DA-model is the following: (1) for each object , a vector is drawn from a Dirichlet distribution , this vector is the parameter of the multinomial distribution of subjective labels for this object; (2) when a worker observes an object , first, a subjective label is drawn from the multinomial distribution , and then, a noisy label is drawn from the multinomial distribution , where stands for -th row of the confusion matrix . It will be denoted as DA DS.

3.2 GLAD
Consider the following special case of the LA-model with parameters , and : parameter is absent meaning that true labels are assumed to be deterministic; the parameter is a scalar value representing the level of expertise for the worker ; and the parameter is a scalar value representing the level of difficulty for the object . Let and define

This model is called GLAD555GLAD is a shortcut for Generative model of Labels, Abilities, and Difficulties. and was described in [15], it will be denoted as LA GLAD.

The corresponding DA-model is the following: (1) each object has a deterministic parameter ; (2) for a worker and an object , a latent subjective label is sampled form the multinomial distribution , and then, a noisy label is generated from . We will refer to it as DA GLAD.

3.3 Minimax entropy principle
Consider a model described in [18, Section 2] for multiclass labels; this model will be referred to as LA MME. The model was derived using a minimax entropy principle, and it is a special case of the LA-model with and defined as matrices of size . Using these parameters, for each worker and each object , define .

The DA-counterpart is the following: (1) each object has a deterministic parameter ; (2) for a worker and an object , a latent subjective label is sampled from the multinomial distribution , and then, a noisy label is generated from . This will be denoted as DA MME.

4 Theoretical analysis

In this section, we analyse the ability of two approaches to consensus modeling (based on the L- and D-assumptions) to output “calibrated” probabilities of labels.

Consider one object, assume . Consider the generative process for the DA-model as described in Section 2, where , and is an unknown object-specific parameter. Assume workers provide labels for the object independently, and each of them makes a mistake with a known probability , which is the same for all workers.
L-assumption approach. According to the traditional approach, the unknown parameter is estimated by the posterior probability of label for the object: where is the prior probability of label (common for all objects) and is the number of ones among noisy labels.

Proposition 1. The value of is neither unbiased, nor consistent estimate of .
Proof. The generative process based on the D-assumption defines the following distribution of the number of ones : . Consider the expectation of with respect to this distribution: . One can see that this is a biased estimate for (see the details in Appendix A).

Now we check the convergence of to as

tends to infinity. By the law of large numbers, the fraction of ones among noisy labels converges:

. Assume that workers are not malicious, i.e. . Note that , therefore converges to different values depending on (see the details in Appendix A). Thus, is not a consistent estimate for unless the distribution is degenerate or .

D-assumption approach. According to the novel approach based on the D-assumption, is a binomial random variable with parameters , where is the probability that an observed noisy label equals . The estimate of parameter is the value providing the maximum for the log likelihood of the observed noisy labels:

Proposition 2. The value of is an unbiased and consistent estimate of .
Proof. Note that , where . Therefore, we have The second equality uses linearity of . Note that is a consistent estimate of the success probability of binomial . Therefore, is a consistent estimate of .

5 Details of implementation

In this section we describe our implementation for the six consensus models described in Sections 3.1–3.3. Our implementation allows us to directly maximise the log-likelihood of observed noisy labels instead of using EM, which optimises the lower bound for the log-likelihood.

For LA-models the log-likelihood (1) is a function of parameters , and . To maximise the log-likelihood we use a standard conjugate gradient descent algorithm. And to compute gradients at each iteration we use a proprietary library for the automatic differentiation, which is an implementation of the operator overloading approach [see, e.g., 2]. Finally, using the maximum likelihood estimates for the parameters, we compute the consensus output for each object as the posterior probabilities for all :

For DA-models the log-likelihood (2) is a function of parameters , and and the same approach is used to maximize it. The consensus output for each object is the estimated distribution of the object’s subjective labels, i.e. for all :

The output of any consensus model depends on the initial values for parameters. However, it will not influence our comparisons of LA- and DA-versions of models: the same initial values are used for common parameters in each pair of models. For the pair of DS models, confusion matrices are initialised as follows: for each object we first compute the vector of relative frequencies of noisy labels for the object, then for each worker we compute a matrix of counts by considering each label of the worker and adding the corresponding vector to the column of the worker’s matrix of counts, finally we normalise each row in the worker’s matrix of counts to make it sum up to 1. Parameter for LA DS is initialised by the frequency of each label value among all noisy labels. For the pair of GLAD models, workers’ parameters are initialised by 1 and objects’ parameters are initialised by as suggested in the original paper [15]. For the pair of MME models, workers’ confusion matrices are initialised by computing the matrices of counts for workers for each worker (in the same way as for DS models) and then taking the logarithm for each element of the matrices666To avoid zeros in the matrices we add 1 to each element in the matrices of counts for workers and objects., and objects’ parameters are initialised in a similar way: for each worker we compute the vector of relative frequencies of noisy labels from this worker, then for each object we compute its matrix of counts by considering each noisy label for this object and adding the corresponding vector to the row of the matrix of counts, finally we take the logarithm for each elements in the matrix777Such nontrivial initialisation for confusion matrices in DS and MME models leads to better results comparing to trivial uniform initialisation for all matrix elements.. Parameters of objects’ latent distributions, , in all DA-models are initialised by the frequency of noisy labels for each object .

Dawid & Skene GLAD Minimax entropy
Data set (number of classes) RFE LA DA LA DA LA DA
Duchenne smiles () 72.08 +4.65 +4.03 +3.4 +5.28 -31.82 +0.88
Web search () 73.03 +12.69 +6.66 +8.2 +7.3 +3.04 +10.88
TREC () 45.48 +6.43 +3.47 -0.78 -0.1 +0.72 +1.51
Textual entailment () 89.92 +2.7 +2.33 +3.2 +2.08 -5.55 -2.42
Temporal ordering () 93.55 +0.61 +0.82 +0.39 +0.82 -2.21 -1.56
Adult content () 76.04 -0.36 +1.44 -0.36 -0.96 -1.56 +2.34
Price() 32.5 -1.25 +1.25 0.0 0.0 -2.5 +2.5
Table 1: Accuracy (in %) of various models across real data sets. Results for the six nontrivial models are shown relative to the RFE result. The best results for each data set are shown in bold.
Dawid & Skene GLAD Minimax entropy
Data set (number of classes) RFE LA DA LA DA LA DA
Duchenne smiles () 2.467 0.821 2.074 0.753 3.464 0.817
Web search () 0.247 0.364 0.408 0.468 0.747 0.331
TREC () 1.403 1.143 1.28 1.116 1.779 1.196
Textual entailment () 0.509 0.554 0.363 0.308 0.377 0.9 0.454
Temporal ordering () 0.841 0.26 0.331 0.252 0.817 0.286
Adult content () 1.345 0.507 0.875 0.535 1.385 0.501
Price () 80.365 3.128 16.357 0.917 42.651 1.675
Table 2: Log loss results across real data sets. The best results for each data set are shown in bold.

6 Experiments

In this section we empirically show that: (1) the novel D-assumption is more realistic than the L-one, and, as a result, DA-models produce more accurate consensus labels; (2) models designed for the D-assumption are capable for recovering latent distributions of labels. A straightforward way to check the first proposition, is to compare performance of an LA-model against the corresponding DA-counterpart. However, as distributions of labels for real objects are unknown, we use synthetic data sets to check the second statement.

Consider multiclass labels . For each object , a model M produces a consensus label , that is a vector of length and its -th element is the probability for the label value . Note that all models, LA- and DA-, produce probabilistic labels, but, as we explained in the previous sections, the intuition behind the LA-model probabilistic output is “confidence” in each label value, whereas for the DA-one it is estimated probability of labels.

Relative frequency estimator (RFE)

Given a set of noisy labels for an object , a standard baseline for consensus modeling is to estimate probabilities for all as , where is the set of indices of workers that label the object .

For each data set, the training set is a set of objects with multiple noisy labels for each object, and the test set is a subset of those objects with a known ground truth label for each of them. The quality of consensus labels produced from the training set is evaluated over the test set. Let be a set of indices of the test set objects, and, for an object , be its ground truth label. We use the following two metrics to evaluate performance of a model on the test set:
Accuracy. For each object in a training set, a model M produces a probabilistic label , and an estimated label for this object is .888If the maximum is attained at several label values, we randomly choose one of them. In our experiments, RFE accuracy is averaged over 10 runs to reduce the effect of random tie-breaking. There was no need to average results of the nontrivial models as no tie was observed. Accuracy over a test set is defined as where is the indicator function for .
Log loss. Given consensus labels based on a model M, the log loss over a test set is the mean negative log likelihood of the test set labels, computed as . 999For this definition, log loss for non-informative probabilistic labels, consisting of probabilities , is 1.

Figure 2: Performance of the LA GLAD (red line with squares) and DA GLAD (green line with dots) on the Duchenne smiles data set. The horizontal axis is for the mean number of labels per object.

Real-world data. First, compare the seven models across the following data sets: Duchenne smiles [15], Web search [18], TREC [4], Textual Entailment [12], Temporal Ordering [12], Adult content [7], Price [9]. For each data set, ground truth labels for a subset of objects are provided by the domain experts. (See Appendix B for the tasks description and summary statistics of the data sets.) Table 1 shows accuracy results. For a data set, accuracy of each nontrivial model is shown as the difference between the result for the RFE baseline (given in the second column) and the model result. As a summary, among 21 comparisons of LA- and DA-models, the DA-counterpart is better in 13 out of 21 times. Besides this, the novel approach generates the best model for 4 out of 7 data sets. Table 2 shows the log loss of the seven models 101010To avoid infinite losses for RFE we applied add-one smoothing to . across the six data set. Remarkably, for the log loss, among 21 comparisons of LA- and DA-models, the DA-counterpart is better in 18 out of 21 times. And for 5 out of 7 data sets, the novel approach generates the best model. Let us highlight some interesting observations from Tables 1 and 2. First, consider the Textual entailment data set: according to the log loss, performance of the RFE baseline is infinitely bad111111This happens if none of noisy labels for an object is equal to the ground truth. for all the data sets except the Textual entailment, which may indicate that the task was easy, in the sense that the traditional L-assumption is likely to be true, and the DA-models suffer due to their greater flexibility. Secondly, consider the TREC data set, where according to the log loss, the quality of probabilities obtained by any model, LA or DA, is worse than that for non-informative uniform probabilities. We checked, that the mean number of labels per object for this data set is which may be not enough to get realistic distributions over three classes. And similarly we explain the results for Web search data set with the mean number of labels per object is .

The number of labels per object. To analyse how the number of labels per object affects the performance of the LA- and DA-models, we conducted the following experiment. As an example, we take the Duchenne smiles data set and two models, LA GLAD and DA GLAD. Figure 2 shows accuracy and the log loss for the Duchenne smiles data as the mean number of labels per object grows from to . For each integer , we sampled a part of the data set, such that the number of labels per object not greater than , it was done as following: given a number of noisy labels for an object, we randomly draw of those labels (if the number of noisy labels for the object was , all the noisy labels were taken). For the sample of the data set, we obtained consensus labels using the two models and evaluate their performance as before. Results shown in Figure 2 are averaged over simulations. Indeed, when the number of noisy labels is , the DA-model is worse than the LA-one according to both performance metrics. However, as the number of labels per object increases, the performance of the DA-model notably exceeds that of the LA-model at almost all overlap values. Especially, for the log loss, we can see that for the DA-model, the bigger the overlap of noisy labels per object, the better the probabilistic labels. Whereas, for the LA-model, when the number of labels per object grows, aggregated probabilistic labels converge to deterministic values and the log loss for the test set becomes worse.

Model MSE LA DA Table 3: Mean squared error between true and estimated probabilities. Rows are for approaches to consensus modelling. The data generating process is based on the D-assumption. (a) LA-model. (b) DA-model.
Figure 3: Calibration plots for two approaches to consensus modelling. The horizontal axis is for the true probabilities and the vertical one is for the estimated probabilities.

Synthetic data. Inspired by the analysis in Section 4, next we empirically examine the ability of the two approaches for consensus modelling to recover the underlying probabilities for labels. Data for this experiment is generated similarly to [15]: we used objects and 20 workers, each worker labels each object; for each object we randomly generate its latent parameter , which is the probability of label ; to produce noisy labels we generate the following parameters for GLAD and . Using the parameters, a data set is generated according to the DA GLAD. Then, for the data set, we used our implementations for the LA GLAD and DA GLAD to obtain estimates . Table 3 shows the mean squared error, , averaged across 10 simulations. This confirms that DA-model is able to accurately estimate the underlying true probabilities. Figure 3 shows calibrations plot for estimated probabilities obtained using the two approaches to consensus modelling. It demonstrates that probabilities estimated by the DA-model are empirically well calibrated, whereas the LA-model is biased towards degenerate distributions.

7 Conclusion

We have proposed the novel D-assumption for consensus modelling and a way of constructing novel DA-models under the assumption. It has been shown that the DA-models are able to obtain proper estimates for true probabilities in contrast to those for the established LA-models. Our experiments confirm that the DA-models often perform better for real data sets, implying that, the D-assumption is more realistic than the L-one. However, there is no universal model which is the best for every data set (even among the state-of-the-art models). Thus, it always makes sense to produce golden labels (by professional judges) at least for a small part of any data set to evaluate all existing models and check which one serves the best for the data at hand. In that sense, our novel D-assumption approach is complimentary to the family of existing consensus models and allows to twice increase their number via considering our alternative noisy label generation process to be implemented for each one of them.

References

  • [1] Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers—a bayesian graphical model for adaptive crowdsourcing and aptitude testing. arXiv preprint arXiv:1206.6386, 2012.
  • [2] M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of algorithms. Journal of Computational and Applied Mathematics, 124:171 – 190, 2000.
  • [3] D. M Blei, A. Y Ng, and M. I Jordan. Latent dirichlet allocation.

    The journal of machine learning research

    , 3:993–1022, 2003.
  • [4] C. Buckley, M. Lease, M. D Smucker, H. J. Jung, and C. Grady. Overview of the trec 2010 relevance feedback track (notebook). In The Nineteenth Text Retrieval Conference (TREC) Notebook, 2010.
  • [5] A. P. Dawid and A. M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28, 1979.
  • [6] Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734–1748, 2016.
  • [7] P. G Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67, 2010.
  • [8] H. Kim and Z. Ghahramani.

    Bayesian classifier combination.

    In

    International conference on artificial intelligence and statistics

    , pages 619–627, 2012.
  • [9] Q. Liu, A. T Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions are enough? In Advances in Neural Information Processing Systems, pages 1914–1922, 2013.
  • [10] An Thanh Nguyen, Matthew Halpern, Byron C Wallace, and Matthew Lease. Probabilistic modeling for crowdsourcing partially-subjective ratings. In Fourth AAAI Conference on Human Computation and Crowdsourcing, 2016.
  • [11] P. Ruvolo, J. Whitehill, and J. R Movellan. Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. 2013.
  • [12] R. Snow, B. O’Connor, D. Jurafsky, and A. Y Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In

    Proceedings of the conference on empirical methods in natural language processing

    , pages 254–263, 2008.
  • [13] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pages 155–164, 2014.
  • [14] E. M Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management, 36:697–716, 2000.
  • [15] J. Whitehill, T. Wu, J. Bergsma, J. R Movellan, and P. L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.
  • [16] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2195–2203. 2012.
  • [17] D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 262–270, 2014.
  • [18] D. Zhou, Q. Liu, J. C Platt, C. Meek, and N. B Shah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240, 2015.

Appendix A Theoretical analysis for the latent label assumption

Remind, that we consider one object whose “true” label , where is an unknown object-specific parameter. Given noisy labels for this object, such that each noisy label is correct with a known probability , the goal is to estimate the parameter .

According to the LA-model, the unknown parameter is estimated by the posterior probability of label for the object:

where is the prior probability of label (common for all objects) and is the value of the random variable , which is the number of ones among noisy labels.

Proposition 1.

The posterior probability of label obtained under the L-assumption is neither unbiased, nor consistent estimate of the true probability underlying the generative process for the DA-model.

Proof.

The posterior probability of label for the object is

where is the prior probability of label (common for all objects) and is the value of the random variable , which is the number of ones among noisy labels.

The generative process based on the D-assumption defines the following distribution over the number of ones :

Consider the expectation of with respect to this distribution:

(3)

To demonstrate that this is a biased estimate of the true parameter , evaluate this expression as a function of for different values of and the uniform prior . Figure 4, showing the dependence of the expected value for from , confirms that is a biased estimate for .

Figure 4: The expected values for given by (3) as a function of for different number of noisy labels , different values of , and the uniform prior . The left plot is for , the middle one is for , and the right one is for . Different values of are shown by colours.

Now we check whether converges to as the number of noisy labels tends to infinity. By the law of large numbers, the fraction of ones among noisy labels converges:

Assume that workers are not malicious, i.e. . Note that

therefore converges to different values depending on :

(4)

Thus, is not a consistent estimate for unless the distribution is degenerate or . ∎

Appendix B Data sets details

We use the following public data sets for our empirical studies:

  • Duchenne smiles [15]. The task is to classify images into two categories – a Duchenne smile (“enjoyment” smile) and a non-Duchenne (“social” smile).

  • Web search [18] and TREC [4]. The task is to rate query-URL pairs. For a given query-URL pair, a worker is asked to rate how relevant is the URL to the search query. For the Web search data, the rating scale has 5 levels: perfect, excellent, good, fair, or bad. For the TREC data, the rating scale is ternary: highly relevant, relevant, and non-relevant.

  • Recognising Textual Entailment [12]. For this task each object contains two statement, and a worker judges whether one statement implies another.

  • Temporal Ordering [12]. Each object describes two events and the task is to judge whether one event follows another.

  • Adult content [7]. The task is to classify web pages into four categories depending on the presence of adult content on them.

  • Price [9]. The task is to estimate prices of household items choosing one out of seven adjacent bins corresponding to different ranges of price.

Data set # Cl. # Obj. # Workers # Samples Lab. per obj. Lab. per worker Gr. truth
mean mean med. objects
Duchenne smiles 2 2134 64 30319 14.2 473.7 109 159
Web search 5 2665 177 15567 5.8 87.9 19 2653
TREC 3 20026 762 91783 4.6 120.5 18 3275
Textual entailment 2 800 164 8000 10.0 48.8 20 800
Temporal ordering 2 462 76 4620 10.0 60.8 20 462
Adult content 4 11040 825 92721 8.4 112.4 18 1517
Price 7 80 155 12400 155.0 80.0 80 80
Table 4: Summary statistics for real data sets.

Table4 provides summary statistics for all the data sets. The table includes the following columns: the size of the label set, the number of objects, the number of crowdsourcing workers, the total number of noisy labels from those workers – the size of the training set, the mean number of labels per object, the mean and the median number of labels per worker, and also the number of objects with known ground truth labels – the size of the test set.