Log In Sign Up

GPM: A Generic Probabilistic Model to Recover Annotator's Behavior and Ground Truth Labeling

In the big data era, data labeling can be obtained through crowdsourcing. Nevertheless, the obtained labels are generally noisy, unreliable or even adversarial. In this paper, we propose a probabilistic graphical annotation model to infer the underlying ground truth and annotator's behavior. To accommodate both discrete and continuous application scenarios (e.g., classifying scenes vs. rating videos on a Likert scale), the underlying ground truth is considered following a distribution rather than a single value. In this way, the reliable but potentially divergent opinions from "good" annotators can be recovered. The proposed model is able to identify whether an annotator has worked diligently towards the task during the labeling procedure, which could be used for further selection of qualified annotators. Our model has been tested on both simulated data and real-world data, where it always shows superior performance than the other state-of-the-art models in terms of accuracy and robustness.


page 1

page 2

page 3

page 4


Regularized Minimax Conditional Entropy for Crowdsourcing

There is a rapidly increasing interest in crowdsourcing for data labelin...

Inferring the ground truth through crowdsourcing

Universally valid ground truth is almost impossible to obtain or would c...

Inferring ground truth from multi-annotator ordinal data: a probabilistic approach

A popular approach for large scale data annotation tasks is crowdsourcin...

Learning from Crowds with Sparse and Imbalanced Annotations

Traditional supervised learning requires ground truth labels for the tra...

Joint Multi-Dimensional Model for Global and Time-Series Annotations

Crowdsourcing is a popular approach to collect annotations for unlabeled...

Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution

This paper develops and implements a scalable methodology for (a) estima...

I Introduction

Nowadays, the amount of digital data generated every day is mind-blowing, while the pace of data generation is still accelerating. To deal with such amount of information, plenty of automatic solutions have been proposed and applied by various research communities such as the database, data mining and computer vision. Meanwhile, crowdsourcing has been adopted as a key problem-solving approach to information collection to address problems difficult for computers, particularly the ground truth label/score collection for the training of supervised machine learning models. There are many crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk)

111 and Crowd Flower 222 that have been widely used [11].

Crowdsourcing tasks are mostly performed in an uncontrolled environment. The lack of control on many factors such as the people (i.e. the workers/annotators), the procedure, and the environment introduces a considerable amount of noise and unreliability, leading to less trusted results than that produced from the better-controlled test environment. Such low-quality answers further make the inference of correct answers (i.e. the so-called ground truth) a challenging task. Some existing crowdsourcing paradigms rely on redundancy-based methods to discover annotator’s quality, such as using a “golden task” before the test or a hidden task during the test [28]. However, the lack of capability to cope with the uncertainty of annotator’s reliability and crowdsourcing task difficulty largely limits the efficiency of such approaches.

Generally, we believe that the noise in labeling task mainly comes from two aspects, i.e., annotator’s reliability and task’s difficulty. We use reliability to measure how likely an annotator will respond to a question seriously [27]. In one labeling task, the reliable annotators answer the question seriously, while the unreliable annotators answer the question either by picking a random answer or the same answer (arbitrarily for the payment), or even maliciously giving false answers to trick the system. Annotator’s reliability may vary during the labeling procedure, i.e., some annotators always give wrong answers, but some annotators may give wrong answers only a few times. In addition, we should notice that for some tasks, different annotators may have different opinions. Generally, most of the reliable annotators agree with the population’s consensus whereas some of the annotators do have their own different opinions which we should respect. Their answers should be thus considered as divergent instead of unreliable or untrustworthy. The other aspect to be considered is the task difficulty

, which is the major source of noise as well. In a more difficult task, it is more likely that the annotators would give a wrong answer. The level of task difficulty determines the probability to obtain noise data. In conclusion, to infer the ground truth label, the noise from annotator and task difficulty should be considered, modeled and removed.

The traditional methods, such as Majority Voting, Mean or Median would fail to resolve this issue as they regard every annotator with the same reliability and every task with the same level of difficulty. In the database and data mining research communities, various models have been proposed [28][13][16]

. Nevertheless, most of the models are designed for a particular application, either for the discrete labeling case, or for continuous labeling case. In addition, their modeling on annotator’s behavior cannot handle different unreliable behaviors such as random/repeated/malicious labeling. Furthermore, when considering a single value (discrete or continuous) as the ground truth, the model could not correctly deal with the reliable but “different” opinion. Thus, in our study, we propose to use the categorical distribution as the ground truth of the label rather than one value to make our model applicable on different applications (class label, continuous label, decision label) as well as capture the reliable but different answers. In addition, annotator’s behavior is modeled by a latent variable (probability) to switch between reliable and unreliable behavior. The underlying ground truth distribution and annotator’s behavior are estimated through Maximum Likelihood Estimates (MLE) using Expectation-Maximization (EM) algorithm. The proposed model is tested on both simulated data and real-world data, where it always shows superior performance compared to other state-of-the-art methods.

In conclusion, the contribution of our work is four-fold:

  • A simple generic probabilistic model is proposed and validated for different labeling applications by considering the categorical distribution as the ground truth.

  • Different opinions from reliable annotators are considered in our model, which are integrated into the ground truth distribution.

  • Task difficulty can be obtained by the entropy of the ground truth distribution, which can be used for further active labeling (assigning more annotators on difficult task rather than easy task).

  • Annotator’s different unreliable behaviors (random, repeated, malicious etc.) can be captured and removed from the ground truth recovering. In addition, the estimated reliability level can be further used for the selection of annotators in crowdsourcing.

The rest of this paper is organized as follows. The related work is introduced in section II. Section III describes our proposed model in detail. To validate our model, section IV introduces all the experiments that have been conducted. Finally, section V concludes this work. The code of this work can be found at github333 This work was done when Jing Li was with University of Nantes..

Ii Related work

Data labeling task can be classified into class labeling, decision (binary) labeling and continuous labeling. Generally, the annotation model (or truth inference model) is designed for one particular type of labeling, either discrete or continuous. Among these models, we could further classify them based on how they model the task, and how they model the annotator’s quality. Zheng et al. [28] provides a very thorough and nice review on the state-of-the-art annotation models. In this section, we select the most representatives for readers’ reference.

Ii-a Class labeling

A class labeling task is to ask annotators to select a single or multiple classes (or categories) out of the candidate classes (or categories). For example, in Computer Vision (CV) image classification tasks, the annotators are asked to label the object (cat, dog, bird, etc.) in an image where the truth is unique. In Natural Language Processing (NLP) text classification tasks, the annotators are asked to label the topics of a document where the answers are multiple. The observed labels and the ground truth labels are treated as (discrete) class labels, no ordering is concerned.

Dawid-Skene model [4] was proposed in 1979 which is a classic, efficient and effective class labeling model and has been validated by many studies [10][28]

. This model uses a confusion matrix to model an annotator’s quality for answering the single-choice tasks. In each confusion matrix, the index of the row represents the ground truth label, the column value represents the probability of annotator gives the column index as the answer.

[9][24] and [19] adopted the similar idea in their models.

Another typical way to model annotator’s behavior is by a single quality value, which represents the ability that this annotator correctly answers a task [5][8][15][1][12]. A typical representative among these models is GLAD proposed in [25]. In this model, the annotator’s quality is in a wide range where the quality value implies an adversarial annotator (who always give adversarial answers). In addition, this model considers task difficulty while Dawid-Skene model does not. When the task difficulty increases (), the probability of obtaining the correct answer decreases towards 0.5 rather than ( is the total number of classes), which is not the case in real application and has been challenged by some researchers in the community [10].

Several studies have been conducted to compare the performances of Dawid-Skene model and GLAD model on different crowdsourcing database [10][28]. Generally, Dawid-Skene is more reliable and robust than GLAD.

Ii-B Decision labeling

A decision labeling task requires the annotators to provide a “decision”, True or False, as the answer. Class labeling task can easily be converted to several decision labeling tasks by asking, for example, “Is this a dog in the image?” “Is this a cat in the image?” etc. Thus, most of the class labeling models can be easily extended to the decision labeling tasks, and vice versa [5][25][4][29][9][24][19].

Similar with the class labeling model, the annotator’s behavior is generally modeled by either a quality value, for example, quality values are ranged between 0 and 1, where 1 represents experts, 0.5 represents spammers, and less than 0.5 represents adversaries [8][15], or a confusion matrix [15][9]

. The annotation procedure is generally simulated by a Bernoulli distribution.

Different from the models mentioned above, [26]

is a supervised learning model. The annotation procedure is modeled by a Gaussian distribution, where the mean is determined by the ground truth, the variance is a function of the annotator’s expertise and task difficulty represented by the features of the input signal. As this model is supervised learning, the features of the input signal need to be trained before the test. Nevertheless, in real applications, such kind of supervised learning procedure is generally not applicable.

Ii-C Continuous labeling

The continuous labeling task requires the annotators to provide a numerical value as the answer. This value can be ordinal, or continuous. For example, in a movie review website, the users are asked to provide their opinion of this movie on a Likert scale from 1 to 5, where 1 represents bad and 5 represents excellent. In a video quality assessment experiment, the observers are asked to rate the quality of the video on a scale of [0, 100], the obtained score can be any continuous value in this range. Generally, in continuous labeling, the underlying ground truth is considered as a continuous real value [19][21][10][14]. If the observed value is an ordinal discrete value, there is a latent threshold for clipping the underlying continuous ground truth to the observed ordinal value. This threshold is either determined by the object characteristics [21], or by annotators [20], or just artificially decided [10].

A very first continuous annotation model is proposed in [19] where the observed label is generated from a Gaussian distribution, with the unknown ground truth as the mean, and the accuracy of the annotator as the precision (inverse of the variance). This model is further developed by [10] with the similar recipe, except for that the task difficulty is included in the model, together with the annotator’s expertise as a multiplier term of the variance. In addition, Gamma priors on both terms of the variance have been imposed. Furthermore, a latent variable that describes the probability that an annotator switches between normal behavior and abnormal (spammer) behavior is used in the model, which is a pioneer work at that time.

Different from the above, Li [14] propose to model the annotator’s behavior by annotator’s bias and annotator’s inconsistency. Annotator’s bias captures the effect that an annotator always underestimates or overestimates the truth of the task while annotator’s inconsistency captures the variance of the attentiveness of the annotator on labeling. In addition, the task difficulty is also considered in the model. Overall, the annotation procedure is modeled by a Gaussian distribution, with ground truth continuous score plus annotator’s bias as the mean, and task difficulty plus annotator’s inconsistency as the variance term. However, this model may fail to model a spammer’s behavior as it cannot be expressed by a Gaussian.

Ii-D Other labeling

In addition to the above mentioned labeling, there are also some other types of labeling. For example, in a questionnaire of evaluating video quality [23], the annotators are asked to use their own vocabulary to describe their perceptual experience of this video. In [3], the annotators are asked to translate 10 sentences (with their own language, for example, French, German, Spanish…) to English. Another task in [3] is to ask the annotators to answer some reading comprehension questions. These labeling tasks are generally more open, and thus more difficult to handle. In our work, we only focus on class labeling, decision labeling and continuous ordinal labeling.

Iii Proposed annotation model

In this section, we firstly describe the scope of the problems that our model can be applied. Then, the proposed model is introduced in detail. The parameters updates for the ground truth distribution as well as the latent annotator’s behavior are provided, which are based on MLE using EM algorithm. Finally, the application of our model on different task scenarios are introduced. For simplicity of the explanation, all the notations are summarized in Table I.

Notation Description
the total number of annotators
the total number of test objects
the total number of candidate labels (categories)
the label, = 1,2,…,
the label (the abbreviation of )
the label given by annotator for object according to the underlying ground truth distribution
the label given by annotator for object according to annotator’s irregular behaviors
the label provided by annotator to object
latent variable which follows Bernoulli distribution determined by annotator
the ground truth categorical distribution for object
the probability of obtaining label in one trial for object
the probability that annotator gives the label seriously
the irregular behavior of annotator
estimated parameters,
the set of all labeled objects and all annotators who have labeled
the set of annotators who labeled the object with label .
the set of objects labeled by annotator
the set of annotators who labeled object
for annotator , object , the probability that observed label comes from ground truth categorical distribution
two Lagrange multipliers in EM algorithm
log likelihood in EM algorithm
convergence threshold for EM algorithm
the ground truth value (can be continuous or discrete)
TABLE I: Notation

Iii-a Problem setup

Assuming that there are annotators and objects (e.g., products, images, movies, websites) to be labeled. The labels that can be chosen are .

can be an ordinal number (

in Likert scale) or a category {cat, dog, bird,…}. The total number of labels is . For ease of later mathematical expression, we use to interchangeably represent label . Let denote the label provided by annotator to object . The underlying ground truth label for object is a categorical distribution , is the probability of obtaining label in one trial for object , . equals to 1 if . This assumption allows us to adapt this model to different applications such as class labeling, decision labeling or continuous ordinal labeling. It should be noted that unlike the NLP document classification problem (where each document may belong to different classes), in our model, each annotator can only choose one label from the candidates. In addition, our model cannot be applied to the case that the obtained label is a continuous numeric value.

The motivation of using a categorical distribution is that, for example, in a product review application, different reviewers’ opinions cannot be the same, which are subject to their feeling and expectation, thus, the ground truth of the judgment of a product should be described by a distribution rather than a score. To get a general consensus idea from population, expectation can thus be calculated and used. Another typical example is the computer vision object classification problem, where the ground truth label is a fixed class. For object , if the ground truth label is , we have = 1, for others, = 0, . In this case, the ground truth is a special case of the categorical distribution. Furthermore, the categorical distribution can also be applied to the decision labeling tasks where the ground truth follows a Bernoulli distribution, another special case of the categorical distribution.

In most of the existing studies, it is assumed that the label provided by an annotator is affected by the task difficulty and the annotator’s quality. Any different opinion from the ground truth (a single value) is considered as an error. This is not true in some applications where there is no single-value ground truth and where we respect everyone’s serious different opinion. In our model, task difficulty and observer’s different opinion are integrally described by the categorical distribution. Naturally, if not specified, we consider that the annotators in a task is a representative of the population, thus, the obtained distribution reflects the global people’s opinion, where different opinions exist. But still, this model can be applied to annotator groups with the similar expertise, e.g., experts, people with a specific profession, or a specific gender to infer the underlying behavior of that group.

Iii-B Distribution-Behavior model

As we assumed in the previous section, the underlying ground truth for an object is a categorical distribution. In an observation, the label given by an annotator is determined by the underlying ground truth distribution as well as the annotator’s behavior. Similar to [10], in our model, we consider that each annotator has a probability to provide a wrong answer, we call it “irregular” answer. In [7], the authors classify annotator’s behavior into eight categories, i.e., competent, spammers, adversaries, positively biased, negatively biased, unary annotators, binary annotators, and ternary annotators. In our study, we reduce the number of irregular behaviors into four categories, which can still cover the ones described in [7]. They are:

  • Random Label: The annotator always randomly select a label from to

    , which follows a uniform distribution.

  • Repeated Label: This is also called “position bias” [2], which is to model the annotator’s behavior that he/she always select the same label no matter what objects are provided.

  • Inverted Label: It means that the annotator may misunderstand the task, or intentionally to give an inverted label than the true label he/she should provide, i.e., he/she is an adversarial annotator.

  • Mixed Label: This is used to model the other irregular behaviors, which can be considered as a random combination of all previously mentioned ones.

The probability that an annotator gives an irregular answer is , where represents the reliability of this annotator. In the whole labeling procedure, we consider an annotator whose as a “spammer”.

The graphical model of our proposed distribution-behavior annotation model is shown in Figure 1. In one trial, the provided label is drawn from two mixture models, one is the ground truth categorical distribution , the other is the annotator’s irregular behavior modeled by a discrete distribution , where . The switch of the two models is determined by a latent variable , which follows a Bernoulli distribution, i.e., . When the latent variable , the annotator labels the object according to the underlying ground truth, otherwise, the annotator labels it “irregularly” based on his own irregular behavior .

Fig. 1: Graphic model for our proposed distribution-behavior model. , and are parameters, , and are latent variables, is the provided label by annotator .

The complete conditional density is given below:


where represents the set of all labeled objects and all annotators. Thus, we have:


subject to:


The objective of our model is to infer the underlying distribution, i.e., the parameters and the annotator’s reliability . The discrete distribution to capture annotator’s irregular behavior will be discussed in section III-D.

Iii-C Parameter Estimation using EM algorithm

The likelihood function can be calculated according to Equation (2), thus, the parameters can be estimated by using MLE, i.e.,


are the parameters.

Since there is a latent variable in our model, there is no analytic solution for the parameters. In this paper, the EM algorithm is utilized. denotes the initialized parameters, and = (, , denotes the parameters in the -th iteration. The whole EM procedure is provided as follows (the source code will be available in Github).

Initialization: EM algorithm is very sensitive to initialization values. In our model, we use the following strategy. , supposing that all the annotators are on the threshold of being a spammer. , , . is the index of tested object. denotes the set of annotators who labeled the object with label .

E-Step: Supposing the current estimates of parameters is , for the iteration of E-step, calculate:


where is the probability that the provided label comes from the ground truth categorical distribution under the current parameters :


M-Step: Find out the that maximizes as the estimates of the iterations, i.e.,


Lagrange multipliers and are thus introduced for and , independently. For completeness, we provide the whole parameter updates here.


where denotes the set of objects labeled by annotator , denotes the set of annotators who labeled object , denotes the set of annotators who labeled object with , denotes the set of objects labeled by annotator with .

Convergence criterion: Evaluate the log likelihood (i.e., Q function defined in Equation 5), repeat E-step and M-step until the convergence criterion is satisfied:


In our model, we set .

Iii-D Discovering the annotator’s irregular behavior

An ideal discrete distribution should be able to capture annotator’s diverse irregular behaviors including random label, repeated label, inverted label and other more complicated conditions as we described in the beginning of Section III-B. However, in reality, it is hard to find this ideal candidate. Alternatively, we consider the uniform distribution as a loose assumption on prior, which also makes the updates much easier in this model, i.e., by setting all . The feasibility of the utilization of uniform distribution to capture the irregular behaviors is validated and shown in Section IV-B1. We keep the general updates for all parameters as shown in Equation (8) to allow for further investigation of for the readers.

Iii-E Prediction of ground truth

In our model, the ground truth is a categorical distribution with parameters . This model can be applied directly to the condition that there is possibility that people have different opinions on the labels of the object. In addition, our model still allows us to extend it to other applications. We will provide more details below to show how to apply our model on them.

Continuous case: In the case where the underlying ground truth is by nature a continuous value, whereas the required label is an ordinal value, the ground truth could be obtained by calculation of the expectation of the estimated ordinal categorical distribution, i.e., .

Discrete case: Both the class labeling and decision labeling belong to this discrete case. A typical class labeling problem can be considered as a special case of categorical distribution, where the correct class with probability of 1, and others with 0. A decision labeling problem can be considered as a Bernoulli distribution. Thus, in discrete labeling, our model can estimate the ground truth by calculating the mode of the predicted distribution, i.e., such that .

Iv Experiment

We tested our model on two types of data sets. One is simulated data, the other is real world data. Details are shown below.

Iv-a Evaluation metrics

Firstly, we summarize different evaluation metrics for the performance of annotation models in this part.

Iv-A1 Classification accuracy

The ratio of correctly classified objects to the total number of objects:


Iv-A2 score

A measure of a classifier’s accuracy, which is the harmonic mean of precision and recall



Iv-A3 Pearson Linear Correlation Coefficient (PLCC)

A measure of the linear correlation between two variables [17]:


is the variance of true values , the same applied to .

Iv-A4 Spearman Rank Order Correlation Coefficient (SROCC)

A nonparametric measure of rank correlation. Let be the rank of , and be the rank of , SROCC is calculated by:


Iv-A5 Root Mean Square Error (RMSE)

A measure of how spread out the prediction errors are:


Iv-A6 Hellinger distance

A measure of the similarity between two probability distributions:


Higher values of , , PLCC, SROCC, Hellinger distance, and lower values of RMSE indicate better performance.

Iv-B Simulated data

Iv-B1 Exp1-a: Detection of irregular annotations

The objective of this experiment is to see whether or not the uniform distribution can capture different types of irregular behaviors. In this experiment, we simulate the ground truth as a categorical distribution with

= 5. The categorical distribution is generated based on a Beta distribution

, where and are randomly selected from 1 to 10. The obtained Beta distribution is then re-scaled and clipped to form a categorical distribution.

We simulate in total 150 objects which are labeled by 25 annotators on average. The is randomly selected from 0 to 1. Meanwhile, we set 20% of the annotators values lower than 0.5, we call them “spammers”. 20% is considered as spamminess ratio.

Four types of irregular behaviors are considered. They are “random”, “repeated”, “inverted”, and “mixed” labeling, which are generated in the following way:

  • Random: the annotator’s label is randomly sampled from a uniform distribution . The observed label is a discrete label.

  • Repeated: each annotator is assigned with a fixed position bias in the simulation, which is randomly sampled from a uniform distribution . Then, for a particular annotator, his/her repeated label is always the one assigned to him/her.

  • Inverted: the provided label by an annotator is . is the observed label according to the ground truth distribution. For example, an adversarial annotator observes “Excellent (5)” but he/she provides “Very Bad (1)” in the task. This is particularly applicable for decision labeling and continuous (ordinal) labeling.

  • Mixed: a combination of the behaviors above in a random way.

To obtain statistical results, each type of behavior is conducted 100 times. The evaluation methods for the performance of detection of irregular annotations are score for the classification of spammers (whose ), PLCC, SROCC and RMSE between ground truth reliability level and the estimated values. Results are shown in Table II.

Behavior type PLCC SROCC RMSE
Random 0.9949 0.9228 0.9011 0.2071
Repeated 0.9121 0.5835 0.5398 0.3388
Inverted 0.9458 0.7297 0.7567 0.3296
Mixed 0.9335 0.6922 0.7305 0.3539
TABLE II: Exp1-a: Performance on detection of different irregular behaviors. The values are the mean of 100 test results. means the higher the value, the better the performance. means the opposite.

The results indicate that the uniform distribution could capture the “random” behavior effectively, in terms of classification accuracy () as well as scale prediction accuracy (PLCC, SROCC and RMSE). However, for other types of behaviors, the accuracy of predicting absolute values is not satisfactory. Nevertheless, its classification performances are promising with scores all above 0.9. Thus, in reality, the estimated value can be used to classify the annotators as spammers or not.

Iv-B2 Exp1-b: Influence of the proportion of irregular annotations

The objective of this experiment is to test the influence of the proportion of irregular behaviors on the prediction accuracy. In this experiment, we simulate the ground truth as a categorical distribution as we did in section IV-B1 with 150 objects and 25 annotators. The are randomly selected from 0 to 1. In addition, we set 5 levels of spamminess ratio. They are 5%, 10%, 15%, 20% and 25%. The irregular behavior is fixed as “mixed”. Each test is repeated 100 times for statistical reliability.

The evaluation methods are RMSE and Hellinger distance between ground truth distribution and the estimated distribution. To make a comparison, the observed distribution is also evaluated. Results are shown in Figure 2.

Figure 2 shows that with the increase of the spamminess ratio, the prediction error is increasing as well, which is reasonable. An interesting finding is that when the spamminess ratio increases from 0.1 to 0.15, the prediction error of our proposed model is even smaller, which is not the case for the estimated distribution from observed labels. In addition, the prediction accuracy of observed data under spamminess ratio of 5% is the same with our proposed model under spamminess ratio of 16%, which indicates an increment of 11% spamminess tolerance for our model. In conclusion, our proposed model is more robust to irregular annotations than directly using the observed labels, which is particularly applicable in crowd-sourcing labeling scenario.

(a) (b)
Fig. 2: Exp1-b: Influence of the proportion of irregular annotations on prediction. Reported values are the mean of the 100 test results.

Iv-B3 Exp1-c: Prediction accuracy

The objective of this experiment is to study the prediction accuracy under different number of annotations. All the experimental simulation setup is similar with Exp1-b, except for that the spamminess ratio is fixed to 20%, and we set 7 levels of annotation numbers, which are 10, 15, 20, …, to 40. Again, for statistical reliability, each test is repeated 100 times. The evaluation methods are the same with Exp1-b. Results are shown in Figure 3.

(a) (b)
Fig. 3: Exp1-c: Prediction accuracy in terms of the number of annotations. Reported values are the mean of the 100 test results.

The experimental results indicate that with our model, the required number of annotations to obtain as accurate results as using observed data can be reduced significantly. An example is that to achieve the same accuracy of 40 annotations using the direct observed labels, only 20 or even fewer annotations are needed by using our proposed model.

Iv-B4 Exp1-d: Universality study

The objective of this experiment is to test the universality of our model on other types of data rather than a categorical distribution. In this experiment, we assume that the ground truth is a continuous score for each object, which is randomly (uniformly) selected from [1, 5]. The observed score follows a Gaussian distribution where mean is the ground truth score, and the precision (inverse of the variance) is randomly sampled from a Gamma distribution

. Clipping (round) is conducted to make the observed score as an integer and in the range of [1, 5]. There are in total 150 test objects. We set annotator’s irregular behavior as “mixed”. The number of annotations is 25. Spamminess ratio is 25%. Each test is repeated 100 times.

We compare our proposed model with the traditional methods, i.e., Majority Vote, Mean, and the state-of-the-art models which are based on Gaussian assumption, i.e., ordinal-discrete-mixture model [10] and Li’s MLE model [14]. The evaluation methods are PLCC, SROCC and RMSE between the predicted label and ground truth. In our model, we use the expectation of the estimated distribution to predict the ground truth. Results are shown in Table III.

Mean 0.8808 0.8997 0.6021
Majority 0.8918 0.8805 0.7053
Li [14] 0.9138 0.9176 0.5792
Ord-dis-mix[10] 0.9699 0.9680 0.3050
Proposed 0.9432 0.9453 0.4657
TABLE III: Exp1-d: Universality study of the proposed model. The values are the mean of 100 test results. means the higher the value, the better the performance. means the opposite.

As the Ord-discrete-mix model is designed based on Gaussian where the annotator’s irregular behavior is also considered, it is reasonable that its performance is the best. It is interesting to notice that our proposed model performs the second best, which is better than another Gaussian based model, i.e., Li’s MLE model. This result validates that our model has excellent generality under different types of data assumption. Mean and Majority methods show their weak robustness ability when there is a large number of irregular labelings.

Iv-C Real-world data

In this section, we compare our model with the state-of-the-art ground truth prediction models, i.e., Dawid-Skene (D&S) [4], GLAD [25], Ord-bin [18], Ord-dis-mix [10] and Li’s MLE model [14] on real-world data sets. To verify the generality of our model in different applications, three different datasets are used. Details are shown in the following parts.

Iv-C1 Crowd Dog Classification

In this experiment, the data from [6] are used. The task is to ask the annotators to recognize the breed of the dog in a given image. There are in total 250 images under test, and 4 types of dogs, i.e., =4. Each image is labeled by 17 annotators on average. The data also provides ground truth labels.

As this data is obtained from a class labeling task, discrete models, i.e., D&S [4], GLAD [25], Ord-bin [18] are used for comparison. For our model, according to the predicted distribution, we select the category with the highest probability as the predicted label (i.e., mode). The evaluation method for the performance of the models is the classification accuracy and score. Results are shown in Table IV.

Model Ord-bin [18] D&S [4] GLAD [25] Proposed
0.6280 0.6320 0.6480 0.6480
0.4862 0.5000 0.5368 0.5217
TABLE IV: Classification performance of different models on Crowd Dog dataset[6]. means the higher the value, the better the performance. means the opposite.

According to the results, Ord-bin performs the worst and GLAD performs the best. Our proposed model performs the second best, which achieves the same classification accuracy with GLAD, however, the score is a little bit lower.

We also use our model to check the spamminess ratio of this dataset, which is 26.21%. As there is no ground truth for the spammer check, this result is only used for reference.

Iv-C2 Face Emotion Identification

In this experiment, we use the Face Emotion Identification dataset 444, which comprises 5242 labels applied on 584 face images collected by 27 annotators. The task is to ask the annotators to identify the sentiment of the face in the image with the label “neutral(0)”, “happy(1)”, “sad(2)”, and “angry(3)”. The ground truth labels are also provided. The compared models as well as the evaluation methods are exactly the same with section IV-C1. Results are shown in Table V.

Interestingly, the performances of the models on this dataset are a little bit different from previous one (i.e., Crowd Dog dataset). In this dataset, Ord-bin performs the best while GLAD performs the worst, which is inverse with the results of Crowd Dog dataset. Our proposed model performs the second best, and D&S model performs the third, which keep the same with the results of Crowd Dog dataset. The results to some extent verify the conclusion from [28] that there is generally no “perfect” model that always outperforms the others, but generally speaking, D&S model is a robust one. In these two datasets, we demonstrate that our proposed model is more robust than D&S.

In addition, according to our model, the detected spamminess ratio for this dataset is 0, which means that this data is quite reliable.

Model GLAD [25] D&S [4] Ord-bin [18] Proposed
0.5976 0.6318 0.6490 0.6370
0.5155 0.5356 0.5514 0.5431
TABLE V: Classification performance of different models on Face emotion dataset. means the higher the value, the better the performance. means the opposite.

Iv-C3 Movie Review

In this experiment, we use the MovieLens 20M dataset555, which comprises 20 million ratings applied to 27,000 movies by 138,000 users on a 5-level scale. As the task is about the opinion of the users on the movie, there is no real ground truth. In our study, we select a subset of this dataset containing 5662 movies labeled by 15147 annotators as the validation dataset, ensuring that each movie is labeled by at least 30 users, and each annotator labeled at least 30 movies. The obtained mean of all annotator’s rating is considered as the ground truth. In addition, to test our model, we further sample a small dataset from the validation dataset, which contains 2833 ratings with 174 movies labeled by 69 users. The ground truth ratings of the 174 movies are obtained by 1452 annotations/movie on average, while in the test data each movie is annotated by 16 users on average.

As this experiment is a continuous case, for our model, we use the expectation of the predicted distribution as the predicted score. The compared models in this experiment are D&S [4], GLAD [25], Ord-bin [18], Ord-dis-mix [10] and Li’s MLE model [14]. The evaluation methods are PLCC, SROCC and RMSE between the ground truth and the predicted score. Results are shown in Table VI. The predicted spamminess ratio in this data is only 1%.

According to the results, as the D&S, GLAD, Ord-bin models are proposed for discrete labeling though they are also applicable on ordinal data, their performances are generally worse than the continuous models such as Ord-dis-mix and Li’s MLE. However, our proposed model performs the best. In addition, considering that the spamminess ratio in this dataset is quite low, we may draw a conclusion that in this experiment, it is the underlying complicated data distribution that determines the performances of different models, which further demonstrates that our proposed model is a generic model applicable in different data patterns and different applications.

D&S [4] 0.4073 0.4957 1.2287
GLAD [25] 0.5347 0.6029 0.6361
Ord-bin [18] 0.7122 0.7166 0.5849
Ord-dis-mix[10] 0.7066 0.7282 0.4292
Li [14] 0.8369 0.8219 0.2483
Proposed 0.8620 0.8420 0.2228
TABLE VI: Performances of different models on MovieLens dataset. means the higher the value, the better the performance. means the opposite.

V Conclusion

In this paper, we propose to use a categorical distribution to represent the underlying ground truth rather than a single value as other works did. The usage of a distribution allows us to 1) apply this model on any labeling tasks, no matter class labeling, decision labeling, or continuous (ordinal) labeling; 2) model the serious but different opinions from the population which we should respect. In addition, a latent variable is introduced to model the probability that an annotator may switch between a reliable and unreliable annotator at any time during a labeling task, which is often happening in real life, especially crowdsourcing scenario. Furthermore, different types of irregular behaviors, such as random labeling, repeated labeling, inverted labeling, and others can be captured by our model by simply using a uniform distribution. The proposed model has been validated on both simulated data and real-world data, where it always shows promising performances than the others in terms of prediction accuracy and robustness to irregular behaviors. In the future, we may consider to extend this work to pair comparison experiment to identify the irregular behavior, discover the personal preference, and infer the consensus opinion in population, which can be applied to recommendation system, A/B test, player matching system, etc., accordingly.


  • [1] B. I. Aydin, Y. S. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas (2014) Crowdsourcing for multiple-choice question answering.. In AAAI, pp. 2946–2953. Cited by: §II-A.
  • [2] N. J. Blunch (1984) Position bias in multiple-choice questions. Journal of Marketing Research, pp. 216–220. Cited by: 2nd item.
  • [3] C. Callison-Burch (2009) Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pp. 286–295. Cited by: §II-D.
  • [4] A. P. Dawid and A. M. Skene (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pp. 20–28. Cited by: §II-A, §II-B, §IV-C1, §IV-C3, §IV-C, TABLE IV, TABLE V, TABLE VI.
  • [5] G. Demartini, D. E. Difallah, and P. Cudré-Mauroux (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, pp. 469–478. Cited by: §II-A, §II-B.
  • [6] Y. Fang, H. Sun, P. Chen, and T. Deng (2017) Improving the quality of crowdsourced image labeling via label similarity. Journal of Computer Science and Technology 32 (5), pp. 877–889. Cited by: §IV-C1, TABLE IV.
  • [7] Y. E. Kara, G. Genc, O. Aran, and L. Akarun (2015) Modeling annotator behaviors for crowd labeling. Neurocomputing 160, pp. 141–156. Cited by: §III-B.
  • [8] D. R. Karger, S. Oh, and D. Shah (2011) Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pp. 1953–1961. Cited by: §II-A, §II-B.
  • [9] H. Kim and Z. Ghahramani (2012) Bayesian classifier combination. In Artificial Intelligence and Statistics, pp. 619–627. Cited by: §II-A, §II-B, §II-B.
  • [10] B. Lakshminarayanan and Y. W. Teh (2013) Inferring ground truth from multi-annotator ordinal data: a probabilistic approach. arXiv preprint arXiv:1305.0015. Cited by: §II-A, §II-A, §II-A, §II-C, §II-C, §III-B, §IV-B4, §IV-C3, §IV-C, TABLE III, TABLE VI.
  • [11] G. Li, J. Wang, Y. Zheng, and M. J. Franklin (2016) Crowdsourced data management: a survey. IEEE Transactions on Knowledge and Data Engineering 28 (9), pp. 2296–2319. Cited by: §I.
  • [12] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han (2014) Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 1187–1198. Cited by: §II-A.
  • [13] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han (2016) A survey on truth discovery. ACM Sigkdd Explorations Newsletter 17 (2), pp. 1–16. Cited by: §I.
  • [14] Z. Li and C. G. Bampis (2017) Recover subjective quality scores from noisy measurements. In Data Compression Conference (DCC), 2017, pp. 52–61. Cited by: §II-C, §II-C, §IV-B4, §IV-C3, §IV-C, TABLE III, TABLE VI.
  • [15] Q. Liu, J. Peng, and A. T. Ihler (2012) Variational inference for crowdsourcing. In Advances in neural information processing systems, pp. 692–700. Cited by: §II-A, §II-B.
  • [16] F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han (2015) Faitcrowd: fine grained truth discovery for crowdsourced data aggregation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 745–754. Cited by: §I.
  • [17] K. Pearson (1895) Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, pp. 240–242. Cited by: §IV-A3.
  • [18] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy (2009) Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th Annual international conference on machine learning, pp. 889–896. Cited by: §IV-C1, §IV-C3, §IV-C, TABLE IV, TABLE V, TABLE VI.
  • [19] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy (2010) Learning from crowds. Journal of Machine Learning Research 11 (Apr), pp. 1297–1322. Cited by: §II-A, §II-B, §II-C, §II-C.
  • [20] V. C. Raykar and S. Yu (2012) Annotation models for crowdsourced ordinal data. Journal of Machine Learning Research 13. Cited by: §II-C.
  • [21] S. Rogers, M. Girolami, and T. Polajnar (2010) Semi-parametric analysis of multi-rater data. Statistics and Computing 20 (3), pp. 317–334. Cited by: §II-C.
  • [22] T. Sørensen (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5, pp. 1–34. Cited by: §IV-A2.
  • [23] D. Strohmeier, S. Jumisko-Pyykkö, and K. Kunze (2010) Open profiling of quality: a mixed method approach to understanding multimodal quality perception. Advances in multimedia 2010. Cited by: §II-D.
  • [24] M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi (2014) Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pp. 155–164. Cited by: §II-A, §II-B.
  • [25] J. Whitehill, T. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035–2043. Cited by: §II-A, §II-B, §IV-C1, §IV-C3, §IV-C, TABLE IV, TABLE V, TABLE VI.
  • [26] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy (2010) Modeling annotator expertise: learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 932–939. Cited by: §II-B.
  • [27] J. Ye, J. Li, M. G. Newman, R. B. Adams, and J. Z. Wang (2017) Probabilistic multigraph modeling for improving the quality of crowdsourced affective data. IEEE Transactions on Affective Computing. Cited by: §I.
  • [28] Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng (2017) Truth inference in crowdsourcing: is the problem solved?. Proceedings of the VLDB Endowment 10 (5), pp. 541–552. Cited by: §I, §I, §II-A, §II-A, §II, §IV-C2.
  • [29] D. Zhou, S. Basu, Y. Mao, and J. C. Platt (2012) Learning from the wisdom of crowds by minimax entropy. In Advances in neural information processing systems, pp. 2195–2203. Cited by: §II-B.

Appendix A Parameter Estimation

In our model, function is defined in Equation 5. For M-step, we need to find subject to: . Thus, we have the updated function:


are Lagrange multipliers. To find the maxima, we need to calculate:


Let , , , , , we have: