1 Introduction
Usually, supervised learning tasks require a large amount of labeled samples to train their models. Although several kinds of data sources, such as games and ecommerce platforms, can automatically label their samples with clear rules, there is still a large amount of data need to be manually labeled. However, requesting domain experts to label large datasets is very expensive and timeconsuming. An alternative choice is to collect large amount of labels from nonexperts.
Recently, there are many works [17, 21] interested in using crowdsourcing to make datasets. Many online platforms, like Amazon Mechanical Turk^{1}^{1}1www.amt.com and CrowdFlower^{2}^{2}2www.crowdflower.com, provide crowdsourcing services. These platforms split the unlabeled large dataset into small parts and distribute them to a group of registered ordinary workers [22]. Collecting labels from crowdsourcing platform is efficient and cheap, however the nonprofessional workers usually have low labeling accuracy. To improve the accuracy, the usual practice is to assign each item to different workers, then aggregate the collected redundant labels into the predicted true labels.
Previous works have proposed many label aggregation methods to infer the true label from the observed noisy labels provided by crowds. The most straightforward way to predict the true label is majority voting. It considers each item independently and takes the most frequent class as the true label. This method does not take the reliability of each worker into consideration, and potentially assumes that all the workers are equally good. However, in actually workers have different degrees of reliability depending on their state of mind, expertise and motivation. A variety of advanced methods have been proposed to overcome this problem. These methods make some assumptions about the behavior of workers and design statistical models to generate the observed noisy labels. These assumptions are represented by model parameters and their relation. They reflect the reliability of workers. Whitehill et al. [22] propose the Generative model of Labels, Abilities, and Difficulties (GLAD) for binary label aggregation. This model can simultaneously infer the true labels, the expertise of workers, and the difficulty of items. Dawid&SkeneEM [2]
is a label aggregation method based on Expectation Maximization (EM). Many recent works
[25, 26, 7, 14] extend this method and improve its performance. We will briefly introduce the existing label aggregation methods in Section 2.In this paper, we propose a novel label aggregation algorithm which includes a neural network. The neural network is a probabilistic classifier. Given an instance, it can predict the unknown true label, where an instance contains the redundant noisy labels of an item. The challenge is that the corresponding true label of each instance is unknown, the neural network must be trained in an unsupervised way. We need to define a loss function (optimization goal) without any ground truth label. In order to solve this problem we set a guiding model. The training process of our algorithm is to find the consensus between the predictions of the neural network and the guiding model. The loss function is differentiable. Thus our algorithm can be trained using minibatch stochastic optimization methods, such as SGD, Adam
[8]and RMSProp
[18]. The model parameters of the neural network and the guiding model are updated simultaneously.Compared with existing label aggregation methods, our algorithm is easy to extend, because there are few limitations in designing the neural network and the guiding model. There are many choices for the architecture of our neural network, e.g., MLP, CNN etc. The only limitation of the guiding model is that it should be differentiable. According to our algorithm, we propose two models, one is a binary label aggregation model based on some delicate assumptions about the behavior of workers, the other is designed to aggregate multiclass noisy labels. Our algorithm can be applied to online settings, because it can be trained using minibatch stochastic optimization methods. However, in order to fairly compare with stateoftheart methods, our experiments are conducted on fixed datasets. Experiments on four realworld datasets demonstrate that our models achieve superior performance over stateoftheart methods.
2 Related Work
Dawid and Skene [2]
proposed the conception of confusion matrix to describe the expertise and the bias of a worker. Based on the confusion matrix, they designed an EM algorithm for label aggregation. In the field of crowdsourcing, the confusion matrix is effective and well known. Raykar et al.
[14] used noisy labels to train their classification model. Their twocoin model is a variation of the confusion matrix. BCC [7]is the probabilistic graphical model version of Dawid&SkeneEM. It also uses confusion matrix to evaluate workers, and uses Gibbs sampling to perform the parameter estimation. CommunityBCC
[19] is an extension of BCC. It divides the workers into worker communities. The workers in the same community have similar confusion matrices. CommunityBCC has better performance than BCC on sparse datasets. BCCWords [15] also extends BCC. Recently Zhou et al. [25, 26] proposed the minimax entropy estimator and its extensions. In these model, the authors set a separate probabilistic distribution for each workeritem pair. In most cases, minimax entropy and the extensions outperform Dawid&SkeneEM.There are several other models for label aggregation. These models do not based on confusion matrix. GLAD [22] is a model that can infer the true labels, the expertise of workers, and the difficulty of items at the same time. It only can used for binary labeling tasks. Liu et al.[11] proposed a model which uses variational inference to approximate the posterior. Zhou and He [24]
designed a label aggregation approach which is based on tensor augmentation and completion. Li et al.
[12] proposed a general crowd targeting framework, it can target good workers for crowdsourcing.DeepAgg [3] is a model based on a deep neural network. The model is trained by a seed dataset which contains noisy labels and the corresponding ground truth labels. So it is not an unsupervised approach. DeepAgg has several limitations. It can not aggregate incomplete data, where many annotators only labeled a few items. Recently, Yin et al. [23] applied Variational AutoEncoder (VAE) [9] to label aggregation. Their LAA model contains a classifier and a reconstructor. Both the classifier and the reconstructor are neural networks. LAA is an unsupervised model and works well in most case. We will use it as a baseline in our experiments.
3 Methods
3.1 Notation
Let’s start by introducing some notations that will be used throughout the paper. Consider that there are items labeled by workers, where each item has possible classes. denotes the label of item given by worker , where . (instance ) denotes the redundant labels for item . is the collection of all the observed labels. Note that each worker may only labels a part of the dataset. If item is not labeled by worker , then the value of is . is the unknown true label of each item . is the collection of all the unknown true labels. For given values of , the goal of label aggregation is to predict the values of .
3.2 Algorithm Framework
In this section, we introduce our novel label aggregation algorithm. This algorithm includes two components: a neural network and a guiding model . The choice of and is very flexible.
can be a multilayer perceptron (MLP), a convolutional neural network (CNN), or any other neural networks. In order to apply stochastic optimization, the loss function should be differentiable respect to the model parameters in
. This is the only constraint of . Our algorithm is easy to train using any stochastic optimization method.3.2.1 Definition
In our algorithm, is a label aggregation neural network. Given an instance , it can predict the corresponding unknown true label.
is represented as a probability distribution
, where is the input instance (the collected redundant labels for an item) anddenotes the network parameters (e.g., weights, biases, etc.). The network’s output is a Cdimensional vector
, the cth element is the probability that the true label of the input instance is class .The noisy labels dataset only contains the observed instances , the corresponding true labels
are unknown. Label aggregation is to predict these unknown true labels, so this is an unsupervised learning task. In order to train
, we define a model to guide the training. assumes that an instance is generated from some conditional distributions , where means the unknown true label and denotes the model parameters. It also assumes that the true label is generated from a prior distribution . The guiding model potentially defines a posterior distribution:(1) 
We do not make any simplifying assumption about and except that the loss function is differentiable respect to .
The optimization goal of this algorithm is to find the consensus between the predictions of and . That means the neural network distribution should be as similar as possible to the posterior distribution
. Thus it is reasonable to use KullbackLeibler divergence, a widelyused measure of the dissimilarity between two probability distributions, as the loss function:
(2) 
where is the KullbackLeibler divergence. We minimize this loss function during the training process. This kind of optimization goal is commonly used in approximate inference and variational inference [1, 20]. Thousands of published papers have shown that this optimization goal is effective. Actually, the data log likelihood can be bounded by
where is the evidence lower bound (ELBO). Minimizing the KL divergence between and is equivalent to maximizing the ELBO. Neural networks are good at fitting probability distributions. Thus, we can efficiently minimize and tightly bound the data log likelihood . That’s why our algorithm framework can work well.
We assume that each collected label is independently generated. The instances in are independent with each other. Plugging and into (2), we have:
(3) 
Equation (3) cannot be directly used to train and , because the expression of is unknown. The exact expression of the posterior may be intractable. Fortunately, it is not necessary to calculate it in our algorithm, we will further rewrite the loss function. Plugging in (1), we have
(4) 
According to (3) and (4) the loss function is rewritten as
(5) 
where the constant is ignored and the loss function is rescaled by . This will not affect the optimization result. The above is the definition of our label aggregation algorithm. The structure of the algorithm is shown in Figure 1.
3.2.2 Training
We are going to solve the following optimization problem
The optimization can be performed by using stochastic optimization methods such as SGD, Adam [8] and RMSProp [18]
. In our algorithm, we apply minibatch training that is commonly used in deep learning. The neural network parameters
and the guiding model parameters are trained simultaneously. In minibatch training,(6) 
where is a minibatch sampled from , and denotes the minibatch size. The gradient of is required to update the model parameters. The unobserved variables are discrete variables that take values from to . Therefore, we have
(7) 
(8) 
where is the cth element of the neural network output. According to (7) and (8), the values of and the corresponding stochastic gradient can be easily computed. The advantage of using minibatch training is that it allows our label aggregation algorithm to be used for large datasets and online setting.
3.3 Label Aggregation Models
In this section, we will introduce two novel label aggregation models based on the aforementioned algorithm framework. In each model, we need to define and .
3.3.1 A Binary Model Based on Worker Ability Assumptions
We take the ability of each worker into consideration and propose a label aggregation model. This model is called NNWA (NN means neural network and WA means worker ability). In NNWA, is a MLP. It inputs an instance and outputs a distribution , where denotes the network parameters.
Next, we define a guiding model . As shown in (7) and (8), in order to compute the loss function and its gradient, we need to define and . In NNWA, for simplicity, we only consider binary labeling tasks (the number of classes ). For each , the ability of each worker is represented by a single parameter . We assume that worker labels each item correctly with the probability
(9) 
According to this assumption, we have:
We can see that the higher the ability of worker is, the higher the likelihood for him or her to label the item correctly. When , he or she just randomly chooses one class. Therefore our assumption is reasonable. According to (9), the conditional distributions that generated instances are defined as
(10) 
where is a set of workers who have labeled item . In this model, the prior distribution is fixed during the training process, it has no parameters to be trained. is a multinomial distribution and is estimated by
(11) 
where the values of the estimators can be calculated by the result of counting the observed labels. Since
is fixed, we introduce a hyperparameter
to constrain the KullbackLeibler divergence term in the loss function (6). We regard this constrained term as a regularizer. Then, using (7) and (8) the minibatch loss function used in practice is(12) 
NNWA is formally shown in Algorithm 1.
3.3.2 A Multiclass Label Aggregation Model
In this section we design a novel label aggregation model. This model can be applied to aggregate multiclass noisy labels. We call it as NNMC (MC means multiclass). In NNMC,
is also a MLP. It is a fully connected neural network with softmax activation function on the last layer.
Now we design a new guiding model for NNMC. Different from NNWA, we do not delicately make some assumptions about the ability and the behavior of a worker. We design from another perspective. Every element in an instance is a independent collected label, so we assumes that in , the th element in an instance is generated by a independent distribution . This distribution is defined as
(13) 
where is a Cdimensional vector. Then can be defined as
(14) 
where is the th element of . Since the softmax function is derivable, can be updated by stochastic optimization methods. Just like NNWA, the prior in NNMC is fixed and is estimated by (11). The loss function of this model is also defined as (12). NNMC is illustrated in Algorithm 2.
4 Experiments
4.1 Baselines
We compare our models with Majority Voting and six stateoftheart baseline methods: Dawid&SkeneEM [2], Minimax Entropy [25], BCC [7], GLAD [22], MMCE [26], and LAA [23]
. Dawid&SkeneEM is a classic generative model for label aggregation. Minimax Entropy is an extension of Dawid&SkeneEM. This model assumes that the observed labels are generated by a distribution over workers, items, and labels. Bayesian Classifier Combination (BCC) is a Bayesian network using confusion matrix
[2]. Generative model of Labels, Abilities and Difficulties (GLAD) is a binary label aggregation model. It can simultaneously infer the true labels, the expertise of workers, and the difficulty of items. Minimax Conditional Entropy Estimators (MMCE) uses an minimax entropy principle to aggregate noisy labels. LabelAware Autoencoders (LAA) applies variational autoencoders to label aggregation.
4.2 Datasets
We use four realworld datasets in our experiments. The detailed information of them is shown in Table 1.
Dataset  Workers  Items  Labels  Classes 

Adult  17  263  1370  4 
RTE  164  800  8000  2 
Heart  12  237  952  2 
Age  165  1002  10020  7 
Adult [13] is a dataset labeled by Mechanical Turk workers to websites. The web pages are classified into four classes according to the amount of adult content on each web page. RTE [16] is a dataset about recognizing textual entailment. There are workers assign items into classes. The average correct rate of the workers is . Heart dataset is provided by
medical students. The students judge whether the patients have heart disease based on the physical examination results.These physical examination samples and the corresponding diagnostic results are downloaded from the UC Irvine machine learning repository
[5]. In order to use Age [4] in our experiments, its labels have been discretized into 7 bins: .4.3 True Label Prediction
After training NNWA and NNMC, we use maximum likelihood estimation (MLE) to predict the true label of each item. For NNWA,
For NNMC,
where the values of are computed by . The prediction error rates are computed by comparing the predicted true labels with the ground truth labels. Note that the ground truth labels are only used for evaluation. We do not use them in the training stage.
4.4 Setups
The open source implementations of majority voting, Dawid&SkeneEM, Minimax Entropy and MMCE are provided by Zhou [25, 26]
. GLAD, LAA and our two label aggregation models are implemented using TensorFlow
^{3}^{3}3www.tensorflow.org which provides GPU acceleration.We apply MLP as the label aggregation neural network and use the tangent hyperbolic function (tanh) [6] as the activation function. Softmax functions are used as the output of the networks. We use RMSprop [18] as the stochastic optimizer to minimize the loss. In our algorithm, the loss will finally converge to a stable value. As shown in Algorithm 1 and Algorithm 2, we implement MLPs with only layers for NNWA and NNMC. Deeper neural networks are also easy to applied in our algorithm. However, limited by the data size, deeper neural networks do not improve the prediction accuracy. We use the likelihood as the criterion to select the hyperparameter. After training the model and predicting the true label, this likelihood is easy to compute. We test multiple values of and select the value that generates the maximal likelihood. In our experiments, .
4.5 ErrorRates of Methods
The prediction error rates of our models and the baselines are illustrated in Table 2, Where GLAD and NNWA only can aggregate binary noisy labels. The best results are highlighted in bold. NNWA has the best performance on the Heart dataset. NNMC outperforms the baselines across the Adult dataset, the Heart dataset and the Age dataset. On the RTE dataset, our models achieve the similar accuracy with GLAD. The results show that our label aggregation algorithm is effective. Our models are more accurate than the baseline methods.
Method  Adult  RTE  Heart  Age 
Majority Voting  26.43  10.31  22.36  34.88 
Dawid&SkeneEM  29.84  7.25  18.14  39.62 
BCC  22.81  7.15  18.82  33.53 
GLAD  —  7.00  20.59  — 
Minimax Entropy  24.33  7.25  16.03  32.63 
MMCE  24.33  7.50  16.03  31.14 
LAA  25.86  13.00  15.19  33.55 
NNWA  —  7.28  12.24  — 
NNMC  21.60  7.13  12.66  30.18 
4.6 Further Investigation
4.6.1 Effectiveness of Worker Ability Detection
After training NNWA, for each class , the ability of each worker can be detected. In order to evaluate the effectiveness of worker ability detection, we take parameters after training on the Heart dataset. However these parameters are not intuitive and not easy to understand. Therefore, for each class , we further use equation (9) to compute the predicted accuracy of each worker. The real accuracy of each worker is easy to compute by counting the collected noisy labels and the ground truth labels. Given class and worker , the corresponding real accuracy is computed by
where denotes the ground truth label of item . The results are illustrated in Figure 2. We can see that the prediction is consistent with the reality. In most case, the predicted accuracy and the real accuracy are quite similar which means NNWA can effectively detect the ability of workers.
4.6.2 Evaluation of the Trained Parameters in NNMC
The trained parameters in NNMC can reflect the reliability of each worker. The values of are hard to understand. Given class and worker , we further use equation (13) to compute the distribution . Table 3 shows the results on the Adult dataset. Due to limited space, we only illustrate the results of two representative workers. We only care about the diagonal values and omit the other values for clarity. The labeling accuracy of each worker is computed by . We can see that worker has higher labeling accuracy, meanwhile, he or she has bigger diagonal values. This is reasonable, according to the definition of , a diagonal value is the probability that the collected label equals the true label and takes value . The results show that, by learning the model parameters, NNMC can capture the knowledge of the reliability of workers.


5 Conclusions
We present a novel algorithm which aggregates noisy labels by finding consensus between a neural network and a guiding model. According to the algorithm framework, we design two label aggregation models called NNWA and NNMC. In our algorithm, there are very few limitations on the choices of the label aggregation neural network and the guiding model. Therefore, our algorithm is very flexible and easy to extend. The experimental results on four realworld datasets show that our models outperform stateoftheart label aggregation methods.
References

[1]
Nasrabadi, Nasser M. ”Pattern recognition and machine learning.” Journal of electronic imaging 16, no. 4 (2007): 049901.
 [2] Dawid, Alexander Philip, and Allan M. Skene. ”Maximum likelihood estimation of observer errorrates using the EM algorithm.” Applied statistics (1979): 2028.

[3]
Gaunt, Alex, Diana Borsa, and Yoram Bachrach. ”Training deep neural nets to aggregate crowdsourced responses.” In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence. AUAI Press, p. 242251. 2016.
 [4] Han, Hu, Charles Otto, Xiaoming Liu, and Anil K. Jain. ”Demographic estimation from face images: Human vs. machine performance.” IEEE Transactions on Pattern Analysis & Machine Intelligence 6 (2015): 11481161.
 [5] A Janosi, W Steinbrunn, M Pfisterer, and R Detrano. Heart disease dataset. In https://archive.ics.uci.edu/ml/datasets/Heart+Disease, 1988.
 [6] Karlik, Bekir, and A. Vehbi Olgac. ”Performance analysis of various activation functions in generalized MLP architectures of neural networks.” International Journal of Artificial Intelligence and Expert Systems 1, no. 4 (2011): 111122.
 [7] Kim, HyunChul, and Zoubin Ghahramani. ”Bayesian classifier combination.” In Artificial Intelligence and Statistics, pp. 619627. 2012.
 [8] Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).
 [9] Kingma, Diederik P., and Max Welling. ”Autoencoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
 [10] Zhou, Denny, Sumit Basu, Yi Mao, and John C. Platt. ”Learning from the wisdom of crowds by minimax entropy.” In Advances in neural information processing systems, pp. 21952203. 2012.
 [11] Liu, Qiang, Jian Peng, and Alexander T. Ihler. ”Variational inference for crowdsourcing.” In Advances in neural information processing systems, pp. 692700. 2012.
 [12] Li, Hongwei, Bo Zhao, and Ariel Fuxman. ”The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing.” In Proceedings of the 23rd international conference on World wide web, pp. 165176. ACM, 2014.
 [13] Ipeirotis, Panagiotis G., Foster Provost, and Jing Wang. ”Quality management on amazon mechanical turk.” In Proceedings of the ACM SIGKDD workshop on human computation, pp. 6467. ACM, 2010.
 [14] Raykar, Vikas C., Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. ”Learning from crowds.” Journal of Machine Learning Research 11, no. Apr (2010): 12971322.
 [15] Simpson, Edwin D., Matteo Venanzi, Steven Reece, Pushmeet Kohli, John Guiver, Stephen J. Roberts, and Nicholas R. Jennings. ”Language understanding in the wild: Combining crowdsourcing and machine learning.” In Proceedings of the 24th international conference on world wide web, pp. 9921002. International World Wide Web Conferences Steering Committee, 2015.

[16]
Snow, Rion, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. ”Cheap and fast—but is it good?: evaluating nonexpert annotations for natural language tasks.” In Proceedings of the conference on empirical methods in natural language processing, pp. 254263. Association for Computational Linguistics, 2008.
 [17] Tian, Tian, and Jun Zhu. ”Maxmargin majority voting for learning from crowds.” In Advances in Neural Information Processing Systems, pp. 16211629. 2015.
 [18] Tieleman, Tijmen, and Geoffrey Hinton. ”Lecture 6.5RMSProp, COURSERA: Neural networks for machine learning.” University of Toronto, Technical Report (2012).
 [19] Venanzi, Matteo, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. ”Communitybased bayesian aggregation models for crowdsourcing.” In Proceedings of the 23rd international conference on World wide web, pp. 155164. ACM, 2014.
 [20] Wainwright, Martin J., and Michael I. Jordan. ”Graphical models, exponential families, and variational inference.” Foundations and Trends® in Machine Learning 1, no. 1–2 (2008): 1305.
 [21] Welinder, Peter, Steve Branson, Pietro Perona, and Serge J. Belongie. ”The multidimensional wisdom of crowds.” In Advances in neural information processing systems, pp. 24242432. 2010.
 [22] Whitehill, Jacob, Tingfan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. ”Whose vote should count more: Optimal integration of labels from labelers of unknown expertise.” In Advances in neural information processing systems, pp. 20352043. 2009.
 [23] Yin, Li’ang, Jianhua Han, Weinan Zhang, and Yong Yu. ”Aggregating crowd wisdoms with labelaware autoencoders.” In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 13251331. AAAI Press, 2017.
 [24] Zhou, Yao, and Jingrui He. ”Crowdsourcing via Tensor Augmentation and Completion.” In IJCAI, pp. 24352441. 2016.
 [25] Zhou, Denny, Sumit Basu, Yi Mao, and John C. Platt. ”Learning from the wisdom of crowds by minimax entropy.” In Advances in neural information processing systems, pp. 21952203. 2012.
 [26] Zhou, Dengyong, Qiang Liu, John Platt, and Christopher Meek. ”Aggregating ordinal labels from crowds by minimax conditional entropy.” In International conference on machine learning, pp. 262270. 2014.
Comments
There are no comments yet.