Log In Sign Up

Improve Learning from Crowds via Generative Augmentation

Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of high-quality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, which is measured by a discriminator; 2) the generated annotations should have high mutual information with the ground-truth labels, which is measured by an auxiliary network. Extensive experiments and comparisons against an array of state-of-the-art learning from crowds methods on three real-world datasets proved the effectiveness of our data augmentation framework. It shows the potential of our algorithm for low-budget crowdsourcing in general.


Learning from Crowds with Sparse and Imbalanced Annotations

Traditional supervised learning requires ground truth labels for the tra...

Learning from Crowds by Modeling Common Confusions

Crowdsourcing provides a practical way to obtain large amounts of labele...

Data Augmentation in Emotion Classification Using Generative Adversarial Networks

It is a difficult task to classify images with multiple class labels usi...

An Auxiliary Classifier Generative Adversarial Framework for Relation Extraction

Relation extraction models suffer from limited qualified training data. ...

Sequential IoT Data Augmentation using Generative Adversarial Networks

Sequential data in industrial applications can be used to train and eval...

Generative Adversarial Networks for Annotated Data Augmentation in Data Sparse NLU

Data sparsity is one of the key challenges associated with model develop...

Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators

Real-world data for classification is often labeled by multiple annotato...

1. Introduction

Modern machine learning systems are data hungry, especially for labeled data, which unfortunately is expensive to acquire at scale. Crowdsourcing provides a label collection schema that is both cost- and time-efficient (Buecheler et al., 2010). It spurs the growing research efforts in directly learning a classifier with only crowdsourced annotations, aka the learning from crowds problem.

In practice, to minimize annotation cost, the instances in crowdsourced data are typically labeled by a small number of annotators; and each annotator will only be assigned to a few instances. This introduces serious sparsity in crowdsourced data. We looked into two widely-used public crowdsourced datasets for multi-class classification, one for image labeling (referred to as LabelMe (Russell et al., 2008; Rodrigues and Pereira, 2018)) and one for music genre classification (referred to as Music (Rodrigues et al., 2014)). On the LabelMe dataset, each instance is only labeled by 2.5 annotators on average (out of 59 annotators), while 88% annotators provide less than 100 annotations (out of 1,000 instances). On the Music dataset, each instance is labeled by 4.2 annotators on average (out of 44 annotators), while 87.5% annotators provide less than 100 annotations (out of 700 instances). Such severe sparsity hinders the utility of crowdsourced labels. On the instance side, annotations provided by non-experts are noisy, which are expected to be improved by redundant annotations. But subject to the budget constraint, redundancy is also to be minimized. This conflict directly limits the quality of crowdsourced labels. On the annotator side, most existing crowdsourcing algorithms model annotator-specific confusions, which are used for label aggregation (Dawid and Skene, 1979), task assignment (Deng et al., 2013; Li et al., 2016) and annotator education (Singla et al., 2014). But due to the limited observations per annotator, such modeling can hardly be inaccurate, and thus various approximations (e.g., strong independence assumptions (Dawid and Skene, 1979)) have to be devised.

A straightforward solution to address annotation sparsity is to recruit more annotators or increase their assignments, at the cost of an increasing budget. This however is against the goal of crowdsourcing, i.e., to collect labeled data at a low cost. We approach the problem from a different perspective: we perform data augmentation using generative models to fill in the missing annotations. Instead of collecting more real annotations, we generate annotations by modeling the annotation distribution on instances and annotators. Given our end goal is to obtain an accurate classifier, the key is to figure out what annotations best help the classifier’s training. We propose two important criteria. First, the generated annotations should follow the distribution of authentic ones, such that they will be consistent with the label confusion patterns observed in the original annotations. Second, the generated annotations should well align with the ground-truth labels, e.g., with high mutual information (Xu et al., 2019; Harutyunyan et al., 2020), so that they will be informative about ground-truth labels to the classifier.

We realize our criteria for annotation augmentation in crowdsourced data using Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). The end product of our solution is a classifier, which predicts the label of a given instance. We set a discriminative model to judge whether an annotation is authentic or generated. Meanwhile, a generative model aims to generate annotations following the distribution of authentic annotations under the guidance of the discriminative model. On a given instance, the generator takes the classifier’s output and the annotator and instance features as input to generate the corresponding annotation. To ensure the informativeness of generated annotations, we maximize the mutual information between the classifier’s predicted label and the generated annotation on each instance (Chen et al., 2016). A two-step training strategy is proposed to avoid model collapse. We name our framework as CrowdInG - learning with Crowdsourced data through Informative Generative augmentation. Extensive experiments on three real-world datasets demonstrated the feasibility of data augmentation for the problem of learning from crowds. Our solution outperformed a set of state-of-the-art crowdsourcing algorithms; and its advantage becomes especially evident with extremely sparse annotations. It provides a new opportunity for low-budget crowdsourcing in general.

2. Related works

Our work studies the learning from crowds problem. Raykar et al. (2010)

employed an EM algorithm to jointly estimate the expertise of different annotators and a logistic regression classifier on crowdsourced data. They followed the well-known Dawid and Skene (DS) model

(Dawid and Skene, 1979) to model the observed annotations. Albarqouni et al. (2016)

extended this solution by replacing the logistic classifier with a deep neural network classifier.

Rodrigues and Pereira (2018)

further extended the solution by replacing the confusion matrix in the DS model with a neural network to model annotators’ expertise, and trained the model in an end-to-end manner.

Guan et al. (2018) used a neural classifier to model each annotator, and aggregated the predictions from the classifiers by a weighted majority vote. Cao et al. (2019)

proposed an information-theoretical deep learning solution to handle the correlated mistakes across annotators. However, all the mentioned solutions only use the observed annotations, such that their practical performance is limited by the sparsity of annotations.

Another research line focuses on modeling the annotators. Whitehill et al. (2009) proposed a probabilistic model which considers both annotator accuracy and instance difficulty. Rodrigues et al. (2014) modeled the annotation process by a Gaussian process. Imamura et al. (2018) and Venanzi et al. (2014) extended the DS model by sharing the confusion matrices among similar annotators to improve annotator modeling with limited observations. Confusions of annotators with few annotations are hard to be modeled accurately, and Kamar et al. (2015) proposed to address the issue with a shared global confusion matrix. Chu et al. (2020) also set a global confusion matrix, which is used to capture the common confusions beyond individual ones. However, the success of the aforementioned models relies on the assumed structures among annotators or annotations. Such strong assumptions are needed, because the sparsity in the annotations does not support more complicated models. But they also restrict the modeling of crowdsourced data, e.g., introducing bias in the learnt model. We lift such restrictions by directly generating annotations, such that our modeling of crowdsourced data even does not make any class- or annotator-dependent assumptions.

Benefiting from their powerful modeling capabilities, deep generative models have been popularly used for data augmentation purposes. Most efforts have been spent on problems in a continuous space, such as image and video generations. Semi-supervised GAN (Odena, 2016; Springenberg, 2015; Antoniou et al., 2017) augments training data by generating new instances from labeled ones. Chae et al. (2018)

employed GAN to address the data sparsity in content recommendation, with their proposed real-value, vector-wise recommendation model training. Recently, GAN has also been adopted in data augmentation for discrete problems.

Wang et al. (2019) designed a two-step solution to perform GAN training for collaborative filtering. Wang et al. (2018) unified generative and discriminative graph neural networks in a GAN framework to enhance the graph representation learning. Irissappane et al. (2020) reduced the needed labeled data to fine-tune the BERT-like text classification models via GAN-generated examples.

Figure 1. Overview of CrowdInG framework. We first sample annotations from annotation distributions provided by the generator. The discriminator and the auxiliary network are trained on the selected annotations. Then, the classifier is first fixed and the generator is updated according to and . The generator is fixed and the classifier is updated according to .

3. Methodology

In this section, we begin our discussion with the background techniques of our augmentation framework, including GAN and InfoGAN. Then we present our CrowdInG solution and discuss its detailed design of each component. Finally, we describe our two-step training strategy for the proposed framework.

3.1. Background

3.1.1. Generative Adversarial Networks

Goodfellow et al. (2014) introduced the GAN framework for training deep generative models as a minimax game, whose goal is to learn a generative distribution that aligns with the real data distribution . The generative distribution is imposed by a generative model , which transforms a noise variable into a sample . A discriminative model is set to distinguish between the authentic and generated samples. The generator is trained by playing against the discriminator . Formally, and play the following two-player minimax game with value function :

where is a function of choice and is typically the choice. The optimal parameters of the generator and the discriminator can be learned by alternately maximizing and minimizing the value function . In this paper, we adopt this idea to model the annotation distribution: a generator is used to generate annotations on specific instances and annotators; and a discriminator is set to distinguish the authentic annotations from the generated ones.

3.1.2. Information Maximizing Generative Adversarial Networks

Chen et al. (2016) extended GAN with an information-theoretic loss to learn disentangled representations for improved data generation. Aside from the value function , InfoGAN also maximizes the mutual information between a small subset of latent variables (referred to as latent code ) and the generated data. The generator takes both random noise and latent code as input, where the latent code is expected to capture the salient structure in the data distribution. The minimax game then turns into an information-regularized form,


is the mutual information between random variables

and , and is the regularization coefficient.

3.2. The CrowdInG framework

Let denote a set of instances labeled by annotators out of possible classes. We define as the feature vector of the -th instance and as its annotation provide by the -th annotator. is thus the annotation vector (with missing values) from annotators for the -th instance. When available, the feature vector of the -th annotator is denoted as ; otherwise, we use a one-hot vector to represent an annotator. Each instance is associated with an unobserved ground-truth label . The goal of learning from crowds is to obtain a classifier that is directly estimated from .

The framework of CrowdInG is depicted in Figure 1. It consists of two main components: 1) a generative module, including a classifier and a generator; and 2) a discriminative module, including a discriminator and an auxiliary network. In the generative module, the classifier first takes an instance as input and outputs its predicted label distribution . For simplicity, we collectively denote classifier’s output for an instance as . And then the generator takes the instance , annotator , the classifier’s output , together with a random noise vector , to generate the corresponding annotation distribution . The discriminative module is designed based on our criteria of high-quality annotations to evaluate the generations. On one hand, the discriminative module uses a discriminator to differentiate whether the annotation triplet () is authentic or generated. On the other hand, the discriminative module penalizes the generation based on the mutual information between the generated annotation and classifier’s output measured by an auxiliary network. Following the idea of InfoGAN, we treat the classifier’s output as the latent code in our annotation generation. And the auxiliary network measures the mutual information between and . The two modules play a minimax game in CrowdInG. A better classifier is expected as the discriminative module faces more difficulties in recognizing the generated annotations during training.

3.2.1. Generative module

The output of the generative module is an annotation distribution for a given annotator-instance pair . Sampling is applied to obtain the final annotations. As shown in Figure 1, this is a two-step procedure. First, the classifier predicts the label of a given instance by

where is a learnable scoring function chosen according to the specific classification tasks. Then the generator takes the classifier’s output as input to predict the underlying annotation distribution for the given annotator-instance pair. Moving beyond the classical class-dependent annotation confusion assumption (Dawid and Skene, 1979; Rodrigues and Pereira, 2018), we impose a much more relaxed generative process about the annotations. We consider the confusions can be caused by instance difficulty, or annotator expertise, or true labels of the instances (e.g., different annotation difficulty in different label categories), or even some random noise. To realize the idea, we provide the feature vector of the instance, the annotator and the classifier’s output to the generator as input, and the corresponding annotation distribution is modeled as,


where is a random noise vector, is a learnable scoring function implemented via a neural network. The generated annotations are sampled from the resulting distribution . To simplify our notations, we use to represent the predicted annotation distribution; and when no ambiguity is invoked, we denote as its -th entry when . Thanks to our data augmentation framework, we can afford a more flexible modeling of the annotation noise, e.g., dropping the hard independence assumptions made in previous works (Rodrigues and Pereira, 2018; Dawid and Skene, 1979). This in turn helps us boost the quality of generated annotations.

3.2.2. Discriminative module

We realize our principles of high-quality annotations in the discriminative module. First, the discriminator aims to differentiate whether an annotation is authentic from annotator to instance , i.e.,

predicts the probability of annotation

being authentic. In a crowdsourcing task, an annotator might confuse a ground-truth label with several classes, such that all of the confused classes could be independently authentic. For example, if an annotator always confuses “birds” with “airplanes” in low resolution images, his/her annotations might be random between these two categories. And thus both types of annotations should be considered as valid, as there is no way to tell which annotation is “correct” only based on the observations of his/her annotations. As a result, we realize the discriminator as a multi-label classifier, which takes an annotation triplet () as input and calculates the discriminative score by a bilinear model,



is the sigmoid function,

is the weight matrix for class , () and () are weight matrices and bias terms for annotator and instance embedding layers. For simplicity, we denote as the discriminator’s output on annotation .

However, Eq (2) does not consider the correlation among different classes in the annotations, as it still evaluates each possible label independently. The situation becomes even worse with sparse observations in individual annotators. For example, when an annotator confuses “bird” with “airplanes”, the discriminator might decide the label of “bird” is more authentic for this annotator, simply because this category appears more often in the annotator’s observed annotations. To capture such “equally plausible” annotations, we equip the discriminator with additional label correlation information (Lanchantin et al., 2019). Specifically, we use a graph convolution network (GCN) (Kipf and Welling, 2016) to model label correlation. Two labels are more likely to be correlated if they are provided to the same instance (by different annotators) in the authentic annotations. We calculate the frequency of label co-occurrence in the observed annotations to construct the adjacency matrix . Then we extend the weight matrix in Eq (2) by , with where

is the identity matrix,

is the diagonal node degree matrix of . We name this component as the label correlation aggregation (LCA) decoder. We also enforce sparsity on the discriminator by applying L2 norm to its outputs.

To realize our second criterion, an auxiliary network is used to measure the mutual information between the classifier’s prediction and the generated annotation on instance . To simplify our notations in the subsequent discussions, we denote as to represent the annotation distribution predicted on pair . As our generator design is very flexible to model complex confusions, it however becomes useless for classifier training if the learnt confusions are independent from the classifier’s outputs. For example, if the generator learnt to generate a particular annotation only by the annotator’s features (e.g., the most frequently observed label in this annotator), such a generation contributes no information to classifier training. We propose to penalize such generations by maximizing the mutual information between classifier’s output and the generated annotations for an instance, i.e., .

In practice, mutual information is generally difficult to optimize, because it requires the knowledge of posterior . We follow the design in (Chen et al., 2016) to maximize the variational lower bound of by utilizing an auxiliary distribution to approximate :


We refer to as the information loss, which can be viewed as an information-theoretical regularization to the original minimax game. The auxiliary distribution is parameterized by the auxiliary network . In our implementation, we devise a two-step training strategy for the entire pipeline (details in Section 3.3.3), where we fix the classifier when updating the generator. As a result, becomes a constant when updating the generator by Eq (3). Since the posterior can be different when the annotations are given by different annotators on different instances, we also provide the instance and annotator features to the auxiliary network,

where is a learnable scoring function. To reduce model complexity, we reuse the annotator and instance encoding layers from the discriminator here. The class-related weight matrix is flatten and transformed to a low-dimension embedding by an embedding layer for each annotation type .

Putting the generative and discriminative modules together, we formalize the value function of our minimax game for learning from crowds in CrowdInG as,


where is a hyper-parameter to control the regularization. The value function is maximized by updating the discriminator to improve its ability in differentiating the authentic annotations from the generated ones, and minimized by updating the classifier, generator and the auxiliary network to generate more high-quality annotations.

3.3. Model optimization

In this section, we introduce the training strategy for CrowdInG, which cannot be simply performed via vanilla end-to-end training. First, the number of unobserved annotator-instance pairs is much larger than the observed ones. Blindly using all the generated annotations overwhelms the training of our discriminative module, and simply leads to trivial solutions (e.g., classifying all annotations as generated). As our solution, we present an entropy-based annotation selection strategy to select informative annotations for discriminative module update. Second, due to the required sampling procedure when generating the annotations, there are non-differentiable steps in our generative module. We resort to an effective counterfactual risk minimization (CRM) method to address the difficulty. Finally, the classifier and the generator in the generative module might change dramatically to fit the complex training signals, which can easily cause model collapse. We propose a two-step training strategy to prevent it in practice.

3.3.1. Entropy-based annotation selection

We borrow the idea from active learning

(Settles, 2009): the discriminator should learn to distinguish the most difficult annotations. A generated annotation is more difficult for the discriminator if the generator is more confident about it. Formally, the selection strategy is designed as,

where is the entropy of the annotation distribution. To reduce training bias caused by annotation sparsity in individual annotators, we sample the same number of generated annotations as the authentic ones in each annotator. As a by-product, our instance selection also greatly reduces the size of training data for the discriminative module. It makes discriminator training a lot more efficient. To fully utilize the power of discriminative module, we use all generated annotations for the generator updating.

3.3.2. Gradient-based optimization

The gradient for the discriminator and the auxiliary network is easy to compute by calculating the derivative on trainable parameters. However, due to the required sampling steps for generating specific annotations, there are non-differentiable steps in the generative module. Previous works (Wang et al., 2018, 2019) use Gumbel-softmax trick or policy gradient to handle the non-differentiable functions. However, once the generator is updated, we need to re-sample the annotations and evaluate them again using the discriminative module, which is time-consuming. To accelerate our model training, we perform batch learning from logged bandit feedback (Joachims et al., 2018; Swaminathan and Joachims, 2015)

. In each epoch, we treat the generative module from the last epoch as the logging policy

, and sample annotations from it. Because the discriminator only evaluates the sampled annotations from the (last) generative module, rather than the entire distribution of annotations predicted by the module, training signals received on the generative module side are in the form of logged bandit feedback.

When updating the generator, the training signals are from both the discriminator and the information loss . We collectively denote them as loss . In each epoch, we update the generator as follows,


where is a Lagrange multiplier introduced to avoid overfitting to the logging policy (Joachims et al., 2018). The optimization of Eq (5) can be easily solved by gradient descent. When updating the classifier, we only use the discriminator’s signals. Intuitively, even though annotations should contain the information about the true labels, the inverse is not necessary. The classifier is updated in a similar fashion,


We follow the suggestions in (Joachims et al., 2018) to search the best in practice.

Figure 2. Performance of two-step strategy. (a) Mean accuracy of accumulated instances with ascending order of entropy on three real-world datasets. (b) Comparison between one-step and two-step strategy on LabelMe dataset.

3.3.3. Two-step update for the generative module

The generative process is controlled by the generator and the classifier. However, the coupling between the two components introduces difficulties in the estimation of them. For example, one component might overfit a particular pattern in the discriminative signal, and cause model collapse in the entire pipeline. In our empirical studies reported in Figure 2(b), we observed test accuracy fluctuated a lot when we simply used gradients calculated by Eq (5) and (6) to update these two components together. Details of our experiment setup can be found in Section 4.

Based on this finding, we adopt a two-step strategy to update the generator and the classifier alternatively. First, we found that the principle behind our annotation selection also applied to our classifier: the entropy of the classifier’s output strongly correlates with its accuracy. According to Figure 2(a), the classifier obtains higher accuracy on instances with lower prediction entropy. Therefore, we decided to use the instances with low classification entropy to update the generator by Eq (5), as there the classifier’s predictions are more likely to be accurate. Then, we use the updated generator on the rest of instances to update the classifier by Eq (6), where the classifier still has a high uncertainty to handle them.

A threshold is pre-selected to separate the instances; and we will discuss its influence on model training in Section 4.5. Besides, to make the entire training process stable, we pre-train the classifier with the observed annotations using neural crowdsourcing algorithm proposed in (Rodrigues and Pereira, 2018), which is included as one of our baselines. With the initialized classifier, we also pre-train the generator and discriminator to provide good initialization of these components.

(a) Results on LabelMe dataset.
(b) Results on Music dataset.

Results on CIFAR-10H dataset.

Figure 3. Results on three real-world datasets. Full CrowdInG training is applied after the dashed line.

4. Experiments

In this section, we evaluate our proposed solution framework on three real-world datasets. The annotations were originally collected from Amazon Mechanical Turk (AMT) by the dataset creators. We compared with a rich set of state-of-the-art crowdsourcing algorithms that estimate the classifiers only with observed annotations. We are particularly interested in investigating how much human labor can be saved by our data augmentation solution? We gradually removed an increasing number of annotations and compared with baselines. The result suggests significant annotation cost can be reduced with our generated annotations, while still maintaining the quality of the learnt classifier. Besides, since our model is the first effort to augment crowdsourced data for classifier training, we compared with models trained with annotations from other generative models for crowdsourced data. Finally, we performed extensive ablation analysis about our proposed model components and hyper-parameters to better understand the model’s behavior.

4.1. Datasets & Implementation details

We employed three real-world datasets for evaluations. LabelMe (Russell et al., 2008; Rodrigues and Pereira, 2018) is an image classification dataset, which consists of 2,688 images from 8 classes, e.g., inside city, street, forest, etc. 1,000 of them are labeled by 59 AMT annotators and the rest are used for validation and testing. Each image is labeled by 2.5 annotators on average. To enrich the training set, standard data augmentation techniques are applied on the training set, including horizontal flips, rescaling and shearing, following the setting in (Rodrigues and Pereira, 2018). We created 10,000 images for training eventually. Music (Rodrigues et al., 2014) is a music genre classification dataset, which consists of 1,000 samples of songs with 30 seconds in length from 10 classes, e.g., classical, country, jazz, etc. 700 of them are labeled by 44 AMT annotators and the rest are left for testing. Each sample is labeled by 4.2 annotators on average. Figure 4 shows several important statistics of these two datasets. Specifically, we report the annotation accuracy and the number of annotations among the annotators. Both statistics vary considerably across annotators in these two datasets, which cause serious difficulties in classical crowdsourcing algorithms. CIFAR-10H (Peterson et al., 2019) is another image classification dataset, which consists of 10,000 images from 10 classes, e.g., airplane, bird, cat, etc., collected from the CIFAR-10 image dataset (Krizhevsky et al., 2009). There were 2,571 annotators recruited and each annotator was asked to label 200 images. However, such large-scale annotations are typically expensive and rare in practice. To make this dataset closer to a realistic and challenging setting, we only selected a subset of low-quality annotators. The modified dataset has 8,687 images annotated by 103 AMT annotators. Each annotator still has 200 annotations with an average accuracy of 78.2%; and each image has 2.37 annotations on average. The original 10,000 images validation set of CIFAR-10 is used as our testing set.

(a) LabelMe
(b) Music
Figure 4. Boxplots for the number of annotations and the accuracy of the AMT annotators for two real-world crowdsourcing datasets.
Figure 5. Test accuracy with various proportion of removed annotations.

To make the comparisons fair, all evaluated methods used the same classifier design (in both CrowdInG and baselines). On the LabelMe dataset, we adopted the same setting as in (Rodrigues and Pereira, 2018)

: we applied a pre-trained VGG-16 network followed by a fully connected (FC) layer with 128 units and ReLU activations, and a softmax output layer, using 50% dropout. On the Music dataset, we also used a 128 units FC layer and softmax output layer. Batch normalization was performed in each layer. We disabled LCA on Music since there is no meaningful label correlation patterns. On the CIFAR-10H dataset, we used a VGG-16 network for the classifier. We used Adam optimizer with learning rates searched from

for both generative and discriminative modules. Scoring functions and are implemented by two-layer neural networks with 64 and 128 hidden units. In each epoch, we update the generative and discriminative modules for 5 times. With pre-training, we execute the training procedures for CrowdInG in the last 40 epochs. All experiments are repeated 5 times with different random seeds, and mean accuracy and standard derivation are reported.

LabelMe Music CIFAR-10H
Doctor Net 82.12 75.41 67.23
85.24 82.56 64.94
82.74 81.42 65.02
85.16 83.17 65.34
85.42 83.38 66.17
CrowdInG 87.03 83.73 68.85
Table 1. Test accuracy of different augmentation methods.

4.2. Classification performance

4.2.1. Baselines

We compared with a rich set of state-of-the-art baselines, which we briefly introduce here. DL-MV: annotations are first aggregated by majority vote, and then it trains a classifier based on the aggregated labels. DL-CL (Rodrigues and Pereira, 2018): a set of designated layers that capture annotators’ confusions (the so-called Crowd Layer) are connected to the classifier, aiming to transform the predicted classifier’s outputs to annotation distributions. Anno-Reg (Tanno et al., 2019): trace regularization on confusion matrices is applied to improve the confusion estimation. Max-MIG (Cao et al., 2019)

: a neural classifier and a label aggregation network are jointly trained using an information-theoretical loss function, correlated confusions among annotators are captured.

AggNet (Albarqouni et al., 2016): an EM-based deep model considering annotator sensitivity and specificity.

4.2.2. Results & analysis

The classification accuracy of the learnt classifiers from different models on the three datasets are reported in Figure 3. Two things we should highlight: 1) as all models are learnt from crowdsourced data, the ground-truth labels on instances are unrevealed to them in training. Therefore, a classifier’s accuracy on training set is still a meaningful performance metric. 2) CrowdInG starts with the same classifier as obtained in DL-CL (as we used DL-CL to pre-train our classifier). On all datasets, we observe that even though DL-CL did not outperform the other baselines, after the training in CrowdInG starts, the classifier’s performance got significantly improved. This proves the utility of our generated annotations for classifier training. Besides, we also looked into the accuracy in individual classes and found by generating more annotations, CrowdInG’s performance on those easily confused classes got more improvement than the baselines. For example, for the class of open country on LabelMe, the original annotation accuracy was only 51.5%. DL-CL achieved 49.6% (i.e., the starting point of CrowdInG), and it was improved to 58.9% after CrowdInG training. Compared with models that are designed for complex confusions, such as Max-MIG and AggNet, CrowdInG still outperformed them with a large margin. This indicates our generator has a stronger advantage in capturing complex confusions.

4.3. Utility of augmented annotations

4.3.1. Experiment setup

We study the utility of augmented annotations from CrowdInG. On each dataset, we gradually removed an increasing number of observed annotations to investigate how different models’ performance changes. We ensure that each instance has at least one annotation, such that we will only remove annotations rather than instances for classifier training. We compared with two representative baselines: 1) DL-MV, a typical majority-vote-based method, and 2) DL-CL, a typical DS-model-based method, to study their sensitivity on the sparsity of annotations.

4.3.2. Results & analysis

We present the results in Figure 5. All models suffered from extreme sparsity when we removed a large portion of annotations (e.g., 60%), but CrowdInG still enjoyed a consistent improvement against all baselines. DL-MV performed the worst, because with less redundant annotations, the quality of its aggregated labels deteriorated seriously. When we looked into the detailed model update trace of CrowdInG, we found that the performance gain became larger after CrowdInG training. Again, because we used the classifier obtained from DL-CL as our starting point for CrowdInG, low-quality annotations were generated at the beginning of CrowdInG update. However, CrowdInG quickly improved once its discriminative module started to penalize those low-quality annotations. The results strongly support that a great deal of human labor can be saved. On LabelMe and CIFAR-10H, CrowdInG performed closely to the baselines’ best performance even with 60% less annotations. Even on the most difficult dataset Music, about 10% annotations can be saved by CrowdInG to achieve similar performance as DL-CL.

4.4. Comparison with other augmentations

4.4.1. Baselines

As no existing method explicitly performs data augmentation for crowdsourced data, we consider several alternative data augmentation methods using various self-training or generative modeling techniques. Arguably, any generative model for crowdsourced data can be used for this purpose.

In particular, we chose the following baselines. Doctor Net (Guan et al., 2018): each annotator is modeled by an individual neural network. When testing, annotations are predicted by annotator networks and then aggregated by weighted majority vote. : we complete the missing annotations using a pre-trained DL-CL model, and then train another DL-CL model based on the completed annotations. : we construct an annotator-instance bipartite graph based on the observed annotations, and fill in the missing links using a Graph Convolution Network (GCN) (Berg et al., 2017; Kipf and Welling, 2016). Then we train a DL-CL model using the expanded annotations. : we follow the same design in (Wang et al., 2017), which unifies generative and discriminative models into a GAN framework. We use DL-CL as the generative model. : we directly train a DL-CL model on the expanded dataset provided by CrowdInG.

4.4.2. Results & analysis

We present the test accuracy on all three datasets in Table 1. Doctor Net trains individual classifiers for each annotator, so that on datasets where annotations from each annotator are sufficient, such as CIFAR-10H, this model obtained satisfactory performance with the generated annotations. But on the other datasets where annotations are sparse in each annotator, its performance dropped a lot. In DL-CL type methods, the performance is generally improved. However, due to the simple class-dependent confusion assumption, such models’ capacity to capture complex confusions is limited. As a result, even though GCN could capture more complex annotator-instance interactions, DL-CL still failed to benefit from it in . The added discriminator in improved the performance; however, DL-CL still could not fully utilize the complex discriminative signals and failed to further improve the performance. performed better than the other baselines by directly using the annotations generated by CrowdInG, which suggests the annotations generated under our criteria are generically helpful for other crowdsouring algorithms.

LabelMe Music CIFAR-10H
CrowdG 85.89 83.14 66.15
83.12 81.28 67.12
84.34 82.24 66.90
86.17 82.74 67.88
CrowdInG 87.03 83.73 68.85
Table 2. Test accuracy of different variants of CrowdInG

4.5. Ablation study

4.5.1. Analysis of different components in CrowdInG

To show the contributions of different components in CrowdInG, we varied the setting of our solution. We already showed the one-step training variant in Figure 2, which suffered from serious model collapsing. To investigate the other components, we created the following variants. CrowdG: the information loss defined in Eq (3) is removed. : the generator only considers classifier’s outputs, annotator features and random noise, but not the instance features. : the generator only considers classifier’s outputs, instance features and random noise, but not the annotator features. : the annotation selection is kept, but instead we randomly select an equal number of generated annotations as the authentic ones for discriminator update.

We reported the test accuracy on three datasets in Table 2. By maximizing the mutual information, CrowdInG outperformed CrowdG with a considerable margin. We further investigated the generated annotations and found the annotations generated by CrowdG were more random, which could not be easily linked to the classifier’s output. performed poorly when the number of annotations per annotator was limited, such as on LabelMe and Music datasets, but worked better when annotations per annotator are adequate, such as on CIFAR-10H. This again proves more annotations are needed to better model annotators’ confusions. performed better because by taking instance features, the generator can model more complicated confusions with respect to instance features. bypassed the data imbalance issue; but without focusing on difficult annotations, it still cannot fully unleash the potential of generated annotations.

Figure 6. Performance under different hyper-parameter settings on LabelMe dataset.

4.5.2. Hyper-parameter analysis.

We studied the sensitivity of hyper-parameters and in CrowdInG. Specifically, controls the degree of the information regularization in Eq (4), we varied it from 0.1 to 1. controls the grouping of instances used for classifier update; and we varied it from 0.2 to 0.8. Due to space limit, we only report the results on LabelMe, similar observations were also obtained on the other two datasets.

The model’s performance under different hyper-parameter settings is illustrated in Figure 6. We can clearly observe that the performance is boosted when appropriate hyper-parameters are chosen. Small poses weak information regularization to the generator, and thus the generated annotations are less informative for classifier training. Large slightly hurts the performance because strong regularization weakens the ability of the generator to capture complex confusions related to instance and annotator features. We can observe similar trend on . To avoid model collapse, a moderate is needed to restrict the classifier training, but a large will hurt the performance more. Because with a large , very few instances will be selected for classifier training, so that the classifier can hardly be updated.

5. Conclusions & Future works

Data sparsity poses a serious challenge to current learning from crowds solutions. We present a data augmentation solution using generative adversarial networks to handle the issue. We proposed two important principles in generating high-quality annotations: 1) the generated annotations should follow the distribution of authentic ones; and 2) the generated annotations should have high mutual information with the ground-truth labels. We implemented these principles in our discriminative model design. Extensive experiment results demonstrated the effectiveness of our data augmentation solution in improving the performance of the classifier learned from crowds, and it sheds light on our solution’s potential in low-budget crowdsourcing in general.

Our exploration also opens up a series interesting future directions. As our generative module captures annotator- and instance-specific confusions, it can be used for annotator education (Singla et al., 2014), e.g., inform individual annotators about their potential confusions. Our solution can also be used for interactive labeling with annotators (Yan et al., 2011), e.g., only acquire annotations on which our generative module currently has a low confidence. Also, the instance-level confusion modeling can better support fine-grained task assignment (Deng et al., 2013), e.g., gather senior annotators for specific tasks.


This work was partially supported by the National Science Foundation under award NSF IIS-1718216 and NSF IIS-1553568, and the Department of Energy under the award DoE-EE0008227.


  • S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab (2016) Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging 35 (5), pp. 1313–1321. Cited by: §2, §4.2.1.
  • A. Antoniou, A. Storkey, and H. Edwards (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: §2.
  • R. v. d. Berg, T. N. Kipf, and M. Welling (2017) Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263. Cited by: §4.4.1.
  • T. Buecheler, J. H. Sieg, R. M. Füchslin, and R. Pfeifer (2010) Crowdsourcing, open innovation and collective intelligence in the scientific method: a research agenda and operational framework. In The 12th International Conference on the Synthesis and Simulation of Living Systems, Odense, Denmark, 19-23 August 2010, pp. 679–686. Cited by: §1.
  • P. Cao, Y. Xu, Y. Kong, and Y. Wang (2019) Max-MIG: an information theoretic approach for joint learning from crowds. In ICLR, External Links: Link Cited by: §2, §4.2.1.
  • D. Chae, J. Kang, S. Kim, and J. Lee (2018) Cfgan: a generic collaborative filtering framework based on generative adversarial networks. In Proceedings of the 27th ACM CIKM, pp. 137–146. Cited by: §2.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657. Cited by: §1, §3.1.2, §3.2.2.
  • Z. Chu, J. Ma, and H. Wang (2020) Learning from crowds by modeling common confusions. arXiv preprint arXiv:2012.13052. Cited by: §2.
  • A. P. Dawid and A. M. Skene (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1), pp. 20–28. Cited by: §1, §2, §3.2.1.
  • J. Deng, J. Krause, and L. Fei-Fei (2013) Fine-grained crowdsourcing for fine-grained recognition. In CVPR, pp. 580–587. Cited by: §1, §5.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1, §3.1.1.
  • M. Guan, V. Gulshan, A. Dai, and G. Hinton (2018) Who said what: modeling individual labelers improves classification. In AAAI, Vol. 32. Cited by: §2, §4.4.1.
  • H. Harutyunyan, K. Reing, G. Ver Steeg, and A. Galstyan (2020) Improving generalization by controlling label-noise information in neural network weights. In International Conference on Machine Learning, pp. 4071–4081. Cited by: §1.
  • H. Imamura, I. Sato, and M. Sugiyama (2018) Analysis of minimax error rate for crowdsourcing and its application to worker clustering model. In International Conference on Machine Learning, pp. 2147–2156. Cited by: §2.
  • A. A. Irissappane, H. Yu, Y. Shen, A. Agrawal, and G. Stanton (2020)

    Leveraging gpt-2 for classifying spam reviews with limited labeled data via adversarial training

    arXiv preprint arXiv:2012.13400. Cited by: §2.
  • T. Joachims, A. Swaminathan, and M. de Rijke (2018) Deep learning with logged bandit feedback. In ICLR, Cited by: §3.3.2, §3.3.2.
  • E. Kamar, A. Kapoor, and E. Horvitz (2015) Identifying and accounting for task-dependent bias in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 3. Cited by: §2.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.2, §4.4.1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • J. Lanchantin, A. Sekhon, and Y. Qi (2019) Neural message passing for multi-label classification. In ECML-PKDD, pp. 138–163. Cited by: §3.2.2.
  • G. Li, J. Wang, Y. Zheng, and M. J. Franklin (2016) Crowdsourced data management: a survey. IEEE TKDE 28 (9), pp. 2296–2319. Cited by: §1.
  • A. Odena (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §2.
  • J. C. Peterson, R. M. Battleday, T. L. Griffiths, and O. Russakovsky (2019) Human uncertainty makes classification more robust. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 9617–9626. Cited by: §4.1.
  • V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy (2010) Learning from crowds.. Journal of Machine Learning Research 11 (4). Cited by: §2.
  • F. Rodrigues, F. Pereira, and B. Ribeiro (2014) Gaussian process classification and active learning with multiple annotators. In International conference on machine learning, pp. 433–441. Cited by: §1, §2, §4.1.
  • F. Rodrigues and F. Pereira (2018) Deep learning from crowds. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §1, §2, §3.2.1, §3.3.3, §4.1, §4.1, §4.2.1.
  • B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman (2008) LabelMe: a database and web-based tool for image annotation. International journal of computer vision 77 (1-3), pp. 157–173. Cited by: §1, §4.1.
  • B. Settles (2009) Active learning literature survey. Cited by: §3.3.1.
  • A. Singla, I. Bogunovic, G. Bartók, A. Karbasi, and A. Krause (2014) Near-optimally teaching the crowd to classify. In ICML, pp. 154–162. Cited by: §1, §5.
  • J. T. Springenberg (2015)

    Unsupervised and semi-supervised learning with categorical generative adversarial networks

    arXiv preprint arXiv:1511.06390. Cited by: §2.
  • A. Swaminathan and T. Joachims (2015) Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16 (1), pp. 1731–1755. Cited by: §3.3.2.
  • R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman (2019) Learning from noisy labels by regularized estimation of annotator confusion. In CVPR, pp. 11244–11253. Cited by: §4.2.1.
  • M. Venanzi, J. Guiver, G. Kazai, P. Kohli, and M. Shokouhi (2014) Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web, pp. 155–164. Cited by: §2.
  • H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, and M. Guo (2018) Graphgan: graph representation learning with generative adversarial nets. In AAAI, Vol. 32. Cited by: §2, §3.3.2.
  • J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang (2017) Irgan: a minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference, pp. 515–524. Cited by: §4.4.1.
  • Q. Wang, H. Yin, H. Wang, Q. V. H. Nguyen, Z. Huang, and L. Cui (2019) Enhancing collaborative filtering with generative augmentation. In Proceedings of the 25th ACM SIGKDD conference, pp. 548–556. Cited by: §2, §3.3.2.
  • J. Whitehill, T. Wu, J. Bergsma, J. Movellan, and P. Ruvolo (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. NeurIPS 22, pp. 2035–2043. Cited by: §2.
  • Y. Xu, P. Cao, Y. Kong, and Y. Wang (2019) L_DMI: a novel information-theoretic loss function for training deep nets robust to label noise.. In NeurIPS, pp. 6222–6233. Cited by: §1.
  • Y. Yan, R. Rosales, G. Fung, and J. G. Dy (2011) Active learning from crowds. In International Conference on Machine Learning, Cited by: §5.