1 Introduction
Crowdsourcing used to refer to the practice of an organization of outsourcing tasks, otherwise performed by employees, to a large number of volunteers [9]. Recently, crowdsourcing has become the only viable way to annotate massive data through the hiring of a large number of inexpensive Internet workers [23]
. Although a variety of tasks can be crowdsourced, the relatively common one is the annotation of images (i.e., ImageNet) for data-driven machine learning algorithms such as deep learning. However, due to the difficulty of the tasks, the poor task description, and the diverse capacity ranges of workers and so on
[14, 3], we often need to invite multiple workers to annotate the same data to improve the label quality [22]. This limits the use of crowdsourcing when the available budget is limited. As such, we face the need of obtaining quality data with a tight budget. Proposed solutions focus on modeling crowdsourced tasks [46, 1], workers [35], or crowdsourcing processes [48, 16, 28] to achieve a better understanding of the same, thereby reducing the impact of incompetent worker and the number of repeated annotations, while improving the quality.We argue that meta learning can offer a solution to the challenge we are facing with crowdsourcing [30]
. Meta learning imitates the process of learning in humans. With only a few data in the target domain, the learner can quickly adapt to recognize a new class of objects. For example, state-of-the-art meta learning algorithms can achieve an accuracy of nearly 60% on five classification tasks for the Mini-ImageNet dataset, with only one training instance per class (called 5-way 1-shot few-shot learning)
[25]. This is comparable to the capability of the majority of human workers on real world crowdsourcing platforms[12]. As such, we can model crowdsourcing employees as meta learners: with a small amount of guidance, they can quickly learn new skills to accomplish crowdsourced tasks.To make this idea concrete, we introduce the notion of meta-worker, a virtual worker trained via a meta learning algorithm, which can quickly generalize to new tasks. Specifically, our crowdsourcing process is formulated as follows. Given a crowdsourcing project, we first partition the tasks into different clusters. We then collect a batch of tasks close to each cluster center and ask the crowd workers to annotate them until a ‘-way -shot’ meta-test dataset is obtained. We also build our meta-training datasets by collecting data from the Internet. Different meta learning algorithms are used to generate a group of diverse meta-workers. We then employ the meta-workers to annotate the remaining tasks; we measure the disagreement among the annotations using the Jensen-Shannon divergence, and consequently decide whether or not to further invite crowd workers to provide additional annotations. Finally, we model the meta-workers’ preference and use weighted majority voting (WMV) to compute the consensus labels, and iteratively optimize the latter until convergence is reached.
The main contributions of our work are as follows:
(i) This work is the first effort in the literature to directly supplement human workers with machine classifiers for crowdsourcing. The results indicate that machine intelligence can limit the use of crowd workers and achieve quality control.
(ii) We use meta-learning to train meta-workers. In addition, we employ ensemble learning to boost the meta-workers’ ability of producing reliable labels. Most simple tasks do not require the participation of human workers, thus enabling budget savings.
(iii) Experiments on real datasets prove that our method achieves the highest quality while using comparable or far less budget than state-of-the-art methods, and the amount of budget-saving grows as the scale of tasks increases.
2 Related Work
Our work is mainly related to two research areas, crowdsourcing and meta learning. Meta learning (or learning to learn) is inspired by the ability of humans to use previous experience to quickly learn new skills [30, 17]. The meta learning paradigm trains a model using a large amount of data in different source domains where data are available, and then fine tunes the model using a small number of samples of the target domain.
In recent years, a variety of meta learning approaches have been introduced. Few-shot meta learning methods can be roughly grouped into three categories, namely optimization-based, model-based (or memory-based, black box model), and metric-based (or non-parametric) methods. Optimization-based methods treat meta learning as an optimization problem and extract the meta-knowledge required to improve the optimization performance. For example, Model-Agnostic Meta-Learning [6]
looks for a set of initialization values of the model parameters that can lead to a strong generalization capability. The idea is to enable the model to quickly adapt to new tasks using few training instances. Model-based methods train a neural network to predict the output based on the model’s current state (determined by the training set) and input data. The meta-knowledge extraction and meta learning process are wrapped in the model training process. Memory-Augmented Neural Networks
[21] and Meta Networks [20] are representative model-based methods. Metric-based methods use clustering ideas for classification. They perform non-parametric learning at the inner (task) level by simply comparing validation points with training points, and assign a validation point to the category with the closest training point. This approach has several representative methods: Siamese networks [2], prototypical networks [24] and relation networks [25]. Here the meta-knowledge is given by the distance or similarity metric.Few-shot meta learning can leverage multi-source data to better mine the information of the target domain data, which is consistent with the aim of budget saving in crowdsourcing. A lot of effort has been put to achieve budget savings in crowdsourcing [15]. One way is to reduce the number of tasks to be performed, such as task pruning (prune the tasks that machines can do well) [33], answer deduction (prune the tasks whose answers can be deduced from existing crowdsourced tasks) [34], and task selection (select the most beneficial tasks to crowdsource) [19, 39, 29, 40]. Another approach is to reduce the cost of individual tasks, or dynamically determine the price of single task, mainly by better task (flow) design [18, 45, 27, 26].
Our work is also related to semi-supervised learning. Semi-supervised self-training
[38, 37]gradually augments the labeled data with new instances, whose labels have been inferred with high confidence, until the unlabeled data pool is empty or the learner does not improve any further. The effectiveness of this approach depends on the added value of the augmented labeled data. Furthermore, the model needs to be updated every time when an instance is augmented as labeled, which is not feasible in a large crowdsourcing project. Some active learning based crowdsourcing approaches
[5, 13, 40] also suffer from these issues. [42] trained a group of classifiers using cleaned data in crowdsourcing, the classifiers are then used to correct the potential noisy labels. Unlike the above proposed solutions, our approach is feasible because we directly use meta-workers, trained by meta learning, to annotate unlabeled data. Meta-workers can quickly generalize to new tasks and can achieve a good performance with the support of only few instances. In contrast, existing methods canonically depend on sufficient training instances for each category to enable machine intelligence assisted crowdsourcing [13, 37, 42, 43]. In addition, our model learns meta-knowledge from external free data to save the budget considerably. Due to the cooperation among diverse meta-workers and to ensemble learning, we can further boost the performance of a group of meta-workers, without the need of frequent updates.3 Proposed Methodology
3.1 Definitions
In this section, we will formalize the Crowdsourcing with Meta-Workers problem setup in detail. Meta model usually consists of two learners, the upper one is called ‘meta learner’, with the duty of extracting meta-knowledge to guide the optimization of the bottom learner, and the bottom one is called ‘base learner’, which executes the classification job. In order to achieve this, the model is firstly trained on a group of different machine learning tasks (like Multi-task Learning, MTL [44]) named “meta-training set”, expecting the model to be capable for different tasks, then the model moves to its target domain called “meta-test set”. More precisely, the meta learner takes one classical machine learning task as a meta-training instance, and a group of tasks as meta-training set, extracts meta-knowledge from them, then uses these meta-knowledge to guide the training and generalization of the base learner on the target domain. To eliminate ambiguity, following the general naming rules of meta learning, we use ‘train/test’ to distinguish the instance (classic machine learning task) used by meta learner, and ‘support/query’ to distinguish the instance (instance in classical machine learning) used by base learner, the composition of dataset required for meta learning is shown in Fig. 1.
Let be a crowdsourcing project with tasks, where each task belongs to one out of classes. We cluster the tasks into categories, and select instances from each cluster to be annotated by crowd workers. The resulting annotated tasks (-support set,
-query set) are used to estimate the crowd workers’ capacity.
and the remaining tasks form our meta-test dataset: . We also need to collect auxiliary datasets related to the tasks at hand to build our meta-training dataset , where each is an independent machine learning task dataset. The diversity of guarantees the generalization ability. In the few-shot learning paradigm, this setup is called a ‘-way -shot’ problem. Table I summarizes the notations used in this paper.Item | Symbol | Remarks |
---|---|---|
number of crowd task | total tasks | |
task / instance | total tasks | |
task labels’ vector |
size , value [1, 2, , ] | |
label space size | called -way | |
support instance num | called -shot | |
meta-test task type | support / query | |
worker (set) | total meta, crowd | |
worker type | meta / crowd | |
confusion matrix | size , model meta | |
accuracy / capacity | decimal, model crowd | |
worker’s annotations | size | |
annotation | size | |
meta algorithms (set) | total algorithms | |
divergence threshold | difficulty criteria |
3.2 Workflow
The workflow of our approach is shown in Fig. 2. After obtaining (completely labeled) and (partially labeled), the problem becomes a standard ‘-way -shot’ meta learning problem. We apply a meta learning algorithm on to extract the meta-knowledge . We then combine with the meta-test support set to adjust to the target task domain, obtaining a meta-worker (i.e., a classifier) . By using different meta learning algorithms in , we can obtain a group of meta-workers with different preferences. are used to get the annotation matrices of the remaining tasks . If the meta-workers disagree with one another on a task, we invite crowd workers to provide further annotations for that task. Finally, we use the confusion matrix and the accuracy to separately model the preferences of meta-workers and of crowd workers, and compute the consensus labels by weighted majority voting in an iterative manner.
3.3 Building the Meta-test Set
The first step of our method transforms a crowdsourcing project of tasks into an -way -shot meta-test set . In order to build an -way -shot meta-test support set from unlabeled instances, we use -means to cluster the instances into clusters (other clustering algorithms can be used as well). We then select instances closest to each cluster center to be annotated. Since the results of clustering are not perfect, the selected instances might not belong to the assigned cluster. Therefore, building requires slightly more than instances.
The label quality of the meta-test support set is crucial. As such, we ask as many as possible workers () to provide annotations. A basic assumption in crowdsourcing is that the aggregated annotations given by a large number of workers are reliable. For example, with an average accuracy of crowd workers as 0.6 under 5 classes, even if the simplest majority voting is used, the expected accuracy of 10 repeated annotations from 10 crowd workers is about 95%, and of 30 repeated annotations is above 99.95%. Once is attained, the remaining tasks of give . , combined with , constitutes the meta-test set.
Although the ground truth of each task in crowdsourcing is unknown and it is hard to estimate a worker s quality, we can still approximate the ground truth for a small portion of tasks (called golden tasks) and use them to estimate worker’s quality [47, 29]. In this way, we can pre-identify low-quality workers based on , and prevent them from participating into the subsequent crowdsourcing process. The modeling of crowd worker’s quality will be discussed in detail in Section 3.6.
3.4 Training Meta-workers
We need to build meta-workers using meta learning algorithms. We choose one representative method from each of the three meta learning categories to form our meta-worker cluster, namely Model-Augmented Meta-Learning (MAML) [6], Meta Networks (MN) [20], and Relation Networks (RN) [25], with a -way -shot setting. Each induces a different learning bias, thus together they can lead to effective ensembles.
Meta learning trains a model on a variety of learning tasks. The model is then fine-tuned to solve new learning tasks using only a small number of training samples of the target task domain [30]. The general meta-learning algorithm can be formalized as follows:
(1) | ||||
(2) |
where represents the model parameters we want to learn, and is the meta-knowledge extracted from . Eq. (1) corresponds to the meta-knowledge learning phase and Eq. (2) comes to the meta-knowledge adaptation phase.
For the above purpose, MAML optimizes the initial values of the parameters to enable the model to adapt quickly to new tasks during gradient descent. The meta-knowledge is given by the gradient descent direction. MN has the ability to remember old data and to assimilate new data quickly. MN learns meta-level knowledge across tasks and shifts its inductive biases along the direction of error reduction. RN learns a deep distance metric () to compare a small number of instances within episodes.
3.5 Obtaining Annotations
After the meta-workers have been adjusted to the target task domain, they can be used to replace, or work together with, the crowd workers to provide the annotations and for , and to save the budget. Although we consider multiple meta-workers to improve the quality of crowdsourcing, there may exist difficult tasks that cannot be performed well by meta-workers. Therefore, we model the difficulty of all the tasks, and select the difficult ones to be annotated by crowd workers.
There are many criteria to quantify the difficulty of a task [15, 29, 40]
. Here, we adopt a simple and intuitive criterion: the more difficult a task is, the harder is for meta-workers to reach an agreement on it, and the larger the divergence between the task annotations is. As such, we can approximate the difficulty of a task by measuring the annotation divergence. Since the annotation given by meta-workers is a label probability distribution, the KL divergence (Eq. (
3)) can be used to measure the difference between any two distributions:However, the direct use of the KL divergence has two disadvantages, i.e., the KL divergence is asymmetric and it can only calculate the divergence between two annotations. Asymmetry makes it necessary to consider the order between annotations, which makes it more complicated and tedious to measure the divergence of multiple annotations. For those reasons, we use the symmetric Jensen-Shannon divergence (Eq. (4)), an extension of the KL divergence, to measure the divergence of all possible annotation pairs, and take their average value as the final divergence (Eq. (5)):
(4) | |||
(5) |
where represents the set of meta annotations for a task.
Once we have collected the meta annotations , we calculate the JS divergence of each task, pick out the tasks with divergence greater than , and submit them to crowd workers for further annotation. We assign additional crowd workers with fair quality to the difficult tasks each, and obtain the crowd annotations . Finally, all annotations are gathered to compute the consensus labels.
3.6 Aggregating the Annotations
3.6.1 Correcting Annotations
During the last step we compute the consensus labels of the tasks. Meta annotations and crowd annotations are inherently different: the former is a discrete probability distribution in the label space, while the crowd annotation is a typical one-hot coding in the label space. Therefore, we use different strategies to model meta-workers and crowd workers. The meta-workers’ probability distribution annotation gives the probabilities that the instance belongs to each class, which are suitable for a D&S model [4, 36, 10]. The one-hot coding crowd annotations, instead, simply indicate the chosen most likely label for an instance. Furthermore, the number of crowd annotations is smaller than that of meta annotations. As such, we can’t build a complex model for crowd workers based on their annotations, so we choose the simplest but effective Worker Probability model (or capacity, accuracy etc.) [8, 11, 41] for modeling in this case.
Since not all the tasks are annotated by crowd workers and is incomplete, we use negative ‘dummy’ annotations to fill up . To eliminate the difference between meta annotations and crowd ones, and to model crowd workers, we introduce an accuracy value for each crowd worker, and transform a crowd worker’s annotation as follows:
(6) |
We perform the above transformation on each bit of the annotation vector , each of the crowd worker is initialized when is attained.
The D&S model focuses on single-label tasks (with fixed choices) and models the bias of each meta-worker as a confusion matrix with size . in models the probability that worker wrongly assigns label to an instance of true label . We use the confusion matrix of a meta-worker to correct the annotation results using Eq. (7), where is the corrected annotation. We initialize each
with an identity matrix of size
in the first iteration.(7) |
3.6.2 Inferring Labels
Once the above correction process is performed via Eqs. (6) and (7), we compute the consensus labels using weighted majority voting on the corrected annotations. We then use the inferred labels to update the confusion matrix of the meta-workers and the accuracy values of the crowd workers. Let . We use the EM algorithm [4, 10, 33] to optimize and consensus labels until convergence. The detailed process is as follows.
E-step: We use Eqs. (6) and (7), and to correct and . Then we combine and
at the task level to obtain a larger annotation tensor
of size , where the first dimension is the number of tasks, the second is the number of workers, and each is an annotation of size . We sum all the annotations, task by task, and select the label corresponding to the position with the highest probability value as the ground truth, as described in Eq. (8) (the annotations and are vectors of size ).(8) |
M-step: Here we use and , and to update . A formal description is given in Eq. (10). Eq. (9) gives out how to count the number of tasks correctly answered by worker , is an indicator function such that if is true under condition , and otherwise. is worker ’s corresponding annotation matrix.
(9) |
To update for crowd worker , we first count the number of tasks that the worker has correctly annotated by Eq. (9), and then normalize the count by the total number of annotations the worker has provided. Note that, in general, a crowd worker does not annotate all tasks (the negative dummy annotations are skipped).
We update the confusion matrix of meta-worker row by row. Here represents the probability of mistaking a task of label as ; as such, the denominator () is the number of tasks with label , and the numerator is the number of tasks whose label is , but is mistaken as . We need to count a total of label confusion cases, and update each entry in the confusion matrix.
(10) |
Our approach (MetaCrowd) is summarized in Algorithm 1. We first use clustering and crowdsourcing to transform crowdsourcing project into -way -shot few-shot learning problem (line 1-4). Then meta learning algorithms are invited to train meta-workers (line 5-10), which will annotate all the remaining crowdsourcing tasks. After that, JS divergence is employed to measure task’s difficulties and the difficult tasks will be annotated by crowd workers again (line 11-16). In the end, we gather all the annotations, correct and aggregate them to compute the consensus labels (line 17-24).
4 Experiments
4.1 Experimental Setup
Datasets: We verified the effectiveness of our proposed method MetaCrowd on three real image datasets, Mini-Imagenet [31], 256_Object_Categories [7], and CUB_200_2011 [32]. Each dataset has multiple subclasses and we treat each subclass in a dataset as a dependent task (or a small dataset). Following the dataset division principle recommended by Mini-Imagenet, we divide it into three parts: train, val and test, the category ratio is , the other two datasets are also processed in a similar way. The statistics of these benchmark datasets are given in Table II. We deem all the data in the ‘train’ portion as a meta-training set and the data in the ‘val’ portion as the validation set, and we randomly select categories from the ‘test’ set to form an -way meta-test set.
Dataset | Image Num | ||
---|---|---|---|
Mini-Imagenet | 20 class | 64+16 class | 600 * 100 |
256_Object | 40 class | 128+32 class | 90 * 200 |
CUB_200_2011 | 40 class | 128+32 class | 60 * 200 |
Crowd Workers: Following the canonical worker setting [12], we simulate four types of workers (spammer, random, normal, and expert), with different capacity (accuracy) ranges and proportions as shown in Table III. We set up three different proportions of workers to study the influence of low reliability workers, normal reliability workers and high reliability workers, the weighted average capacity is 0.535, 0.600 and 0.650. We generate 30 crowd workers for Mini-Imagenet, and 10 crowd workers for 256_Object_Categories and CUB_200_2011 following the setup in Table III, to initialize our -way -shot dataset and to provide additional annotations for the tasks when meta-workers disagree.
Worker type | Floor | Ceiling | Proportions | ||
---|---|---|---|---|---|
spammer | 0.10 | 0.25 | 10% | 10% | 10% |
random | 0.25 | 0.50 | 20% | 10% | 10% |
normal | 0.50 | 0.80 | 60% | 70% | 50% |
expert | 0.80 | 1.00 | 10% | 10% | 30% |
We compare our MetaCrowd against five related and representative methods.
(i) Reqall (Request and allocate) [16] is a typical solution for budget saving. Reqall dynamically determines the amount of annotation required for a given task. For each task, it stops further annotating if the weighted ratio between two classes vote counts reaching a preset threshold or the maximum number. Reqall assumes the workers’ ability are known and mainly focus on binary task. We follow the advice in the paper to convert multi-class problems into binary ones. We fix its quality requirement as 3, consistent with MetaCrowd.
(ii) QASCA [48] is a classical task assignment solution, it estimates the quality improvement if a worker is assigned with a set of tasks (from a pool of tasks), and then selects the optimal set which results in the highest expected quality improvement. We set the budget to an average of 3 annotations per task (similar to Reqall).
(iii) Active (Active crowdsourcing) [5] takes into account budget saving and worker/task selection for crowdsourcing. Active combines task domain (meta-test) and source domain (meta-training) data using sparse coding to get a more concise high-level representation of task, and then uses a probabilistic graphical model to model both workers and tasks, then uses active learning to select the right worker for right task.
(iv) AVNC (Adaptive voting noise correction) [42] tries to identify and eliminate potential noisy annotations, then uses the remaining clean instances to train a set of classifiers to predict and correct the noisy annotations before truth inference. We use MV and WMV as its consensus algorithms and set the budget to an average of 3 annotations per task (similar to Reqall and QASCA)
(v) ST (Self-training method) [37] first trains the annotator with labeled instances in the pool . The annotator then finishes the tasks in , picks out the instance with the highest confidence label and merges it into . The above steps are repeated until all tasks are labeled.
(vi) MetaCrowd and its variants adopt three meta-workers trained by three types of meta algorithms (MAML, MN, and RN) under the -way -shot setting. In our experiments, we consider two variants for the ablation study. MetaCrowd-OC follows the canonical crowdsourcing principle and uses only crowd workers. All the tasks are annotated by the three crowd workers, and we deem its accuracy and budget as the baseline performance. MetaCrowd-OM uses only meta-workers to annotate , even when they disagree. The remaining settings of the variants are the same as MetaCrowd.
For the other parameters not mentioned in the above comparison methods, we have adopted the recommended parameter settings in their original paper.
|
|
|
4.2 Analysis of the Results
Table IV gives the accuracy of the methods under comparison, grouped into four categories: dynamic task allocation (Reqall and QASCA), active learning (Active), machine self correction (AVNC and ST), and meta learning (MetaCrowd-OC, MetaCrowd-OM, and MetaCrowd). Particularly, AVNC and MetaCrowd adopt majority vote and weighted majority vote to compute consensus labels, while other methods adopts their own consensus solutions. We have several important observations.
(i) MetaCrowd vs. Self-training: ST uses supervised self-training to gradually annotate the tasks, it has the lowest accuracy among all the compared methods. This is because the lack of meta-knowledge (labeled training data) makes traditional supervised methods unfeasible under the setting of few-shot learning. When the number of labeled instances is small, ST cannot train an effective model, so the quality of pseudo-labels is not high, and the influence of the error will continue to expand, eventually leading to the failure of the self-training classifier. The other crowdsourcing solutions obtain a much higher accuracy by modelling tasks and workers, and MetaCrowd achieves the best performance through both meta learning and ensemble learning. This shows that the extraction of meta-knowledge from relevant domains is crucial for the few-shot learning process.
(ii) MetaCrowd vs. Active: Both MetaCrowd and Active try to reduce the number of annotations to save the budget. By modeling workers and tasks, Active assigns only the single most appropriate worker to the task to save the budget and to improve the quality. In contrast, MetaCrowd leverages meta-knowledge and initial labels from crowd workers to automatically annotate a large portion of simple tasks, and invites crowd workers to annotate a small number of difficult tasks. Thus, both quality and budget saving can be achieved. MetaCrowd-OM and MetaCrowd both achieve a significantly higher accuracy than Active. This is because crowd workers have diverse preferences and the adopted active learning strategy of Active cannot reliably model workers due to the limited data. By obtaining additional annotations for the most uncertain tasks and meta annotations for all the tasks, MetaCrowd achieves an accuracy improvement of 8%. Even with a large number of annotations from plain crowd workers, MetaCrowd-OC still loses to MetaCrowd, which proves the rationality of empowering crowdsourcing with meta learning.
(iii) MetaCrowd vs. Reqall: Both MetaCrowd and Reqall dynamically determine the number of annotations required for a task based on the annotating results; thus they can trade-off budget with quality. Reqall assumes that workers’ abilities are known and mainly focuses on binary tasks (we follow the recommended approach to adjust it to the multi-class case). With no more than three annotations per task on average (MetaCrowd-OC baseline setting), MetaCrowd beats Reqall both in quality and budget, because MetaCrowd can leverage the classifiers to do most of the simple tasks, and save the budget to focus on the difficult tasks. These results show the effectiveness of our human-machine hybrid approach.
(iv) MetaCrowd vs. QASCA: QASCA and Reqall both decide the next task assignment based on the current annotation results. QASCA does not assume that the workers’ abilities are known, but derives them based on the EM algorithm, so its actual performance is much better than Reqall. However, QASCA is still beaten by MetaCrowd. This is because QASCA is plagued by the cold start problem. At the beginning, it can only treat all workers as saints, and is more affected by low-quality workers. In contrast, MetaCrowd has a relatively accurate understanding of workers at the beginning, owing to the meta-test set construction process and the repel of low-quality workers for difficult tasks.
(v) Modeling vs. non-modeling of workers: In crowdsourcing, we often need to model tasks and/or workers to better perform the tasks and compute the consensus labels of the tasks. Comparing the second column (weighted majority vote) of AVNC, MetaCrowd-OC, MetaCrowd-OM and MetaCrowd with their respective non-model version (first column, majority vote), we can draw the conclusion that, by modeling workers, we can better compute the consensus labels of tasks from their annotations. This advantage is due to two factors: we introduce a worker model to separately account for crowd workers and meta-workers; and we model the difficulty of tasks using divergence, and pay more attention to the difficult ones.
(vi) Robustness to different situations: We treat the results in Table V(b) as the baseline, in Table V(a) there are more noisy workers and while in Table V(c) there are more experts. By comparing the results in Table V(a) and Table V(b), we can see that although the average quality of crowd workers drops by , the accuracy of MetaCrowd reduces by less than , while other methods have a larger reduce . This can be explained by two reasons. First, MetaCrowd utilizes sufficient data to model workers during building the Meta-test set stage, so potential low-quality crowd workers can be identified; second, most of the simple task annotations are provided by ordinary but reliable meta-workers, and we also consult crowd workers with fair quality for difficult tasks. Therefore, MetaCrowd can reduce the impact of low-quality workers and obtain more robust results. In even less common situations with many experts (Table V(c)), MetaCrowd also achieves the best performance. This confirms that our MetaCrowd is suitable for a variety of worker compositions, especially when the capacity of workers is not so good.
4.3 Budget Saving
We use quantitative analysis and simulation results to illustrate the advantage of MetaCrowd for budget saving in terms of the number of used annotations. For this quantitative analysis, we adopt the typical assumption that the expense of a single annotation is uniformly fixed. We separately estimate the budget of MetaCrowd-OC, Reqall, QASCA, Active and MetaCrowd. AVNC adopts the same workflow as MetaCrowd-OC, except for noise correction operation, so its budget is the same as MetaCrowd-OC. On the other hand, ST and MetaCrowd-OM basically has no crowd workers involved, so they are not considered here.
4.3.1 Quantitative Analysis
The number of annotations used by some methods can be estimated in advance, we first calculate them theoretically.
MetaCrowd-OC asks crowd workers to annotate each task and consumes a total of annotations.
MetaCrowd needs some initial annotations to kick off the training of meta-workers. The total number of annotations needed in MetaCrowd is , where is the size of , is the number of crowd workers for repeated annotations, is the margin amplification factor to ensure that can be formed, and is the ratio (determined by the divergence threshold ) of instances that need additional annotations (the other symbols can be found in Table I). Usually , if is big enough, then the number of annotations can be simplified as .
Active needs about annotated tasks to build a stable model of workers and tasks before applying active learning; each of the remaining tasks only needs the single most suitable crowd worker. We let crowd workers annotate tasks, so the total number of annotations for Active is .
QASCA has set the total budget to an average of 3 annotations per task, so it also consumes a total of annotations.
Reqall depends on the quality requirement, task difficulty, and workers’ ability to determine the number of consumed annotations, so the required number of annotations can not be explicitly quantified. In the multi-class case, Reqall considers the two classes with the most votes and thus wastes the budget to some extent.
In summary, the total number of annotations needed in MetaCrowd-OC, Active, MetaCrowd and QASCA are about , , and , respectively.
4.3.2 Simulation Results
Following the experimental settings in the previous subsection, we count the number of consumed crowd worker annotations on Mini-Imagenet (, ) for example.
Both MetaCrowd-OC and QASCA require annotations;
MetaCrowd () costs annotations.
Active asks for about annotations to build the model, and the other tasks consumes about annotations, for a total number of about annotations.
Reqall, with the requirement that the budget should be no more than three annotations per task on average, achieves a competitive quality and consumes about 2.8 annotations per task, amount to a total of annotations;
Data | Task | Reqall | QASCA | Active | MetaCrowd-OC | MetaCrowd | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Budget | Quality | Budget | Quality | Budget | Quality | Budget | Quality | Budget | Quality | ||
Mini-Imagenet | 3000 | 8424 | 0.821 | 9000 | 0.824 | 5000 | 0.775 | 9000 | 0.806 | 3971 | 0.840 |
256_Object | 450 | 1256 | 0.828 | 1350 | 0.827 | 750 | 0.777 | 1350 | 0.792 | 654 | 0.855 |
CUB_200_2011 | 300 | 837 | 0.812 | 900 | 0.819 | 500 | 0.766 | 900 | 0.811 | 530 | 0.836 |
The total number of annotations and accuracy of the methods on three datasets are listed in Table V. MetaCrowd always outperforms Active, Reqall, and MetaCrowd-OC in terms of budget and quality, and its budget saving advantage becomes more prominent as the increase of crowdsourcing project size . MetaCrowd loses to Active in terms of budget on CUB_200_2011, due to the small size of this dataset. In fact, MetaCrowd is not suitable for crowdsourcing tasks with a relatively small size, or with an extremely large label space, in which repeated annotations of for meta-learning consume a large portion of the budget.
4.4 Parameter Analysis
4.4.1 Parameters in Meta Learning
We study the impact of some preset parameters of MetaCrowd, namely and for meta learning algorithm. Generally, the values of and are determined by the given problem. Here we simulate the influence of and on the meta learning algorithm using Mini-Imagenet with MAML algorithm, other datasets and algorithm combinations lead to similar conclusions. We vary or from 1 to 10 while keeping the other fixed as in the previous experiments.
We can see from Fig. 3 that the accuracy increases as the number of tasks increases, but the increment gradually slows down, which is consistent with our intuition that more annotated tasks facilitate the training of more credible meta-workers, and hence improve the quality. On the other hand, as the number of classes increases, the accuracy gradually decreases, but it is always much higher than that of random guess, which suggests the effectiveness of the meta learning algorithm for a few-shot learning task. In crowdsourcing, is determined by the crowdsourcing project itself, and the only parameter we can choose is . A larger gives a better meta-worker, but it is expensive to form a meta-training set with a large ; as such, is adopted as a trade-off in this paper.
The heat-maps of the confusion matrices of our meta-workers under a 5-way 5-shot setting are shown in Fig. 4. We find that the accuracy of all three meta-workers is about 0.6 (the values on the diagonal of each confusion matrix), which is in accordance with the average capacity of our normal crowd workers in Table III. In addition, the meta-workers also manifest different preferences and are capable of doing different tasks.
![]() |
![]() |
![]() |
4.4.2 Parameters in Crowdsourcing
Here we study the impact of the divergence threshold and the number of additional annotations for difficult task on the crowdsourcing quality and budget trading off. Generally speaking, a larger and will lead to a better crowdsourcing quality while consuming more budget as well, so how to set appropriate values for these two parameters to meet the quality and budget requirements as much as possible is critical.
The divergence quantification metric (see Eq. (5)) is within , so we change from 0 to 1 with an interval of 0.05. For each value, we assign workers to the task whose divergence degree exceeds , and finally count the number of instances that received further crowd annotation. Fig. 5 gives the results under different input values of . We can see that as decreases from 1 to 0, the number of instances that need additional annotations gradually increases, and the aggregation accuracy also increases. The overall trend can be roughly divided into three stages according to the value of : . In the first stage , although there is a large variation range of , the number of tasks needs to be manually annotated and the quality of the crowdsourcing project are almost unchanged, this is because the divergence of tasks is roughly distributed within the range of , so the influence of in the first stage is very limited. However in the second stage, the budget and quality of crowdsourcing project are very sensitive to . With the decrease of , both budget and quality increase significantly. In our experiment, setting the value of within [0.3, 0.6] is a reasonable choice. As to the third stage, the budget of the crowdsourcing project is still increasing rapidly with the change of , but the quality keeps a relatively stable stage with little improvement. This can be explained as that the tasks with a small degree of divergence are usually relatively simple tasks, and their annotation results are agreed between meta-workers, and the further manual annotations will not significantly improve the quality. Based on these results, we adopt for experiments, and for the trade off between quality and budget.
![]() |
![]() |
![]() |
We also studied the impact of the number of additional annotations received for each difficult task on the quality and budget of crowdsourcing. The experimental results in Fig. 6 are in line with our intuition: when increases, the quality of crowdsourcing will gradually increase and becomes relatively stable afterwards. At the same time, the accompany budget linearly increases. If , MetaCrowd will degenerate into MetaCrowd-OM. Given that, we fix , which is the same as the number of meta-workers in our experiment.
![]() |
![]() |
![]() |
5 Conclusion
In this paper, we study how to leverage meta learning with crowdsourcing for budget saving and quality improving, and propose an approach called MetaCrowd (Crowdsourcing with Meta-Workers) that implements this idea. Our MetaCrowd approach uses meta learning to train capable meta-workers for crowdsourcing tasks and thus to save budgets. Meanwhile, it quantifies the divergence between meta-workers’ annotations to model the difficulty of tasks, and collects additional annotations for difficult tasks from crowd workers to improve the quality. Experiments on benchmark datasets show that MetaCrowd is superior to the representative methods in terms of budget saving and crowdsourcing quality.
Our method has a better tolerance for the quality of crowdsourcing workers, but it has certain requirements for the types and characteristics of crowdsourcing tasks. One of the possible improvement of our work lies in how to obtain the initial labeled data set required for meta learning with less budget.
References
- [1] (2012) Asking the right questions in crowd data sourcing. In 28th IEEE International Conference on Data Engineering, pp. 1261–1264. Cited by: §1.
-
[2]
(1993)
Signature verification using a “siamese” time delay neural network.
International Journal of Pattern Recognition and Artificial Intelligence
7 (04), pp. 669–688. Cited by: §2. - [3] (2016) A survey of general-purpose crowdsourcing techniques. IEEE Transactions on Knowledge and Data Engineering 28 (9), pp. 2246–2266. Cited by: §1.
- [4] (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1), pp. 20–28. Cited by: §3.6.1, §3.6.2.
- [5] (2014) Active learning for crowdsourcing using knowledge transfer. In AAAI Conference on Artificial Intelligence, pp. 1809–1815. Cited by: §2, §4.1.
- [6] (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §2, §3.4.
- [7] (2007) Caltech-256 object category dataset. Cited by: §4.1.
- [8] (2012) So who won? dynamic max discovery with the crowd. In ACM SIGMOD International Conference on Management of Data, pp. 385–396. Cited by: §3.6.1.
- [9] (2006) The rise of crowdsourcing. Wired Magazine 14 (6), pp. 1–4. Cited by: §1.
- [10] (2010) Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 64–67. Cited by: §3.6.1, §3.6.2.
- [11] (2011) Iterative learning for reliable crowdsourcing systems. In Advances in Neural Information Processing Systems, pp. 1953–1961. Cited by: §3.6.1.
- [12] (2011) Worker types and personality traits in crowdsourcing relevance labels. In ACM International Conference on Information and Knowledge Management, pp. 1941–1944. Cited by: §1, §4.1.
- [13] (2017) Combining active learning and self-labeling for data stream mining. In International Conference on Computer Recognition Systems, pp. 481–490. Cited by: §2.
- [14] (2016) Crowdsourced data management: a survey. IEEE Transactions on Knowledge and Data Engineering 28 (9), pp. 2296–2319. Cited by: §1.
- [15] (2017) Crowdsourced data management: overview and challenges. In ACM International Conference on Management of Data, pp. 1711–1716. Cited by: §2, §3.5.
- [16] (2016) Crowdsourcing high quality labels with a tight budget. In ACM International Conference on Web Search and Data Mining, pp. 237–246. Cited by: §1, §4.1.
- [17] (2020) Many-class few-shot learning on multi-granularity class hierarchy. IEEE Transactions on Knowledge and Data Engineering 99 (1), pp. 1–14. Cited by: §2.
- [18] (2012) Counting with the crowd. VLDB Endowment 6 (2), pp. 109–120. Cited by: §2.
- [19] (2014) Scaling up crowd-sourcing to very large datasets: a case for active learning. VLDB Endowment 8 (2), pp. 125–136. Cited by: §2.
- [20] (2017) Meta networks. In International Conference on Machine Learning, pp. 2554–2563. Cited by: §2, §3.4.
- [21] (2016) Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pp. 1842–1850. Cited by: §2.
- [22] (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622. Cited by: §1.
- [23] (2019) Machine learning with crowdsourcing: a brief summary of the past research and future directions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9837–9843. Cited by: §1.
- [24] (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §2.
-
[25]
(2018)
Learning to compare: relation network for few-shot learning.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1199–1208. Cited by: §1, §2, §3.4. - [26] (2018) SLADE: a smart large-scale task decomposer in crowdsourcing. IEEE Transactions on Knowledge and Data Engineering 30 (8), pp. 1588–1601. Cited by: §2.
- [27] (2018) Dynamic pricing in spatial crowdsourcing: a matching-based approach. In ACM SIGMOD International Conference on Management of Data, pp. 773–788. Cited by: §2.
- [28] (2020) Attention-aware answers of the crowd. In SIAM International Conference on Data Mining, pp. 451–459. Cited by: §1.
- [29] (2020) CrowdWT: crowdsourcing via joint modeling of workers and tasks. ACM Transactions on Knowledge Discovery from Data 99 (1), pp. 1–24. Cited by: §2, §3.3, §3.5.
- [30] (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §1, §2, §3.4.
- [31] (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §4.1.
- [32] (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
- [33] (2012) CrowdER: crowdsourcing entity resolution. VLDB Endowment 5 (11), pp. 1483–1494. Cited by: §2, §3.6.2.
- [34] (2013) Leveraging transitive relations for crowdsourced joins. In ACM SIGMOD International Conference on Management of Data, pp. 229–240. Cited by: §2.
- [35] (2010) The multidimensional wisdom of crowds. Advances in Neural Information Processing Systems 23, pp. 2424–2432. Cited by: §1.
- [36] (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, pp. 2035–2043. Cited by: §3.6.1.
- [37] (2020) Self-training with noisy student improves imagenet classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §2, §4.1.
- [38] (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Annual Meeting of the Association for Computational Linguistics, pp. 189–196. Cited by: §2.
- [39] (2019) Learning loss for active learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102. Cited by: §2.
- [40] (2020) Active multilabel crowd consensus. IEEE Transactions on Neural Networks and Learning Systems 99 (1). Cited by: §2, §2, §3.5.
- [41] (2013) Reducing uncertainty of schema matching via crowdsourcing. VLDB Endowment 6 (9), pp. 757–768. Cited by: §3.6.1.
- [42] (2017) Improving crowdsourced label quality using noise correction. IEEE Transactions on Neural Networks and Learning Systems 29 (5), pp. 1675–1688. Cited by: §2, §4.1.
- [43] (2018) Ensemble learning from crowds. IEEE Transactions on Knowledge and Data Engineering 31 (8), pp. 1506–1519. Cited by: §2.
- [44] (2018) An overview of multi-task learning. Nature Science Review 5 (1), pp. 30–43. Cited by: §3.1.
- [45] (2018) Dlta: a framework for dynamic crowdsourcing classification tasks. IEEE Transactions on Knowledge and Data Engineering 31 (5), pp. 867–879. Cited by: §2.
- [46] (2016) Docs: a domain-aware crowdsourcing system using knowledge bases. VLDB Endowment 10 (4), pp. 361–372. Cited by: §1.
- [47] (2017) Truth inference in crowdsourcing: is the problem solved?. Proceedings of the VLDB Endowment 10 (5), pp. 541–552. Cited by: §3.3.
- [48] (2015) QASCA: a quality-aware task assignment system for crowdsourcing applications. In ACM SIGMOD International Conference on Management of Data, pp. 1031–1046. Cited by: §1, §4.1.
Comments
There are no comments yet.