A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

09/17/2021 ∙ by Dmitry Ustalov, et al. ∙ Moscow Institute of Physics and Technology Yandex 0

Quality control is a crux of crowdsourcing. While most means for quality control are organizational and imply worker selection, golden tasks, and post-acceptance, computational quality control techniques allow parameterizing the whole crowdsourcing process of workers, tasks, and labels, inferring and revealing relationships between them. In this paper, we demonstrate Crowd-Kit, a general-purpose crowdsourcing computational quality control toolkit. It provides efficient implementations in Python of computational quality control algorithms for crowdsourcing, including uncertainty measures and crowd consensus methods. We focus on aggregation methods for all the major annotation tasks, from the categorical annotation in which latent label assumption is met to more complex tasks like image and sequence aggregation. We perform an extensive evaluation of our toolkit on several datasets of different nature, enabling benchmarking computational quality control methods in a uniform, systematic, and reproducible way using the same codebase. We release our code and data under an open-source license at https://github.com/Toloka/crowd-kit.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Means for quality control in crowdsourcing include organizational approaches, such as task design, decomposition, golden tasks preparation, yet reliably automated, and computational approaches that employ relationships and statistical properties of workers, tasks, and labels. Many crowdsourcing studies of complex crowdsourcing pipelines aim to reduce their tasks to multi-classification or combine multi-classification with post-acceptance, e.g., in a seminal paper by Bernstein et al. (2010)

. At the same time, researchers from such fields as natural language processing, computer vision, and others develop discipline-specific methods. To be conveniently employed, these methods need to be integrated with popular data science libraries and frameworks. However, such toolkits as SQUARE

(Sheshadri and Lease, 2013), CEKA (Zhang et al., 2015), Truth Inference (Zheng et al., 2017), spark-crowd (Rodrigo et al., 2019), require additional effort to be embedded in applications. We believe in addressing this issue by developing Crowd-Kit, an open-source production-ready Python toolkit for computational quality control in crowdsourcing. It implements popular quality control methods, providing a common ground for reliable experimentation and application. We perform an extensive evaluation of the Crowd-Kit library to provide the common ground for comparisons. In all the experiments in this paper, we used our implementations of the corresponding methods.

Crowd-Kit Design and Maintenance

Our fundamental design principle of Crowd-Kit is to bridge the gap between crowd science and vivid data science ecosystem of NumPy, SciPy, Pandas, and scikit-learn (Pedregosa et al., 2011). We implemented Crowd-Kit in Python and employ the highly optimized data structures and algorithms available in these libraries, ensuring compatibility with the application programming interface (API) of scikit-learn and data frames of Pandas.

We implemented all the methods in Crowd-Kit from scratch in Python. Although unlike spark-crowd (Rodrigo et al., 2019) our library does not provide means for running on a distributed computational cluster, it leverages efficient implementations of numerical algorithms in underlying libraries widely used in the research community. Besides aggregation methods, Crowd-Kit offers annotation quality characteristics, such as uncertainty (Malinin, 2019) and agreement with aggregate (Appen Limited, 2021).

Crowd-Kit is platform-agnostic, allowing analyzing data from any crowdsourcing marketplace (as soon as one can download the data). Crowd-Kit is an open-source library available under Apache license both on GitHub and PyPI: https://github.com/Toloka/crowd-kit and https://pypi.org/project/crowd-kit/, correspondingly.

Categorical Aggregation

Crowd-Kit includes aggregation methods for categorical data, in which latent label assumption is met. We implement most traditional methods for categorical answer aggregation, including such models as Dawid-Skene (DS, 1979), GLAD (Whitehill et al., 2009), and M-MSR (Ma and Olshevsky, 2020). We also offer an implementation of Majority Vote (MV) as well as its such weighted variations as Worker Agreement with Aggregate (Wawa) as described in Appen Limited (2021).

Method D_Product D_PosSent S_Rel S_Adult binary1 binary2
Table 1: Comparison of the implemented categorical aggregation methods (accuracy is used).


To ensure the correctness of our implementations, we compared the observed aggregation quality with the already available implementations by Zheng et al. (2017) and Rodrigo et al. (2019) on the same datasets. Table 1 shows evaluation results, indicating a similar level of quality as them.

Pairwise Aggregation

Pairwise comparisons are essential for such tasks as information retrieval evaluation and subjective opinion gathering, where the latent label assumption is not met. We implemented the Bradley-Terry probabilistic transitivity model (BT, 1952) for pairwise comparisons.


Table 2 shows the comparison of the Bradley-Terry method implemented in Crowd-Kit to the random baseline on the graded readability dataset by Chen et al. (2013)

. Since it contains only 491 items, we additionally annotated on Toloka a sample of 2,497 images from the IMDB-WIKI dataset

(Rothe et al., 2018). This dataset contains images of people with reliable ground-truth age assigned to every image. The annotation allowed us to obtain 84,543 comparisons by 2,085 workers.

Method Chen et al. (2013) IMDB-WIKI
Table 2: Comparison of implemented pairwise aggregation methods (NDCG@10 is used for Chen et al. (2013) and NDCG@100 is used for IMDB-WIKI).

Sequence Aggregation

Crowd-Kit implements the Recognizer Output Voting Error Reduction (ROVER) dynamic programming algorithm by Fiscus (1997), known for its successful application in crowdsourced sequence aggregation (Marge et al., 2010). We also offer implementations of Reliability Aware Sequence Aggregation (RASA and HRRASA) algorithms by Li and Fukumoto (2019) and Li (2020)

that encode responses using Transformer-based representations and then iteratively estimate the aggregated response embedding.


We used two datasets, CrowdWSA (Li and Fukumoto, 2019) and CrowdSpeech (Pavlichenko et al., 2021). As the typical application for sequence aggregation in crowdsourcing is audio transcription, we used the word error rate as the quality criterion (Fiscus, 1997) in Table 3.

CrowdWSA J1
CrowdSpeech dev-clean
Table 3: Comparison of implemented sequence aggregation methods (average word error rate is used).

Image Aggregation

Crowd-Kit offers three image segmentation aggregation methods. First, it provides a trivial pixel-wise MV. Second, it implements a method similar to the one described in Jung-Lin Lee et al. (2018)

, performing an EM algorithm for counting the probability of a correct answer as the proportion of correctly classified pixels to the number of all pixels that at least one worker chose. Third, we implement a variation of RASA that scores Jaccard distances between segments and weights them proportionally to these distances.


We annotated on Toloka a sample of 2,000 images from the MS COCO

(Lin et al., 2014) dataset consisting of four object labels. For each image, nine workers submitted segmentations. In total, we received 18,000 responses. Table 4 shows the comparison of the methods on the above-described dataset using the intersection over union (IoU) criterion.

Dataset MV EM RASA
Table 4: Comparison of implemented image aggregation algorithms (IoU is used).


Our experience in running Crowd-Kit for processing crowdsourced data shows that it successfully handles industry-scale datasets without the need for a large computational cluster. We currently focus on providing a consistent API for benchmarking existing methods and implementing additional domain-specific aggregation techniques like sequence labels aggregation (Nguyen et al., 2017) and continuous answer aggregation. We believe that the availability of computational quality control techniques in a standardized way would open new venues for reliable improvement of the crowdsourcing quality beyond the traditional well-known methods and pipelines.


  • Appen Limited (2021) Calculating worker agreement with aggregate (wawa). (english). External Links: Link Cited by: Crowd-Kit Design and Maintenance, Categorical Aggregation.
  • M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich (2010) Soylent: A Word Processor with a Crowd Inside. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’10, New York, NY, USA, pp. 313–322 (english). External Links: Document, ISBN 978-1-4503-0271-5 Cited by: Introduction.
  • R. A. Bradley and M. E. Terry (1952) Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39 (3/4), pp. 324–345 (english). External Links: Document, ISSN 0006-3444 Cited by: Pairwise Aggregation.
  • X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz (2013) Pairwise Ranking Aggregation in a Crowdsourced Setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, Rome, Italy, pp. 193–202 (english). External Links: Document, ISBN 9781450318693 Cited by: Evaluation., Table 2.
  • A. P. Dawid and A. M. Skene (1979) Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics) 28 (1), pp. 20–28 (english). External Links: Document, ISSN 0035-9254 Cited by: Categorical Aggregation.
  • J. G. Fiscus (1997) A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In

    1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings

    Santa Barbara, CA, USA, pp. 347–354 (english). External Links: Document Cited by: Evaluation., Sequence Aggregation.
  • D. Jung-Lin Lee, A. Das Sarma, and A. Parameswaran (2018) Quality Evaluation Methods for Crowdsourced Image Segmentation. Technical Report Stanford InfoLab, Stanford University (english). External Links: Link Cited by: Image Aggregation.
  • J. Li and F. Fukumoto (2019) A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation. In Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP, AnnoNLP ’19, Hong Kong, pp. 24–28 (english). External Links: Document Cited by: Evaluation., Sequence Aggregation.
  • J. Li (2020) Crowdsourced Text Sequence Aggregation Based on Hybrid Reliability and Representation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, Virtual Event, China, pp. 1761–1764 (english). External Links: Document, ISBN 9781450380164 Cited by: Sequence Aggregation.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, Zurich, Switzerland, pp. 740–755 (english). External Links: Document, ISBN 978-3-319-10602-1 Cited by: Evaluation..
  • Q. Ma and A. Olshevsky (2020) Adversarial Crowdsourcing Through Robust Rank-One Matrix Completion. In Advances in Neural Information Processing Systems 33, pp. 21841–21852 (english). External Links: Link Cited by: Categorical Aggregation.
  • A. Malinin (2019)

    Uncertainty Estimation in Deep Learning with application to Spoken Language Assessment

    Ph.D. Thesis, University of Cambridge, Cambridge, England, UK, (english). External Links: Document Cited by: Crowd-Kit Design and Maintenance.
  • M. Marge, S. Banerjee, and A. I. Rudnicky (2010) Using the Amazon Mechanical Turk for transcription of spoken language. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, Dallas, TX, USA, pp. 5270–5273 (english). External Links: Document Cited by: Sequence Aggregation.
  • A. T. Nguyen, B. Wallace, J. J. Li, A. Nenkova, and M. Lease (2017) Aggregating and Predicting Sequence Labels from Crowd Annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, pp. 299–309 (english). External Links: Document Cited by: Conclusion.
  • N. Pavlichenko, I. Stelmakh, and D. Ustalov (2021) CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, (english). Note: arXiv:2107.01091 [cs.SD] External Links: Link Cited by: Evaluation..
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay (2011)

    Scikit-learn: Machine Learning in Python

    Journal of Machine Learning Research 12 (85), pp. 2825–2830 (english). External Links: ISSN 1532-4435, Link Cited by: Crowd-Kit Design and Maintenance.
  • E. G. Rodrigo, J. A. Aledo, and J. A. Gámez (2019) spark-crowd: A Spark Package for Learning from Crowdsourced Big Data. Journal of Machine Learning Research 20, pp. 1–5 (english). External Links: Link, ISSN 1532-4435 Cited by: Introduction, Crowd-Kit Design and Maintenance, Evaluation..
  • R. Rothe, R. Timofte, and L. Van Gool (2018) Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks. International Journal of Computer Vision 126 (2), pp. 144–157 (english). External Links: Document, ISSN 1573-1405 Cited by: Evaluation..
  • A. Sheshadri and M. Lease (2013) SQUARE: A Benchmark for Research on Computing Crowd Consensus. In First AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2013, pp. 156–164 (english). External Links: Link Cited by: Introduction.
  • J. Whitehill, T. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo (2009) Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, NIPS 2009, pp. 2035–2043 (english). External Links: ISBN 978-1-61567-911-9, Link Cited by: Categorical Aggregation.
  • J. Zhang, V. S. Sheng, B. A. Nicholson, and X. Wu (2015) CEKA: A Tool for Mining the Wisdom of Crowds. Journal of Machine Learning Research 16 (88), pp. 2853–2858 (english). External Links: Link, ISSN 1532-4435 Cited by: Introduction.
  • Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng (2017) Truth Inference in Crowdsourcing: Is the Problem Solved?. Proceedings of the VLDB Endowment 10 (5), pp. 541–552 (english). External Links: Document, ISSN 2150-8097 Cited by: Introduction, Evaluation..