Rethinking Crowd Sourcing for Semantic Similarity

by   Shaul Solomon, et al.

Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic similarity as a binary category (two sentences are either similar or not similar and there is no middle ground) play the most important role in the labeling. The paper offers heuristics to filter out unreliable annotators and stimulates further discussions on human perception of semantic similarity.



There are no comments yet.


page 1

page 2

page 3

page 4


Calculating the similarity between words and sentences using a lexical database and corpus statistics

Calculating the semantic similarity between sentences is a long dealt pr...

Predicting the Semantic Textual Similarity with Siamese CNN and LSTM

Semantic Textual Similarity (STS) is the basis of many applications in N...

ClaC: Semantic Relatedness of Words and Phrases

The measurement of phrasal semantic relatedness is an important metric f...

Semantic Similarity from Natural Language and Ontology Analysis

Artificial Intelligence federates numerous scientific fields in the aim ...

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

Classic lexical-matching-based QA metrics are slowly being phased out be...

Not just a matter of semantics: the relationship between visual similarity and semantic similarity

Knowledge transfer, zero-shot learning and semantic image retrieval are ...

Semantic Communication with Adaptive Universal Transformer

With the development of deep learning (DL), natural language processing ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human-labeled datasets are routinely used as golden datasets for benchmarking NLP algorithms. For some NLP tasks like Part-of-Speech Tagging or Named Entity Recognition the labeling criteria are formulated rigorously, for others rigorous formulation is lacking. A lot of baselines in modern NLP rely on the idea that certain aspects of natural language are understood intuitively by human annotators. This implicit assumption is often the only argument for some form of ontological consistency of obtained evaluations on a given dataset - i.e. that semantics are definitive and unambiguous. This paper demonstrates that this assumption does not hold for semantic similarity measures. It also finds that using domain-specific features of the labeling process, one could detect unreliable annotators and thus significantly affect the results of the labeling. The contributions of this paper are as follows:

  • it highlights that intuitive understanding of semantic similarity varies across language speakers. In the absence of a universal unsupervised semantic similarity measure, these differences lead to the implicit noise in the research outcomes;

  • using human assessment on thirty-five thousand labeled pairs of sentences the paper explores various inconsistencies present in the labeling;

  • it proposes five possible heuristics to filter unreliable annotators and evaluates the impact of such filtering on various unsupervised semantic similarity measures.

2 Related Work

Large scale annotated datasets (Treebank, Imagenet, and many others) have shown to dramatically increase success in many sub-fields of Machine Learning. However, they are extremely expensive and take a long time to develop. One well-established method for obtaining large-scale labeled datasets is crowd-sourcing. In recent years we have seen the rise of various crowd-sourcing services, which provide non-expert annotated labels. By outsourcing the labeling process to external services it’s possible to scale horizontally, but researchers face significant challenges to both ensure the quality of the labels and validate that the data is labeled according to the criteria of the task. When dealing with non-expert annotators, the issue of quality assurance arises. Most requesters rely on redundancy (Majority Vote) or use some form of Golden Dataset to filter out unreliable annotators. Beyond the classical methods, many statistical methods have been proposed to address the issue.

Dawid & Skene (Dawid and Skene, 1979)

initially proposed an Expectation-Maximization algorithm to predict the error rate for each annotator. Many other probabilistic models have been proposed

(Whitehill et al., 2009; Raykar et al., 2010) to approximate both annotator error and bias (Ipeirotis et al., 2010), and the difficulty of particular labels and models (Sheng et al., 2008). While most of these models have been intended to be generalizable, this paper makes the argument that progress in Natural Language Understanding (NLU) requires attention to the domain-specific attributes of semantic data.

3 Domain-specific annotator attributes for Natural Language

In any form of communication or usage of language, there are always two necessary elements: the form and the meaning. When the meaning is tightly bound to the form, one can take the form context-free and be able to extract the meaning directly. However, in natural language communication, it would be impossible to parse the intended meaning without some external knowledge base. The so-called symbol grounding problem Harnad (1999) states that one can not derive the meaning of a sentence from the syntax alone. Meaning is derived from many sources, the context, the tone of voice, the relationship between interlocutors, etc.

While the NLP community has made great strides developing a better ability to understand the syntactical distribution of a language, we have yet to make any clear headway in modeling meaning

(Bender and Koller, 2020). Although annotators may internally feel that they have an intuitive sense of semantic preservation, there does not seem to be a consistent agreement between people (and even for the same person in varying circumstances).

There are several basic challenges that cause such inconsistencies. Firts of all, the overlap between the form and semantics is very fuzzy, (Tikhonov and Yamshchikov, 2018; Tikhonov et al., 2019); for example, given a pair of sentences in which the only distinction is the sentiment (ex: "I love pizza." vs. "I hate pizza.") human annotators agree that semantics similarity is low while various NLP researchers treat sentiment as style attribute and evaluate these two sentences as semantically similar. Second, there are many possible axes upon which to calculate semantic similarity (communicative intent Westera and Boleda (2019), topic identification Peinelt et al. (2020), emotion recognition Franzoni et al. (2017)); it is not clear how these axes rank when we are after a general measure of semantic similarity. Finally, personal characteristics of the annotator such as implicit understanding of the context or varying background experience systematically affect the judgment of the annotators.

4 The Data

To see differences in the semantic tendencies of human annotation, we used several standard paraphrase and style transfer datasets alongside a random selection of sentence pairs from each dataset. Similar to Yamshchikov et al. (2021) the random pairs of sentences are used for the baseline of sentences that have no semantic overlap whatsoever. The paraphrase datasets include different versions of English Bibles Carlson et al. (2017), English Paralex dataset111, and English Paraphrase dataset222 The style transfer datasets are the dataset of politeness introduced in Rao and Tetreault (2018) referred further as GYAFC, and Yelp! Reviews333 enhanced with human-written reviews with opposite sentiment provided by Tian et al. (2018).

Every pair of sentences was labeled by three independent annotators with a score from 1-5 (1 being dissimilar and 5 being identical). To facilitate further research of human-labeling inconsistencies for the tasks of semantic similarity, we make all collected information on the labeling process available444

There are three major sources of noise affecting this labeling procedure. The first source of noise are unreliable annotators. These are people who don’t give thoughtful responses and randomly fill their answers. Such annotators are present in all crowdsourcing tasks, and there are many methods to filter them out, for example, (Oleson et al., 2011; Lofi, 2013). The second source of noise is the 1-5 labeling scheme itself. On the one hand, the continuous scale from one to five presents the possibility for an annotator to mark a pair with 3 implying that two sentences are neither similar nor dissimilar. On the other hand, one could argue that by definition, the lack of similarity inherently equates to dissimilarity. The final source of noise could be certain personal qualities of the annotators. For example, certain users could be more radical in their judgment and have a preference to give extreme ratings (1,5) while others might be more moderate and give more centrist ratings (2,4), see (Panda et al., 2020).

5 Experiments

We conducted experiments to estimate the impact of these sources of noise on evaluations of unsupervised semantic measures. Initially, we labored under the assumption of a minimal consistency requirement for a measure of semantic similarity, i.e. that random sentences be ranked less similar than non-random sentences on average. However, when trying to validate that assumption by analyzing the similarity score distributions relative to the labels for different measures, we discovered numerous examples of low-quality labels. As a result, we strove to formalize the patterns of noise into clear heuristics that can be applied to any dataset using the metadata available on a publicly available crowdsourcing platform.

It can be argued that the heuristics proposed below are generalizable to any form of human judgement, due to the inherent ambiguity within written language discussed prior. This is a legitimate claim, and as such gaining a clearer picture of the biases and noise in the data becomes even more crucial for NLP tasks that require any form of human quantitative estimates. The heuristics below are by no means an exhaustive list, but rather to be viewed as a sample of the myriad of factors that need to be addressed as we strive towards a more comprehensive formulation of semantic similarity.

5.1 Filtering Heuristics for Unreliable Annotators

We experimented with five different heuristics:

  1. Slow Annotators: those whose mean labeling time is much greater than the average labeling time time555We denoted the annotator who had mean labeling time greater than 300 seconds as a slow annotator. This places them in the 98th percentile in terms of average labeling duration in our dataset..

  2. Low Variance

    : if the variance for all of the labels given by one annotator is lower than 1

    666This means that the vast majority of the labels are annotated with the same label by this annotator.

  3. High Random: remove labelers whose mean semantic similarity score of all random pairs is higher than their mean semantic similarity score for non-random pairs. Among reliable annotators, the random pairs have to score lower than the ones that are semantically similar.

  4. Disagreeable Annotators: using reduced labeling (Scores below 3 collapse into -1, 3 becomes 0, and anything above 3 collapses into 1) we filter any annotator who happens to disagree with an unanimous decision from the other two annotators more than in half of the cases.

  5. Sentimentally Dis-aligned Annotators: as discussed earlier, the relationship between sentiment and semantics is ambiguous, so we wanted to filter out annotators who used the sentiment to determine semantics in an inconsistent way. 777Taking pairs which have a very high word overlap (BLUE score over 0.8) - indicating nearly identical syntactical content - but with sentiment score differences

    1.9 (using huggingface’s sentiment-analysis pipeline is bound by [-1,1]), we filter out annotators whose labeling variance on those pairs was greater than 1.

If the annotator corresponds to one of these categories we pronounce this annotator to be unreliable. To make our experiments clear and reproducible we publish the source code, with specification of all dependencies, including external libraries888

5.2 Correlation with Automated Semantic Similarity Metrics

To estimate how the labeling noise can interfere with the NLP benchmarks that use semantic similarity measurements we took ten of the most used metrics for content preservation and semantic similarity. Word overlap is calculated as percentage of words that occur in both texts. chrF Popović (2015)

is a character n-gram F-score that measures number of n-grams that coincide in input and output.

Cosine similarity is calculated in line with Fu et al. (2018) either with pre-trained GloVe Pennington et al. (2014) or FastText word embeddings Joulin et al. (2016). POS-distance looks for nouns in the input and output and is calculated as a pairwise distance between the embeddings of the found nouns. L2 distance between Elmo Peters et al. (2018) embeddings of two sentences. WMD Kusner et al. (2015) defines the distance between two documents as an optimal transport problem between the embedded words. BLEU Papineni et al. (2002) is one of the most commmon semantic similarity measures. ROUGE Lin and Hovy (2000) compares any text to any other (typically human-generated) summary using a recall-oriented approach and unigrams, with bi-grams, and Lin and Och (2004) with the longest co-occurring n-grams in sequence. Meteor Banerjee and Lavie (2005)

metric is based on a harmonic mean of unigram precision and recall, with recall, weighted higher than precision and some additional features, such as stemming and synonymy matching. Finally,

BERT score proposed in Zhang et al. (2019) is a BERT-based estimator of semantic similarity between two pieces of text.

Table 1 shows how the automated semantic similarity metrics correlate with human labels and how they correlate after we filter unreliable annotators defined according to the heuristics above999See Appendix for the resulting experiments with all five heuristics and relative changes in correlation between human labeling and unsupervised semantic similarity metrics depending on the filtering procedure. It also clearly demonstrates that relatively straight-forward filtering of human labels could add up to nine percentage points to the results of an automated evaluation. This in itself is disturbing, since such changes in performance are often regarded as an improvement in some NLP tasks.

Metrics Baseline Filtered Percentage
heuristics increase
ROUGE-1 0.61 0.65 7.6 %
bleu1 0.60 0.65 7.9 %
ROUGE-l 0.60 0.65 8.0 %
BertScore 0.59 0.64 8.5 %
1-gram_overlap 0.59 0.64 8.0 %
chrfScore 0.58 0.63 7.3 %
L2_score 0.56 0.6 6.6 %
ROUGE-2 0.53 0.58 8.4 %
fasttext_cosine 0.51 0.52 2.2 %
WMD 0.5 0.52 4.4 %
glove_cosine 0.45 0.48 4.6 %
bleu 0.41 0.45 9.0 %
POS Dist score 0.35 0.38 7.2 %
Table 1: The correlation between automated semantic similarity metrics and the human labels over all datasets, and the same correlation when unreliable annotators are filtered out. The automated metrics improve from 2 to 9 percentage points depending on the metric.
Metric Baseline Radicals Baseline Centrists
Radicals after filter Centrists after filter
ROUGE-1 0.70 0.70 0.28 0.45
bleu1 0.69 0.69 0.28 0.46
ROUGE-l 0.69 0.69 0.28 0.46
BertScore 0.69 0.69 0.28 0.46
1-gram_overlap 0.68 0.68 0.28 0.45
chrfScore 0.67 0.67 0.28 0.44
L2_score 0.64 0.65 0.25 0.39
ROUGE-2 0.61 0.61 0.26 0.41
fasttext_cosine 0.57 0.57 0.21 0.32
WMD 0.57 0.56 0.21 0.33
glove_cosine 0.51 0.51 0.19 0.23
bleu 0.47 0.48 0.21 0.32
POS Dist score 0.41 0.40 0.16 0.28
Table 2: The Baseline correlation without filtering and the improvement after filtering unreliable annotators for Radical and Centrist Annotators independently.

In all our experiments situations we see that the combination of Low Variance and High Random filtering heuristics has the strongest impact on the correlation with automated evaluation methods. Under certain circumstances, heuristics based on Disagreeable Labelers and Sentimentally Dis-aligned Labelers also increase the correlation. On the other hand, filtering Slow Annotators out only hurts the performance.

Table 2 shows a more nuanced picture of correlations between the automated metrics and the labels of different annotators. We denote those annotators who selected {1,5} over 50% of the time as Radical and those who selected {2,4} as Centrist; the label 3 was ignored in calculations of these criteria, and only annotators with variance above 1 were included. Comparing results in Table 1 and Table 2 one could see that radical annotators play a major part in the resulting overall correlations between human labels and automated semantic similarity metrics. Moreover, filtering unreliable annotators only affects correlations of the labels given by Centrists. This either shows that treating semantic similarity as a binary value when using crowd-sourced human labels might be beneficial for less ambiguous results or hints that current unsupervised metrics of semantic similarity have hard time capturing nuance that some human annotators see.

6 Conclusion

This paper is an attempt to quantify the inherent ambiguity prevalent in any NLP task that relies on human judgment as a measure of semantic similarity. It demonstrates that a simple heuristic curation of human annotation could give up to 9 extra percentage points in terms of the model performance estimated with some unsupervised semantic similarity measure.

The series of experiments conducted in the paper provides several rules of thumb to reduce the ambiguity of human labels: (1) when labeling treat semantic similarity as a binary feature asking if two texts are similar or not, (2) add sentence pairs where there is no semantic similarity whatsoever and filter unreliable annotators that make mistakes on these pairs, (3) majority vote improves the consistency of your data but it is not as good, as filtering out annotators with label variance and annotators that systematically make mistakes with obviously dissimilar sentence pairs.


  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Bender and Koller (2020) Emily M. Bender and Alexander Koller. 2020. Climbing towards nlu: On meaning, form, and understanding in the age of data. In ACL.
  • Carlson et al. (2017) Keith Carlson, Allen Riddell, and Daniel Rockmore. 2017. Zero-shot style transfer in text using recurrent neural networks. arXiv preprint arXiv:1711.04731.
  • Dawid and Skene (1979) A. P. Dawid and A. Skene. 1979. Maximum likelihood estimation of observer error‐rates using the em algorithm. Journal of The Royal Statistical Society Series C-applied Statistics, 28:20–28.
  • Franzoni et al. (2017) Valentina Franzoni, Alfredo Milani, and Giulio Biondi. 2017. Semo: a semantic model for emotion recognition in web objects. In Proceedings of the International Conference on Web Intelligence, pages 953–958.
  • Fu et al. (2018) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. AAAI.
  • Harnad (1999) Stevan Harnad. 1999. The symbol grounding problem. CoRR, cs.AI/9906002.
  • Ipeirotis et al. (2010) Panagiotis G. Ipeirotis, F. Provost, and J. Wang. 2010. Quality management on amazon mechanical turk. In HCOMP ’10.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  • Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning, pages 957–966.
  • Lin and Hovy (2000) Chin-Yew Lin and Eduard Hovy. 2000.

    The automated acquisition of topic signatures for text summarization.

    In Proceedings of the 18th conference on Computational linguistics-Volume 1, pages 495–501. Association for Computational Linguistics.
  • Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics.
  • Lofi (2013) C. Lofi. 2013. Just ask a human? - controlling quality in relational similarity and analogy processing using the crowd. In BTW Workshops.
  • Oleson et al. (2011) David Oleson, Alexander Sorokin, Greg Laughlin, Vaughn Hester, John Le, and Lukas Biewald. 2011. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. In Proceedings of the 11th AAAI Conference on Human Computation, AAAIWS’11-11, page 43–48. AAAI Press.
  • Panda et al. (2020) S. K. Panda, S. Bhoi, and M. Singh. 2020. A collaborative filtering recommendation algorithm based on normalization approach. Journal of Ambient Intelligence and Humanized Computing, pages 1–23.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Peinelt et al. (2020) Nicole Peinelt, Dong Nguyen, and Maria Liakata. 2020. tbert: Topic models and bert joining forces for semantic similarity detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7047–7055.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.

    Glove: Global vectors for word representation.

    In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • Popović (2015) Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395.
  • Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may i introduce the gyafc dataset: Corpus, benchmarks and metrics for formality style transfer. arXiv preprint arXiv:1803.06535.
  • Raykar et al. (2010) Vikas C. Raykar, S. Yu, L. Zhao, G. Hermosillo, Charles Florin, L. Bogoni, and L. Moy. 2010. Learning from crowds. J. Mach. Learn. Res., 11:1297–1322.
  • Sheng et al. (2008) V. Sheng, F. Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? improving data quality and data mining using multiple, noisy labelers. Econometrics: Data Collection & Data Estimation Methodology eJournal.
  • Tian et al. (2018) Youzhi Tian, Zhiting Hu, and Zhou Yu. 2018. Structured content preservation for unsupervised text style transfer. In arXiv preprint.
  • Tikhonov et al. (2019) Alexey Tikhonov, Viacheslav Shibaev, Aleksander Nagaev, Aigul Nugmanova, and Ivan P Yamshchikov. 2019. Style transfer for texts: Retrain, report errors, compare with rewrites. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3927–3936.
  • Tikhonov and Yamshchikov (2018) Alexey Tikhonov and Ivan P Yamshchikov. 2018. What is wrong with style transfer for texts? arXiv preprint arXiv:1808.04365.
  • Westera and Boleda (2019) Matthijs Westera and Gemma Boleda. 2019. Don’t blame distributional semantics if it can’t do entailment. In Proceedings of the 13th International Conference on Computational Semantics-Long Papers, pages 120–133.
  • Whitehill et al. (2009) Jacob Whitehill, P. Ruvolo, Tingfan Wu, J. Bergsma, and J. Movellan. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS.
  • Yamshchikov et al. (2021) Ivan P Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, and Alexey Tikhonov. 2021. Style-transfer and paraphrase: Looking for a sensible semantic similarity metric. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 35, pages 14213–14220.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.