Log In Sign Up

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

by   Albert Huang, et al.

Classic lexical-matching-based QA metrics are slowly being phased out because they punish succinct or informative outputs just because those answers were not provided as ground truth. Recently proposed neural metrics can evaluate semantic similarity but were trained on small textual similarity datasets grafted from foreign domains. We introduce the Wiki Entity Similarity (WES) dataset, an 11M example, domain targeted, semantic entity similarity dataset that is generated from link texts in Wikipedia. WES is tailored to QA evaluation: the examples are entities and phrases and grouped into semantic clusters to simulate multiple ground-truth labels. Human annotators consistently agree with WES labels, and a basic cross encoder metric is better than four classic metrics at predicting human judgments of correctness.


page 1

page 2

page 3

page 4


Semantic Answer Similarity for Evaluating Question Answering Models

The evaluation of question answering models compares ground-truth annota...

KPQA: A Metric for Generative Question Answering Using Word Weights

For the automatic evaluation of Generative Question Answering (genQA) sy...

Evaluation of Semantic Answer Similarity Metrics

There are several issues with the existing general machine translation o...

Evaluation of Unsupervised Entity and Event Salience Estimation

Salience Estimation aims to predict term importance in documents. Due to...

Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark

In this paper, we introduce an important yet relatively unexplored NLP t...

Rethinking Crowd Sourcing for Semantic Similarity

Estimation of semantic similarity is crucial for a variety of natural la...

Semantic similarity-based approach to enhance supervised classification learning accuracy

This brief communication discusses the usefulness of semantic similarity...

1 Introduction

Information retrieval tools have already revolutionized how we interact with knowledge: what would have taken a half-hour library trip can be learned in minutes through the internet. Question answering (QA) models can take this one step further because they can respond to queries with summative insights instead of just listing relevant documents. Because they dictate the optimization goal of QA research, QA evaluation metrics are a key in directing the future of QA model development.

Classic question-answering evaluation metrics only consider the token overlap between a model’s output and human-annotated ground-truth answers. However, these evaluation metrics fail to consider the plurality of possible correct answers for every question. For instance, a “when” question can be answered using either a year or an event name (i.e. “During Super Bowl XXXI. ”) Furthermore, correct answers can have wildly different lexical signatures if they contain different words, phrasing, or level of detail. Existing QA metrics also cannot evaluate against multiple ground-truth annotations simultaneously, leading to underutilization of dataset resources. Therefore, metrics that rely on the tokens in QA model outputs unfairly punish non-conforming models, including those that provide creatively succinct or informative answers. These metrics force models to imitate the threadbare, minimal ground-truth answers that crowd workers tend to write, exacerbating the problem. Current QA metrics are limiting the usefulness of QA models. With the rise of generative QA, in which models have more freedom to use different words or phrasing than extractive QA, the need for semantic answer evaluation has only grown.

Previous semantic techniques used latent word or phrase embeddings to calculate the similarity between answers. More recently, end-to-end neural text similarity models have been applied to QA evaluation. However, they are trained on small datasets grafted from other tasks. In particular, QA models often answer questions about specific entities or phrases; there are currently no entity or short phrase semantic similarity datasets.

Figure 1: Graphical comparison of the most similar existing QA metrics.
Figure 2:

The WES generation pipeline. (article title, link text) pairs are extracted from the wikitext, then negative examples are generated by pairing article titles with link texts from other article titles. Groups of link text that are associated with the same article are all synonyms. These "synonym clusters" can be used to train evaluation metrics that consider multiple ground-truth answers simultaneously.

In this paper, we introduce the Wiki Entity Similarity dataset: a large, task-specific semantic similarity dataset to train question-answering metrics. WES has over 11 million high-quality examples of synonymous and non-synonymous entities and phrases. In addition, WES is systematically mined from a democratic data corpus (Wikipedia), reducing bias from individual authors or annotators. Furthermore, WES generates synonym clusters of multiple phrases, which can be broken down pairwise into the classic semantic similarity formulation or grouped to simulate multiple ground-truth answers. Despite being auto-generated, human annotators consistently agree with WES labels. We show that a basic cross-encoder trained on WES is better at predicting human answers of correctness than classic metrics.

For the remainder of this paper, “outputs” are produced by QA models and “answers” are the ground-truth annotations that the model is trained on. In section 2, we take a closer look at current QA evaluation metrics and analyze their weaknesses. In section 3, we describe the generation of the novel WES dataset. In section 4, we characterize WES and present a baseline model trained on it. Finally, we conclude in section 5 with directions for future work.

2 Related Work

Although there are numerous desirable qualities for question answering models, we focus on evaluating the factual or conversational correctness of answers. Other metrics exist for testing information content, answer coverage, and other desirable qualities; however, this work is orthogonal to such metrics and should be used in conjunction with them. This section details existing metrics for evaluating the correctness of a model-generated output based on human-generated ground-truth answers.

Exact Match (EM) is a standard binary metric for evaluating extractive QA. Before comparison, outputs and answers are often normalized by removing punctuation and stop words and converting to lowercase, as in (Rajpurkar et al., 2016). This metric relies on exact lexical matching: outputs that contain synonyms or informational prepositional phrases are considered incorrect by this metric.

F1 is a standard floating-point metric that measures the token overlap between the output and answer. It punishes longer, more informative outputs because such answers are less likely to exactly match the annotated ground truth. Finally, although F1 is more forgiving than EM, it fails to differentiate lexical and semantic dissimilarities. As a result, there is no threshold at which the F1 score correlates well with correctness.

EM and F1 are the most prevalent question answering metrics today, serving as the leaderboard metrics for many of the most popular datasets (Rajpurkar et al., 2016, Kwiatkowski et al., 2019, Yang et al., 2018, Choi et al., 2018). However, numerous works have found that lexical metrics can diverge from human judgements of model performance (Min et al., 2021, Chen et al., 2019, Risch et al., 2021).

BLEU, ROUGE-N, ROUGE-L, METEOR: (Papineni et al., 2002), (Lin and Hovy, 2003), (Lin, 2004), (Banerjee and Lavie, 2005)

are common Natural Language Generation metrics based on measuring the n-gram overlap between output and answer. For instance, ROUGE-L measures the longest common subsequence and METEOR weights tokens by importance before applying an F1 analog. Like F1, these metrics punish longer, more detailed answers and unduly punish optional but informative prepositional phrases in model outputs.

In summary, each of the above metrics relies on lexical matching to perform similarity checks. As a result, these metrics unfairly punish longer, more informative outputs that have unnecessary but useful prepositional phrases, and are particularly sensitive to the exact wording of the ground-truth annotation. Furthermore, the above metrics fail to model clause order and logical dependency in model outputs, which advantages illogical outputs that happen to contain words from the ground-truth answer. In recent years, newer metrics have been proposed that use contextual or semantic embeddings to compare sequences.

BERTScore (Zhang et al., 2020) compares BERT contextual embeddings instead of raw tokens, which may more accurately capture underlying meanings and long-term dependencies. However, BERTScore still unfairly punishes optional but useful prepositional phrases as it is fundamentally a token-wise comparison. Finally, (Chen et al., 2019) has shown METEOR to correlate better with human judgment than BERTScore.

Semantic Answer Similarity (SAS) (Risch et al., 2021) takes an end-to-end approach to evaluating generative QA. They compare bi-encoder, cross-encoder, and BERTScore approaches to QA evaluation, and find that a cross-encoder trained on answer similarity correlates most strongly with human judgment. However, SAS is trained on the Semantic Textual Similarity Benchmark (Cer et al., 2017), which has only 7k foreign-domain training examples. The creation of such semantic similarity datasets is tedious and susceptible to bias, and the small dataset size means models are prone to overfitting. We aim to improve on this cross-encoder approach by pretraining on a new, large, domain-specific, auto-generated dataset.

In summary, neural metrics are an emerging standard for more accurate question-answering evaluation. We aim to improve neural QA evaluation by training a new metric on a novel, large, QA-specific dataset that encodes entity similarity rather than general sentence similarity.

3 Wiki Entity Similarity Dataset

The WES dataset is generated by pairing link text and target article text for links in Wikipedia. We generate our dataset in two stages: filtered link collection and negative example generation.

3.1 Link Filtering

Following (Karpukhin et al., 2020), we mine links from the English Wikipedia XML dump from Dec. 20, 2018. The XML format encodes all links in the form ‘[target article title|displayed link text]‘, where the link will be displayed on the source page. We search directly for links that match the following criteria:
Article title contains no hashtag as hashtags denote links to subsections of Wikipedia articles.
Article title contains no colon as colons denote pages under internal categories, such as User, User talk, Category, etc. (Wikipedia contributors, 2022)
Article title contains no parentheses as most (90% of 500 randomly sampled) links to these articles simply remove the parenthetical instead of containing a synonym of the article title.

See Table 1 for examples of the motivation behind filtering out links with the above characters.

Rule Sample Title Sample Link Text
# VAX#Name VAX trademark
( Styx (Band) Styx
Table 1: Motivation for excluding links to article titles that contain certain characters.

After character-level filtering, we use additional heuristics to improve the quality of the link corpus.

Deduplication: we remove duplicate (article title, link text) pairs to improve dataset balance. Link texts with the same content but different capitalization are considered different to preserve the semantic meaning of capitalization in proper nouns.
Incoming link threshold: we enforce that each link in the corpus must link to an article that has a minimum of parameter n incoming links (citations) to filter out rarely used stub articles. Higher values of n ensure higher quality articles and a larger, more representative sample of synonyms for each article title.
Dictionary word linktext: we remove links that have a dictionary word for the link text but a named entity or phrase for the target article title as these dictionary words are rarely fully qualified synonyms to the linked article title. Words are considered to be dictionary words if their WordNet (Fellbaum, 2000) lemmatization is contained in the Python nltk’s (Bird et al., 2009) words corpus.

We experiment with link threshold values , , and , obtaining link corpora of 5,787,081, 4,407,409, and 3,195,545 distinct synonym pairs, respectively.

3.2 Negative Example Generation

After harvesting synonyms from the Wikipedia links, we generate negative (non-synonymous) title-text pairs by matching article titles with link texts from other articles. The negative examples for an article title with link synonyms are generated by sampling other articles from the link corpus, then sampling one associated link text from each article to be paired with the article title (see how a negative example in the pink synonym cluster is created by copying a blue link text from a different article in Fig. 2). This results in each article title being associated with the same number of non-synonyms as distinct synonyms, preserving dataset balance. Choosing non-synonyms from different articles ensures a representative covering of the sample space and precludes duplicate negative examples. After negative example generation, the largest version of WES (with incoming link threshold ) has 11M+ examples, and the highest quality version (with ) contains 6.4M examples.

Future iterations of the WES dataset may be made more difficult through adversarial generation. Such methods condition negative example generation on the article title to create negative examples which are more similar to but not synonymous with the paired phrase. Possible future generation techniques include:
F1 ranking: negative examples generated from the most lexically similar alternatives will create textually similar but semantically different examples, increasing difficulty for token-based models.
Substring selection: negative examples generated from short substrings of the article title may be synonymous in context, but will not be specific enough to globally qualify a concept. Such examples force models to watch for appropriate specificity.
Word co-occurrence: negative examples generated from non-synonymous words with high co-occurrence will create pairs of grammatically feasible phrases, increasing difficulty for part-of-speech-based models.
Article co-occurrence: the titles of often co-current but semantically distinct articles can be treated as very difficult non-synonymous examples, teaching models to differentiate between parts of speech and improving sensitivity.

4 Analysis

We use human evaluation to check the quality of the collected WES dataset. For each task, we randomly sample 25 positive and 25 negative training pairs from the link threshold dataset. Two annotators rate the synonymy of the pairs following the ranking scheme used in (Risch et al., 2021). We treat annotated scores of 2 or 3 as “synonymous” and 1 as “non-synonymous,” and find that only 2% of labels are incorrect. Dataset label accuracies and inter-annotator agreement from different stages in the filtering process are listed in Table 1.

Dataset (link threshold ) Acc
Full Dataset 98% 0.932
Without Dictionary Filtering 91% 0.908
Table 2: Average accuracy and Pearson’s correlation between annotators in the full dataset and ablations.

5 Conclusion

We introduce WES, an 11M example semantic entity similarity dataset for training question answering evaluation models. WES is generated by treating Wikipedia link texts and target article titles as synonyms then filtering for quality. WES is targeted to question answering evaluation, independent of human annotators, and consistent with human judgment. We hope that future question-answering datasets will implement semantic evaluation metrics in their leaderboards to encourage the development of more free-form models. In future works, link-mining similarity datasets like WES can be made more challenging by generating negative examples adversarially as described at the end of section 3, more consistent by unioning semantic clusters according to Wikipedia’s internal redirect pages, and more comprehensive by leveraging link-to-link pairwise synonymy within semantic clusters.


We would like to sincerely thank Di Jin for his guidance during the literature review and method design process, Lufan Wang and Houjun Liu for their feedback on data collection methods and help proofreading this paper, and Yan Liu, Qiong Huang for their help with dataset analysis and compute. We would also like to thank (Si et al., 2021) for reminding us to have some fun in our literature.


  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Link Cited by: §2.
  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.". Cited by: §3.1.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Link, Document Cited by: §2.
  • A. Chen, G. Stanovsky, S. Singh, and M. Gardner (2019) Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Hong Kong, China, pp. 119–124. External Links: Link, Document Cited by: §2, §2.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: question answering in context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §2.
  • C. D. Fellbaum (2000) WordNet : an electronic lexical database. Language 76, pp. 706. Cited by: §3.1.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). External Links: Link, Document Cited by: §3.1.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics. Cited by: §2.
  • C. Lin and E. Hovy (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157. External Links: Link Cited by: §2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §2.
  • S. Min, J. Boyd-Graber, C. Alberti, D. Chen, E. Choi, M. Collins, K. Guu, H. Hajishirzi, K. Lee, J. Palomaki, C. Raffel, A. Roberts, T. Kwiatkowski, P. Lewis, Y. Wu, H. Küttler, L. Liu, P. Minervini, P. Stenetorp, S. Riedel, S. Yang, M. Seo, G. Izacard, F. Petroni, L. Hosseini, N. D. Cao, E. Grave, I. Yamada, S. Shimaoka, M. Suzuki, S. Miyawaki, S. Sato, R. Takahashi, J. Suzuki, M. Fajcik, M. Docekal, K. Ondrej, P. Smrz, H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, J. Gao, B. Oguz, X. Chen, V. Karpukhin, S. Peshterliev, D. Okhonko, M. Schlichtkrull, S. Gupta, Y. Mehdad, and W. Yih (2021) NeurIPS 2020 efficientqa competition: systems, analyses and lessons learned. External Links: 2101.00133 Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, USA, pp. 311–318. External Links: Link, Document Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §2, §2.
  • J. Risch, T. Möller, J. Gutsch, and M. Pietsch (2021) Semantic answer similarity for evaluating question answering models. Proceedings of the 3rd Workshop on Machine Reading for Question Answering. External Links: Link, Document Cited by: §2, §2, §4.
  • C. Si, C. Zhao, and J. Boyd-Graber (2021) What’s in a name? answer equivalence for open-domain question answering. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: Acknowledgements.
  • Wikipedia contributors (2022) Wikipedia:manual of style. Note: [Online; accessed 21-February-2022] External Links: Link Cited by: §3.1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. External Links: Link, Document Cited by: §2.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with bert. External Links: 1904.09675 Cited by: §2.