What's in a Name? Answer Equivalence For Open-Domain Question Answering

09/11/2021 ∙ by Chenglei Si, et al. ∙ University of Maryland 0

A flaw in QA evaluation is that annotations often only provide one gold answer. Thus, model predictions semantically equivalent to the answer but superficially different are considered incorrect. This work explores mining alias entities from knowledge bases and using them as additional gold answers (i.e., equivalent answers). We incorporate answers for two settings: evaluation with additional answers and model training with equivalent answers. We analyse three QA benchmarks: Natural Questions, TriviaQA, and SQuAD. Answer expansion increases the exact match score on all datasets for evaluation, while incorporating it helps model training over real-world datasets. We ensure the additional answers are valid through a human post hoc evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction: A Name that is the Enemy of Accuracy

In question answering (qa), computers—given a question—provide the correct answer to the question. However, the modern formulation of qa usually assumes that each question has only one answer, e.g., squad SQuAD, hotpotqa HotpotQA, drop DROP. This is often a byproduct of the prevailing framework for modern qa DrQA; DPR: a retriever finds passages that may contain the answer, and then a machine reader identifies the (as in only) answer span.

In a recent position paper, QA-TriviaNerds

argue that this is at odds with the best practices for human qa. This is also a problem for computer qa.

Figure 1: An example from nq dataset. The correct model (bert) prediction does not match the gold answer but matches the equivalent answer we mine from a knowledge base.

A bert model BERT trained on Natural Questions (NQ, nq) answers Tim Cook to the question “Who is the Chief Executive Officer of Apple?” (Figure 1), while the gold answer is only Timothy Donald Cook, rendering Tim Cook as wrong as Tim Apple. In the 2020 NeurIPS Efficient qa competition EfficientQA, human annotators rate nearly a third of the predictions that do not match the gold annotation as “definitely correct” or “possibly correct”.

Despite the near-universal acknowledgement of this problem, there is neither a clear measurement of its magnitude nor a consistent best practice solution. While some datasets provide comprehensive answer sets (e.g., TriviaQA), subsequent datasets such as nq have not…and we do not know whether this is a problem. We fill that lacuna.

Section 2 mines knowledge bases for alternative answers to named entities. Even this straightforward approach finds high-precision answers not included in official answer sets. We then incorporate this in both training and evaluation of qa models to accept alternate answers. We focus on three popular open-domain qa datasets: nq, Triviaqa and squad. Evaluating models with a more permissive evaluation improves exact match (em) by 4.8 points on Triviaqa, 1.5 points on nq, and 0.7 points on squad (Section 3). By augmenting training data with answer sets, state-of-the-art models improve on nq and Triviaqa, but not on squad (Section 4), which was created with a single evidence passage in mind. In constrast, augmenting the answer allows diverse evidence sources to provide an answer. After reviewing other approaches for incorporating ambiguity in answers (Section 5), we discuss how to further make qa more robust.

2 Method: An Entity by any Other Name

This section reviews the open-domain qa (odqa) pipeline and introduces how we expand gold answer sets for both training and evaulation.

2.1 odqa with Single Gold Answer

We follow the state-of-the-art retriever–reader pipeline for odqa, where a retriever finds a handful of passages from a large corpus (usually Wikipedia), then a reader, often multi-tasked with passage reranking, selects a span as the prediction.

We adopt a dense passage retriver (DPR, dpr)

to find passages. dpr encodes questions and passages into dense vectors. dpr searches for passages in this dense space: given an encoded query, it finds the nearest passage vectors in the dense space. We do not train a new retriever but instead use the released dpr checkpoint to query the top-

(in this paper, ) most relevant passages.

Given a question and retrieved passages, a neural reader reranks the top- passages and extracts an answer span. Specifically, bert encodes each passage  concatenated with the question  as , where is the maximum sequence length and 

is the hidden size of bert. Three probabilities use this representation to reveal where we can find the answer. The first probability 

encodes whether passage  contains the answer. Because the answer is a subset of the longer span, we must provide the index where the answer starts  and where it ends . Given the encoding of passage , there are three parameter matrices w that produce these probabilities:


where represents the [CLS] token, and and are learnable weights for passage selection, start span and end span. Training updates weights with one positive and negative passages among the top-100 retrieved passages for each question (we use ) with log-likelihood of the positive passage for passage selection (Equation 1) and maximum marginal likelihood over all spans in the positive passage for span extraction (Equations 23).

To study the effect of equivalent answers in reader training, we focus on the distant supervision setting where we know what the answer is but not where it is (in contrast to full supervision where we know both). To use the answer to discover positive passages, we use string matching: any of the top- retrieved passages that contains an answer is considered correct. We discard questions without any positive passages. This framework is consistent with modern state-of-the-art odqa pipelines (NQ-baseline; DPR; Transformer-XH; GraphRetriever; zhao2021beamdr, inter alia).

2.2 Extracting Alias Entities

We expand the original gold answer set by extracting aliases from Freebase Freebase, a large-scale knowledge base (kb). Specifically, for each answer in the original dataset (e.g., Sun Life Stadium), if we can find this entry in the kb, we then use the “common.topic.alias” relation to extract all aliases of the entity (e.g., [Joe Robbie Stadium, Pro Player Park, Pro Player Stadium, Dolphins Stadium, Land Shark Stadium]). We expand the answer set by adding all aliases. We next describe how this changes evaluation and training.

2.3 Augmented Evaluation

For evaluation, we report the exact match (em) score, where a predicted span is correct only if the (normalized) span text matches with a gold answer exactly. This is the adopted metric for span-extraction datasets in most qa papers (DPR; ORQA; DiscreteEM, inter alia). When we incorporate the alias entities in evaluation, we get an expanded answer set . For a given span predicted by the model, we compute em score of if the span matches any correct answer  in the set :


2.4 Augmented Training

When we incorporate the alias entities in training, we treat each retrieved passage as positive if it contains either the original answer or the extracted alias entities. As a result, some originally negative passages become positive since they may contain the aliases, and we augment the original training set. Then, we train on this augmented training set in the same way as in Equations 13.

3 Experiment: Just as Sweet

nq squad Triviaqa
Avg. Original Answers 1.74 1.00 1.00
Matched Answers (%) 71.63 32.16 88.04
Avg. Augmented Answers 13.04 5.60 14.03
#Original Positives 69205 48135 62707
#Augmented Positives 69989 48615 67526
Table 1: Avg. Original Answers denotes the average number of answers per question in the official test sets. Matched Ans. denotes the percentage of original answers that have aliases in the kb. Avg. Augmented Answers denotes the average number of answers in our augmented answer sets. Last two rows: number of positive questions (questions with matched positive passages) in the original / augmented training set for each dataset. nq and Triviaqa have more augmented answers than squad.
Data Model Single Ans Ans Set
nq Baseline 34.9 36.4
+ Augment Train 35.8 37.2
TriviaQA Baseline 49.9 54.7
+ Augment Train 50.0 55.9
SQuAD Baseline 18.9 19.6
+ Augment Train 18.3 18.9
Table 2: Evaluation results on qa datsets compared to the original “Single Ans” evaluation under the original answer set, using the augmented answer sets (“Ans Set”) improves evaluation. Retraining the reader with augmented answer sets (“Augment Train”) is even better for most datasets, even when evaluated on the datasets’ original answer sets. Results are the average of three random seeds.
Baseline +Wiki Train +FB Train
Single Ans. 49.31 49.42 49.53
+Wiki Eval 54.13 55.27 54.57
+FB Eval 51.75 52.23 52.52
Table 3: Results on TriviaQA. Numbers in brackets indicate the improvement compared to the first column. Each column indicates a different training setup and each row indicates a different evaluation setup. Augmented training with Wikipedia aliases (2nd column) and Freebase aliases (3rd column) improve EM over baseline (1st column).

We present results on three qa datasets—nq, Triviaqa and squad—on how including aliases as alternative answers impacts evaluation and training. Since the official test sets are not released, we use the original dev sets as the test sets, and randomly split 10% training data as the held-out dev sets. All of these datasets are extractive qa datasets where answers are spans in Wikipedia articles.

Statistics of Augmentation.

Both squad and Triviaqa have one single answer in the test set (Table 1). While nq also has answer sets, these represent annotator ambiguity given a passage, not the full range of possible answers. For example, different annotators might variously highlight Lenin or Chairman Lenin, but there is no expectation to exhaustively enumerate all of his names (e.g., Vladimir Ilyich Ulyanov or Vladimir Lenin). Although the default test set of Triviaqa uses one single gold answer, the authors released answer aliases minded from Wikipedia. Thus, we directly use those aliases for our experiments in Table 2. Overall, a systematic approach to expand gold answers significantly increases gold answer numbers.

Triviaqa has the most answers that have equivalent answers, while squad has the least. Augmenting the gold answer set increases the positive passages and thus increases the training examples, since questions with no positive passages are discarded (Table 1), particularly for Triviaqa’s entity-centric questions.

Implementation Details.

For all experiments, we use the multiset.bert-base-encoder checkpoint of DPR as the retriever and use bert-base-uncased for our reader model. During training, we sample one positive passage and 23 negative passages for each question. During evaluation, we consider the top-10 retrieved passages for answer span extraction. We use batch size of 16 and learning rate of 3e-5 for training on all datasets.

Augmented Evaluation.

We train models with the original gold answer set and evaluate under two settings: 1) on the original gold answer test set; 2) on the answer test set augmented with alias entities. On all three datasets, em score improves (Table 2). Triviaqa shows the largest improvement, as most answers in Triviaqa are entities ().

Augmented Training.

We incorporate the alias answers in training and compare the results with single-answer training (Table 2). One check that this is encouraging the models to be more robust and not a more permissive evaluation is that augmented training improves em by about a point even on the original single answer test set evaluation. However, Triviaqa improves less, and em decreases on squad with augmented training. The next section inspects examples to understand why augmented training accuracy differs on these datasets.

Freebase vs Wikipedia Aliases.

We present the comparison of using Wikipedia entities and Freebase entities for augmented evaluation and training on TriviaQA. We show the augmented evaluation and training results in Table 3. Using Wikipedia entities increases in EM score under augmented evaluation (e.g., the baseline model scores 54.13 under Wiki-expanded augmented evaluation, as compared to 51.75 under Freebase-expanded augmented evaluation). This is mainly because TriviaQA answers have more matches in Wikipedia titles than in Freebase entities. On the other hand, the difference between the two alias sources is rather small for augmented training. For example, using Wikipedia for answer expansion improves the baseline from 49.31 to 49.42 under single-answer evaluation, while using Freebase improves it to 49.53.

4 Analysis: Does qa Retain that Dear Perfection with another Name?

A sceptical reader would rightly suspect that accuracy is only going up because we have added more correct answers. Clearly this can go too far…if we enumerate all finite length strings we could get perfect accuracy. This section addresses this criticism by examining whether the new answers found with augmented training and evaluation would still satisfy user information-seeking needs voorhees2019evolution for both the training and test sets.

nq squad Triviaqa
Correct 48 31 41
Debatable 0 2 3
Wrong 1 16 6
Invalid 1 1 0
Non-equivalent 1 5 2
Wrong context 0 1 1
Wrong alias 0 10 3
Table 4: Annotation of fifty sampled augmented training examples from each dataset. Most training examples are still correct except for squad, where additional answers are incorrect a third of the time. How the new answers are wrong is broken down in the bottom half of the table.
nq squad Triviaqa
Correct 48 47 50
Wrong 1 1 0
Debatable 1 1 0
Invalid 0 1 0
Table 5: Annotation of fifty test questions that went from incorrect to correct under augmented evaluation. Most changes of correctness are deemed valid by human annotators across all three datasets.

: What city in France did the torch relay start at?

P: Title: 1948 summer olympics. The torch relay then run through Switzerland and France
A: Paris
Alias: France
Error Type: Non-equivalent Entity
Q: How many previously-separate phyla did the 2007 study reclassify?
P: Title: celastrales. In the APG III system, the celastraceae family was expanded to consist of these five groups …
A: 3
Alias: III
Error Type: Wrong Context
Q: What is Everton football club’s semi-official club nickname?
P: Title: history of Everton F. C. Everton football club have a long and detailed history …
A: the people’s club
Alias: Everton F. C.
Error Type: Wrong Alias
Table 6: How adding equivalent answers can go wrong. While errors are rare (Table 4 and 5), these errors are representatives of mistakes. The examples are taken from squad.

Accuracy of Augmented Training Set.

We annotate fifty passages that originally lack an answer but do have an answer from the augmented answer set (Table 4

). We classify them into four catrgories: correct, debatable, and wrong answers, as well as invalid questions that are ill-formed or unanswerable due to annotation error. The augmented examples are mostly correct for nq, consistent with its em jump with augmented training. However, augmentation often surfaces wrong augmented answers for

squad, which explains why the em score drops with augmented training.

We further categorize why the augmentation is wrong into three categories (Table 6): (1) Non-equivalent entities, where the underlying knowledge base has a mistake, which is rare in high quality kbs; (2) Wrong context, where the corresponding context is not answering the question; (3) Wrong alias, where the question asks about specific alternate forms of an entity but the prediction is another alias of the entity. This is relatively common in squad. We speculate this is a side-effect of its creation: users write questions given a Wikipedia paragraph, and the first paragraph often contains an entity’s aliases (e.g., “Vladimir Ilyich Ulyanov, better known by his alias Lenin, was a Russian revolutionary, politician, and political theorist”), which are easy questions to write.

Accuracy of Expanded Answer Set.

Next, we sample fifty test examples that models get wrong under the original evaluation but that are correct under augmented evaluation. We classify them into four catrgories: correct, debatable, wrong answers, and the rare cases of invalid questions. Almost all of the examples are indeed correct (Table 5), demonstrating the high precision of our answer expansion for augmented evaluation. In rare cases, for example, for the question “Who sang the song Tell Me Something Good?”, the model prediction Rufus is an alias entity, but the reference answer is Rufus and Chaka Khan. The authors disagree whether that would meet a user’s information-seeking need because Chaka Khan, the vocalist, was part of the band Rufus. Hence, it was labeled as debatable.

5 Related Work: Refuse thy Name

Answer Annotation in qa Datasets.

Some qa datasets such as nq and TyDi TyDiQA -way annotate dev and test sets where they ask different annotators to annotate the dev and test set. However, such annotation is costly and the coverage is still largely lacking (e.g., our alias expansion obtains many more answers than nq’s original multi-way annotation). Ambigqa AmbigQA aims to address the problem of ambiguous questions, where there are multiple interpretations of the same question and therefore multiple correct answer classes (which could in turn have many valid aliases for each class). We provide an orthogonal view as we are trying to expand equivalent answers to any given gold answer while Ambigqa aims to cover semantically different but valid answers.

Query Expansion Techniques.

Automatic query expansion has been used to improve information retrieval QEsurvey. Recently, query expansion has been used in nlp applications such as document re-ranking BERT-QE and passage retrieval in odqa Qi2019AnsweringCO; GAR, with the goal of increasing accuracy or recall. Unlike this work, our answer expansion aims to improve evaluation of qa models.

Evaluation of QA Models.

There are other attempts to improve qa evaluation. EvalQA find that current automatic metrics do not correlate well with human judgements, which motivated MOCHA to construct a dataset with human annotated scores of candidate answers and use it to train a BERT-based regression model as the scorer. feng-19 argue for instead of evaluating qa systems directly, we should instead evaluate downstream human accuracy when using qa output. Alternatively, Risch2021SemanticAS use a cross-encoder to measure the semantic similarity between predictions and gold answers. For the visual qa task, VQA-alias

incorporate alias answers in visual qa evaluation. In this work, instead of proposing new evaluation metrics, we improve the evaluation of odqa models by augmenting gold answers with alias from knowledge bases.

6 Conclusion: Wherefore art thou Single Answer?

Our approach for matching entities in a kb is a simple approach to improve qa accuracy. We expect future improvements—e.g.,, entity linking source passages would likely improve precision at the cost of recall. Future work should also investigate the role of context in deciding the correctness of predicted answers. Beyond entities, future work should also consider other types of answers such as non-entity phrases and free-form expressions.

As the qa community moves to odqa and multilingual qa, robust approaches will need to holistically account for unexpected but valid answers. This will better help users, use training data more efficiently, and fairly compare models.


We thank members of the UMD CLIP lab, the anonymous reviewers and meta-reviewer for their suggestions and comments. Zhao is supported by the Office of the Director of National Intelligence (odni), Intelligence Advanced Research Projects Activity (iarpa), via the better Program contract 2019-19051600005. Boyd-Graber is supported by nsf Grant iis-1822494. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.