Log In Sign Up

Meta Answering for Machine Reading

We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of a meta-answerer. With just a small snippet of text around an answer, humans can outperform the machine reader, improving recall. Similarly, a simple machine meta-answerer outperforms the environment, improving both precision and recall on the Natural Questions dataset. The system relies on joint training of answer scoring and the selection of conditioning information.


page 1

page 2

page 3

page 4


Interactive Language Learning by Question Answering

Humans observe and interact with the world to acquire knowledge. However...

Interpretation of Natural Language Rules in Conversational Machine Reading

Most work in machine reading focuses on question answering problems wher...

QuAC : Question Answering in Context

We present QuAC, a dataset for Question Answering in Context that contai...

The Web as a Knowledge-base for Answering Complex Questions

Answering complex questions is a time-consuming activity for humans that...

ReviewQA: a relational aspect-based opinion reading dataset

Deep reading models for question-answering have demonstrated promising p...

Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions

The use of language-model-based question-answering systems to aid humans...

1 Introduction

Question Answering (QA) is a benchmark task in Natural Language Understanding, driving significant effort and progress (e.g., Devlin et al. (2018)). In the most popular setting, QA is framed as a machine reading task (Rajpurkar et al., 2016; Kwiatkowski et al., 2019). The problem requires answering a question either by extracting a span of text identifying the answer from a textual context, or abstaining

in the absence of adequate evidence. Formulated this way, the problem can be solved by humans with sufficient agreement, thus allowing to encapsulate the task in the form of self-contained static observations. This closed-world setting has catalyzed considerable progress in question answering, as the automatic and instant evaluation of whether a candidate answer is right or not has allowed the application of sophisticated machine learning techniques 

(Buck et al., 2018).

The price of this setting is artifice. As highlighted by recent work, systems are not pressured to develop language understanding, are prone to adversarial attacks, and take systematic advantage of artifacts in the data (Jia and Liang, 2017; Mudrakarta et al., 2018; Niven and Kao, 2019)

. In real world applications, information-seeking tasks are mediated by machines which provide likely relevant, but also incomplete and noisy, information in heterogeneous formats. One example is web search. When a user submits a query the results may include document links and corresponding snippets, related queries, featured snippets, structured information from knowledge graphs, images, maps, ads etc. By design, these results will satisfy some notion of relevance. However, there is no guarantee that they will satisfy the user’s intent, nor one should conclude that it is impossible to answer the query, for which longer sessions and diverse strategies might be necessary 

(Xie, 2002; Marchionini, 2006; Russell, 2019). Such tasks are characterized by imperfect and asymmetric information, users and systems have access to different sources and have different skills, while popular QA tasks assume perfect information.

We study the strategies both humans and computers can deploy to answer questions in imperfect information environments, what we call a meta-answering task. We perform an extensive analysis of humans’ quantitative and qualitative performance. For a machine, we call an analogous system a machine meta-answerer

, generalizing existing work on reranking, confidence estimation, and determining whether a question is answerable 

(Rajpurkar et al., 2018). We simplify this problem by placing a supervised meta-answerer on top of an existing strong machine reading QA system. We experiment with an existing machine reading/QA dataset—Natural Questions (Kwiatkowski et al., 2019)

—that embeds real users’ information seeking needs in a realistic information retrieval context. The following are our main findings. First, even restricted to the limited contexts provided by the environment, a human meta-answerer can improve the accuracy of a strong QA system: they are able to discern whether answer candidates are responsive to the question, can solve ambiguous references, and can spot irrelevant distractors that can vex brittle QA systems. Second, it is possible to design a simple supervised meta-answerer with a built-in heuristic inference policy that outperforms the QA system on the Natural Questions task, producing the best single system on the NQ leaderboard for short answers.

With respect to the machine meta-answerer, it is important to structure the system in such a way that the problem of composing the conditioning information and scoring answers are decoupled. Furthermore, joint training of these factors, including additional auxiliary tasks, is key to obtaining good performance.

2 From Question Answering to Meta-answering

An extractive QA system maps a question-context pair to a set of answer candidates and their scores, , where all are subspans of and is a score assigned to the candidate by the QA system. In the case of Natural Questions (Kwiatkowski et al., 2019), the data set we use to explore the meta-answering task in this paper, each is a web search query, each is the highest ranking Wikipedia page returned for this query by Google, and each answer candidate is a short, contiguous span from this page.

In contrast, a meta-answerer lacks direct access to and instead takes the -best list of an existing QA system as starting point to predict the single best answer . Conceptually, this is similar to a human being confronted with a search results page and having to identify, from the very limited information provided, which of the different results (generated by some system that had access to the full document ) is worth exploring.

This differs from vanilla -best list re-ranking in two ways. First, considering only the answer candidates and question is an ill-defined (and, for humans, frustrating) task, as validated by our empirical experiments. Secondly, providing the original document  to the re-ranker actually turns the problem back into extractive QA, assuming a hard prior on what spans ought to be considered. In contrast, a meta-answerer has to assemble a sufficiently informative yet also compact ‘history’ by way of interacting with the original full context only through the QA system.

For this, we allow the QA system to return ‘observations’ of that are centered on the answer candidates. In the simplest case, an observation is an ‘in-context’ view of an answer span, with being the tokens to the left/right of in . Theoretically, we can add arbitrary information about the answer candidate to an observation, possibly from other systems that provide scores / additional context - this is particularly natural in a WebSearch setting where many ranking signals could provide valuable information for making decisions without requiring access to the full document.

To avoid degeneration into the original task by simply reassembling through concatenating a large number of observations, we limit the length of the left/right context to a small number of tokens (e.g., 5 works well), and additionally restrict the meta-answerer to only select observations on the basis of which it will score the answer candidates. Phrased like this, it is natural to decompose the Meta-Answering problem into two sub-tasks, history selection that computes by picking of the observations; and answer selection that (re-)scores the answer candidates returned by the QA system on the basis of to identify the best candidate.

As in other domains involving multi-party interaction, from macroeconomics (Akerlof, 1970)

to game theory 

(Osborne and Rubinstein, 1994), the meta answering problem is characterized by imperfect information. Information is imperfect because there is noise and errors in the observations returned by the environment, it is also asymmetric because the environment has access to full documents while the meta-answerer only sees snippets of the context. We believe that this is an important aspect to account for in information-seeking tasks such as question answering. In the next section we show that humans can successfully deal with the meta-answering task.

Baseline Context RewriteQues
Pr. Rec. F1 Pr. Rec. F1 Pr. Rec. F1
NQ annotator 57.9 46.4 51.5 64.2 51.6 57.2 57.7 45.5 50.9
Baseline QA 56.4 45.7 50.5 60.1 47.2 52.9 67.2 56.4 61.4
agv. Human 40.7 41.1 40.6 48.9 50.9 49.8 54.5 56.7 55.5
best Human 39.4 48.8 43.6 52.7 59.1 55.7 60.1 66.8 63.3
vs. NQ annotator -18.5 2.4 -7.9 -11.5 7.5 -1.5 2.4 21.3 12.4
vs. Baseline QA -17.1 3.1 -6.9 -7.5 11.9 2.8 -7.1 10.4 1.9
Table 1: Results of the human evaluation, using bootstrap sampling and exact string matches. As discussed in Section 3.2 these number are not comparable to the official eval metric but allow for comparison between original NQ annotators, the baseline QA system and humans that have to operate on partial information.

3 Humans as meta-answerer

To better understand the task and provide a benchmark, we place humans in the role of the meta-answerer. They see exactly the same information available to a machine-learned system and select actions to answer the question. They can request candidate answers one at a time, ranked from the QA system’s -best list. On one hand, humans have world knowledge that the computers do not. However, they are restricted to the same context as computers; they may instead be burdened by their innate knowledge rather than aided.

This section examines how well human meta-answerers can find correct answers compared to both the annotators of the NQ dataset (under far less elaborate restrictions) and to the machine QA system. We evaluate three frameworks for humans to interact with the underlying QA system with increasing complexity and information; a subset of these settings also correspond to those of our machine meta-answerer.

Baseline shows only the question and candidate answers without context; the meta-answerer needs to decide whether Central Germany is a reasonable answer to “what culture region is Germany a part of”. Context adds surrounding context: is “Charles Osgood as the Narrator Jesse McCartney as JoJo, the Mayor’s son” a good answer to the question “who is JoJo in Horton Hears a Who”. This is identical to the information the machine meta-answerer uses in the next section. RewriteQues goes beyond the machine meta-answerer and allows the user to ask any question to two other QA systems: Lee et al. (2019) over all of Wikipedia or Alberti et al. (2019a) over the source page from NQ. For example, the user can ask “Who did Jesse McCartney play in Horton Hears a Who” to verify that (the system believes) he plays JoJo.

3.1 Human Answering Framework

A human meta-answerer interacts with the underlying QA system through a text-based interface. They first see a prompt; they can then request an answer from the underlying QA system. After requesting up to 20 examples, the user can either abstain or propose one of the answer candidates. For each condition, the same five human meta-answerers (results are averages with error bars) play episodes for random samples of 100 questions from the dev set of the NQ dataset. In addition, in RewriteQues, the human meta-answerer can ask a different question as an action.

3.2 Comparing Human meta-answerers to Original Annotators and the QA system

There are a few differences between how the NQ systems are evaluated vs. how we compare our human evaluations to a baseline. Most importantly, our human subjects select a single continuous string as the answer (they do not know where the string came from in the document); however, the official NQ evaluation calls for matching the exact span in the document. Thus, selecting the correct answer at the wrong position is still counted as a miss (e.g., Kevin Kline at token 13 or 47 is wrong but Kevin Kline at token 30 is correct). In addition, a small number of questions have a short answer that consist of multiple un-connected spans; e.g., several names from a cast; because the underlying QA system cannot produce such spans, neither can our meta-answerer. Any disjoint gold span will always be wrong. Finally, our setting cannot answer binary yes/no questions because the setup is strictly extractive. To compare the accuracy of the QA system and original NQ annotators (both see the whole source document) with the humans operating on partial knowledge, we compare exact and partial string matches via a “surface F1” measure that compares token overlap.

The first comparison is between our humans and the original NQ annotators. This is difficult because the NQ annotators define the gold answer. To hold out one NQ annotator, we create a new answer via bootstrapping against the five annotations given for the dev set: We pick a random annotation and sample (with replacement) five annotations from the remaining four. A guess is counted as correct if at least two out of the five annotations have a short answer and at least one is an exact string match. We consider yes/no annotations as having no short (extractive) answer. This allows us to fairly compare the accuracy of the NQ annotator, our human annotators, and the baseline QA system. Human meta-answerers, on average, have lower F1 than both the original annotators as well as the baseline QA system (Table 1), in Context and especially Baseline. However, a clear improvement in performance is visible when moving from Baseline to Context. There is also an improvement from Context to RewriteQues

, but formulating new questions introduces different information to each of the human meta-answerers, there is also more variance. This variance is in part a function of skill, and the best human meta-answerer was able to improve over the baseline system. The difference between the best human meta-answerer and the NQ annotator that had access to the full Wikipedia page improves from

to . Compared to the baseline QA system on Context, the average human is and F1 points behind, however the best human is slightly more successful than the baseline, being points ahead. On RewriteQues the best human is over five points ahead, and the average human score is higher than the baseline system.

Apart from specific spans/answers, we can also investigate agreement at an action level. When we measure the categorical distribution over final decisions per question, human meta-answerers, users typically agree with each other. The highest agreement between raters is near 0.7 chance-adjusted , while agreement between the users and the system is typically between 0.5 and 0.6.

3.3 A Taxonomy of Outcomes

Figure 1: Humans improve over the underlying QA as they see more information. they abstain less, but this is often balanced out by being “fooled” (tricked into answering questions where they should abstain).

To see how human meta-answerers improve on the baseline, it is helpful to consider different outcomes of the game. As the meta-answerer interacts with the environment, it can either improve or worsen the answers. To build a vocabulary to discuss the different ways the meta-answerer can answer questions, we name the possible outcomes meta-answerer can have answering a question with the help of an underlying QA system (Figure 1). As we discuss the types of outcomes we also discuss strategies that lead human meta-answerers to that outcome.

Right and Neg

The most straightforward result is that the meta-answerer selects a right answer from the QA system and provides it. This can either be confirming the answer that the QA system would have presented anyway, answering when the QA system would have abstained, or selecting a different answer. The most common way for a human meta-answerer to improve is to answer right when the model abstained. For example, to answer 1967 to the question “when did colour tv come out in uk”, the answer at the top of the baseline QA’s -best list, but below the threshold.

Sometimes humans select a wrong answer when an answer is available. We call this a negative selection, or “neg” for short. For example, the question “when did the crucifix become the symbol of christianity” has gold answers the 4th century, in the 2nd century, and 4th century. A human meta-answerer selected the 2nd century, which was not an acceptable answer.


Many NQ instances cannot be answered—“universal social services are provided to members of society based on their income or means” is not a question, the IR system returns an article about Paralympics to a question about the Olympics, or the page did not contain a span that could answer the question “where am i on the steelers waiting list”—the next most common outcome is for a meta-answerer to correctly recognize it should not provide a response. The QA system abstains more than humans, which leads to human meta-answerers biggest failing…


The flipside of abstaining is being “fooled” into answering when no answer was possible; for example answering At the end of the episode ‘The Downward Spiral’ to the question “when does stefan turn his humanity back on in season 8”: humans are enticed by the context “humanity to save her”. Human meta-answerer, lacking the full context, do not realize the distinction between turning humanity on and off in this vampire-based TV show. While humans have a much higher rate of being fooled than computers, some of this is attributable to annotation problems with NQ. For example, the NQ dataset associates the page for Barry Humphries (the correct answer) with the question “who plays the goblin king in the hobbit” but has no annotated gold span. In addition to vetting the validity of our meta-answerer, the human game can also unearth issues in the NQ dataset.


Sometimes a question is answerable, but the meta-answerer falsely abstains. We call this result a “dead” question. Sometimes the meta-answerer is unsure of an answer. For example, for the question “when were the winnie the pooh books written”, the gold answer is “1926 ) , and this was followed by The House at Pooh Corner ( 1928” (two disjoint spans). However, the underlying QA system cannot annotate disjoint spans, so the human cannot find the matching answer span.

Humans improve as we move to the RewriteQues setting, adding the ability to ask new questions. Human meta-answerers can improve recall by converting baseline abstentions into right answers (and a smaller number of baseline negs). However, humans are more often fooled, resulting in lower average precision (Figure 1). Because the human meta-answerer is at the mercy of the baseline QA system, if the baseline system does not surface the answer, the question will go dead without the ability of the human to find the answer.

3.4 Human Strategies

Without contexts in the Baseline setting, the human meta-answerer was limited to examining the question answer combination (e.g., knowing that “Germany” is not a part of Central Germany or that Jennifer is not the “meaning of the name Sinead”). However, these cases are rare enough that human meta-answerers do not overall improve the system.

With Context, human meta-answerers use context to select better answers than the system. For example, seeing that near Arenosa Creek and Matagorda Bay was settled by explorer Robert Cavelier de La Salle in the Wikipedia page context, allowing to convert a dead model question into a correct one. However, humans are often fooled as well. For example, for the question “when did the united kingdom entered world war 2”, most humans answer 3 September after seeing the context “Two days later, on 3 September, after a British ultimatum to Germany to cease military operations was ignored, Britain and France declared war”, which is not in the NQ answer set (presumably the NQ annotators felt that there was ambiguity about when war was declared); the model correctly abstained.

With RewriteQues, humans can more thoroughly probe the source document to establish whether an answer is correct. E.g., while the baseline system answers the question “who plays eddie’s father on blue bloods” with Eric Laneuville, humans can explore outside the source document to find the correct answer (William Sadler, who plays Armin Janko, only appears in the sixth season, while the NQ source document is about an earlier season) or within the page to establish that Eric Laneuville directed several episodes of Blue Bloods.

Question/Title Answer History
cl a cr cl a cr
0.0 0.0 0.0 0.0 0.0 0.0
Table 2: We represent all relevant input to the meta-answerer by generalizing the idea of segment embeddings to multiple independently varying layers, allowing us to capture the internal structure of observations; and by adding a layer of scalar feature strengths to capture the original QA model’s prediction. See text for discussion.

4 Machines as meta-answerer

Human meta-answerers follow roughly the following strategy. They inspect as many candidates as needed, to form an opinion about the candidate answers, based on the available context, until they decide whether to answer and how. Given that the information provided by the QA system is noisy and incomplete, we argue that a suitable architecture needs to model two related tasks. One task is, necessarily, to score candidate answers, we call this model . The second task involves evaluating incoming observations, to select those that provide the most reliable information to pick the right answer or abstain. We call this model , as its goal is to select the most relevant history of the episode, to be retained to make the final decision. We note that the selection task is also necessary, because the capacity of the encoder is limited to a fixed number of tokens.

4.1 System architecture

We model the tasks by training two binary classifiers:

, for answer scoring, and , for history selection. and are implemented as output layers on top of a BERT encoder (Devlin et al., 2018)

which translates a semi-structured input into a single dense vector.

The intended meaning of

is the probability that

is a correct answer to question , assuming that the (unobserved) Wikipedia page where occurs has title , and considering some history , where each was picked as ‘relevant’ for answering. The intended meaning of is the probability that is a useful sequence of observations to evaluate answer candidates for using . Abbreviating the respective conditioning information as inputA and inputH, respectively111inputH is identical to inputA except for the tokens corresponding to the candidate answer being masked out.:


4.2 Semi-structured embeddings

The setup above follows the standard way of training (binary) classifiers on top of the [CLS]computed by BERT, with the sole difference that our input comprises not a single sequence of tokens and segment types but several segment types and an additional feature vector, structured as follows (cf. Table 2).

is the embedding matrix for textual tokens. is the matrix for the top-level segment types used to differentiate the answer (A), the question (Q) and observations (O). is the matrix for sub-segment types used to distinguish the answer span (a) and its surrounding left () and right () context, within each primary segment type. is a single embedding vector for the QA system’s answer score. Finally, is the embedding matrix for the absolute position (required by BERT to model the sequential nature of text in the absence of a recurrence mechanism). All embeddings have the same number of dimensions, so we essentially compute the column of by summing the corresponding embeddings across the rows of the column of Table 2 and adding the feature embedding, compressing the four annotation layers in Table 2, plus the position embeddings, into a single sequence of dense vectors, one for each input position, as follows:


4.3 Answer candidate selector

The answer selector uses - the probability that is a correct answer to question , given the available information - to score all answer candidates and pick the one with the highest probability. is a single feed-forward layer with parameters , taking computed for the entire conditioning information as input. It is trained with a standard binary cross-entropy loss, assuming examples of the form , where is a question, a page title, a candidate answer in context, some additional history of observations, and a binary label indicating whether the answer span contained in is, in fact, among the correct answers for .

One can easily generate a dataset like this from the original NQ data providing the questions and labels, and using a strong baseline system (e.g., Alberti et al. (2019b)) to generate an -best list of answers. As we expect most candidates to be not correct, we aggressively downsample negative examples at runtime (where

is an incorrect answer) to get roughly balanced proportions of positive and negative examples. We do this ‘dynamically’, that is, for each epoch different negatives will be picked so that, in expectation, we still expose the model to the full

-best list.

The final question concerns the generation of , the observations used as ‘history’. We found that simply sampling (without replacement) a random set of observations from each question’s -best works well in practice. Assuming a dataset of examples generated as described, the loss is


4.4 History selector

The history selector uses - the probability that is a useful sequence of observations to evaluate answers for using - to pick, among a set of candidate histories , the highest scoring . Like , we model as a single feed-forward layer with parameters on top of the computed for the conditioning information. Note that does not contain any , as it corresponds to the probability that provides good evidence for scoring different answer candidates for and . We achieve this by masking the A slot for the input when computing the input to .

As we do not have access to examples of good/bad histories directly from , we induce a pseudo-label—a history discriminator —which, given and , identifies which one is better in the following sense:


For readability, we abbreviate the expression above as . An intuitive way of thinking about the history discriminator is that, for a given pair of candidate histories and and one of the training examples used to train , it assigns to the history whose empirical cross-entropy loss on this one example is smallest; or, phrased positively, to the history that provides the most information about answering this question.

This induces a loss function to train

, which discovers good histories, reusing dataset from above:


We found that generating the alternate history by randomly replacing exactly one of the observations in is effective.222Choosing pairs with an observation hamming-distance of one also matches how we use in our decoding algorithm.

One can further motivate the history loss by noting that it maximizes the expected reduction in entropy for and produced by substituting with . While the full expectation would require summing over all possible answers for , training on a single answer defines an unbiased (although noisy) estimate of this expectation. The loss and, consequently, the training signal for implicitly depends on a . We found that one can effectively co-train and from scratch.

4.5 Auxiliary impossibility loss

51% of the examples in the NQ dataset are ‘unanswerable’ questions, for which there is no gold answer to be found in the entire context. Learning to know when to abstain is thus an important part of doing well on NQ.

Instead of modeling the abstain decision explicitly, Alberti et al. (2019b) and Alberti et al. (2019a) demonstrated that good performance can be achieved by always predicting some answer with a score, and picking an optimal threshold on this score on the development data. Whereas we follow this practice in having our model always predict, if you will, the answer scored highest by , and then abstaining in cases where the highest score is below the threshold, we found it important for training to co-learn a model jointly with and , again implemented as a single feed-forward layer on top of . Thus, while we do not use at test time, we still add the following loss to training: . Here, is a dataset trivially derived from the original data, with being defined as above, and is a binary label that is if the example is considered unanswerable, and else. Note that uses the same conditioning information as , making it cheap to co-train as, unlike , we can reuse the same vector as input.

4.6 Training

We initialize BERT from the public checkpoint. The power of encoders like BERT comes to a large degree from the masked language model pre-training task. To ensure our encoder is used to the kind of input described in Table 2, we perform additional language model pre-training on the public large BERT checkpoint, using examples in to compute input sequences and randomly masking 30 tokens in each. We run the pre-training for 200,000 steps, using a batch size of 32.

We found that doing our own pre-training produces better models, most likely because the kind of inputs in Table 2 differ very much from actual natural language that available public checkpoints are used to. Additionally, we found it useful to use the masked language modeling task - masking one single random token from every input - as an auxiliary task, co-trained with , , and .

We combine the four losses into a single weighted loss and treat the per-loss weights as hyper-parameters:

Making predictions

Parameters : 
for to  do
       if  then
       end if
end for
Algorithm 1 Prediction algorithm for the meta-answerer paired with machine reading system QA.

Having introduced the components of our meta-answerer and the losses with which we train them, we now describe how, at test time, we can generate answer predictions.

We use Algorithm 1 which implements the following heuristic for using and , once trained, to perform as a meta-answerer on top of an existing QA system.

The algorithm first builds a size- history composed of the observations for the top- answer candidates proposed by the base QA system. It then iterates over the remaining answer candidates produced by the base QA system and decides, using , whether or not to replace any of the existing observations in the history with the observation for the new answer candidate. We denote with the history obtained by replacing the element of with substitution of observation . In simpler terms, the algorithm greedily selects those observations that maximize at every time-step and, after a single pass over the answer candidates, produces the ‘best’ history .

Once all answer candidates have been processed and a -size history has been selected, we score all answer candidates using and return the answer and its score as our prediction.

System Short Answer Dev Short Answer Test
P R F1 P R F1
BERT (Alberti et al., 2019b) 59.5 47.3 52.7 63.8 44.0 52.1
BERTWWM 59.5 51.9 55.4 63.1 51.4 56.6
Single annotator (Kwiatkowski et al., 2019) 63.4 52.6 57.5 - - -
MMA 65.4 51.7 57.7 64.8 52.8 58.2
Table 3: Results on the Short Answer task of the Natural Questions dataset.

5 Machine meta-answerer Experiments

We take as starting point Alberti et al. (2019b)’s QA model that was trained on NQ, and refer to it as BERT. Adding whole word masking (Liu et al., 2019) adds another 4.5 F1 points, and we use this BERTWWM as the environment QA system. We performed extensive hyper-parameter search, summarized in Figure 2. Using an -best list of size and a history size of gave us best results on dev, as well as choosing to compute the weighted training loss. We trained all losses jointly for 100,000 steps, initializing the encoder with our custom pre-trained BERT large model and randomly initializing for the answer, impossibility, and history heads.

The full results on the NQ data for the short answer task are reported in Table 3, for BERT, BERTWWM, and our machine meta-answerer, MMA. MMA adds another absolute 2.3 F1 points on dev, outperforming the single rater human performance; and on test, increases the performance of BERTWWM by 1.6 F1 points, reaching 58.2 F1 which, as of 2019/09/25, is the best-performing non-ensemble model on the leaderboard.

Correct Incorrect Correct Incorrect
MMA 3876 1117 1807 1030
77.63% 63.69%
4993 2837
BERTWWM 3765 1056 1793 1216
78.10% 59.59%
4821 3009
Difference 111 61 14 -186
172 -172
Table 4: Overall accuracy for the original model and the agent, split by action type.

Table 4 breaks down the results of MMA and BERTWWM by decision type (abstain, answer) on the dev partition of NQ. One can see that MMA answers significantly less than the QA system (2837 vs. 3009). In particular, it is more accurate in avoiding incorrect answers (-186), while also producing a slight positive margin in terms of correct answers (+14). With respect to abstain, the meta-answerer’s win-to-loss ratio is almost 2:1 (111/61). Overall, MMA implements a cautious strategy. From looking at the human meta-answering tasks our impression is that the space of discriminative questions in NQ, the questions that are neither too simple or too hard, is rather narrow which rewards risk-averse strategies.

We identify three main types of patterns characterizing different answer predictions (non abstaining) between BERTWWM and MMA. Sometimes both choose a span that contains the right answer, but one misses part of the gold span; e.g., for ”2017 Hurricane Ophelia During the autumn of 2017” systems disagree with respect to the second date. Other times they choose different entities; e.g., for ”what is the highest peak in the Ozarks?” one picks ”Buffalo Lookout”, the other ”Turner Ward Knob”. Both patterns above are roughly balanced in terms of win/loss ratio. The third pattern is where the same answer is chosen (”The Beatles”) but from different spans, here MMA seems to have an advantage with a win/loss ratio close to 2:1.

5.1 Comparing Computer and Human meta-answerers

In both cases, the meta-answerer takes an initial list of answer candidates and improves them. The computer meta-answerer slightly improves both precision and recall; qualitatively, the changes are minor: tweaking a span, adding a word, favoring an earlier span, or improving the abstention threshold.

Human meta-answerers, however, were more bold. The boldness allows them to greatly improve recall, digging deep in the -best list to find the answers they believe best answer the question. However, this boldness comes at the cost of lower precision; they often get fooled by plausible sounding answers that the NQ annotators did not agree with.

The human meta-answerers were frustrated with some of the NQ annotations. For example, “when did the crucifix become the symbol of christianity” can be answered with the 4th century, 4th century, or in the 2nd century, but not “the 2nd century” or “2nd century”. While the computer meta-answerer could not express its frustration, the Dev and Test scores were not always well correlated.

Figure 2: Results of the hyper-parameters search. We report the max value over several runs.

6 Related Work

The work of Nogueira and Cho (2017) and Buck et al. (2018)

is related to meta answering. There, agents are trained, with reinforcement learning (RL), to find the best answer while interacting with a black box QA system. Agents learn to reformulate the original question, in single-step episodes. Here, we do not consider reformulations, although including adaptive language generation actions is a promising direction for future work, as underlined in the human experiments. Also, meta-answering naturally takes place over multi-step episodes. In addition, in our work, the history (the state) in not simply the original question, instead it summarizes multiple observations, and the history composition becomes a key sub-problem.

Another connection with RL is the state representation problem. Recent work has highlighted the role of auxiliary tasks for effective representation learning (Jaderberg et al., 2017; Such et al., 2019; Bellemare et al., 2019). Such work deals mostly with navigation or arcade environments and, in general, not much research has focused on complex language tasks. We confirm that modeling auxiliary tasks is useful. Given that we rely on a simple heuristic decoding policy, performance must be attributed primarily to the state representation, which BERT is able to exploit.

The problem of optimizing the input representation for machine reading is gaining importance also in the language processing community, although it is addressed in different terms and driven primarily by resource scarcity. Deep BERT models (Devlin et al., 2018) provide the basis for virtually all state of the art QA systems. However, they can only encode limited amount of texts; e.g., 512 tokens for Chris-A’s system (Alberti et al., 2019c). In the Natural Questions task, and others, documents are much longer. A simple workaround is to split the input in overlapping windows (Devlin et al., 2018; Alberti et al., 2019c). The highest scoring span is computed considering all positions in the document. This is not necessarily optimal. A common theme among proposed solution is to generate a summary of the full document, then run the machine reader on the compressed context. (Han et al., 2019) propose an episodic memory reader that uses RL to build the summary. They show that this approach works better than baselines such as rule-based memory scheduling and other RL variants. Other work combines answering and summarization with the purpose of providing justification for the answer (Nishida et al., 2019). While this a good approach for tasks where the explanations are annotated (Yang et al., 2018; Thorne et al., 2018), and constitute part of the evaluation, there is no evidence so far, to the best of our knowledge, that summarization-based approaches are competitive with the state-of-the-art in machine reading.

Other relevant work combine retrieval and machine reading as a joint learning problem (Lee et al., 2019; Nie et al., 2019). ORQA, for example, (Lee et al., 2019) shows that is possible to successfully model jointly these tasks. Their architecture can outperform strong IR baselines, but its performance is far from the supervised machine reading levels, as they don’t have labeled data for training. A meta-answerer outsources the IR steps to the environment, however it would be interesting to propagate training signals between the agent and the environment as in (Lee et al., 2019).

7 Conclusion

In this paper we investigated a meta-answering framework for question answering. A meta-answerer interacts with a QA system to evaluate candidate answers in context and eventually decides whether to answer, and how, or abstain. This setup attempts to simulate real-world (imperfect) information seeking task, where a human seeks information in a setting that is mediated by a machine, using natural language. We found that humans can play the role of a meta-answerer and can compete with a BERT-based single system trained for this task by only looking at a 5 token window to each side of a candidate answer. We also implement a supervised system that can outperform the environment’s QA, using only the same context as the humans. We find that factoring the system to model separately the selection of the episode history, and the answer scoring task, is beneficial. The framework lends itself naturally to experimenting with modular system including multiple sources of information, e.g., IR-based, active question reformulation, multimodal observations and reinforcement learning.

8 Acknowledgements

We would like to thank Chris Alberti, Tom Kwiatkowski and Kenton Lee for feedback and technical support.


  • Akerlof (1970) George Akerlof. 1970. The market for ”lemons”: Quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3):488–500.
  • Alberti et al. (2019a) Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019a. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173.
  • Alberti et al. (2019b) Chris Alberti, Kenton Lee, and Michael Collins. 2019b. A bert baseline for the natural questions.
  • Alberti et al. (2019c) Chris Alberti, Kenton Lee, and Michael Collins. 2019c. A BERT Baseline for the Natural Questions.
  • Bellemare et al. (2019) Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. 2019. A geometric perspective on optimal representations for reinforcement learning. NeurIPS.
  • Buck et al. (2018) Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Wojciech Gajewski, and Wei Wang. 2018. Ask the right questions: Active question reformulation with reinforcement learning. In Proceedings of ICLR.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Han et al. (2019) Moonsu Han, Minki Kang, Hyunwoo Jung, and Sung Ju Hwang. 2019. Episodic memory reader: Learning what to remember for question answering from streaming data. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4407–4417.
  • Jaderberg et al. (2017) Max Jaderberg, Volodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of ICLR.
  • Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2021–2031.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Marchionini (2006) Gary Marchionini. 2006. Exploratory search: From finding to understanding. Commun. ACM, 49(4).
  • Mudrakarta et al. (2018) Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1896–1906.
  • Nie et al. (2019) Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale.
  • Nishida et al. (2019) Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2335–2345.
  • Niven and Kao (2019) Timothy Niven and Hung-Yu Kao. 2019.

    Probing neural network comprehension of natural language arguments.

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664.
  • Nogueira and Cho (2017) Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-oriented query reformulation with reinforcement learning. In Proceedings of EMNLP.
  • Osborne and Rubinstein (1994) Martin J. Osborne and Ariel Rubinstein. 1994. A Course in Game Theory. MIT Press Books. The MIT Press.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Russell (2019) Daniel M. Russell. 2019. The Joy of Search: A Google Insider’s Guide to Going Beyond the Basics. The MIT Press.
  • Such et al. (2019) Felipe Such, Vashish Madhavan, Rosanne Liu, Rui Wang, Pablo Castro, Yulun Li, Jiale Zhi, Ludwig Schubert, Marc Bellemare, Jeff Clune, and Joel Lehman. 2019. An atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning agents. In Proceedings of IJCAI 2019.
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018. The fact extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER).
  • Xie (2002) Hong (Iris) Xie. 2002. Patterns between interactive intentions and information-seeking strategies. Inf. Process. Manage., 38(1):55–77.
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.