e-QRAQ: A Multi-turn Reasoning Dataset and Simulator with Explanations

08/05/2017 ∙ by Clemens Rosenbaum, et al. ∙ University of Massachusetts Amherst 0

In this paper we present a new dataset and user simulator e-QRAQ (explainable Query, Reason, and Answer Question) which tests an Agent's ability to read an ambiguous text; ask questions until it can answer a challenge question; and explain the reasoning behind its questions and answer. The User simulator provides the Agent with a short, ambiguous story and a challenge question about the story. The story is ambiguous because some of the entities have been replaced by variables. At each turn the Agent may ask for the value of a variable or try to answer the challenge question. In response the User simulator provides a natural language explanation of why the Agent's query or answer was useful in narrowing down the set of possible answers, or not. To demonstrate one potential application of the e-QRAQ dataset, we train a new neural architecture based on End-to-End Memory Networks to successfully generate both predictions and partial explanations of its current understanding of the problem. We observe a strong correlation between the quality of the prediction and explanation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years deep neural network models have been successfully applied in a variety of applications such as machine translation

Cho et al. (2014), object recognition Krizhevsky et al. (2012); He et al. (2016), game playing Mnih et al. (2015), dialog Weston (2016) and more. However, their lack of interpretability makes them a less attractive choice when stakeholders must be able to understand and validate the inference process. Examples include medical diagnosis, business decision-making and reasoning, legal and safety compliance, etc. This opacity also presents a challenge simply for debugging and improving model performance. For neural systems to move into realms where more transparent, symbolic models are currently employed, we must find mechanisms to ground neural computation in meaningful human concepts, inferences, and explanations. One approach to this problem is to treat the explanation problem itself as a learning problem and train a network to explain the results of a neural computation. This can be done either with a single network learning jointly to explain its own predictions or with separate networks for prediction and explanation. Regardless, the availability of sufficient labelled training data is a key impediment. In previous work Guo et al. (2017) we developed a synthetic conversational reasoning dataset in which the User presents the Agent with a simple, ambiguous story and a challenge question about that story. Ambiguities arise because some of the entities in the story have been replaced by variables, some of which may need to be known to answer the challenge question. A successful Agent must reason about what the answers might be, given the ambiguity, and, if there is more than one possible answer, ask for the value of a relevant variable to reduce the possible answer set. In this paper we present a new dataset e-QRAQ constructed by augmenting the QRAQ simulator with the ability to provide detailed explanations about whether the Agent’s response was correct and why. Using this dataset we perform some preliminary experiments, training an extended End-to-End Memory Network architecture Sukhbaatar et al. (2015) to jointly predict a response and a partial explanation of its reasoning. We consider two types of partial explanation in these experiments: the set of relevant variables, which the Agent must know to ask a relevant, reasoned question; and the set of possible answers, which the Agent must know to answer correctly. We demonstrate a strong correlation between the qualities of the prediction and explanation.

2 Related Work

Current interpretable machine learning algorithms for deep learning can be divided into two approaches: one approach aims to explain black box models in a model-agnostic fashion

Ribeiro et al. (June 2016); Turner (June 2016); another studies learning models, in particular deep neural networks, by visualizing for example the activations or gradients inside the networks Zahavy et al. (2016); Shrikumar et al. (2016); Selvaraju et al. (2016)

. Other work has studied the interpretability of traditional machine learning algorithms, such as decision trees

Hara & Hayashi (June 2016), graphical models Kim et al. (2015)

, and learned rule-based systems

Malioutov & Varshney (2013). Notably, none of these algorithms produces natural language explanations, although the rule-based system is close to a human-understandable form if the features are interpretable. We believe one of the major impediments to getting NL explanations is the lack of datasets containing supervised explanations.

Datasets have often accelerated the advance of machine learning in their perspective areas Ferraro et al. (2015)

, including computer vision

LeCun (1998); Krizhevsky & Hinton (2009); Russakovsky et al. (2015); Lin et al. (2014); Krishna et al. (2016), natural language Lowe et al. (2015); Hermann et al. (2015); Dodge et al. (2015), reasoning Weston et al. (2015); Bowman et al. (2015); Guo et al. (2017), etc. Recently, natural language explanation was added to complement existing visual datasets via crowd-sourcing labeling Reed et al. (2016). However, we know of no question answering or reasoning datasets which offer NL explanations. Obviously labeling a large number of examples with explanations is a difficult and tedious task – and not one which is easily delegated to an unskilled worker. To make progress until such a dataset is available or other techniques obviate its need, we follow the approach of existing work such as Weston et al. (2015); Weston (2016), and generate synthetic natural language explanations from a simulator.

3 The QRAQ Dataset

A QRAQ domain, as introduced in Guo et al. (2017), has two actors, the User and the Agent. The User provides a short story set in a domain similar to the HomeWorld domain of Weston et al. (2015); Narasimhan et al. (2015) given as an initial context followed by a sequence of events, in temporal order, and a challenge question. The stories are semantically coherent but may contain hidden, sometimes ambiguous, entity references, which the Agent must potentially resolve to answer the question.

  1. [itemsep=-1.5pt]

  2. Hannah and Emma are in the office.

  3. John is in the park.

  4. Bob and George are in the square.

  5. Hannah picks up the gift.

  6. $v goes from the office to the park.

  7. $w goes from the park to the bank.

  8. $x goes from the office to the square.

  9. Emma goes from the square to the bank.

  10. $y goes from the square to the bank.

  11. Where is the gift?

Example 1 A QRAQ Problem

To do so, the Agent can query the User for the value of variables which hide the identity of entities in the story. At each point in the interaction, the Agent must determine whether it knows the answer, and if so, provide it; otherwise it must determine a variable to query which will reduce the potential answer set (a “relevant” variable).
In example 1 the actors $v, $w, $x and $y are treated as variables whose value is unknown to the Agent. In the first event, for example, $v refers to either Hannah or Emma, but the Agent can’t tell which. In a realistic text this entity obfuscation might occur due to spelling or transcription errors, unknown descriptive references such as “Emma’s sibling”, or indefinite pronouns such as “somebody”. Several datasets with 100k problems each and of varying difficulty have been released to the research community and are available for download qra .

4 Explainable QRAQ: e-QRAQ

4.1 The Dataset

This paper’s main contribution is an extension to the original QRAQ simulator that provides extensive explanations of the reasoning process required to solve a QRAQ problem. These explanations are created dynamically at runtime, in response to the Agent’s actions. The following two examples illustrate these explanations, for several different scenarios:

  • [itemsep=-1.5pt]

  • The problem:

  • Silvia is in the porch.

  • Charles is in the cellar.

  • Maria is in the porch.

  • Charles goes from the cellar to the attic.

  • Charles goes from the attic to the terrace.

  • $V0 goes from the porch to the boudoir.

  • Where is Maria

  • The interaction:

  • Possible Answers: Porch, Boudoir; Relevant Variables: $V0

  • Who is $V0?

  • $V0 is Silvia.

  • This query was helpful, since it allowed the following inference:
    We now know that $V0 is Silvia, and not Maria. Maria can therefore not be in the boudoir.

  • Possible Answers: Porch; Relevant Variables:

  • Maria is in the porch.

  • This answer is correct.

Example 2 A successfully solved e-QRAQ Problem (with explanations)
  • [itemsep=-1.5pt]

  • The problem:

  • Paul is in the attic.

  • Maria is in the cellar.

  • Charles is in the attic.

  • Maria goes from the cellar to the terrace.

  • $V4 goes from the attic to the porch.

  • Maria goes from the terrace to the boudoir

  • Where is Charles?

  • The interaction:

  • Possible Answers: Attic, Porch; Relevant Variables: $V4

  • Who is $V1?

  • This query was not helpful, since $V1 does not even occur in the problem.

  • Possible Answers: Attic, Porch; Relevant Variables: $V4

  • Charles is in the porch.

  • This was a guess, since Charles could still have been $V4, and thereby in the Porch or in the Attic.
    This guess was correct.

Example 3 An unsuccessfully solved e-QRAQ Problem (with explanations)

The context (C), events (E), and question (Q) parts of the problem are identical to those in a QRAQ problem. In addition there is a trace of the interaction of a trained Agent (A) model with the User (U) simulator. The simulator provides two kinds of explanations in response to the Agent’s query or answer. The first kind denoted “U” indicates whether the Agent’s response is correct or not and why. The second kind of explanation, denoted “U” provides a full description of what can be inferred in the current state of the interaction. In this case the relevant information is the set of possible answers at different points in the interaction (Porch, Boudoir / Porch for Example 2) and the set of relevant variables ($V0 / none for Example 2).

In Example 2, illustrating a successful interaction, the Agent asks for the value of $V0 and the User responds with the answer (Silvia) as well as an explanation indicating that it was correct (helpful) and why. Specifically, in this instance it was helpful because it enabled an inference which reduced the possible answer set (and reduced the set of relevant variables). On the other hand, in Example 3, we see an example of a bad query and corresponding critical explanation.

In general, the e-QRAQ simulator offers the following explanations to the Agent:


When answering, the User will provide feedback depending on whether or not the Agent has enough information to answer; that is, on whether the set of possible answers contains only one answer. If the Agent has enough information, the User will only provide feedback on whether or not the answer was correct and on the correct answer if the answer was false. If the agent does not have enough information, and is hence guessing, the User will say so and list all still relevant variables and the resulting possible answers.


When querying, the User will provide several kinds of feedback, depending on how useful the query was. A query on a variable not even occurring in the problem will trigger an explanation that says that the variable is not in the problem. A query on an irrelevant variable will result in an explanation showing that the story’s protagonist cannot be the entity hidden by that variable. Finally, a useful (i.e. relevant) query will result in feedback showing the inference that is possible by knowing that variable’s reference. This set of inference can also serve as the detailed explanation to obtain the correct answer above.

The e-QRAQ simulator will be available upon publication of this paper at the same location as QRAQ qra for researchers to test their interpretable learning algorithms.

4.2 The “interaction flow”

   provides the initial problem state
  while Episode has not terminated do
     Agent: chooses action and explanation for
     if in training mode then
        if in supervised mode then
           User: provides feedback on and , and provides the learning targets (i.e. ground truth actions and explanations) for
           Agent: trains model on
        else if

 in reinforcement learning mode 

           User: provides feedback on and , and provides rewards for
           Agent: trains model on
        end if
     end if
     User: computes performance (interaction and explanation accuracies) on and .
     , where is the state entered from upon choosing action
     if action was an answer then
        terminate episode
     end if
  end while
Figure 1: The User-Agent Interaction

The normal interaction flow between the User and the Agent during runtime of the simulator is shown in Figure 1, and is - with the exception of the additional explanations - identical to the interaction flow for the original QRAQ proglems Guo et al. (2017). This means that the User acts as a scripted counterpart to the Agent in the simulated e-QRAQ environment. We show interaction flows for both supervised and reinforcement learning modes. Additionally, we want to point out that in Figure 1 can be both U and U, i.e. both the natural language explanation and the internal state explanations. Performance and accuracy are measured by the User, that compares the Agent’s suggested actions and the Agent’s suggested explanations with the ground truth known by the User.

5 Experimental Setup

For the experiments, we use the User simulator explanations to train an extended memory network. As shown in Figure 2, our network architecture extends the End-to-End Memory architecture of Sukhbaatar et al. (2015)

, adding a two layer Multi-Layer Perceptron to a concatenation of all “hops” of the network. The explanation and response prediction are trained jointly. In these preliminary experiments we do not train directly with the natural language explanation from U, just the explanation of what can be inferred in the current state U

. In future experiments we will work with the U explanations directly.

sentences :
context + events
e.g.“Joe is in the garden.”

“$v goes from

challenge question
e.g. “Where is Joe?”








Figure 2: The modified E2E-Memory Network architecture simultaneously generating answers to the challenge question and explanations of its internal belief state, shown with four internal “hops”.

Specifically, for our experiments, we provide a classification label for the prediction output generating the Agent’s actions, and a vector

of the following form to the explanation output (where

is an one-hot encoding of dimensionality (or vocabulary size)

of word , and is the explanation set:

We then train the network, using Adam Kingma & Ba (2014), on the combined loss (where is the cross-entropy between the true labels

and the estimated labels

, is the network’s interaction output and is the networks explanation output):
Figure 3:

The Interaction Accuracy (over 50 epochs with 1000 problems each)

For testing, we consider the network to predict a entity in the explanation if the output vector surpasses a threshold for the index corresponding to that entity. We tried several thresholds, some adaptive (such as the average of the output vector’s values), but found that a fixed threshold of .5 works best.

6 Results

To evaluate the model’s ability to jointly learn to predict and explain its predictions we performed two experiments. First, we investigate how the prediction accuracy is affected by jointly training the network to produce explanations. Second, we evaluate how well the model learns to generate explanations. To understand the role of the explanation content in the learning process we perform both of these experiments for each of the two types of explanation: relevant variables and possible answers. We do not perform hyperparameter optimization on the E2E Memory Network, since we are more interested in relative performance. While we only show a single experimental run in our Figures, results were nearly identical for over five experimental runs.

Figure 4: The Explanation Accuracies (over 50 epochs with 1000 problems each)

The experimental results differ widely for the two kinds of explanation considered, where an explanation based on possible answers provides better scores for both experiments. As illustrated in Figure 3, simultaneously learning possible-answer explanations does not affect prediction, while learning relevant-variable explanation learning severely impairs prediction performance, slowing the learning by roughly a factor of four. We can observe the same outcome for the quality of the explanations learned, shown in Figure 4

. Here again the performance on possible-answer explanations is significantly higher than for relevant-variable explanations. Possible-answer explanations reach an F-Score of .9, while relevant-variable explanations one of .09 only, with precision and recall only slightly deviating from the F-Score in all experiments.

We would expect that explanation performance should correlate with prediction performance. Since Possible-answer knowledge is primarily needed to decide if the net has enough information to answer the challenge question without guessing and relevant-variable knowledge is needed for the net to know what to query, we analyzed the network’s performance on querying and answering separately. The memory network has particular difficulty learning to query relevant variables, reaching only about .5 accuracy when querying. At the same time, it learns to answer very well, reaching over .9 accuracy there. Since these two parts of the interaction are what we ask it to explain in the two modes, we find that the quality of the explanations strongly correlates with the quality of the algorithm executed by the network.

7 Conclusion and Future Work

We have constructed a new dataset and simulator, e-QRAQ, designed to test a network’s ability to explain its predictions in a set of multi-turn, challenging reasoning problems. In addition to providing supervision on the correct response at each turn, the simulator provides two types of explanation to the Agent: A natural language assessment of the Agent’s prediction which includes language about whether the prediction was correct or not, and a description of what can be inferred in the current state – both about the possible answers and the relevant variables. We used the relevant variable and possible answer explanations to jointly train a modified E2E memory network to both predict and explain it’s predictions. Our experiments show that the quality of the explanations strongly correlates with the quality of the predictions. Moreover, when the network has trouble predicting, as it does with queries, requiring it to generate good explanations slows its learning. For future work, we would like to investigate whether we can train the net to generate natural language explanations and how this might affect prediction performance.