Interpretation of Natural Language Rules in Conversational Machine Reading

08/28/2018 ∙ by Marzieh Saeidi, et al. ∙ 0

Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regulations to answer "Can I...?" or "Do I have to...?" questions such as "I am working in Canada. Do I have to carry on paying UK National Insurance?" after reading a UK government website about this topic. This task requires both the interpretation of rules and the application of background knowledge. It is further complicated due to the fact that, in practice, most questions are underspecified, and a human assistant will regularly have to ask clarification questions such as "How long have you been working abroad?" when the answer cannot be directly derived from the question and text. In this paper, we formalise this task and develop a crowd-sourcing strategy to collect 32k task instances based on real-world rules and crowd-generated questions and scenarios. We analyse the challenges of this task and assess its difficulty by evaluating the performance of rule-based and machine-learning baselines. We observe promising results when no background knowledge is necessary, and substantial room for improvement whenever background knowledge is needed.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An example of two utterances for rule interpretation. In the first utterance, a follow-up question is generated. In the second, the scenario, history and background knowledge (Canada is not in the EEA) is used to arrive at the answer “Yes”.

There has been significant progress in teaching machines to read text and answer questions when the answer is directly expressed in the text Rajpurkar et al. (2016); Joshi et al. (2017); Welbl et al. (2018); Hermann et al. (2015). backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Can we add some old school extractive QA as well? Any links? However, in many settings, the text contains rules expressed in natural language that can be used to infer the answer when combined with background knowledge, rather than the literal answer. For example, to answer someone’s question “I am working for an employer in Canada. Do I need to carry on paying National Insurance?” with “Yes”backgroundcolor=purple!40!whitebackgroundcolor=purple!40!whitetodo: backgroundcolor=purple!40!whitesameer: check: is Germany/No better example?, one needs to read that “You’ll carry on paying National Insurance if you’re working for an employer outside the EEA” and understand how the rule and question determine the answer.

Answering questions that require rule interpretation is often further complicated due to missing information in the question. For example, as illustrated in Figure 1 (Utterance 1), the actual rule also mentions that National Insurance only needs to be paid for the first 52 weeks when abroad. This means that we cannot answer the original question without knowing how long the user has already been working abroad. Hence, the correct response in this conversational context is to issue another query such as “Have you been working abroad 52 weeks or less?”

To capture the fact that question answering in the above scenario requires a dialog, we hence consider the following conversational machine reading (CMR) problem as displayed in Figure 1: Given an input question, a context scenario of the question, a snippet of supporting rule text containing a rule, and a history of previous follow-up questions and answers, predict the answer to the question backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: All ‘yes’ and ‘no’s after the introduction are formatted differently, should these be the same? (“Yes”or “No”) or, if needed, generate a follow-up question whose answer is necessary to answer the original question. Our goal in this paper is to create a corpus for this task, understand its challenges, and develop initial models that can address it.

backgroundcolor=purple!40!whitebackgroundcolor=purple!40!whitetodo: backgroundcolor=purple!40!whitesameer: very much sobackgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Is this better now?

To collect a dataset for this task, we could give a textual rule to an annotator and ask them to provide an input question, scenario, and dialog in one go. This poses two problems. First, this setup would give us very little control. For example, users would decide which follow-up questions become part of the scenario and which are answered with “Yes” or “No”. Ultimately, this can lead to bias because annotators might tend to answer “Yes”, or focus on the first condition. Second, the more complex the task, the more likely crowd annotators are to make mistakes. To mitigate these effects, we aim to break up the utterance annotation as much as possible.

We hence develop an annotation protocol in which annotators collaborate with virtual users—agents that give system-produced answers to follow-up questions—to incrementally construct a dialog based on a snippet of rule text and a simple underspecified initial question (e.g., “Do I need to …?”), and then produce a more elaborate question based on this dialog (e.g., “I am … Do I need to…?”). By controlling the answers of the virtual user, we control the ratio of “Yes” and “No” answers. And by showing only subsets of the dialog to the annotator that produces the scenario, we can control what the scenario is capturing. The question, rule text and dialogs are then used to produce utterances of the kind we see in Figure 1. Annotators show substantial agreement backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Explain what agreement means in this dialog when constructing dialogs with a three-way annotator agreement at a Fleiss’ Kappa level of .111This is well within the range of what is considered as substantial agreement Artstein and Poesio (2008). backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Not sure what to do in this TODO. Not sure if this line is necessary. Likewise, we find that our crowd-annotators produce questions that are coherent with the given dialogs with high accuracy. backgroundcolor=purple!40!whitebackgroundcolor=purple!40!whitetodo: backgroundcolor=purple!40!whitesameer: (TODO).

In theory, the task could be addressed by an end-to-end neural network that encodes the question, history and previous dialog

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: aren’t previous dialog and history the same thing? Also, what about scenario?, and then decodes a Yes/No answer or question. In practice, we test this hypothesis using a seq2seq model Sutskever et al. (2014); Cho et al. (2014), with and without copy mechanisms backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I don’t believe we give any results for models without copy mech. That’s fine, right? backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whitePat: re:above, I hope its fine, as we dont have results prepared for models without copy mechanismsGu et al. (2016)

to reflect how follow-up questions often use lexical content from the rule text. We find that despite a training set size of 21,890 training utterances, successful models for this task need a stronger inductive bias due to the inherent challenges of the task: interpreting natural language rules, generating questions, and reasoning with background knowledge. We develop heuristics that can work better in terms of identifying what questions to ask, but they still fail to interpret scenarios correctly. To further motivate the task, we also show in oracle experiments that a CMR system can help humans to answer questions faster and more accurately.

This paper makes the following contributions:

  1. [nosep]

  2. We introduce the task of conversational machine reading backgroundcolor=red!20!whitebackgroundcolor=red!20!whitetodo: backgroundcolor=red!20!whiteTim: contrast with FAIR’s ParlAI and NIPS paper: Dialog-based language learning

    and provide evaluation metrics

    backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: We don’t do much in terms of metrics. Maybe de-emphasize.

  3. We develop an annotation protocol to collect annotations for conversational machine reading, suitable for use in crowd-sourcing platforms such as Amazon Mechanical Turk.

  4. We provide a corpus of over 32k conversational machine reading utterances, from domains such as grant descriptions, traffic laws and benefit programs, and include an analysis of the challenges the corpus poses.

  5. We develop and compare several baseline models for the task and subtasks.

2 Task Definition

Figure 1 shows an example of a conversational machine reading problem. A user has a question that relates to a specific rule or part of a regulation, such as “Do I need to carry on paying National Insurance?”. In addition, a natural language description of the context or scenario, such as “I am working for an employer in Canada”, is provided. The question will need to be answered using a small snippet of supporting rule text. Akin to machine reading problems in previous work Rajpurkar et al. (2016); Hermann et al. (2015), we assume that this snippet is pre-identified. We generally assume that the question is underspecified, in the sense that the question often does not provide enough information to be answered directly. However, an agent can use the supporting rule text to infer what needs to be asked in order to determine the final answer. In Figure 1, for example, a reasonable follow-up question is “Have you been working abroad 52 weeks or less?”.backgroundcolor=purple!40!whitebackgroundcolor=purple!40!whitetodo: backgroundcolor=purple!40!whitesameer: something about that is the adequate question (versus longer dialogs?)

We formalise the above task on a per-utterance basis. A given dialog corresponds to a sequence of prediction problems, backgroundcolor=red!20!whitebackgroundcolor=red!20!whitetodo: backgroundcolor=red!20!whiteTim: I don’t get the second half of this sentence one for each utterance the system needs to produce. Let be a vocabulary. Let be an input question and an input support rule text, where is a word from a vocabulary. Furthermore, let be a dialog history where each is a follow-up question, and each is a follow-up answer. Let be a scenario describing the context of the question. We will refer to as the input. Given an input , our task is to predict an answer that specifies whether the answer to the input question, in the context of the rule text and the previous follow-up question dialog, backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Do we also need scenario here? is either Yes, No, Irrelevant or another follow-up question in . Here Irrelevant is the target answer whenever a rule text is not related to the question .

3 Annotation Protocol

Figure 2: The different stages of the annotation process (excluding the rule text extraction stage). First a human annotator generates an underspecified input question (question generation). Then, a virtual user and a human annotator collaborate to produce a dialog of follow-up questions and answers (dialog generation). Finally, a scenario is generated from parts of the dialog, and these parts are omitted in the final result.

Our annotation protocol is depicted in Figure 2 and has four high-level stages: Rule Text Extraction, Question Generation, Dialog Generation and Scenario Annotation. We present these stages below, together with discussion of our quality-assurance mechanisms and method to generate negative data. For more details, such as annotation interfaces, we refer the reader to Appendix A.

3.1 Rule Text Extraction Stage

First, we identify the source documents that contain the rules we would like to annotate. Source documents can be found in Appendix C. We then convert each document to a set of rule texts using a heuristic which identifies and groups paragraphs and bulleted lists. To preserve readability during the annotation, we also split by a maximum rule text length and a maximum number of bullets.

3.2 Question Generation Stage

For each rule text we ask annotators to come up with an input question. Annotators are instructed to ask questions that cannot be answered directly but instead require follow-up questions. This means that the question should a) match the topic of the support rule text, and b) be underspecified. At present, this part of the annotation is done by expert annotators, but in future work we plan to crowd-source this step as well.

3.3 Dialog Generation Stage

In this stage, we view human annotators as assistants that help users reach the answer to the input question. Because the question was designed to be broad and to omit important information, human annotators will have to ask for this information using the rule text to figure out which question to ask. The follow-up question is then sent to a virtual user, i.e., a program that simply generates a random Yes or No answer. If the input question can be answered with this new information, the annotator should enter the respective answer. If not, the annotator should provide the next follow-up question and the process is repeated.

Figure 3: We use different annotators (indicated by different colors) to create the complete dialog tree.

When the virtual user is providing random Yes and No

answers in the dialog generation stage, we are traversing a specific branch of a decision tree. We want the corpus to reflect all possible dialogs for each question and rule text. Hence, we ask annotators to label additional branches. For example, if the first annotator received a

Yes as the answer to the second follow-up question in Figure 3, the second annotator (orange) receives a No.

3.4 Scenario Annotation Stage

In the final stage, we choose parts of the dialogs created in the previous stage and present this to an annotator. For example, the annotator sees “Are you working or preparing for work?” and No. They are then asked to write a scenario that is consistent with this dialog such as “I am currently out of work after being laid off from my last job, but am not able to look for any yet.”. The number of questions and answers that the annotator is presented with for generating a scenario can vary from one to the full length of a dialog. Users are encouraged to paraphrase the questions and not to use many words from the dialog.

In an attempt to make these scenarios closer to the real-world situations where a user may provide a lot of unnecessary information to an operator, not only do we present users with one or more questions and answers from a specific dialog but also with one question from a random dialog. The annotators are asked to come up with a scenario that fits all the questions and answers.

Finally, a dialog is produced by combining the scenario with the input question and rule text from the previous stages. In addition, all dialog utterances that were not shown to the final annotator are included as well as they complement the information in the scenario. Given a dialog of this form, we can create utterances that are described in Section 2.

As a result of this stage of annotation, we create a corpus of scenarios and questions where the correct answers (Yes, No or Irrelevant) to questions can be derived from the related scenarios. This corpus and its challenges will be discussed in Section 4.2.2. 11todo: 1Is it only correct answers or also necessary follow up questions

3.5 Negative Examples

To facilitate the future application of the models to large-scale rule-based documents instead of rule text, we deem it to be imperative for the data to contain negative examples of both questions and scenarios.

We define a negative question as a question that is not relevant to the rule text. In this case, we expect models to produce the answer Irrelevant. For a given rule text and question pair, a negative example is generated by sampling a random question from the set of all possible questions, excluding the question itself and questions sourced from the same document using a methodology similar to the work of levy2017zero.

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I would rephrase, time permitting. Not sure how. Can also be removed.

The data created so far is biased in the sense that when a scenario is given, at least one of the follow-up questions in a dialog can be answered. In practice, we expect users to also provide background scenarios that are completely irrelevant to the input question. Therefore, we sample a negative scenario for each input question and rule text pair, in our data. We uniformly sample from the scenarios created in Section 3.4 for all question and rule text pairs unequal to . backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Should we have the negative data validation results here? They should still be applicable For more details, we point the reader to Appendix D.

3.6 Quality Control

We employ a range of quality control measures throughout the process. In particular, we:

  1. [nosep]

  2. Re-annotate pre-terminal nodes in the dialog trees if they have identical Yes and No branches.

  3. Ask annotators to validate the previous dialog in case previous utterances where created by different annotators.

  4. Assess a sample of annotations for each annotator and keep only those annotators with quality scores higher than a certain threshold.

  5. We require annotators to pass a qualification test before selecting them for our tasks. We also require high approval rates and restrict location to the UK, US, or Canada.

Further details are provided in Appendix B.

3.7 Cost, Duration and Scalability

The cost of different stages of annotation is as follows. An annotator was paid $0.15 for an initial question (948 questions), $0.11 for a dialog part (3000 dialog parts) and $0.20 for a scenario (6,600 scenarios). It takes in total 2 weeks to complete the annotation process. Considering that all the annotation stages can be done through crowdsourcing and in a relatively short time period and at a reasonable cost using established validation procedures, the dataset can be scaled up without major bottlenecks or an impact on the quality.


In this section, we present the Shaping Answers with Rules through Conversation (ShARC) dataset.222The dataset and its Codalab challenge can be found at

4.1 Dataset Size and Quality

The dataset is built up from of 948 distinct snippets of rule text. Each has an input question and a “dialog tree”. At each step in the dialog, there is a followup question posed and the tree branches depending on the answer to the followup question (yes/no). The ShARC dataset is comprised of all individual “utterances” from every tree, i.e. every possible point/node in any dialog tree. There are 6058 of these utterances. In addition, there are 6637 scenarios that provide more information, allowing some questions in the dialog tree to be “skipped” as the answers can be inferred from the scenario. Scenarios therefore modify the dialog trees, which creates new trees. When combined with scenarios and negative sampled scenarios, the total number of distinct utterances became 37087. As a final step, utterances were removed where the scenario referred to a portion of the dialog tree that was unreachable for that utterance, leaving a final dataset size of 32436 utterances.333One may argue that the the size of the dataset is not sufficient for training end-to-end neural models. While we believe that the availability of large datasets such as SNLI or SQuAD has helped drive the state-of-the-art forward on related tasks, relying solely on large datasets to push the boundaries of AI cannot be as practical as developing better models for incorporating common sense and external knowledge which we believe ShARC is a good test-bed for. Furthermore, the proposed annotation protocol and evaluation procedure can be used to reliably extend the dataset or create datasets for new domains.

We break these into train, development and test sets such that each dataset contains approximately the same proportion of sources from each domain, targeting a 70%/10%/20% split.

To evaluate the quality of dialog generation HITs, we sample a subset of 200 rule texts and questions and allow each HIT to be annotated by three distinct workers. In terms of deciding whether the answer is a Yes, No or some follow-up question, the three annotators reach an answer agreement of . backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Is this yes/more? This should be linked to previous section We also calculate Cohen’s Kappa, a measure designed for situations with two annotators. We randomly select two out of the three annotations and compute the unweighted kappa values, repeated for 100 times and averaged to give a value of .

backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: this should be shortenedbackgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Should we make it a bit more explicit e.g. BLEU Score between Annotations or something a bit better explained?

The above metrics measure whether annotators agree in terms of deciding between Yes, No or some follow-up question, but not whether the follow-up questions are equivalent. To approximate this, we calculate BLEU scores between pairs of annotators when they both predict follow-up questions. Generally, we find high agreement: Annotators reach average BLEU scores of , , and for maximum orders of , , and respectively.

To get an indication of human performance on the sub-task of classifying whether a response should be a

Yes, No or Follow-up Question, we use a similar methodology to Rajpurkar et al. (2016) by considering the second answer to each question as the human prediction and taking the majority vote as ground truth. The resulting human accuracy is .

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I think this whole section can go. Annotator agreement in this context gives no indication of the quality of the data as it is not agreement between annotators contributing to the actual dataset. The bit of value in this paragraph is that approval rate increased from 38% without qualification to 93% with (I think it was something like 93 because we annotated less than 100 - don’t know where the data is), which is relevant to section 3.6(4) and should be moved there.

To evaluate the quality of the scenarios, we sample 100 scenarios randomly and ask two expert annotators to validate them. We perform validation for two cases: 1) scenarios generated by turkers who did not attempt the qualification test and were not filtered by our validation process, 2) scenarios that are generated by turkers who have passed the qualification test and validation process. In the second case, annotators approved an average of 89 of the scenarios whereas in the first case, they only approved an average of 38. This shows that the qualification test and the validation process improved the quality of the generated scenarios by more than double. In both cases, the annotators agreed on the validity of 91-92 of the scenarios. For further details on dataset quality, the reader is referred to Appendix B.

4.2 Challenges

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: This paragraph is redundant

We analyse the challenges involved in solving conversational machine reading in ShARC. We divide these into two parts: challenges that arise when interpreting rules, and challenges that arise when interpreting scenarios.

4.2.1 Interpreting Rules

When no scenarios are available, the task reduces to backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I find these a), b) c) things hard to read. I think they should either turn into standard text or become lists. a) identifying the follow-up questions within the rule text, b) understanding whether a follow-up question has already been answered in the history, and c) determining the logical structure of the rule (e.g. disjunction vs. conjunction vs. conjunction of disjunctions) backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: what about disjunction of conjunctions? I’d say: e.g. conjunctions, disjunctions and complex combinations.

To illustrate the challenges that these sub-tasks involve, we manually categorise a random sample of 100

pairs. We identify 9 phenomena of interest, and estimate their frequency within the corpus. Here we briefly highlight some categories of interest, but full details, including examples, can be found in Appendix 


A large fraction of problems involve the identification of at least two conditions, and approximately 41% and 27% of the cases involve logical disjunctions and conjunctions respectively. These can appear in linguistic coordination structures as well as bullet points. Often, differentiating between conjunctions and disjunctions is easy when considering bullets—key phrases such as “if all of the following hold” can give this away. However, in 13% of the cases, no such cues are given and we have to rely on language understanding to differentiate. For example:

Q: Do I qualify for Statutory Maternity Leave?

R: You qualify for Statutory Maternity Leave if - you’re an employee not a “worker” - you give your employer the correct notice

4.2.2 Interpreting Scenarios

Scenario interpretation can be considered as a multi-sentence entailment task. Given a scenario (premise) of (usually) several sentences, and a question (hypothesis), a system should output Yes (Entailment), No (Contradiction) or Irrelevant (Neutral). In this context, Irrelevant indicates that the answer to the question cannot be inferred from the scenario.

Different types of reasoning are required to interpret the scenarios. Examples include numerical reasoning, temporal reasoning and implication (common sense and external knowledge). We manually label 100 scenarios with the type of reasoning required to answer their questions. Table 1 shows examples of different types of reasoning and their percentages. Note that these percentages do not add up to 100% as interpreting a scenario may require more than one type of reasoning.

Category Questions Scenario %
Explicit Has your wife reached state pension age? Yes My wife just recently reached the age for state pension 25%
Temporal Did you own it before April 1982? Yes I purchased the property on June 5, 1980. 10%
Geographic Do you normally live in the UK? No I’m a resident of Germany. 7%
Numeric Do you work less than 24 hours a week between you? No My wife and I work long hours and get between 90 - 110 hours per week between the two of us. 12%
Paraphrase Are you working or preparing for work? No I am currently out of work after being laid off from my last job, but am not able to look for any yet. 19%
Implication Are you the baby’s father? No My girlfriend is having a baby by her ex. 51%
Table 1: Types of reasoning and their proportions in the dataset based on 100 samples. Implication includes reasoning beyond what is explicitly stated in the text, including common sense reasoning and external knowledge.

5 Experiments

To assess the difficulty of ShARC as a machine learning problem, we investigate a set of baseline approaches on the end-to-end task as well as the important sub-tasks we identified. The baselines are chosen to assess and demonstrate both feasibility and difficulty of the tasks.


For all following classification tasks, we use micro- and macro- averaged accuracies. backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I believe this is incorrect. We compute BLEU between gold follow-up question and predicted no? For the follow-up generation task, we compute the BLEU scores at orders 1, 2, 3 and 4 computed between the gold follow-up questions, and follow-up question for all utterances in the evaluation dataset.

5.1 Classification (excluding Scenarios)

backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: Perhaps this explanation should happen sooner?

On each turn, a CMR system needs to decide, either explicitly or implicitly, whether the answer is Yes or No, whether the question is not relevant to the rule text (Irrelevant), or whether a follow-up question is necessary—an outcome we label as More. In the following experiments, we will test whether one can learn to make this decision using the ShARC training data.

When a non-empty scenario is given, this task also requires an understanding of how scenarios answer follow-up questions. In order to focus on the challenges of rule interpretation, here we only consider empty scenarios.

Formally, for an utterance , we require models to predict an answer where . Since we consider only the classification task without scenario influence, we consider the subset of utterances such that . This data subset consists of train, dev and test utterances.


We evaluate various baselines including random, a surface logistic regression applied to a TFIDF representation of the rule text, question and history, a rule-based heuristic which makes predictions depending on the number of overlapping words between the rule text and question, detecting conjunctive or disjunctive rules, detecting negative mismatch between the rule text and the question and what the answer to the last follow-up history was, a feature-engineered Random Forest and a Convolutional Neural Network applied to the tokenised inputs of the concatenated rule text, question and history.


We find that, for this classification sub-task, Random Forest slightly outperforms the heuristic. All learnt models considerably outperform the random and majority baselines.

inline,backgroundcolor=blue!20!whiteinline,backgroundcolor=blue!20!whitetodo: inline,backgroundcolor=blue!20!whiteMax: I would have these results tables positioned [h] to appear in place so that the results flow on from the text. Like this, I find looking at the results rather confusing (especially since we have so many different types)
Model Micro Acc. Macro Acc.
Surface LR
Random Forest
Table 2: Selected Results of the baseline models on the classification sub-task.

5.2 Follow-up Question Generation without Scenarios

When the target utterance is a follow-up question, we still have to determine what that follow-up question is. For an utterance , we require models to predict an answer where is the next follow-up question, if has history of length . We therefore consider the subset of utterances such that and . This data subset consists of 1071 train, 112 dev and 424 test utterances.


We first consider several simple baselines to explore the relationship between our evaluation metric and the task. As annotators are encouraged to re-use the words from rule text when generating follow-up questions, a baseline that simply returns the final sentence of the rule text performs surprisingly well. We also implement a rule-based model that uses several heuristics.

If framed as a seq2seq task, a modified CopyNet is most promising Gu et al. (2016). We also experiment with span extraction/sequence-tagging approaches to identify relevant spans from the rule text that correspond to the next follow-up questions. We find that Bidirectional Attention Flow Seo et al. (2017) performed well.444We use AllenNLP implementations of BiDAF & DAM Further implementation details can be found in Appendix H.


Our results, shown in Table 3

indicate that systems that return contiguous spans from the rule text perform better according to our BLEU metric. We speculate that the logical forms in the data are challenging for existing models to extract and manipulate, which may suggest why the explicit rule-based system performed best. We further note that only the rule-based and NMT-Copy models are capable of generating genuine questions rather than spans or sentences.

First Sent.
Table 3: Selected Results of the baseline models on follow-up question generation.

5.3 Scenario Interpretation

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I would quantify “many” or exclude

Many utterances require the interpretation of the scenario associated with a question. If the scenario is understood, certain follow-up questions can be backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: skipped -¿ inferred? or something better? consider rewording skipped because they are answered within the scenario. In this section, we investigate how backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: I’m not sure we’re really answering the research question “How difficult is scenario interpretation?” based on the models we’ve trained. Maybe we’ve provided insight into some approaches one might consider taking? difficult scenario interpretation is by training models to answer follow-up questions based on scenarios.


We use a random baseline and also implement a surface logistic regression applied to a TFIDF representation of the combined scenario and the question. For neural models, we use Decomposed Attention Model (DAM) 

(Parikh et al., 2016) trained on each the SNLI and ShARC corpora using ELMO embeddings Peters et al. (2018).44footnotemark: 4


Table 4 shows the result of our baseline models on the entailment corpus of backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: the? ShARC test set. Results show poor performance especially for the macro accuracy metric of both simple baselines and neural state-of-the-art entailment models. This performance highlights the challenges that the scenario interpretation task of ShARC presents, many of which are discussed in Section 4.2.2.

Model Micro Acc. Macro Acc.
Surface LR
Table 4: Results of entailment models on ShARC.

5.4 Conversational Machine Reading

The CMR task requires all of the above abilities. To understand its core challenges, we compare baselines that are trained end-to-end vs. baselines that reuse solutions for the above subtasks.


We present a Combined Model (CM) which is a pipeline of the best performing Random Forest classification model, rule-based follow-up question generation model and Surface LR entailment model. We first run the classification model to predict Yes, No, More or Irrelevant. If More is predicted, the Follow-up Question Generation model is used to produce a follow-up question, . The rule text and produced follow-up question are then passed as inputs to the Scenario Interpretation model. If the output of this is Irrelevant, then the CM predicts , otherwise, these steps are repeated recursively until the classification model no longer predicts More or the entailment model predicts Irrelevant, in which case the model produces a final answer. We also investigate an extension of the NMT-copy model on the end-to-end task. Input sequences are encoded as a concatenation of the rule text, question, scenario and history. The model consists of a shared encoder LSTM, a 4-class classification head with attention, and a decoder GRU to generate followup questions. The model was trained by alternating training the classifier via standard softmax-cross entropy loss and the followup generator via seq2seq. At test time, the input is first classified, and if the predicted class is More, the follow-up generator is used to generate a followup question, . A simpler model without the separate classification head failed to produce predictive results.

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Just wrote this. Please review and ensure it fits with overall narrative

We find that the combined model outperforms the neural end-to-end model on the CMR task, however, the fact that the neural model has learned to classify better than random and also predict follow-up questions is encouraging for designing more sophisticated neural models for this task. backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: Either needs some text referencing the table or force that table to display here

Model Micro Acc Macro Acc BLEU-1 BLEU-4
Table 5: Results of the models on the CMR task.
User Study

In order to evaluate the utility of conversational machine reading, we run a user study that compares CMR to when such an agent is not available, i.e. the user has to read the rule text and determine themselves the answer to the question. On the other hand, with the agent, the user does not read the rule text, instead only responds to follow-up questions. Our results show that users using the conversational agent reach conclusions

times faster than ones that are not, but more importantly, they are also much more accurate ( as compared to ). Details of the experiments and the results are included in Appendix I.

6 Related Work

This work relates to several areas of active research.

Machine Reading

In our task, systems answer questions about units of texts. In this sense, it is most related to work in Machine Reading Rajpurkar et al. (2016); Seo et al. (2017); Weissenborn et al. (2017). The core difference lies in the conversational nature of our task: in traditional Machine Reading the questions can be answered right away; in our setting, clarification questions are often needed. The domain of text we consider is also different (regulatory vs Wikipedia, books, newswire). inline,backgroundcolor=orange!20!whiteinline,backgroundcolor=orange!20!whitetodo: inline,backgroundcolor=orange!20!whiteSebastian: Maybe add reasoning and Wikihop


The task we propose is, at its heart, about conducting a dialog Weizenbaum (1966); Serban et al. (2018); Bordes and Weston (2016). Within this scope, our work is closest to work in dialog-based QA where complex information needs are addressed using a series of questions. In this space, previous approaches have been looking primarily at QA dialogs about images Das et al. (2017)

and knowledge graphs 

Saha et al. (2018); Iyyer et al. (2017). In parallel to our work, both choi_quac_2018 and reddy_coqa:_2018 have to began to investigate QA dialogs with background text. Our work not only differs in the domain covered (regulatory text vs wikipedia), but also in the fact that our task requires the interpretation of complex rules, application of background knowledge, and the formulation of free-form clarification questions. rao_learning_2018 investigate how to generate clarification questions but this does not require the understanding of explicit natural language rules.

Rule Extraction From Text

There is a long line of work in the automatic extraction of rules from text Silvestro (1988); Moulin and Rousseau (1992); Delisle et al. (1994); Hassanpour et al. (2011); Moulin and Rousseau (1992). The work tackles a similar problem—interpretation of rules and regulatory text—but frames it as a text-to-structure task as opposed to end-to-end question-answering. For example, Delisle94fromtext maps text to horn clauses. This can be very effective, and good results are reported, but suffers from the general problem of such approaches: they require careful ontology building, layers of error-prone linguistic preprocessing, and are difficult for non-experts to create annotations for.

Question Generation

Our task involves the automatic generation of natural language questions. Previous work in question generation has focussed on producing questions for a given text, such that the questions can be answered using this text Vanderwende (2008); M. Olney et al. (2012); Rus et al. (2011). backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: I have more citations here, but somehow the latex breaks when I add them. Must be an issue with the bibtex file In our case, the questions to generate are derived from the background text but cannot be answered by them. Mostafazadeh2016GeneratingNQ investigate how to generate natural follow-up questions based on the content of an image. Besides not working in a visual context, our task is also different because we see question generation as a sub-task of question answering.

7 Conclusion

In this paper we present a new task as well as an annotation protocol, a dataset, and a set of baselines. The task is challenging and requires models to generate language, copy tokens, and make logical inferences. Through the use of an interactive and dialog-based annotation interface, we achieve good agreement rates at a low cost. backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: Repeated Initial baseline results suggest that substantial improvements are possible and require sophisticated integration of entailment-like reasoning and question generation.


This work was supported by in part by an Allen Distinguished Investigator Award and in part by Allen Institute for Artificial Intelligence (AI2) award to UCI.


Appendix A Annotation Interfaces

Figure 4 shows the Mechanical-Turk interface we developed for the dialog generation stage. Note that the interface also contains a mechanism to validate previous utterances in case they have been generated by different annotators.

Figure 4: The dialog-style web interface encourages workers to extract all the rule text-relevant evidence required to answer the initial question in the form of Yes/No follow-up questions.

Figure 5 shows the annotation interface for the scenario generation task, where the first question is relevant and the second question is not relevant.

Figure 5: Annotators are asked to write a scenario that fits the given information, i.e. questions and answers.

Appendix B Quality Control

In this section, we present several measure that we take in order to create a high quality dataset.

Irregularity Detection

A convenient property of the formulation of the reasoning process backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: I think we need to remove mentions of reasoning process as we haven’t introduced it as a binary decision tree is class exclusivity at the final partitioning of the utterance spacebackgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: I can’t parse this sentence. That is, if the two leaf nodes stemming from the same Follow-up Question node have identical Yes or No values, this is an indication of either a mis-annotation or a redundant question. We automatically identify these irregularities, trim the subtree at Follow-up Question node and re-annotate. backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Refer back to the tree section here? This also means that our protocol effectively guarantees a minimum of two annotations per leaf node, further enhancing data quality.


We implement back-validation by providing the workers with two options: Yes and proceed with the task, or No and provide an invalidation reason to de-incentivize unnecessary rejections. We found this approach to be valuable both as a validation mechanism as well as a means of collecting direct feedback about the task and the types of incorrect annotations encountered. We then trim any invalidated subtrees and re-annotate.backgroundcolor=orange!20!whitebackgroundcolor=orange!20!whitetodo: backgroundcolor=orange!20!whiteSebastian: Again refer to the tree section to make this easier to understand

Contradiction Detection

We can introduce contradictory information by adding random questions and answers to a dialog part when generating HITs for scenario generation. Therefore, we first ask each annotator to identify whether the provided dialog parts are contradictory. If they are, the annotator will invalidate the HIT.

Validation Sampling

We sample a proportion of each worker‘s annotations to validate. Through this process, each worker is assigned a quality score. We only allow workers with a score higher than a certain value to participate in our HITs Snow et al. (2008). We also restrict participation to workers with approval rate, previously completed HITs and located in the UK, US or Canada.

Qualification Test

Amazon Mechanical Turk allows the creation of qualification tests through the API, which need to be passed by each turker before attempting any HIT from a specific task. A qualification can contain several questions with each having a value. The qualification requirement for a HIT can specify that the total value must be over a specific threshold for the turker to obtain that qualification. We set this threshold to 100%.

Possible Sources of Noise
inline,backgroundcolor=blue!20!whiteinline,backgroundcolor=blue!20!whitetodo: inline,backgroundcolor=blue!20!whitePat: Max, Marzieh, please review this paragraph

Here we detail possible sources of noise, estimate their effects and outline the steps taken to mitigate these sources:

a) Noise arising from annotation errors: This has been discussed in detail above.

b) Noise arising from negative question generation: Some noise could be introduced due to the automatic sampling of the negative questions. To obtain an estimate, 100 negative questions were assessed by an expert annotator. It was found that only 8% of negatively sampled questions were erroneous.

c) Noise arising from the negative scenario sampling: A further 100 utterances with negatively sampled scenarios were curated by an expert annotator, and it was found that 5% of the utterances were erroneous.

d) Errors arising from the application of scenarios to dialog trees: The assumption that the scenario was only relevant to the follow-up questions it was generated from, and was independent to all other follow-up questions posed in that dialog tree is not necessarily true, and could result in noisy dialog utterances. 100 utterances from the subset of the data where this type of error was possible were assessed by expert annotators, and 12% of these utterances were found to be erroneous. This type of error can only affect 80% of utterances, thus the estimated total effect of this type of noise is 10%.

Despite the relatively low levels of noise, we asked expert annotators to manually inspect and curate (if necessary) all the instances in the development and the test set that are prone to potential errors. This leads to an even higher quality of data in our dataset.

backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: Orphaned lines

Appendix C Further Details on Corpus

We use unique sources from unique domains listed below. For transparency and reproducibility, the source URLs are included in the corpus for each dialog utterance.

Further, the ShARC dataset composition can be seen in Table 6.

Set # Utterances # Trees # Scenarios # Sources
All 32436 948 6637 264
Train 21890 628 4611 181
Development 2270 69 547 24
Test 8276 251 1910 59
Table 6: Dataset composition.

Appendix D Negative Data

In this section, we provide further details regarding the generation of the negative examples.

d.1 Negative Questions

backgroundcolor=blue!20!whitebackgroundcolor=blue!20!whitetodo: backgroundcolor=blue!20!whiteMax: We should consider addressing Reviewer2 WA1 in anticipation in this section

Formally, for each unique positive question, rule text pair, , and defining as the source document for , we construct the set where is the set of questions that are not sourced from . We take a random uniform sample from to generate the negative utterance where = Irrelevant and is an empty history sequence. An example of a negative question is shown below.


Can I get Working Tax Credit?


You must also wear protective headgear if you are using a learner’s permit or are within 1 year of obtaining a motorcycle license.

d.2 Negative Scenarios

We also negatively sample scenarios so that models can learn to ignore distracting scenario information that is not relevant to the task. We define a negative scenario as a scenario that provides no information to assist answering a given question and as such, good models should ignore all details within these scenarios.

A scenario is associated with the (one or more) dialog question and answer pairs that it was generated from.

For a given unique question, rule text pair, , associated with a set of positive scenarios , we uniformly randomly sample a candidate negative scenario from the set of all possible scenarios. We then build TF-IDF representations for the set of all dialog questions associated with , i.e. . We also construct TF-IDF representations for the set of dialog questions associated with , .

If the cosine similarity for all pairs of dialog questions between

and are less than a backgroundcolor=yellow!20!whitebackgroundcolor=yellow!20!whitetodo: backgroundcolor=yellow!20!whiteMike: Should this show the actual threshold? (0.5) certain threshold, the candidate is accepted as a negative, otherwise a new candidate is sampled and the process is repeated. Then we iterate over all utterances that contain and use the negative scenario to create one more utterance whenever the original utterance has an empty scenario. The threshold value was validated using manual verification. An example is shown below:


You are allowed to make emergency calls to 911, and bluetooth devices can still be used while driving.


The person I’m referring to can no longer take care of their own affairs.

Appendix E Challenges

In this section we present a few interesting examples we encountered in order to provide a better understanding of the requirements and challenges of the proposed task.

e.1 Dialog Generation

Figure 6: Example of a complex and hard-to-interpret rule relationship.
Figure 7: Example of a hard-to-interpret rule due to complex negations. In this particular example, majority vote was inaccurate.
Figure 8: Example of a conjunctive rule relationship derived from a bulleted list, determined by the presence of “, and” in the third bullet.
Figure 9: Example of a dialog-tree for a typical disjunctive bulleted list.

Table 8 shows the breakdown of the types of challenges that exist in our dataset for dialog generation and their proportion.

Appendix F Entailment Corpus

Using the scenarios and their associated questions and answers we create an entailment corpus for each of the train, development and test sets of ShARC. For every dialog utterance that includes a scenario, we create a number of data points as follows:

For every utterance in ShARC with input and output where , we create an entailment instance (, ) such that =

  • = Entailment if the answer to follow-up question is Yes which can be derived from .

  • = Contradiction if the answer to follow-up question is No which can be derived from .

  • = Neutral if the answer to follow-up question cannot be derived from .

Table 7 shows the statistics for the entailment corpus.

Set Entailment Contradiction Neutral
Table 7: Statistics of the entailment corpus created from the ShARC dataset.

Appendix G Further details on Interpreting rules

Category Example Question Example Rule Text Percentage
Simple Can I claim extra MBS items? If you’re providing a bulk billed service to a patient you may claim extra MBS items. 31%
Bullet Points Do I qualify for assistance? To qualify for assistance, applicants must meet all loan eligibility requirements including:
  • [itemsep=0pt,parsep=2pt,topsep=5pt]

  • Be unable to obtain credit elsewhere at reasonable rates and terms to meet actual needs;

  • Possess legal capacity to incur loan obligations;

In-line Conditions Do these benefits apply to me? These are benefits that apply to individuals who have earned enough Social Security credits and are at least age 62. 39%
Conjunctions Could I qualify for Letting Relief? If you qualify for Private Residence Relief and have a chargeable gain, you may also qualify for Letting Relief. This means you’ll pay less or no tax. 18%
Disjunctions Can I get deported? The United States may deport foreign nationals who participate in criminal acts, are a threat to public safety, or violate their visa. 41%
Understanding Questioner Role Am I eligible? The borrower must qualify for the portion of the loan used to purchase or refinance a home. Borrowers are not required to qualify on the portion of the loan used for making energy-efficient upgrades. 10%
Negations Will I get the National Minimum Wage? You won’t get the National Minimum Wage or National Living Wage if you’re work shadowing 15%
Conjunction Disjunction Combination Can my partner and I claim working tax credit? You can claim if you work less than 24 hours a week between you and one of the following applies:
  • [itemsep=0pt,parsep=2pt,topsep=5pt]

  • you work at least 16 hours a week and you’re disabled or aged 60 or above

  • you work at least 16 hours a week and your partner is incapacitated

World Knowledge Required to Resolve Ambiguity Do I qualify for Statutory Maternity Leave? You qualify for Statutory Maternity Leave if:
  • [itemsep=0pt,parsep=2pt,topsep=5pt]

  • you’re an employee not a ‘worker’

  • you give your employer the correct notice

Table 8: Types of features present for question, rule text pairs and their proportions in the dataset based on 100 samples. World Knowledge Required to resolve ambiguity refers to where the rule itself doesn’t syntactically indicate whether to apply a conjunction or disjunction, and world knowledge is required to infer the rule.

Appendix H Further details on Follow-up Question Generation Modelling

Table 9 details all the results for all the the models considered for follow-up question generation.

First Sent.

Return the first sentence of the rule text

Random Sent.

Return a random sentence from the rule text


A simple binary logistic model, which was trained to predict whether or not a given sentence in a rule text had the highest trigram overlap with the target follow-up question, using a bag of words feature set, augmented with 3 very simple engineered features (the number of sentences in the rule text, the number of tokens in the sentence and the position of the sentence in the rule text)

Sequence Tag

A simple neural model consisting of a learnt word embedding followed by an LSTM. Each word in the rule text is classified as either in or out of the subsequence to return using an I/O sequence tagging scheme.

Random Sent.
First Sent.
Last Sent.
Surface LR
Sequence Tag
Table 9: All results of the baseline models on follow-up question generation.

h.1 Further details on neural models for question generation

Table 10 details what the inputs and outputs of the neural models should be.

Model Input Output
NMT-Copy ? ?
Sequence Tag ? ? Span corresponding to follow-up question.
BiDAF Question: ? ?
Context  :  
Span corresponding to follow-up question.
Table 10: Inputs and outputs of neural models for question generation.

The NMT-Copy model follows an encoder-decoder architecture. The encoder is an LSTM. The decoder is a GRU equipped with a copy mechanism, with an attention mechanism over the encoder outputs and an additional attention over the encoder outputs with respect to the previously copied token. We achieved best results by limiting the model‘s generator vocabulary to only very common interrogative words. We train with a 50:50 teacher-forcing / greedy decoding ratio. At test time we greedily sample the next word to generate, but prevent repeated tokens being generated by sampling the second highest scoring token if the highest would result in a repeat.

In order to frame the task as a span extraction task, a simple method of mapping a follow-up question onto a span in the rule text was employed. The longest common subsequence of tokens between the rule text and follow-up question was found, and if the subsequence length was greater than a certain threshold, the target span was generated by increasing the length of the subsequence so that it matched the length of the follow-up question. These spans were then used to supervise the training of the BiDAF and sequence tagger models.

Appendix I Evaluating Utility of CMR

In order to evaluate the utility of conversational machine reading, we run a user study that compares CMR with the scenario when such an agent is not available, i.e. the user has to read the rule text, the question, and the scenario, and determine for themselves whether the answer to the question is “Yes” or “No”. On the other hand, with the agent, the user does not read the rule text, instead only responds to follow-up questions with a “Yes” or “No”, based on the scenario text and world knowledge.

We carry out a user study with 100 randomly selected scenarios and questions, and elicit annotation from workers for each. As these instances are from the CMR dataset, the quality is fairly high, and thus we have access to the gold answers and follow-ups questions for all possible responses by the users. This allows us to evaluate the accuracy of the users in answering the question, the primary objective of any QA system. We also track a number of other metrics, such as the time taken by the users to reach the conclusion.

In Figure 9(a), we see that the users that have access to the conversational agent are almost twice as fast the users that need to read the rule text. This demonstrates that even though the users with the conversational agent have to answer more questions (as many as the followup questions), they are able to understand and apply the knowledge more quickly. Further, in Figure 9(b), we see that users with access to the conversational agents are much more accurate than ones without, demonstrating that an accurate conversational agent can have a considerable impact on efficiency.

(a) Time taken to reach conclusion
(b) Accuracy of the conclusion reached
Figure 10: Utility of CMR Evaluation via a user study demonstrating that users with an accurate conversational agent are not only reach conclusions much faster than ones that have to read the rule text, but also that the conclusions reached are correct much more often.