Conversational machine reading systems help users answer high-level questions (e.g. determine if they qualify for particular government benefits) when they do not know the exact rules by which the determination is made(e.g. whether they need certain income levels or veteran status). The key challenge is that these rules are only provided in the form of a procedural text (e.g. guidelines from government website) which the system must read to figure out what to ask the user. We present a new conversational machine reading model that jointly extracts a set of decision rules from the procedural text while reasoning about which are entailed by the conversational history and which still need to be edited to create questions for the user. On the recently introduced ShARC conversational machine reading dataset, our Entailment-driven Extract and Edit network (E3) achieves a new state-of-the-art, outperforming existing systems as well as a new BERT-based baseline. In addition, by explicitly highlighting which information still needs to be gathered, E3 provides a more explainable alternative to prior work. We release source code for our models and experiments at https://github.com/vzhong/e3.READ FULL TEXT VIEW PDF
In conversational machine reading, systems need to interpret natural lan...
The goal of conversational machine reading is to answer user questions g...
Document interpretation and dialog understanding are the two major chall...
Conversational Machine Reading (CMR) aims at answering questions in a
Most work in machine reading focuses on question answering problems wher...
Recent advances in conversational systems have changed the search paradi...
Conversational machine reading (CMR) tools have seen a rapid progress in...
In conversational machine reading (CMR), a system must help users answer high-level questions by participating in an information gathering dialog. For example, in Figure 1 the system asks a series of questions to help the user decide if they need to pay tax on their pension. A key challenge in CMR is that the rules by which the decision is made are only provided in natural language (e.g. the rule text in Figure 1). At every step of the conversation, the system must read the rules text and reason about what has already been said in to formulate the best next question.
We present a new model that jointly reasons about what rules are present in the text and which are already entailed by the conversational history to improve question generation. More specifically, we propose the Entailment-driven Extract and Edit network (). learns to extract implicit rules in the document, identify which rules are entailed by the conversation history, and edit rules that are not entailed to create follow-up questions to the user. During each turn, parses the rule text to extract spans in the text that correspond to implicit rules (underlined in Figure 1). Next, the model scores the degree to which each extracted rule is entailed by the initial user scenario (red in Figure 1) and by previous interactions with the user (green in Figure 1). Finally, the model decides on a response by directly answering the question (yes/no), stating that the rule text does not contain sufficient information to answer the question (irrelevant), or asking a follow-up question about an extracted rule that is not entailed but needed to determine the answer (blue in Figure 1). In the case of inquiry, the model edits an extracted rule into a follow-up question. To our knowledge,
is the first extract-and-edit method for conversational dialogue, as well as the first method that jointly infers implicit rules in text, estimates entailment, inquires about missing information, and answers the question.
We compare to the previous-best systems as well as a new, strong, BERT-based extractive question answering model (BERTQA) on the recently proposed ShARC CMR dataset (Saeidi et al., 2018). Our results show that is more accurate in its decisions and generates more relevant inquiries. In particular, outperforms the previous-best model by 5.7% in micro-averaged decision accuracy and 4.3 in inquiry BLEU4. Similarly, outperforms the BERTQA baseline by 4.0% micro-averaged decision accuracy and 2.4 in inquiry BLEU4. In addition to outperforming previous methods, is explainable in the sense that one can visualize what rules the model extracted and how previous interactions and inquiries ground to the extracted rules. We release source code for and the BERTQA model at https://github.com/vzhong/e3.
Recently, there has been growing interest in question answering (QA) in a dialogue setting (Choi et al., 2018; Reddy et al., 2019). CMR (Saeidi et al., 2018) differs from dialogue QA in the domain covered (regulatory text vs Wikipedia). A consequence of this is that CMR requires the interpretation of complex decision rules in order to answer high-level questions, whereas dialogue QA typically contains questions whose answers are directly extractable from the text. In addition, CMR requires the formulation of free-form follow-up questions in order to identify whether the user satisfies decision rules, whereas dialogue QA does not. There has also been significant work on task-oriented dialogue, where the system must inquire about missing information in order to help the user achieve a goal (Williams et al., 2013; Henderson et al., 2014; Mrkšić et al., 2017; Young et al., 2013). However, these tasks are typically constrained to a fixed ontology (e.g. restaurant reservation), instead of a latent ontology specified via natural language documents.
One traditional approach for designing dialogue systems divides the task into language understanding/state-tracking (Mrkšić et al., 2017; Zhong et al., 2018), reasoning/policy learning (Su et al., 2016), and response generation (Wen et al., 2015). The models for each of these subtasks are then combined to form a full dialogue system (Young et al., 2013; Wen et al., 2017). The previous best system for ShARC (Saeidi et al., 2018) similarly breaks the CMR task into subtasks and combines hand-designed sub-models for decision classification, entailment, and follow-up generation. In contrast, the core reasoning (e.g. non-editor) components of are jointly trained, and does not require complex hand-designed features.
There is a long history of work on extracting knowledge automatically from text (Moulin and Rousseau, 1992). Relation extraction typically assumes that there is a fixed ontology onto which extracted knowledge falls (Mintz et al., 2009; Riedel et al., 2013). Other works forgo the ontology by using, for example, natural language (Angeli and Manning, 2014; Angeli et al., 2015). These extractions from text are subsequently used for inference over a knowledge base (Bordes et al., 2013; Dettmers et al., 2018; Lin et al., 2018) and rationalizing model predictions (Lei et al., 2016). Our work is more similar with the latter type in which knowledge extracted are not confined to a fixed ontology and instead differ on a document basis. In addition, the rules extracted by our model are used for inference over natural language documents. Finally, these rules provide rationalization for the model’s decision making, in the sense that the user can visualize what rules the model extracted and which rules are entailed by previous turns.
In conversational machine reading, a system reads a document that contains a set of implicit decision rules. The user presents a scenario describing their situation, and asks the system an underspecified question. In order to answer the user’s question, the system must ask the user a series of follow-up questions to determine whether the user satisfies the set of decision rules.
The key challenges in CMR are to identify implicit rules present in the document, understand which rules are necessary to answer the question, and inquire about necessary rules that are not entailed by the conversation history by asking follow-up questions. The three core modules of , the extraction, entailment, and decision modules, combine to address these challenges. Figure 2 illustrates the components of .
For ease of exposition, we describe for a single turn in the conversation. To make the references concrete in the following sections, we use as an example the inputs and outputs from Figure 1. This example describes a turn in a conversation in which the system helps the user determine whether they need to pay UK taxes on their pension.
The extraction module extracts spans from the document that correspond to latent rules. Let , , , denote words in the rule text, question, scenario, and the inquiry and user response during the th previous turn of the dialogue after turns have passed. We concatenate these inputs into a single sequence joined by sentinel tokens that mark the boundaries of each input. To encode the input for the extraction module, we use BERT, a transformer-based model (Vaswani et al., 2017) that achieves consistent gains on a variety of NLP tasks (Devlin et al., 2019). We encode using the BERT encoder, which first converts words into word piece tokens (Wu et al., 2016)
, then embeds these tokens along with their positional embeddings and segmentation embeddings. These embeddings are subsequently encoded via a transformer network, which allows for inter-token attention at each layer. Letbe the number of tokens in the concatenated input and be the output dimension of the BERT encoder. For brevity, we denote the output of the BERT encoder as and refer readers to Devlin et al. (2019) for detailed architecture.
In order to extract the implicit decision rules from the document, we compute a start score and an end score for each th token as
where , , and
is the sigmoid function.
For each position where is larger than some threshold , we find the closest proceeding position where . Each pair then forms an extracted span corresponding to a rule expressed in the rule text. In the example in Figure 1, the correct extracted spans are “UK resident” and “UK civil service pensions”.
For the th rule, we use self-attention to build a representation over the span .
where and . Here, are respectively the unnormalized and normalized scores for the self-attention layer.
Let denote the number spans in the rule text, each of which corresponds to a ground truth rule. The rule extraction loss is computed as the sum of the binary cross entropy losses for each rule .
Let denote the number of tokens in the rule text, , the ground truth start and end positions for the th rule, and the indicator function that returns 1 if and only if the condition holds. Recall from Eq (1) that and
denote the probabilities that tokenis the start and end of a rule. The start and end binary cross entropy losses for the th rule are computed as
Given the extracted rules , the entailment module estimates whether each rule is entailed by the conversation history, so that the model can subsequently inquire about rules that are not entailed. For the example in Figure 1, the rule “UK resident” is entailed by the previous inquiry “Are you a UK resident”. In contrast, the rule “UK civil service pensions” is not entailed by either the scenario or the conversation history, so the model needs to inquire about it. In this particular case the scenario does not entail any rule.
For each extracted rule, we compute a score that indicates the extent to which this particular rule has already been discussed in the initial scenario and in previous turns . In particular, let denote the number of tokens shared by and , the number of tokens in , and the number of tokens in . We compute the scenario entailment score as
where , , and respectively denote the precision, recall, and F1 scores. We compute a similar score to represent the extent to which the rule has been discussed in previous inquiries. Let denote tokens in the th previous inquiry. We compute the history entailment score between the extracted rule and all previous inquiries in the conversation history as
The final representation of the th rule, , is then the concatenation of the span self-attention and the entailment scores.
where denotes the concatenation of and . We also experiment with embedding and encoding similarity based approaches to compute entailment, but find that this F1 approach performs the best. Because the encoder utilizes cross attention between different components of the input, the representations and are able to capture notions of entailment. However, we find that explicitly scoring entailment via the entailment module further discourages the model from making redundant inquiries.
Given the extracted rules and the entailment-enriched representations for each rule , the decision module decides on a response to the user. These include answering yes/no to the user’s original question, determining that the rule text is irrelevant to the question, or inquiring about a rule that is not entailed but required to answer the question. For the example in Figure 1, the rule “UK civil service pensions” is not entailed, hence the correct decision is to ask a follow-up question about whether the user receives this pension.
We start by computing a summary of the input using self-attention
where , , and , are respectively the unnormalized and normalized self-attention weights. Next, we score the choices yes, no, irrelevant, and inquire.
is a vector containing a class score for each of theyes, no, irrelevant, and inquire decisions.
For inquiries, we compute an inquiry score for each extracted rule .
where and . Let indicate the correct decision, and indicate the correct inquiry, if the model is supposed to make an inquiry. The decision loss is
During inference, the model first determines the decision . If the decision is inquire, the model asks a follow-up question about the th rule such that . Otherwise, the model concludes the dialogue with .
In the event that the model chooses to make an inquiry about an extracted rule , is given to an subsequent editor to rephrase into a follow-up question. For the example in 1, the editor edits the span “UK civil service pensions” into the follow-up question “Are you receiving UK civil service pensions?” Figure 3 illustrates the editor.
The editor takes as input , the concatenation of the extracted rule to rephrase and the rule text . As before, we encode using a BERT encoder to obtain . The encoder is followed by two decoders that respective generate the pre-span edit and post-span edit . For the example in Figure 1, given the span “UK civil service pensions”, the pre-span and post span edits that form the question “Are you receiving UK civil service pensions?” are respectively “Are you receiving” and “?”
To perform each edit, we employ an attentive decoder (Bahdanau et al., 2015)
with Long Short-Term Memory (LSTM)(Hochreiter and Schmidhuber, 1997). Let denote the decoder state at time . We compute attention over the input.
Let denote the embedding matrix corresponding to tokens in the vocabulary. To generate the th token , we use weight tying between the output layer and the embedding matrix (Press and Wolf, 2017).
|Model||Micro Acc.||Macro Acc.||BLEU1||BLEU4||Comb.|
We use a separate attentive decoder to generate the pre-span edit and the post-span edit . The decoders share the embedding matrix and BERT encoder but do not share other parameters. The output of the editor is the concatenation of tokens .
The editing loss consists of the sequential cross entropy losses from generating the pre-span edit and the post-span edit. Let denote the number of tokens and the th tokens in the ground truth pre-span edit. The pre-span loss is
The editing loss is then the sum of the pre-span and post-span losses, the latter of which is obtained in a manner similar to Eq (26).
We train and evaluate the Entailment-driven Extract and Edit network on the ShARC CMR dataset. In particular, we compare our method to three other models. Two of these models are proposed by Saeidi et al. (2018). They are an attentive sequence-to-sequence model that attends to the concatenated input and generates the response token-by-token (Seq2Seq), and a strong hand-engineered pipeline model with sub-models for entailment, classification, and generation (Pipeline). For the latter, Saeidi et al. (2018) show that these sub-models outperform neural models such as the entailment model by Parikh et al. (2016), and that the combined pipeline outperforms the attentive sequence-to-sequence model. In addition, we propose an extractive QA baseline based on BERT (BERTQA). Similar models achieved state-of-the-art on a variety of QA tasks (Rajpurkar et al., 2016; Reddy et al., 2019). We refer readers to Section A.1 of the appendices for implementation details BERTQA.
We tokenize using revtok111https://github.com/jekbradbury/revtok and part-of-speech tag (for the editor) using Stanford CoreNLP Manning et al. (2014). We fine-tune the smaller, uncased pretrained BERT model by Devlin et al. (2019) (e.g. bert-base-uncased).222We use the BERT implementation from https://github.com/huggingface/pytorch-pretrained-BERT We optimize using ADAM (Kingma and Ba, 2015) with an initial learning rate of 5e-5 and a warm-up rate of 0.1. We regularize using Dropout (Srivastava et al., 2014) after the BERT encoder with a rate of 0.4.
To supervise rule extraction, we reconstruct full dialogue trees from the ShARC training set and extract all follow-up questions as well as bullet points from each rule text and its corresponding dialogue tree. We then match these extracted clauses to spans in the rule text, and consider these noisy matched spans as supervision for rule extraction. During inference, we use heuristic bullet point extraction333We extract spans from the text that starts with the “*” character and ends with another “*” character or a new line. in conjunction with spans extracted by the rule extraction module. This results in minor performance improvements ( % micro/macro acc.) over only relying on the rule extraction module. In cases where one rule fully covers another, we discard the covered shorter rule. Section A.2 details how clause matching is used to obtain noisy supervision for rule extraction.
We train the editor separately, as jointly training with a shared encoder worsens performance. The editor is trained by optimizing while the rest of the model is trained by optimizing . We use a rule extraction threshold of and a rule extraction loss weight of . We perform early stopping using the product of the macro-averaged accuracy and the BLEU4 score.
For the editor, we use fixed, pretrained embeddings from GloVe (Pennington et al., 2014), and use dropout after input attention with a rate of 0.4. Before editing retrieved rules, we remove prefix and suffix adpositions, auxiliary verbs, conjunctions, determiners, or punctuation. We find that doing so allows the editor to convert some extracted rules (e.g. or sustain damage) into sensible questions (e.g. did you sustain damage?).
Our performance on the development and the blind, held-out test set of ShARC is shown in Table 1. Compared to previous results, achieves a new state-of-the-art, obtaining best performance on micro and macro-averaged decision classification accuracy and BLEU4 scores while maintaining similar BLEU1 scores. These results show that both answers the user’s original question more accurately, and generates more coherent and relevant follow-up questions. In addition, Figure 6 shows that because explicitly extracts implicit rules from the document, the model’s predictions are explainable in the sense that the user can verify the correctness of the extracted rules and observe how the scenario and previous interactions ground to the extracted rules.
|Model||Micro Acc.||Macro Acc.||BLEU1||BLEU4||Comb.|
|-edit, entail, extract (BERTQA)||63.4||70.6||47.4||37.4||23.7|
Table 2 shows an ablation study of on the development set of ShARC.
BERTQA (“-edit, entail, extract”), which reduces to after removing the editor, entailment, and extraction modules, presents a strong baseline that exceeds previous results on all metrics except for BLEU1. This variant inquires about spans extracted from the text, which, while more relevant as indicated by the higher BLEU4 score, does not have the natural qualities of a question, hence it has a lower BLEU1. Nonetheless, the large gains of BERTQA over the attentive Seq2Seq model shows that retrieval is a more promising technique for asking follow-up questions than word-by-word generation. Similar findings were reported for question answering by Yatskar (2019).
Adding explicit extraction of rules in the document (“-edit, entail”) forces the model to interpret all rules in the document versus only focusing on extracting the next inquiry. This results in better performance in both decision classification and inquiry relevance compared to the variant that is not forced to interpret all rules.
The “-edit” model explicitly models whether an extracted rule is entailed by the user scenario and previous turns. Modeling entailment allows the model to better predict whether a rule is entailed, and thus more often inquire about rules that are not entailed. Figure (a)a illustrates one such example in which both extracted rules have high entailment score, and the model chooses to conclude the dialogue by answering no instead of making further inquiries. Adding entailment especially improves in BLEU4 score, as the inquiries made by the model are more relevant and appropriate.
While without the editor is able to retrieve rules that are relevant, these spans are not fluent questions that can be presented to the user. The editor is able to edit the extracted rules into more fluid and coherent questions, which results further gains particularly in BLEU1.
In addition to ablation studies, we analyze errors makes on the development set of ShARC.
Figure 7 shows the confusion matrix of decisions. We specifically examine examples in which produces an incorrect decision. On the ShARC development set there are 726 such cases, which correspond to a 32.0% error rate. We manually analyze 100 such examples to identify commons types of errors. Within these, in 23% of examples, the model attempts to answer the user’s initial question without resolving a necessary rule despite successfully extracting the rule. In 19% of examples, the model identifies and inquires about all necessary rules but comes to the wrong conclusion. In 18% of examples, the model makes a redundant inquiry about a rule that is entailed. In 17% of examples, the rule text contains ambiguous rules. Figure (b)b contains one such example in which the annotator identified the rule “a female Vietnam Veteran”, while the model extracted an alternative longer rule “a female Vietnam Veteran with a child who has a birth defect”. Finally, in 13% of examples, the model fails to extract some rule from the document. Other less common forms of errors include failures by the entailment module to perform numerical comparison, complex rule procedures that are difficult to deduce, and implications that require world knowledge. These results suggests that improving the decision process after rule extraction is an important area for future work.
On 340 examples (15%) in the ShARC development set, generates an inquiry when it is supposed to. We manually analyze 100 such examples to gauge the quality of generated inquiries. On 63% of examples, the model generates an inquiry that matches the ground-truth. On 14% of examples, the model makes inquires in a different order than the annotator. On 12% of examples, the inquiry refers to an incorrect subject (e.g. “are you born early” vs. “is your baby born early”. This usually results from editing an entity-less bullet point (“* born early”). On 6% of examples, the inquiry is lexically similar to the ground truth but has incorrect semantics (e.g. “do you need savings” vs. “is this information about your savings”). Again, this tends to result from editing short bullet points (e.g. “* savings”). These results indicate that when the model correctly chooses to inquire, it largely inquires about the correct rule. They also highlight a difficulty in evaluating CMR — there can be several correct orderings of inquiries for a document.
We proposed the Entailment-driven Extract and Edit network (), a conversational machine reading model that extracts implicit decision rules from text, computes whether each rule is entailed by the conversation history, inquires about rules that are not entailed, and answers the user’s question. achieved a new state-of-the-art result on the ShARC CMR dataset, outperforming existing systems as well as a new extractive QA baseline based on BERT. In addition to achieving strong performance, we showed that provides a more explainable alternative to prior work which do not model document structure.
This research was supported in part by the ARO (W911NF-16-1-0121) and the NSF (IIS-1252835, IIS-1562364). We thank Terra Blevins, Sewon Min, and our anonymous reviewers for helpful feedback.
Convolutional 2D knowledge graph embeddings.In AAAI.
The Stanford CoreNLP natural language processing toolkit.In ACL.
A decomposable attention model for natural language inference.In EMNLP.
Dropout: A simple way to prevent neural networks from overfitting.JMLR.
Semantically conditioned lstm-based natural language generation for spoken dialogue systems.In EMNLP.
Our BertQA baseline follows that proposed by Devlin et al. (2019) for the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). Due to the differences in context between ShARC and SQuAD, we augment the input to the BERTQA model in a manner similar to Section 3.1. The distinction here is that we additionally add the decision types “yes”, “no”, and “irrelevant” as parts of the input such that the problem is fully solvable via span extraction. Similar to Section 3.1, let denote the BERT encoding of the length- input sequence. The BERTQA model predicts a start score and an end score .
We take the answer as the span that gives the highest score such that . Because we augment the input with decision labels, the model can be fully supervised via extraction endpoints.
The ShARC dataset is constructed from full dialogue trees in which annotators exhaustively annotate yes/no branches of follow-up questions. Consequently, each rule required to answer the initial user question forms a follow-up question in the full dialogue tree. In order to identify rule spans in the document, we first reconstruct the dialogue trees for all training examples in ShARC. For each document, we trim each follow-up question in its corresponding dialogue tree by removing punctuation and stop words. For each trimmed question, we find the shortest best-match span in the document that has the least edit distance from the trimmed question, which we take as the corresponding rule span. In addition, we extract similarly trimmed bullet points from the document as rule spans. Finally, we deduplicate the rule spans by removing those that are fully covered by a longer rule span. Our resulting set of rule spans are used as noisy supervision for the rule extraction module. This preprocessing code is included with our code release.