Recent years have observed a surge of interest in conversational machine reading comprehension (MRC). Unlike the traditional setting of MRC that requires answering a single question given a passage (aka context), the conversational MRC task is to answer the current question in a conversation given a passage and the previous questions and answers. The goal of this task is to mimic real-world situations where humans seek information in a conversational manner.
Despite the success existing works have achieved on traditional MRC (e.g., SQuAD Rajpurkar et al. (2016)), conversational MRC has proven significantly more challenging when the conversations are incorporated into the MRC task. It has been observed that in those human-to-human conversations Reddy et al. (2018); Choi et al. (2018), the starting turns tend to focus on the beginning chunks of the passage and as the conversation progresses, the focus shifts to the later chunks. Moreover, the turn transitions are usually smooth, with the focus often remaining in the same chunk or moving to a neighboring chunk. Lastly, many questions refer back to the conversation history via either coreference or ellipsis. Therefore, one hopes to develop a model that can capture these shifts of focus during a conversation. We model conversation flow as a sequence of latent states in the dialog and learn important latent states associated with these shifts of focus.
To cope with the above challenges, many methods have been proposed to effectively utilize conversation history, including previous questions and/or previous answers. Most existing approaches, however, simply prepend the conversation history to the current question Reddy et al. (2018); Zhu et al. (2018) or add previous answer locations to the passage Choi et al. (2018); Yatskar (2018), and treat the task as a single-turn MRC while ignoring the important information from the conversation flow. Huang et al. (2018)
assumed that the hidden representations generated during the previous reasoning processes potentially capture important information for answering the previous questions, and thus provide additional clues for answering the current question. They proposed anIntegration-Flow (IF) mechanism to first process sequentially in passage, in parallel of question turns and then process sequentially in question turns, in parallel of passage words. Their FlowQA model achieves strong empirical results on two benchmarks (i.e., CoQA and QuAC) Reddy et al. (2018); Choi et al. (2018).
However, the IF mechanism is not quite natural since it does not mimic how humans perform reasoning. This is because when humans execute such task, they typically do not first perform reasoning in parallel for each question, and then refine the reasoning results across different turns. This may partially explain why this strategy is inefficient because the results of previous reasoning processes are not incorporated into the current reasoning process. As a result, the reasoning performance at each turn is only slightly improved by the hidden states of the previous reasoning process, even though they use stacked IF layers to try to address this problem.
To address the aforementioned issues, we propose GraphFlow, a Graph Neural Network (GNN) based model for conversational MRC. As shown in Fig. 1, GraphFlow consists of three components, Encoding layer, Reasoning layer, and Prediction layer. The Encoding layer encodes conversation history and the context text that aligns question embeddings. The Reasoning layer dynamically constructs a question-aware context graph at each turn, and then applies GNNs to process the sequence of context graphs. In particular, the graph node embedding outputs of the reasoning process at the previous turn are used as a starting state when reasoning at the current turn, which is closer to how humans perform reasoning in a conversational setting, compared to existing approaches. The prediction layer predicts the answers based on the matching score of the question embedding and the learned graph node embeddings for context at each turn.
We highlight our contributions as follows:
We propose a novel graph neural network based model, namely GraphFlow, for conversational MRC which captures conversational flow in the dialog.
We dynamically construct a context graph that consists of each passage word as a node, which encodes not only the passage itself, but also the question as well as the conversation history.
We propose a novel flow mechanism to process a sequence of context graphs, which models the temporal dependencies between consecutive context graphs.
On two public benchmarks (i.e., CoQA and QuAC), the proposed model shows superior performance compared to existing state-of-the-art methods. For instance, GraphFlow outperforms two recently proposed models FlowQA by 2.3% F1 and SDNet by 0.7% F1 on benchmark dataset CoQA, respectively.
Our interpretability analysis shows that our model can better mimic human reasoning process on this task.
2 Graph-Flow Approach
2.1 Encoding Layer
The task of conversational MRC is to answer a question given the context and conversation history. We denote the context as which consists of a sequence of words and the question at the -th turn as which consists of a sequence of words . The details of encoding the question and context are given next.
Pretrained word embeddings We use 300-dim GloVe Pennington et al. (2014) embeddings as well as 1024-dim BERT Devlin et al. (2018) embeddings to embed each word in the context and the question. Following Zhu et al. (2018), we pre-compute BERT embeddings for each word using a weighted sum of BERT layer outputs.
Aligned question embeddings Following Lee et al. (2016) and recent work, for each context word at the -th turn, we incorporate an aligned question embedding where is the GloVe embedding of question word and is an attention score between context word and question word . Here we define the attention score as,
where is a trainable model parameter, is the hidden state size, and is the GloVe embedding of context word . To simplify notation, we denote the above attention mechanism as
, meaning that an attention matrix is computed between two sets of vectorsand , which is later used to get a linear combination of vector set . Hence we can reformulate the above alignment as .
Manual features For each context word, we also encode manual features to a vector
concatenating 12-dim POS (part-of-speech) embedding, 8-dim NER (named entity recognition) embedding and a 3-dim exact matching vector indicating whether the context word appears in.
Conversation history Following Choi et al. (2018), we utilize conversation history by concatenating a feature vector encoding previous answer locations to the context word embeddings. In addition, we also prepend previous question-answer pairs to the current question. When prepending conversation history, previous methods usually separate the current question from conversation history using some special tokens, which is not a good strategy in practice as observed by Choi et al. (2018). We instead concatenate a 3-dim relative turn marker embedding to each word vector in the augmented question to indicate which turn it belongs to (e.g., indicates the previous -th turn). We find this strategy works better in practice.
In summary, at the -th turn in a conversation, each context word is encoded by a vector which is a concatenation of , , , and . And each question word is encoded by a vector which is a concatenation of , and . We denote and as a sequence of context word vectors and question word vectors , respectively at the -th turn.
2.2 Reasoning Layer
2.2.1 Question Understanding
For each question , we apply a BiLSTM Hochreiter and Schmidhuber (1997) to the raw question embeddings to obtain contextualized embeddings .
Each question is then represented as a weighted sum of word vectors in the question via a simple self attention mechanism.
where is a -dim trainable weight.
Finally, we encode question history sequentially in question turns with a LSTM to generate history-aware question vectors.
The output hidden states of the LSTM network will be used for predicting answers.
2.2.2 Graph Learner
We dynamically build a weighted graph to model semantic relationships among context words at each turn in a conversation. We claim that the process of building such a context graph is supposed to depend on not only the semantic meanings of context words, but also on the question being asked, as well as the conversation history, so as to help better answer the question.
To this end, we first apply an attention mechanism to the context representations (which additionally incorporate both question information and conversation history as described in Section 2.1) at the -th turn to compute an attention matrix , serving as a weighted adjacency matrix for the context graph, defined as,
where is a trainable weight and is the embedding size of .
Considering that a fully connected context graph is not only computationally expensive but also makes little sense for reasoning, we then proceed to extract a sparse graph from
via a KNN-style strategy where we only keep thenearest neighbors (including itself) for each context node and apply a softmax function to these selected adjacency matrix elements to get a sparse and normalized adjacency matrix .
Given the context graphs constructed by the Graph Learner, we propose a novel Graph-Flow (GF) mechanism to sequentially process a sequence of context graphs. Readers can think that it is analogous to an RNN-style structure where the main difference is that each element in a sequence is not a data point, but instead a graph. As we advance in a sequence of graphs, we process each graph using a GNN and the output will be used when processing the next graph.
The details of the GF mechanism are as follows. At the -th turn, before we apply a GNN to the context graph , we initialize context node embeddings by fusing both the original context information and the updated context information at the previous turn via a fusion function.
where is the GF layer index.
As a result, the graph node embedding outputs of the reasoning process at the previous turn are used as a starting state when reasoning at the current turn. Note that we set as we will not incorporate any historical information at the first turn. Notably, even though there is no direct link between at the -th turn and at the -th turn, the information flow can be propagated between them as the GNN progresses, which means information can be exchanged among all the context words spatially and temporally.
We use Gated Graph Neural Networks (GGNN) Li et al. (2015) as our GNN module. When running GGNN, the aggregated neighborhood information for each node is computed as a weighted sum of its neighboring node embeddings where the weights come from the normalized adjacency matrix . The fusion function is designed as a gated sum of two information sources.
is a sigmoid function andis a gating vector.
To simplify notation, we denote the above GF mechanism as
which takes as input the old graph node embeddings and the normalized adjacency matrix , and updates the graph node embeddings.
2.2.4 Multi-level Graph Reasoning
While a GNN is responsible for modeling the global interactions among context words, modeling local interactions among consecutive context words is also important for the task. Therefore, before feeding the context word representations to a GNN, we first apply a BiLSTM to the context words, that is, , and we then use the output as the initial context node embedding. Inspired by recent work Wang et al. (2018) on modeling the context with different levels of granularity, we choose to apply one GF layer on low level representations of the context and another GF layer on high level representations of the context, as formulated in the following.
2.3 Prediction Layer
, we predict answer spans by computing the start and end probabilitiesand of the -th context word for the -th question, formulated as,
We additionally train a classifier to handle unanswerable questions or questions whose answers are not text spans in the context. We design different classifiers for the two benchmarks CoQA and QuAC as CoQA contains questions with abstractive answers but QuAC does not. For the CoQA benchmark, we train a multi-class classifier which classifies a question into one of the four categories including “unknown”, “yes”, “no” and “other”. We do text span prediction only if the question type is “other”. For the QuAC benchmark, we train three separate classifiers to handle three question classification tasks including a binary classification task (i.e., “unknown”) and two multi-class classification tasks (i.e., “yes/no” and “followup”). The classifier is defined as,
where is a linear layer for binary classification and a dense layer for multi-class classification, which maps a -dim vector to a -dim vector. Further, is a sigmoid function for binary classification and a softmax function for multi-class classification. We use to represent the whole context at the
-th turn which is a concatenation of average pooling and max pooling outputs of.
For training, the goal is to minimize the cross entropy loss of both text span prediction (if the question requires it) and question type prediction. The cross entropy of text span prediction is defined as,
where indicates whether this question requires text span prediction, and and are the ground-truth start and end positions of the answer span for the -th question.
As aforementioned, we train a single classifier for question type prediction on CoQA and three separate classifiers for question type prediction on QuAC. Therefore, the loss of question type prediction is defined differently for the two datasets as the following,
where indicates the question type for the -th question.
where , and indicate the ground-truth labels of the “unknown”, “yes/no” and “followup” prediction tasks for -th question, and , and are the corresponding probability predictions.
Thus, the training losses for CoQA and QuAC are and , respectively.
During inference, for CoQA, we do text span prediction only if span probability is the largest; otherwise, the answer is “unknown”, “yes” or “no” depending on which one has the largest probability. For QuAC, we do text span prediction only if is no larger than a certain threshold111We use 0.3 in our experiments as this maximizes the F1 score on the development set.; otherwise, the question is unanswerable.
In this section, we conduct an extensive evaluation of our proposed model against state-of-the-art conversational MRC models. We use two popular benchmarks, described below. The implementation of the model will be publicly available soon.
3.1 Data and Metrics
CoQA CoQA contains 127k questions with answers, obtained from 8k conversations. In CoQA, answers are in free-form and hence are not necessarily text spans from the context (i.e., 33.2% of the questions have abstractive answers). Although this calls for a generation approach, following previous work Yatskar (2018); Huang et al. (2018); Zhu et al. (2018), as detailed in Section 2.3, we adopt an extractive approach with additional question type classifiers to handle non-extractive questions. The average length of questions is only 5.5 words (i.e., 70% questions have coreference or ellipsis), which means conversation history is important for better understanding those questions. The average number of turns per dialog is 15.2. Notably, in the testing set, there are two out-of-domain datasets which are reserved for testing only.
QuAC QuAC contains 98k questions with answers, obtained from 13k conversations. Unlike CoQA, all the answers in QuAC are text spans from the context. The average length of questions is 6.5 and there are on average 7.2 questions per dialog. The average length of QuAC context is 401 which is longer than that of CoQA which is 271. The average length of QuAC answers is 14.6 which is also longer than that of CoQA which is 2.7.
The main evaluation metric is F1 score which is the harmonic mean of precision and recall at word level between the predicted answer and ground truth. In addition, for QuAC the Human Equivalence Score (HEQ) is used to judge whether a system performs as well as an average human. HEQ-Q and HEQ-D are model accuracies at question level and dialog level, respectively. Please refer toReddy et al. (2018); Choi et al. (2018) for details of these metrics.
3.2 Model Settings
We keep and fix the GloVe vectors for those words that appear more than 5 times in the training set. The size of all hidden layers is set to 300. When constructing context graphs, the neighborhood size is set to 10. The number of GNN hops is set to 5 for CoQA and 3 for QuAC. During training, we apply dropout after the embedding layers (0.3 for GloVe and 0.4 for BERT). A dropout rate of 0.3 is also applied after the output of all RNN layers. We use Adamax Kingma and Ba (2014)
as the optimizer and the learning rate is set to 0.001. We reduce the learning rate by a factor of 0.5 if the validation F1 score has stopped improving every one epoch. We stop the training when no improvement is seen for 10 consecutive epochs. We clip the gradient at length 10. We batch over dialogs and the batch size is set to 1. When augmenting the current turn with conversation history, we only consider the previous two turns. When doing text span prediction, the span is constrained to have a maximum length of 12 for CoQA and 35 for QuAC. All these hyper-parameters are tuned on the development set.
3.3 Model Comparison
Baseline methods We compare our approach with the following baseline methods: PGNet See et al. (2017), DrQA Chen et al. (2017), DrQA+PGNet Reddy et al. (2018), BiDAF++ Yatskar (2018), FlowQA Huang et al. (2018) and SDNet Zhu et al. (2018). We will discuss the details of these methods in Section 4.1.
Results As shown in Table 1 and Table 2, our model consistently outperforms these state-of-the-art baselines in terms of F1 score. In particular, Graph-Flow yields improvement over all existing models on both datasets by at least +0.7% F1 on CoQA and +0.8% F1 on QuAC, respectively. Compared with FlowQA which is also based on the flow idea, our model improves F1 by 2.3% on CoQA and 0.8% on QuAC, which demonstrates the superiority of our Graph-Flow mechanism over the Integration-Flow mechanism. Compared with SDNet which relies on sophisticated inter-attention and self-attention mechanisms, our model improves F1 by 0.7% on CoQA222They did not report the results on QuAC..
3.4 Ablation Study
We conduct an extensive ablation study to further investigate the performance impact of different components in our model. Here we briefly describe ablated systems: – TempConn removes temporal connections between consecutive context graphs, – GF removes the GF layer, – PreQues does not prepend previous questions to the current turn, – PreAns does not prepend previous answers to the current turn, – PreAnsLoc does not mark previous answer locations in the context, and – BERT removes pretrained BERT embeddings. We also show the model performance with no conversation history GraphFlow (0-His) or one previous turn of the conversation history GraphFlow (1-His).
Table 3 shows the contributions of the above components on the CoQA development set. We find that the pretrained BERT embedding has the most impact on the performance, which again demonstrates the power of large-scale pretrained language models. Our proposed GF mechanism also contributes significantly to the model performance (i.e., improves F1 score by 1.4%). In addition, within the GF layer, both the GNN part (i.e., 1.1% F1) and the temporal connection part (i.e., 0.3% F1) contribute to the results. We also notice that explicitly adding conversation history to the current turn helps the model performance by comparing GraphFlow (2-His), GraphFlow (1-His) and GraphFlow (0-His). We can see that the previous answer information is more crucial than the previous question information. And among many ways to use the previous answer information, directly marking previous answer locations seems to be the most effective. We conjecture this is partially because the turn transitions in a conversation are usually smooth and marking the previous answer locations helps the model better identify relevant context chunks for the current question.
3.5 Interpretability Analysis
Here we visualize the memory bank (i.e., an by matrix) which stores the hidden representations (and thus reasoning output) of the context throughout a conversation. While directly visualizing the hidden representations is difficult, thanks to the flow-based mechanism introduced into our model, we instead visualize the changes of hidden representations of context words between consecutive turns. We expect that the most changing parts of the context should be those which are relevant to the questions being asked and therefore should probably be able to indicate shifts of the focus in a conversation.
Following Huang et al. (2018)
, we visualize this by computing the cosine similarity of the hidden representations of the same context words at consecutive turns, and then highlight the words that have small cosine similarity scores (i.e., change more significantly)333For better visualization, we apply an attention threshold of 0.3 to highlight only the dramatically changing context words.. Fig. 2 highlights the most changing context words between consecutive turns in a conversation from the CoQA dev. set. As we can see, the hidden representations of context words which are relevant to the consecutive questions are changing most and thus highlighted most. We suspect this is in part because when the focus shifts, the model finds out the context chunks relevant to the previous turn become less important but those relevant to the current turn become more important. Therefore, the memory updates in these regions are the most active. Obviously, this makes the model easier to answer follow-up questions. As we observe in our visualization experiments, in conversations extensively involving coreference or ellipsis, our model can still perform reasonably well. We refer the interested readers to the supplementary material for complete visualization examples.
4 Related Work
4.1 Conversational MRC
Many methods have been proposed to utilize conversation history in the literature of conversational MRC. Reddy et al. (2018); Zhu et al. (2018) concatenate previous questions and answers to the current question. Choi et al. (2018) concatenate a feature vector encoding the turn number to the question word embedding and a feature vector encoding previous N answer locations to the context embeddings. However, as claimed by Huang et al. (2018), these methods ignore previous reasoning processes performed by the model when reasoning at the current turn. Instead, they propose the idea of Integration-Flow to allow rich information in the reasoning process to flow through a conversation.
Another challenge of this task is how to handle abstractive answers. Reddy et al. (2018) propose a hybrid method DrQA+PGNet, which augments a traditional extractive RC model with a text generator. Yatskar (2018) propose to first make a Yes/No decision, and output an answer span only if Yes/No was not selected.
4.2 Graph Neural Networks
have drawn increasing attention since they extend traditional Deep Learning approaches to non-euclidean data such as graph-structured data. Recently, GNNs have been applied to various question answering tasks including visual question answering (VQA)Teney et al. (2017); Norcliffe-Brown et al. (2018), knowledge base question answering (KBQA) Sun et al. (2018) and MRC De Cao et al. (2018); Song et al. (2018); Xu et al. (2018c), and have shown advantages over traditional approaches. For tasks where the graph structure is unknown, De Cao et al. (2018); Song et al. (2018); Xu et al. (2018b) construct a static graph where entity mentions in a passage are nodes of this graph and edge information is determined by coreferences of entity mentions. Norcliffe-Brown et al. (2018) dynamically construct a graph which contains all the visual objects appearing in an image. In parallel to our work, Liu et al. (2018) also dynamically construct a graph of all words from free text.
We proposed a novel GNNs-based model, namely GraphFlow, for conversational MRC which carries over the reasoning output throughout a conversation. On two recently released conversational MRC benchmarks, our proposed model achieves superior results over previous approaches. Interpretability analysis shows that the model can better mimic the human reasoning process for conversational MRC compared to existing models.
In the future, we would like to investigate more effective ways of automatically learning graph structures from free text and modeling temporal connections between sequential graphs.
- Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §3.3.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.3.
- QuAC: question answering in context. arXiv preprint arXiv:1808.07036. Cited by: §1, §1, §2.1, §3.1, §4.1.
- Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920. Cited by: §4.2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
Neural message passing for quantum chemistry.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §4.2.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §4.2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.2.1.
- Flowqa: grasping flow in history for conversational machine comprehension. arXiv preprint arXiv:1810.06683. Cited by: §1, §2.3, §3.1, §3.3, §3.5, §4.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §4.2.
- Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436. Cited by: §2.1.
- Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.2.3, §4.2.
- Contextualized non-local neural networks for sequence learning. arXiv preprint arXiv:1811.08600. Cited by: §4.2.
- Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pp. 8344–8353. Cited by: §4.2.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.1.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1.
- Coqa: a conversational question answering challenge. arXiv preprint arXiv:1808.07042. Cited by: §1, §1, §3.1, §3.3, §4.1, §4.1.
- Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §3.3.
- Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040. Cited by: §4.2.
- Open domain question answering using early fusion of knowledge bases and text. arXiv preprint arXiv:1809.00782. Cited by: §4.2.
- Graph-structured representations for visual question answering. In , pp. 1–9. Cited by: §4.2.
- Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934. Cited by: §2.2.4.
- Graph2seq: graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823. Cited by: §4.2.
- Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. arXiv preprint arXiv:1808.07624. Cited by: §4.2.
SQL-to-text generation with graph-to-sequence model. arXiv preprint arXiv:1809.05255. Cited by: §4.2.
- A qualitative comparison of coqa, squad 2.0 and quac. arXiv preprint arXiv:1809.10735. Cited by: §1, §3.1, §3.3, §4.1.
- SDNet: contextualized attention-based deep network for conversational question answering. arXiv preprint arXiv:1812.03593. Cited by: §1, §2.1, §2.3, §3.1, §3.3, §4.1.
Appendix A Supplemental Material
a.1 Visualization of the Flow Operation
More visualization examples on detecting the most changing parts in the context. (Starts on next page.)