An Abstractive approach to Question Answering

11/16/2017 ∙ by Rajarshee Mitra, et al. ∙ Microsoft 0

Question Answering has come a long way from answer sentence selection, relational QA to reading and comprehension. We move our attention to abstractive question answering by which we facilitate machine to read passages and answer questions by generating them. We frame the problem as a sequence to sequence learning where the encoder being a network that models the relation between question and passage, thereby relying solely on passage and question content to form an abstraction of the answer. Not being able to retain facts and making repetitions are common mistakes that affect the overall legibility of answers. To counter these issues, we employ copying mechanism and maintenance of coverage vector in our model respectively. Our results on MS-MARCO demonstrates it's superiority over baselines and we also show qualitative examples where we improved in terms of correctness and readability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question Answering is a crucial problem in language understanding and a major milestone towards human-level machine intelligence. Datasets like SQuAD, MS-MARCO and others have led to plethora of contributions in machine reading and comprehension. The next-generation QA systems can be envisioned as the ones which can read passages and write long answers to questions. We formulate generative question answering as a form of QA where we expect the machine to produce an abstractive answer A that encompasses all the information from passage P required to answer a question Q. Eventually, such systems, if capable of understanding what information is necessary and what is not, will enable us to acquire information from multiple sources and present them in the form of a summarized answer.

The assumption, prevalent in most of the existing approaches, that the answer should be always a particular sub-span in the passage is a very strict one.

Generating answers should have to carefully incorporate any facts and entities which is necessary to answer the question as well as simultaneously discarding irrelevant information from P. This requires building complex relations between the question and the passage. What makes it further challenging is not only a need for good generative model but also the readability of the generations. Even if one achieves good results in terms of lexical similarity metrics like ROUGE-L, the determination of how much correctness and readability is preserved is a very significant concern in this task.

Our generative model is inspired by Seq2Seq model (Sutskever et al. (2014) which is the basis of various NLG tasks like translation and summarization. In this paper, we propose our model that learns alignment between question and passage words to produce rich query-aware passage representation and using this same representation to directly decode the answer while attending to all the states in the representation. Learning end-to-end, our approach has no dependency on any external extractive labels. We also propose an approach where we make our decoder RNN state computation attention-aware by modifying the computed state of the previous step with attended encoder context.

While building such model, we noticed that, often, it replaces correct entities with similar incorrect ones (eg. correct year by incorrect year). This hinders the overall correctness of the answer being generated. To tackle this, we incorporate a copying mechanism (Gu et al. (2016), See et al. (2017)) that learns when to copy an important entity directly from the passage instead of generating anything from vocabulary. This makes our approach abstractive-cum-extractive. Furthermore, a common error in generative models is repetitiveness in the text being generated. This also affects the generation that follows. A common practice, being used in similar tasks, like machine translation and summarization, is keeping track of a coverage vector (Tu et al. (2016)) that keeps track of which encoder states have been attended to what extent in the past.

Our main novelty lies in demonstrating that it is possible to generate answers directly from modeled relationship between question and passage without the need to build extraction model or providing any positional labels (like start/end). We show that the decoder with the help of pointer-generator networks can itself choose to copy or generate answer words.

2 Related Work

Traditional QA models like Kumar et al. (2016) and Seo et al. (2016a) have shown some fascinating results in the form of relational based QA Weston et al. (2015) or machine reading and comprehension by Wang and Jiang (2016), (Seo et al., 2016b) and Wang et al. (2017) with promising results. Their methods proved how successful is pointer networks of Vinyals et al. (2015). Considerable amount of work has also been done on passage ranking (Tang et al., 2017). Tan et al. (2017) has taken a generative approach where they add a decoder on top of their extractive model and thus leveraging the extracted evidence to synthesize the answer. However, the model still relies strongly on the extraction to perform the generation which means it is essential to have the start and end labels (a span) for every example. Even also, when they are multiple regions in P that contribute to A, the model will need to predict multiple spans.

Our approach differs from this in the sense that it only relies on the content of Q and P to generate the answer and the extraction-abstraction soft switch happens as a part of the learning procedure without the need for any dependency on extraction.

3 Model Details

Our model consists of an encoder that compute representation by using context and attentive layers and an attentive (Bahdanau et al. (2014)) decoder with pointer-generator networks to decide when to copy or generate. We also keep a track of the attention at each time step to maintain a coverage vector to improve readability.

3.1 Representation

We use the standard GRU (Cho et al. (2014)) as the building block of our recurrence for computing representations for both questions and passages.

Initially, both P and Q can be expressed by their respective word embeddings as Q = and P = .Further, the representations are built by multi-layered bi-directional GRU and weight sharing being done between P and Q.

(1)

where and are the hidden states of the GRU at time step.

3.2 Attentive Layer

Figure 1: Block diagram of our model. Both Attention1 and Attention2 is the scaled multiplicative attention we discuss in (3).

We use a multiplicative attention with a scaling factor (Vaswani et al., 2017) to compute alignment between question and passage words and respectively. Specifically, for each , we take the weighted sum of all which is then concatenated to to form it’s final representation. We also use gating mechanism (Tan et al., 2017) to provide varying importance to passage words (4).

Figure 2: An alignment matrix between Q and P shows how the model associates each word in Q with each word in P. As an example, when the model reads the word long in the question, it focuses most of it’s attention on 6-7 months in passage. All the columns are softmax normalized.

Mathematically, we can express the attentive layer as:

Question: when was the death penalty abolished?
Passage: The last executions UK took place in 1964 , and the death penalty was abolished in 1998 . In 2004 the UK became a party to the 13th Protocol to the European Convention on Human Rights and prohibited the restoration of the death penalty . Death penalty , capital punishment , or execution is the legal process of putting a person to death as a punishment for a crime . Modern History of Death Penalty . 1608 - Earliest death penalty in the British American Colonies handed out for UNK
Model with copying: 1998
Model without copying: 1964
Table 1: Replacement of correct entity by negative and similar one. We show two results from two models – with and without point-gen
(2)
(3)
(4)
(5)

and are non-linear () transformations of question and passage representations and respectively. computes dot product attention between transformed query and passage representations followed by a scaling factor . This also implies that for each passage word, we find the weighted importance of all question words, scale them and finally a softmax normalization to produce weights. With the weights, we thus get, we find the weighted sum of all the question words for that particular passage word (3). This is the new question-attended passage representation . We further concatenate this with the original passage representation followed by a sigmoid gating (4) to represent how important this passage time step is for the encoding.

The penultimate stage of the attentive layer contains a bi-directional GRU (5) that acts a smoothing layer over the concatenation of original and question-attended passage representation.

We concatenate and

and perform a non-linear transformation of it before passing them to the decoder.

(6)

Figure 2 shows a particular example of how relationship between question and passage is modeled.

3.3 Decoding

Q: what is a urethra
P: in anatomy , the urethra is a tube that connects the urinary bladder to the urinary meatus for the removal of fluids from the body . 1 infection of the urethra is urethritis , said to be more common in females than males . 2 urethritis is a common cause of dysuria ( pain when urinating ) . 3 related to urethritis is so called urethral syndrome . 4 passage of kidney stones through the urethra can be painful , which can lead to urethral strictures .
Model with coverage: is a tube that connects urinary bladder to the urinary meatus for the removal of fluids from the body ……
Model without coverage: the urethra a tube that connects urinary urinary bladder the the tube that connects the urinary bladder fluids the urinary body
Table 2: Here are two types of repetitions we notice: consecutive identical (two times ”urinary”) words and duplicate phrases (”tube that connects the urinary bladder”). We show two results from two models – with and without coverage

We use the GRU cell to compute our decoder states and the same scaled attention as in 2.
The decoder decodes the information encoded by the query-aware passage encoder, aggregating information from various parts in the passage to produce the final answer. As we have seen in Figure 2, different regions of the passage can be weighted more if they are relevant to the query. In such cases, often we have seen from question-passage heatmaps, several areas in the passage are marked important. For eg. when the query is asking for some year, all the words in the passage which denotes a year stands out more or less from other passage words (this issue is also highlighted in Table 1). It is now importantly the decoder’s job to pick the correct year. The decoder is initialized with combination of both question representation encoded passage representation (from 6)

Making the RNN attention-aware: We use the attended context concatenated with previous decoder state as input to the RNN to compute the current state. This makes the RNN aware about previous attentions. This differs from (Luong et al., 2015), where they attend after RNN state computation. We have observed that our difference causes a gain of at least 1 ROUGE-L point. The decoder state computation can be expressed as:

(7)

where is the attention distribution over the encoder states at time-step in the decoder and is the decoder hidden state. The attention methodology used here is identical to the one used in the attentive layer. Based on , the decoder decides which encoder state to focus more on. is the attended context vector of the encoder at decoding time step. We use this context vector in concatenation with the decoder GRU’s previous state to compute current state

. Finally, we compute the decoder output probability as

. This is a fully abstractive approach.

3.3.1 Copying from source

However, to overcome the limitation of missing out entities from P (Table 1) or when dealing mostly extractive data, we apply pointer-generator networks to make it abstractive-cum-extractive.With pointer-generator model, at each decoder time-step, we make a probabilistic switch between whether to copy or simply choose from output distribution . This is governed by which is computed as (See et al. (2017)):

(9)

At each decoder time step, we use the decoder state and the context vector , to decide whether to copy from the encoder words or to generate the highest probable word from the decoder output.

(10)

From a high-level point of view, the pointer-generator model, at each decoder step , adds the attention probabilities of each encoder words from to their respective probabilities in the decoder output distribution at that decoder time step. A sigmoid over the affine transformations produce a probability. Higher means the model will mostly choose a word from vocabulary while lower means higher chance of copying a source word from passage.

is the final probability distribution which is used for generating answer word and computing the training loss.

3.3.2 Mitigating repetitions through coverage

Readability is a major concern in almost all NLG tasks (Table 2). The copying mechanism helps to preserve information from the encoder side and prevent repetition of words which we have observed in our experiments. We realized that alleviating this problem will eventually lead to better readability of answers. Hence, we incorporated coverage mechanism, which is pretty popular approach in machine translation, summarization, into our model. It is a mechanism in MT to avoid over-translation by keeping track of which source words are receiving attention too many times. Specifically, we used the attention distributions being computed at each decoder time step to maintain a coverage vector which is basically cumulative sum of attention probabilities.

Hence, at each time step, the decoder has information about how much each encoder state has been attended until the previous step. We add an extra term to (7). This makes the standard attention mechanism has a knowledge about which states has been attended already enough in the past and thus manages to curb repetitions.

3.4 Loss

At each decoder time step, we use the negative log-likelihood of the correct word and finally try to minimize the total loss over all the decoder steps.

(11)
Model ROUGE-L Perplexity
Seq2Seq baseline -/37.70 -/-
gQA (+p-gen,+ cov.) 74/59.5 4.35/4.62
gQA (+p-gen) 70/57.5 3.36/4.05
gQA (w/o p-gen) 45/42. 4.3/4.48
Table 3: The best performance on the MS-MARCO development data, given the correct passage. We report the perplexity alongside ROUGE-L score. For each numeric column, score follows the train/dev format.
Figure 3: Example result: core information retained

4 Data and Experiments

We conducted our experiments on MS-MARCO111Data is available for download at http://www.msmarco.org/dataset.aspx data (Nguyen et al. (2016)). In our experiments, to form the passage-answer pair from training data, we select the correct input passage for an answer by determing which passage has a sub-span with highest ROUGE-L score with the reference answer Tan et al. (2017). We also verify that this ROUGE-L score is not less than 0.7. The filtered train data consists of 75000 examples. For evaluation, we take the correct passage for each query and generate answer from them. We relied on Stanford CoreNLP Manning et al. (2014) to tokenize all texts. Also, We restricted the number of words in it to 30000 and lengths of P and A to 200 and 50 respectively. We use a batch of 50 examples for updating our model while training for roughly 15000 iterations. We use Glove (Pennington et al. (2014)) to represent words and keep them fixed. We use hidden state dimension of 256 throughout the network. We use the Adadelta optimizer (Zeiler (2012)), with epsilon=1e-6, rho=0.95 and initial learning rate=1, to minimize our loss.

5 Results and Analysis

We list our result on MS-MARCO data in Table 3. We do not consider negative passages and test the model only on correct passages for each query. We also include the results reported in Table 6 of Tan et al. (2017) on experimenting with basic sequence-to-sequence model with selected passage as the input. Results show that our model outperforms the generative baseline by a large margin. The learnt mean value of is mostly 0.7 which tells us that the data is mostly extractive in nature and forces the model to copy mostly. We also tried to use self-attention but that didn’t improve performance, most probably, because the decoder attention is already aggregating information from different parts of passage. Not only improvement in score, but also coverage reduced repetitions in multiple instances. We provide some examples in Fig.3 and Fig.4.

Figure 4: Example result: almost same answer

6 Conclusion

We successfully build a generative model that can not only efficiently model relationship between question and passage but also can generate answers from the encoded relationship. We let the model decide to be in abstractive or extractive mode based on the nature of the data thus removing any dependency on any other external extractive feature. Moreover, we apply coverage in this QA task to mitigate frequent repetitions.

References