An Attention Mechanism for Answer Selection Using a Combined Global and Local View

07/05/2017 ∙ by Yoram Bachrach, et al. ∙ DigitalGenius 0

We propose a new attention mechanism for neural based question answering, which depends on varying granularities of the input. Previous work focused on augmenting recurrent neural networks with simple attention mechanisms which are a function of the similarity between a question embedding and an answer embeddings across time. We extend this by making the attention mechanism dependent on a global embedding of the answer attained using a separate network. We evaluate our system on InsuranceQA, a large question answering dataset. Our model outperforms current state-of-the-art results on InsuranceQA. Further, we visualize which sections of text our attention mechanism focuses on, and explore its performance across different parameter settings.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Question answering (QA) relates to the building of systems capable of automatically answering questions posed by humans in natural language. Various frameworks have been proposed for question answering, ranging from simple information-retrieval techniques for finding relevant knowledge articles or webpages, through methods for identifying the most relevant sentence in a text regarding a posed question, to methods for querying structured knowledge-bases or databases to produce an answer [1, 2, 3, 4, 5]

A popular QA task is answer selection, where, given a question, the system must pick correct answers from a pool of candidate answers [6, 7, 8, 9, 10].

Answer selection has many commercial applications. Virtual assistants such as Amazon Alexa and Google Assistant are designed to respond to natural language questions posed by users. In some cases such systems simply use a search engine to find relevant webpages; however, for many kinds of queries, such systems are capable of providing a concise specific answer to the posed question.

Similarly, various AI companies are attempting to improve customer service by automatically replying to customer queries. One way to design such a system is to curate a dataset of historical questions posed by customers and the responses given to these queries by human customer service agents. Given a previously unobserved query, the system can then locate the best matching answer in the curated dataset.

Answer selection is a difficult task, as typically there is a large number of possible answers which need to be examined. Furthermore, although in many cases the correct answer is lexically similar to the question, in other cases semantic similarities between words must be learned in order to find the correct answer [11, 12]. Additionally, many of the words in the answer may not be relevant to the question.

Consider, for example, the following question answer pair:

How do I freeze my account?
Hello, hope you are having a great day. You can freeze your account by logging into our site and pressing the freeze account button. Let me know if you have any further questions regarding the management of your account with us.

Intuitively, the key section which identifies the above answer as correct is “[…] you can freeze your account by […]”, which represents a small fraction of the entire answer.

Earlier work on answer selection used various techniques, ranging from information retrieval methods [13]

and machine learning methods relying on hand-crafted features 

[14, 15]

. Deep learning methods, which have recently shown great success in many domains including image classification and annotation 

[16, 17, 18], multi-annotator data fusion [19, 20], NLP and conversational models [21, 22, 23, 24, 25] and speech recognition [21, 19], have also been successfully applied to question answering [26]. Current state-of-the-art methods use recurrent neural network (RNN) architectures which incorporate attention mechanisms [27]. These allow such models to better focus on relevant sections of the input [22].

Our contribution: We propose a new architecture for question answering. Our high-level approach is similar to recently proposed QA systems [26, 27], but we augment this design with a more sophisticated attention mechanism, combining the local information in a specific part of the answer with a global representation of the entire question and answer.

We evaluate the performance of our model using the recently released InsuranceQA dataset [26], a large open dataset for answer selection comprised of insurance related questions such as: “what can you claim on Medicare?”. 111As opposed to other QA tasks such as answers extraction or machine text comprehension and reasoning [28, 29], the InsuranceQA dataset questions do not generally require logical reasoning.

We beat state-of-the-art approaches  [26, 27], and achieve good performance even when using a relatively small network.

Ii Previous Work

Answer selection systems can be evaluated using various datasets consisting of questions and answers. Early answer selection models were commonly evaluated against the QASent dataset [15]; however, this dataset is very small and thus less similar to real-world applications. Further, its candidate answer pools are created by finding sentences with at least one similar (non-stopword) word as compared to the question, which may create a bias in the dataset.

Wiki-QA [30] is a dataset that contains several orders of magnitude more examples than QASent, where the candidate answer pools were created from the sentences in the relevant Wikipedia page for a question, reducing the amount of keyword bias in the dataset compared to QASent.

Our analysis is based on the InsuranceQA [26] dataset, which is much larger, and similar to real-world QA applications. The answers in InsuranceQA are relatively long (see details in Section V-A), so the candidate answers are likely to contain content that does not relate directly to the question; thus, a good QA model for InsuranceQA must be capable of identifying the most important words in a candidate answer.

Early work on answer selection was based on finding the semantic similarity between question and answer parse trees using hand-crafted features [14, 15]. Often, lexical databases such as WordNet were used to augment such models [31]. Not only did these models suffer from using hand-crafted features, those using lexical databases were also often language-dependent.

Recent attempts at answer selection aim to map questions and candidate answers into n-dimensional vectors, and use a vector similarity measure such as cosine similarity to judge a candidate answer’s affinity to a question. In other words, the similarity between a question and a candidate is high if the candidate answers the question well, low if the candidate is not a good match for the question.

Such models are similar to Siamese models, a good review of which can be found in Muller et al’s paper [32]. Feng et al. [26]

propose using convolutional neural networks to vectorize both questions and answers before comparing them using cosine similarity. Similarly, Tan et al. 

[27] use a recurrent neural network to vectorize questions and answers. Attention mechanisms have proven to greatly improve the performance of recurrent networks in many tasks [22, 27, 33, 34, 35], and indeed Tan et al. [27] incorporate a simple attention mechanism in their system.

Iii Preliminaries

Fig. 1: Model architecture using answer-localized attention [27]. The left hand side used for the question. The right side of the architecture is used for both the answer and distractor.
Fig. 2: Our proposed architecture with augmented attention. As in Figure 1, the right side of the model is used to embed answers and distractors.

Our approach is similar to the Answer Selection Framework of Tan et al. [27], but we propose a different network architecture and a new attention mechanism. We first provide a high level description of this framework (see the original paper for a more detailed discussion), then discuss our proposed attention mechanism.

The framework is based on a neural network with parameters which can embed either a question or a candidate answer into low dimensional vectors . The network can embed a question with no attention, which we denote as , and embed a candidate answer with attention to the question, denoted as . We denote the similarity function used as ( may be the dot product function, the cosine similarity function or some other similarity function).

Given a trained network, we compute the similarity between question and answer embeddings:

for any with being the th candidate answer in the pool. We then select the answer yielding the highest similarity .

The embedding functions, and , depend on the architecture used and the parameters

. The network is trained by choosing a loss function

, and using stochastic gradient descent to tune the parameters given the training data. Each training item consists of a question

, the correct answer and a distractor (an incorrect answer). A prominent choice is using a shifted hinge loss, designating that the correct answer must have a higher score than the distractor by at least a certain margin , where the score is based on the similarity to the question.


The above expression has a zero loss if the correct answer has a score higher than the distractor by at least a margin , and the loss linearly increases in the score difference between the correct answer and the distractor.

Any reasonable neural network design for can be used to build a working answer-selection systems using the above approach; however, the network design can have a big impact on the system’s accuracy.

Iii-a Embedding Questions and Answers

Earlier work examined multiple approaches for embedding questions and answers, including convolutional neural networks, recurrent neural networks (RNNs) (sometimes augmented with an attention mechanism) and hybrid designs [26, 27].

An RNN design “digests” the input sequence, one element at a time, changing its internal state at every timestep. The RNN is based on a cell, a parametrized function mapping a current state and an input element to the new state [36]

. A popular choice for the RNN’s cell is the Long Short Term Memory (LSTM) cell 


Given a question comprised of words , we denote the ’th output of an LSTM RNN digesting the question as ; similarly given an answer we denote the ’th output of an LSTM RNN digesting the question as .

One simple approach is to have the embeddings of the question and answer be the last LSTM output, i.e. and . Note that are vectors whose dimensionality depends on the dimensionality of the LSTM cell; we denote by the ’th coordinate of the LSTM output at timestep .

Another alternative is to aggregate the LSTM outputs across the different timesteps by taking their coordinate-wise mean (mean-pooling):

Alternatively, one may aggregate by taking the or coordinate-wise max (max-pooling):

We use another simple way of embedding the question and answer, which is based on term-frequency (TF) features. Given a vocabulary of words , and a text we denote the TF representation of as where if the word occurs in and otherwise . 222Another alternative is setting to the number of times the word appears in . A slightly more complex option is using TF-IDF features [38] or an alternative hand-crafted feature scheme; however we opt for the simpler TF representation, letting the neural network learn how to use the raw information.

A simple overall embedding of a text is where is an matrix, and where determines the final embedding’s dimensionality; the weights of are typically part of the neural network parameters, to be learned during the training of the network. Instead of a single matrix multiplication, one may use the slightly more elaborate alternative of applying a feedforward network, in order to allow for non-linear embeddings.

We note that a TF representation loses information regarding the order of the words in the text, but can provide a good global view of key topics discussed in the text.

Our main contribution is a new design for the neural network that ranks candidate answers for a given question. Our design uses a TF-based representation of the question and answer, and includes a new attention mechanism which uses this global representation when computing the attention weights (in addition to the local information used in existing approaches). We describe existing attention designs (based on local information) in Section III-B, before proceeding to describe our approach in Section IV.

Iii-B Local Attention

Early RNN designs were based on applying a deep feedforward network at every timestep, but struggled to cope with longer sequences due to exploding and diminishing gradients [39]. Other recurrent cells such as the LSTM and GRU cells [39, 40] have been proposed as they alleviate this issue; however, even with such cells, tackling large sequences remains hard [41]. Consider using an LSTM to digest a sequence, and taking the final LSTM state to represent the entire sequence; such a design forces the system to represent the entire sequence using a single LSTM state, which is a very narrow channel, making it difficult for the network to represent all the intricacies of a long sequence [22].

Attention mechanisms allow placing varying amounts of emphasis across the entire sequence [22], making it easier to process long sequences; in QA, we can give different weights to different parts of the answer while aggregating the LSTM outputs along the different timesteps:

where denotes the weight (importance) placed on timestep and is the th value of the th embedding vector.

Tan et al. [27] proposed a very simple attention mechanism for QA, shown in Figure 1:

where is the weighted hidden layer, and are matrix parameters to be learned, and is a vector parameter to be learned.

Iv Global-Local Attention

A limitation of the attention mechanism of Tan et al. [27] is that it only looks at the the embedded question vector and one candidate answer word embedding at a time. Our proposed attention mechanism adds a global view of the candidate, incorporating information from all words in the answer.

Iv-a Creating Global Representations

One possibility for constructing a global embedding is an RNN design. However, RNN cells tend to focus on the more recent parts of an examined sequence [41]. We thus opted for using a term-frequency vector representing the entire answer, as shown in Figure 2. We denote this representation as:

where relates to the i’th word in our chosen vocabulary, and if this word appears in the candidate answer, and otherwise.

Consider a candidate answer , and let denote its sequence of RNN LSTM outputs, i.e. denotes the ’th output of a RNN LSTM processing this sequence (so is a vector whose dimensionality is as the hidden size of the LSTM cell). We refer to as the local-embedding at time . 333Note that although we call a local embedding, the ’th LSTM state does of course take into account other words in the sequence (and not only the ’th word). By referring to it as “local” we simply mean to say that it is more heavily influenced by the ’th word or words close to it in the sequence.

Iv-B Combining Local and Global Representations to Determine Attention Weights

The goal of an attention mechanism is to construct an overall representation of the candidate answer , which is later compared to the question representation to determine how well the candidate answers the question; this is achieved by obtaining a set of weights (where ), and constructing the final answer representation as a weighted average of the LSTM outputs, with these weights.

Given a candidate answer , we compute the attention coefficient for timestep as follows.

First, we combine the local view (the LSTM output, more heavily influenced by the words around timestep ) with the global view (based on TF features of all the words in the answer). We begin by taking linear combinations of the TF features then passing them through a nonlinearity (so that the range of each dimension is bounded in ):

The weights of the matrix are model parameters to be learned, and its dimensions are set so as to map the sparse TF vector to a dense low dimensional vector (in our implementation is a 50 dimensional vector).

Similarly, we take a linear combination of the different dimensions of the local representation (in this case there is no need for the operation, as the LSTM output is already bounded):

where the weights of the are model parameters to be learned (and with dimensions set so that would be a 140 dimensional vector).

Given a TF representation of a text , whose dimensionality is the size of the vocabulary, and an RNN representation of the text , with a certain dimentionality , we may wish construct a normalized representation of the text. As the norms of these two parts may differ, simply concatenating these parts may result in a vector dominated by one side. We thus define a joint representation as follows.

We normalize each part so as to have a desired ratio of norms between the RNN and TF representations; this ratio reflects the relative importance of the RNN and TF embeddings in the combined representation (for instance when settings both to both parts would have a unit norm, giving them equal importance):

We then concatenate the normalized TF and RNN representations to generate the joint representation:

where represents vector concatenation.

We construct the local attention representation at the ’th word of the answer as:

using values of .

The raw attention coefficient of the ’th word in the answer is computed by measuring the similarity of a vector representing the question, and a local-global representation of the answer at word . We build these representations, of matching dimensions, by taking the same number of linear combinations from (the raw global-local representation of the answer at word ). Thus the attention weight for the ’th word is:

where , are matrices whose weights are parameters to be learned (and whose dimensions are set so that and would be vectors of identical dimensionality, 140 in our implementation), and where denotes the cosine similarity between vectors:

with the symbol in the nominator denoting the dot product between two vectors.

Finally, we normalize the attention coefficients with respect to their exponent to obtain the final attention weights, by applying the softmax operator on the raw attention coefficients. We take the raw attention coefficients, and define the final attention weights where and is the result of the softmax operator applied on :

Iv-C Building the Final Attention Based Representation

The role of the attention weights is building a final representation of a candidate answer; different answers are ranked based on the similarity of their final representation and a final question representation. Similarly to the TF representation of the answer, we denote the TF representation of the question as: , where relates to the i’th word in our chosen vocabulary, and if this word appears in the question, and otherwise. Our final representation of the question is a joining of the TF representation of the question and the mean pooled RNN question representation (somewhat similarly to how we join the TF and RNN representation when determining the attention weights):

Our final representation of the answer is also a joining two parts, a TF part (as defined earlier) and an attention weighted RNN part . We construct as the weighted average of the LSTM outputs, where the weights are the attention weights defined above:

The final representation of the answer is thus:

Figure 2 describes the final architecture of our model, showing how we use a TF-based global embedding both in determining the attention weights and in the overall representation of the questions and answers. The dotted lines in the figures indicate that our model’s attention weights depend not only on the local embedding but also on the global embedding.

Iv-D Tuning Parameters to Minimize the Loss

The loss function we use is the shifted hinge loss defined in Section III. We compute the score of an answer candidate as the similarity between its final representation and the final representation of the question 444We use the cosine similarity as our similarity function for the loss, though other similarity functions can also be used. :

Given the score of the correct answer candidate and the score of a distractor (incorrect) candidate , , our loss is .

The above loss relates to a single training item (consisting of a single question, its correct answer and an incorrect candidate answer). Training the neural network parameters involves iteratively examining items in a dataset consisting of many training items (each containing a question, its correct answer and a distractor) and modifying the current network parameters. We train our system using variant of stochastic gradient descent (SGD) with the Adam optimization [42].

V Empirical Evaluation

We evaluate our proposed neural network design in a similar manner to earlier evaluations of Siamese neural network designs [30, 43], where a neural network is trained to embed both questions and candidate answers as low dimensional vectors.

V-a Experiment Setup

Fig. 3: A visualization of the attention weights for each word in a correct answer to a question. These examples show how the attention mechanism is focusing on relevant parts of the correct answer (although the attention is still quite noisy).
Fig. 4: Performance of our system on InsuranceQA for various model sizes (both the LSTM hidden layer size and embedding size)

We use the InsuranceQA dataset and its evaluation framework [26]. The InsuranceQA dataset contains question and answer pairs from the insurance domain, with roughly 25,000 unique answers, and is already partitioned into a training set and two test sets, called test 1 and test 2.

The InsuranceQA dataset has relatively short questions (mean length of 7). However, the answers are typically very long (mean length of 94).

At test time the system takes as input a question and a pool of candidate answers and is asked to select the best matching answer to the question from the pool. The InsuranceQA comes with answer pools of size , consisting of the correct answers and random distractors chosen from the set of answers to other questions.

State-of-the-art results for InsuranceQA were achieved by Tan et al [27], which also provide a comparison with several baselines: Bag-of-words (with IDF weighted sum of word vectors and cosine similarity based ranking), the Metzler-Bendersky IR model [44], and  [26] - the CNN based Architecture-II and Architecture-II with Geometricmean of Euclidean and Sigmoid Dot product (GESD).

We implemented our model in TensorFlow 

[45] and conducted experiments on our GPU cluster.

We use the same hidden layer sizes and embedding size as Tan et al. [27]: for the bidirectional LSTM size and an embedding size of ; this allows us to investigate the impact of our proposed attention mechanism. 555As is the case with many neural networks, increasing the hidden layer size or embedding size can improve the performance on our InsuranceQA models; we compare our performance to the work of Tan et al. [27] with the same hidden and embedding sizes; similarly to them we use embeddings pre-trained using Word2Vec [46] and avoid overfitting by applying early stopping (we also apply Dropout [47, 48]).

Model Test1 Test2
Bag-of-words 32.1 32.2
Metzler-Bendersky 55.1 50.8
Arch-II [26] 62.8 59.2
Arch-II GSED [26] 65.3 61.0
Attention LSTM [27] 69.0 64.8
TF-LSTM Concatenation 62.1 61.5
Local-Global Attention 70.1 67.4
TABLE I: Performance of various models on InsuranceQA

V-B Results

Table I presents the results of our model and the various baselines for InsuranceQA. The performance metric used here is P@1, the proportion of instances where a correct answer was ranked higher than all other distractors in the pool. The table shows that our model outperforms the previous baselines.

We have also examined the performance of our model as a function of its size (determining the system’s runtime and memory consumption). We used different values for both the size of the LSTM’s hidden layer size and embedding size, and examined the performance of the resulting QA system on InsuranceQA. Our results are given in Figure 4, which shows both the P@1 metric and the mean reciprocal rank (MRR) [49, 50] 666The MRR metric assigns the model partial credit even in cases where the highest ranking candidate is an incorrect answer, with the score depending on the highest rank of a correct answer.

Figure 4 shows that performance improves as the model gets larger, but the returns on extending the model size quickly diminish. Interestingly, even relatively small models achieve a reasonable question answering performance.

To show our attention mechanism is necessary to achieve good performance, we also construct a model that simply concatenates the output of the feedforward network (on TF features) and the output of the bidirectional LSTM, called TF-LSTM concatenation. While this model does make use of TF-based features in addition to the LSTM state of the RNN, it does not use an attention mechanism to allow it to focus on the more relevant parts of the text. As the table shows, the performance of the TF-LSTM model is significantly lower than that of our model with the global-local attention mechanism. This indicates that the improved performance stems from the model’s improved ability to focus on the relevant parts of the answer (and not simply from having a larger capacity and including TF-features).

Finally, we examine the the attention model’s weights to evaluate it qualitatively. Figure 

3 visualizes the weights for two question-answer pairs, where the color intensity reflects the relative weight placed on the word (the coefficients discussed earlier). The figure shows that our attention model can focus on the parts of the candidate answer that are most relevant for the given question.

Vi Conclusion

We proposed a new neural design for answer selection, using an augmented attention mechanism, which combines both local and global information when determining the attention weight to place at a given timestep. Our analysis shows that our design outperforms earlier designs based on a simpler attention mechanism which only considers the local view.

Several questions remain open for future research. First, the TF-based global view of our design was extremely simple; could a more elaborate design, possibly using convolutional neural networks, achieve better performance?

Second, our attention mechanism joins the local and global information in a very simple manner, by normalizing each vector and concatenating the normalized vectors. Could a more sophisticated joining of this information, perhaps allowing for more interaction between the parts, help further improve the performance of our mechanism?

Finally, can the underlying principles of our global-local attention design improve the performance of other systems, such as machine translation or image processing systems?