In this work, we propose Attentive Pooling (AP), a two-way attention mechanism for discriminative model training. In the context of pair-wise ranking or classification with neural networks, AP enables the pooling layer to be aware of the current input pair, in a way that information from the two input items can directly influence the computation of each other's representations. Along with such representations of the paired inputs, AP jointly learns a similarity measure over projected segments (e.g. trigrams) of the pair, and subsequently, derives the corresponding attention vector for each input to guide the pooling. Our two-way attention mechanism is a general framework independent of the underlying representation learning, and it has been applied to both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) in our studies. The empirical results, from three very different benchmark tasks of question answering/answer selection, demonstrate that our proposed models outperform a variety of strong baselines and achieve state-of-the-art performance in all the benchmarks.READ FULL TEXT VIEW PDF
In NLP, convolution neural networks (CNNs) have benefited less than recu...
We propose a new attention mechanism for neural based question answering...
We describe an attentive encoder that combines tree-structured recursive...
One of the main challenges in ranking is embedding the query and documen...
Attention is typically used to select informative sub-phrases that are u...
With the rapid growth of knowledge bases (KBs), question answering over
The dominant neural architectures in question answer retrieval are based...
Keras implementation of https://arxiv.org/pdf/1602.03609.pdf
. However, most recent work on neural attention models have focused on one-way attention mechanisms based on recurrent neural networks designed for generation tasks.
Another important family of machine learning tasks are centered around pair-wise ranking or classification, which have a broad set of applications, including but not limited to, question answering, entailment, paraphrasing and any other pair-wise matching problems. The current state-of-the-art models usually include NN-based representation for the input pair, followed by a discriminative ranking or classification models. For example, a convolution (or a RNN) and a max-pooling is used to independently construct distributed vector representations of the input pair, followed by a large-margin training(Hu et al., 2014; Weston et al., 2014; Shen et al., 2014; dos Santos et al., 2015).
The key contribution of this work is that we propose Attentive Pooling (AP), a two-way attention mechanism, that significantly improves such discriminative models’ performance on pair-wise ranking or classification, by enabling a joint learning of the representations of both inputs as well as their similarity measurement.
Specifically, AP enables the pooling layer to be aware of the current input pair, in a way that information from the two input items can directly influence the computation of each other’s representations. The main idea in AP consists of learning a similarity measure over projected segments (e.g. trigrams) of the two items in the input pair, and using the similarity scores between the segments to compute attention vectors in both directions. Next, the attention vectors are used to perform pooling.
There are a few key benefits of our model.
Thanks to the two-way attention, our model projects the paired inputs, even though they may not be always semantically comparable for some applications (e.g., questions and answers in question answering), into a common representation space that they can be compared in a more plausible way.
Our model is effective in matching pairs of inputs with significant length variations.
The two-way attention mechanism is independent of the underlying representation learning. For example, AP can be applied to both CNNs and RNNs, which is in contrast to the one-way attention used in the generation models mostly based on recurrent nets.
In this work, we perform an extensive number of experiments on applying attentive pooling CNNs (AP-CNN) and biLSTMs (AP-biLSTM) for the answer selection task. In this task, given a question and an candidate answer pool for this question, the goal is to search for and select the candidate answer that correctly answers . We perform experiments with three publicly available benchmark datasets, which vary in data scale, complexity and length ratios between question and answers: InsuranceQA, TREC-QA and WikiQA. For the three datasets, AP-CNN and AP-biLSTM respectively outperform the CNN and the biLSTM that do not use attention. Additionally, AP-CNN achieves state-of-the-art results for the three datasets.
Our experimental results also demonstrate that attentive pooling makes the CNN more robust to large input texts. This is an important finding, since recent work have demonstrated that, in the context of semantically equivalent question retrieval, CNN based representations do not scale well with the size of the input text (dos Santos et al., 2015). Additionally, as AP-CNN does not rely only on the final vector representation to capture interactions between the input question and answer, it requires much less convolutional filters than the regular CNN. It means that AP-CNN-based representations are more compact, which can help to speed up the training process.
Although we demonstrate experimental results for NLP tasks only, AP is a general method that can be also applied to different types of NNs that perform matching of two inputs. Therefore, we believe that AP can be useful for different applications, such as computer vision and bioinformatics.
This paper is organized as follows. In Section 2, we describe two NN architectures for answer selection that have been recently proposed in the literature. In Section 3, we detail the attentive pooling approach. In Section 4, we discuss some related work. Sections 5 and 6 detail our experimental setup and results, respectively. In Section 7 we present our final remarks.
Different neural network architectures have been recently proposed to perform matching of semantically related text segments (Yu et al., 2014; Hu et al., 2014; dos Santos et al., 2015; Wang & Nyberg, 2015; Severyn & Moschitti, 2015; Tan et al., 2015). In this section we briefly review two NN architectures that have previously been applied to the answer selection task: QA-CNN (Feng et al., 2015) and QA-biLSTM (Tan et al., 2015). Given a pair (, ) consisting of a question and a candidate answer , both networks score the pair by first computing fixed-length independent continuous vector representations and
, and then computing the cosine similarity between these two vectors.
In Figure 1 we present a joint illustration of these two neural networks. The first layer in both QA-CNN and QA-biLSTM transforms each input word into a fixed-size real-valued word embedding . Word embeddings (WEs) are encoded by column vectors in an embedding matrix , where is a fixed-sized vocabulary and is the dimention of the word embeddings. Given the input pair (, ), where the question contains tokens and the candidate answer contains tokens, the output of the first layer consists of two sequences of word embeddings and . Next, QA-CNN and QA-biLSTM use different approaches to process these sequences. While QA-CNN process both and
using a convolution, QA-biLSTM uses a Bidirectional Long Short-Term Memory RNN(Hochreiter & Schmidhuber, 1997) to process these sequences.
Given the sequence , let us define the matrix as a matrix where each column contains a vector that is the concatenation of a sequence of word embeddings centralized in the -th word of the question. The output of the convolution with filters over the question is computed as follows:
where each column in
contains features extracted in a context window around the-th word of . The matrix and the vector are parameters to be learned. The number of convolutional filters , and the size of the word context window are hyper-parameters to be chosen by the user.
In a similar manner, and using the same NN parameters and , we compute , the output of the convolution over the candidate answer .
Our LSTM implementation is similar to the one in (Graves et al., 2013) with minor modification. Given the sequence , the hidden vector (with size ) at the time step is updated as follows:
In the LSTM architecture, there are three gates (input , forget and output ), and a cell memory vector . is the function. The input gate can determine how incoming vectors alter the state of the memory cell. The output gate can allow the memory cell to have an effect on the outputs. Finally, the forget gate allows the cell to remember or forget its previous state. , and are the network parameters.
Single direction LSTMs suffer a weakness of not utilizing the contextual information from the future tokens. Bidirectional LSTM utilizes both the previous and future context by processing the sequence on two directions, and generate two independent sequences of LSTM output vectors. One processes the input sequence in the forward direction, while the other processes the input in the reverse direction. The output at each time step is the concatenation of the two output vectors from both directions, ie. = . We define for the notation consistency with the previous subsection. After computing the hidden state for each time step , we generate the matrices and , where the -th column in () corresponds to -th hidden state that is computed by the biLSTM when processing (). The same network parameters are used to process both questions and candidate answers.
Given the matrices and , we compute the vector representations and by applying a column-wise max-pooling over and , followed by a non-linearity. Formally, the -th elements of the vectors and are compute as follows:
The last layer in QA-CNN and QA-biLSTM scores the input pair (,) by computing the cosine similarity between the two representations:
Both networks are trained by minimizing a pairwise ranking loss function over the training set. The input in each round is two pairs (, ) and (, ), where is a ground truth answer for , and is an incorrect answer. As in (Weston et al., 2014; Hu et al., 2014), we define the training objective as a hinge loss:
where is constant margin, and are scores generated by the network with parameter set . During training, for each question we randomly sample 50 negative answers from the entire answer set, but only use the one with the highest score to update the model.
Attentive pooling is an approach that enables the pooling layer to be aware of the current input pair, in a way that information from the question can directly influence the computation of the answer representation , and vice versa. The main idea consists of learning a similarity measure over the projected segments in the input pairs, and uses the similarity scores between the segments to compute attention vectors. When AP is applied to CNN, which we call AP-CNN, the network learns the similarity measure over the convolved input sequences. When AP is applied to biLSTM, which we call AP-biLSTM, the network learns the similarity measure over the hidden states produced by the biLSTM when processing the two input sequences. We use a similarity measure that has a bilinear form but followed by a non-linearity.
In Fig. 2, we illustrate the application of AP over the output of the convolution or the biLSTM to construct the representations and . Consider the input pair (, ) where the question has size and the answer has size 111In Fig. 2, has a size of five and has a size of seven.. After we compute the matrices and , either by convolution or biLSTM, we compute the matrix as follows:
where is a matrix of parameters to be learned by the NN. When the convolution is used to compute and , the matrix contains the scores of a soft alignment between the convolved -size context windows of and . When the biLSTM is used to compute and , the matrix contains the scores of a soft alignment between the hidden vectors of each token in and .
Next, we apply column-wise and row-wise max-poolings over to generate the vectors and , respectively. Formally, the -th elements of the vectors and are computed as follows:
We can interpret each element of the vector as an importance score for the context around the -th word in the candidate answer with regard to the question . Likewise, each element of the vector can be interpreted as the importance score for the context around the -th word in the question with regard to the candidate answer .
Next, we apply the softmax function to the vectors and to create attention vectors and . For instance, the -th element of the vector is computed as follows:
Finally, the representations and are computed as the dot product between the attention vectors and and the output of the convolution (or biLSTM) over and , respectively:
Like in QA-CNN and QA-biLSTM, the final score is also computed using the cosine similarity between and . We use SGD to train AP-CNN and AP-biLSTM by minimizing the same pairwise loss function used in QA-CNN and QA-biLSTM.
. Recently, deep learning (DL) approaches have been exploited for this task and achieved significant out-performance compared to traditional non-DL methods. For example, in(Yu et al., 2014; Feng et al., 2015; Severyn & Moschitti, 2015), the authors generate the representations of questions and answers separately, and score a QA pair using a similarity metric on top of these representations. In Wang & Nyberg (2015), first a joint feature vectors is learned from a joint long short-term memory (LSTM) model connecting questions and answers, and then the task is converted into a learning-to-rank problem.
At the same time, attention-based systems have shown very promising results on a variety of NLP tasks, such as machine translation (Bahdanau et al., 2015; Sutskever et al., 2014), caption generation (Xu et al., 2015) and factoid question answering (Hermann et al., 2015). Such models learn to focus their “attention” to specific parts of their input.
Some recently-proposed approaches introduce attention mechanisms in the answer selection task. Tan et al. (2015) developed an attentive reader based on bidirectional long short-term memory, which emphasizes certain part of the answer according to the question embedding. Unlike (Tan et al., 2015), in which attention is imposed only on answer embedding generation, AP-CNN and AP-biLSTM consider the interdependence between questions and answers.
In the context of two-way attention, two very recent work are related to ours. Rocktäschel et al. (2015), propose a two-way attention method that is inspired by bidirectional LSTMs that read a sequence and its reverse for improved encoding. Their approach, which is designed for RNNs only, differs in many aspects from the approach described in this work, which can be easily applied for CNNs and RNNs. Yin et al. (2015) present a two-way attention mechanism that is tailored to CNNs. Some of the main differences between their approach and this work are: (1) they use a simple Euclidean distance to compute the interdependence between the two input texts, while in this work we apply similarity metric learning, which has the potential to learn better ways to measure the interaction between segments of the input items; (2) the models in (Yin et al., 2015) compute the attention vector using sum-pooling over the alignment matrix and use the convolutional outputs updated by the attention as the input for another level of convolutional layer. In this work we use max-pooling over the alignment matrix plus softmax, in order to explicitly create an attention vector that is used to perform the pooling. Experimental results show that such difference yields substantial improvement of performance on WikiQA dataset.
We apply AP-CNN, AP-biLSTM, QA-CNN and QA-biLSTM to three different answer selection datasets: InsuranceQA, TREC-QA and WikiQA. These datasets contain text of different domains and have different caracteristics. Table 1 presents some statistics about the datasets, including the number of questions in each set, average length of questions (M) and answers (L), average number of candidate answers in the dev/test sets and the average ratio between the lengths of questions and their ground-truth answers.
|Dataset||Train||Dev||Test||Avg. M||Avg. L||Avg. # Cand. Ans.||Avg. L/M|
InsuranceQA222git clone https://github.com/shuzi/insuranceQA.git is a recently released large-scale non-factoid QA dataset from the insurance domain. This dataset provides a training set, a validation set, and two test sets. We do not see obvious categorical differentiation between questions of the two test sets. For each question in dev/test sets, there is a set of 500 candidate answers, which include the ground-truth answers and randomly selected negative answers. More details can be found in (Feng et al., 2015).
TREC-QA333The data is obtained from (Yao et al., 2013) http://cs.jhu.edu/~xuchen/packages/jacana-qa-naacl2013-data-results.tar.bz2 was created by Wang et al. (2007) based on Text REtrieval Conference (TREC) QA track (8-13) data. We follow the exact approach of train/dev/test questions selection in (Wang & Nyberg, 2015), in which all questions with only positive or negative answers are removed. Finally, we have 1162 training questions, 65 development questions and 68 test questions.
WikiQA444The data is obtained from (Yang et al., 2015) is an open domain question-answering dataset. We use the subtask that assumes that there is at least one correct answer for a question. The corresponding dataset consists of 20,360 question/candidate pairs in train, 1,130 pairs in dev and 2,352 pairs in test. We adopt the standard setup of only considering questions that have correct answers for evaluation.
In order to fairly compare our results with the ones in previous work, we use two different sets of pre-trained word embeddings. For the InsuranceQA dataset, we use the 100-dimensional vectors that were trained by Feng et al. (2015) using word2vec (Mikolov et al., 2013). Following Wang & Nyberg (2015), Tan et al. (2015) and Yin et al.(2015), for the TREC-QA and the WikiQA datasets we use the 300-dimensional vectors that were trained using word2vec and are publicly available on the website of this tool555https://code.google.com/p/word2vec/.
In Table 2
, we show the selected hyperparameter values, which were tuned using the validation sets. We try to use as much as possible the same hyperparameters for all the three datasets. The size of the word embeddings is different due to the different pre-trained versions that we used for InsuranceQA and the other two datasets. We use a context window of size 3 for InsuranceQA, while we set this parameter to 4 for TREC-QA and WikiQA. Using the selected hyperparameters, the best results are normally achieved using between 15 and 25 training epochs. For AP-CNN, AP-biLSTM and QA-LSTM, we also use a learning rate schedule that decreases the learning rateaccording to the training epoch . Following dos Santos & Zadrozny (2014), we set the learning rate for epoch , , using the equation: .
|Word Emb. size||100/300||100/300||100/300||100/300|
|Conv. Filters / Hid.Vec. Size||400||4000||141x2||141x2|
|Context Window size||3/4||2||1||1|
|Init. Learning Rate||1.1||0.05||1.1||1.1|
In Table 3, we present the experimental results of the four NNs for the InsuranceQA dataset. The results are in terms of accuracy, which is equivalent to precision at top one. On the bottom part of this table, we can see that AP-CNN outperforms QA-CNN by a large margin in both test sets, as well as in the dev set. AP-biLSTM also outperforms the QA-biLSTM in all the three sets. AP-CNN and AP-biLSTM have similar performance.
On the top part of Table 3 we present the results of two state-of-the-art systems for this dataset. In (Feng et al., 2015), the authors present a CNN architecture that is similar to QA-CNN, but that uses a different similarity metric instead of cosine similarity. In (Tan et al., 2015), the authors use a biLTSM architecture that employs unidirectional attention. Both AP-CNN and AP-biLSTM outperform the state-of-the-art systems.
|(Feng et al., 2015)||65.4||65.3||61.0|
|(Tan et al., 2015)||68.4||68.1||62.2|
One important characteristic of AP-CNN is that it requires less convolutional filters than QA-CNN. For the InsuranceQA dataset, AP-CNN uses 10x less filters (400) than QA-CNN (4000). Using 800 filters in AP-CNN produces very similar results as using 400. On the other hand, as also found in (Feng et al., 2015), QA-CNN requires at least 2000 filters to achieve more than 60% accuracy on InsuranceQA. AP-CNN needs less filters because it does not rely only on the final vector representation to capture interactions between the input question and answer. As a result, although AP-CNN has a more complex architecture, its training time is two times faster than QA-CNN. Using a Tesla K20Xm, our Theano implementation of AP-CNN takes about 16 minutes to complete one epoch (training + inference over validation set) for InsuranceQA, which consists on processing 1.5 million text segments.
In figures 3 and 4, we plot the aggregated accuracy of AP-CNN and QA-CNN for answers up to a certain length for the Test1 and Test2 sets, respectively. We can see in both plots that the performance of both system is better for shorter answers. However, while the performance of QA-CNN continues to drop as larger answers are considered, the performance of AP-CNN seems to be stable after reaching a length of 90 tokens. These results give support to our hypothesis that attentive pooling helps the CNN to become robust to larger input texts.
In Table 4, we present the experimental results of the four NNs for the TREC-QA dataset. The results are in terms of mean average precision (MAP) and mean reciprocal rank (MRR), which are the metric normally used in previous work with the same dataset. We use the official trec_eval scorer to compute MAP and MRR. We can see in Table 4 that AP-CNN outperforms QA-CNN by a large margin in both metrics. AP-biLSTM outperforms the QA-biLSTM, but its performance is not as good as the of AP-CNN.
On the top part of Table 4 we present the results of three recent work that use TREC-QA as a benchmark. In (Wang & Nyberg, 2015), the authors present an LTSM architecture for answer selection. Their best result consists of a combination of LSTM and the BM25 algorithm. In (Severyn & Moschitti, 2015), the authors propose an NN architecture where the representations created by a convolutional layer are the input to similarity measure learning. Wang & Ittycheriah (2015) propose a word-alignment-based method that is suitable for the FAQ-based QA task. AP-CNN outperforms the state-of-the-art systems in both metrics, MAP and MRR.
Table 5 shows the experimental results of the four NNs for the WikiQA dataset. Like in the other two datasets, AP-CNN outperforms QA-CNN, and AP-biLSTM outperforms the QA-biLSTM. The difference of performance between AP-CNN and QA-CNN is smaller than the one for the InsuranceQA dataset. We believe it is because the average size of the answers in WikiQA (25) is much smaller than in InsuranceQA (95). It is expected that attentive pooling bring more impact to the datasets that have larger answer/question lengths.
In Table 5 we also present the results of two recent work that use WikiQA as a benchmark. Yang et al. (2015), present a bigram CNN model with average pooling. In (Yin et al., 2015), the authors propose an attention-based CNN. In order to make a fair comparison, in Table 5 we include Yin et al.’s result that use word embeddings only666Yin et al. (Yin et al., 2015) report 0.6921(MAP) and 0.7108(MRR) when using handcrafted features in addition to word embeddings.. AP-CNN outperforms these two systems in both metrics.
Figures 5 and 6 depict two heat maps of two test questions from InsuranceQA that were correctly answered by AP-CNN and whose answers are more than 100 words long. The stronger the color of a word in the question (answer), the larger the attention weight in () of the trigram centered at that word. As we can see in the pictures, the attentive pooling mechanism is indeed putting more focus on the segments of the answer that have some interaction with the question, and vice-verse.
We present attentive pooling, a two-way attention mechanism for discriminative model training. The main contributions of the paper are: (1) AP is more general than recently proposed two-way attention mechanism because: (a) it learns how to compute interactions between the items in the input pair; and (b) it can be applied to both CNNs and RNNs; (2) we demonstrate that AP can be effectively used with CNNs and biLSTM in the context of the answer selection task, using three different benchmark datasets; (3) our experimental results demonstrate that AP helps the CNN to cope with large input texts; (4) we present new state-of-the-art results for InsuranceQA and TREC-QA datasets. (5) for the WikiQA dataset our results are the best reported so far for methods that do not use handcrafted features.
The authors would like to thank Piero Molino for creating the script used to produce the text heat maps presented in this work.