M^2S-Net: Multi-Modal Similarity Metric Learning based Deep Convolutional Network for Answer Selection

04/19/2016 ∙ by Lingxun Meng, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Recent works using artificial neural networks based on distributed word representation greatly boost performance on various natural language processing tasks, especially the answer selection problem. Nevertheless, most of the previous works used deep learning methods (like LSTM-RNN, CNN, etc.) only to capture semantic representation of each sentence separately, without considering the interdependence between each other. In this paper, we propose a novel end-to-end learning framework which constitutes deep convolutional neural network based on multi-modal similarity metric learning (M^2S-Net) on pairwise tokens. The proposed model demonstrates its performance by surpassing previous state-of-the-art systems on the answer selection benchmark, i.e., TREC-QA dataset, in both MAP and MRR metrics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inspired by the achievements of convolutional networks (a.k.a, ConvNets) in the field of computer vision, more and more researchers constitute ConvNets for kinds of natural language processing tasks, e.g., text classification

Kim (2014), text regression Bitvai and Cohn (2015), short text pair re-ranking Severyn and Moschitti (2015), and semantic matching Hu et al. (2014).

For the answer selection task, i.e., given a question and a set of candidate sentences, choose the correct sentence that contains the exact answer and sufficiently support the answer choices. Most of the previous methods constitute Siamese-like deep architectures (like LSTM-RNN, CNN, etc.) to learn the semantic representation for each sentence, and then use cosine similarity or weight matrix to compute the similarity of the pairwise representations

Wang and Nyberg (2015). At the same time, these works mostly adopted shallow architectures for sentence modeling, since deeper nets did not bring better performance. On the contrary, we firmly convinced that one can benefit much more from deep learning strategy.
Following the success of RNN-based attentive mechanism designed for machine translation task Bahdanau et al. (2014), recently some works attempted two-way attention mechanism for sentence pair matching problems Tan et al. (2015); Santos et al. (2016); Yin et al. (2015). Such soft attention mechanism proves the effectiveness of the interaction between sentence pairs from lexical level to semantic level, yet aggravates much more computations and model complexity.

Q: When did Amtrak begin operations?
A: Amtrak has not turned a profit since it was founded in 1971.
Table 1: An example of QA-pair in TREC-QA.
Figure 1: Our MS-Net for sentence pair matching.

All the previous works mentioned above motivate us to construct a network based on pairwise token matching for exhaustive matching learning. However, a vital issue of this constitution is the word similarity measurement.Take Q and A in Table 1 for example, distinguishing the similarity between ”begin” with ”found: set up” from that between ”begin” with ”found: discovered” makes a lot of sense. To solve this, we constitute a deep convolutional neural network based on pairwise token matching measured with multi-modal similarity metric learning, named by us MS-Net, where the learnable multi-modal similarity metric provides a comprehensive and multi-granularity measurement. Experimental results on the benchmark dataset of answer selection task indicate that the proposed model can greatly benefit from deep network structure as well as multi-modal similarity metric learning, and also demonstrate that the proposed MS-Net outperforms a variety of strong baselines and achieve state-of-the-art.

2 MS-Net

In this paper, we propose a novel learning framework for sentence pair matching, where the pairwise token similarity matrix is computed firstly, and then a deep convolutional network is constructed to learn matching representation exhaustively, finally concatenate the learned pairwise matching representation and additional simple word-level overlap feature to feed into a pointwise rank loss for end-to-end fine-tuning (please see Fig. 1 for better understanding).

2.1 Multi-Modal Similarity Metric

As a fundamental component in MS-Net, it is of vital importance to design an appropriate similarity measurement for pairwise token matching. Given a sentence pair and , where are the word count of sentence , respectively, are from -dimensional word embeddings which are pre-trained under vocabulary , The similarity matrix is formulated as follows:


where k represents the number of modality that can be tuned, represents the matrix of the learnable similarity metric, and the corresponding bias term . Since the dimension of metric increases exponentially with word embedding dimension, some regularization methods were proposed Shalit et al. (2010); Cao et al. (2013) to limit model complexity for preventing overfitting. Frobenius norm is adopted here for simplification. For better comparison, we also design cosine and euclidean similarity, which are formulated as follows:


2.2 Convolution and Pooling

The convolution layer in this work consists of a filter bank , along with filter biases , where , and refer to the number, width and height of filters respectively, and denotes the channels of data from the lower layer. More specifically, for the first convolution layer, equals to the multi-modal parameter , which means convolving across all the similarity modalities to learn the pattern. Given the output ( represents similarity matrix ) from the lower layer, the output of the convolution with filter bank is computed as follows:


where * is marked as the convolutional operation, indexes the number of filters, and indicates the sliding operations for dot production along the axis of width and height with one step size.
Typically, there exist two types of convolution: wide and narrow. Even though previous works Kalchbrenner et al. (2014) have pointed out that using wide type of convolution got better performance, we use the narrow type for convenience. Finally, we get the output of layer as .

The outputs from the convolutional layer (passed through the activation function) are then fed into the pooling layer, whose goal is to aggregate the information and reduce the representation. Technically, there exist two types of pooling strategy, i.e.,

average pooling and max pooling, and both pooling methods are widely used. However, max pooling can lead to strong over-fitting on the training data and, hence, poor generalization on the test data, as shown in Zeiler and Fergus (2013). For stability and reproductivity, we adopt the average pooling strategy in our work.

2.3 Pointwise Learning to Rank with Metric Regularization

We adopt simple pointwise method to model our answer selection task, though pairwise and listwise approaches claim to yield better performance. The cross-entropy cost deployed here to discriminantly train our framework as follows:



is the output probability of

sample through our networks, is the corresponding ground truth, and contains all the parameters optimized by the network, i.e., . Frobenius norm is used to regularize the parameter of the metrics to prevent over-fitting.

We use Stochastic Gradient Descent (SGD) to optimize our network, and AdaDelta

Zeiler (2012)

is used to automatically adapt the learning rate during the training procedure. For higher performance, hyper-parameter selection is conducted on the development set, and Batch Normalization (

BN) layer Ioffe and Szegedy (2015)

after each convolution layer is also added to speed up the network optimization. In addition, dropout is applied after the first hidden layer for regularization, and early stopping is used to prevent over-fitting with a patience of 5 epochs.

Set #Question #QApairs %Correct Judge
Train-All 1,229 53,417 12.0% auto
Train 94 4,718 7.4% man
Dev 65 1,117 18.4% man
Test 68 1,442 17.2% man
Table 2: Statistics of the answer sentence selection dataset. Judge denotes whether correctness was determined automatically (auto) or by human annotators (man).

3 Experiments

3.1 Dataset

In this section, we use TREC-QA dataset to evaluate the proposed model, which appears to be one of the most widely used benchmarks for answer sentence selection. This dataset was created by Wang et al. (2007) based on Text REtrieval Conference (TREC) QA track (8-13) data111http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz. Candidate answers were automatically retrieved for each factoid question. Two sets of data are provided for training, one is small training set containing 94 questions collected through manual judgement, and the other is full training set, i.e., Train-All, which contains 1,229 questions from the entire TREC 8-12 collection with automatically labeled ground truth by matching answer keys’ regular expressions222http://cs.jhu.edu/~xuchen/packages/jacana-qa-naacl2013-data-results.tar.bz2. Table 2

summarizes the answer selection dataset in details. In the following experiments, we use the full training set due to its relatively large scale, even though there exists noisy labels caused by automatically pattern matching.

The original development and test datasets have 82 and 100 questions, respectively. Following Wang and Nyberg (2015); Santos et al. (2016); Tan et al. (2015), all questions with only positive or negative answers are removed. Finally, we have 65 development questions with 1,117 question-answer pairs, and 68 test questions with 1,442 question-answer pairs.

3.2 Token Representation

We use a pre-trained 50-dimensional word embeddings333http://nlp.stanford.edu/data/glove.6B.zip Pennington et al. (2014) as our initial word look-up table. These word embeddings are trained on Wikipedia data and the fifth English Gigawords with totally 6 Billion tokens. Need to be mentioned here, trading off between model complexity and performance, we do not use the 300-dimensional embeddings, which are trained on much more data and more widely adopted by previous works Santos et al. (2016); Tan et al. (2015).

Figure 2: Comparison of S-Nets with different measurements and network structure.

3.3 Experimental Setting

Following previous works, we also use the two metrics to evaluate the proposed model, i.e., Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR). The official trec_eval scorer tool444http://trec.nist.gov/trec_eval/ is used to compute the above metrics.

The simplest word overlap features between each question-answer pair are computed, and we concatenate them with our learned matching representation for the final rank learning. This feature vector contains only two features, i.e., word overlap and IDF-weighted word overlap.

Experiments of our MS-Net on three pre-defined similarity measurements are denoted as MS-Net-Euc, MS-Net-Cos, and MS-Net-Metric respectively. All of these models share the same network configuration. To demonstrate the fact that the proposed network can benefit more from deep structure, we compare MS-Net-Metric with a one-convolutional layer network, namely MS-Net-Shallow (We found that much deeper construction might bring in randomness which harms the reproductivity of the performance, so we use two-convolutional layer for strict experimental comparison). Furthermore, we list results of MS-Net-Metric with , respectively denoted as MS-Net-Metric, MS-Net-Metric-2 and M

S-Net-Metric-4, to verify the effectiveness of the proposed multi-modal similarity metric. All the networks mentioned here are implemented using Caffe

Jia et al. (2014) and the code is open now555https://github.com/lxmeng/mms_answer_selection.

4 Results and Discussion

We are motivated to use multi-modal similarity metric to solve polysemy of words, and construct thorough matching network between sentence pairs for end-to-end question answering modeling. From Fig. 2, we can see that one-modality metric is slightly better than euclidean and cosine similarity measurement. Increasing the number of modality of measurement greatly boost the performance by 7%. The comparison between shallow and deep network structure indicate that the proposed MS-Net benefits much from deep construction. The rank of answer in Table 1 is promoted from top 35 and 26 by using euclidean and cosine similarity measuremen to top 3 by using ours.
For comprehensive comparison, we also list the results of prior state-of-the-art methods in literature on this task in Table 3. It can be seen that the proposed method outperforms the most recently published attention-based methods by 1% in both MAP and MRR metrics.
The proposed method could be further improved by upgrading the regularization term to limit the rank of metric, which had been proved by Law et al. (2014); Cao et al. (2013). Besides, combining the dissimilarity modeled by distance metric learning with similarity mentioned here would be our future work.

Reference MAP MRR
wang2007jeopardy .6029 .6852
heilman2010tree .6091 .6917
wang2010probabilistic .5951 .6951
yao2013answer .6307 .7477
yih2013question .7092 .7700
yu2014deep .7113 .7846
wang2015long .7134 .7913
tan2015lstm .7106 .7998
severyn2015learning .7459 .8078
santos2016attentive .7530 .8511
wang2016sentence .7714 .8447
MS-Net-Metric-2 .7698 .8640
MS-Net-Metric-4 .7793 .8487
Table 3: Results of our models and other methods from the literature.

5 Conclusion

A novel end-to-end learning framework (MS-Net) is proposed for answer sentence selection task. Interdependence between sentence pair at lexical level is explored much more by constituting deep convolutional neural network directly on pairwise token matching. To enrich the lexical modality measurement, we adopt multi-modal similarity metric learning. The proposed architecture is proved effective, and surpasses previous state-of-the-art systems on the answer selection benchmark, i.e., TREC-QA dataset, in both MAP and MRR metrics.