Inspired by the achievements of convolutional networks (a.k.a, ConvNets) in the field of computer vision, more and more researchers constitute ConvNets for kinds of natural language processing tasks, e.g., text classificationKim (2014), text regression Bitvai and Cohn (2015), short text pair re-ranking Severyn and Moschitti (2015), and semantic matching Hu et al. (2014).
For the answer selection task, i.e., given a question and a set of candidate sentences, choose the correct sentence that contains the exact answer and sufficiently support the answer choices. Most of the previous methods constitute Siamese-like deep architectures (like LSTM-RNN, CNN, etc.) to learn the semantic representation for each sentence, and then use cosine similarity or weight matrix to compute the similarity of the pairwise representationsWang and Nyberg (2015). At the same time, these works mostly adopted shallow architectures for sentence modeling, since deeper nets did not bring better performance. On the contrary, we firmly convinced that one can benefit much more from deep learning strategy.
Following the success of RNN-based attentive mechanism designed for machine translation task Bahdanau et al. (2014), recently some works attempted two-way attention mechanism for sentence pair matching problems Tan et al. (2015); Santos et al. (2016); Yin et al. (2015). Such soft attention mechanism proves the effectiveness of the interaction between sentence pairs from lexical level to semantic level, yet aggravates much more computations and model complexity.
|Q:||When did Amtrak begin operations?|
|A:||Amtrak has not turned a profit since it was founded in 1971.|
All the previous works mentioned above motivate us to construct a network based on pairwise token matching for exhaustive matching learning. However, a vital issue of this constitution is the word similarity measurement.Take Q and A in Table 1 for example, distinguishing the similarity between ”begin” with ”found: set up” from that between ”begin” with ”found: discovered” makes a lot of sense. To solve this, we constitute a deep convolutional neural network based on pairwise token matching measured with multi-modal similarity metric learning, named by us MS-Net, where the learnable multi-modal similarity metric provides a comprehensive and multi-granularity measurement. Experimental results on the benchmark dataset of answer selection task indicate that the proposed model can greatly benefit from deep network structure as well as multi-modal similarity metric learning, and also demonstrate that the proposed MS-Net outperforms a variety of strong baselines and achieve state-of-the-art.
In this paper, we propose a novel learning framework for sentence pair matching, where the pairwise token similarity matrix is computed firstly, and then a deep convolutional network is constructed to learn matching representation exhaustively, finally concatenate the learned pairwise matching representation and additional simple word-level overlap feature to feed into a pointwise rank loss for end-to-end fine-tuning (please see Fig. 1 for better understanding).
2.1 Multi-Modal Similarity Metric
As a fundamental component in MS-Net, it is of vital importance to design an appropriate similarity measurement for pairwise token matching. Given a sentence pair and , where are the word count of sentence , respectively, are from -dimensional word embeddings which are pre-trained under vocabulary , The similarity matrix is formulated as follows:
where k represents the number of modality that can be tuned, represents the matrix of the learnable similarity metric, and the corresponding bias term . Since the dimension of metric increases exponentially with word embedding dimension, some regularization methods were proposed Shalit et al. (2010); Cao et al. (2013) to limit model complexity for preventing overfitting. Frobenius norm is adopted here for simplification. For better comparison, we also design cosine and euclidean similarity, which are formulated as follows:
2.2 Convolution and Pooling
The convolution layer in this work consists of a filter bank , along with filter biases , where , and refer to the number, width and height of filters respectively, and denotes the channels of data from the lower layer. More specifically, for the first convolution layer, equals to the multi-modal parameter , which means convolving across all the similarity modalities to learn the pattern. Given the output ( represents similarity matrix ) from the lower layer, the output of the convolution with filter bank is computed as follows:
where * is marked as the convolutional operation, indexes the number of filters, and indicates the sliding operations for dot production along the axis of width and height with one step size.
Typically, there exist two types of convolution: wide and narrow. Even though previous works Kalchbrenner et al. (2014) have pointed out that using wide type of convolution got better performance, we use the narrow type for convenience. Finally, we get the output of layer as .
The outputs from the convolutional layer (passed through the activation function) are then fed into the pooling layer, whose goal is to aggregate the information and reduce the representation. Technically, there exist two types of pooling strategy, i.e.,average pooling and max pooling, and both pooling methods are widely used. However, max pooling can lead to strong over-fitting on the training data and, hence, poor generalization on the test data, as shown in Zeiler and Fergus (2013). For stability and reproductivity, we adopt the average pooling strategy in our work.
2.3 Pointwise Learning to Rank with Metric Regularization
We adopt simple pointwise method to model our answer selection task, though pairwise and listwise approaches claim to yield better performance. The cross-entropy cost deployed here to discriminantly train our framework as follows:
is the output probability ofsample through our networks, is the corresponding ground truth, and contains all the parameters optimized by the network, i.e., . Frobenius norm is used to regularize the parameter of the metrics to prevent over-fitting.
We use Stochastic Gradient Descent (SGD) to optimize our network, and AdaDeltaZeiler (2012)
is used to automatically adapt the learning rate during the training procedure. For higher performance, hyper-parameter selection is conducted on the development set, and Batch Normalization (BN) layer Ioffe and Szegedy (2015)
after each convolution layer is also added to speed up the network optimization. In addition, dropout is applied after the first hidden layer for regularization, and early stopping is used to prevent over-fitting with a patience of 5 epochs.
In this section, we use TREC-QA dataset to evaluate the proposed model, which appears to be one of the most widely used benchmarks for answer sentence selection. This dataset was created by Wang et al. (2007) based on Text REtrieval Conference (TREC) QA track (8-13) data111http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz. Candidate answers were automatically retrieved for each factoid question. Two sets of data are provided for training, one is small training set containing 94 questions collected through manual judgement, and the other is full training set, i.e., Train-All, which contains 1,229 questions from the entire TREC 8-12 collection with automatically labeled ground truth by matching answer keys’ regular expressions222http://cs.jhu.edu/~xuchen/packages/jacana-qa-naacl2013-data-results.tar.bz2. Table 2
summarizes the answer selection dataset in details. In the following experiments, we use the full training set due to its relatively large scale, even though there exists noisy labels caused by automatically pattern matching.
The original development and test datasets have 82 and 100 questions, respectively. Following Wang and Nyberg (2015); Santos et al. (2016); Tan et al. (2015), all questions with only positive or negative answers are removed. Finally, we have 65 development questions with 1,117 question-answer pairs, and 68 test questions with 1,442 question-answer pairs.
3.2 Token Representation
We use a pre-trained 50-dimensional word embeddings333http://nlp.stanford.edu/data/glove.6B.zip Pennington et al. (2014) as our initial word look-up table. These word embeddings are trained on Wikipedia data and the fifth English Gigawords with totally 6 Billion tokens. Need to be mentioned here, trading off between model complexity and performance, we do not use the 300-dimensional embeddings, which are trained on much more data and more widely adopted by previous works Santos et al. (2016); Tan et al. (2015).
3.3 Experimental Setting
Following previous works, we also use the two metrics to evaluate the proposed model, i.e., Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR). The official trec_eval scorer tool444http://trec.nist.gov/trec_eval/ is used to compute the above metrics.
The simplest word overlap features between each question-answer pair are computed, and we concatenate them with our learned matching representation for the final rank learning. This feature vector contains only two features, i.e., word overlap and IDF-weighted word overlap.
Experiments of our MS-Net on three pre-defined similarity measurements are denoted as MS-Net-Euc, MS-Net-Cos, and MS-Net-Metric respectively. All of these models share the same network configuration. To demonstrate the fact that the proposed network can benefit more from deep structure, we compare MS-Net-Metric with a one-convolutional layer network, namely MS-Net-Shallow (We found that much deeper construction might bring in randomness which harms the reproductivity of the performance, so we use two-convolutional layer for strict experimental comparison). Furthermore, we list results of MS-Net-Metric with , respectively denoted as MS-Net-Metric, MS-Net-Metric-2 and M
S-Net-Metric-4, to verify the effectiveness of the proposed multi-modal similarity metric. All the networks mentioned here are implemented using CaffeJia et al. (2014) and the code is open now555https://github.com/lxmeng/mms_answer_selection.
4 Results and Discussion
We are motivated to use multi-modal similarity metric to solve polysemy of words, and construct thorough matching network between sentence pairs for end-to-end question answering modeling. From Fig. 2, we can see that one-modality metric is slightly better than euclidean and cosine similarity measurement. Increasing the number of modality of measurement greatly boost the performance by 7%. The comparison between shallow and deep network structure indicate that the proposed MS-Net benefits much from deep construction. The rank of answer in Table 1 is promoted from top 35 and 26 by using euclidean and cosine similarity measuremen to top 3 by using ours.
For comprehensive comparison, we also list the results of prior state-of-the-art methods in literature on this task in Table 3. It can be seen that the proposed method outperforms the most recently published attention-based methods by 1% in both MAP and MRR metrics.
The proposed method could be further improved by upgrading the regularization term to limit the rank of metric, which had been proved by Law et al. (2014); Cao et al. (2013). Besides, combining the dissimilarity modeled by distance metric learning with similarity mentioned here would be our future work.
A novel end-to-end learning framework (MS-Net) is proposed for answer sentence selection task. Interdependence between sentence pair at lexical level is explored much more by constituting deep convolutional neural network directly on pairwise token matching. To enrich the lexical modality measurement, we adopt multi-modal similarity metric learning. The proposed architecture is proved effective, and surpasses previous state-of-the-art systems on the answer selection benchmark, i.e., TREC-QA dataset, in both MAP and MRR metrics.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
- Bitvai and Cohn (2015) Zsolt Bitvai and Trevor Cohn. 2015. Non-linear text regression with a deep convolutional neural network. Volume 2: Short Papers 1(x1):180.
Cao et al. (2013)
Qiong Cao, Yiming Ying, and Peng Li. 2013.
Similarity metric learning for face recognition.In Proceedings of the IEEE International Conference on Computer Vision. pages 2408–2415.
- Heilman and Smith (2010) Michael Heilman and Noah A Smith. 2010. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 1011–1019.
- Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Advances in Neural Information Processing Systems. pages 2042–2050.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 .
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers. pages 655–665. http://aclweb.org/anthology/P/P14/P14-1062.pdf.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. EMNLP .
Law et al. (2014)
Marc T Law, Nicolas Thome, and Matthieu Cord. 2014.
Fantope regularization in metric learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pages 1051–1058.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.
- Santos et al. (2016) Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive pooling networks. arXiv preprint arXiv:1602.03609 .
- Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pages 373–382.
- Shalit et al. (2010) Uri Shalit, Daphna Weinshall, and Gal Chechik. 2010. Online learning in the manifold of low-rank matrices. In Advances in neural information processing systems. pages 2128–2136.
- Tan et al. (2015) Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 .
Wang and Nyberg (2015)
Di Wang and Eric Nyberg. 2015.
A long short-term memory model for answer sentence selection in question answering.ACL, July .
- Wang and Manning (2010) Mengqiu Wang and Christopher D Manning. 2010. Probabilistic tree-edit models with structured latent variables for textual entailment and question answering. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pages 1164–1172.
- Wang et al. (2007) Mengqiu Wang, Noah A Smith, and Teruko Mitamura. 2007. What is the jeopardy model? a quasi-synchronous grammar for qa. In EMNLP-CoNLL. volume 7, pages 22–32.
- Wang et al. (2016) Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. 2016. Sentence similarity learning by lexical decomposition and composition. arXiv preprint arXiv:1602.07019 .
- Yao et al. (2013) Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013. Answer extraction as sequence tagging with tree edit distance. In HLT-NAACL. Citeseer, pages 858–867.
- Yih et al. (2013) Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. ACL .
- Yin et al. (2015) Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2015. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193 .
- Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. NIPS Deep Learning and Representation Learning Workshop .
- Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 .
- Zeiler and Fergus (2013) Matthew D. Zeiler and Rob Fergus. 2013. Stochastic pooling for regularization of deep convolutional neural networks. CoRR abs/1301.3557. http://dblp.uni-trier.de/db/journals/corr/corr1301.html/abs-1301-3557.