1 Introduction
Inspired by the achievements of convolutional networks (a.k.a, ConvNets) in the field of computer vision, more and more researchers constitute ConvNets for kinds of natural language processing tasks, e.g., text classification
Kim (2014), text regression Bitvai and Cohn (2015), short text pair reranking Severyn and Moschitti (2015), and semantic matching Hu et al. (2014).For the answer selection task, i.e., given a question and a set of candidate sentences, choose the correct sentence that contains the exact answer and sufficiently support the answer choices. Most of the previous methods constitute Siameselike deep architectures (like LSTMRNN, CNN, etc.) to learn the semantic representation for each sentence, and then use cosine similarity or weight matrix to compute the similarity of the pairwise representations
Wang and Nyberg (2015). At the same time, these works mostly adopted shallow architectures for sentence modeling, since deeper nets did not bring better performance. On the contrary, we firmly convinced that one can benefit much more from deep learning strategy.Following the success of RNNbased attentive mechanism designed for machine translation task Bahdanau et al. (2014), recently some works attempted twoway attention mechanism for sentence pair matching problems Tan et al. (2015); Santos et al. (2016); Yin et al. (2015). Such soft attention mechanism proves the effectiveness of the interaction between sentence pairs from lexical level to semantic level, yet aggravates much more computations and model complexity.
Q:  When did Amtrak begin operations? 

A:  Amtrak has not turned a profit since it was founded in 1971. 
All the previous works mentioned above motivate us to construct a network based on pairwise token matching for exhaustive matching learning. However, a vital issue of this constitution is the word similarity measurement.Take Q and A in Table 1 for example, distinguishing the similarity between ”begin” with ”found: set up” from that between ”begin” with ”found: discovered” makes a lot of sense. To solve this, we constitute a deep convolutional neural network based on pairwise token matching measured with multimodal similarity metric learning, named by us MSNet, where the learnable multimodal similarity metric provides a comprehensive and multigranularity measurement. Experimental results on the benchmark dataset of answer selection task indicate that the proposed model can greatly benefit from deep network structure as well as multimodal similarity metric learning, and also demonstrate that the proposed MSNet outperforms a variety of strong baselines and achieve stateoftheart.
2 MSNet
In this paper, we propose a novel learning framework for sentence pair matching, where the pairwise token similarity matrix is computed firstly, and then a deep convolutional network is constructed to learn matching representation exhaustively, finally concatenate the learned pairwise matching representation and additional simple wordlevel overlap feature to feed into a pointwise rank loss for endtoend finetuning (please see Fig. 1 for better understanding).
2.1 MultiModal Similarity Metric
As a fundamental component in MSNet, it is of vital importance to design an appropriate similarity measurement for pairwise token matching. Given a sentence pair and , where are the word count of sentence , respectively, are from dimensional word embeddings which are pretrained under vocabulary , The similarity matrix is formulated as follows:
(1) 
where k represents the number of modality that can be tuned, represents the matrix of the learnable similarity metric, and the corresponding bias term . Since the dimension of metric increases exponentially with word embedding dimension, some regularization methods were proposed Shalit et al. (2010); Cao et al. (2013) to limit model complexity for preventing overfitting. Frobenius norm is adopted here for simplification. For better comparison, we also design cosine and euclidean similarity, which are formulated as follows:
(2) 
2.2 Convolution and Pooling
The convolution layer in this work consists of a filter bank , along with filter biases , where , and refer to the number, width and height of filters respectively, and denotes the channels of data from the lower layer. More specifically, for the first convolution layer, equals to the multimodal parameter , which means convolving across all the similarity modalities to learn the pattern. Given the output ( represents similarity matrix ) from the lower layer, the output of the convolution with filter bank is computed as follows:
(3)  
where * is marked as the convolutional operation, indexes the number of filters, and indicates the sliding operations for dot production along the axis of width and height with one step size.
Typically, there exist two types of convolution: wide and narrow. Even though previous works Kalchbrenner et al. (2014) have pointed out that using wide type of convolution got better performance, we use the narrow type for convenience. Finally, we get the output of layer as .
The outputs from the convolutional layer (passed through the activation function) are then fed into the pooling layer, whose goal is to aggregate the information and reduce the representation. Technically, there exist two types of pooling strategy, i.e.,
average pooling and max pooling, and both pooling methods are widely used. However, max pooling can lead to strong overfitting on the training data and, hence, poor generalization on the test data, as shown in Zeiler and Fergus (2013). For stability and reproductivity, we adopt the average pooling strategy in our work.2.3 Pointwise Learning to Rank with Metric Regularization
We adopt simple pointwise method to model our answer selection task, though pairwise and listwise approaches claim to yield better performance. The crossentropy cost deployed here to discriminantly train our framework as follows:
(4) 
where
is the output probability of
sample through our networks, is the corresponding ground truth, and contains all the parameters optimized by the network, i.e., . Frobenius norm is used to regularize the parameter of the metrics to prevent overfitting.We use Stochastic Gradient Descent (SGD) to optimize our network, and AdaDelta
Zeiler (2012)is used to automatically adapt the learning rate during the training procedure. For higher performance, hyperparameter selection is conducted on the development set, and Batch Normalization (
BN) layer Ioffe and Szegedy (2015)after each convolution layer is also added to speed up the network optimization. In addition, dropout is applied after the first hidden layer for regularization, and early stopping is used to prevent overfitting with a patience of 5 epochs.
Set  #Question  #QApairs  %Correct  Judge 

TrainAll  1,229  53,417  12.0%  auto 
Train  94  4,718  7.4%  man 
Dev  65  1,117  18.4%  man 
Test  68  1,442  17.2%  man 
3 Experiments
3.1 Dataset
In this section, we use TRECQA dataset to evaluate the proposed model, which appears to be one of the most widely used benchmarks for answer sentence selection. This dataset was created by Wang et al. (2007) based on Text REtrieval Conference (TREC) QA track (813) data^{1}^{1}1http://cs.stanford.edu/people/mengqiu/data/qgemnlp07data.tgz. Candidate answers were automatically retrieved for each factoid question. Two sets of data are provided for training, one is small training set containing 94 questions collected through manual judgement, and the other is full training set, i.e., TrainAll, which contains 1,229 questions from the entire TREC 812 collection with automatically labeled ground truth by matching answer keys’ regular expressions^{2}^{2}2http://cs.jhu.edu/~xuchen/packages/jacanaqanaacl2013dataresults.tar.bz2. Table 2
summarizes the answer selection dataset in details. In the following experiments, we use the full training set due to its relatively large scale, even though there exists noisy labels caused by automatically pattern matching.
The original development and test datasets have 82 and 100 questions, respectively. Following Wang and Nyberg (2015); Santos et al. (2016); Tan et al. (2015), all questions with only positive or negative answers are removed. Finally, we have 65 development questions with 1,117 questionanswer pairs, and 68 test questions with 1,442 questionanswer pairs.
3.2 Token Representation
We use a pretrained 50dimensional word embeddings^{3}^{3}3http://nlp.stanford.edu/data/glove.6B.zip Pennington et al. (2014) as our initial word lookup table. These word embeddings are trained on Wikipedia data and the fifth English Gigawords with totally 6 Billion tokens. Need to be mentioned here, trading off between model complexity and performance, we do not use the 300dimensional embeddings, which are trained on much more data and more widely adopted by previous works Santos et al. (2016); Tan et al. (2015).
3.3 Experimental Setting
Following previous works, we also use the two metrics to evaluate the proposed model, i.e., Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR). The official trec_eval scorer tool^{4}^{4}4http://trec.nist.gov/trec_eval/ is used to compute the above metrics.
The simplest word overlap features between each questionanswer pair are computed, and we concatenate them with our learned matching representation for the final rank learning. This feature vector contains only two features, i.e., word overlap and IDFweighted word overlap.
Experiments of our MSNet on three predefined similarity measurements are denoted as MSNetEuc, MSNetCos, and MSNetMetric respectively. All of these models share the same network configuration. To demonstrate the fact that the proposed network can benefit more from deep structure, we compare MSNetMetric with a oneconvolutional layer network, namely MSNetShallow (We found that much deeper construction might bring in randomness which harms the reproductivity of the performance, so we use twoconvolutional layer for strict experimental comparison). Furthermore, we list results of MSNetMetric with , respectively denoted as MSNetMetric, MSNetMetric2 and M
SNetMetric4, to verify the effectiveness of the proposed multimodal similarity metric. All the networks mentioned here are implemented using Caffe
Jia et al. (2014) and the code is open now^{5}^{5}5https://github.com/lxmeng/mms_answer_selection.4 Results and Discussion
We are motivated to use multimodal similarity metric to solve polysemy of words, and construct thorough matching network between sentence pairs for endtoend question answering modeling. From Fig. 2, we can see that onemodality metric is slightly better than euclidean and cosine similarity measurement. Increasing the number of modality of measurement greatly boost the performance by 7%. The comparison between shallow and deep network structure indicate that the proposed MSNet benefits much from deep construction. The rank of answer in Table 1 is promoted from top 35 and 26 by using euclidean and cosine similarity measuremen to top 3 by using ours.
For comprehensive comparison, we also list the results of prior stateoftheart methods in literature on this task in Table 3. It can be seen that the proposed method outperforms the most recently published attentionbased methods by 1% in both MAP and MRR metrics.
The proposed method could be further improved by upgrading the regularization term to limit the rank of metric, which had been proved by Law et al. (2014); Cao et al. (2013). Besides, combining the dissimilarity modeled by distance metric learning with similarity mentioned here would be our future work.
Reference  MAP  MRR 
wang2007jeopardy  .6029  .6852 
heilman2010tree  .6091  .6917 
wang2010probabilistic  .5951  .6951 
yao2013answer  .6307  .7477 
yih2013question  .7092  .7700 
yu2014deep  .7113  .7846 
wang2015long  .7134  .7913 
tan2015lstm  .7106  .7998 
severyn2015learning  .7459  .8078 
santos2016attentive  .7530  .8511 
wang2016sentence  .7714  .8447 
MSNetMetric2  .7698  .8640 
MSNetMetric4  .7793  .8487 
5 Conclusion
A novel endtoend learning framework (MSNet) is proposed for answer sentence selection task. Interdependence between sentence pair at lexical level is explored much more by constituting deep convolutional neural network directly on pairwise token matching. To enrich the lexical modality measurement, we adopt multimodal similarity metric learning. The proposed architecture is proved effective, and surpasses previous stateoftheart systems on the answer selection benchmark, i.e., TRECQA dataset, in both MAP and MRR metrics.
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Bitvai and Cohn (2015) Zsolt Bitvai and Trevor Cohn. 2015. Nonlinear text regression with a deep convolutional neural network. Volume 2: Short Papers 1(x1):180.

Cao et al. (2013)
Qiong Cao, Yiming Ying, and Peng Li. 2013.
Similarity metric learning for face recognition.
In Proceedings of the IEEE International Conference on Computer Vision. pages 2408–2415.  Heilman and Smith (2010) Michael Heilman and Noah A Smith. 2010. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 1011–1019.
 Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Advances in Neural Information Processing Systems. pages 2042–2050.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 .
 Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 2227, 2014, Baltimore, MD, USA, Volume 1: Long Papers. pages 655–665. http://aclweb.org/anthology/P/P14/P141062.pdf.
 Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. EMNLP .

Law et al. (2014)
Marc T Law, Nicolas Thome, and Matthieu Cord. 2014.
Fantope regularization in metric learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. pages 1051–1058.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D141162.
 Santos et al. (2016) Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive pooling networks. arXiv preprint arXiv:1602.03609 .
 Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pages 373–382.
 Shalit et al. (2010) Uri Shalit, Daphna Weinshall, and Gal Chechik. 2010. Online learning in the manifold of lowrank matrices. In Advances in neural information processing systems. pages 2128–2136.
 Tan et al. (2015) Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstmbased deep learning models for nonfactoid answer selection. arXiv preprint arXiv:1511.04108 .

Wang and Nyberg (2015)
Di Wang and Eric Nyberg. 2015.
A long shortterm memory model for answer sentence selection in question answering.
ACL, July .  Wang and Manning (2010) Mengqiu Wang and Christopher D Manning. 2010. Probabilistic treeedit models with structured latent variables for textual entailment and question answering. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pages 1164–1172.
 Wang et al. (2007) Mengqiu Wang, Noah A Smith, and Teruko Mitamura. 2007. What is the jeopardy model? a quasisynchronous grammar for qa. In EMNLPCoNLL. volume 7, pages 22–32.
 Wang et al. (2016) Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. 2016. Sentence similarity learning by lexical decomposition and composition. arXiv preprint arXiv:1602.07019 .
 Yao et al. (2013) Xuchen Yao, Benjamin Van Durme, Chris CallisonBurch, and Peter Clark. 2013. Answer extraction as sequence tagging with tree edit distance. In HLTNAACL. Citeseer, pages 858–867.
 Yih et al. (2013) Wentau Yih, MingWei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question answering using enhanced lexical semantic models. ACL .
 Yin et al. (2015) Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2015. Abcnn: Attentionbased convolutional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193 .
 Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. NIPS Deep Learning and Representation Learning Workshop .
 Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 .
 Zeiler and Fergus (2013) Matthew D. Zeiler and Rob Fergus. 2013. Stochastic pooling for regularization of deep convolutional neural networks. CoRR abs/1301.3557. http://dblp.unitrier.de/db/journals/corr/corr1301.html/abs13013557.
Comments
There are no comments yet.