Recently, adopting neural attentional mechanism has proven to be an extremely successful technique in a wide range of natural language processing tasks, ranging from machine translation [Bahdanau et al.2014], sentence summarization [Rush et al.2015], question answering [Hermann et al.2015] and text entailment [Rocktäschel et al.2015, Wang and Jiang2015, Cheng et al.2016]. The basic idea is to learn and attend to most relevant parts of (potentially preprocessed) a sequence while analysing or generating another sequence .
Taking the following two sentences as examples, where we highlight the helpful partial information alignment from according to with attention.
: A toddler sits on a rock chair with fallen leaves.
: A little child sits quietly on a stone bench in autumn.
The sequence-based attention is illustrated in Figure 1(a). The representation is a flat sequence, and attention distribution is applied to this simple topology. Although the idea is to soft-align words and phrases in the two sentences, one can observe that: 1) The hidden state of each position incorporates its context information, which is implicit and sequential, alignment at phrase-level is thus challenging (e.g. “autumn” to “fallen leaves”). 2) As we will discuss shortly, the attention is implemented with a weighted sum of sequence, thus lacks linguistic interpretation for its semantic composition.
Any well-formed sentences have its underlying syntactic structure. It is a tree topology that encodes a sentence’s important composing subcomponents. Evidently, this is in stark contrast with the flat and sequential topology the existing models assume.
In this paper we extend the attentional mechanism from a sequence to a tree, allowing syntactic information to be integrated. As shown in Figure 1(b), syntax-based attention allows neural models to more explicitly capture the phrase-level alignment. In addition, it clearly reaches a higher level of interpretability. While this observation is general, in this paper we demonstrate its effectiveness in natural language inference. We believe other tasks such as neural translation model [Bahdanau et al.2014, Luong et al.2015] can similarly benefit from this idea.
The contributions of this paper can be summarized as follows.
We extend sequence-based attention to syntax-based, therefore incorporating richer linguistic properties.
We design and validate our algorithm that makes such topological attentional mechanism possible.
Beyond quantitative measurement, we carefully perform qualitative analysis, and demonstrate why and how the idea works.
Our work can be regarded as an attempt to boost the generalization ability of attention matching mechanism by encoding prior knowledge (syntax). As an example, our results show syntactic structure of sentence or phrase is crucial for text semantic matching.
2 Neural Attention Model for Natural Language Inference
Natural language inference, also called text entailment, is a task to determine the semantic relationship (entailment, contradiction, or neutral) between two sentences (a premise and a hypothesis). This task is important involved in many natural language processing (NLP) problems, such as information extraction, relation extraction, text summarization or machine translation.
To better understand this task, we give an example in the dataset as follows:
Premise: These girls are having a great time looking for seashells.
Hypothesis: The girls are happy.
More precisely, NLI can be framed as a simple three-way classification task, which requires the model to be able to represent and reason with the core phenomena of natural language semantics [Bowman et al.2016].
2.1 Long Short-Term Memory Network
is a type of recurrent neural network (RNN)[Elman1990], and specifically addresses the issue of learning long-term dependencies. LSTM maintains a memory cell that updates and exposes its content only when deemed necessary.
We define the LSTM units at each time step
to be a collection of vectors in: an input gate , a forget gate , an output gate , a memory cell and a hidden state . is the number of the LSTM units. The elements of the gating vectors , and are in .
The LSTM is precisely specified as follows.
where is the input at the current time step; is an affine transformation which depends on parameters of the network and .
denotes the logistic sigmoid function anddenotes elementwise multiplication.
The update of each LSTM unit can be written precisely as
2.2 Neural Attention Model
Given two sequences and , we let denote the embedded representation of the word . The standard LSTM has one temporal dimension: at position of sentence , the output reflects the meaning of the subsequence .
The main idea of attention model [Hermann et al.2015] is that the representation of sentence is obtained dynamically based on the degree of alignment between the words in sentence and . More formally, for sentence and , we first compute the hidden state of each sentence by two LSTMs: 111The model used by [Rocktäschel et al.2015] is a little different from this for a better performance, in which encoding of one sentence is conditioned on the other.:
While processing sentence at time , the model emits an attention vector to weight , the hidden states of , thereby obtaining a fine-grained representation of sentence as follows:
where can be compute as:
Where is a alignment score and can obtained by:
where , , are learned parameters.
Finally, the representation of the sentence pair is constructed by the last attention-weighted representation and the last output vector as:
For the entailment task, the final representation
of sentence-pair, is fed into the output layer, generating the probabilities over all pre-defined classes (entailment, contradiction, or neutral) .
where and are parameters of the model.
3 Syntax-Based Attention Matching Model
The building block of this work syntax-based instead of sequence-based compositional model. There are several such candidates, such as recursive neural network [Socher et al.2013] and tree-structured LSTM [Tai et al.2015]. In this paper, we use latter model since for its superior performance in representing sentence meaning.
3.1 Tree-structured LSTM
Different with standard LSTM, tree-structured LSTM composes its state from an input vector and the hidden states of children units. More formally, the model takes as input a syntactic tree (constituency tree or dependency tree), then a composition function is applied to combine the children nodes according to the syntactic structure to obtain an new compositional vector for their parent node.
Here we investigate two types of composition functions for constituency and dependency tree respectively.
Composition Function for Constituency Tree
Given constituency tree induced by a sentence, there are at most children nodes for each parent node. We refer to and as the hidden state and memory cell of the -th child of node . The transition equations of each node are as follows:
where denotes the input vector and is non-zero if and only if it is a leaf node. represents the logistic sigmoid function and denotes element-wise multiplication. , , and is the weight matrix which depends on parameters of the network.
Composition Function for Dependency Tree
For the dependency tree, we refer to as the set of children of node . Then the transition equations of each node are formulated as:
where , , and are the weight matrices which depend on parameters of the network.
The update of each unit can be written precisely as
3.2 Syntax-Based Attention Matching Model
The second stage of the design is to apply attention to the tree topology. For two trees and induced by sentence and , the representation of their subtrees and can be obtained as follows:
At node of tree , we reread over tree and compute a weighted tree representation of tree , which also recursively accumulate information from its children .
where denotes the number of nodes of tree ; measures the alignment degree between two subtrees; is recursively accumulate information from its children.
For constituency tree,
For dependency tree,
The attention between two subtrees and can be computed as
The final representation of two trees and can be obtained by
where denotes the number of nodes of tree .
To facilitate the description later, we refer to SAT-LSTMs as our proposed syntax-based attention model. dLSTM and cLSTM represent LSTMs are built over a dependency and constituency respectively.
Given a sentence pair and its label . The output of neural network is the probabilities of the different classes. The parameters of the network are trained to minimise the cross-entropy of the predicted and true label distributions.
where l is one-hot representation of the ground-truth label ; is predicted probabilities of labels; is the class number.
To minimize the objective, we use stochastic gradient descent with the diagonal variant of AdaGrad[Duchi et al.2011]
. To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm exceeds a threshold[Graves2013].
4.1 Initialization and Hyperparameters
For each task, we take the hyperparameters which achieve the best performance on the development set via an small grid search over combinations of the initial learning rate, regularization and the threshold value of gradient norm [5, 10, 50]. The final hyper-parameters are reported in Table 1.
|Hidden layer size||100|
|Initial learning rate||0.005|
We use the Stanford Natural Language Inference Corpus (SNLI) [Bowman et al.2015]. This corpus contains 570K sentence pairs, and all of the sentences and labels stem from human annotators. SNLI is two orders of magnitude larger than all other existing RTE corpora. Therefore, the massive scale of SNLI allows us to train powerful neural networks such as our proposed architecture in this paper.
5.1 Data Preparation
5.2 Competitor Methods
Neural bag-of-words (NBOW): Each sequence is represented as the sum of the embeddings of the words it contains, and then they are concatenated and fed to a multi-layer perceptron (MLP).
LSTM encoders: The sentence pair are encoded by LSTMs respectively.
Attention LSTM encoders (AT-LSTMs): The sentence pair are encoded with the consideration of the alignment of words between two sentences [Rocktäschel et al.2015].
Tree-based CNN encoders: The sentence pair are encoded by tree-based CNNs respectively [Mou et al.2015].
Tree-based LSTM encoders: The sentence pair are encoded by tree-based LSTM respectively.
SPINN-PI encoder: The sentence pair are encoded by stack-augmented parser-interpreter neural network with parsed input respectively, which is proposed by [Bowman et al.2016].
|Model||Hidden.||Train acc. (%)||Dev. acc. (%)||Test acc. (%)|
|Previous non-NN results|
Lexicalized classifier[Bowman et al.2015]
|Previous sentence encoder-based NN results|
|LSTM encoders [Bowman et al.2015]||100||84.8||—||77.6|
|Tree-based CNN encoders [Mou et al.2015]||300||83.4||82.4||82.1|
|SPINN-PI encoders [Bowman et al.2016]||300||89.2||—||83.2|
|AT-LSTMs encoders [Rocktäschel et al.2015]||100||85.3||83.7||83.5|
Table 2 provides a comparison of results on SNLI dataset. From the table, we can observe that:
For two kinds of syntax-based LSTM encoders, cLSTM achieve better performances than dLSTM, which is consistent with gildea2004dependencies experiment results on tree-based alignment. We think the reason is that constituency-based model can better learn the semantic compositionality and it has taken the orders of child nodes into consideration.
Irrespective of attention mechanism, both two syntax-based LSTM encoders are superior to sequence-based LSTM encoder, which indicates the effectiveness of syntax-based composition.
SAT-cLSTMs surpass all the competitor methods and achieve the best performance. More precisely, SAT-cLSTMs outperform AT-LSTMs by 2.1%, and are superior to Tree-LSTM encoders by 3.8%, which suggests the importance of incorporating syntactic information into attention models.
5.4 Experiment Analysis
5.4.1 Analysis of Compositionality and Attention Mechanism
Can our model select useful composition information using attention mechanism ? To answer this question, we sample several sub-tree pairs from test dataset which achieve the best alignment of a sentence pair.
As shown in Figure 3, we can observe that,
The alignments in these cases are consistent with people’s understanding. For example, the alignment degree is much higher than and , which is crucial for the final prediction of the two sentence’ relation and indicates the effectiveness of this syntax-based composition.
Our model has learned the alignment between subtrees, meaning that matching patterns at word-phrase or phrase-phrase level can be captured effectively not merely at word-word level.
|person ’s||holding his cup up||wearing a pink dress||having a great time|
|people ’s||holding up a white plastic cup||in a pink dress||having a good time|
|belong to the lady||with a cup in his hand||dressed in pink||enjoy time together|
|of a person||with a beer in his hand||wearing a pink dress||is very happy|
|of humans||holds up a playing card||in pink||enjoying a night|
Nearest neighbor phrases drawn from the SNLI test set, which based on cosine similarity of different representations produced by SAT-LSTMs.
|the boys are bare chested||a golden retriever nurses puppies|
|NBOW||the men are naked||a cat nurses puppies|
|the boys are stretching||a puppy barks at a girl|
|the boys are sleeping||the dog is a labrador retriever|
|the boys are sitting down||a golden retriever nurses some other dogs puppies|
|the man has nothing on his face||a girl is sitting on a park bench holding a puppy|
|AT-LSTMs||a man is outside with no bag on his back||a big dog watching over a smaller dog|
|his bald head is exposed||the big dog is checking out the smaller dog|
|a man in summer clothing skiing on thin snow||a gal is holding a stuffed dog|
|the man is not wearing a shirt||a golden retriever nurses some other dogs puppies|
|SAT-LSTMs||two men are shirtless||three puppies are snuggling with their mother by the fire|
|the man is completely nude||puppies next to their mother|
|a man without a shirt is on the water||a mother dog checking up on her baby puppy|
5.4.2 Analysis of Phrases Representations
We compute the representations of each subtree and show some examples sampled from test dataset with their most related neighbors in Table 3.
The phrasal paraphrases, such as “having a great time/enjoy time together”, have obtained close representations, which is more helpful for the identification of the entailment relation of two sentences. Besides, we can see the ability of the model to learn a variety of general paraphrastic transformations, such as possessive rule “persons’s/of a person” and verb particle shift “holding his cup up/holding up a white plastic cup”.
Some other examples such as “wearing a pink dress/in a pink dress/dressed in pink” indicate our SAT-LSTMs model is more robust to syntactic variations, which is more crucial to boost the generalization ability while encoding a sentence or sentence pair.
5.4.3 Analysis of Learned Sentence Representations
We explore the sentence representations learned by the three different models on the SNLI. Table 4 illustrates the nearest neighbors of sentence representations learned from NBOW, AT-LSTMs, SAT-LSTMs.
As shown in Table 4, NBOW finds a sentence’s neighbors with full consideration of lexical paraphrase. While the neighbors returned by SAT-LSTMs are mostly syntactic variations with meaning preserving. For example, for the first sentence “the boy are bare chested”, NBOW gives the “the men are naked” most likely based on the word pair “bare/naked”, thereby ignoring the information of “chested”. However, the sentences given by SAT-LSTMs contain the same meaning with ample ways of expressions, such as “the man is not wearing a shirt” and “the man without a shirt”, which accurately reflect the meaning of “bare chested”.
Compared with AT-LSTMs, SAT-LSTMs can provide more flexible syntactic expressions. For example, for the sentence ‘a golden retriever nurses puppies”, SAT-LSTMs capture this syntactic paraphrase ‘A nurses B/B is snuggling with A”, which is difficult for NBOW and AT-LSTMs models.
6 Related Work
There has been recent work proposing to incorporate syntax priori into neural network. socher2012semantic use a recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. tai2015improved introduce a generalization of the standard LSTM architecture to tree-structured network. bowman2016fast propose an stack-augmented Parser-Interpreter Neural Network for sentence encoding, which combines parsing and interpretation within a single tree-sequence hybrid model. These models are designed for representing a sentence in more plausible way, while we want to model the strong interaction of two sentences over tree structure.
More recently, several works have tried to incorporate priori into attention based model. cohn2016incorporating extend the attentional neural translation model to include structural biases from word based alignment models. gu2016incorporating incorporate copying mechanism into attention based model to address the OOV problem in a more systemic way for machine translation. Different with these models, we augment attention model with syntax priori for semantic matching.
Another thread of work is sequential attention models for natural language inference. rocktaschel2015reasoning propose to use attention model for sentence pair encoding. wang2015learning extend this model by paying more attention to important word-level matching results. Compared with these models, we integrate syntax structure into attention matching model, which can match two trees in a plausible way.
Natural language has its underlying syntactic structure, which gives a feasibility to assign attention to tree-structured topologies instead of a flat sequence. Although we just use it in context of natural language inference, the idea of syntax-based attention model can be easily transferred to other tasks for phrase-level alignment, such as neural translation model. When we submit our paper, we find this paper [Eriguchi et al.2016]
, which proposed tree-to-sequence attention based model for neural machine translation, thereby showing the effectiveness of syntax-based attention mechnism. The major difference is their model is based on word-to-word and word-to-phrase attention (sequence conditioned on tree) whereas our proposed model focus on phrase-to-phrase attention (tree over tree).
In this paper, we integrate syntax structure into attention model. Compared with sequence-based attention model, our model can easily capture phrase-level alignment. Experiments on Stanford Natural Language Inference Corpus demonstrate the efficacy of our proposed model and its superiority to competitor models. Furthermore, we have made an elaborate experiment design and case analysis to evaluate the effectiveness of our syntax-base matching model and explain why attention over trees is a good idea.
In future, we wish to use our SAT-LSTMs matching model to learn the representation of phrasal[Wieting et al.2015] or syntactic paraphrases from massive paraphrase dataset, such as PPDB [Ganitkevitch et al.2013]. We expect that the learned representation of subtree with rich prior knowledge should be useful for downstream tasks in a pre-trained manner.
- [Bahdanau et al.2014] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. ArXiv e-prints, September.
- [Bowman et al.2015] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- [Bowman et al.2016] Samuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021.
- [Chen and Manning2014] Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750.
- [Cheng et al.2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733.
- [Cohn et al.2016] Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. 2016. Incorporating structural alignment biases into an attentional neural translation model. arXiv preprint arXiv:1601.01085.
[Duchi et al.2011]
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
The Journal of Machine Learning Research, 12:2121–2159.
- [Elman1990] Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
- [Eriguchi et al.2016] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neural machine translation. CoRR, abs/1603.06075.
- [Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. Ppdb: The paraphrase database. In HLT-NAACL, pages 758–764.
- [Gildea2004] Daniel Gildea. 2004. Dependencies vs. constituents for tree-based alignment. In EMNLP, pages 214–221. Citeseer.
- [Graves2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- [Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
- [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- [Jozefowicz et al.2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In Proceedings of The 32nd International Conference on Machine Learning.
- [Klein and Manning2003] Dan Klein and Christopher D Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430.
- [Luong et al.2015] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics.
- [Mou et al.2015] Lili Mou, Men Rui, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2015. Recognizing entailment and contradiction by tree-based convolution. arXiv preprint arXiv:1512.08422.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12:1532–1543.
- [Rocktäschel et al.2015] Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
- [Rush et al.2015] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal, September.
- [Saxe et al.2013] Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
- [Socher et al.2012] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP, pages 1201–1211.
- [Socher et al.2013] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
- [Sutskever et al.2011] Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
- [Tai et al.2015] Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
- [Wang and Jiang2015] Shuohang Wang and Jing Jiang. 2015. Learning natural language inference with lstm. arXiv preprint arXiv:1512.08849.
- [Wieting et al.2015] John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198.