Syntax-based Attention Model for Natural Language Inference

07/22/2016 ∙ by Pengfei Liu, et al. ∙ FUDAN University 0

Introducing attentional mechanism in neural network is a powerful concept, and has achieved impressive results in many natural language processing tasks. However, most of the existing models impose attentional distribution on a flat topology, namely the entire input representation sequence. Clearly, any well-formed sentence has its accompanying syntactic tree structure, which is a much rich topology. Applying attention to such topology not only exploits the underlying syntax, but also makes attention more interpretable. In this paper, we explore this direction in the context of natural language inference. The results demonstrate its efficacy. We also perform extensive qualitative analysis, deriving insights and intuitions of why and how our model works.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, adopting neural attentional mechanism has proven to be an extremely successful technique in a wide range of natural language processing tasks, ranging from machine translation [Bahdanau et al.2014], sentence summarization [Rush et al.2015], question answering [Hermann et al.2015] and text entailment [Rocktäschel et al.2015, Wang and Jiang2015, Cheng et al.2016]. The basic idea is to learn and attend to most relevant parts of (potentially preprocessed) a sequence while analysing or generating another sequence .

Figure 1: A motivated example to illustrate sequence-based and syntax-based attention model for target word “autumn”. The square boxes represent hidden states of the words (or phrases); darker indicates higher alignment.

Taking the following two sentences as examples, where we highlight the helpful partial information alignment from according to with attention.
: A toddler sits on a rock chair with fallen leaves.
: A little child sits quietly on a stone bench in autumn.

The sequence-based attention is illustrated in Figure 1(a). The representation is a flat sequence, and attention distribution is applied to this simple topology. Although the idea is to soft-align words and phrases in the two sentences, one can observe that: 1) The hidden state of each position incorporates its context information, which is implicit and sequential, alignment at phrase-level is thus challenging (e.g. “autumn” to “fallen leaves”). 2) As we will discuss shortly, the attention is implemented with a weighted sum of sequence, thus lacks linguistic interpretation for its semantic composition.

Any well-formed sentences have its underlying syntactic structure. It is a tree topology that encodes a sentence’s important composing subcomponents. Evidently, this is in stark contrast with the flat and sequential topology the existing models assume.

In this paper we extend the attentional mechanism from a sequence to a tree, allowing syntactic information to be integrated. As shown in Figure 1(b), syntax-based attention allows neural models to more explicitly capture the phrase-level alignment. In addition, it clearly reaches a higher level of interpretability. While this observation is general, in this paper we demonstrate its effectiveness in natural language inference. We believe other tasks such as neural translation model [Bahdanau et al.2014, Luong et al.2015] can similarly benefit from this idea.

The contributions of this paper can be summarized as follows.

  1. We extend sequence-based attention to syntax-based, therefore incorporating richer linguistic properties.

  2. We design and validate our algorithm that makes such topological attentional mechanism possible.

  3. Beyond quantitative measurement, we carefully perform qualitative analysis, and demonstrate why and how the idea works.

  4. Our work can be regarded as an attempt to boost the generalization ability of attention matching mechanism by encoding prior knowledge (syntax). As an example, our results show syntactic structure of sentence or phrase is crucial for text semantic matching.

2 Neural Attention Model for Natural Language Inference

Natural language inference, also called text entailment, is a task to determine the semantic relationship (entailment, contradiction, or neutral) between two sentences (a premise and a hypothesis). This task is important involved in many natural language processing (NLP) problems, such as information extraction, relation extraction, text summarization or machine translation.

To better understand this task, we give an example in the dataset as follows:

Premise: These girls are having a great time looking for seashells.

Hypothesis: The girls are happy.

Label: entailment

More precisely, NLI can be framed as a simple three-way classification task, which requires the model to be able to represent and reason with the core phenomena of natural language semantics [Bowman et al.2016].

2.1 Long Short-Term Memory Network

Long short-term memory neural network (LSTM) [Hochreiter and Schmidhuber1997]

is a type of recurrent neural network (RNN)

[Elman1990], and specifically addresses the issue of learning long-term dependencies. LSTM maintains a memory cell that updates and exposes its content only when deemed necessary.

While there are numerous LSTM variants, here we use the LSTM architecture used by [Jozefowicz et al.2015], which is similar to the one in [Graves2013] but without peep-hole connections.

We define the LSTM units at each time step

to be a collection of vectors in

: an input gate , a forget gate , an output gate , a memory cell and a hidden state . is the number of the LSTM units. The elements of the gating vectors , and are in .

The LSTM is precisely specified as follows.


where is the input at the current time step; is an affine transformation which depends on parameters of the network and .

denotes the logistic sigmoid function and

denotes elementwise multiplication.

The update of each LSTM unit can be written precisely as


Here, the function is a shorthand for Eq. (1-3).

LSTM can map the input sequence of arbitrary length to a fixed-sized vector, and has been successfully applied to a wide range of NLP tasks, such as machine translation [Sutskever et al.2014], language modelling [Sutskever et al.2011] and natural language inference [Rocktäschel et al.2015].

2.2 Neural Attention Model

Given two sequences and , we let denote the embedded representation of the word . The standard LSTM has one temporal dimension: at position of sentence , the output reflects the meaning of the subsequence .

The main idea of attention model [Hermann et al.2015] is that the representation of sentence is obtained dynamically based on the degree of alignment between the words in sentence and . More formally, for sentence and , we first compute the hidden state of each sentence by two LSTMs: 111The model used by [Rocktäschel et al.2015] is a little different from this for a better performance, in which encoding of one sentence is conditioned on the other.:


While processing sentence at time , the model emits an attention vector to weight , the hidden states of , thereby obtaining a fine-grained representation of sentence as follows:


where can be compute as:


Where is a alignment score and can obtained by:


where , , are learned parameters.

Finally, the representation of the sentence pair is constructed by the last attention-weighted representation and the last output vector as:

Figure 2: Two matching frameworks: Sequence-based attention model and syntax-based attention model. The box represents hidden state of a node and the bold yellow box represents the node of sentence at the position . The darker blue box represents a higher alignment score between the corresponding node and the node .

For the entailment task, the final representation

of sentence-pair, is fed into the output layer, generating the probabilities over all pre-defined classes (entailment, contradiction, or neutral) .


where and are parameters of the model.

3 Syntax-Based Attention Matching Model

The building block of this work syntax-based instead of sequence-based compositional model. There are several such candidates, such as recursive neural network [Socher et al.2013] and tree-structured LSTM [Tai et al.2015]. In this paper, we use latter model since for its superior performance in representing sentence meaning.

3.1 Tree-structured LSTM

Different with standard LSTM, tree-structured LSTM composes its state from an input vector and the hidden states of children units. More formally, the model takes as input a syntactic tree (constituency tree or dependency tree), then a composition function is applied to combine the children nodes according to the syntactic structure to obtain an new compositional vector for their parent node.

Here we investigate two types of composition functions for constituency and dependency tree respectively.

Composition Function for Constituency Tree

Given constituency tree induced by a sentence, there are at most children nodes for each parent node. We refer to and as the hidden state and memory cell of the -th child of node . The transition equations of each node are as follows:


where denotes the input vector and is non-zero if and only if it is a leaf node. represents the logistic sigmoid function and denotes element-wise multiplication. , , and is the weight matrix which depends on parameters of the network.

Composition Function for Dependency Tree

For the dependency tree, we refer to as the set of children of node . Then the transition equations of each node are formulated as:


where , , and are the weight matrices which depend on parameters of the network.

The update of each unit can be written precisely as


Here, the function is a shorthand for Eq. (12-16) for constituency tree or (17-20) for dependency tree.

3.2 Syntax-Based Attention Matching Model

The second stage of the design is to apply attention to the tree topology. For two trees and induced by sentence and , the representation of their subtrees and can be obtained as follows:


At node of tree , we reread over tree and compute a weighted tree representation of tree , which also recursively accumulate information from its children .


where denotes the number of nodes of tree ; measures the alignment degree between two subtrees; is recursively accumulate information from its children.

For constituency tree,


For dependency tree,


The attention between two subtrees and can be computed as


The final representation of two trees and can be obtained by


where denotes the number of nodes of tree .

To facilitate the description later, we refer to SAT-LSTMs as our proposed syntax-based attention model. dLSTM and cLSTM represent LSTMs are built over a dependency and constituency respectively.

4 Training

Given a sentence pair and its label . The output of neural network is the probabilities of the different classes. The parameters of the network are trained to minimise the cross-entropy of the predicted and true label distributions.


where l is one-hot representation of the ground-truth label ; is predicted probabilities of labels; is the class number.

To minimize the objective, we use stochastic gradient descent with the diagonal variant of AdaGrad

[Duchi et al.2011]

. To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm exceeds a threshold


4.1 Initialization and Hyperparameters

Orthogonal Initialization

We use orthogonal initialization of our LSTMs, which allows neurons to react to the diverse patterns and is helpful to train a multi-layer network

[Saxe et al.2013].

Unsupervised Initialization

The word embeddings for all of the models are initialized with the 100d GloVe vectors (840B token version, [Pennington et al.2014]

). The other parameters are initialized by randomly sampling from uniform distribution in



For each task, we take the hyperparameters which achieve the best performance on the development set via an small grid search over combinations of the initial learning rate

, regularization and the threshold value of gradient norm [5, 10, 50]. The final hyper-parameters are reported in Table 1.

Hyper-parameters SNLI
Embedding size 100
Hidden layer size 100
Initial learning rate 0.005
Table 1: Hyper-parameters for our model on SNLI.

5 Experiment

We use the Stanford Natural Language Inference Corpus (SNLI) [Bowman et al.2015]. This corpus contains 570K sentence pairs, and all of the sentences and labels stem from human annotators. SNLI is two orders of magnitude larger than all other existing RTE corpora. Therefore, the massive scale of SNLI allows us to train powerful neural networks such as our proposed architecture in this paper.

5.1 Data Preparation

We parse the sentences in the dataset for our tree-structured LSTMs. More specifically, for the Dependency Tree-LSTMs, we produce dependency parses[Chen and Manning2014]

of each sentence; For constituency Tree-LSTMs, the trees are parsed by binarized constituency parser

[Klein and Manning2003].

5.2 Competitor Methods

  • Neural bag-of-words (NBOW): Each sequence is represented as the sum of the embeddings of the words it contains, and then they are concatenated and fed to a multi-layer perceptron (MLP).

  • LSTM encoders: The sentence pair are encoded by LSTMs respectively.

  • Attention LSTM encoders (AT-LSTMs): The sentence pair are encoded with the consideration of the alignment of words between two sentences [Rocktäschel et al.2015].

  • Tree-based CNN encoders: The sentence pair are encoded by tree-based CNNs respectively [Mou et al.2015].

  • Tree-based LSTM encoders: The sentence pair are encoded by tree-based LSTM respectively.

  • SPINN-PI encoder: The sentence pair are encoded by stack-augmented parser-interpreter neural network with parsed input respectively, which is proposed by [Bowman et al.2016].

Model Hidden. Train acc. (%) Dev. acc. (%) Test acc. (%)
Previous non-NN results

Lexicalized classifier

[Bowman et al.2015]
99.7 78.2
Previous sentence encoder-based NN results
LSTM encoders [Bowman et al.2015] 100 84.8 77.6
Tree-based CNN encoders [Mou et al.2015] 300 83.4 82.4 82.1
SPINN-PI encoders [Bowman et al.2016] 300 89.2 83.2
AT-LSTMs encoders [Rocktäschel et al.2015] 100 85.3 83.7 83.5
Our results
Tree-dLSTM encoders 100 83.5 77.1 78.7
Tree-cLSTM encoders 100 82.2 79.8 80.3
AT-LSTMs encoders 100 84.2 82.7 82.0
SAT-dLSTMs 100 86.6 83.8 83.4
SAT-cLSTMs 100 87.9 85.0 84.1
Table 2: Results of our proposed models against other neural models on SNLI corpus. Hidden. is the number of neurons in hidden state . Train, Dev. and Test denote the classification accuracy. SAT-LSTMs denote our proposed syntax-based attention model. dLSTM and cLSTM represent LSTMs are built over a dependency and constituency respectively.

5.3 Results

Table 2 provides a comparison of results on SNLI dataset. From the table, we can observe that:

  • For two kinds of syntax-based LSTM encoders, cLSTM achieve better performances than dLSTM, which is consistent with gildea2004dependencies experiment results on tree-based alignment. We think the reason is that constituency-based model can better learn the semantic compositionality and it has taken the orders of child nodes into consideration.

  • Irrespective of attention mechanism, both two syntax-based LSTM encoders are superior to sequence-based LSTM encoder, which indicates the effectiveness of syntax-based composition.

  • SAT-cLSTMs surpass all the competitor methods and achieve the best performance. More precisely, SAT-cLSTMs outperform AT-LSTMs by 2.1%, and are superior to Tree-LSTM encoders by 3.8%, which suggests the importance of incorporating syntactic information into attention models.

Figure 3: Visualization of syntax-based alignments over two subtrees. The numbers along the dotted lines represent the alignment scores.

5.4 Experiment Analysis

5.4.1 Analysis of Compositionality and Attention Mechanism

Can our model select useful composition information using attention mechanism ? To answer this question, we sample several sub-tree pairs from test dataset which achieve the best alignment of a sentence pair.

As shown in Figure 3, we can observe that,

  • The alignments in these cases are consistent with people’s understanding. For example, the alignment degree is much higher than and , which is crucial for the final prediction of the two sentence’ relation and indicates the effectiveness of this syntax-based composition.

  • Our model has learned the alignment between subtrees, meaning that matching patterns at word-phrase or phrase-phrase level can be captured effectively not merely at word-word level.

person ’s holding his cup up wearing a pink dress having a great time
people ’s holding up a white plastic cup in a pink dress having a good time
belong to the lady with a cup in his hand dressed in pink enjoy time together
of a person with a beer in his hand wearing a pink dress is very happy
of humans holds up a playing card in pink enjoying a night
Table 3:

Nearest neighbor phrases drawn from the SNLI test set, which based on cosine similarity of different representations produced by SAT-LSTMs.

the boys are bare chested a golden retriever nurses puppies
NBOW the men are naked a cat nurses puppies
the boys are stretching a puppy barks at a girl
the boys are sleeping the dog is a labrador retriever
the boys are sitting down a golden retriever nurses some other dogs puppies
the man has nothing on his face a girl is sitting on a park bench holding a puppy
AT-LSTMs a man is outside with no bag on his back a big dog watching over a smaller dog
his bald head is exposed the big dog is checking out the smaller dog
a man in summer clothing skiing on thin snow a gal is holding a stuffed dog
the man is not wearing a shirt a golden retriever nurses some other dogs puppies
SAT-LSTMs two men are shirtless three puppies are snuggling with their mother by the fire
the man is completely nude puppies next to their mother
a man without a shirt is on the water a mother dog checking up on her baby puppy
Table 4: Nearest neighbor sentences drawn from the SNLI test set, which based on cosine similarity of different representations emitted by NBOW, AT-LSTMs and SAT-LSTMs.

5.4.2 Analysis of Phrases Representations

We compute the representations of each subtree and show some examples sampled from test dataset with their most related neighbors in Table 3.

The phrasal paraphrases, such as “having a great time/enjoy time together”, have obtained close representations, which is more helpful for the identification of the entailment relation of two sentences. Besides, we can see the ability of the model to learn a variety of general paraphrastic transformations, such as possessive rule “persons’s/of a person” and verb particle shift “holding his cup up/holding up a white plastic cup”.

Some other examples such as “wearing a pink dress/in a pink dress/dressed in pink” indicate our SAT-LSTMs model is more robust to syntactic variations, which is more crucial to boost the generalization ability while encoding a sentence or sentence pair.

5.4.3 Analysis of Learned Sentence Representations

We explore the sentence representations learned by the three different models on the SNLI. Table 4 illustrates the nearest neighbors of sentence representations learned from NBOW, AT-LSTMs, SAT-LSTMs.

As shown in Table 4, NBOW finds a sentence’s neighbors with full consideration of lexical paraphrase. While the neighbors returned by SAT-LSTMs are mostly syntactic variations with meaning preserving. For example, for the first sentence “the boy are bare chested”, NBOW gives the “the men are naked” most likely based on the word pair “bare/naked”, thereby ignoring the information of “chested”. However, the sentences given by SAT-LSTMs contain the same meaning with ample ways of expressions, such as “the man is not wearing a shirt” and “the man without a shirt”, which accurately reflect the meaning of “bare chested”.

Compared with AT-LSTMs, SAT-LSTMs can provide more flexible syntactic expressions. For example, for the sentence ‘a golden retriever nurses puppies”, SAT-LSTMs capture this syntactic paraphrase ‘A nurses B/B is snuggling with A”, which is difficult for NBOW and AT-LSTMs models.

6 Related Work

There has been recent work proposing to incorporate syntax priori into neural network. socher2012semantic use a recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. tai2015improved introduce a generalization of the standard LSTM architecture to tree-structured network. bowman2016fast propose an stack-augmented Parser-Interpreter Neural Network for sentence encoding, which combines parsing and interpretation within a single tree-sequence hybrid model. These models are designed for representing a sentence in more plausible way, while we want to model the strong interaction of two sentences over tree structure.

More recently, several works have tried to incorporate priori into attention based model. cohn2016incorporating extend the attentional neural translation model to include structural biases from word based alignment models. gu2016incorporating incorporate copying mechanism into attention based model to address the OOV problem in a more systemic way for machine translation. Different with these models, we augment attention model with syntax priori for semantic matching.

Another thread of work is sequential attention models for natural language inference. rocktaschel2015reasoning propose to use attention model for sentence pair encoding. wang2015learning extend this model by paying more attention to important word-level matching results. Compared with these models, we integrate syntax structure into attention matching model, which can match two trees in a plausible way.

7 Outlook

Natural language has its underlying syntactic structure, which gives a feasibility to assign attention to tree-structured topologies instead of a flat sequence. Although we just use it in context of natural language inference, the idea of syntax-based attention model can be easily transferred to other tasks for phrase-level alignment, such as neural translation model. When we submit our paper, we find this paper [Eriguchi et al.2016]

, which proposed tree-to-sequence attention based model for neural machine translation, thereby showing the effectiveness of syntax-based attention mechnism. The major difference is their model is based on word-to-word and word-to-phrase attention (sequence conditioned on tree) whereas our proposed model focus on phrase-to-phrase attention (tree over tree).

8 Conclusion

In this paper, we integrate syntax structure into attention model. Compared with sequence-based attention model, our model can easily capture phrase-level alignment. Experiments on Stanford Natural Language Inference Corpus demonstrate the efficacy of our proposed model and its superiority to competitor models. Furthermore, we have made an elaborate experiment design and case analysis to evaluate the effectiveness of our syntax-base matching model and explain why attention over trees is a good idea.

In future, we wish to use our SAT-LSTMs matching model to learn the representation of phrasal[Wieting et al.2015] or syntactic paraphrases from massive paraphrase dataset, such as PPDB [Ganitkevitch et al.2013]. We expect that the learned representation of subtree with rich prior knowledge should be useful for downstream tasks in a pre-trained manner.