Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. On the Stanford Natural Language Inference (SNLI) dataset, we obtain state-of-the-art results with almost an order of magnitude fewer parameters than previous work and without relying on any word-order information. Adding intra-sentence attention that takes a minimum amount of order into account yields further improvements.READ FULL TEXT VIEW PDF
Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Decomposable Attention Model for Sentence Pair Classification (Pytorch Version) from paper 'A Decomposable Attention Model for Natural Language Inference' https://arxiv.org/abs/1606.01933
Natural language inference (NLI) refers to the problem of determining entailment and contradiction relationships between a premise and a hypothesis. NLI is a central problem in language understanding [Katz1972, Bos and Markert2005, van Benthem2008, MacCartney and Manning2009] and recently the large SNLI corpus of 570K sentence pairs was created for this task [Bowman et al.2015]. We present a new model for NLI and leverage this corpus for comparison with prior work.
A large body of work based on neural networks for text similarity tasks including NLI has been published in recent years[Hu et al.2014, Rocktäschel et al.2016, Wang and Jiang2016, Yin et al.2016, inter alia]. The dominating trend in these models is to build complex, deep text representation models, for example, with convolutional networks [LeCun et al.1990, CNNs henceforth]
or long short-term memory networks[Hochreiter and Schmidhuber1997, LSTMs henceforth] with the goal of deeper sentence comprehension. While these approaches have yielded impressive results, they are often computationally very expensive, and result in models having millions of parameters (excluding embeddings).
Here, we take a different approach, arguing that for natural language inference it can often suffice to simply align bits of local text substructure and then aggregate this information. For example, consider the following sentences:
Bob is in his room, but because of the thunder and lightning outside, he cannot sleep.
Bob is awake.
It is sunny outside.
The first sentence is complex in structure and it is challenging to construct a compact representation that expresses its entire meaning. However, it is fairly easy to conclude that the second sentence follows from the first one, by simply aligning Bob with Bob and cannot sleep with awake and recognizing that these are synonyms. Similarly, one can conclude that It is sunny outside contradicts the first sentence, by aligning thunder and lightning with sunny and recognizing that these are most likely incompatible.
We leverage this intuition to build a simpler and more lightweight approach to NLI within a neural framework; with considerably fewer parameters, our model outperforms more complex existing neural architectures. In contrast to existing approaches, our approach only relies on alignment and is fully computationally decomposable with respect to the input text. An overview of our approach is given in Figure 1
. Given two sentences, where each word is represented by an embedding vector, we first create a soft alignment matrix using neural attention[Bahdanau et al.2015]. We then use the (soft) alignment to decompose the task into subproblems that are solved separately. Finally, the results of these subproblems are merged to produce the final classification. In addition, we optionally apply intra-sentence attention [Cheng et al.2016] to endow the model with a richer encoding of substructures prior to the alignment step.
Asymptotically our approach does the same total work as a vanilla LSTM encoder, while being trivially parallelizable across sentence length, which can allow for considerable speedups in low-latency settings. Empirical results on the SNLI corpus show that our approach achieves state-of-the-art results, while using almost an order of magnitude fewer parameters compared to complex LSTM-based approaches.
Our method is motivated by the central role played by alignment in machine translation [Koehn2009] and previous approaches to sentence similarity modeling [Haghighi et al.2005, Das and Smith2009, Chang et al.2010, Fader et al.2013], natural language inference [Marsi and Krahmer2005, MacCartney et al.2006, Hickl and Bensley2007, MacCartney et al.2008], and semantic parsing [Andreas et al.2013]. The neural counterpart to alignment, attention [Bahdanau et al.2015], which is a key part of our approach, was originally proposed and has been predominantly used in conjunction with LSTMs [Rocktäschel et al.2016, Wang and Jiang2016] and to a lesser extent with CNNs [Yin et al.2016]. In contrast, our use of attention is purely based on word embeddings and our method essentially consists of feed-forward networks that operate largely independently of word order.
Let and be the two input sentences of length and , respectively. We assume that each , is a word embedding vector of dimension and that each sentence is prepended with a “NULL” token. Our training data comes in the form of labeled pairs , where is an indicator vector encoding the label and is the number of output classes. At test time, we receive a pair of sentences and our goal is to predict the correct label .
Let and denote the input representation of each fragment that is fed to subsequent steps of the algorithm. The vanilla version of our model simply defines and . With this input representation, our model does not make use of word order. However, we discuss an extension using intra-sentence attention in Section 3.4 that uses a minimal amount of sequence information.
The core model consists of the following three components (see Figure 1), which are trained jointly:
First, soft-align the elements of and using a variant of neural attention [Bahdanau et al.2015] and decompose the problem into the comparison of aligned subphrases.
Second, separately compare each aligned subphrase to produce a set of vectors for and for . Each is a nonlinear combination of and its (softly) aligned subphrase in (and analogously for ).
Finally, aggregate the sets and from the previous step and use the result to predict the label .
We first obtain unnormalized attention weights , computed by a function , which decomposes as:
This decomposition avoids the quadratic complexity that would be associated with separately applying times. Instead, only applications of are needed. We takeGlorot et al.2011].
These attention weights are normalized as follows:
Here is the subphrase in that is (softly) aligned to and vice versa for .
Next, we separately compare the aligned phrases and using a function , which in this work is again a feed-forward network:
where the brackets denote concatenation. Note that since there are only a linear number of terms in this case, we do not need to apply a decomposition as was done in the previous step. Thus can jointly take into account both and .
We now have two sets of comparison vectors and . We first aggregate over each set by summation:
and feed the result through a final classifier, that is a feed forward network followed by a linear layer:
where represents the predicted (unnormalized) scores for each class and consequently the predicted class is given by .
For training, we use multi-class cross-entropy loss with dropout regularization [Srivastava et al.2014]:
Here denote the learnable parameters of the functions F, G and H, respectively.
In the above model, the input representations are simple word embeddings. However, we can augment this input representation with intra-sentence attention to encode compositional relationships between words within each sentence, as proposed by cheng2016long. Similar to Eqs. 1 and 3.1, we define
where is a feed-forward network. We then create the self-aligned phrases
The distance-sensitive bias terms provides the model with a minimal amount of sequence information, while remaining parallelizable. These terms are bucketed such that all distances greater than 10 words share the same bias. The input representation for subsequent steps is then defined as and analogously .
|Method||Train Acc||Test Acc||#Parameters|
|Lexicalized Classifier [Bowman et al.2015]||99.7||78.2||–|
|300D LSTM RNN encoders [Bowman et al.2016]||83.9||80.6||3.0M|
|1024D pretrained GRU encoders [Vendrov et al.2015]||98.8||81.4||15.0M|
|300D tree-based CNN encoders [Mou et al.2015]||83.3||82.1||3.5M|
|300D SPINN-PI encoders [Bowman et al.2016]||89.2||83.2||3.7M|
|100D LSTM with attention [Rocktäschel et al.2016]||85.3||83.5||252K|
|300D mLSTM [Wang and Jiang2016]||92.0||86.1||1.9M|
|450D LSTMN with deep attention fusion [Cheng et al.2016]||88.5||86.3||3.4M|
|Our approach (vanilla)||89.5||86.3||382K|
|Our approach with intra-sentence attention||90.5||86.8||582K|
We now discuss the asymptotic complexity of our approach and how it offers a higher degree of parallelism than LSTM-based approaches. Recall that denotes embedding dimension and means sentence length. For simplicity we assume that all hidden dimensions are and that the complexity of matrix()-vector() multiplication is .
A key assumption of our analysis is that , which we believe is reasonable and is true of the SNLI dataset [Bowman et al.2015] where , whereas recent LSTM-based approaches have used . This assumption allows us to bound the complexity of computing the attention weights.
The complexity of an LSTM cell is , resulting in a complexity of to encode the sentence. Adding attention as in rocktaschel2015reasoning increases this complexity to .
Application of a feed-forward network requires steps. Thus, the Compare and Aggregate steps have complexity and respectively. For the Attend step, is evaluated times, giving a complexity of . Each attention weight requires one dot product, resulting in a complexity of .
Thus the total complexity of the model is , which is equal to that of an LSTM with attention. However, note that with the assumption that , this becomes which is the same complexity as a regular LSTM. Moreover, unlike the LSTM, our approach has the advantage of being parallelizable over , which can be useful at test time.
|Our approach (vanilla)||83.6||91.3||85.8|
|Our approach w/ intra att.||83.7||92.1||86.7|
|ID||Sentence 1||Sentence 2||DA (vanilla)||DA (intra att.)||SPINN-PI||mLSTM||Gold|
|A||Two kids are standing in the ocean hugging each other.||Two kids enjoy their day at the beach.||N||N||E||E||N|
|B||A dancer in costumer performs on stage while a man watches.||the man is captivated||N||N||E||E||N|
|C||They are sitting on the edge of a fountain||The fountain is splashing the persons seated.||N||N||C||C||N|
|D||Two dogs play with tennis ball in field.||Dogs are watching a tennis match.||N||C||C||C||C|
|E||Two kids begin to make a snowman on a sunny winter day.||Two penguins making a snowman.||N||C||C||C||C|
|F||The horses pull the carriage, holding people and a dog, through the rain.||Horses ride in a carriage pulled by a dog.||E||E||C||C||C|
|G||A woman closes her eyes as she plays her cello.||The woman has her eyes open.||E||E||E||E||C|
|H||Two women having drinks and smoking cigarettes at the bar.||Three women are at a bar.||E||E||E||E||C|
|I||A band playing with fans watching.||A band watches the fans play||E||E||E||E||C|
We evaluate our approach on the Stanford Natural Language Inference (SNLI) dataset [Bowman et al.2015]. Given a sentences pair , the task is to predict whether is entailed by , contradicts , or whether their relationship is neutral.
Following bowman2015large, we remove examples labeled “–” (no gold label) from the dataset, which leaves 549,367 pairs for training, 9,842 for development, and 9,824 for testing. We use the tokenized sentences from the non-binary parse provided in the dataset and prepend each sentence with a “NULL” token. During training, each sentence was padded up to the maximum length of the batch for efficient training (the padding was explicitly masked out so as not to affect the objective/gradients). For efficient batching in TensorFlow, we semi-sorted the training data to first contain examples where both sentences had length less than 20, followed by those with length less than 50, and then the rest. This ensured that most training batches contained examples of similar length.
Embeddings: We use 300 dimensional GloVe embeddings [Pennington et al.2014] to represent words. Each embedding vector was normalized to have
norm of 1 and projected down to 200 dimensions, a number determined via hyperparameter tuning. Out-of-vocabulary (OOV) words are hashed to one of 100 random embeddings each initialized to mean 0 and standard deviation 1. All embeddings remain fixed during training, but the projection matrix is trained. All other parameter weights (hidden layers etc.) were initialized from random Gaussians with mean 0 and standard deviation 0.01.
Each hyperparameter setting was run on a single machine with 10 asynchronous gradient-update threads, using Adagrad [Duchi et al.2011] for optimization with the default initial accumulator value of 0.1. Dropout regularization [Srivastava et al.2014]
was used for all ReLU layers, but not for the final linear layer. We additionally tuned the following hyperparameters and present their chosen values in parentheses: network size (2-layers, each with 200 neurons), batch size (4),11116 or 32 also work well and are a bit more stable. dropout ratio (0.2) and learning rate (0.05–vanilla, 0.025–intra-attention). All settings were run for 50 million steps (each step indicates one batch) but model parameters were saved frequently as training progressed and we chose the model that did best on the development set.
Results in terms of 3-class accuracy are shown in Table 1. Our vanilla approach achieves state-of-the-art results with almost an order of magnitude fewer parameters than the LSTMN of cheng2016long. Adding intra-sentence attention gives a considerable improvement of 0.5 percentage points over the existing state of the art. Table 2 gives a breakdown of accuracy on the development set showing that most of our gains stem from neutral, while most losses come from contradiction pairs.
Table 3 shows some wins and losses. Examples A-C are cases where both variants of our approach are correct while both SPINN-PI [Bowman et al.2016] and the mLSTM [Wang and Jiang2016] are incorrect. In the first two cases, both sentences contain phrases that are either identical or highly lexically related (e.g. “Two kids” and “ocean / beach”) and our approach correctly favors neutral in these cases. In Example C, it is possible that relying on word-order may confuse SPINN-PI and the mLSTM due to how “fountain” is the object of a preposition in the first sentence but the subject of the second.
The second set of examples (D-F) are cases where our vanilla approach is incorrect but mLSTM and SPINN-PI are correct. Example F requires sequential information and neither variant of our approach can predict the correct class. Examples D-E are interesting however, since they don’t require word order information, yet intra-attention seems to help. We suspect this may be because the word embeddings are not fine-grained enough for the algorithm to conclude that “play/watch” is a contradiction, but intra-attention, by adding an extra layer of composition/nonlinearity to incorporate context, compensates for this.
Finally, Examples G-I are cases that all methods get wrong. The first is actually representative of many examples in this category where there is one critical word that separates the two sentences (close vs open in this case) and goes unnoticed by the algorithms. Examples H requires inference about numbers and Example I needs sequence information.
We presented a simple attention-based approach to natural language inference that is trivially parallelizable. The approach outperforms considerably more complex neural methods aiming for text understanding. Our results suggest that, at least for this task, pairwise comparisons are relatively more important than global sentence-level representations.
We thank Slav Petrov, Tom Kwiatkowski, Yoon Kim, Erick Fonseca, Mark Neumann for useful discussion and Sam Bowman and Shuohang Wang for providing us their model outputs for error analysis.
TensorFlow: Large-scale machine learning on heterogeneous systems.Software available from tensorflow. org.
Natural language inference by tree-based convolution and heuristic matching.In Proceedings of ACL (short papers).