A Neural Architecture Mimicking Humans End-to-End for Natural Language Inference

11/15/2016 ∙ by Biswajit Paria, et al. ∙ Accenture ERNET India IIT Kharagpur 0

In this work we use the recent advances in representation learning to propose a neural architecture for the problem of natural language inference. Our approach is aligned to mimic how a human does the natural language inference process given two statements. The model uses variants of Long Short Term Memory (LSTM), attention mechanism and composable neural networks, to carry out the task. Each part of our model can be mapped to a clear functionality humans do for carrying out the overall task of natural language inference. The model is end-to-end differentiable enabling training by stochastic gradient descent. On Stanford Natural Language Inference(SNLI) dataset, the proposed model achieves better accuracy numbers than all published models in literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Motivation

The problem of Natural Language Inference (NLI) is to identify whether a statement (hypothesis: ) in natural language can be inferred or contradicted in the context of another statement (premise: ) in natural language. If it can neither be inferred nor contradicted, we say hypothesis is ‘neutral’ to premise. NLI is one of the most important component for natural language understanding systems [Benthem, 2008; MacCartney and Manning, 2009]. NLI has multitude of applications in natural language question answering [Harabagiu and Hickl, 2006]

, semantic search, text summarization 

[Lacatusu et al., 2006] etc.

Consider the three statements A: The couple is walking on the sea shore. B: The man and woman are wide awake. C: The man and woman are shopping on the busy street. Here the statement A is the premise and, B and C both are hypotheses. B can be inferred from A, where as it is reasonably clear C cannot be true if A were. A and C can be true together, in a strict sense, by arguing that there was a busy shopping option by the sea shore, which is not true generally. The problem of NLI thus falls in more “common sense reasoning” segment compared to strict logical inferencing and is subtly different from deduction in formal logical setting [MacCartney, 2009].

Unsupervised feature learning and deep learning 

[Bengio, 2009; LeCun et al., 2015]

based on neural networks have gained prominence in the last few years. State of the art neural networks models and appropriate algorithms to train these models have been proposed for multitude of tasks in computer vision, natural language processing, speech recognition etc and these models hold benchmark results for most problems. In the area of natural language processing, the recent deep learning models have been proven superior to conventional rule-based or machine learning approaches in many tasks like part of speech tagging, question answering, sentiment analysis, document classification 

[Kumar et al., 2015] etc. Not only deep learning models hold the state of the art results for these problems, many model constructs used like attention mechanism have close alignment with human thought process.

Motivated by the same, we dissect the problem of NLI into various sub tasks, similar to how human carries out NLI. We then realize each sub tasks using a deep learning construct, weave them together to create a complete end-to-end model for NLI. Let us first see how we can dissect the problem of NLI as humans do it. When seeing the two statements A and B as in the example above, humans first aligns information snippets between the sentences like (the couple, the man and woman) and (walking, wide awake). We notice that first pair is equivalent. From the second pair we conclude that walking is possible only in the state of being awake. From the results of these two different kinds of processing we conclude sentence B can be inferred from sentence A. Suppose in A if it were dog instead of couple, it would not have been equivalent, we could not have inferred B even though the second pair results are the same. Each pair results are important, some cases they are independent, but in most cases they are dependent as humans make use of a lot of contextual information. We analyze shopping on a street is not possible at sea shore and conclude C is contradicted by A. Note that for inferring B, we never paid attention to where the couple were walking, but to contradict C, we paid attention to the place. Humans first align the needed information according to the context, compare each pair differently by making use of the contextual information and then deduce finally by making use of each of the comparison results.

The main contributions of this paper are as follows.

  1. A neural architecture using variants of long short term memory, composable neural networks and attention mechanism is proposed for the problem of natural language inference.

  2. The model is inspired from how humans carry out the task of natural language inference and hence very intuitive. Each step of the humans in performing NLI is mimicked by an appropriate deep learning construct in the model.

  3. We present detailed experimental results on Stanford Natural Language Inference(SNLI) Dataset [Bowman et al., 2015], and shows that proposed model outperforms all the other models

Ii Preliminaries and Background

In a deep learning framework, the natural language sentences are converted into a numerical representation by word embeddings, in the first place. This numerical representations are then encoded by using a bi-directional LSTM or a binary tree LSTM, to consider various information snippets along with the context in which they appear. Attention mechanism is used to learn the parts of the information that needs to be aligned and processed together according to the context. The generated pairs by attention mechanism are then processed separately using a set of different operators selected by soft gating. The outputs of the different process pairs are then aggregated or composed together for the final prediction task. Below we briefly describe concepts of word embeddings and LSTMs. Attention mechanism and composition, and their motivations are introduced along with the model.

Ii-a Word Embeddings

The first challenge encountered in applying deep learning models for NLP is to find a correct numerical representation for words. “You shall know a word by the company it keeps” (Firth, J. R. 1957:11), is one of the most influential ideas in natural language processing. Multiple models for representing a word as a numerical vector, based on the context it appears, stem from this idea. Many vector representations for words have been proposed, including the well known latent semantic indexing 

[Dumais, 2004]. Vector representations for words in the context of neural networks was proposed in [Bengio et al., 2003]

In this paper, each word in the vocabulary is assigned a distributed word feature vector,

. The probability distribution of word sequences,

, is then expressed in terms of these word feature vectors. The word feature vectors and parameters of the probability function (a neural network) are learned together by training a suitable feed-forward neural network to maximize the log-likelihood of the text corpora, considering each text snippet of fixed window size as a training sample. 

[Mikolov et al., 2013a] adapted this model and proposed two new models: Continuous bag of words and skip-gram model, popularly known as ‘Word2Vec’ models. Continuous bag of word models try to predict the current word given the previous and next surrounding words, discarding the word order, in a fixed context window. Skip-gram model tries to predict the surrounding words given the current word. These models have better training complexity, and thus can be used for training on large corpus. The vectors generated by these models on large corpus have shown to capture subtle semantic relationships between words, by simple vector operations on them [Mikolov et al., 2013b]. The drawback of these models is that they mostly use local information (words in a contextual window). To effectively utilize the aggregated global information from the corpus without incurring high computational cost, ‘GloVe’ word vectors were proposed by [Pennington et al., 2014]. This model tries to create word vectors such that dot product of two vectors will closely resemble the co-occurrence statistics of the corresponding words in the full corpus. The model have shown to be more effective compared to Word2Vec models for capturing semantic regularities on smaller corpus.

Ii-B Recurrent Neural Network and Long Short Term Memory(LSTM)

The basic idea behind Recurrent Neural Networks (RNN) is to capture and encode the information present in a given sequence like text. Given a sequence of words, a numerical representation (GloVe or Word2Vec vectors) for a word is fed to a neural network and the output is computed. While computing the output for the next word, the output from the previous word (or time step) is also considered. RNNs are called recurrent because they perform the same computation for every element of a sequence using the output from previous computations. At any step RNN performs the following computation,

where and are the trainable parameters of the model, and is a nonlinear function. The bias terms are left out here and have to be added appropriately. is the output at timestep, which can either be utilized as is, or can be fed again to a parameterized construct such as softmax [Bishop, 2006], depending on the task at hand. The training is done by formulating a loss objective function based on the outputs at all timesteps, and trying to minimize the loss. The vanilla RNNs explained above have difficulty in learning long term dependencies in the sequence via gradient descent training [Bengio et al., 1994]

. Also training vanilla RNNs is shown to be difficult because of vanishing and exploding gradient problems 

[Pascanu et al., 2013]. Long short term memory (LSTM) [Hochreiter and Schmidhuber, 1997], a variant of RNN is shown to be effective in capturing long-term dependencies and easier to train compared to vanilla RNNs. Multiple variants of LSTMs have been proposed in literature. One can refer to [Greff et al., 2015] for a comprehensive survey of LSTM variants.

A LSTM module has three parameterized gates, input gate (), forget gate () and output gate (). A gate operates by

where and are the parameters of the gate , is the hidden state at the previous time step and

stands for the sigmoid function. All the three gates have the same equation form and inputs, but they have different set of parameters. Along with hidden state, LSTM module also has a cell state. The updation of the hidden state and cell state at anytime step are controlled by the various gates as follows,

and

(1)

where and are again parameters of the model. The key component of the LSTM is the cell state. The LSTM has the capability to modify and retain the content on the cell state as required by the task, using the gates and hidden states. While forward LSTM takes the input sequence as it is, a backward LSTM takes the input in the reverse order. A backward LSTM is used to capture the dependencies of a word on future words in the original sequence. A concatenation of a forward LSTM and a backward LSTM is known as bi-directional LSTM (bi-LSTM) [Greff et al., 2015].

Ii-B1 Binary Tree Long Short Term Memory

The LSTM or bi-LSTM model process the information in a sequential manner, as a linear chain. But a natural language sentence have more syntactic structure to it, and the information is represented more as a tree structure than a linear chain. To incorporate this way of processing information, tree structured LSTMs were introduced by Tai et al. 2015. In a tree structured LSTM (Tree-LSTM) each node will have multiple previous time steps, one each corresponding to a child in the tree structure for the node, compared to a single previous time step of a linear chain. Different set of parameters for each child is included for the input and output gates to learn how different child information have to be processed. Using separate parameters child information is summed up to form the input and output gate values of every node, as follows:

Multiple forget gates (one for each child) are included to learn the information from each child that needs to be remembered. Forget gate update for each is,

Then cell state is updated based on the forget gate values and cell state of the children is below.

Hidden states are then computed similar to normal LSTM as given in (1). The bias terms are left out in all the equations and have to be added appropriately wherever needed. All s and s in the above equations are model parameters, to be learned.

The tree structure can be formed by considering the syntactic parse of the sentence, leading to different variations of Tree LSTM [Tai et al., 2015]. If we consider the syntactic structure, each sample in the training data creates different tree structures, leading to difficulty in training the model efficiently. To work around this we considered complete binary tree structures, formed by pairing adjacent words recursively. We call this btree-LSTM in the subsequent discussion.

Iii The Proposed Model

The model first encodes the sentences using a normal bi-LSTM or a btree-LSTM. This is to consider the different segments of the sentence along with the context, which is an essential part of human processing as explained earlier. In case of bi-LSTM, the encodings are augmented along with the corresponding word vectors to create enhanced encodings. In the case of btree-LSTM encodings this enhancement is not done since, there is no one-to-one correspondence with the number of words in the sentence after the encoding. If the bi-LSTM encodings are done there will be encodings for a -length sentence, where as if btree-LSTM encodings are done, there will be encodings. btree-LSTMs considers more possible phrasal structures(along with the context) of the input sentence compared to a bi-LSTM, as shown below.

and

(2)

The phrase encodings ( or , ) in (2) represents the various information snippets in the sentence along with the context in which they appear. We do this encodings for both the sentences, hypothesis and premise . Next phase is to align the information snippets between the hypothesis and premise, as humans do, for which one can incorporate neural attention.

Iii-a Attention Mechanism

Attention mechanism was introduced in the context of machine translation recently [Luong et al., 2015; Bahdanau et al., 2014], where in words or phrases from one language has to be mapped or aligned to words or phrases in another language for the purpose of translating. We use similar concept to learn this alignment for our purpose of NLI. Given two sets of vectors, and , the attention value (a numerical quantity) is associated for each element of the first set to each element of the second set . Forall , where,

One can see that for all , . After learning, attention values will be high for elements that are mapped and low for other elements. For example with bi-LSTM or btree-LSTM encoding corresponding to the man and woman will have high attention value to the encoding corresponding to the couple, and low attention values for other snippets, in the context of the given sentences. Given an element we can generate the attention values and sum up the elements of the second set, using the attention values as weights, to create a representation of the information that element is interested in or aligned with in the second set. As the attention values are high only for aligned encodings the summed up vector from the second set will be dominated by the aligned information.

The phrase encodings of hypothesis are aligned with the phrase encodings of the premise using an attention mode as given in (3). The result of the alignment is computed using a weighted sum of the phrase encodings of the premise , using attention values as weights.

where

and

(3)

Now that information snippets are aligned, pairs of , they need to be processed. Different operators have to be applied based on the pairs and context. All the individual results have then to be aggregated to make the final decision. We use neural network composition for this purpose.

Iii-B Task Composition

Often a large task can be solved by composing the results of various different sub tasks, each computed separately. Such an approach for Question Answering was introduced by [Andreas et al., 2016]. We adapt this approach for our purpose here. After learning the alignment of encodings, we need to perform different functions or comparisons, depending on the kind of inputs and the sentence context, to see whether they contribute positively or negatively towards final prediction. In our example after aligning the encoding corresponding to the man and woman with the encoding for the couple the model has to process whether they are equivalent. Similarly after aligning walking to wide awake the model has to do a different kind of processing to verify that wide awake is followed from walking. Again if in an example, all birds are aligned with canary, model might have to check for a type of or subset of relationship. Depending on the type of input, the context of the sentence, different functions(operators or tasks) have to be applied. The operators also has to learn what it is supposed to do. Towards this purpose we introduced number of operators, each is a two layer feed forward neural network, with different set of parameters. If and are the aligned encodings corresponding to two different text snippets, they are passed through different two layer feed forward neural network, the outputs of each weighed according to a soft gating function as

where s are model parameters. The soft gating function helps to chose which operator has to be chosen to be applied, based on their types and context in the sentence in which they appear. Recall from the example different operators have to be applied to compare (the couple, the man and woman) and (walking,wide awake). This is realized by soft gating function.

Expression for pairs from (3) are given in (4). A schematic diagram of this module is given in Figure 1.

Fig. 1: Attention and Single Task Module used after bi-LSTM and btree-LSTM encodings
(4)

Each denotes a certain output for an input encoding pair. Different pairs yields different s. All the s has to be aggregated(composed) towards the output for the final prediction. In our example after understanding the man and woman and the couple are equivalent and wide awake follows from walking, the model will have two vectors one for each pair. Both the s have to be considered in making the final judgement. How to aggregate the various

s have to be learned by the model. There are two parts to it. One is the order in which they have to be aggregated, if there are more than two. Each ordering will give a different tree structured computation. The second being what exactly means aggregation. In the example the aggregation is an ’and’ operator, both has to be satisfied. In another example it could be ’or’ etc. Ideally for this, we should bring in a reinforcement learning mechanism similar to the one that is used in 

[Andreas et al., 2016], to learn the order of aggregation and a neural network [Socher et al., 2011] for the learning the aggregation operator. In our current model(for which results are discussed), we aggregate the operator outputs by using a normal LSTM. The aggregation order learning which maps to tree structured computation is envisioned as a part of future work.

The aggregated result is then passed through a comparison layer to do the final prediction, which is shown below,

and

(5)

where is the model parameter and denotes the cross-entropy between and . We minimize this loss averaged across the training samples, to learn the various model parameters using stochastic gradient descent [Bottou, 2012]. A schematic diagram of the complete model is given in Figure 2.

Fig. 2: The Complete Model: The upper tree learning(top of the figure) is envisioned for future work, in the current model a simple LSTM is used there instead

Iii-C Relevant Previous Work

NLI is a well studied problem with a rich literature using classical machine learning techniques. With the advent of deep learning, many models including LSTMs were used for NLI. Recently Stanford Natural Language Inference(SNLI) dataset was created [Bowman et al., 2015] using crowd sourcing. Many deep learning models have been benchmarked on this dataset for NLI. Detail list is available at http://nlp.stanford.edu/projects/snli/. This recent thesis Bowman [2016] covers deep learning based works in detail.

Many of deep learning based works relied on creating encodings of the sentences using LSTMs or convolutional neural networks or gated recurrent units or variants of recursive neural networks, and then using these encodings for the final prediction task

Bowman et al. [2016]; Vendrov et al. [2015]; Mou et al. [2016] are all these kinds of work. Bowman et al. (2016) also introduced an efficient mechanism to learn the binary parse of the tree along with creating encodings for the prediction task. Works in Rocktäschel et al. [2015]; Wang and Jiang [2015] used neural attention mechanism along with LSTMs for the problem of NLI.

There are 3 main works in the space, which claims state of the art results. 1. To address the problem of compressing a lot of information in a single LSTM cell, Cheng et al. [2016] introduced Long Short Term Memory Networks(LSTMN) for Natural Language Inference. 2. Munkhdalai and Yu (2016) introduced Neural Tree Indexers (NTI), by bringing in attention over tree structures of the sentences. 3. Parikh et al. [2016] is another very recent work, which uses the attention mechanism over words, compare them and then aggregate the results. As explained earlier we considers attention over possible sentence segment encodings(considering context), subtask division, operator selection and learning, and aggregation learning. Our model is aligned with human thought process and hence very intuitive and achieves state of the art results.

Iv Experiments and Evaluation

The model was implemented in TensorFlow 

[Abadi et al., 2015] - an open-source library for numerical computation for Deep Learning. All experiments were carried on a Dell Precision Tower 7910 server with Nvidia Titan X GPU. The models were trained using the Adam’s Optimizer [Kingma and Ba, 2014] in a stochastic gradient descent [Bottou, 2012]

fashion. We used batch normalization 

[Ioffe and Szegedy, 2015] while training. The various model parameters used are mentioned in Table I.

We experimented with both GloVe vectors trained111http://nlp.stanford.edu/data/glove.840B.300d.zip on Common Crawl dataset as well as Word2Vec vector trained222https://code.google.com/archive/p/word2vec/ on Google news dataset. We used Google News trained word2vec word embeddings for the final reported results. Before matching a word in the dataset with a word in the word2vec collection, we converted all characters to lower case. The word embeddings are not trained along with the model. However before using them in our model, we transformed the embeddings using a learnable single layer neural network(, where W is the model parameter and w is the word embeddings). For out of vocabulary words, we assigned them random word vectors. Each element of the word vector is randomly sampled from

. This decision has been taken after observing that the word2vec vector elements are approximately distributed according to the above normal distribution.

Parameter Name Value
Word Vector Dimension 300
Sequence Length 64
bi-LSTM Hidden State Dimension 300
btree-LSTM Hidden State Dimension 300
Operator Count 11
Batch Size 40
Batch Norm init. value 0.001
TABLE I: Model & Training Parameters

There are two main datasets available in the public domain for NLI. Sentences Involving Compositional Knowledge (SICK) [Marelli et al., 2014b] dataset from the SemEval-2014 task [Marelli et al., 2014a] which involves, predicting the degree of relatedness between two sentences, detecting the entailment relation holding between them. SICK consists of 10000 sentence pairs manually labelled for relatedness and entailment. We have experimented with this dataset and have got very good results. The dataset being small and the model having large number of parameters, overfitting could have happened. As benchmark is not available for other state of the art models for comparison on SICK we are not including our results on this dataset. The Stanford Natural Language Inference Corpus(SNLI) [Bowman et al., 2015] dataset contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference. We are presenting our results on this dataset in comparison with other state of the art models.

Model Train Accuracy Test Accuracy #Parameters
Classifier(hand crafted features) [Bowman et al., 2015] 99.7 78.2
GRU encoders [Vendrov et al., 2015] 98.8 81.4 15.0M
Tree-based CNN encoders [Mou et al., 2016] 83.3 82.1 3.5M
SPINN-NP encoders [Bowman et al., 2016] 89.2 83.2 3.7M
LSTM with attention [Rocktäschel et al., 2015] 85.3 83.5 252K
mLSTM [Wang and Jiang, 2015] 92.0 86.1 1.9M
LSTM Networks [Cheng et al., 2016] 88.5 86.3 3.4M
word-word attention and aggregation [Parikh et al., 2016] 90.5 86.8 582K
NTI with global attention [Munkhdalai and Yu, 2016] 88.5 87.3 -
Our model with bi-LSTM encoders 89.8 86.4 6M
Our model with btree-LSTM encoders 88.6 87.6 2M
TABLE II: Comparison Results on SNLI Dataset

The comparison results of various models on SNLI dataset is given in Table II. One can see that our model with bi-lstm encodings have better accuracy numbers compared to all published results, but fall short very close to the results reported in not yet published works [Parikh et al., 2016; Munkhdalai and Yu, 2016]. The model with btree-lstm encodings have better accuracy numbers than all the models. The class level accuracy results of various models on SNLI dataset is given in Table III.

Method N E C
SPINN-NP encoders [Bowman et al., 2016] 80.6 88.2 85.5
mLSTM [Wang and Jiang, 2015] 81.6 91.6 87.4
word-word attention and aggregation [Parikh et al., 2016] 83.7 92.1 86.7
Our model with bi-LSTM encoders 84.3 90.6 86.9
Our model with btree-LSTM encoders 84.8 93.2 87.4
TABLE III: Class Level Accuracy. N: Neutral Class, E: Entailment Class, C: Contradiction

V Conclusion & Future Work

We presented a complete deep learning model for the problem of natural language inference. The model used deep learning constructs like LSTM variants, attention mechanism and composable neural networks to mimic humans for natural language inference. The model is end-to-end differentiable, enabling training by simple stochastic gradient descent. From the initial experiments, the model have better accuracy numbers than all the published models. The model is interpretable in close alignment with human process while performing NLI, unlike other complicated deep learning models. We hope further experiments and hyper parameter tuning will improve these results further.

There are different enhancements for the model possible and potential future work directions. The btree-LSTM currently uses a complete binary tree structure formed by considering neighbouring encodings recursively. A binary tree learning scheme, similar to  [Bowman et al., 2016] can be considered to be incorporated in the model. Tree construction based on the ordering of attention values will lead to heap like structures. We are currently working on this model which we have named Heap-LSTM. Currently the model uses soft gating for operator selection, hard selection with appropriate learning mechanism is something that has to be explored. The model aggregates the operator outputs using a simple LSTM, the aggregation tree structure learning using appropriate learning mechanism similar to [Andreas et al., 2016] is another major stream of work.

The alignment of the model with human thought process, already better accuracy numbers than all published models just from initial experiments, all advocate the exploration of model enhancements in these directions.

References