Natural language inference (NLI) is an important and significant task in natural language processing (NLP). It concerns whether a hypothesis can be inferred from a premise, requiring understanding of the semantic similarity between the hypothesis and the premise to discriminate their relation (lan2018toolkit, ). Table 1 shows several samples of natural language inference from SNLI (Stanford Natural Language Inference) corpus (bowman2015large, ).
In the literature, the task of NLI is usually viewed as a relation classification. It learns the relation between a premise and a hypothesis in a large training set, then predicts the relation between a new pair of premise and hypothesis. The existing methods of NLI can be roughly partitioned into two categories: feature-based models (bowman2015large, )
and neural network-based models(yang2016hierarchical, ; chen2016enhanced, ). Feature-based models represent a premise and a hypothesis by their unlexicalized and lexicalized features, such as
-gram length and the real-valued feature of length difference, then train a classifier to perform relation classification. Recently, end-to-end neural network-based models have drawn worldwide attention since they have demonstrated excellent performance on quite a few NLP tasks including machine translation, natural language inference, etc.
|Wet brown dog swims towards||A dog is playing fetch in a pond.||neutral|
|camera.||A dog is in the water.||entailment|
|The dog is sleeping in his bed.||contradiction|
On the basis of their model structures, we can divide neural network-based models for NLI into two classes (lan2018toolkit, ), sentence encoding models and sentence interaction-aggregation models. The architectures of the two types of models are shown in Figure 1.
.a) independently encode a pair of sentences, a premise and a hypothesis using pre-trained word embedding vectors, then learn semantic relation between two sentences with a multi-layer perceptron (MLP). In these models, LSTM (Long Short-Term Memory networks)(cheng2016long, )
, its variants GRU (Gated Recurrent Units)(chung2014empirical, ) and Bi-LSTM, are usually utilized to encode the sentences since they were capable of learning long-term dependencies inside sentences. For example, Conneau et al.
proposed a generic NLI training scheme and compared several sentence encoding architectures: LSTM or GRU, Bi-LSTM with mean/max pooling, self-attention network and hierarchical convolutional networks(conneau2017supervised, ). The experimental results demonstrated that the Bi-LSTM with max pooling achieved the best performance. Talman et al. designed a hierarchical Bi-LSTM max pooling (HBMP) model to encode sentences (talman2018natural, ). This model applied parameters of one Bi-LSTM to initialize the next Bi-LSTM to convey information, which shown better results than the model with a single Bi-LSTM. Besides LSTM, attention mechanisms could also be used to boost the effectiveness of sentence encoding. The model developed by Ghaeini et al. added self-attention to LSTM model, and achieved better performance (ghaeini2018dr, ).
Sentence interaction-aggregation models (parikh2016decomposable, ; wang2017bilateral, ; kim2018semantic, ; lan2018toolkit, ) (their main architecture is shown in Figure 1.b) learn vector representations of pairs of sentences in the way similar to sentence encoding models and calculate pairwise word interaction matrix between two sentences using the newly updated word vectors, and then the matching results are aggregated into a vector to make the final decision. Compared with sentence encoding model, sentence interaction-aggregation models aggregate word similarities between a pair of sentences, are capable of capturing the relevant information between two sentences, a premise and a hypothesis. Bahdanau et al. translated and aligned text simultaneously in machine translation task (bahdanau2014neural, ), innovatively introducing attention mechanism to natural language process (NLP). He et al. designed a pairwise word interaction model (PWIM) (he2016pairwise, ), which made full use of word-level fine-grained information. Wang et al. put forward a bilateral multi-perspective matching (BiMPM) model (wang2017bilateral, ), focusing on various matching strategies that could be seen as different types of attention. The empirical studies of Lan et al. (lan2018toolkit, ) and Chen et al. (chen2016enhanced, ) concluded that sentence interation-aggregation models, especially ESIM (Enhanced Sequential Inference Model), a carefully designed sequential inference model based on chain LSTMs, outperformed all previous sentence encoding models.
Although ESIM has achieved excellent achievements, this model doesn’t consider the attention along the words in a sentence in its Bi-LSTM layer. Word attention can characterize the different contribution of each word. Therefore, it will be beneficial to put word attention into the Bi-LTSM layer. Moreover, the orientation of the words represents the direction of the information flow, either forward or backward, should not be ignored. In traditional Bi-LSTM model, the forward and the backward vectors learnt by Bi-LSTM are simply jointed. It’s necessary to consider whether each orientation (forward or backward) has different importance on word encoding, thus adaptively joint the two orientation vectors together with different weights. Therefore, in this study, using ESIM model as the baseline, we add an attention layer behind each Bi-LSTM layer, then use an adaptive orientation embedding layer to jointly represent the forward and backward vectors. We name this attention boosted Bi-LSTM as Bi-aLSTM, and denote the modified ESIM as aESIM. Experimental results on SNLI, MultiNLI (williams2017broad, ) and Quora (wang2017bilateral, ) benchmarks have demonstrated better performance of aESIM model than that of the baseline ESIM and the other state-of-the-art models. We believe that the architecture of Bi-aLSTM has potentially to be used in other NLP tasks such as text classification, machine translation and so on.
This paper is organized as follows. We introduce the general frameworks of ESIM and aESIM in Section 2. We describe the datasets and the experiment settings, and analyze our experimental results in Section 3. We then draw conclusions in Section 4.
2. Attention Boosted Sequential Inference Model
Supposed that we have two sentences and , where represents premise and represents hypothesis. The goal is to predict the label meaning for their relation.
2.1. ESIM model
Enhanced Sequential Inference Model (ESIM) (cheng2016long, ) is composed of four main components: input encoding layer, local inference modeling layer, inference composition layer and classification layer.
In the input encoding layer, ESIM first uses Bi-LSTM layer to encode input sentence pairs (Equations 1-2), which can be initialized using pre-trained word embeddings (e.g. Glove 840B vectors (pennington2014glove, )), where is the word embedding vector of the -th word in , is that of word in .
Secondly, ESIM implements the local inference layer for enhancing the sentence information. First it calculates a similarity matrix based on and .
It then gets the new expression for and with the equation below:
where and represent the weighted summation of and . It further enhances the local inference information collected as below.
After the enhancement of local inference, another Bi-LSTM layer is used to capture local inference information and their context for inference composition.
Instead of summation adopted by Parikh et al. (parikh2016decomposable, ), ESIM proposes to compute both max and average pooling and feeds the concatenate fixed length vector to the final classifier: a fully connected multi-layer perceptron.
2.2. aESIM model
The overall architecture of our newly proposed attention boosted sequential inference model (named aESIM) based on ESIM is similar to ESIM. In detail, aESIM also consists of four main parts: encoding layer, local inference modeling layer, decoding layer and classification layer. The only difference between ESIM and aESIM is that we substitute the two Bi-LSTM layers (LSTM1 and LSTM2) in ESIM with two Bi-aLSTM layers in aESIM. Therefore, as illustrated in Figure 2, the layers with red-dotted circles in ESIM will be replaced by the Bi-aLSTM layers shown in the right upper corner of the Figure 2 and the details of Bi-aLSTM can be found in Figure 3.
Given the word vector of the -th word in sentence , which can be obtained by pre-trained word embeddings such as Glove 840B vectors (pennington2014glove, ) in the first Bi-aLSTM layer or obtained from the local inference modeling layer in the second Bi-aLSTM layer. We utilize a forward LSTM layer and a backward LSTM layer to collect both direction information and .
As described in introduction section, in the following newly proposed Bi-aLSTM, we add word attention and additive operation on both orientations of traditional Bi-LSTM layer.
Word attention layer
It’s obvious that not all words contribute equally to the representation of a sentence. Attention mechanism, which is introduced in (yang2016hierarchical, ), is extremely effective to extract vital words from the whole sentence, and is particularly beneficial to generate the sentence vector. Therefore, we use the following attention mechanism after we get and .
Suppose , we then have
where is obtained after one-layer MLP for the input , is the importance of word , is calculated by the SoftMax unit on the context vector of the sentence which is randomly initialized and modified during the training, is the attention enhanced vector through multiplying the weight and original vector , where correspond to the forward vector and the backward vector , respectively.
Adaptive word direction layer
In traditional Bi-LSTM model, the forward and the backward vectors of a word are considered to have equal importance on the word representation. The model simply connects the forward and backward vectors head and tail without weighing their importance. For a word in different direction or orientation, the former and the latter words are reversed. Thus, different direction vectors of a word make different contribution to the representation, especially the words in a long sentence. Therefore, we propose a new adaptive direction layer to learn the contribution of different directions for a single word.
Formally, given two direction word vectors and , the whole word vector can be expressed as:
where, and denote weight matrix and the bias, denotes the nonlinear function, denotes the concentration. All the parameters can be learned during training. Then we can get the whole sentence vector as below:
This word and orientation enhanced Bi-LSTM is called Bi-aLSTM. Its whole architecture is shown in the Figure 3, is applied in ESIM model to replace the two Bi-LSTM layers for the task of natural language inference. Besides, this Bi-aLSTM can be used to other natural language processing tasks and our preliminary experiments have demonstrated that Bi-aLSTM is capable of improving the performance of Bi-LSTM models on sentimental classification task (for space limitation, this results will not be shown in the paper).
3. Experiment Setup
We evaluated our model on three datasets: the Stanford Natural Language Inference (SNLI) corpus, the Multi-Genre Natural Language Inference (MultiNLI) corpus, and Quora duplicate question dataset. We selected these three relatively large corpora out of eight corpora in (lan2018toolkit, ) since deep learning models usually show better generalization ability on large training sets and produce more convincing results than on small training sets.
SNLI The Stanford Natural Language Inference (SNLI) corpus contains 570,152 sentence pairs, including 549K training pairs, 10K validation pairs and 10K testing pairs. Each pair has one of relation classes (entailment, neutral, contradiction and ‘-’). The ‘-’ class indicates there is no conclusion between the two sentences. Consequently, we remove all pairs with relation ’-’ during training, validating and testing processes.
MultiNLI This corpus is a crowd-sourced collection of 433K sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generation evaluation.
Quora The Quora dataset contains 400,000 question pairs. The task of this corpus is to judge whether the two sentences means the same affair.
We use the validation set to select models for testing. The hyper-parameters of aESIM model are listed as follows. We use the Adam method (kingma2014adam, ) for optimization. The first momentum is set to be 0.9 and the second 0.999. The initial learning rate is set to 0.0005, and the batch size is 128. The dimensions of all hidden states of Bi-aLSTM and word embedding are 300. We employ non-linearity function (klambauer2017self, )
replacing rectified linear uniton account of its faster convergence rate. Dropout rate is set to 0.2 during training. We use pre-trained 300-D Glove 840B vectors (pennington2014glove, ) to initialize word embeddings. Out-of-vocabulary (OOV) words are initialized randomly with Gaussian samples. All vectors are updated during training.
3.3. Experiment results
, the method in the first block is a traditional feature engineering method, those in the second are the sentence vector-based models, those in the third are attention-based models, and ESIM and our aESIM are shown in the fourth block. Where the results of ESIM and aESIM are implemented by ourselves on Keras, the results of the others are taken from their original publications. We then compare the baseline models, CBOW, Bi-LSTM with ESIM and our aESIM on MultiNLI corpus shown In Table3, where the results of the baselines are taken from (williams2017broad, ). Finally,we compare several types of CNN and RNN models on Quroa corpus shown in Table 4, the results of theses CNN and RNN models are taken from (wang2017bilateral, ). The accuracy (ACC) of each method is measured by the commonly used precision score 111https://nlp.stanford.edu/projects/snli/, and the methods with the best accuracy are marked in bold.
|Unlexicalized + Unigram and bigram features (bowman2015large, )||78.2|
|300D LSTM encoders (bowman2015large, )||80.6|
|300D NTI-SLSTM-LSTM encoders (munkhdalai2017neural, )||83.4|
|4096D Bi-LSTM with max-pooling (conneau2017supervised, )||84.5|
|300D Gumbel TreeLSTM encoders (choi2018learning, )||85.6|
|512D Dynamic Meta-Embeddings (kiela2018dynamic, )||86.7|
|100D DF-LSTM17 (liu2016deep, )||84.6|
|300D LSTMN with deep attention fusion (cheng2016long, )||85.7|
|BiMPM (wang2017bilateral, )||87.5|
According to the results in Tables 2-4, aESIM model achieved 88.1% on SNLI corpus, elevating 0.8 percent higher than ESIM model. It promoted almost 0.5 percent accuracy and outperformed the baselines on MultiNLI. It also achieved 88.01% on Quora. Therefore, we concluded that aESIM with further word attention and word orientation operation was superior to ESIM model.
3.4. Attention visualization
We selected three types of sentence pairs from a premise and its three hypothesis sentences in the test set of SNLI corpus as shown in Figure 4, where the premise sentence is ‘A woman with a green headscarf, blue shirt and a very big grin’, and three hypothesis sentences are ‘the woman has been shot’, ‘the woman is very happy’ and ‘the woman is young’ with relation labels ‘contradiction’, ‘entailment’, and ‘neutral’, respectively. Each pair of sentences has their key word pairs: grin-shot, grin-happy and grin-young, which determines whether the premise can entail the hypothesis. Figures 4.a-4.c are the visualization of the attention layer between sentence pairs after the Bi-LSTM layer in ESIM model and that after Bi-aLSTM layer in aESIM model for contrasting ESIM and aESIM. By doing so, we could understand how the models judge the relation between two sentences.
In each Figure, the brighter the color, the higher the weight is. We could conclude that our aESIM model had the higher weight than ESIM model on each key word pair, especially in Figure 4.b, where the similarity of ‘happy’ and ‘grin’ in aESIM model is much higher than that in ESIM model. Therefore, our aESIM model was able to capture the most important word pair in each pair of sentences.
In this study, we propose an improved version of ESIM named aESIM for NLI. It modifies the Bi-LSTM layer to collect more information. We evaluate our aESIM model on three NLI corpora. Experimental results show that aESIM model achieves better performance than ESIM model. In the future, we will evaluate how attention mechanisms can be applied on other tasks and explore a way to use less time and space with guaranteed accuracy.
This work is supported in part by the National Nature Science Foundation of China (No. 61876016 and No. 61632004), the Fundamental Research Funds for the Central Universities (No. 2018JBZ006).
- (1) W. Lan and W. Xu, “Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering,” in Proceedings of COLING 2018, 2018.
- (2) S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” arXiv preprint arXiv:1508.05326, 2015.
- (3) Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document classification,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489, 2016.
- (4) Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhanced lstm for natural language inference,” arXiv preprint arXiv:1609.06038, 2016.
- (5) A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” arXiv preprint arXiv:1705.02364, 2017.
- (6) A. Talman, A. Yli-Jyrä, and J. Tiedemann, “Natural language inference with hierarchical bilstm max pooling architecture,” arXiv preprint arXiv:1808.08762, 2018.
- (7) J. Im and S. Cho, “Distance-based self-attention network for natural language inference,” arXiv preprint arXiv:1712.02047, 2017.
- (8) T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang, “Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling,” arXiv preprint arXiv:1801.10296, 2018.
- (9) J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” arXiv preprint arXiv:1601.06733, 2016.
- (10) J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- (11) R. Ghaeini, S. A. Hasan, V. Datla, J. Liu, K. Lee, A. Qadir, Y. Ling, A. Prakash, X. Z. Fern, and O. Farri, “Dr-bilstm: Dependent reading bidirectional lstm for natural language inference,” arXiv preprint arXiv:1802.05577, 2018.
- (12) A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” arXiv preprint arXiv:1606.01933, 2016.
- (13) Z. Wang, W. Hamza, and R. Florian, “Bilateral multi-perspective matching for natural language sentences,” arXiv preprint arXiv:1702.03814, 2017.
- (14) S. Kim, J.-H. Hong, I. Kang, and N. Kwak, “Semantic sentence matching with densely-connected recurrent and co-attentive information,” arXiv preprint arXiv:1805.11360, 2018.
- (15) D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- (16) H. He and J. Lin, “Pairwise word interaction modeling with deep neural networks for semantic similarity measurement,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 937–948, 2016.
- (17) A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.
- (18) J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- (19) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- (20) G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems, pp. 971–980, 2017.
- (21) T. Munkhdalai and H. Yu, “Neural tree indexers for text understanding,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, p. 11, NIH Public Access, 2017.
J. Choi, K. M. Yoo, and S.-g. Lee, “Learning to compose task-specific tree
Proceedings of the 2018 Association for the Advancement of Artificial Intelligence (AAAI). and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2018.
- (23) D. Kiela, C. Wang, and K. Cho, “Dynamic meta-embeddings for improved sentence representations,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1466–1477, 2018.
- (24) P. Liu, X. Qiu, J. Chen, and X. Huang, “Deep fusion lstms for text semantic matching,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1034–1043, 2016.