Neural Network Models for Natural Language Inference Fail to Capture the Semantics of Inference

Neural network models have been very successful for natural language inference, with the best models reaching 90 the success of these models turns out to be largely task specific. We show that models trained on one inference task fail to perform well in others, even if the notion of inference assumed in these tasks is the same or similar. We train four state-of-the-art neural network models on different datasets and show that each one of these fail to generalize outside of the respective task. In light of these results we conclude that the current neural network models are not able to generalize in capturing the semantics of natural language inference, but seem to be overfitting to the specific dataset.


page 1

page 2

page 3

page 4


Testing the Generalization Power of Neural Network Models Across NLI Benchmarks

Neural network models have been very successful for natural language inf...

Posing Fair Generalization Tasks for Natural Language Inference

Deep learning models for semantics are generally evaluated using natural...

Finding Fuzziness in Neural Network Models of Language Processing

Humans often communicate by using imprecise language, suggesting that fu...

Non-entailed subsequences as a challenge for natural language inference

Neural network models have shown great success at natural language infer...

Improving Generalization by Incorporating Coverage in Natural Language Inference

The task of natural language inference (NLI) is to identify the relation...

Compiling ONNX Neural Network Models Using MLIR

Deep neural network models are becoming increasingly popular and have be...

Tiny-CRNN: Streaming Wakeword Detection In A Low Footprint Setting

In this work, we propose Tiny-CRNN (Tiny Convolutional Recurrent Neural ...

1 Introduction

Natural Language Inference (NLI) has attracted considerable interest in the NLP community and, recently, a large number of neural network-based systems have been proposed to deal with the task. These approaches can be usually categorized into: a) sentence encoding models, and b) other neural network models. Both of them have been very successful, with the state of the art on the SNLI and MultiNLI datasets being 90.1% Kim et al. (2018) and 86.7% Devlin et al. (2018) respectively. However, a big question w.r.t to these systems is their ability to generalize outside the specific datasets they are trained and tested on. Recently, Glockner et al. (2018) have shown that state-of-the-art NLI systems break considerably easily when instead of tested on the original SNLI test set, they are tested on a test set which is constructed by taking premises from the training set and creating several hypotheses from them by changing at most one word within the premise. The results show a very significant drop in accuracy for three of the four systems. The system that was more difficult to break and had the less loss in accuracy was the system by Chen et al. (2018) which utilizes external knowledge taken from WordNet Miller (1995).

In this paper we show that NLI systems that have been very successful in specific NLI benchmarks, fail to generalize when trained on a specific NLI datset and then tested across different NLI benchmarks. The results we get are in line with Glockner et al. (2018), showing that the generalization capability of the individual NLI systems is very limited, but, what is more, they further show the only system that was less prone to breaking in Glockner et al. (2018), breaks in the experiments we have conducted as well.

2 Related Work

The ability of NLI systems to generalize and related skepticism has been raised in a recent paper by Glockner et al. (2018). There, the authors show that the generalization capabilities of state-of-the-art NLI systems, in cases where some kind of external lexical knowledge is needed, drops dramatically when the SNLI test set is replaced by a test set where the premise and the hypothesis are otherwise identical except for at most one word. The results show a very significant drop in accuracy. Kang et al. (2018) recognize the generalization problem that comes with training on datasets like SNLI, which tend to be homogeneous with linguistic variation. In this context, they propose to better train NLI models by making use of adversarial examples. Gururangan et al. (2018) show that datasets like SNLI and MultiNLI contain unintentional annotation artifacts which help neural network models in classification. On a theoretical and methodological level, there is discussion on the nature of various NLI datasets, as well as the definition of what counts as NLI and what does not. For example, Chatzikyriakidis et al. (2017) present an overview of the most standard datasets for NLI and show that the definitions of inference in each of them are actually quite different.

3 Experimental Setup

3.1 Data

We chose three different datasets for the experiments: SNLI, MultiNLI and SICK. All of them have been designed for NLI involving three-way classification. The selected datasets use the same three labels entailment, neutral and contradiction. We did not include any datasets with two-way classification, e.g. SciTail Khot et al. (2018). As SICK is a relatively small dataset with approximately only 10k sentence pairs, we did not use it as training data in any experiment. We also trained our models with a combined SNLI + MultiNLI training set.

All the experimental combinations are listed in Table 1. Examples from the selected dataset are provided in Table 2. We describe the three datasets in more detail below.

Train data Dev data Test data
MultiNLI MultiNLI MultiNLI
Table 1: List of all the combinations of data used in the experiments. The rows highlighted in bold are baseline experiments, where the test data comes from the same corpus as the training and development data.


The Stanford Natural Language Inference (SNLI) corpus Bowman et al. (2015) is a dataset of 570k human-written sentence pairs manually labeled with the labels entailment, contradiction, and neutral. The dataset is divided into training (550,152 pairs), development (10,000 pairs) and test sets (10,000 pairs). The source for the premise sentences in SNLI were image captions taken from the Flickr30k corpus Young et al. (2014).


The Multi-Genre Natural Language Inference (MultiNLI) corpus Williams et al. (2018) is a broad-coverage corpus for NLI, consisting of 433k human-written sentence pairs labeled with entailment, contradiction and neutral. Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English. The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs).

Only five genres are included in the training set. The development and test sets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data. We used the matched development set (MultiNLI-m) for the experiments.

entailment SICK A person, who is riding a bike, is wearing gear which is black
A biker is wearing gear which is black
2-3 SNLI A young family enjoys feeling ocean waves lap at their feet.
A family is at the beach.
2-3 MultiNLI Kal tangled both of Adrin’s arms, keeping the blades far away.
Adrin’s arms were tangled, keeping his blades away from Kal.
contradiction SICK There is no man wearing a black helmet and pushing a bicycle
One man is wearing a black helmet and pushing a bicycle
2-3 SNLI A man with a tattoo on his arm staring to the side with vehicles and buildings behind him.
A man with no tattoos is getting a massage.
2-3 MultiNLI Also in Eustace Street is an information office and a cultural center for children, The Ark.
The Ark, a cultural center for kids, is located in Joyce Street.
neutral SICK A little girl in a green coat and a boy holding a red sled are walking in the snow
A child is wearing a coat and is carrying a red sled near a child in a green and black coat
2-3 SNLI An old man with a package poses in front of an advertisement.
A man poses in front of an ad for beer.
2-3 MultiNLI Enthusiasm for Disney’s Broadway production of The Lion King dwindles.
The broadway production of The Lion King was amazing, but audiences are getting bored.
Table 2: Example sentence pairs from the three datasets.


SICK (Marelli et al., 2014) is a dataset that was originally constructed to test compositional distributional semantics (DS) models. The dataset contains 9,840 examples pertaining to logical inference (negation, conjunction, disjunction, apposition, relative clauses, etc.). However, its focus are distributional semantic approaches. Therefore, it normalises several cases DS is not expected to account for. The dataset consists of approximately 10k test pairs annotated for inference (three-way) and relatedness. The dataset is constructed taking pairs of sentences from a random subset of the 8K ImageFlickr data set111 and the SemEval 2012 STS MSRVideo Description dataset222

3.2 Model and Training Details

We perform experiments with five models, two from sentence encoding approaches and three coming from cross-sentence approaches. For sentence encoding models, we have chosen a simple one-layer bidirectional LSTM with max pooling (BiLSTM-max) with the hiddens size of 600D per direction, used e.g. in InferSent

(Conneau et al., 2017), and HBMP Talman et al. (2018). For the other models, we have chosen ESIM (Chen et al., 2017), which includes cross-sentence attention, and KIM Chen et al. (2018), which has cross-sentence attention and utilizes external knowledge. We also selected one model involving a pretrained language model, namely ESIM + ELMo (Peters et al., 2018). All of the models perform well on the SNLI dataset, reaching near state-of-the-art accuracy in the sentence encoding and the other category respectively. KIM is particularly interesting in this context as it performed significantly better than other models in the Breaking NLI experiment conducted by Glockner et al. (2018).

For BiLSTM-max we used the Adam optimizer Kingma and Ba (2014)

and a learning rate of 5e-4. The learning rate was decreased by the factor of 0.2 after each epoch if the model did not improve. We used a batch size of 64. Dropout of 0.1 was used between the layers of the multi-layer perceptron classifier, except before the last layer. The models were evaluated with the development data after each epoch and training was stopped if the development loss increased for more than 3 epochs. The model with the highest development accuracy was selected for testing. The BiLSTM-max models were initialized with pre-trained GloVe 840B word embeddings of size 300 dimensions

Pennington et al. (2014)

, which were fine-tuned during training. Our implementation of BiLSTM-max was done in PyTorch.

For HBMP, ESIM and KIM we used the original implementations as well as the default settings and hyperparameter values as described in

Talman et al. (2018), Chen et al. (2017) and Chen et al. (2018) respectively, adjusting only the vocabulary size based on the dataset used. For ESIM + ELMo we used the AllenNLP (Gardner et al., 2018) PyTorch implementation with the default settings and hyperparameter values.333For more information on the ELMo implementation see

Train data Dev data Test data Test accuracy Model
SNLI SNLI SNLI 86.6 HBMP Talman et al. (2018)
SNLI SNLI SNLI 88.0 ESIM Chen et al. (2017)
SNLI SNLI SNLI 88.6 KIM Chen et al. (2018)
SNLI SNLI SNLI 88.6 ESIM + ELMo Peters et al. (2018)
SNLI SNLI MultiNLI-m 55.7* -30.4 BiLSTM-max
SNLI SNLI MultiNLI-m 56.3* -30.3 HBMP
SNLI SNLI MultiNLI-m 59.2* -28.8 ESIM
SNLI SNLI MultiNLI-m 61.7* -26.9 KIM
SNLI SNLI MultiNLI-m 64.2* -24.4 ESIM + ELMo
SNLI SNLI SICK 54.5 -31.6 BiLSTM-max
MultiNLI MultiNLI-m MultiNLI-m 73.1* BiLSTM-max
MultiNLI MultiNLI-m MultiNLI-m 73.2* HBMP
MultiNLI MultiNLI-m MultiNLI-m 76.8* ESIM
MultiNLI MultiNLI-m MultiNLI-m 77.3* KIM
MultiNLI MultiNLI-m MultiNLI-m 80.2* ESIM + ELMo
MultiNLI MultiNLI-m SNLI 63.8 -9.3 BiLSTM-max
MultiNLI MultiNLI-m SNLI 65.3 -7.9 HBMP
MultiNLI MultiNLI-m SNLI 66.4 -10.4 ESIM
MultiNLI MultiNLI-m SNLI 68.5 -8.8 KIM
MultiNLI MultiNLI-m SNLI 69.1 -11.1 ESIM + ELMo
MultiNLI MultiNLI-m SICK 54.1 -19.0 BiLSTM-max
MultiNLI MultiNLI-m SICK 54.1 -19.1 HBMP
MultiNLI MultiNLI-m SICK 47.9 -28.9 ESIM
MultiNLI MultiNLI-m SICK 50.9 -26.4 KIM
MultiNLI MultiNLI-m SICK 51.4 -28.8 ESIM + ELMo
SNLI + MultiNLI SNLI SICK 54.5 -31.6 BiLSTM-max
SNLI + MultiNLI SNLI SICK 55.0 -31.1 HBMP
SNLI + MultiNLI SNLI SICK 54.5 -33.0 ESIM
SNLI + MultiNLI SNLI SICK 54.6 -31.6 KIM
SNLI + MultiNLI SNLI SICK 57.1 -31.7 ESIM + ELMo
Table 3: Test accuracies (%). For the baseline results highlighted in bold the training data includes examples from the same corpus as the test data. For the other models the training and test data are taken from separate corpora. is the difference between the test accuracy and the baseline accuracy for the same training set. Results marked with * are for the development set, as no annotated test set is openly available.

4 Experimental Results

Table 3 contains all the experimental results.

Our experiments show that, while all of the five models perform well when the test set is taken from the same corpus as the training and development set, accuracy drops significantly when the test data is drawn from a separate corpus, the average drop in accuracy being 25.4 points across all experiments.

The accuracy drops the most when a model is tested on SICK. The drop in this case is between 19.0-28.9 points when trained on MultiNLI, between 31.6-33.7 points when trained on SNLI and between 31.1-33.0 when trained on SNLI + MultiNLI. This result was somewhat expected, as the method of constructing the sentence pairs was different, and therefore there is too much difference in the kind of sentence pairs between the training and test sets for the models to be able to transfer what it has learned to the test examples. However, the drop in accuracy was not expected to be that dramatic.

The most surprising result was that the accuracy of all models drops significantly even in the set-up where the models were trained on MultiNLI and tested on SNLI (7.9-11.1 points). This is surprising as both of these datasets have been constructed with a similar data collection method using the same definition of inference (i.e. same definition of entailment, contradiction and neutral). The sentences included in SNLI are also much simpler compared to those in MultiNLI. This might also explain why the drop in accuracy for all of the four models is lowest when the models are trained on MultiNLI and tested on SNLI. It is also very surprising that the model with biggest drop in accuracy was ESIM + ELMo which includes a pretrained ELMo language model. ESIM + ELMo did, however, get the highest accuracy of 69.1% in this experiment.

All the models perform almost equally poorly across all the experiments. Both BiLSTM-max and HBMP have an average drop in accuracy of 24.4 points, while the average for KIM is 25.5 and for ESIM + ELMo 25.6. ESIM has the highest average drop of 27.0 points. In contrast to the findings of Glockner et al. (2018), utilizing external knowledge did not improve the model’s generalization capability, as KIM performed equally poorly across all dataset combinations. Also including a pretrained language model did not improve the results significantly.

5 Conclusion

In this paper we have shown that neural network models for NLI fail to generalize across different NLI benchmarks. We experimented with five state-of-the-art models covering both sentence encoding approaches and cross-sentence attention models. For all the systems, the accuracy drops between 7.9-33.7 points (the average drop being 25.4 points), when testing with a test set drawn from a separate corpus from that of the training data, as compared to when the test and training data are splits from the same corpus.

Our findings, together with the previous negative findings e.g. by Glockner et al. (2018) and Gururangan et al. (2018), indicate that the current state-of-the-art neural network models fail to capture the semantics of NLI in a way that will enable them to generalize across different NLI situations. The results indicate two issues to be taken into consideration: a) using datasets involving a fraction of what NLI is, will fail when tested in datasets that are testing for a slightly different definition. This is evident when we move from the SNLI to the SICK dataset. b) NLI is to some extent also genre/context dependent. Training on SNLI and testing on MultiNLI gives worse results than vice versa. This can be seen as an indication that training on multiple genres helps. However, this is still not enough given that, even in case of training on MultiNLI and testing on SNLI, accuracy drops significantly. Further work is required on better data resources as well as on better neural network models to tackle these issues.


  • Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    . Association for Computational Linguistics.
  • Chatzikyriakidis et al. (2017) Stergios Chatzikyriakidis, Robin Cooper, Simon Dobnik, and Staffan Larsson. 2017. An overview of natural language inference data collection: The way forward? In Proceedings of the Computing Natural Language Inference Workshop.
  • Chen et al. (2018) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. 2018. Neural natural language inference models enhanced with external knowledge. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia.
  • Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668. Association for Computational Linguistics.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  • Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In ACL workshop for NLP Open Source Software.
  • Glockner et al. (2018) Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia.
  • Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112. Association for Computational Linguistics.
  • Kang et al. (2018) Dongyeop Kang, Tushar Khot, Ashish Sabharwal, and Eduard Hovy. 2018. Adversarial training for textual entailment with knowledge-guided examples. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia.
  • Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. Scitail: A textual entailment dataset from science question answering. In AAAI.
  • Kim et al. (2018) Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. 2018. Semantic sentence matching with densely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. 2014. A sick cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC2014.
  • Miller (1995) George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.

    Glove: Global vectors for word representation.

    In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • Talman et al. (2018) Aarne Talman, Anssi Yli-Jyrä, and Jörg Tiedemann. 2018. Natural language inference with hierarchical bilstm max pooling architecture. arXiv preprint arXiv:1808.08762.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). Association for Computational Linguistics.
  • Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.