Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

Natural Language Inference is a challenging task that has received substantial attention, and state-of-the-art models now achieve impressive test set performance in the form of accuracy scores. Here, we go beyond this single evaluation metric to examine robustness to semantically-valid alterations to the input data. We identify three factors - insensitivity, polarity and unseen pairs - and compare their impact on three SNLI models under a variety of conditions. Our results demonstrate a number of strengths and weaknesses in the models' ability to generalise to new in-domain instances. In particular, while strong performance is possible on unseen hypernyms, unseen antonyms are more challenging for all the models. More generally, the models suffer from an insensitivity to certain small but semantically significant alterations, and are also often influenced by simple statistical correlations between words and training labels. Overall, we show that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2018

Stress Test Evaluation for Natural Language Inference

Natural language inference (NLI) is the task of determining if a natural...
research
04/16/2021

Natural Language Inference with a Human Touch: Using Human Explanations to Guide Model Attention

Natural Language Inference (NLI) models are known to learn from biases a...
research
12/07/2019

Adversarial Analysis of Natural Language Inference Systems

The release of large natural language inference (NLI) datasets like SNLI...
research
04/19/2022

Metamorphic Testing-based Adversarial Attack to Fool Deepfake Detectors

Deepfakes utilise Artificial Intelligence (AI) techniques to create synt...
research
04/01/2021

A Survey on Natural Language Video Localization

Natural language video localization (NLVL), which aims to locate a targe...
research
05/28/2023

Targeted Data Generation: Finding and Fixing Model Weaknesses

Even when aggregate accuracy is high, state-of-the-art NLP models often ...
research
12/15/2021

CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance

Recent neural models that extend the pretrain-then-finetune paradigm con...

Please sign up or login with your details

Forgot password? Click here to reset