Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

10/15/2020
by   Guanhua Zhang, et al.
0

Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2019

Mitigating Annotation Artifacts in Natural Language Inference Datasets to Improve Cross-dataset Generalization Ability

Natural language inference (NLI) aims at predicting the relationship bet...
research
07/09/2019

Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Natural Language Inference (NLI) datasets often contain hypothesis-only ...
research
05/14/2019

Misleading Failures of Partial-input Baselines

Recent work establishes dataset difficulty and removes annotation artifa...
research
07/13/2016

The KIT Motion-Language Dataset

Linking human motion and natural language is of great interest for the g...
research
02/09/2023

Augmenting NLP data to counter Annotation Artifacts for NLI Tasks

In this paper, we explore Annotation Artifacts - the phenomena wherein l...
research
07/14/2020

Our Evaluation Metric Needs an Update to Encourage Generalization

Models that surpass human performance on several popular benchmarks disp...
research
06/02/2021

MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain

Crowdworker-constructed natural language inference (NLI) datasets have b...

Please sign up or login with your details

Forgot password? Click here to reset