GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

11/15/2022
by   Linyi Yang, et al.
0

Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.

READ FULL TEXT

page 14

page 15

research
03/01/2023

How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks

The GPT-3.5 models have demonstrated impressive performance in various N...
research
10/13/2021

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

Supersized pre-trained language models have pushed the accuracy of vario...
research
08/27/2021

Evaluating the Robustness of Neural Language Models to Input Perturbations

High-performance neural language models have obtained state-of-the-art r...
research
07/11/2023

BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

One common trend in recent studies of language models (LMs) is the use o...
research
09/19/2021

Training Dynamic based data filtering may not work for NLP datasets

The recent increase in dataset size has brought about significant advanc...
research
03/21/2021

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Various robustness evaluation methodologies from different perspectives ...
research
02/11/2023

Evaluating the Robustness of Discrete Prompts

Discrete prompts have been used for fine-tuning Pre-trained Language Mod...

Please sign up or login with your details

Forgot password? Click here to reset