Robustness Gym: Unifying the NLP Evaluation Landscape

01/13/2021
by   Karan Goel, et al.
2

Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all 4 evaluation paradigms with just a few clicks, and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate Robustness Gym's utility to practitioners, we conducted a real-world case study with a sentiment-modeling team, revealing performance degradations of 18 that Robustness Gym can aid novel research analyses, we perform the first study of state-of-the-art commercial and academic named entity linking (NEL) systems, as well as a fine-grained analysis of state-of-the-art summarization models. For NEL, commercial systems struggle to link rare entities and lag their academic counterparts by 10 struggle on examples that require abstraction and distillation, degrading by 9

READ FULL TEXT

page 2

page 14

page 16

page 18

page 20

page 34

research
03/21/2021

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Various robustness evaluation methodologies from different perspectives ...
research
02/01/2019

Robustness of Generalized Learning Vector Quantization Models against Adversarial Attacks

Adversarial attacks and the development of (deep) neural networks robust...
research
06/18/2021

Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks

Trigger set-based watermarking schemes have gained emerging attention as...
research
06/09/2023

Overcoming Adversarial Attacks for Human-in-the-Loop Applications

Including human analysis has the potential to positively affect the robu...
research
10/29/2018

A Pragmatic Guide to Geoparsing Evaluation

Empirical methods in geoparsing have thus far lacked a standard evaluati...
research
12/15/2021

Measure and Improve Robustness in NLP Models: A Survey

As NLP models achieved state-of-the-art performances over benchmarks and...
research
08/12/2016

Measuring the State of the Art of Automated Pathway Curation Using Graph Algorithms - A Case Study of the mTOR Pathway

This paper evaluates the difference between human pathway curation and c...

Please sign up or login with your details

Forgot password? Click here to reset