Beyond Accuracy: Behavioral Testing of NLP models with CheckList

05/08/2020
by   Marco Tulio Ribeiro, et al.
0

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2023

Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features

A challenge towards developing NLP systems for the world's languages is ...
research
07/11/2023

Can a Chatbot Support Exploratory Software Testing? Preliminary Results

Tests executed by human testers are still widespread in practice and fil...
research
06/10/2021

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

Models that top leaderboards often perform unsatisfactorily when deploye...
research
10/14/2022

TestAug: A Framework for Augmenting Capability-based NLP Tests

The recently proposed capability-based NLP testing allows model develope...
research
02/21/2023

NLPLego: Assembling Test Generation for Natural Language Processing Applications

The development of modern NLP applications often relies on various bench...
research
12/31/2020

HateCheck: Functional Tests for Hate Speech Detection Models

Detecting online hate is a difficult task that even state-of-the-art mod...
research
02/11/2023

MTTM: Metamorphic Testing for Textual Content Moderation Software

The exponential growth of social media platforms such as Twitter and Fac...

Please sign up or login with your details

Forgot password? Click here to reset