Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimising performance on the behavioural tests during training (behavioural learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioural test suite, leading to overestimation and misrepresentation of model performance – one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different granularity levels. We optimise behaviour-specific loss functions and evaluate models on several partitions of the behavioural test suite controlled to leave out specific phenomena. An aggregate score measures generalisation to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification and reading comprehension) and compare the impact of a diverse set of regularisation and domain generalisation methods on generalisation performance.

READ FULL TEXT
research
04/08/2022

Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection

Behavioural testing – verifying system capabilities by validating human-...
research
12/29/2019

ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension

Reading comprehension is one of the crucial tasks for furthering researc...
research
04/06/2020

Evaluating NLP Models via Contrast Sets

Standard test sets for supervised learning evaluate in-distribution gene...
research
05/25/2022

ER-TEST: Evaluating Explanation Regularization Methods for NLP Models

Neural language models' (NLMs') reasoning processes are notoriously hard...
research
09/02/2021

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Data-driven subword segmentation has become the default strategy for ope...
research
04/01/2020

Can We Use SE-specific Sentiment Analysis Tools in a Cross-Platform Setting?

In this paper, we address the problem of using sentiment analysis tools ...
research
05/23/2022

A Fine-grained Interpretability Evaluation Benchmark for Neural NLP

While there is increasing concern about the interpretability of neural m...

Please sign up or login with your details

Forgot password? Click here to reset