Measuring Spurious Correlation in Classification: 'Clever Hans' in Translationese

08/25/2023
by   Angana Borah, et al.
0

Recent work has shown evidence of 'Clever Hans' behavior in high-performance neural translationese classifiers, where BERT-based classifiers capitalize on spurious correlations, in particular topic information, between data and target classification labels, rather than genuine translationese signals. Translationese signals are subtle (especially for professional translation) and compete with many other signals in the data such as genre, style, author, and, in particular, topic. This raises the general question of how much of the performance of a classifier is really due to spurious correlations in the data versus the signals actually targeted for by the classifier, especially for subtle target signals and in challenging (low resource) data settings. We focus on topic-based spurious correlation and approach the question from two directions: (i) where we have no knowledge about spurious topic information and its distribution in the data, (ii) where we have some indication about the nature of spurious topic correlations. For (i) we develop a measure from first principles capturing alignment of unsupervised topics with target classification labels as an indication of spurious topic information in the data. We show that our measure is the same as purity in clustering and propose a 'topic floor' (as in a 'noise floor') for classification. For (ii) we investigate masking of known spurious topic carriers in classification. Both (i) and (ii) contribute to quantifying and (ii) to mitigating spurious correlations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2022

Explaining Translationese: why are Neural Classifiers Better and what do they Learn?

Recent work has shown that neural feature- and representation-learning, ...
research
04/22/2022

Neural Contrastive Clustering: Fully Unsupervised Bias Reduction for Sentiment Classification

Background: Neural networks produce biased classification results due to...
research
12/02/2020

Fair Attribute Classification through Latent Space De-biasing

Fairness in visual recognition is becoming a prominent and critical topi...
research
11/17/2022

Balanced Deep CCA for Bird Vocalization Detection

Event detection improves when events are captured by two different modal...
research
12/05/2012

Evaluating Classifiers Without Expert Labels

This paper considers the challenge of evaluating a set of classifiers, a...
research
10/03/2014

Probit Normal Correlated Topic Models

The logistic normal distribution has recently been adapted via the trans...
research
04/10/2015

Discrimination and characterization of Parkinsonian rest tremors by analyzing long-term correlations and multifractal signatures

In this paper, we analyze 48 signals of rest tremor velocity related to ...

Please sign up or login with your details

Forgot password? Click here to reset