Probing Classifiers are Unreliable for Concept Removal and Detection

07/08/2022
by   Chenhao Tan, et al.
8

Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute. Even under the most favorable conditions when an attribute's features in representation space can alone provide 100 post-hoc or adversarial methods will fail to remove the attribute correctly. These theoretical implications are confirmed by empirical experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of attribute removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2018

Adversarial Removal of Demographic Attributes from Text Data

Recent advances in Representation Learning and Adversarial Training seem...
research
03/15/2022

Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

We describe a simple and effective method (Spectral Attribute removaL; S...
research
07/13/2022

Supervised Attribute Information Removal and Reconstruction for Image Manipulation

The goal of attribute manipulation is to control specified attribute(s) ...
research
05/25/2021

Honest-but-Curious Nets: Sensitive Attributes of Private Inputs can be Secretly Coded into the Entropy of Classifiers' Outputs

It is known that deep neural networks, trained for the classification of...
research
10/12/2022

Fairness via Adversarial Attribute Neighbourhood Robust Learning

Improving fairness between privileged and less-privileged sensitive attr...
research
11/06/2017

A^4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation

Text-based analysis methods allow to reveal privacy relevant author attr...
research
02/16/2021

Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information

Training and evaluation of fair classifiers is a challenging problem. Th...

Please sign up or login with your details

Forgot password? Click here to reset