Adversarial Concept Erasure in Kernel Space

01/28/2022
by   Shauli Ravfogel, et al.
2

The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to control the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces – subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against all nonlinear adversaries at once.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2022

Linear Guardedness and its Implications

Previous work on concept identification in neural representations has fo...
research
01/28/2022

Linear Adversarial Concept Erasure

Modern neural models trained on textual data rely on pre-trained represe...
research
04/20/2022

Analyzing Gender Representation in Multilingual Models

Multilingual language models were shown to allow for nontrivial transfer...
research
10/11/2019

Finding Interpretable Concept Spaces in Node Embeddings using Knowledge Bases

In this paper we propose and study the novel problem of explaining node ...
research
07/27/2023

A Geometric Notion of Causal Probing

Large language models rely on real-valued representations of text to mak...
research
11/20/2018

Adversarial Removal of Gender from Deep Image Representations

In this work we analyze visual recognition tasks such as object and acti...
research
05/24/2023

A Neural Space-Time Representation for Text-to-Image Personalization

A key aspect of text-to-image personalization methods is the manner in w...

Please sign up or login with your details

Forgot password? Click here to reset