Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

12/15/2021
by   Andrew Wang, et al.
0

Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Consequently, language models encode toxic information, which makes the real-world usage of these language models limited. Current methods aim to prevent toxic features from appearing generated text. We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. To construct this toxic subspace, we propose a method to generalize toxic directions in the latent space. We also provide a methodology for constructing parallel datasets using a context based word masking system. Through our experiments, we show that when the toxic subspace is removed from a set of sentence representations, almost no toxic representations remain in the result. We demonstrate empirically that the subspace found using our method generalizes to multiple toxicity corpora, indicating the existence of a low-dimensional toxic subspace.

READ FULL TEXT
research
06/23/2022

AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

The objective of pre-trained language models is to learn contextual repr...
research
01/11/2023

Topics in Contextualised Attention Embeddings

Contextualised word vectors obtained via pre-trained language models enc...
research
01/28/2022

Linear Adversarial Concept Erasure

Modern neural models trained on textual data rely on pre-trained represe...
research
07/27/2023

A Geometric Notion of Causal Probing

Large language models rely on real-valued representations of text to mak...
research
02/07/2022

PatClArC: Using Pattern Concept Activation Vectors for Noise-Robust Model Debugging

State-of-the-art machine learning models are commonly (pre-)trained on l...
research
10/15/2021

Exploring Low-dimensional Intrinsic Task Subspace via Prompt Tuning

How can pre-trained language models (PLMs) learn universal representatio...
research
09/03/2023

Representations Matter: Embedding Modes of Large Language Models using Dynamic Mode Decomposition

Existing large language models (LLMs) are known for generating "hallucin...

Please sign up or login with your details

Forgot password? Click here to reset