Toy Models of Superposition

09/21/2022
by   Nelson Elhage, et al.
12

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

READ FULL TEXT

page 11

page 15

page 16

page 19

page 27

page 31

page 37

research
01/25/2019

Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples

Sometimes it is not enough for a DNN to produce an outcome. For example,...
research
09/16/2019

Interpreting and Improving Adversarial Robustness with Neuron Sensitivity

Deep neural networks (DNNs) are vulnerable to adversarial examples where...
research
01/30/2019

A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance

The existence of adversarial examples in which an imperceptible change i...
research
07/22/2021

Towards Explaining Adversarial Examples Phenomenon in Artificial Neural Networks

In this paper, we study the adversarial examples existence and adversari...
research
12/07/2019

Does Interpretability of Neural Networks Imply Adversarial Robustness?

The success of deep neural networks is clouded by two issues that largel...
research
07/03/2023

Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)

Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 hav...
research
02/24/2021

It was "all" for "nothing": sharp phase transitions for noiseless discrete channels

We establish a phase transition known as the "all-or-nothing" phenomenon...

Please sign up or login with your details

Forgot password? Click here to reset