Sparse Autoencoders Find Highly Interpretable Features in Language Models

09/15/2023
by   Hoagy Cunningham, et al.
0

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Understanding the function of individual neurons within language models ...
research
04/19/2023

Disentangling Neuron Representations with Concept Vectors

Mechanistic interpretability aims to understand how models store represe...
research
10/04/2022

Polysemanticity and Capacity in Neural Networks

Individual neurons in neural networks often represent a mixture of unrel...
research
05/02/2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Despite rapid adoption and deployment of large language models (LLMs), t...
research
04/14/2021

An Interpretability Illusion for BERT

We describe an "interpretability illusion" that arises when analyzing th...
research
11/22/2022

Interpreting Neural Networks through the Polytope Lens

Mechanistic interpretability aims to explain what a neural network has l...
research
12/21/2018

NeuroX: A Toolkit for Analyzing Individual Neurons in Neural Networks

We present a toolkit to facilitate the interpretation and understanding ...

Please sign up or login with your details

Forgot password? Click here to reset