Finding Neurons in a Haystack: Case Studies with Sparse Probing

05/02/2023
by   Wes Gurnee, et al.
0

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train k-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of k we study the sparsity of learned representations and how this varies with model scale. With k=1, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

READ FULL TEXT

page 10

page 26

page 27

page 28

page 33

page 34

research
06/03/2021

Rich dynamics caused by known biological brain network features resulting in stateful networks

The mammalian brain could contain dense and sparse network connectivity ...
research
09/14/2023

Padding Aware Neurons

Convolutional layers are a fundamental component of most image-related m...
research
09/15/2023

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' inte...
research
12/03/2014

Deeply learned face representations are sparse, selective, and robust

This paper designs a high-performance deep convolutional network (DeepID...
research
10/04/2022

Polysemanticity and Capacity in Neural Networks

Individual neurons in neural networks often represent a mixture of unrel...
research
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...
research
11/09/2020

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Language exhibits structure at different scales, ranging from subwords t...

Please sign up or login with your details

Forgot password? Click here to reset