N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

04/22/2023
by   Alex Foote, et al.
0

Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose Neuron to Graph (N2G), a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality.

READ FULL TEXT

page 2

page 8

page 9

research
05/31/2023

Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabili...
research
09/15/2023

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' inte...
research
09/09/2023

Neurons in Large Language Models: Dead, N-gram, Positional

We analyze a family of large language models in such a lightweight manne...
research
09/19/2018

Interpretable Textual Neuron Representations for NLP

Input optimization methods, such as Google Deep Dream, create interpreta...
research
11/09/2020

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Language exhibits structure at different scales, ranging from subwords t...
research
04/23/2022

CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks

In this paper, we propose CLIP-Dissect, a new technique to automatically...
research
08/20/2020

Prototype-based interpretation of the functionality of neurons in winner-take-all neural networks

Prototype-based learning (PbL) using a winner-take-all (WTA) network bas...

Please sign up or login with your details

Forgot password? Click here to reset