Neuron to Graph: Interpreting Language Model Neurons at Scale

05/31/2023
by   Alex Foote, et al.
0

Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph.

READ FULL TEXT

page 8

page 12

research
04/22/2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Understanding the function of individual neurons within language models ...
research
04/23/2022

CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks

In this paper, we propose CLIP-Dissect, a new technique to automatically...
research
09/09/2023

Neurons in Large Language Models: Dead, N-gram, Positional

We analyze a family of large language models in such a lightweight manne...
research
09/19/2018

Interpretable Textual Neuron Representations for NLP

Input optimization methods, such as Google Deep Dream, create interpreta...
research
10/31/2015

Why Neurons Have Thousands of Synapses, A Theory of Sequence Memory in Neocortex

Neocortical neurons have thousands of excitatory synapses. It is a myste...
research
04/21/2019

GAN-based Generation and Automatic Selection of Explanations for Neural Networks

One way to interpret trained deep neural networks (DNNs) is by inspectin...
research
01/30/2023

Evaluating Neuron Interpretation Methods of NLP Models

Neuron Interpretation has gained traction in the field of interpretabili...

Please sign up or login with your details

Forgot password? Click here to reset