Neurons in Large Language Models: Dead, N-gram, Positional

by   Elena Voita, et al.

We analyze a family of large language models in such a lightweight manner that can be done on a single GPU. Specifically, we focus on the OPT family of models ranging from 125m to 66b parameters and rely only on whether an FFN neuron is activated or not. First, we find that the early part of the network is sparse and represents many discrete features. Here, many neurons (more than 70 large collection of diverse data. At the same time, many of the alive neurons are reserved for discrete features and act as token and n-gram detectors. Interestingly, their corresponding FFN updates not only promote next token candidates as could be expected, but also explicitly focus on removing the information about triggering them tokens, i.e., current input. To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream. With scale, models become more sparse in a sense that they have more dead neurons and token detectors. Finally, some neurons are positional: them being activated or not depends largely (or solely) on position and less so (or not at all) on textual data. We find that smaller models have sets of neurons acting as position range indicators while larger models operate in a less explicit manner.


N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Understanding the function of individual neurons within language models ...

Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabili...

Sparse Interventions in Language Models with Differentiable Masking

There has been a lot of interest in understanding what information is ca...

Exposing the Functionalities of Neurons for Gated Recurrent Unit Based Sequence-to-Sequence Model

The goal of this paper is to report certain scientific discoveries about...

Text vectorization via transformer-based language models and n-gram perplexities

As the probability (and thus perplexity) of a text is calculated based o...

LLMCad: Fast and Scalable On-device Large Language Model Inference

Generative tasks, such as text generation and question answering, hold a...

BatchPrompt: Accomplish more with less

As the ever-increasing token limits of large language models (LLMs) have...

Please sign up or login with your details

Forgot password? Click here to reset