Disentangling Neuron Representations with Concept Vectors

04/19/2023
by   Laura O'Mahony, et al.
0

Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful vectors, known as concept vectors, in activation space instead of individual neurons. The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our method can search for fine-grained concepts according to the user's desired level of concept separation. The analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. Our evaluations show that the concept vectors found encode coherent, human-understandable features.

READ FULL TEXT

page 3

page 4

page 5

page 7

page 8

research
11/22/2022

Interpreting Neural Networks through the Polytope Lens

Mechanistic interpretability aims to explain what a neural network has l...
research
03/09/2023

Cones: Concept Neurons in Diffusion Models for Customized Generation

Human brains respond to semantic features of presented stimuli with diff...
research
11/30/2017

TCAV: Relative concept importance testing with Linear Concept Activation Vectors

Neural networks commonly offer high utility but remain difficult to inte...
research
09/15/2023

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' inte...
research
04/14/2021

An Interpretability Illusion for BERT

We describe an "interpretability illusion" that arises when analyzing th...
research
11/24/2015

Convergent Learning: Do different neural networks learn the same representations?

Recent success in training deep neural networks have prompted active inv...
research
11/13/2022

Generalization Beyond Feature Alignment: Concept Activation-Guided Contrastive Learning

Learning invariant representations via contrastive learning has seen sta...

Please sign up or login with your details

Forgot password? Click here to reset