Towards Automated Circuit Discovery for Mechanistic Interpretability

04/28/2023
by   Arthur Conmy, et al.
0

Recent work in mechanistic interpretability has reverse-engineered nontrivial behaviors of transformer models. These contributions required considerable effort and researcher intuition, which makes it difficult to apply the same methods to understand the complex behavior that current models display. At their core however, the workflow for these discoveries is surprisingly similar. Researchers create a data set and metric that elicit the desired model behavior, subdivide the network into appropriate abstract units, replace activations of those units to identify which are involved in the behavior, and then interpret the functions that these units implement. By varying the data set, metric, and units under investigation, researchers can understand the functionality of each neural network region and the circuits they compose. This work proposes a novel algorithm, Automatic Circuit DisCovery (ACDC), to automate the identification of the important units in the network. Given a model's computational graph, ACDC finds subgraphs that explain a behavior of the model. ACDC was able to reproduce a previously identified circuit for Python docstrings in a small transformer, identifying 6/7 important attention heads that compose up to 3 layers deep, while including 91 connections.

READ FULL TEXT
research
10/02/2021

A Minimal Intervention Definition of Reverse Engineering a Neural Circuit

In neuroscience, researchers have developed informal notions of what it ...
research
11/01/2022

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Research in mechanistic interpretability seeks to explain behaviors of m...
research
03/16/2023

Vision Transformer for Action Units Detection

Facial Action Units detection (FAUs) represents a fine-grained classific...
research
02/06/2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Universality is a key hypothesis in mechanistic interpretability – that ...
research
06/15/2020

The PSPACE-hardness of understanding neural circuits

In neuroscience, an important aspect of understanding the function of a ...
research
11/10/2018

Detecting Work Zones in SHRP 2 NDS Videos Using Deep Learning Based Computer Vision

Naturalistic driving studies seek to perform the observations of human d...
research
02/24/2023

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Recent advancements in interpretability research made transformer langua...

Please sign up or login with your details

Forgot password? Click here to reset