Progress measures for grokking via mechanistic interpretability

01/12/2023
by   Neel Nanda, et al.
0

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of “grokking” exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Universality is a key hypothesis in mechanistic interpretability – that ...
research
11/01/2022

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Research in mechanistic interpretability seeks to explain behaviors of m...
research
10/22/2020

Towards falsifiable interpretability research

Methods for understanding the decisions of and mechanisms underlying dee...
research
02/24/2020

Markov Logic Networks with Complex Weights: Expressivity, Liftability and Fourier Transforms

We study expressivity of Markov logic networks (MLNs). We introduce comp...
research
02/24/2023

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Recent advancements in interpretability research made transformer langua...
research
11/04/2020

Reverse engineering learned optimizers reveals known and novel mechanisms

Learned optimizers are algorithms that can themselves be trained to solv...

Please sign up or login with your details

Forgot password? Click here to reset