Learning Transformer Programs

06/01/2023
by   Dan Friedman, et al.
0

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short of providing complete, faithful descriptions of the underlying algorithms. In this work, we introduce a procedure for training Transformers that are mechanistically interpretable by design. We build on RASP [Weiss et al., 2021], a programming language that can be compiled into Transformer weights. Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization and then be automatically converted into a discrete, human-readable program. We refer to these models as Transformer Programs. To validate our approach, we learn Transformer Programs for a variety of problems, including an in-context learning task, a suite of algorithmic problems (e.g. sorting, recognizing Dyck-languages), and NLP tasks including named entity recognition and text classification. The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size; and, more importantly, they are easy to interpret. To demonstrate these advantages, we convert Transformers into Python programs and use off-the-shelf code analysis tools to debug model errors and identify the “circuits” used to solve different sub-problems. We hope that Transformer Programs open a new path toward the goal of intrinsically interpretable machine learning.

READ FULL TEXT

page 4

page 19

research
01/12/2023

Tracr: Compiled Transformers as a Laboratory for Interpretability

Interpretability research aims to build tools for understanding machine ...
research
06/13/2021

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent ne...
research
10/02/2022

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Transformer networks have seen great success in natural language process...
research
12/21/2020

RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we prop...
research
05/24/2023

Can Transformers Learn to Solve Problems Recursively?

Neural networks have in recent years shown promise for helping software ...
research
01/30/2023

Looped Transformers as Programmable Computers

We present a framework for using transformer networks as universal compu...
research
05/23/2023

Physics of Language Models: Part 1, Context-Free Grammar

We design experiments to study how generative language models, like GPT,...

Please sign up or login with your details

Forgot password? Click here to reset