Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

05/15/2023
by   Zhengxuan Wu, et al.
0

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters – an approach we call DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

READ FULL TEXT

page 7

page 18

page 19

research
03/05/2023

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Causal abstraction is a promising theoretical framework for explainable ...
research
09/14/2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Training large language models to follow instructions makes them perform...
research
06/06/2021

Causal Abstractions of Neural Networks

Structural analysis methods (e.g., probing and feature attribution) are ...
research
01/11/2023

Causal Abstraction for Faithful Model Interpretation

A faithful and interpretable explanation of an AI model's behavior and i...
research
05/15/2023

Causal Analysis for Robust Interpretability of Neural Networks

Interpreting the inner function of neural networks is crucial for the tr...
research
02/24/2023

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Recent advancements in interpretability research made transformer langua...
research
05/04/2023

Automatic Prompt Optimization with "Gradient Descent" and Beam Search

Large Language Models (LLMs) have shown impressive performance as genera...

Please sign up or login with your details

Forgot password? Click here to reset