Uncovering mesa-optimization algorithms in Transformers

09/11/2023
by   Johannes von Oswald, et al.
0

Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood. Here, we hypothesize that the strong performance of Transformers stems from an architectural bias towards mesa-optimization, a learned process running within the forward pass of a model consisting of the following two steps: (i) the construction of an internal learning objective, and (ii) its corresponding solution found through optimization. To test this hypothesis, we reverse-engineer a series of autoregressive Transformers trained on simple sequence modeling tasks, uncovering underlying gradient-based mesa-optimization algorithms driving the generation of predictions. Moreover, we show that the learned forward-pass optimization algorithm can be immediately repurposed to solve supervised few-shot tasks, suggesting that mesa-optimization might underlie the in-context learning capabilities of large language models. Finally, we propose a novel self-attention layer, the mesa-layer, that explicitly and efficiently solves optimization problems specified in context. We find that this layer can lead to improved performance in synthetic and preliminary language modeling experiments, adding weight to our hypothesis that mesa-optimization is an important operation hidden within the weights of trained Transformers.

READ FULL TEXT

page 29

page 30

research
12/15/2022

Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architectur...
research
12/28/2022

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

State space models (SSMs) have demonstrated state-of-the-art sequence mo...
research
05/17/2021

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...
research
10/05/2021

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Recent studies have demonstrated that the performance of transformers on...
research
05/27/2022

Transformers from an Optimization Perspective

Deep learning models such as the Transformer are often constructed by he...
research
05/04/2023

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transfor...
research
06/21/2023

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Investigating deep learning language models has always been a significan...

Please sign up or login with your details

Forgot password? Click here to reset