Gated recurrent neural networks discover attention

09/04/2023
by   Nicolas Zucchet, et al.
0

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architectur...
research
06/01/2023

Transformers learn to implement preconditioned gradient descent for in-context learning

Motivated by the striking ability of transformers for in-context learnin...
research
12/05/2020

A Review of Designs and Applications of Echo State Networks

Recurrent Neural Networks (RNNs) have demonstrated their outstanding abi...
research
02/26/2020

ResNets, NeuralODEs and CT-RNNs are Particular Neural Regulatory Networks

This paper shows that ResNets, NeuralODEs, and CT-RNNs, are particular n...
research
02/18/2018

Memorize or generalize? Searching for a compositional RNN in a haystack

Neural networks are very powerful learning systems, but they do not read...
research
06/11/2021

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Transformers with linearised attention ("linear Transformers") have demo...
research
10/13/2021

How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies

We present and review an algorithmic and theoretical framework for impro...

Please sign up or login with your details

Forgot password? Click here to reset