Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

06/11/2021
by   Kazuki Irie, et al.
8

Transformers with linearised attention ("linear Transformers") have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s. However, the original FWP formulation is more general than the one of linear Transformers: a slow neural network (NN) continually reprograms the weights of a fast NN with arbitrary NN architectures. In existing linear Transformers, both NNs are feedforward and consist of a single layer. Here we explore new variations by adding recurrence to the slow and fast nets. We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks (code execution and sequential ListOps), Wikitext-103 language models, and on the Atari 2600 2D game environment. Our models exhibit properties of Transformers and RNNs. In the reinforcement learning setting, we report large improvements over LSTM in several Atari games. Our code is public.

READ FULL TEXT
research
06/29/2020

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers achieve remarkable performance in several tasks but due to ...
research
02/11/2022

A Modern Self-Referential Weight Matrix That Learns to Modify Itself

The weight matrix (WM) of a neural network (NN) is its program. The prog...
research
10/07/2022

Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules

Work on fast weight programmers has demonstrated the effectiveness of ke...
research
09/04/2023

Gated recurrent neural networks discover attention

Recent architectural developments have enabled recurrent neural networks...
research
04/08/2020

Adaptive Transformers in RL

Recent developments in Transformers have opened new interesting areas of...
research
04/28/2023

Representation Matters: The Game of Chess Poses a Challenge to Vision Transformers

While transformers have gained the reputation as the "Swiss army knife o...
research
08/17/2021

Investigating transformers in the decomposition of polygonal shapes as point collections

Transformers can generate predictions in two approaches: 1. auto-regress...

Please sign up or login with your details

Forgot password? Click here to reset