Stabilizing Transformers for Reinforcement Learning

by   Emilio Parisotto, et al.

Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons of information could provide a similar performance boost in partially observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL, trained using the same losses, has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical. GTrXL offers an easy-to-train, simple-to-implement but substantially more expressive architectural alternative to the standard multi-layer LSTM ubiquitously used for RL agents in partially observable environments.


Adaptive Transformers in RL

Recent developments in Transformers have opened new interesting areas of...

Working Memory Graphs

Transformers have increasingly outperformed gated RNNs in obtaining new ...

Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

The lottery ticket hypothesis proposes that over-parameterization of dee...

Stabilizing Transformer-Based Action Sequence Generation For Q-Learning

Since the publication of the original Transformer architecture (Vaswani ...

Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Many real-world applications such as robotics provide hard constraints o...

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment

Reinforcement learning (RL) algorithms face two distinct challenges: lea...

Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

Despite the recent advancements in offline reinforcement learning via su...

Code Repositories


Adaptive Attention Span for Reinforcement Learning

view repo


Pytorch implementation of Compressive Transformers, from Deepmind

view repo


Music and text generation with Transformer-XL.

view repo


Implementation of a modified vision transformer on the crypto market space

view repo


Transformer XL from scratch trained to perfection on toy dataset. PyTorch.

view repo

Please sign up or login with your details

Forgot password? Click here to reset