Stabilizing Transformer-Based Action Sequence Generation For Q-Learning

10/23/2020 ∙ by Gideon Stein, et al. ∙ ITMO University 13

Since the publication of the original Transformer architecture (Vaswani et al. 2017), Transformers revolutionized the field of Natural Language Processing. This, mainly due to their ability to understand timely dependencies better than competing RNN-based architectures. Surprisingly, this architecture change does not affect the field of Reinforcement Learning (RL), even though RNNs are quite popular in RL, and time dependencies are very common in RL. Recently, (Parisotto et al. 2019) conducted the first promising research of Transformers in RL. To support the findings of this work, this paper seeks to provide an additional example of a Transformer-based RL method. Specifically, the goal is a simple Transformer-based Deep Q-Learning method that is stable over several environments. Due to the unstable nature of Transformers and RL, an extensive method search was conducted to arrive at a final method that leverages developments around Transformers as well as Q-learning. The proposed method can match the performance of classic Q-learning on control environments while showing potential on some selected Atari benchmarks. Furthermore, it was critically evaluated to give additional insights into the relation between Transformers and RL.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep Q-learning

The goal of Q-learning is to find a function that correctly maps state (s) action (a) pairs to their corresponding value for an agent that interacts with an environment. The simplest form of Q-learning is defined as updating a Q-function in the following manner:


where is a reward, is a discount factor and

represents the value of any pair (s,a). By updating Q-values after rewards, the greedy policy as-well as the value function change frequently. Under the condition that all states are explored sufficiently, these updates are guaranteed to converge to a Q-function that correctly represents the environment. Based on this Q-function, a policy will be formed by greedily sampling the action with the highest Q-value at every step. Traditionally, the Q-function was implemented as a table. However, it is possible to approximate it with a neural network. This method is known as Deep Q-learning. When using a function approximator for the Q-function, the definition of an update changes since it is only possible to update weights of the network and not specific Q-values directly. An update for a Deep Q-network (DQN) is therefore defined as:


where represents the network weights and and are the state and action in the timestep t+1. When a network is updated according to Formula 2, an issue arises. Data that is acquired by an agent interacting with an environment is quite different from a fixed dataset that is normally used to train neural networks. Today, Replay Buffers, a method to save experience and reuse it during model training, and target networks, a method to make the target of the update more stable, are used to counter these issues. By applying these two methods to Deep Q-learning, mnih2013playing opened the field of Deep Reinforcement Learning. Their method will be the baseline RL method for the course of this work.

The Transformer Architecture

The Transformer model is a sequence to sequence architecture (seq2seq) which was initially developed to perform translation tasks in NLP, and which relies heavily on the Attention mechanism. A seq2seq structure is defined as a model that takes in a sequence of signals and returns a sequence of outputs. Also, the Transformer is an Encoder-Decoder structure that splits into two distinct submodels. An Encoder, that transforms an input sequence into an encoded representation and a Decoder that generates, based on the encoded representation, a new sequence as an output. Since the proposed architecture is based on the Encoder of the classic Transformer vaswani2017attention

, its structure will be discussed further. The Transformer Encoder takes in some word tokens and transforms them into the same number of encoded representations. To do this efficiently, the Transformer stacks several identical blocks on top of each other. These blocks are called Encoder layers. Additionally, the Encoder features an Embedding layer and a positional encoding of the input sequences which both are added before the first layer. A single Encoder layer is constructed out of two main components. An Attention block and a feed-forward network. Additionally, residual connections and normalization are added. Fig.

1 shows the structure of the Encoder layer. Note that the input and the output of the Encoder have the same dimension. This makes layer stacking possible. The computation that takes place in a single Encoder layer is defined as:


where X represents an input tensor with the shape (batch size, input sequence length, model dimension). Typically, Dropout is deployed after the Attention block and after the feed-forward block.

Figure 1: The standard Transformer Encoder layer

The Attention mechanism

Attention is a mechanism that understands the importance of specific inputs for other inputs and combines these into a new vector that includes this information. This mechanism does not rely on a hidden representation that includes all past information but attends directly to the full inputs. This helps Attention to perform better than RNN-based approaches in many cases, especially when long-term dependencies are present and relevant. Based on Attention, Multi-Head Attention is performed by multiple Attention operations in parallel on sub-parts of the inputs. This allows attending multiple sub-areas of inputs at once. The Transformer features the use of Scaled Dot Product Attention as well as Multi-Head Attention to understand dependencies. Scaled dot product Attention is defined as the operation on three inputs. Keys (K), Queries (Q), and Values (V):


Where Q, K, V are input matrices, and is the last dimension of K. Intuitively, this can be understood as a way to scale and add the content of V by a factor that is a combination of Q and V. Through this channel, V attends to the information that is included in Q and K and is altered accordingly. To perform Multi-Head Attention, the initial input vector is simply split. When performing Multi-Head Attention in the Encoder, the embedded input sequence represents K, Q, and V. This specific form of Attention is called Self-Attention, since the input sequence attends to itself. It is a key component that allows the Encoder to encode the input sequence efficiently.

Transformers for Q-learning

Transformer-based Q-Networks

This paper proposes to use an altered version of the Transformer Encoder as a Q-network for a Q-learning agent. However, the original structure has to be altered slightly to be usable. To map to Q-values at the end of the model, the output of the Encoder has to be mapped to the Q-value dimension which is achieved by adding a fully connected layer after the last Encoder layer. Also, the embedding layer of the classic Transformer has to be replaced by a fully connected layer that maps from the state dimension to the model dimension. After these two steps, a Transformer-based Q-network (TBQN) that can map from states to Q-values is obtained. It can be examined in Figure 2.

The literature parisotto2019stabilizingupadhyay2019transformer,  mishra2017simple

suggests, that a Q-learning agent using the proposed TBQN would be very hard to optimize and most likely unstable. To preemptively counter this, a method variation search space was constructed which includes three categories. Firstly, changes to the model structure itself. Secondly, the application of additional methods for DQNs and Transformers. Both of these categories represent small model or method variations that are proposed in the literature and might be able to improve the performance of TBQNs. Thirdly, a selection of possible impactful Hyperparameters is included. This search space was then filtered to find a method variation that is easier to optimize and more stable than a base Q-learning agent featuring the base TBQN.

Figure 2: The proposed Transformer-based Q-network

Transformer layer variations

Since the original publication of the Transformer  vaswani2017attention, many Transformer layer variations were introduced in the literature. These structural changes are exclusively made to make the Transformer more stable during training. From this literature, several Transformer layer variations were selected to be tested as the core layer for TBQNs.

Dropout free models (layer type 2)

Since Transformers were initially developed for NLP, they feature the usage of Dropout layers. Typically implemented to counter overfitting, the usage of Dropout in RL is not popular. Due to this, a layer without Dropout was tested. Additionally, all layer variations are tested with and without Dropout after the final layer. This layer variation is displayed in Figure 3a.

Identity Map Reordering (IMR) (layer type 3)

A layer variation that was described in  parisotto2019stabilizing

. It features the positional change of the normalization layer to the start of each sub-layer. Furthermore, an additional ReLU activation after every sub-layer was added to prevent two linear layers in a row. Its implementation can be observed in Figure


Pre layer Normalization (layer type 4)

Very similar to IMR, this variation described in xiong2020layer changes the position of the layer normalization to the beginning of each sub-layer. While this is identical to IMR, this variation does not feature an additional ReLU activation. Its implementation can be observed in Figure 3c.

Output gate connections (layer type 5)

Also described in  parisotto2019stabilizing this variation based on IMR additionally replaces the residual connection with a gated layer. While residual connections were initially implemented to improve the training of deep neural networks, they seem to make training Transformers more unstable. They are replaced with the following gate formulation, where W and b are trainable parameters:


This variation was also already tested for Transformer-based methods in RL and it will be used as it was proposed in parisotto2019stabilizing.

GRU gate connections (layer type 6)

Finally, another variation will be tested which features the usage of a different gating mechanism based on a GRU unit. Again, this variation was introduced in  parisotto2019stabilizing and is based on IMR. Noteworthy is that this model variation combined with Maximum a Posteriori Policy Optimization song2019v achieved SOTA results for DMLab-30. It remains to be seen if this is also the case for Q-learning. The mechanism is defined by Formula (6). W and U are trainable parameters.

(a) Layer type 2 (no dropout)
(b) Layer type 3 (IMR)
(c) Layer type 4 (norm first)
(d) Layer type 5/6 (gated)
Figure 3: The Transformer Encoder layer variations

Additional methods and Hyperparameters

Additionally to these layer variations, the following methods and Hyperparameters were included categorically in the search space to test their effect on the performance of a TBQN:

  • Double Q-learning

  • Target update period

  • Target update ()  lillicrap2015continuous

  • Learning rate schedules

  • Depth-Scaled Initialization  zhang2019improving

  • Depth-Scaled Initialization of the last Layer  zhang2019improving

  • Number of Attention Heads

  • Initial collection steps

  • Environment normalization

  • Epsilon Greedy

  • Replay Buffer size

  • Future reward discount ()

  • Batch size

  • Learning rate

  • Encoder type (whether or not dropout is used outside of the Encoder layers)


Baseline performance

To motivate the method variation search and to set a base performance of TBQNs, a Q-learning agent with the proposed base TBQN and with no special additions (except a Replay Buffer and a Target Network) was evaluated. The agent was trained on four environments (MountainCar-v0, Acrobot-v1, CartPole-v1, and LunarLander-v2) for 150k steps. All these environments are implemented by OpenAI GYM  1606.01540. The average episode return over 10 episodes can be examined in Figure 4. Two training runs per environment were executed. The agent was not able to solve any environment sufficiently, had a high fluctuation, and even diverged on some occasions (denoted by a graph ending before 150k steps). For these experiments, the following Hyperparameters were used. Initial collect steps: 1000, mean squared loss, 4 Attention Heads, epsilon greedy: 0.1, Replay Buffer length: 100000, batch size: 32, learning rate: 1e-5. The rest of the parameters were not used. It shows quite clearly, that TBQNs need additional help to perform.

(a) LunarLander-v2
(b) Acrobot-v1
(c) MountainCar-v0
(d) CartPole-v1
Figure 4: Average return during training of a base Q-learning agent with the proposed TBQN base in four different environments.

Selecting the optimal method variation

While it would be ideal to test every possible method variation, this is unfeasible due to computational complexity. Due to that, a two-step method based on two distinct studies was constructed to find a well-performing method variation.

Study one - Parameter importance

The first study focused on narrowing down the method search space significantly. This was achieved by estimating the Mean Decrease Impurity Importance Score for all parameters in the method search space. Based on these scores, parameters with low importance were excluded entirely. Furthermore, parameters with high importance were further evaluated to select the best performing values and exclude the rest from the search space. To estimate these scores, the method search space had to be sampled and evaluated. Since grid search was infeasible, a Tree-Structured Parzen Estimator, which was firstly described in  

bergstra2011algorithms, was used to sample from the search space. Every method search space sample was trained for 15k steps. As a final performance score, the average return of the last 10 episodes was used. To guarantee generality, the study was conducted independently in three different environments (CartPole-v1, Acrobot-v1, and LunarLander-v2) and the final importance score for every parameter was averaged between these environments. All studies were performed on a single GPU (Nvidia1080Ti). Further information can be found in Appendix B.

Study two - Final selection

After having narrowed down the method search space significantly, the remaining search space samples were evaluated further to find the model variation with the best performance. Two methods were used to determine the effect of certain parameter values on the performance of TBQNs and to select a final method variation: On one hand, the mean reward of the last ten episodes between all samples where a certain parameter value was present was calculated. On the other hand, the search space samples with the highest rewards for every environment were extracted. This was done to determine whether combinations of specific parameter values performed especially well. Again, the study was conducted in three different environments (CartPole-v1, Acrobot-v1, and LunarLander-v2) to guarantee generality. All search space samples were trained for 75K steps in the environments CartPole-v1 and AcroBot-v1 and for 150k steps in the environment LunarLander-v2. All experiments were conducted on a single Nvidia GPU(1080ti). Further information can be found in Appendix B.


Based on the two studies, the method variation represented by Table 1 was selected as it performed well in all environments and proved to be stable during training. The parameters initial collect steps, Environment normalization, Replay Buffer size, , double Q-learning, and the Encoder type had low importance for control environments and are not specified. The following comments should be made to accompany this selection:

Parameter Value Category
Gradient Clipping True Fixed
Batch size 32 Fixed
Learning rate 1e-4 Fixed
Layer type 3 Fixed
Custom lr schedule ”No” Fixed
Depth-Scaled Initialization 1 Fixed
Target upate period 10+ Semi-fixed
Num Heads 4/2 Semi-fixed
Epsilon Greedy (0. - 1.) Environment dependent
Depth-Scaled Initialization (last layer) (T/F) Environment dependent
Loss function (Huber, Squared) Environment dependent
(.99, .95) Environment dependent
Table 1: Final method variation
  • The optimal values for several parameters are environment-dependent. This means the performance of a Q-learning agent using a TBQN relies strongly on the right value selection. The optimal values however change from environment to environment.

  • Surprisingly, IMR layers (layer type 3) perform the best while GRU-gated layers (layer type 6) were excluded early due to frequent divergence.

  • While being very important for NLP, learning rate schedules are not required for TBQNs. It is estimated that TBQNs with layer variations do not require learning rate schedules which makes them obsolete.

  • Depth-Scaled Initialization zhang2019improving is beneficial. Models that were initialized with it tended to diverge less and achieved higher average rewards at the end of training.

  • Gradient Clipping is very important for TBQNs. Since the Transformer has problems with divergence in the RL setting, Gradient Clipping helps to mitigate destructive updates.

  • Several parameters are not important for model performance (Assuming no abstruse values). During the parameter search, they showed no significant impact on the performance of TBQNs.

Performance in control environments

To evaluate the performance of the method variation specified in table 1, its average return during training was compared to the average return during training of an optimized classic Q-learning agent in four Environments (CartPole-v1, Acrobot-v1, MountainCar-v0, and LunarLander-v2). The method was extracted from Rl-zoo baselines rl-zoo, a collection of Hyperparameter optimized methods. Additionally, the final method variation was tested with different values for history length, model dimensions, and the number of layers (Appendix C) to secure that it is stable and performs consistently when scaled up or down. By examining Figure 5, it is visible that the performance of the proposed model is consistent over different model sizes. Furthermore, when comparing Figure 5 with Figure 6, it is visible that the average return during training of these different approaches is comparable.

(a) LunarLander-v2 (150k steps)
(b) Acrobot-v1 (150k steps)
(c) MountainCar-v0 (150k steps)
(d) CartPole-v1 (150k steps)
Figure 5: Average return during training of the final model variation with different model dimensions on control environments
(a) LunarLander-v2 (200k steps)
(b) Acrobot-v1 (100k steps)
(c) MountainCar-v0 (100k steps)
(d) CartPole-v1 (100k steps)
Figure 6: Average return during training of a HP optimized classic Q-learning agent

Performance in ATARI environments

(a) Atari Asteroids
(b) Atari MsPacman 1
(c) Atari MsPacman 2
Figure 7: Average return during training on Atari

Additionally to the control environments, the proposed method variation was trained in two environments (”MsPacman” and ”Asteroids”) of the popular ATARI benchmark. Two parameters that are not included in 1 were scaled up from their initial values to match the complexity of the new environment and to keep them in reasonable ranges. The initial collect steps were increased from 1000 to 5000. Additionally, the Replay Buffer size was increased from 100k to 200k. For both environments, the RAM state which consists out of 128 pixels was used as the state vector for training. The proposed method was trained on MsPacman for 4 and 5 million timesteps and on ”Asteroids” on 5 million timesteps. The best average return during training was compared to the reported results from mnih2015human (classic DQN performance), and hausknecht2015deep (RQN performance). Both studies trained their methods for 10 million timesteps before reporting their final average returns.

When tracking the method state with the best average return during training for the ”Asteroids” environment, the proposed method variation performs quite well. After only 5 million timesteps, the Q-learning agent achieved a higher average return than the reported RQN-based and DQN-based methods. However, during training, the model performance fluctuates strongly which makes the final Q-learning agent perform quite bad. For the ”MsPacman” environment, no superior performance can be reported. Additionally, the training behavior does vary significantly. The same TBQN-based Q-learning agent was trained twice. The first try (7b) shows consistent learning over the whole training period. The second one (7c) shows no increase in performance over the whole training. While TBQN based methods can perform in ATARI environments, it is still a challenging task. More experiments must be conducted to form a final conclusion. It is also suspected that conducting the parameter search in control environments, might have had a negative effect on the performance of TBQNs in ATARI environments. We are positive, that this challenge can however be overcome by committing more computational resources in the future.

Methods Asteroids MsPacman
DQN 1629 +/- 542 2311 +/- 525
DQN 1070+/-345 2363 +/-735
RQN 1020 +/-312 2048+/-653
TBQN 1813+/- 396 1555+/-696
Table 2: Reported average returns of different methods on Atari. 1 = hausknecht2015deep, 2 = mnih2015human


During this work, the interaction of Transformer architectures and Deep Q-learning was evaluated. The goal of this work was to craft a new RL method based on the combination of Deep Q-learning and Transformer-based models which was successful. Through an extensive method variation search, a Transformer-based Deep Q-Learning method was constructed which leverages developments around Transformers as well as Q-learning. The proposed model can match the performance of an optimized classic Q-learning agent on control environments while showing potential on selected Atari environments. Despite these successes, the testing of the proposed final method variation on more environments and especially environments that require a deep understanding of past states is still essential to form a final conclusion. The results of this work are complementary to parisotto2019stabilizing and another step to a better understanding of Transformer architectures in RL. This work defies past results that neglect Transformer architectures in RL and shows that they can perform when handled carefully. While the proposed method is connected to the one that was used in parisotto2019stabilizing, it represents a different version of a Transformer-based RL method that can be deployed, tuned, and tested more easily. To further encourage this, the code base of this research can be accessed under GS. It is hoped that this work can help to support new studies on the topic of Transformers in RL and leverage them to RL mainstream.


A. Model specifications

Throughout this work, the TBQN dimensions specified in Table 3 were used.

Specification Control Atari
History horizon 5 steps 4 steps
Encoding Dimension 64 64
Number of Layers 3 2
Dff Dimension 256 256
Table 3: TBQN dimensions throughout this work

B. Study specifications

Table 4 and Table 5 hold additional information concerning the studies that were conducted to arrive at a final method variation.

Specification Value
Number of evaluated search space samples 30
Number of environments 3
Runs per sample 2
Training steps 15k
Table 4: Additional information for study 1
Specification Value
Remaining search space samples 24
Number of environments 3
Runs per sample 2
Training steps 150k / 75k
Table 5: Additional information for study 2

C. Model dimension variants

During the evaluation of the final TBQN variation, the model dimensions were altered to test for stability when scaling TBQNs up or down. The following variations were tested:

  • History horizon: 5, Dimensions: 64/256, Layers: 3

  • History horizon: 5, Dimensions: 64/256, Layers: 6

  • History horizon: 3, Dimensions: 64/256, Layers: 3

  • History horizon: 7, Dimensions: 64/256, Layers: 3

  • History horizon: 5, Dimensions: 128/512, Layers: 3

D. Additional comments

All experiments and studies were conducted on a single GPU (Nvidia1080Ti). Specific parameters that are not explicitly defined are set to the default values of TensorFlow

tensorflow2015-whitepaper or are defined in the experiment scripts available at GS