Multi-Game Decision Transformers

by   Kuang-Huei Lee, et al.

A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction. Additional information, videos and code can be seen at:


page 2

page 9

page 19

page 21


The Chess Transformer: Mastering Play using Generative Language Models

This work demonstrates that natural language transformers can support mo...

AutoFormer: Searching Transformers for Visual Recognition

Recently, pure transformer-based models have shown great potentials for ...

Can Wikipedia Help Offline Reinforcement Learning?

Fine-tuning reinforcement learning (RL) models has been challenging beca...

UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

Recent advances in multi-agent reinforcement learning have been largely ...

Word Play for Playing Othello (Reverses)

Language models like OpenAI's Generative Pre-Trained Transformers (GPT-2...

Do Transformers Encode a Foundational Ontology? Probing Abstract Classes in Natural Language

With the methodological support of probing (or diagnostic classification...

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

Transformers with linearised attention ("linear Transformers") have demo...

1 Introduction

Building large-scale generalist models that solve many tasks by training on massive task-agnostic datasets has emerged as a dominant approach in natural language processing

(Devlin et al., 2018; Brown et al., 2020)

, computer vision

(Dosovitskiy et al., 2020; Arnab et al., 2021), and their intersection (Radford et al., 2021; Alayrac et al., 2022). These models can adapt to new tasks (such as translation (Raffel et al., 2019; Xue et al., 2021)), make use of unrelated data (such as using high-resource language to improve translations of low-resource languages (Dabral et al., 2021)), or even incorporate new modalities by projecting images into language space (Lu et al., 2021; Tsimpoukelli et al., 2021). The success of these methods largely derives from a combination of scalable model architectures (Vaswani et al., 2017), an abundance of unlabeled task-agnostic data, and continuous improvements in high performance computing infrastructure. Crucially, scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) indicate that performance gains due to scale have not yet reached a saturation point.

In this work, we argue that a similar progression is possible in the field of reinforcement learning, and take initial steps toward scalable methods that produce highly capable generalist agents. In contrast to vision and language domains, reinforcement learning has seen advocacy for the use of smaller models (Cuccu et al., 2018; Mania et al., 2018; Bastani et al., 2018) and is usually either used to solve single tasks, or multiple tasks within the same environment. Importantly, training across multiple environments – with very different dynamics, rewards, visuals, and agent embodiments – has been studied less significantly.

Specifically, we investigate whether a single model – with a single set of parameters – can be trained to act in multiple environments from large amounts of expert and non-expert experience. We consider training on a suite of 41 Atari games (Bellemare et al., 2013; Gulcehre et al., 2020) for their diversity, informally asking “Can models learn something universal from playing many video games?”. To train this model, we use only the previously-collected trajectories from Agarwal et al. (2020), but we evaluate our agent interactively. We are not striving for mastery or efficiency that game-specific agents can offer, as we believe we are still in early stages of this research agenda. Rather, we investigate whether the same trends observed in language and vision hold for large-scale generalist reinforcement learning agents.

Figure 1:

Aggregates of human-normalized scores (Inter-Quartile Mean) across 41 Atari games. Grey bars are single-game specialist models while blue are generalists. Single-game BCQ 

(Fujimoto et al., 2019) results are from Gulcehre et al. (2020). Multi-game models are all trained on a dataset (Agarwal et al., 2020) with inter-quartile mean human-normalized score of 101%, which Multi-Game DT notably exceeds.

We find that we can train a single agent that achieves 126% of human-level performance simultaneously across all games after training on offline expert and non-expert datasets (see Figure 1). Furthermore, we see similar trends that mirror those observed in language and vision: rapid fine-tuning to never-before-seen games with very little data (Section 4.5), a power-law relationship between performance and model size (Section 4.4), and faster training progress for larger models.

Notably, not all existing approaches to multi-environment training work well. We investigate several approaches, including treating the problem as offline decision transformer-based sequence modeling (Chen et al., 2021; Janner et al., 2021), online RL (Mnih et al., 2015), offline temporal difference methods (Kumar et al., 2020), contrastive representations (Oord et al., 2018), and behavior cloning (Pomerleau, 1991). We find that decision transformer based models offer the best performance and scaling properties in the multi-environment regime. However, to permit training on both expert and non-expert trajectories, we find it is necessary to use a guided generation technique from language modeling to generate expert-level actions, which is an important departure from standard decision transformers.

Our contributions are threefold: First, we show that it is possible to train a single high-performing generalist agent to act across multiple environments from offline data alone. Second, we show that scaling trends observed in language and vision hold. And third, we compare multiple approaches for achieving this goal, finding that decision transformers combined with guided generation perform the best. It is our hope this study can inspire further research in generalist agents. To aid this, we make our pre-trained models and code publicly available.

Figure 2: An overview of the training and evaluation setup. We observe expert-level game-play in the interactive setting after offline learning from trajectories ranging from beginner to expert.

2 Related Work

A generalist agent for solving a variety of environments has been a goal for artificial intelligence (AI) researchers since the inception of AI as a field of study 

(McCarthy et al., 2006). This same reason motivated the introduction of the Atari suite (the Arcade Learning Environment, or ALE) as a testbed for learning algorithms (Bellemare et al., 2013); in their own words, the ALE is for “empirically assessing agents designed for general competency.” While the celebrated deep -learning (Mnih et al., 2013) and actor critic (Mnih et al., 2016)

agents were among the first to use a single algorithm for all games, they nevertheless required separate training and hyperparameters for each game agent. Later works have demonstrated the ability to learn a single neural network agent on multiple Atari games simultaneously, either online

(Espeholt et al., 2018) or via policy distillation (Parisotto et al., 2015; Rusu et al., 2015). The aim of our work is similar – to learn a single agent for playing multiple Atari games – with a focus on offline learning. We demonstrate results with human-level competency on up to 46 games, which is unseen in literature.

A closely related setting is learning to solve multiple tasks within the same or similar environments. For example in the robotics field, existing works propose to use language-conditioned tasks (Lynch and Sermanet, 2020; Ahn et al., 2022; Jang et al., 2022), while others posit goal-reaching as a way to learn general skills (Mendonca et al., 2021), among other proposals (Kalashnikov et al., 2021; Yu et al., 2020). In this work, we tackle the problem of learning to act in a large collection of environments with distinctively different dynamics, rewards, and agent embodiments. This complicated but important setting requires a different type of generalization that has been studied significantly less.

A concurrent work (Reed et al., 2022) also aims to train a transformer-based generalist agent based on offline data including for the ALE. This work differs from ours in that the offline training data is exclusively near-optimal and it requires prompting by expert trajectories at inference time. In contrast, we extend decision transformers (Chen et al., 2021) from the Upside-Down RL family (Srivastava et al., 2019; Schmidhuber, 2019) to learn from a diverse dataset (expert and non-expert data), predict returns, and pick optimality-conditioned returns. Furthermore, we provide comparisons against existing behavioral cloning, online and offline RL methods, and contrastive representations (Yang and Nachum, 2021; Oord et al., 2018). Other works that also consider LLM-like sequence modeling for a variety of single control tasks include (Reid et al., 2022; Zheng et al., 2022; Janner et al., 2021; Furuta et al., 2021; Ortega et al., 2021).

3 Method

We consider a decision-making agent that at every time receives an observation of the world , chooses an action , and receives a scalar reward . Our goal is to learn a single optimal policy distribution with parameters that maximizes the agent’s total future return on all the environments we consider.

3.1 Reinforcement Learning as Sequence Modeling

Following (Chen et al., 2021)

, we pose the problem of offline reinforcement learning as a sequence modeling problem where we model the probability of the next sequence token

conditioned on all tokens prior to it: , similar to contemporary decoder-only sequence models (Brown et al., 2020; Chowdhery et al., 2022; Rae et al., 2021). The sequences we consider have the form:

where represents a time-step, is the number of image patches per observation (which we further discuss in Section 3.2), and is the agent’s target return for the rest of the sequence. Such a sequence order respects the causal structure of the environment decision process. Figure 3 presents an overview of our model architecture.

Figure 3: An overview of our decision transformer architecture.

Returns, actions, and rewards are tokenized (See Section 3.2 for details), and we train the model to predict the next return, action, and reward discrete token in a sequence via standard cross-entropy loss. The sequence we consider is different from Chen et al. (2021), which has . Our design allows predicting the return distribution and sampling from it, instead of relying on a user to manually select an expert-level return at inference time (See Section 3.4).

Predicting future value and rewards have been shown to be useful objectives for learning better representations in artificial reinforcement learning agents (Lyle et al., 2021; Schrittwieser et al., 2020; Lee et al., 2020) and important signals for representation learning in humans (Alexander and Gershman, 2021). Thus, while we may not directly use all of the predicted quantities, the task of predicting them encourages structure and representation learning of our environments. In this work, we do not attempt to predict future observations due to their non-discrete nature and the additional model capacity that would be required to generate images. However, building image-based forward prediction models of the environment has been shown to be a useful representation objective for RL (Hafner et al., 2019b, a, 2020). We leave it for future investigation.

3.2 Tokenization

To generate returns, actions, and rewards via multinomial distributions similarly to language generation, we convert these quantities to discrete tokens. Actions are already discrete quantities in the environments we consider. We convert scalar rewards to ternary quantities , and uniformly quantize returns into a discrete range shared by all our environments222The training datasets we use (Section 3.3) contains scalar reward values clipped to . For return quantization, we use range with bin size 1 in all our experiments as we find it covers most of the returns we observe in the datasets. .

Inspired by simplicity and effectiveness of transformer architecture for processing images (Dosovitskiy et al., 2020), we divide each observation image into a collection of patches333We use 6x6 patches, where each patch corresponds to 14x14 pixels, in all our experiments. (see Figure 3). Each patch is additively combined with a trainable position encoding and linearly projected into the input token embedding space. We experimented with using image tokenizations coming from a convolutional network, but did not find it to have a significant benefit and omitted it for simplicity.

We chose our tokenization scheme with simplicity in mind, but many other schemes are possible. While all our environments use a shared action space, varying action spaces when controlling different agent morphologies can still be tokenized using methods of (Huang et al., 2020; Kurin et al., 2020; Gupta et al., 2021). And while we used uniform quantization to discretize continuous quantities, more sophisticated methods such as VQ-VAE (van den Oord et al., 2017) can be used to learn more effective discretizations.

3.3 Training Dataset

To train the model, we use an existing dataset of Atari trajectories (with quantized returns) introduced in (Agarwal et al., 2020). The dataset contains trajectories collected from the training progress of a DQN agent (Mnih et al., 2015). Following (Gulcehre et al., 2020), we select 46 games where DQN performance significantly exceeded that of a random agent. 41 games are used for training and 5 games are held out for out-of-distribution generalization experiments.

We chose 5 held-out games representing different game categories including Alien and MsPacman (maze based), Pong (ball tracking), SpaceInvaders (shoot vertically), and StarGunner (shoot horizontally), to ensure out-of-distribution generalization can be evaluated on different types of games.

For each of 41 games, we use data from 2 training runs, each containing roll-outs from 50 policy checkpoints, in turn each containing 1 million environment steps. This totals 4.1 billion steps. Using the tokenization scheme in previous sections, the dataset contains almost 160 billion tokens.

As the dataset contains agent’s behavior at all stages of learning, it contains both expert and non-expert behaviors. We do not perform any special filtering, curation, or balancing of the dataset. The motivation to train on such data instead of expert-only behaviors is twofold: Firstly, sub-optimal behaviors are more diverse than optimal behaviors and may still be useful for learning representations of the environment and consequences of poor decisions. Secondly, it may be difficult to create a single binary criteria for optimality as it is typically a graded quantity. Thus, instead of assuming only task-relevant expert behaviors, we train our model on all available behaviors, yet generate expert behavior at inference time as described in the next section.

3.4 Expert Action Inference

As described above, our training datasets contain a mix of expert and non-expert behaviors, thus directly generating actions from the model imitating the data is unlikely to consistently produce expert behavior (as we confirm in Section 4.7

). Instead, we want to control action generation to consistently produce actions of highly-rewarding behavior. This mirrors the problem of discriminator-guided generation in language models, for which a variety of methods have been proposed

(Krause et al., 2020; Yang and Klein, 2021; Ouyang et al., 2022).

We propose an inference-time method inspired by (Krause et al., 2020)

and assume a binary classifier

that identifies whether or not the behavior is expert-level before taking an action at time . Following Bayes’ rule, the distribution of expert-level returns at time is then:

Similarly to (Shachter, 1988; Todorov, 2006; Toussaint, 2009; Kappen et al., 2012), we define a binary classifier to be proportional to future return with inverse temperature parameter 444We use in all our experiments.:

This results in a simple auto-regressive procedure where we first sample high-but-plausible target returns according to log-probability , and then sample actions according to . See Figure 4 for an illustration of this procedure and Section B.3 for implementation details. It can be seen as a variation of return-conditioned policies (Kumar et al., 2019; Srivastava et al., 2019; Chen et al., 2021) that automatically generates expert-level (but likely) returns at every timestep, instead of manually fixing them for the duration of the episode.

Figure 4: An illustration of our expert-level return and action sampling procedure. and are the distributions learned by the sequence model.

Importantly, this formulation only affects the inference procedure of the model – training is entirely unaffected and can rely on standard next-token prediction frameworks and infrastructure. While we chose this formulation for its simplicity, controllable generation is an active area of study and we expect other more effective methods to be introduced in the future. As such, our contribution is to point out a connection between problems of controllable generation in language modeling and optimality conditioning in control.

4 Experiments

We formulate our experiments to answer a number of questions that are addressed in following sections:

  • How do different online and offline methods perform in the multi-game regime?

  • How do different methods scale with model size?

  • How effective are different methods at transfer to novel games?

  • Does multi-game decision transformer improve upon training data?

  • Does expert action inference (Section 3.4) improve upon behavioral cloning?

  • Does training on expert and non-expert data bring benefits over expert-only training?

  • Are there benefits to specifically using transformer architecture?

We also qualitatively explore the attention behavior of these models in Appendix H.

4.1 Setup

Model Variants and Scaling.

We base our decision transformer (DT) configuration on GPT-2 (Brown et al., 2020) as summarized in Section B.1. We report results for DT-200M (a Multi-Game DT with 200M parameters) if not specified otherwise. Other smaller variants are DT-40M and DT-10M. We set sequence length to 4 game frames for all experiments, which results in sequences of 156 tokens.

Training and Fine-tuning.

We train all Multi-Game DT models on TPUv4 hardware and the Jaxline (Babuschkin et al. (2020)) framework for 10M steps using the LAMB optimizer (You et al., 2019) with a

learning rate, 4000 steps linear warm-up, no weight decay, gradient clip 1.0,

and , and batch size 2048. For fine-tuning on novel games, we train for 100k steps with a learning rate, weight decay and batch size of 256 instead. Both regimes used image augmentations as described in Section B.5.


We measure performance on individual Atari games by human normalized scores (HNS) (Mnih et al., 2015), i.e. , or DQN-normalized scores, i.e. normalizing by the best DQN scores seen in the training dataset instead of using human scores. To create an aggregate comparison metric across all games, we use inter-quartile mean (IQM) of human-normalized scores across all games, following evaluation best practices proposed in (Agarwal et al., 2021). Due to the prohibitively long training times, we only evaluated one training seed. We additionally report median aggregate metric in Appendix D.

4.2 Baseline Methods


Our Decision Transformer (Sec. 3.1) can be reduced to a transformer-based Behavioral Cloning (BC) (Pomerleau, 1991) agent by removing the target return condition and return token prediction. Similar to what we do for Decision Transformer, we also learn BC models at different scales (10M, 40M, 200M parameters) while keeping other configurations unchanged.

C51 Dqn

As a point of comparison for online performance, we use the C51 algorithm (Bellemare et al., 2017) which is a variant of deep -learning (DQN) but with a categorical loss for minimizing the temporal difference (TD) errors. Following improvements suggested in Hessel et al. (2018) as well as our own empirical observations, we use multi-step learning with

. For the single-game experiments, we use the standard convolutional neural network (CNN) used in the implementation of C51 

(Castro et al., 2018). For the multi-game experiments, we modify the C51 implementation based on a hyperparameter search to use an Impala neural network architecture (Espeholt et al., 2018) with three blocks using , , and channels respectively with a batch size of and update period of .


For an offline TD-based learning algorithm we use conservative -learning (CQL) (Kumar et al., 2020). Namely, we augment the categorical loss of C51 with a behavioral cloning loss minimizing , where is a state-action pair sampled from the offline dataset and . Following the recommendations in Kumar et al. (2020) we weight the contribution of the BC loss by when using 100% of the offline data (multi-game training) and when using 1% (single-game finetuning). For scaling experiments, we vary the number of blocks and channels in each block of the Impala: the number of blocks and channels is one of , , , .


For rapid adaptation to new games via fine-tuning, we consider representation learning baselines including contrastive predictive coding (CPC) (Oord et al., 2018), BERT pretraining (Devlin et al., 2018), and attentive contrastive learning (ACL) (Yang and Nachum, 2021)

. All state representation networks are implemented as additional multi-layer perceptrons (MLPs) or transformer layers on top of the Impala CNN used in C51 and CQL baselines. CPC uses two additional MLP layers with

units each interleaved with ReLU activation to represent , which is optimized by maximizing of true transitions and minimizing where is a state randomly sampled from the batch (including states from other games). For BERT pretraining, we use self-attention layers with attention heads of units each and feed-forward dimension , and train using BERT’s masked self-prediction loss on a trajectory of sequence length . ACL shares the same model parametrization as BERT, with the inclusion of action prediction in the pretraining objective.

4.3 How do different online and offline methods perform in the multi-game regime?

We compare different online and offline algorithms in the multi-game regime and their single-game counterparts in  Figure 1. We find that single-game specialists are still most performant. Among multi-game generalist models, our Multi-Game Decision Transformer model comes closest to specialist performance. Multi-game online RL with non-transformer models comes second, while we struggled to get good performance with offline non-transformer models. We note that our multi-game online C51 DQN median score of 68% (see Appendix D) which compares similarly to multi-game median Impala score of 70%, which we calculated from results reported by (Espeholt et al., 2018) for our suite of games.

4.4 How do different methods scale with model size?

In large language and vision models, lowest-achievable training loss typically decreases predictably with increasing model size. Kaplan et al. (2020) demonstrated an empirical power law relationship between the capacity of a language model (NLP terminology for a next-token autoregressive generative model) and its performance (negative log likelihood on held-out data). These trends were verified over many orders of magnitude of model size, ranging from few-million parameter models to hundreds of billion parameter models.

(a) Scaling of IQM scores for all training games with different model sizes and architectures.
(b) Scaling of IQM scores for all novel games after fine-tuning DT and CQL.
Figure 5: How model performance scales with model size, on training set games and novel games. (Impala) indicates using the Impala CNN architecture.

We investigate whether similar trends hold for interactive in-game performance – not just training loss – and show a similar power-law performance trend in Figure 4(a). Multi-Game Decision Transformer performance reliably increases over two orders of magnitude, whereas the other methods either saturate, or have much slower performance growth.

We also find that larger models train faster, in the sense of reaching higher in-game performance after observing the same number of tokens. We discuss these results in Appendix G.

4.5 How effective are different methods at transfer to novel games?

Pretraining for rapid adaptation to new games has not been explored widely on Atari games despite being a natural and well-motivated task due to its relevance to how humans transfer knowledge to new games. Nachum and Yang (2021) employed pretraining on large offline data and fine-tunining on small expert data for Atari and compared to a set of state representation learning objectives based on bisimulation (Gelada et al., 2019; Zhang et al., 2020), but their pretraining and fine-tuning use the same game. We are instead interested in the transfer ability of pretrained agents to new games.

We hence devise our own evaluation setup by pretraining DT, CQL, CPC, BERT, and ACL on the full datasets of the 41 training games with 50M steps each, and fine-tuning one model per held-out game using 1% (500k steps) from each game. The 1% fine-tuning data is uniformly sampled from the 50M step dataset without quality filtering. DT and CQL use the same objective for pretraining and fine-tuning, whereas CPC, BERT, and ACL each use their own pretraining objective and are fine-tuned using the BC objective. All methods are fine-tuned for 100,000 steps, which is much shorter than training any agent from scratch. We additionally include training CQL from scratch on the 1% held-out data to highlight the benefit of rapid fine-tuning.

Fine-tuning performance on the held-out games is shown in Figure 6. Pretraining with the DT objective performs the best across all games. All methods with pretraining outperform training CQL from scratch, which verifies our hypothesis that pretraining on other games should indeed help with rapid learning of a new game. CPC and BERT underperform DT, suggesting that learning state representations alone is not sufficient for desirable transfer performance. While ACL adds an action prediction auxiliary loss to BERT, it showed little effect, suggesting that modeling the actions in the right way on the offline data is important for good transfer performance. Furthermore, we find that fine-tuning performance improves as the DT model becomes larger, while CQL fine-tuning performance is inconsistent with model size (see Figure 4(b)).

Figure 6: Fine-tuning performance on 1% of 5 held-out games’ data after pretraining on other 41 games using DT, CQL, CPC, BERT, and ACL. All pretraining methods outperform training CQL from scratch on the 1% held-out data, highlighting the transfer benefit of pretraining on other games. DT performs the best among all methods considered.

4.6 Does multi-game decision transformer improve upon training data?

We want to evaluate whether decision transformer with expert action inference is capable of acting better than the best demonstrations seen during training. To do this, we look at the top 3 performing decision transformer model rollouts. We use top 3 rollouts instead of the mean across all rollouts to more fairly compare to the best demonstration, rather than an average expert demonstration. We show percentage improvement over best demonstration score for individual games in Figure 7. We see significant improvement over the training data in a number of games.

Figure 7: Percent of improvement of top 3 decision transformer rollouts over the best score in the training dataset. 0% indicates no improvement. Top-3 metric (instead of mean) is used to more fairly compare to the best – rather than expert average – demonstration score.

4.7 Does optimal action inference improve upon behavior cloning?

Figure 8: Comparison of per-game scores for decision transformer to behavioral cloning. Bars indicate standard deviation around the mean across 16 trials. We show DQN-normalized scores in this figure for better presentations.

In Figure 1 we see that IQM performance across all games is indeed significantly improved by generating optimality-conditioned actions. Figure 8 shows the mean and standard deviation of scores across all games. While behavior cloning may sometimes produce highly-rewarding episodes, it is less likely to do so. We find decision transformer outperforms behavioral cloning in 31 out of 41 games.

4.8 Does training on expert and non-expert data bring benefits over expert-only training?

We believe that, comparing to learning from expert demonstrations, learning from large, diverse datasets that include some expert data but primarily non-expert data help learning and improve performance. To verify this hypothesis, we filter our training data Agarwal et al. (2020) from each game by episodic returns and only preserve top 10% trajectories to produce an expert dataset (see Appendix E for details). We use this expert dataset to train our multi-game decision transformer (DT-40M) and the transformer-based behavioral cloning model (BC-40M). Figure 9 compares these models trained on expert data and our DT-40M trained on all data.

We observe that (1) Training only on expert data improves behavioral cloning; (2) Training on full data, including expert and non-expert data, improves Decision Transformer; (3) Decision Transformer with full data outperforms behavioral cloning trained on expert data.

Figure 9: Comparison of 40M transformer models trained on full data and only expert data.

4.9 Are there benefits to specifically using transformer architecture?

Figure 10: Performance scaling with model size for UDRL and CQL (Impala architecture) compared to Decision Transformer.

Decision Transformer is an Upside-Down RL (UDRL) (Schmidhuber, 2019; Srivastava et al., 2019) implementation that uses the transformer architecture and considers RL as a sequence modeling problem. To understand the benefit of the transformer architecture, we compare to an UDRL implementation that uses feed-forward, convolutional Impala networks (Espeholt et al., 2018). See Appendix F for more details on the architecture.

Figure 10 shows clear advantages of Decision Transformer over UDRL with the Impala architecture. In the comparison between UDRL (Impala) and CQL that uses the same Impala network at each model size we evaluated, we observe that UDRL (Impala) outperforms CQL. The results show that the benefits of our method come not only from using network architectures, but also from the UDRL formulation. Although it is not feasible to compare transformer with all possible convolutional architectures due to the broad design space, we believe these empirical results still show a clear trend favoring both UDRL and transformer architectures.

5 Conclusion

In the quest to develop highly capable and generalist agents, we have made important and measurable progress. Namely, our results exhibit a clear benefit of using large transformer-based models in multi-game domains, and the general trends in these results – performance improvements with larger models and the ability to rapidly fine-tune to new tasks – mirror the successes observed for large-scale vision and language models. Our results also highlight difficulties of online RL algorithms in handling the complexity of multi-game training on Atari. It is interesting to note that our best results are achieved by decision transformers, which essentially learn via supervised learning on sequence data, compared to alternative approaches such as temporal difference learning (more typical in reinforcement learning), policy gradients, and contrastive representation learning. This begs the question of whether online learning algorithms can be modified to be as “data-absorbent” as DT-like methods. While even our best generalist agents at times fall short of performance achieved by agents trained on a single-task, this is broadly consistent with related works that have trained single models on many tasks

Kaiser et al. (2017); Reed et al. (2022). However, our best generalist agents are already capable of outperforming the data they are trained on. We believe the trends suggest clear paths for future work – that, with larger models and larger suites of tasks, performance is likely to scale up commensurately.


We acknowledge reasons for caution in over-generalizing our conclusions. Our results are based largely on performance in the Atari suite, where action and observation spaces are aligned across different games. It is unclear whether offline RL datasets such as Atari are of sufficient scale and diversity that we would see similar performance scaling as observed in NLP and vision benchmarks. Whether we can observe other forms of generalization, such as zero-shot adaptation, as well as whether our conclusions hold for other settings, remains unclear.

Societal Impacts.

In the current setting, we do not foresee significant societal impact as the models are limited to playing simple video games. We emphasize that our current agents are not intended to interact with humans or be used outside of self-contained game-playing domains. One should exercise increased caution if extending our algorithms and methods to such situations in order to ensure any safety and ethical concerns are appropriately addressed. At the same time, the capability of decision making based on reward feedback – rather than purely imitation of the data – has the potential to be easier to align with human values and goals.


We would like to thank Oscar Ramirez, Roopali Vij, Sabela Ramos, Rishabh Agarwal, Shixiang (Shane) Gu, Aleksandra Faust, Noah Fiedel, Chelsea Finn, Sergey Levine, John Canny, Kimin Lee, Hao Liu, Ed Chi, and Luke Metz for their valuable contributions and support for this work.


Appendix A Contribution Statement

Kuang-Huei Lee Proposed project direction. Contributed to the JAX and TF Decision Transformer (DT) and BC code. Ran BC and DT experiments. Contributed to paper writing.

Ofir Nachum Proposed project direction. Contributed to TF code for DQN, CQL, and representation learning algorithms. Ran experiments for DQN and CQL. Contributed to paper writing.

Mengjiao Yang Contributed code in representation learning algorithms. Ran experiments for CPC, BERT, ACL pretraining + OOD finetuning. Contributed to paper writing.

Lisa Lee Contributed to TF codebase. Ran alternative environment experiments. Helped with paper editing.

Daniel Freeman Contributed to JAX training code and dataset generation pipelines. Generated datasets. Helped with paper writing.

Winnie Xu Contributed to JAX training infrastructure, beginner / expert dataset generation, augmentation and metrics pipelines. Ran experiments for alternative DT variants. Helped with paper writing. Work done while a Research Intern.

Sergio Guadarrama Helped with project direction and experiment discussions. Worked on paper editing.

Ian Fischer Helped with project direction and experiment discussions. Worked on paper editing.

Eric Jang Helped with project direction, contributed to building JAX infrastructure. Helped with paper writing.

Henryk Michalewski Contributed to building jaxline training and data processing infrastructure. Ran fine-tuning experiments. Helped with project direction and experiment discussions. Contributed to paper writing.

Igor Mordatch Proposed project direction. Contributed to building JAX training and data processing infrastructure. Ran DT experiments. Contributed to paper writing and visualizations.

Appendix B Implementation Details

b.1 Transformer network architecture

The input consists of a sequence of observations, returns, actions and rewards. Observations are images in the format . We use grayscale images (i.e., ). Similar to ViT (Dosovitskiy et al., 2020), we extract non-overlapping image patches, perform a linear projection and then rasterise them into -dimensional 1D tokens. We define each patch to be pixels (i.e., ). A learned positional embedding is added to each of the patch tokens to retain positional information as in ViT. As described in Section 3.2, returns are discretized into 120 buckets in , and rewards are converted to ternary quantities .

For the whole sequence , we learn another positional embedding at each position and add to each token embedding. We experimented with rotary position embedding (Su et al., 2021), but did not find a significant benefit from them in our setting. On top of the token embeddings, our transformer models use a standard transformer decoder architecture.

A standard transformer implementation for sequence modeling would employ a sequential causal attention masking to prevent positions from attending to subsequent positions (Vaswani et al., 2017). However, for the sequence that we consider, we do not want to prevent the position corresponding to observation token from accessing subsequent observation tokens within the same timestep, since there is no clear sequential causal relation between image patches. Therefore, we change the sequential causal masking to allow observation tokens within the same timestep to access each other, but not subsequent positions after , i.e.

Table 1 summarizes the transformer configurations we use for each model size. We train these models on an internal cluster, each with 64 TPUv4. Due to prohibitively long training times, we only evaluated one training seed.

Model Layers Hidden size ( Heads Params Training Time on 64 TPUv4
DT-10M 4 512 8 10M 1 day
DT-40M 6 768 12 40M 2 days
DT-200M 10 1280 20 200M 8 days
Table 1: Multi-Game Decision Transformer Variants

b.2 Fine-tuning protocol for Atari games

In the fine-tuning experiments, we reserved five games (Alien, MsPacman, Pong, Space Invaders and Star Gunner) to be used only for fine-tuning. These games were selected due to their varied gameplay characteristics. Each game was fine-tuned separately to measure the model’s transfer performance for a fixed game. We use 1% of the original dataset (corresponding to roughly transitions) to specifically test fine-tuning in low-data regimes.

b.3 Action and return sampling during in-game evaluation

We sample actions from the model with a temperature of 1. Inspired by Nucleus sampling (Holtzman et al. (2019)

), we only sample from the top 85th percentile action logits for all Decision Transformer models and Behavioral Cloning models (this parameter was selected to give highest performance for both models). While we train the model to predict actions for all timesteps in the sequence, during in-game evaluation, we execute the last predicted action in the sequence (conditioned on all past observations, and past generated actions, rewards, and target returns).

To generate target returns as discussed in Section 3.4, we sample them from the model with the temperature of 1 and the top 85th percentile logits. We use in all our experiments. To avoid storing the history of previously generated target returns (which may be difficult to incorporate into some RL frameworks), we experimented with autoregressively regenerating all target returns in the sequence, and found that to work well without requiring any special recurrent state maintenance outside of the model.

As an alternative way to generate expert-but-likely returns, we also experimented with simply generating return samples from the model according to log-probability , and picking the highest one. We then generate the action conditioned on this largest picked return as before. This avoids needing the hyperparameter . In this setting, we found , inverse temperature of 0.75 for return sampling, no percentile cutoff for return sampling, and sampling from the top 50th percentile action logits with a temperature of 1 to work similarly well.

b.4 Evaluation protocol and Atari environment details

Our environment is the Atari 2600 Gym environment with pre-processing performed as in Agarwal et al. (2020). Our Atari observations are grayscale images. We compress observation images to jpeg in the dataset (to keep dataset size small) and during in-game evaluation. All games use the same shared set of 18 discrete actions. For all methods, each game score is calculated by averaging over 16 model rollout episode trials. To reduce inter-trial variability, we do not use sticky actions during evaluation.

b.5 Image augmentation

All models were trained with image augmentations. We investigate training with the following augmentation methods: random cropping, random channel permutation, random pixel permutation, horizontal flip, vertical flip, and random rotations. We found random cropping and random rotations to work the best. (In our random cropping implementation, images of size

are padded on each side with 4 zero-value pixels, and then randomly cropped to

.) In general, we aim to expand the domain of problems solved during training to similar kinds that we hope to generalize to by encoding useful inductive biases. We maintain the same random augmentation parameters for each window sequence. We apply data augmentation in both pre-training and fine-tuning.

Appendix C Baseline Implementation Details


Our BC model is effectively the same as our DT model but removing the return token from the training sequence:

Instead of predicting a return token (distribution) given observation tokens and the previous part of the sequence, we directly predict an action token (distribution), which also means that we remove return conditioning for the BC model. During evaluation, we sample actions with a temperature of 1, and sample from the top 85th percentile logits (as discussed in Section B.3). All other implementation details and configurations are identical to DT.

C51 Dqn

For single-game experiments, our implementation and training followed the details in (Bellemare et al., 2017) except for using multi-step learning with . For multi-game experiments we trained using the details provided in the main text; we ran the algorithm for 15M gradient steps (B environment steps B Atari frames).


For CQL we use the same optimizer and learning rate as for C51 DQN. We use a per-replica batch size of 32 and run for 1M gradient steps on a TPU pod with 32 cores, yielding a global batch size of 256. During finetuning for each game, we copy the entire -network trained with CQL, and apply an additional 100k gradient steps of batch size 32 on a single CPU, where each batch is sampled exclusively from the offline dataset of the finetuned game. We also experimented with smaller learning rates ( instead of the default ) and larger batch sizes (, ) but found the results largely unchanged. We also tried using offline C51 and double DQN as opposed to CQL, and found performance to be worse.


For the CPC baseline (Oord et al., 2018), we apply a contrastive loss between using the objective function


where is a trainable matrix and is a non-trainable prior distribution; for mini-batch training we set to be the distribution of states in the mini-batch. The state representations is parametrized by CNNs followed by two MLP layers with 512 units each interleaved with ReLU activation. For the CNN architecture, we used the C51 implementation with an Impala neural network architecture of three blocks using 16, 32, and 32 channels respectively, and trained with a batch size of 256 and learning rate of 0.00025 both during pretraining and downstream BC adaptation. We conduct representation learning for a total of 1M gradient steps, and finetune on 1% data for 100k steps every 50k steps of representation learning and report the best finetuning results.


Our BERT and ACL baselines are based on the representation learning objectives described in (Yang and Nachum, 2021). For the BERT (Devlin et al., 2018) state representation learning baseline, we (1) take a sub-trajectory from the dataset (without special tokenization as in DT), (2) randomly mask a subset of these, (3) pass the masked sequence into a transformer, and then (4) for each masked input state , apply a contrastive loss between its representation and the transformer output at the corresponding sequence position:


where is the distribution over states in the mini-batch. For attentive contrastive learning (ACL) (Yang and Nachum, 2021), we apply an additional action prediction loss to the output of BERT at the sequence positions of the action inputs.

To parameterize , we use the same CNN architecture as in CPC, while the transformer is parameterized by two self-attention layers with 4 attention heads of 256 units each and feed-forward dimension 512. The transformer does not apply any additional directional masking to its inputs. We used .

Pretraining and finetuning is analogous to CPC. Namely, when finetuning we take the pretrained representation and use a BC objective for learning a neural network (two MLP layers with 512 units each) policy on top of this representation.

Appendix D Comparisons between methods based on other aggregate metrics

We used inter-quartile mean to aggregate performance over individual games in Figure 1. Median is another metric commonly used to aggregate scores (although it has issues as discussed in (Agarwal et al., 2021): it has high variability, and in the most extreme case, the median is unaffected by zero performance on nearly half of the tasks.). For completeness, we report the median scores for all methods:

Figure 11: Median human-normalized score across 41 Atari games. Grey bars are single-game specialist models while blue are generalists. Single-game BCQ results are from Gulcehre et al. (2020).

For expert-filtering experiments in Section 4.8, we also provide the plot of expert filtering effects with median human-normalized scores in Figure 12. We note that ranking of various configurations do not change across aggregate metrics.

Figure 12: Median human-normalized scores of 40M transformer models trained on full data and only expert data.

For Upside-Down RL comparison experiments Section 4.9, we also provide median human-normalized scores in Figure 13.

Figure 13: How UDRL (Impala architecture) median human-normalized score scales with model size on training set games, in comparisons with Decision Transformer and CQL (Impala architecture).

Appendix E Details of Expert Dataset Generation

To generate the expert dataset for experiments in Section 4.8, we we filter our training data Agarwal et al. (2020) from each game by episodic returns and only preserve top 10% trajectories to produce an expert dataset. We plot of return histograms for reference in Figure 14.

Figure 14: Histograms of rollout performance from Agarwal et al. (2020) used to generate the expert dataset, with (unnormalized) score-density on the vertical axis, and game score (rewards are clipped) on the horizontal axis. We indicate the 90th percentile performance cutoff with a red vertical line for each game. Rollouts that exceeded this score threshold were included in the expert dataset.

Appendix F Details on Comparisons between transformers and convolution networks

In Section 4.9 we compare the transformer architecture to an UDRL  (Schmidhuber, 2019; Srivastava et al., 2019) implementation that uses feed-forward, convolutional Impala networks (Espeholt et al., 2018). Specifically, we use the same return, action, and reward tokenizers as in DT, and only replace the observation (four consecutive Atari frames stacked together) encoding to use the Impala architecture. Similar to what we do for CQL, we also experiment with different sizes of the Impala architecture by varying the number of blocks and channels in each block of the Impala network: the number of blocks and channels is one of , , . We use a -layer fully-connected head to predict the next return token from observation embedding; another head to predict the next action token from a concatenation of observation embedding and return token embedding; another head to predict the next reward token from a concatenation of observation embedding, return token embedding, and action token embedding.

The input to the model is slightly different from what we have for DT: Instead of considering a -timestep sub-trajectory () where each timestep contains , we stack image frames (as common in Mnih et al. (2015)), and only consider from the last timestep. All other design choices and evaluation protocols are the same as DT.

Appendix G Effect of Model Size on Training Speed

It is believed that large transformer-based language models train faster than smaller models, in the sense that they reach higher performance after observing a similar number of tokens (Kaplan et al., 2020; Chowdhery et al., 2022). We find this trend to hold in our setting as well. Figure 15 shows an example of performance on two example games as multi-game training progresses. We see that larger models reach higher scores per number of training steps taken (thus tokens observed).

Figure 15: Example game scores for different model sizes as multi-game training progresses.

Appendix H Qualitative Attention Analysis

We find that the Decision Transformer model consistently attends to observation image patches that contain meaningful game entities. Figure 16 visualizes selected attention heads and layers for various games. We find heads consistently attend to entities such as player character, player’s free movement space, non-player objects, and environment features.

(a) Asterix: player
(b) Frostbite: player
(c) Breakout: ball
(d) Breakout: no paddle
(e) Breakout: unbroken blocks
(f) Asterix: non-players
Figure 16: Example image patches attended (red) for predicting next action by Decision Transformer.

Appendix I Raw Atari Scores

We report full raw scores of 41 training Atari games for best performing sizes of multi-game models in Table 2.

Game Name DT (200M) BC (200M) Online DQN (10M) CQL (60M)
Amidar 101.5 101.0 629.8 4.0
Assault 2,385.9 1,872.1 1,338.7 820.1
Asterix 14,706.3 5,162.5 2,949.1 950.0
Atlantis 3,105,342.3 4,237.5 976,030.4 16,800.0
BankHeist 5.0 63.1 1,069.6 20.0
BattleZone 17,687.5 9,250.0 26,235.2 5,000.0
BeamRider 8,560.5 4,948.4 1,524.8 3,246.4
Boxing 95.1 90.9 68.3 100.0
Breakout 290.6 185.6 32.6 62.0
Carnival 2,213.8 2,986.9 2,021.2 440.0
Centipede 2,463.0 2,262.8 4,848.0 2,904.0
ChopperCommand 4,268.8 1,800.0 951.4 400.0
CrazyClimber 126,018.8 123,350.0 146,362.5 139,300.0
DemonAttack 23,768.4 7,870.6 446.8 1,202.0
DoubleDunk -10.6 -1.5 -156.2 -2.0
Enduro 1,092.6 793.2 896.3 729.0
FishingDerby 11.8 5.6 -152.3 18.4
Freeway 30.4 29.8 30.6 32.0
Frostbite 2,435.6 782.5 2,748.4 408.0
Gopher 9,935.0 3,496.3 3,205.6 700.0
Gravitar 59.4 12.5 492.5 0.0
Hero 20,408.8 13,850.0 26,568.8 14,040.0
IceHockey -10.1 -8.3 -10.4 -10.5
Jamesbond 700.0 431.3 264.6 500.0
Kangaroo 12,700.0 12,143.8 7,997.1 6,700.0
Krull 8,685.6 8,058.8 8,221.4 7,170.0
KungFuMaster 15,562.5 4,362.5 29,383.1 13,700.0
NameThisGame 9,056.9 7,241.9 6,548.8 3,700.0
Phoenix 5,295.6 4,326.9 3,932.5 1,880.0
Pooyan 2,859.1 1,677.2 4,000.0 330.0
Qbert 13,734.4 11,276.6 4,226.5 11,700.0
Riverraid 14,755.6 9,816.3 7,306.6 3,810.0
RoadRunner 54,568.8 49,118.8 25,233.0 50,900.0
Robotank 63.2 44.6 9.2 17.0
Seaquest 5,173.8 1,175.6 1,415.2 643.0
TimePilot 2,743.8 1,312.5 -883.1 2,400.0
UpNDown 16,291.3 10,454.4 8,167.6 5,610.0
VideoPinball 1,007.7 1,140.8 85,351.0 0.0
WizardOfWor 187.5 443.8 975.9 500.0
YarsRevenge 28,897.9 20,738.9 18,889.5 19,505.4
Zaxxon 275.0 50.0 -0.1 0.0
Table 2: Raw scores of 41 training Atari games for best performing multi-game models.