Deep Reinforcement Learning (deep RL, François-Lavet et al., 2018) has proven to be an indispensable tool for training successful agents on difficult sequential decision-making problems (Bellemare et al., 2013; Tassa et al., 2018). The success of deep RL is particularly noteworthy in highly complex strategic games such as StarCraft (Vinyals et al., 2019) and DoTA2 (OpenAI et al., 2019), where deep RL agents now surpass expert human performance in some scenarios.
Deep RL involves training agents based on large neural networks using large amounts of data(Sutton, 2019), a trend evident across both model-based (Schrittwieser et al., 2019) and model-free (Schrittwieser et al., 2019) learning. The sample complexity of such state-of-the-art agents is often incredibly high: MuZero (Schrittwieser et al., 2019) and Agent-57 (Badia et al., 2020) use 10-50 years of experience per Atari game, and OpenAI Five (OpenAI et al., 2019) uses 45,000 years of experience to accomplish its remarkable performance. This is clearly impractical: unlike easily-simulated environments such as video games, collecting interaction data for many real-world tasks is costly, making improved data efficiency a prerequisite for successful use of deep RL in these settings (Dulac-Arnold et al., 2019).
Meanwhile, new self-supervised representation learning methods have significantly improved data efficiency when learning new vision and language tasks, particularly in low data regimes or semi-supervised learning(Xie et al., 2019; Hénaff et al., 2019; Chen et al., 2020b). Self-supervised methods improve data efficiency by leveraging a nearly limitless supply of training signal from tasks generated on-the-fly, based on “views" drawn from the natural structure of the data (e.g., image patches, data augmentation or temporal proximity, see Doersch et al., 2015; Oord et al., 2018; Hjelm et al., 2019; Tian et al., 2019; Bachman et al., 2019; He et al., 2019; Chen et al., 2020a).
Motivated by successes in semi-supervised and self-supervised learning (Tarvainen and Valpola, 2017; Xie et al., 2019; Grill et al., 2020), we train better state representations for RL by forcing representations to be temporally predictive and consistent when subject to data augmentation. Specifically, we extend a strong model-free agent by adding a dynamics model which predicts future latent representations provided by a parameter-wise exponential moving average of the agent itself. We also add data augmentation to the future prediction task, which enforces consistency across different views of each observation. Contrary to some methods (Kaiser et al., 2019; Hafner et al., 2019), our dynamics model operates entirely in the latent space and doesn’t rely on reconstructing raw states.
We evaluate our method, which we call Momentum Predictive Representations (MPR), on 26 games in the Atari 100k benchmark (Kaiser et al., 2019), where agents are allowed only 100k steps of environment interaction (producing 400k frames of input) per game, which roughly corresponds to two hours of real-time experience. Notably, the human experts in Mnih et al. (2015) and Van Hasselt et al. (2016) were given the same amount of time to learn these games, so a budget of 100k steps permits a reasonable comparison in terms of data efficiency.
In our experiments, we augment a modified version of Data-Efficient Rainbow (DER) (van Hasselt et al., 2019) with the MPR loss, and evaluate versions of MPR with and without data augmentation. We find that each version is superior to controlled baselines. When coupled with data augmentation, MPR achieves a median score of 0.444, which is a state-of-the-art result on this benchmark, outperforming prior methods by a significant margin. Notably, MPR also outperforms human expert scores on out of games while using roughly the same amount of in-game experience.
We consider reinforcement learning (RL) in the standard Markov Decision Process (MDP) setting where an agent interacts with its environment inepisodes, each consisting of sequences of observations, actions and rewards. We use , and to denote the observation, the action taken by the agent and the reward received at timestep . We seek to train an agent whose expected cumulative reward in each episode is maximized. To do this, we combine a strong model-free RL algorithm, Rainbow (Hessel et al., 2018), with Momentum Predictive Representations as an auxiliary loss to improve sample efficiency. We now describe our overall approach in detail.
2.1 Deep Q-Learning
We focus on the Atari Learning Environment (Bellemare et al., 2013), a challenging setting where the agent takes discrete actions while receiving purely visual, pixel-based observations. A prominent method for solving Atari, Deep Q Networks (Mnih et al., 2015), trains a neural network to approximate the agent’s current Q-function (policy evaluation) while updating the agent’s policy greedily with respect to this Q-function (policy improvement). This involves minimizing the error between predictions from
and a target value estimated by, an earlier version of the network:
Various improvements have been made over the original DQN: Distributional RL (Bellemare et al., 2017) models the full distribution of future reward rather than just the mean, Dueling DQN (Wang et al., 2016) decouples the value of a state from the advantage of taking a given action in that state, Double DQN (Van Hasselt et al., 2016) modifies the Q-learning update to avoid overestimation due to the operation, among many others. Rainbow (Hessel et al., 2018) consolidates these improvements into a single combined algorithm and has been adapted to work well in data-limited regimes (van Hasselt et al., 2019).
2.2 Momentum Predictive Representations
For our auxiliary loss, we start with the intuition that encouraging state representations to be predictive of future observations given future actions should improve the data efficiency of RL algorithms. Let denote a sequence of previously experienced states and actions sampled from a replay buffer, where is the maximum number of steps into the future which we want to predict. Our method has four main components which we describe below:
Online and Target networks: We use an online encoder to transform observed states into representations . We use these representations in an objective that encourages them to be predictive of future observations up to some fixed temporal offset , given a sequence of actions to perform. We augment each observation independently when using data augmentation. Rather than predicting representations produced by the online encoder, we follow prior work (Tarvainen and Valpola, 2017; Grill et al., 2020) by computing target representations for future states using a momentum encoder , whose parameters are an exponential moving average (EMA) of the online encoder parameters. Denoting the parameters of as , those of as , and the momentum coefficient as , the update rule for is:
Note that this means the momentum encoder is not updated via gradient descent.
Transition Model: For the prediction objective, we generate a sequence of predictions of future state representations using an action-conditioned transition model . We compute iteratively: , starting from . We compute by applying the momentum encoder to the observed future states : . The transition model and prediction loss operate in the latent space, thus avoiding pixel-based reconstruction objectives. We describe the architecture of in section 2.3.
Projection Heads: We use online and momentum projection heads and (Chen et al., 2020a) to project online and target representations to a smaller latent space, and apply an additional prediction head (Grill et al., 2020) to the online projections to predict the target projections:
The momentum projection head parameters are given by an EMA of the online projection head parameters, using the same update as the online and momentum encoders.
: We compute the future prediction loss for MPR by summing over cosine similarities111Cosine similarity is proportional to the “normalized L2" loss used in BYOL (Grill et al., 2020). between the predicted and observed representations at timesteps for :
where and are computed from as we just described.
We call our method Momentum Predictive Representations (MPR), following the predictive nature of the objective and the use of an exponential moving average target network similar to (Tarvainen and Valpola, 2017; He et al., 2019). During training, we combine the MPR loss with the Q-learning loss for Rainbow. The MPR loss affects , , and . The Q-learning loss affects and the Q-learning head, which contains additional layers specific to Rainbow. Denoting the Q-learning loss from Rainbow as , our full optimization objective is: .
Compared to prior work (Kostrikov et al., 2020; Laskin et al., 2020), our method can leverage data augmentations more effectively by encouraging consistency between representations of different augmented views. We empirically verify this via a controlled comparison to DrQ (see section 5). It should be noted that MPR can still be used in contexts where data augmentation is unavailable or counterproductive. Compared to related work on contrastive representation learning, MPR does not use negative samples, which may require careful design of contrastive tasks, large batch sizes (Chen et al., 2020a), or the use of a buffer to emulate large batch sizes (He et al., 2019).
2.3 Transition Model Architecture
For the transition model , we apply a convolutional network directly to the spatial output of the convolutional encoder . The network comprises two 64-channel convolutional layers with
filters, with batch normalization(Ioffe and Szegedy, 2015)
after the first convolution and ReLU nonlinearities after each convolution. We append a one-hot vector representing the action taken to each location in the input to the first convolutional layer, similar toSchrittwieser et al. (2019). We use a maximum prediction depth of , and we truncate calculation of the MPR loss at episode boundaries to avoid encoding environment reset dynamics into the model.
2.4 Data Augmentation
When using augmentation, we use the same set of image augmentations as in DrQ from Kostrikov et al. (2020), consisting of small random shifts and color jitter. We found it important to normalize activations to lie in at the output of the convolutional encoder and transition model when using augmentation, as in Schrittwieser et al. (2019). We use Kornia (Riba et al., 2020) for efficient GPU-based data augmentations.
When not using augmentation, we find that MPR performs better when dropout with probabilityis applied at each layer in the online and momentum encoders. This is consistent with Laine and Aila (2016); Tarvainen and Valpola (2017), who find that adding noise inside the network is important when not using image-specific augmentation, as proposed by Bachman et al. (2014). We found that applying dropout in this way was not helpful when using image-specific augmentation.
2.5 Implementation Details
For our Atari experiments, we largely follow van Hasselt et al. (2019)
for DQN hyperparameters, with four exceptions. We follow DrQ(Kostrikov et al., 2020) by: using the 3-layer convolutional encoder from Mnih et al. (2015), using 10-step returns instead of 20-step returns for Q-learning, and not using a separate DQN target network when using augmentation222Note that this makes Double DQN updates (Van Hasselt et al., 2016) identical to standard DQN updates.. We also perform two gradient steps per environment step instead of one. We show results for this configuration with and without augmentation in Table 4, and confirm that these changes are not themselves responsible for our performance. We reuse the first layer of the DQN MLP head as the MPR projection head . When using dueling DQN (Wang et al., 2016), concatenates the outputs of the first layers of the value and advantage heads. When these layers are noisy (Fortunato et al., 2018), does not use the noisy parameters. Finally, we parameterize the predictor as a linear layer. We list the complete hyperparameters in Table 3. For , we use based on early experiments.
We test MPR on the sample-efficient Atari setting introduced by Kaiser et al. (2019) and van Hasselt et al. (2019). In this task, only 100,000 environment steps of training data are available – equivalent to 400,000 frames, or just under two hours – compared to the typical standard of 50,000,000 environment steps, or roughly 39 days of experience. When used without image augmentation, MPR demonstrates scores comparable to the previous best result from Kostrikov et al. (2020). When combined with image augmentation, MPR achieves a median human-normalized score of 0.444, which is a new state-of-the-art result on this task. MPR achieves super-human performance on six games in this data-limited setting: Boxing, Krull, Kangaroo, Road Runner, James Bond and Crazy Climber, compared to a maximum of two for any previous methods, and achieves scores higher than DrQ (the previous state-of-the-art method) on 21 out of 26 games. See Table 1 for a full list of scores. For consistency with previous works, we report human and random scores from Wang et al. (2016).
|Game||Random||Human||SimPLe||DER||OTRainbow||CURL||DrQ||MPR (no Aug)||MPR|
|Kung Fu Master||258.5||22736.3||17257.2||14346.1||5722.2||14307.8||9111.0||14535.2||14135.8|
|Up N Down||533.4||11693.2||3350.3||2877.4||2847.6||2955.2||3180.8||5167.9||7307.8|
We evaluate the performance of different methods by computing the average episodic return at the end of training. It is common to normalize scores with respect to expert human scores to account for different scales of scores in each game. The human-normalized performance of an agent on a game is calculated as
and then aggregated across the 26 games by taking their mean or median. It is common to report the median human-normalized performance, as the median is less susceptible to outliers. However, we find that in some games human scores are so high that differences between methods are washed out when normalizing scores. This makes it difficult for scores in these games, such as Alien, Asterix, and Seaquest, to influence aggregate metrics. To account for this, we also also report scores normalized by a DQN(Mnih et al., 2015) agent trained over 50 million steps, using DQN scores reported in Wang et al. (2016).
Additionally, we note that the standard evaluation protocol of evaluating over only 500,000 frames per game is problematic, as the quantity we are trying to measure is expected return over episodes. Due to the very long lengths of some episodes (up to 108,000 frames), this method may collect as few as four complete episodes. Moreover, better policies tend to have longer episodes on many games, leading stronger algorithms to experience even greater variance in this estimate of expected episodic returns. As variance is already a concern in deep RL(see Henderson et al., 2018), we propose evaluating over 100 episodes irrespective of their length, but we report results using the standard metric.
4 Related Work
4.1 Data-Efficient RL:
A number of works have sought to improve sample efficiency in deep RL. SiMPLe (Kaiser et al., 2019) learns an explicit pixel-level transition model for Atari to generate simulated training data, achieving strong results on several games in the 100k frame setting. However, both van Hasselt et al. (2019) and Kielak (2020) demonstrate that variants of Rainbow (Hessel et al., 2018) tuned for sample efficiency can achieve comparable or superior performance.
In the context of continuous control, several works propose to leverage a latent-space model trained on reconstruction loss to improve sample efficiency (Hafner et al., 2019; Lee et al., 2019; Hafner et al., 2020). Most recently, DrQ (Kostrikov et al., 2020) and RAD (Laskin et al., 2020) have found that applying modest image augmentation can substantially improve sample efficiency in reinforcement learning, yielding better results than prior model-based methods. Data augmentation has also been found to improve generalization of reinforcement learning methods (Combes et al., 2018; Laskin et al., 2020) in multi-task and transfer settings. We show that data augmentation can be more effectively leveraged in reinforcement learning by forcing representations to be consistent between different augmented views of an observation while also predicting future latent states.
4.2 Representation Learning in RL:
Representation learning has a long history of use in RL – see Lesort et al. (2018). For example, CURL (Srinivas et al., 2020) recently proposed a combination of image augmentation and a contrastive loss to perform representation learning for RL. However, follow-up results from RAD (Laskin et al., 2020) suggest that most of the benefits of CURL come from its use of image augmentation rather than its contrastive loss.
CPC (Oord et al., 2018), CPC|Action (Guo et al., 2018), ST-DIM (Anand et al., 2019) and DRIML (Mazoure et al., 2020) propose to optimize various temporal contrastive losses in reinforcement learning environments. We perform an ablation comparing such temporal contrastive losses to our method in section 5. Kipf et al. (2019) propose to learn object-oriented contrastive representations by training a structured transition model based on a graph neural network.
MPR bears some resemblance to Deep MDP (Gelada et al., 2019), which trains a transition model with an unnormalized L2 loss to predict representations of future states, along with a reward prediction objective. However, DeepMDP uses its online encoder for prediction targets as well rather than employing a target encoder, and is thus prone to representational collapse (sec. C.5 in Gelada et al. (2019)). To mitigate this issue, DeepMDP relies on an additional observation reconstruction objective. In contrast, our model is self-supervised, trained entirely in the latent space, and uses a normalized loss. Our ablations (see section 5) demonstrate that using a momentum target encoder has a large impact on our method, making it another key difference between MPR and DeepMDP.
MPR is also similar to PBL (Guo et al., 2020), which also directly predicts representations of future states. However, PBL uses a separate target network trained via gradient descent, whereas MPR uses a momentum target encoder, which we find to be vital. Moreover, PBL studies multi-task generalization in the asymptotic limits of data, whereas MPR is concerned with single-task performance in low data regimes, using as much data as PBL. Unlike PBL, MPR additionally enforces consistency across augmentations, which empirically provides a large boost in performance.
We now present several ablation studies to measure the contribution of components in our method.
The importance of the momentum encoder
To illustrate the importance of using a momentum target encoder, we test two variants of our method that omit it. In each experiment, we use the online network to encode both the inputs and targets. In the first ablation, we allow gradients to flow from the target representations into the online encoder, and in the second we place a stop-gradient operation on the target representations to stop the encoder from learning to make the representations more predictable. We find that both ablations degrade performance, the former achieving a median human-normalized score of and the latter , compared to our full method’s score of . These results are consistent with the findings of Tarvainen and Valpola (2017), who observe that a momentum target encoder produces a more effective learning signal than the online encoder.
Making better use of data augmentation
We examine an ablated version of MPR in which all temporal elements are removed. Similarly to Mean Teachers (Tarvainen and Valpola, 2017) and BYOL (Grill et al., 2020), we provide the online and momentum encoders with different augmented views of each observation, and calculate the MPR loss between their representations of these views. We find that this variant alone outperforms DrQ or CURL (see Table 2, line 3), suggesting MPR makes better use of data augmentation than prior methods in reinforcement learning.
Dynamics modeling is key
A key distinction between MPR and other recent approaches leveraging representation learning for reinforcement learning, such as CURL (Srinivas et al., 2020) and DRIML (Mazoure et al., 2020), is our use of an explicit multi-step dynamics model. We test two ablated versions of MPR, one with no dynamics modeling and one that models only a single step of dynamics. Each of these variants has degraded performance compared to five-step MPR, with extended dynamics modeling consistently improving performance (see Table 2).
Comparison with contrastive losses
Although many recent works in representation learning have employed contrastive learning, we find that MPR consistently outperforms both temporal and non-temporal variants of contrastive losses (see Table 5), including CURL (Srinivas et al., 2020).
|Method||Augmentation||MPR Loss||Transition Model||Steps (K)||Median HNS|
|MPR (no model)||✓||✓||0|
|MPR (no aug.)||✓||✓||5|
6 Future Work
Recent work in both visual (Chen et al., 2020b) and language representation learning (Brown et al., 2020) has suggested that self-supervised models trained on large datasets perform exceedingly well on downstream problems with limited data, often outperforming methods trained using only task-specific data. Future works could similarly exploit large corpora of unlabelled data, perhaps from multiple MDPs or raw videos, to further improve the performance of RL methods in low-data regimes. As the MPR objective is unsupervised, it could be directly applied in such settings.
Another interesting direction is to use the transition model learned by MPR for planning. MuZero (Schrittwieser et al., 2019) has demonstrated that planning with a model supervised via reward and value prediction can work extremely well given sufficient (massive) amounts of data. It remains unclear whether such models can work well in low-data regimes, and whether augmenting such models with self-supervised objectives such as MPR can improve their data efficiency.
In this paper we introduced Momentum Predictive Representations (MPR), a self-supervised representation learning algorithm designed to improve the data efficiency of deep reinforcement learning agents. MPR learns representations that are both temporally predictive and consistent across different views of environment observations, directly predicting representations of future states produced by a momentum target encoder. MPR achieves state-of-the-art performance on the 100k steps Atari benchmark, demonstrating significant improvements over prior work. Our experiments show that MPR is highly robust, and is able to outperform the previous state of the art when either data augmentation or temporal prediction is disabled. We identify important directions for future work, and hope continued research at the intersection of self-supervised learning and reinforcement learning leads to algorithms which rival the efficiency and robustness of humans.
We are grateful for the collaborative research environment provided by Mila and Microsoft Research. We would also like to acknowledge Hitachi for providing funding support for this project. We thank Nitarshan Rajkumar and Evan Racah for providing feeback on an earlier draft; Denis Yarats and Aravind Srinivas for answering questions about DrQ and CURL; Michal Valko and Sherjil Ozair for discussions about BYOL; and Phong Nguyen for helpful discussions. Finally, we thank Compute Canada and Microsoft Research for providing computational resources used in this project.
- Unsupervised state representation learning in atari. In NeurIPS, Cited by: §4.2.
- Learning with pseudo-ensembles. Advances in Neural Information Processing Systems (NIPS). Cited by: §2.4.
- Learning representations by maximizing mutual information across views. In NeurIPS, Cited by: §1.
- Agent57: outperforming the atari human benchmark. arXiv preprint arXiv:2003.13350. Cited by: §1.
- A distributional perspective on reinforcement learning. ICML. Cited by: §2.1.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47. Cited by: §1, §2.1.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §6.
- A simple framework for contrastive learning of visual representations. ICML. Cited by: Appendix B, §1, 3rd item, §2.2.
- Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §1, §6.
- Learning invariances for policy generalization. arXiv preprint arXiv:1809.02591. Cited by: §4.1.
- Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §1.
- Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901. Cited by: §1.
- Noisy networks for exploration. In ICLR, External Links: Cited by: §2.5.
- An introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560. Cited by: §1.
- Deepmdp: learning continuous latent space models for representation learning. ICML. Cited by: §4.2.
- Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, 1st item, 3rd item, §5, footnote 1.
- Bootstrap latent-predictive representations for multitask reinforcement learning. arXiv preprint arXiv:2004.14646. Cited by: §4.2.
- Neural predictive belief representations. ICML. Cited by: 3rd item, §4.2.
- Dream to control: learning behaviors by latent imagination. ICLR. Cited by: §4.1.
- Learning latent dynamics for planning from pixels. In ICML, Cited by: §1, §4.1.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §2.2, §2.2.
- Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §1.
- Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.1.
- Rainbow: combining improvements in deep reinforcement learning. In AAAI, Cited by: §2.1, §2, §4.1.
- Learning deep representations by mutual information estimation and maximization. ICLR. Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2.3.
- Model based reinforcement learning for atari. In ICLR, Cited by: Figure 1, §1, §1, Table 1, §3, §4.1.
- Do recent advancements in model-based deep reinforcement learning really improve data efficiency?. External Links: Cited by: §4.1.
- Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247. Cited by: §4.2.
- Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649. Cited by: item *, §2.2, §2.4, §2.5, §3, §4.1.
- Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.4.
- Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990. Cited by: §2.2, §4.1, §4.2.
- Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §4.1.
- State representation learning for control: an overview. Neural Networks 108. Cited by: §4.2.
- Deep reinforcement and infomax learning. arXiv preprint arXiv:2006.07217. Cited by: §4.2, §5.
- Human-level control through deep reinforcement learning. Nature 518 (7540). Cited by: §1, §2.1, §2.5, §3.1.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: 2nd item, Appendix B, §1, §4.2.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. External Links: Cited by: §1, §1.
Pytorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §2.5.
- . In The IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.4.
- Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265. Cited by: §1, §2.3, §2.4, §6.
- Curl: contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136. Cited by: 1st item, §4.2, §5, §5.
- Rlpyt: a research code base for deep reinforcement learning in pytorch. arXiv preprint arXiv:1909.01500. Cited by: §2.5.
- The bitter lesson. Incomplete Ideas (blog). Note: http://www.incompleteideas.net/IncIdeas/BitterLesson.html Cited by: §1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Cited by: §1, 1st item, §2.2, §2.4, §5, §5.
- Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §1.
- Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
- Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §1, §2.1, footnote 2.
When to use parametric models in reinforcement learning?. In NeurIPS, Cited by: Appendix B, §1, §2.1, §2.5, §3, §4.1.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. Cited by: §1.
- Dueling network architectures for deep reinforcement learning. In ICML, Cited by: §2.1, §2.5, §3.1, §3.
- Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §1, §1.
Appendix A Atari Details
|Parameter||Setting (for both variations)|
|Reward clipping||[-1, 1]|
|Terminal on loss of life||True|
|Max frames per episode||108K|
|Support of Q-distribution||51|
|Optimizer: learning rate||0.0001|
|Max gradient norm||10|
|Priority correction||0.4 1|
|Noisy nets parameter||0.1|
|Min replay size for sampling||2000|
|Replay period every||1 step|
|Updates per step||2|
|Multi-step return length||10|
|Q network: channels||32, 64, 64|
|Q network: filter size||, ,|
Q network: stride
|4, 2, 1|
|Q network: hidden units||256|
|(MPR loss coefficient||2|
|Parameter||With Augmentation||Without Augmentation|
|Data Augmentation||Random shifts ( pixels) &||None|
|Target network: update period||1*||1000|
Similar to Kostrikov et al. (2020), we found target networks to not be necessary when using data augmentation.
a.1 Controlled baselines
To ensure that the minor hyper-parameter changes we make to the DER baseline is not solely responsible for our improved performance, we perform controlled experiments using the same hyper-parameters and same random seeds for baselines. We find that our DQN without augmentation is slightly stronger than data-efficient Rainbow, while with augmentation enabled our results are comparable to those of DrQ. None of these methods, however are close to the performance of MPR.
|Variant||Human-Normalized Score||DQN-Normalized Score|
|Base w/ augmentation||0.268||0.448||0.220||0.273|
Appendix B Comparison with a contrastive loss
To compare MPR with alternative methods drawn from contrastive learning, we examine several variants of a contrastive losses based on InfoNCE (Oord et al., 2018):
A contrastive loss based solely on different views of the same state, similar to CURL (Srinivas et al., 2020).
A temporal contrastive loss with both augmentation and where targets are drawn one step in the future, equivalent to single-step CPC (Oord et al., 2018).
A temporal contrastive loss with an explicit dynamics model, similar to CPC|Action (Guo et al., 2018).
In each case use a momentum target encoder with the same update constant as in MPR, 0.01. To create as fair a comparison as possible, we use the same augmentation (random shifts and intensity) and the same DQN hyperparameters as in MPR. As in MPR, we calculate contrastive losses using the output of the first layer of the Q-head MLP, with a bilinear classifier(as in Oord et al., 2018). Following Chen et al. (2020a) we normalize representations in the contrastive loss, with a temperature of 0.1. We present results in Table 5.
Although all of these variants outperform the previous contrastive result on this task, CURL, none of them substantially improve performance over the DQN they use as a baseline; the two temporal InfoNCE variants each boost performance on 15/26 games, while the non-temporal version actually harms performance on 14/26 games. We consider these results broadly consistent with those of CURL, which observes a relatively small performance boost over their baseline, Data-Efficient Rainbow (van Hasselt et al., 2019).
|Method||Augmentation||InfoNCE||Temporal||Transition Model||Median HNS|
|InfoNCE w/ dynamics model||✓||✓||✓||✓|
|1-step temporal InfoNCE||✓||✓||✓|
|Base DQN w/ augmentation||✓|