Ever since the first fully-learned approach succeeded at playing Atari games from screen images (Mnih et al., 2015), standard practice in deep reinforcement learning (RL) has been to learn visual features and a control policy jointly, end-to-end. Several such deep RL algorithms have matured (Hessel et al., 2018; Schulman et al., 2017; Mnih et al., 2016; Haarnoja et al., 2018) and have been successfully applied to domains ranging from real-world (Levine et al., 2016; Kalashnikov et al., 2018) and simulated robotics (Lee et al., 2019; Laskin et al., 2020a; Hafner et al., 2020) to sophisticated video games (Berner et al., 2019; Jaderberg et al., 2019), and even high-fidelity driving simulators (Dosovitskiy et al., 2017). While the simplicity of end-to-end methods is appealing, relying on the reward function to learn visual features can be severely limiting. For example, it leaves features difficult to acquire under sparse rewards, and it can narrow their utility to a single task. Although our intent is broader than to focus on either sparse-reward or multi-task settings, they arise naturally in our studies. We investigate how to learn representations which are agnostic to rewards, without degrading the control policy.
A number of recent works have significantly improved RL performance by introducing auxiliary losses, which are unsupervised tasks that provide feature-learning signal to the convolution neural network (CNN) encoder, additionally to the RL loss(Jaderberg et al., 2017; van den Oord et al., 2018; Laskin et al., 2020b; Guo et al., 2020; Schwarzer et al., 2020) et al., 2020; Grill et al., 2020; He et al., 2019)
have demonstrated that powerful feature extractors can be learned without labels, as evidenced by their usefulness for downstream tasks such as ImageNet classification. Together, these advances suggest that visual features for RL could possibly be learned entirely without rewards, which would grant greater flexibility to improve overall learning performance. To our knowledge, however, no single unsupervised learning (UL) task has been shown adequate for this purpose in general vision-based environments.
In this paper, we demonstrate the first decoupling of representation learning from reinforcement learning that performs as well as or better than end-to-end RL. We update the encoder weights using only UL and train a control policy independently, on the (compressed) latent images. This capability stands in contrast to previous state-of-the-art methods, which have trained the UL and RL objectives jointly, or Laskin et al. (2020b), which observed diminished performance with decoupled encoders.
Our main enabling contribution is a new unsupervised task tailored to reinforcement learning, which we call Augmented Temporal Contrast (ATC). ATC requires a model to associate observations from nearby time steps within the same trajectory (Anand et al., 2019)
. Observations are encoded via a convolutional neural network (shared with the RL agent) into a small latent space, where the InfoNCE loss is applied(van den Oord et al., 2018). Within each randomly sampled training batch, the positive observation, , for every anchor, , serves as negative for all other anchors. For regularization, observations undergo stochastic data augmentation (Laskin et al., 2020b) prior to encoding, namely random shift (Kostrikov et al., 2020), and a momentum encoder (He et al., 2020; Laskin et al., 2020b) is used to process the positives. A learned predictor layer further processes the anchor code (Grill et al., 2020; Chen et al., 2020) prior to contrasting. In summary, our algorithm is a novel combination of elements that enables generic learning of the structure of MDPs from visual observations, without requiring rewards or actions as input.
We include extensive experimental studies establishing the effectiveness of our algorithm in a visually diverse range of common RL environments: DeepMind Control Suite (DMControl; Tassa et al. 2018), DeepMind Lab (DMLab; Beattie et al. 2016), and Atari (Bellemare et al., 2013). Our experiments span discrete and continuous control, 2D and 3D visuals, and both on-policy and off policy RL algorithms. Complete code for all of our experiments is available at https://github.com/astooke/rlpyt/rlpyt/ul. Our empirical contributions are summarized as follows:
Online RL with UL: We find that the convolutional encoder trained solely with the unsupervised ATC objective can fully replace the end-to-end RL encoder without degrading policy performance. ATC achieves nearly equal or greater performance in all DMControl and DMLab environments tested and in 5 of the 8 Atari games tested. In the other 3 Atari games, using ATC as an auxiliary loss or for weight initialization still brings improvements over end-to-end RL.
Encoder Pre-Training Benchmarks: We pre-train the convolutional encoder to convergence on expert demonstrations, and evaluate it by training an RL agent using the encoder with weights frozen. We find that ATC matches or outperforms all prior UL algorithms as tested across all domains, demonstrating that ATC is a state-of-the-art UL algorithm for RL.
Multi-Task Encoders: An encoder is trained on demonstrations from multiple environments, and is evaluated, with weights frozen, in separate downstream RL agents. A single encoder trained on four DMControl environments generalizes successfully, performing equal or better than end-to-end RL in four held-out environments. Similar attempts to generalize across eight diverse Atari games result in mixed performance, confirming some limited feature sharing among games.
Ablations and Encoder Analysis: Components of ATC are ablated, showing their individual effects. Additionally, data augmentation is shown to be necessary in DMControl during RL even when using a frozen encoder. We introduce a new augmentation, subpixel random shift, which matches performance while augmenting the latent images, unlocking computation and memory benefits.
2 Related Work
Several recent works have used unsupervised/self-supervised representation learning methods to improve performance in RL. The UNREAL agent (Jaderberg et al., 2017) introduced unsupervised auxiliary tasks to deep RL, including the Pixel Control task, a Q-learning method requiring predictions of screen changes in discrete control environments, which has become a standard in DMLab (Hessel et al., 2019). CPC (van den Oord et al., 2018) applied contrastive losses over multiple time steps as an auxiliary task for the convolutional and recurrent layers of RL agents, and it has been extended with future action-conditioning (Guo et al., 2018). Recently, PBL (Guo et al., 2020) surpassed these methods with an auxiliary loss of forward and backward predictions in the recurrent latent space using partial agent histories. Where the trend is of increasing sophistication in auxiliary recurrent architectures, our algorithm is a markedly simpler, one-step temporal technique, requiring only observations, and yet it proves sufficient in POMDPs.
ST-DIM (Anand et al., 2019) and DRIML (Mazoure et al., 2020) introduced various temporal, contrastive losses, including on “local” features within the encoder, without data augmentation. CURL (Laskin et al., 2020b) introduced an augmented, contrastive auxiliary task similar to ours, including a momentum encoder but without the temporal aspect. Most recently, MPR (Schwarzer et al., 2020) combined data augmentation with multi-step, convolutional forward modeling and a similarity loss to improve DQN agents in the Atari 100k benchmark. Hafner et al. (2019, 2020); Lee et al. (2019) proposed to leverage world-modeling in a latent-space for continuous control. None of these methods attempt to decouple encoder training from the RL loss (except for CURL, with reduced performance), and none have previously been shown effective in as diverse a collection of RL environments as ours.
Finn et al. (2016); Devin et al. (2018); Ha and Schmidhuber (2018) are example works which pretrained encoder features in advance; they used image reconstruction losses such as the VAE (Kingma and Welling, 2013) or else assumed object-centric representations. MERLIN (Wayne et al., 2018)
trained a convolutional encoder and sophisticated memory module online, detached from the RL agent, which learned read-only accesses to memory. It used reconstruction and one-step latent-prediction losses and achieved high performance in DMLab-like environments with extreme partial observability. Our loss function may benefit those settings, as it outperforms similar reconstruction losses in our experiments. Decoupling unsupervised pretraining from downstream tasks is common in computer vision(Hénaff et al., 2019; He et al., 2019; Chen et al., 2020) and has the favorable properties of providing task agnostic features which can be used for training smaller task-specific networks, yielding significant gains in computational efficiency over end-to-end methods.
3 Augmented Temporal Contrast
Our unsupervised learning task, Augmented Temporal Contrast (ATC), requires a model to associate an observation, , with one from a specified, near-future time step, . Within each training batch, we apply stochastic data augmentation to the observations (Laskin et al., 2020b), namely random shift (Kostrikov et al., 2020), which is simple to implement and provides highly effective regularization in most cases. The augmented observations are encoded into a small latent space where a contrastive loss is applied. This task encourages the learned encoder to extract meaningful elements of the structure of the MDP from observations.
Our architecture for ATC consists of four learned components - (i) a convolutional encoder, , which processes the anchor observation, , into the latent image , (ii) a linear global compressor,
to produce a small latent code vector, (iii) a residual predictor MLP, , which acts as an implicit forward model to advance the code , and (iv) a contrastive transformation matrix, . To process the positive observation, into the target code , we use a momentum encoder (He et al., 2019) parameterized as a slowly moving average of the weights from the learned encoder and compressor layer:
The complete architecture is shown in Figure 1. The convolutional encoder, , alone is shared with the RL agent.
using logits computed bilinearly, as. In our implementation, the positives from all other elements within the training batch serve as the negative examples for each anchor. Denoting an observation indexed from dataset as , and its positive as , the logits can be written as ; our loss function in practice is:
4.1 Evaluation Environments and Algorithms
We evaluate ATC on three standard, visually diverse RL benchmarks - the DeepMind control suite (DMControl; Tassa et al. 2018), Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and DeepMind Lab (DMLab; Beattie et al. 2016). Atari requires discrete control in arcade-style games. DMControl is comprised of continuous control robotic locomotion and manipulation tasks. In contrast, DMLab requries the RL agent to reason in more visually complex 3D maze environments with partial observability.
We use ATC to enhance both on-policy and off-policy RL algorithms. For DMControl, we use RAD-SAC (Laskin et al., 2020a; Haarnoja et al., 2018) with the random shift augmentation of Kostrikov et al. (2020)
. A difference from prior work is that we use more downsampling in our convolutional network, by using stridesinstead of to reduce the convolution output image by 25x.111For our input image size , the convolution output image is rather than . Performance remains largely unchanged, except for a small decrease in the half-cheetah environment, but the experiments run significantly faster and use less GPU memory. For both Atari and DMLab, we use PPO (Schulman et al., 2017). In Atari, we use feed-forward agents, sticky actions, and no end-of-life boundaries for RL episodes. In DMLab we used recurrent, LSTM agents receiving only a single time-step image input, the four-layer convolution encoder from Jaderberg et al. (2019), and we tuned the entropy bonus for each level. In the online setting, the ATC loss is trained using small replay buffer of recent experiences.
We include all our own baselines for fair comparison and provide complete settings in an appendix. Multiple seeds were run for each experiment, and the lightly shaded area around each curve represents the maximum extent of the best and worst seeds.
4.2 Online RL with UL
We found ATC to be capable of training the encoder online, fully detached from the RL agent, and achieve essentially equal or better scores versus end-to-end RL in all six environments we tested, Figure 2. In Cartpole-Swingup-Sparse, where rewards are only received once the pole reaches vertical, UL training enabled the agent to master the task significantly faster. The encoder is trained with one update for every RL update to the policy, using the same batch size, except in Cheetah-Run, which required twice the UL updates.
We experimented with two kinds of levels in DMLab: Explore_Goal_Locations, which requires repeatedly navigating a maze whose layout is randomized every episode, and Lasertag_Three_Opponents, which requires fast reflexes to pursue and tag enemies at a distance. We found ATC capable of training fully detached encoders while achieving equal or better performance than end-to-end RL. Results are shown in Figure 3. Both environments exhibit sparsity which is greater in the “large” version than the “small” version, which our algorithm addresses, discussed next.
In Explore, the goal object is rarely seen, especially early on, making its appearance difficult to learn. We therefore introduced prioritized sampling for UL, with priorities corresponding to empirical absolute returns: , where , to train more frequently on more informative scenes.222In Explore_Goal_Locations, the only reward is +10, earned when reaching the goal object. Whereas uniform-UL performs slightly below RL, prioritized-UL outperforms RL and nearly matches using UL (uniform) as an auxiliary task. By considering the encoder as a stand-alone feature extractor separate from the policy, no importance sampling correction is required.
In Lasertag, enemies are often seen, but the reward of tagging one is rarely achieved by the random agent. ATC learns the relevant features anyway, boosting performance while the RL-only agent remains at zero average score. We found that increasing the rate of UL training to do twice as many updates333Since the ATC batch size was 512 but the RL batch size was 1024, performing twice as many UL updates still only consumed the same amount of encoder training data as RL. We did not fine-tune for batch size. further improved the score to match the ATC-auxiliary agent, showing flexibility to address the representation-learning bottleneck when opponents are dispersed.
We tested a diverse subset of eight Atari games, shown in Figure 4. We found detached-encoder training to work as well as end-to-end RL in five games, but performance suffered in Breakout and Space Invaders in particular. Using ATC as an auxiliary task, however, improves performance in these games and others. We found it helpful to anneal the amount of UL training over the course of RL in Atari (details in an appendix). Notably, we found several games, including Space Invaders, to benefit from using ATC only to initialize encoder weights, done using an initial 100k transitions gathered with a uniform random policy. Some of our remaining experiments provide more insights into the challenges of this domain.
4.3 Encoder Pre-Training Benchmarks
To benchmark the effectiveness of different UL algorithms for RL, we propose a new evaluation methodology that is similar to how UL pre-training techniques are measured in computer vision (see e.g. Chen et al. (2020); Grill et al. (2020)): (i) collect a data set composed of expert demonstrations from each environment; (ii) pre-train the CNN encoder with that data offline using UL; (iii) evaluate by using RL to learn a control policy while keeping the encoder weights frozen. This procedure isolates the asymptotic performance of each UL algorithm for RL. For convenience, we drew expert demonstrations from partially-trained RL agents. Further details about pre-training by each algorithm are provided in an appendix.
We compare ATC against two competing algorithms: Augmented Contrast (AC), from CURL (Laskin et al., 2020b), which uses the same observation for the anchor and the positive, and a VAE (Kingma and Welling, 2013), for which we found better performance by introducing a time delay to the target observation (VAE-T). We found ATC to match or outperform the other algorithms, in all four test environments, as shown in Figure 5. Further, ATC is the only one to match or outperform the reference end-to-end RL across all cases.
We compare against both Pixel Control (Jaderberg et al., 2017; Hessel et al., 2019) and CPC (van den Oord et al., 2018), which have been shown to bring strong benefits in DMLab. While all algorithms perform similarly well in Explore, ATC performs significantly better in Lasertag, Figure 6. Our algorithm is simpler than Pixel Control and CPC in the sense that it uses neither actions, deconvolution, nor recurrence.
We compare against Pixel Control, VAE-T, and a basic inverse model which predicts actions between pairs of observations. We also compare against Spatio-Temporal Deep InfoMax (ST-DIM), which uses temporal contrastive losses with “local” features from an intermediate convolution layer to ensure attention to the whole screen; it was shown to produce detailed game-state knowledge when applied to individual frames (Anand et al., 2019). Of the four games shown in Figure 7, ATC is the only UL algorithm to match the end-to-end RL reference in Gravitar and Breakout, and it performs best in Space Invaders.
4.4 Multi-Task Encoders
In the offline setting, we conducted initial explorations into the capability of ATC to learn multi-task encoders, simply by pre-training on demonstrations from multiple environments. We evaluate the encoder by using it, with frozen weights, in separate RL agents learning each downstream task.
Figure 8 shows our results in DMControl, where we pretrained using only the four environments in the top row. Although the encoder was never trained on the Hopper, Pendulum, nor Finger domains, the multi-task encoder supports efficient RL in them. Pendulum-Swingup and Cartpole-Swingup-Sparse stand out as challenging environments which benefited from cross-domain and cross-task pre-training, respectively. The pretraining was remarkably efficient, requiring only 20,000 updates to the encoder.
Atari proved a more challenging domain for learning multi-task encoders. Learning all eight games together in Figure 11, in the appendix, resulted in diminished performance relative to single-game pretraining in three of the eight. The decrease was partially alleviated by widening the encoder with twice as many filters per layer, indicating that representation capacity is a limiting factor. To test generalization, we conducted a seven-game pre-training experiment where we test the encoder on the held-out game. Most games suffered diminished performance (although still perform significantly higher than a frozen random encoder), confirming the limited extent to which visual features transfer across these games.
4.5 Ablations and Encoder Analysis
Random Shift in ATC
In offline experiments, we discovered random shift augmentations to be helpful in all domains. To our knowledge, this is the first application of random shift to 3D visual environments as in DMLab. In Atari, we found performance in Gravitar
to suffer from random shift, but reducing the probability of applying random shift to each observation from 1.0 to 0.1 alleviated the effect while still bringing benefits in other games, so we used this setting in our main experiments. Results are shown in Figure12 in an appendix.
Random Shift in RL
In DMControl, we found the best results when using random shift during RL, even when training with a frozen encoder. This is evidence that the augmentation regularizes not only the representation but also the policy, which first processes the latent image into a 50-dimensional vector. To unlock computation and memory benefits of replaying only the latent images for the RL agent, we attempted to apply data augmentation to the latent image. But we found the smallest possible random shifts to be too extreme. Instead, we introduce a new augmentation, subpixel random shift
, which linearly interpolates among neighboring pixels. As shown in Figure13 in the appendix, this augmentation restores performance when applied to the latent images, allowing a pre-trained encoder to be entirely bypassed during policy training updates.
Temporal Contrast on Sequences
In Breakout alone, we discovered that composing the UL training batch of trajectory segments, rather than individual transitions, gave a significant benefit. Treating all elements of the training batch independently provides “hard” negatives, since the encoder must distinguish between neighboring time steps. This setting had no effect in the other Atari games tested, and we found equal or better performance using individual transitions in DMControl and DMLab. Figure 10 further shows that using a similarity loss (Grill et al., 2020) does not capture the benefit.
We analyzed the learned encoders in Breakout to further study this ablation effect. Figure 10 shows the attention of four different encoders on the displayed scene. The poorly performing UL encoder heavily utilizes the paddle to distinguish the observation. The UL encoder trained with random shift and sequence data, however, focuses near the ball, as does the fully-trained RL encoder. (The random encoder mostly highlights the bricks, which are less relevant for control.) In an appendix, we include other example encoder analyses from Atari and DMLab which show ATC-trained encoders attending only to key objects on the game screen, while RL-trained encoders additionally attend to potentially distracting features such as game score.
Reward-free representation learning from images provides flexibility and insights for improving deep RL agents. We have shown a broad range of cases where our new unsupervised learning algorithm can fully replace RL for training convolutional encoders while maintaining or improving online performance. In a small number of environments–a few of the Atari games–including the RL loss for encoder training still surpasses our UL-only method, leaving opportunities for further improvements in UL for RL.
Our preliminary efforts to use actions as inputs (into the predictor MLP) or as prediction outputs (inverse loss) with ATC did not immediately yield improvements. We experimented only with random shift, but other augmentations may be useful, as well. In multi-task encoder training, our technique avoids any need for sophisticated reward-balancing (Hessel et al., 2019), but more advanced training methods may still help when the required features are in conflict, as in Atari, or if they otherwise impact our loss function unequally. On the theoretical side, it may be helpful to analyze the effects of domain shift on the policy when a detached representation is learned online.
One obvious application of our offline methodology would be in the batch RL setting, where the agent learns from a fixed data set. Our offline experiments showed that a relatively small number of transitions are sufficient to learn rich representations by UL, and the lower limit could be further explored. Overall, we hope that our algorithm and experiments spur further developments leveraging unsupervised learning for reinforcement learning.
We thank Ankesh Anand and Evan Racah for helpful discussions regarding ST-DIM, MPR, and other related matters in representation learning for RL.
- Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §4.3.
- DeepMind lab. arXiv preprint arXiv:1612.03801. Cited by: §1, §4.1.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47, pp. 253–279. Cited by: §1, §4.1.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
- A simple framework for contrastive learning of visual representations. arXiv:2002.05709. Cited by: §1, §1, §2, §4.3.
- Deep object-centric representations for generalizable robot learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7111–7118. Cited by: §2.
- CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §1.
Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 512–519. Cited by: §2.
- Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §1, §4.3, §4.5.
- Bootstrap latent-predictive representations for multitask reinforcement learning. arXiv preprint arXiv:2004.14646. Cited by: §1, §2.
- Neural predictive belief representations. arXiv preprint arXiv:1811.06407. Cited by: §2.
- Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, Cited by: §3.
- World models. arXiv preprint arXiv:1803.10122. Cited by: §2.
Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor.
International Conference on Machine Learning, Cited by: §1, §4.1.
- Dream to control: learning behaviors by latent imagination. In International Conference on Learning Representations, Cited by: §1, §2.
- Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, Cited by: §2.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §2, §3.
Momentum contrast for unsupervised visual representation learning.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2.
- Rainbow: combining improvements in deep reinforcement learning. In AAAI Conference on Artificial Intelligence, Cited by: §1.
- Multi-task deep reinforcement learning with popart. In AAAI Conference on Artificial Intelligence, Cited by: §A.4, §2, §4.3, §5.
- Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §1, §4.1.
- Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, Cited by: §A.4, §1, §2, §4.3.
- Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §1.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2, §4.3.
- Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649. Cited by: §1, §3, §4.1.
- Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990. Cited by: §1, §4.1.
- CURL: contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, Cited by: §1, §1, §1, §2, §3, §4.3.
- Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §1, §2.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
- Deep reinforcement and infomax learning. arXiv preprint arXiv:2006.07217. Cited by: §2.
- Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §4.1.
- Data-efficient reinforcement learning with momentum predictive representations. arXiv preprint arXiv:2007.05929. Cited by: §1, §2.
- Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §1, §4.1.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §1, §2, §3, §4.3.
- Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760. Cited by: §2.
Appendix A Appendix
a.1 Additional Figures
In subpixel random shift, new pixels are a linearly weighted average of the four nearest pixels to a randomly chosen coordinate location. We used uniformly random horizontal and vertical shifts, and tested maximum displacements in
pixels (with “edge” mode padding). We found to work well in all tested domains, restoring the performance of raw image augmentation but eliminating convolutions entirely from the RL training updates.
a.2 RL Settings
|Observation rendering||, RGB|
|Random Shift Pad|
|Replay buffer size|
|Action repeat||(Finger, Walker)|
|Learning rate ()|
|Batch Size||(Cheetah, Pendulum)|
|Critic target update freq|
|Convolution filter size|
|Hidden units (MLP)|
|Observation rendering||, Grey|
|Likelihood ratio clip,||0.1|
|Convolution filter sizes|
|Hidden units (MLP)|
|Generalized Advantage Estimation|
|Learning rate annealing||linear|
|Entropy bonus coefficient||0.01|
|Repeat action probability|
|Value loss coefficient|
|Observation rendering||, RGB|
|Likelihood ratio clip,||0.1|
|Convolution filter sizes|
|Hidden units (LSTM)|
|Skip connections||conv 3, 4; LSTM|
|Generalized Advantage Estimation|
|Learning rate annealing||none|
|Entropy bonus coefficient||(Explore)|
|Value loss coefficient|
a.3 Online UL Settings
|Random shift pad|
|Learning rate annealing||cosine|
|Target update interval|
|Predictor hidden sizes,|
|Random shift probability|
|Batch size||as RL (individual observations)|
|Min agent steps to UL|
|Min agent steps to RL|
|UL update schedule||as RL|
|Random shift probability|
|Batch size||512 (32 trajectories of 16 time steps)|
|Min agent steps to UL|
|Min agent steps to RL|
|UL update schedule||Annealed quadratically from 6 per sampler iteration|
|( once at steps for weight initialization)|
|Random shift probability|
|Batch size||512 (individual observations)|
|Min agent steps to UL|
|Min agent steps to RL|
|UL update schedule||2 per sampler iteration|
a.4 Offline Pre-Training Details
We conducted coarse hyperparameter sweeps to tune each competing UL algorithm. In all cases, the best setting is the one shown in our comparisons.
When our VAEs include a time difference between input and reconstruction observations, we include one hidden layer with action additionally input between the encoder and decoder. We tried both 1.0 and 0.1 KL-divergence weight in the VAE loss, and found 0.1 to perform better in both DMControl and Atari.
For the VAE, we experimented with 0 and 1 time step difference between input and reconstruction target observations and training for either or updates. The best settings were 1-step temporal, and updates, with batch size 128. ATC used 1-step temporal, updates (although this can be significantly decreased), and batch size 256 (including Cheetah). The pretraining data set consisted of the first transitions from a RAD-SAC agent learning each task, including random actions. Within this span, Cartpole and Ball_in_Cup learned completely, but Walker and Cheetah reached average returns of 514 and 630, respectively (collected without the compressive convolution).
For Pixel Control, we used the settings from Hessel et al. (2019) (see the appendix therein), except we used only empirical returns, computed offline (without bootstrapping). For CPC, we tried training batch shapes, in (64, 8), (32, 16), (16, 32), and found the setting with rollouts of length 16 to be best. We contrasted all elements of the batch against each other, rather than only forward constrasts. In all cases we also used 16 steps to warmup the LSTM. For all algorithms we tried learning rates and and both and updates. For ATC and CPC, the lower learning rate and higher number of updates helped in Lasertag especially. The pretraining data was samples from partially trained RL agents receiving average returns of 127 and 6 in Explore_Goal_Locations_Small and Lasertag_Three_Opponents_Small, respectively.
For the VAE, we experimented with 0, 1, and 3 time step difference between input and reconstruction target, and found 3 to work best. For ST-DIM we experimented with 1, 3, and 4 time steps differences, and batch sizes from 64 to 256, learning rates and . Likewise, 3-step delay worked best. For the inverse model, we tried 1- and 3-step predictions, with 1-step working better overall, and found random shift augmentation to help. For pixel control, we used the settings in Jaderberg et al. (2017), again with full empirical returns. We ran each algorithm for up to updates, although final ATC results used updates. We ran each RL agent with and without observation normalization on the latent image and observed no difference in performance. Pretraining data was samples sourced from the replay buffer of DQN agents trained for steps with epsilon-greedy . Evaluation scores were: