Recent progress in deep reinforcement learning (RL) has achieved impressive results in a range of applications from playing games (Mnih et al., 2015; Silver et al., 2016), to dialogue systems (Li et al., 2017) and robotics (Levine et al., 2016; Andrychowicz et al., 2018). However, generalizing to unseen environments remains difficult for deep RL algorithms, which can fail catastrophically when encountering new environments (Leike et al., 2017). We consider the setting where an RL agent trains on a limited number of environments and must generalize to unseen environments. The agent will not perform perfectly on the unseen environments. But can it avoid dangers that were already encountered during training? In safety-critical domains there can be catastrophic outcomes which are unacceptable – see (Garcıa and Fernández, 2015) for a review on safety in RL. We would ideally like our RL agents to be able to avoid the dangers consistent with those seen during training, without requiring a hand-crafted safe policy for these. In this work, we assume that we have access to a simulator, which captures the basic semantics of the world (i.e. dangers, goals and dynamics). In the simulator the agent can experience dangers and learn from them (Paul et al., 2018). We evaluate agents on how well they can transfer knowledge: can they generalize to unseen environments with the same basic semantics? At deployment, the agent has a single episode to solve an unseen environment and any dangerous behaviour is considered an unacceptable catastrophe.
Motivated by the standard regularization methods for tackling overfitting in deep neural networks,Farebrother et al. (2018) and Cobbe et al. (2018) experiment with L2-regularisation, dropout (Srivastava et al., 2014)2015) with Deep Q-Networks (Mnih et al., 2015), showing improved generalization performance. Zhang et al. (2018) investigate the ability of A3C (Mnih et al., 2016) to generalize rather than memorize in a set of gridworlds similar to our environments. They show that perfect generalization is possible when a sufficient amount of environments is provided (10000 environments), but they do not focus on the regime of a limited number of training environments, nor evaluate performance in terms of safety. Similarly, the focus of Cobbe et al. (2018) is on a large number of training environments. At the other extreme, Leike et al. (2017) introduce a ‘Distribution Shift’ gridworld setup, where they train on a single environment and deploy on another. In a different direction, Saunders et al. (2018)
approached danger avoidance by using supervised learning to train a blocker (i.e. a classifier) using a human-in-the-loop to maintain safety during training, which restricts its scalability. A collision prediction model was also considered in the model-based setting inKahn et al. (2017). In Lipton et al. (2016), catastrophes are avoided by training an intrinsic fear model to predict whether a catastrophe will occur, and using this to perform reward shaping. From a modeling perspective, an ensemble of models often performs better than a single model (Dietterich, 2000)
. They can also be used for predictive uncertainty estimation of deep neural networks(Lakshminarayanan et al., 2017). In our work we make use of this uncertainty estimation. Finally, our approach can also be related to meta-learning (Schmidhuber, 1987; Thrun and Pratt, 2012; Hochreiter et al., 2001; Bengio et al., 1992), which is concerned with learning strategies which are fast to adapt using prior experience. In the RL context, approaches include gradient-based (Finn et al., 2017) and recurrent style (Wang et al., 2016; Duan et al., 2016) models using multiple environments to train from. Our setting corresponds to the zero-shot meta-RL setting, in which we train on multiple training environments but do not adapt based on test environment reward signals.
We first investigate safety and generalization in a class of gridworlds. We find that standard DQN fails to avoid catastrophes at test time, even with 1000 training environments. We compare standard DQN to modified versions that incorporate dropout, Q-network ensembling, and a classifier to recognize dangerous actions. These modifications reduce catastrophes significantly, including in the regime of very few training environments. We next look at safety and generalization in the more challenging CoinRun environment. We find that in this case simple model averaging does not help significantly to reduce catastrophes compared to a PPO baseline. However, we find that there is still important uncertainty information captured in the ensemble of value functions of the PPO agents. We perform a study on whether the agent can predict ahead of time whether a catastrophe will occur, given the information in the ensemble of value functions. We find that the uncertainty in these value functions is helpful for predicting a catastrophe. This is useful as it can be used to improve safety by requesting an intervention from a human.
We consider an agent interacting with an environment in the standard RL framework (Sutton and Barto, 2018). At each step, the agent selects an action based on its current state, and the environment provides a reward and the next state. Our task setup is the same as in (Zhang et al., 2018): there is a train/test split for environments that is analogous to the train/test split for data points in supervised learning. In our experiments all environments will have the same reward and transition function, and differ only in the initial state. Hence we can equivalently describe our setup in terms of a distribution on initial states for a single MDP. Formally, we denote our task by , where
is a Markov Decision Process (MDP), with state space, action space
, transition probabilityand immediate reward function . Additionally,
is a probability distribution on the initial state. We use the undiscounted episodic setting, where each episode randomly samples an initial state from and ends in a finite number of timesteps, . There are disjoint training and test sets which have i.i.d. samples from . During training the agent encounters initial states only from the training set and makes learning updates based on the observed rewards. Test performance is calculated on the test set, and no learning takes place at test time.
3 Gridworld Experiments
3.1 Experimental Setup
Our environment setup is a distribution of gridworld environments, each of which is size , and contains an agent (blue), a single lava cell (red) and a single goal cell (green). The agent receives sparse rewards of for reaching the goal and for reaching the lava. The episode terminates whenever the goal or lava is reached, or when fifty timesteps have elapsed (giving zero reward), whichever occurs first. We consider two environment settings, which we call Full and Reveal. In Full, the agent sees the full map (an example trajectory is shown in Supplementary Material, Fig. 7), whereas in Reveal, Fig. 1, the agent starts off seeing only part of the map, and reveals the map as it goes around, with a view. Reveal
is a more challenging setting because it requires the agent to move around to uncover the position of the goal. The agent receives the observation as an array of RGB pixel values flattened across the channel dimension. We treat moving onto the lava as a catastrophe. Our evaluation metrics are the percentage of environments that are solved (the agent reaches the goal before the timeout), and the percentage of environments that end in catastrophe (the agent reaches the lava). On test environments we consider timeouts to be an acceptable failure, whereas a catastrophe is unacceptable.
Deep Q-Networks (DQN).
is a parameter vector. DQN is optimized by minimizing, at each iteration , where . The are parameters of a target network that is kept frozen for a number of iterations whilst updating the online network parameters . The optimization is performed off-policy, randomly sampling from an experience replay buffer. During training, actions are chosen using the -greedy exploration strategy, selecting a random action with probability and otherwise taking the greedy action (which has maximum Q-value). At test time, the agent acts greedily.
Ensembles of models (i.e. model averaging) are usually used for estimating model (i.e. epistemic) uncertainty. In particular, instead of a single model , a set of models is fitted. Then either the average, or, in classification tasks, the mode (i.e. majority vote) is used for prediction. When neural networks are used as models, the diversification between the models is obtained by initializing them differently and by following independent training. For model averaging on DQN, we do the model averaging on the -value.
Another approach to avoiding dangers is to learn a classifier for whether a state-action pair will be catastrophic and use this to block certain actions — see (Saunders et al., 2018) for an example trained with a human-in-the-loop. During training we store all state-action pairs, together with a binary label of whether a catastrophe occurred. Then after training the DQN agent, we separately train the classifier to predict the probability that a state-action pair will result in a catastrophe. Training is done in a supervised manner by minimizing the binary cross entropy loss. The classifier is used as a ‘blocker’ at deployment time. At test time we run our selected action through the classifier and block the action if the classifier predicts it is catastrophic with confidence greater than some threshold. We then move on to the next highest value action and run that through the classifier. The process repeats until an acceptable action is found, otherwise the episode is terminated. Note that the blocker will only block dangerous actions that occur just before the danger is about to be experienced, but won’t help for those actions which irreversibly cause a catastrophe to occur many steps later (Saunders et al., 2018).
A summary of the methods used can be found in Tab. 12012) with learning rate 1e–4, a replay buffer with 10K capacity and the target network was updated every 1K episodes. An -greedy policy was used with an exponential decay rate 0.999 and end value 0.05. The blocker is also a 3-layer multi-layer perceptron with hidden layer sizes [128,256,256] trained for 10k iterations using: batch size 64, Adam optimizer (Kingma and Ba, 2014) with learning rate 5e–3.
|DQN||Same as (Mnih et al., 2015)|
|Drop-DQN||Regularized linear layers with dropout probability|
|Block-DQN||Catastrophe classifier used along with DQN|
|Ens-DQN||Ensemble of 9 independently trained and differently intialized DQNs|
|Maj-DQN||Majority vote of 9 independently trained and differently intialized DQNs|
|Block&Ens-DQN||Combination of Block-DQN and Ens-DQN|
3.3 Results and Discussion.
To make figures easier to read, this section includes only four methods: DQN, Ens-DQN, Block-DQN and Block&Ens-DQN. In Fig. 2 we present results on the Reveal gridworld. We plot the percentage of environments that ended in catastrophe in Fig. 1(a), and the percentage of solved environments in Fig. 1(b), as a function of the number of training environments available during training. We trained all models to convergence on the training environments. See Fig. 8 and Fig. 9 in supplementary material for results of all methods on Full and Reveal settings and also for the evaluations on the training environments. Fig. 1(b) shows that our agents never achieve perfect performance on the test environments. Moreover, when an agent fails to reach the goal, it does not always fail gracefully (e.g. by simply timing out) but instead often ends in catastrophe (visiting the lava). Most of the methods we investigated outperformed the DQN baseline in terms of percentage of test catastrophes. Each method offers a different trade-off between test performance on catastrophes and solved environments. For example, Block-DQN offers better catastrophe performance than DQN, but its performance on solving environments is worse given more than 100 training environments. This is possibly because the blocker is over-cautious, with too high a false-positive rate for catastrophes, which prematurely stops some environments from being solved. Note that in a real-world setting, avoiding catastrophes (Fig. 1(a)) will be much more important than doing well on most environments (Fig. 1(b)).
In Fig. 3 we showcase an example state from our experiments highlighting the role of the ensemble and the blocker in avoiding the catastrophe.
4 CoinRun Experiments
4.1 Experimental Setup
Following our experiments on gridworlds, we next consider the more challenging CoinRun environment (Cobbe et al., 2018), a procedurally generated game in which the agent is spawned on the left and whose aim is to reach the coin on the right whilst avoiding obstacles, see Fig. 4 for some screenshots. The agent receives a reward of +5 for reaching the coin, and the episode terminates with -5 reward either after 1000 timesteps, or on collision with an obstacle. We simplified the environment from (Cobbe et al., 2018) to remove all crates and obstacles except for the lava and to have only six actions (no-op, jump, jump-right, jump-left, right, left). This simplification allowed us to train our agents in 10 million timesteps, rather than 256 million. In our setup we consider falling in the lava to be a catastrophe, whereas a timeout is an acceptable failure. The observations given to the agent is the RGB 64x64 pixel values, flattened along the channel dimension.
Proximal Policy Optimization (PPO)
to perform fairly well in the original CoinRun environment. We train five PPO agents independently and with different random initializations on each of 10, 25, 50 and 200 training levels. We used model averaging using majority vote (mode of sampled actions from the five agents), denoted Maj-PPO, and the sampling from the ensemble mean, denoted Ens-mean (where the mean distribution is formed by taking the mean over the logits of the individual PPO policy categorical distributions). We also trained a single agent with dropout for each of the 10, 25, 50 and 200 training levels. For full algorithm settings see Sec.A.1 in the supplementary material.
We plot the percentage of test levels ending in catastrophe and the percentage solved against the number of training environments in Fig. 5. We see the two methods using an ensemble, Maj-PPO and Ens-mean, give similar performance to the baseline. We see a slight improvement for the ensembles for the 10 training environments setting. The other methods using dropout as a regularizer and MC dropout Gal and Ghahramani (2016) for ensembling did not match baseline performance, see Fig. 10 of the supplementary material, which also contains performance on the training set. We emphasise that performing perfectly on a small number of training environments is not sufficient to get good test performance, both for % solved and more importantly for % catastrophes.
environments for a range of methods. Five random seeds are used for each algorithm. For the PPO baseline the dots mark the five seeds’ performance, and the line and shading are the mean and one standard deviation intervals respectively. Other methods used all five seeds so no intervals appear for them. The ensemble algorithms don’t do significantly better than a single PPO agent (on average) both in terms of catastrophes and % solved. The complete version is provided in Fig.10 of the appendix, and includes both train and test performances as well as dropout experiments.
Predicting a Catastrophes in CoinRun
In gridworld, the catastrophes are local in that they occur exactly one step after the dangerous action is taken. In CoinRun catastrophes are non-local: an agent takes a jump action and falls in the lava a few steps later (with no way to avoid the lava once in mid-air). We suspect this explains why it’s harder to reduce catastrophes in CoinRun than gridworld. Rather than modifying the agents actions, we instead now consider a setup where the agent should call for help if it thinks it has taken a dangerous action that will lead to a catastrophe. This is for example the intervention setup used in an autonomous driving application Michelmore et al. (2018). The agent requests an intervention based on a discrimination function which combines the mean, , and standard deviation, , of the ensemble of the five agents’ value functions, similar to UCB Auer (2003). We consider a binary classification task with a catastrophe occurring in timesteps as the ‘positive‘ class, and predicting no catastrophe occurring in timesteps as the ‘negative‘ class. A true positive would be an agent predicting the catastrophe and it occurring, whereas a false positive would be predicting a catastrophe and it not occurring. We imagine a human intervention occurring on a positive prediction, and so would like to reduce the number of false positives (which might waste the human’s time, or be suboptimal) and maintain a high true positive rate. An ROC curve captures the diagnostic ability of a binary classifier system as its discrimination threshold is varied – the threshold is compared to the discrimination function . In an ROC curve, the higher the sensitivity (true positive rate) and the lower the 1-specificity (false positive rate) the better. The AUC score is a summary statistic of the ROC curve, the higher the better. In Fig 6
we plot ROC curves for the Ens-mean agent, together with an agent that has random value functions and takes random actions. The different curves show different action selection methods (random or Ens-mean action selection), together with different discrimination function hyperparameters. Shown are mean and one standard deviation confidence intervals based on ten bootstrap samples from the data collected from one rollout on 1000 test environments. We see from Fig5(a) that for 10 training levels and a prediction time window of one step, the uncertainty information from using the standard deviation of the ensemble-mean action selection method gives superior prediction performance compared to not using the standard deviation (shown also is that it’s better than random). However, it helps less as the time-window increases, Fig 5(b), or as the number of training levels increases, Fig 5(c) 5(d). Note that it’s much easier to predict a catastrophe for a smaller time window (left column), than a longer one (right column). This supports our hypothesis that it is the time-extended nature of the CoinRun danger which is particularly challenging to generalize about. See Fig. 11 and Fig. 12 in the supplementary material for a wider range of ROC plots.
In this paper we investigated how safety performance generalizes when deployed on unseen test environments drawn from the same distribution of environments seen during training, where no further learning is allowed. We focused on the realistic case in which there are a limited number of training environments. We found RL algorithms can fail dangerously on the test environments even when performing perfectly during training. We investigated some simple ways to improve safety generalization performance. We also investigated whether a future catstrophe can be predicted in the challenging CoinRun environment, finding that uncertainty information in an ensemble of agents is helpful when only a small number of environments are available.
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Andrychowicz et al. (2018) Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
- Auer (2003) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944941.
- Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
- Cobbe et al. (2018) Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
- Dietterich (2000) Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
- Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.
- Farebrother et al. (2018) Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In international conference on machine learning, pages 1050–1059, 2016.
- Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90.
- Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Kahn et al. (2017) Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
- Leike et al. (2017) Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
- Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008, 2017.
- Lipton et al. (2016) Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforcement learning’s sisyphean curse with intrinsic fear.(nov. 2016). arXiv preprint cs.LG/1611.01211, 2016.
- Michelmore et al. (2018) Rhiannon Michelmore, Marta Kwiatkowska, and Yarin Gal. Evaluating uncertainty quantification in end-to-end autonomous driving control, 2018.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in PyTorch.In NIPS Autodiff Workshop, 2017.
- Paul et al. (2018) Supratik Paul, Michael A Osborne, and Shimon Whiteson. Fingerprint policy optimisation for robust reinforcement learning. arXiv preprint arXiv:1805.10662, 2018.
- Saunders et al. (2018) William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2067–2069. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
- Schmidhuber (1987) Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Thrun and Pratt (2012) Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
- Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Wang et al. (2016) Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- Zhang et al. (2018) Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
Appendix A Supplementary Material
See Fig.7 for some frames from the Full gridworld setting.
Shown in Fig. 8 are the results for all the methods on the Full setting. See Fig. 9 for results on the Reveal setting. Shown are also the performance on the training environments (solid lines). We see similar results to the main paper, and note that as expected the Full setting has better generalization performance for % solved than Reveal, but the catastrophe performance is similar in each.
See Fig. 10 for complete results on CoinRun, including training performance and dropout as both a regularizer (turned off at test) or for MC dropout (dropout on at test time).
a.1 Algorithm Settings
We trained our agents each for timesteps, using a linearly decaying learning rate (initial value ) and Adam optimizer [Kingma and Ba, 2014]
. We used 256 PPO steps, 8 minibatches, 3 PPO epochs, entropy coefficient of 0.01, and a decay rate of 0.999. We used the same IMPALA-CNN style architecture asCobbe et al.  (itself taken from Espeholt et al. 
), except we modify it to be smaller, using a convolutional layer with 5 filters, max pooling with pool size 3 and strides 2 and same padding, followed by two residual blocks, each containing [relu, conv, relu, conv] to which the input is added to the output in residual styleHe et al. . This is followed by a relu and fully connected layer to 256 hidden units, followed by another fully connected layer with 6 heads, one for each action logit in a categorical distribution. For the dropout agent we decayed the dropout probability according to , where decays linearly from 0.1 to zero. Training was performed using 32-CPU machines, using TensorFlow Abadi et al.  for PPO and PyTorch Paszke et al.  for DQN.