Generalizing from a few environments in safety-critical reinforcement learning

07/02/2019 ∙ by Zachary Kenton, et al. ∙ University of Oxford 7

Before deploying autonomous agents in the real world, we need to be confident they will perform safely in novel situations. Ideally, we would expose agents to a very wide range of situations during training, allowing them to learn about every possible danger, but this is often impractical. This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. Firstly, in a gridworld setting, we show that catastrophes can be significantly reduced with simple modifications, including ensemble model averaging and the use of a blocking classifier. In the more challenging CoinRun environment we find similar methods do not significantly reduce catastrophes. However, we do find that the uncertainty information from the ensemble is useful for predicting whether a catastrophe will occur within a few steps and hence whether human intervention should be requested.



There are no comments yet.


page 3

page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problem Setting.

Recent progress in deep reinforcement learning (RL) has achieved impressive results in a range of applications from playing games (Mnih et al., 2015; Silver et al., 2016), to dialogue systems (Li et al., 2017) and robotics (Levine et al., 2016; Andrychowicz et al., 2018). However, generalizing to unseen environments remains difficult for deep RL algorithms, which can fail catastrophically when encountering new environments (Leike et al., 2017). We consider the setting where an RL agent trains on a limited number of environments and must generalize to unseen environments. The agent will not perform perfectly on the unseen environments. But can it avoid dangers that were already encountered during training? In safety-critical domains there can be catastrophic outcomes which are unacceptable – see (Garcıa and Fernández, 2015) for a review on safety in RL. We would ideally like our RL agents to be able to avoid the dangers consistent with those seen during training, without requiring a hand-crafted safe policy for these. In this work, we assume that we have access to a simulator, which captures the basic semantics of the world (i.e. dangers, goals and dynamics). In the simulator the agent can experience dangers and learn from them (Paul et al., 2018). We evaluate agents on how well they can transfer knowledge: can they generalize to unseen environments with the same basic semantics? At deployment, the agent has a single episode to solve an unseen environment and any dangerous behaviour is considered an unacceptable catastrophe.

Related Work.

Motivated by the standard regularization methods for tackling overfitting in deep neural networks,

Farebrother et al. (2018) and Cobbe et al. (2018) experiment with L2-regularisation, dropout (Srivastava et al., 2014)

and batch normalization

(Ioffe and Szegedy, 2015) with Deep Q-Networks (Mnih et al., 2015), showing improved generalization performance. Zhang et al. (2018) investigate the ability of A3C (Mnih et al., 2016) to generalize rather than memorize in a set of gridworlds similar to our environments. They show that perfect generalization is possible when a sufficient amount of environments is provided (10000 environments), but they do not focus on the regime of a limited number of training environments, nor evaluate performance in terms of safety. Similarly, the focus of Cobbe et al. (2018) is on a large number of training environments. At the other extreme, Leike et al. (2017) introduce a ‘Distribution Shift’ gridworld setup, where they train on a single environment and deploy on another. In a different direction, Saunders et al. (2018)

approached danger avoidance by using supervised learning to train a blocker (i.e. a classifier) using a human-in-the-loop to maintain safety during training, which restricts its scalability. A collision prediction model was also considered in the model-based setting in

Kahn et al. (2017). In Lipton et al. (2016), catastrophes are avoided by training an intrinsic fear model to predict whether a catastrophe will occur, and using this to perform reward shaping. From a modeling perspective, an ensemble of models often performs better than a single model (Dietterich, 2000)

. They can also be used for predictive uncertainty estimation of deep neural networks

(Lakshminarayanan et al., 2017). In our work we make use of this uncertainty estimation. Finally, our approach can also be related to meta-learning (Schmidhuber, 1987; Thrun and Pratt, 2012; Hochreiter et al., 2001; Bengio et al., 1992), which is concerned with learning strategies which are fast to adapt using prior experience. In the RL context, approaches include gradient-based (Finn et al., 2017) and recurrent style (Wang et al., 2016; Duan et al., 2016) models using multiple environments to train from. Our setting corresponds to the zero-shot meta-RL setting, in which we train on multiple training environments but do not adapt based on test environment reward signals.


We first investigate safety and generalization in a class of gridworlds. We find that standard DQN fails to avoid catastrophes at test time, even with 1000 training environments. We compare standard DQN to modified versions that incorporate dropout, Q-network ensembling, and a classifier to recognize dangerous actions. These modifications reduce catastrophes significantly, including in the regime of very few training environments. We next look at safety and generalization in the more challenging CoinRun environment. We find that in this case simple model averaging does not help significantly to reduce catastrophes compared to a PPO baseline. However, we find that there is still important uncertainty information captured in the ensemble of value functions of the PPO agents. We perform a study on whether the agent can predict ahead of time whether a catastrophe will occur, given the information in the ensemble of value functions. We find that the uncertainty in these value functions is helpful for predicting a catastrophe. This is useful as it can be used to improve safety by requesting an intervention from a human.

2 Background

Task Setup.

We consider an agent interacting with an environment in the standard RL framework (Sutton and Barto, 2018). At each step, the agent selects an action based on its current state, and the environment provides a reward and the next state. Our task setup is the same as in (Zhang et al., 2018): there is a train/test split for environments that is analogous to the train/test split for data points in supervised learning. In our experiments all environments will have the same reward and transition function, and differ only in the initial state. Hence we can equivalently describe our setup in terms of a distribution on initial states for a single MDP. Formally, we denote our task by , where

is a Markov Decision Process (MDP), with state space

, action space

, transition probability

and immediate reward function . Additionally,

is a probability distribution on the initial state

. We use the undiscounted episodic setting, where each episode randomly samples an initial state from and ends in a finite number of timesteps, . There are disjoint training and test sets which have i.i.d. samples from . During training the agent encounters initial states only from the training set and makes learning updates based on the observed rewards. Test performance is calculated on the test set, and no learning takes place at test time.

3 Gridworld Experiments

3.1 Experimental Setup

Our environment setup is a distribution of gridworld environments, each of which is size , and contains an agent (blue), a single lava cell (red) and a single goal cell (green). The agent receives sparse rewards of for reaching the goal and for reaching the lava. The episode terminates whenever the goal or lava is reached, or when fifty timesteps have elapsed (giving zero reward), whichever occurs first. We consider two environment settings, which we call Full and Reveal. In Full, the agent sees the full map (an example trajectory is shown in Supplementary Material, Fig. 7), whereas in Reveal, Fig. 1, the agent starts off seeing only part of the map, and reveals the map as it goes around, with a view. Reveal

is a more challenging setting because it requires the agent to move around to uncover the position of the goal. The agent receives the observation as an array of RGB pixel values flattened across the channel dimension. We treat moving onto the lava as a catastrophe. Our evaluation metrics are the percentage of environments that are solved (the agent reaches the goal before the timeout), and the percentage of environments that end in catastrophe (the agent reaches the lava). On test environments we consider timeouts to be an acceptable failure, whereas a catastrophe is unacceptable.

Figure 1: Example trajectory from a Reveal environment. Agent: blue. Goal: green. Lava: red. Walls: grey. Mask: black.

3.2 Methods

Deep Q-Networks (DQN).

Deep Q-networks (Mnih et al., 2015) do Q-learning (Watkins and Dayan, 1992) using a deep neural network as a function approximator to estimate the optimal value function , where

is a parameter vector. DQN is optimized by minimizing

, at each iteration , where . The are parameters of a target network that is kept frozen for a number of iterations whilst updating the online network parameters . The optimization is performed off-policy, randomly sampling from an experience replay buffer. During training, actions are chosen using the -greedy exploration strategy, selecting a random action with probability and otherwise taking the greedy action (which has maximum Q-value). At test time, the agent acts greedily.

Model Averaging.

Ensembles of models (i.e. model averaging) are usually used for estimating model (i.e. epistemic) uncertainty. In particular, instead of a single model , a set of models is fitted. Then either the average, or, in classification tasks, the mode (i.e. majority vote) is used for prediction. When neural networks are used as models, the diversification between the models is obtained by initializing them differently and by following independent training. For model averaging on DQN, we do the model averaging on the -value.

Catastrophe Classifier.

Another approach to avoiding dangers is to learn a classifier for whether a state-action pair will be catastrophic and use this to block certain actions — see (Saunders et al., 2018) for an example trained with a human-in-the-loop. During training we store all state-action pairs, together with a binary label of whether a catastrophe occurred. Then after training the DQN agent, we separately train the classifier to predict the probability that a state-action pair will result in a catastrophe. Training is done in a supervised manner by minimizing the binary cross entropy loss. The classifier is used as a ‘blocker’ at deployment time. At test time we run our selected action through the classifier and block the action if the classifier predicts it is catastrophic with confidence greater than some threshold. We then move on to the next highest value action and run that through the classifier. The process repeats until an acceptable action is found, otherwise the episode is terminated. Note that the blocker will only block dangerous actions that occur just before the danger is about to be experienced, but won’t help for those actions which irreversibly cause a catastrophe to occur many steps later (Saunders et al., 2018).

Algorithm Settings

A summary of the methods used can be found in Tab. 1

. All 3-layer multi-layer perceptron DQN models were trained for 1M training episodes using: hidden layer sizes [256,256,512], batch size 32, RMSProp

(Tieleman and Hinton, 2012) with learning rate 1e–4, a replay buffer with 10K capacity and the target network was updated every 1K episodes. An -greedy policy was used with an exponential decay rate 0.999 and end value 0.05. The blocker is also a 3-layer multi-layer perceptron with hidden layer sizes [128,256,256] trained for 10k iterations using: batch size 64, Adam optimizer (Kingma and Ba, 2014) with learning rate 5e–3.

Method Description
DQN Same as (Mnih et al., 2015)
Drop-DQN Regularized linear layers with dropout probability
Block-DQN Catastrophe classifier used along with DQN
Ens-DQN Ensemble of 9 independently trained and differently intialized DQNs
Maj-DQN Majority vote of 9 independently trained and differently intialized DQNs
Block&Ens-DQN Combination of Block-DQN and Ens-DQN
Table 1: Description of methods used in our Gridworld experiments.

3.3 Results and Discussion.

To make figures easier to read, this section includes only four methods: DQN, Ens-DQN, Block-DQN and Block&Ens-DQN. In Fig. 2 we present results on the Reveal gridworld. We plot the percentage of environments that ended in catastrophe in Fig. 1(a), and the percentage of solved environments in Fig. 1(b), as a function of the number of training environments available during training. We trained all models to convergence on the training environments. See Fig. 8 and Fig. 9 in supplementary material for results of all methods on Full and Reveal settings and also for the evaluations on the training environments. Fig. 1(b) shows that our agents never achieve perfect performance on the test environments. Moreover, when an agent fails to reach the goal, it does not always fail gracefully (e.g. by simply timing out) but instead often ends in catastrophe (visiting the lava). Most of the methods we investigated outperformed the DQN baseline in terms of percentage of test catastrophes. Each method offers a different trade-off between test performance on catastrophes and solved environments. For example, Block-DQN offers better catastrophe performance than DQN, but its performance on solving environments is worse given more than 100 training environments. This is possibly because the blocker is over-cautious, with too high a false-positive rate for catastrophes, which prematurely stops some environments from being solved. Note that in a real-world setting, avoiding catastrophes (Fig. 1(a)) will be much more important than doing well on most environments (Fig. 1(b)).

(a) Percentage of catastrophic outcomes in unseen environments (lower is better), as a function of number of training environments.
(b) Percentage of solved unseen environments (higher is better), as a function of training environments.
Figure 2: Results on the Reveal setting, evaluated on unseen test environments for a range of methods. Nine random seeds are used for each algorithm and mean performances is shown here. Figure (a) shows that modified algorithms outperform the baseline DQN in terms of danger avoidance. The effect on return performance is observed in (b). The complete version is provided in Figure 9 of the appendix, and includes both train and test performances.

In Fig. 3 we showcase an example state from our experiments highlighting the role of the ensemble and the blocker in avoiding the catastrophe.

Figure 3: Example transition by the Block&Ens-DQN in one unseen environment, in the Full setting. (a) the environment state, ; (b) the output of the trained catastrophe classifier (i.e. blocker) conditioned on the environment state, where a threshold 0.5 is selected; (c) the nine estimates of the state-action value function , from the differently initialized and independently trained DQNs. The background colour highlights action with maximum value. The agent should not make the catastrophic action of going left, something that both the blocker and the ensemble (i.e. model average) of the DQNs will avoid. However, if the middle top agent in (c) was acting alone it would choose to go left, which would lead to a catastrophic outcome.

4 CoinRun Experiments

4.1 Experimental Setup

Following our experiments on gridworlds, we next consider the more challenging CoinRun environment (Cobbe et al., 2018), a procedurally generated game in which the agent is spawned on the left and whose aim is to reach the coin on the right whilst avoiding obstacles, see Fig. 4 for some screenshots. The agent receives a reward of +5 for reaching the coin, and the episode terminates with -5 reward either after 1000 timesteps, or on collision with an obstacle. We simplified the environment from (Cobbe et al., 2018) to remove all crates and obstacles except for the lava and to have only six actions (no-op, jump, jump-right, jump-left, right, left). This simplification allowed us to train our agents in 10 million timesteps, rather than 256 million. In our setup we consider falling in the lava to be a catastrophe, whereas a timeout is an acceptable failure. The observations given to the agent is the RGB 64x64 pixel values, flattened along the channel dimension.

Figure 4: Two sample environments from our modified CoinRun setting.

4.2 Methods

Proximal Policy Optimization (PPO)

In these experiments we use the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) as it was shown by Cobbe et al. (2018)

to perform fairly well in the original CoinRun environment. We train five PPO agents independently and with different random initializations on each of 10, 25, 50 and 200 training levels. We used model averaging using majority vote (mode of sampled actions from the five agents), denoted Maj-PPO, and the sampling from the ensemble mean, denoted Ens-mean (where the mean distribution is formed by taking the mean over the logits of the individual PPO policy categorical distributions). We also trained a single agent with dropout for each of the 10, 25, 50 and 200 training levels. For full algorithm settings see Sec. 

A.1 in the supplementary material.

4.3 Results

Generalization Performance

We plot the percentage of test levels ending in catastrophe and the percentage solved against the number of training environments in Fig. 5. We see the two methods using an ensemble, Maj-PPO and Ens-mean, give similar performance to the baseline. We see a slight improvement for the ensembles for the 10 training environments setting. The other methods using dropout as a regularizer and MC dropout Gal and Ghahramani (2016) for ensembling did not match baseline performance, see Fig. 10 of the supplementary material, which also contains performance on the training set. We emphasise that performing perfectly on a small number of training environments is not sufficient to get good test performance, both for % solved and more importantly for % catastrophes.

(a) Percentage of catastrophic outcomes in unseen environments (lower is better), as a function of number of training environments.
(b) Percentage of solved unseen environments (higher is better), as a function of training environments.
Figure 5: Results in the CoinRun setting, evaluated on unseen test

environments for a range of methods. Five random seeds are used for each algorithm. For the PPO baseline the dots mark the five seeds’ performance, and the line and shading are the mean and one standard deviation intervals respectively. Other methods used all five seeds so no intervals appear for them. The ensemble algorithms don’t do significantly better than a single PPO agent (on average) both in terms of catastrophes and % solved. The complete version is provided in Fig. 

10 of the appendix, and includes both train and test performances as well as dropout experiments.

Predicting a Catastrophes in CoinRun

In gridworld, the catastrophes are local in that they occur exactly one step after the dangerous action is taken. In CoinRun catastrophes are non-local: an agent takes a jump action and falls in the lava a few steps later (with no way to avoid the lava once in mid-air). We suspect this explains why it’s harder to reduce catastrophes in CoinRun than gridworld. Rather than modifying the agents actions, we instead now consider a setup where the agent should call for help if it thinks it has taken a dangerous action that will lead to a catastrophe. This is for example the intervention setup used in an autonomous driving application Michelmore et al. (2018). The agent requests an intervention based on a discrimination function which combines the mean, , and standard deviation, , of the ensemble of the five agents’ value functions, similar to UCB Auer (2003). We consider a binary classification task with a catastrophe occurring in timesteps as the ‘positive‘ class, and predicting no catastrophe occurring in timesteps as the ‘negative‘ class. A true positive would be an agent predicting the catastrophe and it occurring, whereas a false positive would be predicting a catastrophe and it not occurring. We imagine a human intervention occurring on a positive prediction, and so would like to reduce the number of false positives (which might waste the human’s time, or be suboptimal) and maintain a high true positive rate. An ROC curve captures the diagnostic ability of a binary classifier system as its discrimination threshold is varied – the threshold is compared to the discrimination function . In an ROC curve, the higher the sensitivity (true positive rate) and the lower the 1-specificity (false positive rate) the better. The AUC score is a summary statistic of the ROC curve, the higher the better. In Fig 6

we plot ROC curves for the Ens-mean agent, together with an agent that has random value functions and takes random actions. The different curves show different action selection methods (random or Ens-mean action selection), together with different discrimination function hyperparameters. Shown are mean and one standard deviation confidence intervals based on ten bootstrap samples from the data collected from one rollout on 1000 test environments. We see from Fig 

5(a) that for 10 training levels and a prediction time window of one step, the uncertainty information from using the standard deviation of the ensemble-mean action selection method gives superior prediction performance compared to not using the standard deviation (shown also is that it’s better than random). However, it helps less as the time-window increases, Fig 5(b), or as the number of training levels increases, Fig 5(c) 5(d). Note that it’s much easier to predict a catastrophe for a smaller time window (left column), than a longer one (right column). This supports our hypothesis that it is the time-extended nature of the CoinRun danger which is particularly challenging to generalize about. See Fig. 11 and Fig. 12 in the supplementary material for a wider range of ROC plots.

(a) 10 training levels,
(b) 10 training levels,
(c) 25 training levels,
(d) 25 training levels,
Figure 6: ROC Curves for binary classifier based on discrimination function , a combination of mean and standard deviation of the agents’ value functions, classifying a catastrophe occurring in the next steps. All plots are on data from one episode on 1000 different test levels. Towards the top left is better. Higher AUC is better. Left column: . Right column: . Top row: training levels. Bottom row: training levels.

5 Conclusion

In this paper we investigated how safety performance generalizes when deployed on unseen test environments drawn from the same distribution of environments seen during training, where no further learning is allowed. We focused on the realistic case in which there are a limited number of training environments. We found RL algorithms can fail dangerously on the test environments even when performing perfectly during training. We investigated some simple ways to improve safety generalization performance. We also investigated whether a future catstrophe can be predicted in the challenging CoinRun environment, finding that uncertainty information in an ensemble of agents is helpful when only a small number of environments are available.


Appendix A Supplementary Material

Full Setting

See Fig.7 for some frames from the Full gridworld setting.

Figure 7: Example trajectory from a Full environment. Agent: blue. Goal: green. Lava: red. Walls: grey.

Further Results.

Shown in Fig. 8 are the results for all the methods on the Full setting. See Fig. 9 for results on the Reveal setting. Shown are also the performance on the training environments (solid lines). We see similar results to the main paper, and note that as expected the Full setting has better generalization performance for % solved than Reveal, but the catastrophe performance is similar in each.

(a) Percentage of catastrophic outcomes (lower is better), as a function of number of training environments.
(b) Percentage of solved environments (higher is better), as a function of number of training environments.
Figure 8: Complete quantitative experimental results on the Full setting, trained to convergence. Nine seeds are used for training the agents and the mean performances are visualized.
(a) Percentage of catastrophic outcomes (lower is better), as a function of number of training environments.
(b) Percentage of solved environments (higher is better), as a function of number of training environments.
Figure 9: Complete quantitative experimental results on the Reveal setting, trained to convergence. Nine seeds are used for training the agents and the mean performances are visualized.

See Fig. 10 for complete results on CoinRun, including training performance and dropout as both a regularizer (turned off at test) or for MC dropout (dropout on at test time).

(a) Percentage of catastrophic outcomes (lower is better), as a function of number of training environments.
(b) Percentage of solved environments (higher is better), as a function of number of training environments.
Figure 10: Complete quantitative experimental results on the CoinRun setting.

See Fig. 11 and Fig. 12 for ROC curves on test and train environments respectively, for and training levels.

(a) 10 levs,
(b) 10 levs,
(c) 10 levs,
(d) 10 levs,
(e) 25 levs,
(f) 25 levs,
(g) 25 levs,
(h) 25 levs,
(i) 50 levs,
(j) 50 levs,
(k) 50 levs,
(l) 50 levs,
(m) 200 levs,
(n) 200 levs,
(o) 200 levs,
(p) 200 levs,
Figure 11: ROC Curves for binary classifier based on discrimination function , a combination of mean and standard deviation of the agents’ value functions, classifying a catastrophe occurring in the next steps. All plots are on data from one episode on 1000 different test levels. Towards the top left is better. Higher AUC is better. Columns: . Rows: train levels.
(a) 10 levs,
(b) 10 levs,
(c) 10 levs,
(d) 10 levs,
(e) 25 levs,
(f) 25 levs,
(g) 25 levs,
(h) 25 levs,
(i) 50 levs,
(j) 50 levs,
(k) 50 levs,
(l) 50 levs,
(m) 200 levs,
(n) 200 levs,
(o) 200 levs,
(p) 200 levs,
Figure 12: ROC Curves evaluated on training levels. Columns: . Rows: train levels.

a.1 Algorithm Settings

PPO settings

We trained our agents each for timesteps, using a linearly decaying learning rate (initial value ) and Adam optimizer [Kingma and Ba, 2014]

. We used 256 PPO steps, 8 minibatches, 3 PPO epochs, entropy coefficient of 0.01, and a decay rate of 0.999. We used the same IMPALA-CNN style architecture as

Cobbe et al. [2018] (itself taken from Espeholt et al. [2018]

), except we modify it to be smaller, using a convolutional layer with 5 filters, max pooling with pool size 3 and strides 2 and same padding, followed by two residual blocks, each containing [relu, conv, relu, conv] to which the input is added to the output in residual style

He et al. [2016]. This is followed by a relu and fully connected layer to 256 hidden units, followed by another fully connected layer with 6 heads, one for each action logit in a categorical distribution. For the dropout agent we decayed the dropout probability according to , where decays linearly from 0.1 to zero. Training was performed using 32-CPU machines, using TensorFlow Abadi et al. [2015] for PPO and PyTorch Paszke et al. [2017] for DQN.