Log In Sign Up

Investigating Generalisation in Continuous Deep Reinforcement Learning

by   Chenyang Zhao, et al.

Deep Reinforcement Learning has shown great success in a variety of control tasks. However, it is unclear how close we are to the vision of putting Deep RL into practice to solve real world problems. In particular, common practice in the field is to train policies on largely deterministic simulators and to evaluate algorithms through training performance alone, without a train/test distinction to ensure models generalise and are not overfitted. Moreover, it is not standard practice to check for generalisation under domain shift, although robustness to such system change between training and testing would be necessary for real-world Deep RL control, for example, in robotics. In this paper we study these issues by first characterising the sources of uncertainty that provide generalisation challenges in Deep RL. We then provide a new benchmark and thorough empirical evaluation of generalisation challenges for state of the art Deep RL methods. In particular, we show that, if generalisation is the goal, then common practice of evaluating algorithms based on their training performance leads to the wrong conclusions about algorithm choice. Finally, we evaluate several techniques for improving generalisation and draw conclusions about the most robust techniques to date.


How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned

Deep reinforcement learning (RL) has emerged as a promising approach for...

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Deep reinforcement learning (RL) algorithms are predominantly evaluated ...

A Survey of Generalisation in Deep Reinforcement Learning

The study of generalisation in deep Reinforcement Learning (RL) aims to ...

Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

Reinforcement learning (RL) for continuous control typically employs dis...

Dopamine: A Research Framework for Deep Reinforcement Learning

Deep reinforcement learning (deep RL) research has grown significantly i...

How to Make Deep RL Work in Practice

In recent years, challenging control problems became solvable with deep ...

Ctrl-Z: Recovering from Instability in Reinforcement Learning

When learning behavior, training data is often generated by the learner ...

1 Introduction

Deep Reinforcement Learning (Deep RL) has achieved great success in solving many complex problems ranging from discrete control tasks like Go (Silver et al., 2017) and Atari games (Mnih et al., 2015), to continuous robot control tasks (Lillicrap et al., 2016). As intelligent systems, we would like our Deep RL agents to succeed in various environments, including ones unseen during training. However, as examples of high capacity machine learning models, Deep RL agents are at risk of overfitting

– learning policies overly specific to their training environment and failing to generalise to new conditions. Overfitting risk, regularisation and generalisation to novel samples is well studied in supervised learning, where evaluating generalisation through train/test splits is ubiquitous. However (due to the relatively greater difficulty of obtaining a good solution to the RL training problem in the first place) evaluating for generalisation to novel conditions through such train/test splits is not common practice in Deep RL. Correspondingly, mainstream Deep RL algorithm research focuses on optimising the training condition well, rather than developing models that generalise well to novel conditions. Nevertheless, now that Deep RL

training is increasingly successful, it is timely to move focus onto models’ generalisation properties. Achieving generalisation is crucial if Deep RL should move out of the the lab and solve real-world problems where noise and uncertainty are intrinsic, and novel conditions will certainly be encountered (Sunderhauf et al., 2018).

In this paper, along with several concurrent works (Zhang et al., 2018a; Packer et al., 2018; Zhang et al., 2018b), we advocate for a renewed focus on generalisation in Deep RL research, rather than on training performance alone. Aside from first principles interest in the ability of our agents to succeed in diverse and novel environments, there is particular demand for generalisation in the context of the reality gap in robotics. Despite continuing improvements in algorithmic sample efficiency, Deep RL requires a large number of environmental interaction samples for learning complex tasks without prior knowledge. For this reason, the majority of Deep RL training is done in simulation, which is moreover usually deterministic. Transfer from such simulated training environments to potential real-world deployment is known as crossing the reality gap in robotics, and is well known to be difficult (Koos et al., 2013), thus providing an important motivation for studying generalisation.

As we will see, there are several generalisation challenges that can arise for RL control agents. The first is generalisation from a deterministic training environment to a noisy and uncertain testing environment, for example in the form of real sensor and actuator noise. Secondly, assuming we correctly model environmental variability in our training simulation, there is the question of whether an agent learns to generalise to future conditions drawn from the same distribution, or overfits to its specific training experiences (Zhang et al., 2018b). Finally, there is the subtle but important point that no matter the effort applied to modelling environmental conditions and variability in simulated training, it is generally impossible to predict and accurately model the environmental conditions and variability an agent might encounter in the real world (Koos et al., 2013). Therefore an important way to think about model generalisation is not only robustness to overfitting per-se, but generalisation under some level of domain shift. In supervised learning, domain shift refers to changes in the data distribution which we would like a predictive model to be robust to, for example the type of camera in visual object recognition (Csurka, 2017). The corresponding notion in RL is that we would like our policy’s success to be invariant to nuisance changes in the environment (Cully et al., 2015). These could span both noise, for example sensor, actuator, and environmental noise; and variability, for example camera type, initial state of an agent, or mass of an objects being manipulated.

In this paper, we study generalisation in Deep RL for continuous control, with a particular focus on robustness to domain shift between training and testing. We first provide a thorough characterisation of the diverse sources of uncertainty and variability that provide generalisation challenges. Secondly, we provide a thorough evaluation of several state of the art Deep RL methods on several OpenAI Gym benchmarks (Brockman et al., 2016) in terms of their generalisation properties, particularly across domain shift, with regards to these different sources of variability. In doing so we attempt to answer several questions including:

  • How do state of the art algorithms generalise under different sources of uncertainty and domain shift?

  • Does standard practice of picking algorithms and architectures based on training return lead to selecting models with good generalisation?

  • Can robustness be improved via simple modifications to existing methods?

Our analysis shows that existing RL methods are generally vulnerable to overfitting, showing poor generalisation to testing. This is particularly so in cases of domain shift, for example transferring from a deterministic to stochastic simulation; or where system parameters such as robot mass are varied between training and deployment. Correspondingly, the standard practice of picking algorithms and architectures based on training performance leads to the wrong choice in terms of generalisation performance. Therefore, if generalisation is of interest, then benchmarks such as ours that test generalisation are recommended instead. Finally, we thoroughly evaluate several existing techniques that might improve generalisation, and report the most robust combination as a starting point for future work.

2 Related Work

Standardised and Reproducible Evaluations  Deep RL research has benefitted tremendously from efforts on standardised environment models and benchmarks (Brockman et al., 2016; Tassa et al., 2018). Building on this, a variety of Deep RL algorithms for continuous control were implemented and compared based on training return in Duan et al. (2016)

. However, these results have high variance, leading to concerns about reproducibility of conclusions and dependence on specific choice of training seeds

(Henderson et al., 2018). This in turn has led to statistical power analysis to determine the sufficient number of random seeds to allow reliable comparison among algorithms (Colas et al., 2018).

Generalisation and Overfitting  Recent concurrent work to ours has noted the overfitting risk in this standard practice of evaluation by training return. Zhang et al. (2018b) studied overfitting of Deep RL in discrete maze tasks. Testing environments are generated with the same maze configuration but different initial positions as training. Deep RL algorithms were shown to suffer from overfitting to training configurations and memorise training scenarios. Similarly Zhang et al. (2018a) proposed to formalise overfitting in continuous control by splitting random seeds for training and testing environments, and then diagnosing overfitting through the generalisation error – the difference between average return in training and testing environments. Cobbe et al. (2018) studied regularisation for improving generalisation across procedurally generated arcade environments in discrete control. All these studies showed improved generalisation when trained with more random seeds.

Domain Shift in RL  Overfitting models fail to generalise from training to testing data although they are drawn from the same underlying distribution. Domain-shift challenges a model trained in one domain to perform in a target domain with different statistics, which typically leads to a significant drop in performance. This challenge is unavoidable if we wish to apply Deep RL-trained models in the real world, as the reality gap (Koos et al., 2013) of modelling errors and the unpredictability of the unconstrained real world mean that training in simulation will always mismatch reality. Recent concurrent work to ours has also studied some limited facets of domain-shift including adding noise to observations or initial states during testing compared to the training simulation (Zhang et al., 2018a). Meanwhile Packer et al. (2018) studied performance under train-test domain shift by modifying environmental parameters such as robot mass and length modified to generate new domains. In discrete control, generalisation under adversarially designed noise has also been studied (Huang et al., 2017).

Improving Generalisation  Various techniques can potentially improve generalisation of RL-trained policies in the face of overfitting and domain-shift. These include training under adversarially designed noise (Mandlekar et al., 2017) and specially designed network architectures (Srouji et al., 2018). While entropy-regularisation (Haarnoja et al., 2017) is topical in RL for improving training performance, we also investigate its potential for improving RL generalisation as it does in supervised learning (Chaudhar et al., 2017). Meanwhile domain randomised training has been used to improve generalisation to new domains (Tobin et al., 2017; Stulp et al., 2011; Colas et al., 2018).

Contributions  We present a more thorough characterisation of the generalisation challenges in overfitting and domain shift in continuous control compared to concurrent studies (Zhang et al., 2018b; Packer et al., 2018; Zhang et al., 2018a), which each only touch on a subset of the factors involved (Table 1). To quantify these issues empirically, we contribute a comprehensive benchmark for measuring RL generalisation performance, both within and across-domain, and evaluate several current algorithms and modifications.

3 Characterising Generalisation Challenges

We first start by giving a thorough characterisation of the generalisation challenges that can arise for RL agents.

RL formalisation

  In reinforcement learning, an agent learns to maximise its expected cumulative reward through interacting with an environment. The problem setting of RL can be described by a Markov Decision Process

, with state space , action space

, environment (transition probability) model

, reward function and discount factor . In addition,

describes the probability distribution of initial state

. The objective of RL learning is to find the optimal policy to act in an MDP such that,


where trajectory is a sequence of state and action pairs , is the accumulated return of trajectory , and indicates that trajectory is sampled with , , .

3.1 Sources of Uncertainty and Variability

Figure 1: Block diagram of a classic control system. represents the observation function, represents the actuation function, represents the transition function, and represents the noise that enters the system due to different types of uncertainties.

Consider a classic robot control system as illustrated in Figure 1. The agent provides the controller and the environmental transition model can be broken into an actuation module, a sensor module and a dynamical module. In real world applications, each of the three modules may contain uncertainties due to noise. Moreover, each may exhibit contextual variability which provides a potential source of bias between distinct encounters with the environment, including between training and testing. Such variability creates domain shift in RL. To unpack the distinction between uncertainty and disturbances in rollouts (i.e. noise), and contextual variations that can induce systematic shifts between domains , we specify the environmental MDP corresponding to Figure 1 in more detail as Eq. (3),(4

). Note that though the uncertainties are described as Gaussian distributions for simplicity, they can be any distribution in practice.


In this model of the MDP, noise is introduced at three time-scales. The observations , commands and next states as perturbed at each time step, as indicated by the subscript. They are perturbed by Gaussian noise with variances , , and , respectively. The initial state is sampled once per episode from a Gaussian with variance , before the episode starts. The combined function parameters and the corresponding variances define the MDP. Switching between these parameters implies a domain switch. As discussed in Section 3.2, these can also be sampled from a distribution before learning. These parameters then stay fixed during learning, which is why they are indexed with domain .

Examples  To illustrate these terms by way of example: A change in the mass of an object to be manipulated (or of the robot itself) in the environment or friction constant would correspond to a change in transition function parameter . Stochasticity in the outcomes of transition model due to environmental noise such as changing wind condition is determined by . Changes in the observation function via parameters correspond to events such as a change of camera when doing vision-driven control. Meanwhile proprioceptive noise is generated with variance . Changes in the actuation function ’s parameter could correspond to wear in a motor or increased joint friction reducing the obtained forces. Noise generated internally by motor while actuating actions are sampled with variance . In general, we would like our agents to be robust to as much of these variations and noise as possible.

Study Train Setting Test Setting
Basic Gym
, ,
Table 1: Comparison of evaluation settings in: Basic Gym (Brockman et al., 2016) Overfitting (Zhang et al., 2018b), Dissection (Zhang et al., 2018a), Generalize (Packer et al., 2018). are samples generated with random training seeds, are samples generated with random testing seeds.

3.2 Generalisation Across MDP Distributions

With this formalisation in mind, we can understand the goal of generalisation as robustness to a potential distribution of both environments and samples from those environments. That is, insensitivity to both environmental parameters , and noise samples. This is in contrast to commonly used deterministic simulations (), without environmental variability ( constant).

Formalising Generalisation  Given a fixed set of environmental parameters , the corresponding transition model and initial state distribution are denoted and . We denote as the expected return given a set of environmental parameters . If the environmental parameters are varying across trials, we denote as the expected return under the distribution of environment parameters :


We would like our agents to solve a distribution over (non-deterministic ) environments in Eq. (6), rather than the conventional RL criterion in Eq. (2). That is for a given trial, we would expect to sample an environment once , and then at each time-step sample noise . We want agents to perform well over both this long-timescale variability, and short-time scale uncertainty.

Furthermore, by evaluating on training return, standard RL practice implicitly assumes that the simulated training domain models the testing domain perfectly: or . While this assumption can hold for some tasks like Atari games, creating a sufficiently accurate simulated model is challenging for dynamic tasks (Koos et al., 2013), and is generally impossible if the testing domain is the unconstrained real-world. Therefore an important quantity of interest to measure is how trained models generalize to encounters with a certain degree of domain shift between environments ( or ). Therefore, besides training performance, we should monitor the robustness of our models via the quantity:


That is, the performance of the model trained on or ; when tested on or . This view encompasses robustness to changes in distribution of starting condition (Zhang et al., 2018b) and maps (Cobbe et al., 2018), training with deterministic observations non-deterministic testing (Zhang et al., 2018a) (but also includes action and environmental noise), and extrapolation in environmental parameters such as object mass (Packer et al., 2018) that will arise in the practice due to the reality gap (Koos et al., 2013). Given the inability to exactly control or simulate the distribution of real-world environmental encounters, the model robustness quantified above should be a consideration in our development of new methods, and our choice of algorithms and architectures in practice.

Since the vast majority of existing work does not explicitly separate training and testing phases, in the following sections we introduce a set of benchmarks with clear train/test distinctions. Based on these, we systematically measure the generalization of several popular algorithms under the diverse variations discussed in this section.

4 Experimental Design

We design a benchmark of generalisation – testing rather than training performance. We cover both generalisation across seeds when the simulation is non-deterministic ( ) in observation, actuation and process; and particularly focus on robustness to environment parameter variation, i.e., domain-shift or .

4.1 Training Algorithms and Architectures

We study several model-free policy gradient based Deep RL algorithms with OpenAI baseline implementations (Dhariwal et al., 2017) including Trust Region Policy Optimisation (TRPO) (Schulman et al., 2015), Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016). In addition to basic Deep RL algorithms, we also consider several modifications of the baseline algorithms and architectures that may improve generalisation of learned policies.

Entropy Regulariser  Policies with higher entropy may be more robust to uncertain dynamics (Ziebart, 2010). To encourage learning higher-entropy policies, we add entropy to the training objective as a regularizer (Haarnoja et al., 2017), denoted with suffix -Ent.

Structured Control Net  Inspired by classic control theory, Structured Control Net (SCN) splits a Deep RL policy into a linear module and a nonlinear residual module and shows improved robustness against noise (Srouji et al., 2018). We train SCNs with PPO, denoted PPO-SCN.


  A standard continuous control policy architecture is a multilayer perceptrons (MLP) with two 64 unit hidden layers. To investigate the influence of network size on generalisation performance, we use a smaller MLP with two 16 unit hidden layers. The smaller network is indicated as with suffix

-16, such as PPO-16, SCN-16.

Adversarial Attacks Assist Learning  Several attempts have been made to improve policy robustness with the assistance of adversaries (Pattanaik et al., 2018; Pinto et al., 2017). We follow ‘adversarially robust policy learning’ (ARPL) (Mandlekar et al., 2017) where adversarial noise maximises the norm of output actions . To minimise the interference in the simulation platform, adversaries only attack in the observation space.

Multi-Domain Learning  Training agents on multiple domains is a simple strategy to improve generalisation over environment changes (Tobin et al., 2017). To simulate variability in domains, we generate a distribution of domains controlled by a parameter . At each training rollout, we sample a new domain from the distribution . In this case we only sample dynamics parameters , and denote this training setting -MDL.

4.2 Environments and Evaluation

We experiment on several MuJoCo simulated (Todorov et al., 2012) environments in OpenAI Gym (Brockman et al., 2016). To explore robustness to environmental parameter variation, we modify various environmental dynamics parameters as shown in Table 2. For example, in Walker2d, we modify robot mass, friction and gravity coefficients, and apply constant horizontal force as wind.

Task Name Environment Factors
InvertedPendulum-v2 ,
InvertedDoublePendulum-v2 , ,
Walker2d-v2 , , ,
Hopper-v2 , , ,
HalfCheetah-v2 , ,
Table 2: Summary of evaluation environments and their environment factors that are included to generate shifts in transition model.

For each training setting (task, algorithm/architecture, train-environment), we train 12 policies with different random seeds. For each condition (task, algorithm/architecture, test-environment) we evaluate by averaging over 20 different testing rollouts. We assume constant sensor and actuation module context parameters , and the same initial state distribution but different random seeds between training and testing . Our evaluation focuses on robustness against various noise scales in observation, action and environmental parameter space , as well as systematic shifts in the environmental parameters . Gaussian noise is directly added to outputs of dynamic plant and agent policies as observation and action noise. For environmental parameter noise, at each time step, a set of environmental parameter (e.g. wind condition) is sampled and simulation is modified accordingly. In contrast, for systematic shifts, environmental parameters are sampled at the start of each trial and remain constants within the trial. Four testing settings are denoted with Obs, Act, Env, Dom respectively.


  We use three evaluation metrics including

Testing return (Eq. (5)) where possibly and Expected testing return (Eq. (6)) where possibly . Finally, as an aggregate measure of performance given that we may not know the strength of noise or variability in the testing domain, we also compute the Area Under Curve (AUC) of testing return with respect to scale of the underlying Gaussian distribution :


where could be both noise and domain shift , and is the step size of varying scales, .

5 Experimental Results

How do standard continuous controllers generalise under noise and domain-shift?

(a) Observation Noise
(b) Action Noise
(c) Environmental Noise
(d) Domain Shift
(e) PPO
(f) TRPO
(g) DDPG
Figure 2: Generalization of standard continuous control policies for Walker2d-v2. Top: Performance with varying testing noise and environment variation scale. Bottom: Heatmaps illustrate policy performance over a grid of environmental domain-shifts. Each cell corresponds to a particular set of context parameters with training domain at . Results are averaged over 12 random seeds.

Standard practice in Deep RL research is to train policies on standard MuJoCo-Gym benchmark environments. Algorithms such as TRPO and PPO now achieve impressive training return; but do the resulting policies generalise to novel contexts at testing time?

We analyse this question for observation-, action-, and environment-noise and domain shifts. Figure 2 shows the results for Walker2d as an example environment, and the full results are summarised in Table 4 in Appendix. Figures 2(a-c) show performance degradation of standard policies as increasing observation, action, and environmental noise are added at testing. We can see that policies are relatively sensitive to observation noise compared to the other types. In terms of domain-shift rather than noise, Figures 2(e-g) show that the performance of a standard model degrades rapidly as example environmental parameters (mass ratio, wind direction) are changed at testing, with the degradation rate depending on the factor being modified (e.g., greater mass-sensitivity than wind). Figure 2(d) summarizes the domain-shift performance as an average over increasing shift in all walker parameters (Table 2: mass, wind, friction, gravity). In this particular environment, TRPO is usually the most robust algorithm. However, in the full evaluation over all tasks (Table 4 in appendix) using testing AUC score (area under the curves in Fig 2), there is no consistent winner in algorithm robustness.

Train with Train with
Return Noise Type Return
(a) Training performance of deterministic vs. noisy training
Noise Type Train with Train with
Obs. 0.0 3723.9 321.3 3682.9415.6
0.2 2522.2 593.4 3036.1576.3
0.4 937.8 308.9 1303.8374.5
Act. 0.0 3678.8355.5 3596.5 406.3
0.2 2096.3570.4 3044.0525.6
0.4 919.7249.5 1951.5486.9
Env. 0.0 3756.3341.7 3761.8415.1
0.2 3133.5465.9 3225.4450.0
0.4 2528.6415.9 2660.1 436.1
Dom. 0.0 3725.1343.7 3417.9549.2
0.2 3228.7408.6 2985.2486.4
0.4 2462.3394.4 2468.2341.9
(b) Testing performance of deterministic vs. noisy training
Table 3: Improving Walker2d-PPO generalization by training with noise. (a) Compares training return in deterministic () and noisy () conditions. (b) Compares the testing performance of different training conditions. ‘MDL’ and ‘DOM’ refer to training or testing on multiple domains respectively. For simplicity we overload to refer to in the multi-domain setting, and use to indicate expected return .

Does modelling noise and variability in training improve generalisation?

We saw above that performance degrades rapidly with noise. However, as discussed earlier, testing a deterministically trained policy in a stochastic environment can be seen as a form of domain-shift (). We therefore study if reducing this domain shift by adding noise and environment variation during training improves generalisation.

As a detailed example, we analyse PPO-trained Walker2d in Table 3. To reduce domain shift between training and testing, we train the policies in noisy environment () to align with a testing condition, in comparison to training with default environment (). We compare both training performance (Table 3a) and testing return under multiple testing noise levels (Table 3b). The experiment considers both i.i.d Gaussian noise and training domain randomisation in preparation for testing on novel domains (denoted ‘Dom’). The expected testing return results (cf. Eq. (6)) are averaged over 12 training x 20 testing seeds.

From the results in Table 3, we can see that (i) in each case, except environmental noise, the (expected) training return is significantly lower when adding noise (Table 3a, compare cols), (ii) For observation and action noise, training with noise in preparation for testing with noise improves performance compared to the conventional deterministic training (Table 3b, compare green cols). (iii) However, there is not a clear benefit from removing the domain shift in this way if the testing scenario contains environment noise, or novel domains compared to training (Table 3b, compare red cols). (iv) Finally, we evaluate the impact of a domain-shift corresponding to mis-specified noise strength. In this case we can see that while testing performance has generally degraded at compared to

, the degradation is ameliorated significantly (compared to deterministic training) in the case of action noise and observation noise. Welch’s t-test (

) is used for all significance test.

Figure 3(a) shows the impact of training with multiple domains or observation noise compared to training with deterministic environments, averaging over all five benchmark environments. Results are expressed as difference to vanilla PPO. Detailed results across all tasks under the aggregate testing AUC metric (Eq. (8)) are visible in Table 5 by comparing PPO (standard training) with PPO-MDL (Multi Domain Training), and PPO-Noisy Obs (), where the green highlighted entries show when the noisy training improves on the PPO baseline. The results show positive influence in improving generalisation when training with noisy environments. However, adding noise during training can also raise the risk of failures in learning process for some tasks (especially training with noisy observations). Interestingly, there is some transferability across noise types. MDL training often improves robustness not only to novel domains at testing, but also i.i.d observation, action, and environment noise. Meanwhile, observation noise training improves robustness to action noise and MDL testing in HalfCheetah. In summary, modeling uncertainty during training can help improve generalisation at testing, but better methods are still necessary particularly if uncertainty is misspecified.

What existing techniques improve generalisation?

(a) Noisy Training
(b) Training Techniques
Figure 3: Change in normalised testing AUC score using various training settings compared to vanilla PPO training. (a) Impact of training with more stochastic environments. (b) Impact of architecture and algorithmic modifications. Results are averaged over all five training environments.

We next investigate if any of the methods discussed in Section 4.1 improve generalisation performance. We build on PPO due to being easier to integrate with the modifications (unlike, e.g., TRPO), and its good stability and training efficiency. Figure 3(b) illustrates the change in testing generalisation performance (normalised testing AUC score) compared to vanilla PPO, when using each training technique. Detailed expected testing AUC (Eq. (8)) results are summarised in Table 6 in appendix. We can see that: (i) The smaller PPO-16 often surpasses the classic PPO architecture with 64 hidden units each layer in generalisation, and similarly for SCN vs SCN-16. (ii) Entropy-regularised PPO usually surpasses vanilla PPO, sometimes by a large margin. (iii) Adversarial PPO-APRL sometimes improves, but often also worsens vanilla PPO. (iv) The best performing model is either PPO-16, SCN-16, or PPO-Ent. Thus generalisation performance can be increased by reducing architecture size, or adding regularisers to reduce overfitting. However, there is no specific overfitting reduction strategy that works consistently across environments and noise types.

Is training return a valid metric for performing model selection?

In Deep RL research, training return is the standard evaluation metric for comparing learning algorithms and architectures. Given that ultimately we should care about testing return, this practice is based on the strong assumption that there is no overfitting and all distributions are identical during training and testing. However, we now know that overfitting does occur, modelling errors between training and testing are unavoidable and that the real world is noisy (Sunderhauf et al., 2018; Koos et al., 2013). So it is important to ask what is the implication of this evaluation practice on the algorithms and architectures we determine to be ‘winners’.

As an illustration, we first show how generalisation performance evolves during training. Figure 4 shows the testing AUC score and training return as a function of PPO training iterations in the Walker2d-v2 environment. Training and testing performance initially improve in tandem, but overfitting occurs as learning continues.

Figure 4: Comparison of testing AUC score and training return during PPO learning of Walker2d-v2 task.

To investigate the effect of this on algorithm choice, we fit a Pareto frontier to the testing AUC score vs training return of each method. Each algorithm is represented with the learned policy with best training performance among multiple random seeds. Similar results are obtained if using average performance across seeds. Figures 5(a)5(b) show illustrative curves for Walker2D-ActNoise and HalfCheetah-ObsNoise. The Pareto frontiers illustrate that it is hard to achieve good training and testing performance simultaneously. We further compute the correlation between testing AUC and training return for each task under each noise type in Figure 5(c). (Here the seven variants of the PPO algorithm in Fig 5(a) are the elements being correlated). Most environments and noise types have clear negative correlation. Thus if we follow standard practice of evaluating algorithms based on training performance, we will often pick the least robust algorithm with worst generalisation. If generalisation is of interest, as it should be, then evaluations should use generalisation metrics such as the benchmarks proposed here.

(a) Walker2D - Action
(b) HalfCheetah - Observation
Obs. Act. Env. Dom.
Walker -0.760 -0.722 -0.457 -0.424
Hopper -0.203 -0.346 -0.473 -0.415
HalfCheetah -0.946 -0.843 -0.046 -0.477
Pendulum -0.132 -0.584 -0.665 -0.630
D-Pendulum -0.033 -0.732  0.025  0.324
(c) Correlation Coef. between training returns and testing AUC
Figure 5: Training return does not reflect generalisation performance. (a,b): Clear Pareto frontiers exist in testing AUC vs training return across algorithms (dots). (c): Testing AUC and training return are generally negatively correlated.

6 Conclusion

We contributed an analysis and set of benchmarks to investigate generalisation in deep RL. The analysis showed that standard algorithms and architectures generalize poorly in the face of noise and environmental shift. In particular, training and testing performance are often anti-correlated, so the standard practice of developing models with the aim of maximising training performance may be leading the community to produce less robust models. The results show that different off-the-shelf algorithms better address different aspects of generalisation performance, and various enhanced training strategies can also improve aspects of generalisation. However there is currently no generally good solution to all facets of generalisation, and new algorithms are needed.



Training Hyperparameters

We use the implementation of basic RL algorithms (PPO, TRPO, DDPG) from OpenAI baselines codebase (Dhariwal et al., 2017). The hyperparameters we use for each training algorithm are listed below:


- Policy Network: (64, tanh, 64, tanh, Linear) + policy standard deviation variable

- Value Network: (64, tanh, 64, tanh, Linear)
- Normalised observations with running mean filter
- Number of time steps per batch: 2048

- Number of optimiser epochs per iteration: 10

- Size of optimiser minibatches: 64
- Optimiser learning rate:

- Generalised Advantage Estimator (GAE) factor

- Discount factor
- Cliprange parameter: 0.2
- Number of total training steps:


- Policy Network: (64, tanh, 64, tanh, Linear) + policy standard deviation variable
- Value Network: (64, tanh, 64, tanh, Linear)
- Normalised observations with running mean filter
- Number of time steps per batch: 1024
- Maximum KL divergence:
- Conjugate gradient iterations: 10
- CG damping factor:
- Generalised Advantage Estimator (GAE) factor
- Discount factor
- Value network update epochs per iteration: 5
- Value network learning rate:
- Number of total training steps:


- Actor Network: (64, relu, 64, relu, tanh)

- Critic Network: (64, relu, 64, relu, Linear)
- Normalised observations with running mean filter
- Noise type: OU-Noise 0.2
- Learning rates: actor LR: , critic LR:
- L2 normalisation coeff: 0.01
- Batch size: 64
- Discount factor
- Soft target update
- Reward Scale: 1.0
- Number of total training steps:

Table 4: Generalisation performance of basic Deep RL learners in terms of AUC scores. All settings are evaluated with 5 environments and 4 noise types in each environment. All results are averaged over 12 random seeds. Blue: Winning algorithm among the basic PPO, TRPO and DDPG baselines. Welch’s t-test () is used for significance testing.

env. Name PPO PPO-MDL PPO-ObsNoise
Walker2d obs.
Hopper obs.
HalfCheetah obs.
Pendulum obs.
D-Pendulum obs.
Table 5: Generalisation performance of training with noisy environments in terms of AUC score. All settings are evaluated with 5 environments and 4 noise types in each environment. All performance are averaged over 12 random seeds. Green: Noisy training settings that improve on corresponding PPO baseline. Welch’s t-test () is used for significant testing.

env. Name
Walker2d obs.
Hopper obs.
HalfCheetah obs.
Pendulum obs.
D-Pendulum obs.
Table 6: Generalisation performance of different algorithms and architectures in terms of AUC score. All settings are evaluated with 5 environments and 4 noise types in each environment. All performance are averaged over 12 random seeds. The training settings that give top- performance within a test setting are highlighted in boldface. Welch’s t-test () is used for significant testing.