Modifiable OpenAI Gym environments for studying generalization in RL
Deep reinforcement learning (RL) has achieved breakthrough results on many tasks, but has been shown to be sensitive to system changes at test time. As a result, building deep RL agents that generalize has become an active research area. Our aim is to catalyze and streamline community-wide progress on this problem by providing the first benchmark and a common experimental protocol for investigating generalization in RL. Our benchmark contains a diverse set of environments and our evaluation methodology covers both in-distribution and out-of-distribution generalization. To provide a set of baselines for future research, we conduct a systematic evaluation of deep RL algorithms, including those that specifically tackle the problem of generalization.READ FULL TEXT VIEW PDF
Modifiable OpenAI Gym environments for studying generalization in RL
Deep reinforcement learning (RL) has emerged as an important family of techniques that may support the development of intelligent systems that learn to accomplish goals in a variety of complex real-world environments (Mnih et al., 2015; Arulkumaran et al., 2017). A desirable characteristic of such intelligent systems is the ability to function in diverse environments, including ones that have never been encountered before. Yet, deep RL algorithms are commonly trained and evaluated on a fixed environment. The algorithms are evaluated in terms of their ability to optimize a policy in a complex environment, rather than their ability to learn a representation that generalizes to previously unseen circumstances. Indeed, their sensitivity to even subtle changes in the environment and the dangers of overfitting to a specific environment have been noted in the literature (Rajeswaran et al., 2017b; Henderson et al., 2018; Zhang et al., 2018; Whiteson et al., 2011).
. It refers to both interpolation to environments similar to those seen during training and extrapolation outside the training data distribution. The latter is particularly challenging but is crucial to the deployment of systems in the real world.
Generalization in deep RL has been recognized as an important problem and is under active investigation (Rajeswaran et al., 2017a; Pinto et al., 2017; Kansky et al., 2017; Yu et al., 2017; Wang et al., 2016; Duan et al., 2016b; Sung et al., 2017; Clavera et al., 2018; Sæmundsson et al., 2018). However, each work uses a different set of environments and experimental protocols. For example, Kansky et al. (2017) propose a graphical model architecture, evaluating on variations of the Atari game Breakout. Rajeswaran et al. (2017a) propose training on a distribution of domains in risk-averse manner and evaluate on two continuous control tasks from MuJoCo (Hopper and HalfCheetah). Duan et al. (2016b) aim to learn a policy that automatically adapts to the environment dynamics and evaluate on bandits, tabular MDPs, and maze navigation. Sæmundsson et al. (2018) combine learning a hierarchical latent model for the environment dynamics and model predictive control, evaluating on two continuous control tasks (cart-pole swing-up and double-pendulum swing-up).
What appears to be missing is a common testbed for evaluating generalization in deep RL: a clearly defined set of tasks, metrics, and baselines that can support concerted community-wide progress. In other words, research on generalization in deep RL has not yet adopted the ‘common task framework’, a proven catalyst of progress (Donoho, 2015). Only by creating such testbeds and evaluating on them can we fairly compare and contrast the merits of different algorithms and accurately measure progress made on the problem.
Our contribution is to establish a reproducible framework for investigating generalization in deep RL, with the hope that it will catalyze progress on this problem, and to present an empirical evaluation of generalization in deep RL algorithms as a baseline. We select a diverse but manageable set of environments, comprising classic control problems and MuJoCo locomotion tasks, built on top of OpenAI Gym for ease of adoption. For each environment, we specify degrees of freedom (parameters) along which its specification can be varied, leading to changes in the system dynamics but not the reward structure. Significantly, we test generalization in two regimes: interpolation and extrapolation. Interpolation implies that agents should perform well in test environments where parameters are similar to those seen during training. Extrapolation requires agents to perform well in test environments where parameters are different from those seen during training. Importantly, we do not allow updates to the trained model or policy at test time, unlike many benchmarks for transfer learning and multi-task learning.
To provide the community with a set of clear baselines, we evaluate two deep RL algorithms on all environments and under different combinations of training and testing regimes. We chose one algorithm from each of the two major families: A2C from the actor-critic family and PPO from the policy gradient family. Using the same experimental protocol, we also evaluate two schemes for tackling generalization in deep RL: EPOpt, which learns a policy that is robust to environment changes by maximizing expected reward over the most difficult of a distribution of environment parameters, and RL, which learns a policy that can adapt to the environment at hand by taking into account the trajectory it sees. Because each scheme is constructed based on existing deep RL algorithms, our evaluation is of four algorithms: EPOpt-A2C, EPOpt-PPO, RL-A2C, and RL-PPO. We analyze the results and draw conclusions that can guide future work on generalization in deep RL. The experimental results confirm that extrapolation is more difficult than interpolation. The ‘vanilla’ deep RL algorithms (A2C and PPO) and their EPOpt variants were able to interpolate successfully. RL-A2C and RL-PPO proved to be difficult to train and were unable to reach the level of performance of the other algorithms given the same amount of training resources. In other words, training a conservative policy that is oblivious to changes in the system dynamics can generalize quite well, while training an adaptive policy is comparatively data inefficient.
Generalization in RL. There are two main approaches to generalization in RL: learning policies that are robust to environment variations, or learning policies that adapt to such variations. A popular approach to learn a robust policy is to maximize a risk-sensitive objective, such as the conditional value at risk (Tamar et al., 2015), over a distribution of environments. Morimoto & Doya (2001)
maximize the minimum reward over possible disturbances, proposing robust versions of the actor-critic and value gradient methods in a control theory framework. This maximin objective is utilized by others in the context where environment changes are modeled by uncertainties in the transition probability distribution function of a Markov decision process.Nilim & Ghaoui (2004) assume that the set of possible transition probability distribution functions are known, while Lim et al. (2013) and Roy et al. (2017)estimate it using sampled trajectories from the distribution of environments of interest. A recent representative of this approach applied to deep RL is the EPOpt algorithm (Rajeswaran et al., 2017a), which maximizes the conditional value at risk, i.e. expected reward over the subset of environments with lowest expected reward. EPOpt has the advantage that it can be used in conjunction with any RL algorithm. Adversarial training has also been proposed to learn a robust policy; for MuJoCo locomotion tasks, Pinto et al. (2017) trains an adversary that tries to destabilize the agent during training.
A robust policy may sacrifice performance on many environment variants in order to not fail on a few. Thus, an alternative, recently popular approach to generalization in RL is to learn a policy that can adapt to the environment at hand (Yu et al., 2017). To do so, a number of algorithms learn an embedding for each environment variant using trajectories sampled from that environment, which is input into a policy. Then, at test time, the current trajectory can be used to compute an embedding for the current environment, enabling automatic adaptation of the policy. Duan et al. (2016b), Wang et al. (2016), Sung et al. (2017), and Mishra et al. (2018), which differ mainly in the way embeddings are computed, consider model-free RL by letting the embedding be input into a policy and/or value function. Clavera et al. (2018) consider model-based RL, in which the embedding is input into a dynamics model and actions are selected using model predictive control. Under a similar setup, Sæmundsson et al. (2018) utilize probabilistic dynamics models and inference.
This literature review has focused on RL algorithms for generalization that do not require updating the learned model or policy at test time, in keeping with our benchmark’s evaluation procedure. There has been work on generalization in RL that utilize such updates, primarily under the umbrellas of transfer learning, multi-task learning, and meta-learning. Taylor & Stone (2009) surveys transfer learning in RL where a fixed test environment is considered, with Rusu et al. (2016) being an example of recent work on that problem using deep networks. Ruder (2017) provides a survey of multi-task learning in general, which, different from our problem of interest, considers a fixed finite population of tasks. Finn et al. (2017) present a meta-learning formulation of generalization in RL and Al-Shedivat et al. (2018) extend it for continuous adaptation in non-stationary environments.
Empirical methodology in deep RL. Shared open-source software infrastructure, which enables reproducible experiments, has been crucial to the success of deep RL. The deep RL research community uses simulation frameworks, including OpenAI Gym (Brockman et al., 2016), the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2017), DeepMind Lab (Beattie et al., 2016), and VizDoom (Kempka et al., 2016). The MuJoCo physics simulator (Todorov et al., 2012) has been influential in standardizing a number of continuous control tasks. For ease of adoption, our work builds on OpenAI Gym and MuJoCo tasks, allowing variations in the environment specifications in order to study generalization. OpenAI recently released a benchmark for transfer learning in RL (Nichol et al., 2018), in which the goal is to train an agent to play new levels of a video game with fine-tuning at test time. In contrast, our benchmark does not allow fine-tuning and focuses on control tasks.
Our work also follows in the footsteps of a number of empirical studies of reinforcement learning algorithms, which have primarily focused on the case where the agent is trained and tested on a fixed environment. Henderson et al. (2018)
investigate reproducibility in deep RL, testing state-of-the-art algorithms on four MuJoCo tasks: HalfCheetah, Hopper, Walker2d, and Swimmer. They show that results may be quite sensitive to hyperparameter settings, initialization, random seeds, and other implementation details, indicating that care must be taken not to overfit to a particular environment. The problem of overfitting in RL was recognized earlier byWhiteson et al. (2011), who propose an evaluation methodology based on training and testing on multiple environments sampled from a distribution and experiment with three classic environments: MountainCar, Acrobot, and puddle world. Duan et al. (2016a) present a benchmark suite of continuous control tasks and conduct a systematic evaluation of reinforcement learning algorithms on those tasks. They consider generalization in terms of interpolation on a subset of their tasks. In contrast to these works, we address a greater variety of tasks, extrapolation as well as interpolation, and algorithms for learning deep RL agents that generalize.
In RL, environments are formulated in terms of Markov Decision Processes (MDPs) (Sutton & Barto, 2017). An MDP is defined by the tuple where is the set of possible states, is the set of actions, is the transition probability distribution function, is the reward function, is the discount factor, is the initial state distribution at the beginning of each episode, and is the time horizon per episode. Generalization to environment variations is usually characterized as generalization to changes in and ; our benchmark considers changes in .
Let and be the state and action taken at time . At the beginning of each episode, . Under a policy stochastically mapping a sequence of states to actions, and , giving a trajectory , . RL algorithms, taking the MDP as fixed, learn to maximize the expected reward over an episode , where . They often utilize the concepts of a value function and a state-action value function .
We first evaluate ‘vanilla’ deep RL algorithms from two main categories: actor-critic and policy gradient. From the actor-critic family, we chose A2C (Mnih et al., 2016), and from the policy gradient family we chose PPO (Schulman et al., 2017).111We carried out preliminary experiments on other deep RL algorithms including A3C, TRPO, and ACKTR. A2C and A3C/ACKTR had similar qualitative results, as did PPO and TRPO. These algorithms are oblivious to variations in the environment; they were not designed with generalization in mind. We also include recently-proposed algorithms that are designed to be able to generalize: EPOpt (Rajeswaran et al., 2017a) from the robust approaches and RL (Duan et al., 2016b) from the adaptive approaches. Both these methods are built on top of ‘vanilla’ deep RL algorithms, so for completeness we evaluate a Cartesian product of the algorithms for generalization and the ‘vanilla’ algorithms: EPOpt-A2C, EPOpt-PPO, RL-A2C, and RL-PPO. Next we briefly summarize A2C, PPO, EPOpt, and RL, using the notation in Section 3.
Advantage Actor-Critic (A2C)
. A2C involves the interplay of two optimizers; a critic learns a parametric value function, while an actor utilizes that value function to learn a parametric policy that maximizes expected reward. At each iteration, trajectories are generated using the current policy, with the environment and hidden states of the value function and policy reset at the end of each episode. Then, the policy and value function parameters are updated using RMSProp(Hinton et al., 2012), with an entropy term added to the policy objective function in order to encourage exploration. We use an implementation from OpenAI Baselines (Dhariwal et al., 2017).
Proximal Policy Optimization (PPO). PPO aims to learn a sequence of monotonically improving parametric policies by maximizing a surrogate for the expected reward via gradient ascent, cautiously bounding the improvement achieved at each iteration. At iteration , trajectories are generated using the current policy , with the environment and hidden states of the policy reset at the end of each episode. The following objective is then maximized with respect to using Adam (Kingma & Ba, 2015):
where are the expected visitation frequencies under , , equals clipped to the interval with , and . Again, we use an implementation from OpenAI Baselines, PPO2.
Ensemble Policy Optimization (EPOpt). To generalize over a distribution of environments (MDPs) , we would like to learn a policy that maximizes the expected reward over the distribution, . In order to obtain a policy that is also robust to out-of-distribution environments, EPOpt instead maximizes the expected reward over the fraction of environments with worst expected reward:
At each iteration, the algorithm generates a number of complete episodes according to the current policy where at the end of each episode a new environment is sampled from and reset. (As in A2C and PPO, at the end of each episode the hidden states of the policy and value function are reset.) It keeps the fraction of episodes with lowest reward and uses them to update the policy with some RL algorithm (TRPO (Schulman et al., 2015) in the paper). We instead use A2C and PPO, building our implementation of EPOpt on top of the implementations of A2C and PPO.
RL. To learn a policy that can adapt to the dynamics of the environment at hand, RL
models the policy and value functions as a recurrent neural network (RNN) with the current trajectory as input, not just the sequence of states. The hidden states of the RNN may be viewed as an environment embedding. Specifically, for the RNN the inputs at timeare , , , and , where is a Boolean variable indicating whether the episode ended after taking action ; the output is and the hidden states are updated to . Like the other algorithms, at each iteration trajectories are generated using the current policy with the environment state reset at the end of each episode. However, unlike the other algorithms, a new environment is sampled from only at the end of every episodes, which we call a trial. ( in our experiments.) Likewise, the hidden states of the policy and value functions are reinitialized only at the end of each trial. The generated trajectories are then input into any RL algorithm, maximizing expected reward in a trial; the paper uses TRPO, while we use A2C and PPO. As with EPOpt, our implementation of RL is built on top of the implementations of A2C and PPO.
Our environments are modified versions of four environments from the classic control problems in OpenAI Gym (Brockman et al., 2016) (CartPole, MountainCar, Acrobot, and Pendulum) and two environments from OpenAI Roboschool (Schulman et al., 2017) (HalfCheetah and Hopper) that are based on the corresponding MuJoCo (Todorov et al., 2012) environments. CartPole, MountainCar, and Acrobot have discrete action spaces, while the others have continuous action spaces. We alter the implementations to allow control of several environment parameters that affect the transition probability distribution functions of the corresponding MDPs. Each of the six environments has three versions, with parameters allowed to vary.
Deterministic (D): The parameters of the environment are fixed at the default values in the implementations from Gym and Roboschool. Every time the environment is reset, only the state is reset.
Random (R): Every time the environment is reset, the parameters are uniformly sampled from a -dimensional box containing the default values. This is done by independently sampling each parameter uniformly from an interval containing the default value.
Extreme (E): Every time the environment is reset, its parameters are uniformly sampled from -dimensional boxes anchored at the vertices of the box in R. This is done by independently sampling each parameter uniformly from the union of two intervals that straddle the corresponding interval in R.
Appendix A contains a schematic of the parameter ranges in D, R, and E when . We now describe the environments.
CartPole (Barto et al., 1983). A pole is attached to a cart that moves on a frictionless track. For at most time steps, the agent pushes the cart either left or right with the goal of keeping the pole upright. There is a reward of for each time step the pole is upright, with the episode ending when the angle of the pole is too large. Three environment parameters can be varied: (1) push force magnitude, (2) pole length, (3) pole mass.
MountainCar (Moore, 1990). The goal is to move a car to the top of a hill within time steps. At each time step, the agent pushes a car left or right, with a reward of . Two environment parameters can be varied: (1) push force magnitude, (2) car mass.
Acrobot (Sutton, 1995). The acrobot is a two-link pendulum attached to a bar with an actuator at the joint between the two links. At each time step, the agent applies torque (to the left, to the right, or not at all) to the joint in order to swing the end of the second link above the bar to a height equal to the length of the link. The reward system is the same as that of MountainCar, but with a maximum of
time steps. We have required that the links have the same parameters, with the following three allowed to vary: (1) length, (2) mass, (3) moment of inertia.
Pendulum. The goal is to, for time steps, apply a continuous-valued force to a pendulum in order to keep it at a vertical position. The reward at each time step is a decreasing function of the pendulum’s angle from vertical, the speed of the pendulum, and the magnitude of the applied force. Two environment parameters can be varied, the pendulum’s: (1) length, (2) mass.
HalfCheetah. The half-cheetah is a bipedal robot with eight links and six actuated joints corresponding to the thighs, shins, and feet. The goal is for the robot to learn to walk on a track without falling over by applying continuous-valued forces to its joints. The reward at each time step is a combination of the progress made and the costs of the movements, e.g., electricity and penalties for collisions, with a maximum of time steps. Three environment parameters can be varied: (1) power, a factor by which the forces are multiplied before application, (2) torso density, (3) sliding friction of the joints.
Hopper. The hopper is a monopod robot with four links arranged in a chain corresponding to a torso, thigh, shin, and foot and three actuated joints. The goal, reward structure, and parameters are the same as those of HalfCheetah.
In all environments, the difficulty may depend on the values of the parameters; for example, in CartPole, a very light and long pole would be more difficult to balance. Therefore, the structure of the parameter ranges in R and E was constructed to include environments of various difficulties. The actual ranges of the parameters for each environment were chosen by hand and are listed in Appendix A.222The ranges of the parameters were chosen so that a policy trained using PPO on D struggles quite a bit on the environments corresponding to the vertices of the box in R and fails completely on the environments corresponding to the most extreme vertices of the boxes in E.
In sum, we benchmark six algorithms (A2C, PPO, EPOpt-A2C, EPOpt-PPO, RL-A2C, RL-PPO) and six environments (CartPole, MountainCar, Acrobot, Pendulum, HalfCheetah, Hopper). With each pair of algorithm and environment, we consider nine training-testing scenarios: training on D, R, and E and testing on D, R, and E. We refer to each scenario using the two-letter abbreviation of the training and testing environment versions, e.g., DR for training on D and testing on R. For A2C, PPO, EPOpt-A2C, and EPOpt-PPO, we train for episodes and test on episodes. For RL-A2C and RL-PPO, we train for trials, equivalent to episodes, and test on the last episodes of trials. Note that this is a fair comparison as policies without memory of previous episodes are expected to have the same performance in any episode of a trial, and we are able to evaluate the ability of RL-A2C and RL-PPO to adapt their policy to the environment parameters of the current trial. For the sake of completeness, we do a thorough sweep of hyperparameters and randomly generate random seeds. We report results over several runs of the entire hyperparameter sweep (the only difference being the random seeds). In the following paragraphs we describe the network architectures for the policy and value functions, our hyperparameter search, and the performance metrics we use for evaluation.
Policy and value function parameterization. We consider two network architectures for the policy and value functions. In the first, following Henderson et al. (2018)
, the policy and value function are multi-layer perceptrons (MLPs) with two hidden layers ofunits each and hyperbolic tangent activations; there is no parameter sharing. We refer to this architecture as FF (feed-forward). In the second,333Based on personal communication with an author of Duan et al. (2016b).
the policy and value functions are the outputs of two separate fully-connected layers on top of a one-hidden-layer RNN with long short-term memory (LSTM) cells ofunits. The RNN itself is on top of a MLP with two hidden layers of units each, which we call the feature network. Again, hyperbolic tangent activations are used throughout; we refer to this architecture as RC (recurrent). For A2C, PPO, EPOpt-A2C, and EPOpt-PPO, we evaluate both architectures, while for RL-A2C and RL
-PPO (which require recurrent networks), we evaluate only the second architecture. In all cases, for discrete action spaces policies sample actions by taking a softmax function over the policy network output layer; for continuous action spaces actions are sampled from a Gaussian distribution with mean the policy network output layer and diagonal covariance matrix whose entries are learned along with the policy and value function network parameters.
Hyperparameters. During training, in each algorithm and each version of each environment, we performed grid search over a set of hyperparameters used in the optimizers, and selected the value with the highest success probability when tested on the same version of the environment. The set of hyperparameters includes the learning rate for all algorithms and the length of the trajectory generated at each iteration (which we call batch size) for A2C, PPO, RL-A2C, and RL-PPO. They also include the coefficient of the policy entropy in the objective for A2C, EPOpt-A2C, and RL-A2C and the coefficient of the KL divergence between the previous policy and current policy for RL-PPO. The grid values are listed in Section B. In EPOpt-A2C and EPOpt-PPO, we sample environments per iteration and set first to and then after iterations. Other hyperparameters, such as the discount factor, were set to the default values in OpenAI Baselines.
Performance metrics. The traditional performance metric used in the RL literature is the average total reward achieved by the policy in an episode. In the spirit of the definition of an RL agent as goal-seeking (Sutton & Barto, 2017) and to obtain a metric independent of reward shaping, we also compute the percentage of episodes in which a certain goal is successfully completed, the success rate. This additional metric is a clear and interpretable way to compare performance across conditions and environments. We define the goals of each environment as follows: CartPole: balance for at least time steps, MountainCar: get to the hilltop within time steps, Acrobot: swing the end of the second link to the desired height within time steps, Pendulum: keep the angle of the pendulum at most radians from vertical for the last time steps of a trajectory with length , HalfCheetah and Hopper: walk for meters.
We highlight some of the key findings and present a summary of the experimental results here, concentrating on the binary success metric. For each algorithm, architecture, and environment, we compute three numbers. (1) Default: success percentage on DD (the classic RL setting). (2) Interpolation: success percentages on RR. (3) Extrapolation: geometric mean of the success percentages on DR, DE, and RE. Table1 summarizes the results. Section C contains analogous tables for each environment, which will be referred to in the following discussion.
Generalization performance (in % success) of each algorithm, averaged over all environments (mean and standard deviation over five runs).
A2C and PPO. With the FF architecture, the two ‘vanilla’ deep RL algorithms are often successful on the classic RL setting of training and testing on a fixed environment, as evidenced by the high values for Default. However, when those agents trained on environment version D are tested, we observed that they usually suffer from a significant drop in performance in R and an even further drop in E. When the algorithm is successful in the classic RL setting, as for PPO with the FF architecture, which has a Default number of , they are able to interpolate. (Interpolation equals in that case.) That is, simply training on a distribution of environments, without any special mechanism for generalization, results in agents that can perform fairly well in similar environments. However, as expected in general they are less successful at extrapolation; PPO with the FF architecture has a Extrapolation number of . A2C with either architecture shows similar behavior to PPO with the FF architecture, while PPO with the RC architecture had difficulty training on the classic RL setting and did not generalize well. For example, on all the environments except CartPole and Pendulum the FF architecture was necessary for PPO to train a successful policy on DD. The pattern of decrease from Default to Interpolation to Extrapolation shown in Table 1 also appears when looking at each environment individually. The magnitude of decrease depends on the combination of algorithm, architecture, and environment. For instance, on CartPole, A2C interpolates and extrapolates successfully, where Interpolation equals and Extrapolation equals ; this behavior is also shown for PPO with the FF architecture. On the other hand, on Hopper, PPO with the FF architecture has success rate in the classic RL setting but struggles to interpolate (Interpolation equals ) and fails to extrapolate (Extrapolation equals ). This indicates that our choice of environments and their parameter ranges led to a variety of difficulty in generalization.
EPOpt. With the FF architecture, EPOpt-PPO improved both interpolation and extrapolation performance over PPO, as shown in Table 1. Looking at specific environments, on Hopper EPOpt-PPO has nearly twice the interpolation performance and significantly improved extrapolation performance compared to PPO. Such an improvement also appears for Pendulum. EPOpt-PPO, similar to PPO, generally did not benefit from using the recurrent architecture; this may be due to the LSTM requiring more data to train. EPOpt however did not demonstrate the same performance gains when combined with A2C. EPOpt-A2C was able to find limited success using the RC architecture on CartPole but for other environments failed to learn a working policy even in the Default setting.
RL. RL-A2C and RL-PPO proved to be difficult to train and data inefficient. This is most likely due to the RC architecture, as PPO also has difficulty training on D with that architecture as shown in Table 1. On most environments, the Default numbers are low, indicating that a working policy was not found in the classic RL setting of training and testing on a fixed environment. As a result, they also have low Interpolation and Extrapolation numbers. In a few, such as RL-PPO on CartPole and RL-A2C on HalfCheetah, a working policy was found in the classic RL setting, but the algorithm struggled to interpolate or extrapolate. A success story is RL-A2C on Pendulum, where we have nearly success rate in DD, interpolate extremely well (Interpolation is ), and extrapolate fairly well (Extrapolation is ).
We observed that the partial success of these algorithms on the environments appears to be dependent on two implementation choices: the feature network in the RC architecture and the nonzero coefficient of the KL divergence between the previous policy and current policy in RL-PPO, which is intended to help stabilize training.
We introduced a new testbed and experimental protocol to measure the generalization ability of deep RL algorithms, to environments both similar to and different from those seen during training. Such a testbed enables us to compare the relative merits of algorithms for learning generalizable RL agents. Our code, based on OpenAI Gym, will be made available online and we hope that it will support future research on generalization in deep RL. Using our testbed we have evaluated two state-of-the-art deep RL algorithms, A2C and PPO, and two algorithms that explicitly tackle the problem of generalization in different ways: EPOpt, which aims to generalize by being robust to environment variations, and RL, which aims to automatically adapt to environment variations.
Overall, the ‘vanilla’ deep RL algorithms have good generalization performance, being able the interpolate quite well with some extrapolation success. They are only outperformed by EPOpt when it is combined with PPO. RL on the other hand is difficult to train, and in its success cases provides no clear generalization advantage over the ‘vanilla’ deep RL algorithms or EPOpt. We have considered model-free RL in our evaluation; a clear direction for future work is to perform a similar evaluation for model-based RL, in particular recent work such as Sæmundsson et al. (2018) and Clavera et al. (2018). Because model-based RL explicitly learns the system dynamics and generally is more data efficient, it could be better leveraged by adaptive techniques for generalization.
This material is in part based upon work supported by the National Science Foundation under Grant No. TWC-1409915, DARPA under FA8750-17-2-0091, and Berkeley Deep Drive. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Journal of Artificial Intelligence Research (JAIR), 47, 2013.
50 years of data science.In Tukey Centennial Workshop, 2015.
International Conference on Machine Learning (ICML), 2016a.
Table 2 details the parameter ranges for each environment and environment setting: Deterministic (D), Random (R), and Extreme (E). Figure 1 illustrates the ranges from which the parameters are sampled; the parameters for D are fixed within the range of R, and E is uniformly sampled from a range wider than R, excluding the intervals corresponding to R.
The grid values we search over for each hyperparameter and each algorithm are listed below. In sum, the search space contains unique hyperparameter configurations for all algorithms on a single training environment ( training configurations), and each trained agent is evaluated on test settings ( total train/test configurations). We report results for runs of the full grid search, a total of experiments.
A2C, EPOpt-A2C with RC architecture, and RL-A2C:
EPOpt-A2C with FF architecture:
PPO, EPOpt-PPO with RC architecture:
EPOpt-PPO with FF architecture:
A2C and RL-A2C:
PPO and RL-PPO:
Policy entropy coefficient:
KL divergence coefficient:
In order to elucidate the generalization behavior of each algorithm, here we present versions of Table 1 for each environment.
On MountainCar, several of the algorithms, including A2C with both architectures and PPO with the FF architecture, have greater success on Extrapolation than Interpolation, which is sometimes greater than Default (see Table 5). This is unexpected because Extrapolation combines the success rates of DR, DE, and RE, with E containing more extreme parameter settings, while Interpolation is the success rate of RR. To explain this phenomenon, we hypothesize that compared to R, E is dominated by easy parameter settings, e.g., those where the car is light but the force of the push is strong, allowing the agent to reach the top of the hill easily. In order to test this hypothesis, we create a heatmap of the reward achieved by A2C with the FF architecture trained on D and tested on R and E. We also investigated A2C with the RC architecture and PPO with the FF architecture, but because the heatmaps are qualitatively similar, we show only the heatmap for A2C with the FF architecture, in Figure 2. Referring to the structure in Figure 1, we see that the reward achieved by the policy is higher in the regions corresponding to E. Indeed, it appears that the largest regions of E are those with a large force, which enables the trained policy to push the car up the hill in less than time steps, achieving the goal set in Section 6. (Note that the reward is the negative of the number of time steps taken to push the car up the hill.)
This special case demonstrates the importance of considering a wide variety of environments when assessing the generalization performance of an algorithm; each environment may have idiosyncrasies that cause performance to be correlated with parameters. For example, Figure 3 shows a similar heatmap for A2C with the FF architecture on Pendulum, in which Interpolation is greater than Extrapolation. In this case, the policy trained on D struggles more on environments from E than on those from R, which bolsters our hypothesis.
To investigate the effect of EPOpt and RL and the different environment versions on training, we plotted the training curves for PPO, EPOpt-PPO, and RL-PPO on each version of each environment, averaged over the five experiment runs and showing error bands based on the standard deviation over the runs. Training curves for all algorithms and environments are available at the following link: http://www.github.com/sunblaze-ucb/rl-generalization. We observe that in the majority of cases training appears to be stabilized by the increased randomness in the environments in R and E, including situations where successful policies are found. This behavior is particularly apparent for CartPole, whose training curves are shown in Figure 4 and in which all five algorithms above are able to find at least partial success. We see that especially towards the end of the training period, the error bands for training on E are narrower than those for training on D or R. Except for EPOpt-PPO with the FF architecture, the error bands for training on D appear to be the widest. Indeed, RL-PPO is very unstable when trained on D, possibly because the more expressive policy network overfits to the generated trajectories.
The above link also contains videos of the trained agents of one run of the experiments for all environments and algorithms. We include the five scenarios considered in computing Default, Interpolation, and Extrapolation: DD, DR, DE, RR, and RE. Using HalfCheetah as a case study, we describe some particularly interesting behavior we saw.
A trend we noticed across several algorithms were similar changes in the cheetah’s gait that seem to be correlated with the difficulty of the environment. The cheetah’s gait became forward-leaning when trained on the Random and Extreme environments, and remained relatively flat in the agents trained on the Deterministic environment (see figures 5 and 6). We hypothesize that the forward-leaning gait developed to counteract conditions in the R and E settings. The agents with the forward-learning gait were able to recover from face planting (as seen in the second row of figure 5
), as well as maintain balance after violent leaps likely caused by settings with unexpectedly high power. In addition to becoming increasingly forward-leaning, the agents’ gait also tended to become stiffer in the more extreme settings, developing a much shorter, twitching stride. Though it reduces the agents’ speed, a shorter, stiffer stride appears to make the agent more resistant to adverse settings that would cause an agent with a longer stride to fall. This example illustrates how training on a range of different environment configurations may encourage policies that are more robust to changes in system dynamics at test time.