Dynamics generalization in deep reinforcement learning (RL) studies the problem of transferring a RL agent’s policy from training environments to settings with unseen system dynamics or structures, such as the layout of a maze or the physical parameters of a robot Nagabandi et al. (2018); Lee et al. (2020). Although recent advancement in deep reinforcement learning has enabled agents to perform tasks in complex training environments, dynamics generalization remains a challenging problem Rajeswaran et al. (2017); Henderson et al. (2018).
Training policies that are robust to unseen environment dynamics has several merits. First and foremost, an agent trained in an ideal setting may be required to perform in more adversarial circumstances, such as increased obstacles, darker lighting and rougher surfaces. Secondly, it may enable efficient sim-to-real policy transfers Tobin et al. (2017), as the agent may quickly adapt to the differences in dynamics between the training environment and the testing environment. Lastly, an information bottleneck naturally divides a model into its encoder and controller components, improving the interpretability of end-to-end RL policies, which have traditionally been assumed as a black box.
In this work, we consider the problem of dynamics generalization from an information theoretic perspective. Studies in the field of information bottleneck have shown that generalization of deep neural networks in supervised learning can be measured and improved by controlling the amount of information flow between layersTishby and Zaslavsky (2015); in this paper, we hypothesize that the same can be applied to reinforcement learning. In particular, we show that the poor generalization in unseen tasks is due to the DNNs memorizing environment observations, rather than extracting the relevant information for a task. To prevent this, we impose communication constraints as an information bottleneck between the agent and the environment. Such bottleneck would limit the information flow between observations and representations, thus encouraging the encoder to only extract relevant information from the environment and preventing memorization.
A joint optimisation of encoder and policy with an information bottleneck is a challenging problem, because, in general, the separation principle Witsenhausen (1971)
is not applicable. The separation principle allows to estimate state from observation (and under certain conditions to compress observationTanaka et al. (2017)), and then to derive an policy. In the cases where the separation principle is not applicable, a joint optimisation of encoder and policy can be seen as ’chicken and egg’ problem: to derive an optimal policy one needs a meaningful state representation, which in turn depends on the performance of the policy.
Our main contributions are as follows. Firstly, we tackle the problem of poor generalization of DRL to unseen tasks by applying an information bottleneck between observations and state representations (see Figure 1). Specifically, we find a stochastic mapping from observations to internal representations, and regularize such mapping to limit the amount of information flow. Secondly and most significantly, we propose an annealing scheme for a stable join-optimization of the encoder and policy components of the network, finding a family of solutions parameterized by the weight of the information constraint. Thirdly, we demonstrate that policies trained with an information bottleneck achieve significantly better performance on tasks with unseen layouts, goals and dynamics, as compared to the standard DRL methods. Finally, we demonstrate that our method produces state representations which admit a semantic interpretation, which is in general not guaranteed for end-to-end DRL. Specifically, we demonstrate that the encoder in our approach maps stochastic observations to a space where distances between points are consistent with their values from the optimal critic.
Our proposed method is general and can be intergrated with most state-of-the-art reinforcement learning architectures. A version of our method based on a Pytorch baseline is published and available at github.com/anonymous.
2 Related Work
There is a series of previous works that address the problem of control with information bottlenecks. Borkar and Mitter (1997) is one of the first works in this direction, where the effects of state compression were studied in the case of linear and known dynamics. Specifically, they showed that in the case of Linear Quadratic Regulator, there exists an optimal compression scheme of state observations. The following works Tatikonda and Mitter (2004); Tatikonda et al. (2004); Tanaka et al. (2017); Tiomkin and Tishby (2017), studied the optimality of compression schemes under different assumptions, although all of them assumed known dynamics, and did not consider information bottleneck for its generalization benefits.
Recently, it was shown that information bottleneck improves generalization in adversarial inverse reinforcement learning Peng et al. (2019)
. By placing a bottleneck on the discriminator of a GAN, the author effectively balances the performance the discriminator and the generator to provide more meaningful gradients. This work, however, focuses strictly on imitation learning, and does not consider any online learning setting involving long-horizon planning.
Another relevant work is the work by Pacelli and Majumdar Pacelli and Majumdar (2020), where information bottleneck is estimated and optimized through separate MINE estimators Belghazi et al. (2018) at each time step. While this work also tackles the problem of generalization, it only focuses on image-based environments with changing textures, without considering changing environment goals or dynamics. Additionally, the use of separate MINE estimators at each time step may limit the scalability of the method for long horizon problems. Our work, in contrast, trains a single encoder whose information is regularized without any explicit estimators, and we focus on dynamics randomization problems with changing environment layouts and parameters.
Finally, in Goyal et al. Goyal et al. (2019), the information bottleneck between actions and goals is studied with an aim to create goal independent policies. While both Goyal et al. (2019) and our work utilize the variational approximation of the upper bound on the mutual information, their work focuses on finding high information states for more efficient exploration, which is a different objective from our work.
3.1 Markov Decision Process and Reinforcement Learning
This paper assumes a finite-horizon Markov Decision Process (MDP)Puterman (1994), defined by a tuple . Here, denotes the state space (which could either be noisy observations or raw internal states), denotes the action space, denotes the state transition distribution, denotes the reward function, is the discount factor, and finally is the horizon. At each step , the action is sampled from a policy distribution where and is the policy parameter. After transiting into the next state by sampling from , where , the agent receives a scalar reward . The agent continues performing actions until it enters a terminal state or reaches the horizon, by when the agent has completed one episode. We let denote the sequence of states that the agent enters in one episode.
With such definition, the goal of RL is to learn a policy that maximizes the expected discounted reward , where expectation is taken on the possible trajectories and the starting states . In this paper, we assume model-free learning, meaning the agent does not have access to the environment dynamics .
To study dynamics generalization, we further focus on context conditional environments, which correspond to a MDP distribution parameterized by a context variable . Here could range from a robot’s density to the coefficient of friction between any two surfaces. For each context , the MDP adapts a specific state transition distribution , and the agent now aims to learn a policy that maximizes the reward given a particular context. Here, is directly provided to the agent as an oracle. Our goal is to train on a distribution of context , and evaluate the agent’s generalization performance on unseen contexts .
3.2 Mutual Information
Mutual information measures the amount of information obtained about one random variable after observing another random variableCover and Thomas (2012). Formally, given two random variables and
with joint distributionand marginal densities and , their MI is defined as the KL-divergence between joint density and product of marginal densities:
4.1 Problem Definition
We consider an architecture in which the agent learns with limited information from the environment: instead of learning directly from the environment states , the agent needs to estimate noisy encoding of the state, whose information is limited by a bottleneck.
Formally, we decompose the agent policy into an encoder and a decoder (action policy), where . The encoder maps environment states into stochastic embedding, and the decoder outputs agent actions :
With such setup, we maximize the RL objective with a constraint on the mutual information between the environment states and the embedding:
To estimate mutual information between , and , We makes use of the following identity:
In practice, we take samples of to estimate the mutual information. While is straightforward to compute, calculating requires marginalization across the entire state space , which in most non-trivial environments are intractable. Instead, we follow the method adopted in many recent works and introduce an approximator, , to replace Peng et al. (2019); Goyal et al. (2019). A proof for this can be found in the Appendix.
4.2 Unconstrained Lagrangian
We introduce a Lagrangian multiplier and optimize on the upper bound of given by the approximator :
As discussed in Strouse et al. (2018), the gradient update at time is the policy gradient update with the modified reward, minus a scaled penalty by KL-divergence between state and embedding:
where is the discounted reward until step , and is the environment reward modified by the KL penalty: .
4.3 Annealing Scheme
We generate a family of solutions (optimal pairs of encoder and policy) parametrized by the information bottleneck constraint weight . In our case, each solution is characterized by a correspondingly constrained amount of information required to maximize the environment rewards.
The rationale is as follows: to encourage the agent to extract relevant information from the environment, we want to impose high penalty for passing too much information through the encoder. At the beginning of training, such penalty produces gradients that offsets the agent’s learning gradients, making it difficult for the agent to form good policies.
To tackle this problem, we create the entire family of solutions through annealing, starting from a deterministic (unconstrained) encoder, and gradually injecting noise by increasing the penalty coefficient (temperature parameter), .
This approach allows training of well-formed policies for much larger values, as the encoder has already learned to extract useful information from the environment, and only needs to learn to "forget" more information as increases. In the experiment section, we will demonstrate that training the model using annealing enables the agent to learn with much larger coefficients compared to from scratch. In particular, Figure 6 shows an increase and decrease in generalization benefits along the annealing curve.
5 Experiment Results
In this section, we apply the approaches described in Section 4 to discrete maze environments and various control environments. In doing so, we aim to answer the following questions:
How effectively can we learn a policy with information bottleneck through annealing?
How well can a policy trained end-to-end with an information bottleneck transfer to new, unseen structure or dynamics?
MiniGrid Environments are used as the primary discrete experiments Chevalier-Boisvert et al. (2018). To validate the results statistically, we randomly generate and sample maze environments of the same size to test the agent’s ability to transfer to new layouts. The fixed layout and examples of the randomly generated layouts are listed in Figure 2.
For each transfer experiment, we randomly sample 4 mazes, 3 of which are used for the training set and 1 for testing. Specifically, we train a policy using the training set, then retain it for the unseen maze to assess how fast the model learns the new maze layout. Figure 3 shows the learning curves of three different setups: learning with a tight information bottleneck (); learning with a loose information bottleneck ( as ablation); learning with full information ( and deterministic encoder as baseline
). As the plot shows, learning with a tight information bottleneck achieves the best transfer learning result, reaching near-optimal solution of 0.9 mean reward around 2 times faster compared to the baseline. The close performance between the baseline and the ablation suggests the benefit of generalization only emerges as we tighten the information bottleneck.
Furthermore, we demonstrate that the code learned through information bottleneck learns structured information about the maze. Figure 3 illustrates the projection of every state’s embedding (after convergence) onto 2D space through T-SNE, with each point colored by its critic value. From the projection plot, we observe the emergence of consistent value gradients as well as local clustering by actions.
The CartPole environment consists of a pole attached to a cart sliding on a frictionless surface. The pole is free to swing around the connection point to the cart, and the environment goal is to move the cart either left or right to keep the pole upright. The agent obtains a reward of 1 for keeping the pole upright at each time step, and can achieve a maximum of reward over the entire episode. Should the pole fail to maintain an angle of 12 degrees from the vertical line, the episode will terminate early.
The CartPole environment is configured to have 2 discrete actions: moving left or right at each time step. For this environment, we vary two environment parameters: the magnitude of the cart’s push force, and the length of the pole. The push force affects the cart’s movement at each time step, while the length of the pole affects its torque. We provide limited randomization during training compared to the configurations in Packer et al. (2018): we range push forces from to , and the pole length from to . For evaluation, we consider a much wider range as well as extreme values: we first test the policy’s performance on push forces ranging from to and pole lengths from to ; then, we test on extremely large values of push forces (, ) and pole lengths (, , ) to assess the policy’s stability. While push force is difficult to visualize, Figure 4 illustrates the different pole lengths used for training and evaluation.
As illustrated in Figure 5, both the baseline and our approach achieve good training performance; the baseline, however, fails to generalize beyond unseen pole lengths, while our method produces a policy that adapts to almost all test configurations. The difference in generalization to unseen dynamics between the baseline and our approach showcases the power of information bottleneck: by limiting the amount of information flow between observation and representation, we force the DNN to learn a general representation of the environment dynamics that can be readily adapted to unseen values.
A policy trained with a well-tuned bottleneck performs well even in extreme configurations. For the extreme ranges (force and pole length ), the agent trained with a bottleneck achieves optimal reward (> 195) on all configurations. The plot for this result is moved to the Appendix.
Next, we demonstrate the generalization benefits of our method in the HalfCheetah environment. In this environment, a bipedal robot with 6 joints and 8 links imitates a 2D cheetah, and its goal is to learn to move in the positive direction without falling over. The environment reward is a combination of its velocity in the positive direction and the cost of its movement (in the form of a L-2 cost on action). A illustration of the environment is provided in Figure 7.
The environment has continuous actions corresponding to the force values applied to its joints. Its dynamics is more complex in nature compared to CartPole, making generalization a challenging task. Similar to Packer et al. (2018), we vary the torso density of the robot to change its movement dynamics. In particular, we vary the training density from to , and test the policy’s performance on density values ranging from to . As the robot’s actions corresponding to forces, whose effects are linearly affected by density, policy extrapolation from the training parameters to the test parameters is extremely challenging.
While both the baseline’s and our method’s performances suffer outside of the training range, our method achieves significantly better reward when the density is low. Figure 8 better illustrates the performance difference between the baseline and our method: for most test configurations our method performs significantly better than the baseline, especially for density values that are lower than those seen in testing. This again indicates better stability and generalization in the policy trained with an information bottleneck.
Finally, in the Humanoid environment (Figure 7) a human-like robot with 13 rigid links and 17 actuators freely moves on a flat surface. The goal is to move forward as soon as possible, while keeping the cost of action low. The environment reward is the forward velocity of the center of the robot minus a L-2 penalty on the action.
Similar to HalfCheetah, the environment has continuous actions corresponding to the force values applied to the robot’s joints. Another challenging environment, Humanoid tests a policy’s ability to generalize a high dimensional system. For our experiments, we scale both the robot’s mass and its joints’ damping factors from to , then testing the policy’s performance on test mass and damping scales from to . Both of these parameters directly affect the robot’s actions’ impact on movement.
The result for Humanoid is presented in Figure 9, where average test reward (on unseen parameters only) along the different beta values are shown alongside the baseline reward. In particular, for a properly tuned bottleneck, our method achieves significantly better performance than the baseline: for a value of , the average test reward is around 30% higher than that of the baseline’s average test reward, signifying a substantial boost in generalization performance.
6 Conclusion and Future Work
In this work we proposed a principled way to improve generalization to unseen tasks in deep reinforcement learning, by introducing a stochastic encoder with an information bottleneck optimized through annealing.
We have proved our hypothesis that generalization in DRL can be improved by preventing explicit memorization of training environment observations. We showed that an explicit information bottleneck in the DRL cascade forces the agent to learn to squeeze the minimum amount of information from the observation before the optimal solution is found, preventing it from overfitting onto the training tasks. This led to much better generalization performances (for unseen maze layouts, unseen goals, and unseen dynamics) than baselines and other regularization techniques such as L-2 penalty and dropout.
Practically, we showed that the suggested annealing scheme allowed the agent to find optimal encoder-decoder pairs under different information constraints, even for significant information compression that corresponds to very large values. This annealing scheme was designed to gradually inject noise to the encoder to reduce information (by gradually increasing ), while keeping a well-formed decoder (action policy) that received meaningful RL gradients. This slow change in the values of is critical, when it is not guarantied to have an optimal joint solution for the encoder and decoder (action policy), as in cases where the separation principle is not satisfied.
Overall, we found significant generalization advantages of our approach over the baseline in the maze environment as well as control environment such as CartPole, HalfCheetah, and Humanoid. A CartPole policy trained using an information bottleneck, for instance, was able to generalize to test parameters more than 10 times larger than the training parameters, completely beating the baseline’s generalization performance.
A promising future direction for research is to rigorously study the properties of the representation space, which may contribute to improving the interpretability of representations in deep neural networks in general. One of the insights of this work was that the produced representation in the maze environments preserved critic value distances of the original states; the representation space was thus consistent with the planning space, allowing generalization over unseen layouts.
7 Broader Impact
Our work improves the generalization ability of RL agents to extreme unseen environment dynamics, and can contribute to current efforts to deploy RL agents in real world circumstances. For instance, applying our method to an autonomous vehicle may boost its ability to navigate in extreme weather conditions, improving its safety for passengers; a household robot (e.g. a laundry-folding robot) may better serve people by adapting to variations in its task due to the complex nature of the real world; production robots may operate more efficiently by better handling misplaced materials or components. As our method is general and can be plugged into any RL architectures, it can be potentially employed in existing systems to further boost their ability to handle edge cases in their tasks.
, our method’s focus on injecting noise into the agent may cause it to operate falsely in rare occasions, due to the noisy encoder producing outlier codes. Thus, while we have demonstrated that on expectation our method achieves good generalization performance in extreme test settings, further studies in this direction with worst case optimality guarantees in mind are required. One possibility is to decrease significantly stochasticity in the encoder during test time, which may decrease performance but will prevent outlier codes; another potential direction is to consider empowerment or other metrics as safety measures to prevent the agent from taking extreme actions.
This work was supported in part by NSF under grant NRI-#1734633 and by Berkeley Deep Drive.
-  (2018) Mutual information neural estimation. In International Conference on Machine Learning, pp. 530–539. Cited by: §2.
-  (1997) LQG control with communication constraints. In Communications, Computation, Control, and Signal Processing, pp. 365–373. Cited by: §2.
-  (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §5.1.
-  (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.2.
-  (2019) Infobot: transfer and exploration via the information bottleneck. ICLR2019. Cited by: §2, §4.1.
-  (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361. Cited by: §7.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §7.
Deep reinforcement learning that matters.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2018) Pytorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr,2018. Cited by: §7, §9.3.
-  (2020) Context-aware dynamics model for generalization in model-based reinforcement learning. In ICML, Cited by: §1.
-  (2018) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347. Cited by: §1.
-  (2020) Learning task-driven control policies via information bottlenecks. arXiv preprint arXiv:2002.01428. Cited by: §2.
-  (2018) Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282. Cited by: §5.2, §5.3.
-  (2019) Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. ICLR 2019. Cited by: §2, §4.1.
-  (1994) Markov decision processes: discrete stochastic dynamic programming. Cited by: §3.1.
-  (2017) Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561. Cited by: §1.
-  (2018) Learning to share and hide intentions using information regularization. In Advances in Neural Information Processing Systems, pp. 10249–10259. Cited by: §4.2.
-  (2017) LQG control with minimum directed information: semidefinite programming approach. IEEE Transactions on Automatic Control 63 (1), pp. 37–52. Cited by: §1, §2.
-  (2004) Control under communication constraints. IEEE Transactions on automatic control 49 (7), pp. 1056–1068. Cited by: §2.
-  (2004) Stochastic linear control over a communication channel. IEEE transactions on Automatic Control 49 (9), pp. 1549–1561. Cited by: §2.
-  (2017) A unified bellman equation for causal information and value in markov decision processes. arXiv preprint arXiv:1703.01585. Cited by: §2.
-  (2015) Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1.
-  (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §1.
-  (1971) Separation of estimation and control for discrete time systems. Proceedings of the IEEE 59 (11), pp. 1557–1566. Cited by: §1.
9.1 Proof for Lower Bound on Mutual Information by Variational Approximator
This achieves an upper bound on :
where the inequality arises because of the non-negativeness KL-divergence:
9.2 Environment Descriptions
The agent is a point that can move horizontally or vertically in a 2-D maze structure. Each state observation is a compact encoding of the maze, with each layer containing information about the placement of the walls, the goal position, and the agent position respectively. The goal state is one in which the goal position and the agent position are the same. The agent obtains a positive reward of when it reaches the goal, and no reward otherwise.
The agent is a cart sliding on a frictionless horizontal surface with a pole attached to its top. The pole is free to swing about the cart, and at each time step the cart moves to the left or to the right to keep the pole in upright position. Each sate observation consists of four variables: the cart position, the cart velocity, the pole angle, and the pole velocity at tip. The reward at every time is 1, and the episode terminates when it reaches 200 in length or when the pole fails to maintain an upright angle of at most degrees.
The agent is a bipedal robot with 6 joints and 8 links imitating a 2D cheetah. The agent moves horizontally on a smooth surface, and its goal is to learn to move in the positive direction without falling over, by applying continuous forces to each individual joint. The state observations encode the robot’s position, velocity, joint angles, and joint angular velocities. The reward at each time is the robot’s velocity in the positive direction, , minus the action costs . Here, indicates the position of the robot at time , and is the robot’s action input.
The agent is a human-like robot with 13 rigid links and 17 actuators. The agent moves freely on a smooth surface, and its goal is to move in the forward direction as quickly as possible. Similar to HalfCheetah, its actions are continuous forces to each individual joint, and the state observations encoder its position, velocity, joint angles, and joint angular velocities. The reward at each time is the sum of its velocity () in the positive direction minus the action cost .
|State Dimensions||(12, 12, 3)||(4,)||(18,)||(47,)|
|Action Dimensions||(4,)||(2, )||(6,)||(17, )|
9.3 Network Parameters and Hyperparameters for Learning
For all maze experiments we use standard A2C, and for all control experiments we use PPO. Our baseline is adopted from , and we modify the code to add a stochastic encoder.
For baseline, we use 3 layers of convolutional layers with 2-by-2 kernels, and channel size 16, 32, 64 respectively. The convolutional layers are followed by a linear layer ("deterministic encoder") of hidden size 64. Finally, the actor and critic each uses 1 linear layers of hidden size 64. We use Tanh activations between layers. For our approach, we add an additional linear layer after the convolution to output the diagonal variance of the encoder to provide stochasticity.
For baseline, we use 1 linear layer of hidden size 32, followed by an additional linear layer of hidden size 32 ("deterministic encoder"). Actor and critic each uses 2 linear layers of hidden size 32. For our approach, we again add an additional linear layer of hidden size 32 after the first linear layer to output the diagonal variance for the stochastic encoder.
We follow mostly the same architecture as for CartPole, except the hidden size is 128.
For baseline, we use 2 linear layers of hidden size 96, followed by an additional linear layer of hidden size 96 ("deterministic encoder"). Actor and critic each uses 2 linear layers of hidden size 96. For our approach, we add an additional linear layer of hidden size 96 after the first 2 linear layers to output the diagonal variance.
9.4 Hyperparameter Selection
The most crucial hyperparameter value is, which determines the size of the information bottleneck. We evaluate the policy at even intervals during annealing to find optimal representations and control policies for each to determine the optimal value. For all other hyperparameters, we mostly followed the hyperparameters used in each environment’s respective baselines, with the exception of tuning the learning rates, batch size, and encoder dimension. Learning rate was tuned through random initialization and short training; batch size and encoder dimension were turned through a binary sweep.
|value loss coef||0.5|
|max gradient norm||0.5|
|value loss coef||0.5|
|CartPole encoder dimension||32|
|HalfCheetah encoder dimension||128|
|max gradient norm||0.5|
|value loss coef||1|
9.5 Evaluation Results for Extreme Configurations in Cartpole
We provide the evaluation grid for extreme configurations in Cartpole in Figure 10.
9.6 Full Evaluation Results Along Annealing Curve for Cartpole and HalfCheetah
We provide the full evaluation results for CartPole and HalfCheetah along their respective annealing curves in Figure 11 and Figure 12. For each set of plots, we demonstrate the increase in generalization performance due to tightening of the information bottleneck, followed by a sudden deterioration of the policy as the encoder loses too much information.