1 Introduction
Although reinforcement learning (RL) has allowed automatic solving of a variety of environments with rewards, the topic of generalization in deep reinforcement learning (RL) has recently been very prominent, due to the brittleness witnessed in policies in a variety of environments.
One framework used to study RL generalization is to treat it analogous to a classical supervised learning problem  i.e. train jointly on a finite ”training set”, and check performance on the ”test set” as an approximation to the population distribution. Normally, statistical learning theory only provides measurements of generalization performance
at the end, based on a probabilistic bounding approach using the complexity of the classifier.
This however, ignores many trajectory dependent factors. Many different RLspecific parameters may affect RL performance during training, including
 the discount factor, action stochasticity, and other network modifications such as Batch Normalization.
Trajectory Dependent methods of analysis thus are still important (which include direction of gradient, Gradient Lipschitz smoothness, and optimization landscapes), and seek to understand training behavior during gradient descent.However, one large caveat in RL is simply the inherent noisiness in evaluating the objective function, which can easily make landscape visualization such as Li et al. (2018); Ilyas et al. (2018), intractible and expensive for larger scale datasets, such as CoinRun Cobbe et al. (2018). Instead, we must infer properties about the loss landscape and training from observing summary metrics. The main metric we will use is the gradient cosine similarity between training and testing sets during training.
While there has been work on understanding what may happen to RL generalization when one hyperparameter is changed Jiang et al. (2015); Ahmed et al. (2018), the theory of RL hyperparameters’s coupling effects on generalization is not fully understood. One question raised is if these factors are independent to each other  for instance, if we add entropy bonuses to improve generalization, should we ignore tuning other hyperparameters like ? Or for instance, if the MDP family is more noisy and stochastic, should we ignore tuning parameters such as the minibatchsize?
We provide separate experimental results measuring parts of the policy gradient optimization process, to show that

Many hyperparameters are not orthogonal to each other with respect to generalization performance. In fact, the addition of one hyperparameter may completely change the monotonicity of another hyperparameter with respect to generalization.

Different regularizations do not affect the training process the same way, and gradient cosine similarity is one main metric to show their effects.
2 Notation and Methods
To formalize our supervised learning analogy to the RL setup, let be a distribution over parameters that parametrize an MDP family . Each parametrizes some state space, action space, reward, transitions, and observation function, with . An appropriate train and test set can then be created by randomly sampling and training or evaluating on .
For sake of notational simplicity, we denote as the observation rather than . The standard policy gradient Williams (1992) without a discount factor where is the gradient with respect to the true objective:
(1) 
More recent RL algorithm such as PPO Schulman et al. (2017) optimize a surrogate objective , where is the clipped advantage ratio, is the value function error, and is entropy.
These surrogate losses inherently affect our definition of the ”gradient cosine similarity” (GCS) between training and test gradients and what the GCS measures. With basic policy gradient this is simply the normalized dot product between and which is unbiased, but with algorithms such as PPO, the GCS becomes biased. On datasets that practically require PPO, we minimize its subtle effects on the GCS by using the same algorithm hyperparameters for training and testing.
We briefly explain some metrics and what they measure:

GCS: If training gradient is aligned more with true gradient, larger learning rates may be allowed. Furthermore, Santurkar et al. (2018) uses GCS as an approximation of the Hessian and smoothness of training landscape  having this oscillate too highly suggests a sharp minimizer, which is bad for generalization.

Variance, Gradient Norm, and
weight norm: The variance of the gradient estimation is dependent on both the batchsize as well as the Lipschitz smoothness of the policy gradient with respect to policy parameters, i.e.
. Gradient Norm is a practical estimate of the Lipschitz constant between different total rewards, and is a rough estimate of the complexity of the policy.
3 MDP’s for Experimentation
Henderson et al. (2018); Ilyas et al. (2018) establish deep RL experimentation is inherently noisy, and joint training only increases the variance of the policy gradient is increased further from sampling from different environments. The raw versions of ALE and Mujoco environments do not strongly follow distributional sampling of levels. We instead use both synthetic (RNNMDP) and real (CoinRun) datasets, with policy gradient PPO as the default algorithm due to its reliability (hyperparameters in Appendix, A.2.2).
3.1 RnnMdp
To simulate nonlinear dynamics for generalization, we fix an RNN to simulate . The RNN’s (input, hidden state, output) correspond to the (action of a policy, underlying MDP state, observations ) with scalar rewards obtained through a nonlinear map of . The initial state will start from . Thus the the parameter will only be a seed to generate initial state. This setting is done in order to use a constant transition function, to guarantee that there always exists an optimal action for any state regardless of . By allowing control over state transitions, we may study the effects of stochasticity in the environment, by adding Gaussian noise into the hidden state at each time, i.e. . We compute gradients through the environment during joint training to improve training performance.
3.1.1 Results
In Figure 1, we find adding stochasticity (e.g. random action stochasticity or sticky actions Zhang et al. (2018)) into actions increases GCS and decreases test loss. This result is isolated from any other factors in RL, and thus shows that stochasticity aligns the policy gradient more to the true gradient. Since entropy and stochasticity may smoothen the loss landscape in RL Ahmed et al. (2018), this suggests that entropy may be also therefore be affecting the direction of the gradient in particular ways impacting generalization.
Furthermore, when introducing stochasticity into the environment by perturbing the state transition randomly at each turn, we find in Figure 2 that higher gamma values in reward produces better generalization, while in the deterministic setting without transition stochasticity, lower gamma value produce better generalization. These seemingly conflicting results illustrates the complex interplay between action stochasticity and monotonicity of with respect to generalization.
3.2 Realistic MDP’s  CoinRun
3.2.1 Differing Effects for Regularizers in RL
We use CoinRun Cobbe et al. (2018) as a procedurally generated MDP family for testing larger scale generalization and analyze common regularization techniques such as batch normalization, action stochasticity and data augmentation. Throughout all settings, in Figure 3, we see that steadily decreasing gradient cosine similarity (GCS) produces overall worse testing performance, with higher asymptotic GCS correlated with test performance. However, regularizers have different effects: e.g. batchnorm changes the network weight norms while stochasticity does not, and stochasticity is heavily sensitive to batchsize.
In Figure 3, we note that training can have instability on its gradient cosine similarity (GCS) and thus smoothness, but that such instability is also correated with poor test performance (see Appendix A3 for instability results).
In Figure A3, 3 we data augmentation produces a constant nonzero GCS, as well as high asymptotic gradient norm, but does not significantly deviate network norm or entropy. This implies that data augmentation has a strong effect on the Lipschitz constant on the loss, but also does not allow the policy to converge, which is consistent with the fact that its GCS does not converge to 0.
Meanwhile, stochasticity leads to an overall better GCS consistent with Ahmed et al. (2018), which correlates with the eventual higher test performance, consistent with the RNNMDP environment. Batchnorm reduces both the variance and the magnitude of the GCS consistent with Santurkar et al. (2018), while the addition of stochasticity with batch normalization increases the GCS overall slightly. Furthermore, batchnorm also has a similar effect to stochasticity on the policy entropy, but stochasticity does not change the norm of network weights, while batchnorm does. Batchnorm reduces the gradient norm, as does stochasticity when compared based on reward vs gradient norm.
3.2.2 Interdependence between Hyperparameters at a Larger Scale
We investigate stochasticity’s effects further, especially from batchsize. Both larger and smaller batchsizes generally benefit batchnorm Smith et al. (2017), and usually lower batchsizes improve generalization due to beneficial gradient noise Keskar et al. (2016)  we find that this is not the case for stochasticity.
As shown above, action stochasticity (entropy regularization) naturally boosts GCS  however, extra boosts can be caused by unwanted high variance on the GCS (which can be exacerbated by poor choices of batchsize, suggesting existence of sharp minimizer) and actually lead to poorer test performance. While keeping stochasticity fixed to 0.1 (Figure 5
) during training, the minibatch number (which inversely affects batchsize) has a strong effect on test performance and the weight norms, similar to SL. Surprisingly, higher batchsize (and thus less noise) produces significantly more variance on the GCS, also suggesting convergence on a sharp minimizer (i.e. high eigenvalue on the Hessian).
Stochasticity also reduces the acceptable range of minibatchnumber (Figure 6), showing this parameter is important to RL generalization when using entropy bonuses. The batchsize normally provides a form of gradient noise which may help training  in the RL case this is translated to sampling individual policy gradient estimates from a mixture between the true policy gradient and the entropy added gradient. As expected, too much noise caused by low batchsize such as nminibatch = 64 produces negligible GCS.
We provide more relationship analysis in Appendix A6, testing relationships between stochasticity and on CoinRun and CoinRunMazes (an RNN exploration test).
4 Conclusion
From the above results, we conclude that when dealing with RL generalization, one must be very careful in hyperparameter optimization. As shown in this work, hyperparameters are simply not orthogonal in terms of improving generalization performance. One change in a hyperparameter may change the monotonicity relationship for another hyperparameter, and thus and may easily provide wrong conclusions during experimentation. In comparison to SL, RL also has many more important hyperparameters, all of which must be considered carefully when designing robust systems.
There is a lack of understanding on these relationships, and a deeper investigation could prove important for both the practice and theory of RLgeneralization. We hope that this work provides guidance for both future work and practical guidance in designing robust and generalizable RL systems.
References
 Ahmed et al. (2018) Ahmed, Z., Roux, N. L., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization. CoRR, abs/1811.11214, 2018.
 Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. CoRR, abs/1812.02341, 2018.

Henderson et al. (2018)
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D.
Deep reinforcement learning that matters.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018
, pp. 3207–3214, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16669.  Ilyas et al. (2018) Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Are deep policy gradient algorithms truly policy gradient algorithms? CoRR, abs/1811.02553, 2018.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 448–456. JMLR.org, 2015.
 Jiang et al. (2015) Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’15, pp. 1181–1189, Richland, SC, 2015. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450334136.
 Keskar et al. (2016) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016.
 Li et al. (2018) Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 6391–6401, 2018.
 Santurkar et al. (2018) Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How does batch normalization help optimization? In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 2487–2497. Curran Associates, Inc., 2018.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 Smith et al. (2017) Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay the learning rate, increase the batch size. CoRR, abs/1711.00489, 2017.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. doi: 10.1007/BF00992696.
 Zhang et al. (2018) Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893, 2018.
Appendix A.1 Full Results, Enlarged Pictures, Examples for Completeness
a.1.1 Why the RNNMDP?
For the sake of simplicity, assume our observation was the identity function (i.e. we observe the state). Note that the ”state” within an MDP is subjective, depending on certain modifications, and this can dramatically affect solvability. For instance, an agent whose observation only consists of single timestep states is unable to adapt to different transition functions in different MDP’s. This can occur in Mujoco when two MDP’s possesss different gravities, and thus the optimal action is different even if the witnessed states are the same. However, an addition of framestacking will change the ”witnessed state” to rather, a combination of last 4 timesteps from which an agent will be able to adapt to different gravities. Thus, we cannot simply generate ”randomMDP”s as a valid benchmark.
Thus, to prevent these ambiguities, we simply construct a family of MDP’s by using one single transition , but use different initial starting states, and let reward be independent of , and controls only the initial state . This solves the above problem if the state also contains the gravity parameter. This way, there is always one optimal action given each state. We use a nonlinear RNN cell (with additional stochasticities as an option) to simulate the transition , as well as nonlinear functions for reward. Conceptually, an RNN may approximate any sequential transition, and thus with a large enough cell, may simulate any MDP.
a.1.2 RNN MDP Graphical Results
We see that various forms of stochasticity (pure random stochasticity as well as action stickyness, with probabilities shown on legend) also produce higher GCS, as well as better testing performance.
a.1.2.1 in RNNMDP
We apply a similar stochastic modification on the RNNMDP, by adding noise into the state: and find that stochasticity in the environment produces a monotone relationship in which higher produced better test performance, while determinism in the environment reverses this relationship  lower produces better testing performance.
a.1.3 Why CoinRun?
Coinrun Package (CoinrunStandard, RandomMazes) Cobbe et al. (2018)  Both environments provide infinitelevelgeneration; CoinrunStandard is a sidescroller which particularly benchmarks strong convolutional network generalization and establishes various regularizations indeed help with test performance. RandomMazes (in A7 are sequentially generated mazes which test the exploration properties of RNN’s.
a.1.4 Regularizers in RL
As a reminder to the reader, it should be noted that all algorithms produce a that optimizes a surrogate loss, but at test time, is evaluated on . For example, adding the discount factor and entropy regularization make the surrogate training objective
(2) 
We formally introduce the exact regularizers and what their effects may entail:

Action Stochasticity: Ahmed et al. (2018)
argues that entropy bonuses and action stochasticity provides a smoother loss function. In the generalization case, we study its biasing effects on aligning the training set policy gradient with the true distribution policy gradient. In
case, is generated from with probability , which can be reinterpreted as purely adding to the objective. The difference in gradient cosine is examining the contribution of .We provide the normal policy gradient as found in Williams (1992) where :
Suppose for the sake of simplicity we considered the horizon case, and used action stochasticity, where is generated from with probability ; thus the policy gradient for a single batch instance is a mixture, i.e. is
with probability and with probability . This implies that in the infinite batch setting, we have:
Note that
with cross entropy . Thus we may view this as a rewardweighted entropy penalty: if
was a relatively uniform distribution, then we are simply taking the gradient respect to the KL between the policy and a uniform distribution:
.As an aside, for a discrete softmax policy, , and hence the contribution from the random action will be:
At a high level, as the policy’s entropy becomes lower and hence the policy is more confident, both the term and the norm of the gradient become high.
We see that using an entropy penalty to the objective at only training
time will align better with the true policy gradient. Unlike linear regression where adding an explicit regularizer on the weight norm decreases the weights during training, this is not the case for RL, as we see that stochasticity does not provide a statistically significant change to the
norm of the network weights.Note that this noise addition is not unbiased in parameter space  if adding a simple random Gaussian noise to a gradient , then on expectation with respect to the noise sample, the expectation of the dot product between training sample gradient and true gradient remains fixed:

BatchNorm Ioffe & Szegedy (2015): Cobbe et al. (2018) empirically showed the benefit of batchnorm on RL generalization. Translated to the RLsetting, Santurkar et al. (2018) establishes that for every action , if is after adding batchnorm on , batchnorm can reduce the norm on the gradient as well as the 2nd order Hessian (smoothness) term: and . We thus expect its gradient smoothing effects also reduce the variance on the gradient cosine similarity, as well as provide a smoother transition from the beginning of training to the end.
We see that Santurkar et al. (2018)’s results on smoothing both the gradient norm and the Hessian term translate to the policy gradient. They state that for a network which uses batchnorm layers while is the original network, for a loss function (shortened to , denote to be the loss function after adding batch norm:
(3) and Hessian smoothness term
(4) where is an activation, is the set of activations after batch norm, is batchsize,
is the standard deviation computed over the minibatch of
’s, is a constant term. In reinforcement learning, is thus . The contributions to the smoothness come from the terms to the right in both equations, which are subtracting off from the gradient norm and Hessian term respectively of the original network. Thus from the above, increasing batchsize generally reduces batchnorm’s smoothing effects.Since experimental results show that entropy also increased asymptotically, this implies that contributes to the gradient norm significantly, while the network gradient norm is low, consistent with the lower norm during training.

 Jiang et al. (2015) shows that higher increases the size of the optimal policy set for the training MDP’s, because will force the policy space to consider multiple stateaction pairs for planning. Thus can also be a useful tool to verify the number of local minima of the training landscape. If high produces worse testing performance, this implies that the set of optimal solutions for the true distribution is small, and the agent has overfitted by converging on one of the many optimal policies for the training set.

For data augmentation, a reasonable model is to sample , where randomly adds shapes to the picture, thus our policy gradient is
(5) If we consider for simplicity, a deterministic function , then the gradient is instead . As the entropy is not significantly raised at convergence from experimental results, a mild assumption is and thus has strong alignment with .
a.1.5 Extended CoinRun Results
a.1.5.1 GCS’s Correlation with Test Instability
Runs with roughly nonmonotonically decreasing GCS tended to also produce more unstable testing curves, even with stable training curves. These large ”bumps” in the GCS corresponded to sudden drops on the testing curves. Furthermore, testing curves that produced such ”bumps” ultimately produced poorer final testing performances. In the monotonically decreasing GCS cases, this corresponded to better testing performance overall.
In terms of loss landscape, when both training and testing performance drop, this implies that the trajectory of gradient descent has accidentally reached a peak on the true distribution’s landscape, from too large of a step size. However, other times exist in which the GCS sharply rises when the training curve is stable while testing curve is not, suggesting the lack of smoothness on training landscape is causing instability.
a.1.5.2 When is GCS Inaccurate?
We present an extreme case in which GCS correlation with test performance is diminished: higher complexity on the reward function. If using a synthetic reward (i.e. klayer convolution networks on the observation, and then averagepooling on the output), we find a monotonic decrease in GCS, but testing performance varied if using the same number of environments for each of the training sets:
The layer number induces more complexity on the reward function.
a.1.5.3 Gradient Norm
We examine the contributions to the GCS, with (A5) further showing that stochasticity reduces the gradient norm significantly, even while the policy has not yet converged.
a.1.5.4 , Continued
In order to check on the role between and stochasticity at a larger scale, we apply forced action stochasticity in both training and testing settings but vary the , so that there is only one difference between training and testing. (A6) shows entropy noticeably affecting the range of , where higher stochasticity sharpens the range of allowable ’s for high test performance.
a.1.5.5 CoinRunMazes
Applying the same stochasticities to the RandomMazes environment which is a test of exploration strategy in a Maze, and we find that stochasticity actually improves the larger ranges for test performance  exploration/RNN tasks have more complexity due to the temporal component, consistent with this result.
Appendix A.2 Hyperparameters and Exact Setups
a.2.1 RnnMdp
RNN MDP HyperParameters  Values 
RNN Cell  GRU512 
Horizon  256 
State Noise  Gaussian Vector 
Reward Output  2layer MLP ReLU 
Policy Function  2 to 4 layer MLP ReLU 
Initial State Sampling  Gaussian Vector 
Initializers  Orthogonal Initializers for all 
Length of reward gradient  40 
Optimizer  SGD or FullBatch GD 

a.2.2 PPO Parameters
See Cobbe et al. (2018) for the default parameters used for CoinRun.
Comments
There are no comments yet.