The importance of exploration in reinforcement learning (RL) cannot be overstated, as training agents to directly maximize the expected return in challenging environments will likely become trapped in sub-optimal solutions. Difficulties in finding the optimal policy are characterized by two hard problems: sparse rewards (Pathak et al., 2017) and deceptive rewards (Ecoffet et al., 2019). For example, the first reward in the well-known Montezuma’s Revenge game is encountered after environment steps (a search space roughly of size , e.g., see Aytar et al., 2018), which is unsolvable without structured exploration strategies.
In order to address the problem of sparse rewards, exploration-based algorithms introduce a variety of components to improve exploration. Bellemare et al. (2016); Tang et al. (2016) maintain state visitation frequencies which act as intrinsic motivation. Other works (Osband et al., 2016; Henderson et al., 2017)
model uncertainty by estimating the state-action value function, thus simultaneously keeping track of multiple promising solutions.Bellemare, Dabney, and Munos (2017) use a categorical distribution to keep track of the random returns to bolster exploratory actions. In a similar vein, Doan, Mazoure, and Lyle (2018) rely on a generative model to learn the distribution of state-action values. In that case, approximating the return density with a generator allows to cover a wide range of sub-optimal moves and plays a role similar to that of an exploration strategy.
Although count-based methods (Bellemare et al., 2016; Tang et al., 2016) have strong theoretical guarantees, they have traditionally been used to solve tasks with sparse rewards. On the other hand, deceptive environments which trap agents within local minima are just as tricky to solve. For instance, some navigation and locomotion tasks such as biped walking possess deceptive reward regions which attract the agent into low reward regions (Conti et al., 2017). A straightforward approach to boosting exploration in challenging environments consists in increasing the capacity of the policy such that it is able to capture a more refined landscape. Indeed, a more expressive policy allows to keep track of local optima (Haarnoja et al., 2017) and hence are at lower risk to fall into them.
Previous results (Rezende and Mohamed, 2015) suggest that a complex and potentially multi-modal distribution can be decomposed into a sequence of invertible transformations denoted normalizing flows, applied on the original (simple) density. This approach has been used in the off-policy setting and positioned as an effective exploration technique (Tang and Agrawal, 2018).
In this work, we show that normalizing flows-based policies can form a natural extension to the Soft Actor-Critic (SAC) algorithm (Haarnoja et al., 2018), which we call SAC-NF. We show empirically that this simple extension significantly improves upon the already high exploration rate of SAC and achieves better convergence properties as well as better empirical performance. Last but not least, the class of policies that we propose requires significantly less parameters than its baseline counterpart, while also improving on the original results. Finally, we assess the performance of both SAC and SAC-NF across a variety of benchmark continuous control tasks from OpenAI Gym using the MuJoCo simulator (Todorov, Erez, and Tassa, 2012).
This paper is organized as follows. Section 2 first reviews previous literature relevant to our work. Section 3 provides an overview of background material, before introducing the proposed SAC-NF approach in Section 4. Section 5 first shows an empirical analysis of SAC-NF expressivity through toy experiments before comparing against SAC on Mujoco benchmark environments. Section 6 finally provides conclusions and future directions.
2 Related Work
The idea behind off-policy strategies in RL is to collect samples under some behaviour policy and use them to train a target policy. Off-policy algorithms are known to train faster than their on-policy counterparts, but at the cost of higher variance and instability(Lillicrap et al., 2016). Among this family, actor critic (AC) strategies have shown great success for solving continuous control tasks. In between value-based and policy-based approaches, an AC algorithm trains an actor (policy-based) using guidance from a critic (value-based). Two major AC algorithms, SAC (Haarnoja et al., 2018) and TD3 (Fujimoto, van Hoof, and Meger, 2018), have shown a large performance improvement over previous off-policy algorithms such as DDPG (Lillicrap et al., 2016) or A3C (Mnih et al., 2016). While TD3 does so by maintaining a second critic network to alleviate the overestimation bias, SAC enforces exploration by adding an entropy regularization term.
Density estimation for better exploration
Using powerful density estimators to model state-action values with the aim to improve exploration generalization has been a long-standing practice in RL. For instance, Henderson et al. (2017) use dropout approximation (Gal and Ghahramani, 2016)
within a Bayesian network and show improvement on stability and performance of policy gradient methods.Osband et al. (2016)
rather rely on an ensemble of neural networks to estimate the uncertainty in the prediction of the value function, allowing to reduce learning times while improving performance. Finally,Doan, Mazoure, and Lyle (2018) consider generative adversarial networks (Goodfellow et al., 2014) to model the distribution of random state-value functions. The current work considers a different approach based on normalizing flows for density estimation.
Flow-based generative models have proven to be powerful density approximators (Rezende and Mohamed, 2015). The idea is to relate an initial noise density distribution to a posterior distribution using a sequence of invertible transformations, parametrized by a neural network and having desirable properties. For example, invertible autoregressive flows (IAF) are characterized by a simple-to-compute Jacobian (Kingma, Salimans, and Welling, 2016). In their original formulation, IAF layers allow to learn location-scale invariant (i.e. affine) transformations of the simple initial noise density. Normalizing flows have been used previously in the on-policy RL setting where IAF extends a base policy found by TRPO (Tang and Agrawal, 2018). In this work, we tackle the off-policy learning setting, and we focus on planar and radial flows, which are known to provide a good trade-off between function expressivity and time complexity (Rezende and Mohamed, 2015).
In this section, we review the formal setting of RL in a Markov decision process (MDP), the policy optimization approaches considered in this paper, and the general framework of normalizing flows, which will be used to improve exploration in Section4.
3.1 Markov Decision Process
A discrete-time, finite-horizon, MDP (Bellman, 1957; Puterman, 2014) is described by a state space (either discrete or continuous), an action space (either discrete or continuous), a transition function , and a reward function . MDPs are useful for modelling sequential decision-making problems. On each round , an agent interacting with this MDP observes the current state , selects an action , and observes a reward upon transitioning to a new state . Let be a discount factor. The goal of an agent evolving in a discounted MDP is to learn a policy such as taking action would maximize the expected sum of discounted returns,
The corresponding state-action value function can be written as the expected discounted rewards from taking action in state , that is
We use to denote the trajectory distribution induced by following policy . If or
are vector spaces, action and space vectors are respectively denoted byand .
3.2 Policy Optimization
Policy gradient methods (Sutton et al., 1999)
are a class of algorithms for learning RL policies, relying on stochastic gradient descent to optimize the discounted return through gradient steps on the policy parameters. While this has been addressed in the on-policy setting, for instance by restricting changes in the policy to be constrained within a trust region(Schulman et al., 2015), on-policy strategies suffer from a high sample complexity. On the other hand, their off-policy alternatives (Lillicrap et al., 2016) are subject to instability, especially in continuous state and action spaces. Recently, SAC (Haarnoja et al., 2018) has been shown to mitigate this by optimizing a maximum entropy policy objective function:
where is the entropy of the policy and is the importance given to the entropy regularizer. The entropy term allows to prevent mode collapse on the highest reward and to maintain the exploration rate above some threshold.
3.3 Normalizing Flows
Normalizing flows are useful to generate samples from complex probability distributions (Rezende and Mohamed, 2015). More specifically, they provide a general framework for extending the change of variable theorem for density functions to a sequence of
-dimensional real random variables. The initial random variable has density function and is linked to the final output of the flow through a sequence of invertible, smooth mappings called normalizing flows of length , such that
Specific forms of the mapping can be selected based on desired properties: highly expressive and sometimes invertible maps (Rezende and Mohamed, 2015), always invertible affine transformations (Kingma, Salimans, and Welling, 2016), volume-preserving and orthogonal transformations (Tomczak and Welling, 2016) or always invertible and highly expressive networks (Huang et al., 2018). For example, the family of radial contractions around a point defined as (Rezende and Mohamed, 2015):
are highly expressive (i.e. represent a wide set of distributions) and yet very light (parameter-wise), in addition to enjoying a closed-form determinant
for and . The family of radial maps defined above allows to approximate the target posterior through a sequence of concentric expansions of arbitrary width and centered around a learnable point . In order to guarantee that the flow is invertible, it is sufficient to pick . As pointed out by Kingma, Salimans, and Welling (2016), radial and planar flows have an advantage in lower dimensions (up to a few hundreds) since they require a low number of parameters while being highly expressive.
4 Augmenting SAC with Normalizing Flows
We now propose a flow-based formulation of the off-policy maximum entropy RL objective (Eq. 1) and argue that SAC is a special case of the resulting approach, called SAC-NF, where the normalizing flow layers are over-regularized.
4.1 Entropy maximization in RL
Recall that the SAC algorithm (Haarnoja et al., 2018) finds the information projection of the Boltzmann Q-function onto the set of diagonal Gaussian policies , such that an update in the policy improvement step is given by:
denotes the Kullback-Leibler divergence andcontrols the temperature, i.e. the peakedness of the distribution. Equivalently, the objective can be reformulated as a maximum entropy RL task:
where the differential entropy of the policy is denoted pointwise as for every state in trajectory . If for some mean vector and covariance matrix in a fixed-dimensional space, then . When is diagonal, then the entropy is directly proportional to the sum of log-variances. The task tackled by SAC can hence be re-formulated as maximizing expected discounted rewards while keeping the volume of the policy at a certain level. Doing so keeps exploration active and prevents mode collapse on the highest reward.
The policy is updated using Eq. 5, while the state-action value function is trained using soft updates on the critic and state value functions. A major drawback of diagonal Gaussian policies lies in their symmetry: reaching global optima lying far away requires a large variance, but avoiding being stuck in suboptimal solutions requires a damped entropy in cases where local and global optima are close.
4.2 Exploration through normalizing flows
We propose a class of policies comprised of an initial noise sample , a state-noise embedding and of a normalizing flow of arbitrary length parameterized by . Sampling from the policy can be described by the following set of equations:
where depends on the noise and the state. This deterministic mapping takes values from the set below to the action space
where denotes concatenation of vectors and . Intuitively, corresponds to a non-parametric generator network, while samples following the reparametrization trick in variational Bayes (Kingma and Welling, 2014). Precisely, is a state embedding function and plays the role of a covariance matrix. While in theory and are equivalent up to a multiplicative constant to sampling , in practice yields lower variance estimates.
In the limit, we can recover the original base policy through heavy regularization:
for all states , implying that . When , a similar statement holds for replaced by and by when . By analogy with the SAC updates, SAC-NF searches the information projection of onto the feasible set of policies by minimizing the negative variational lower bound w.r.t. parameters and , using samples from replay buffer :
where the policy density is decomposed on a log scale
The term involving
appears through chain rule or recalling the gradient of a multivariate normal density with respect to the Cholesky factorof its covariance. Note that for , it would be replaced by .
The (soft) state value function is parameterized by and defined exactly as for SAC:
and estimated by minimizing the mean squared error:
Similarly, the state-action value function is parameterized by and learned using the standard temporal difference loss:
In practice, we use Monte-Carlo samples to approximate the gradient of each loss:
Algorithm 1 outlines the proposed method: the major distinction from the original SAC is the additional gradient step on the normalizing flows layers while holding the base policy function constant.
This section addresses two major points: (1) it highlights the beneficial impact of NF on exploration through two toy examples and (2) it compares the proposed SAC-NF approach against SAC, i.e. the current state-of-the-art, on a set of continuous control tasks from MuJoCo (Todorov, Erez, and Tassa, 2012). For all experiments, we hold the entropy rate constant at for every environment except for Humanoid-v2 following the tuning reported by Haarnoja et al. (2018). Optimal values for SAC-NF are reported in the Appendix (Table 2).
5.1 Managing multi-modal policies
We first conduct a synthetic experiment to illustrate how the augmentation of a base policy with normalizing flows allows to represent multi-modal policies. We consider a navigation task environment with continuous state and action spaces consisting of four goal states symmetrically placed around the origin. The agent starts at the origin and, on each time , receives reward corresponding to the Euclidean distance to the closest goal. We consider a SAC-NF agent (Algorithm 1 with mini-batch size
, 4 flows and one hidden layer of 8 neurons), which can represent radial policies. The agent is trained overepochs, each epoch consisting of time steps.
displays some trajectories sampled by the SAC-NF agent along with the kernel density estimation (KDE) of terminal state visitations by the agent. Trajectories are obtained by sampling from respective policy distributions instead of taking the average action. We observe that the SAC-NF agent, following a flow-based policy, is able to successfully visit all four modes.
5.2 Robustness to confounding rewards
We now show through a environment with deceptive rewards that, unlike the Gaussian policy, radial policies are able to find the global optimal solution. We consider an environment composed of three reward areas: a locally optimal strip around the initial state, a global optimum on the opposing end of the room, separated by a pit of highly negative reward. The agent starts at the position and must navigate into the high rewards area without falling into the pit. On each time , the agent receives the reward associated to its current location .
We compare the SAC-NF agent (Algorithm 1 with mini-batch size , 4 flows and one hidden layer of 8 neurons), which can represent radial policies, with a classical SAC agent(two hidden layers of 16 units) that models Gaussian policies. Both agents are trained over epochs, each epoch consisting of time steps.
Figure 2 displays the trajectories visited by both agents. This highlights the biggest weakness of diagonal Gaussian policies: the agent is unable to simultaneously reach the region of high rewards while avoiding the center of the room. In this case, lowering the entropy threshold will lead to the conservative behaviour of staying in the yellow zone; increasing the entropy leads the agent to die without reaching the goal. Breaking the symmetry of the policy by adding (in this case three) radial flows allows the agent to successfully reach the target area by walking along the safe path surrounding the room.
In the case of steep reward functions, where low rewards border on high rewards, symmetric distributions force the agent to explore into all possible directions. This leads the agent to sometimes attain the high reward region, but, more dangerously, falling into low reward areas with non-zero probability.
5.3 MuJoCo locomotion benchmarks
In this section, we compare our SAC-NF method against the SAC baseline on five continuous control tasks from the Mujoco suite (see Figure 3): Ant-v2, HalfCheetah-v2 Humanoid-v2, Hopper-v2 and Walker2d-v2.
The SAC-NF agent consists of one feedforward hidden layer of units acting as state embedding, which is then followed by a normalizing flow of length . Details of the model can be found in table 2. For the SAC baseline, two hidden layers of units are used. The critic and value function architectures are the same as in Haarnoja et al. (2018). All networks are trained with Adam optimizer (Kingma and Ba, 2015) with a learning rate of .
displays the performance of both SAC and SAC-NF. Each curve is averaged over 5 random seeds (bold line) and one standard deviation confidence intervals are represented as shaded regions over 1 millions steps (or 2 million steps forHumanoid-v2). Every environment steps, we evaluate our policy times and report the average. The best observed reward for each method can be found in Table 1. We observe that SAC-NF shows faster convergence, which translates into better sample efficiency, compared to the baseline. Indeed, SAC-NF takes advantage of the expressivity of normalizing flows to allow for better exploration and thus discover new policies. In particular, we notice that SAC-NF performs well on two challenging tasks Humanoid-v2 and Ant-v2 which are known to require hard exploration.
Table 1 not only shows better performance from SAC-NF in most of the environments, but also points out the reduction in the number of parameters for our policy architecture. For instance, on Hopper-v2, we could reduce up to by the number of parameters ( parameters for SAC baseline versus for SAC-NF) and by the number of parameters in Humanoid-v2, while performing at least as well as the baseline. Moreover, note that SAC-NF uses a lower entropy rate than the baseline ( versus ), suggesting that unlike SAC which explicitly encourages exploratory actions, SAC-NF achieves it through fitting the Boltzmann Q-function.
In this paper, we proposed a novel algorithm which combines the soft actor-critic updates together with a sequence of normalizing flows of arbitrary length. The high expressivity of the later allows to (1) quickly discover richer policies (2) compress the cumbersome Gaussian policy into a lighter network and (3) better avoid local optima. Our proposed algorithm leverages connections between maximum entropy reinforcement learning and the evidence lower bound used to optimize variational approximations; this relationship allows for a straightforward extension of SAC to a wider class of distributions. We demonstrated through experiments on five MuJoCo environments that our method has a better convergence rate than the baseline, vastly improves the coverage of the parameter space and is at least as performant as SAC.
We want to thank Compute Canada/Calcul Québec and Mila – Quebec AI Institute for providing computational resources. We also thank Chin-Wei Huang for insightful discussions.
- Aytar et al. (2018) Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and de Freitas, N. 2018. Playing hard exploration games by watching youtube. Advances in Neural Information Processing Systems.
- Bellemare et al. (2016) Bellemare, M. G.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems.
Bellemare, Dabney, and Munos (2017)
Bellemare, M. G.; Dabney, W.; and Munos, R.
A distributional perspective on reinforcement learning.
International Conference on Machine Learning.
- Bellman (1957) Bellman, R. 1957. A markovian decision process. Journal of Mathematics and Mechanics 679–684.
- Conti et al. (2017) Conti, E.; Madhavan, V.; Such, F. P.; Lehman, J.; Stanley, K. O.; and Clune, J. 2017. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. Advances in Neural Information Processing Systems.
- Doan, Mazoure, and Lyle (2018) Doan, T.; Mazoure, B.; and Lyle, C. 2018. Gan q-learning. arXiv preprint arXiv:1805.04874.
- Ecoffet et al. (2019) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.
- Fujimoto, van Hoof, and Meger (2018) Fujimoto, S.; van Hoof, H.; and Meger, D. 2018. Addressing function approximation error in actor-critic methods. International Conference on Machine Learning.
Gal and Ghahramani (2016)
Gal, Y., and Ghahramani, Z.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In International Conference on Machine Learning, 1050–1059.
- Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
- Haarnoja et al. (2017) Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. CoRR abs/1702.08165.
- Haarnoja et al. (2018) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning.
- Henderson et al. (2017) Henderson, P.; Doan, T.; Islam, R.; and Meger, D. 2017. Bayesian policy gradients via alpha divergence dropout inference. NIPS Bayesian Deep Learning Workshop.
- Huang et al. (2018) Huang, C.; Krueger, D.; Lacoste, A.; and Courville, A. C. 2018. Neural autoregressive flows. International Conference on Machine Learning.
- Kingma and Ba (2015) Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations.
- Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on Learning Representations.
- Kingma, Salimans, and Welling (2016) Kingma, D. P.; Salimans, T.; and Welling, M. 2016. Improving variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems.
- Lillicrap et al. (2016) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- Maaten and Hinton (2008) Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov):2579–2605.
- Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783.
- Osband et al. (2016) Osband, I.; Blundell, C.; Pritzel, A.; and Roy, B. V. 2016. Deep exploration via bootstrapped DQN. Advances in Neural Information Processing Systems.
- Pathak et al. (2017) Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. International Conference on Machine Learning.
- Puterman (2014) Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Rezende and Mohamed (2015) Rezende, D. J., and Mohamed, S. 2015. Variational inference with normalizing flows. International Conference on Machine Learning.
- Schulman et al. (2015) Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization.
- Sutton et al. (1999) Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063. Cambridge, MA, USA: MIT Press.
- Tang and Agrawal (2018) Tang, Y., and Agrawal, S. 2018. Boosting trust region policy optimization by normalizing flows policy.
- Tang et al. (2016) Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, X.; Duan, Y.; Schulman, J.; Turck, F. D.; and Abbeel, P. 2016. #exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems.
- Todorov, Erez, and Tassa (2012) Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, 5026–5033. IEEE.
- Tomczak and Welling (2016) Tomczak, J. M., and Welling, M. 2016. Improving variational auto-encoders using householder flow. NIPS Bayesian Deep Learning Workshop.
We provide a table of hyperparameters used to obtain results in the MuJoCo domain. Note thatcorresponds to the concatenated, to the average and to the conditional models.
|Adam Optimizer parameters|
Comparing visitation states
In this part, we highlight the advantage of augmenting any simple policy with normalizing flows. For that purpose, we compared the state visitation counts of SAC-NF and SAC for Humanoid-v2. This is a challenging task which is known to require a high amount of exploration to avoid deceptively falling in a suboptimal solution. In Figure 5, we ran rollouts for each methods (keeping the noise on) and projected the states visited in a two-dimensional dimension (with -SNE (Maaten and Hinton, 2008)). NF expressivity allows our method to visit more diverse states and hence converge to good policies earlier. The major downside of the Gaussian policy lies in its symmetric nature, which prevents it from arbitrary focusing on subsets of the action space.