1 Introduction
The importance of exploration in reinforcement learning (RL) cannot be overstated, as training agents to directly maximize the expected return in challenging environments will likely become trapped in suboptimal solutions. Difficulties in finding the optimal policy are characterized by two hard problems: sparse rewards (Pathak et al., 2017) and deceptive rewards (Ecoffet et al., 2019). For example, the first reward in the wellknown Montezuma’s Revenge game is encountered after environment steps (a search space roughly of size , e.g., see Aytar et al., 2018), which is unsolvable without structured exploration strategies.
In order to address the problem of sparse rewards, explorationbased algorithms introduce a variety of components to improve exploration. Bellemare et al. (2016); Tang et al. (2016) maintain state visitation frequencies which act as intrinsic motivation. Other works (Osband et al., 2016; Henderson et al., 2017)
model uncertainty by estimating the stateaction value function, thus simultaneously keeping track of multiple promising solutions.
Bellemare, Dabney, and Munos (2017) use a categorical distribution to keep track of the random returns to bolster exploratory actions. In a similar vein, Doan, Mazoure, and Lyle (2018) rely on a generative model to learn the distribution of stateaction values. In that case, approximating the return density with a generator allows to cover a wide range of suboptimal moves and plays a role similar to that of an exploration strategy.Although countbased methods (Bellemare et al., 2016; Tang et al., 2016) have strong theoretical guarantees, they have traditionally been used to solve tasks with sparse rewards. On the other hand, deceptive environments which trap agents within local minima are just as tricky to solve. For instance, some navigation and locomotion tasks such as biped walking possess deceptive reward regions which attract the agent into low reward regions (Conti et al., 2017). A straightforward approach to boosting exploration in challenging environments consists in increasing the capacity of the policy such that it is able to capture a more refined landscape. Indeed, a more expressive policy allows to keep track of local optima (Haarnoja et al., 2017) and hence are at lower risk to fall into them.
Previous results (Rezende and Mohamed, 2015) suggest that a complex and potentially multimodal distribution can be decomposed into a sequence of invertible transformations denoted normalizing flows, applied on the original (simple) density. This approach has been used in the offpolicy setting and positioned as an effective exploration technique (Tang and Agrawal, 2018).
In this work, we show that normalizing flowsbased policies can form a natural extension to the Soft ActorCritic (SAC) algorithm (Haarnoja et al., 2018), which we call SACNF. We show empirically that this simple extension significantly improves upon the already high exploration rate of SAC and achieves better convergence properties as well as better empirical performance. Last but not least, the class of policies that we propose requires significantly less parameters than its baseline counterpart, while also improving on the original results. Finally, we assess the performance of both SAC and SACNF across a variety of benchmark continuous control tasks from OpenAI Gym using the MuJoCo simulator (Todorov, Erez, and Tassa, 2012).
This paper is organized as follows. Section 2 first reviews previous literature relevant to our work. Section 3 provides an overview of background material, before introducing the proposed SACNF approach in Section 4. Section 5 first shows an empirical analysis of SACNF expressivity through toy experiments before comparing against SAC on Mujoco benchmark environments. Section 6 finally provides conclusions and future directions.
2 Related Work
Offpolicy RL
The idea behind offpolicy strategies in RL is to collect samples under some behaviour policy and use them to train a target policy. Offpolicy algorithms are known to train faster than their onpolicy counterparts, but at the cost of higher variance and instability
(Lillicrap et al., 2016). Among this family, actor critic (AC) strategies have shown great success for solving continuous control tasks. In between valuebased and policybased approaches, an AC algorithm trains an actor (policybased) using guidance from a critic (valuebased). Two major AC algorithms, SAC (Haarnoja et al., 2018) and TD3 (Fujimoto, van Hoof, and Meger, 2018), have shown a large performance improvement over previous offpolicy algorithms such as DDPG (Lillicrap et al., 2016) or A3C (Mnih et al., 2016). While TD3 does so by maintaining a second critic network to alleviate the overestimation bias, SAC enforces exploration by adding an entropy regularization term.Density estimation for better exploration
Using powerful density estimators to model stateaction values with the aim to improve exploration generalization has been a longstanding practice in RL. For instance, Henderson et al. (2017) use dropout approximation (Gal and Ghahramani, 2016)
within a Bayesian network and show improvement on stability and performance of policy gradient methods.
Osband et al. (2016)rather rely on an ensemble of neural networks to estimate the uncertainty in the prediction of the value function, allowing to reduce learning times while improving performance. Finally,
Doan, Mazoure, and Lyle (2018) consider generative adversarial networks (Goodfellow et al., 2014) to model the distribution of random statevalue functions. The current work considers a different approach based on normalizing flows for density estimation.Normalizing flows
Flowbased generative models have proven to be powerful density approximators (Rezende and Mohamed, 2015). The idea is to relate an initial noise density distribution to a posterior distribution using a sequence of invertible transformations, parametrized by a neural network and having desirable properties. For example, invertible autoregressive flows (IAF) are characterized by a simpletocompute Jacobian (Kingma, Salimans, and Welling, 2016). In their original formulation, IAF layers allow to learn locationscale invariant (i.e. affine) transformations of the simple initial noise density. Normalizing flows have been used previously in the onpolicy RL setting where IAF extends a base policy found by TRPO (Tang and Agrawal, 2018). In this work, we tackle the offpolicy learning setting, and we focus on planar and radial flows, which are known to provide a good tradeoff between function expressivity and time complexity (Rezende and Mohamed, 2015).
3 Background
In this section, we review the formal setting of RL in a Markov decision process (MDP), the policy optimization approaches considered in this paper, and the general framework of normalizing flows, which will be used to improve exploration in Section
4.3.1 Markov Decision Process
A discretetime, finitehorizon, MDP (Bellman, 1957; Puterman, 2014) is described by a state space (either discrete or continuous), an action space (either discrete or continuous), a transition function , and a reward function . MDPs are useful for modelling sequential decisionmaking problems. On each round , an agent interacting with this MDP observes the current state , selects an action , and observes a reward upon transitioning to a new state . Let be a discount factor. The goal of an agent evolving in a discounted MDP is to learn a policy such as taking action would maximize the expected sum of discounted returns,
The corresponding stateaction value function can be written as the expected discounted rewards from taking action in state , that is
We use to denote the trajectory distribution induced by following policy . If or
are vector spaces, action and space vectors are respectively denoted by
and .3.2 Policy Optimization
Policy gradient methods (Sutton et al., 1999)
are a class of algorithms for learning RL policies, relying on stochastic gradient descent to optimize the discounted return through gradient steps on the policy parameters. While this has been addressed in the onpolicy setting, for instance by restricting changes in the policy to be constrained within a trust region
(Schulman et al., 2015), onpolicy strategies suffer from a high sample complexity. On the other hand, their offpolicy alternatives (Lillicrap et al., 2016) are subject to instability, especially in continuous state and action spaces. Recently, SAC (Haarnoja et al., 2018) has been shown to mitigate this by optimizing a maximum entropy policy objective function:(1) 
where is the entropy of the policy and is the importance given to the entropy regularizer. The entropy term allows to prevent mode collapse on the highest reward and to maintain the exploration rate above some threshold.
3.3 Normalizing Flows
Normalizing flows are useful to generate samples from complex probability distributions (Rezende and Mohamed, 2015). More specifically, they provide a general framework for extending the change of variable theorem for density functions to a sequence of
dimensional real random variables
. The initial random variable has density function and is linked to the final output of the flow through a sequence of invertible, smooth mappings called normalizing flows of length , such that(2) 
Specific forms of the mapping can be selected based on desired properties: highly expressive and sometimes invertible maps (Rezende and Mohamed, 2015), always invertible affine transformations (Kingma, Salimans, and Welling, 2016), volumepreserving and orthogonal transformations (Tomczak and Welling, 2016) or always invertible and highly expressive networks (Huang et al., 2018). For example, the family of radial contractions around a point defined as (Rezende and Mohamed, 2015):
(3) 
are highly expressive (i.e. represent a wide set of distributions) and yet very light (parameterwise), in addition to enjoying a closedform determinant
(4) 
for and . The family of radial maps defined above allows to approximate the target posterior through a sequence of concentric expansions of arbitrary width and centered around a learnable point . In order to guarantee that the flow is invertible, it is sufficient to pick . As pointed out by Kingma, Salimans, and Welling (2016), radial and planar flows have an advantage in lower dimensions (up to a few hundreds) since they require a low number of parameters while being highly expressive.
4 Augmenting SAC with Normalizing Flows
We now propose a flowbased formulation of the offpolicy maximum entropy RL objective (Eq. 1) and argue that SAC is a special case of the resulting approach, called SACNF, where the normalizing flow layers are overregularized.
4.1 Entropy maximization in RL
Recall that the SAC algorithm (Haarnoja et al., 2018) finds the information projection of the Boltzmann Qfunction onto the set of diagonal Gaussian policies , such that an update in the policy improvement step is given by:
(5) 
where
denotes the KullbackLeibler divergence and
controls the temperature, i.e. the peakedness of the distribution. Equivalently, the objective can be reformulated as a maximum entropy RL task:(6) 
where the differential entropy of the policy is denoted pointwise as for every state in trajectory . If for some mean vector and covariance matrix in a fixeddimensional space, then . When is diagonal, then the entropy is directly proportional to the sum of logvariances. The task tackled by SAC can hence be reformulated as maximizing expected discounted rewards while keeping the volume of the policy at a certain level. Doing so keeps exploration active and prevents mode collapse on the highest reward.
The policy is updated using Eq. 5, while the stateaction value function is trained using soft updates on the critic and state value functions. A major drawback of diagonal Gaussian policies lies in their symmetry: reaching global optima lying far away requires a large variance, but avoiding being stuck in suboptimal solutions requires a damped entropy in cases where local and global optima are close.
4.2 Exploration through normalizing flows
We propose a class of policies comprised of an initial noise sample , a statenoise embedding and of a normalizing flow of arbitrary length parameterized by . Sampling from the policy can be described by the following set of equations:
(7)  
(8)  
(9) 
where depends on the noise and the state. This deterministic mapping takes values from the set below to the action space
(10) 
where denotes concatenation of vectors and . Intuitively, corresponds to a nonparametric generator network, while samples following the reparametrization trick in variational Bayes (Kingma and Welling, 2014). Precisely, is a state embedding function and plays the role of a covariance matrix. While in theory and are equivalent up to a multiplicative constant to sampling , in practice yields lower variance estimates.
In the limit, we can recover the original base policy through heavy regularization:
(11) 
for all states , implying that . When , a similar statement holds for replaced by and by when . By analogy with the SAC updates, SACNF searches the information projection of onto the feasible set of policies by minimizing the negative variational lower bound w.r.t. parameters and , using samples from replay buffer :
(12) 
where the policy density is decomposed on a log scale
The term involving
appears through chain rule or recalling the gradient of a multivariate normal density with respect to the Cholesky factor
of its covariance. Note that for , it would be replaced by .The (soft) state value function is parameterized by and defined exactly as for SAC:
and estimated by minimizing the mean squared error:
Similarly, the stateaction value function is parameterized by and learned using the standard temporal difference loss:
In practice, we use MonteCarlo samples to approximate the gradient of each loss:
(13) 
Algorithm 1 outlines the proposed method: the major distinction from the original SAC is the additional gradient step on the normalizing flows layers while holding the base policy function constant.
5 Experiments
This section addresses two major points: (1) it highlights the beneficial impact of NF on exploration through two toy examples and (2) it compares the proposed SACNF approach against SAC, i.e. the current stateoftheart, on a set of continuous control tasks from MuJoCo (Todorov, Erez, and Tassa, 2012). For all experiments, we hold the entropy rate constant at for every environment except for Humanoidv2 following the tuning reported by Haarnoja et al. (2018). Optimal values for SACNF are reported in the Appendix (Table 2).
5.1 Managing multimodal policies
We first conduct a synthetic experiment to illustrate how the augmentation of a base policy with normalizing flows allows to represent multimodal policies. We consider a navigation task environment with continuous state and action spaces consisting of four goal states symmetrically placed around the origin. The agent starts at the origin and, on each time , receives reward corresponding to the Euclidean distance to the closest goal. We consider a SACNF agent (Algorithm 1 with minibatch size
, 4 flows and one hidden layer of 8 neurons), which can represent radial policies. The agent is trained over
epochs, each epoch consisting of time steps.Figure 1
displays some trajectories sampled by the SACNF agent along with the kernel density estimation (KDE) of terminal state visitations by the agent. Trajectories are obtained by sampling from respective policy distributions instead of taking the average action. We observe that the SACNF agent, following a flowbased policy, is able to successfully visit all four modes.
5.2 Robustness to confounding rewards
We now show through a environment with deceptive rewards that, unlike the Gaussian policy, radial policies are able to find the global optimal solution. We consider an environment composed of three reward areas: a locally optimal strip around the initial state, a global optimum on the opposing end of the room, separated by a pit of highly negative reward. The agent starts at the position and must navigate into the high rewards area without falling into the pit. On each time , the agent receives the reward associated to its current location .
We compare the SACNF agent (Algorithm 1 with minibatch size , 4 flows and one hidden layer of 8 neurons), which can represent radial policies, with a classical SAC agent(two hidden layers of 16 units) that models Gaussian policies. Both agents are trained over epochs, each epoch consisting of time steps.
Figure 2 displays the trajectories visited by both agents. This highlights the biggest weakness of diagonal Gaussian policies: the agent is unable to simultaneously reach the region of high rewards while avoiding the center of the room. In this case, lowering the entropy threshold will lead to the conservative behaviour of staying in the yellow zone; increasing the entropy leads the agent to die without reaching the goal. Breaking the symmetry of the policy by adding (in this case three) radial flows allows the agent to successfully reach the target area by walking along the safe path surrounding the room.
In the case of steep reward functions, where low rewards border on high rewards, symmetric distributions force the agent to explore into all possible directions. This leads the agent to sometimes attain the high reward region, but, more dangerously, falling into low reward areas with nonzero probability.
5.3 MuJoCo locomotion benchmarks
In this section, we compare our SACNF method against the SAC baseline on five continuous control tasks from the Mujoco suite (see Figure 3): Antv2, HalfCheetahv2 Humanoidv2, Hopperv2 and Walker2dv2.
The SACNF agent consists of one feedforward hidden layer of units acting as state embedding, which is then followed by a normalizing flow of length . Details of the model can be found in table 2. For the SAC baseline, two hidden layers of units are used. The critic and value function architectures are the same as in Haarnoja et al. (2018). All networks are trained with Adam optimizer (Kingma and Ba, 2015) with a learning rate of .
Figure 4
displays the performance of both SAC and SACNF. Each curve is averaged over 5 random seeds (bold line) and one standard deviation confidence intervals are represented as shaded regions over 1 millions steps (or 2 million steps for
Humanoidv2). Every environment steps, we evaluate our policy times and report the average. The best observed reward for each method can be found in Table 1. We observe that SACNF shows faster convergence, which translates into better sample efficiency, compared to the baseline. Indeed, SACNF takes advantage of the expressivity of normalizing flows to allow for better exploration and thus discover new policies. In particular, we notice that SACNF performs well on two challenging tasks Humanoidv2 and Antv2 which are known to require hard exploration.Table 1 not only shows better performance from SACNF in most of the environments, but also points out the reduction in the number of parameters for our policy architecture. For instance, on Hopperv2, we could reduce up to by the number of parameters ( parameters for SAC baseline versus for SACNF) and by the number of parameters in Humanoidv2, while performing at least as well as the baseline. Moreover, note that SACNF uses a lower entropy rate than the baseline ( versus ), suggesting that unlike SAC which explicitly encourages exploratory actions, SACNF achieves it through fitting the Boltzmann Qfunction.
SAC  SACNF  

Antv2  4071 341  33  
HalfCheetahv2  6609 1334  8.6  
Hopperv2  3039 698  5.6  
Humanoidv2  5613 243  62  
Walker2dv2  4210 353  8.5  
6 Conclusion
In this paper, we proposed a novel algorithm which combines the soft actorcritic updates together with a sequence of normalizing flows of arbitrary length. The high expressivity of the later allows to (1) quickly discover richer policies (2) compress the cumbersome Gaussian policy into a lighter network and (3) better avoid local optima. Our proposed algorithm leverages connections between maximum entropy reinforcement learning and the evidence lower bound used to optimize variational approximations; this relationship allows for a straightforward extension of SAC to a wider class of distributions. We demonstrated through experiments on five MuJoCo environments that our method has a better convergence rate than the baseline, vastly improves the coverage of the parameter space and is at least as performant as SAC.
Acknowledgements
We want to thank Compute Canada/Calcul Québec and Mila – Quebec AI Institute for providing computational resources. We also thank ChinWei Huang for insightful discussions.
References
 Aytar et al. (2018) Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and de Freitas, N. 2018. Playing hard exploration games by watching youtube. Advances in Neural Information Processing Systems.
 Bellemare et al. (2016) Bellemare, M. G.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying countbased exploration and intrinsic motivation. Advances in Neural Information Processing Systems.

Bellemare, Dabney, and Munos (2017)
Bellemare, M. G.; Dabney, W.; and Munos, R.
2017.
A distributional perspective on reinforcement learning.
International Conference on Machine Learning
.  Bellman (1957) Bellman, R. 1957. A markovian decision process. Journal of Mathematics and Mechanics 679–684.
 Conti et al. (2017) Conti, E.; Madhavan, V.; Such, F. P.; Lehman, J.; Stanley, K. O.; and Clune, J. 2017. Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. Advances in Neural Information Processing Systems.
 Doan, Mazoure, and Lyle (2018) Doan, T.; Mazoure, B.; and Lyle, C. 2018. Gan qlearning. arXiv preprint arXiv:1805.04874.
 Ecoffet et al. (2019) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995.
 Fujimoto, van Hoof, and Meger (2018) Fujimoto, S.; van Hoof, H.; and Meger, D. 2018. Addressing function approximation error in actorcritic methods. International Conference on Machine Learning.

Gal and Ghahramani (2016)
Gal, Y., and Ghahramani, Z.
2016.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In International Conference on Machine Learning, 1050–1059.  Goodfellow et al. (2014) Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
 Haarnoja et al. (2017) Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energybased policies. CoRR abs/1702.08165.
 Haarnoja et al. (2018) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning.
 Henderson et al. (2017) Henderson, P.; Doan, T.; Islam, R.; and Meger, D. 2017. Bayesian policy gradients via alpha divergence dropout inference. NIPS Bayesian Deep Learning Workshop.
 Huang et al. (2018) Huang, C.; Krueger, D.; Lacoste, A.; and Courville, A. C. 2018. Neural autoregressive flows. International Conference on Machine Learning.
 Kingma and Ba (2015) Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. International Conference on Learning Representations.
 Kingma, Salimans, and Welling (2016) Kingma, D. P.; Salimans, T.; and Welling, M. 2016. Improving variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems.
 Lillicrap et al. (2016) Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings.
 Maaten and Hinton (2008) Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using tsne. Journal of Machine Learning Research 9(Nov):2579–2605.
 Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783.
 Osband et al. (2016) Osband, I.; Blundell, C.; Pritzel, A.; and Roy, B. V. 2016. Deep exploration via bootstrapped DQN. Advances in Neural Information Processing Systems.
 Pathak et al. (2017) Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiositydriven exploration by selfsupervised prediction. International Conference on Machine Learning.
 Puterman (2014) Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Rezende and Mohamed (2015) Rezende, D. J., and Mohamed, S. 2015. Variational inference with normalizing flows. International Conference on Machine Learning.
 Schulman et al. (2015) Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization.
 Sutton et al. (1999) Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063. Cambridge, MA, USA: MIT Press.
 Tang and Agrawal (2018) Tang, Y., and Agrawal, S. 2018. Boosting trust region policy optimization by normalizing flows policy.
 Tang et al. (2016) Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, X.; Duan, Y.; Schulman, J.; Turck, F. D.; and Abbeel, P. 2016. #exploration: A study of countbased exploration for deep reinforcement learning. Advances in Neural Information Processing Systems.
 Todorov, Erez, and Tassa (2012) Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, 5026–5033. IEEE.
 Tomczak and Welling (2016) Tomczak, J. M., and Welling, M. 2016. Improving variational autoencoders using householder flow. NIPS Bayesian Deep Learning Workshop.
Supplementary Material
Experimental parameters
We provide a table of hyperparameters used to obtain results in the MuJoCo domain. Note that
corresponds to the concatenated, to the average and to the conditional models.NF parameters  

flows  type  alpha  model  
Antv2  radial  conditional  
HalfCheetahv2  radial  concat  
Hopperv2  radial  conditional  
Humanoidv2  radial  average  
Walkerv2  radial  average  
Adam Optimizer parameters  
Algorithm parameters  
size 
Comparing visitation states
In this part, we highlight the advantage of augmenting any simple policy with normalizing flows. For that purpose, we compared the state visitation counts of SACNF and SAC for Humanoidv2. This is a challenging task which is known to require a high amount of exploration to avoid deceptively falling in a suboptimal solution. In Figure 5, we ran rollouts for each methods (keeping the noise on) and projected the states visited in a twodimensional dimension (with SNE (Maaten and Hinton, 2008)). NF expressivity allows our method to visit more diverse states and hence converge to good policies earlier. The major downside of the Gaussian policy lies in its symmetric nature, which prevents it from arbitrary focusing on subsets of the action space.
Comments
There are no comments yet.