1 Introduction
Deep reinforcement learning (deep RL) has emerged as a promising direction for autonomous acquisition of complex behaviors (Mnih et al., 2015; Silver et al., 2016), due to its ability to process complex sensory input (Jaderberg et al., 2016)
and to acquire elaborate behavior skills using generalpurpose neural network representations
(Levine et al., 2016). Deep reinforcement learning methods can be used to optimize deterministic (Lillicrap et al., 2015) and stochastic (Schulman et al., 2015a; Mnih et al., 2016) policies. However, most deep RL methods operate on the conventional deterministic notion of optimality, where the optimal solution, at least under full observability, is always a deterministic policy (Sutton & Barto, 1998). Although stochastic policies are desirable for exploration, this exploration is typically attained heuristically, for example by injecting noise
(Silver et al., 2014; Lillicrap et al., 2015; Mnih et al., 2015) or initializing a stochastic policy with high entropy (Kakade, 2002; Schulman et al., 2015a; Mnih et al., 2016).In some cases, we might actually prefer to learn stochastic behaviors. In this paper, we explore two potential reasons for this: exploration in the presence of multimodal objectives, and compositionality attained via pretraining. Other benefits include robustness in the face of uncertain dynamics (Ziebart, 2010)
(Ziebart et al., 2008), and improved convergence and computational properties (Gu et al., 2016a). Multimodality also has application in real robot tasks, as demonstrated in (Daniel et al., 2012). However, in order to learn such policies, we must define an objective that promotes stochasticity.In which cases is a stochastic policy actually the optimal solution? As discussed in prior work, a stochastic policy emerges as the optimal answer when we consider the connection between optimal control and probabilistic inference (Todorov, 2008). While there are multiple instantiations of this framework, they typically include the cost or reward function as an additional factor in a factor graph, and infer the optimal conditional distribution over actions conditioned on states. The solution can be shown to optimize an entropyaugmented reinforcement learning objective or to correspond to the solution to a maximum entropy learning problem (Toussaint, 2009). Intuitively, framing control as inference produces policies that aim to capture not only the single deterministic behavior that has the lowest cost, but the entire range of lowcost behaviors, explicitly maximizing the entropy of the corresponding policy. Instead of learning the best way to perform the task, the resulting policies try to learn all of the ways of performing the task. It should now be apparent why such policies might be preferred: if we can learn all of the ways that a given task might be performed, the resulting policy can serve as a good initialization for finetuning to a more specific behavior (e.g. first learning all the ways that a robot could move forward, and then using this as an initialization to learn separate running and bounding skills); a better exploration mechanism for seeking out the best mode in a multimodal reward landscape; and a more robust behavior in the face of adversarial perturbations, where the ability to perform the same task in multiple different ways can provide the agent with more options to recover from perturbations.
Unfortunately, solving such maximum entropy stochastic policy learning problems in the general case is challenging. A number of methods have been proposed, including Zlearning (Todorov, 2007), maximum entropy inverse RL (Ziebart et al., 2008), approximate inference using message passing (Toussaint, 2009), learning (Rawlik et al., 2012), and Glearning (Fox et al., 2016), as well as more recent proposals in deep RL such as PGQ (O’Donoghue et al., 2016), but these generally operate either on simple tabular representations, which are difficult to apply to continuous or highdimensional domains, or employ a simple parametric representation of the policy distribution, such as a conditional Gaussian. Therefore, although the policy is optimized to perform the desired skill in many different ways, the resulting distribution is typically very limited in terms of its representational power, even if the parameters of that distribution are represented by an expressive function approximator, such as a neural network.
How can we extend the framework of maximum entropy policy search to arbitrary policy distributions? In this paper, we borrow an idea from energybased models, which in turn reveals an intriguing connection between Qlearning, actorcritic algorithms, and probabilistic inference. In our method, we formulate a stochastic policy as a (conditional) energybased model (EBM), with the energy function corresponding to the “soft” Qfunction obtained when optimizing the maximum entropy objective. In highdimensional continuous spaces, sampling from this policy, just as with any general EBM, becomes intractable. We borrow from the recent literature on EBMs to devise an approximate sampling procedure based on training a separate sampling network, which is optimized to produce unbiased samples from the policy EBM. This sampling network can then be used both for updating the EBM and for action selection. In the parlance of reinforcement learning, the sampling network is the actor in an actorcritic algorithm. This reveals an intriguing connection: entropy regularized actorcritic algorithms can be viewed as approximate Qlearning methods, with the actor serving the role of an approximate sampler from an intractable posterior. We explore this connection further in the paper, and in the course of this discuss connections to popular deep RL methods such as deterministic policy gradient (DPG) (Silver et al., 2014; Lillicrap et al., 2015), normalized advantage functions (NAF) (Gu et al., 2016b), and PGQ (O’Donoghue et al., 2016).
The principal contribution of this work is a tractable, efficient algorithm for optimizing arbitrary multimodal stochastic policies represented by energybased models, as well as a discussion that relates this method to other recent algorithms in RL and probabilistic inference. In our experimental evaluation, we explore two potential applications of our approach. First, we demonstrate improved exploration performance in tasks with multimodal reward landscapes, where conventional deterministic or unimodal methods are at high risk of falling into suboptimal local optima. Second, we explore how our method can be used to provide a degree of compositionality in reinforcement learning by showing that stochastic energybased policies can serve as a much better initialization for learning new skills than either random policies or policies pretrained with conventional maximum reward objectives.
2 Preliminaries
In this section, we will define the reinforcement learning problem that we are addressing and briefly summarize the maximum entropy policy search objective. We will also present a few useful identities that we will build on in our algorithm, which will be presented in Section 3.
2.1 Maximum Entropy Reinforcement Learning
We will address learning of maximum entropy policies with approximate inference for reinforcement learning in continuous action spaces. Our reinforcement learning problem can be defined as policy search in an infinitehorizon Markov decision process (MDP), which consists of the tuple
, The state space and action spaceare assumed to be continuous, and the state transition probability
represents the probability density of the next state given the current state and action . The environment emits a reward on each transition, which we will abbreviate as to simplify notation. We will also use and to denote the state and stateaction marginals of the trajectory distribution induced by a policy .Our goal is to learn a policy . We can define the standard reinforcement learning objective in terms of the above quantities as
(1) 
Maximum entropy RL augments the reward with an entropy term, such that the optimal policy aims to maximize its entropy at each visited state:
where is an optional but convenient parameter that can be used to determine the relative importance of entropy and reward.^{1}^{1}1In principle, can be folded into the reward function, eliminating the need for an explicit multiplier, but in practice, it is often convenient to keep
as a hyperparameter.
Optimization problems of this type have been explored in a number of prior works (Kappen, 2005; Todorov, 2007; Ziebart et al., 2008), which are covered in more detail in Section 4. Note that this objective differs qualitatively from the behavior of Boltzmann exploration (Sallans & Hinton, 2004) and PGQ (O’Donoghue et al., 2016), which greedily maximize entropy at the current time step, but do not explicitly optimize for policies that aim to reach states where they will have high entropy in the future. This distinction is crucial, since the maximum entropy objective can be shown to maximize the entropy of the entire trajectory distribution for the policy , while the greedy Boltzmann exploration approach does not (Ziebart et al., 2008; Levine & Abbeel, 2014). As we will discuss in Section 5, this maximum entropy formulation has a number of benefits, such as improved exploration in multimodal problems and better pretraining for later adaptation.If we wish to extend either the conventional or the maximum entropy RL objective to infinite horizon problems, it is convenient to also introduce a discount factor to ensure that the sum of expected rewards (and entropies) is finite. In the context of policy search algorithms, the use of a discount factor is actually a somewhat nuanced choice, and writing down the precise objective that is optimized when using the discount factor is nontrivial (Thomas, 2014). We defer the full derivation of the discounted objective to Appendix A, since it is unwieldy to write out explicitly, but we will use the discount in the following derivations and in our final algorithm.
2.2 Soft Value Functions and EnergyBased Models
Optimizing the maximum entropy objective in LABEL:eq:maxent_objective provides us with a framework for training stochastic policies, but we must still choose a representation for these policies. The choices in prior work include discrete multinomial distributions (O’Donoghue et al., 2016)
(Rawlik et al., 2012). However, if we want to use a very general class of distributions that can represent complex, multimodal behaviors, we can instead opt for using general energybased policies of the form(3) 
where is an energy function that could be represented, for example, by a deep neural network. If we use a universal function approximator for , we can represent any distribution . There is a close connection between such energybased models and soft versions of value functions and Qfunctions, where we set and use the following theorem:
Theorem 1.
Let the soft Qfunction be defined by
(4)  
and soft value function by
(5) 
Then the optimal policy for LABEL:eq:maxent_objective is given by
(6) 
Proof.
See Appendix A.1 as well as (Ziebart, 2010). ∎
Theorem 1 connects the maximum entropy objective in LABEL:eq:maxent_objective and energybased models, where acts as the negative energy, and serves as the logpartition function. As with the standard Qfunction and value function, we can relate the Qfunction to the value function at a future state via a soft Bellman equation:
Theorem 2.
Proof.
See Appendix A.2, as well as (Ziebart, 2010). ∎
The soft Bellman equation is a generalization of the conventional (hard) equation, where we can recover the more standard equation as , which causes (missing) to approach a hard maximum over the actions. In the next section, we will discuss how we can use these identities to derive a Qlearning style algorithm for learning maximum entropy policies, and how we can make this practical for arbitrary Qfunction representations via an approximate inference procedure.
3 Training Expressive EnergyBased Models via Soft QLearning
In this section, we will present our proposed reinforcement learning algorithm, which is based on the soft Qfunction described in the previous section, but can be implemented via a tractable stochastic gradient descent procedure with approximate sampling. We will first describe the general case of soft Qlearning, and then present the inference procedure that makes it tractable to use with deep neural network representations in highdimensional continuous state and action spaces. In the process, we will relate this Qlearning procedure to inference in energybased models and actorcritic algorithms.
3.1 Soft QIteration
We can obtain a solution to (missing)
by iteratively updating estimates of
and . This leads to a fixedpoint iteration that resembles Qiteration:Theorem 3.
Soft Qiteration. Let and be bounded and assume that and that exists. Then the fixedpoint iteration
(8)  
(9) 
converges to and , respectively.
Proof.
See Appendix A.2 as well as (Fox et al., 2016). ∎
We refer to the updates in (missing) and (9) as the soft Bellman backup operator that acts on the soft value function, and denote it by . The maximum entropy policy in (missing) can then be recovered by iteratively applying this operator until convergence. However, there are several practicalities that need to be considered in order to make use of the algorithm. First, the soft Bellman backup cannot be performed exactly in continuous or large state and action spaces, and second, sampling from the energybased model in (missing) is intractable in general. We will address these challenges in the following sections.
3.2 Soft QLearning
This section discusses how the Bellman backup in Theorem 3 can be implemented in a practical algorithm that uses a finite set of samples from the environment, resulting in a method similar to Qlearning. Since the soft Bellman backup is a contraction (see Appendix A.2), the optimal value function is the fixed point of the Bellman backup, and we can find it by optimizing for a Qfunction for which the soft Bellman error is minimized at all states and actions. While this procedure is still intractable due to the integral in (missing) and the infinite set of all states and actions, we can express it as a stochastic optimization, which leads to a stochastic gradient descent update procedure. We will model the soft Qfunction with a function approximator with parameters and denote it as .
To convert Theorem 3 into a stochastic optimization problem, we first express the soft value function in terms of an expectation via importance sampling:
(10) 
where can be an arbitrary distribution over the action space. Second, by noting the identity
, where can be any strictly positive density function on , we can express the soft Qiteration in an equivalent form as minimizing
(11) 
where are positive over and respectively, is a target Qvalue, with given by (missing) and being replaced by the target parameters, .
This stochastic optimization problem can be solved approximately using stochastic gradient descent using sampled states and actions. While the sampling distributions and can be arbitrary, we typically use real samples from rollouts of the current policy . For
we have more options. A convenient choice is a uniform distribution. However, this choice can scale poorly to high dimensions. A better choice is to use the current policy, which produces an unbiased estimate of the soft value as can be confirmed by substitution. This overall procedure yields an iterative approach that optimizes over the Qvalues, which we summarize in
Section 3.4.However, in continuous spaces, we still need a tractable way to sample from the policy , both to take onpolicy actions and, if so desired, to generate action samples for estimating the soft value function. Since the form of the policy is so general, sampling from it is intractable. We will therefore use an approximate sampling procedure, as discussed in the following section.
3.3 Approximate Sampling and Stein Variational Gradient Descent (SVGD)
In this section we describe how we can approximately sample from the soft Qfunction. Existing approaches that sample from energybased distributions generally fall into two categories: methods that use Markov chain Monte Carlo (MCMC) based sampling
(Sallans & Hinton, 2004), and methods that learn a stochastic sampling network trained to output approximate samples from the target distribution (Zhao et al., 2016; Kim & Bengio, 2016). Since sampling via MCMC is not tractable when the inference must be performed online (e.g. when executing a policy), we will use a sampling network based on Stein variational gradient descent (SVGD) (Liu & Wang, 2016) and amortized SVGD (Wang & Liu, 2016). Amortized SVGD has several intriguing properties: First, it provides us with a stochastic sampling network that we can query for extremely fast sample generation. Second, it can be shown to converge to an accurate estimate of the posterior distribution of an EBM. Third, the resulting algorithm, as we will show later, strongly resembles actorcritic algorithm, which provides for a simple and computationally efficient implementation and sheds light on the connection between our algorithm and prior actorcritic methods.Formally, we want to learn a stateconditioned stochastic neural network , parametrized by , that maps noise samples drawn from a normal Gaussian, or other arbitrary distribution, into unbiased action samples from the target EBM corresponding to . We denote the induced distribution of the actions as , and we want to find parameters so that the induced distribution approximates the energybased distribution in terms of the KL divergence
(12)  
Suppose we “perturb” a set of independent samples in appropriate directions , the induced KL divergence can be reduced. Stein variational gradient descent (Liu & Wang, 2016) provides the most greedy directions as a functional
(13)  
where is a kernel function (typically Gaussian, see details in Appendix D.1). To be precise, is the optimal direction in the reproducing kernel Hilbert space of , and is thus not strictly speaking the gradient of (missing), but it turns out that we can set as explained in (Wang & Liu, 2016)
. With this assumption, we can use the chain rule and backpropagate the Stein variational gradient into the policy network according to
(14) 
and use any gradientbased optimization method to learn the optimal sampling network parameters. The sampling network can be viewed as an actor in an actorcritic algorithm. We will discuss this connection in Section 4, but first we will summarize our complete maximum entropy policy learning algorithm.
3.4 Algorithm Summary
To summarize, we propose the soft Qlearning algorithm for learning maximum entropy policies in continuous domains. The algorithm proceeds by alternating between collecting new experience from the environment, and updating the soft Qfunction and sampling network parameters. The experience is stored in a replay memory buffer as standard in deep Qlearning (Mnih et al., 2013), and the parameters are updated using random minibatches from this memory. The soft Qfunction updates use a delayed version of the target values (Mnih et al., 2015). For optimization, we use the ADAM (Kingma & Ba, 2015) optimizer and empirical estimates of the gradients, which we denote by . The exact formulae used to compute the gradient estimates is deferred to Appendix C, which also discusses other implementation details, but we summarize an overview of soft Qlearning in Algorithm 1.
4 Related Work
Maximum entropy policies emerge as the solution when we cast optimal control as probabilistic inference. In the case of linearquadratic systems, the mean of the maximum entropy policy is exactly the optimal deterministic policy (Todorov, 2008), which has been exploited to construct practical path planning methods based on iterative linearization and probabilistic inference techniques (Toussaint, 2009). In discrete state spaces, the maximum entropy policy can be obtained exactly. This has been explored in the context of linearly solvable MDPs (Todorov, 2007) and, in the case of inverse reinforcement learning, MaxEnt IRL (Ziebart et al., 2008). In continuous systems and continuous time, path integral control studies maximum entropy policies and maximum entropy planning (Kappen, 2005). In contrast to these prior methods, our work is focused on extending the maximum entropy policy search framework to highdimensional continuous spaces and highly multimodal objectives, via expressive generalpurpose energy functions represented by deep neural networks. A number of related methods have also used maximum entropy policy optimization as an intermediate step for optimizing policies under a standard expected reward objective (Peters et al., 2010; Neumann, 2011; Rawlik et al., 2012; Fox et al., 2016). Among these, the work of Rawlik et al. (2012) resembles ours in that it also makes use of a temporal difference style update to a soft Qfunction. However, unlike this prior work, we focus on generalpurpose energy functions with approximate sampling, rather than analytically normalizable distributions. A recent work (Liu et al., 2017) also considers an entropy regularized objective, though the entropy is on policy parameters, not on sampled actions. Thus the resulting policy may not represent an arbitrarily complex multimodal distribution with a single parameter. The form of our sampler resembles the stochastic networks proposed in recent work on hierarchical learning (Florensa et al., 2017). However this prior work uses a taskspecific reward bonus system to encourage stochastic behavior, while our approach is derived from optimizing a general maximum entropy objective.
A closely related concept to maximum entropy policies is Boltzmann exploration, which uses the exponential of the standard Qfunction as the probability of an action (Kaelbling et al., 1996)
. A number of prior works have also explored representing policies as energybased models, with the Qvalue obtained from an energy model such as a restricted Boltzmann machine (RBM)
(Sallans & Hinton, 2004; Elfwing et al., 2010; Otsuka et al., 2010; Heess et al., 2012). Although these methods are closely related, they have not, to our knowledge, been extended to the case of deep network models, have not made extensive use of approximate inference techniques, and have not been demonstrated on the complex continuous tasks. More recently, O’Donoghue et al. (2016) drew a connection between Boltzmann exploration and entropyregularized policy gradient, though in a theoretical framework that differs from maximum entropy policy search: unlike the full maximum entropy framework, the approach of O’Donoghue et al. (2016) only optimizes for maximizing entropy at the current time step, rather than planning for visiting future states where entropy will be further maximized. This prior method also does not demonstrate learning complex multimodal policies in continuous action spaces.Although we motivate our method as Qlearning, its structure resembles an actorcritic algorithm. It is particularly instructive to observe the connection between our approach and the deep deterministic policy gradient method (DDPG) (Lillicrap et al., 2015), which updates a Qfunction critic according to (hard) Bellman updates, and then backpropagates the Qvalue gradient into the actor, similarly to NFQCA (Hafner & Riedmiller, 2011). Our actor update differs only in the addition of the term. Indeed, without this term, our actor would estimate a maximum a posteriori (MAP) action, rather than capturing the entire EBM distribution. This suggests an intriguing connection between our method and DDPG: if we simply modify the DDPG critic updates to estimate soft Qvalues, we recover the MAP variant of our method. Furthermore, this connection allows us to cast DDPG as simply an approximate Qlearning method, where the actor serves the role of an approximate maximizer. This helps explain the good performance of DDPG on offpolicy data. We can also make a connection between our method and policy gradients. In Appendix B, we show that the policy gradient for a policy represented as an energybased model closely corresponds to the update in soft Qlearning. Similar derivation is presented in a concurrent work (Schulman et al., 2017).
5 Experiments
Our experiments aim to answer the following questions: (1) Does our soft Qlearning method accurately capture a multimodal policy distribution? (2) Can soft Qlearning with energybased policies aid exploration for complex tasks that require tracking multiple modes? (3) Can a maximum entropy policy serve as a good initialization for finetuning on different tasks, when compared to pretraining with a standard deterministic objective? We compare our algorithm to DDPG (Lillicrap et al., 2015), which has been shown to achieve better sample efficiency on the continuous control problems that we consider than other recent techniques such as REINFORCE (Williams, 1992), TRPO (Schulman et al., 2015a), and A3C (Mnih et al., 2016). This comparison is particularly interesting since, as discussed in Section 4, DDPG closely corresponds to a deterministic maximum a posteriori variant of our method. The detailed experimental setup can be found in Appendix D. Videos of all experiments^{2}^{2}2https://sites.google.com/view/softqlearning/home and example source code^{3}^{3}3https://github.com/haarnoja/softqlearning are available online.
5.1 Didactic Example: MultiGoal Environment
In order to verify that amortized SVGD can correctly draw samples from energybased policies of the form , and that our complete algorithm can successful learn to represent multimodal behavior, we designed a simple “multigoal” environment, in which the agent is a 2D point mass trying to reach one of four symmetrically placed goals. The reward is defined as a mixture of Gaussians, with means placed at the goal positions. An optimal strategy is to go to an arbitrary goal, and the optimal maximum entropy policy should be able to choose each of the four goals at random. The final policy obtained with our method is illustrated in Figure 1. The Qvalues indeed have complex shapes, being unimodal at , convex at , and bimodal at . The stochastic policy samples actions closely following the energy landscape, hence learning diverse trajectories that lead to all four goals. In comparison, a policy trained with DDPG randomly commits to a single goal.
5.2 Learning MultiModal Policies for Exploration
Though not all environments have a clear multimodal reward landscape as in the “multigoal” example, multimodality is prevalent in a variety of tasks. For example, a chess player might try various strategies before settling on one that seems most effective, and an agent navigating a maze may need to try various paths before finding the exit. During the learning process, it is often best to keep trying multiple available options until the agent is confident that one of them is the best (similar to a bandit problem (Lai & Robbins, 1985)). However, deep RL algorithms for continuous control typically use unimodal action distributions, which are not well suited to capture such multimodality. As a consequence, such algorithms may prematurely commit to one mode and converge to suboptimal behavior.
To evaluate how maximum entropy policies might aid exploration, we constructed simulated continuous control environments where tracking multiple modes is important for success. The first experiment uses a simulated swimming snake (see Figure 2), which receives a reward equal to its speed along the axis, either forward or backward. However, once the swimmer swims far enough forward, it crosses a “finish line” and receives a larger reward. Therefore, the best learning strategy is to explore in both directions until the bonus reward is discovered, and then commit to swimming forward. As illustrated in Figure 6 in Appendix D.3, our method is able to recover this strategy, keeping track of both modes until the finish line is discovered. All stochastic policies eventually commit to swimming forward. The deterministic DDPG method shown in the comparison commits to a mode prematurely, with only 80% of the policies converging on a forward motion, and 20% choosing the suboptimal backward mode.
The second experiment studies a more complex task with a continuous range of equally good options prior to discovery of a sparse reward goal. In this task, a quadrupedal 3D robot (adapted from Schulman et al. (2015b)) needs to find a path through a maze to a target position (see Figure 2). The reward function is a Gaussian centered at the target. The agent may choose either the upper or lower passage, which appear identical at first, but the upper passage is blocked by a barrier. Similar to the swimmer experiment, the optimal strategy requires exploring both directions and choosing the better one. Figure 3(b) compares the performance of DDPG and our method. The curves show the minimum distance to the target achieved so far and the threshold equals the minimum possible distance if the robot chooses the upper passage. Therefore, successful exploration means reaching below the threshold. All policies trained with our method manage to succeed, while only policies trained with DDPG converge to choosing the lower passage.
5.3 Accelerating Training on Complex Tasks with Pretrained Maximum Entropy Policies
A standard way to accelerate deep neural network training is taskspecific initialization (Goodfellow et al., 2016)
, where a network trained for one task is used as initialization for another task. The first task might be something highly general, such as classifying a large image dataset, while the second task might be more specific, such as finegrained classification with a small dataset. Pretraining has also been explored in the context of RL
(Shelhamer et al., 2016). However, in RL, nearoptimal policies are often neardeterministic, which makes them poor initializers for new tasks. In this section, we explore how our energybased policies can be trained with fairly broad objectives to produce an initializer for more quickly learning more specific tasks.We demonstrate this on a variant of the quadrupedal robot task. The pretraining phase involves learning to locomote in an arbitrary direction, with a reward that simply equals the speed of the center of mass. The resulting policy moves the agent quickly to an randomly chosen direction. An overhead plot of the center of mass traces is shown above to illustrate this. This pretraining is similar in some ways to recent work on modulated controllers (Heess et al., 2016) and hierarchical models (Florensa et al., 2017). However, in contrast to these prior works, we do not require any taskspecific highlevel goal generator or reward.
Figure 4 also shows a variety of test environments that we used to finetune the running policy for a specific task. In the hallway environments, the agent receives the same reward, but the walls block sideways motion, so the optimal solution requires learning to run in a particular direction. Narrow hallways require choosing a more specific direction, but also allow the agent to use the walls to funnel itself. The Ushaped maze requires the agent to learn a curved trajectory in order to arrive at the target, with the reward given by a Gaussian bump at the target location.
As illustrated in Figure 7 in Appendix D.3, the pretrained policy explores the space extensively and in all directions. This gives a good initialization for the policy, allowing it to learn the behaviors in the test environments more quickly than training a policy with DDPG from a random initialization, as shown in Figure 5. We also evaluated an alternative pretraining method based on deterministic policies learned with DDPG. However, deterministic pretraining chooses an arbitrary but consistent direction in the training environment, providing a poor initialization for finetuning to a specific task, as shown in the results plots.
6 Discussion and Future Work
We presented a method for learning stochastic energybased policies with approximate inference via Stein variational gradient descent (SVGD). Our approach can be viewed as a type of soft Qlearning method, with the additional contribution of using approximate inference to obtain complex multimodal policies. The sampling network trained as part of SVGD can also be viewed as tking the role of an actor in an actorcritic algorithm. Our experimental results show that our method can effectively capture complex multimodal behavior on problems ranging from toy point mass tasks to complex torque control of simulated walking and swimming robots. The applications of training such stochastic policies include improved exploration in the case of multimodal objectives and compositionality via pretraining generalpurpose stochastic policies that can then be efficiently finetuned into taskspecific behaviors.
While our work explores some potential applications of energybased policies with approximate inference, an exciting avenue for future work would be to further study their capability to represent complex behavioral repertoires and their potential for composability. In the context of linearly solvable MDPs, several prior works have shown that policies trained for different tasks can be composed to create new optimal policies (Da Silva et al., 2009; Todorov, 2009). While these prior works have only explored simple, tractable representations, our method could be used to extend these results to complex and highly multimodal deep neural network models, making them suitable for composable control of complex highdimensional systems, such as humanoid robots. This composability could be used in future work to create a huge variety of nearoptimal skills from a collection of energybased policy building blocks.
7 Acknowledgements
We thank Qiang Liu for insightful discussion of SVGD, and thank Vitchyr Pong and Shane Gu for help with implementing DDPG. Haoran Tang and Tuomas Haarnoja are supported by Berkeley Deep Drive.
References
 Da Silva et al. (2009) Da Silva, M., Durand, F., and Popović, J. Linear Bellman combination for control of character animation. ACM Trans. on Graphs, 28(3):82, 2009.
 Daniel et al. (2012) Daniel, C., Neumann, G., and Peters, J. Hierarchical relative entropy policy search. In AISTATS, pp. 273–281, 2012.
 Elfwing et al. (2010) Elfwing, S., Otsuka, M., Uchibe, E., and Doya, K. Freeenergy based reinforcement learning for visionbased navigation with highdimensional sensory inputs. In Int. Conf. on Neural Information Processing, pp. 215–222. Springer, 2010.
 Florensa et al. (2017) Florensa, C., Duan, Y., and P., Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In Int. Conf. on Learning Representations, 2017.

Fox et al. (2016)
Fox, R., Pakman, A., and Tishby, N.
Taming the noise in reinforcement learning via soft updates.
In
Conf. on Uncertainty in Artificial Intelligence
, 2016.  Goodfellow et al. (2016) Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep learning. chapter 8.7.4. MIT Press, 2016. http://www.deeplearningbook.org.
 Gu et al. (2016a) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247, 2016a.

Gu et al. (2016b)
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S.
Continuous deep Qlearning with modelbased acceleration.
In
Int. Conf. on Machine Learning
, pp. 2829–2838, 2016b.  Hafner & Riedmiller (2011) Hafner, R. and Riedmiller, M. Reinforcement learning in feedback control. Machine Learning, 84(12):137–169, 2011.
 Heess et al. (2012) Heess, N., Silver, D., and Teh, Y. W. Actorcritic reinforcement learning with energybased policies. In Workshop on Reinforcement Learning, pp. 43. Citeseer, 2012.
 Heess et al. (2016) Heess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
 Kakade (2002) Kakade, S. A natural policy gradient. Advances in Neural Information Processing Systems, 2:1531–1538, 2002.
 Kappen (2005) Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011, 2005.
 Kim & Bengio (2016) Kim, T. and Bengio, Y. Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439, 2016.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. 2015.
 Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
 Levine & Abbeel (2014) Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071–1079, 2014.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

Liu & Wang (2016)
Liu, Q. and Wang, D.
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.  Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A, Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Int. Conf. on Machine Learning, 2016.
 Neumann (2011) Neumann, G. Variational inference for policy search in changing situations. In Int. Conf. on Machine Learning, pp. 817–824, 2011.
 O’Donoghue et al. (2016) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Qlearning. arXiv preprint arXiv:1611.01626, 2016.
 Otsuka et al. (2010) Otsuka, M., Yoshimoto, J., and Doya, K. Freeenergybased reinforcement learning in a partially observable environment. In ESANN, 2010.
 Peters et al. (2010) Peters, J., Mülling, K., and Altun, Y. Relative entropy policy search. In AAAI Conf. on Artificial Intelligence, pp. 1607–1612, 2010.
 Rawlik et al. (2012) Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.
 Sallans & Hinton (2004) Sallans, B. and Hinton, G. E. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5(Aug):1063–1088, 2004.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In Int. Conf on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Qlearning. arXiv preprint arXiv:1704.06440, 2017.
 Shelhamer et al. (2016) Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Selfsupervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In Int. Conf on Machine Learning, 2014.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, Jan 2016. ISSN 00280836. Article.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Thomas (2014) Thomas, P. Bias in natural actorcritic algorithms. In Int. Conf. on Machine Learning, pp. 441–448, 2014.
 Todorov (2007) Todorov, E. Linearlysolvable Markov decision problems. In Advances in Neural Information Processing Systems, pp. 1369–1376. MIT Press, 2007.
 Todorov (2008) Todorov, E. General duality between optimal control and estimation. In IEEE Conf. on Decision and Control, pp. 4286–4292. IEEE, 2008.
 Todorov (2009) Todorov, E. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems, pp. 1856–1864, 2009.
 Toussaint (2009) Toussaint, M. Robot trajectory optimization using approximate inference. In Int. Conf. on Machine Learning, pp. 1049–1056. ACM, 2009.
 Uhlenbeck & Ornstein (1930) Uhlenbeck, G. E. and Ornstein, L. S. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
 Wang & Liu (2016) Wang, D. and Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Zhao et al. (2016) Zhao, J., Mathieu, M., and LeCun, Y. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
 Ziebart (2010) Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433–1438, 2008.
Appendix A Policy Improvement Proofs
In this appendix, we present proofs for the theorems that allow us to show that soft Qlearning leads to policy improvement with respect to the maximum entropy objective. First, we define a slightly more nuanced version of the maximum entropy objective that allows us to incorporate a discount factor. This definition is complicated by the fact that, when using a discount factor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense, discounted policy gradients typically do not optimize the true discounted objective. Instead, they optimize average reward, with the discount serving to reduce variance, as discussed by
Thomas (2014). However, for the purposes of the derivation, we can define the objective that is optimized under a discount factor asThis objective corresponds to maximizing the discounted expected reward and entropy for future states originating from every stateaction tuple weighted by its probability under the current policy. Note that this objective still takes into account the entropy of the policy at future states, in contrast to greedy objectives such as Boltzmann exploration or the approach proposed by O’Donoghue et al. (2016).
We can now derive policy improvement results for soft Qlearning. We start with the definition of the soft Qvalue for any policy as the expectation under of the discounted sum of rewards and entropy :
(15) 
Here, denotes the trajectory originating at . Notice that for convenience, we set the entropy parameter to 1. The theory can be easily adapted by dividing rewards by .
The discounted maximum entropy policy objective can now be defined as
(16) 
a.1 The Maximum Entropy Policy
If the objective function is the expected discounted sum of rewards, the policy improvement theorem (Sutton & Barto, 1998) describes how policies can be improved monotonically. There is a similar theorem we can derive for the maximum entropy objective:
Theorem 4.
(Policy improvement theorem) Given a policy , define a new policy as
(17) 
Assume that throughout our computation, is bounded and is bounded for any (for both and ). Then .
The proof relies on the following observation: if one greedily maximize the sum of entropy and value with onestep lookahead, then one obtains from :
(18) 
The proof is straightforward by noticing that
(19) 
Then we can show that
(20) 
With Theorem 4, we start from an arbitrary policy and define the policy iteration as
(21) 
Then improves monotonically. Under certain regularity conditions, converges to . Obviously, we have . Since any nonoptimal policy can be improved this way, the optimal policy must satisfy this energybased form. Therefore we have proven Theorem 1.
a.2 Soft Bellman Equation and Soft Value Iteration
Recall the definition of the soft value function:
(22) 
Suppose . Then we can show that
(23) 
This completes the proof of Theorem 2.
Finally, we show that the soft value iteration operator , defined as
(24) 
is a contraction. Then Theorem 3 follows immediately.
Appendix B Connection between Policy Gradient and QLearning
We show that entropyregularized policy gradient can be viewed as performing soft Qlearning on the maximumentropy objective. First, suppose that we parametrize a stochastic policy as
(26) 
where is an energy function with parameters , and is the corresponding partition function. This is the most general class of policies, as we can trivially transform any given distribution into exponential form by defining the energy as . We can write an entropyregularized policy gradient as follows:
(27) 
where is the distribution induced by the policy, is an empirical estimate of the Qvalue of the policy, and is a statedependent baseline that we get to choose. The gradient of the entropy term is given by
(28) 
and after substituting this back into (missing), noting (missing), and choosing , we arrive at a simple form for the policy gradient:
(29) 
To show that (missing) indeed correponds to soft Qlearning update, we consider the Bellman error
(30) 
where is an empirical estimate of the soft Qfunction. There are several valid alternatives for this estimate, but in order to show a connection to policy gradient, we choose a specific form
(31) 
where is an empirical soft advantage function that is assumed not to contribute the gradient computation. With this choice, the gradient of the Bellman error becomes
(32) 
Now, if we choose and , we recover the policy gradient in (missing). Note that the choice of using an empirical estimate of the soft advantage rather than soft Qvalue makes the target independent of the soft value, and at convergence, approximates the soft Qvalue up to an additive constant. The resulting policy is still correct, since the Boltzmann distribution in (missing) is independent of constant shift in the energy function.
Appendix C Implementation
c.1 Computing the Policy Update
Here we explain in full detail how the policy update direction in Algorithm 1 is computed. We reuse the indices in this section with a different meaning than in the body of the paper for the sake of providing a cleaner presentation.
Expectations appear in amortized SVGD in two places. First, SVGD approximates the optimal descent direction in Equation (13) with an empirical average over the samples . Similarly, SVGD approximates the expectation in Equation (14) with samples , which can be the same or different from . Substituting (13) into (14) and taking the gradient gives the empirical estimate
Finally, the update direction is the average of , where is drawn from a minibatch.
c.2 Computing the Density of Sampled Actions
Equation (missing) states that the soft value can be computed by sampling from a distribution and that is optimal. A direct solution is to obtain actions from the sampling network: . If the samples and actions have the same dimension, and if the jacobian matrix is nonsingular, then the probability density is
(33) 
In practice, the Jacobian is usually singular at the beginning of training, when the sampler is not fully trained. A simple solution is to begin with uniform action sampling and then switch to later, which is reasonable, since an untrained sampler is unlikely to produce better samples for estimating the partition function anyway.
Appendix D Experiments
d.1 Hyperparameters
Throughout all experiments, we use the following parameters for both DDPG and soft Qlearning. The Qvalues are updated using ADAM with learning rate . The DDPG policy and soft Qlearning sampling network use ADAM with a learning rate of . The algorithm uses a replay pool of size one million. Training does not start until the replay pool has at least 10,000 samples. Every minibatch has size . Each training iteration consists of time steps, and both the Qvalues and policy / sampling network are trained at every time step. All experiments are run for epochs, except that the multigoal task uses epochs and the finetuning tasks are trained for epochs. Both the Qvalue and policy / sampling network are neural networks comprised of two hidden layers, with
hidden units at each layer and ReLU nonlinearity. Both DDPG and soft Qlearning use additional OU Noise
(Uhlenbeck & Ornstein, 1930; Lillicrap et al., 2015) to improve exploration. The parameters are and . In addition, we found that updating the target parameters too frequently can destabilize training. Therefore we freeze target parameters for every time steps (except for the swimming snake experiment, which freezes for epochs), and then copy the current network parameters to the target networks directly ().Soft Qlearning uses action samples (see Appendix C.1) to compute the policy update, except that the multigoal experiment uses . The number of additional action samples to compute the soft value is . The kernel
is a radial basis function, written as
, where , with equal to the median of pairwise distance of sampled actions . Note that the step size changes dynamically depending on the state , as suggested in (Liu & Wang, 2016).The entropy coefficient is for multigoal environment, and for the swimming snake, maze, hallway (pretraining) and Ushaped maze (pretraining) experiments.
All finetuning tasks anneal the entropy coefficient quickly in order to improve performance, since the goal during finetuning is to recover a neardeterministic policy on the finetuning task. In particular, is annealed loglinearly to within epochs of finetuning. Moreover, the samples are fixed to a set and is reduced linearly to within epochs.
d.2 Task description
All tasks have a horizon of , except the multigoal task, which uses . We add an additional termination condition to the quadrupedal 3D robot to discourage it from flipping over.
Comments
There are no comments yet.