1 Introduction
Reinforcement Learning (RL) combined with deep neural networks have led to a wide range of successful applications, including the game of Go, robotics control and video game playing
[32, 30, 24]. During the training of deep RL agent, the injection of noise into the learning procedure can usually prevent the agent from premature convergence to bad locally optimal solutions, for example, by entropy regularization [30, 23] or by explicitly optimizing a maximum entropy objective [13, 25].Though entropy regularization is much simpler to implement in practice, it greedily optimizes the policy entropy at each time step, without accounting for future effects. On the other hand, maximum entropy objective considers the entropy of the distribution over entire trajectories, and is more conducive to theoretical analysis [2]. Recently, [13, 14] also shows that optimizing the maximum entropy objective can lead to desirable properties such as robustness and multimodal policy.
Can we preserve the simplicity of entropy regularization while attaining desirable properties under maximum entropy framework? To achieve this, a necessary condition is an expressive representation of policy. Though various flexible probabilistic models have been proposed in generative modeling [10, 37], such models are underexplored in policy based RL. To address such issues, we propose flexible policy classes and efficient algorithms to compute entropy regularized policy gradients.
In Section 3, we introduce Implicit Policy, a generic policy representation from which we derive two expressive policy classes, Normalizing Flows Policy (NFP) and more generally, Noninvertible Blackbox Policy (NBP). NFP provides a novel architecture that embeds state information into Normalizing Flows; NBP assumes little about policy architecture, yet we propose algorithms to efficiently compute entropy regularized policy gradients when the policy density is not accessible. In Section 4, we show that entropy regularization optimizes a lower bound of maximum entropy objective. In Section 5, we show that when combined with entropy regularization, expressive policies achieve competitive performance on benchmarks and leads to robust and multimodal policies.
2 Preliminaries
2.1 Background
We consider the standard RL formalism consisting of an agent interacting with the environment. At time step , the agent is in state , takes action , receives instant reward and transitions to next state . Let be a policy. The objective of RL is to search for a policy which maximizes cumulative expected reward , where is a discount factor. The action value function of policy is defined as . In policy based RL, a policy is explicitly parameterized as with parameter , and the policy can be updated by policy gradients , where is the learning rate. So far, there are in general two ways to compute policy gradients for either onpolicy or offpolicy updates.
Score function gradient & Pathwise gradient.
Given a stochastic policy , the score function gradient for onpolicy update is computed as as in [31, 30, 23, 35]. For offpolicy update, it is necessary to introduce importance sampling weights to adjust the distribution difference between the behavior policy and current policy. Given a deterministic policy , the pathwise gradient for onpolicy update is computed as . In practice, this gradient is often computed offpolicy [33, 32], where the exact derivation comes from a modified offpolicy objective [3].
Entropy Regularization.
For onpolicy update, it is common to apply entropy regularization [38, 26, 23, 31]. Let be the entropy of policy at state . The entropy regularized update is
(1) 
where is a regularization constant. By boosting policy entropy, this update can potentially prevent the policy from premature convergence to bad locally optimal solutions. In Section 3, we will introduce expressive policies that leverage both onpolicy/offpolicy updates, and algorithms to efficiently compute entropy regularized policy gradients.
Maximum Entropy RL.
In maximum entropy RL formulation, the objective is to maximize the cumulative reward and the policy entropy , where is a tradeoff constant. Note that differs from the update in (1) by an exchange of expectation and gradient. The intuition of
is to achieve high reward while being as random as possible over trajectories. Since there is no simple low variance gradient estimate for
, several previous works [31, 13, 25] have proposed to optimize primarily using offpolicy value based algorithms.2.2 Related Work
A large number of prior works have implemented policy gradient algorithms with entropy regularization [30, 31, 23, 26], which boost exploration by greedily maximizing policy entropy at each time step. In contrast to such greedy procedure, maximum entropy objective considers entropy over the entire policy trajectories [13, 25, 29]. Though entropy regularization is simpler to implement in practice, [12, 13] argues in favor of maximum entropy objective by showing that trained policies can be robust to noise, which is desirable for real life robotics tasks; and multimodal, a potentially desired property for exploration and finetuning for downstream tasks. However, their training procedure is fairly complex, which consists of training a soft Q function by fixed point iteration and a neural sampler by Stein variational gradient [21]. We argue that properties as robustness and multimodality are attainable through simple entropy regularized policy gradient algorithms combined with expressive policy representations.
Prior works have studied the property of maximum entropy objective [25, 39], entropy regularization [26] and their connections with variants of operators [2]. It is commonly believed that entropy regularization greedily maximizes local policy entropy and does not account for how a policy update impacts future states. In Section 4, we show that entropy regularized policy gradient update maximizes a lower bound of maximum entropy objective, given constraints on the differences between consecutive policy iterates. This partially justifies why simple entropy regularization combined with expressive policy classes can achieve competitive empirical performance in practice.
There is a number of prior works that discuss different policy architectures. The most common policy for continuous control is unimodal Gaussian [30, 31, 23]. [14] discusses mixtures of Gaussian, which can represent multimodal policies but it is necessary to specify the number of modes in advance. [13] also represents a policy using implicit model, but the policy is trained to sample from the soft Q function instead of being trained directly. Recently, we find [11] also uses Normalizing Flows to represent policies, but their focus is learning an hierarchy and involves layers of pretraining. Contrary to early works, we propose to represent flexible policies using implicit models/Normalizing Flows and efficient algorithms to train the policy endtoend.
Implicit models have been extensively studied in probabilistic inference and generative modeling [10, 17, 19, 37]
. Implicit models define distributions by transforming source noise via a forward pass of neural networks, which in general sacrifice tractable probability density for more expressive representation. Normalizing Flows are a special case of implicit models
[27, 5, 6], where transformations from source noise to output are invertible and allow for maximum likelihood inference. Borrowing inspirations from prior works, we introduce implicit models into policy representation and empirically show that such rich policy class entails multimodal behavior during training. In [37], GAN [10] is used as an optimal density estimator for likelihood free inference. In our work, we apply similar idea to compute entropy regularization when policy density is not available.3 Implicit Policy for Reinforcement Learning
We assume the action space to be a compact subset of . Any sufficiently smooth stochastic policy can be represented as a blackbox with parameter that incorporates state information and independent source noise sampled from a simple distribution . In state , the action is sampled by a forward pass in the blackbox.
(2) 
For example, Gaussian policy is reduced to where is standard Gaussian [30]. In general, the distribution of is implicitly defined: for any set of , . Let be the density of this distribution^{2}^{2}2In future notations, when the context is clear, we use to denote both the density of the policy as well as the policy itself: for example, means sampling from the policy; means the log density of policy at in state .. We call such policy Implicit Policy as similar ideas have been previous explored in implicit generative modeling literature [10, 19, 37]. In the following, we derive two expressive stochastic policy classes following this blackbox formulation, and propose algorithms to efficiently compute entropy regularized policy gradients.
3.1 Normalizing Flows Policy (NFP)
We first construct a stochastic policy with Normalizing Flows. Normalizing Flows [27, 6] have been applied in variational inference and probabilistic modeling to represent complex distributions. In general, consider transforming a source noise by a series of invertible nonlinear function each with parameter , to output a target sample ,
(3) 
Let be the Jacobian matrix of , then the density of
is computed by chain rule,
(4) 
For a general invertible transformation , computing is expensive. We follow the architecture of [5] to ensure that is computed in linear time. To combine state information, we embed state by another neural network with parameter
and output a state vector
with the same dimension as . We can then insert the state vector between any two layers of (3) to make the distribution conditional on state . In our implementation, we insert the state vector after the first transformation (we detail our architecture design in Appendix C).(5) 
Though the additive form of and may in theory limit the capacity of the model, in practice we find the resulting policy still very expressive. For simplicity, we denote the above transformation (5) as with parameter . It is obvious that is still invertible between and , which is critical for computing according to (4). Such representations build complex policy distributions with explicit probability density , and hence entail training using score function gradient estimators.
Since there is no analytic form for entropy, we use samples to estimate entropy by reparameterization, . The gradient of entropy can be easily computed by a pathwise gradient and easily implemented using backpropagation .
Onpolicy algorithm for NFP.
Any onpolicy policy optimization algorithms can be easily combined with NFP. Since NFP has explicit access to policy density, it allows for training using score function gradient estimators with efficient entropy regularization.
3.2 Noninvertible Blackbox Policy (NBP)
The forward pass in (2) transforms the simple noise distribution to complex action distribution through the blackbox . However, the mapping is in general noninvertible and we do not have access to the density . We derive a pathwise gradient for such cases and leave all the proof in Appendix A.
Theorem 3.1 (Stochastic Pathwise Gradient).
Given an implicit stochastic policy . Let be the implicitly defined policy. Then the pathwise policy gradient for the stochastic policy is
(6) 
To compute the gradient of policy entropy for such general implicit policy, we propose to train an additional classifier
with parameter along with policy . The classifier is trained to minimize the following objective given a policy(7) 
where
is a uniform distribution over
andis the sigmoid function. We have
creftype A.1 in Appendix A.2 to guarantee that the optimal solution of (7) provides an estimate of policy density, . As a result, we could evaluate the entropy by simple reparametrization . Further, we can compute gradients of the policy entropy through the density estimate as shown by the following theorem.Theorem 3.2 (Unbiased Entropy Gradient).
Let be the optimal solution from (7), where the policy is given by implicit policy . The gradient of entropy can be computed as
(8) 
It is worth noting that to compute , simply plugging in to replace in the entropy definition does not work in general, since the optimal solution of (7) implicitly depends on . However, fortunately in this case the additional term vanishes. The above theorem guarantees that we could apply entropy regularization even when the policy density is not accessible.
Offpolicy algorithm for NBP.
We develop an offpolicy algorithm for NBP. The agent contains an implicit with parameter , a critic with parameter and a classifier with parameter . At each time step , we sample action and save experience tuple to a replay buffer . During training, we sample a minibatch of tuples from , update critic using TD learning, update policy using pathwise gradient (6) and update classifier by gradient descent on (7). We also maintain target networks with parameter to stabilize learning [24, 32]. The pseudocode is listed in Appendix D.
4 Entropy Regularization and Maximum Entropy RL
Though policy gradient algorithms with entropy regularization are easy to implement in practice, they are harder to analyze due to the lack of a global objective. Now we show that entropy regularization maximizes a lower bound of maximum entropy objective when consecutive policy iterates are close.
At each iteration of entropy regularized policy gradient algorithm, the policy parameter is updated as in (1). Following similar ideas in [15, 30], we now interpret such update as maximizing a linearized surrogate objective in the neighborhood of the previous policy iterate . The surrogate objective is
(9) 
The firstorder Taylor expansion of (9) centering at gives a linearized surrogate objective . Let , the entropy regularized update (1) is equivalent to solving the following optimization problem then update according to ,
where is a positive constant depending on both the learning rate and the previous iterate , and can be recovered from (1). The next theorem shows that by constraining the KL divergence of consecutive policy iterates, the surrogate objective (9) forms a nontrivial lower bound of maximum entropy objective,
Theorem 4.1 (Lower Bound).
If , then
(10) 
By optimizing at each iteration, entropy regularized policy gradient algorithms maximize a lower bound of . This implies that though entropy regularization is a greedier procedure than optimizing maximum entropy objective, it accounts for certain effects that the maximum entropy objective is designed to capture. Nevertheless, the optimal solutions of both optimization procedures are different. Previous works [26, 13] have shown that the optimal solutions of both procedures are energy based policies, with energy functions being fixed points of Boltzmann operator and Mellowmax operator respectively [2]
. In Appendix B, we show that Boltzmann operator interpolates between Bellman operator and Mellowmax operator, which asserts that entropy regularization is greedier than optimizing
, yet it still maintains uncertainties in the policy updates.Though maximum entropy objective accounts for long term effects of policy entropy updates and is more conducive to analysis [2], it is hard to implement a simple yet scalable procedure to optimize the objective [13, 14, 2]. Entropy regularization, on the other hand, is simple to implement in both onpolicy and offpolicy setting. In experiments, we will show that entropy regularized policy gradients combined with expressive policies achieve competitive performance in multiple aspects.
5 Experiments
Our experiments aim to answer the following questions: (1) Will expressive policy be hard to train, does implicit policy provide competitive performance on benchmark tasks? (2) Are implicit policies robust to noises on locomotion tasks? (3) Does implicit policy entropy regularization entail multimodal policies as displayed under maximum entropy framework [13]?
To answer (1), we evaluate both NFP and NBP agent on benchmark continuous control tasks in MuJoCo [36] and compare with baselines. To answer (2), we compare NFP with unimodal Gaussian policy on locomotion tasks with additive observational noises. To answer (3), we illustrate the multimodal capacity of both policy representations on specially designed tasks illustrated below, and compare with baselines. In all experiments, for NFP, we implement with standard PPO for onpolicy update to approximately enforce the KL constraint (10) as in [31]; for NBP, we implement the offpolicy algorithm developed in Section 3. In Appendix C and F, we detail hyperparameter settings in the experiments and provide a small ablation study.
5.1 Locomotion Tasks
Benchmark tasks.
One potential disadvantage of expressive policies compared to simple policies (like unimodal Gaussian) is that they pose a more serious statistical challenge due to a larger number of parameters. To see if implicit policy suffers from such problems, we evaluate NFP and NBP on MuJoCo benchmark tasks. For each task, we train for a prescribed number of time steps, then report the results averaged over 5 random seeds. We compare the results with baseline algorithms, such as DDPG [32], SQL [13], TRPO [30] and PPO [31], where baseline TPRO and PPO use unimodal Gaussian policies. As can be seen from Table 1, both NFP and NBP achieve competitive performances on benchmark tasks: they outperform DDPG, SQL and TRPO on most tasks. However, baseline PPO tends to come on top on most tasks. Interestingly on HalfCheetah, baseline PPO gets stuck on a locally optimal gait, which NFP improves upon by a large margin.
Tasks  Timesteps  DDPG  SQL  TRPO  PPO  NFP  NBP 

Hopper  
HalfCheetah  
Walker2d  
Ant 
Table 1: A comparison of implicit policy optimization with baseline algorithms on MuJoCo benchmark tasks. For each task, we show the average rewards achieved after training the agent for a fixed number of time steps. The results for NFP and NBP are averaged over 5 random seeds. The results for DDPG, SQL and TRPO are approximated based on the figures in [14], PPO is from OpenAI baseline implementation [4]. We highlight the top two algorithms for each task in bold font. Both TRPO and PPO use unimodal Gaussian policies.
Robustness to Noisy Observations.
We add independent Gaussian noise to each component of the observations to make the original tasks partially observable. Since PPO with unimodal Gaussian achieves leading performance on noisefree locomotion tasks across onpolicy baselines (A2C [23], TRPO [30]) as shown in [31] and Appendix E.1, we compare NFP only with PPO with unimodal Gaussian on such noisy locomotion tasks. In Figure 1, we show the learning curves of both agents, where on many tasks NFP learns significantly faster than unimodal Gaussian. Why complex policies may add to robustness? We propose that since these control tasks are known to be solved by multiple separate modes of policy [22], observational noises potentially blur these modes and make it harder for a unimodal Gaussian policy to learn any single mode (e.g. unimodal Gaussian puts probability mass between two neighboring modes [18]). On the contrary, NFP can still navigate a more complex reward landscape thanks to a potentially multimodal policy distribution and learn effectively. We leave a more detailed study of robustness, multimodality and complex reward landscape as interesting future work.
5.2 Multimodal policy
Gaussian Bandits.
Though factorized unimodal policies suffice for most benchmark tasks, below we motivate the importance of a flexible policy by a simple example: Gaussian bandits. Consider a two dimensional bandit . The reward of action is for a positive definite matrix . The optimal policy for maximum entropy objective is , i.e. a Gaussian policy with covariance matrix . We compare NFP with PPO with factorized Gaussian. As illustrated in Figure 2(a), NFP can approximate the optimal Gaussian policy pretty closely while the factorized Gaussian cannot capture the high correlation between the two action components.
Navigating 2D Multigoal.
We motivate the strength of implicit policy to represent multimodal policy by Multigoal environment [13]. The agent has 2D coordinates as states and 2D forces as actions . A ball is randomly initialized near the origin and the goal is to push the ball to reach one of the four goal positions plotted as red dots in Figure 2(b). While a unimodal policy can only deterministically commit the agent to one of the four goals, a multimodal policy obtained by NBP can stochastically commit the agent to multiple goals. On the right of Figure 2(b) we also show sampled actions and contours of Q value functions at various states: NBP learns a very flexible policy with different number of modes in different states.
Learning a Bimodal Reacher.
For a more realistic example, consider learning a bimodal policy for reaching one of two targets (Figure 3(a)). The agent has the physical coordinates of the reaching arms as states and applies torques to the joints as actions . The objective is to move the reacher head to be close to one of the targets. As illustrated by trajectories in Figure 2(c), while a unimodal Gaussian policy can only deterministically reach one target (red curves), a NFP agent can capture both modes by stochastically reaching one of the two targets (blue curves).
axes are actions. Green dots are actions from the optimal policy, a Gaussian distribution with covariance structure illustrated by the contours. Red dots and blue dots are actions sampled from a learned factorized Gaussian and NFP. NFP captures the covariance of the optimal policy while factorized Gaussian cannot. (b): Illustration of 2D multigoal environment. Left: trajectories generated by trained NBP agent (solid blue curves). The
and axes are coordinates of the agent (state). The agent is initialized randomly near the origin. The goals are red dots, and instant rewards are proportional to the agent’s minimum distance to one of the four goals. Right: predicted Q value contours by the critic (light blue: low value, light green: high value and actions sampled from the policy (blue dots) at three selected states. The NFP policy has different number of modes at different states. (c): Trajectories of the reacher head by NFP (blue curves) and unimodal Gaussian policies (red curves) for the bimodal reacher. Yellow dots are locations of the two targets, and the green dot is the starting location of the reacher.Finetuning for downstream tasks.
A recent paradigm for RL is to pretrain an agent to perform a conceptually highlevel task, which may accelerate finetuning the agent to perform more specific tasks [13]. We consider pretraining a quadrupedal robot (Figure 3(b)) to run fast, then finetune the robot to run fast in a particular direction [13] as illustrated in Figure 3(c), where we set walls to limit the directions in which to run. Wide and Narrow Hallways tasks differ by the distance of the opposing walls. If an algorithm does not inject enough diversity during pretraining, it will commit the agent to prematurely run in a particular direction, which is bad for finetuning. We compare the pretraining capacity of DDPG [20], SQL [13] and NBP. As shown in Figure 3(d), after pretraining, NBP agent manages to run in multiple directions, while DDPG agent runs in a single direction due to a deterministic policy (Appendix E.2). In Table 2, we compare the cumulative rewards of agents after finetuning on downstream tasks with different pretraining as initializations. In both tasks, we find NBP to outperform DDPG, SQL and random initialization (no pretraining) by statistically significant margins, potentially because NBP agent learns a highlevel running gait that is more conducive to finetuning. Interestingly, in Narrow Hallway, randomly initialized agent performs better than DDPG pretraining, which is probably because running fast in Narrow Hallway requires running in a very narrow direction, and DDPG pretrained agent needs to first unlearn the overtly specialized running gait acquired from pretraining. In Wide Hallway, randomly initialized agent easily gets stuck in a locally optimal gait (running between two opposing walls) while pretraining in general helps avoid such problem.
Tasks  Random init  DDPG init  SQL init  NBP init 

Wide Hallway  
Narrow Hallway 
Table 2: A comparison of downstream finetuning under different initializations. For each task, we show the cumulative rewards after pretraining for steps and finetuning for steps. The rewards are shown in the form , all results are averaged over 5 seeds. Random init means the agent is trained from scratch.
Combining multiple modes by Imitation Learning.
We propose another paradigm that can be of practical interest. In general, learning a multimodal policy from scratch is hard for complex tasks since it requires good exploration and an algorithm to learn multimodal distributions [13], which is itself a hard inference problem [10]. A big advantage of policy based algorithm over value based algorithm [13]
is that the policy can be easily combined with imitation learning. We could decompose a complex task into several simpler tasks, each representing a simple mode of behavior easily learned by a RL agent, then combine them into a single agent using imitation learning or inverse RL
[1, 8, 28].We illustrate with a stochastic Swimmer example (see Appendix E.3). Consider training a Swimmer to move fast either forward or backward. The aggregate behavior has two modes and it is easy to solve each single mode. We train two separate Swimmers to move forward/backward and generate expert trajectories using the trained agents. We then train a NBP / NFP agent using GAN [10] / maximum likelihood estimation to combine both modes. Training with the same algorithms, a unimodal policy either commits to only one mode or learns a policy that puts large probability mass between the two modes [18, 10], which greatly deviates from the expert policy. On the contrary, expressive policies can more flexibly incorporate multiple modes into a single agent.
6 Conclusion
We have proposed Implicit Policy, a rich class of policy that can represent complex action distributions. We have derived efficient algorithms to compute entropy regularized policy gradients for generic implicit policies. Importantly, we have also showed that entropy regularization maximizes a lower bound of maximum entropy objective, which implies that in practice entropy regularization rich policy class can lead to desired properties of maximum entropy RL. We have empirically showed that implicit policy achieves competitive performance on benchmark tasks, is more robust to observational noise, and can flexibly represent multimodal distributions.
Acknowledgements.
This research was supported by an Amazon Research Award (2017) and AWS cloud credits. The authors would like to thank Jalaj Bhandari for helpful discussions, and Sergey Levine for helpful comments on early stage experiments of the paper.
References

Abbeel and Ng [2004]
Abbeel, P. and Ng, A. Y. (2004).
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM.  Asadi and Littman [2017] Asadi, K. and Littman, M. L. (2017). An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, pages 243–252.
 Degris et al. [2012] Degris, T., White, M., and Sutton, R. S. (2012). Offpolicy actorcritic. arXiv preprint arXiv:1205.4839.
 Dhariwal et al. [2017] Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. (2017). Openai baselines. https://github.com/openai/baselines.
 Dinh et al. [2014] Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 Dinh et al. [2016] Dinh, L., SohlDickstein, J., and Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
 Duan et al. [2016] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338.
 Finn et al. [2016] Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016). A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852.
 Fortunato et al. [2017] Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. (2017). Noisy networks for exploration. arXiv preprint arXiv:1706.10295.
 Goodfellow et al. [2014] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
 Haarnoja et al. [2018a] Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a). Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808.
 Haarnoja et al. [2018b] Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. (2018b). Composable deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1803.06773.
 Haarnoja et al. [2017] Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165.
 Haarnoja et al. [2018c] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018c). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
 Kakade and Langford [2002] Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274.
 Kingma and Ba [2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kingma and Welling [2013] Kingma, D. P. and Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 Levine [2018] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909.
 Li and Turner [2017] Li, Y. and Turner, R. E. (2017). Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107.
 Lillicrap et al. [2015] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

Liu and Wang [2016]
Liu, Q. and Wang, D. (2016).
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pages 2378–2386.  Mania et al. [2018] Mania, H., Guy, A., and Recht, B. (2018). Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055.
 Mnih et al. [2016] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937.
 Mnih et al. [2013] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
 Nachum et al. [2017] Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2772–2782.
 O’Donoghue et al. [2016] O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Pgq: Combining policy gradient and qlearning. arXiv preprint arXiv:1611.01626.
 Rezende and Mohamed [2015] Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.

Ross et al. [2011]
Ross, S., Gordon, G., and Bagnell, D. (2011).
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 627–635.  Schulman et al. [2017a] Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440.
 Schulman et al. [2015] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
 Schulman et al. [2017b] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Silver et al. [2016] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
 Silver et al. [2014] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In ICML.
 Srivastava et al. [2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
 Sutton et al. [2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
 Todorov et al. [2012] Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE.
 Tran et al. [2017] Tran, D., Ranganath, R., and Blei, D. M. (2017). Hierarchical implicit models and likelihoodfree variational inference. arXiv preprint arXiv:1702.08896.
 Williams [1992] Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer.
 Ziebart [2010] Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
Appendix A Proof of Theorems
a.1 Stochastic Pathwise Gradient
See 3.1
Proof.
We follow closely the derivation of deterministic policy gradient [33]. We assume that all conditions are satisfied to exchange expectations and gradients when necessary. Let denote the implicit policy . Let be the value function and action value function under such stochastic policy. We introduce as the probability of transitioning from to in steps under policy . Overloading the notation a bit, is the probability of in one step by taking action (i.e., ). We have
In the above derivation, we have used the Fubini theorem to interchange integral (expectation) and gradients. We can iterate the above derivation and have the following
With the above, we derive the pathwise policy gradient as follows
where is the discounted state visitation probability under policy . Writing the whole integral as an expectation over states, the policy gradient is
which is equivalent to in (6) in creftype 3.1. ∎
We can recover the result for deterministic policy gradient by using a degenerate functional form , i.e. with a deterministic function to compute actions.
a.2 Unbiased Entropy Gradient
Lemma A.1 (Optimal Classifier as Density Estimator).
Assume is expressive enough to represent any classifier (for example is a deep neural net). Assume to be bounded and let be uniform distribution over . Let be the optimizer to the optimization problem in (7). Then and is the volume of .
Proof.
See 3.2
Proof.
Let be the density of implicit policy . The entropy is computed as follows
Computing its gradient
(11) 
In the second line we highlight the fact that the expectation depends on parameter both implicitly through the density and through the sample . After decomposing the gradient using chain rule, we find that the first term vanishes, leaving the result shown in the theorem. ∎
a.3 Lower Bound
We recall that given a policy , the standard RL objective is . In maximum entropy formulation, the maximum entropy objective is
(12) 
where is a regularization constant and is the entropy of policy at . We construct a surrogate objective based on another policy as follows
(13) 
The following proof highly mimics the proof in [30]. We have the following definition for coupling two policies
Definition A.1 (coupled).
Two policies are coupled if for any .
Lemma A.2.
Given are coupled, then
Proof.
Let denote the number of times that for , i.e. the number of times that disagree before time . We can decompose the expectations as follows
Note that implies for all hence
The definition of coupling implies , and so . Now we note that
Combining previous observations, we have proved the lemma. ∎
Note that if we take , then the surrogate objective in (9) is equivalent to defined in (13). With creftype A.2, we prove the following theorem.
See 4.1
Proof.
Appendix B Operator view of Entropy Regularization and Maximum Entropy RL
Recall in standard RL formulation, the agent is in state , takes action , receives reward and transitions to . Let the discount factor . Assume that the reward is deterministic and the transitions are deterministic, i.e. , it is straightforward to extend the following to general stochastic transitions. For a given policy , define linear Bellman operator as
Any policy satisfies the linear Bellman equation . Define Bellman optimality operator (we will call it Bellman operator) as
Now we define Mellowmax operator [2, 13] with parameter as follows,
It can be shown that both and are contractive operator when . Let be the unique fixed point of , then is the action value function of the optimal policy . Let be the unique fixed point of
Comments
There are no comments yet.