1 Introduction
Reinforcement learning algorithms coupled with powerful function approximators have recently achieved a series of successes (Mnih et al., 2015; Silver et al., 2016; Lillicrap et al., 2015; Kalashnikov et al., 2018). Unfortunately, while being extremely powerful, deep reinforcement learning (DRL) algorithms often require a large number of interactions with the environment to achieve good results, partially because they are often applied “from scratch” rather than in settings where they can leverage existing experience. This reduces their applicability in domains where generating experience is expensive, or learning from scratch is challenging.
The data efficiency of DRL algorithms is affected by various factors and significant research effort has been directed at achieving improvements (e.g. Popov et al., 2017). At the same time the development of basic locomotor behavior in humans can, in fact, require large amounts of experience and practice (Adolph et al., 2012), and it can take significant effort and training to master complex, highspeed skills (Haith & Krakauer, 2013)
. Once such skills have been acquired, however, humans rapidly put them to work in new contexts and to solve new tasks, suggesting transfer learning as an important mechanism.
For the purpose of this paper we are interested in methods that are suitable for transfer in the context of highdimensional motor control problems. We further focus on modelfree approaches, which are evident in human motor control (Haith & Krakauer, 2013), and have recently been used by a variety of scalable deep RL methods (e.g. Lillicrap et al., 2015; Mnih et al., 2015; Schulman et al., 2017; Kalashnikov et al., 2018).
Transfer may be especially valuable in domains where a small set of skills can be composed, in different combinations, to solve a variety of tasks. Different notions of compositionality have been considered in the RL and robotics literature. For instance, ‘options’ are associated with discrete units of behavior that can be sequenced, thus emphasizing composition in time (Precup et al., 1998). In this paper we are concerned with a rather distinct notion of compositionality, namely how to combine and blend potentially concurrent behaviors. This form of composition is particularly relevant in highdimensional continuous action spaces, where it is possible to achieve more than one task simultaneously (e.g. walking somewhere while juggling).
One approach to this challenge is via the composition of task rewards. Specifically, we are interested in the following question: If we have previously solved a set of tasks with similar transition dynamics but different reward functions, how can we leverage this knowledge to solve new tasks which can be expressed as a convex combination of those rewards functions?
This question has recently been studied in two independent lines of work: by Barreto et al. (2017, 2018) in the context of successor feature (SF) representations used for Generalized Policy Improvement (GPI) with deterministic policies, and by Haarnoja et al. (2018a); van Niekerk et al. (2018) in the context of maximum entropy policies. These approaches operate in distinct frameworks but both achieve skill composition by combining the functions associated with previously learned skills.
We clarify the relationship between the two approaches and show that both can perform well in some situations but achieve poor results in others, often in complementary ways. We introduce a novel method of behavior composition that that can consistently achieve good performance.
Our contributions are as follows:

We introduce succcessor features (SF) in the context of maximum entropy and extend the GPI theorem to this case (maxent GPI).

We provide an analysis of when GPI, and compositional “optimism” (Haarnoja et al., 2018a) of entropyregularized policies transfer. We construct both tabular and continuous action tasks where both fail to transfer well.

We propose a correction term – which we call Divergence Correction (DC)– based on the Rényi divergence between policies which allows us, in principle, to recover the optimal policy for transfer for any convex combination of rewards.

We demonstrate a practical implementation of these methods in continuous action spaces using adaptive importance sampling and compare the approaches introduced here: maxent GPI and DC with optimism (Haarnoja et al., 2018a) and Conditional functions (Schaul et al., 2015) in a variety of nontrivial continuous action transfer tasks.
2 Background
2.1 Multitask RL
We consider Markov Decision Processes defined by the tuple
containing: a state space , action space , a start state distribution , a transition function , a discount and a reward function . The objective of RL is to find a policy which maximises the discounted expected return from any state where the expected reward is dependent on the policy and the MDP .We formalize transfer as in Barreto et al. (2017); Haarnoja et al. (2018a), as the desire to perform well across all tasks in a set after having learned policies for tasks , without additional experience. We assume that and are related in two ways: all tasks share the same state transition function, and tasks in can be expressed as convex combinations of rewards associated with tasks in set . So if we write the reward functions for tasks in
as the vector
, tasks in can be expressed as .We focus on combinations of two policies but the methods can be extended to more than two tasks. We refer to a transfer method as optimal, if it achieves optimal returns on tasks in , using only experience on tasks .
2.2 Successor Features
Successor Features (SF) (Dayan, 1993) and Generalised Policy Improvement (GPI) (Barreto et al., 2017, 2018) provide a principled solution to transfer in the setting defined above. SF make the additional assumption that the reward feature is fully observable, that is, the agent has access to the rewards of all tasks in but not during training on each individual task.
The key observation of SF representations is that linearity of the reward with respect to the features implies the following decomposition of the action value of policy on task :
(1) 
where is the expected discounted sum of features induced by policy . This decomposition allows us to compute the actionvalue for on any task by learning .
If we have a set of policies indexed by , SF and GPI provide a principled approach to transfer on task . Namely, we act according to the deterministic GPI policy where
(2) 
The GPI theorem guarantees the GPI policy has a return at least as good as any component policy, that is, .
2.3 Maximum Entropy RL
The maximum entropy (maxent) RL objective augments the reward to favor entropic solutions
(3) 
where is a parameter that determines the relative importance of the entropy term.
3 Composing Policies in MaxEnt Reinforcement Learning
In this section we present two novel approaches for maxent transfer learning. In section 4 we then outline a practical method for making use of these results.
3.1 MaxEnt Successor Features and Generalized Policy Improvement
We introduce maxent SF, which provide a practical method for computing the value of a maximum entropy policy under any convex combination of rewards. We then show the GPI theorem (Barreto et al., 2017) holds for maximum entropy policies.
We define the actiondependent SF to include the entropy of the policy, excluding the current state, analogous to the maxentropy definition of in (4):
(7) 
where is a vector of ones of the same dimensionality as and we define the statedependent successor features as the expected in analogy with :
(8) 
The maxentropy actionvalue of for any convex combination of rewards is then given by
. Maxent SF allow us to estimate the actionvalue of previous policies on a new task. We show that, as in the deterministic case, there is a principled way to combine multiple policies using their actionvalues on task
.Theorem 3.1 (MaxEnt Generalized Policy Improvement)
Let be policies with maxent actionvalue functions and value functions . Define
Then,
(9)  
(10) 
where and are the maxent actionvalue and value function respectively of .
Proof: See appendix A.1. In our setup, we learn , the SFs of policies for each task in , we define the maxent GPI policy for task as
3.2 Divergence Correction (DC)
Haarnoja et al. (2018a) introduced a simple approach to policy composition by estimating the actionvalue for the transfer task from the optimal actionvalues of the component tasks and
(11) 
When using Boltzmann policies defined by , the resulting policy, , is the product distribution of the two component policies. We refer to as the compositionally “optimistic” (CO) policy, as it acts according to the optimistic assumption that the optimal returns of and will be, simultaneously, achievable^{1}^{1}1Compositional optimism is not the same as optimism under uncertainty, often used in RL for exploration..
Both maxent GPI we presented above, and CO can, in different ways, fail to transfer well in some situations (see fig. 1 for some examples in tabular case). Neither approach consistently performs optimally during transfer, even if all component terms are known exactly. We desire a solution for transfer that, in principle, can perform optimally.
Here we show, at the cost of learning a function conditional on the task weightings , it is in principle possible to recover the optimal policy for the transfer tasks, without direct experience on those tasks, by correcting for the compositional optimism bias in . For simplicity, as in Haarnoja et al. (2018a), we restrict this to the case with only 2 tasks, but it can be extended to multiple tasks.
The correction term for CO uses a property noted, but not exploited in Haarnoja et al. (2018a). The bias in is related to the the discounted sum of Rényi divergences of the two component policies. Intuitively, if the two policies result in trajectories with low divergence between the policies in each state, the CO assumption that both policies can achieve good returns is approximately correct. When the divergences are large, the CO assumption is being overly optimistic and the correction term will be large.
Theorem 3.2 (DC Optimality)
Let be maxent optimal policies for tasks with rewards and with maxent actionvalue functions . Define as the fixed point of
Given the conditions for Soft Q convergence, the maxent optimal for is
Proof: See appendix A.2. We call this Divergence Correction (DC) as the quantity is related to the Rényi divergence between policies (see appendix A.2 for details). Learning does not require any additional information (in principle) than that required to learn policies and . Unlike with SF, it is not necessary to observe other task features while training the policies. On the other hand, unlike with GPI, which can be used to naturally combine any number of tasks with arbitrary weight vectors , in order to apply DC one must estimate for all values of . The complexity of learning increases significantly if more than 2 tasks are combined.
Supplementary Table 1 provides a comparison on the properties of the methods we consider here. We also compare with simply learning a conditional function (CondQ) (e.g. Schaul et al., 2015; Andrychowicz et al., 2017). As with GPI, this requires observing the full set of task features , in order to compute for arbitrary .
In this section we have introduced two new theoretical approaches to maxent transfer composition: maxent GPI and DC. We have shown how these are related to relevant prior methods. In the next section we address the question of how to practically learn and sample with these approaches in continuous action spaces.
4 Adaptive Importance Sampling for Boltzman Policies Algorithm
The control of robotic systems with highdimensional continuous action spaces is a promising use case for the ideas presented in this paper. Such control problems may allow for multiple solutions, and can exhibit exploitable compositional structure. Unfortunately, learning and sampling of general Boltzmann policies defined over continuous action spaces is challenging. While this can be mitigated by learning a parametric sampling distribution, during transfer we want to sample from the Boltzmann policy associated with a newly synthesized actionvalue function without having to learn such an approximation first. To address this issue we introduce Adaptive Importance Sampling for Boltzmann Policies (AISBP), a method which provides a practical solution to this challenge.
In the following we parametrise all functions with neural nets (denoting parameters by the subscript ), including the soft actionvalue for reward : ; the associated soft value function and a proposal distribution , the role of which we explain below. We use an offpolicy algorithm, so that experience generated by training on policy can be used to improve policy . This is especially important since our analysis requires the actionvalue to be known in all states. This is less likely to be the case for an on onpolicy algorithm, that only updates using trajectories generated by policy . During training experience generated by all tasks are stored in a replay buffer , and minibatches are sampled uniformly and used to update all function approximators. Soft Q iteration (see eq. 4) is used to learn and . These updates are, in principle, straightforward using transitions sampled from the replay buffer.
Sampling from the Boltzmann policy defined by , is challenging as is estimating the partition function (the of which is also the value, c.f. Eq. 6
). One approach is to fit an expressible, tractable sampler, such as a stochastic neural network to approximate
(e.g. Haarnoja et al., 2018a). This approach works well when learning a single policy. However, during transfer this may require learning a new sampler for each new value composition. AISBP instead uses importance sampling to sample and estimate the partition function. The scalability of this approach is improved by using using a learned proposal distribution , and by observing that modern architectures allow for efficient batch computation of a large number of importance samples. To facilitate transfer we restrict the parametric form of the proposals to mixtures of (truncated) Gaussians. The product of these mixture distributions is tractable, allowing sampling from the proposal product during transfer (see supplementary C.3).More formally, for each policy in we learn an actionvalue , and value network, and a proposal distribution (we drop the task index here when writing the losses for notational clarity, and write the losses for a single policy). The proposal distribution is a mixture of
truncated Normal distributions
, truncated to the square with diagonal covariances(12) 
The proposal distribution is optimized by minimizing the forward KL divergence with the Boltzmann policy . This KL is “zero avoiding” and overestimates the support of (Murphy, 2012) which is desirable for a proposal distribution (Gu et al., 2015),
(13) 
where the expectation is over the replay buffer state density.
The inner expectation in the proposal loss itself requires sampling from . We approximate his expectation by selfnormalized importance sampling and use a target proposal distribution
which is a mixture distribution consisting of the proposals for all policies along with a uniform distribution. For batchsize
and proposal samples the estimator of the proposal loss is then(14) 
The value function loss is defined as the L2 error on the Soft Q estimate of value
(15) 
which is estimated using importance sampling to compute the integral.
(16) 
This introduces bias due to the finitesample approximation of the expectation inside the (concave) . In practice we found this estimator sufficiently accurate, provided the proposal distribution was close to . We also use importance sampling to sample from while acting.
The actionvalue loss is just the L2 norm with the Soft Q target:
(17) 
To improve stability we employ target networks for the value and proposal networks (Mnih et al., 2015; Lillicrap et al., 2015) We also parameterize as an advantage (Baird, 1994; Wang et al., 2015; Harmon et al., 1995) which is more stable when the advantage is small compared with the value. The full algorithm is given in Algorithm Box 1 and more details are provided in appendix C.
4.1 Importance Sampled MaxEnt GPI
The same importance sampling approach can also be used to estimate maxent SF. Maxent GPI requires us to learn the expected (maximum entropy) features for each policy , in order to estimate its (entropic) value under a new convex combination task . This requires that the experience tuple in the replay contain the full feature vector , rather than just the reward for the policy which generated the experience . Given this information and can be learned with analogous updates to and , which again requires importance sampling to estimate .
As with , we use a target network for and advantage parametrization.
We found it more stable to using a larger target update period than for . Full details are of the losses and samplers are in appendix C.
4.2 Divergence Correction
All that is required for transfer using compositional optimism (eq. 11, Haarnoja et al. (2018a)) is the maxent action values of each task, so no additional training is required beyond the base policies. In section 3.2 we have shown that if we can learn the fixed point of we can correct this compositional optimism and recover the optimal actionvalue .
We exploit the recursive relationship in to fit a neural net with a TD(0) estimator. This requires learning a conditional estimator for any value of , so as to support arbitrary task combinations. Fortunately, the same experience can be used to learn an estimator for all values of , by sampling during each TD update. As before, we use target networks and an advantage parametrization for
We learn as , for each pair of policies resulting in the loss
(18)  
As with other integrals of the action space, we approximate this loss using importance sampling to estimate the integral. Note that, unlike GPI and CondQ (next section), learning does not require observing while training.
We also considered a heuristic approach where we learned
only for (this is typically approximately the largest divergence). This avoids the complexity of a conditional estimator and we estimate asThis heuristic, we denote DCCheap, can be motivated by considering Gaussian policies with similar variance (see appendix
D) The maxent GPI bound can be used to correct for overestimates of the heuristic , .4.3 Cond Q
4.4 Sampling Compositional Policies
During transfer we would like to be able to sample from the Boltzmann policy defined by our estimate of the transfer actionvalue (the estimate is computed using the methods we enumerated above) without having to, offline, learn a new proposal or sampling distribution first (which is the approach employed by Haarnoja et al. (2018a)).
As outlined earlier (and detailed in appendix C.3), we chose the proposal distributions so that the product of proposals is tractable, meaning we can sample from . This is a good proposal distribution when the CO bias is low, since defines a Boltzmann policy which is the product of the base policies^{2}^{2}2.. However, when is large, meaning the CO bias is large, may not be a good proposal, as we show in the experiments. In this case none of the existing proposal distributions may be a good fit. Therefore we sample from a mixture distribution of all policies, all policy products and the uniform distribution.
(19) 
where is the volume of the action space. Empirically, we find this is sufficient to result in good performance during transfer. The algorithm for transfer is given in supplementary algorithm 2.
5 Experiments
5.1 Discrete, tabular environment
We first consider some illustrative tabular cases of compositional transfer. These highlight situations in which GPI and CO transfer can perform poorly (Figure 1). As expected, we find that GPI performs well when the optimal transfer policy is close to one of the existing policies; CO performs well when both subtask policies are compatible. The task we refer to as “tricky” is illustrative of tasks we will focus on in the next section, namely scenarios in which the optimal policy for the transfer task does not resemble either existing policy: In the grid world nonoverlapping rewards for each task are provided in one corner of the grid world, while lower value overlapping rewards are provided in the other corner (cf. Fig. 1). As a consequence both GPI and CO perform poorly while DC performs well in all cases.
5.2 Continuous action spaces
We compare the transfer methods in more challenging continuous control tasks. We train maxent policies to solve individual tasks using the importance sampling approach from section 4 and then assess transfer. All methods use the same experience and proposal distribution.
Figure 2 examines the transfer policies in detail in a simple pointmass task and shows how the estimated corrects and dramatically changes the policy.
We then examine conceptually similar tasks in more difficult domains: a 5 DOF planar manipulator reaching task (figure 3), 3 DOF jumping ball and 8 DOF ant (figure 4). We see that DC recovers a qualitatively better policy in all cases. The performance of GPI depends noticeably on the choice of . DCCheap, which is a simpler heuristic, performs almost as well as DC in the tasks we consider except for the point mass task. When bounded by GPI (DCCheap+GPI) it performs well for the point mass task as well, suggesting simple approximations of may be sufficient in some cases^{3}^{3}3 We provide videos of the more interesting tasks at https://tinyurl.com/yaplfwaq. .
We focussed on “tricky” tasks as they are challenging form of transfer. In general, we would expect DC to perform well in most situations where OC performs well, since in this case the correction term that DC must learn is inconsequential (OC is equivalent to assuming ). Supplementary figure 9 demonstrates on a task with noncomposible solutions (i.e. is large and potentially challenging to learn), DC continues to perform as well as GPI, slightly better than CondQ, and as expected, OC performs poorly.
6 Discussion
We have presented two approaches to transfer learning via convex combinations of rewards in the maximum entropy framework: maxent GPI and DC. We have shown that, under standard assumptions, the maxent GPI policy performs at least as well as its component policies, and that DC recovers the optimal transfer policy. Todorov (2009) and (Saxe et al., 2017; van Niekerk et al., 2018) previously considered optimal composition of maxent policies. However, these approaches require stronger assumptions about the class of MDPs By contrast, DC does not restrict the class of MDPs and learns how compatible policies are, allowing approximate recovery of optimal transfer policies both when the component rewards are jointly achievable (AND), and when only one subgoal can be achieved (OR).
We have compared our methods with conditional actionvalue functions (CondQ) (Schaul et al., 2015, e.g.) and optimistic policy combination (Haarnoja et al., 2018a). Further, we have presented AISBP, a practical algorithm for training DC and maxent GPI models in continuous action spaces using adaptive importance sampling. We have compared these approaches, along with heuristic approximations of DC, and demonstrated that DC recovers an approximately optimal policy during transfer across a variety of highdimensional control tasks. Empirically we have found CondQ may be harder to learn than DC, and it requires additional observation of during training.
References
 Adolph et al. (2012) Karen E Adolph, Whitney G Cole, Meghana Komati, Jessie S Garciaguirre, Daryaneh Badaly, Jesse M Lingeman, Gladys LY Chan, and Rachel B Sotsky. How do you learn to walk? thousands of steps and dozens of falls per day. Psychological science, 23(11):1387–1394, 2012.
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
 Baird (1994) Leemon C Baird. Reinforcement learning in continuous time: Advantage updating. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on, volume 4, pp. 2448–2453. IEEE, 1994.
 Barreto et al. (2017) André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.

Barreto et al. (2018)
Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel,
Daniel Mankowitz, Augustin Zidek, and Remi Munos.
Transfer in deep reinforcement learning using successor features and
generalised policy improvement.
In
Proceedings of the International Conference on Machine Learning
, pp. 501–510, 2018.  Clevert et al. (2015) DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 Dayan (1993) Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 Fox et al. (2015) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562, 2015.
 Gil et al. (2013) Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249:124–131, 2013.
 Gu et al. (2015) Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems, pp. 2629–2637, 2015.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 Haarnoja et al. (2018a) Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. Composable deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1803.06773, 2018a.
 Haarnoja et al. (2018b) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018b.
 Haith & Krakauer (2013) Adrian M Haith and John W Krakauer. Modelbased and modelfree mechanisms of human motor learning. In Progress in motor control, pp. 1–21. Springer, 2013.
 Harmon et al. (1995) Mance E Harmon, Leemon C Baird III, and A Harry Klopf. Advantage updating applied to a differential game. In Advances in neural information processing systems, pp. 353–360, 1995.
 Kalashnikov et al. (2018) Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
 Kappen (2005) Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Murphy (2012) Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029.
 Popov et al. (2017) Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel BarthMaron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Dataefficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
 Precup et al. (1998) Doina Precup, Richard S Sutton, and Satinder Singh. Theoretical results on reinforcement learning with temporally abstract options. In European conference on machine learning, pp. 382–393. Springer, 1998.
 Saxe et al. (2017) Andrew M Saxe, Adam Christopher Earle, and Benjamin S Rosman. Hierarchy through composition with multitask lmdps. 2017.
 Schaul et al. (2015) Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015.
 Schrempf et al. (2005) Oliver C Schrempf, Olga Feiermann, and Uwe D Hanebeck. Optimal mixture approximation of the product of mixtures. In Information Fusion, 2005 8th International Conference on, volume 1, pp. 8–pp. IEEE, 2005.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Todorov (2009) Emanuel Todorov. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems, pp. 1856–1864, 2009.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 van Niekerk et al. (2018) Benjamin van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. Will it blend? composing value functions in reinforcement learning. arXiv preprint arXiv:1807.04439, 2018.
 Wang et al. (2015) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.
Appendix A Proofs
a.1 MaxEnt Generalized Policy Improvement
See 3.1
For brevity we denote . Define the soft Bellman operator associated with policy as
Haarnoja et al. (2018b) have pointed out that the soft Bellman operator corresponds to a conventional, “hard”, Bellman operator defined over the same MDP but with reward . Thus, as long as and are bounded, is a contraction with as its fixed point. Appplying to we have:
Similarly, if we apply , the soft Bellman operator induced by policy , to , we obtain:
We now note that the KullbackLeibler divergence between
and can be written asThe quantity above, which is always nonnegative, will be useful in the subsequent derivations. Next we write
(20) 
a.2 DC Proof
See 3.2
We follow a similar approach to Haarnoja et al. (2018a) but without making approximations and generalizing to all convex combinations.
First note that since and are optimal then .
For brevity we use and notation rather than writing the time index.
Define
(23)  
(24) 
and consider soft Qiteration on starting from . We prove, inductively, that at each iteration .
This is true by definition for .
(25)  
(26)  
(27)  
(28)  
(29) 
Since soft Qiteration converges to the maxent optimal soft this completes the proof.
One can get an intuition for by noting that
(30) 
where is the Rényi divergence of order . can be seen as the discount sum of divergences, weighted by the unnormalized product distribution .
a.3 n policies
It is possible to extend Theorem 3.2 to the case with policies in a straightforward way.
Theorem A.1 (Multipolicy DC Optimality)
Let be maxent optimal policies for tasks with rewards with maxent actionvalue functions .
Define as the fixed point of
Comments
There are no comments yet.