Entropic Policy Composition with Generalized Policy Improvement and Divergence Correction

12/05/2018 ∙ by Jonathan J Hunt, et al. ∙ 2

Deep reinforcement learning (RL) algorithms have made great strides in recent years. An important remaining challenge is the ability to quickly transfer existing skills to novel tasks, and to combine existing skills with newly acquired ones. In domains where tasks are solved by composing skills this capacity holds the promise of dramatically reducing the data requirements of deep RL algorithms, and hence increasing their applicability. Recent work has studied ways of composing behaviors represented in the form of action-value functions. We analyze these methods to highlight their strengths and weaknesses, and point out situations where each of them is susceptible to poor performance. To perform this analysis we extend generalized policy improvement to the max-entropy framework and introduce a method for the practical implementation of successor features in continuous action spaces. Then we propose a novel approach which, in principle, recovers the optimal policy during transfer. This method works by explicitly learning the (discounted, future) divergence between policies. We study this approach in the tabular case and propose a scalable variant that is applicable in multi-dimensional continuous action spaces. We compare our approach with existing ones on a range of non-trivial continuous control problems with compositional structure, and demonstrate qualitatively better performance despite not requiring simultaneous observation of all task rewards.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 7

page 11

page 16

page 20

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning algorithms coupled with powerful function approximators have recently achieved a series of successes (Mnih et al., 2015; Silver et al., 2016; Lillicrap et al., 2015; Kalashnikov et al., 2018). Unfortunately, while being extremely powerful, deep reinforcement learning (DRL) algorithms often require a large number of interactions with the environment to achieve good results, partially because they are often applied “from scratch” rather than in settings where they can leverage existing experience. This reduces their applicability in domains where generating experience is expensive, or learning from scratch is challenging.

The data efficiency of DRL algorithms is affected by various factors and significant research effort has been directed at achieving improvements (e.g. Popov et al., 2017). At the same time the development of basic locomotor behavior in humans can, in fact, require large amounts of experience and practice (Adolph et al., 2012), and it can take significant effort and training to master complex, high-speed skills (Haith & Krakauer, 2013)

. Once such skills have been acquired, however, humans rapidly put them to work in new contexts and to solve new tasks, suggesting transfer learning as an important mechanism.

For the purpose of this paper we are interested in methods that are suitable for transfer in the context of high-dimensional motor control problems. We further focus on model-free approaches, which are evident in human motor control (Haith & Krakauer, 2013), and have recently been used by a variety of scalable deep RL methods (e.g. Lillicrap et al., 2015; Mnih et al., 2015; Schulman et al., 2017; Kalashnikov et al., 2018).

Transfer may be especially valuable in domains where a small set of skills can be composed, in different combinations, to solve a variety of tasks. Different notions of compositionality have been considered in the RL and robotics literature. For instance, ‘options’ are associated with discrete units of behavior that can be sequenced, thus emphasizing composition in time (Precup et al., 1998). In this paper we are concerned with a rather distinct notion of compositionality, namely how to combine and blend potentially concurrent behaviors. This form of composition is particularly relevant in high-dimensional continuous action spaces, where it is possible to achieve more than one task simultaneously (e.g. walking somewhere while juggling).

One approach to this challenge is via the composition of task rewards. Specifically, we are interested in the following question: If we have previously solved a set of tasks with similar transition dynamics but different reward functions, how can we leverage this knowledge to solve new tasks which can be expressed as a convex combination of those rewards functions?

This question has recently been studied in two independent lines of work: by Barreto et al. (2017, 2018) in the context of successor feature (SF) representations used for Generalized Policy Improvement (GPI) with deterministic policies, and by Haarnoja et al. (2018a); van Niekerk et al. (2018) in the context of maximum entropy policies. These approaches operate in distinct frameworks but both achieve skill composition by combining the -functions associated with previously learned skills.

We clarify the relationship between the two approaches and show that both can perform well in some situations but achieve poor results in others, often in complementary ways. We introduce a novel method of behavior composition that that can consistently achieve good performance.

Our contributions are as follows:

  1. We introduce succcessor features (SF) in the context of maximum entropy and extend the GPI theorem to this case (max-ent GPI).

  2. We provide an analysis of when GPI, and compositional “optimism” (Haarnoja et al., 2018a) of entropy-regularized policies transfer. We construct both tabular and continuous action tasks where both fail to transfer well.

  3. We propose a correction term – which we call Divergence Correction (DC)– based on the Rényi divergence between policies which allows us, in principle, to recover the optimal policy for transfer for any convex combination of rewards.

  4. We demonstrate a practical implementation of these methods in continuous action spaces using adaptive importance sampling and compare the approaches introduced here: max-ent GPI and DC with optimism (Haarnoja et al., 2018a) and Conditional functions (Schaul et al., 2015) in a variety of non-trivial continuous action transfer tasks.

2 Background

2.1 Multi-task RL

We consider Markov Decision Processes defined by the tuple

containing: a state space , action space , a start state distribution , a transition function , a discount and a reward function . The objective of RL is to find a policy which maximises the discounted expected return from any state where the expected reward is dependent on the policy and the MDP .

We formalize transfer as in Barreto et al. (2017); Haarnoja et al. (2018a), as the desire to perform well across all tasks in a set after having learned policies for tasks , without additional experience. We assume that and are related in two ways: all tasks share the same state transition function, and tasks in can be expressed as convex combinations of rewards associated with tasks in set . So if we write the reward functions for tasks in

as the vector

, tasks in can be expressed as .

We focus on combinations of two policies but the methods can be extended to more than two tasks. We refer to a transfer method as optimal, if it achieves optimal returns on tasks in , using only experience on tasks .

2.2 Successor Features

Successor Features (SF) (Dayan, 1993) and Generalised Policy Improvement (GPI) (Barreto et al., 2017, 2018) provide a principled solution to transfer in the setting defined above. SF make the additional assumption that the reward feature is fully observable, that is, the agent has access to the rewards of all tasks in but not during training on each individual task.

The key observation of SF representations is that linearity of the reward with respect to the features implies the following decomposition of the action value of policy on task :

(1)

where is the expected discounted sum of features induced by policy . This decomposition allows us to compute the action-value for on any task by learning .

If we have a set of policies indexed by , SF and GPI provide a principled approach to transfer on task . Namely, we act according to the deterministic GPI policy where

(2)

The GPI theorem guarantees the GPI policy has a return at least as good as any component policy, that is, .

2.3 Maximum Entropy RL

The maximum entropy (max-ent) RL objective augments the reward to favor entropic solutions

(3)

where is a parameter that determines the relative importance of the entropy term.

This objective has been considered in a number of works including Kappen (2005); Todorov (2009); Haarnoja et al. (2017, 2018a); Ziebart et al. (2008); Fox et al. (2015).

We define the action-value associated with eq. 3 as

(4)

(notice does not include any entropy terms for the state ). Soft Q iteration

(5)
(6)

where converges to the optimal policy with standard assumptions (Haarnoja et al., 2017).

3 Composing Policies in Max-Ent Reinforcement Learning

In this section we present two novel approaches for max-ent transfer learning. In section 4 we then outline a practical method for making use of these results.

3.1 Max-Ent Successor Features and Generalized Policy Improvement

We introduce max-ent SF, which provide a practical method for computing the value of a maximum entropy policy under any convex combination of rewards. We then show the GPI theorem (Barreto et al., 2017) holds for maximum entropy policies.

We define the action-dependent SF to include the entropy of the policy, excluding the current state, analogous to the max-entropy definition of in (4):

(7)

where is a vector of ones of the same dimensionality as and we define the state-dependent successor features as the expected in analogy with :

(8)

The max-entropy action-value of for any convex combination of rewards is then given by

. Max-ent SF allow us to estimate the action-value of previous policies on a new task. We show that, as in the deterministic case, there is a principled way to combine multiple policies using their action-values on task

.

Theorem 3.1 (Max-Ent Generalized Policy Improvement)

Let be policies with -max-ent action-value functions and value functions . Define

Then,

(9)
(10)

where and are the -max-ent action-value and value function respectively of .

Proof: See appendix A.1. In our setup, we learn , the SFs of policies for each task in , we define the max-ent GPI policy for task as

3.2 Divergence Correction (DC)

Haarnoja et al. (2018a) introduced a simple approach to policy composition by estimating the action-value for the transfer task from the optimal action-values of the component tasks and

(11)

When using Boltzmann policies defined by , the resulting policy, , is the product distribution of the two component policies. We refer to as the compositionally “optimistic” (CO) policy, as it acts according to the optimistic assumption that the optimal returns of and will be, simultaneously, achievable111Compositional optimism is not the same as optimism under uncertainty, often used in RL for exploration..

Both max-ent GPI we presented above, and CO can, in different ways, fail to transfer well in some situations (see fig. 1 for some examples in tabular case). Neither approach consistently performs optimally during transfer, even if all component terms are known exactly. We desire a solution for transfer that, in principle, can perform optimally.

Here we show, at the cost of learning a function conditional on the task weightings , it is in principle possible to recover the optimal policy for the transfer tasks, without direct experience on those tasks, by correcting for the compositional optimism bias in . For simplicity, as in Haarnoja et al. (2018a), we restrict this to the case with only 2 tasks, but it can be extended to multiple tasks.

The correction term for CO uses a property noted, but not exploited in Haarnoja et al. (2018a). The bias in is related to the the discounted sum of Rényi divergences of the two component policies. Intuitively, if the two policies result in trajectories with low divergence between the policies in each state, the CO assumption that both policies can achieve good returns is approximately correct. When the divergences are large, the CO assumption is being overly optimistic and the correction term will be large.

Theorem 3.2 (DC Optimality)

Let be max-ent optimal policies for tasks with rewards and with max-ent action-value functions . Define as the fixed point of

Given the conditions for Soft Q convergence, the max-ent optimal for is

Proof: See appendix A.2. We call this Divergence Correction (DC) as the quantity is related to the Rényi divergence between policies (see appendix A.2 for details). Learning does not require any additional information (in principle) than that required to learn policies and . Unlike with SF, it is not necessary to observe other task features while training the policies. On the other hand, unlike with GPI, which can be used to naturally combine any number of tasks with arbitrary weight vectors , in order to apply DC one must estimate for all values of . The complexity of learning increases significantly if more than 2 tasks are combined.

Supplementary Table 1 provides a comparison on the properties of the methods we consider here. We also compare with simply learning a conditional function (CondQ) (e.g. Schaul et al., 2015; Andrychowicz et al., 2017). As with GPI, this requires observing the full set of task features , in order to compute for arbitrary .

In this section we have introduced two new theoretical approaches to max-ent transfer composition: max-ent GPI and DC. We have shown how these are related to relevant prior methods. In the next section we address the question of how to practically learn and sample with these approaches in continuous action spaces.

4 Adaptive Importance Sampling for Boltzman Policies Algorithm

The control of robotic systems with high-dimensional continuous action spaces is a promising use case for the ideas presented in this paper. Such control problems may allow for multiple solutions, and can exhibit exploitable compositional structure. Unfortunately, learning and sampling of general Boltzmann policies defined over continuous action spaces is challenging. While this can be mitigated by learning a parametric sampling distribution, during transfer we want to sample from the Boltzmann policy associated with a newly synthesized action-value function without having to learn such an approximation first. To address this issue we introduce Adaptive Importance Sampling for Boltzmann Policies (AISBP), a method which provides a practical solution to this challenge.

In the following we parametrise all functions with neural nets (denoting parameters by the subscript ), including the soft action-value for reward : ; the associated soft value function and a proposal distribution , the role of which we explain below. We use an off-policy algorithm, so that experience generated by training on policy can be used to improve policy . This is especially important since our analysis requires the action-value to be known in all states. This is less likely to be the case for an on on-policy algorithm, that only updates using trajectories generated by policy . During training experience generated by all tasks are stored in a replay buffer , and mini-batches are sampled uniformly and used to update all function approximators. Soft Q iteration (see eq. 4) is used to learn and . These updates are, in principle, straightforward using transitions sampled from the replay buffer.

Sampling from the Boltzmann policy defined by , is challenging as is estimating the partition function (the of which is also the value, c.f. Eq. 6

). One approach is to fit an expressible, tractable sampler, such as a stochastic neural network to approximate

(e.g. Haarnoja et al., 2018a). This approach works well when learning a single policy. However, during transfer this may require learning a new sampler for each new value composition. AISBP instead uses importance sampling to sample and estimate the partition function. The scalability of this approach is improved by using using a learned proposal distribution , and by observing that modern architectures allow for efficient batch computation of a large number of importance samples. To facilitate transfer we restrict the parametric form of the proposals to mixtures of (truncated) Gaussians. The product of these mixture distributions is tractable, allowing sampling from the proposal product during transfer (see supplementary C.3).

More formally, for each policy in we learn an action-value , and value network, and a proposal distribution (we drop the task index here when writing the losses for notational clarity, and write the losses for a single policy). The proposal distribution is a mixture of

truncated Normal distributions

, truncated to the square with diagonal covariances

(12)

The proposal distribution is optimized by minimizing the forward KL divergence with the Boltzmann policy . This KL is “zero avoiding” and over-estimates the support of (Murphy, 2012) which is desirable for a proposal distribution (Gu et al., 2015),

(13)

where the expectation is over the replay buffer state density.

The inner expectation in the proposal loss itself requires sampling from . We approximate his expectation by self-normalized importance sampling and use a target proposal distribution

which is a mixture distribution consisting of the proposals for all policies along with a uniform distribution. For batchsize

and proposal samples the estimator of the proposal loss is then

(14)

The value function loss is defined as the L2 error on the Soft Q estimate of value

(15)

which is estimated using importance sampling to compute the integral.

(16)

This introduces bias due to the finite-sample approximation of the expectation inside the (concave) . In practice we found this estimator sufficiently accurate, provided the proposal distribution was close to . We also use importance sampling to sample from while acting.

The action-value loss is just the L2 norm with the Soft Q target:

(17)

To improve stability we employ target networks for the value and proposal networks (Mnih et al., 2015; Lillicrap et al., 2015) We also parameterize as an advantage (Baird, 1994; Wang et al., 2015; Harmon et al., 1995) which is more stable when the advantage is small compared with the value. The full algorithm is given in Algorithm Box 1 and more details are provided in appendix C.

Initialize proposal network parameters , value network parameters and action-value network parameters and replay
while training do in parallel on each actor
     Obtain parameters from learner
     Sample task
     Roll out episode using to importance sample
     Add experience to replay
end while
while training do in parallel on the learner
     Sample SARS tuple from
     Improve , ,
     Improve additional losses for transfer , , , ,
     if target update period then
         Update target network parameters , ,
     end if
end while
Algorithm 1 AISBP training algorithm

4.1 Importance Sampled Max-Ent GPI

The same importance sampling approach can also be used to estimate max-ent SF. Max-ent GPI requires us to learn the expected (maximum entropy) features for each policy , in order to estimate its (entropic) value under a new convex combination task . This requires that the experience tuple in the replay contain the full feature vector , rather than just the reward for the policy which generated the experience . Given this information and can be learned with analogous updates to and , which again requires importance sampling to estimate .

As with , we use a target network for and advantage parametrization.

We found it more stable to using a larger target update period than for . Full details are of the losses and samplers are in appendix C.

4.2 Divergence Correction

All that is required for transfer using compositional optimism (eq. 11, Haarnoja et al. (2018a)) is the max-ent action values of each task, so no additional training is required beyond the base policies. In section 3.2 we have shown that if we can learn the fixed point of we can correct this compositional optimism and recover the optimal action-value .

We exploit the recursive relationship in to fit a neural net with a TD(0) estimator. This requires learning a conditional estimator for any value of , so as to support arbitrary task combinations. Fortunately, the same experience can be used to learn an estimator for all values of , by sampling during each TD update. As before, we use target networks and an advantage parametrization for

We learn as , for each pair of policies resulting in the loss

(18)

As with other integrals of the action space, we approximate this loss using importance sampling to estimate the integral. Note that, unlike GPI and CondQ (next section), learning does not require observing while training.

We also considered a heuristic approach where we learned

only for (this is typically approximately the largest divergence). This avoids the complexity of a conditional estimator and we estimate as

This heuristic, we denote DC-Cheap, can be motivated by considering Gaussian policies with similar variance (see appendix

D) The max-ent GPI bound can be used to correct for over-estimates of the heuristic , .

4.3 Cond Q

As a baseline, we directly learn a conditional function using a similar approach to DC of sampling each update (Schaul et al., 2015). This, like GPI but unlike DC, requires observing during training so the reward on task can be estimated. We provide the full details in appendix C.

4.4 Sampling Compositional Policies

During transfer we would like to be able to sample from the Boltzmann policy defined by our estimate of the transfer action-value (the estimate is computed using the methods we enumerated above) without having to, offline, learn a new proposal or sampling distribution first (which is the approach employed by Haarnoja et al. (2018a)).

As outlined earlier (and detailed in appendix C.3), we chose the proposal distributions so that the product of proposals is tractable, meaning we can sample from . This is a good proposal distribution when the CO bias is low, since defines a Boltzmann policy which is the product of the base policies222.. However, when is large, meaning the CO bias is large, may not be a good proposal, as we show in the experiments. In this case none of the existing proposal distributions may be a good fit. Therefore we sample from a mixture distribution of all policies, all policy products and the uniform distribution.

(19)

where is the volume of the action space. Empirically, we find this is sufficient to result in good performance during transfer. The algorithm for transfer is given in supplementary algorithm 2.

5 Experiments

5.1 Discrete, tabular environment

We first consider some illustrative tabular cases of compositional transfer. These highlight situations in which GPI and CO transfer can perform poorly (Figure 1). As expected, we find that GPI performs well when the optimal transfer policy is close to one of the existing policies; CO performs well when both subtask policies are compatible. The task we refer to as “tricky” is illustrative of tasks we will focus on in the next section, namely scenarios in which the optimal policy for the transfer task does not resemble either existing policy: In the grid world non-overlapping rewards for each task are provided in one corner of the grid world, while lower value overlapping rewards are provided in the other corner (cf. Fig. 1). As a consequence both GPI and CO perform poorly while DC performs well in all cases.

Figure 1: Policy composition in the tabular case. All tasks are in an infinite-horizon tabular 8x8 world. The action space is the 4 diagonal movements (actions at the boundary transition back to the same state) (a-c) shows 3 reward functions (color indicates reward, dark blue , light blue ). The arrows indicate the action likelihoods for the max-ent optimal policy for each task. The tasks are L(eft) and T(ricky) tasks. (d) The log regret for the max-ent returns for 3 qualitatively distinct compositional tasks , using different approaches to transfer from the base policies. The compositional tasks we consider are left-right (LR), left-up (LU) and the “tricky“ tasks (T). GPI performs well when the subtasks are incompatible, meaning the optimal policy is similar to one of the existing policies. CO performs poorly in these situations, resulting in indecision about which subtask to commit to (e shows the CO policy and value). Conversely, when the subpolicies are compatible, such as on the LU task, CO transfers well while the GPI policy (f) does not consistently take advantage of the compatibility of the two tasks to simultaneously achieve both subgoals. Neither GPI nor CO policies (g shows the GPI policy, but CO is similar) perform well when the optimal transfer policy is dissimilar to either existing task policy. The two tricky task policies are compatible in many states but have a high-divergence in the bottom-left corner since the rewards are non-overlapping there (i), thus the optimal policy on the composed task is to move to the top right corner where there are overlapping rewards. By learning, and correcting for, this future divergence between policies, DC results in optimal policies for all task combinations including tricky (h).

Additional details in supplementary figure 5.

5.2 Continuous action spaces

We compare the transfer methods in more challenging continuous control tasks. We train max-ent policies to solve individual tasks using the importance sampling approach from section 4 and then assess transfer. All methods use the same experience and proposal distribution.

Figure 2 examines the transfer policies in detail in a simple point-mass task and shows how the estimated corrects and dramatically changes the policy.

Figure 2: Tricky point mass. The continuous “tricky” task with a simple 2-D velocity controlled pointed mass. (a) Environment and example trajectories. The rewards are , and for the green, red and yellow squares. Lines show sampled trajectories (starting in the center) for the compositional task with different transfer methods. Only DC and CondQ (not shown) navigate to the yellow reward area for the joint task which is the optimal direction for the compositional task. (b) The returns for each transfer method. DC and CondQ methods recover significantly better performance than GPI, and the CO policy performs poorly. (c) at the center position for the transfer task. As both base policies prefer moving left and down, most of the energy is on these actions. However, the future divergence under these actions is high, which results in the differing qualitatively from CO and prefers the upward trajectory. Additional details in supplementary figure 6.

We then examine conceptually similar tasks in more difficult domains: a 5 DOF planar manipulator reaching task (figure 3), 3 DOF jumping ball and 8 DOF ant (figure 4). We see that DC recovers a qualitatively better policy in all cases. The performance of GPI depends noticeably on the choice of . DC-Cheap, which is a simpler heuristic, performs almost as well as DC in the tasks we consider except for the point mass task. When bounded by GPI (DC-Cheap+GPI) it performs well for the point mass task as well, suggesting simple approximations of may be sufficient in some cases333 We provide videos of the more interesting tasks at https://tinyurl.com/yaplfwaq. .

We focussed on “tricky” tasks as they are challenging form of transfer. In general, we would expect DC to perform well in most situations where OC performs well, since in this case the correction term that DC must learn is inconsequential (OC is equivalent to assuming ). Supplementary figure 9 demonstrates on a task with non-composible solutions (i.e. is large and potentially challenging to learn), DC continues to perform as well as GPI, slightly better than CondQ, and as expected, OC performs poorly.

Figure 3: “Tricky” task with planar manipulator. The “tricky” tasks with a 5D torque-controlled planar manipulator. The training tasks consists of (mutually exclusive) rewards of when the finger is at the green and red targets respectively and reward at the blue target. (b) Finger position at the end of the trajectories starting from randomly sampled start states) for the transfer task with circles indicating the rewards. DC and CondQ trajectories reach towards the blue target (the optimal solution) while CO and GPI trajectories primarily reach towards one of the suboptimal partial solutions. (c) Box plot of returns on the transfer tasks, DC has much more consistent performance. Additional details in supplementary figure 7
Figure 4: “Tricky” task with mobile bodies. “Tricky” task with two bodies: a 3 DOF jumping ball (supplementary figure 8a) and (a) 8 DOF ant (both torque controlled). The task has rewards in the green and red boxes respectively and in the blue square. (b) Sampled trajectories of the ant on the transfer task starting from a neutral position for transfer at . GPI and DC consistently go to the blue square (optimal), CondQ and CO do not. Box plot of returns for the jumping ball (c) and ant (d) when started in the center position. CO approach does not recover a good transfer policy for the compositional task while the other approaches largely succeed, although CondQ does not learn a good policy on the ant. Additional details in supplementary figure 8.

6 Discussion

We have presented two approaches to transfer learning via convex combinations of rewards in the maximum entropy framework: max-ent GPI and DC. We have shown that, under standard assumptions, the max-ent GPI policy performs at least as well as its component policies, and that DC recovers the optimal transfer policy. Todorov (2009) and (Saxe et al., 2017; van Niekerk et al., 2018) previously considered optimal composition of max-ent policies. However, these approaches require stronger assumptions about the class of MDPs By contrast, DC does not restrict the class of MDPs and learns how compatible policies are, allowing approximate recovery of optimal transfer policies both when the component rewards are jointly achievable (AND), and when only one sub-goal can be achieved (OR).

We have compared our methods with conditional action-value functions (CondQ) (Schaul et al., 2015, e.g.) and optimistic policy combination (Haarnoja et al., 2018a). Further, we have presented AISBP, a practical algorithm for training DC and max-ent GPI models in continuous action spaces using adaptive importance sampling. We have compared these approaches, along with heuristic approximations of DC, and demonstrated that DC recovers an approximately optimal policy during transfer across a variety of high-dimensional control tasks. Empirically we have found CondQ may be harder to learn than DC, and it requires additional observation of during training.

References

Appendix A Proofs

a.1 Max-Ent Generalized Policy Improvement

See 3.1

For brevity we denote . Define the soft Bellman operator associated with policy as

Haarnoja et al. (2018b) have pointed out that the soft Bellman operator corresponds to a conventional, “hard”, Bellman operator defined over the same MDP but with reward . Thus, as long as and are bounded, is a contraction with as its fixed point. Appplying to we have:

Similarly, if we apply , the soft Bellman operator induced by policy , to , we obtain:

We now note that the Kullback-Leibler divergence between

and can be written as

The quantity above, which is always nonnegative, will be useful in the subsequent derivations. Next we write

(20)

From (A.1) we have that

Using the contraction and monotonicity of the soft Bellman operator we have

We have just showed (9). In order to show (10), we note that

(21)

Similarly, we have, for all ,

(22)

The bound (10) follows from (A.1) and (A.1).

a.2 DC Proof

See 3.2

We follow a similar approach to Haarnoja et al. (2018a) but without making approximations and generalizing to all convex combinations.

First note that since and are optimal then .

For brevity we use and notation rather than writing the time index.

Define

(23)
(24)

and consider soft Q-iteration on starting from . We prove, inductively, that at each iteration .

This is true by definition for .

(25)
(26)
(27)
(28)
(29)

Since soft Q-iteration converges to the max-ent optimal soft this completes the proof.

One can get an intuition for by noting that

(30)

where is the Rényi divergence of order . can be seen as the discount sum of divergences, weighted by the unnormalized product distribution .

a.3 n policies

It is possible to extend Theorem 3.2 to the case with policies in a straightforward way.

Theorem A.1 (Multi-policy DC Optimality)

Let be max-ent optimal policies for tasks with rewards with max-ent action-value functions .

Define as the fixed point of