1 Introduction
In recent years, Reinforcement Learning (RL) [sutton1998reinforcement, ] techniques have been successfully applied to solve complex sequential decisionmaking problems under uncertainty [mnih2015human, ; hessel2017rainbow, ; haarnoja2017reinforcement, ; haarnoja2018soft, ]. However, agents trained on a singletask typically exhibit poor performance when faced with a modified task with different dynamics or reward functions [taylor2009transfer, ]. A key remaining challenge for advancing the field towards general purpose applications is to train agents that can generalize over tasks [plappert2018multi, ]. Generalization over tasks is also important in hierarchical reinforcement learning (HRL) settings [dayan1993feudal, ; sutton1999between, ; dietterich1998maxq, ] where there is a need to be dataefficient by using reusable skills across different unseen situations. However, a longstanding problem in HRL is of obtaining a general skillset and how to properly reuse that set in different situations.
In this paper, we focus on the problem of learning policies that generalize well under changes in both the dynamics and reward functions. We do so by formulating a novel MultiTask RL (MTRL) problem from a variational inference (VI) perspective. Our formulation relies on two latent skill embeddings which hold information about the dynamics of the system and the goal. The skill embeddings are disentangled in that one can independently specify a change in the dynamics or the goal of the system and still obtain a wellperforming policy. We call this method Disentangled Skill Embeddings (DSE). Having trained such policies using DSE, we can tackle HRL problems by allowing the agent to move in the previously learned space of skills.
We contribute: 1) a novel MTRL formulation for learning disentangled skill embeddings using a VI objective; 2) two MTRL algorithms, DSEREINFORCE and DSESAC (Soft Actor Critic), that can learn multidynamics and multigoal policies and generalize to new tasks after fast retraining; and 3) demonstrate that one can learn higher level policies over latent skills in a HRL scenario.
2 Related work
Most approaches in MTRL [taylor2009transfer, ; oh2017zero, ; henderson2017benchmark, ; nachum2018data, ; steindor, ; deisenroth2014multi, ] consider changes in dynamics or reward functions to result in different tasks. That is, two tasks with the same dynamics but different reward functions are considered to be different tasks. Here, we focus on exploiting known changes either in dynamics or goals, or both, for better generalization.
An approach that decouples dynamics and rewards is [devin2017learning, ]
. In this work modular neural networks that capture robot dynamics are combined with modules that capture the task goals. This allows robots to solve novel tasks by recombining task and robot modules. In
[zhang2018decoupling, ] decoupling reward and dynamics is done in a modelbased framework based on successor features.Closest to our work is [hausman2018learning, ] where policies trained in a MTRL scenario are equipped with a latent variable embedding that describes a particular task. However, the latent variable embedding contains entangled information about both the transition and the reward function. This can impede generalization as the latent space might not be able to represent a policy for a task for unseen reward and dynamics combination. In contrast, our approach disentangles the latent spaces to overcome this issue. Another similar approach is [gupta2018meta, ], which is a metalearning algorithm that optimizes the latent space for the policies to enable structured exploration across multiple time steps.
We derive our algorithm using a variational infernence formulation for RL. Several previous works have described RL as an inference problem [kappen2005path, ; todorov2008general, ; levine2013variational, ] including its relation to entropy regularization haarnoja2017reinforcement ; haarnoja2018soft ; grau2016planning ; peters2010relative ; ziebart2008maximum . This formalism has recently attracted attention [haarnoja2017reinforcement, ; levine2018reinforcement, ; abdolmaleki2018maximum, ], because it provides a powerful and intuitive way to describe more complex agent architectures using the tools from graphical models.
The multitask algorithm, Distral [teh2017distral, ], utilizes, not only entropy regularization, but also adds a relativeentropy penalty that encourages the policies to be close to a shared compressed policy for transfer between tasks. The trained policies of this approach are not parameterized by any variables that identify the task at hand. This is important since policies trained with Distral cannot generalize beyond their training tasks.
3 Background and Notation
We consider a set
of Markov decision processes (MDPs)
defined as the tuple where is the state space, the action space, is the discount factor, denotes the ’th state transition function that fully specifies the dynamics of the system, and denotes the reward function that quantifies the agent’s performance and fully specifies the goal of the system.In the multitask problem, the agent must provide an optimal policy for each of the possible tasks. Obtaining all solutions independently maximizes performance on each individual task, but does not transfer information between the tasks. Solving all tasks with a single policy maximizes transfer since all parameters of the policy are shared but, importantly, the final solution only maximizes the average reward across tasks. Thus, a mixture of shared and task specific parameters is ideal.
In the following we derive a variational multitask reinforcement learning formulation where the policy has some parameters that are shared across all tasks, and two latent embeddings—dynamicsspecific and reward/goalspecific—that serve as taskspecific parameters.
4 Disentangled Skill Embeddings
We learn flexible skills that are reusable across different dynamics and goals by learning two latent spaces, and . We achieve this by gathering data from the set of MDPs , indexed by the dynamicscondition and goalcondition for all and and then learn the conditional distributions and for each and . The latent variables are inputs to the policy serving as behaviour modulators. Importantly, once the latent spaces are fully learnt, one can directly use the policy equipped with the skill embeddings without knowledge of the task indices.
We now derive a variational inference (VI) formulation for multitask RL that allows learning both latent spaces and the policy. As in [haarnoja2017reinforcement, ; kappen2005path, ; todorov2008general, ; levine2013variational, ; levine2018reinforcement, ; abdolmaleki2018maximum, ]
, we start by introducing a random variable
that denotes whether the trajectory is optimal () or not (). Note that this includes the dynamics index and the goal index as well as and for all timesteps . The likelihood of an optimal trajectory is defined as. We denote the posterior trajectory probability assuming optimality as
. Treatingas a latent variable with prior probability
, we specify the logevidence as .We now introduce a variational distribution on trajectories which combined with Jensen’s inequality provides the Evidence Lower Bound (ELBO) . In practice, we maximize the ELBO and use as an approximate posterior. The generative model is and the variational distribution is . We stress that the only difference between these are the conditional factors involving the latent variables and . The MTRL problem can now be stated as a maximization of the ELBO w.r.t. and :
(1) 
where we added the scalars and to weight each information term (see Appendix A.1 for a mathematical justification) and we set the problem to have infinite horizon () with discount factor . The first two information terms measure how far the variational distributions and are from the specified priors that we assume fixed and equal for all conditions. Similarly, the last information term measures how far the variational policy is from the prior policy. Note that by setting we can eliminate this restriction. Furthermore, by setting to be an improper uniform prior we recover a formulation with entropy regularization in the policy.
This trajectorybased formulation of the problem is sufficient to derive a novel REINFORCEtype algorithm equipped with DSE. However, our work also provides a derivation of a multitask SAC algorithm with DSE (DSESAC) that requires a full specification of the recursive properties of value functions and the optimal solutions for the policy and embeddings. We describe those properties in the next section.
4.1 Recursions, Optimal Policies and Optimal Embeddings
Crucial to the construction of DSESAC is a recursive property that we can exploit for value bootstraping. A taskindexed value function can be defined by taking the expectation in Equation (1) over all random variables except and . The Qfunction is then defined as . We provide a lemma for the value recursion.
Lemma 1 (Index and statedependent Value Function Recursion).
The indexdependent Value function satisfies the following recursive property.
(2) 
The proof of the previous and subsequent lemmas can be found in Appendix A.2 and A.4.
DSESAC also requires analytic solutions for the policy and embeddings. The optimal policy can be obtained by computing the functional derivative of a Lagrangian (see Appendix) of the variational problem w.r.t. the policy and equating the result to zero.
Lemma 2 (Optimal policy with DSE).
Let the variational distributions and be fixed. Then, the optimal policy is
where is the normalizing function and with and are the Bayesian posteriors over and .
Note that the Qvalues are computed by using both Bayesian posterior distributions over the task indices, i.e., and . Intuitively, in the extreme case where a and can completely specify the task at hand with certainty (i.e., the Bayesian posterior is peaked), the optimal policy selects the correct Qfunction for this task; whereas for nonextreme cases a mixture is computed.
Employing the same procedure as before, we write the optimal variational distributions as follows:
Lemma 3 (Optimal Embeddings).
Assuming fixed , the optimal variational distributions are
(3) 
where and are conceptually similar to Value functions but depend on , and , , respectively.
5 DSE Algorithms
This section focuses on describing two practical algorithms using disentangled embeddings. DSEREINFORCE is updated onpolicy and requires full trajectories from the different tasks. Although it is easier to implement, REINFORCEtype algorithms are known to suffer from high variance in the gradient estimates which slows down training. The second algorithm, DSESAC, is inspired by the SAC algorithm
[haarnoja2018soft, ], and is an more dataefficient offpolicy algorithm that directly uses the transitions of all the tasks sampled from a replay memory. It achieves this by estimating the Value functions and Qfunctions.Common to both algorithms is the parametrization of the policy with parameters representing those of a neural network. Additionally, we consider pure entropy regularization by setting the prior policy
to an improper uniform distribution by which we can ignore. Similarly, we consider the embeddings
and to have parameters (abusing notation) and , respectively. With this notation in place we proceed to describe DSEREINFORCE.5.1 DseReinforce
DSEREINFORCE first samples trajectories from each combination of dynamics and goal contexts , where . Then, these are used to update the shared parameters and the parameters and of the variational distributions. For the updates of the variational distributions we use the reparametrization trick [kingma2013auto, ] and make the assumption that the variational parameters and contain a set of specific parameters and for each dynamics and goal context. We further assume that the latent variables are multivariate Gaussians with diagonal covariance matrix (this assumption can easily be relaxed). Therefore, the latent variables are expressed as and where and are the noise terms; and
are the mean vectors and;
and are the diagonal vectors of the covariance matrix.The maximization in (1) can be written using Monte Carlo estimators as with
where we have used the following definition of the regularized discounted future returns
Moreover, we have separated the KL terms of the variational distributions out of the summation over , as they are independent of and can be computed in closed form due to the Gaussian assumption. Consequently, we added the corresponding sum of discounts by computing the geometric sum . Note that we implicitly redefined the trajectories so that they contain the noise realizations instead of the latent variables. Algorithmic details can be found in the Appendix B.
Adaptive normalization using PopArt:
In our preliminary experiments we observed that DSEREINFORCE was selectively solving some tasks but not others. For this reason we use the adaptive rescaling method PopArt [van2016learning, ; hessel2018multi, ] to normalize the discounted rewards to have zero mean and unit variance before each training iteration. Thus all tasks affect the gradient equally.
5.2 DseSac
DSESAC collects transitions from each dynamics and reward context and stores them in separate replay memories . Then, samples from the replay memories are used to estimate the Qfunctions , the value functions , the variational distributions and and the policy . Subscripts denote the symbols for the parameters of the neural networks used as function approximators.
The Qfunctions are learned by optimizing the loss , where the target is a onesample estimate obtained with real experience and denotes the parameters of a target network which is updated at every training iteration as for some .
The value functions are learned by minimizing , where the target exploits the value recursion in Equation (4):
In order to reduce overestimation of Qvalues, we follow [haarnoja2018soft, ] where the minimum of two Qfunction approximators is used i.e., , which have different sets of parameters and initialization but are trained using the same loss .
The parameters can be learned by minimizing the following expected KLdivergence between the parametric policy and the optimal policy from Lemma (5) (with an improper prior )
The policy loss is written as , where the Qfunction is estimated with a single sample i.e., given that and . The normalizing function can be safely ignored as in haarnoja2018soft .
Following a similar rationale as before, the variational parameters for each context () can be learned by minimizing the following KLdivergences
which translates into
(4) 
where, for clarity, we define with . All remaining algorithmic details are in the Appendix.
6 Experiments
Multitask Cartpole. The colors correspond to different goal contexts. (a) shows the average evolution of the total rewards across 4 random seeds; shaded regions are the standard error, separated by the goals. For clarity, we averaged over dynamics (full plots in Appendix C.1). (b) and (c) show the latent spaces
and respectively after the training. The distributions shown are standard deviation. (d) and (e) are results from a generalization experiment carried out on (see Sec 6.1 for description).Here we empirically validate our algorithms and show the applicability of the trained policies equipped with DSE on both multitask and hierarchical RL problems. DSEREINFORCE is tested on a discrete actionspace problem (Cartpole) and DSESAC on a continous actionspace problem (Reacher^{1}^{1}1From the Mujoco dynamics simulation software). On multitask problems we show the benefit of disentanglement when compared to three baselines: singleembedding algorithm similar to hausman2018learning , Distral teh2017distral
and independent learners. Hyperparameter values are shown in the Appendix B.3.
6.1 DSEREINFORCE on Cartpole
We extended the Cartpole environment provided in the Open AI gym library^{2}^{2}2https://gym.openai.com/envs/CartPolev1/ by modifying the reward function to reflect the need to balance at different locations: left (), middle () and right (). Additionally, we allowed for three different dynamics conditions by changing the mass of the cart . Simulations were run for time steps.
Figure 1(a) shows that DSEREINFORCE solves all nine tasks simultaneously at approximately the same rate exceeding the performance of the baselines: Distral [teh2017distral, ] and independently trained (no multitask; trained with REINFORCE) algorithms; and performing similarly to the single embedding case. Importantly, we find that DSEREINFORCE produces a policy that generalizes better than the baselines as we show in the next section. In Figure 1(b) we observe the variational distributions learned for embeddings of the different dynamics contexts (in grey) and in (c) for the different reward contexts (in red, orange and blue). These have separated to represent the different tasks in the latent space. The variational distributions shown in darkred color in (b) and green color in (c) are the result of learning (with identical priors on and conditioned on the trained shared parameters) in a new unseen condition (
) successfully solving the task. This shows that the latent spaces are able to interpolate well. In Figure
1(d) we show the mean of the variational distributions for the goal contexts in color; and in grey, latent vectors that we used to test whether the learned policy is able to generalize to unseen goals. We show in panel (e) the xlocation of the tip of the pole. As can be seen through the grey conditions, the policy is able to generalize to new locations in an ordered (along the xaxis) fashion.Retraining and generalization of DSEREINFORCE on Cartpole
In this section we test generalization when there are missing dynamics or goal conditions on a task matrix. We consider the case of training on offdiagonal tasks () and testing on the diagonal; and the case of training on only tasks () (See Figure 2). The testing phase is executed on each testtask by initializing the variational distributions with matching indices and retraining both the variational and shared parameters.
In Figure 2, we show on the leftmost panels the multitask training for both (63) and (45) settings and on the remaining panels the performance of the testing phase. We compare our DSEREINFORCE (dse0) against the singleembedding algorithm from previous section. We can clearly see the benefits of disentangling the dynamics from the reward; the DSE algorithm provides strong initializations for tasks never seen before that are not mere “interpolation"tasks as tested in the previous section. Note that independent singletask training would need about 10000 trajectories to train whereas DSE sometimes solves the testtask instantaneously (without accounting for the multitask trainning).
HRL on Cartpole
We test the validity of the trained policies equiped with DSE in an HRL scenario by training a highlevel policy that acts on the rewards latent space. For this, we developed a novel cartpole problem (AsteroidCartpole), where a balanced cartpole must avoid falling asteroids; this is detailed in the Appendix. For this, we fixed the mass to ; the latent variable for was fixed to the mean of . The highlevel policy acted on a discrete action space consisting of 5 selected points of rewards latent space; three were the means of the learned variational distributions for the goalcontexts and the remaining two were interpolations . Figure 3(a) shows the evolution of the episodic rewards while training the highlevel policy (HRL) with standard REINFORCE equipped with a baseline and with PopArt. As a comparison, we also trained the same REINFORCE algorithm but acting directly on the lowlevel actions. As seen, the hierarchical policy outperforms the baseline and attains maximum reward.
6.2 DSESAC on Reacher
The original Reacher environment consists of moving the tip of a robotic arm to a random location; its state space included position of the goal. We modified this environment by removing the goal position information and instead, learn an embedding for it. This is considerably a more difficult task. Further, we modified it to vary the dynamics and reward functions by changing the arm lengths and goal position. We chose different goal locations and different arm lengths.
Figure 4 shows the results of our experiments with DSESAC on our multitask Reacher problem. We compared these results with singletask independent learners and singleembedding SAC. As we see, both singleembedding and DSESAC have comparable performance and exceed the singletask learner. We also carried out experiments comparing DSESAC with DSEREINFORCE (Appendix C.5) where DSESAC outperforms DSEREINFORCE in Reacher by a large margin. Further, we carried out the “interpolation” experiments similar to the previous Cartpole experiments (Appendix C.5) showing generalization capabilities in Reacher.
Retraining and generalization of DSESAC on multitask Reacher
Similar to the Cartpole scenario, we test generalization of DSESAC when training with missing tasks on the conditions () and (). Testing is performed on unseen testtasks by initializing the variational distributions by matching index. Learning curves of the multitask policy is shown in Figure 5. In Table 3 of the Appendix, we found that the initial performance in testtasks is on average better for DSESAC compared to the singleembedding SAC algorithm and the performs well in terms of the number of trajectories that it takes for a singletask learner to reach such performance. DSESAC obtained episodic reward, while the single embedding obtained episodic reward. It also takes the singletask policy number trajectories to reach the performance of DSESAC.
HRL on Reacher
We tested the policy trained with DSESAC on a HRL scenario. In this case, we continuously moved the goal location in a circle passing by locations that the multitask policy has never seen. We trained with standard singletask Soft ActorCritic (HSAC), a highlevel policy that acts on the latent space and uses the pretrained multitask policy as lowlevel policy. Such policy is compared against the baseline of SAC trained directly on lowlevel actions. Figure 3(b) shows the performances of HSAC and SAC in purple and green respectively. We see that HSAC can solve the task faster than standard SAC can.
7 Conclusions
We have developed a multitask framework from a variational inference perspective that is able to learn latent spaces that generalize to unseen tasks where the dynamics and reward can change independently. In particular, the disentangling allows for better generalization and faster retraining in new tasks. We have shown that the policies learned with our two algorithms DSEREINFORCE and DSESAC, can be used successfully in HRL scenarios.
A promising future direction for DSESAC could be to learn Qfunctions and Value functions that do not depend on the taskindex but directly depend on the latent variables. This would allow for the training of a single Qfunction and Value function instead of one per goal and dynamics condition.
References
 [1] R. Sutton and A. Barto, Reinforcement learning. MIT Press, Cambridge, 1998.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [3] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” arXiv preprint arXiv:1710.02298, 2017.

[4]
T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with
deep energybased policies,” in
International Conference on Machine Learning
, pp. 1352–1361, 2017.  [5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.

[6]
M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,”
Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1633–1685, 2009.  [7] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, et al., “Multigoal reinforcement learning: Challenging robotics environments and request for research,” arXiv preprint arXiv:1802.09464, 2018.
 [8] P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” in Advances in neural information processing systems, pp. 271–278, 1993.
 [9] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 [10] T. G. Dietterich, “The maxq method for hierarchical reinforcement learning,” in Proceedings of the Fifteenth International Conference on Machine Learning, pp. 118–126, Morgan Kaufmann Publishers Inc., 1998.
 [11] J. Oh, S. Singh, H. Lee, and P. Kohli, “Zeroshot task generalization with multitask deep reinforcement learning,” in International Conference on Machine Learning, pp. 2661–2670, 2017.
 [12] P. Henderson, W.D. Chang, F. Shkurti, J. Hansen, D. Meger, and G. Dudek, “Benchmark environments for multitask learning in continuous domains,” arXiv preprint arXiv:1708.04352, 2017.
 [13] O. Nachum, S. Gu, H. Lee, and S. Levine, “Dataefficient hierarchical reinforcement learning,” arXiv preprint arXiv:1805.08296, 2018.
 [14] S. Sæmundsson, K. Hofmann, and M. P. Deisenroth, “Meta reinforcement learning with latent variable gaussian processes,” May 2018.
 [15] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox, “Multitask policy search for robotics,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3876–3881, IEEE, 2014.
 [16] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multitask and multirobot transfer,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2169–2176, IEEE, 2017.
 [17] A. Zhang, H. Satija, and J. Pineau, “Decoupling dynamics and reward for transfer learning,” 2018.
 [18] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an embedding space for transferable robot skills,” in International Conference on Learning Representations, 2018.
 [19] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine, “Metareinforcement learning of structured exploration strategies,” arXiv preprint arXiv:1802.07245, 2018.
 [20] H. J. Kappen, “Path integrals and symmetry breaking for optimal control theory,” Journal of statistical mechanics: theory and experiment, vol. 2005, no. 11, p. P11011, 2005.
 [21] E. Todorov, “General duality between optimal control and estimation,” in Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pp. 4286–4292, IEEE, 2008.
 [22] S. Levine and V. Koltun, “Variational policy search via trajectory optimization,” in Advances in Neural Information Processing Systems, pp. 207–215, 2013.
 [23] J. GrauMoya, F. Leibfried, T. Genewein, and D. A. Braun, “Planning with informationprocessing constraints and model uncertainty in markov decision processes,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 475–491, Springer, 2016.
 [24] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search.,” in AAAI, pp. 1607–1612, Atlanta, 2010.
 [25] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.,” in AAAI, vol. 8, pp. 1433–1438, Chicago, IL, USA, 2008.
 [26] S. Levine, “Reinforcement learning and control as probabilistic inference: Tutorial and review,” arXiv preprint arXiv:1805.00909, 2018.
 [27] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” arXiv preprint arXiv:1806.06920, 2018.
 [28] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” in Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
 [29] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [30] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning values across many orders of magnitude,” in Advances in Neural Information Processing Systems, pp. 4287–4295, 2016.
 [31] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt, “Multitask deep reinforcement learning with popart,” arXiv preprint arXiv:1809.04474, 2018.
Appendix A Proofs
a.1 Information term weights justification
We can easily weigh each information term with by assuming
and
More concretely, this gives
where we have eliminated the constant terms since they do not affect the solution of the optimization problem. Therefore, to unclutter the notation, we override the definition of the variational distributions by and .
a.2 Proof Lemma 1
Lemma 4 (Index and statedependent Value Function Recursion).
The indexdependent Value function satisfies the following recursive property.
Proof.
We start by stating again the definition of the value function:
Then we take out the terms with inside the summation and write explicitly the expectation, i.e.,
We see now that the inner expectation term is in fact . Therefore, changing the subindices to , , , and we proved the lemma. ∎
a.3 Lagrangian for DSESAC Optimal Policy
Definition 1 (DSE Lagrangian).
Let be an arbitrary distribution over states. Then the Lagrangian is defined as
where , , are the Lagrange multipliers ensuring that the policy and variational distributions are properly normalized.
a.4 Proof Lemma 2
Lemma 5 (Optimal policy with DSE).
Let the variational distributions and be fixed. Then, the optimal policy is
where is the normalizing function and with and are the Bayesian posteriors over the task indices.
Proof.
We take the functional derivative of the Lagrangian with respect to where the star denotes a particular element resulting in
Next, equating the previous equation to zero and using the following equalities , and we obtain
(5) 
Rearranging the terms we have
(6) 
Finally, using the fact that we can obtain the value of the Lagrange multiplier . Then, we obtain the desired policy
(7) 
∎
a.5 Proof Lemma 3
Lemma 6 (Optimal Embeddings).
Assuming a fixed policy the optimal variational distributions are given by
where and are conceptually similar to Value functions but depend on the latent variables and task indices. More formally,
Comments
There are no comments yet.