1 Introduction
Current reinforcement learning (RL) algorithms have achieved impressive results across a broad range of games and continuous control platforms. While effective, such algorithms all too often require millions of environment interactions to learn, requiring access to large compute as well as simulators or large amounts of demonstrations. This stands in stark contrast to the efficiency of biological learning systems [24], as well as the need for dataefficiency in real world systems, e.g. in robotics where environment interactions can be expensive and risky. In recent years, data efficient RL has thus become a key area of research and stands as one of the bottlenecks for RL to be applied in the real world [8]. Research in the area is multifaceted and encompasses multiple overlapping directions. Recent developments in offpolicy and modelbased RL have dramatically improved stability and dataefficiency of RL algorithms which learn tabula rasa [e.g. 1, 18]. A rapidly growing body of literature, under broad headings such as transfer learning, meta learning, or hierarchical RL, aims to speed up learning by reusing knowledge acquired in previous instances of similar learning problems. Transfer learning typically follows a two step procedure: a system is first pretrained on one or multiple training tasks, then a second step adapts
the system on a downstream task. While transfer learning approaches allow significant flexibility in system design, the twostep process is often criticised for being suboptimal. In contrast,
metalearning incorporates adaptation into the learning process itself. In gradientbased approaches, systems are explicitly trained such that they perform well on a downstream task after a few gradient descent steps [11]. Alternatively, in encoderbased approaches a mapping is learned from a data collected in a downstream task to a task representation [e.g 7, 36, 32, 38, 29, 22]. Because metalearning approaches optimize the adaptation process directly, they are expected to adapt faster to downstream tasks than transfer learning approaches. But performing this optimization can be algorithmically or computationally challenging, making it difficult to scale to complex and broader task distributions, especially since many approaches simultaneously solve not just the metalearning but also a challenging multitask learning problem.Given the limitations of metalearning, a number of recent works have raised the question whether transfer learning methods, potentially combined with dataefficient offpolicy algorithms, are sufficient to achieve effective generalization as well as rapid adaptation to new tasks. For example, in the context of supervised meta learning, Raghu et al. [31] showed that learning good features and finetuning during adaptation led to results competitive with MAML. In reinforcement learning, Fakoor et al. [10] showed that direct application of TD3 [14] to maximize a multitask objective along with a recurrent context and smart reuse of training data was sufficient to match performance of SOTA metalearning methods on current benchmarks.
In this paper, we take a similar perspective and try to understand the extent to which fast adaptation can be achieved using a simple transfer framework, with the generality of gradientbased adaptation. Central to our approach is the behaviour prior recovered by multitask KLregularized objectives [34, 15]. We improve transfer performance by leveraging this prior in two important ways: first, as a regularizer which helps with exploration and restricts the space of solutions that need to be considered, and second as a proposal distribution for importance weighting, where the weights are learnt and given by the exponentiated Qfunction. This avoids the need to learn an explicit parametric policy for the transfer task, instead the policy is obtained directly by tilting the prior with the learned, exponentiated actionvalue function. To further speedup adaptation and avoid learning this Qfunction denovo, we make use of a particular parameterization of the actionvalue functions obtained during multitask training: the Qvalues are parameterized to be linear in some shared underlying feature space. Intuitively, this shared feature representation captures the commonalities in terms of both reward and transition dynamics. In practice, we found this value function representation together with the behaviour prior to generalize well to transfer tasks, drastically speedingup the adaptation process. We show that across continuous control environments ranging from standard metaRL benchmarks to more challenging environments with higher dimensional action spaces and sparse rewards, our method can match or outperform recent metalearning approaches, echoing recent observations in [10].
Our paper is structured as follows. Section 2 provides the necessary background material and characterizes the multitask reinforcement learning problem. Our method, based on importance weighting, is presented in Section 3 while Section 4 shows how our training algorithm can be adapted to improve transfer learning performance. Relevant work is discussed in Section 5 with experimental results presented in Section 6.
2 Background
We consider a multitask reinforcement learning setup, where we denote a probability distribution over tasks as
. Each taskis a Markov Decision Process (MDP), i.e. a tuple
described by (respectively) the transition probability, initial state distribution, reward function, action and state spaces, where
and are identical across tasks. Furthermore, we assume that we are given finite i.i.d. samples of tasks split into training, , and test, sets. For each task, denoted by , we denote the taskspecific policy as , whereas is a shared behaviour prior which regularizes the ’s. On top of that, we denote as , the transition probability, initial state distribution and reward function for the task .The starting point in this paper is DISTRAL [34] which aims to optimize the following multitask objective on the training set:
(1) 
where is an inverse temperature parameter and denotes the sampling a trajectory from the task using the policy . The objective in (1) is optimized with respect to all and jointly. In particular, for each task and for a fixed behaviour prior , the optimization of the objective is equivalent to solving a regularized RL problem with augmented reward . As for learning the behaviour prior , optimizing (1) with respect to amounts to minimizing the sum of KL divergences between the taskspecific policies and the prior:
(2) 
The behaviour prior’s role is to model behavior that is shared across the tasks. As shown in [15], a prior trained according to (1) with computational restrictions such as partial access to observations only (information asymmetry) can capture useful default behaviours (such as walking in some walkingrelated task). The prior regularizes the taskspecific solutions and can transfer useful behavior between tasks, which can speed up learning.
Let be the current policy for the task . For a fixed behaviour prior , we define the associated soft Qfunction as
(3) 
This function was considered in [13]. Note that if
is a uniform distribution, the definition in (
3) is equivalent to the soft Qfunction considered, for instance, in [18, 19]. Furthermore, the policy, which is a result of computing 1step softgreedy policy, defined as:(4) 
will have higher soft Qvalue on the task , i.e. (see [18]). Therefore, (4) gives us a principled way to perform policy improvement. A similar policy improvement step is used, for instance, in MPO [1] and Soft Actor Critic (SAC) [18]. In both cases, the authors optimize a parametric representation to fit the distribution in (4).
But instead of fitting a parametric policy, one can directly act according to the improved policy in (4). This can be potentially more efficient, since it avoids an additional step of learning policy with function approximation. However, sampling exactly from the distribution in (4) can only be done in a few special cases. Below, we propose a method which uses importance sampling to draw samples from a distribution, which approximates the distribution in (4).
3 Importance weighted policy learning
For each task and for a fixed behaviour prior , we consider the following. Firstly, we sample a set of actions from the behaviour prior:
(5) 
We denote as , the set of sampled actions and as the set of discrete action distributions defined on for a state . For simplicity of notation, we will drop from and denote it as . We denote as the soft action value function for some policy and reward function . Then, we construct the following action distribution over for each state :
(6)  
with a normalizing constant :
Then, the resulting policy is a discrete approximation for the improved policy of the form from (4). Note that the procedure 6 corresponds to a softmax distribution over actions with respect to the exponent of the soft Qfunction.
In the limit of , the procedure 56 is guaranteed to sample from the policy from (4). The above sampling scheme gives rise to the Importance Weighted Policy Learning (IWPL) algorithm, which combines nonparametric policy evaluation and improvements steps, described below.
Nonparametric policy evaluation Let be a function and is a policy defined on . We define the soft Bellman backup operator:
It is easy to see (as in [18]) that the Bellman iteration converges to the soft value function 3 for . Then, for the policy defined by eq.4
we consider an estimator for the Bellman operator induced by the importance weighting procedure
56 (with a new sampled set of actions ):(7) 
In the limit, this procedure would converge to the soft Qfunction for : .
Nonparametric policy improvement
Given the current proposal , some old policy , corresponding soft Qfunction , we can obtain new policy via (4). In this case, similar to [19] (Appendix B.2), we have:
where is the soft Qfunction corresponding to the . To approximate the , we resample new actions via procedure 5 and apply procedure 6 to the and obtain the categorical distribution with following probabilities:
This describes a policy improvement procedure based on importance sampling.
Behaviour prior (proposal) improvement
Temperature calibration
In the current formulation, IWPL requires us to choose the inverse temperature parameter in 1 and in 6. For varying reward scales, it could result in an unstable behaviour of the procedure 6. Some RL algorithms, such as REPS [30], MPO [1] therefore replace similar (soft) regularization terms with hard limits on KL or entropy. Here, we consider a hardconstraint version of objective (1):
(8)  
The parameter defines the maximum average deviation of all the policies from the behaviour prior . Given , we can adjust the inverse temperature to match this constraint. In many cases is easier to choose than the inverse temperature since it does not, for instance, depend on the scale of the reward. The associated temperature parameter can be optimized by considering the Lagrangian for the objective 8, similar to REPS [30] and MPO [1].
Algorithm
The concrete algorithm is a combination of the steps above with parametric function approximation of the necessary quantities. We consider the approximation for the behaviour prior and an approximation for the soft value function for the task . We denote as and as the other set of parameters which correspond to the target networks (see Mnih et al. [26])  the networks which are kept fixed for some number of iterations. We denote as the discrete policy coming from 6 associated with and . Then, can be trained by minimizing the Bellman residual:
(9) 
where and:
(10) 
The behaviour prior is learned by minimizing:
(11) 
The full algorithm is presented in Algorithm 1.
4 Importance weighted policy adaptation for transfer learning
Given pretrained actionvalue functions and a behaviour prior from optimization of the objective 8 on the training set, we show how to leverage it to quickly solve tasks from the test set. We call this process adaptation. Below, we describe how adaptation is facilitate by two components of our method, behaviour and value transfer.
Behaviour Transfer.
Given a pretrained behaviour prior , we can learn the solution to a new task by learning a new value function and sampling from the implicit policy defined by 6. This can be achieved by executing the procedure in Section 3
without the prior improvement step. Because the policy essentially is initialized from the behaviour prior, the latter constrains possible solutions and leads to sensible exploration. In order to obtain new optimal policy, we need to learn new optimal soft Q function, which can require considerable amount of samples when Q is naively parameterized by a neural network. Below, we propose a way to leverage the Qfunctions learned for tasks in the training set to speed up transfer in terms of number of interactions with the environment.
Value Transfer.
In order to acquire knowledge about the value function that can be leveraged for transfer we choose to represent the task specific value as a linear function of taskspecific parameters and shared features :
(12) 
where
is a function mapping states and actions to a feature vector (with parameters
shared across tasks), is a taskspecific vector used to identify taskspecific Qvalues, and . During the adaptation phase, we initialize as , with , and adapt using TD(0) learning. Furthermore, for some more challenging tasks, we replace (at training time) the taskspecific vector by a nonlinear embedding of a structured goal descriptor which is available during training but not during adaptation, i.e. , where is a learned embedding of goal with parameters shared across training tasks. At test time, we initialize the critic as before: . Since some RL problems can still be challenging multitask learning problems, this ”asymmetry” between learning and testing allows us to simplify the solution of the multitask problem without affecting the applicability of the learned representation, in contrast to most of the metalearning approaches which require that training and adaptation phase be matched. Then, our proposed method exploits both, behaviour prior and shared value features to derive an efficient offpolicy transfer learning algorithm. Note that this approach does not require to have a finite or/and discrete set of tasks and could work also in the continuously parameterised task distributions, since we essentially allow the taskspecific Qfunction to depend on the task conditioning.Algorithm
Given the new task , we will learn associated to construct Qfunction of the form 12. Let be a pretrained behaviour prior, be pretrained features for the Qfunctions on the training set. We use similar notation as in Section 3, by denoting as , the target network parameters and as associated categorical distributions of form 6. Let be the function approximator of the form 12 for the new task . Then, the adaptation on the task reduces to learning the Qfunction by minimizing TD(0) Bellman residual:
(13) 
where
(14) 
Note that in addition to learning new , it is also possible to finetune pretrained features . It may be required if test tasks are too different from the training tasks. This scenario is discussed in Generalization part of Section 6. We call the resulted algorithm Importance Weighted Policy Adaptation (IWPA) which is described in Algorithm 2.
5 Related Work
The proposed algorithm has some similarities to recent offpolicy RL methods. In both Maximum a Posteriori Policy Optimization (MPO) [1] and in Soft Actor Critic (SAC) [18], the authors propose to learn the parametric policy and fit it to the nonparametric improved policy as in eq. 4 (in MPO, the is replaced by the parametric policy, whereas in SAC, is replaced by the uniform distribution). Furthermore, as in our method, in SAC the authors use induced soft Qfunction. The both methods collect the experience using the parametric policy. In contrast, in our method, we directly use the improved nonparametric policy to collect the experience as well as to construct the bootstrapped Qfunction. Moreover, our method is explicitly build in the context of multitask learning and makes use of behaviour prior with information asymmetry [15] which encourages structured exploration.
In recent work on Qlearning, there were many attempts to scale it up to highdimensional and continuous action domains. In soft Qlearning [17]
, in the context of maximum entropy RL, the authors learn a parametric mapping from normallydistributed samples to ones drawn from a policy distribution, which converges to the optimal nonparametric policy induced by a soft Q function (in a similar way as in eq.
4 with a uniform ). In Amortized Qlearning [6], the authors propose to learn a proposal distribution for actions and then select the one maximizing the Qfunction. Unlike in our work, the authors do not regularize the induced nonparametric distribution to stay close to the proposal. Note that, in the limit of the temperature , then our softmax operator over importance weights becomes a max, making our approach a strict generalization of AQL. Finally, Hunt et al. [23], propose to learn a proposal distribution which is good for transfer to a new task, in the context of successor features [2] while maximizing the entropy.Transfer of knowledge from past tasks to future ones is a wellestablished problem in machine learning
[5, 3] and has been addressed from several different angles. Meta learning approaches try to learn the adaptation mechanism by explicitly optimizing either for minimal regret during adaptation or for performance after adaptation. Gradientbased approaches, often derived from MAML, aim at learning initial network weights such that a few gradient steps from this initialization is sufficient to adapt to new tasks [11, 12, 16, 28]. Memorybased meta learning approaches model the adaptation procedure using recurrent networks [7, 36, 25, 22, 32]. One problem of meta learning approaches is the explicit optimization for adaptation on a new task, which may be computationally expensive. In addition, most of the metalearning methods require the training and adaptation process to be matched. It could restrict the class of problems which can be solved by this approach since some hard meta RL problems could also constitute hard multitask problems. Our method allows to provide additional information at training time to facilitate this learning without affecting the adaptation phase.Other transfer learning methods (ours included) do not explicitly optimize the algorithm for adaptation. A common approach is to use a neural network which shares some parameters across training tasks and finetunes the rest. Recent work [31] suggests that this yields performance comparable to the MAMLstyle training. Transfer learning with Successor Features [2] exploits a similar decomposition of the actionvalue function, but relies on Generalized Policy Improvement for efficient transfer, instead of our more general gradientbased adaptation. Another approach for reusing past experience is hierarchical RL which tries to compress the experience to a shared lowlevel controller or a set of options which are reused in later tasks [4, 20, 35, 37]. Finally, an approach we build upon is to distill past behavior into a prior policy [34, 15] from which we can bootstrap during adaptation. In Fakoor et al. [10], the authors propose a transfer learning approach based on finetuning a critic acquired via a multitask objective. To speedup adaptation, their method makes heavy use of offpolicy data acquired during metatraining, and an adaptive trust region which regularizes the critic parameters based on task similarity.
6 Experiments
In this section, we empirically study the performance of our method in the following scenarios. Firstly, we assess how well the method performs in the multitask scenario. Then, we demonstrate the methods ability to achieve competitive performance in adapting to holdout tasks compared to meta reinforcement learning baselines on a few standard benchmarks. On top of that, we show that the method scales well to more challenging sparse reward scenarios and achieves superior adaptation performance on hold out tasks compared to considered baselines. Finally, we consider the case when the number of training tasks is very small. In this case the behaviour prior and valuefunction representation may overfit to the training tasks. We demonstrate that our method still generalizes to holdout task when additional finetuning is allowed.
Task setup.
We consider two standard meta reinforcement learning problems: 2D point mass navigation and half cheetah velocity task, described in Rakelly et al. [32]. In addition to these simple tasks, we design a set of sparse reward tasks, which are harder as control and exploration problems: Go To Ring: a quadruped body needs to navigate to a particular (unknown) position on a ring. Move Box: a spherelike robot must move a box to a specific position. Reach: a simulated robotic arm is required to reach a particular (unknown) goal position. GTT: A humanoid body needs to navigate to a particular (unknown) position on a rectangle. For every task, we consider a set of training , and heldout tasks . For every task, the policy receives proprioceptive information, as well as the global position of the body and the unstructured task identifier (a number from to ). For the Move Box task, we provide additional global position of the target as task observation on training distribution to facilitate learning. We do not provide this information when working on test tasks. For more environment details, please refer to Appendix B.
Multitask training.
We first demonstrate our method ability to solve multitask learning problems. As baseline, we consider SVG(0) [21], an actorcritic algorithm with additional Retrace offpolicy correction [27] for learning the Qfunction as described in [33]. We refer to this algorithm as RS(0). We further consider a continuousaction version of DISTRAL [34] built on top of RS(0), where we learn a behaviour prior alongside the policy and value function, similar to [15]. This prior exhibits information asymmetry of observations with respect to the policy and the value function (it receives less information) which makes it to learn useful default behaviour speeding up the learning. In Appendix B, we specify the information provided to the behaviour prior and the policy. Furthermore, we consider MPO [1] algorithm as well as its version with behaviour prior, which we call MPO + DISTRAL. The latter simply uses KLregularizion to the learned prior (alongside the policy learning) in the Mstep as soft constraint as well as soft Qfunction. In our method, IWPL, we also use the behaviour prior with information asymmetry between Qfunction, which receives taskspecific information.
For each of the models, we optimize hyperparameters and report the best found configuration with 3 random seeds. The experiments are run in a distributed setup with 64 actors that generate experience and a single learner somewhat similar to
Espeholt et al. [9] using. We use a replay buffer of size and control the number of times an individual experience tuple is considered by the learner. This ensures softsynchronicity between ator and learner and ensures a fair comparison between models that differ with respect to the compute cost of inference and learning. For more details, please refer to the Appendix A.The results are given on Figure 1. We can see that our method achieves competitive performance compared to the baselines. Note that it has larger gains in tasks where the control problem is harder. This effect of behaviour prior was observed in [15] and presumably is amplified for IWPL, where there is no intermediate parametric policy in the loop. It immediately samples the useful actions from the prior which is learned faster than the agent policy due to the restricted set of observations as discussed in [15]. Interestingly, we do not observe a difference between MPO and MPO+DISTRAL, presumably because the effect of the behaviour prior is reduced by the hard KL constraint to the previous policy.
Adaptation performance.
Next, we investigate performance of our method in adapting to holdout tasks. The main criteria is the data efficiency in terms of a number of episodes on a new task. As discussed in Section 4, we want to leverage the behaviour prior as well as learned shared representation for the actionvalue function. Therefore, we consider two variants of our method, IWPA described in Section 4. We refer to ”Shared Q + IW” as the version which leverages both behaviour prior and actionvalue function, and ”IW”, which leverages only behaviour prior and learns action value function from scratch without making assumption 12. As natural baseline, we consider RS(0) + DISTRAL agent as in multitask learning where for learning Qfunction we use TD(0) as in IWPA. Starting from this, we call ”Shared Q”, the agent which leverages both behaviour prior and actionvalue function and ”DISTRAL” which leverages only behaviour prior.
We pretrain ”RS(0) + DISTRAL” agent with Qfunction parameterisation 12 on the training set, choose best performing hyperparameter and freeze pretrained and actionvalue features for each task. Then we apply all four proposed adaptation methods to these behaviour prior and actionvalue features. The reason to use one algorithm for pretrainining is to isolate the adaptation performance from the multitask performance studied above. Empirically, we found that models trained based on IWPL lead to similar results, but we decided to report the results pretrained using ”RS(0) + DISTRAL” because this agent was already considered in [15].
In addition, we consider two metareinforcement learning baselines: a reimplementation of RL2 [7], [36] as well as a reimplementation of PEARL [32]. For both implementations we build upon RS(0) as the base algorithm. In our implementation of PEARL (denoted as PEARL*), we use simple LSTM to encode the context. As reported in Rakelly et al. [32], this variant is slower to learn but eventually achieves similar to PEARL performance. Despite this change, our results achieve comparable performance to those presented in Section 6.3 of [32]. On top of that, we also consider a baseline which learns to solve the test tasks ”From Scratch” and corresponds to RS(0) algorithm without pretraining and behaviour prior. For more details, see Appendix A.
We start by presenting testtime adaptation performance on two standard continuous control tasks used in [32]: halfcheetah velocity and Sparse 2D navigation. Note, that for Sparse 2D navigation task, PEARL receives dense reward during training whereas our agent is trained with sparse rewards. It additionally demonstrates that our method can be employed in more difficult scenarios. The results are presented in Figure 2. While RL2 and PEARL converge faster in absolute terms, IWPA remains competitive and converges quickly despite not optimizing the adaptation process directly.
Going further, we present the results on complex sparse reward tasks. Results on these tasks are depicted on Figure 3. Our proposed method achieves gains in adaptation time with respect to the baseline DISTRAL. Furthermore, we note that using shared features for the value function provides a significant gain. It is important to note that using shared features without the behaviour prior fails to learn fast, because the behaviour prior plays a crucial role in facilitating exploration (see Appendix D
). On top of that, we observe that IWPA similarly to multitask results section, provides bigger gains on harder to control problems, like GTT humanoid. Note that this is a very challenging task: humanoid needs to locate a target and only receives a reward when successfull. Furthermore, the humanoid may fail at any moment and the episodes will terminate. It makes it extremely hard to learn without any prior knowledge. We note that both RL2 and PEARL failed to achieve optimal performance on these tasks. This could be for a variety of reasons, including the sparsity of the rewards and the complexity of learning a single policy that has to operate over long time horizons.
Generalization
An efficient transfer learning method should be robust to low data regime. Here we show that in case, when a few of training tasks are available, the method is still be able to generalize if we allow for the additional finetuning of the shared features for the Qfunction after 20 episodes of interaction on a new task. For each of the sparse reward tasks, we consider a version which has few training tasks. We trained IWPL on these and compare it to the IWPL trained in large tasks regime. The results are given in Figure 4. As we see, the method trained in a low tasks regime fails to generalize in most of the tasks, whereas the additional finetuning helps to recover the final performance and still be able to do it faster than learning from scratch.
7 Discussion
We have presented a novel method for multitask learning as well as for adaptation to new holdout tasks which does not explicitly metalearn the adaptation process and yet can match the adaptation speed of common metareinforcement learning algorithms. Instead of explicit metalearning, we relied on feature reuse and bootstrapping from a behavioral prior. The behavior prior can be seen as an informed proposal for a task distribution that is then specialized to a particular task by a learned actionvalue function. This scheme can be easily integrated into different actorcritic algorithms for data efficient offpolicy learning at training and test time. It further does not strictly require to execute test time adaptation as an inner loop during training thus adding extra flexibility.
References
 [1] (2018) Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256. Cited by: §A.1, §1, §2, §3, §3, §5, §6.
 [2] (2017) Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065. Cited by: §5, §5.

[3]
(2000)
A model of inductive bias learning.
Journal of Artificial Intelligence Research
12, pp. 149–198. Cited by: §5.  [4] (2014) Pacinspired option discovery in lifelong reinforcement learning. In International Conference on Machine Learning, pp. 316–324. Cited by: §5.
 [5] (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §5.
 [6] (2020) Qlearning in enormous action spaces via amortized approximate maximization. External Links: 2001.08116 Cited by: §5.
 [7] (2016) RL: fast reinforcement learning via slow reinforcement learning. External Links: 1611.02779 Cited by: §1, §5, §6.
 [8] (2019) Challenges of realworld reinforcement learning. External Links: 1904.12901 Cited by: §1.
 [9] (201810–15 Jul) IMPALA: scalable distributed deepRL with importance weighted actorlearner architectures. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1407–1416. External Links: Link Cited by: §6.
 [10] (2020) Metaqlearning. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §5.
 [11] (201706–11 Aug) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1126–1135. External Links: Link Cited by: §1, §5.
 [12] (2018) Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 9516–9527. External Links: Link Cited by: §5.
 [13] (2015) Taming the noise in reinforcement learning via soft updates. External Links: 1512.08562 Cited by: §2.
 [14] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §1.
 [15] (2019) Information asymmetry in KLregularized RL. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §5, §5, §6, §6, §6.
 [16] (2018) Metareinforcement learning of structured exploration strategies. arXiv preprint arXiv:1802.07245. Cited by: §5.
 [17] (2017) Reinforcement learning with deep energybased policies. External Links: 1702.08165 Cited by: §5.
 [18] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pp. 1861–1870. Cited by: §1, §2, §3, §5.
 [19] (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, External Links: Link Cited by: §2, §3.
 [20] (2016) Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182. Cited by: §5.
 [21] (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2944–2952. External Links: Link Cited by: §6.
 [22] (2019) Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424. Cited by: §1, §5.
 [23] (2018) Composing entropic policies using divergence correction. External Links: 1812.02216 Cited by: §5.
 [24] (2017) Building machines that learn and think like people. Behavioral and Brain Sciences 40, pp. e253. Cited by: §1.
 [25] (2018) A simple neural attentive metalearner. In International Conference on Learning Representations, External Links: Link Cited by: §5.
 [26] (2013) Playing atari with deep reinforcement learning. External Links: 1312.5602 Cited by: §3.
 [27] (2016) Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29, pp. 1054–1062. External Links: Link Cited by: §6.
 [28] (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2, pp. 2. Cited by: §5.
 [29] (2019) Metalearning of sequential strategies. arXiv preprint arXiv:1905.03030. Cited by: §1.
 [30] (2010) Relative entropy policy search. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, AAAI’10, pp. 1607–1612. Cited by: §3, §3.
 [31] (2019) Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157. Cited by: §1, §5.
 [32] (2019) Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254. Cited by: §A.2, Appendix A, §1, §5, §6, §6, §6.
 [33] (2018) Learning by playing solving sparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine Learning, pp. 4344–4353. Cited by: §6.
 [34] (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: §1, §2, §5, §6.
 [35] (2019) Exploiting hierarchy for learning and transfer in klregularized rl. arXiv preprint arXiv:1903.07438. Cited by: §5.
 [36] (2016) Learning to reinforcement learn. arXiv preprint arXiv:1611.05763. Cited by: §A.2, Appendix A, §1, §5, §6.
 [37] (2019) Regularized hierarchical policies for compositional transfer in robotics. arXiv preprint arXiv:1906.11228. Cited by: §5.
 [38] (2019) VariBAD: a very good method for bayesadaptive deep rl via metalearning. External Links: 1910.08348 Cited by: §1.
Appendix A Experimental details
For all the models, we use similar architectures for all the components. Each agent has actor, critic and optionally behaviour prior networks. For all the methods, except for [36] and PEARL [32]
, actor, critic and behaviour prior networks are 2 dimensional multilayer perceptron with ELU activation followed by onedimensional linear layer. On top of that, for each of the networks, we use a layer normalizing inputs. For
[36], the actor and critic networks are 2dimensional multi layer perceptrons with ELU activations, followed by an LSTM with elu activations. In PEARL [32], actor and critic networks have similar structure as other methods and the encoder network is an LSTM followed by onedimensional stochastic layer encoding Gaussian distribution. Actor and behaviour prior are represented by Gaussian distributions as well.
a.1 Multitask training experiment
We consider the following hyperparameter ranges:

Learning rates:

Initial inverse temperature :

Epsilon :

KLcost (inverse temperature) for DISTRAL baseline :
For the multitask experiments, we found that the following values worked best for all the architectures:

Learning rate:

Epsilon :
The best hyperparameters for RS(0) + DISTRAL for multitask experiment:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
The best hyperparameters for IWPL for multitask experiment:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
To have a fair comparison, we optimize Estep epsilon as well as KL cost for MPO [1]. We consider the same ranges as above and the best hyperparameters are:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
For all the experiments, we use batch size of and we split trajectories into chunks of size . For multitask experiments, on Figure 1
, we report 3 random seeds for each model with the best hyperparameters. Shading under the curves corresponds to 95% confidence interval within these evaluations. We split the data on the Xaxis by chunks
timesteps and the reward in these chunks is averaged. Then, we apply the rolling window smoothing with a window size of .a.2 Adaptation experiment
For the adaptation experiment, we train the Shared Q + DISTRAL architecture on each of the tasks. We found that the same combination of learning rate of and of KLcost of worked the best, so we use the same values for pretraining for all the tasks. We run 3 random seeds of pretraining and take the best performing seed to use for adaptation, therefore producing behaviour prior and shared features . Then, for each task, we consider a small validation set consisting of 3 tasks which we use to choose the best adaptation hyperparameters. As for adaptation hyperparameter ranges, we consider only:

Initial inverse temperature :

KLcost (inverse temperature) for DISTRAL baseline :
For all the adaptation experiments we use learning rate of and epsilon of .
The best adaptation hyperparameters for IW and shared Q + IW:

Sparse 2d navigation:

Halfcheetah:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
The best adaptation hyperparameters for DISTRAL and DISTRAL + Shared Q:

Sparse 2d navigation:

Halfcheetah:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
As for baselines, [36] and PEARL [32], we use a learning rate of and for PEARL we optimize a bottleneck cost from a range . We use bottleneck layer dimension of . The bottleneck costs per tasks are given here:

Sparse 2d navigation:

Halfcheetah:

Go to Target, Humanoid:

Go to Ring, Qudruped:

Move box, Jumping Ball:

Reach:
Adaptation protocol
We use a fixed protocol for adaptation on all the tasks for gradientbased methods. After each unroll of subtrajectory of size , we apply 1 gradient update to the adapted parameters and after each episode we apply gradient updates. The gradient updates performed by sampling trajectories from a local replay buffer with batch size of . Furthermore, for each task we act according to the behaviour prior (where appropriate) for a few exploration episodes.

Sparse 2d navigation: 5 episodes.

Halfcheetah: 2 episodes.

Go to Target, Humanoid: 20 episodes.

Go to Ring, Qudruped: 5 episodes.

Move box, Jumping Ball: 5 episodes.

Reach: 5 episodes.
Curves from Figures 2 and 3 plot average episodic return during adaptation, averaged over test tasks with 3 independent runs each (seeds). For each task and seed, we estimate average episodic return by averaging over the last 3 episodes. Shading under the curves corresponds to 95% confidence interval within these evaluations. Results on Sparse 2D navigation shown in Figure 2 are smoothed using a rolling window of 5. No smoothing is applied for Halfcheetah velocity. For Figure 3 we use a rolling window of 30.
Appendix B Environment Details
On Go To Ring, the agent receives a reward of 10 on achieving the target and is given an immobility penalty of 0.005 for each time step. The episode is terminated either by achieving a target or after 10 seconds (with 20 steps per second). The task distribution is defined by and which are sampled uniformly at each meta episode. At training time, we provide only task id as taskspecific information. The walker is randomly spawn at each episode in the rectangle from . The number of training tasks is , number of test tasks is . We provide proprioception, global position and orientation for both behaviour prior and the agent, whereas the task identifier is provided only to the agent at training time.
For Reach, we use a simulated Jaco robot which has to achieve a target specified in a cube with size of 0.4. Once the Jaco is within the radius of 0.05 of the target, it receives a reward of 1. The episode is terminated after 10 seconds (with 25 steps per second). At training time, we provide only task id as taskspecific information. Number of training tasks is , number of test tasks is . We provide proprioception, global position and orientation for both behaviour prior and the agent, whereas the task identifier is provided only to the agent at training time.
For Move Box, the reward of 10 is only given once the box is on the target. The episode is terminated either after putting the box on a target or after 20 seconds (20 steps per second). The task distribution is defined by a tuple of box and target positions, which are kept fixed for the entire meta episode. These positions are sampled uniformly in the room of size 8x8 and on maximum relative distance of 2. At training time, we provide global target position as task information. Number of training tasks is , number of test tasks is . We provide proprioception, global position and orientation for both behaviour prior and the agent, whereas the global target position is provided only to the agent at training time.
For GTT, the agent receives the reward of 1.0 on achieving the target and is given an immobility penalty of 0.005 for each time step and a penalty of 1.0 if the agent (humanoid) touches the floor with the upper body or knees. The episode is terminated either by achieving a target or after 10 seconds (with 20 steps per second). The task distribution is defined by a target position sampled uniformly on the rectangle of size 8x8. At training time, we provide only task id as taskspecific information. At training time, the walker position is randomly initialized in the room at each episode, whereas for the test time, the walker initial position is kept fixed for the entire metaepisode. Number of training tasks is 100, number of test tasks is 30. We provide proprioception, global position and orientation for both behaviour prior and the agent, whereas the task identifier is provided only to the agent at training time.
Appendix C Additional Results
In Section 4 “Value Transfer”, we describe how IWPA can make use of privileged information during metatraining by mapping features to task specific Qvalues , via an inner product with task features . Figure 6 reports metatraining performance of “Shared Q” with either (referred as Task id) or (referred as Task description), where is a structured task descriptor. The latter yields a qualitative difference on Move Box, where this information represents a global position of a target location. This confirms that using rich privileged information during metatraining, is important to scale meta and transfer learning approaches to more challenging domains.
Appendix D Ablations
The method IWPA described in Section 4 and in Algorithm 2 relies on both behaviour prior and learnt Qfunction features . Furthermore, based on the transfer learning results presented in Figure 3, it may seem that stateaction value function features are a crucial component for the transfer. In this section, we provide an ablation, where we show that without a behaviour prior, these features only do not transfer. Therefore, the combination of both, behaviour prior and value features is important. The results are given in Figure 7. As we can see, the architecture which uses both components, ”Shared Q + IW” works very well, whereas the one which reloads only the value features fails to learn.