I Introduction
Lack of principled mechanisms to quickly and efficiently transfer learned policies when the robot dynamics or tasks significantly change, has become the major bottleneck in Reinforcement Learning (RL). This inability to transfer policies is one major reason why RL has still not proliferated physical application like robotics
[22, 23, 19, 10]. This lack of methods to efficiently transfer robust policies forces the robot to learn to control in isolation and from scratch, even when only a few parameters of the dynamics change across tasks (e.g. walking with different weight or surface friction, etc.). This is both computation and sample expensive, and also makes simulationtoreal transfer difficult.AdapttoLearn (ATL) is inspired by the fact that adaptation of behaviors combined with learning through experience is a primary mechanism of learning in biological creatures [14, 9]
. Imitation Learning (IL)
[8, 36] also seems to play a very crucial part in skill transfer, and as such has been widely studied in RL. In control theory, adaptation has been typically studied for tracking problems on deterministic dynamical systems with welldefined reference trajectories [3, 5]. Inspired by learning in biological world, we seek to answer the question: Will combining adaptation of transferred policies and learning from exploration lead to more efficient policy trasnfer? In this paper, we show that the ability to adapt and incorporate further learning can be achieved through optimizing over combined environmental and intrinsic adaptation rewards. Learning over the combined reward feedback, we ensure that the agent quickly adapts and also learns to acquire skills beyond what the source policy can teach. Our empirical results show that the presented policy transfer algorithm is able to adapt to significant differences in the transition models, which otherwise, using Imitation Learning or Guided Policy Search (GPS) [15] would fail to produce stable solution. We posit that the presented method can be the foundation of a broader class of RL algorithms that can choose seamlessly between learning through exploration and supervised adaption using behavioral imitation.Ii Related work
Deep Reinforcement Learning (DRL) has recently enabled agents to learn policies for complex robotic tasks in simulation [22, 23, 19, 10]. However, DRL applications to robotics have been plagued by the curse of sample complexity. Transfer Learning (TL) seeks to mitigate this learning inefficiency of DRL [31]. A significant body of literature on transfer in RL is focused on initialized RL in the target domain using learned source policy; known as jumpstart/warmstart methods [30, 1, 2]. Some examples of these transfer architectures include transfer between similar tasks [4], transfer from human demonstrations [24] and transfer from simulation to real [21, 26, 35]
. However, warmstart based approaches do not always work in transfer for RL, even when similar warmstarts are quite effective in supervised learning. Efforts have also been made in exploring accelerated learning directly on real robots, through Guided Policy Search (GPS)
[17] and parallelizing the training across multiple agents using metalearning [16, 20, 36], but these approaches are prohibitively sample expensive for realworld robotics. SimtoReal transfers have been widely adopted in the recent works and can be viewed as a subset of same domain transfer problems. Daftry et al. [7]demonstrated the policy transfer for control of aerial vehicles across different vehicle models and environments. Policy transfer from simulation to real using an inverse dynamics model estimated interacting with the real robot is presented by
[6]. The agents trained to achieve robust policies across various environments through learning over an adversarial loss is presented in [34]. Here we present a general architecture capable of transferring policies across MDPs with significantly different transition dynamics.Iii Preliminaries and Problem Specification
Consider an infinite horizon MDP defined as a tuple , where denote set of continuous states; is a set of continuous bounded actions,
is the state transition probability distribution of reaching
upon taking action in , is the distribution over initial states and is deterministic reward function, s.t. and is the discount factor s.t .Let be stochastic policy over continuous state and action space. The action from a policy is a draw from the distribution . The agent’s goal is to find a policy which maximize the total return. The total return under a policy is given as,
(1) 
where, , and .
We will use the following definition of the state value function and stateaction value function defined under any policy
With the quick overview of the preliminaries, we now specify the problem of policy transfer studied here: Consider a source MDP , and a target MDP , each with its own stateaction space and transition model respectively. We will mainly focus on the problem of same domain transfer in this paper, where the state and action space are analogous and , but the source and target state transition models differ significantly due to unmodeled dynamics or external environment interactions. We are concerned with this problem because such policy adaptation in the presence of significant transition model change happens quite often in robotics, e.g. change in surface friction, payload carried by robot, modeling differences, or component deterioration. Note that extension to crossdomain transfer could be achieved with domain alignment techniques such as manifold alignment (MA) [33], see [12] for a modelbased policy transfer method for RL with MA.
Let be a parameterized optimal stochastic policy for source MDP . The source policy can be obtained using any available RL methods [29, 27, 28]. We use Proximal Policy Optimization (PPO) [28] algorithm to learn a optimal source policy on a unperturbed source task. A warmstarted PPOTL [2] policy, trained over a perturbed target task and ideal imitation learning is used as the TL solutions against which our proposed ATL policy is compared.
Iv Adaptive Policy Transfer
In this paper, we approach the problem of transfer learning for RL through adaptation of previously learned policies. Our goal in the rest of this paper is to show that an algorithm that can judiciously combine adaptation and learning is capable of avoiding brute force random exploration to a large extent and therefore can be significantly less sample expensive.
Towards this goal, our approach is to enable the agent to learn the optimal mixture of adaptation (which we term as behavioral imitation) and learning from exploration. Our method differs from existing transfer methods that rely on warm start for TL or policy imitation [36, 8] in a key way: Unlike imitation learning, we do not aim to emulate the source optimal policy. Instead, we try to adapt and encourage directional search for optimal reward by mimicking the source transitions under source optimal policy as projected onto the target state space. The details of Adaptive policy Transfer is presented further in this section.
Iva AdapttoLearn: Policy Transfer in RL
ATL has two components to policy transfer, Adaptation and Learning. We begin by mathematically describing our approach of adaptation for policy transfer in RL and state all the necessary assumptions in Section IVA1. We then develop the AdapttoLearn algorithm in Section IVA2 by adding the learning component by random exploration through reward mixing.
IvA1 Adaptation for policy learning
Adaptation of policies aims to generate the policy such that at every given the Target transition approximately mimics the Source transitions under Source optimal policy as projected onto the Target state space. We can loosely state the adaptation objective as . Where , i.e. target state projected onto source manifold. The mapping is a bijective Manifold Alignment (MA) between source and target, stateaction spaces , . The MA mapping is required for crossdomain transfers [2, 12]. For ease of exposition, in this paper, we focus on same domain transfer and assume MA mapping to be identity, such that .
Note that our goal is not to directly imitate the policy itself, but rather use the source behavior as a baseline for finding optimal transitions in the target domain. This behavioral imitation objective can be achieved by minimizing the average KL divergence between pointwise local target trajectory deviation likelihoods , and the source transition under source optimal policy .
We define the target transition deviation likelihood as conditional probability of landing in the state
starting from state under some action . Unlike state transition probability, the state is not the next transitioned state but a random state at which this probability is evaluated (Hence need not have the time stamp).The adaptation objective can be formalized as minimizing the average KLdivergence [27] between source and target transition trajectories as follows,
(2) 
where is the trajectory in the target domain under the policy defined as collection states visited starting from state and making transitions under target transition model . We now define the probability of the trajectories and :
The term is defined as total likelihood of onestep deviations to reference states over the trajectory . The reference states are obtained by reinitializing the source simulator at every time “” to the states along the trajectory and making optimal transitions to using source optimal policy. The total trajectory deviation likelihood can be expressed as follows,
(3) 
Assumption 1
Access to source simulator with restart capability is assumed. That is, source transition model can be initialized to any desired states and under optimal actions the source simulator provides next transitioned state .
Using Assumption1, can be defined as total probability of piecewise transitions at every state under the source optimal policy , and the source transitions .
(4) 
Unlike the conventional treatment of KLterm in the RL literature, the transition probabilities in the KL divergence in (18) are treated as transition likelihoods and the transitioned state
as random variables. The policy
learns to mimic the optimal transitions by minimizing the KL distance between the transition likelihoods of the target agent reaching the reference states and the source transitions evalauted at .Using the definition of the probabilities of the trajectories under and ( Equation(3) & (4) ) the KL divergence (18) over the trajectory can be expressed as follows
(5)  
Assumption 2
We assume that optimal source policy is available, and the probability of taking optimal actions , .
We make this simplifying Assumption2 i.e. to derive a surrogate objective which lower bounds the true KL loss (IVA1), such that, .
The expression for surrogate objective is defined follows,
(6) 
where defined as the intrinsic/adaptation reward term.
We need this assumption only for deriving the expression for intrinsic reward. Assumption2 allows us to extend the proposed transfer architecture to deterministic source policy or source policy derived from a human expert; since a human teacher cannot specify the probability of choosing an action. However, in the empirical evaluation of the algorithm, the source policy used as a stochastic optimal policy. The stochastic nature of such a policy adds natural exploration, and also demonstrates that the presented method is not restricted in its application by the above assumption.
IvA2 Combined Adaptation and Learning
To achieve “Adaption” and “Learning” simultaneously, we propose to augment the environment reward with intrinsic reward . By optimizing over mixture of intrinsic and environmental return simultaneously, we achieve transferred policies that try to both follow source advice and learn to acquire skills beyond what source can teach. This tradeoff between learning by exploration and learning by adaptation can be realized as follows:
(7) 
where the term is the mixing coefficient. The inclusion of intrinsic reward can be seen as adding a directional search to RL exploration. The source optimal policy and source transitions provide a possible direction of search for one step optimal reward at every state along the trajectory . For consistency of the reward mixing, the rewards are normalized to form the total reward.
The optimal mixing coefficient is learned over episodic data collected interacting with environment [18]. The details of hierarchical learning of mixing coefficient is provided in SectionVA, and its behavior in experiments is analyzed in SectionVII.
Assumption 3
The true transition distribution for source and the target
are unknown. However, we assume both source and target transition models follow a Gaussian distribution centered at the next propagated state with fixed variance “
”, which is chosen heuristically s.t
Although Assumption3 might look unrealistic, in reality it is not very restrictive. We empirically show that for all the experimented MuJoCo environments, a bootstrapped Gaussian transition assumption is good enough for ATL agent to transfer the policy efficiently.
Using the Assumptions2 & 3, we can simplify the individual KL term (intrinsic reward) as follows
(8) 
The individual terms in the expectation (22), represent the distance between two transition likelihoods of landing in the next state starting in and under actions . The target agent is encouraged to take actions that lead to states which are close in expectation to a reference state provided by an optimal baseline policy operating on the source model. The intuition is that by doing so, a possible direction of search for the higher environmental rewards “” is provided.
IvB ActorCritic
Using the definition of value function, the objective in (7) can be written as . The expectation can be rewritten as sum over states and actions as follows:
(9) 
where is the state visitation distributions [29].
Considering the case of an offpolicy RL, we use a behavioral policy for generating trajectories. This behavioral policy is different from the policy to be optimized. The objective function in an offpolicy RL measures the total return over the behavioral state visitation distribution, while the mismatch between the training data distribution and the true policy state distribution is compensated by importance sampling estimator as follows,
(10) 
Using the previously updated policy as the behavioral policy i.e , the objective expression can be rewritten as,
(11) 
where is the parameter before update and is known to us. An estimated stateaction value function replaces the true value function as the true value function is usually unknown. The mixed reward is used to compute the critic stateaction value function . Further, any policy update algorithm [29, 28, 27, 25] can be used to update the policy in the direction of optimizing this objective.
Learning over a mixture of intrinsic and environmental rewards helps in the directional search for maximum total return. The source transitions provide a possible direction in which maximum environmental reward can be achieved. This tradeoff between directional search and random exploration is achieved using the mixed reward.
V Optimization of Target Policy
In the previous section, we formulated an AdapttoLearn policy transfer algorithm; we now describe how to derive a practical algorithm from these theoretical foundations under finite sample counts and arbitrary policy parameterizations.
Optimizing the following objective (11) we can generate adaptive policy updates for the target task.
(12) 
If calculating the above expectation is feasible, it is possible to maximize the objective in (11) and move the policy parameters in the direction of achieving a higher total discounted return. However, this is not generally the case since the true expectation is intractable, therefore a common practice is to use an empirical estimate of the expectation.
The previous section proposed an optimization method to find the adaptive policy using KLdivergence as an intrinsic reward, enforcing the target transition model to mimic the source transitions. This section describes how this objective can be approximated using a Monte Carlo simulation. The approximate policy update method works by computing an estimate of the gradient of the return and plugging it into a stochastic gradient ascent algorithm The gradient estimate over i.i.d data is computed as follows:
where is empirical distribution over the data .
Va SampleBased Estimation of the Gradient
The previous section proposed an optimization method to find the adaptive policy using KLdivergence as an intrinsic reward, enforcing the target transition model to mimic the source transitions. This section describes how this objective can be approximated using a Monte Carlo simulation. The approximate policy update method works by computing an estimate of the gradient of the return and plugging it into a stochastic gradient ascent algorithm The gradient estimate over i.i.d data from the collected trajectories is computed as follows:
where is empirical distribution over the data . The details of empirical estimate of the gradient of the total return is provided in Appendix.
Note that the importance sampling based offpolicy update objective renders our discounted total return and its gradient independent of policy parameter . Hence the gradient of stateaction value estimates with respect to is zero in the above expression for total gradient.
VB Learning Mixing Coefficient from Data
A hierarchical update of the mixing coefficient is carried out over ntest trajectories, collected using the updated policy network . Where is parameter after the policy update step. We use stochastic gradient ascent to update the mixing coefficient as follows
where is the learning rate and is the empirical estimate of the gradient of the total return .
A hierarchical update of the mixing coefficient is carried out over ntest trajectories, collected using the updated policy network . The mixing coefficient is learnt by optimizing the return over trajectory as follows,
where is parameter after the policy update step.
We can use gradient ascent to update parameter in direction of optimizing the reward mixing as follows,
Using the definition of mixed reward as , we can simplify the above gradient as,
We use stochastic gradient ascent to update the mixing coefficient as follows
where is the learning rate and is the empirical estimate of the gradient of the total return . The gradient estimate over data is computed as follows,
where truncated trajectory length from experiments.
As we can see the gradient of objective with respect to mixing coefficient is an average over difference between environmental and intrinsic rewards. If the update will move parameter towards favoring learning through exploration more than learning through adaptation and visa versa.
As update is a constrained optimization with constraint . We handle this constrained optimization by modelling as output of sigmoidal network parameterized by parameters . The constrained optimization can be equivalently written as optimizing w.r.to as follows
The details of policy and update is shown in Figure1.
Env  Property  Source  Target  %Change 
Hopper  Floor Friction  1.0  2.0  +100% 
HalfCheetah  gravity  9.81  15  +52% 
Total Mass  14  35  +150%  
BackFoot Damping  3.0  1.5  100%  
Floor Friction  0.4  0.1  75%  
Walker2d  Density  1000  1500  +50% 
RightFoot Friction  0.9  0.45  50%  
LeftFoot Friction  1.9  1.0  47.37% 
Hopper  Walker2d  HalfCheetah  
State Space  12  18  17 
Action Space  3  6  6 
Number of layers  3  3  3 
Layer Activation  tanh  tanh  tanh 
Network Parameters  10530  28320  26250 
Discount  0.995  0.995  0.995 
Learning rate ()  
Initial Value  
Learning rate ()  
Batch size  20  20  5 
Policy Iter  3000  5000  1500 
Vi Sample Complexity and Optimality
In this section we will provide without proofs the sample complexity and optimality results of the proposed policy transfer method. The proofs of the following theorem can be found in.
Via Lower bounds on Sample Complexity
Although there is some empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there are not many theoretical guarantees. Since many of the existing transfer algorithms approach the problem of transfer as a method of providing good initialization to target task RL, we can expect the sample complexity of those algorithms to still be a function of the cardinality of stateaction pairs . On the other hand, in a supervised learning setting, the theoretical guarantees of the most algorithms have no dependency on size (or dimensionality) of the input domain (which is analogous to in RL). Having formulated a policy transfer algorithm using labeled reference trajectories derived from optimal source policy, we construct supervised learning like PAC property of the proposed method. For deriving, the lower bound on the sample complexity of the proposed transfer problem, we consider only the adaptation part of the learning i.e., the case when in Eq(7). This is because, in ATL, adaptive learning is akin to supervised learning, since the source reference trajectories provide the target states
Suppose we are given the learning problem specified with training set where each are independently drawn trajectories according to some distribution . Given the data we can compute the empirical return for every , we will show that the following holds:
(15) 
with probability at least , for some very small s.t . We can claim that the empirical return for all is a sufficiently accurate estimate of the true return function. Thus a reasonable learning strategy is to find a that would minimize empirical estimate of the objective Eq(7).
Theorem VI.1
If the induced class of the policy : has uniform convergence property in empirical mean; then the empirical risk minimization is PAC. s.t
(16) 
and number of trajectory samples required can be lower bounded as
(17) 
Please refer Appendix for the proof of the above theorem.
ViB Optimality result under Adaptive TransferLearning
Consider MDP and which differ in their transition models. For the sake of analysis, let be the MDP with ideal transition model, such that target follows source transition precisely. Let be the transition model achieved using the estimated policy learned over data interacting with the target model and the associated MDP be denoted as . We analyze the optimality of return under adapted source optimal policy through ATL.
Definition VI.2
Given the value function and model and , which only differ in the corresponding transition models and . Define
Lemma VI.3
Given , and value function , the following bound holds
where and and are transition of MDP respectively.
Vii Policy transfer in simulated robotic locomotion tasks
To evaluate AdapttoLearn Policy Transfer in reinforcement learning, we design our experiments using sets of tasks based on the continuous control environments in MuJoCo simulator [32]. Our experimental results demonstrate that ATL can adapt to significant changes in transition dynamics. We perturb the parameters of the simulated target models for the policy transfer experiments (see TableIII for original and perturbed parameters of the target model). To create a challenging training environment, we changed the parameters of the transition model such that the optimal source policy alone without further learning cannot produce any stable results (see source policy performance in Figure2 & 5). We compare our results against two baselines: (a) Initialized RL (Initialized PPO) (WarmStartRL [2]) (b) Standalone reinforcement policy learning (PPO) [28].
We experiment with the ATL algorithm on Hopper, Walker2d, and HalfCheetah Environments. The states of the robots are their generalized positions and velocities, and the actions are joint torques. High dimensionality, nonsmooth dynamics due to contact discontinuity, and underactuated dynamics make these tasks very challenging. We use deep neural networks to represent the source and target policy, the details of which are in the Table
IV.Learning curves showing the total reward averaged across five runs of each algorithm are provided in Figure2 & 5. Adaptto Learn policy transfer solved all the tasks, yielding quicker learning compared to other baseline methods. Figure4 provide the beta vs episode plot for HalfCheetah, Hopper and Walker2d environment. Recall that the extent of mixing of adaptation and learning from exploration is driven by the mixing coefficient . Figure4 shows that for all these environments with parameter perturbation, the ATL agent found it valuable to follow source baseline and therefore favored adaptation over exploration in search for optimal policy. In cases where source policy is not good enough to learn the target task the proposed algorithm seamlessly switches to initialized RL by favouring the exploration, as demonstrated in Crippled Half Cheetah experiment:
To further test the algorithm’s capability to adapt, we evaluate its performance in the crippledHalfcheetah environment (Figure5). With a loss of actuation in the front leg, adaptation using source policy might not aid learning, and therefore learning from exploration is necessary. We observe the mixing coefficient evolves closer to zero, indicating that ATL is learning to rely more on exploration rather than following the source policy. This results in a slight performance drop initially, but overall, ATL outperforms the baselines. This shows that in the scenarios where adaptation using source policy does not solve the task efficiently, ATL learns to seamlessly switch to exploring using environmental rewards.
Viii Conclusion
We introduced a new transfer learning technique for RL: AdapttoLearn (ATL), that utilizes combined adaptation and learning for policy transfer from source to target tasks. We demonstrated on nonlinear and continuous robotic locomotion tasks that our method leads to a significant reduction in sample complexity over the prevalent warmstart based approaches. We demonstrated that ATL seamlessly learns to rely on environment rewards when learning from Source does not provide a direct benefit, such as in the Crippled Cheetah environment. There are many exciting directions for future work. A network of policies that can generalize across multiple tasks could be learned based on each new adapted policies. How to train this endtoend is an important question. The ability of AdapttoLearn to handle significant perturbations to the transition model indicates that it should naturally extend to simtoreal transfer [21, 26, 35], and crossdomain transfer [33]. Another exciting direction is to extend the work to other combinatorial domains (e.g., multiplayer games). We expect, therefore, follow on work will find other exciting ways of exploiting such adaptation in RL, especially for robotics and realworld applications.
References
 [1] (2012) Reinforcement learning transfer via sparse coding. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, pp. 383–390. Cited by: §A, §II.
 [2] (2015) Unsupervised crossdomain transfer in policy gradient reinforcement learning via manifold alignment. In Proc. of AAAI, Cited by: §A, §II, §III, §IVA1, §VII.
 [3] (2013) Adaptive control. Courier Corporation. Cited by: §I.
 [4] (2007) General game learning using knowledge transfer.. In IJCAI, pp. 672–677. Cited by: §A, §II.
 [5] (2013) Rapid transfer of controllers between uavs using learningbased adaptive control. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5409–5416. Cited by: §I.
 [6] (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §A, §II.
 [7] (2016) Learning transferable policies for monocular reactive mav control. In International Symposium on Experimental Robotics, pp. 3–11. Cited by: §A, §II.
 [8] (2017) Oneshot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §I, §IV.
 [9] (2011) Understanding observational learning: an interbehavioral approach. The Analysis of Verbal Behavior 27 (1), pp. 191–203. Cited by: §I.
 [10] (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §A, §I, §II.
 [11] (2018) PAC reinforcement learning with an imperfect model. In Proc. of AAAI, Cited by: §AB, §VIB.
 [12] (2018) Crossdomain transfer in reinforcement learning using target apprentice. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7525–7532. Cited by: §III, §IVA1.
 [13] (2002) Nearoptimal reinforcement learning in polynomial time. Machine Learning 49 (23), pp. 209–232. Cited by: §AB, §VIB.
 [14] (2011) Human sensorimotor learning: adaptation, skill, and beyond. Current opinion in neurobiology 21 (4), pp. 636–644. Cited by: §I.
 [15] (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §I.
 [16] (2016) Learning handeye coordination for robotic grasping with largescale data collection. In International Symposium on Experimental Robotics, pp. 173–184. Cited by: §A, §II.
 [17] (2015) Learning contactrich manipulation skills with guided policy search. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 156–163. Cited by: §A, §II.
 [18] (2017) Metasgd: learning to learn quickly for fewshot learning. External Links: 1707.09835 Cited by: §IVA2.
 [19] (2017) Learning to schedule control fragments for physicsbased characters using deep qlearning. ACM Transactions on Graphics (TOG) 36 (3), pp. 29. Cited by: §A, §I, §II.
 [20] (2018) Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §A, §II.
 [21] (2017) Simtoreal transfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537. Cited by: §A, §II, §VIII.
 [22] (2016) Terrainadaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (TOG) 35 (4), pp. 81. Cited by: §A, §I, §II.
 [23] (2017) Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG) 36 (4), pp. 41. Cited by: §A, §I, §II.
 [24] (2006) Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 2219–2225. Cited by: §A, §II.
 [25] (2008) Natural actorcritic. Neurocomputing 71 (7), pp. 1180–1190. Cited by: §IVB.

[26]
(2011)
A reduction of imitation learning and structured prediction to noregret online learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 627–635. Cited by: §A, §II, §VIII.  [27] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §B, §III, §IVA1, §IVB.
 [28] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §IVB, §VII.
 [29] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §III, §IVB, §IVB.
 [30] (2005) Value functions for rlbased behavior transfer: a comparative study. In Proceedings of the National Conference on Artificial Intelligence, Vol. 20, pp. 880. Cited by: §A, §II.
 [31] (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §A, §II.
 [32] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §VII.
 [33] (2009) Manifold alignment without correspondence.. In IJCAI, Vol. 2, pp. 3. Cited by: §III, §VIII.
 [34] (2017) Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907. Cited by: §A, §II.
 [35] (2017) Simtoreal transfer of accurate grasping with eyeinhand observations and continuous control. arXiv preprint arXiv:1712.03303. Cited by: §A, §II, §VIII.
 [36] (2018) Dexterous manipulation with deep reinforcement learning: efficient, general, and lowcost. arXiv preprint arXiv:1810.06045. Cited by: §A, §I, §II, §IV.
Appendix A Total return gradient with respect to policy parameters
The total return which we aim to maximize in adapting the source policy to target is the mixture of environmental rewards and Intrinsic KL divergence reward as follows,
(21) 
Taking the expectation over policy and transition distribution we can write the above expression
(22) 
Using the definition of the statevalue function, the above objective function can be rewritten as
(23) 
The adaptive policy update methods work by computing an estimator of the gradient of the return and plugging it into a stochastic gradient ascent algorithm
(24) 
where is the learning rate and is the empirical estimate of the gradient of the total discounted return .
Taking the derivative of the total return term
Using the following definition in above expression,
We can rewrite the gradient to total return over policy as,
As the reward is independent of , we can simplify the above expression and can be rewritten as
As we can see the above expression has a recursive property involving term . Using the following definition of a discounted state visitation distribution
(28)  
we can write the gradient of transfer objective as follows,
(29) 
Considering an offpolicy RL update, where is used for collecting trajectories over which the statevalue function is estimated, we can rewrite the above gradient for offline update as follows,
Multiplying and dividing Eq29 by and we form a gradient estimate for offline update,
(30) 
where the ratio is importance sampling term, and using the following identity the above expression can be rewritten as
Aa Theoretical bounds on sample complexity
Although there is some empirical evidence that transfer can improve performance in subsequent reinforcementlearning tasks, there are not many theoretical guarantees in the literature. Since many of the existing transfer algorithms approach the problem of transfer as a method of providing good initialization to target task RL, we can expect the sample complexity of those algorithms to still be a function of the cardinality of stateaction pairs . On the other hand, in a supervised learning setting, the theoretical guarantees of the most algorithms have no dependency on size (or dimensionality) of the input domain (which is analogous to in RL). Having formulated a policy transfer algorithm using labeled reference trajectories derived from optimal source policy, we construct supervised learning like PAC property of the proposed method. For deriving, the lower bound on the sample complexity of the proposed transfer problem, we consider only the adaptation part of the learning i.e., the case when . This is because, in ATL, adaptive learning is akin to supervised learning, since the source reference trajectories provide the target states given every pair.
Suppose we are given the learning problem specified with training set where each are independently drawn trajectories according to some distribution . Given the data we can compute the empirical return for every , we will show that the following holds: