Adaptive Policy Transfer in Reinforcement Learning

05/10/2021 ∙ by Girish Joshi, et al. ∙ University of Illinois at Urbana-Champaign 0

Efficient and robust policy transfer remains a key challenge for reinforcement learning to become viable for real-wold robotics. Policy transfer through warm initialization, imitation, or interacting over a large set of agents with randomized instances, have been commonly applied to solve a variety of Reinforcement Learning tasks. However, this seems far from how skill transfer happens in the biological world: Humans and animals are able to quickly adapt the learned behaviors between similar tasks and learn new skills when presented with new situations. Here we seek to answer the question: Will learning to combine adaptation and exploration lead to a more efficient transfer of policies between domains? We introduce a principled mechanism that can "Adapt-to-Learn", that is adapt the source policy to learn to solve a target task with significant transition differences and uncertainties. We show that the presented method learns to seamlessly combine learning from adaptation and exploration and leads to a robust policy transfer algorithm with significantly reduced sample complexity in transferring skills between related tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Lack of principled mechanisms to quickly and efficiently transfer learned policies when the robot dynamics or tasks significantly change, has become the major bottleneck in Reinforcement Learning (RL). This inability to transfer policies is one major reason why RL has still not proliferated physical application like robotics

[22, 23, 19, 10]. This lack of methods to efficiently transfer robust policies forces the robot to learn to control in isolation and from scratch, even when only a few parameters of the dynamics change across tasks (e.g. walking with different weight or surface friction, etc.). This is both computation and sample expensive, and also makes simulation-to-real transfer difficult.

Adapt-to-Learn (ATL) is inspired by the fact that adaptation of behaviors combined with learning through experience is a primary mechanism of learning in biological creatures [14, 9]

. Imitation Learning (IL)

[8, 36] also seems to play a very crucial part in skill transfer, and as such has been widely studied in RL. In control theory, adaptation has been typically studied for tracking problems on deterministic dynamical systems with well-defined reference trajectories [3, 5]. Inspired by learning in biological world, we seek to answer the question: Will combining adaptation of transferred policies and learning from exploration lead to more efficient policy trasnfer? In this paper, we show that the ability to adapt and incorporate further learning can be achieved through optimizing over combined environmental and intrinsic adaptation rewards. Learning over the combined reward feedback, we ensure that the agent quickly adapts and also learns to acquire skills beyond what the source policy can teach. Our empirical results show that the presented policy transfer algorithm is able to adapt to significant differences in the transition models, which otherwise, using Imitation Learning or Guided Policy Search (GPS) [15] would fail to produce stable solution. We posit that the presented method can be the foundation of a broader class of RL algorithms that can choose seamlessly between learning through exploration and supervised adaption using behavioral imitation.

Ii Related work

Deep Reinforcement Learning (D-RL) has recently enabled agents to learn policies for complex robotic tasks in simulation [22, 23, 19, 10]. However, D-RL applications to robotics have been plagued by the curse of sample complexity. Transfer Learning (TL) seeks to mitigate this learning inefficiency of D-RL [31]. A significant body of literature on transfer in RL is focused on initialized RL in the target domain using learned source policy; known as jump-start/warm-start methods [30, 1, 2]. Some examples of these transfer architectures include transfer between similar tasks [4], transfer from human demonstrations [24] and transfer from simulation to real [21, 26, 35]

. However, warm-start based approaches do not always work in transfer for RL, even when similar warm-starts are quite effective in supervised learning. Efforts have also been made in exploring accelerated learning directly on real robots, through Guided Policy Search (GPS)

[17] and parallelizing the training across multiple agents using meta-learning [16, 20, 36], but these approaches are prohibitively sample expensive for real-world robotics. Sim-to-Real transfers have been widely adopted in the recent works and can be viewed as a subset of same domain transfer problems. Daftry et al. [7]

demonstrated the policy transfer for control of aerial vehicles across different vehicle models and environments. Policy transfer from simulation to real using an inverse dynamics model estimated interacting with the real robot is presented by

[6]. The agents trained to achieve robust policies across various environments through learning over an adversarial loss is presented in [34]. Here we present a general architecture capable of transferring policies across MDPs with significantly different transition dynamics.

Iii Preliminaries and Problem Specification

Consider an infinite horizon MDP defined as a tuple , where denote set of continuous states; is a set of continuous bounded actions,

is the state transition probability distribution of reaching

upon taking action in , is the distribution over initial states and is deterministic reward function, s.t. and is the discount factor s.t .

Let be stochastic policy over continuous state and action space. The action from a policy is a draw from the distribution . The agent’s goal is to find a policy which maximize the total return. The total return under a policy is given as,

(1)

where, , and .

We will use the following definition of the state value function and state-action value function defined under any policy

With the quick overview of the preliminaries, we now specify the problem of policy transfer studied here: Consider a source MDP , and a target MDP , each with its own state-action space and transition model respectively. We will mainly focus on the problem of same domain transfer in this paper, where the state and action space are analogous and , but the source and target state transition models differ significantly due to unmodeled dynamics or external environment interactions. We are concerned with this problem because such policy adaptation in the presence of significant transition model change happens quite often in robotics, e.g. change in surface friction, payload carried by robot, modeling differences, or component deterioration. Note that extension to cross-domain transfer could be achieved with domain alignment techniques such as manifold alignment (MA) [33], see [12] for a model-based policy transfer method for RL with MA.

Let be a parameterized optimal stochastic policy for source MDP . The source policy can be obtained using any available RL methods [29, 27, 28]. We use Proximal Policy Optimization (PPO) [28] algorithm to learn a optimal source policy on a unperturbed source task. A warm-started PPO-TL [2] policy, trained over a perturbed target task and ideal imitation learning is used as the TL solutions against which our proposed ATL policy is compared.

Iv Adaptive Policy Transfer

In this paper, we approach the problem of transfer learning for RL through adaptation of previously learned policies. Our goal in the rest of this paper is to show that an algorithm that can judiciously combine adaptation and learning is capable of avoiding brute force random exploration to a large extent and therefore can be significantly less sample expensive.

Towards this goal, our approach is to enable the agent to learn the optimal mixture of adaptation (which we term as behavioral imitation) and learning from exploration. Our method differs from existing transfer methods that rely on warm start for TL or policy imitation [36, 8] in a key way: Unlike imitation learning, we do not aim to emulate the source optimal policy. Instead, we try to adapt and encourage directional search for optimal reward by mimicking the source transitions under source optimal policy as projected onto the target state space. The details of Adaptive policy Transfer is presented further in this section.

Iv-a Adapt-to-Learn: Policy Transfer in RL

ATL has two components to policy transfer, Adaptation and Learning. We begin by mathematically describing our approach of adaptation for policy transfer in RL and state all the necessary assumptions in Section IV-A1. We then develop the Adapt-to-Learn algorithm in Section IV-A2 by adding the learning component by random exploration through reward mixing.

Iv-A1 Adaptation for policy learning

Adaptation of policies aims to generate the policy such that at every given the Target transition approximately mimics the Source transitions under Source optimal policy as projected onto the Target state space. We can loosely state the adaptation objective as . Where , i.e. target state projected onto source manifold. The mapping is a bijective Manifold Alignment (MA) between source and target, state-action spaces , . The MA mapping is required for cross-domain transfers [2, 12]. For ease of exposition, in this paper, we focus on same domain transfer and assume MA mapping to be identity, such that .

Note that our goal is not to directly imitate the policy itself, but rather use the source behavior as a baseline for finding optimal transitions in the target domain. This behavioral imitation objective can be achieved by minimizing the average KL divergence between point-wise local target trajectory deviation likelihoods , and the source transition under source optimal policy .

We define the target transition deviation likelihood as conditional probability of landing in the state

starting from state under some action . Unlike state transition probability, the state is not the next transitioned state but a random state at which this probability is evaluated (Hence need not have the time stamp).

The adaptation objective can be formalized as minimizing the average KL-divergence [27] between source and target transition trajectories as follows,

(2)

where is the trajectory in the target domain under the policy defined as collection states visited starting from state and making transitions under target transition model . We now define the probability of the trajectories and :

The term is defined as total likelihood of one-step deviations to reference states over the trajectory . The reference states are obtained by re-initializing the source simulator at every time “” to the states along the trajectory and making optimal transitions to using source optimal policy. The total trajectory deviation likelihood can be expressed as follows,

(3)
Assumption 1

Access to source simulator with restart capability is assumed. That is, source transition model can be initialized to any desired states and under optimal actions the source simulator provides next transitioned state .

Using Assumption-1, can be defined as total probability of piece-wise transitions at every state under the source optimal policy , and the source transitions .

(4)

Unlike the conventional treatment of KL-term in the RL literature, the transition probabilities in the KL divergence in (18) are treated as transition likelihoods and the transitioned state

as random variables. The policy

learns to mimic the optimal transitions by minimizing the KL distance between the transition likelihoods of the target agent reaching the reference states and the source transitions evalauted at .

Using the definition of the probabilities of the trajectories under and ( Equation-(3) & (4) ) the KL divergence (18) over the trajectory can be expressed as follows

(5)
Assumption 2

We assume that optimal source policy is available, and the probability of taking optimal actions , .

We make this simplifying Assumption-2 i.e. to derive a surrogate objective which lower bounds the true KL loss (IV-A1), such that, .

The expression for surrogate objective is defined follows,

(6)

where defined as the intrinsic/adaptation reward term.

We need this assumption only for deriving the expression for intrinsic reward. Assumption-2 allows us to extend the proposed transfer architecture to deterministic source policy or source policy derived from a human expert; since a human teacher cannot specify the probability of choosing an action. However, in the empirical evaluation of the algorithm, the source policy used as a stochastic optimal policy. The stochastic nature of such a policy adds natural exploration, and also demonstrates that the presented method is not restricted in its application by the above assumption.

Iv-A2 Combined Adaptation and Learning

To achieve “Adaption” and “Learning” simultaneously, we propose to augment the environment reward with intrinsic reward . By optimizing over mixture of intrinsic and environmental return simultaneously, we achieve transferred policies that try to both follow source advice and learn to acquire skills beyond what source can teach. This trade-off between learning by exploration and learning by adaptation can be realized as follows:

(7)

where the term is the mixing coefficient. The inclusion of intrinsic reward can be seen as adding a directional search to RL exploration. The source optimal policy and source transitions provide a possible direction of search for one step optimal reward at every state along the trajectory . For consistency of the reward mixing, the rewards are normalized to form the total reward.

The optimal mixing coefficient is learned over episodic data collected interacting with environment [18]. The details of hierarchical learning of mixing coefficient is provided in Section-V-A, and its behavior in experiments is analyzed in Section-VII.

Assumption 3

The true transition distribution for source and the target

are unknown. However, we assume both source and target transition models follow a Gaussian distribution centered at the next propagated state with fixed variance

”, which is chosen heuristically s.t

Although Assumption-3 might look unrealistic, in reality it is not very restrictive. We empirically show that for all the experimented MuJoCo environments, a bootstrapped Gaussian transition assumption is good enough for ATL agent to transfer the policy efficiently.

Using the Assumptions-2 & 3, we can simplify the individual KL term (intrinsic reward) as follows

(8)

The individual terms in the expectation (22), represent the distance between two transition likelihoods of landing in the next state starting in and under actions . The target agent is encouraged to take actions that lead to states which are close in expectation to a reference state provided by an optimal baseline policy operating on the source model. The intuition is that by doing so, a possible direction of search for the higher environmental rewards “” is provided.

Iv-B Actor-Critic

Using the definition of value function, the objective in (7) can be written as . The expectation can be rewritten as sum over states and actions as follows:

(9)

where is the state visitation distributions [29].

Considering the case of an off-policy RL, we use a behavioral policy for generating trajectories. This behavioral policy is different from the policy to be optimized. The objective function in an off-policy RL measures the total return over the behavioral state visitation distribution, while the mismatch between the training data distribution and the true policy state distribution is compensated by importance sampling estimator as follows,

(10)

Using the previously updated policy as the behavioral policy i.e , the objective expression can be rewritten as,

(11)

where is the parameter before update and is known to us. An estimated state-action value function replaces the true value function as the true value function is usually unknown. The mixed reward is used to compute the critic state-action value function . Further, any policy update algorithm [29, 28, 27, 25] can be used to update the policy in the direction of optimizing this objective.

Learning over a mixture of intrinsic and environmental rewards helps in the directional search for maximum total return. The source transitions provide a possible direction in which maximum environmental reward can be achieved. This trade-off between directional search and random exploration is achieved using the mixed reward.

V Optimization of Target Policy

In the previous section, we formulated an Adapt-to-Learn policy transfer algorithm; we now describe how to derive a practical algorithm from these theoretical foundations under finite sample counts and arbitrary policy parameterizations.

Optimizing the following objective (11) we can generate adaptive policy updates for the target task.

(12)

If calculating the above expectation is feasible, it is possible to maximize the objective in (11) and move the policy parameters in the direction of achieving a higher total discounted return. However, this is not generally the case since the true expectation is intractable, therefore a common practice is to use an empirical estimate of the expectation.

The previous section proposed an optimization method to find the adaptive policy using KL-divergence as an intrinsic reward, enforcing the target transition model to mimic the source transitions. This section describes how this objective can be approximated using a Monte Carlo simulation. The approximate policy update method works by computing an estimate of the gradient of the return and plugging it into a stochastic gradient ascent algorithm The gradient estimate over i.i.d data is computed as follows:

where is empirical distribution over the data .

1: Source Policy, source simulator
2:Initialize . Draw initial state from the given distribution in target task
3:for  do
4:     while  do
5:          Generate the optimal action using Source policy
6:          Generate the action using the
7:          Apply the action at state in the target task
8:          Apply the action at state in the source simulator
9:          Compute the point-wise KL intrinsic reward
10:          Incrementally store the trajectory for policy update      
11:      Form the Empirical loss
12:      Maximize the total return to update the policy
13:     Collect test trajectories using
14:      Maximize the total return for optimal mixing coefficient
Algorithm 1 Adapt-to-Learn Policy Transfer in RL

V-a Sample-Based Estimation of the Gradient

The previous section proposed an optimization method to find the adaptive policy using KL-divergence as an intrinsic reward, enforcing the target transition model to mimic the source transitions. This section describes how this objective can be approximated using a Monte Carlo simulation. The approximate policy update method works by computing an estimate of the gradient of the return and plugging it into a stochastic gradient ascent algorithm The gradient estimate over i.i.d data from the collected trajectories is computed as follows:

where is empirical distribution over the data . The details of empirical estimate of the gradient of the total return is provided in Appendix.

Note that the importance sampling based off-policy update objective renders our discounted total return and its gradient independent of policy parameter . Hence the gradient of state-action value estimates with respect to is zero in the above expression for total gradient.

V-B Learning Mixing Coefficient from Data

A hierarchical update of the mixing coefficient is carried out over n-test trajectories, collected using the updated policy network . Where is parameter after the policy update step. We use stochastic gradient ascent to update the mixing coefficient as follows

where is the learning rate and is the empirical estimate of the gradient of the total return .

A hierarchical update of the mixing coefficient is carried out over n-test trajectories, collected using the updated policy network . The mixing coefficient is learnt by optimizing the return over trajectory as follows,

where is parameter after the policy update step.

We can use gradient ascent to update parameter in direction of optimizing the reward mixing as follows,

Using the definition of mixed reward as , we can simplify the above gradient as,

We use stochastic gradient ascent to update the mixing coefficient as follows

where is the learning rate and is the empirical estimate of the gradient of the total return . The gradient estimate over data is computed as follows,

where truncated trajectory length from experiments.

As we can see the gradient of objective with respect to mixing coefficient is an average over difference between environmental and intrinsic rewards. If the update will move parameter towards favoring learning through exploration more than learning through adaptation and visa versa.

As update is a constrained optimization with constraint . We handle this constrained optimization by modelling as output of sigmoidal network parameterized by parameters . The constrained optimization can be equivalently written as optimizing w.r.to as follows

The details of policy and update is shown in Figure-1.

Fig. 1: Total Policy Update scheme to learn Target task. The Hyper-parameter are updated over test trajectories generated using the updated policy parameters.
Env Property Source Target %Change
Hopper Floor Friction 1.0 2.0 +100%
HalfCheetah gravity -9.81 -15 +52%
Total Mass 14 35 +150%
Back-Foot Damping 3.0 1.5 -100%
Floor Friction 0.4 0.1 -75%
Walker2d Density 1000 1500 +50%
Right-Foot Friction 0.9 0.45 -50%
Left-Foot Friction 1.9 1.0 -47.37%
TABLE I: Transition Model and environment properties for Source and Target task and % change
Hopper Walker2d HalfCheetah
State Space 12 18 17
Action Space 3 6 6
Number of layers 3 3 3
Layer Activation tanh tanh tanh
Network Parameters 10530 28320 26250
Discount 0.995 0.995 0.995
Learning rate ()
Initial Value
-Learning rate ()
Batch size 20 20 5
Policy Iter 3000 5000 1500
TABLE II: Policy Network details and Network learning parameter details

Vi Sample Complexity and -Optimality

In this section we will provide without proofs the sample complexity and -optimality results of the proposed policy transfer method. The proofs of the following theorem can be found in.

Vi-a Lower bounds on Sample Complexity

Although there is some empirical evidence that transfer can improve performance in subsequent reinforcement-learning tasks, there are not many theoretical guarantees. Since many of the existing transfer algorithms approach the problem of transfer as a method of providing good initialization to target task RL, we can expect the sample complexity of those algorithms to still be a function of the cardinality of state-action pairs . On the other hand, in a supervised learning setting, the theoretical guarantees of the most algorithms have no dependency on size (or dimensionality) of the input domain (which is analogous to in RL). Having formulated a policy transfer algorithm using labeled reference trajectories derived from optimal source policy, we construct supervised learning like PAC property of the proposed method. For deriving, the lower bound on the sample complexity of the proposed transfer problem, we consider only the adaptation part of the learning i.e., the case when in Eq-(7). This is because, in ATL, adaptive learning is akin to supervised learning, since the source reference trajectories provide the target states

Suppose we are given the learning problem specified with training set where each are independently drawn trajectories according to some distribution . Given the data we can compute the empirical return for every , we will show that the following holds:

(15)

with probability at least , for some very small s.t . We can claim that the empirical return for all is a sufficiently accurate estimate of the true return function. Thus a reasonable learning strategy is to find a that would minimize empirical estimate of the objective Eq-(7).

Theorem VI.1

If the induced class of the policy : has uniform convergence property in empirical mean; then the empirical risk minimization is PAC. s.t

(16)

and number of trajectory samples required can be lower bounded as

(17)

Please refer Appendix for the proof of the above theorem.

Vi-B -Optimality result under Adaptive Transfer-Learning

Consider MDP and which differ in their transition models. For the sake of analysis, let be the MDP with ideal transition model, such that target follows source transition precisely. Let be the transition model achieved using the estimated policy learned over data interacting with the target model and the associated MDP be denoted as . We analyze the -optimality of return under adapted source optimal policy through ATL.

Definition VI.2

Given the value function and model and , which only differ in the corresponding transition models and . Define

Lemma VI.3

Given , and value function , the following bound holds

where and and are transition of MDP respectively.

The proof of this lemma is based on the simulation lemma [13]. For the proof of above lemma refer Appendix. Similar results for RL with imperfect models were reported by [11].

Vii Policy transfer in simulated robotic locomotion tasks

(a)
(b)
(c)
Fig. 2: Learning curves for locomotion tasks, averaged across five runs of each algorithm: Adapt-to-Learn(Ours), Randomly Initialized RL(PPO), Warm-Started PPO using source policy parameters and Best case imitation learning using Source policy directly on Target Task without any adaptation.
(a)
(b)
(c)
Fig. 3: Trajectory KL divergence Total Intrinsic Return averaged across five runs.
Fig. 4: Mixing coefficient over one run for HalfCheetah, Hopper and Walker2d Environment
(a)
(b)
(c)
Fig. 5: (a) Learning curves for Crippled-Fat HalfCheetah Env, averaged across five runs: Adapt-to-Learn (ATL (Ours)), Randomly Initialized RL (PPO), Warm-Started PPO using source policy parameters. (b) Avg Intrinsic reward for ATL. (c) Mixing coefficient .

To evaluate Adapt-to-Learn Policy Transfer in reinforcement learning, we design our experiments using sets of tasks based on the continuous control environments in MuJoCo simulator [32]. Our experimental results demonstrate that ATL can adapt to significant changes in transition dynamics. We perturb the parameters of the simulated target models for the policy transfer experiments (see Table-III for original and perturbed parameters of the target model). To create a challenging training environment, we changed the parameters of the transition model such that the optimal source policy alone without further learning cannot produce any stable results (see source policy performance in Figure-2 & 5). We compare our results against two baselines: (a) Initialized RL (Initialized PPO) (Warm-Start-RL [2]) (b) Stand-alone reinforcement policy learning (PPO) [28].

We experiment with the ATL algorithm on Hopper, Walker2d, and HalfCheetah Environments. The states of the robots are their generalized positions and velocities, and the actions are joint torques. High dimensionality, non-smooth dynamics due to contact discontinuity, and under-actuated dynamics make these tasks very challenging. We use deep neural networks to represent the source and target policy, the details of which are in the Table-

IV.

Learning curves showing the total reward averaged across five runs of each algorithm are provided in Figure-2 & 5. Adapt-to Learn policy transfer solved all the tasks, yielding quicker learning compared to other baseline methods. Figure-4 provide the beta vs episode plot for HalfCheetah, Hopper and Walker2d environment. Recall that the extent of mixing of adaptation and learning from exploration is driven by the mixing coefficient . Figure-4 shows that for all these environments with parameter perturbation, the ATL agent found it valuable to follow source baseline and therefore favored adaptation over exploration in search for optimal policy. In cases where source policy is not good enough to learn the target task the proposed algorithm seamlessly switches to initialized RL by favouring the exploration, as demonstrated in Crippled Half Cheetah experiment:

To further test the algorithm’s capability to adapt, we evaluate its performance in the crippled-Halfcheetah environment (Figure-5). With a loss of actuation in the front leg, adaptation using source policy might not aid learning, and therefore learning from exploration is necessary. We observe the mixing coefficient evolves closer to zero, indicating that ATL is learning to rely more on exploration rather than following the source policy. This results in a slight performance drop initially, but overall, ATL outperforms the baselines. This shows that in the scenarios where adaptation using source policy does not solve the task efficiently, ATL learns to seamlessly switch to exploring using environmental rewards.

Viii Conclusion

We introduced a new transfer learning technique for RL: Adapt-to-Learn (ATL), that utilizes combined adaptation and learning for policy transfer from source to target tasks. We demonstrated on nonlinear and continuous robotic locomotion tasks that our method leads to a significant reduction in sample complexity over the prevalent warm-start based approaches. We demonstrated that ATL seamlessly learns to rely on environment rewards when learning from Source does not provide a direct benefit, such as in the Crippled Cheetah environment. There are many exciting directions for future work. A network of policies that can generalize across multiple tasks could be learned based on each new adapted policies. How to train this end-to-end is an important question. The ability of Adapt-to-Learn to handle significant perturbations to the transition model indicates that it should naturally extend to sim-to-real transfer [21, 26, 35], and cross-domain transfer [33]. Another exciting direction is to extend the work to other combinatorial domains (e.g., multiplayer games). We expect, therefore, follow on work will find other exciting ways of exploiting such adaptation in RL, especially for robotics and real-world applications.

References

  • [1] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and G. Weiss (2012) Reinforcement learning transfer via sparse coding. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 383–390. Cited by: §-A, §II.
  • [2] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor (2015) Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment. In Proc. of AAAI, Cited by: §-A, §II, §III, §IV-A1, §VII.
  • [3] K. J. Åström and B. Wittenmark (2013) Adaptive control. Courier Corporation. Cited by: §I.
  • [4] B. Banerjee and P. Stone (2007) General game learning using knowledge transfer.. In IJCAI, pp. 672–677. Cited by: §-A, §II.
  • [5] G. Chowdhary, T. Wu, M. Cutler, and J. P. How (2013) Rapid transfer of controllers between uavs using learning-based adaptive control. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5409–5416. Cited by: §I.
  • [6] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §-A, §II.
  • [7] S. Daftry, J. A. Bagnell, and M. Hebert (2016) Learning transferable policies for monocular reactive mav control. In International Symposium on Experimental Robotics, pp. 3–11. Cited by: §-A, §II.
  • [8] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017) One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §I, §IV.
  • [9] M. J. Fryling, C. Johnston, and L. J. Hayes (2011) Understanding observational learning: an interbehavioral approach. The Analysis of Verbal Behavior 27 (1), pp. 191–203. Cited by: §I.
  • [10] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller, et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §-A, §I, §II.
  • [11] N. Jiang (2018) PAC reinforcement learning with an imperfect model. In Proc. of AAAI, Cited by: §A-B, §VI-B.
  • [12] G. Joshi and G. Chowdhary (2018) Cross-domain transfer in reinforcement learning using target apprentice. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7525–7532. Cited by: §III, §IV-A1.
  • [13] M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine Learning 49 (2-3), pp. 209–232. Cited by: §A-B, §VI-B.
  • [14] J. W. Krakauer and P. Mazzoni (2011) Human sensorimotor learning: adaptation, skill, and beyond. Current opinion in neurobiology 21 (4), pp. 636–644. Cited by: §I.
  • [15] S. Levine and V. Koltun (2013) Guided policy search. In International Conference on Machine Learning, pp. 1–9. Cited by: §I.
  • [16] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen (2016) Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics, pp. 173–184. Cited by: §-A, §II.
  • [17] S. Levine, N. Wagener, and P. Abbeel (2015) Learning contact-rich manipulation skills with guided policy search. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 156–163. Cited by: §-A, §II.
  • [18] Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few-shot learning. External Links: 1707.09835 Cited by: §IV-A2.
  • [19] L. Liu and J. Hodgins (2017) Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics (TOG) 36 (3), pp. 29. Cited by: §-A, §I, §II.
  • [20] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. Cited by: §-A, §II.
  • [21] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2017) Sim-to-real transfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537. Cited by: §-A, §II, §VIII.
  • [22] X. B. Peng, G. Berseth, and M. Van de Panne (2016) Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (TOG) 35 (4), pp. 81. Cited by: §-A, §I, §II.
  • [23] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne (2017) Deeploco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG) 36 (4), pp. 41. Cited by: §-A, §I, §II.
  • [24] J. Peters and S. Schaal (2006) Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 2219–2225. Cited by: §-A, §II.
  • [25] J. Peters and S. Schaal (2008) Natural actor-critic. Neurocomputing 71 (7), pp. 1180–1190. Cited by: §IV-B.
  • [26] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    ,
    pp. 627–635. Cited by: §-A, §II, §VIII.
  • [27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §-B, §III, §IV-A1, §IV-B.
  • [28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §IV-B, §VII.
  • [29] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §III, §IV-B, §IV-B.
  • [30] M. E. Taylor, P. Stone, and Y. Liu (2005) Value functions for rl-based behavior transfer: a comparative study. In Proceedings of the National Conference on Artificial Intelligence, Vol. 20, pp. 880. Cited by: §-A, §II.
  • [31] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §-A, §II.
  • [32] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §VII.
  • [33] C. Wang and S. Mahadevan (2009) Manifold alignment without correspondence.. In IJCAI, Vol. 2, pp. 3. Cited by: §III, §VIII.
  • [34] M. Wulfmeier, I. Posner, and P. Abbeel (2017) Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907. Cited by: §-A, §II.
  • [35] M. Yan, I. Frosio, S. Tyree, and J. Kautz (2017) Sim-to-real transfer of accurate grasping with eye-in-hand observations and continuous control. arXiv preprint arXiv:1712.03303. Cited by: §-A, §II, §VIII.
  • [36] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar (2018) Dexterous manipulation with deep reinforcement learning: efficient, general, and low-cost. arXiv preprint arXiv:1810.06045. Cited by: §-A, §I, §II, §IV.

Appendix A Total return gradient with respect to policy parameters

The total return which we aim to maximize in adapting the source policy to target is the mixture of environmental rewards and Intrinsic KL divergence reward as follows,

(21)

Taking the expectation over policy and transition distribution we can write the above expression

(22)

Using the definition of the state-value function, the above objective function can be re-written as

(23)

The adaptive policy update methods work by computing an estimator of the gradient of the return and plugging it into a stochastic gradient ascent algorithm

(24)

where is the learning rate and is the empirical estimate of the gradient of the total discounted return .

Taking the derivative of the total return term

Using the following definition in above expression,

We can rewrite the gradient to total return over policy as,

As the reward is independent of , we can simplify the above expression and can be re-written as

As we can see the above expression has a recursive property involving term . Using the following definition of a discounted state visitation distribution

(28)

we can write the gradient of transfer objective as follows,

(29)

Considering an off-policy RL update, where is used for collecting trajectories over which the state-value function is estimated, we can rewrite the above gradient for offline update as follows,

Multiplying and dividing Eq-29 by and we form a gradient estimate for offline update,

(30)

where the ratio is importance sampling term, and using the following identity the above expression can be rewritten as

A-a Theoretical bounds on sample complexity

Although there is some empirical evidence that transfer can improve performance in subsequent reinforcement-learning tasks, there are not many theoretical guarantees in the literature. Since many of the existing transfer algorithms approach the problem of transfer as a method of providing good initialization to target task RL, we can expect the sample complexity of those algorithms to still be a function of the cardinality of state-action pairs . On the other hand, in a supervised learning setting, the theoretical guarantees of the most algorithms have no dependency on size (or dimensionality) of the input domain (which is analogous to in RL). Having formulated a policy transfer algorithm using labeled reference trajectories derived from optimal source policy, we construct supervised learning like PAC property of the proposed method. For deriving, the lower bound on the sample complexity of the proposed transfer problem, we consider only the adaptation part of the learning i.e., the case when . This is because, in ATL, adaptive learning is akin to supervised learning, since the source reference trajectories provide the target states given every pair.

Suppose we are given the learning problem specified with training set where each are independently drawn trajectories according to some distribution . Given the data we can compute the empirical return for every , we will show that the following holds: