One of the goals of transfer learning is to efficiently learn policies in tasks where sample collection is cheap and then transfer the learned knowledge to tasks where sample collection is expensive. Recent deep reinforcement learning (Deep RL) algorithms require an extensive amount of data, which can be difficult, dangerous, or even impossible to obtain [28, 37, 48, 29]. Practical concerns regarding sample inefficiency make transfer learning a timely problem to solve, especially in the context of RL for robotics. Robots should be able to efficiently transfer knowledge from related tasks to new ones. For instance, consider an assistive robot that learns to feed a patient with a neck problem. The robot could not learn a sophisticated feeding policy when directly trained with a disabled patient in-the-loop, due to the limited number of interactions with the patient. Instead, the robot can learn how to feed abled-bodies, where it is easier to obtain data, and transfer the learned knowledge to the setting with the disabled patient using only a few samples.
We study transfer in the reinforcement learning setting where different tasks are parameterized by their reward function. While this problem and its similar variants have been studied using approaches like meta-RL [13, 45, 30, 15, 19], multitask learning [34, 42], and successor features , fine-tuning as an approach for transfer learning in RL is still not well-explored. Fine-tuning is an important method to study for two reasons. First, it is a widely-used transfer learning approach that is very well-studied in supervised learning [27, 21, 47], but the limits of fine-tuning have been less studied in RL. Second, compared to peer approaches, fine-tuning does not require strong assumptions about the target domain, making it a general and easily applicable approach. Our goal is to broaden our understanding of transfer in RL by exploring when fine-tuning works, when it doesn’t, and how we can overcome its challenges. Concretely, we consider fine-tuning to be more efficient when it requires less interactive steps with the target environment.
In this paper, we find that fine-tuning does not always work as expected when transferring between rewards whose corresponding trajectories belong to different homotopy classes. A homotopy class is traditionally defined as a class of trajectories that can be continuously deformed to one another without colliding with any barriers , see Fig. 1
(a). In this work, we generalize the notion of barriers to include any set of states that incur a large negative reward. These states lead to phase transitions (discontinuities) in the reward function. We assume that we know these barriers (and therefore homotopy classes) beforehand, which is equivalent to assuming knowledge of the reward functions. Knowing the reward function a-priori is a commonly made assumption in many robotics tasks, such as knowing goals[24, 23, 33] or having domain knowledge of unsafe states beforehand [17, 44]. Also, reinforcement learning algorithms naturally assume that the reward function is available . Generalizing the notion of barriers allows us to go beyond robotics tasks classically associated with homotopy classes, e.g., navigation around barriers, to include tasks like assistive feeding. Our key insight is that fine-tuning continuously changes policy parameters and that leads to continuously deforming trajectories. Hence, fine-tuning across homotopy classes will induce trajectories that intersect with barriers. This will introduce a high loss and gradients that point back to the source policy parameters. So it is difficult to fine-tune the policy parameters across homotopy classes. To address this challenge, we propose a novel Ease-In-Ease-Out fine-tuning approach consisting of two stages: a Relaxing Stage and a Curriculum Learning Stage. In the Relaxing Stage, we relax the barrier constraint by removing it. Then, in the Curriculum Learning Stage, we develop a curriculum starting from the relaxed reward to the target reward that gradually adds the barrier constraint back.
The contributions of the paper are summarized as follows:
We introduce the idea of using homotopy classes as a way of characterizing the difficulty of fine-tuning in reinforcement learning. We extend the definition of homotopy classes to general cases and demonstrate that fine-tuning across homotopy classes requires more interaction steps with the environment than fine-tuning within the same homotopy class.
We propose a novel Ease-In-Ease-Out fine-tuning approach that fine-tunes across homotopy classes, and consists of a relaxing and a curriculum learning stage.
We evaluate Ease-In-Ease-Out fine-tuning on a variety of robotics-inspired environments and show that our approach can learn successful target policies with less interaction steps than other fine-tuning approaches.
Ii Related Work
. Batch Spectral Shrinkage penalizes small singular values of model features so that untransferable spectral components are repressed
. Progressive Neural Networks (PNN) transfer prior knowledge by merging the source feature into the target feature at the same layer
. These works achieve state-of-the-art fine-tuning performance in supervised learning; however, directly applying fine-tuning methods to transfer RL does not necessarily lead to successful results as supervised learning and reinforcement learning differ in many factors such as access to labeled data or the loss function optimized by each paradigm. We compare our approach with these fine-tuning methods for transfer RL.
In fine-tuning for robotics, a robot usually pre-trains its policy on a general source task, where there is more data available, and then fine-tunes to a specific target task. Recent work in vision-based manipulation shows that fine-tuning for off-policy RL algorithms can successfully adapt to variations in state and dynamics when starting from a general grasping policy . As another example, RoboNet trains models on different robot platforms and fine-tunes them to unseen tasks and robots . A key difference is that our work proposes a systematic approach using homotopy classes for discovering when fine-tuning can succeed or fail. This is very relevant to existing literature in this domain, as our approach can explain why a general policy, e.g., a general grasping policy, can or cannot easily be fine-tuned to more specific settings.
Transfer Reinforcement Learning. There are several lines of work for transfer RL including successor features, meta-RL and multitask learning. We refer the readers to  for a comprehensive survey. We compare these works to our approach below.
Successor Features. Barreto et al., address the same reward transfer problem as ours by learning a universal policy across tasks based on successor features . However, this work makes a number of assumptions about the structure of the reward function and requires that the rewards between source and target tasks be close to each other, while our work has no such constraints.
Meta-RL. Meta learning provides a generalizable model from multiple (meta-training) tasks to quickly adapt to new (meta-test) tasks. There are various Meta RL methods including RNN-based [13, 45, 30], gradient-based [15, 19, 30], or meta-critic approaches . The gradient-based approach is the most related to our work, which finds policy parameters (roughly akin to finding a source task) that enable fast adaptation via fine-tuning. Note that all meta-RL approaches assume that agents have access to environments or data of meta-training tasks, which is not guaranteed in our setting. Here our focus is to discover when fine-tuning is generally challenging based on homotopy classes. In our experiments we compare our algorithm to core fine-tuning approaches rather than techniques that leverage ideas from fine-tuning or build upon them.
Multitask learning. Other works transfer knowledge by simultaneously learning multiple tasks or goals . In these works, transfer is enabled by learning shared representations of tasks and goals [10, 42, 34, 31]. In our work, we consider the setting where tasks are learned sequentially.
Regularization. Cobbe et al’s work 
proposes a metric to quantify the generalization ability of RL algorithms and compare the effects of different regularization techniques on generalization. The paper compares the effects of deeper networks, batch normalization, dropout, L2 Regularization, data augmentation and stochasticity (-greedy action selection and entropy bonus). The proposed techniques are designed for general purpose transfer reinforcement learning but are not specially designed for transfer reinforcement learning across homotopy classes. We compare our approach against using deeper networks, dropout, and entropy bonuses in our Navigation and Lunar Lander experiments and show that these techniques alone are not sufficient to transfer across homotopy classes (see supplementary materials).
Iii Fine-tuning across Homotopy Classes
In transfer reinforcement learning, our goal is to fine-tune from a source task to a target task. We formalize a task using a Markov Decision Process, where is the state space, is the action space,
is the transition probability,is the initial state distribution, is the reward function, and is the discount factor. We denote as the source task and as the target task. We assume that and only differ on reward function, i.e., . These different reward functions across the source and target task can for instance capture different preferences or constraints of the agent. A stochastic policy
defines a probability distribution over the action in a given state. The goal of RL is to learn an optimal policy, which maximizes the expected discounted return . We define a trajectory to be the sequence of states the agent has visited over time , and denote to be a trajectory produced by the optimal policy, . We assume that the optimal policy for the source environment is available or can be easily learned. Our goal is then to leverage knowledge from to learn the optimal policy for task . We aim to learn with substantially fewer training samples than other comparable fine-tuning approaches.
Iii-a Homotopy Classes
Homotopy classes are formally defined by homotopic trajectories in navigation scenarios in :
Homotopic Trajectories and Homotopy Class. Two trajectories connecting the same initial and end points are homotopic if and only if one can be continuously deformed without intersecting with any barriers. Homotopic trajectories are clustered into a homotopy class. 111Even though the presence of a single obstacle introduces infinitely many homotopy classes, in most applications we can work with a finite number of them, which can be formalized by the concept of -homology . For algorithms that compute these homology classes see .
Fig. 1 (a) illustrates a navigation scenario with two homotopy classes and separated by a red barrier. and can be continuously deformed into each other without intersecting the barrier, and hence are in the same homotopy class.
Generalization. The original definition of homotopy classes is limited to navigation scenarios with deterministic trajectories and the same start and end states. We generalize this definition to encompass a wider range of tasks in three ways.
Firstly, we account for tasks where there could be more than one feasible initial and end state. We generalize the initial and end points to a set of states and , where contains all the possible starting states and contains all possible ending states as shown in Fig. 1 (b).
Secondly, we generalize the notion of a barrier to be a set of states that are penalized with large negative rewards , where is a large positive number and is the reward without any barriers. Large negative rewards correspond to any negative phase transitions or discrete jumps in the reward function. Importantly, the generalized ‘barrier’ allows us to define homotopy classes in tasks without physical barriers that penalize states with large negative rewards (see our Assistive Feeding experiment). Although source and target tasks differ in reward functions, they share the same barrier states.
Thirdly, we need to generalize the notion of continuously deforming trajectories to trajectory distributions when considering stochastic policies. We appeal a distribution distance metric, Wasserstein- () metric, that penalizes jumps (discontinuities) between trajectory distributions induced by stochastic policies. We can now define our generalized notion of homotopic trajectories.
General Homotopic Trajectories. Two trajectories with distributions and and with the initial states and the final states are homotopic if and only if one can be continuously deformed into the other in the metric without receiving large negative rewards. Definitions for the metric and -continuity are in Section I of the supplementary materials.
General homotopic trajectories are depicted in Fig. 1 (b). The generalized definition of a homotopy class is the set of general homotopic trajectories. Note that using the metric is crucial here. Homotopic equivalence of stochastic policies according to other distances like total variation, KL-divergence, or even is usually trivial because distributions that even have a tiny mass on all deterministic homotopy classes become homotopically equivalent. On the other hand, in the metric, the distance between distributions that tweak the percentages, even by a small amount, would be at least the minimum distance between trajectories in different deterministic homotopy classes, which is a constant. So to go from one distribution over trajectories to another one with different percentages, one has to make a jump according to the metric.
Iii-B Challenges of Fine-tuning across Homotopy Classes
Running Example. We explain a key optimization issue caused by barriers when fine-tuning across homotopy classes. We illustrate this problem in Fig. 1 (b). An agent must learn to navigate to its goal without colliding with the barrier. Assuming that the agent only knows how to reach by swerving right, denoted by , we want to learn how to reach by swerving left (i.e., find ).
We show how the barrier prevents fine-tuning from source to target in Fig. 2. This figure depicts the loss landscape for the target task with and without barriers. All policies are parameterized by a single parameter and optimized with the vanilla policy gradient algorithm . Warmer regions indicate higher losses in the target task whereas cooler regions indicate lower losses.
Policies that collide with barriers cause large losses shown by the hump in Fig. 2 (b). Gradients point away from this large loss region, so it is difficult to cross the hump without a sufficiently large step size. In contrast, in Fig. 2 (a), the loss landscape without the barrier is smooth, so fine-tuning is easy to converge. Details on the landscape plots are in Section IV of the supplementary materials.
We now formally investigate how discontinuities in trajectory space caused by barriers affect fine-tuning of model-free RL algorithms. We let the model parameterized by to induce a policy , and define the loss for the model to be . We assume that is high when the expected return is low. This assumption is satisfied in common model-free RL algorithms such as vanilla policy gradient. We optimize our policy using gradient descent with step size : . We can now define what it means to fine-tune from one task to another.
Let be the optimal set of parameters that minimizes the cost function on the source task. Using as the loss for target reward, fine-tuning from to for gradient steps is defined as: and for
We consider a policy to have successfully fine-tuned to if the received expected return is less than away from the expected reward of the optimal target policy for some small , i.e., .
We now theoretically analyze why it is difficult to fine-tune across homotopy classes. Due to the space limit, we only include our main theorem and remark in the paper. We refer readers to the supplementary materials for the proofs.
-continuity of policy. A policy parameterized by is -continuous if the mapping
, which maps a vector of parameters in a metric space to a distribution over state-actions is continuous inmetric.
-continuity of transition probability function. An MDP with transition probability function is called -continuous if the mapping which maps a state-action pair in a metric space to a distribution over states is continuous in metric.
Assume that is a parametrized policy for an MDP . If both and are -continuous, then a continuous change of policy parameters results in a continuous deformation of the induced random trajectory in the metric. However, continuous deformations of the trajectories do not ensure continuous changes of their corresponding policy parameters.
Note that the theorem also applies to deterministic policies. For deterministic policies -continuity is the same as the classical notion of continuity. Theorem 1 bridges the idea of changes in policy parameters with trajectory deformation. To use this theorem, we need assumptions on the learning rate and bounds on the gradients. Specifically for any and induced by policies in two different homotopy classes, we need to assume: . With such small enough learning rate , fine-tuning will always induce trajectories that visit barrier states, .
Intuitively, the conclusion we should reach from Theorem 1 is that fine-tuning model parameters across homotopy classes is more difficult or even infeasible in terms of number of interaction steps in the environment compared to fine-tuning within the same homotopy class; this is under the assumptions that the transition probability function and policy of are -continuous, learning rate is sufficiently small, and gradients are bounded 222Modern optimizers and large step sizes can help evade local minima but risk making training unstable when step sizes are too large..
Iv Ease-In-Ease-Out fine-tuning Approach
Our insight is that even though there are states with large negative rewards that make fine-tuning difficult across homotopy classes, there is still useful information that can be transferred across homotopy classes. Specifically, we first ease in or relax the problem by removing the negative reward associated with barriers, which enables the agent to focus on fine-tuning towards target reward without worrying about large negative rewards. We then ease out by gradually reintroducing the negative reward via a curriculum. We assume the environment is alterable in order to remove and re-introduce barrier states. In most cases, this requires access to a simulator, which is a common assumption in many robotics applications [12, 14, 5, 38]. We assume that during the relaxing stage as well as each subsequent curriculum stage, we are able to converge to an approximately optimal policy for that stage using reinforcement learning.
Ease In: Relaxing Stage. In the relaxing stage, we remove the barrier penalty in the reward function, i.e., . We denote the target MDP with relaxed reward function as . Note that we do not physically remove the barriers, so the transition function does not change. We start from and train the policy in to obtain . The relaxation removes large losses incurred by the barriers, making fine-tuning much easier than naïve fine-tuning.
Ease Out: Curriculum Learning Stage. The relaxing stage finds an optimal policy for . We now need to learn the optimal policy for original target MDP that actually penalizes barrier states with a large penalty . We develop two curricula to gradually introduce this penalty.
(1) Reward Weight (general case). We design a general curriculum that can be used for any environments by gradually increasing the penalty from 0 to using a series of values satisfying . We redefine our reward function to include intermediary values:
This allows us to define a sequence of corresponding tasks where and . For each new task , we initialize the policy with the previous task’s optimal policy and train it using the reward . The detailed algorithm is shown in Algorithm 1 in Section III of the supplementary materials.
(2) Barrier Set Size. When there is only a single barrier set (i.e., is connected), we can also build a curriculum around the set itself. Here, we keep the penalty but gradually increase the set of states that incur this penalty. We can guarantee that our algorithm always converges as we discuss in our analysis section below.
To build a curriculum, we can choose any state as our initial set and gradually inflate this set to by connecting more and more states together 333A connected path is defined differently for continuous and discrete state spaces. For example, in continuous state spaces, a connected path means a continuous path.. For example, we can connect new states that are within some radius of the current set. This allows us to define a series of connected sets satisfying . We can then similarly redefine our reward function and parameterize it by including intermediary barrier sets
Note that the sets only change the reward associated with the states, not the dynamics.
Curriculum learning by evolving barrier set size is more interpretable and controllable than the general reward weight approach since for each task , an agent learns a policy that avoids a subset of states, . In the general reward weight approach, it is unclear which states the resulting policy will never visit. A shortcoming of the barrier set size approach is that the convergence guarantee is limited to single barriers because if we have multiple barriers, we may not find a initial set as described in Lemma 4. The algorithm for the barrier set approach follows the same structure as Algorithm 1.
Analysis. For both curriculum learning by reward weight and barrier set size, if the agent can successfully find an optimal policy at every intermediary task, then we can find for . For the reward weight approach, we cannot prove that at every stage , the optimal policy for is guaranteed to be obtained, but we can still have the following proposition:
For curriculum learning by reward weight, in every stage, the learned policy achieves a higher reward than the initialized policy evaluated on the final target task.
Though the reward weight approach is not guaranteed to achieve the optimal policy in every curriculum step, the policy improves with respect to the final target reward. Each curriculum step is much easier than the original direct fine-tuning problem, which increases the possibility for successful fine-tuning. For the barrier set size approach, we prove that in every stage, the optimal policy for each stage is achievable. To learn an optimal policy in each stage, finding is key:
There exists that divides the trajectories of and into two homotopy classes.
We propose an approach for finding in Algorithm 2 in Section III of the supplementary materials.
A curriculum starting with as described in Lemma 4 and inflating to with sufficiently small changes in each step, i.e., small enough for reinforcement learning to find trajectories that should not visit barrier states, can always learn the optimal policy for the final target reward.
We evaluate our approach on four axes of complexity: (1) the size of barrier, (2) the number of barriers, (3) barriers in 3D environments, and (4) barriers that are not represented by physical obstacles but by a set of ‘undesirable’ states.
To evaluate these axes, we use various domains including navigation (Figs. 4, 5), lunar lander (Fig. 6 Left), fetch reach (Fig. 6 Right), mujoco ant (Fig. 7), and assistive feeding task (Fig. 8). We compare our approach against naïve fine-tuning (Fine-tune) as well as three state-of-the-art fine-tuning approaches: Progressive Neural Networks (PNN) , Batch Spectral Shrinkage (BSS) , and -SP . We also add training on the target task from a random initialization (Random) as a reference, but we do not consider Random as a comparable baseline because it is not a transfer learning algorithm. We evaluate all the experiments using the total number of interaction steps it takes to reach within some small distance of the desired return in the target task. We report the average number of interaction steps over in units of 1000 (lower is better). We indicate statistically significant differences () with baselines by listing the first letter of those baselines. We ran Navigation (barrier sizes), and Fetch Reach experiments with 5 random seeds and the rest with 10 random seeds. If more than half of the runs exceeded the maximum number of interaction steps without reaching the desired target task reward, we report that the task is unachievable with the maximum number of interaction steps. Finally, we use stochastic policies which is why our source and target policies may not be symmetrical. Experiment details are in Section V of the supplementary materials.
|Ours||117.4 128.6||162.6 70.5||102.7 87.8||112.3 111.3|
|PNN||92.2 102||138.6 92.1||159.8 90.6||119.2 125|
|Fine-tune||141.1 53||>256||157 100||241 27.5|
|Random||54.6 61.5||88.4 59.4||145 74.8||77.1 40.6|
1. Navigation. We address the first two axes by analyzing our problem under varying barrier sizes and varying number of homotopy classes. We experiment with our running example where an agent must navigate from a fixed start position to the goal set (green area).
Varying Barrier Sizes. We investigate how varying the size of the barrier affects the fine-tuning problem going from Right to Left. Here, we use a one-step curriculum so the barrier set size and reward weight approaches are the same. Table I demonstrates that when barrier sizes are small (1,3), our approach is not the most sample efficient, but remains comparable to other methods. With larger barrier sizes (5, 7), we find that our method requires the least amount of training updates. This result suggests that our approach is especially useful when barriers are large (i.e., fine-tuning is hard). When fine-tuning is easy, simpler approaches like starting from a random initialization can be used.
|LL LR||LL RL||LL RR|
|PNN||101.9 37.2||>300||119.2 36.4|
Four Homotopy Classes. We next investigate how multiple homotopy classes can affect fine-tuning. As shown in Fig. 5, adding a second barrier creates four homotopy classes: LL, LR, RL, and RR. We experiment with both barrier set size and reward weight approaches and report results when using LL as our source task in Table II. Results for using LR, RL, and RR as the source task are included in the supplementary materials. We can observe that the proposed Ease-In-Ease-out approach outperforms other fine-tuning methods. Having multiple barriers does not satisfy the single barrier assumption, so our reward weight approach performs better on average than the barrier set size approach. Note that in LL LR, Random performs best, which implies that the task is easy to learn from scratch and no transfer learning is needed. We conclude that while increasing the number of barrier sets can result in a more challenging fine-tuning problem for other methods, it does not negatively affect our approach.
2. Lunar Lander. Before exploring 3D environments that differ significantly from the navigation environment, we conducted an experiment in Lunar Lander. The objective of the game is to land on the ground between the two flags without crashing. As shown in Fig. 6 (Left), this environment is similar to the navigation environments in that we introduce a barrier which creates two homotopy classes: Left and Right. However, the main difference is that the agent is controlled by two lateral thrusters and a main engine.
|L R||R L|
Results are shown in Table III. We observe that while L
-SP suffers from a large variance and PNN needs many more steps, both our reward weight approach and barrier set size approach outperforms the fine-tuning methods. The reward weight approach has a small standard deviation and performs stably. Note that Random requires large amount of interaction steps, meaning that training the landing task is originally quite difficult and needs transfer reinforcement learning.Our approach significantly reduces the number of steps needed to learn the optimal policy in both directions.
2. Fetch Reach. We address the third axis by evaluating our Ease-In-Ease-Out fine-tuning approach on a more realistic Fetch Reach environment . The Fetch manipulator must fine-tune across homotopy classes in . In the reaching task, the robot needs to reach either the orange or blue tables by stretching right or left respectively. The tables are separated by a wall which creates two homotopy classes, as shown in Fig. 6. Our results are shown in Table IV. We find that our approach was the most efficient compared to baseline methods. One reason why the baselines did not perform well was that the wall’s large size and its proximity to the robot caused it to collide often, making it particularly difficult to fine-tune across homotopy classes. We found that even training from a random initialization proved difficult. For this reason, we had to relax the barrier constraint to obtain valid Left and Right source policies.
|L R||R L|
3. Mujoco Ant. Finally, we explore whether our algorithm can generalize beyond navigation-like tasks that are traditionally associated with homotopy classes. We demonstrate two examples–Mujoco Ant and Assistive Feeding–where barrier states correspond to undesirable states rather than physical objects. In the Mujoco Ant environment , the barrier states correspond to a set of joint angles that the ant’s upper right leg cannot move to. The boundary of the barrier states are shown by the red lines in Fig. 7. In our source task, the ant moves while its upper right joint remains greater than . We call this orientation Down. Our goal is to transfer to the target task where the joint angle is less than , or Up. Results are shown in Table V. We do not evaluate the other direction, Up Down, because this direction was easy for all of our baselines to begin with, including our own approach. We find that our approach was the most successful in fine-tuning across the set of joint angle barrier states.
|Mujoco Ant||Assistive Feeding|
|Down Up||Up Down|
4. Assistive Gym. We use an assistive feeding environment  to create another type of non-physical barrier in the robot’s range of motion. In Fig. 8 (right), we simulate a disabled person who cannot change her head orientation by a large amount. The goal is to feed the person using a spoon. Here, we can easily train a policy on an abled body with a normal head orientation, as in Fig. 8 (left). However, we have limited data for the head orientation of the disabled person (the chin is pointing upwards as it is common in patients who use a head tracking device). To feed a disabled body, the spoon needs to point down, while for an abled body, the spoon needs to point up. The barrier states correspond to holding the spoon in any direction between these two directions when close to the mouth, which may ‘feed’ the food to the user’s nose or chin. This environment is an example of settings with limited data in the target environment, i.e., interacting with the disabled person. It also shows a setting with no physical barriers, and the ‘barrier states’ correspond to the spoon orientations in between, which can be uncomfortable or even unsafe. As shown in Table V, Our Ease-In-Ease-Out fine-tuning approach learns the new policy for the disabled person faster than training from scratch while the other fine-tuning methods fail to learn the target policy.
Summary. We introduce the idea of using homotopy classes to characterize the difficulty of fine-tuning between tasks with different reward functions. We propose a novel Ease-In-Ease-Out fine-tuning method that first relaxes the problem and then forms a curriculum. We extend the notion of homotopy classes, which allows us to go beyond navigation environments and apply our approach on more general robotics tasks. We demonstrate that our method requires less samples on a variety of domains and tasks compared to other fine-tuning baselines.
Limitations and Future Work. Our work has a number of limitations. This includes the need for accessing the barrier states a priori. However, our assistive gym example is a step towards considering environments where barrier states are not as clearly defined a priori. In the future, we plan to apply our methods to other robotics domains with non-trivial homotopy classes by directly finding the homotopy classes  and then using our algorithm to fine-tune.
We would like to thank NSF Award Number 2006388 and the DARPA Hicon-Learn project for their support.
-  (2017) Successor features for transfer in reinforcement learning. In NeurIPS, Cited by: §I, §II.
-  (2004) Reinforcement learning and its relationship to supervised learning. Handbook of learning and approximate dynamic programming. Cited by: §II.
Infinite-horizon policy-gradient estimation. JAIR. Cited by: §III-B.
-  (2012) Topological constraints in search-based robot path planning. Autonomous Robots. Cited by: §I, §III-A.
-  (2016) Openai gym. arXiv:1606.01540. Cited by: §E-B, §E-B, §IV, §V.
-  (2010) Measuring and computing natural generators for homology groups. Computational Geometry. Cited by: §VI, footnote 1.
-  (2019) Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In NeurIPS, Cited by: 2nd item, §II, §V.
-  (2019) Quantifying generalization in reinforcement learning. In ICML, Cited by: §E-A, §II.
-  (2019) RoboNet: large-scale multi-robot learning. arXiv preprint arXiv:1910.11215. Cited by: §II.
-  (2014) Multi-task policy search for robotics. In ICRA, Cited by: §II.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §II.
-  (2017) CARLA: an open urban driving simulator. arXiv preprint:1711.03938. Cited by: §IV.
-  (2016) Rl: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §I, §II.
-  (2020) Assistive gym: a physics simulation framework for assistive robotics. ICRA. Cited by: §E-B, §IV, §V.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §I, §II.
Deep spatial autoencoders for visuomotor learning. In ICRA, Cited by: §II.
-  (2012) Safe exploration of state and action spaces in reinforcement learning. JAIR. Cited by: §I.
-  (2018) Robot learning in homes: improving generalization and reducing dataset bias. In NeurIPS, Cited by: §II.
-  (2018) Meta-reinforcement learning of structured exploration strategies. In NeurIPS, Cited by: §I, §II.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §E-B, §E-B.
-  (2006) Reducing the dimensionality of data with neural networks. science. Cited by: §I.
-  (2020) Efficient adaptation for end-to-end vision-based robotic manipulation. arXiv preprint arXiv:2004.10190. Cited by: §II.
-  (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §I.
-  (2013) Reinforcement learning in robotics: a survey. IJRR. Cited by: §I.
-  (2012) Autonomous reinforcement learning on raw visual input data in a real world application. In IJCNN, Cited by: §II.
-  (2016) End-to-end training of deep visuomotor policies. JMLR. Cited by: §II.
Unsupervised and transfer learning challenge: a deep learning approach. In Unsupervised and Transfer Learning workshop in ICML, Cited by: §I.
-  (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I.
-  (2014) Disaster robotics. MIT press. Cited by: §I.
-  (2018) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv:1803.11347. Cited by: §I, §II.
-  (2018) Visual reinforcement learning with imagined goals. In NeurIPS, Cited by: §II.
-  (2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In ICRA, Cited by: §II.
-  (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: §I.
-  (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §I, §II.
-  (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: 1st item, §II, §V.
-  (2018) Sim2real viewpoint invariant visual servoing by recurrent control. In CVPR, Cited by: §II.
-  (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging. Cited by: §I.
-  (2020) IGibson, a simulation environment for interactive tasks in large realisticscenes. arXiv preprint arXiv:2012.02924. Cited by: §IV.
-  (2017) Learning to learn: meta-critic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §II.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
-  (2009) Transfer learning for reinforcement learning domains: a survey. JMLR. Cited by: §II.
-  (2017) Distral: robust multitask reinforcement learning. In NeurIPS, Cited by: §I, §II.
-  (2012) Mujoco: a physics engine for model-based control. In IROS, Cited by: §E-B, §V.
-  (2016) Safe exploration in finite markov decision processes with gaussian processes. arXiv preprint arXiv:1606.04753. Cited by: §I.
-  (2016) Learning to reinforcement learn. arXiv preprint arXiv:1611.05763. Cited by: §I, §II.
-  (2018) Explicit inductive bias for transfer learning with convolutional networks. In ICML, Cited by: 3rd item, §II, §V.
-  (2014) How transferable are features in deep neural networks?. In NeurIPS, Cited by: §I.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Cited by: §I.
Appendix A Definitions
metric. Given a metric space , the distance between two distributions and on , is defined as follows:
where represents the set of couplings of and , i.e., matchings of probability mass.
-continuity. A mapping from a metric space with any distance function to a distribution space, is continuous in Wasserstein- metric if and only if and , such that if , then .
Appendix B Proofs
Since we are dealing with compact spaces for inputs to our functions, continuity is in general equivalent to uniform continuity. So below we interchangeably use the terms continuity and uniform continuity. We also note that uniform continuity is a weaker assumption than Lipschitz continuity, which is satisfied by neural networks with bounded parameters.
Below we show that the composition of two stochastic functions, each of which is uniformly continuous, is uniformly continuous. A stochastic function maps the input to a random , and can also be viewed as a deterministic function that maps to a p.d.f. over . The composition of and maps input first to a random intermediate output using , and then is fed to to generate the random output . So the composition can also be viewed as the function that maps to the p.d.f. of this composed output . We remark that uniform continuity is implied by continuity as long as we are working over compact spaces. We also remark that the uniform continuity assumption can be relaxed for the inner function in the composition.
Let , to be two stochastic functions. Their composition, which we denote by , maps to a distribution whose p.d.f. is defined by where is the p.d.f. (note that is a p.d.f. itself, so the integration results in a p.d.f.). The result of this composition is uniformly continuous as long as are uniformly continuous.
Since both and are uniformly continuous functions, , such that
Now fix an . We need to show that there is a such that implies . By uniform continuity of , we first find a corresponding to , and then treating as the “” for the function above, we find a corresponding . Our goal is to show that implies .
Note that implies . We now take the coupling from the definition of the metric for and which achieves the . Note that we are making a simplifying assumption that the in the definition of is obtained at some particular ; in general, this may not be the case, but tweaking our argument by taking the limit of such proves the general case.
Because , this means the support of is only on pairs such that . Note that because ’s marginals are , we have
By the triangle inequality for the metric we have
But note that is only supported on pairs such that . Uniform continuity of implies that for all such pairs is at most . So we get
Assume that is a parameterized policy for an MDP . If both and are -continuous, then a continuous change of policy parameters results in a continuous deformation of the induced random trajectory in the metric. However, continuous deformations of the trajectories does not ensure continuous changes of their corresponding policy parameters.
Our strategy is to prove continuity by induction and repeated applications of Lemma 6.
When , is a deterministic constant function with respect to , thus, it is continuous.
For , assume by induction that we have proved is continuous w.r.t. . Our goal is to prove that is also continuous. We do this in two steps. First we obtain continuity of , and then .
For the first step, note that
where is the concatenation of the identity operator and the policy . Since this concatenation results in a continuous function, and is continuous, by Lemma 6, we get continuity of .
For the second step, we compose with the dynamics of . Namely
where is the concatenation of the identity operator and the MDP dynamics . By a similar reasoning, this results in a continuous function of .
For the second half of the theorem, the same trajectory may be induced by different policies, so continuous change trajectories can corresponds to quite different model parameters.
We assume the action distribution is a type of distribution with parameters
, such as Gaussian distribution with mean and standard deviation.can be regarded as a mapping for parameter to a distribution. We assume that is continuous, which is satisfied for commonly used distributions such as the Gaussian distribution. So and , such that if , .
The policy of deep RL model is represented by a neural network parameterized by , which outputs the parameters of the distribution. Every layer of the neural network employs a Lipschitz continuous activation function such as ReLU and TanH. We define the number of layers in the network as and the Lipschitz constant for every layer as . We assume that the norms of the output of each layer with respect to all the states in the state space is upper bounded by a constant value or otherwise the output is infinity. Then we can derive that
If , then , and then . So the action distribution is continuous with respect to the model parameters. Then we derive the continuity of .
Since the is continuous with respect to , then is also continuous. ∎
Fine-tuning deep RL model parameters across homotopy classes is more difficult or even infeasible in terms of sample complexity compared to fine-tuning within the same homotopy class, under the assumptions that the transition probability function and policy of are -continuous, learning rate is sufficiently small, and gradients are bounded.
Justification for Remark 9.
Large negative rewards, or barriers, correspond to high loss and create gradients pointing away from the barriers. Looking at Equation (2), the large cost incurred when a trajectory collides with a barrier creates a large negative term. This causes the gradient estimate to be extremely large.
As implied in Theorem 7, fine-tuning across homotopy classes implies that trajectories must intersect with the barrier at some steps during training if both the policies and the transition function are -continuous. The policy induced by the deep RL model is -continuous based on Lemma 8. When intersecting with the barrier, large negative rewards create large gradients that point away from the optimal target policy weights. Thus fine-tuning across homotopy classes will always be blocked by the barriers. Instead, fine-tuning within the homotopy classes can always find a deformation process without intersecting with the barriers.
For curriculum learning by reward weight, in every stage, the learned policy achieves a higher reward than the initialized policy evaluated on the final target task.
We denote by and set as . For the reward weight approach, in every stage , we can write the reward function
as a sum of two parts
can be regarded as the reward function which only penalizes the barrier states. Then if evaluated under , since is trained to maximize the expected return under , achieves higher expected return than . However, if evaluated under , similarly, achieves higher expected return than . Therefore, achieves higher expected return than if evaluated under the reward . Since , so achieves higher expected return than under reward , which means that is penalized less than by the barrier states. Now has achieved higher expected return than under both and , and thus also achieved higher expected return under the final target reward . ∎
There exists that divides the trajectories of and into two homotopy classes.
If there is no such , then the trajectories of and visit no states in and can be continuously deformed to each other without visiting any state in . So and are in the same homotopy class with respect to . Since optimizes the reward and visits no states in , it should be the optimal policy for the target reward . Therefore, the source and target reward are in the same homotopy class, which violates the assumption that they are in different homotopy classes. Thus, there must exist such . ∎
A curriculum beginning with as described in Lemma 11 is inflated to with sufficiently small changes in each step, i.e., small enough to enable reinforcement learning algorithms to approximate optimal policies whose trajectories do not visit barrier states. This curriculum can always learn the optimal policy for the final target reward in the last step.
We demonstrate the case with only two homotopy classes. If there are multiple homotopy classes, then the obstacle state set contains multiple non-connected subsets, and we can design the to form a sequence of transfer tasks to gradually transfer from the source homotopy class to the target homotopy class where two homotopy classes in each transfer task only have one connected obstacle state subset between them. Thus the analysis for two homotopy classes is sufficient as it can be applied to each transfer task to ensure the convergence. Importantly, here we assume to use reinforcement learning algorithms that can converge to optimal policies for each task in the curriculum. We prove this proposition in two steps:
1) The optimal policy for the first step with is in the same homotopy class as the optimal policy for the final target reward
As shown in Lemma 11, there exists that divides the trajectories of and into two homotopy classes. So in the first stage, the barrier is between and (assuming our reinforcement learning algorithm is able to find the optimal policy ). Since there are only two homotopy classes and and are from different homotopy classes, the trajectories of and have to be from the same homotopy class.
2) In every step , if in the previous step, can be learned and its trajectories are in the same homotopy class as trajectories induced by under , then the current step’s optimal policy can also be learned and its trajectories are from the same homotopy class as trajectories induced by under
In step , the reinforcement learning algorithm only needs to spend very little exploration to prevent the policy from visiting since we assume that the inflation from to is small enough to make reinforcement learning able to find trajectories not visiting states in . Since the last step’s optimal policy is in the same homotopy class as , , which is fine-tuned from , cannot be in the same homotopy class as due to the larger set of barrier states between and . So can only be in the same homotopy class as .
With the above two statements, we can derive that at every step, a policy without visiting the barrier states is achievable and it is in the same homotopy class as . In the final step, such policy is exactly .
Appendix C Algorithm
In this section, we first provide our main curriculum learning algorithm in Algorithm 1.
As described in Lemma 11, we have shown that there exists a that divides the trajectories of and into two homotopy classes. This is particularly needed for the Barrier Set Size approach. Lemma 11 does not provide an approach for constructing . Here we propose one approach (Algorithm 2) for finding such an .
To find a desired , we assume that we have two oracle functions available: (1) a collision checker: the function , which outputs true if passes through , and outputs false otherwise; (2) a homotopy class checker: the function , which outputs true if the obstacle set divide and into two different homotopy classes, and outputs false otherwise. These assumptions are shown in the Input in Algorithm 2.
First, we observe that the trajectory induced by the optimal relaxed policy passes through the original obstacle , so there exists a subset of causing the source optimal trajectory not being able to continuously deform to . This means that there exists a subset of that divides and into two different homotopy classes.
Based on this idea, we first construct which divides and into two homotopy classes (see line 3-19). We first assign the original to (Line 3). We then cut into two halves, and (Line 5) and update it to one of the two halves based on rules introduced below (see lines 6-18) until we reach the desirable starting barrier