I Introduction
One of the goals of transfer learning is to efficiently learn policies in tasks where sample collection is cheap and then transfer the learned knowledge to tasks where sample collection is expensive. Recent deep reinforcement learning (Deep RL) algorithms require an extensive amount of data, which can be difficult, dangerous, or even impossible to obtain [28, 37, 48, 29]. Practical concerns regarding sample inefficiency make transfer learning a timely problem to solve, especially in the context of RL for robotics. Robots should be able to efficiently transfer knowledge from related tasks to new ones. For instance, consider an assistive robot that learns to feed a patient with a neck problem. The robot could not learn a sophisticated feeding policy when directly trained with a disabled patient intheloop, due to the limited number of interactions with the patient. Instead, the robot can learn how to feed abledbodies, where it is easier to obtain data, and transfer the learned knowledge to the setting with the disabled patient using only a few samples.
We study transfer in the reinforcement learning setting where different tasks are parameterized by their reward function. While this problem and its similar variants have been studied using approaches like metaRL [13, 45, 30, 15, 19], multitask learning [34, 42], and successor features [1], finetuning as an approach for transfer learning in RL is still not wellexplored. Finetuning is an important method to study for two reasons. First, it is a widelyused transfer learning approach that is very wellstudied in supervised learning [27, 21, 47], but the limits of finetuning have been less studied in RL. Second, compared to peer approaches, finetuning does not require strong assumptions about the target domain, making it a general and easily applicable approach. Our goal is to broaden our understanding of transfer in RL by exploring when finetuning works, when it doesn’t, and how we can overcome its challenges. Concretely, we consider finetuning to be more efficient when it requires less interactive steps with the target environment.
In this paper, we find that finetuning does not always work as expected when transferring between rewards whose corresponding trajectories belong to different homotopy classes. A homotopy class is traditionally defined as a class of trajectories that can be continuously deformed to one another without colliding with any barriers [4], see Fig. 1
(a). In this work, we generalize the notion of barriers to include any set of states that incur a large negative reward. These states lead to phase transitions (discontinuities) in the reward function. We assume that we know these barriers (and therefore homotopy classes) beforehand, which is equivalent to assuming knowledge of the reward functions. Knowing the reward function apriori is a commonly made assumption in many robotics tasks, such as knowing goals
[24, 23, 33] or having domain knowledge of unsafe states beforehand [17, 44]. Also, reinforcement learning algorithms naturally assume that the reward function is available [40]. Generalizing the notion of barriers allows us to go beyond robotics tasks classically associated with homotopy classes, e.g., navigation around barriers, to include tasks like assistive feeding. Our key insight is that finetuning continuously changes policy parameters and that leads to continuously deforming trajectories. Hence, finetuning across homotopy classes will induce trajectories that intersect with barriers. This will introduce a high loss and gradients that point back to the source policy parameters. So it is difficult to finetune the policy parameters across homotopy classes. To address this challenge, we propose a novel EaseInEaseOut finetuning approach consisting of two stages: a Relaxing Stage and a Curriculum Learning Stage. In the Relaxing Stage, we relax the barrier constraint by removing it. Then, in the Curriculum Learning Stage, we develop a curriculum starting from the relaxed reward to the target reward that gradually adds the barrier constraint back.The contributions of the paper are summarized as follows:

We introduce the idea of using homotopy classes as a way of characterizing the difficulty of finetuning in reinforcement learning. We extend the definition of homotopy classes to general cases and demonstrate that finetuning across homotopy classes requires more interaction steps with the environment than finetuning within the same homotopy class.

We propose a novel EaseInEaseOut finetuning approach that finetunes across homotopy classes, and consists of a relaxing and a curriculum learning stage.

We evaluate EaseInEaseOut finetuning on a variety of roboticsinspired environments and show that our approach can learn successful target policies with less interaction steps than other finetuning approaches.
Ii Related Work
Finetuning. Finetuning is wellstudied in the space of supervised learning [25, 16, 32, 26, 18, 11, 36]. Approaches such as SP penalize the Euclidean distance of source and finetuned weights [46]
. Batch Spectral Shrinkage penalizes small singular values of model features so that untransferable spectral components are repressed
[7]. Progressive Neural Networks (PNN) transfer prior knowledge by merging the source feature into the target feature at the same layer
[35]. These works achieve stateoftheart finetuning performance in supervised learning; however, directly applying finetuning methods to transfer RL does not necessarily lead to successful results as supervised learning and reinforcement learning differ in many factors such as access to labeled data or the loss function optimized by each paradigm
[2]. We compare our approach with these finetuning methods for transfer RL.In finetuning for robotics, a robot usually pretrains its policy on a general source task, where there is more data available, and then finetunes to a specific target task. Recent work in visionbased manipulation shows that finetuning for offpolicy RL algorithms can successfully adapt to variations in state and dynamics when starting from a general grasping policy [22]. As another example, RoboNet trains models on different robot platforms and finetunes them to unseen tasks and robots [9]. A key difference is that our work proposes a systematic approach using homotopy classes for discovering when finetuning can succeed or fail. This is very relevant to existing literature in this domain, as our approach can explain why a general policy, e.g., a general grasping policy, can or cannot easily be finetuned to more specific settings.
Transfer Reinforcement Learning. There are several lines of work for transfer RL including successor features, metaRL and multitask learning. We refer the readers to [41] for a comprehensive survey. We compare these works to our approach below.
Successor Features. Barreto et al., address the same reward transfer problem as ours by learning a universal policy across tasks based on successor features [1]. However, this work makes a number of assumptions about the structure of the reward function and requires that the rewards between source and target tasks be close to each other, while our work has no such constraints.
MetaRL. Meta learning provides a generalizable model from multiple (metatraining) tasks to quickly adapt to new (metatest) tasks. There are various Meta RL methods including RNNbased [13, 45, 30], gradientbased [15, 19, 30], or metacritic approaches [39]. The gradientbased approach is the most related to our work, which finds policy parameters (roughly akin to finding a source task) that enable fast adaptation via finetuning. Note that all metaRL approaches assume that agents have access to environments or data of metatraining tasks, which is not guaranteed in our setting. Here our focus is to discover when finetuning is generally challenging based on homotopy classes. In our experiments we compare our algorithm to core finetuning approaches rather than techniques that leverage ideas from finetuning or build upon them.
Multitask learning. Other works transfer knowledge by simultaneously learning multiple tasks or goals [34]. In these works, transfer is enabled by learning shared representations of tasks and goals [10, 42, 34, 31]. In our work, we consider the setting where tasks are learned sequentially.
Regularization. Cobbe et al’s work [8]
proposes a metric to quantify the generalization ability of RL algorithms and compare the effects of different regularization techniques on generalization. The paper compares the effects of deeper networks, batch normalization, dropout, L2 Regularization, data augmentation and stochasticity (
greedy action selection and entropy bonus). The proposed techniques are designed for general purpose transfer reinforcement learning but are not specially designed for transfer reinforcement learning across homotopy classes. We compare our approach against using deeper networks, dropout, and entropy bonuses in our Navigation and Lunar Lander experiments and show that these techniques alone are not sufficient to transfer across homotopy classes (see supplementary materials).Iii Finetuning across Homotopy Classes
In transfer reinforcement learning, our goal is to finetune from a source task to a target task. We formalize a task using a Markov Decision Process
, where is the state space, is the action space,is the transition probability,
is the initial state distribution, is the reward function, and is the discount factor. We denote as the source task and as the target task. We assume that and only differ on reward function, i.e., . These different reward functions across the source and target task can for instance capture different preferences or constraints of the agent. A stochastic policydefines a probability distribution over the action in a given state. The goal of RL is to learn an optimal policy
, which maximizes the expected discounted return . We define a trajectory to be the sequence of states the agent has visited over time , and denote to be a trajectory produced by the optimal policy, . We assume that the optimal policy for the source environment is available or can be easily learned. Our goal is then to leverage knowledge from to learn the optimal policy for task . We aim to learn with substantially fewer training samples than other comparable finetuning approaches.Iiia Homotopy Classes
Homotopy classes are formally defined by homotopic trajectories in navigation scenarios in [4]:
Definition III.1.
Homotopic Trajectories and Homotopy Class. Two trajectories connecting the same initial and end points are homotopic if and only if one can be continuously deformed without intersecting with any barriers. Homotopic trajectories are clustered into a homotopy class. ^{1}^{1}1Even though the presence of a single obstacle introduces infinitely many homotopy classes, in most applications we can work with a finite number of them, which can be formalized by the concept of homology [6]. For algorithms that compute these homology classes see [6].
Fig. 1 (a) illustrates a navigation scenario with two homotopy classes and separated by a red barrier. and can be continuously deformed into each other without intersecting the barrier, and hence are in the same homotopy class.
Generalization. The original definition of homotopy classes is limited to navigation scenarios with deterministic trajectories and the same start and end states. We generalize this definition to encompass a wider range of tasks in three ways.
Firstly, we account for tasks where there could be more than one feasible initial and end state. We generalize the initial and end points to a set of states and , where contains all the possible starting states and contains all possible ending states as shown in Fig. 1 (b).
Secondly, we generalize the notion of a barrier to be a set of states that are penalized with large negative rewards , where is a large positive number and is the reward without any barriers. Large negative rewards correspond to any negative phase transitions or discrete jumps in the reward function. Importantly, the generalized ‘barrier’ allows us to define homotopy classes in tasks without physical barriers that penalize states with large negative rewards (see our Assistive Feeding experiment). Although source and target tasks differ in reward functions, they share the same barrier states.
Thirdly, we need to generalize the notion of continuously deforming trajectories to trajectory distributions when considering stochastic policies. We appeal a distribution distance metric, Wasserstein () metric, that penalizes jumps (discontinuities) between trajectory distributions induced by stochastic policies. We can now define our generalized notion of homotopic trajectories.
Definition III.2.
General Homotopic Trajectories. Two trajectories with distributions and and with the initial states and the final states are homotopic if and only if one can be continuously deformed into the other in the metric without receiving large negative rewards. Definitions for the metric and continuity are in Section I of the supplementary materials.
General homotopic trajectories are depicted in Fig. 1 (b). The generalized definition of a homotopy class is the set of general homotopic trajectories. Note that using the metric is crucial here. Homotopic equivalence of stochastic policies according to other distances like total variation, KLdivergence, or even is usually trivial because distributions that even have a tiny mass on all deterministic homotopy classes become homotopically equivalent. On the other hand, in the metric, the distance between distributions that tweak the percentages, even by a small amount, would be at least the minimum distance between trajectories in different deterministic homotopy classes, which is a constant. So to go from one distribution over trajectories to another one with different percentages, one has to make a jump according to the metric.
IiiB Challenges of Finetuning across Homotopy Classes
Running Example. We explain a key optimization issue caused by barriers when finetuning across homotopy classes. We illustrate this problem in Fig. 1 (b). An agent must learn to navigate to its goal without colliding with the barrier. Assuming that the agent only knows how to reach by swerving right, denoted by , we want to learn how to reach by swerving left (i.e., find ).
We show how the barrier prevents finetuning from source to target in Fig. 2. This figure depicts the loss landscape for the target task with and without barriers. All policies are parameterized by a single parameter and optimized with the vanilla policy gradient algorithm [3]. Warmer regions indicate higher losses in the target task whereas cooler regions indicate lower losses.
Policies that collide with barriers cause large losses shown by the hump in Fig. 2 (b). Gradients point away from this large loss region, so it is difficult to cross the hump without a sufficiently large step size. In contrast, in Fig. 2 (a), the loss landscape without the barrier is smooth, so finetuning is easy to converge. Details on the landscape plots are in Section IV of the supplementary materials.
We now formally investigate how discontinuities in trajectory space caused by barriers affect finetuning of modelfree RL algorithms. We let the model parameterized by to induce a policy , and define the loss for the model to be . We assume that is high when the expected return is low. This assumption is satisfied in common modelfree RL algorithms such as vanilla policy gradient. We optimize our policy using gradient descent with step size : . We can now define what it means to finetune from one task to another.
Let be the optimal set of parameters that minimizes the cost function on the source task. Using as the loss for target reward, finetuning from to for gradient steps is defined as: and for
We consider a policy to have successfully finetuned to if the received expected return is less than away from the expected reward of the optimal target policy for some small , i.e., .
We now theoretically analyze why it is difficult to finetune across homotopy classes. Due to the space limit, we only include our main theorem and remark in the paper. We refer readers to the supplementary materials for the proofs.
Definition III.3.
continuity of policy. A policy parameterized by is continuous if the mapping
, which maps a vector of parameters in a metric space to a distribution over stateactions is continuous in
metric.Definition III.4.
continuity of transition probability function. An MDP with transition probability function is called continuous if the mapping which maps a stateaction pair in a metric space to a distribution over states is continuous in metric.
Theorem 1.
Assume that is a parametrized policy for an MDP . If both and are continuous, then a continuous change of policy parameters results in a continuous deformation of the induced random trajectory in the metric. However, continuous deformations of the trajectories do not ensure continuous changes of their corresponding policy parameters.
Note that the theorem also applies to deterministic policies. For deterministic policies continuity is the same as the classical notion of continuity. Theorem 1 bridges the idea of changes in policy parameters with trajectory deformation. To use this theorem, we need assumptions on the learning rate and bounds on the gradients. Specifically for any and induced by policies in two different homotopy classes, we need to assume: . With such small enough learning rate , finetuning will always induce trajectories that visit barrier states, .
Remark 2.
Intuitively, the conclusion we should reach from Theorem 1 is that finetuning model parameters across homotopy classes is more difficult or even infeasible in terms of number of interaction steps in the environment compared to finetuning within the same homotopy class; this is under the assumptions that the transition probability function and policy of are continuous, learning rate is sufficiently small, and gradients are bounded ^{2}^{2}2Modern optimizers and large step sizes can help evade local minima but risk making training unstable when step sizes are too large..
Iv EaseInEaseOut finetuning Approach
Our insight is that even though there are states with large negative rewards that make finetuning difficult across homotopy classes, there is still useful information that can be transferred across homotopy classes. Specifically, we first ease in or relax the problem by removing the negative reward associated with barriers, which enables the agent to focus on finetuning towards target reward without worrying about large negative rewards. We then ease out by gradually reintroducing the negative reward via a curriculum. We assume the environment is alterable in order to remove and reintroduce barrier states. In most cases, this requires access to a simulator, which is a common assumption in many robotics applications [12, 14, 5, 38]. We assume that during the relaxing stage as well as each subsequent curriculum stage, we are able to converge to an approximately optimal policy for that stage using reinforcement learning.
Ease In: Relaxing Stage. In the relaxing stage, we remove the barrier penalty in the reward function, i.e., . We denote the target MDP with relaxed reward function as . Note that we do not physically remove the barriers, so the transition function does not change. We start from and train the policy in to obtain . The relaxation removes large losses incurred by the barriers, making finetuning much easier than naïve finetuning.
Ease Out: Curriculum Learning Stage. The relaxing stage finds an optimal policy for . We now need to learn the optimal policy for original target MDP that actually penalizes barrier states with a large penalty . We develop two curricula to gradually introduce this penalty.
(1) Reward Weight (general case). We design a general curriculum that can be used for any environments by gradually increasing the penalty from 0 to using a series of values satisfying . We redefine our reward function to include intermediary values:
(1) 
This allows us to define a sequence of corresponding tasks where and . For each new task , we initialize the policy with the previous task’s optimal policy and train it using the reward . The detailed algorithm is shown in Algorithm 1 in Section III of the supplementary materials.
(2) Barrier Set Size. When there is only a single barrier set (i.e., is connected), we can also build a curriculum around the set itself. Here, we keep the penalty but gradually increase the set of states that incur this penalty. We can guarantee that our algorithm always converges as we discuss in our analysis section below.
To build a curriculum, we can choose any state as our initial set and gradually inflate this set to by connecting more and more states together ^{3}^{3}3A connected path is defined differently for continuous and discrete state spaces. For example, in continuous state spaces, a connected path means a continuous path.. For example, we can connect new states that are within some radius of the current set. This allows us to define a series of connected sets satisfying . We can then similarly redefine our reward function and parameterize it by including intermediary barrier sets
(2) 
Note that the sets only change the reward associated with the states, not the dynamics.
Curriculum learning by evolving barrier set size is more interpretable and controllable than the general reward weight approach since for each task , an agent learns a policy that avoids a subset of states, . In the general reward weight approach, it is unclear which states the resulting policy will never visit. A shortcoming of the barrier set size approach is that the convergence guarantee is limited to single barriers because if we have multiple barriers, we may not find a initial set as described in Lemma 4. The algorithm for the barrier set approach follows the same structure as Algorithm 1.
Analysis. For both curriculum learning by reward weight and barrier set size, if the agent can successfully find an optimal policy at every intermediary task, then we can find for . For the reward weight approach, we cannot prove that at every stage , the optimal policy for is guaranteed to be obtained, but we can still have the following proposition:
Proposition 3.
For curriculum learning by reward weight, in every stage, the learned policy achieves a higher reward than the initialized policy evaluated on the final target task.
Though the reward weight approach is not guaranteed to achieve the optimal policy in every curriculum step, the policy improves with respect to the final target reward. Each curriculum step is much easier than the original direct finetuning problem, which increases the possibility for successful finetuning. For the barrier set size approach, we prove that in every stage, the optimal policy for each stage is achievable. To learn an optimal policy in each stage, finding is key:
Lemma 4.
There exists that divides the trajectories of and into two homotopy classes.
We propose an approach for finding in Algorithm 2 in Section III of the supplementary materials.
Proposition 5.
A curriculum starting with as described in Lemma 4 and inflating to with sufficiently small changes in each step, i.e., small enough for reinforcement learning to find trajectories that should not visit barrier states, can always learn the optimal policy for the final target reward.
V Experiments
We evaluate our approach on four axes of complexity: (1) the size of barrier, (2) the number of barriers, (3) barriers in 3D environments, and (4) barriers that are not represented by physical obstacles but by a set of ‘undesirable’ states.
To evaluate these axes, we use various domains including navigation (Figs. 4, 5), lunar lander (Fig. 6 Left), fetch reach (Fig. 6 Right), mujoco ant (Fig. 7), and assistive feeding task (Fig. 8). We compare our approach against naïve finetuning (Finetune) as well as three stateoftheart finetuning approaches: Progressive Neural Networks (PNN) [35], Batch Spectral Shrinkage (BSS) [7], and SP [46]. We also add training on the target task from a random initialization (Random) as a reference, but we do not consider Random as a comparable baseline because it is not a transfer learning algorithm. We evaluate all the experiments using the total number of interaction steps it takes to reach within some small distance of the desired return in the target task. We report the average number of interaction steps over in units of 1000 (lower is better). We indicate statistically significant differences () with baselines by listing the first letter of those baselines. We ran Navigation (barrier sizes), and Fetch Reach experiments with 5 random seeds and the rest with 10 random seeds. If more than half of the runs exceeded the maximum number of interaction steps without reaching the desired target task reward, we report that the task is unachievable with the maximum number of interaction steps. Finally, we use stochastic policies which is why our source and target policies may not be symmetrical. Experiment details are in Section V of the supplementary materials.
Barrier Sizes  
1  3  5  7  
Ours  117.4 128.6  162.6 70.5  102.7 87.8  112.3 111.3 
PNN  92.2 102  138.6 92.1  159.8 90.6  119.2 125 
SP  138.2 61.3  >256  >256  >256 
BSS  >256  >256  >256  >256 
Finetune  141.1 53  >256  157 100  241 27.5 
Random  54.6 61.5  88.4 59.4  145 74.8  77.1 40.6 
1. Navigation. We address the first two axes by analyzing our problem under varying barrier sizes and varying number of homotopy classes. We experiment with our running example where an agent must navigate from a fixed start position to the goal set (green area).
Varying Barrier Sizes. We investigate how varying the size of the barrier affects the finetuning problem going from Right to Left. Here, we use a onestep curriculum so the barrier set size and reward weight approaches are the same. Table I demonstrates that when barrier sizes are small (1,3), our approach is not the most sample efficient, but remains comparable to other methods. With larger barrier sizes (5, 7), we find that our method requires the least amount of training updates. This result suggests that our approach is especially useful when barriers are large (i.e., finetuning is hard). When finetuning is easy, simpler approaches like starting from a random initialization can be used.
Transfer Tasks  
LL LR  LL RL  LL RR  
Ours:barrier  88.13.2  52.114.2  48.113.2 
Ours:reward  63.29.1  47.110.9  56.59.9 
PNN  101.9 37.2  >300  119.2 36.4 
SP  130.6 28.6  >300  >300 
BSS  >300  >300  >300 
Finetune  141.212.1  >300  >300 
Random  43.54.1  >300  169.427.1 
Four Homotopy Classes. We next investigate how multiple homotopy classes can affect finetuning. As shown in Fig. 5, adding a second barrier creates four homotopy classes: LL, LR, RL, and RR. We experiment with both barrier set size and reward weight approaches and report results when using LL as our source task in Table II. Results for using LR, RL, and RR as the source task are included in the supplementary materials. We can observe that the proposed EaseInEaseout approach outperforms other finetuning methods. Having multiple barriers does not satisfy the single barrier assumption, so our reward weight approach performs better on average than the barrier set size approach. Note that in LL LR, Random performs best, which implies that the task is easy to learn from scratch and no transfer learning is needed. We conclude that while increasing the number of barrier sets can result in a more challenging finetuning problem for other methods, it does not negatively affect our approach.
2. Lunar Lander. Before exploring 3D environments that differ significantly from the navigation environment, we conducted an experiment in Lunar Lander. The objective of the game is to land on the ground between the two flags without crashing. As shown in Fig. 6 (Left), this environment is similar to the navigation environments in that we introduce a barrier which creates two homotopy classes: Left and Right. However, the main difference is that the agent is controlled by two lateral thrusters and a main engine.
Lunar Lander  
L R  R L  
Ours:barrier  80.4646.58  80.2339.76 
Ours:reward  75.1334.25  38.436.46 
PNN  117.353.35  128.5944.56 
SP  124.5469.99  94.5951.23 
BSS  >300  >300 
Finetune  >300  >300 
Random  232.32 48.21  162.9249.54 
Results are shown in Table III. We observe that while L
SP suffers from a large variance and PNN needs many more steps, both our reward weight approach and barrier set size approach outperforms the finetuning methods. The reward weight approach has a small standard deviation and performs stably. Note that Random requires large amount of interaction steps, meaning that training the landing task is originally quite difficult and needs transfer reinforcement learning.
Our approach significantly reduces the number of steps needed to learn the optimal policy in both directions.2. Fetch Reach. We address the third axis by evaluating our EaseInEaseOut finetuning approach on a more realistic Fetch Reach environment [5]. The Fetch manipulator must finetune across homotopy classes in . In the reaching task, the robot needs to reach either the orange or blue tables by stretching right or left respectively. The tables are separated by a wall which creates two homotopy classes, as shown in Fig. 6. Our results are shown in Table IV. We find that our approach was the most efficient compared to baseline methods. One reason why the baselines did not perform well was that the wall’s large size and its proximity to the robot caused it to collide often, making it particularly difficult to finetune across homotopy classes. We found that even training from a random initialization proved difficult. For this reason, we had to relax the barrier constraint to obtain valid Left and Right source policies.
Fetch Reach  
L R  R L  
Ours:barrier  308.7167.7  274130.5 
PNN  >500  >500 
SP  >500  >500 
BSS  >500  >500 
Finetune  >500  >500 
Random  >500  >500 
3. Mujoco Ant. Finally, we explore whether our algorithm can generalize beyond navigationlike tasks that are traditionally associated with homotopy classes. We demonstrate two examples–Mujoco Ant and Assistive Feeding–where barrier states correspond to undesirable states rather than physical objects. In the Mujoco Ant environment [43], the barrier states correspond to a set of joint angles that the ant’s upper right leg cannot move to. The boundary of the barrier states are shown by the red lines in Fig. 7. In our source task, the ant moves while its upper right joint remains greater than . We call this orientation Down. Our goal is to transfer to the target task where the joint angle is less than , or Up. Results are shown in Table V. We do not evaluate the other direction, Up Down, because this direction was easy for all of our baselines to begin with, including our own approach. We find that our approach was the most successful in finetuning across the set of joint angle barrier states.
Mujoco Ant  Assistive Feeding  
Down Up  Up Down  
Ours  1420.0268.8  41632 
PNN  >10000  >2000 
SP  >10000  >2000 
BSS  >10000  >2000 
Finetune  2058.5535.2  >2000 
Random  2290.4585.8  49428 
4. Assistive Gym. We use an assistive feeding environment [14] to create another type of nonphysical barrier in the robot’s range of motion. In Fig. 8 (right), we simulate a disabled person who cannot change her head orientation by a large amount. The goal is to feed the person using a spoon. Here, we can easily train a policy on an abled body with a normal head orientation, as in Fig. 8 (left). However, we have limited data for the head orientation of the disabled person (the chin is pointing upwards as it is common in patients who use a head tracking device). To feed a disabled body, the spoon needs to point down, while for an abled body, the spoon needs to point up. The barrier states correspond to holding the spoon in any direction between these two directions when close to the mouth, which may ‘feed’ the food to the user’s nose or chin. This environment is an example of settings with limited data in the target environment, i.e., interacting with the disabled person. It also shows a setting with no physical barriers, and the ‘barrier states’ correspond to the spoon orientations in between, which can be uncomfortable or even unsafe. As shown in Table V, Our EaseInEaseOut finetuning approach learns the new policy for the disabled person faster than training from scratch while the other finetuning methods fail to learn the target policy.
Vi Discussion
Summary. We introduce the idea of using homotopy classes to characterize the difficulty of finetuning between tasks with different reward functions. We propose a novel EaseInEaseOut finetuning method that first relaxes the problem and then forms a curriculum. We extend the notion of homotopy classes, which allows us to go beyond navigation environments and apply our approach on more general robotics tasks. We demonstrate that our method requires less samples on a variety of domains and tasks compared to other finetuning baselines.
Limitations and Future Work. Our work has a number of limitations. This includes the need for accessing the barrier states a priori. However, our assistive gym example is a step towards considering environments where barrier states are not as clearly defined a priori. In the future, we plan to apply our methods to other robotics domains with nontrivial homotopy classes by directly finding the homotopy classes [6] and then using our algorithm to finetune.
Vii Acknowledgements
We would like to thank NSF Award Number 2006388 and the DARPA HiconLearn project for their support.
References
 [1] (2017) Successor features for transfer in reinforcement learning. In NeurIPS, Cited by: §I, §II.
 [2] (2004) Reinforcement learning and its relationship to supervised learning. Handbook of learning and approximate dynamic programming. Cited by: §II.

[3]
(2001)
Infinitehorizon policygradient estimation
. JAIR. Cited by: §IIIB.  [4] (2012) Topological constraints in searchbased robot path planning. Autonomous Robots. Cited by: §I, §IIIA.
 [5] (2016) Openai gym. arXiv:1606.01540. Cited by: §EB, §EB, §IV, §V.
 [6] (2010) Measuring and computing natural generators for homology groups. Computational Geometry. Cited by: §VI, footnote 1.
 [7] (2019) Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In NeurIPS, Cited by: 2nd item, §II, §V.
 [8] (2019) Quantifying generalization in reinforcement learning. In ICML, Cited by: §EA, §II.
 [9] (2019) RoboNet: largescale multirobot learning. arXiv preprint arXiv:1910.11215. Cited by: §II.
 [10] (2014) Multitask policy search for robotics. In ICRA, Cited by: §II.
 [11] (2009) Imagenet: a largescale hierarchical image database. In CVPR, Cited by: §II.
 [12] (2017) CARLA: an open urban driving simulator. arXiv preprint:1711.03938. Cited by: §IV.
 [13] (2016) Rl: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §I, §II.
 [14] (2020) Assistive gym: a physics simulation framework for assistive robotics. ICRA. Cited by: §EB, §IV, §V.
 [15] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §I, §II.

[16]
(2016)
Deep spatial autoencoders for visuomotor learning
. In ICRA, Cited by: §II.  [17] (2012) Safe exploration of state and action spaces in reinforcement learning. JAIR. Cited by: §I.
 [18] (2018) Robot learning in homes: improving generalization and reducing dataset bias. In NeurIPS, Cited by: §II.
 [19] (2018) Metareinforcement learning of structured exploration strategies. In NeurIPS, Cited by: §I, §II.
 [20] (2018) Stable baselines. GitHub. Note: https://github.com/hilla/stablebaselines Cited by: §EB, §EB.
 [21] (2006) Reducing the dimensionality of data with neural networks. science. Cited by: §I.
 [22] (2020) Efficient adaptation for endtoend visionbased robotic manipulation. arXiv preprint arXiv:2004.10190. Cited by: §II.
 [23] (2018) Qtopt: scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §I.
 [24] (2013) Reinforcement learning in robotics: a survey. IJRR. Cited by: §I.
 [25] (2012) Autonomous reinforcement learning on raw visual input data in a real world application. In IJCNN, Cited by: §II.
 [26] (2016) Endtoend training of deep visuomotor policies. JMLR. Cited by: §II.

[27]
(2011)
Unsupervised and transfer learning challenge: a deep learning approach
. In Unsupervised and Transfer Learning workshop in ICML, Cited by: §I.  [28] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I.
 [29] (2014) Disaster robotics. MIT press. Cited by: §I.
 [30] (2018) Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv:1803.11347. Cited by: §I, §II.
 [31] (2018) Visual reinforcement learning with imagined goals. In NeurIPS, Cited by: §II.
 [32] (2016) Supersizing selfsupervision: learning to grasp from 50k tries and 700 robot hours. In ICRA, Cited by: §II.
 [33] (2018) Multigoal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: §I.
 [34] (2017) An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §I, §II.
 [35] (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: 1st item, §II, §V.
 [36] (2018) Sim2real viewpoint invariant visual servoing by recurrent control. In CVPR, Cited by: §II.
 [37] (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging. Cited by: §I.
 [38] (2020) IGibson, a simulation environment for interactive tasks in large realisticscenes. arXiv preprint arXiv:2012.02924. Cited by: §IV.
 [39] (2017) Learning to learn: metacritic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §II.
 [40] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
 [41] (2009) Transfer learning for reinforcement learning domains: a survey. JMLR. Cited by: §II.
 [42] (2017) Distral: robust multitask reinforcement learning. In NeurIPS, Cited by: §I, §II.
 [43] (2012) Mujoco: a physics engine for modelbased control. In IROS, Cited by: §EB, §V.
 [44] (2016) Safe exploration in finite markov decision processes with gaussian processes. arXiv preprint arXiv:1606.04753. Cited by: §I.
 [45] (2016) Learning to reinforcement learn. arXiv preprint arXiv:1611.05763. Cited by: §I, §II.
 [46] (2018) Explicit inductive bias for transfer learning with convolutional networks. In ICML, Cited by: 3rd item, §II, §V.
 [47] (2014) How transferable are features in deep neural networks?. In NeurIPS, Cited by: §I.
 [48] (2017) Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Cited by: §I.
Appendix A Definitions
Definition A.1.
metric. Given a metric space , the distance between two distributions and on , is defined as follows:
where represents the set of couplings of and , i.e., matchings of probability mass.
Definition A.2.
continuity. A mapping from a metric space with any distance function to a distribution space, is continuous in Wasserstein metric if and only if and , such that if , then .
Appendix B Proofs
Since we are dealing with compact spaces for inputs to our functions, continuity is in general equivalent to uniform continuity. So below we interchangeably use the terms continuity and uniform continuity. We also note that uniform continuity is a weaker assumption than Lipschitz continuity, which is satisfied by neural networks with bounded parameters.
Below we show that the composition of two stochastic functions, each of which is uniformly continuous, is uniformly continuous. A stochastic function maps the input to a random , and can also be viewed as a deterministic function that maps to a p.d.f. over . The composition of and maps input first to a random intermediate output using , and then is fed to to generate the random output . So the composition can also be viewed as the function that maps to the p.d.f. of this composed output . We remark that uniform continuity is implied by continuity as long as we are working over compact spaces. We also remark that the uniform continuity assumption can be relaxed for the inner function in the composition.
Lemma 6.
Let , to be two stochastic functions. Their composition, which we denote by , maps to a distribution whose p.d.f. is defined by where is the p.d.f. (note that is a p.d.f. itself, so the integration results in a p.d.f.). The result of this composition is uniformly continuous as long as are uniformly continuous.
Proof.
Since both and are uniformly continuous functions, , such that
Now fix an . We need to show that there is a such that implies . By uniform continuity of , we first find a corresponding to , and then treating as the “” for the function above, we find a corresponding . Our goal is to show that implies .
Note that implies . We now take the coupling from the definition of the metric for and which achieves the . Note that we are making a simplifying assumption that the in the definition of is obtained at some particular ; in general, this may not be the case, but tweaking our argument by taking the limit of such proves the general case.
Because , this means the support of is only on pairs such that . Note that because ’s marginals are , we have
By the triangle inequality for the metric we have
But note that is only supported on pairs such that . Uniform continuity of implies that for all such pairs is at most . So we get
(3) 
∎
Theorem 7.
Assume that is a parameterized policy for an MDP . If both and are continuous, then a continuous change of policy parameters results in a continuous deformation of the induced random trajectory in the metric. However, continuous deformations of the trajectories does not ensure continuous changes of their corresponding policy parameters.
Proof.
Our strategy is to prove continuity by induction and repeated applications of Lemma 6.
When , is a deterministic constant function with respect to , thus, it is continuous.
For , assume by induction that we have proved is continuous w.r.t. . Our goal is to prove that is also continuous. We do this in two steps. First we obtain continuity of , and then .
For the first step, note that
where is the concatenation of the identity operator and the policy . Since this concatenation results in a continuous function, and is continuous, by Lemma 6, we get continuity of .
For the second step, we compose with the dynamics of . Namely
where is the concatenation of the identity operator and the MDP dynamics . By a similar reasoning, this results in a continuous function of .
For the second half of the theorem, the same trajectory may be induced by different policies, so continuous change trajectories can corresponds to quite different model parameters.
∎
Lemma 8.
A deep RL model, which is represented by a neural network
with Lipschitz continuous activation functions such as ReLU and TanH at each layer, induces a
continuous policy .Proof.
We assume the action distribution is a type of distribution with parameters
, such as Gaussian distribution with mean and standard deviation.
can be regarded as a mapping for parameter to a distribution. We assume that is continuous, which is satisfied for commonly used distributions such as the Gaussian distribution. So and , such that if , .The policy of deep RL model is represented by a neural network parameterized by , which outputs the parameters of the distribution. Every layer of the neural network employs a Lipschitz continuous activation function such as ReLU and TanH. We define the number of layers in the network as and the Lipschitz constant for every layer as . We assume that the norms of the output of each layer with respect to all the states in the state space is upper bounded by a constant value or otherwise the output is infinity. Then we can derive that
(4) 
If , then , and then . So the action distribution is continuous with respect to the model parameters. Then we derive the continuity of .
(5)  
Since the is continuous with respect to , then is also continuous. ∎
Remark 9.
Finetuning deep RL model parameters across homotopy classes is more difficult or even infeasible in terms of sample complexity compared to finetuning within the same homotopy class, under the assumptions that the transition probability function and policy of are continuous, learning rate is sufficiently small, and gradients are bounded.
Justification for Remark 9.
Large negative rewards, or barriers, correspond to high loss and create gradients pointing away from the barriers. Looking at Equation (2), the large cost incurred when a trajectory collides with a barrier creates a large negative term. This causes the gradient estimate to be extremely large.
As implied in Theorem 7, finetuning across homotopy classes implies that trajectories must intersect with the barrier at some steps during training if both the policies and the transition function are continuous. The policy induced by the deep RL model is continuous based on Lemma 8. When intersecting with the barrier, large negative rewards create large gradients that point away from the optimal target policy weights. Thus finetuning across homotopy classes will always be blocked by the barriers. Instead, finetuning within the homotopy classes can always find a deformation process without intersecting with the barriers.
∎
Proposition 10.
For curriculum learning by reward weight, in every stage, the learned policy achieves a higher reward than the initialized policy evaluated on the final target task.
Proof.
We denote by and set as . For the reward weight approach, in every stage , we can write the reward function
(6) 
as a sum of two parts
(7) 
and
(8)  
can be regarded as the reward function which only penalizes the barrier states. Then if evaluated under , since is trained to maximize the expected return under , achieves higher expected return than . However, if evaluated under , similarly, achieves higher expected return than . Therefore, achieves higher expected return than if evaluated under the reward . Since , so achieves higher expected return than under reward , which means that is penalized less than by the barrier states. Now has achieved higher expected return than under both and , and thus also achieved higher expected return under the final target reward . ∎
Lemma 11.
There exists that divides the trajectories of and into two homotopy classes.
Proof.
If there is no such , then the trajectories of and visit no states in and can be continuously deformed to each other without visiting any state in . So and are in the same homotopy class with respect to . Since optimizes the reward and visits no states in , it should be the optimal policy for the target reward . Therefore, the source and target reward are in the same homotopy class, which violates the assumption that they are in different homotopy classes. Thus, there must exist such . ∎
Proposition 12.
A curriculum beginning with as described in Lemma 11 is inflated to with sufficiently small changes in each step, i.e., small enough to enable reinforcement learning algorithms to approximate optimal policies whose trajectories do not visit barrier states. This curriculum can always learn the optimal policy for the final target reward in the last step.
Proof.
We demonstrate the case with only two homotopy classes. If there are multiple homotopy classes, then the obstacle state set contains multiple nonconnected subsets, and we can design the to form a sequence of transfer tasks to gradually transfer from the source homotopy class to the target homotopy class where two homotopy classes in each transfer task only have one connected obstacle state subset between them. Thus the analysis for two homotopy classes is sufficient as it can be applied to each transfer task to ensure the convergence. Importantly, here we assume to use reinforcement learning algorithms that can converge to optimal policies for each task in the curriculum. We prove this proposition in two steps:
1) The optimal policy for the first step with is in the same homotopy class as the optimal policy for the final target reward
As shown in Lemma 11, there exists that divides the trajectories of and into two homotopy classes. So in the first stage, the barrier is between and (assuming our reinforcement learning algorithm is able to find the optimal policy ). Since there are only two homotopy classes and and are from different homotopy classes, the trajectories of and have to be from the same homotopy class.
2) In every step , if in the previous step, can be learned and its trajectories are in the same homotopy class as trajectories induced by under , then the current step’s optimal policy can also be learned and its trajectories are from the same homotopy class as trajectories induced by under
In step , the reinforcement learning algorithm only needs to spend very little exploration to prevent the policy from visiting since we assume that the inflation from to is small enough to make reinforcement learning able to find trajectories not visiting states in . Since the last step’s optimal policy is in the same homotopy class as , , which is finetuned from , cannot be in the same homotopy class as due to the larger set of barrier states between and . So can only be in the same homotopy class as .
With the above two statements, we can derive that at every step, a policy without visiting the barrier states is achievable and it is in the same homotopy class as . In the final step, such policy is exactly .
∎
Appendix C Algorithm
In this section, we first provide our main curriculum learning algorithm in Algorithm 1.
As described in Lemma 11, we have shown that there exists a that divides the trajectories of and into two homotopy classes. This is particularly needed for the Barrier Set Size approach. Lemma 11 does not provide an approach for constructing . Here we propose one approach (Algorithm 2) for finding such an .
To find a desired , we assume that we have two oracle functions available: (1) a collision checker: the function , which outputs true if passes through , and outputs false otherwise; (2) a homotopy class checker: the function , which outputs true if the obstacle set divide and into two different homotopy classes, and outputs false otherwise. These assumptions are shown in the Input in Algorithm 2.
First, we observe that the trajectory induced by the optimal relaxed policy passes through the original obstacle , so there exists a subset of causing the source optimal trajectory not being able to continuously deform to . This means that there exists a subset of that divides and into two different homotopy classes.
Based on this idea, we first construct which divides and into two homotopy classes (see line 319). We first assign the original to (Line 3). We then cut into two halves, and (Line 5) and update it to one of the two halves based on rules introduced below (see lines 618) until we reach the desirable starting barrier
Comments
There are no comments yet.