## 1. Introduction

Recent advance in Deep Reinforcement Learning (DRL) has obtained expressive success of achieving human-level control in complex tasks mnih2015human; lillicrap2015continuous. However, DRL is still faced with sample inefficiency problems especially when the state-action space becomes large, which makes it difficult to learn from scratch. TL has shown great potential to accelerate RL sutton1998reinforcement via leveraging prior knowledge from past learned policies of relevant tasks taylor2009transfer; laroche2017transfer; rajendran2017attend. One major direction of transfer in RL focused on measuring the similarity between two tasks either through mapping the state spaces between two tasks taylor2007transfer; brys2015policy

, or computing the similarity of two Markov Decision Processes (MDPs)

song2016measuring, and then transferring value functions directly according to their similarities.Another direction of policy transfer focuses on selecting a suitable source policy for explorations fernandez2006probabilistic; li2018optimal. However, such single-policy transfer cannot be applied to cases when one source policy is only partially useful for learning the target task. Although some transfer approaches utilized multiple source policies during the target task learning, they suffer from either of the following limitations, e.g., Laroche and Barlier laroche2017transfer assumed that all tasks share the same transition dynamics and differ only in the reward function; Li et al. li2018context proposed Context-Aware Policy reuSe (CAPS) which required the optimality of source policies since it only learns an intra-option policy over these source policies. Furthermore, it requires manually adding primitive policies to the policy library which limits its generality and cannot be applied to problems of continuous action spaces.

To address the above problems, we propose a novel Policy Transfer Framework (PTF) which combines the above two directions of policy reuse. Instead of using source policies as guided explorations in a target task, we adaptively select a suitable source policy during target task learning and use it as a complementary optimization objective of the target policy. The backbone of PTF can still use existing DRL algorithms to update its policy, and the source policy selection problem is modeled as the option learning problem. In this way, PTF does not require any source policy to be perfect on any subtask and can still learn toward an optimal policy in case none of the source policy is useful. Besides, the option framework allows us to use the termination probability as a performance indicator to determine whether a source policy reuse should be terminated to avoid negative transfer. In summary, the main contributions of our work are: 1) PTF learns when and which source policy is the best to reuse for the target policy and when to terminate it by modelling multi-policy transfer as the option learning problem; 2) we propose an adaptive and heuristic mechanism to ensure the efficient reuse of source policies and avoid negative transfer; and 3) both existing value-based and policy-based DRL approaches can be incorporated and experimental results show PTF significantly boosts the performance of existing DRL approaches, and outperforms state-of-the-art policy transfer methods both in discrete and continuous action spaces.

## 2. Background

This paper focuses on standard RL tasks, formally, a task can be specified by an Markov Decision Process (MDP), which can be described as a tuple , where is the set of states; is the set of actions; is the state transition function: and is the reward function: . A policy

is a probability distribution over actions conditioned on states:

. The solution for an MDP is to find an optimal policy maximizing the total expected return with a discount factor : .Q-Learning, Deep Q-Network (DQN). Q-learning watkins1992q and DQN mnih2015human are popular value-based RL methods. Q-learning holds an action-value function for policy as , and learns the optimal Q-function, which yields an optimal policy watkins1992q. DQN learns the optimal Q-function by minimizing the loss:

(1) |

where is the target Q-network parameterized by and periodically updated from .

Policy Gradient (PG) Algorithms. Policy gradient methods are another choice for dealing with RL tasks, which is to directly optimize the policy parameterized by . PG methods optimize the objective by taking steps in the direction of . Using Q-function, then the gradient of the policy can be written as:

(2) |

where is the state distribution given

. Several practical PG algorithms differ in how they estimate

. For example, REINFORCE williams1992simple simply uses a sample return . Alternatively, one could learn an approximation of the action-value function ; is called the critic and leads to a variety of actor-critic algorithms sutton1998reinforcement; mnih2016asynchronous.The Option Framework. Sutton et al. sutton1999 firstly formalized the idea of temporally extended actions as an option. An option is defined as a triple in which is an initiation state set, is an intra-option policy and is a termination function that specifies the probability an option terminates at state . An MDP endowed with a set of options becomes a Semi-Markov Decision Process (Semi-MDP), which has a corresponding optimal option-value function over options learned using intra-option learning. The option framework considers the call-and-return option execution model, in which an agent picks option according to its option-value function , and follows the intra-option policy until termination, then selects a next option and repeats the procedure.

## 3. Policy Transfer Framework

### 3.1. Motivation

One major direction of previous works focuses on transferring value functions directly according to the similarity between two tasks brys2015policy; song2016measuring; laroche2017transfer. However, this way often assumes a well-estimated model for measurement which causes computational complexity and is infeasible in complex scenarios. Another direction of policy transfer methods focuses on selecting appropriate source policies based on the performance of source policies on the target task to provide guided explorations during each episode fernandez2006probabilistic; li2018optimal; li2018context. However, most of these works are faced with the challenge of how to select a suitable source policy, since each source policy may only be partially useful for the target task. Furthermore, some of them assume source policies to be optimal and deterministic which restricts the generality. How to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing in previous work.

According to the above analysis, in this paper, we firstly propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea and combining the above two directions of policy reuse. Instead of using source policies as guided explorations in a target task, PTF adaptively selects a suitable source policy during target task learning and uses it as a complementary optimization objective of the target policy. In this way, PTF does not require any source policy to be perfect on any subtask and can still learn toward an optimal policy in case none of the source policy is useful. Besides, we propose a novel way of adaptively determining the degree of transferring the knowledge of a source policy to the target one to avoid negative transfer, which can be effectively used in cases when only part of source policies share the same state-action space as the target one.

### 3.2. Framework Overview

Figure 1(a) illustrates the proposed Policy Transfer Framework (PTF) which contains two main components, one (Figure 1(b)) is the agent module (here is an example of an actor-critic model), which is used to learn the target policy with guidance from the option module. The other (Figure 1(c)) is the option module, which is used to learn when and which source policy is useful for the agent module. Given a set of source policies as the intra-option policies, the PTF agent first initializes a set of options together with the option-value network with random parameters. At each step, it selects an action following its policy, receives a reward and transitions to the next state. Meanwhile, it also selects an option according to the policy over options and the termination probabilities. For the update, the PTF agent introduces a complementary loss, which transfers knowledge from the intra-option policy through imitation, weighted by an adaptive adjustment factor . The PTF agent will also update the option-value network and the termination probability of using its own experience simultaneously. The reuse of the policy terminates according to the termination probability of and then another option is selected for reuse following the policy over options. In this way, PTF efficiently exploits the useful information from the source policies and avoids negative transfer through the call-and-return option execution model. PTF could be easily integrated with both value-based and policy-based DRL methods. We will describe how it could be combined with A3C mnih2016asynchronous as an example in the next section in detail.

### 3.3. Policy Transfer Framework (PTF)

In this section, we describe PTF applying in A3C mnih2016asynchronous: PTF-A3C. The whole learning process of PTF-A3C is shown in Algorithm 1. First, PTF-A3C initializes network parameters for the option-value network, the termination network (which shares the input and hidden layers with the option-value network and holds a different output layer), and A3C networks (Line 1). For each episode, the PTF-A3C agent first selects an option according to the policy over options (Line 6); then it selects an action following the current policy , receives a reward , transits to the next state and stores the transition to the replay buffer (Lines 8-11). Another option will be selected if the option is terminated according to the termination probability of (Line 12).

For the update, the agent computes the gradient of the temporal difference loss for the critic network (Line 17); and calculates the gradients of the standard actor loss, and also the extra loss of difference between the source policy inside the option and the current policy , which is measured by the cross-entropy loss: . is used as the supervision signal, weighted by an adaptive adjustment factor . To ensure sufficient explorations, an entropy bonus is also considered mnih2016asynchronous, weighted by a constant factor (Line 18). Then it updates the option-value network following Algorithm 2 and the termination network accordingly (Lines 19, 20) which is described in detail in the following section.

#### 3.3.1. Learning Source Policy Selection

The remaining issue is how to update the option-value network which is given in Algorithm 2. Since options are temporal abstractions sutton1999; bacon2017option, is introduced as the option-value function upon arrival. The expected return of executing option upon entering next state is , which is correlated to , i.e., the probability that option terminates in next state :

(3) | ||||

Then, PTF-A3C samples a batch of transitions from the replay buffer and updates the option-value network by minimizing the loss (Line 6 in Algorithm 2). Each sample can be used to update the values of multiple options, as long as the option allows to select the sampled action (for continuous action space, this is achieved by fitting action

in the source policy distribution with a certain confidence interval). Thus the sample efficiency can be significantly improved in an off-policy manner.

PTF-A3C learns option-values in the call-and-return option execution model, where an option is executed until it terminates at state based on its termination probability and then a next option is selected by a policy over options, which is -greedy to the option-value . Specifically, with a probability of , the option with the highest option-value is selected (random selection in case of a tie); and PTF-A3C makes random choices with probability to explore other options with potentially better performance.

#### 3.3.2. Learning Termination Probabilities

According to the call-and-return option execution model, the termination probability controls when to terminate the current selected option and select another option accordingly. The objective of learning the termination probability is to maximize the expected return , so we update the termination network parameters by computing the gradient of the discounted return objective with respect to the initial condition bacon2017option:

(4) |

where is the advantage function which can be approximated as , and is a discounted factor of state-option pairs from the initial condition : . is the transition probability along the trajectory starting from the initial condition to in steps. Since is estimated from samples along the on-policy stationary distribution, we neglect it for data efficiency Thomas14; li2018context. Then is updated w.r.t. as follows bacon2017option; li2018context:

(5) |

where is the learning rate, is a regularization term. The advantage term is if the option is the one with the maximized option value, and negative otherwise. In this way, all termination probabilities would increase if the option value is not the maximized one. However, the estimation of the option-value function is not accurate initially. If we multiply the advantage to the gradient, the termination probability of an option with the maximize true option value would also increase, which would lead to a sub-optimal policy over options. The purpose of is to ensure sufficient exploration that the best one could be selected.

#### 3.3.3. Transferring from Selected Source Policy

Next, we describe how to transfer knowledge from the selected source policy. The way to transfer is motivated from policy distillation rusu2016policy which exploits multiple teacher policies to train a student policy. Namely, a teacher policy is used to generate trajectories , each containing a sequence of states . The goal is to match student’s policy , parameterized by , to

. The corresponding loss function term for each sequence at each time step

is: , whereis the cross-entropy loss. For value-based algorithms, e.g., DQN, we can measure the difference of two Q-value distributions using the Kullback-Leibler divergence (KL) with temperature

:(6) |

Kickstarting schmitt2018kickstarting trains a student policy that surpasses the teacher policy on the same task set by adding the cross-entropy loss between the teacher and student policies to the RL loss. However, it does not consider learning a new task that is different from the teacher’s task set. Furthermore, the way using Population Based Training (PBT) pbt to adjust the weighting factor of the cross-entropy loss increases the computational complexity, lack of adaptive adjustment.

To this end, we propose an adaptive and heuristic way to adjust the weighting factor of the cross-entropy loss. The option module contains a termination network that reflects the performance of options on the target task. If the performance of the current option is not the best among all options, the termination probability of this option grows, which indicates we should assign a higher probability to terminate the current option. Therefore, the termination probability of a source policy can be used as a performance indicator of adjusting its exploitation degree. Specifically, the probability of exploiting the current source policy should be decreased as the performance of the option decreases. And the weighting factor which implies the probability of exploiting the current source policy should be inversely proportional to the termination probability. Specifically, we propose adaptively adjust as follows:

(7) |

where is a discount function. When the value of the termination function of option increases, it means that the performance of the option is not the best one among all options based on the current experience. Thus we decrease the weighting factor of the cross-entropy loss and vice versa. controls the slow decrease in exploiting the transferred knowledge from source policies which means at the beginning of learning, we exploit source knowledge mostly. As learning continues, past knowledge becomes less useful and we focus more on the current self-learned policy. In this way, PTF efficiently exploits useful information and avoids negative transfer from source policies.

## 4. Experimental Results

In this section, we evaluate PTF on three test domains, grid world fernandez2006probabilistic, pinball konidaris2009skill and reacher dmcontrol compared with several DRL methods learning from scratch (A3C mnih2016asynchronous and PPO ppo); and the state-of-the-art policy transfer method CAPS li2018context, implemented as a deep version (Deep-CAPS). Results are averaged over random seeds ^{1}^{1}1The source code is put on https://github.com/PTF-transfer/Code_PTF.

### 4.1. Evaluation on Different Environments

#### 4.1.1. Grid world

Figure 2(a) shows a grid world , with an agent starting from any of the grids, and choosing one of four actions: up, down, left and right. Each action makes the agent move to the corresponding direction with one step size. denote goals of source tasks, and represent goals of target tasks. As noted, is similar to one of the source tasks since their goals are within a close distance; while is different from each source task due to the far distance among their goals. The game ends when the agent approaches the grid of a target task or the time exceeds a fixed period. The agent receives a reward of after approaching the goal grid. The source policies are trained using A3C learning from scratch. We also manually design primitive policies for deep-CAPS following its previous settings (i.e., each primitive policy selects the same action for all states), which is unnecessary for our PTF framework.

We first investigate the performance of PTF when the target task is similar to one of the source tasks, (i.e., the distance between their goal grids is very close). Figure 3 presents the average discounted rewards of various methods when learning task on grid world. We can see from Figure 3(a) that PTF-A3C significantly accelerates the learning process and outperforms A3C. Similar results can be found in Figure 3(b). The reason is that PTF quickly identifies the optimal source policy and exploits useful information from source policies, which efficiently accelerates the learning process than learning from scratch. Figure 3(c) shows the performance gap between PTF-A3C and deep-CAPS. This is because the policy reuse module and the target task learning module in PTF are loosely decoupled, apart from reusing knowledge from source policies, PTF is also able to utilize its own experience from the environment. However, in deep-CAPS, these two parts are highly decoupled, which means its explorations and exploitations are fully dependent on the source policies inside the options. Thus, deep-CAPS needs higher requirements on source policies than our PTF, and finally achieves lower performance than PTF-A3C.

Next, we investigate the performance of PTF when all source tasks are not quite similar to the target task (i.e., the distance between their goal grids is very far). Figure 4 presents average discounted rewards of various methods when learning task . We can see from Figure 4(a), (b) that both PTF-A3C and PTF-PPO significantly accelerate the learning process and outperform A3C and PPO. The reason is that PTF identifies which source policy is optimal to exploit and when to terminate it, which efficiently accelerates the learning process than learning from scratch. The lower performance of deep-CAPS than PTF-A3C (Figure 4(c)) is due to the similar reasons as described before, that its explorations and exploitations are fully dependent on source policies, thus needs higher requirements on source policies than PTF, and finally achieves lower performance than PTF-A3C.

To verify that PTF works as well in situations where transitions between source and target tasks are different, we conduct experiments on learning on a grid world (Figure 2(b)), whose map is much different from the map for learning source tasks. Figure 5 shows that PTF still outperforms other methods even if only some parts of source policies can be exploited. PTF identifies and exploits useful parts automatically.

We further investigate whether PTF can efficiently avoid negative transfer. Figure 6 shows the average discounted rewards of PTF-A3C and deep-CAPS when source policies are not optimal towards source tasks. As we described before, deep-CAPS is fully dependent on source policies for explorations and exploitations on the target task, when source policies are not optimal towards source tasks, which means they are not deterministic at all states. Thus, deep-CAPS cannot avoid the negative and stochastic impact of source policies, which confuses the learning of the option-value network and finally obtains lower performance than PTF-A3C.

#### 4.1.2. Pinball

In the pinball domain (Figure 7(a)), a ball must be guided through a maze of arbitrarily shaped polygons to a designated target location. The state space is continuous over the position and velocity of the ball in the plane. The action space is continuous in the range of , which controls the increment of the velocity in the vertical or horizontal direction. Collisions with obstacles are elastic and can be used to the advantage of the agent. A drag coefficient of effectively stops ball movements after a finite number of steps when the null action is chosen repeatedly. Each thrust action incurs a penalty of while taking no action costs . The episode terminates with a reward when the agent reaches the target. We interrupted any episode taking more than steps and set the discount factor to . These rewards are all normalized to ensure more stable training. The source policies are trained using A3C learning from scratch. We also design primitive policies for deep-CAPS, an increment of the velocity in the vertical or horizontal direction; a decrement of the velocity in the vertical or horizontal direction and the null action, which is unnecessary for our PTF framework.

Figure 8 depicts the performance of PTF when learning task on Pinball, which is similar to source task (i.e., the distance between their goal states is very close). We can see that PTF significantly accelerates the learning process of A3C and PPO (Figure 8(a) and (b)); outperforms deep-CAPS (Figure 8(c)). The advantage of PTF is similar with that in grid world: PTF efficiently exploits the useful information from source policies to optimize the target policy, thus achieves higher performance than learning from scratch. However, deep-CAPS achieves lower average discounted rewards than PTF since it is fully dependent on source policies for explorations in the target task, while the continuous action space is hard to be fully covered even with the manually added primitive policies. Therefore, deep-CAPS achieves lower performance than PTF in such a domain.

We further verify that PTF works well in the same setting as in the grid world that all source tasks are not quite similar to the target task (i.e., the distance between their goal states is very far). From Figure 9 we can see that PTF outperforms other methods even if only some parts of source policies can be exploited. PTF identifies which source policy is optimal to exploit and when to terminate it, thus efficiently accelerates the learning process.

#### 4.1.3. Reacher

To further validate the performance of PTF, we provide an alternative scenario, Reacher (Figure 7(b)) dmcontrol, which is qualitatively different from the above two navigation tasks. Reacher is one of robot control problems in MuJoCo mujoco, equipped with a two-link planar to reach a target location. The episode ends with the reward when the end effector penetrates the target sphere, or ends when it takes more than steps. We design several tasks in Reacher which are different from the location and size of the target sphere. Since deep-CAPS performs poorly in the above continuous domain (pinball) due to the limitations described above, we only compare the PTF with vanilla A3C and PPO in the following sections.

Figure 10(a) shows the performance of PTF-A3C and A3C on Reacher. We can see that PTF-A3C efficiently achieves higher average discounted rewards than A3C. Similar results can be found in PTF-PPO and PPO shown in Figure 10(b). This is because PTF efficiently exploits the useful knowledge in source tasks, thus accelerates the learning process compared with vanilla methods. All results over various environments further show the robustness of PTF.

### 4.2. The Influence of

Next, we provide an ablation study to investigate the influence of the weighting factor (Equation 7) on the performance of PTF, which is the key factor. Figure 11 shows the influence of different parts of the weighting factor on the performance of PTF-A3C. We can see that when the extra loss is added without the weighting factor , although it helps the agent at the beginning of learning compared with A3C learning from scratch, it leads to a sub-optimal policy because of focusing too much on mimicking the source policies. In contrast, introducing the weighting factor allows us to terminate exploiting source policies in time and thus achieves the best transfer performance.

### 4.3. The Performance of Option Learning

Finally, we validate whether PTF learns an effective policy over options. Since there may be some concerns about learning termination , that the termination is easy to collapse bacon2017option; HarutyunyanDBHM19; HarbBKP18, making it difficult for the policy optimization. In this section, we provide the dynamics of the option switch frequency to investigate the option learning in PTF. From Figure 12 (a), (b) we can see that the option switch frequency decreases quickly and stabilizes as the learning goes by. This indicates that both PTF-A3C and PTF-PPO efficiently learn when and which option is useful and provides meaningful guidance for target task learning.

## 5. Related Work

Recently, transfer in RL has become an important direction and a wide variety of methods have been studied in the context of RL transfer learning taylor2009transfer. Brys et al. brys2015policy applied a reward shaping approach to policy transfer, benefiting from the theoretical guarantees of reward shaping. However, it may suffer from negative transfer. Song et al. song2016measuring transferred the action-value functions of the source tasks to the target task according to a task similarity metric to compute the task distance. However, they assumed a well-estimated model which is not always available in practice. Later, Laroche et al. laroche2017transfer reused the experience instances of a source task to estimate the reward function of the target task. The limitation of this approach resides in the restrictive assumption that all the tasks share the same transition dynamics and differ only in the reward function.

Policy reuse is a technique to accelerate RL with guidance from previously learned policies, assuming to start with a set of available policies, and to select among them when faced with a new task, which is, in essence, a transfer learning approach taylor2009transfer. Fernández et al. fernandez2006probabilistic used policy reuse as a probabilistic bias when learning the new, similar tasks. Rajendran et al. rajendran2017attend proposed the A2T (Attend, Adapt and Transfer) architecture to select and transfer from multiple source tasks by incorporating an attention network which learns the weights of several source policies for combination. Li et al. li2018optimal proposed the optimal source policy selection through online explorations using multi-armed bandit methods. However, most of the previous works select the source policy according to the performance of source policies on the target task, i.e., the utility, which fails to address the problems where multiple source policies are partially useful for learning the target task and even cause negative transfer.

The option framework was firstly proposed in sutton1999 as temporal abstractions which is modeled as Semi-MDPs. A number of works focused on option discovery bacon2017option; abs-1712-00004; HarbBKP18; HarutyunyanDBHM19. An important example is the option-critic bacon2017option which learns multiple source policies in the form of options from scratch, end-to-end. However, the option-critic tends to collapse to single-action primitives in later training stages. The follow-up work on the option-critic with deliberation cost HarbBKP18 addresses this option collapse by modifying the termination objective to additionally penalize option termination, but it is highly sensitive to the associated cost parameter. Recently, Harutyunyan et al. HarutyunyanDBHM19

further modify the termination objective to be completely independent of the task reward and provide theoretical guarantees for the optimality. The objective of all these option discovery works and PTF are orthogonal, that PTF transfers from the source policies to the target task and the rest of works learn multiple source policies from scratch. There are also some imitation learning works

KipfLDZSGKB19; HausmanCSSL17; abs-1711-11289 correlated to option discovery which is not the focus of this work.## 6. Conclusion and Future Work

In this paper, we propose a Policy Transfer Framework (PTF) which can efficiently select the optimal source policy and exploit the useful information to facilitate the target task learning. PTF also efficiently avoids negative transfer through terminating the exploitation of current source policy and selects another one adaptively. PTF can be easily combined with existing deep policy-based and actor-critic methods. Experimental results show PTF efficiently accelerates the learning process of existing state-of-the-art DRL methods and outperforms previous policy reuse approaches. As a future topic, it is worthwhile investigating how to extend PTF to multiagent settings. Another interesting direction is how to learn abstract knowledge for fast adaptation in new environments.

## Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214).

## References

## Appendix

### Network structure

The network structure is the same for all methods: the actor network has two fully-connected hidden layers both with 64 hidden units, the output layer is a fully-connected layer that outputs the action probabilities for all actions; the critic network contains two fully-connected hidden layers both with 64 hidden units and a fully-connected output layer with a single output: the state value; the option-value network contains two fully-connected hidden layers both with 32 units; two output layers, one outputs the option-values for all options, and the other outputs the termination probability of the selected option.

#### Grid world

The input consists of the following information: the coordinate of the agent and the environmental information (i.e., each of surrounding eight grids is a wall or not) which is encoded as a one-hot vector.

#### Pinball

The input contains the position of the ball ( and ) and the velocity of the ball in the plane.

#### Reacher

The input contains the positions of the finger ( and ), the relative distance to the target position, and the velocity of in the plane.

#### Parameter Settings

Hyperparameter | Value |

Discount factor() | 0.99 |

Optimizer | Adam |

Learning rate | |

decrement | |

-start | |

-end | |

Batch size | |

Number of episodes | |

replacing the target network |

Hyperparameter | Value |
---|---|

Number of processes | 8 |

Discount factor() | 0.99 |

Optimizer | Adam |

Learning rate | |

Entropy term coefficient |

Hyperparameter | Value |
---|---|

Discount factor() | 0.99 |

Optimizer | Adam |

Learning rate | |

Clip value | |

Entropy term coefficient | 0.005 |

Hyperparameter | Value |

Discount factor() | 0.99 |

Optimizer | Adam |

Learning rate for the policy network | |

Learning rate for the option network | |

Regularization term for Equation 5 | |

decrement | |

-start | |

-end | |

Batch size | |

Number of episodes | |

replacing the target network |