DeepAI
Log In Sign Up

CUP: Critic-Guided Policy Reuse

The ability to reuse previous policies is an important aspect of human intelligence. To achieve efficient policy reuse, a Deep Reinforcement Learning (DRL) agent needs to decide when to reuse and which source policies to reuse. Previous methods solve this problem by introducing extra components to the underlying algorithm, such as hierarchical high-level policies over source policies, or estimations of source policies' value functions on the target task. However, training these components induces either optimization non-stationarity or heavy sampling cost, significantly impairing the effectiveness of transfer. To tackle this problem, we propose a novel policy reuse algorithm called Critic-gUided Policy reuse (CUP), which avoids training any extra components and efficiently reuses source policies. CUP utilizes the critic, a common component in actor-critic methods, to evaluate and choose source policies. At each state, CUP chooses the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy. The guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy. Then the target policy is regularized to imitate the guidance policy to perform efficient policy search. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/11/2018

Context-Aware Policy Reuse

Transfer learning can greatly speed up reinforcement learning for a new ...
06/03/2021

Lifetime policy reuse and the importance of task capacity

A long-standing challenge in artificial intelligence is lifelong learnin...
02/29/2020

Contextual Policy Reuse using Deep Mixture Models

Reinforcement learning methods that consider the context, or current sta...
05/01/2015

Bayesian Policy Reuse

A long-lived autonomous agent should be able to respond online to novel ...
09/18/2020

GRAC: Self-Guided and Self-Regularized Actor-Critic

Deep reinforcement learning (DRL) algorithms have successfully been demo...
04/16/2022

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Bayesian policy reuse (BPR) is a general policy transfer framework for s...

1 Introduction

Human intelligence can solve new tasks quickly by reusing previous policies (Guberman and Greenfield, 1991). Despite remarkable success, current Deep Reinforcement Learning (DRL) agents lack this knowledge transfer ability (Silver et al., 2017; Vinyals et al., 2019; Ceron and Castro, 2021), leading to enormous computation and sampling cost. As a consequence, a large number of works have been studying the problem of policy reuse in DRL, i.e., how to efficiently reuse source policies to speed up target policy learning (Fernández and Veloso, 2006; Barreto et al., 2018; Li et al., 2019; Yang et al., 2020b).

A fundamental challenge towards policy reuse is: how does an agent with access to multiple source policies decide when and where to use them (Fernández and Veloso, 2006; Kurenkov et al., 2020; Cheng et al., 2020)? Previous methods solve this problem by introducing additional components to the underlying DRL algorithm, such as hierarchical high-level policies over source policies (Li et al., 2018, 2019; Yang et al., 2020b), or estimations of source policies’ value functions on the target task (Barreto et al., 2017, 2018; Cheng et al., 2020). However, training these components significantly impairs the effectiveness of transfer, as hierarchical structures induce optimization non-stationarity (Pateria et al., 2021), and estimating the value functions for every source policy is computationally expensive and with high sampling cost. Thus, the objective of this study is to address the question:

Can we achieve efficient transfer without training additional components?

Notice that actor-critic methods (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) learn a critic that approximates the actor’s Q function and serves as a natural way to evaluate policies. Based on this observation, we propose a novel policy reuse algorithm that utilizes the critic to choose source policies. The proposed algorithm, called Critic-gUided Policy reuse (CUP), avoids training any additional components and achieves efficient transfer. At each state, CUP chooses the source policy that has the largest one-step improvement over the current target policy, thus forming a guidance policy. Then CUP guides learning by regularizing the target policy to imitate the guidance policy. This approach has the following advantages. First, the one-step improvement can be estimated simply by querying the critic, and no additional components are needed to be trained. Secondly, the guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy, which ensures that CUP can reuse the source policies to improve the current target policy. Finally, CUP is conceptually simple and easy to implement, introducing very few hyper-parameters to the underlying algorithm.

We evaluate CUP on Meta-World (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot arm manipulation tasks. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.

2 Preliminaries

Reinforcement learning (RL) deals with Markov Decision Processes (MDPs). A MDP can be modelled by a tuple

, with state space , action space , reward function , transition function , and discount factor (Sutton and Barto, 2018). In this study, we focus on MDPs with continuous action spaces. RL’s objective is to find a policy that maximizes the cumulative discounted return .

While CUP is generally applicable to a wide range of actor-critic algorithms, in this work we use SAC (Haarnoja et al., 2018) as the underlying algorithm. The soft Q function and soft V function (Haarnoja et al., 2017) of a policy are defined as:

(1)
(2)

where

is the entropy weight. SAC’s loss functions are defined as:

(3)

where is the replay buffer, is a hyper-parameter representing the target entropy, and are network parameters, is target network’s parameters, and is the target soft value function.

We define the soft expected advantage

of action probability distribution

over policy at state as:

(4)

measures the one-step performance improvement brought by following instead of at state , and following afterwards.

The field of policy reuse focuses on solving a target MDP efficiently by transferring knowledge from a set of source policies . We denote the target policy learned on at iteration as , and its corresponding soft Q function as . In this work, we assume that the source policies and the target policy share the same state and action spaces.

3 Critic-Guided Policy Reuse

This section presents CUP, an efficient policy reuse algorithm that does not require training any additional components. CUP is built upon actor-critic methods. In each iteration, CUP uses the critic to form a guidance policy from the source policies and the current target policy. Then CUP guides policy search by regularizing the target policy to imitate the guidance policy. Section 3.1 presents how to form a guidance policy by aggregating source policies through the critic, and proves that the guidance policy is guaranteed to be a monotonic improvement over the current target policy. We also prove that the target policy is theoretically guaranteed to improve by imitating the guidance policy. Section 3.2 presents the overall framework of CUP.

3.1 Critic-Guided Source Policy Aggregation

CUP utilizes action probabilities proposed by source policies to improve the current target policy, and forms a guidance policy. At iteration

of target policy learning, for each state , the agent has access to a set of candidate action probability distributions proposed by the source policies and the current target policy: . The guidance policy can be formed by combining the action probability distributions that have the largest soft expected advantage over at each state :

(5)

The second equation holds as adding to all soft expected advantages does not affect the result of the operator. Eq. 5 implies that at each state, we can choose which source policy to follow simply by querying its expected soft Q value under . Noticing that with function approximation, the exact soft Q value cannot be acquired. The following theorem enables us to form the guidance policy with an approximated soft Q function, and guarantees that the guidance policy is a monotonic improvement over the current target policy.

Theorem 1

Let be an approximation of such that

(6)

Define

(7)

Then,

(8)

Theorem 1 provides a way to choose source policies using an approximation of the current target policy’s soft Q value. As SAC learns such an approximation, the guidance policy can be formed without training any additional components.

The next question is, how to incorporate the guidance policy into target policy learning? The following theorem demonstrates that policy improvement can be guaranteed if the target policy is optimized to stay close to the guidance policy.

Theorem 2

If

(9)

then

(10)

where is the largest possible absolute value of the reward, is the largest entropy of , and is the largest possible absolute difference of the policy entropy.

According to Theorem 2, the target policy can be improved by minimizing the KL divergence between the target policy and the guidance policy. Thus we can use the KL divergence as an auxiliary loss to guide target policy learning. Proofs of this section are deferred to Appendix B.1 and Appendix B.2. Theorem 1 and Theorem 2 can be extended to common “hard” value functions (deferred to Appendix B.3), so CUP is also applicable to actor-critic algorithms that uses “hard” Bellman updates, such as A3C (Mnih et al., 2016).

3.2 CUP Framework

Figure 1: CUP framework. In each iteration, CUP first forms a guidance policy by querying the critic, then guides policy learning by adding a KL regularization to policy search.

In this subsection we propose the overall framework of CUP. As shown in Fig. 1, at each iteration , CUP first forms a guidance policy according to Eq. 7, then provides additional guidance to policy search by regularizing the target policy to imitate (Wu et al., 2019; Fujimoto and Gu, 2021). Specifically, CUP minimizes the following loss to optimize :

(11)

where is the original actor loss defined in Eq. (3), and is a hyper-parameter controlling the weight of regularization. In practice, we find that using a fixed weight for regularization has two problems. First, it is difficult to balance the scale between and the regularization term, because grows as the Q value gets larger. Secondly, a fixed weight cannot reflect the agent’s confidence on . For example, when no source policies have positive soft expected advantages, . Then the agent should not imitate anymore, as cannot provide any guidance to further improve performance. Noticing that the soft expected advantage serves as a natural confidence measure, we weight the KL divergence with corresponding soft expected advantage at that state:

(12)

where is the approximated soft expected advantage, are two hyper-parameters, and is the approximated soft value function. This adaptive regularization weight automatically balances between the two losses, and ignores the regularization term at states where cannot improve over anymore. We further upper clip the expected advantage with the absolute value of to avoid the agent being overly confident about due to function approximation error .

CUP’s pseudo-code is presented in Alg. 1. The modifications CUP made to SAC are marked in red. Additional implementation details are deferred to Appendix D.1.

  Require: Source policies , hyper-parameters ,
  Initialize replay buffer
  Initialize actor , entropy weight , critic ,, target networks
  while not done do
     for each environment step do
        
        
        
     end for
     for each gradient step do
        Sample minibatch from
        Query source policies’ action probabilities for states in
        Compute expected advantages according to Eq. (4), form according to Eq. (7)
         for
        
        
         for
     end for
  end while
Algorithm 1 CUP

4 Experiments

We evaluate on Meta-World (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot manipulation tasks. These tasks are both correlated (performed by the same Sawyer robot arm) and distinct (interacting with different objects and having different reward functions), and serve as a proper evaluation benchmark for policy reuse. The source policies are achieved by training on three representative tasks: Reach, Push, and Pick-Place. We choose several complex tasks as target tasks, including Hammer, Peg-Insert-Side, Push-Wall, Pick-Place-Wall, Push-Back, and Shelf-Place. Among these target tasks, Hammer and Peg-Insert-Side require interacting with objects unseen in the source tasks. In Push-Wall and Pick-Place-Wall, there is a wall between the object and the goal. In Push-Back, the goal distribution is different from Push. In Shelf-Place, the robot is required to put a block on a shelf, and the shelf is unseen in the source tasks. Video demonstrations of these tasks are available at https://meta-world.github.io/. Similar to the settings in Yang et al. (2020a), in our experiments the goal position is randomly reset at the start of every episode. Codes are available at https://github.com/NagisaZj/CUP.

4.1 Transfer Performance on Meta-World

We compare against several representative baseline algorithms, including HAAR (Li et al., 2019), PTF (Yang et al., 2020b), MULTIPOLAR (Barekatain et al., 2021), and MAMBA (Cheng et al., 2020). Among these algorithms, HAAR and PTF learn hierarchical high-level policies over source policies. MAMBA aggregates source policies’ V functions to form a baseline function, and performs policy improvement over the baseline function. MULTIPOLAR learns a weighted sum of source policies’ action probabilities, and learns an additional network to predict residuals. We also compare against the original SAC algorithm. All the results are averaged over six random seeds. As shown in Figure 2, CUP is the only algorithm that achieves efficient transfer on all six tasks, significantly outperforming the original SAC algorithm. HAAR has a jump-start performance on Push-Wall and Pick-Pick-Wall, but fails to further improve due to optimization non-stationarity induced by jointly training high-level and low-level policies. MULTIPOLAR achieves comparable performance on Push-Wall and Peg-Insert-Side, because the Push source policy is useful on Push-Wall (implied by HAAR’s good jump-start performance), and learning residuals on Peg-Insert-Side is easier (implied by SAC’s fast learning). In Pick-Place-Wall, the Pick-Place source policy is useful, but the residual is difficult to learn, so MULTIPOLAR does not work. For the remaining three tasks, the source policies are less useful, and MULTIPOLAR fails on these tasks. PTF fails as its hierarchical policy only gets updated when the agent chooses similar actions to one of the source policies, which is quite rare when the source and target tasks are distinct. MAMBA fails as estimating all source policies’ V functions accurately is sampling inefficient. Algorithm performance evaluated by success rate is deferred to Appendix E.1.

Figure 2:

Evaluation of CUP and several baselines on various Meta-World tasks. Dashed areas represent 95% bootstrapped confidence intervals. CUP achieves substantially better performance than baseline algorithms.

4.2 Analyzing the Guidance Policy

This subsection provides visualizations of CUP’s source policy selection. Fig. 3 shows the percentages of each source policy being selected throughout training on Push-Wall. At early stages of training, the source policies are selected more frequently as they have positive expected advantages, which means that they can be used to improve the current target policy. As training proceeds and the target policy becomes better, the source policies are selected less frequently. Among these three source policies, Push is chosen more frequently than the other two source policies, as it is more related to the target task. Figure 4 presents the source policies’ expected advantages over an episode at convergence in Pick-Place-Wall. The Push source policy and Reach source policy almost always have negative expected advantages, which implies that these two source policies can hardly improve the current target policy anymore. Meanwhile, the Pick-Place source policy has expected advantages close to zero after 100 environment steps, which implies that the Pick-Place source policy is close to the target policy at these steps. Analyses on all six tasks as well as analyses on HAAR’s source policy selection are deferred to Appendix E.2 and Appendix E.6, respectively.

Figure 3: Percentages of source policies being selected by CUP during training on Push-Wall. The green dashed line represents the target policy’s success rate on the task.
Figure 4: Expected advantages of source policies at convergence on Pick-Place-Wall. The horizontal axis represents the environment steps of an episode.

4.3 Ablation Study

This subsection evaluates CUP’s sensitivity to hyper-parameter settings and the number of source policies. We also evaluate CUP’s robustness against random source policies, which do not provide meaningful candidate actions for solving target tasks.

4.3.1 Hyper-Parameter Sensitivity

Figure 5: Ablation studies on a wide range of hyper-parameters. CUP performs well on a wide range of hyper-parameters.

For all the experiments in Section 4.1, we use the same set of hyper-parameters, which indicates that CUP is generally applicable to a wide range of tasks without particular fine-tuning. CUP introduces only two additional hyper-parameters to the underlying SAC algorithm, and we further test CUP’s sensitivity to these additional hyper-parameters. As shown in Fig. 5, CUP is generally robust to the choice of hyper-parameters and achieves stable performance.

4.3.2 Number of Source Policies

We evaluate CUP as well as baseline algorithms on a larger source policy set. We add three policies to the original source policy set, which solve three simple tasks including Drawer-Close, Push-Wall, and Coffee-Button. This forms a source policy set composed of six policies. As shown in Fig. 6, CUP is still the only algorithm that solves all the six target tasks efficiently. MULTIPOLAR suffers from a decrease in performance, which indicates that learning the weighted sum of source policies’ actions becomes more difficult as the number of source policies grows. The rest of the baseline algorithms have similar performance to those using three source policies. Fig. 7 provides a more direct comparison of CUP’s performance with different number of source policies. CUP is able to utilize the additional source policies to further improve its performance, especially on Pick-Place-Wall and Peg-Insert-Side. Further detailed analysis is deferred to Appendix E.3.

Figure 6: Performance of CUP and baseline algorithms on various Meta-World tasks, with a set of six source policies.
Figure 7: Comparison of CUP’s performance with different number of source policies.

4.3.3 Interference of Random Source Policies

In order to evaluate the efficiency of CUP’s critic-guided source policy aggregation, we add random policies to the set of source policies. As shown in Fig. 8, adding up to 3 random source policies does not affect CUP’s performance. This indicates that CUP can efficiently choose which source policy to follow even if there exist many source policies that are not meaningful. Adding 4 and 5 random source policies leads to a slight drop in performance. This drop is because that as the number of random policies grows, more random actions are sampled, and taking argmax over these actions’ expected advantages is more likely to be affected by errors in value estimation.

To further investigate CUP’s ability to ignore unsuitable source policies, we design another transfer setting that consists of another two source policy sets. The first set consists of three random policies that are useless for the target task, and the second set adds the Reach policy to the first set. As demonstrated in Fig. 8, when none of the source policies are useful, CUP performs similarly to the original SAC, and its sample efficiency is almost unaffected by the useless source policies. When there exists a useful source policy, CUP can efficiently utilize it to improve performance, even if there are many useless source policies.

Figure 8: Ablation studies on CUP’s sensitivity to useless source policies. (a) Adding up to 3 random policies to the source policy set does not affect CUP’s performance. (b) Ablation study in a setting where most source policies are useless. If none of the source policies are useful (3 Random Sources), CUP performs similarly to the original SAC. Even if only one of the four source policies is useful (3 Random Sources+Reach), CUP is still able to efficiently utilize the useful source policy to improve learning performance.

5 Related Work

Policy reuse.

A series of works on policy reuse utilize source policies for exploration in value-based algorithms (Fernández and Veloso, 2006; Li and Zhang, 2018; Gimelfarb et al., 2021), but they are not applicable to policy gradient methods due to the off-policyness problem (Fujimoto et al., 2019). AC-Teach (Kurenkov et al., 2020) mitigates this problem by improving the actor over behavior policy’s value estimation, but still fails in more complex tasks. One branch of methods train hierarchical high-level policies over source policies. CAPS (Li et al., 2018) guarantees the optimality of the hierarchical policies by adding primitive skills to the low-level policy set, but is inapplicable to MDPs with continuous action spaces. HAAR (Li et al., 2019) fine-tunes low-level policies to ensure optimality, but joint training of high-level and low-level policies induce optimization non-stationarity (Pateria et al., 2021). PTF (Yang et al., 2020b) trains a hierarchical policy, which is imitated by the target policy. However, the hierarchical policy only gets updated when the target policy chooses similar actions to one of the source policies, so PTF fails in complex tasks with large action spaces. Another branch of works aggregate source policies via their Q functions or V functions on the target task. Barreto et al. (2017) and Barreto et al. (2018) focus on the situation where source tasks and target tasks share the same dynamics, and aggregate source policies by choosing the policy that has the largest Q at each state. They use successor features to mitigate the heavy computation cost brought by estimating Q functions for all source policies. MAMBA (Cheng et al., 2020) forms a baseline function by aggregating source policies’ V functions, and guides policy search by improving the policy over the baseline function. Finally, MULTIPOLAR (Barekatain et al., 2021) learns a weighted sum over source policies’ actions, and learns an auxiliary network to predict residuals around the aggregated actions. MULTIPOLAR is computationally expensive, as it requires querying all the source policies at every sampling step. Our proposed method, CUP, focuses on the setting of learning continuous-action MDPs with actor-critic methods. CUP is both computationally and sampling efficient, as it does not require training any additional components.

Policy regularization.

Adding regularization to policy optimization is a common approach to induce prior knowledge into policy learning. Distral (Teh et al., 2017) achieves inter-task transfer by imitating an average policy distilled from policies of related tasks. In offline RL, policy regularization serves as a common technique to keep the policy close to the behavior policy used to collect the dataset (Wu et al., 2019; Nair et al., 2020; Fujimoto and Gu, 2021). CUP uses policy regularization as a means to provide additional guidance to policy search with the guidance policy.

6 Conclusion

In this study, we address the problem of reusing source policies without training any additional components. By utilizing the critic as a natural evaluation of source policies, we propose CUP, an efficient policy reuse algorithm without training any additional components. CUP is conceptually simple, easy to implement, and has theoretical guarantees. Empirical results demonstrate that CUP achieves efficient transfer on a wide range of tasks. As for future work, CUP assumes that all source policies and the target policy share the same state and action spaces, which limits CUP’s application to more general scenarios. One possible future direction is to take inspiration from previous works that map the state and action spaces of an MDP to another MDP with similar high-level structure (Wan et al., 2020; Zhang et al., 2020; Heng et al., 2022; van der Pol et al., 2020b, a). Another interesting direction is to incorporate CUP into the continual learning setting (Rolnick et al., 2019; Khetarpal et al., 2020), in which an agent gradually enriches its source policy set in an online manner.

Acknowledgements

This work is supported in part by Science and Technology Innovation 2030 – “New Generation Artificial Intelligence” Major Project (No. 2018AAA0100904), National Natural Science Foundation of China (62176135), and China Academy of Launch Vehicle Technology (CALT2022-18).

References

  • M. Barekatain, R. Yonetani, and M. Hamaya (2021) MULTIPOLAR: multi-source policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108–3116. Cited by: §4.1, §5.
  • A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Zidek, and R. Munos (2018) Transfer in deep reinforcement learning using successor features and generalised policy improvement. In

    International Conference on Machine Learning

    ,
    pp. 501–510. Cited by: §1, §1, §5.
  • A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver (2017) Successor features for transfer in reinforcement learning. Advances in neural information processing systems 30. Cited by: §1, §5.
  • J. S. O. Ceron and P. S. Castro (2021) Revisiting rainbow: promoting more insightful and inclusive deep reinforcement learning research. In International Conference on Machine Learning, pp. 1373–1383. Cited by: §1.
  • C. Cheng, A. Kolobov, and A. Agarwal (2020) Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems 33, pp. 5587–5598. Cited by: §1, §4.1, §5.
  • A. A. Fedotov, P. Harremoës, and F. Topsoe (2003) Refinements of pinsker’s inequality. IEEE Transactions on Information Theory 49 (6), pp. 1491–1498. Cited by: §B.2, §B.3.
  • F. Fernández and M. Veloso (2006) Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pp. 720–727. Cited by: §1, §1, §5.
  • S. Fujimoto and S. S. Gu (2021) A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §3.2, §5.
  • S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. Cited by: §1.
  • S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §5.
  • M. Gimelfarb, S. Sanner, and C. Lee (2021) Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts. In Uncertainty in Artificial Intelligence, pp. 1787–1797. Cited by: §5.
  • S. R. Guberman and P. M. Greenfield (1991) Learning and transfer in everyday cognition. Cognitive Development 6 (3), pp. 233–260. Cited by: §1.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352–1361. Cited by: §2.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1, §2.
  • Y. Heng, T. Yang, Y. ZHENG, H. Jianye, and M. E. Taylor (2022) Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In The 38th Conference on Uncertainty in Artificial Intelligence, Cited by: §6.
  • S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning, Cited by: §B.2, §B.3.
  • K. Khetarpal, M. Riemer, I. Rish, and D. Precup (2020) Towards continual reinforcement learning: a review and perspectives. arXiv preprint arXiv:2012.13490. Cited by: §6.
  • A. Kurenkov, A. Mandlekar, R. Martin-Martin, S. Savarese, and A. Garg (2020) AC-teach: a bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. In Conference on Robot Learning, pp. 717–734. Cited by: §1, §5.
  • S. Li, F. Gu, G. Zhu, and C. Zhang (2018) Context-aware policy reuse. arXiv preprint arXiv:1806.03793. Cited by: §1, §5.
  • S. Li, R. Wang, M. Tang, and C. Zhang (2019) Hierarchical reinforcement learning with advantage-based auxiliary rewards. Advances in Neural Information Processing Systems 32. Cited by: §1, §1, §4.1, §5.
  • S. Li and C. Zhang (2018) An optimal online method of selecting source policies for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §5.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning.. In ICLR (Poster), Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.1.
  • A. Nair, A. Gupta, M. Dalal, and S. Levine (2020) Awac: accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §5.
  • G. Ostrovski, P. S. Castro, and W. Dabney (2021) The difficulty of passive learning in deep reinforcement learning. Advances in Neural Information Processing Systems 34, pp. 23283–23295. Cited by: Appendix C.
  • S. Pateria, B. Subagdja, A. Tan, and C. Quek (2021) Hierarchical reinforcement learning: a comprehensive survey. ACM Computing Surveys (CSUR) 54 (5), pp. 1–35. Cited by: §1, §5.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. Advances in Neural Information Processing Systems 32. Cited by: §6.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
  • S. Sodhani, A. Zhang, and J. Pineau (2021) Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pp. 9767–9779. Cited by: §D.2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.
  • Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. Advances in neural information processing systems 30. Cited by: §5.
  • E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling (2020a) Plannable approximations to mdp homomorphisms: equivariance under actions. arXiv preprint arXiv:2002.11963. Cited by: §6.
  • E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling (2020b) MDP homomorphic networks: group symmetries in reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 4199–4210. Cited by: §6.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • M. Wan, T. Gangwani, and J. Peng (2020) Mutual information based knowledge transfer under state-action dimension mismatch. arXiv preprint arXiv:2006.07041. Cited by: §6.
  • Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §3.2, §5.
  • R. Yang, H. Xu, Y. Wu, and X. Wang (2020a) Multi-task reinforcement learning with soft modularization. Advances in Neural Information Processing Systems 33, pp. 4767–4777. Cited by: §4.
  • T. Yang, J. Hao, Z. Meng, Z. Zhang, Y. Hu, Y. Chen, C. Fan, W. Wang, Z. Wang, and J. Peng (2020b) Efficient deep reinforcement learning through policy transfer.. In AAMAS, pp. 2053–2055. Cited by: §1, §1, §4.1, §5.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. Cited by: §1, §4.
  • Q. Zhang, T. Xiao, A. A. Efros, L. Pinto, and X. Wang (2020) Learning cross-domain correspondence for control with dynamics cycle-consistency. arXiv preprint arXiv:2012.09811. Cited by: §6.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Section 2.

    3. Did you discuss any potential negative societal impacts of your work? See Appendix A.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See Section 3.1.

    2. Did you include complete proofs of all theoretical results? See Appendix B.1 and Appendix B.2.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See the supplemental materials.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Appendix

      D.1.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 4.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.1.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See Section 4

    2. Did you mention the license of the assets? See the supplemental materials.

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Broader Social Impact

We believe policy reuse serves as a promising way to transfer knowledge among AI agents. This ability will enable AI agents to master new skills efficiently. However, we are also aware of possible negative social impacts, such as plagiarizing other AI products by querying and reusing their policies.

Appendix B Proofs

b.1 Proof for Theorem 1

Proof. As , we have that for all , the difference between the true value function and the approximated value function is bounded:

As is contained in , with defined in Eq. (7), it is obvious that for all , . Then for all ,

b.2 Proof of Theorem 2

Proof. According to Pinsker’s inequality (Fedotov et al., 2003), , where is the L1 norm. So we have that for all , . According to the Performance Difference Lemma (Kakade and Langford, 2002), we have that for all :

where 111We slightly abuse the notation here to indicate that the agent start deterministically from state . is the normalized discounted state occupancy distribution. Note that

(14)
(15)

Eventually, we have .

b.3 Critic-Guided Source Policy Aggregation under “Hard” Value Functions

In this section we override the notation , to represent “hard” value functions, and override the notation to represent the expected advantage, which is defined as . Then Theorem 1 and Theorem 2 can be extended as below.

Theorem 3

Let be an approximation of such that

(16)

Define

(17)

Then,

(18)
Theorem 4

If

(19)

then

(20)

where is the largest possible absolute value of the reward.

Theorem 3 and Theorem 4 implies that CUP can still guarantee policy improvement under hard Bellman updates. Proofs are given below.

Proof for Theorem 3.

As , we have that for all , the difference between the true value function and the approximated value function is bounded:

As is contained in , with defined in Eq. (17), it is obvious that for all , . Then for all ,