1 Introduction
Human intelligence can solve new tasks quickly by reusing previous policies (Guberman and Greenfield, 1991). Despite remarkable success, current Deep Reinforcement Learning (DRL) agents lack this knowledge transfer ability (Silver et al., 2017; Vinyals et al., 2019; Ceron and Castro, 2021), leading to enormous computation and sampling cost. As a consequence, a large number of works have been studying the problem of policy reuse in DRL, i.e., how to efficiently reuse source policies to speed up target policy learning (Fernández and Veloso, 2006; Barreto et al., 2018; Li et al., 2019; Yang et al., 2020b).
A fundamental challenge towards policy reuse is: how does an agent with access to multiple source policies decide when and where to use them (Fernández and Veloso, 2006; Kurenkov et al., 2020; Cheng et al., 2020)? Previous methods solve this problem by introducing additional components to the underlying DRL algorithm, such as hierarchical highlevel policies over source policies (Li et al., 2018, 2019; Yang et al., 2020b), or estimations of source policies’ value functions on the target task (Barreto et al., 2017, 2018; Cheng et al., 2020). However, training these components significantly impairs the effectiveness of transfer, as hierarchical structures induce optimization nonstationarity (Pateria et al., 2021), and estimating the value functions for every source policy is computationally expensive and with high sampling cost. Thus, the objective of this study is to address the question:
Can we achieve efficient transfer without training additional components?
Notice that actorcritic methods (Lillicrap et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) learn a critic that approximates the actor’s Q function and serves as a natural way to evaluate policies. Based on this observation, we propose a novel policy reuse algorithm that utilizes the critic to choose source policies. The proposed algorithm, called CriticgUided Policy reuse (CUP), avoids training any additional components and achieves efficient transfer. At each state, CUP chooses the source policy that has the largest onestep improvement over the current target policy, thus forming a guidance policy. Then CUP guides learning by regularizing the target policy to imitate the guidance policy. This approach has the following advantages. First, the onestep improvement can be estimated simply by querying the critic, and no additional components are needed to be trained. Secondly, the guidance policy is theoretically guaranteed to be a monotonic improvement over the current target policy, which ensures that CUP can reuse the source policies to improve the current target policy. Finally, CUP is conceptually simple and easy to implement, introducing very few hyperparameters to the underlying algorithm.
We evaluate CUP on MetaWorld (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot arm manipulation tasks. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.
2 Preliminaries
Reinforcement learning (RL) deals with Markov Decision Processes (MDPs). A MDP can be modelled by a tuple
, with state space , action space , reward function , transition function , and discount factor (Sutton and Barto, 2018). In this study, we focus on MDPs with continuous action spaces. RL’s objective is to find a policy that maximizes the cumulative discounted return .While CUP is generally applicable to a wide range of actorcritic algorithms, in this work we use SAC (Haarnoja et al., 2018) as the underlying algorithm. The soft Q function and soft V function (Haarnoja et al., 2017) of a policy are defined as:
(1) 
(2) 
where
is the entropy weight. SAC’s loss functions are defined as:
(3) 
where is the replay buffer, is a hyperparameter representing the target entropy, and are network parameters, is target network’s parameters, and is the target soft value function.
We define the soft expected advantage
of action probability distribution
over policy at state as:(4) 
measures the onestep performance improvement brought by following instead of at state , and following afterwards.
The field of policy reuse focuses on solving a target MDP efficiently by transferring knowledge from a set of source policies . We denote the target policy learned on at iteration as , and its corresponding soft Q function as . In this work, we assume that the source policies and the target policy share the same state and action spaces.
3 CriticGuided Policy Reuse
This section presents CUP, an efficient policy reuse algorithm that does not require training any additional components. CUP is built upon actorcritic methods. In each iteration, CUP uses the critic to form a guidance policy from the source policies and the current target policy. Then CUP guides policy search by regularizing the target policy to imitate the guidance policy. Section 3.1 presents how to form a guidance policy by aggregating source policies through the critic, and proves that the guidance policy is guaranteed to be a monotonic improvement over the current target policy. We also prove that the target policy is theoretically guaranteed to improve by imitating the guidance policy. Section 3.2 presents the overall framework of CUP.
3.1 CriticGuided Source Policy Aggregation
CUP utilizes action probabilities proposed by source policies to improve the current target policy, and forms a guidance policy. At iteration
of target policy learning, for each state , the agent has access to a set of candidate action probability distributions proposed by the source policies and the current target policy: . The guidance policy can be formed by combining the action probability distributions that have the largest soft expected advantage over at each state :(5) 
The second equation holds as adding to all soft expected advantages does not affect the result of the operator. Eq. 5 implies that at each state, we can choose which source policy to follow simply by querying its expected soft Q value under . Noticing that with function approximation, the exact soft Q value cannot be acquired. The following theorem enables us to form the guidance policy with an approximated soft Q function, and guarantees that the guidance policy is a monotonic improvement over the current target policy.
Theorem 1
Let be an approximation of such that
(6) 
Define
(7) 
Then,
(8) 
Theorem 1 provides a way to choose source policies using an approximation of the current target policy’s soft Q value. As SAC learns such an approximation, the guidance policy can be formed without training any additional components.
The next question is, how to incorporate the guidance policy into target policy learning? The following theorem demonstrates that policy improvement can be guaranteed if the target policy is optimized to stay close to the guidance policy.
Theorem 2
If
(9) 
then
(10) 
where is the largest possible absolute value of the reward, is the largest entropy of , and is the largest possible absolute difference of the policy entropy.
According to Theorem 2, the target policy can be improved by minimizing the KL divergence between the target policy and the guidance policy. Thus we can use the KL divergence as an auxiliary loss to guide target policy learning. Proofs of this section are deferred to Appendix B.1 and Appendix B.2. Theorem 1 and Theorem 2 can be extended to common “hard” value functions (deferred to Appendix B.3), so CUP is also applicable to actorcritic algorithms that uses “hard” Bellman updates, such as A3C (Mnih et al., 2016).
3.2 CUP Framework
In this subsection we propose the overall framework of CUP. As shown in Fig. 1, at each iteration , CUP first forms a guidance policy according to Eq. 7, then provides additional guidance to policy search by regularizing the target policy to imitate (Wu et al., 2019; Fujimoto and Gu, 2021). Specifically, CUP minimizes the following loss to optimize :
(11) 
where is the original actor loss defined in Eq. (3), and is a hyperparameter controlling the weight of regularization. In practice, we find that using a fixed weight for regularization has two problems. First, it is difficult to balance the scale between and the regularization term, because grows as the Q value gets larger. Secondly, a fixed weight cannot reflect the agent’s confidence on . For example, when no source policies have positive soft expected advantages, . Then the agent should not imitate anymore, as cannot provide any guidance to further improve performance. Noticing that the soft expected advantage serves as a natural confidence measure, we weight the KL divergence with corresponding soft expected advantage at that state:
(12) 
where is the approximated soft expected advantage, are two hyperparameters, and is the approximated soft value function. This adaptive regularization weight automatically balances between the two losses, and ignores the regularization term at states where cannot improve over anymore. We further upper clip the expected advantage with the absolute value of to avoid the agent being overly confident about due to function approximation error .
4 Experiments
We evaluate on MetaWorld (Yu et al., 2020), a popular reinforcement learning benchmark composed of multiple robot manipulation tasks. These tasks are both correlated (performed by the same Sawyer robot arm) and distinct (interacting with different objects and having different reward functions), and serve as a proper evaluation benchmark for policy reuse. The source policies are achieved by training on three representative tasks: Reach, Push, and PickPlace. We choose several complex tasks as target tasks, including Hammer, PegInsertSide, PushWall, PickPlaceWall, PushBack, and ShelfPlace. Among these target tasks, Hammer and PegInsertSide require interacting with objects unseen in the source tasks. In PushWall and PickPlaceWall, there is a wall between the object and the goal. In PushBack, the goal distribution is different from Push. In ShelfPlace, the robot is required to put a block on a shelf, and the shelf is unseen in the source tasks. Video demonstrations of these tasks are available at https://metaworld.github.io/. Similar to the settings in Yang et al. (2020a), in our experiments the goal position is randomly reset at the start of every episode. Codes are available at https://github.com/NagisaZj/CUP.
4.1 Transfer Performance on MetaWorld
We compare against several representative baseline algorithms, including HAAR (Li et al., 2019), PTF (Yang et al., 2020b), MULTIPOLAR (Barekatain et al., 2021), and MAMBA (Cheng et al., 2020). Among these algorithms, HAAR and PTF learn hierarchical highlevel policies over source policies. MAMBA aggregates source policies’ V functions to form a baseline function, and performs policy improvement over the baseline function. MULTIPOLAR learns a weighted sum of source policies’ action probabilities, and learns an additional network to predict residuals. We also compare against the original SAC algorithm. All the results are averaged over six random seeds. As shown in Figure 2, CUP is the only algorithm that achieves efficient transfer on all six tasks, significantly outperforming the original SAC algorithm. HAAR has a jumpstart performance on PushWall and PickPickWall, but fails to further improve due to optimization nonstationarity induced by jointly training highlevel and lowlevel policies. MULTIPOLAR achieves comparable performance on PushWall and PegInsertSide, because the Push source policy is useful on PushWall (implied by HAAR’s good jumpstart performance), and learning residuals on PegInsertSide is easier (implied by SAC’s fast learning). In PickPlaceWall, the PickPlace source policy is useful, but the residual is difficult to learn, so MULTIPOLAR does not work. For the remaining three tasks, the source policies are less useful, and MULTIPOLAR fails on these tasks. PTF fails as its hierarchical policy only gets updated when the agent chooses similar actions to one of the source policies, which is quite rare when the source and target tasks are distinct. MAMBA fails as estimating all source policies’ V functions accurately is sampling inefficient. Algorithm performance evaluated by success rate is deferred to Appendix E.1.
Evaluation of CUP and several baselines on various MetaWorld tasks. Dashed areas represent 95% bootstrapped confidence intervals. CUP achieves substantially better performance than baseline algorithms.
4.2 Analyzing the Guidance Policy
This subsection provides visualizations of CUP’s source policy selection. Fig. 3 shows the percentages of each source policy being selected throughout training on PushWall. At early stages of training, the source policies are selected more frequently as they have positive expected advantages, which means that they can be used to improve the current target policy. As training proceeds and the target policy becomes better, the source policies are selected less frequently. Among these three source policies, Push is chosen more frequently than the other two source policies, as it is more related to the target task. Figure 4 presents the source policies’ expected advantages over an episode at convergence in PickPlaceWall. The Push source policy and Reach source policy almost always have negative expected advantages, which implies that these two source policies can hardly improve the current target policy anymore. Meanwhile, the PickPlace source policy has expected advantages close to zero after 100 environment steps, which implies that the PickPlace source policy is close to the target policy at these steps. Analyses on all six tasks as well as analyses on HAAR’s source policy selection are deferred to Appendix E.2 and Appendix E.6, respectively.
4.3 Ablation Study
This subsection evaluates CUP’s sensitivity to hyperparameter settings and the number of source policies. We also evaluate CUP’s robustness against random source policies, which do not provide meaningful candidate actions for solving target tasks.
4.3.1 HyperParameter Sensitivity
For all the experiments in Section 4.1, we use the same set of hyperparameters, which indicates that CUP is generally applicable to a wide range of tasks without particular finetuning. CUP introduces only two additional hyperparameters to the underlying SAC algorithm, and we further test CUP’s sensitivity to these additional hyperparameters. As shown in Fig. 5, CUP is generally robust to the choice of hyperparameters and achieves stable performance.
4.3.2 Number of Source Policies
We evaluate CUP as well as baseline algorithms on a larger source policy set. We add three policies to the original source policy set, which solve three simple tasks including DrawerClose, PushWall, and CoffeeButton. This forms a source policy set composed of six policies. As shown in Fig. 6, CUP is still the only algorithm that solves all the six target tasks efficiently. MULTIPOLAR suffers from a decrease in performance, which indicates that learning the weighted sum of source policies’ actions becomes more difficult as the number of source policies grows. The rest of the baseline algorithms have similar performance to those using three source policies. Fig. 7 provides a more direct comparison of CUP’s performance with different number of source policies. CUP is able to utilize the additional source policies to further improve its performance, especially on PickPlaceWall and PegInsertSide. Further detailed analysis is deferred to Appendix E.3.
4.3.3 Interference of Random Source Policies
In order to evaluate the efficiency of CUP’s criticguided source policy aggregation, we add random policies to the set of source policies. As shown in Fig. 8, adding up to 3 random source policies does not affect CUP’s performance. This indicates that CUP can efficiently choose which source policy to follow even if there exist many source policies that are not meaningful. Adding 4 and 5 random source policies leads to a slight drop in performance. This drop is because that as the number of random policies grows, more random actions are sampled, and taking argmax over these actions’ expected advantages is more likely to be affected by errors in value estimation.
To further investigate CUP’s ability to ignore unsuitable source policies, we design another transfer setting that consists of another two source policy sets. The first set consists of three random policies that are useless for the target task, and the second set adds the Reach policy to the first set. As demonstrated in Fig. 8, when none of the source policies are useful, CUP performs similarly to the original SAC, and its sample efficiency is almost unaffected by the useless source policies. When there exists a useful source policy, CUP can efficiently utilize it to improve performance, even if there are many useless source policies.
5 Related Work
Policy reuse.
A series of works on policy reuse utilize source policies for exploration in valuebased algorithms (Fernández and Veloso, 2006; Li and Zhang, 2018; Gimelfarb et al., 2021), but they are not applicable to policy gradient methods due to the offpolicyness problem (Fujimoto et al., 2019). ACTeach (Kurenkov et al., 2020) mitigates this problem by improving the actor over behavior policy’s value estimation, but still fails in more complex tasks. One branch of methods train hierarchical highlevel policies over source policies. CAPS (Li et al., 2018) guarantees the optimality of the hierarchical policies by adding primitive skills to the lowlevel policy set, but is inapplicable to MDPs with continuous action spaces. HAAR (Li et al., 2019) finetunes lowlevel policies to ensure optimality, but joint training of highlevel and lowlevel policies induce optimization nonstationarity (Pateria et al., 2021). PTF (Yang et al., 2020b) trains a hierarchical policy, which is imitated by the target policy. However, the hierarchical policy only gets updated when the target policy chooses similar actions to one of the source policies, so PTF fails in complex tasks with large action spaces. Another branch of works aggregate source policies via their Q functions or V functions on the target task. Barreto et al. (2017) and Barreto et al. (2018) focus on the situation where source tasks and target tasks share the same dynamics, and aggregate source policies by choosing the policy that has the largest Q at each state. They use successor features to mitigate the heavy computation cost brought by estimating Q functions for all source policies. MAMBA (Cheng et al., 2020) forms a baseline function by aggregating source policies’ V functions, and guides policy search by improving the policy over the baseline function. Finally, MULTIPOLAR (Barekatain et al., 2021) learns a weighted sum over source policies’ actions, and learns an auxiliary network to predict residuals around the aggregated actions. MULTIPOLAR is computationally expensive, as it requires querying all the source policies at every sampling step. Our proposed method, CUP, focuses on the setting of learning continuousaction MDPs with actorcritic methods. CUP is both computationally and sampling efficient, as it does not require training any additional components.
Policy regularization.
Adding regularization to policy optimization is a common approach to induce prior knowledge into policy learning. Distral (Teh et al., 2017) achieves intertask transfer by imitating an average policy distilled from policies of related tasks. In offline RL, policy regularization serves as a common technique to keep the policy close to the behavior policy used to collect the dataset (Wu et al., 2019; Nair et al., 2020; Fujimoto and Gu, 2021). CUP uses policy regularization as a means to provide additional guidance to policy search with the guidance policy.
6 Conclusion
In this study, we address the problem of reusing source policies without training any additional components. By utilizing the critic as a natural evaluation of source policies, we propose CUP, an efficient policy reuse algorithm without training any additional components. CUP is conceptually simple, easy to implement, and has theoretical guarantees. Empirical results demonstrate that CUP achieves efficient transfer on a wide range of tasks. As for future work, CUP assumes that all source policies and the target policy share the same state and action spaces, which limits CUP’s application to more general scenarios. One possible future direction is to take inspiration from previous works that map the state and action spaces of an MDP to another MDP with similar highlevel structure (Wan et al., 2020; Zhang et al., 2020; Heng et al., 2022; van der Pol et al., 2020b, a). Another interesting direction is to incorporate CUP into the continual learning setting (Rolnick et al., 2019; Khetarpal et al., 2020), in which an agent gradually enriches its source policy set in an online manner.
Acknowledgements
This work is supported in part by Science and Technology Innovation 2030 – “New Generation Artificial Intelligence” Major Project (No. 2018AAA0100904), National Natural Science Foundation of China (62176135), and China Academy of Launch Vehicle Technology (CALT202218).
References
 MULTIPOLAR: multisource policy aggregation for transfer reinforcement learning between diverse environmental dynamics. In Proceedings of the TwentyNinth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3108–3116. Cited by: §4.1, §5.

Transfer in deep reinforcement learning using successor features and generalised policy improvement.
In
International Conference on Machine Learning
, pp. 501–510. Cited by: §1, §1, §5.  Successor features for transfer in reinforcement learning. Advances in neural information processing systems 30. Cited by: §1, §5.
 Revisiting rainbow: promoting more insightful and inclusive deep reinforcement learning research. In International Conference on Machine Learning, pp. 1373–1383. Cited by: §1.
 Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems 33, pp. 5587–5598. Cited by: §1, §4.1, §5.
 Refinements of pinsker’s inequality. IEEE Transactions on Information Theory 49 (6), pp. 1491–1498. Cited by: §B.2, §B.3.
 Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pp. 720–727. Cited by: §1, §1, §5.
 A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §3.2, §5.
 Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587–1596. Cited by: §1.
 Offpolicy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §5.
 Contextual policy transfer in reinforcement learning domains via deep mixturesofexperts. In Uncertainty in Artificial Intelligence, pp. 1787–1797. Cited by: §5.
 Learning and transfer in everyday cognition. Cognitive Development 6 (3), pp. 233–260. Cited by: §1.
 Reinforcement learning with deep energybased policies. In International Conference on Machine Learning, pp. 1352–1361. Cited by: §2.
 Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1, §2.
 Crossdomain adaptive transfer reinforcement learning based on stateaction correspondence. In The 38th Conference on Uncertainty in Artificial Intelligence, Cited by: §6.
 Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning, Cited by: §B.2, §B.3.
 Towards continual reinforcement learning: a review and perspectives. arXiv preprint arXiv:2012.13490. Cited by: §6.
 ACteach: a bayesian actorcritic method for policy learning with an ensemble of suboptimal teachers. In Conference on Robot Learning, pp. 717–734. Cited by: §1, §5.
 Contextaware policy reuse. arXiv preprint arXiv:1806.03793. Cited by: §1, §5.
 Hierarchical reinforcement learning with advantagebased auxiliary rewards. Advances in Neural Information Processing Systems 32. Cited by: §1, §1, §4.1, §5.
 An optimal online method of selecting source policies for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §5.
 Continuous control with deep reinforcement learning.. In ICLR (Poster), Cited by: §1.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §3.1.
 Awac: accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §5.
 The difficulty of passive learning in deep reinforcement learning. Advances in Neural Information Processing Systems 34, pp. 23283–23295. Cited by: Appendix C.
 Hierarchical reinforcement learning: a comprehensive survey. ACM Computing Surveys (CSUR) 54 (5), pp. 1–35. Cited by: §1, §5.
 Experience replay for continual learning. Advances in Neural Information Processing Systems 32. Cited by: §6.
 Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
 Multitask reinforcement learning with contextbased representations. In International Conference on Machine Learning, pp. 9767–9779. Cited by: §D.2.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.
 Distral: robust multitask reinforcement learning. Advances in neural information processing systems 30. Cited by: §5.
 Plannable approximations to mdp homomorphisms: equivariance under actions. arXiv preprint arXiv:2002.11963. Cited by: §6.
 MDP homomorphic networks: group symmetries in reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 4199–4210. Cited by: §6.
 Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
 Mutual information based knowledge transfer under stateaction dimension mismatch. arXiv preprint arXiv:2006.07041. Cited by: §6.
 Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §3.2, §5.
 Multitask reinforcement learning with soft modularization. Advances in Neural Information Processing Systems 33, pp. 4767–4777. Cited by: §4.
 Efficient deep reinforcement learning through policy transfer.. In AAMAS, pp. 2053–2055. Cited by: §1, §1, §4.1, §5.
 Metaworld: a benchmark and evaluation for multitask and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. Cited by: §1, §4.
 Learning crossdomain correspondence for control with dynamics cycleconsistency. arXiv preprint arXiv:2012.09811. Cited by: §6.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work? See Section 2.

Did you discuss any potential negative societal impacts of your work? See Appendix A.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See the supplemental materials.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Appendix
D.1. 
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See Section 4.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.1.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? See Section 4

Did you mention the license of the assets? See the supplemental materials.

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Broader Social Impact
We believe policy reuse serves as a promising way to transfer knowledge among AI agents. This ability will enable AI agents to master new skills efficiently. However, we are also aware of possible negative social impacts, such as plagiarizing other AI products by querying and reusing their policies.
Appendix B Proofs
b.1 Proof for Theorem 1
Proof. As , we have that for all , the difference between the true value function and the approximated value function is bounded:
As is contained in , with defined in Eq. (7), it is obvious that for all , . Then for all ,
b.2 Proof of Theorem 2
Proof. According to Pinsker’s inequality (Fedotov et al., 2003), , where is the L1 norm. So we have that for all , . According to the Performance Difference Lemma (Kakade and Langford, 2002), we have that for all :
where ^{1}^{1}1We slightly abuse the notation here to indicate that the agent start deterministically from state . is the normalized discounted state occupancy distribution. Note that
(14)  
(15) 
Eventually, we have .
b.3 CriticGuided Source Policy Aggregation under “Hard” Value Functions
In this section we override the notation , to represent “hard” value functions, and override the notation to represent the expected advantage, which is defined as . Then Theorem 1 and Theorem 2 can be extended as below.
Theorem 3
Let be an approximation of such that
(16) 
Define
(17) 
Then,
(18) 
Theorem 4
If
(19) 
then
(20) 
where is the largest possible absolute value of the reward.
Theorem 3 and Theorem 4 implies that CUP can still guarantee policy improvement under hard Bellman updates. Proofs are given below.
Proof for Theorem 3.
As , we have that for all , the difference between the true value function and the approximated value function is bounded:
As is contained in , with defined in Eq. (17), it is obvious that for all , . Then for all ,