Ranking Policy Gradient

06/24/2019 ∙ by Kaixiang Lin, et al. ∙ Michigan State University 0

Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art uses value function to derive policy while it usually requires an extensive search over the state-action space, which is one reason for the inefficiency. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal ranking of a set of discrete actions. To accelerate the learning of policy gradient methods, we describe a novel off-policy learning framework and establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general sample-efficient off-policy learning framework, which accelerates learning and reduces variance. Furthermore, the sample complexity of RPG does not depend on the dimension of state space, which enables RPG for large-scale problems. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.



There are no comments yet.


page 17

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the major challenges in reinforcement learning (RL) is the high sample complexity (Kakade et al., 2003), which is the number of samples must be collected to conduct successful learning. There are different reasons leading to poor sample efficiency of RL (Yu, 2018). One often overlooked but decisive factor is the usage of value functions, which have been widely adopted in state-of-the-art (Van Hasselt et al., 2016; Hessel et al., 2017; Mnih et al., 2016; Schulman et al., 2017; Gruslys et al., 2018). Because algorithms directly optimizing return, such as Reinforce Williams (1992), could suffer from high variance (Sutton and Barto, 2018), value function baselines were introduced by actor-critic methods to reduce the variance. However, since a value function is associated with a certain policy, the samples collected by former policies cannot be readily used without complicated manipulations (Degris et al., 2012) and extensive parameter tuning (Nachum et al., 2017). Such an on-policy requirement increases the difficulty of sample-efficient learning. On the other hand, off-policy methods, such as one-step -learning (Watkins and Dayan, 1992) and variants of deep networks (DQN) (Hessel et al., 2017; Dabney et al., 2018; Van Hasselt et al., 2016), are currently among the most sample-efficient algorithms. These algorithms, however, often require extensive searching (Bertsekas and Tsitsiklis, 1996)

over a usually large state-action space to estimate the action value function. This requirement could limit the sample-efficiency of these algorithms.

To address the aforementioned challenge, we first revisit the decision process of RL. For a deterministic decision process, the action with the largest action value is chosen. For a stochastic decision process, the action values are normalized into a probability distribution, from which an action is sampled. In both cases, the crucial factor deciding which action to take is the the relative relationship of actions rather than the absolute values. Therefore, instead of estimating the return of each action (i.e., the action-value function), we propose the

ranking policy gradient (RPG), a policy gradient method that optimizes the ranking of actions with respect to the long-term reward by learning the pairwise relationship among actions.

Secondly, we propose an off-policy learning framework to improve the sample-efficiency of policy gradient methods, without relying on value function baselines. We establish the theoretical equivalence between RL optimizing the lower bound of the long-term reward and learning the near-optimal policy in a supervised manner. The central idea is that we separate policy learning into two stages: exploration and supervision. During the exploration stage, we use the near-optimal trajectories encountered to construct a training dataset that approximates the state-action pairs sampled from a near-optimal policy. During the supervision stage, we imitate the near-optimal policy with the training dataset. Such a separated design empowers the flexibility of off-policy learning so that we can smoothly incorporate various exploration strategies to improve sample-efficiency. Also, by using a supervised learning approach, the upper bound of gradient variance is largely reduced because of its independence of the horizon and reward scale, which does not hold for general policy gradient methods. This learning paradigm leads to a novel sample complexity analysis of large-scale MDP, in a non-tabular setting without the linear dependence on the state space. Besides, we empirically show that there is a trade-off between optimality and sample-efficiency. Last but not least, we demonstrate that the proposed approach, consolidating the RPG with off-policy learning, substantially outperforms the state-of-the-art 

(Hessel et al., 2017; Bellemare et al., 2017; Dabney et al., 2018; Mnih et al., 2015).

2 Related works

Sample Efficiency. The sample efficient reinforcement learning can be roughly divided into two categories. The first category includes variants of -learning (Mnih et al., 2015; Schaul et al., 2015; Van Hasselt et al., 2016; Hessel et al., 2017). The main advantage of -learning methods is the use of off-policy learning, which is essential towards sample efficiency. The representative DQN (Mnih et al., 2015)

introduced deep neural network in

-learning, which further inspried a track of successful DQN variants such as Double DQN (Van Hasselt et al., 2016), Dueling networks (Wang et al., 2015), prioritized experience replay (Schaul et al., 2015), and Rainbow (Hessel et al., 2017). The second category is the actor-critic approaches. Most of recent works (Degris et al., 2012; Wang et al., 2016; Gruslys et al., 2018) in this category leveraged importance sampling by re-weighting the samples to correct the estimation bias and reduce variance. The main advantage is in the wall-clock times due to the distributed framework, firstly presented in (Mnih et al., 2016), instead of the sample-efficiency. As of the time of writing, the variants of DQN (Hessel et al., 2017; Dabney et al., 2018; Bellemare et al., 2017; Schaul et al., 2015; Van Hasselt et al., 2016) are among the algorithms of most sample efficiency, which are adopted as our baselines for comparison.

RL as Supervised Learning.

Many efforts have focused on developing the connections between RL and supervised learning, such as Expectation-Maximization algorithms 

(Dayan and Hinton, 1997; Peters and Schaal, 2007; Kober and Peters, 2009; Abdolmaleki et al., 2018), Entropy-Regularized RL (Oh et al., 2018; Haarnoja et al., 2018)

, and Interactive Imitation Learning (IIL) 

(Daumé et al., 2009; Syed and Schapire, 2010; Ross and Bagnell, 2010; Ross et al., 2011; Sun et al., 2017; Hester et al., 2018; Osa et al., 2018). EM-based approaches apply the probabilistic framework to formulate the RL problem maximizing a lower bound of the return as a re-weighted regression problem, while it requires on-policy estimation on the expectation step. Entropy-Regularized RL optimizing entropy augmented objectives can lead to off-policy learning without the usage of importance sampling while it converges to soft optimality (Haarnoja et al., 2018).

Of the three tracks in prior works, the IIL is most closely related to our work. The IIL works firstly pointed out the connection between imitation learning and reinforcement learning (Ross and Bagnell, 2010; Syed and Schapire, 2010; Ross et al., 2011) and explore the idea of facilitating reinforcement learning by imitating experts. However, most of imitation learning algorithms assume the access to the expert policy or demonstrations. The off-policy learning framework proposed in this paper can be interpreted as an online imitation learning approach that constructs expert demonstrations during the exploration without soliciting experts, and conducts supervised learning to maximize return at the same time.

In conclusion, our approach is different from prior arts in terms of at least one of the following aspects: objectives, oracle assumptions, the optimality of learned policy, and on-policy requirement. More concretely, the proposed method is able to learn optimal policy in terms of long-term reward, without access to the oracle (such as expert policy or expert demonstration) and it can be trained both empirically and theoretically in an off-policy fashion. A more detailed discussion of the related work is provided in Appendix A.

PAC Analysis of RL. Most existing studies on sample complexity analysis (Kakade et al., 2003; Strehl et al., 2006; Kearns et al., 2000; Strehl et al., 2009; Krishnamurthy et al., 2016; Jiang et al., 2017; Jiang and Agarwal, 2018; Zanette and Brunskill, 2019) are established on the value function estimation. The proposed approach leverages the probably approximately correct framework (Valiant, 1984) in a different way such that it does not rely on the value function. Such independence directly leads to a practically sample-efficient algorithm for large-scale MDP, as we demonstrated in the experiments.

3 Notations and Problem Setting

In this paper, we consider a finite horizon

, discrete time Markov Decision Process (MDP) with a finite discrete state space

and for each state , the action space is finite. The environment dynamics is denoted as . We note that the dimension of action space can vary given different states. We use to denote the maximal action dimension among all possible states. Our goal is to maximize the expected sum of positive rewards, or return , where . In this case, the optimal deterministic Markovian policy always exists (Puterman, 2014, Proposition 4.4.3). The upper bound of trajectory reward () is denoted as . Table 1 provides a comprehensive reference describing the notations used throughout this work.

Notations Definition
The discrepancy of the value of action and action . , where Notice that the value here is not the estimation of return, it represents which action will have relatively higher return if followed.
denotes the probability that -th action is to be ranked higher than -th action. Notice that is controlled by through
A trajectory collected from the environment. It is worth noting that this trajectory is not associated with any policy. It only represents a series of state-action pairs. We also use the abbreviation , .
The trajectory reward is the sum of reward along one trajectory.
The summation over all possible trajectories .
The probability of a specific trajectory is collected from the environment given policy .
The set of all possible near-optimal trajectories. denotes the number of near-optimal trajectories in .
The number of training samples or equivalently state action pairs sampled from uniformly (near)-optimal policy.
The number of discrete actions.
Table 1: Notations

4 Ranking Policy Gradient

Value function estimation is widely used in advanced RL algorithms (Mnih et al., 2015, 2016; Schulman et al., 2017; Gruslys et al., 2018; Hessel et al., 2017; Dabney et al., 2018) to facilitate the learning process. In practice, the on-policy requirement of value function estimations in actor-critic methods has largely increased the difficulty of achieving sample-efficient learning (Degris et al., 2012; Gruslys et al., 2018). With the advantage of off-policy learning, the DQN (Mnih et al., 2015) variants are currently among the most sample-efficient algorithms (Hessel et al., 2017; Dabney et al., 2018; Bellemare et al., 2017). For complicated tasks, the value function can well align with the relative relationship of returns of actions, but the absolute values are hardly accurate (Mnih et al., 2015; Ilyas et al., 2018).

The above observations motivate us to look at the decision phase of RL from a different prospect: Given a state, the decision making is to perform a relative comparison over available actions and then choose the best action, which leads to a relatively higher return than others. Therefore, an alternative solution is to learn the ranking of the actions, instead of deriving policy from the action values. In this section, we show how to optimize the ranking of a set of discrete actions to maximize the return, and thus avoid value function estimation. The discussion of the second question will be given in Section 5. In this section, the action value functions or -values no longer represent the return, but are only used to illustrate the relative relationship of available actions.

In order to optimize the ranking, we formulate the action ranking problem as follows. Denote the action value as given a state , where the model parameter is omitted for concise presentation. Our goal is to optimize the action values such that the best action has the highest probability to be selected than others. Inspired by learning to rank (Burges et al., 2005), we consider the pairwise relationship among all actions, by modeling the probability (denoted as ) of an action to be ranked higher than any action as follows:


where means the action value of is same as that of the action , indicates that the action is ranked higher than . We would like to increase the probability of the optimal action such that it ranks higher than any other actions, which is the probability action is chosen. The pairwise ranking policy is defined in Eq (2), given the mild Assumption 1 is satisfied. Please refer to Appendix H for the discussions on Assumption 1.

The pairwise ranking policy is defined as:


where the is defined in Eq (1), given current state , or equivalently given the current action values .

Assumption 1

Given a state , the events that the action is ranked higher than the action are independent, for all .

To increase the probability of selecting the best action, is equal to increase the joint probability that the best action is ranked higher than all other actions. With Assumption 1, we can decompose the joint probability as the multiplication of pairwise probability Eq (1). Our ultimate goal is to maximize the long-term reward through optimizing the pairwise relationship among the action pairs. To achieve this goal, we resort to the policy gradient method. Formally, we propose the ranking policy gradient method (RPG), as shown in Theorem 4.

[Ranking Policy Gradient Theorem] For any MDP, the gradient of the expected long-term reward w.r.t. the parameter of a pairwise ranking policy (Def 4) is given by:


and the deterministic pairwise ranking policy is: , where denotes the action value of action (, ), and denotes the -th state-action pair in trajectory , denote the action values of all other actions that were not taken given state in trajectory , i.e., , .

The proof of Theorem 4 is provided in Appendix B. Theorem 4 states that optimizing the discrepancy between the action values of the best action and all other actions, is optimizing the pairwise relationships that maximize the return. One limitation of RPG is that it is not convenient for the tasks where only optimal stochastic policies exist since the pairwise ranking policy takes extra efforts to construct a probability distribution [see Appendix B.1]. In order to learn the stochastic policy, we introduce Listwise Policy Gradient (LPG) that optimizes the probability of ranking a specific action on the top of a set of actions, with respect to the return. In the context of RL, this top one probability is the probability of action to be chosen, which is equal to the sum of probability all possible permutations that map action in the top. This probability is computationally prohibitive since we need to consider the probability of permutations. Inspired by listwise learning to rank approach (Cao et al., 2007), the top one probability can be modeled by the softmax function (see Theorem 4). Therefore, LPG is equivalent to the Reinforce (Williams, 1992)

algorithm with a softmax layer. LPG provides another interpretation of

Reinforce algorithm from the perspective of learning the optimal ranking and enables the learning of both deterministic policy and stochastic policy (see Theorem 4).

[(Cao et al., 2007), Theorem 6] Given the action values , the probability of action to be chosen (i.e. to be ranked on the top of the list) is:


where is any increasing, strictly positive function. A common choice of is the exponential function.

[Listwise Policy Gradient Theorem] For any MDP, the gradient of the long-term reward w.r.t. the parameter of listwise ranking policy takes the following form:


where the listwise ranking policy parameterized by is given by Eq (6) for tasks with deterministic optimal policies:


or Eq (7) for stochastic optimal policies:


where the policy takes the form as in Eq (8)


is the probability that action being ranked highest, given the current state and all the action values .

The proof of Theorem 4 exactly follows the direct policy differentiation (Peters and Schaal, 2008; Williams, 1992) by replacing the policy to the form of the Softmax function. The action probability forms a probability distribution over the set of discrete actions Cao et al. (2007, Lemma 7). Theorem 4 states that the vanilla policy gradient (Williams, 1992) parameterized by Softmax layer is optimizing the probability of each action to be ranked highest, with respect to the long-term reward. Furthermore, it enables learning both of the deterministic policy and stochastic policy.

Comparing LPG with RPG, one advantage of LPG is that it is more convenient to model the stochastic policies since softmax directly constructs a valid probability distribution. However, forming a probability distribution also limits the expressive power of the model since the dimension of the action space is fixed. Since RPG is optimizing the pairwise relationship among actions, it is possible to use one model to solve the tasks with dynamic action spaces and multi-task RL, without the design of the action masking mechanism (Li et al., 2017) or adding each specific layer for each task (Rusu et al., 2015). Furthermore, RPG is more sample-efficient than LPG when we are learning a deterministic policy.

To this end, seeking sample-efficiency motivates us to learn the relative relationship (RPG in Theorem 4 and LPG in Theorem 4) of actions, instead of deriving policy based on value function estimations. However, both of RPG and LPG belong to policy gradient methods, which suffer from large variance and require on-policy learning (Sutton and Barto, 2018). Therefore, the intuitive implementations of RPG or LPG are still far from sample-efficient. In the next section, we will describe a general off-policy learning framework empowered by supervised learning, which provides an alternative way to accelerate learning other than using value function baselines.

5 Off-policy learning as supervised learning

In this section, we discuss the connections and discrepancies between RL and supervised learning, and our results lead to a sample-efficient off-policy learning paradigm for RL. The main result in this section is Theorem 5, which casts the problem of maximizing the lower bound of return into a supervised learning problem, given one relatively mild Assumption 2 and practical assumptions 1,3. It can be shown that these assumptions are valid in a range of common RL tasks, as discussed in Lemma G in Appendix G. The central idea is to collect only the near-optimal trajectories when the learning agent interacts with the environment, and imitate the near-optimal policy by maximizing the log likelihood of the state-action pairs from these near-optimal trajectories. With the road map in mind, we then begin to introduce our approach as follows.

In a discrete action MDP with finite states and horizon, given the near-optimal policy , the stationary state distribution is given by: , where is the probability of a certain state given a specific trajectory and is not associated with any policies, and only is related to the policy parameters. The stationary distribution of state-action pairs is thus: . In this section, we consider the MDP that each initial state will lead to at least one (near)-optimal trajectory. For a more general case, please refer to the discussion in Appendix C. In order to connect supervised learning (i.e., imitating a near-optimal policy) with RL and enable sample-efficient off-policy learning, we first introduce the trajectory reward shaping (TRS), defined as follows:

[Trajectory Reward Shaping, TRS] Given a fixed trajectory , its trajectory reward is shaped as follows:

where is a problem-dependent near-optimal trajectory reward threshold that indicates the least reward of near-optimal trajectory, and . We denote the set of all possible near-optimal trajectories as , i.e., .

The threshold indicates a trade-off between the sample-efficiency and the optimality. The higher the threshold, the less frequently it will hit the near-optimal trajectories during exploration, which means it has higher sample complexity, while the final performance is better (see Figure 3).

The trajectory reward can be reshaped to any positive functions that are not related to policy parameter . For example, if we set , the conclusions in this section still hold (see Eq (18) in Appendix D). For the sake of simplicity, we set .

Different from the reward shaping work (Ng et al., 1999), where shaping happens at each step on , the proposed approach directly shapes the trajectory reward , which facilitates the smooth transform from RL to SL. After shaping the trajectory reward, we can transfer the goal of RL from maximizing the return to maximize the long-term performance (Def 5). [Long-term Performance] The long-term performance is defined by the expected shaped trajectory reward:


According to Def 5, the expectation over all trajectories is the equal to that over the near-optimal trajectories in , i.e., .

The optimality is preserved after trajectory reward shaping () since the optimal policy maximizing long-term performance is also an optimal policy for the original MDP, i.e., , where and (see Lemma D in Appendix D). Similarly, when , the optimal policy after trajectory reward shaping is a near-optimal policy for original MDP. Note that most policy gradient methods use the softmax function, in which we have (see Lemma D in Appendix D). Therefore when softmax is used to model a policy, it will not converge to an exact optimal policy. On the other hand, ideally, the discrepancy of the performance between them can be arbitrarily small based on the universal approximation (Hornik et al., 1989)

with general conditions on the activation function and Theorem 1 in 

Syed and Schapire (2010).

Essentially, we use TRS to filter out near-optimal trajectories and then we maximize the probabilities of near-optimal trajectories to maximize the long-term performance. This procedure can be approximated by maximizing the log-likelihood of near-optimal state-action pairs, which is a supervised learning problem. Before we state our main results, we first introduce the definition of uniformly near-optimal policy (Def 5) and a prerequisite (Asm. 2) specifying the applicability of the results.

[Uniformly Near-Optimal Policy, UNOP] The Uniformly Near-Optimal Policy is the policy whose probability distribution over near-optimal trajectories (

) is a uniform distribution. i.e.

, where is the number of near-optimal trajectories. When we set , it is an optimal policy in terms of both maximizing return and long-term performance. In the case of , the corresponding uniform policy is an optimal policy, we denote this type of optimal policy as uniformly optimal policy (UOP).

Assumption 2 (Existence of Uniformly Near-Optimal Policy)

We assume the existence of Uniformly Near-Optimal Policy (Def. 5).

Based on Lemma G in Appendix G, Assumption 2 is satisfied for certain MDPs that have deterministic dynamics. Other than Assumption 2, all other assumptions in this work (Assumptions 1,3) can almost always be satisfied in practice, based on empirical observations. With these relatively mild assumptions, we present the following long-term performance theorem, which shows the close connection between supervised learning and RL.

[Long-term Performance Theorem] Maximizing the lower bound of expected long-term performance in Eq (9) is maximizing the log-likelihood of state-action pairs sampled from a uniformly (near)-optimal policy , which is a supervised learning problem:


The optimal policy of maximizing the lower bound is also the optimal policy of maximizing the long-term performance and the return. It is worth noting that Theorem 5 does not require a uniformly near-optimal policy to be deterministic. The only requirement is the existence of a uniformly near-optimal policy.

Maximizing the lower bound of long-term performance is maximizing the lower bound of long-term reward since we can set and . An optimal policy that maximizes this lower bound is also an optimal policy maximizing the long-term performance when , thus maximizing the return. The proof of Theorem 5 can be found in Appendix D. Theorem 5 indicates that we break the dependency between current policy and the environment dynamics, which means off-policy learning is able to be conducted by the above supervised learning approach. Furthermore, we point out that there is a potential discrepancy between imitating UNOP by maximizing log likelihood (even when the optimal policy’s samples are given) and the reinforcement learning since we are maximizing a lower bound of expected long-term performance (or equivalently the return over the near-optimal trajectories only) instead of return over all trajectories. In practice, the state-action pairs from an optimal policy is hard to construct while the uniform characteristic of UNOP can alleviate this issue (see Sec 6). Towards sample-efficient RL, we apply Theorem 5 to RPG, which reduces the ranking policy gradient to a classification problem by Corollary 5. [Ranking performance policy gradient] Optimizing the lower bound of expected long-term performance (defined in Eq (9)) using pairwise ranking policy (Eq (2)) is equal to:


The proof of Corollary 5 can be found in Appendix E. Similarly, we can reduce LPG to a classification problem (see Corollary 5). One advantage of casting RL to SL is variance reduction. With the proposed off-policy supervised learning, we can reduce the upper bound of the policy gradient variance, as shown in the Corollary 2.

[Listwise performance policy gradient] Optimizing the lower bound of expected long-term performance by the listwise ranking policy (Eq (8)) is equivalent to:


The proof of this Corollary is a direct application of theorem 5 by replacing policy with the softmax function.

[Policy gradient variance reduction] Given a stationary policy, the upper bound of the variance of each dimension of policy gradient is . The upper bound of gradient variance of maximizing the lower bound of long-term performance Eq (10) is , where is the maximum norm of log gradient based on Assumption 3. The supervised learning has reduced the upper bound of gradient variance by an order of as compared to the regular policy gradient, considering , which is a very common situation in practice.

The proof of Corollary 2 can be found in Appendix F. This corollary shows that the variance of regular policy gradient is upper-bounded by the square of time horizon and the maximum trajectory reward. It is aligned with our intuition and empirical observation: the longer the horizon the harder the learning. Also, the common reward shaping tricks such as truncating the reward to  (Castro et al., 2018) can help the learning since it reduces variance by decreasing . With supervised learning, we concentrate the difficulty of long-time horizon into the exploration phase, which is an inevitable issue for all RL algorithms, and we drop the dependence on and for policy variance. Thus, it is more stable and efficient to train the policy using supervised learning. One limitation of this method is that it requires specific domain knowledge. We need to explicitly define the trajectory reward threshold for different tasks, which is crucial to the final performance and sample-efficiency. In many applications such as dialogue systems (Li et al., 2017), recommender systems (Melville and Sindhwani, 2011), etc., the reward functions are crafted to guide the learning process, and in these scenarios the values of are naturally known. For the cases that we have no prior knowledge on the reward function of MDP, we treat as a tuning parameter to balance the optimality and efficiency. The major theoretical uncertainty on general tasks is the existence of a uniformly near-optimal policy, and however it is almost negligible to the empirical performance in practice, as demonstrated in our experiments. The rigorous theoretical analysis of this problem is beyond the scope of this work.

6 An algorithmic framework for off-policy learning

Based on the discussions in Section 5, we separate the training of a RL agent into two stages: exploration and supervision. The key idea is that although we have no access to the UNOP, we can approximate the state action pairs sampled from the environment follow UNOP by only collecting the near-optimal trajectories.

Exploration Stage. The goal of the exploration stage is to collect different near-optimal trajectories as frequently as possible. Under the off-policy framework, the exploration agent and the learning agent are separated, therefore, any existing RL algorithm can be used during the exploration. In fact, the principle of this framework is that we should use the most advanced RL agents in the exploration to collect as many near-optimal trajectories as possible and leave the policy learning to the supervision stage.

Supervision Stage. The goal of supervision is to learn near-optimal policy by maximizing the log-likelihood of the state-action pairs collected from the above exploration stage. When ranking policies are applied, this supervision stage is a classification problem as shown in Corollary 5 or Corollary 5.

The two-stage algorithmic framework can be directly incorporated in RPG and LPG to improve sample efficiency. The implementation of RPG is given in Algorithm 1

, and LPG follows the same procedure except for the difference in the loss function. The main requirements of Alg. 

1 are on exploration efficiency and the MDP structure. During the exploration stage, a sufficient amount of the different near-optimal trajectories need to be collected for constructing a representative supervised learning training dataset. Theoretically, this requirement always holds when a sufficient number of training episodes are conducted [see Appendix G, Lemma G]. One practical concern of the proposed algorithm is that the number of episodes could be prohibitively large, which makes this algorithm sample-inefficient. However, according to our extensive empirical observations, we notice that a sufficient amount of near-optimal trajectories have almost always been collected, long before the value function based state-of-the-art converge to near-optimal performance.

Therefore, we point out that instead of deriving policies from value functions, it can be more sample-efficient to imitate UNOP by supervised learning and use value functions to facilitate the exploration. It is not necessary to explore the suboptimal state-action space or estimate value functions accurately while we already acquired enough samples to learn a policy that can perform (nearly) optimally. In some relatively simple tasks such as Pong, we can even imitate the uniformly optimal policy directly without relying on any advanced exploration algorithm or value function estimations.

0:  The near-optimal trajectory reward threshold , the number of maximal training episodes . Maximum number of time steps in each episode , and batch size .
1:  while  episode  do
2:     repeat
3:         Retrieve state and sample action by the specified exploration agent (can be random, -greedy, or any RL algorithms).
4:         Collect the experience and store to the replay buffer.
6:         if t % update step == 0 then
7:            Sample a batch of experience from the near-optimal replay buffer.
8:            Update based on the hinge loss Eq (11) for RPG.
9:            Update the exploration agent using samples from the regular replay buffer (In simple MDPs such as Pong where near-optimal trajectories are encountered frequently, near-optimal replay buffer can be used to update the exploration agent).
10:         end if
11:     until terminal or
12:     if return  then
13:         Take the near-optimal trajectory in the latest episode from the regular replay buffer, and insert the trajectory into the near-optimal replay buffer.
14:     end if
15:     if  % evaluation step == 0 then
16:         Evaluate the RPG agent by greedily choosing the action. If the best performance is reached, then stop training.
17:     end if
18:  end while
Algorithm 1 Off-Policy Learning for Ranking Policy Gradient (RPG)

7 Sample Complexity and Generalization Performance

In this section, we present a theoretical analysis on the sample complexity of RPG with off-policy learning framework in Section 6. The analysis leverages the results from the Probably Approximately Correct (PAC) framework, and provides an alternative approach to quantify sample complexity of RL from the perspective of the connection between RL and SL (see Theorem 5), which is significantly different from the existing approaches that use value function estimations (Kakade et al., 2003; Strehl et al., 2006; Kearns et al., 2000; Strehl et al., 2009; Krishnamurthy et al., 2016; Jiang et al., 2017; Jiang and Agarwal, 2018; Zanette and Brunskill, 2019). We show that the sample complexity of RPG (Theorem 7.1) depends on the properties of MDP such as horizon, action space, dynamics, and the generalization performance of supervised learning. It is worth mentioning that the sample complexity of RPG has no linear dependence on the state-space, which makes it suitable for large-scale MDPs. Moreover, we also provide a formal quantitative definition (Def 7.2) on the exploration efficiency of RL.

Corresponding to the two-stage framework in Section 6, the sample complexity of RPG also splits into two problems:

  • Learning efficiency: How many state-action pairs from the uniformly optimal policy do we need to collect, in order to achieve good generalization performance in RL?

  • Exploration efficiency: For a certain type of MDPs, what is the probability of collecting training samples (state-action pairs from the uniformly near-optimal policy) in the first

    episodes in worst case? This question leads to a quantitative evaluation metric of different exploration methods.

The first stage is resolved by Theorem 7.1, which connects the lower bound of the generalization performance of RL to the supervised learning generalization performance. Then we discuss the exploration efficiency of the worst case performance for a binary tree MDP in Lemma 7.2. Jointly, we show how to link the two stages to give a general theorem that studies how many samples we need to collect in order to achieve certain performance in RL.

In this section, we restrict our discussion on the MDPs with a fixed action space and assume the existence of deterministic optimal policy. The policy corresponds to the empirical risk minimizer (ERM) in the learning theory literature, which is the policy we obtained through learning on the training samples. denotes the hypothesis class from where we are selecting the policy. Given a hypothesis (policy) , the empirical risk is given by . Without loss of generosity, we can normalize the reward function to set the upper bound of trajectory reward equals to one (), similar to the assumption in (Jiang and Agarwal, 2018). It is worth noting that the training samples are generated i.i.d.

from an unknown distribution, which is perhaps the most important assumption in the statistical learning theory.

i.i.d. is satisfied in this case since the state action pairs (training samples) are collected by filtering the samples during the learning stage, and we can manually manipulate the samples to follow the distribution of UOP (Def 5) by only storing the unique near-optimal trajectories.

7.1 Supervision stage: Learning efficiency

To simplify the presentation, we restrict our discussion on the finite hypothesis class (i.e. ) since this dependence is not germane to our discussion. However, we note that the theoretical framework in this section is not limited to the finite hypothesis class. For example, we can simply use the VC dimension (Vapnik, 2006) or the Rademacher complexity (Bartlett and Mendelson, 2002) to generalize our discussion to the infinite hypothesis class, such as neural networks. For completeness, we first revisit the sample complexity result from the PAC learning in the context of RL. [Supervised Learning Sample Complexity (Mohri et al., 2018)] Let , and let be fixed, the inequality holds with probability at least , when the training set size satisfies:


where the generalization error (expected risk) of a hypothesis is defined as:

Condition 1 (Action values)

We restrict the action values of RPG in certain range, i.e., , where is a positive constant.

This condition can be easily satisfied, for example, we can use a sigmoid to cast the action values into . We can impose this constraint since in RPG we only focus on the relative relationship of action values. Given the mild condition and established on the prior work in statistical learning theory, we introduce the following results that connect the supervised learning and reinforcement learning.

[Generalization Performance] Given a MDP where the UOP (Def 5) is deterministic, let denote the size of hypothesis space, and be fixed, the following inequality holds with probability at least :

where , denotes the environment dynamics. is the upper bound of supervised learning generalization performance, defined as .

[Sample Complexity] Given a MDP where the UOP (Def 5) is deterministic, let denotes the size of hypothesis space, and let be fixed. Then for the following inequality to hold with probability at least :

it suffices that the number of state action pairs (training sample size ) from the uniformly optimal policy satisfies:

The proofs of Theorem 7.1 and Corollary 7.1 are provided in Appendix I. Theorem 7.1 establishes the connection between the generalization performance of RL and the sample complexity of supervised learning. The lower bound of generalization performance decreases exponentially with respect to the horizon and action space dimension . This is aligned with our empirical observation that it is more difficult to learn the MDPs with a longer horizon and/or a larger action space. Furthermore, the generalization performance has a linear dependence on , the transition probability of optimal trajectories. Therefore, , , and jointly determines the difficulty of learning of the given MDP. As pointed out by Corollary 7.1, the smaller the is, the higher the sample complexity. Note that , , and all characterize intrinsic properties of MDPs, which cannot be improved by our learning algorithms. One advantage of RPG is that its sample complexity has no dependence on the state space, which enables the RPG to resolve large-scale complicated MDPs, as demonstrated in our experiments. In the supervision stage, our goal is the same as in the traditional supervised learning: to achieve better generalization performance .

7.2 Exploration stage: Exploration efficiency

The exploration efficiency is highly related to the MDP properties and the exploration strategy. To provide interpretation on how the MDP properties (state space dimension, action space dimension, horizon) affect the sample complexity through exploration efficiency, we characterize a simplified MDP as in (Sun et al., 2017) , in which we explicitly compute the exploration efficiency of a stationary policy (random exploration), as shown in Figure 1.

[Exploration Efficiency] We define the exploration efficiency of a certain exploration algorithm () within a MDP () as the probability of sampling and collecting distinct optimal trajectories in the first episodes. We denote the exploration efficiency as . When , , and optimality threshold are fixed, the higher the , the better the exploration efficiency. We use to denote the number of near-optimal trajectories in this subsection. If the exploration algorithm derives a series of learning policies, then we have , where is the number of steps the algorithm updated the policy. If we would like to study the exploration efficiency of a stationary policy, then we have .

[Expected Exploration Efficiency] The expected exploration efficiency of a certain exploration algorithm () within a MDP () is defined as:

The definitions provide a quantitative metric to evaluate the quality of exploration. Intuitively, the quality of exploration should be determined by how frequently it will hit different good trajectories. We use Def 7.2 for theoretical analysis and Def 7.2 for practical evaluation.

Figure 1: The binary tree structure MDP () with one initial state, similar as discussed in (Sun et al., 2017). In this work, we assume there is no duplicated states in the tree, the uniform initial state distribution if the MDP has multiple initial states and the deterministic environment dynamics. For the worst case exploration is random exploration and each trajectory will be visited at same probability under random exploration. Note that in this type of MDP, the Assumption 2 is satisfied.

[The Worst Case Exploration Efficiency] The Exploration Efficiency of random exploration policy in a binary tree MDP () is given as:

where denotes the total number of different trajectories in the MDP. In binary tree MDP , , where the denotes the number of distinct initial states. denotes the number of optimal trajectories. denotes the random exploration policy, which means the probability of hitting each trajectory in is equal. The proof of Lemma 7.2 is available in Appendix J.

7.3 Joint Analysis Combining Exploration and Supervision

In this section, we jointly consider the learning efficiency and exploration efficiency to study the generalization performance. Concretely, we would like to study if we interact with the environment a certain number of episodes, what is the worst generalization performance we can expect with certain probability, if RPG is applied. [RL Generalization Performance] Given a MDP where the UOP (Def 5) is deterministic, let be the size of the hypothesis space, and let be fixed, the following inequality holds with probability at least :

where is the number of episodes we have explored in the MDP, is the number of distinct optimal state-action pairs we needed from the UOP (i.e., size of training data.). denotes the number of distinct optimal state-action pairs collected by the random exploration. . The proof of Corollary 7.3 is provided in Appendix K. Corollary 7.3 states that the probability of sampling optimal trajectories is the main bottleneck of exploration and generalization, instead of state space dimension. In general, the optimal exploration strategy depends on the properties of MDPs. In this work, we focus on improving learning efficiency, i.e., learning optimal ranking instead of estimating value functions. The discussion of optimal exploration is beyond the scope of this work.

8 Applications and Empirical Results

Figure 2: The training curves of the proposed RPG, LPG and state-of-the-art on eight Atari games. All results are averaged over random seeds from 1 to 5. The -axis represents the number of training iterations (each iteration consists of 250,000 interactions.) and the -axis represents the averaged training episodic return. The last figure plots the expected exploration efficiency of state-of-the-art, the results are averaged over random seeds from 1 to 10.

In this section, we present our empirical results of sample complexity on comparing RPG (combined with different exploration strategies) and LPG agents with the state-of-the-art baselines. The set of baselines we evaluated include DQN (Mnih et al., 2015) , C51 (Bellemare et al., 2017)

, Implicit Quantile 

(Dabney et al., 2018), and Rainbow (Hessel et al., 2017). Implementations of all methods are provided in the Dopamine framework (Castro et al., 2018). The proposed methods, including EPG, LPG, and RPG, are also implemented based on Dopamine with the following adaptations111Code is available at https://github.com/illidanlab/rpg.:

EPG: EPG denotes the stochastic listwise policy gradient with off-policy supervised learning, which is the vanilla policy gradient trained with off-policy supervised learning. The exploration and supervision agent is parameterized by the same neural network. The supervision agent minimizes the cross-entropy loss (see Eq (12)) over the near-optimal trajectories collected in an online fashion.


: LPG denotes the deterministic listwise policy gradient with off-policy supervised learning. We choose an action greedily based on the value of logits during the evaluation, and it stochastically explores the environment as does by EPG.

RPG: RPG explores the environment using a separate EPG agent in Pong and Implicit Quantile in other games. Then RPG conducts supervised learning by minimizing the hinge loss Eq (11). It is worth mentioning that the exploration agent (EPG or Implicit Quantile) can be replaced by any other existing exploration method.

The testbeds are eight Atari 2600 games (Pong, Breakout, Bowling, BankHeist, DoubleDunk, Pitfall, Boxing, Robotank) from the Arcade Learning Environment (Bellemare et al., 2013) without randomly repeating the previous action. In our experiments, we simply collect all trajectories with the trajectory reward no less than the threshold without eliminating the duplicated trajectories, which we found it is an empirically reasonable simplification. As the results shown in Figure 2 Pong, the proposed RPG and LPG achieve higher sample efficiency than competiting baselines, and converge to the optimal deterministic policy. We see that RPG converges faster than LPG, which may be due to that hinge loss is more sample-efficient than cross-entropy in terms of imitating a deterministic policy.

Comparing EPG with LPG, we notice that since Pong exists a deterministic optimal policy, EPG that uses a stochastic policy to fit an optimal policy is not as effective as LPG, let along RPG. Except for Pong, we use implicit quantile as our exploration agent (denoted as implicit_quantilerpg) and found it is the most efficient exploration strategy for RPG. As the results shown in Figure 2, RPG configurations with different exploration strategies are more sample-efficient than the state-of-the-art.

Ablation Study: On the Trade-off between Sample-Efficiency and Optimality. Results in Figure 3 show that there is a trade-off between sample efficiency and optimality, which is controlled by the trajectory reward threshold . Recall that determines how close is the learned UNOP to optimal policies. A higher value of leads to a less frequency of near-optimal trajectories being collected and and thus a lower sample efficiency, and however the algorithm is expected to converge to a strategy of better performance. We note that is the only parameter we tuned across all experiments.

Figure 3: The trade-off between sample efficiency and optimality on DoubleDunk,BreakOut, BankHeist. The three curves refer to returns of algorithms that use different near-optimal trajectory reward thresholds . The thresholds are denoted by the numbers at the end of legends.

Ablation Study: Exploration Efficiency. We empirically evaluate the Expected Exploration Efficiency (Def 7.2) of the state-of-the-art on Pong. It is worth noting that the RL generalization performance is determined by both of learning efficiency and exploration efficiency. Therefore, higher exploration efficiency does not necessarily lead to more sample efficient algorithm due to the learning inefficiency, as demonstrated by RainBow and DQN (see the last subfigure in Figure 2). Also, the Implicit Quantile achieves the best performance among baselines, since its exploration efficiency largely surpasses other baselines.

9 Conclusion and Discussions

In this work, we introduced ranking policy gradient methods that, for the first time, approach the RL problem from a ranking perspective. Furthermore, towards the sample-efficient RL, we propose an off-policy learning framework, which trains RL agents in a supervised learning manner and thus largely facilitates the learning efficiency. The off-policy learning framework separates RL into exploration and supervision stages, and enables the flexibility to integrate a variety of advanced exploration algorithms. Besides, we provide an alternative approach to analyze the sample complexity of RL, and show that the sample complexity of RPG has no dependency on the state space dimension. Last but not least, the RPG with the off-policy learning framework achieves surprising empirical results as compared to the state-of-the-art.


Appendix A Discussion of Existing Efforts on Connecting Reinforcement Learning to Supervised Learning.

There are two main distinctions between supervised learning and reinforcement learning. In supervised learning, the data distribution is static and training samples are assumed to be sampled i.i.d. from . On the contrary, the data distribution is dynamic in reinforcement learning and the sampling procedure is not independent. First, since the data distribution in RL is determined by both environment dynamics and the learning policy, and the policy keeps being updated during the learning process. This updated policy results in dynamic data distribution in reinforcement learning. Second, policy learning depends on previously collected samples, which in turn determines the sampling probability of incoming data. Therefore, the training samples we collected are not independently distributed. These intrinsic difficulties of reinforcement learning directly cause the sample-inefficient and unstable performance of current algorithms.

On the other hand, most state-of-the-art reinforcement learning algorithms can be shown to have a supervised learning equivalent. To see this, recall that most reinforcement learning algorithms eventually acquire the policy either explicitly or implicitly, which is a mapping from a state to an action or a probability distribution over the action space. The use of such a mapping implies that ultimately there exists a supervised learning equivalent to the original reinforcement learning problem, if optimal policies exist. The paradox is that it is almost impossible to construct this supervised learning equivalent on the fly, without knowing any optimal policy.

Although the question of how to construct and apply proper supervision is still an open problem in the community, there are many existing efforts providing insightful approaches to reduce reinforcement learning into its supervised learning counterpart over the past several decades. Roughly, we can classify the existing efforts into the following categories:

  • Expectation-Maximization (EM): Dayan and Hinton (1997); Peters and Schaal (2007); Kober and Peters (2009); Abdolmaleki et al. (2018), etc.

  • Entropy-Regularized RL (ERL): Oh et al. (2018); Haarnoja et al. (2018), etc.

  • Interactive Imitation Learning (IIL): Daumé et al. (2009); Syed and Schapire (2010); Ross and Bagnell (2010); Ross et al. (2011); Sun et al. (2017), etc.

The early approaches in the EM track applied Jensen’s inequality and approximation techniques to transform the reinforcement learning objective. Algorithms are then derived from the transformed objective, which resemble the Expectation-Maximization procedure and provide policy improvement guarantee (Dayan and Hinton, 1997). These approaches typically focus on a simplified RL setting, such as assuming that the reward function is not associated with the state (Dayan and Hinton, 1997), approximating the goal to maximize the expected immediate reward and the state distribution is assumed to be fixed (Peters and Schaal, 2008). Later on in Kober and Peters (2009), the authors extended the EM framework from targeting immediate reward into episodic return. Recently, Abdolmaleki et al. (2018) used the EM-framework on a relative entropy objective, which adds a parameter prior as regularization. It has been found that the estimation step using Retrace (Munos et al., 2016) can be unstable even with a linear function approximation (Touati et al., 2017). In general, the estimation step in EM-based algorithms involves on-policy evaluation, which is one challenge shared among policy gradient methods. On the other hand, off-policy learning usually leads to a much better sample efficiency, and is one main motivation that we want to reformulate RL into a supervised learning task.

On the track of entropy regularization, Soft Actor-Critic (Haarnoja et al., 2018) used the framework of entropy-regularized RL to improve sample-efficiency and achieve faster convergence. It was shown to converge to the optimal policy that optimizes the composite objective including the long-term reward and policy entropy. This approach provides a rather efficient way to model suboptimal behavior, and lead to empirically sound policies. However, the entropy term in the objective leads to a discrepancy between the entropy-regularized objective and the original long-term reward. On the other hand, Oh et al. (2018) shared similarity to our work in terms of the method we collecting the samples, but radically different from the proposed approach in terms of theoretical formation. The approach collects good samples based on the past experience and then conduct the imitation learning w.r.t. those good samples. However, this self-imitation learning procedure was eventually connected to lower-bound-soft--learning, which belongs to entropy-regularized reinforcement learning. Indeed, there is a trade-off between sample-efficiency and modeling suboptimal behavior: a more strict requirement on the samples being collected will lead to less chance to hit satisfactory samples while the resulting policy is more close to imitate the optimal behavior.

Methods Objective Cont. Action Optimality Off-Policy No Oracle
Table 2: A comparison of studies reducing RL to SL. The Objective column denotes whether the goal is to maximize long-term reward. The Cont. Action column denotes whether the method is applicable to both continuous and discrete action spaces. The Optimality denotes whether the algorithms can model the optimal policy. ✓ denotes the optimality achieved by ERL is w.r.t. the entropy regularize objective instead of the original objective on return. The Off-Policy column denotes if the algorithms enable off-policy learning. The No Oracle column denotes if the algorithms need to access to a certain type of oracle (expert policy or expert demonstrations).

From the track of interactive imitation learning, early efforts such as (Ross and Bagnell, 2010; Ross et al., 2011) pointed out that the main discrepancy between imitation learning and reinforcement learning is the violation of i.i.d. assumption. SMILe (Ross and Bagnell, 2010) and DAgger (Ross et al., 2011) are proposed to overcome the distribution mismatch. Theorem 2.1 in Ross and Bagnell (2010) quantified the performance degradation from the expert considering that the learned policy fails to imitate the expert with a certain probability. The theorem seems to resemble the long-term performance theorem (Thm. 5) in this paper. However, it studied the scenario that the learning policy is trained through a state distribution induced by the expert, instead of state-action distribution as considered in Theorem 5. As such, Theorem 2.1 in Ross and Bagnell (2010) may be more applicable to the situation where an interactive procedure is needed, such as querying the expert during the training process. On the contrary, the proposed work focuses on directly applying supervised learning without having access to the expert to label the data. The optimal state-action pairs are collected during exploration and conducting supervised learning on the replay buffer will provide a performance guarantee in terms of long-term expected reward. Concurrently, a resemble of Theorem 2.1 in (Ross and Bagnell, 2010) is Theorem 1 in (Syed and Schapire, 2010), where the authors reduced the apprenticeship learning to classification, under the assumption that the apprentice policy is deterministic and the misclassification rate is bounded at all time steps. In this work, we show that it is possible to circumvent such a strong assumption and reduce RL to its SL. Furthermore, our theoretical framework also leads to an alternative analysis of sample-complexity. Later on AggreVaTe (Ross and Bagnell, 2014) was proposed to incorporate the information of action costs to facilitate imitation learning, and its differentiable version AggreVaTeD (Sun et al., 2017) was developed in succession and achieved impressive empirical results. Recently, hinge loss was introduced to regular -learning as a pre-training step for learning from demonstration (Hester et al., 2018), or as a surrogate loss for imitating optimal trajectories (Osa et al., 2018). In this work, we show that hinge loss constructs a new type of policy gradient method and can be used to learn optimal policy directly.

In conclusion, our method approaches the problem of reducing RL to SL from a unique perspective that is different from all prior work. With our reformulation from RL to SL, the samples collected in the replay buffer satisfy the i.i.d. assumption, since the state-action pairs are now sampled from the data distribution of UNOP. A multi-aspect comparison between the proposed method and relevant prior studies is summarized in Table 2.

Appendix B Ranking Policy Gradient Theorem

The Ranking Policy Gradient Theorem (Theorem 4) formulates the optimization of long-term reward using a ranking objective. The proof below illustrates the formulation process.

The following proof is based on direct policy differentiation (Peters and Schaal, 2008; Williams, 1992). For a concise presentation, the subscript for action value , and is omitted.


where the trajectory is a series of state-action pairs from , . From Eq (14) to Eq (15), we use the first-order Taylor expansion of to further simplify the ranking policy gradient.

b.1 Probability Distribution in Ranking Policy Gradient

In this section, we discuss the output property of the pairwise ranking policy. We show in Corollary B.1 that the pairwise ranking policy gives a valid probability distribution when the dimension of the action space . For cases when and the range of -value satisfies Condition 2, we show in Corollary 2 how to construct a valid probability distribution.

The pairwise ranking policy as shown in Eq (2) constructs a probability distribution over the set of actions when the action space is equal to , given any action values . For the cases with , this conclusion does not hold in general. It is easy to verify that , holds and the same conclusion cannot be applied to by constructing counterexamples. However, we can introduce a dummy action to form a probability distribution for RPG. During policy learning, the algorithm increases the probability of best actions and the probability of dummy action decreases. Ideally, if RPG converges to an optimal deterministic policy, the probability of taking best action is equal to 1 and . Similarly, we can introduce a dummy trajectory with the trajectory reward and . The trajectory probability forms a probability distribution since and and . The proof of a valid trajectory probability is similar to the following proof on to be a valid probability distribution with a dummy action. Its practical influence is negligible since our goal is to increase the probability of (near)-optimal trajectories. To present in a clear way, we avoid mentioning dummy trajectory in Proof B while it can be seamlessly included.

Condition 2 (The range of -value)

We restrict the range of -values in RPG so that it satisfies , where and is the dimension of the action space.

This condition can be easily satisfied since in RPG we only focus on the relative relationship of and we can constrain the range of -values so that satisfies the condition 2. Furthermore, since we can see that is decreasing w.r.t to action dimension . The larger the action dimension, the less constraint we have on the action values.

Given Condition 2, we introduce a dummy action and set , which constructs a valid probability distribution over the action space . Since we have and . To prove that this is a valid probability distribution, we only need to show that , i.e. . Let ,

This thus concludes the proof.

Appendix C Condition of Preserving Optimality

The following condition describes what types of MDPs are directly applicable to the trajectory reward shaping (TRS, Def 5):

Condition 3 (Initial States)

The (near)-optimal trajectories will cover all initial states of MDP. i.e. , where .

The MDPs satisfying this condition cover a wide range of tasks such as Dialogue System (Li et al., 2017), Go (Silver et al., 2017), video games (Bellemare et al., 2013) and all MDPs with only one initial state. If we want to preserve the optimality by TRS, the optimal trajectories of a MDP need to cover all initial states or equivalently, all initial states must lead to at least one optimal trajectory. Similarly, the near-optimality is preserved for all MDPs that its near-optimal trajectories cover all initial states.

Figure 4: The binary tree structure MDP with two initial states (), similar as discussed in (Sun et al., 2017). Each path from a root to a leaf node denotes one possible trajectory in the MDP.

Theoretically, it is possible to transfer more general MDPs to satisfy Condition 3 and preserve the optimality with potential-based reward shaping (Ng et al., 1999). More concretely, consider the deterministic binary tree MDP () with the set of initial states as defined in Figure 4. There are eight possible trajectories in . Let . Therefore, this MDP does not satisfy Condition 3. We can compensate the trajectory reward of the best trajectory starting from to the by shaping the reward with the potential-based function and . This reward shaping requires more prior knowledge, which may not be feasible in practice. A more realistic method is to design a dynamic trajectory reward shaping approach. In the beginning, we set . Take as an example, . During the exploration stage, we track the current best trajectory of each initial state and update with its trajectory reward.

Nevertheless, if the Condition 3 is not satisfied, we need more sophisticated prior knowledge other than a predefined trajectory reward threshold to construct the replay buffer (training dataset of UNOP). The practical implementation of trajectory reward shaping and rigorously theoretical study for general MDPs are beyond the scope of this work.

Appendix D Proof of Long-term Performance Theorem 5

Given a specific trajectory , the log-likelihood of state-action pairs over horizon is equal to the weighted sum over the entire state-action space, i.e.:

where the sum in the right hand side is the summation over all possible state-action pairs. It is worth noting that is not related to any policy parameters. It is the probability of a specific state-action pair in a specific trajectory . Given a trajectory , denote the unique state-action pairs in this trajectory as , where is the number of unique state-action pairs in and . The number of occurrences of a state-action pair in the trajectory is denoted as . Then we have the following:


From Eq (16) to Eq (17) we used the fact:

and therefore we have