The goal of a reinforcement learning (RL) agent is to maximize the reward it can obtain. In the finite horizon setting with horizon , the typical approach is to maximize the total reward ; whereas, in the infinite horizon setting we may either consider the average reward or the discounted reward . While the average setting is intuitive, in the discounted case, controls the planning horizon; specifically, close to 1 results in behavior that plans for the long term, whereas when is near 0, the policy becomes greedy and myopic.
This work focuses on the gap between training and evaluation metrics. Although in most settings agents are evaluated on the undiscounted total reward, in practice, due to algorithmic limitations, state-of-the-art methods train the agent on a different objective – the discounted sum(mnih2015human; tessler2017deep; schulman2017proximal; haarnoja2018soft). While a discount sufficiently close to one induces a policy identical to that of the total reward setting (blackwell1962discrete); as the discount factor deviates from one, the solution obtained by solving the -discounted problem may become strictly sub-optimal, when evaluated on the total reward.
The discounted setting has a very important property – the contraction of the Bellman operator (banach1922operations)
, which enables RL algorithms to find solutions even when function approximators are used (e.g., neural networks). Although in the classic RL formulation, the discount factor is considered as one of the MDP parameters e.g., the task the agent must solve; in practice, it is often considered as a hyperparameter, due to the stability the-discounted setting provides.
Previous work has highlighted the sub-optimality of the training regime and those attempting to overcome these issues are commonly referred to as Meta-RL algorithms. The term “Meta” refers to the algorithm adaptively changing the objective to improve performance. For instance, Meta-gradients (xu2018meta) learn the discount and , a weighting between the n-step return and the bootstrap value, to maximize the return and minimize the prediction error. LIRPG (zheng2018learning) learns an intrinsic reward function that, when combined with the extrinsic reward, guides the policy towards better performance in the -discounted task. MeRL (agarwal2019learning) tackles the problem of under-specified and sparse rewards; learning a reward function that is easier to learn from – a dense feedback signal.
In this work, we take a different path. Our goal is to maximize the total reward, a finite-horizon task, while algorithmically solving a -discounted objective. To enable such a feat we propose reward tweaking, a method for learning a surrogate reward function . The surrogate reward is learned such that an agent maximizing in the -discounted setting, will be optimal with respect to the original reward in the undiscounted task. Fig. 1
illustrates our proposed framework. We formalize the reward learning task as an optimization problem and analyze it both theoretically and empirically. Specifically, we construct it in the form of finding the max-margin separator between trajectories. As this is a supervised learning problem learning the surrogate reward is easier than finding the optimal policy. Once such a reward is found, learning the policy becomes more efficient(jiang2015dependence).
We evaluate reward tweaking on both a tabular domain and a high dimensional continuous control task (Hopper-v2 (todorov2012mujoco)). We observe that by applying reward tweaking, when solving the -discounted objective, the agent is capable of recovering a reward that improves the performance on the undiscounted task. Additionally, results on a tabular domain show that reward tweaking converts sparse reward problems into an equivalent dense reward signal, enabling a dramatic reduction in learning time.
2 Related Work
Reward Shaping: Reward shaping has been thoroughly researched from multiple directions. As the policy is a by-product of the reward, previous works defined classes of reward transformations that preserve the optimality of the policies. For instance, potential-based reward shaping that was proposed by ng1999policy and extended by devlin2012dynamic that provided a dynamic method for learning such transformations online. While such methods aim to preserve the policy, others aim to shift it towards better exploration (pathak2017curiosity) or attempt to improve the learning process (chentanez2005intrinsically; zheng2018learning). However, while the general theme in reward shaping is that the underlying reward function is unknown, in reward tweaking we have access to the real returns and attempt to tweak the reward such that it enables more efficient learning.
Meta RL: This work can be seen as a form of meta-learning (andrychowicz2016learning; nichol2018first; xu2018meta), as the reward learning module acts as a ‘meta-learner’ aiming to adapt the reward signal, guiding the agent towards better performance on the original task. As opposed to prior work, our method does not aim to find a prior that enables few-shot learning (finn2017model). Rather, it can be seen as a similar approach to xu2018meta, which adapts the discount factor in an attempt to increase performance, or zheng2018learning that learn an intrinsic reward function that guides the agent towards better policies. While xu2018meta and zheng2018learning solve and optimize for the discounted setting, we take a different approach – we aim to learn a reward function for the discounted scenario that guides the agent towards better behavior on the total reward.
Direct Optimization: While practical RL methods typically solve the wrong task, i.e., the -discounted, there exist methods that can directly optimize the total reward. For instance, evolution strategies (salimans2017evolution) and augmented random search (mania2018simple)
perform a finite difference estimation of. This optimization scheme does not suffer from stability issues and can thus optimize the real objective directly. As these methods can be seen as a form of finite differences, they require complete evaluation of the newly suggested policies at each iteration. This procedure is sample inefficient.
Inverse RL and Trajectory Ranking: In Inverse RL (ng2000algorithms; abbeel2004apprenticeship; syed2008game), the goal is to infer a reward function which explains the behavior of an expert demonstrator; or in other words, a reward function such that the optimal solution results in a similar performance to that of the expert. The goal in Trajectory Ranking (wirth2017survey) is similar, however instead of expert demonstrations, we are provided with a ranking between trajectories. While previous work (brown2019extrapolating) considered trajectory ranking to perform IRL; they focused on a set of trajectories, possibly strictly sub-optimal, which are provided apriori and a relative ranking between them. In this work, the trajectories are collected on-line using the behavior policy and the ranking is performed by evaluating the total reward provided by the environment.
We consider a Markov Decision Process(puterman1994markov, MDP) defined by the Tuple , where are the states, the actions, the reward function and is the transition kernel. The goal in RL is to learn a policy . While one may consider the set of stochastic policies, it is well known that there exists an optimal policy that is deterministic policy. In addition to the policy, we define the value function to be the expected reward-to-go of the policy and the quality function the utility of initially playing action and then acting based on policy afterwards.
A common objective is the discounted-return where there is an additional parameter , which controls the effective planning horizon. Here, the goal is to solve
where is the initial state distribution.
One way of solving the -discounted problem is through the value function
As the Bellman operator is a contraction (banach1922operations), iterative application is ensured to converge exponentially fast, with rate , to the optimal value.
In addition to the discounted task, which considers an infinite horizon, when the horizon is finite, we may consider the total return
Our focus is on the scenario where the horizon is defined apriori (e.g., timeout (bellemare2013arcade)
); however, there are additional settings in which the horizon may be a random variable sampled at, or there exists a set of absorbing states
reachable with a terminal probability, for which the agent receives a fixed reward of .
The finite horizon setting with fixed horizon introduces non-stationarity into the learning process, as the value of each state also depends on the “remaining” horizon. Hence, a solution may be to either (i) augment the timestep within the state such that , or (ii) learn such that and .
4 Maximizing the Total Reward by Learning with a Discount Factor
Although the total reward is often the objective of interest, due to numerical optimization issues, empirical methods solve an alternative problem – the -discounted objective. This re-definition of the task enables deep RL methods to solve complex high dimensional problems. However, in the general case, this may converge to a sub-optimal solution (blackwell1962discrete). We present such an example in Fig. 2, where taking the action left provides a reward and taking right . For rewards such as and the critical discount is . Here, results in the policy going to the right (maximizing the total reward in the process) and results in sub-optimal behavior, as the agent will prefer to go to the left.
We define a policy as the optimal policy for the -discounted task, and the value of this policy in the total reward undiscounted regime as .
The focus of this work is on learning with infeasible discount factors, which we define as the set of discount factors where is the minimal such that . Simply put, is the minimal discount factor where we can still solve the discounted task and remain optimal w.r.t. the total reward. This is motivated by the limitation imposed by empirical algorithms, which typically do not work well for discount factors close to 1. Hence, they may suffer from the ability to only operate with infeasible discount factors.
While previous methods attempted to overcome this issue through algorithmic improvements, e.g., attempting to increase the maximal supported by the algorithm, we take a different approach. Our goal is to learn a surrogate reward function such that for any two trajectories , where , which induce returns then .
In layman’s terms, we aim to find a new reward function for the -discounted setting, which will induce an optimal policy on the total reward setting with reward .
4.1 Existence of the Solution
We begin by proving existence of the solution, in addition to a discussion on its uniqueness.
For any there exists such that
Where is the original reward and is the horizon.
We define the optimal value function for the finite-horizon total reward by , the optimal policy and the value of the surrogate objective with discount by . Thus, extending the result shown in efroni2018beyond, we may define the surrogate reward as
As we only added a fixed term to the reward, trivially, under the optimal Bellman operator we have that . Also, this has a single fixed-point :
the above holds for any . ∎
Hence, by finding an optimal policy for the surrogate reward function in the -discounted case, we recover a policy that is optimal for the total reward.
It’s important to notice that while the common theme in RL is to analyze MDPs with stationary reward functions; in our setting, as seen in Eq. 4, the surrogate reward function is not necessarily stationary. Hence, in the general case, must be time dependent.
While we showed that there exists a solution, learning based on Eq. 4 isn’t much different than directly solving the total reward MDP. It is not clear if a method that, due to stability issues, is unable to train directly on the total reward, will cope with this specific surrogate reward.
Fortunately, the problem is ill-defined, a well-known result from Inverse RL literature (abbeel2004apprenticeship; syed2008game).
For any MDP , the surrogate reward for the -discounted problem is not unique.
This can be shown trivially, by observing that any multiplication of the rewards by a positive (non-zero) scalar, is identical. Meaning that, :
In this work, we operate in the on-policy regime. The reward tweaker observes trajectories sampled from the behavior policy. Hence, the reward does not need to be optimal across all states, but rather along the states visited by the optimal policies. For instance, setting , a tweaked reward is to set visited by and 0 otherwise, i.e., a bandit problem operating only on the active states.
kearns1998near proposed the Simulation Lemma, which enables analysis of the value error when using an empirically estimated MDP. By extending their results to the finite horizon -discounted case, we show that in addition to improving rates of convergence, the optimal surrogate reward increases the robustness of the MDP to uncertainty in the transition matrix.
If enables recovering the optimal policy, e.g., Eq. 4, and , where is the empirical probability transition estimates, then
The proof follows that of the Simulation Lemma, with a slight adaptation for the finite horizon discounted scenario with zero reward error.
where . Moreover, in order to recover the bounds for the original problem, plugging in Theorem 1 we retrieve the original value and thus ; in which the robustness is improved by a factor of . ∎
As we are analyzing the finite-horizon objective, it is natural to consider a scenario where the uncertainty in the transitions is concentrated at the end of the trajectories. This setting is motivated by practical observations – the agent commonly starts in the same set of states, hence, most of the data it observes is concentrated in those regions leading to larger errors at advanced stages.
The following proposition presents how the bounds improve when the uncertainty is concentrated near the end of the trajectories, as a factor of both the discount and the size of the uncertainty region defined by .
We assume the MDP can be factorized, such that all states that can be reached within steps are unreachable for , . We denote the reachable states by .
If and then
Notice that this formulation defines a factorized MDP, in which we can analyze each portion independently.
The sub-optimality of for can be bounded similarly to Proposition 1
where the horizon is due to the factorization of the MDP.
For , clearly as the estimation error is for states near the initial position, then the overall error in these states is bounded by where is due to the steps it takes to reach the uncertainty region. ∎
4.3 The trade-off
It is well-known that as decreases, the sample complexity of the algorithm decreases and the convergence rate increases (petrik2009biasing; strehl2009reinforcement; jiang2015dependence). This raises an immediate question – if we can learn an optimally tweaked reward, for any , where is the maximal discount the algorithm is capable of learning with, why not focus on where it is easiest to learn the policy, i.e., ? Obviously, there exists a trade-off between the complexity of learning the policy and that of learning a tweaked reward that leads to the optimal policy.
Consider the chain MDP in Fig. 3, where the goal is to reach as fast as possible. The game ends when either (i) is reached, or (ii) time-steps have passed. In this case, the trade-off is clear; when , and is a solution – yet finding an optimal policy in this case is a complex task. On the other hand, while for the rewards need to be constructed at each state and is thus a hard task; given these rewards, learning the policy is relatively easy. This shows that there exists a trade-off and the optimal discount will reside within .
5 Learning the Surrogate Objective
Previously, we have discussed the existence and benefits provided by a reward function, which enables learning with smaller discount factors. In this section, we focus on how to learn the surrogate reward function . When the model is known, a non-stationary variant of -Policy Iteration (efroni2018beyond, -PI) can be used to learn the reward function . In addition, when the undiscounted task can be solved, the reward can be constructed as shown in Eq. 4.
However, an important component of -PI is an inner loop that evaluates the policy using the original discount value. As we consider a scenario where the algorithm fails in the original task, i.e., we can only estimate the value for infeasible values of , it will also fail in the evaluation phase. Hence, we opt for an alternative, iterative optimization scheme. With a slight abuse of notation, we define the set of previously seen trajectories and a sub-trajectory by . Our goal is to find a reward function that satisfies the ranking problem, for any and . Assuming , the loss is defined as
For a tabular state space, we may represent a trajectory by and ; where is a vector where all elements are
is a vector where all elements arebesides the one corresponding to , and is a linear mapping.
This is equivalent to the logistic regression, where the goal is to find the max-margin separator between trajectories
This is equivalent to the logistic regression, where the goal is to find the max-margin separator between trajectories
Since this is a linear separator and we know that there exists a solution, then following the results from soudry2018implicit, we know that will indeed converge to the max-margin separator. In addition, lyu2020gradient showed that under mild assumptions, deep homogeneous networks converge to some local margin-maximizing solution.
As we are concerned with finding the margin between all possible trajectories, the optimization task becomes
As opposed to brown2019extrapolating, we are not provided with demonstrations and a ranking between them, however, as shown in Algorithm 1 and Fig. 1, we follow an online scheme. The policy is used to collect trajectories during the exploratory phase, which are provided as input to the reward learner. The reward learner provides the reward to the policy. Assuming a solution to the above loss can be found, this scheme will eventually converge to an optimal policy that maximizes the total reward.
As Eq. 8 aims to maximize the margin between trajectories, this increases the action-gap (farahmand2011action; bellemare2016increasing), potentially enabling more efficient learning in low discount scenarios.
We focus on two domains, a tabular puddle world and a robotic control task. While the tabular domain enables analysis of reward tweaking, through visualization of the reward as the discount factor changes; the robotic task showcases the ability of reward tweaking to work in high dimensional tasks where function approximation is needed.
For the robotics task, we focus on the Hopper domain, where current methods can find near-optimal solutions. We show that (1) for high discount factors the algorithms fail, even when solving a finite-horizon task with an appropriate non-stationary model, and (2) that reward tweaking is capable of guiding the agent towards optimal performance, even when the effective planning horizon is reduced dramatically. Hence, for domains where the algorithms maximal planning horizon is smaller than , reward tweaking can be lead to large performance gains.
To understand how reward tweaking behaves, we initially focus on a grid-world, depicted in Fig. 4. Here, the agent starts at the bottom left corner and is required to navigate towards the goal (the wall on the right). Each step, the agent receives a reward of , whereas the goal, an absorbing state, provides . Clearly, the optimal solution is the stochastic shortest path. However, at the center resides a puddle where . The puddle serves as a distractor for the agent, creating a critical point at . While the optimal total reward is , an agent trained without reward tweaking for prefers to reside in the puddle and obtains a total reward of .
For analysis of plausible surrogate rewards, obtained by reward tweaking, we present the heatmap of a reward in each state, based on Eq. 4. As the reward is non-stationary, we plot the reward at the minimal reaching time, e.g., where is the Manhattan distance in the grid. The arrows represent the gradient of the surrogate reward . Notice that when , the default reward signal is sufficient. However, in this domain, when the discount declines below the critical point, the 1-step surrogate reward (e.g., ) becomes informative enough for the 1-step greedy estimator to find the optimal policy (represented by the arrows). In other words, reward tweaking with small discount factors leads to denser rewards; this in turn, results in an easier training phase for the policy.
Here, we test reward tweaking on a complex task, using approximation schemes (deep neural networks). We focus on Hopper (todorov2012mujoco), a finite-horizon () robotic control task where the agent controls the torque applied to each motor. We build upon the TD3 algorithm (fujimoto2018addressing), a deterministic off-policy policy-gradient method (silver2014deterministic; lillicrap2015continuous), that combines two major components – an actor and a critic. The actor maps from state to action , i.e., the policy; and the critic is trained to predict the expected return. Our computing infrastructure consists of two machines with two GTX 1080 GPUs, each. Our implementation is based on the original code provided by fujimoto2018addressing, including their original hyper-parameters. All our results compare 5 randomly sampled seeds, showing the mean and confidence interval of the mean.
6.2.1 Experimental Details
In this section we compare reward tweaking with the baseline TD3 model, adapted to the infinite horizon setting. The results are presented in Fig. 5 and Table 1. Our implementations are based on the original code provided by fujimoto2018addressing. In their work, as they considered stationary models, they converted the task into an infinite horizon setting, by replacing considering “termination signals due to timeout” as non-terminal. However, they still evaluate the policies on the total reward achieved over steps.
Although this works well in the MuJoCo control suite, it is not clear if this should work in the general case, and whether or not the algorithm designer has access to this knowledge (that termination occurred due to timeout). Hence, we opt for a finite-horizon non-stationary actor and critic scheme. To achieve this, we concatenate the normalized remaining time to the state.
As we are interested in a finite-horizon task, the reward tweaking experiments are performed on this non-stationary version of TD3. In our experiments, and are learned in parallel (Fig. 1). The reward tweaker , is represented using 4 fully-connected layers (
) with ReLU activations. Training is performed on trajectories, which are sampled from the replay buffer, everytime-steps. Each time, the loss is computed using trajectories, each up to a length of steps. Due to the structure of the environment, timeout can occur at any state. Hence, given a sampled sub-trajectory, we update the normalized remaining time of the sampled states, based on the length of the sub-trajectory, i.e., as if timeout occurred in the last sampled state. This enables to train using partial trajectories, while the model is required to extrapolate the rest222Code accompanying this paper is provided in bit.ly/2O1klVu.
Numerical results for the Hopper experiments. For each experiment, we present the average score between the best results of the various seeds, the standard deviation and the best (max) performance across all seeds. For each, we highlight the models that, with high confidence, based on the Avg. and STD, performed the best.
We present the results in Fig. 5 and the numerical values in Table 1. For each discount and model, we run a simulation using 5 random seeds. For each seed, at the end of training, we re-evaluate the policy that is believed to be best for episodes. The results we report are the average, STD and the max of these values (over the simulations performed).
Baseline analysis: We observe that the baseline behaves best for . This leads to two important observations. (1) Although the model is fit for a finite-horizon task, and the horizon is indeed finite, it fails when the effective planning horizon increases above () – suggesting that practical algorithms have a maximal feasible discount value, such that for larger values they suffer from performance degradation. (2) As decreases, the agent becomes myopic leading to sub-optimal behavior.
These results motivate the use of reward tweaking. While reward tweaking is incapable of overcoming the first problem, it is designed to cope with the second – the failure due to the decrease in the planning horizon.
Reward Tweaking: Analyzing the results obtained by reward tweaking leads to some important observations. (1) There exists discount values for which the algorithm (TD3) staggers, as the performance deteriorates for . (2) Reward tweaking improves the performance across feasible discount values, i.e., values where the algorithm is capable of learning (). (3) As learning the reward becomes harder as decreases (Section 4.3
, the variance grows as the planning horizon decreases. Finally, (4) focusing on the maximal performance across seeds, we see that while this is a hard task, reward tweaking improves the performance in terms of average and maximal score – and even achieves optimal performance atin some seeds.
In this work, we presented reward tweaking, a method for learning the reward function, in finite horizon tasks, when the algorithm imposes a limitation on the set of available discount factors. When this set is infeasible, the resulting policies are sub-optimal. Our proposed method enables the learning algorithm to find optimal policies, by learning a new reward function . We showed, in the tabular case, that there exists a reward function for the discounted case, which leads the policy towards optimal behavior with respect to the total reward. In addition, we proposed an objective to solving this task, which is equivalent to finding the max-margin separator between trajectories.
We performed evaluated reward tweaking in a tabular setting, where the reward is ensured to be recoverable, and in a high dimensional continuous control task. In the tabular case, visualization enables us to analyze the structure of the learned rewards, suggesting that reward tweaking learns dense reward signals – these are easier for the learning algorithm to learn from. In the Hopper domain, our results show that reward tweaking is capable of outperforming the baseline across most planning horizons. Finally, we observe that while learning a good reward function is hard for short planning horizons, in some seeds, reward tweaking is capable of finding good reward functions. This suggests that further improvement in stability and parallelism of the reward learning procedure may further benefit this method.
Although in terms of performance reward tweaking has many benefits, this method has a computational cost. Even though learning the surrogate reward does not require additional interactions with the environment (sample complexity), it does require a significant increase in computation power. Each iteration we sample partial trajectories and compute the loss over the entirety of them – this requires many more computations when compared to the standard TD3 scheme. Thus, it is beneficial to use reward tweaking when the performance degradation, due to the use of infeasible discount factors, is large.
We focused on scenarios where the algorithm is incapable of solving for the correct . Future work can extend reward tweaking to additional capabilities. For instance, to stabilize learning, many works clip the reward to (mnih2015human). Reward tweaking can incorporate structural constraints on the surrogate reward, such as, residing within a clipped range. Moreover, while our focus was on maximizing the total reward, an interesting question is whether methods such as reward tweaking can be used to guide the policy towards robust/safe policies (Proposition 1).
Finally, while reward tweaking has a computational overhead, it comes with significant benefits. Once an optimal surrogate reward has been found, it can be re-used in future training procedures. As it enables the agent to solve the task using smaller discount factors, given a learned , the policy training procedure is potentially faster.
The authors would like to thank Tom Jurgenson, Esther Derman and Nadav Merlis for the constructive talks and feedback during the work on this paper.