Beyond Exponentially Discounted Sum: Automatic Learning of Return Function

05/28/2019 ∙ by Yufei Wang, et al. ∙ 0

In reinforcement learning, Return, which is the weighted accumulated future rewards, and Value, which is the expected return, serve as the objective that guides the learning of the policy. In classic RL, return is defined as the exponentially discounted sum of future rewards. One key insight is that there could be many feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived, yet these different forms might render dramatically different speeds of learning this policy. In this paper, we research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner. We test our methods on a specially designed maze environment and several Atari games, and our experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The objective of reinforcement learning(RL) is to find a policy that takes the best action to gain accumulated long-term rewards, through trial and error. There are two important notions in RL: return, which is the weighted accumulated future rewards after taking a series of actions, and value, which is the expected value of the return, serveing as the objective for the policy function to maximize. In classic RL (sutton1998introduction, ), the return is usually defined as the exponentially discounted sum of future rewards, where the discounting factor balances the importance of the immediate and future rewards. The motivation for using such a mathematical form of return is based on the economic theory (sutton1998introduction, ; pitis2019rethinking, )

, however, from the perspective of machine learning, there could be many different feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived. For example, let us consider the task of navigating through a maze. Any form of the return function that gives higher values to the correct path leading to the exit would generate the same optimal policy. However, these different forms might render dramatically different speeds of learning this policy. For example, suppose the agent can only receive a long-delayed non-zero reward when it reaches the exit. The return function that exponentially discounts this reward (inversely) along the path might incur very slow learning at the beginning of the correct path, while the return function that back propagates this reward without decay might greatly ease the learning at the beginning. Based on this observation, our paper is concerned with the following question: Is there any better form of the return function than exponentially discounted sum, and can we design an algorithm to automatically learn such a function?

Actually, there have been some previous works that aim at finding better return/value functions that render more stable and faster policy learning. They can be mainly classified into two categories. The first kind of works modify the form of the return function. For example,


proposed to use meta-learning to learn the hyperparameters

, used in the TD- methods. This generalizes the case of using fixed values of , , but in general the return function is still constrained in the exponentially discounted sum form. arjona2018rudder

employed a LSTM model to predict the delayed terminal reward at each time step, and then redistributed the delayed reward to previous time steps according to the prediction error (or the heuristic analysis on the hidden states of LSTM). However, there is no guarantee that the heuristic redistribution can lead to better performance or fast learning. The second kind of works focus on modifying the rewards used for return computation, and do not change the form of the return function at all. Examples include reward shaping

(ng1999policy, ), reward clipping (mnih2013playing, ), intrinsic reward (chentanez2005intrinsically, ; icarte2018using, ), exploration bonus (bellemare2016unifying, ), and reward from auxiliary task (jaderberg2016reinforcement, ). In addition to these hand-crafted manipulations on rewards, a few recent works attempted to learn the manipulations automatically (zheng2018learning, ; xu2018meta, ).

In clear contrast to the aforementioned previous works, our work aims at automatically learning the appropriate mathematical form of the return function. Different from just learning the hyperparameters in the exponentially discounted sum, we propose to use a general mathematical form for the return function, and employ meta learning to learn the optimal return function in an end-to-end manner. In particular, we argue that the general form of the return function should take into account the information of the whole trajectory instead of merely the current state. This mimics the process of human reasoning, as we human always tend to analyze a sequence of moves as a whole. We implement our idea upon modern actor-critic algorithms (babaeizadeh2017ga3c, ; schulman2017proximal, ), and demonstrate its efficiency first in a specially designed Maze environment(mattchantk_2016, ), and then in the high-dimensional Atari games using the Arcade Learning Environment(ALE) (bellemare13arcade, ; machado17arcade, ). The experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.

The paper is organized as follows. In Section 2 we introduce related works and background settings. In Section 3, we formalize the proposed idea and describe the meta learning algorithm in detail. In Section 4, the experimental results are presented. The paper concludes in Section 5.

2 Background and Related Work

As aforementioned, there are in general two ways to modify how return is computed. The first type directly modify the mathematical form of return function, and the second type modify the reward used in return computation. Our methods fall in the first class of methods, and is orthogonal to the second class.

2.1 Modifying The Mathematical Form of Return Function

Lots of works has researched the discounting factor in the exponentially discounted form return function. lattimore2011time proposed to generalize the exponentially discounting form through discount functions that change with the agents’ age. Precisely choosing different at different time has been proposed in (franccois2015discount, ; OpenAI_dota, ). xu2018meta treated hyper-parameter [, ] in TD- as learnable parameters that could be optimized by meta-gradient approach. pitis2019rethinking introduced a flexible state-action discounting factor by characterizing rationality in sequential decision making. barnard1993temporal generalized the work of singh1992scaling to propose a multi-time scale TD learning model, based on which romoff2019separating ; reinke2017average ; sherstan2018generalizing explored to decompose the original discounting factor into multiple ones. arjona2018rudder proposes to decompose the delayed return to previous time steps by analyzing the hidden states of a LSTM. All these works suggest that the form of return function is one of the key elements in RL. Our work generalizes them by proposing to use a general form return function.

2.2 Reward Augmentation

Another family of works focus on manipulating the reward functions to enhance the learning of the agents. Reward shaping (ng1999policy, ) shows what property a reward modification should possess to remain the optimal policy, i.e., the potential-based reward. Lots of works focus on designing intrinsic reward to help learning, e.g., (oudeyer2009intrinsic, ; schmidhuber2010formal, ; stadie2015incentivizing, ) used hand-engineered intrinsic reward that greatly accelerate the learning. Exploration bonus is another class of works that add bonus reward to improve the exploration behaviours of the agent so as to help learning, e.g., bellemare2016unifying ; martin2017count ; tang2017exploration propose to use the pseudo-count of states to derive the bonus reward. (sutton2011horde, ; mirowski2016learning, ; jaderberg2016reinforcement, ) propose to augment the reward by considering information coming from the task itself.

2.3 Relational Reinforcement Learning

The attention mechanism has been proven successful to extract the relationships of words in sentences in natural language processing

vaswani2017attention ; bahdanau2014neural , and has recently been combined with relational learning (dvzeroski2001relational, ) to extract the relations between entities in a single state, so as to enhance learning in RL zambaldi2018relational . To encode the trajectory information in the general return function, we also employ the attention mechanism, but not to different entities in a single state, but to different state-action pairs in a trajectory. More details are given in Section 3.3.

3 Methods

3.1 Classic RL Settings

Here we briefly review the classic RL settings, where return and value are defined as the exponentially discounted sum of accumulated future rewards. We will also review the policy gradient theorem, upon which our method is built on. We assume an episodic setting. At each time step , the agent receive a state from the state space , takes an action in the action space , receives a reward from the environment based on the reward function , and then transfers to the next state according to the transition dynamics . Given a trajectory , the return of is computed in the following exponentially discounted sum form: , , where is the discounting factor. Similarly, the return of a state-action pair is computed as:

A policy (which is usually parameterized by

, e.g., a neural network) is a probability distribution on the action space

given a state , . We say a trajectory is generated under policy if all the actions along the trajectory is chosen following , i.e., means and . Given a policy , the value of a state is defined as the expected return of all the trajectories when the agent starts at and then follows : . Similarly, the value of a state-action pair is defined as the expected return of all trajectories when the agent starts at , takes action , and then follows : .

The performance of a policy can be measured as , where is the initial state distribution. For brevity, we will write throughout. The famous policy gradient theorem (sutton2000policy, ) states:


Usually, a baseline value can be subtracted by

to lower the variance, e.g., in A3C

(mnih2016asynchronous, ) the state value is used. In the following, for the simplicity of derivation, we will assume that is computed using a sample return .

3.2 Learning a General Form of Return Function

The key of our method is that we do not restrict the function form of return to be just the exponentially discounted sum of future rewards. Instead, we compute it using an arbitrary function , which is parameterized by a neural network , and we call as the return generating model. In addition, the function considers the information of the whole trajectory:


We can then optimize our policy with regard to this new return by substituting it in the policy gradient theorem:


We now show how to train the return generating model , so that it renders faster learning when the policy is updated using the new return.

For the simplicity of derivation, we take in the following particular form:


i.e., the new return is calculated as a linear combination of future rewards, and the linear coefficient is computed using the return generating model . For the consideration of convergence, we require that , as in the case of the exponentially discounted sum form.

Using meta-learning, it is straightforward to set the learning objective of to be the performance of the updated policy , and thus

can be trained using chain rule. The similar idea has also been used in

xu2018meta and zheng2018learning . Formally, with the particular linear combination form of the new return, update in E.q.3 becomes:


We want the new return to maximize the effect of such an update, i.e., to maximize the performance of the updated policy . By the chain rule, we have:


The first term in E.q. 6 can be computed by policy gradient theorem (E.q 1), with the original return and samples collected under the new policy :


and the second term in E.q. 6 can be computed using E.q. 5:


Therefore, the gradient can be computed by combining E.q. 7 and E.q. 8, and any first-order methods can be used to optimize (e.g., gradient ascent).

In practice, when computing , we do not really sample a new trajectory using the updated policy as this increases the sample complexity. Instead, we compute it using the current trajectory with importance sampling:


The full procedure of our method is shown in Algorithm 1, and a overview of the relationship between our return generating model and existing policy gradient-based RL algorithms is shown in figure 1.

For the clearance of illustration we only show our algorithm in the simplest case, i.e, 1) using the linear combination of future rewards as the new form of return function; 2) using vanilla gradient ascent to update , and 3) using sample return to approximate the Q-value . Actually, our method can be applied much more broadly: 1) can take in any form as long as it is differentiable to the parameter ; 2) the update of the policy

can employ any modern optimizer like RMSProp

(tieleman2012lecture, ) or Adam (kingma2014adam, ), as long as the update procedure of computing is fully differentiable to (as in E.q. 8); and 3) our methods can be combined with any advanced actor-critic algorithms like A3C (mnih2016asynchronous, ) or PPO (schulman2017proximal, ), as long as the new gradient can be effectively evaluated (as in E.q. 7). For example, we can use the clipped surrogate objective proposed in PPO to measure the performance of the updated policy , and E.q. 9 then becomes:

1 Input: policy network , reward generating model , training iterations , step size and
2 for  to  do
3       Sample a trajectory with length using
4       Compute the new return using as in E.q. 4
5       update as in E.q. 5
6       Approximate using E.q. 9 or E.q. 10
7       Compute using E.q. 8
8       Compute
9       update
Algorithm 1 Learning Return Computation with Meta Learning

3.3 Discussion

Viewing in Trajectory. In this subsection we present how we design the structure of the return generating model . Our first design principle is that should be able to use the information of the whole trajectory to generate return. This mimics the process of human reasoning, as we always tends to analyze a sequence of moves. Our second design principle is that the model should be excel to analyze the relationship between different time steps when generating the return, which is again natural to human reasoning. With these two design principles, we choose to employ the Multi-Head Attention Module (vaswani2017attention, ) as a main block of our model. Actually, we adopted the encoder part of the Transformer architecture vaswani2017attention . Specifically, the input to our model is a whole trajectory . The model takes the trajectory

as input, embeds it into a vector

of dimension , and then passes it through several stack of layers that consist of multi-head attention and residual feed forward connection. At last, the model uses a feed forward generator to generate the new returns . In the linear combination case, the generator uses softmax to produce the normalized linear coefficient .

Figure 1: Relationship between Return Generating Model and existing policy gradient-based RL. We can optionally use a new value network to learn the new return generated by , as shown by the dashed black box. The dashed red lines and box show how the return generating model can be trained using meta-learning and the chain rule.

From New Return to New Value. The new return function naturally brings about a brand new value function. Indeed, we can similarly define the value of and as the expected value of the new returns of trajectories starting from (taking action ), and following policy :


Here we give a very shallow analysis on this new value definition (one obvious future work would be to give a deep analysis on such definitions). The new value definition is way more general than the original one, but this generality also loses useful properties, and perhaps the most important one is that the Bellman Equation no longer applies. The bellman equation tells that the value of the current state can be computed by bootstrapping from the value of the next state , and the correction of such bootstrap indeed lies in the special exponentially discounted sum form of return function. Similarly, the temporal difference learning methods (e.g. Q-learning, SARSA) would usually fail with the new value, as such methods can be viewed as a sample approximation to the Bellman Equation. On the other hand, the Monte-Carlo methods for evaluating the new value would still work, which only relies on the Law of Big Number.

4 Experiments

4.1 General Setups

In this subsection, we describe some general setups for the following experiments. We implemented our RGM upon the A2C (mnih2016asynchronous, ) algorithm. All the experiment are running on a machine equipped with Intel(R) Xeon(R) CPU E5-2690, and four Nvidia Tesla M40 GPUs. In the following, we will denote the A2C baseline as the vanilla A2C, and denote our algorithms as A2C + RGM (short for Return Genearting Model). In both the illustrative case and the Atari experiment, the new form of return function is the linear combinations of the future rewards, as discussed in Section 3.2 and E.q. 4. The RGM has 4 stacking layers and 4 heads in multi-head attention.

4.2 Illustrative Case

In this section, we show our algorithm using an illustrative maze example. As shown in the leftmost panel of figure 2, the maze (mattchantk_2016, ) is a 2-D grid world of size . The agent starts from the left-top corner, and needs to find its way to the exit at the bottom-right corner. Black lines in the maze represents unbreakable walls. The key feature of the maze is that it has portals, shown as colored grids in figure 2, which can transport the agent immediately from one location to another with the same color. Indeed, we deliberately created four isolated rooms in the maze, and to successfully reach the exit requires the agent to utilize these portals to transport between different rooms. Two possible routes have been marked out in figure 2.

At each time step, the agent gets a state which is the coordinate of his current location, chooses an moving direction among up, down, left and right, and receives a reward of if he does not reach the exit, and if he reaches the exit, which also ends the game. We set to be very small, e.g., in our experiments. Therefore, the maze renders an environment that has very delayed reward. Besides, from our human beings’ perspective, we would pay special attention to the portals, as they enable nonconsecutive spatial location change.

We compare the vanilla A2C algorithm and our A2C + RGM algorithm. We use separate value and policy networks, both of them are feed-forward MLPs with three hidden layers of 64 neurons and the ReLU nonlinear activation. Figure

3 (b) compares the learning curves of these two algorithms. For ablation study, we implemented two more versions of our A2C + RGM besides the vanilla one. 1) ‘RGM + target’: since our RGM is always learning (changing) itself, the learning objective it provides to the value network is also changing, which might cause oscillation for the learning of the value network. To alleviate this, similar to that in DQN (mnih2013playing, ) and DDPG (lillicrap2015continuous, ), we use a target network of the RGM model and optimize the value network towards the target network to stabilize its learning. The parameters of the target network is copied from the learning RGM periodically. 2) ‘RGM without attention’: we replaced all the attention modules with feed-forward layers to test whether the attention module is necessary. The results in figure 3 (b) shows that our algorithm learns dramatically faster and better than the vanilla A2C algorithms. It also shows that the attention module is necessary for stable learning, and whether using a target network does not cause huge difference in learning in the maze environment.

To have a deeper understanding of how the RGM works, we visualized the values of the grids under the vanilla A2C’s value network and our A2C + RGM’s value network. The interesting result is shown in figure 2. As expected, the values of grids learned in vanilla A2C exponentially decreases as the distance to the exit increases, and there is no difference in values between the transport grids and vanilla grids. However, the values learned by our RGM is quite different. It shows that a heavy part of return is redistributed to the beginning of the correct path that leads to the exit, and the value almost follows a decreasing trend as it approaches the exit. Also, compared with the vanilla values, the portals in our A2C + RGM tends to have much higher values than the vanilla grids around it. As discussed in the last section, both values give the correct policy, as they all have higher values along the correct path. However, as indicated by Figure 3 (b), these two different values have large difference in the learning speed.

To further investigate how the new value is learned, we visualized the linear coefficient generated by our RGM along the correct path, and compared it with the traditional discounted . The result is shown in figure 3 (b). We normalize both coefficients to make them sum up to . With , the discounted form return gives almost equal favour to each reward encountered along the path. On the contrast, our RGM distributed almost all the weight to the delayed reward at the last time step, which is the only positive reward the agent receives when reaching the exit. As a result, when using the traditional discounted return, the delayed reward (in our setting, which is also the only effective learning signal) is exponentially decayed times before it can reach the initial state, which greatly hurts its propagation; on the other hand, our RGM could propagate back this learning signal with nearly zero loss to all previous states (recall that our RGM compute return as , if some is near 1, then will be fully used when computing all ). With this effective back propagation of the learning signal, our RGM greatly eases the learning of the agents.

Figure 2: The values of grids of vanilla A2C (the value learns towards the traditional discounted sum return) and our A2C + RGM (the value learns towards to the return generated by RGM ). Darker color represents larger value. The value of A2C + RGM derives the correct path labeled by the red + black arrows in the leftmost panel, and the value of vanilla A2C derives another correct path labeled by blue + black arrows in the leftmost panel.
(a) (b)
Figure 3:

(a) Learning curves of the compared algorithms. The Y-axis shows the moving average reward of the last 1000 episodes, and the shaded region represents half a standard deviation. (b) Visualization of

and along the correct path, which is labeled by the arrows in the left maze.

4.3 Deep Reinforcement Learning Results on Atari Games

We also tested our A2C + RGM on multiple Atari games from the Arcade Learning Environment (ALE) bellemare2013arcade

. We implemented the baseline A2C algorithm using Pytorch with exactly the same network architecture as in

mnih2016asynchronous , and trained it using the same hyper-parameters as in the OpenAI implementation dhariwal2017openai , except for the parameter

used in RMSProp. Due to the slight difference of the implementation of RMSProp in Pytorch and Tensorflow, we could only reproduce comparable results to those in published papers by tuning

of RMSProp in Pytorch to be . Our RGM has one extra hyper-parameter, which is the learning rate . We detail in below how we set it. We do not use a separate target network for the RGM as it does not bring significant help in the maze experiments.

Figure 4 shows the comparison of our A2C + RGM and the baseline A2C on 6 Atari games: Bowling, Privateeye, Frostbite, Tennis, Mspacman, and SpaceInvaders. The former four games all have very rare and delayed rewards, e.g., in privateeye it takes the agent 180 time steps to collect the glue to get its first reward, while the latter two games have rich immediate rewards. Using the original discounted sum return, learning in the delayed reward games are much more difficult than the rich-reward games. As shown in Figure 4, as expected, our A2C + RGM has bring large benefits in the delayed games by learning a better form of return computation. Meanwhile, it also boosts the performance of rich-reward games, where the learning with original return is already easy. For our RGM, we explored the following values for learning rate: . We plotted the best results from the search.

(a) Bowling (b) Privateeye (c) Frostbite
(d) Tennis (e) Mspacman (f) SpaceInvaders
Figure 4: Comparison of Vanilla A2C and A2C + RGM on six Atari games. Each game is trained for 150000 episodes, and each episode is truncated to have a maximum of 200 time steps. The Y-axis is the raw game scores. The solid line is the mean of the raw scores in the last 100 episodes, averaged over eight environment seeds, and the shaded area is half a standard deviation.

5 Conclusion

Return and value serve as the key objective that guide the learning of the policy. One key insight is that there could be many different ways to define the computation form of the return (and thus the value), from which the same optimal policy can be derived. However, these different forms could render dramatic difference in the learning speed. In this paper, we propose to use arbitrary general form for return computation, and designed an end-to-end algorithm to learn such general form to enhance policy learning by meta-gradient methods. We test our methods on a specially designed maze environment and several Atari games, and the experimental results show that our methods effectively learned new return computation forms that greatly improved learning performance.