SIBRE: Self Improvement Based REwards for Reinforcement Learning

by   Somjit Nath, et al.
Tata Consultancy Services

We propose a generic reward shaping approach for improving rate of convergence in reinforcement learning (RL), called Self Improvement Based REwards, or SIBRE. The approach can be used for episodic environments in conjunction with any existing RL algorithm, and consists of rewarding improvement over the agent's own past performance. We show that SIBRE converges under the same conditions as the algorithm whose reward has been modified. The new rewards help discriminate between policies when the original rewards are either weakly discriminated or sparse. Experiments show that in certain environments, this approach speeds up learning and converges to the optimal policy faster. We analyse SIBRE theoretically, and follow it up with tests on several well-known benchmark environments for reinforcement learning.



There are no comments yet.


page 1

page 2

page 3

page 4


Joint Inference of Reward Machines and Policies for Reinforcement Learning

Incorporating high-level knowledge is an effective way to expedite reinf...

Unbiased Methods for Multi-Goal Reinforcement Learning

In multi-goal reinforcement learning (RL) settings, the reward for each ...

Reinforcement Learning with Perturbed Rewards

Recent studies have shown the vulnerability of reinforcement learning (R...

Self Punishment and Reward Backfill for Deep Q-Learning

Reinforcement learning agents learn by encouraging behaviours which maxi...

Learning Intrinsic Symbolic Rewards in Reinforcement Learning

Learning effective policies for sparse objectives is a key challenge in ...

Off-Policy Reinforcement Learning with Delayed Rewards

We study deep reinforcement learning (RL) algorithms with delayed reward...

Solving The Lunar Lander Problem under Uncertainty using Reinforcement Learning

Reinforcement Learning (RL) is an area of machine learning concerned wit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is useful for solving sequential decision-making problems in complex environments. Value-based [15, 23], actor-critic and its extensions [19, 20], and Monte-Carlo methods [7] have been shown to match or exceed human performance in games. However, the training effort required for these algorithms tends to be high [14, 21, 18], especially in environments with complex state and action spaces. One reason for slow learning is that long episodes and sparse rewards are able to observe only a weak reward signal, leading to a sluggish initial response.

In this paper, we propose a modification to the reward function (called SIBRE, short for Self Improvement Based REward) that aims to improve the rate of learning in episodic sparse reward environments. SIBRE is a threshold-based reward for RL algorithms, which provides a positive reward when the agent improves on its past performance, and negative reward otherwise. This reward mechanism can be used in conjunction with any standard RL algorithm (value or policy-based) without additional changes.

Motivating applications in literature: Apart from game-based applications, reinforcement learning has been used in operations research problems [26], robotics [6], and networked systems [17]. We expect SIBRE to be helpful in such scenarios, because it addresses challenges such as (i) lack of knowledge of optimal reward level, (ii) variation of optimal reward values across instances within the same domain, and (iii) weak differentiation between rewards from optimal and suboptimal actions. It does so by comparing current performance with the agent’s own history, thus providing a baseline for learning. Similar approaches appear to have worked in literature on container loading [24] and railway scheduling [10] problems, without being formally proposed or analysed. One study on bin packing does propose reward shaping explicitly, and is described below.

Literature on formal reward shaping: The proposed approach (SIBRE) falls under the category of reward shaping approaches for RL. Prior literature has shown that the optimal policy learnt by RL remains invariant under reward shaping if the modification can be expressed as a potential function [16]. Other studies such as [1]

have used potential functions for transfer learning. While the concept is valuable, designing a potential function for each problem could be a difficult task. More closely related is

[12], where the authors show improved performance on a bin packing task under a ranked reward scheme, where the agent’s reward is based on its recent performance. The reward signal is binary (

), and is based on a comparison with the 75th percentile of recently observed rewards. These binary rewards are used as targets for value estimation. While SIBRE is conceptually similar, the key differences are (i) a continuous rather than binary reward, (ii) a mechanism designed to work with any existing RL algorithm, (iii) a theoretical analysis of convergence, and (iv) demonstration of the effect of SIBRE on multiple standard RL algorithms, and on multiple benchmark environments.

We consider the key contributions of this work to be, (i) explicitly defining a novel, general, and reusable reward shaping approach for improving rate of convergence in episodic sparse reward environments (Section 2), (ii) theoretically establishing its convergence (Section 3), and (iii) experiments on several well-known RL environments (Section 4) that show improvement in rate of convergence to the optimal policy (including during transfer learning), compared to standard RL algorithms.

2 Description of Methodology

Consider an episodic Markov Decision Process (MDP) specified by the standard tuple

[22], where is the state space, is the action space, is the set of possible rewards, and is the transition function. We assume the existence of a reinforcement learning algorithm for learning the optimal mapping . We use Q-Learning [25] for ease of explanation in this section, but similar arguments can be developed for policy and actor-critic based algorithms. Typically, the reward structure is a natural consequence of the problem from which the MDP was derived. For example, in the popular environment Gridworld [4], the task is to navigate through a 2-D grid towards a goal state. The most common reward structure in this problem is to provide a small negative step reward for every action that does not end in the goal state, and a large positive terminal reward for reaching the goal state. It follows that the value of optimal reward depends on both the values of the step and terminal rewards, as well as on the size of the grid. In this paper, we retain the original step rewards for time step within the episode, but replace the terminal reward for episode by a baseline-differenced value of the total return :


where is the step within an episode, is the number of the episode, is the set of terminal states, is the return for episode , and is the performance threshold at episode . Note that the return is based on the original reward structure of the MDP. If the original step reward at is , then . The net effect of SIBRE is to provide a positive terminal reward, if and negative otherwise, which gives the notion of self-improvement. For the purposes of the subsequent proof, we assume that a number of episodes is run after every threshold update, allowing the q-values to converge with respect to the latest threshold value. Note that can be a different number from one update to another. Once the q-values have converged, the threshold can be updated using the relation,

where is the step size and is assumed to be externally defined according to a fixed schedule.

Training process: The q-values are trained after every episode , while the threshold is updated after episodes once the q-values have converged. If the initial threshold value is very low, it is fairly easy for the algorithm to achieve the positive terminal reward, and a large proportion of state-action pairs converge to positive q-values. During the next threshold update, the high average returns since the last update result in an increase in the value of . The threshold thus acts as a lagging performance measure of the algorithm over the training history. The algorithm is said to have converged when both the threshold and the q-values converge. A schematic of the procedure is shown in Figure 1. The original rewards only affect the returns , which in turn are used to update the threshold for SIBRE. At the end of each episode, the current return and threshold values are used to compute the new rewards , which implicitly or explicitly drive the policy .

Figure 1: Training process using SIBRE.

3 Proof of Convergence

After every update to the threshold , we assume that the chosen reinforcement learning algorithm is allowed to converge using the modified SIBRE rewards . Further, we assume that the chosen algorithm in its default form is known to converge to the optimal policy. For example, in Q-Learning it is known that values converge to under mild conditions [9]. Let be the maximum expected return from the start state for the original reward structure. The following theorem shows that the algorithm with SIBRE-defined rewards also converges in expectation to the same return. Note that we do not make any assumptions about the form of the RL algorithm.

Theorem 1: An RL algorithm with known convergence properties, still converges to the optimal return when used in conjunction with SIBRE.

Proof: We prove the conjecture by considering three cases for the current threshold with respect to the optimal threshold . The crux of the proof is to show that the expectation of moves towards (or remains at ) in all three cases.

Case 1:
There exists a policy for which , therefore there exist policy for which . If we let the RL algorithm converge,


{Since and }


{Since by definition}

From (1) and (2):


Case 2:
There exists no policy for which since by definition of , it is the maximum expected return. If we let the RL algorithm converge,


{Since and }

Case 3:
There exists a policy for which , therefore for the same policy . If we let the RL algorithm converge,


{Since }

Hence proved,

Therefore and the optimal policy for the new reward structure, SIBRE, when is an optimal policy for the original MDP since by definition, the optimal policy is one that attains the maximum expected reward.

3.1 Notes on the Proof of Theorem 1

  • A step in the direction of optimality for the new reward structure , given by SIBRE, is also a step in the direction of optimality for the original reward structure .

  • In order to observe the convergence characteristics as described above, we do not necessarily need to let the RL algorithm converge to the final policy after each threshold update. We only need sufficient training to ensure,




  • In practical use, we found that training once after every episode also shows similar convergence characteristics. We use this approximation for all the results reported later in this study.

3.2 Intuition

Our hypothesis is that the rewards under SIBRE, as defined in (1), help RL algorithms discriminate between good and bad actions more easily. The effect is similar to that of baselines in policy gradient algorithms [5], but can be generically applied to all reinforcement learning algorithms, including value-based methods. The effect is particularly noticeable in sparse reward settings, or in environments where the measurable difference in outcomes (rewards) is small. In both cases, the terminal reward in (1) becomes significant in proportion to the total rewards over the episode. Section 4 provides empirical support for this reasoning.

4 Experiments and Results

To validate our approach and evaluate its effectiveness, we broadly use the following two experimental setups.

4.1 Gridworld

To evaluate the effectiveness of SIBRE, we use a variable sized Gridworld environment [4] with a negative step reward and a positive terminal reward for reaching the goal. We use the following versions of the Gridworld domain:

  • Door & Key environment: This environment has a key that the agent must pick up to unlock a door and then get to the gold. This environment is challenging to solve because of its sparse reward structure.

  • Multi-room environment: This environment has a series of connected rooms with doors that must be opened to get to the next room. The final room has the gold that the agent must get to. This environment is also challenging particularly with increasing number of rooms.

We have earlier proved the convergence properties of SIBRE with Q-learning but we experiment with A2C [11], as well.

Figure 2: (a) SIBRE+A2C vs pure A2C on 6x6 Door & Key Environment. (b) Parameter Sensitivity plots for SIBRE on Doorkey.
Figure 3: (a) SIBRE+A2C vs pure A2C on Multi-room Environment with two rooms. (b) Parameter Sensitivity plots for SIBRE on Multiroom.

4.1.1 Door & Key Environment

We compare the results of pure A2C with those of A2C+SIBRE on grids of dimensions up to 6x6. The initial position of the agent, the position of gold and that of the key are all randomly set. An episode terminates after the agent reaches the gold or after 1000 time-steps are elapsed, whichever happens earlier.

We use the following values of the hyperparameters:

, , , learning rate = , (for SIBRE, from Fig. 2 (b) ) . Both the algorithms were trained for 1.8 Million frames over 30 runs.

There is a penalty for every step and a reward for reaching the gold. There is no intermediate reward for picking the key or opening the door which makes the task challenging. We observe that A2C with SIBRE is able to converge for grids with dimensions up to 6x6 (Figure 2 (a)) much faster than than A2C and SIBRE also has a higher average reward.

Figure 2 (b) shows a parameter sensitivity curve with respect to for this environment. For smaller values of , it does not learn as well. We can attribute this to slow learning of the threshold which we observe is very important in this domain for good performance.

4.1.2 Multi-room Environment

We compare the results of pure A2C with those of A2C+SIBRE on Multi-room environment. The initial position of the agent and that of the gold are set randomly. An episode terminates after the agent reaches the gold or after 1000 time-steps are elapsed, whichever happens earlier.

We use the following values of the hyper-parameters: , , , learning rate = , (for SIBRE, from Fig. 3 (b)). Both the algorithms were trained for 1.8 Million frames over 30 runs.

Initially, there is a penalty for every step and a reward for reaching the gold. There is no intermediate reward for moving from one room to the other which increases the complexity of the task. We observe that both the techniques are able to converge for environments with up to 2 rooms. The learning curves are shown in Figure 3 (a). It’s evident from the performance comparison that that A2C converges a bit faster with SIBRE.

Figure 3 (b) portrays the -sensitivity curves. For this environment, we observe much robust performance of SIBRE with respect to , with varying values not affecting performance as much.

4.2 Gym Environments and Atari

In addition we also tested out SIBRE on FrozenLake [3] and 3 Atari 2600 games, Pong, Freeway and Venture. Since we used policy-based methods (A2C) for our previous experiments, we experimented using a value-based method, Rainbow [8], which shows good performance across almost all Atari games. For FrozenLake we used Q-learning.

Figure 4: (a) Q-learning+SIBRE vs Q-learning on Frozenlake (b) Rainbow+SIBRE vs Rainbow on Pong
Figure 5: (a) SIBRE+Rainbow vs Rainbow on Freeway (b) SIBRE+Rainbow vs Rainbow on Venture

4.2.1 FrozenLake

FrozenLake is a stochastic world setting, where the agent needs to traverse a grid (4x4). Some of the states have slippery ice which are safe to walk on, some have holes where the episode terminates with no reward. One of the states has a goal which provides a terminal reward of +1. Additionally, the stochasticity is induced in the environment because of the slippery ice, where the state the agent ends up in partially depends on the chosen action.

For this environment, we trained an agent with tabular-Q learning. The agent was trained for 10000 episodes with a turn-limit of 100 steps per episode in which case the agent terminates with a reward of 0. The hyper-parameters used were: , learning rate = 0.01, (for SIBRE).

The learning curves of both the algorithms, each trained over 100 independent runs are shown in Figure 4 (a). We used a UCB1 [22] style exploration, where we found Q-learning to not work as well as SIBRE and it converged to a much lower value as compared to Q-learning with SIBRE, which has a higher average reward.

4.2.2 Pong

In Pong, the agent has to defeat the opponent by hitting a ball with a paddle and making the opponent miss the ball. The agent gets a reward of +1 every-time the opponent misses and a reward of -1 every-time the agent misses the ball. The episode terminates after the agent gets to .

For Pong, we used Rainbow without multi-step updates (n=1). The agent was trained for 1 Million frames and the results across the 5 runs are plotted in Fig. 4 (b). Also, remaining the hyper-parameters used are same as in [8] with SIBRE using . Across all the 5 runs in pong, Rainbow with SIBRE reaches optimal policy much faster than Rainbow and even the worst performing run with SIBRE is doing better than the best performing run without SIBRE, which signifies the improvement of using a threshold based reward structure to accelerate running. However, asymptotically both reach optimal performance.

4.2.3 Freeway

In Freeway, the agent has to cross a road with traffic without hitting the car and every-time it crosses the road the agent gets a reward of +1 and it is moved back if it hits a car. This is also a relatively sparse reward setting.

To solve this environment, we use Rainbow with same original settings as the paper [8] and we run it for 30 Million Frames. We plot the mean episode rewards every 125k frames. For Freeway to work well without SIBRE, we clipped the rewards to (-1,+1) whereas with SIBRE we passed the original rewards to SIBRE. All the remaining hyper-parameters are same as [8], and we used for SIBRE.

In this environment, we find SIBRE to accelerate training, as shown in Fig. 5 (a) and it converges faster to the optimal policy with the same asymptotic reward.

4.2.4 Venture

Venture is a maze-navigation game, which is difficult to solve because of the sparsity of rewards. The agent has to navigate through a maze with enemies and escape within a certain time to avoid being killed. The reward is dependent on the time it takes to exit the maze.

Rainbow [8] does pretty well in this game as opposed to many Deep Reinforcement Learning algorithms which struggle to solve the task. Both the algorithms were run for 120 Million frames. In this game too, we used the same hyper-parameters as used for Freeway. Rainbow used reward clipping to improve performance. SIBRE however, did not use any reward clipping and we ran it with .

In Venture, as well (Fig. 5 (b)), we noticed a speed-up in training and a higher average reward across training.

4.3 Similar reward shaping in practical problems

We found three prior studies in literature that used reward shaping in practical applications, that was similar in concept to SIBRE. [12] showed improved results on a bin-packing task using a ranked-reward scheme. A binary reward signal was computed using the agent’s own past performance. [10] use a related approach for online computation of railway schedules. They use comparison with a prior performance threshold to define whether the latest schedule has improved on the benchmark. This binary signal (success or failure) is used to train a tabular Q-learning algorithm. [24] use improvement with respect to previous objective values as the proxy for total return in a policy gradient framework. While we could not locate any studies that used SIBRE in its precise form for practical problems, the success of similar versions of reward shaping indicates that SIBRE could be useful for such problems.

4.4 Transfer Learning on DoorKey

SIBRE learns the value of a threshold which it aims to beat after each episode. The threshold encourages it to keep doing better. So we believe that once it has learnt the threshold properly, we get optimal performance. When we use the same model to learn on a bigger state-space with same reward structure, the value of the threshold provides a high initial value to beat and this helps in easy transfer of learning.

In DoorKey, we first train on 6x6 grid for 0.8 Million Frames and then transfer to 8x8 grid and train it for further 2.4 Million Frames and the results plotted are averaged over 10 runs.All other hyper-parameters are same as the previous experiment. From Figure 6 it is evident that SIBRE enables faster transfer as well as good overall performance.

Figure 6: SIBRE+A2C vs A2C on transfer learning from 6x6 to 8x8 grid in DoorKey environment

5 Conclusion and Future Work

The results presented in this work show that RL algorithms have empirically better performance, both qualitative and quantitative, when coupled with SIBRE for the selected environments. In particular, the capacity of the suggested technique to perform better in large instances using transfer learning makes it suitable for solving real-life problems. By averaging the rewards obtained over past episodes, SIBRE provides an improvement-based reward function. This allows the agent to constantly improve over its past performance.

Further, in our research, we plan to test the potential advantages of using a threshold-based reward, in terms of stability and performance for additional RL algorithms (especially recent advances such as DDPG [13], TRPO [19] and PPO [20]). We also want to use the technique along with curriculum/transfer learning [2] especially in partially observed environments where the state and action space sizes remain the same (local) but the environment complexity (scale, observability, stochasticity, non-linearity) is scaled up. Finally, there remains the possibility of including non-episodic MDPs within the threshold-based setup.


  • [1] B. Badnava and N. Mozayani (2019) A new potential-based reward shaping for reinforcement learning agent. arXiv preprint arXiv:1902.06239. Cited by: §1.
  • [2] Y. Bengio (2009) Curriculum learning. In

    ICML ’09 Proceedings of the 26th Annual International Conference on Machine Learning

    pp. 41–48. Cited by: §5.
  • [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.2.
  • [4] M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. Cited by: §2, §4.1.
  • [5] E. Greensmith, P. L. Bartlett, and J. Baxter (2004) Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5 (Nov), pp. 1471–1530. Cited by: §3.2.
  • [6] S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. Cited by: §1.
  • [7] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang (2014) Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346. Cited by: §1.
  • [8] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver (2017) Rainbow: combining improvements in deep reinforcement learning. CoRR abs/1710.02298. External Links: Link, 1710.02298 Cited by: §4.2.2, §4.2.3, §4.2.4, §4.2.
  • [9] T. Jaakkola, M. Jordan, and S. Singh (1994-11) On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, pp. 1185–1201. External Links: Document Cited by: §3.
  • [10] H. Khadilkar (2019) A scalable RL algorithm for scheduling railway lines. IEEE Transactions on Intelligent Transportation Systems 20 (2), pp. 727–736. Cited by: §1, §4.3.
  • [11] V. Konda (2002) Actor-critic algorithms. (6). Cited by: §4.1.
  • [12] A. Laterre, Y. Fu, M. K. Jabri, A. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir (2018)

    Ranked reward: enabling self-play reinforcement learning for combinatorial optimization

    arXiv preprint arXiv:1807.01672. Cited by: §1, §4.3.
  • [13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §5.
  • [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep RL. In International conference on machine learning, pp. 1928–1937. Cited by: §1.
  • [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep RL. Nature 518 (7540), pp. 529. Cited by: §1.
  • [16] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1.
  • [17] D. O’Neill, M. Levorato, A. Goldsmith, and U. Mitra (2010) Residential demand response using RL. In IEEE International Conference on Smart Grid Communications, pp. 409–414. Cited by: §1.
  • [18] J. Pachocki, G. Brockman, J. Raiman, S. Zhang, H. Pondé, J. Tang, F. Wolski, C. Dennison, R. Jozefowicz, P. Debiak, et al. (2018) OpenAI five, 2018. URL https://blog. openai. com/openai-five. Cited by: §1.
  • [19] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §1, §5.
  • [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §1, §5.
  • [21] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • [22] R. S. Sutton and A. G. Barto (2018) Introduction to reinforcement learning. 2nd edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 9780262039246 Cited by: §2, §4.2.1.
  • [23] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep RL with double q-learning.. In AAAI, Vol. 2, pp. 5. Cited by: §1.
  • [24] R. Verma, S. Saikia, H. Khadilkar, P. Agarwal, G. Shroff, and A. Srinivasan (2019) A reinforcement learning framework for container selection and ship load sequencing in ports. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2250–2252. Cited by: §1, §4.3.
  • [25] C. J. C. H. Watkins and P. Dayan (1992-05-01) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: ISSN 1573-0565, Document, Link Cited by: §2.
  • [26] W. Zhang and T. G. Dietterich (1995) An RL approach to job-shop scheduling. In IJCAI, Vol. 95, pp. 1114–1120. Cited by: §1.