1 Introduction
Reinforcement learning (RL) is useful for solving sequential decisionmaking problems in complex environments. Valuebased [15, 23], actorcritic and its extensions [19, 20], and MonteCarlo methods [7] have been shown to match or exceed human performance in games. However, the training effort required for these algorithms tends to be high [14, 21, 18], especially in environments with complex state and action spaces. One reason for slow learning is that long episodes and sparse rewards are able to observe only a weak reward signal, leading to a sluggish initial response.
In this paper, we propose a modification to the reward function (called SIBRE, short for Self Improvement Based REward) that aims to improve the rate of learning in episodic sparse reward environments. SIBRE is a thresholdbased reward for RL algorithms, which provides a positive reward when the agent improves on its past performance, and negative reward otherwise. This reward mechanism can be used in conjunction with any standard RL algorithm (value or policybased) without additional changes.
Motivating applications in literature: Apart from gamebased applications, reinforcement learning has been used in operations research problems [26], robotics [6], and networked systems [17]. We expect SIBRE to be helpful in such scenarios, because it addresses challenges such as (i) lack of knowledge of optimal reward level, (ii) variation of optimal reward values across instances within the same domain, and (iii) weak differentiation between rewards from optimal and suboptimal actions. It does so by comparing current performance with the agent’s own history, thus providing a baseline for learning. Similar approaches appear to have worked in literature on container loading [24] and railway scheduling [10] problems, without being formally proposed or analysed. One study on bin packing does propose reward shaping explicitly, and is described below.
Literature on formal reward shaping: The proposed approach (SIBRE) falls under the category of reward shaping approaches for RL. Prior literature has shown that the optimal policy learnt by RL remains invariant under reward shaping if the modification can be expressed as a potential function [16]. Other studies such as [1]
have used potential functions for transfer learning. While the concept is valuable, designing a potential function for each problem could be a difficult task. More closely related is
[12], where the authors show improved performance on a bin packing task under a ranked reward scheme, where the agent’s reward is based on its recent performance. The reward signal is binary (), and is based on a comparison with the 75th percentile of recently observed rewards. These binary rewards are used as targets for value estimation. While SIBRE is conceptually similar, the key differences are (i) a continuous rather than binary reward, (ii) a mechanism designed to work with any existing RL algorithm, (iii) a theoretical analysis of convergence, and (iv) demonstration of the effect of SIBRE on multiple standard RL algorithms, and on multiple benchmark environments.
We consider the key contributions of this work to be, (i) explicitly defining a novel, general, and reusable reward shaping approach for improving rate of convergence in episodic sparse reward environments (Section 2), (ii) theoretically establishing its convergence (Section 3), and (iii) experiments on several wellknown RL environments (Section 4) that show improvement in rate of convergence to the optimal policy (including during transfer learning), compared to standard RL algorithms.
2 Description of Methodology
Consider an episodic Markov Decision Process (MDP) specified by the standard tuple
[22], where is the state space, is the action space, is the set of possible rewards, and is the transition function. We assume the existence of a reinforcement learning algorithm for learning the optimal mapping . We use QLearning [25] for ease of explanation in this section, but similar arguments can be developed for policy and actorcritic based algorithms. Typically, the reward structure is a natural consequence of the problem from which the MDP was derived. For example, in the popular environment Gridworld [4], the task is to navigate through a 2D grid towards a goal state. The most common reward structure in this problem is to provide a small negative step reward for every action that does not end in the goal state, and a large positive terminal reward for reaching the goal state. It follows that the value of optimal reward depends on both the values of the step and terminal rewards, as well as on the size of the grid. In this paper, we retain the original step rewards for time step within the episode, but replace the terminal reward for episode by a baselinedifferenced value of the total return :(1) 
where is the step within an episode, is the number of the episode, is the set of terminal states, is the return for episode , and is the performance threshold at episode . Note that the return is based on the original reward structure of the MDP. If the original step reward at is , then . The net effect of SIBRE is to provide a positive terminal reward, if and negative otherwise, which gives the notion of selfimprovement. For the purposes of the subsequent proof, we assume that a number of episodes is run after every threshold update, allowing the qvalues to converge with respect to the latest threshold value. Note that can be a different number from one update to another. Once the qvalues have converged, the threshold can be updated using the relation,
where is the step size and is assumed to be externally defined according to a fixed schedule.
Training process: The qvalues are trained after every episode , while the threshold is updated after episodes once the qvalues have converged. If the initial threshold value is very low, it is fairly easy for the algorithm to achieve the positive terminal reward, and a large proportion of stateaction pairs converge to positive qvalues. During the next threshold update, the high average returns since the last update result in an increase in the value of . The threshold thus acts as a lagging performance measure of the algorithm over the training history. The algorithm is said to have converged when both the threshold and the qvalues converge. A schematic of the procedure is shown in Figure 1. The original rewards only affect the returns , which in turn are used to update the threshold for SIBRE. At the end of each episode, the current return and threshold values are used to compute the new rewards , which implicitly or explicitly drive the policy .
3 Proof of Convergence
After every update to the threshold , we assume that the chosen reinforcement learning algorithm is allowed to converge using the modified SIBRE rewards . Further, we assume that the chosen algorithm in its default form is known to converge to the optimal policy. For example, in QLearning it is known that values converge to under mild conditions [9]. Let be the maximum expected return from the start state for the original reward structure. The following theorem shows that the algorithm with SIBREdefined rewards also converges in expectation to the same return. Note that we do not make any assumptions about the form of the RL algorithm.
Theorem 1: An RL algorithm with known convergence properties, still converges to the optimal return when used in conjunction with SIBRE.
Proof: We prove the conjecture by considering three cases for the current threshold with respect to the optimal threshold . The crux of the proof is to show that the expectation of moves towards (or remains at ) in all three cases.
Case 1:
There exists a policy for which , therefore there exist policy for which .
If we let the RL algorithm converge,
.
(2) 
{Since and }
(3) 
{Since by definition}
∎
From (1) and (2):
(4) 
Case 2:
There exists no policy for which since by definition of , it is the maximum expected return.
If we let the RL algorithm converge,
.
(5) 
{Since and }
∎
Case 3:
There exists a policy for which , therefore for the same policy .
If we let the RL algorithm converge,
.
(6) 
{Since }
∎
Hence proved,
Therefore and the optimal policy for the new reward structure, SIBRE, when is an optimal policy for the original MDP since by definition, the optimal policy is one that attains the maximum expected reward.
3.1 Notes on the Proof of Theorem 1

A step in the direction of optimality for the new reward structure , given by SIBRE, is also a step in the direction of optimality for the original reward structure .

In order to observe the convergence characteristics as described above, we do not necessarily need to let the RL algorithm converge to the final policy after each threshold update. We only need sufficient training to ensure,
for
for
for

In practical use, we found that training once after every episode also shows similar convergence characteristics. We use this approximation for all the results reported later in this study.
3.2 Intuition
Our hypothesis is that the rewards under SIBRE, as defined in (1), help RL algorithms discriminate between good and bad actions more easily. The effect is similar to that of baselines in policy gradient algorithms [5], but can be generically applied to all reinforcement learning algorithms, including valuebased methods. The effect is particularly noticeable in sparse reward settings, or in environments where the measurable difference in outcomes (rewards) is small. In both cases, the terminal reward in (1) becomes significant in proportion to the total rewards over the episode. Section 4 provides empirical support for this reasoning.
4 Experiments and Results
To validate our approach and evaluate its effectiveness, we broadly use the following two experimental setups.
4.1 Gridworld
To evaluate the effectiveness of SIBRE, we use a variable sized Gridworld environment [4] with a negative step reward and a positive terminal reward for reaching the goal. We use the following versions of the Gridworld domain:

Door & Key environment: This environment has a key that the agent must pick up to unlock a door and then get to the gold. This environment is challenging to solve because of its sparse reward structure.

Multiroom environment: This environment has a series of connected rooms with doors that must be opened to get to the next room. The final room has the gold that the agent must get to. This environment is also challenging particularly with increasing number of rooms.
We have earlier proved the convergence properties of SIBRE with Qlearning but we experiment with A2C [11], as well.
4.1.1 Door & Key Environment
We compare the results of pure A2C with those of A2C+SIBRE on grids of dimensions up to 6x6. The initial position of the agent, the position of gold and that of the key are all randomly set. An episode terminates after the agent reaches the gold or after 1000 timesteps are elapsed, whichever happens earlier.
We use the following values of the hyperparameters:
, , , learning rate = , (for SIBRE, from Fig. 2 (b) ) . Both the algorithms were trained for 1.8 Million frames over 30 runs.There is a penalty for every step and a reward for reaching the gold. There is no intermediate reward for picking the key or opening the door which makes the task challenging. We observe that A2C with SIBRE is able to converge for grids with dimensions up to 6x6 (Figure 2 (a)) much faster than than A2C and SIBRE also has a higher average reward.
Figure 2 (b) shows a parameter sensitivity curve with respect to for this environment. For smaller values of , it does not learn as well. We can attribute this to slow learning of the threshold which we observe is very important in this domain for good performance.
4.1.2 Multiroom Environment
We compare the results of pure A2C with those of A2C+SIBRE on Multiroom environment. The initial position of the agent and that of the gold are set randomly. An episode terminates after the agent reaches the gold or after 1000 timesteps are elapsed, whichever happens earlier.
We use the following values of the hyperparameters: , , , learning rate = , (for SIBRE, from Fig. 3 (b)). Both the algorithms were trained for 1.8 Million frames over 30 runs.
Initially, there is a penalty for every step and a reward for reaching the gold. There is no intermediate reward for moving from one room to the other which increases the complexity of the task. We observe that both the techniques are able to converge for environments with up to 2 rooms. The learning curves are shown in Figure 3 (a). It’s evident from the performance comparison that that A2C converges a bit faster with SIBRE.
Figure 3 (b) portrays the sensitivity curves. For this environment, we observe much robust performance of SIBRE with respect to , with varying values not affecting performance as much.
4.2 Gym Environments and Atari
In addition we also tested out SIBRE on FrozenLake [3] and 3 Atari 2600 games, Pong, Freeway and Venture. Since we used policybased methods (A2C) for our previous experiments, we experimented using a valuebased method, Rainbow [8], which shows good performance across almost all Atari games. For FrozenLake we used Qlearning.
4.2.1 FrozenLake
FrozenLake is a stochastic world setting, where the agent needs to traverse a grid (4x4). Some of the states have slippery ice which are safe to walk on, some have holes where the episode terminates with no reward. One of the states has a goal which provides a terminal reward of +1. Additionally, the stochasticity is induced in the environment because of the slippery ice, where the state the agent ends up in partially depends on the chosen action.
For this environment, we trained an agent with tabularQ learning. The agent was trained for 10000 episodes with a turnlimit of 100 steps per episode in which case the agent terminates with a reward of 0. The hyperparameters used were: , learning rate = 0.01, (for SIBRE).
The learning curves of both the algorithms, each trained over 100 independent runs are shown in Figure 4 (a). We used a UCB1 [22] style exploration, where we found Qlearning to not work as well as SIBRE and it converged to a much lower value as compared to Qlearning with SIBRE, which has a higher average reward.
4.2.2 Pong
In Pong, the agent has to defeat the opponent by hitting a ball with a paddle and making the opponent miss the ball. The agent gets a reward of +1 everytime the opponent misses and a reward of 1 everytime the agent misses the ball. The episode terminates after the agent gets to .
For Pong, we used Rainbow without multistep updates (n=1). The agent was trained for 1 Million frames and the results across the 5 runs are plotted in Fig. 4 (b). Also, remaining the hyperparameters used are same as in [8] with SIBRE using . Across all the 5 runs in pong, Rainbow with SIBRE reaches optimal policy much faster than Rainbow and even the worst performing run with SIBRE is doing better than the best performing run without SIBRE, which signifies the improvement of using a threshold based reward structure to accelerate running. However, asymptotically both reach optimal performance.
4.2.3 Freeway
In Freeway, the agent has to cross a road with traffic without hitting the car and everytime it crosses the road the agent gets a reward of +1 and it is moved back if it hits a car. This is also a relatively sparse reward setting.
To solve this environment, we use Rainbow with same original settings as the paper [8] and we run it for 30 Million Frames. We plot the mean episode rewards every 125k frames. For Freeway to work well without SIBRE, we clipped the rewards to (1,+1) whereas with SIBRE we passed the original rewards to SIBRE. All the remaining hyperparameters are same as [8], and we used for SIBRE.
In this environment, we find SIBRE to accelerate training, as shown in Fig. 5 (a) and it converges faster to the optimal policy with the same asymptotic reward.
4.2.4 Venture
Venture is a mazenavigation game, which is difficult to solve because of the sparsity of rewards. The agent has to navigate through a maze with enemies and escape within a certain time to avoid being killed. The reward is dependent on the time it takes to exit the maze.
Rainbow [8] does pretty well in this game as opposed to many Deep Reinforcement Learning algorithms which struggle to solve the task. Both the algorithms were run for 120 Million frames. In this game too, we used the same hyperparameters as used for Freeway. Rainbow used reward clipping to improve performance. SIBRE however, did not use any reward clipping and we ran it with .
In Venture, as well (Fig. 5 (b)), we noticed a speedup in training and a higher average reward across training.
4.3 Similar reward shaping in practical problems
We found three prior studies in literature that used reward shaping in practical applications, that was similar in concept to SIBRE. [12] showed improved results on a binpacking task using a rankedreward scheme. A binary reward signal was computed using the agent’s own past performance. [10] use a related approach for online computation of railway schedules. They use comparison with a prior performance threshold to define whether the latest schedule has improved on the benchmark. This binary signal (success or failure) is used to train a tabular Qlearning algorithm. [24] use improvement with respect to previous objective values as the proxy for total return in a policy gradient framework. While we could not locate any studies that used SIBRE in its precise form for practical problems, the success of similar versions of reward shaping indicates that SIBRE could be useful for such problems.
4.4 Transfer Learning on DoorKey
SIBRE learns the value of a threshold which it aims to beat after each episode. The threshold encourages it to keep doing better. So we believe that once it has learnt the threshold properly, we get optimal performance. When we use the same model to learn on a bigger statespace with same reward structure, the value of the threshold provides a high initial value to beat and this helps in easy transfer of learning.
In DoorKey, we first train on 6x6 grid for 0.8 Million Frames and then transfer to 8x8 grid and train it for further 2.4 Million Frames and the results plotted are averaged over 10 runs.All other hyperparameters are same as the previous experiment. From Figure 6 it is evident that SIBRE enables faster transfer as well as good overall performance.
5 Conclusion and Future Work
The results presented in this work show that RL algorithms have empirically better performance, both qualitative and quantitative, when coupled with SIBRE for the selected environments. In particular, the capacity of the suggested technique to perform better in large instances using transfer learning makes it suitable for solving reallife problems. By averaging the rewards obtained over past episodes, SIBRE provides an improvementbased reward function. This allows the agent to constantly improve over its past performance.
Further, in our research, we plan to test the potential advantages of using a thresholdbased reward, in terms of stability and performance for additional RL algorithms (especially recent advances such as DDPG [13], TRPO [19] and PPO [20]). We also want to use the technique along with curriculum/transfer learning [2] especially in partially observed environments where the state and action space sizes remain the same (local) but the environment complexity (scale, observability, stochasticity, nonlinearity) is scaled up. Finally, there remains the possibility of including nonepisodic MDPs within the thresholdbased setup.
References
 [1] (2019) A new potentialbased reward shaping for reinforcement learning agent. arXiv preprint arXiv:1902.06239. Cited by: §1.

[2]
(2009)
Curriculum learning.
In
ICML ’09 Proceedings of the 26th Annual International Conference on Machine Learning
, pp. 41–48. Cited by: §5.  [3] (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §4.2.
 [4] (2018) Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gymminigrid. Cited by: §2, §4.1.
 [5] (2004) Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5 (Nov), pp. 1471–1530. Cited by: §3.2.
 [6] (2017) Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. Cited by: §1.
 [7] (2014) Deep learning for realtime atari game play using offline montecarlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346. Cited by: §1.
 [8] (2017) Rainbow: combining improvements in deep reinforcement learning. CoRR abs/1710.02298. External Links: Link, 1710.02298 Cited by: §4.2.2, §4.2.3, §4.2.4, §4.2.
 [9] (199411) On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, pp. 1185–1201. External Links: Document Cited by: §3.
 [10] (2019) A scalable RL algorithm for scheduling railway lines. IEEE Transactions on Intelligent Transportation Systems 20 (2), pp. 727–736. Cited by: §1, §4.3.
 [11] (2002) Actorcritic algorithms. (6). Cited by: §4.1.

[12]
(2018)
Ranked reward: enabling selfplay reinforcement learning for combinatorial optimization
. arXiv preprint arXiv:1807.01672. Cited by: §1, §4.3.  [13] (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, External Links: Link Cited by: §5.
 [14] (2016) Asynchronous methods for deep RL. In International conference on machine learning, pp. 1928–1937. Cited by: §1.
 [15] (2015) Humanlevel control through deep RL. Nature 518 (7540), pp. 529. Cited by: §1.
 [16] (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1.
 [17] (2010) Residential demand response using RL. In IEEE International Conference on Smart Grid Communications, pp. 409–414. Cited by: §1.
 [18] (2018) OpenAI five, 2018. URL https://blog. openai. com/openaifive. Cited by: §1.
 [19] (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §1, §5.
 [20] (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §1, §5.
 [21] (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
 [22] (2018) Introduction to reinforcement learning. 2nd edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 9780262039246 Cited by: §2, §4.2.1.
 [23] (2016) Deep RL with double qlearning.. In AAAI, Vol. 2, pp. 5. Cited by: §1.
 [24] (2019) A reinforcement learning framework for container selection and ship load sequencing in ports. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2250–2252. Cited by: §1, §4.3.
 [25] (19920501) Qlearning. Machine Learning 8 (3), pp. 279–292. External Links: ISSN 15730565, Document, Link Cited by: §2.
 [26] (1995) An RL approach to jobshop scheduling. In IJCAI, Vol. 95, pp. 1114–1120. Cited by: §1.
Comments
There are no comments yet.