1 Introduction
In reinforcement learning (RL) problems, an agent learns to maximize a reward signal, which may be timedelayed [16]
. The RL framework has gained much success in handling complex problems in last years. However, mastering difficult tasks is often slow. Thus most of the RL researcher focuses on improving the speed of the learning process by using the knowledge that provided by an expert or a heuristic function.
Transfer learning (TL) is one of the approaches that try to improve the speed of learning by using knowledge extracted from previously learned tasks [6]. Reward shaping (RS) is also one of the methods that have been used for transferring such knowledge or any other type of knowledge.
In problems that could be modeled using RL, many tasks have kind of sparse reward signal. For example, there are some tasks that an agent will receive nothing until it gets to a goal or it gets nothing until some events occur in the environment. The reward function of a task in an environment is independent of time, and the agent tries to maximize the sum of its discounted reward in an episode, while from an episode to another episode of learning, the agent could use a knowledge extracted from past episodes to reinforce the reward signal. There is much information that can be used to reinforce the reward signal. For example, after one episode of learning has done, the agent could compare its performance with the best episode of the learning or the worst one, and by this comparison, it could improve the learning process of itself.
2 Background
2.1 Reinforcement learning
RL is a group of machine learning methods, which can be used for training an agent using environmental feedback. An RL agent in an environment can be modeled as a Markov decision process(MDP)
[16]. An MDP is a tuple like , and in this tuple: is a set of states that appear in the environment; is a set of agent’s possible actions;is a function which gives probability of reaching state
while agent currently is in state and chose action to perform; is a function which gives amount of reward the agent will receives when performed the action in state and transferred to state ; called discount factor which indicates how future rewards are important for the agent.(1) 
The agent wants to maximize equation 1, which is called the expected discounted return, during its lifetime [16]. There are many methods for training RL agents. Methods such as Qlearning[19]
and SARSA which estimate a value for each
pair and by using this estimated value compute a policy that can maximize the agent’s discounted return. There are also some methods that use policy gradient [21, 17, 2] to approximate a policy and use this policy for the agent’s decision making.(2) 
2.2 Reward shaping
Reward shaping (RS) refers to allowing an agent to train on an artificial reward signal rather than environmental feedback [18]. However, this artificial signal must be a potential function. Otherwise, the optimal policy will be different for the new MDP [14]. Authors of [14] prove that is a potential function if there exists a realvalued function such that for all :
(3) 
Consequently by using this potential function, the optimal policy in the new MDP() remains optimal for the old MDP(). In the new MDP, equation 3, and all of the extended equations from equation 3, is unchanged and must have the same value as in the old MDP. There are also other works in the literature that extend the potential function. Authors of [20] have shown that the potential function can be a function of the joint stateaction space. Next, Authors of [7] have shown that the potential function could be a dynamic function and that might change over the time while all the properties of potential based reward shaping (PBRS) remain constant. In addition to [20] and [7], [12] combines these two extensions and show that any function in the form of equation 6 could be a potential function.
(4) 
(5) 
(6) 
By using reward shaping Qlearning update rule will turn into equation 7. In equation 7, could be any of equations represented in 3, 4, 5, or 6. Equation 7 will give us this opportunity to enhance the learning by providing some extra information about the problem for the agent. This additional knowledge can come from any source like human knowledge of the problem, some heuristic function, or the knowledge that extracted by the agent.
(7) 
3 Related works
The first method, which should be considered, is the Multigrid RL with RS, [11] which propose a method for learning a potential function in an online manner. Authors of [11] estimate a value function during the learning process and by using this value function shape a potential function. The estimated value function is the value of an abstract state. State abstraction could be applied by any method.
(8) 
(9) 
Another work that has been done in the literature is [10], which proposed a potential function based on a plan. In [10], the agent gets an extra reward based on progress in the plan. In equation 10, z is an abstract state that can be any state of the plan, and the function returns the time step at which given the abstract state appears during the process of executing the plan.
(10) 
Another work that is an extended version of [10], for multiagent reinforcement learning, is [8]. In [8] two methods have been proposed, which can be used to shape the reward signal of agents and improve their learning process.
Another work that has been presented in the literature of transfer learning is a method that transfers the agent’s policy using reward shaping. Authors of [5] assume that there is a mapping function, which maps target task to source task. By using this mapping function, they define a potential function for shaping the reward signal. The proposed potential function as shown in equation 11 is defined based on the policy of source task. In equation 11, is a mapping function from target task state space to source task state space, and is a mapping function from target task action space to source task action space, and is the policy of agent in the source task.
(11) 
In [5], RS has been used as a knowledgetransfer procedure, and a learned policy from another task has been used as knowledge.
There are many other works in the literature that proposed a potential based reward shaping method for multiagent reinforcement learning (MARL) and singleagent reinforcement learning (SARL). [4] proposed a method for RS, while there is a demonstration of a task. [4] proposed a method that uses a similarity measure for calculating the similarity between stateaction pairs. Another work, which assumes a situation like that assumed in [4], is [15]. Authors of [15] instead of using a similarity measure used an Inverse reinforcement learning approach to approximate a reward function for shaping the potential function. Another technique that is presented for MARL is [9]. In [9] difference reward used to shape a potential function. The difference reward helps the agent to learn the impact of its action on the environment by not considering the impact of other agents’ action on the environment.
4 Proposed potential based reward shaping method
Given these points and motivations, we go on to describe our approach to reinforce reward signal by using a knowledge gained from the learning process. So, we want a reward function that will change every time the agent make progress in the task. This reward function must encourage or punish the agent according to the best and worst episode until now. The reward function must also handle sparse reward signals. The function that represented in equation 12 has properties that we want. As we know from [14], changing reward function might change the optimal policy for the current task. Hence, we use reward shaping to manipulate the reward function in order to consider the agent’s improvement during the learning process.
(12) 
In equation 12: is the immediate reward, which is a sparss reward signal; is the sum of rewards in the current episode, which we call it ; is the maximum value of until now; and is the minimum value of until now. This function returns zero, while the reward signal of the environment has no information in it; the case to control the effect of sparsereward. This function reinforces the reward signal by measuring a distance from a fixed point. We could use this approach with any kind of learning algorithm to boost the learning process. As presented in algorithm LABEL:alg:pbrsai, after each episode of learning the agent will change parameters of potential function if needed.
4.1 Extending to multitask agents
The presented potential function helps the learning methods to improve their performance by using knowledge extracted from previous episodes. Sometimes the agent needs to learn multiple tasks simultaneously. With this in mind, we extended the proposed method to be used in multitask RL.
5 Results
In order to demonstrate the usefulness of the proposed approach, we perform experiments in two domains: Breakout and Pong games from arcade learning environment (ALE) [3]. According to the [3]:
“ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, modelbased planning, imitation learning, transfer learning, and intrinsic motivation.”
We used two different baselines to compare our proposed method with them. [1] used as a first baseline and [6] as a second baseline. [1] is an implementation of [13], which uses a policy gradient method to approximate the optimal policy of a specific task. As a second baseline, we implement [6] that presented in the literature of transfer learning. We evaluate our work in two different phases: first of all, we evaluate our method during the learning process, and then we evaluate the performance of the final policy. We tested our method with two different assumptions: the first assumption is that we know the maximum and minimum values of the episode reward in each task, and the second assumption is that we have no information about the episode reward of the task and this values will be obtained during the learning process.
5.1 Breakout
Breakout is an arcade game, which an agent handles a paddle to hit a ball trying to destroy more bricks while preventing the ball to cross the paddle. We trained an agent in this environment for 1000000 episodes. Figure 1 shows the learning curve of agents that trained on the Breakout game. As we see in figure 1 area under the curve of our method is greater than the baseline method, and the final performance is also better than the baseline.
5.2 Pong
Pong is also an arcade game, which an agent has the responsibility of moving a paddle to hit a ball. The agent will receive 1 reward if it loses the ball and will receive +1 if the opponent loses the ball. We trained an agent in this environment for 30000 episodes. Figure 2 shows the learning curve of agents that trained in the Pong game. As you see in figure 2 area under the curve of our method is greater than one of the baseline methods. However, in general, the proposed method in this environment has not been able to work very well and has a little improvement.
5.3 Multitask Agent
We also implemented a multitask agent, which learns how to play games of Pong and Breakout and then trains the agent for 115k episodes. Figure 3 and figure 4 demonstrate results of our experiments for a multitask agent. Figure 3 demonstrates the learning curve of the multitask agent for each of the games, while figure 4 demonstrates the performance of the learnedpolicy. Figure 4 is the result of running the agent with the learnedpolicy for 100 episode and then take the average rewards of episodes. As we can see, the multitask agent, which using reinforced reward signal, does perform better on average than a multitask agent, which does not use reinforced reward signal.
6 Conclusion
First of all, in this paper, we studied the literature of transfer learning and potentialbased reward shaping. According to our study in the literature of transfer learning and potentialbased reward shaping, there was no method that tries to extract knowledge from the learning process. Hence, we introduced a novel way of extracting knowledge from the learning process, and we used reward shaping as a knowledge transferring method.
The proposed method guides the agent toward a goal by looking at the value of episode reward. If the agent goes toward the goal, the reward signal will be reinforced. This method can be used with any learning algorithm, and it supposed to be efficient wherever applied. We implemented our method using [1] and compare the results with two different baselines in two different environments, and then we extend our method to be used in multitask agents. In most of the experiments, the results were promising. We also tested our method in a multitask agent, which tries to learn games of Pong and Breakout simultaneously, and we saw that our method could lead to improvement.
References
 [1] Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. Reinforcement learning thorugh asynchronous advantage actorcritic on a gpu. In 5th International Conference on Learning Representations, 2017.

[2]
Peter L. Bartlett and Jonathan Baxter.
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
, 15(3):319–350, 2001.  [3] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Int. Res., 47(1):253–279, may 2013.
 [4] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E. Taylor, and Ann Nowé. Reinforcement learning from demonstration through shaping. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 3352–3358. AAAI Press, 2015.
 [5] Tim Brys, Anna Harutyunyan, Matthew E. Taylor, and Ann Nowé. Policy transfer using reward shaping. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’15, pages 181–188, Richland, SC, 2015. International Foundation for Autonomous Agents and Multiagent Systems.
 [6] Gabriel De la Cruz, Yunshu Du, James Irwin, and Matthew Taylor. Initial progress in transfer for deep reinforcement learning algorithms. In International Joint Conference on Artificial Intelligence (IJCAI), 2016.
 [7] Sam Devlin and Daniel Kudenko. Dynamic potentialbased reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems  Volume 1, AAMAS ’12, pages 433–440, Richland, SC, 2012. International Foundation for Autonomous Agents and Multiagent Systems.
 [8] Sam Devlin and Daniel Kudenko. Planbased reward shaping for multiagent reinforcement learning. Knowledge Eng. Review, 31(1):44–58, 2016.
 [9] Sam Devlin, Logan Michael Yliniemi, Daniel Kudenko, and Kagan Tumer. Potentialbased difference rewards for multiagent reinforcement learning. In AAMAS, pages 165–172. IFAAMAS/ACM, 2014.
 [10] M. Grzes and D. Kudenko. Planbased reward shaping for reinforcement learning. In 2008 4th International IEEE Conference Intelligent Systems, volume 2, pages 10–22–10–29, Sept 2008.

[11]
Marek Grześ and Daniel Kudenko.
Multigrid reinforcement learning with reward shaping.
In
Proceedings of the 18th International Conference on Artificial Neural Networks, Part I
, ICANN ’08, pages 357–366, Berlin, Heidelberg, 2008. SpringerVerlag.  [12] Anna Harutyunyan, Sam Devlin, Peter Vrancx, and Ann Nowe. Expressing arbitrary reward functions as potentialbased advice. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2652–2658. AAAI Press, 2015.
 [13] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.
 [14] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
 [15] Halit Bener Suay, Tim Brys, Matthew E. Taylor, and Sonia Chernova. Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, AAMAS ’16, pages 429–437, Richland, SC, 2016. International Foundation for Autonomous Agents and Multiagent Systems.
 [16] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
 [17] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA, 1999. MIT Press.
 [18] Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. JOURNAL OF MACHINE LEARNING RESEARCH, 10:1633–1685, JUL 2009.
 [19] Christopher J.C.H. Watkins and Peter Dayan. Technical note: Qlearning. Machine Learning, 8(3):279–292, May 1992.
 [20] Eric Wiewiora, Garrison Cottrell, and Charles Elkan. Principled methods for advising reinforcement learning agents. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 792–799. AAAI Press, 2003.
 [21] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, May 1992.
Comments
There are no comments yet.