Reinforcement learning (RL) is a framework that allows an agent to complete tasks in an environment, even when a model of the environment is not known. The agent ‘learns’ to complete a task by maximizing its expected long-term reward, where the reward signal is supplied by the environment. RL algorithms have been successfully implemented in many fields, including robotics [1, 2], and games [3, 4] . However, it remains difficult for an RL agent to master new tasks in unseen environments. This is especially true when the reward given by the environment is sparse/ significantly delayed.
It may be possible to guide an RL agent towards more promising solutions faster, if it is equipped with some form of prior knowledge about the environment. This can be encoded by modifying the reward signal received by the agent during training. However, the modification must be carried out in a principled manner, since providing an additional reward at each step might distract the agent from the true goal 
. Potential-based reward shaping (PBRS) is one such method that augments the reward in an environment specified by a Markov Decision Process (MDP) with a term that is a difference ofpotentials . This method is attractive since it easily allows for the recovery of optimal policies, while enabling the agent to learn these policies faster.
Potential functions are typically functions of states. This could be a limitation, since in some cases, such a function may not be able to encode all information available in the environment. To allow for imparting more information to the agent, a potential-based advice (PBA) scheme was proposed in . The potential functions in PBA include both states and actions as their arguments.
To the best of our knowledge, PBRS and PBA schemes in the literature [6, 7, 8] assume that an optimal policy is deterministic. This will not always be the case, since an optimal policy might be a stochastic policy. This is especially true when there are states in the environment that are partially observable or indistinguishable from each other. Moreover, the aforementioned papers limit their focus to discrete state and action spaces.
In this paper, we study the addition of PBRS and PBA schemes to the reward, in settings where: i) the optimal policy will be stochastic, and ii) state and action spaces may be continuous. We additionally provide guarantees on the convergence of an advantage actor-critic architecture that is augmented with a PBA scheme. We make the following contributions:
We prove that the ability of an agent to learn an optimal stochastic policy remains unaffected when augmenting PBRS to soft Q-learning.
We propose a technique for adapting PBA in policy-based methods, in order to use these schemes in environments with continuous state and action spaces.
We present an Algorithm, AC-PBA, describing an advantage actor-critic architecture augmented with PBA, and provide guarantees on its convergence.
We evaluate our approach on two experimental domains: a discrete-state, discrete-action Puddle-jump Gridworld that has indistinguishable states, and a continuous-state, continuous-action Mountain Car.
The remainder of this paper is organized as follows: Section II presents related work in reward shaping. Required preliminaries to RL, PBRS and PBA is presented in Section III. Section IV presents our results on using PBRS for stochastic policy learning. We present a method to augment PBA to policy gradient frameworks and an algorithm detailing this in Section V. Experiments validating our approach are reported in Section VI, and we conclude the paper in Section VII.
Ii Related Work
Shaping or augmenting the reward received by an RL agent in order to enable it to learn optimal policies faster is an active area of research. Reward modification via human feedback was used in [9, 10] to interactively shape an agent’s response so that it learned a desired behavior. However, frequent human supervision is usually costly and may not possible in every situation. A curiosity-based RL algorithm for sparse reward environments was presented in , where an intrinsic reward signal characterized the prediction error of the agent as a curiosity reward. The reward received by the agent was augmented with a function that represented the number of times the agent had visited a state in .
Entropy regularization as a way to encourage exploration of policies during the early stages of learning was studied in  and . This was used to lead a policy towards states with a high reward in  and .
Static potential-based functions were shown to preserve the optimality of deterministic policies in . This property was extended to dynamic potential-based functions in . The authors of  showed that when an agent learned a policy using Q-learning, applying PBRS at each training step was equivalent to initializing the Q-function with the potentials. They studied value-based methods, but restricted their focus to learning deterministic policies. The authors of  demonstrated a method to transform a reward function into a potential-based function during training. The potential function in PBA was obtained using an ‘experience filter’ in .
The use of PBRS in model-based RL was studied in , and for episodic RL in . PBRS was extended to planning in partially observable domains in . However, these papers only considered the finite-horizon case. In comparison, we consider the infinite horizon, discounted cost setting in this paper.
In control theoretic settings, RL algorithms have been used to establish guarantees on convergence to an optimal controller for the Linear Quadratic Regulator, when a model of the underlying system was not known in [23, 24]. A survey of using RL for control is presented in . OpenAI Gym  enables the solving of several problems in classical control using RL algorithms.
Iii-a Reinforcement Learning
An MDP  is a tuple . is the set of states, the set of actions, encodes
, the probability of transition to, given current state and action .
is a probability distribution over the initial states.denotes the reward that the agent receives when transitioning from while taking action . In this paper, .
The goal for an RL agent  is to learn a policy , in order to maximize . Here, is a discounting factor, and the expectation is taken over the trajectory induced by policy . If , the policy is deterministic. On the other hand, a randomized policy returns a probability distribution over the set of actions, and is denoted .
The value of a state-action pair following policy is represented by the Q-function, written . The Q-function allows us to calculate the state value . The advantage of a particular action , over other actions at a state is defined by .
Iii-B Value-based and Policy-based Methods
The RL problem has two general solution techniques. Value-based
methods determine an optimal policy by maintaining a set of reward estimates when following a particular policy. At each state, an action that achieves the highest (expected) reward is taken. Typical value-based methods to learn greedy (determininistic) policies include Q-learning and Sarsa-learning. Recently, the authors of  proposed soft Q-learning, which is a value-based method that is able to learn stochastic policies.
In comparison, policy-based methods directly search over the policy space . Starting from an initial policy, specified by a set of parameters, these methods compute the expected reward for this policy, and update the parameter set according to certain rules to improve the policy. Policy gradient  is one way to achieve policy improvement. This method repeatedly computes (an estimate of) the gradient of the expected reward with respect to the policy parameters. Policy-based approaches usually exhibit better convergence properties, and can be used in continuous action spaces . They can also be used to learn stochastic policies. REINFORCE and actor-critic are examples of policy gradient algorithms .
Iii-C PBRS and PBA
Reward shaping methods augment the environment reward with an additional reward , . This changes the structure of the original MDP to . The goal is to choose so that an optimal policy for , , is also optimal for the original MDP . Potential-based reward shaping (PBRS) schemes were shown to be able to preserve the optimality of deterministic policies in .
In PBRS, the function is defined as a difference of potentials, . Specifically, . Then, the Q-function, , of the optimal greedy policy for and the optimal Q-function for are related by: . Therefore, the optimal greedy policy is not changed [6, 8], since:
The authors of  augmented to include action as an argument. They termed this potential-based advice (PBA). There are two forms– look-ahead PBA and look-back PBA– respectively defined by:
For the look-ahead PBA scheme, the state-action value function for following policy is given by:
The optimal greedy policy for can be recovered from the optimal state-action value function for from:
The optimal greedy policy for using look-back PBA can be recovered similarly.
Iv PBRS for Stochastic Policy Learning
The existing literature on PBRS has focused on augmenting value-based methods to learn optimal deterministic policies. In this section, we first show that PBRS preserves optimality, when the optimal policy is stochastic. Then, we show that the learnability will not be changed when using PBRS in soft Q-learning.
Assume that the optimal policy is stochastic. Then, with , PBRS preserves the optimality of stochastic policies.
The goal in the original MDP was to find a policy in order to maximize:
In PBRS, the goal is to determine a policy so that:
Next, we examine the effect on learnability when using PBRS with soft Q-learning. Soft Q-learning is a value-based method for stochastic policy learning that was proposed in . Different from Equation (5), the goal is to maximize both, the accumulated reward, and the policy entropy at each visited state:
The entropy term encourages exploration of the state space, and the parameter is a trade-off between exploitation and exploration.
Before stating our result, we summarize the soft Q-learning update procedure. From , the optimal value-function, , is given by:
The optimal soft Q-function is determined by solving the soft Bellman equation:
The optimal policy can be obtained from Equation (9) as:
In the rest of this Section, we assume both, states and actions are discrete and no function approximator is used. We also omit subscripts for and , and set for simplicity. From Equation (9), and as in Q-learning, soft Q-learning updates the soft Q-function by minimizing the soft Bellman error:
where . During training, . With denoting the learning rate, the Q-function update is given by:
The main result of this section shows that the ability of an agent to learn an optimal policy is unaffected when using soft Q-learning augmented with PBRS. We define a notion of learnability, and use this to establish our claim.
During training, an agent encounters a sequence of states, actions, and rewards that serves as ‘raw-data’ which is fed to the RL algorithm. Let and denote two RL agents. Let and denote the experience tuple at learning step from a trajectory used by and , respectively.
Definition 1 (Learnability)
Denote the accumulated difference in the Q-functions of and after learning for steps by and , respectively. Then, given identical sample experiences, (that is, ), and are said to have the same learnability if .
Soft Q-learning, with initial soft Q-values and augmented with PBRS where state potential is , has the same learnability as soft Q-learning without PBRS but with its soft Q-values initialized to .
Consider an agent that uses a PBRS scheme during learning and an agent that does not use PBRS, but has its soft Q-values initialized as , where is the initial Q-value of . We further assume that and adopt the same learning rate. From Definition 1, to show that and have the same learnability, we need to show that the soft Bellman errors and are equal at each training step , given the same experience sets and . From Equation (11), the soft Bellman errors for and can be respectively written as:
Since for each , comparing and is reduced to comparing and . We show this by induction.
At training step there is no update. Thus, . Assume that the Bellman errors are identical up to a step . That is, . Then, the accumulated errors for the two agents until this step are also identical. That is, . Consider training step . The state values at this step are: and respectively. The Bellman errors at are:
It follows that .
If the Q-function is represented by a function approximator (as is typical for continuous action spaces), then Proposition 2 may not hold. This is because the Q-function in this scenario is updated using gradient descent, instead of Equation (12). Gradient descent is sensitive to initialization. Thus, different initial values will result in different updates of the Q-function.
V PBA for Stochastic Policy Learning
Although PBRS can preserve the optimality of policies in several settings, it suffers from the drawback of being unable to encode richer information, such as desired relations between states and actions. The authors of  proposed potential-based advice (PBA), a scheme that augments the potential function by including actions as an argument together with states. In this section, we show that while using PBA, recovering the optimal policy can be difficult if the optimal policy is stochastic. Then, we propose a novel way to impart prior information in order to learn a stochastic policy with PBA.
V-a Stochastic policy learning with PBA
Assume that we can compute , the optimal value for state-action pair in MDP . The optimal stochastic policy for is . From Equation (3), the optimal stochastic policy for the modified MDP that has its reward augmented with PBA is given by . Without loss of generality, . If the optimal policy is deterministic, then the policy for can be recovered easily from that for using Equation (4). However, when it is stochastic, we need to average over trajectories in the MDP, which makes it difficult to recover the optimal policy for from that of .
In the sequel, we will propose a novel way to take advantage of PBA in the policy gradient framework in order to directly learn a stochastic policy.
V-B Imparting PBA in policy gradient
Let denote the value of a parameterized policy in MDP . That is, . Following the policy gradient theorem , and defining , the gradient of with respect to the parameter is given by:
REINFORCE  is a policy gradient method that uses Monte Carlo simulation to learn , where the parameter update is performed only at the end of an episode (a trajectory of length ). If we apply a look-ahead PBA scheme as in Equation (1) along with REINFORCE, then the total return from time is given by:
Notice that if is used in Equation (13) instead of , then the policy gradient is biased. One way to resolve the problem is to add the difference to . However, this makes the learning process identical to the original REINFORCE and PBA is not used. While using PBA in a policy gradient setup, it it important to add the term so that the policy gradient is unbiased, and also leverage the advantage that PBA offers during learning.
To apply PBA in policy gradient, we turn to temporal difference (TD) methods. TD methods update estimates of the accumulated return based in part on other learned estimates, before the end of an episode. A popular TD-based policy gradient method is the actor-critic framework . In this setup, after performing action at step , the accumulated return is estimated by which, in turn, is estimated by . It should be noted that the estimates are unbiased.
When the reward is augmented with look-ahead PBA, the accumulated return is changed to , which is estimated by . From Equation (3), at steady state, . Intuitively, to keep policy gradient unbiased when augmented with look-ahead PBA, we can add at each training step. In other words, we can use as the estimated return. It should be noted that before the policy reaches steady state, adding
at each time step will not cancel out the effect of PBA. This is unlike in REINFORCE, where the addition of this term negates the effect of using PBA. In the advantage actor-critic, an advantage term is used instead of the Q-function in order to reduce the variance of the estimated policy gradient. In this case also, the potential termcan be added in order to keep the policy gradient unbiased.
A procedure for augmenting the advantage actor-critic with PBA is presented in Algorithm AC-PBA. and denote learning rates for the actor and critic respectively. When applying look-ahead PBA, at training step , parameter of the critic is updated as follows:
where is the estimation error of the state value after receiving new reward at step
. To ensure an unbiased estimate of the policy gradient, the potential termis added while updating as:
A similar method can be used when learning with look-back PBA. In this case, the critic and the policy parameter are updated as follows:
In fact, the potential term need not be added to ensure an unbiased estimate in this case. Then, the policy parameter update becomes:
which is exactly the policy update of the advantage actor-critic. This is formally stated in Proposition 3
It is equivalent to show that:
The inner expectation is a function of , policy , and transition probability . Denoting this expectation by , we obtain:
The last equality follows from the fact that the integral evaluates to , and its gradient is .
A1: The value function belongs to a linear family. That is, , where is a known full-rank feature matrix, and .
A2: For the set of policies , there exists a constant such that .
A3: Learning rates of the actor and critic satisfy: , , .
For any probability measure on a finite set , the -norm of with respect to is given by . Theorem 1 gives a bound on the error introduced as a result of approximating the value function with as in assumption A1. This error term is small if the family is rich. In fact, if the critic is updated in batches, a tighter bound can be achieved, as shown in Proposition 1 of . Extending the result to the case of online updates is a subject of future work.
Let . Then, for any limit point of Algorithm AC-PBA, .
We consider only look-ahead PBA. The proof for look-back PBA follows similarly. Define . From assumption A3, the actor is updated at a slower rate than the critic. This allows us to fix the actor to study the asymptotic behavior of the critic . The update dynamics of the critic can be represented by:
where if look-ahead PBA is applied. When the critic is approximated by a linear function (assumption A1), will converge to , an asymptotically stable equilibrium of Equation (20). The update of the actor is then:
Now, consider the evaluation of , , in the original MDP . We obtain the following equations:
The result follows by applying assumption A2.
Look-back PBA could result in better performance compared to look-ahead PBA since look-back PBA does not involve estimating a future action.
Our experiments seek to compare the performance of an actor-critic architecture augmented with PBA and with PBRS with the ‘vanilla’ advantage actor-critic (A2C). We consider two setups. The first is a Puddle-Jump Gridworld , where the state and action spaces are discrete. The second environment we study is a continuous state and action space mountain car .
In each experiment, we compare the rewards received by the agent when it uses the following schemes: i): ‘vanilla’ (A2C); ii): A2C augmented with PBRS; iii): A2C with look-ahead PBA; iv): A2C with look-back PBA.
Vi-a Puddle-Jump Gridworld
Figure 1 depicts the Puddle-jump gridworld environment as a 10x10 grid. The state space is denoting the position of the agent in the grid, where . The goal of the agent is to navigate from the start state to the goal . At each step, the agent can choose from actions in the set . There is a puddle along row which the agent should jump over. Further, the states and (blue squares in Figure 1) are indistinguishable to the agent. As a result, any optimal policy for the agent is a stochastic policy.
If the action is chosen in rows or , the agent will land on the other side of the puddle with probability , and remain in the same state otherwise. This action chosen in other rows will keep the agent in its current state. Any action that will move the agent off the grid will keep its state unchanged. The agent receives a reward of for each action, and for reaching .
When using PBRS, we set for states in rows and , and for all other states. We need to encourage the agent to jump over the puddle. Unlike in PBRS, PBA can provide the agent with more information about the actions it can take. We set to a ‘large’ value if action at state results in the agent moving closer to the goal according to the norm, . We additionally stipulate that
. That is, the state potential of PBA is the same as the state potential of PBRS under a uniform distribution over the actions. This is to ensure a fair comparison between PBRS and PBA.
In our experiment, we set the discount factor . Since the dimensions of the state and action spaces is not large, we do not use a function approximator for the policy . A parameter is associated to each state-action pair, and the policy is computed as: . We fix , and for all cases.
From Figure 2, we observe that the look-back PBA scheme performs the best, in that the agent converges to the goal in five times fewer episodes ( vs. episodes) than A2C without advice. When A2C is augmented with PBRS, convergence to the goal is slightly faster than without any reward shaping. When augmented with look-ahead PBA, in the first few episodes, the reward increases faster than in the case of A2C augmented with PBRS. However, this slows down after the early training stages and the policy converges to the goal in about the same number of episodes as a policy trained without advice. A reason for this could be that during later stages of training, a look-ahead PBA scheme might advise an agent with ‘bad’ actions, leading to bad policies, thereby impeding the progress of learning. For example, an action