I Introduction
Reinforcement learning (RL) is a framework that allows an agent to complete tasks in an environment, even when a model of the environment is not known. The agent ‘learns’ to complete a task by maximizing its expected longterm reward, where the reward signal is supplied by the environment. RL algorithms have been successfully implemented in many fields, including robotics [1, 2], and games [3, 4] . However, it remains difficult for an RL agent to master new tasks in unseen environments. This is especially true when the reward given by the environment is sparse/ significantly delayed.
It may be possible to guide an RL agent towards more promising solutions faster, if it is equipped with some form of prior knowledge about the environment. This can be encoded by modifying the reward signal received by the agent during training. However, the modification must be carried out in a principled manner, since providing an additional reward at each step might distract the agent from the true goal [5]
. Potentialbased reward shaping (PBRS) is one such method that augments the reward in an environment specified by a Markov Decision Process (MDP) with a term that is a difference of
potentials [6]. This method is attractive since it easily allows for the recovery of optimal policies, while enabling the agent to learn these policies faster.Potential functions are typically functions of states. This could be a limitation, since in some cases, such a function may not be able to encode all information available in the environment. To allow for imparting more information to the agent, a potentialbased advice (PBA) scheme was proposed in [7]. The potential functions in PBA include both states and actions as their arguments.
To the best of our knowledge, PBRS and PBA schemes in the literature [6, 7, 8] assume that an optimal policy is deterministic. This will not always be the case, since an optimal policy might be a stochastic policy. This is especially true when there are states in the environment that are partially observable or indistinguishable from each other. Moreover, the aforementioned papers limit their focus to discrete state and action spaces.
In this paper, we study the addition of PBRS and PBA schemes to the reward, in settings where: i) the optimal policy will be stochastic, and ii) state and action spaces may be continuous. We additionally provide guarantees on the convergence of an advantage actorcritic architecture that is augmented with a PBA scheme. We make the following contributions:

We prove that the ability of an agent to learn an optimal stochastic policy remains unaffected when augmenting PBRS to soft Qlearning.

We propose a technique for adapting PBA in policybased methods, in order to use these schemes in environments with continuous state and action spaces.

We present an Algorithm, ACPBA, describing an advantage actorcritic architecture augmented with PBA, and provide guarantees on its convergence.

We evaluate our approach on two experimental domains: a discretestate, discreteaction Puddlejump Gridworld that has indistinguishable states, and a continuousstate, continuousaction Mountain Car.
The remainder of this paper is organized as follows: Section II presents related work in reward shaping. Required preliminaries to RL, PBRS and PBA is presented in Section III. Section IV presents our results on using PBRS for stochastic policy learning. We present a method to augment PBA to policy gradient frameworks and an algorithm detailing this in Section V. Experiments validating our approach are reported in Section VI, and we conclude the paper in Section VII.
Ii Related Work
Shaping or augmenting the reward received by an RL agent in order to enable it to learn optimal policies faster is an active area of research. Reward modification via human feedback was used in [9, 10] to interactively shape an agent’s response so that it learned a desired behavior. However, frequent human supervision is usually costly and may not possible in every situation. A curiositybased RL algorithm for sparse reward environments was presented in [11], where an intrinsic reward signal characterized the prediction error of the agent as a curiosity reward. The reward received by the agent was augmented with a function that represented the number of times the agent had visited a state in [12].
Entropy regularization as a way to encourage exploration of policies during the early stages of learning was studied in [13] and [14]. This was used to lead a policy towards states with a high reward in [15] and [16].
Static potentialbased functions were shown to preserve the optimality of deterministic policies in [6]. This property was extended to dynamic potentialbased functions in [8]. The authors of [17] showed that when an agent learned a policy using Qlearning, applying PBRS at each training step was equivalent to initializing the Qfunction with the potentials. They studied valuebased methods, but restricted their focus to learning deterministic policies. The authors of [18] demonstrated a method to transform a reward function into a potentialbased function during training. The potential function in PBA was obtained using an ‘experience filter’ in [19].
The use of PBRS in modelbased RL was studied in [20], and for episodic RL in [21]. PBRS was extended to planning in partially observable domains in [22]. However, these papers only considered the finitehorizon case. In comparison, we consider the infinite horizon, discounted cost setting in this paper.
In control theoretic settings, RL algorithms have been used to establish guarantees on convergence to an optimal controller for the Linear Quadratic Regulator, when a model of the underlying system was not known in [23, 24]. A survey of using RL for control is presented in [25]. OpenAI Gym [26] enables the solving of several problems in classical control using RL algorithms.
Iii Preliminaries
Iiia Reinforcement Learning
An MDP [27] is a tuple . is the set of states, the set of actions, encodes
, the probability of transition to
, given current state and action .is a probability distribution over the initial states.
denotes the reward that the agent receives when transitioning from while taking action . In this paper, .The goal for an RL agent [28] is to learn a policy , in order to maximize . Here, is a discounting factor, and the expectation is taken over the trajectory induced by policy . If , the policy is deterministic. On the other hand, a randomized policy returns a probability distribution over the set of actions, and is denoted .
The value of a stateaction pair following policy is represented by the Qfunction, written . The Qfunction allows us to calculate the state value . The advantage of a particular action , over other actions at a state is defined by .
IiiB Valuebased and Policybased Methods
The RL problem has two general solution techniques. Valuebased
methods determine an optimal policy by maintaining a set of reward estimates when following a particular policy. At each state, an action that achieves the highest (expected) reward is taken. Typical valuebased methods to learn greedy (determininistic) policies include Qlearning and Sarsalearning
[28]. Recently, the authors of [29] proposed soft Qlearning, which is a valuebased method that is able to learn stochastic policies.In comparison, policybased methods directly search over the policy space [28]. Starting from an initial policy, specified by a set of parameters, these methods compute the expected reward for this policy, and update the parameter set according to certain rules to improve the policy. Policy gradient [30] is one way to achieve policy improvement. This method repeatedly computes (an estimate of) the gradient of the expected reward with respect to the policy parameters. Policybased approaches usually exhibit better convergence properties, and can be used in continuous action spaces [31]. They can also be used to learn stochastic policies. REINFORCE and actorcritic are examples of policy gradient algorithms [28].
IiiC PBRS and PBA
Reward shaping methods augment the environment reward with an additional reward , . This changes the structure of the original MDP to . The goal is to choose so that an optimal policy for , , is also optimal for the original MDP . Potentialbased reward shaping (PBRS) schemes were shown to be able to preserve the optimality of deterministic policies in [6].
In PBRS, the function is defined as a difference of potentials, . Specifically, . Then, the Qfunction, , of the optimal greedy policy for and the optimal Qfunction for are related by: . Therefore, the optimal greedy policy is not changed [6, 8], since:
The authors of [7] augmented to include action as an argument. They termed this potentialbased advice (PBA). There are two forms– lookahead PBA and lookback PBA– respectively defined by:
(1)  
(2) 
For the lookahead PBA scheme, the stateaction value function for following policy is given by:
(3) 
The optimal greedy policy for can be recovered from the optimal stateaction value function for from:
(4) 
The optimal greedy policy for using lookback PBA can be recovered similarly.
Iv PBRS for Stochastic Policy Learning
The existing literature on PBRS has focused on augmenting valuebased methods to learn optimal deterministic policies. In this section, we first show that PBRS preserves optimality, when the optimal policy is stochastic. Then, we show that the learnability will not be changed when using PBRS in soft Qlearning.
Proposition 1
Assume that the optimal policy is stochastic. Then, with , PBRS preserves the optimality of stochastic policies.
The goal in the original MDP was to find a policy in order to maximize:
(5) 
In PBRS, the goal is to determine a policy so that:
(6) 
The last term in Equation (6) is constant, and doesn’t affect the identity of the maximizing policy of (5).
Next, we examine the effect on learnability when using PBRS with soft Qlearning. Soft Qlearning is a valuebased method for stochastic policy learning that was proposed in [29]. Different from Equation (5), the goal is to maximize both, the accumulated reward, and the policy entropy at each visited state:
(7) 
The entropy term encourages exploration of the state space, and the parameter is a tradeoff between exploitation and exploration.
Before stating our result, we summarize the soft Qlearning update procedure. From [29], the optimal valuefunction, , is given by:
(8) 
The optimal soft Qfunction is determined by solving the soft Bellman equation:
(9) 
The optimal policy can be obtained from Equation (9) as:
(10) 
In the rest of this Section, we assume both, states and actions are discrete and no function approximator is used. We also omit subscripts for and , and set for simplicity. From Equation (9), and as in Qlearning, soft Qlearning updates the soft Qfunction by minimizing the soft Bellman error:
(11) 
where . During training, . With denoting the learning rate, the Qfunction update is given by:
(12) 
The main result of this section shows that the ability of an agent to learn an optimal policy is unaffected when using soft Qlearning augmented with PBRS. We define a notion of learnability, and use this to establish our claim.
During training, an agent encounters a sequence of states, actions, and rewards that serves as ‘rawdata’ which is fed to the RL algorithm. Let and denote two RL agents. Let and denote the experience tuple at learning step from a trajectory used by and , respectively.
Definition 1 (Learnability)
Denote the accumulated difference in the Qfunctions of and after learning for steps by and , respectively. Then, given identical sample experiences, (that is, ), and are said to have the same learnability if .
Proposition 2
Soft Qlearning, with initial soft Qvalues and augmented with PBRS where state potential is , has the same learnability as soft Qlearning without PBRS but with its soft Qvalues initialized to .
Consider an agent that uses a PBRS scheme during learning and an agent that does not use PBRS, but has its soft Qvalues initialized as , where is the initial Qvalue of . We further assume that and adopt the same learning rate. From Definition 1, to show that and have the same learnability, we need to show that the soft Bellman errors and are equal at each training step , given the same experience sets and . From Equation (11), the soft Bellman errors for and can be respectively written as:
Since for each , comparing and is reduced to comparing and . We show this by induction.
At training step there is no update. Thus, . Assume that the Bellman errors are identical up to a step . That is, . Then, the accumulated errors for the two agents until this step are also identical. That is, . Consider training step . The state values at this step are: and respectively. The Bellman errors at are:
It follows that .
Remark 1
If the Qfunction is represented by a function approximator (as is typical for continuous action spaces), then Proposition 2 may not hold. This is because the Qfunction in this scenario is updated using gradient descent, instead of Equation (12). Gradient descent is sensitive to initialization. Thus, different initial values will result in different updates of the Qfunction.
V PBA for Stochastic Policy Learning
Although PBRS can preserve the optimality of policies in several settings, it suffers from the drawback of being unable to encode richer information, such as desired relations between states and actions. The authors of [7] proposed potentialbased advice (PBA), a scheme that augments the potential function by including actions as an argument together with states. In this section, we show that while using PBA, recovering the optimal policy can be difficult if the optimal policy is stochastic. Then, we propose a novel way to impart prior information in order to learn a stochastic policy with PBA.
Va Stochastic policy learning with PBA
Assume that we can compute , the optimal value for stateaction pair in MDP . The optimal stochastic policy for is . From Equation (3), the optimal stochastic policy for the modified MDP that has its reward augmented with PBA is given by . Without loss of generality, . If the optimal policy is deterministic, then the policy for can be recovered easily from that for using Equation (4). However, when it is stochastic, we need to average over trajectories in the MDP, which makes it difficult to recover the optimal policy for from that of .
In the sequel, we will propose a novel way to take advantage of PBA in the policy gradient framework in order to directly learn a stochastic policy.
VB Imparting PBA in policy gradient
Let denote the value of a parameterized policy in MDP . That is, . Following the policy gradient theorem [28], and defining , the gradient of with respect to the parameter is given by:
(13) 
Then, .
REINFORCE [28] is a policy gradient method that uses Monte Carlo simulation to learn , where the parameter update is performed only at the end of an episode (a trajectory of length ). If we apply a lookahead PBA scheme as in Equation (1) along with REINFORCE, then the total return from time is given by:
(14) 
Notice that if is used in Equation (13) instead of , then the policy gradient is biased. One way to resolve the problem is to add the difference to . However, this makes the learning process identical to the original REINFORCE and PBA is not used. While using PBA in a policy gradient setup, it it important to add the term so that the policy gradient is unbiased, and also leverage the advantage that PBA offers during learning.
To apply PBA in policy gradient, we turn to temporal difference (TD) methods. TD methods update estimates of the accumulated return based in part on other learned estimates, before the end of an episode. A popular TDbased policy gradient method is the actorcritic framework [28]. In this setup, after performing action at step , the accumulated return is estimated by which, in turn, is estimated by . It should be noted that the estimates are unbiased.
When the reward is augmented with lookahead PBA, the accumulated return is changed to , which is estimated by . From Equation (3), at steady state, . Intuitively, to keep policy gradient unbiased when augmented with lookahead PBA, we can add at each training step. In other words, we can use as the estimated return. It should be noted that before the policy reaches steady state, adding
at each time step will not cancel out the effect of PBA. This is unlike in REINFORCE, where the addition of this term negates the effect of using PBA. In the advantage actorcritic, an advantage term is used instead of the Qfunction in order to reduce the variance of the estimated policy gradient. In this case also, the potential term
can be added in order to keep the policy gradient unbiased.A procedure for augmenting the advantage actorcritic with PBA is presented in Algorithm ACPBA. and denote learning rates for the actor and critic respectively. When applying lookahead PBA, at training step , parameter of the critic is updated as follows:
where is the estimation error of the state value after receiving new reward at step
. To ensure an unbiased estimate of the policy gradient, the potential term
is added while updating as:A similar method can be used when learning with lookback PBA. In this case, the critic and the policy parameter are updated as follows:
(15) 
In fact, the potential term need not be added to ensure an unbiased estimate in this case. Then, the policy parameter update becomes:
(16) 
which is exactly the policy update of the advantage actorcritic. This is formally stated in Proposition 3
Proposition 3
It is equivalent to show that:
(18) 
The inner expectation is a function of , policy , and transition probability . Denoting this expectation by , we obtain:
(19) 
The last equality follows from the fact that the integral evaluates to , and its gradient is .
The main result of this paper presents guarantees on the convergence of Algorithm ACPBA using the theory of ‘two timescale stochastic analysis’ [32]. Assume that:

A1: The value function belongs to a linear family. That is, , where is a known fullrank feature matrix, and .

A2: For the set of policies , there exists a constant such that .

A3: Learning rates of the actor and critic satisfy: , , .
For any probability measure on a finite set , the norm of with respect to is given by . Theorem 1 gives a bound on the error introduced as a result of approximating the value function with as in assumption A1. This error term is small if the family is rich. In fact, if the critic is updated in batches, a tighter bound can be achieved, as shown in Proposition 1 of [33]. Extending the result to the case of online updates is a subject of future work.
Theorem 1
Let . Then, for any limit point of Algorithm ACPBA, .
We consider only lookahead PBA. The proof for lookback PBA follows similarly. Define . From assumption A3, the actor is updated at a slower rate than the critic. This allows us to fix the actor to study the asymptotic behavior of the critic [34]. The update dynamics of the critic can be represented by:
(20) 
where if lookahead PBA is applied. When the critic is approximated by a linear function (assumption A1), will converge to , an asymptotically stable equilibrium of Equation (20). The update of the actor is then:
(21) 
Let denote the set of asymptotic stable equilibria in Equation (21). Any will satisfy in Equation (21). Then, will converge to .
Now, consider the evaluation of , , in the original MDP . We obtain the following equations:
(22) 
Subtracting Equation (21) from Equation (VB), and applying the CauchySchwarz inequality to the result yields:
The result follows by applying assumption A2.
Remark 2
Lookback PBA could result in better performance compared to lookahead PBA since lookback PBA does not involve estimating a future action.
Vi Experiments
Our experiments seek to compare the performance of an actorcritic architecture augmented with PBA and with PBRS with the ‘vanilla’ advantage actorcritic (A2C). We consider two setups. The first is a PuddleJump Gridworld [35], where the state and action spaces are discrete. The second environment we study is a continuous state and action space mountain car [26].
In each experiment, we compare the rewards received by the agent when it uses the following schemes: i): ‘vanilla’ (A2C); ii): A2C augmented with PBRS; iii): A2C with lookahead PBA; iv): A2C with lookback PBA.
Via PuddleJump Gridworld
Figure 1 depicts the Puddlejump gridworld environment as a 10x10 grid. The state space is denoting the position of the agent in the grid, where . The goal of the agent is to navigate from the start state to the goal . At each step, the agent can choose from actions in the set . There is a puddle along row which the agent should jump over. Further, the states and (blue squares in Figure 1) are indistinguishable to the agent. As a result, any optimal policy for the agent is a stochastic policy.
If the action is chosen in rows or , the agent will land on the other side of the puddle with probability , and remain in the same state otherwise. This action chosen in other rows will keep the agent in its current state. Any action that will move the agent off the grid will keep its state unchanged. The agent receives a reward of for each action, and for reaching .
When using PBRS, we set for states in rows and , and for all other states. We need to encourage the agent to jump over the puddle. Unlike in PBRS, PBA can provide the agent with more information about the actions it can take. We set to a ‘large’ value if action at state results in the agent moving closer to the goal according to the norm, . We additionally stipulate that
. That is, the state potential of PBA is the same as the state potential of PBRS under a uniform distribution over the actions. This is to ensure a fair comparison between PBRS and PBA.
In our experiment, we set the discount factor . Since the dimensions of the state and action spaces is not large, we do not use a function approximator for the policy . A parameter is associated to each stateaction pair, and the policy is computed as: . We fix , and for all cases.
From Figure 2, we observe that the lookback PBA scheme performs the best, in that the agent converges to the goal in five times fewer episodes ( vs. episodes) than A2C without advice. When A2C is augmented with PBRS, convergence to the goal is slightly faster than without any reward shaping. When augmented with lookahead PBA, in the first few episodes, the reward increases faster than in the case of A2C augmented with PBRS. However, this slows down after the early training stages and the policy converges to the goal in about the same number of episodes as a policy trained without advice. A reason for this could be that during later stages of training, a lookahead PBA scheme might advise an agent with ‘bad’ actions, leading to bad policies, thereby impeding the progress of learning. For example, an action
Comments
There are no comments yet.