1 Introduction
In reinforcement learning an agent explores an environment and through the use of a reward signal learns to optimize its behavior to maximize the expected longterm return. Reinforcement learning has seen success in several areas including robotics (Lin, 1993; Levine et al., 2015), computer games (Mnih et al., 2013, 2015), online advertising (Pednault et al., 2002), board games (Tesauro, 1995; Silver et al., 2016), and many others. For an introduction to reinforcement learning we refer to the classic text by Sutton & Barto (1998). In this paper we consider modelfree reinforcement learning, where the statetransition function is not known or learned. There are many different algorithms for modelfree reinforcement learning, but most fall into one of two families: actionvalue fitting and policy gradient techniques.
Actionvalue techniques involve fitting a function, called the Qvalues, that captures the expected return for taking a particular action at a particular state, and then following a particular policy thereafter. Two alternatives we discuss in this paper are SARSA (Rummery & Niranjan, 1994) and Qlearning (Watkins, 1989), although there are many others. SARSA is an onpolicy algorithm whereby the actionvalue function is fit to the current policy, which is then refined by being mostly greedy with respect to those actionvalues. On the other hand, Qlearning attempts to find the Qvalues associated with the optimal policy directly and does not fit to the policy that was used to generate the data. Qlearning is an offpolicy algorithm that can use data generated by another agent or from a replay buffer of old experience. Under certain conditions both SARSA and Qlearning can be shown to converge to the optimal Qvalues, from which we can derive the optimal policy (Sutton, 1988; Bertsekas & Tsitsiklis, 1996).
In policy gradient techniques the policy is represented explicitly and we improve the policy by updating the parameters in the direction of the gradient of the performance (Sutton et al., 1999; Silver et al., 2014; Kakade, 2001). Online policy gradient typically requires an estimate of the actionvalue function of the current policy. For this reason they are often referred to as actorcritic methods, where the actor refers to the policy and the critic to the estimate of the actionvalue function (Konda & Tsitsiklis, 2003). Vanilla actorcritic methods are onpolicy only, although some attempts have been made to extend them to offpolicy data (Degris et al., 2012; Levine & Koltun, 2013).
In this paper we derive a link between the Qvalues induced by a policy and the policy itself when the policy is the fixed point of a regularized policy gradient algorithm (where the gradient vanishes). This connection allows us to derive an estimate of the Qvalues from the current policy, which we can refine using offpolicy data and Qlearning. We show in the tabular setting that when the regularization penalty is small (the usual case) the resulting policy is close to the policy that would be found without the addition of the Qlearning update. Separately, we show that regularized actorcritic methods can be interpreted as actionvalue fitting methods, where the Qvalues have been parameterized in a particular way. We conclude with some numerical examples that provide empirical evidence of improved data efficiency and stability of PGQL.
1.1 Prior work
Here we highlight various axes along which our work can be compared to others. In this paper we use entropy regularization to ensure exploration in the policy, which is a common practice in policy gradient (Williams & Peng, 1991; Mnih et al., 2016). An alternative is to use KLdivergence instead of entropy as a regularizer, or as a constraint on how much deviation is permitted from a prior policy (Bagnell & Schneider, 2003; Peters et al., 2010; Schulman et al., 2015; Fox et al., 2015). Natural policy gradient can also be interpreted as putting a constraint on the KLdivergence at each step of the policy improvement (Amari, 1998; Kakade, 2001; Pascanu & Bengio, 2013). In Sallans & Hinton (2004) the authors use a Boltzmann exploration policy over estimated Qvalues which they update using TDlearning. In Heess et al. (2012) this was extended to use an actorcritic algorithm instead of TDlearning, however the two updates were not combined as we have done in this paper. In Azar et al. (2012) the authors develop an algorithm called dynamic policy programming, whereby they apply a Bellmanlike update to the actionpreferences of a policy, which is similar in spirit to the update we describe here. In Norouzi et al. (2016)
the authors augment a maximum likelihood objective with a reward in a supervised learning setting, and develop a connection that resembles the one we develop here between the policy and the Qvalues. Other works have attempted to combine on and offpolicy learning, primarily using actionvalue fitting methods
(Wang et al., 2013; Hausknecht & Stone, 2016; Lehnert & Precup, 2015), with varying degrees of success. In this paper we establish a connection between actorcritic algorithms and actionvalue learning algorithms. In particular we show that TDactorcritic (Konda & Tsitsiklis, 2003) is equivalent to expectedSARSA (Sutton & Barto, 1998, Exercise 6.10) with Boltzmann exploration where the Qvalues are decomposed into advantage function and value function. The algorithm we develop extends actorcritic with a Qlearning style update that, due to the decomposition of the Qvalues, resembles the update of the dueling architecture (Wang et al., 2016). Recently, the field of deep reinforcement learning, i.e., the use of deep neural networks to represent actionvalues or a policy, has seen a lot of success
(Mnih et al., 2015, 2016; Silver et al., 2016; Riedmiller, 2005; Lillicrap et al., 2015; Van Hasselt et al., 2016). In the examples section we use a neural network with PGQL to play the Atari games suite.2 Reinforcement Learning
We consider the infinite horizon, discounted, finite state and action space Markov decision process, with state space
, action space and rewards at each time period denoted by . A policyis a mapping from stateaction pair to the probability of taking that action at that state, so it must satisfy
for all states . Any policyinduces a probability distribution over visited states,
(which may depend on the initial state), so the probability of seeing stateaction pair is .In reinforcement learning an ‘agent’ interacts with an environment over a number of times steps. At each time step the agent receives a state and a reward and selects an action from the policy , at which point the agent moves to the next state , where is the probability of transitioning from state to state after taking action . This continues until the agent encounters a terminal state (after which the process is typically restarted). The goal of the agent is to find a policy that maximizes the expected total discounted return , where the expectation is with respect to the initial state distribution, the statetransition probabilities, and the policy, and where is the discount factor that, loosely speaking, controls how much the agent prioritizes longterm versus shortterm rewards. Since the agent starts with no knowledge of the environment it must continually explore the state space and so will typically use a stochastic policy.
Actionvalues.
The actionvalue, or Qvalue, of a particular state under policy is the expected total discounted return from taking that action at that state and following thereafter, i.e., . The value of state under policy is denoted by , which is the expected total discounted return of policy from state . The optimal actionvalue function is denoted and satisfies for each . The policy that achieves the maximum is the optimal policy , with value function . The advantage function is the difference between the actionvalue and the value function, i.e., , and represents the additional expected reward of taking action over the average performance of the policy from state . Since we have the identity , which simply states that the policy has no advantage over itself.
Bellman equation.
The Bellman operator (Bellman, 1957) for policy is defined as
where the expectation is over next state , the reward , and the action from policy . The Qvalue function for policy is the fixed point of the Bellman operator for , i.e., . The optimal Bellman operator is defined as
where the expectation is over the next state , and the reward . The optimal Qvalue function is the fixed point of the optimal Bellman equation, i.e., . Both the Bellman operator and the optimal Bellman operator are contraction mappings in the supnorm, i.e., , for any . From this fact one can show that the fixed point of each operator is unique, and that value iteration converges, i.e., and from any initial . (Bertsekas, 2005).
2.1 Actionvalue learning
In value based reinforcement learning we approximate the Qvalues using a function approximator. We then update the parameters so that the Qvalues are as close to the fixed point of a Bellman equation as possible. If we denote by the approximate Qvalues parameterized by , then Qlearning updates the Qvalues along direction and SARSA updates the Qvalues along direction . In the online setting the Bellman operator is approximated by sampling and bootstrapping, whereby the Qvalues at any state are updated using the Qvalues from the next visited state. Exploration is achieved by not always taking the action with the highest Qvalue at each time step. One common technique called ‘epsilon greedy’ is to sample a random action with probability , where starts high and decreases over time. Another popular technique is ‘Boltzmann exploration’, where the policy is given by the softmax over the Qvalues with a temperature , i.e., , where it is common to decrease the temperature over time.
2.2 Policy gradient
Alternatively, we can parameterize the policy directly and attempt to improve it via gradient ascent on the performance . The policy gradient theorem (Sutton et al., 1999) states that the gradient of with respect to the parameters of the policy is given by
(1) 
where the expectation is over with probability . In the original derivation of the policy gradient theorem the expectation is over the discounted distribution of states, i.e., over . However, the gradient update in that case will assign a low weight to states that take a long time to reach and can therefore have poor empirical performance. In practice the nondiscounted distribution of states is frequently used instead. In certain cases this is equivalent to maximizing the average (i.e., nondiscounted) policy performance, even when uses a discount factor (Thomas, 2014). Throughout this paper we will use the nondiscounted distribution of states.
In the online case it is common to add an entropy regularizer to the gradient in order to prevent the policy becoming deterministic. This ensures that the agent will explore continually. In that case the (batch) update becomes
(2) 
where denotes the entropy of policy , and is the regularization penalty parameter. Throughout this paper we will make use of entropy regularization, however many of the results are true for other choices of regularizers with only minor modification, e.g., KLdivergence. Note that equation (2) requires exact knowledge of the Qvalues. In practice they can be estimated, e.g., by the sum of discounted rewards along an observed trajectory (Williams, 1992), and the policy gradient will still perform well (Konda & Tsitsiklis, 2003).
3 Regularized policy gradient algorithm
In this section we derive a relationship between the policy and the Qvalues when using a regularized policy gradient algorithm. This allows us to transform a policy into an estimate of the Qvalues. We then show that for small regularization the Qvalues induced by the policy at the fixed point of the algorithm have a small Bellman error in the tabular case.
3.1 Tabular case
Consider the fixed points of the entropy regularized policy gradient update (2). Let us define , and for each . A fixed point is one where we can no longer update in the direction of without violating one of the constraints , i.e., where
is in the span of the vectors
. In other words, any fixed point must satisfy , where for each the Lagrange multiplier ensures that . Substituting in terms to this equation we obtain(3) 
where we have absorbed all constants into . Any solution to this equation is strictly positive elementwise since it must lie in the domain of the entropy function. In the tabular case is represented by a single number for each state and action pair and the gradient of the policy with respect to the parameters is the indicator function, i.e., . From this we obtain for each (assuming that the measure ). Multiplying by and summing over we get . Substituting into equation (3) we have the following formulation for the policy:
(4) 
for all and . In other words, the policy at the fixed point is a softmax over the advantage function induced by that policy, where the regularization parameter can be interpreted as the temperature. Therefore, we can use the policy to derive an estimate of the Qvalues,
(5) 
With this we can rewrite the gradient update (2) as
(6) 
since the update is unchanged by perstate constant offsets. When the policy is parameterized as a softmax, i.e., , the quantity is sometimes referred to as the actionpreferences of the policy (Sutton & Barto, 1998, Chapter 6.6). Equation (4) states that the action preferences are equal to the Qvalues scaled by , up to an additive perstate constant.
3.2 General case
Consider the following optimization problem:
(7) 
over variable which parameterizes , where we consider both the measure in the expectation and the values to be independent of . The optimality condition for this problem is
where is the Lagrange multiplier associated with the constraint that the policy sum to one at each state. Comparing this to equation (3), we see that if and the measure in the expectation is the same then they describe the same set of fixed points. This suggests an interpretation of the fixed points of the regularized policy gradient as a regression of the logpolicy onto the Qvalues. In the general case of using an approximation architecture we can interpret equation (3) as indicating that the error between and is orthogonal to for each , and so cannot be reduced further by changing the parameters, at least locally. In this case equation (4) is unlikely to hold at a solution to (3), however with a good approximation architecture it may hold approximately, so that the we can derive an estimate of the Qvalues from the policy using equation (5). We will use this estimate of the Qvalues in the next section.
3.3 Connection to actionvalue methods
The previous section made a connection between regularized policy gradient and a regression onto the Qvalues at the fixed point. In this section we go one step further, showing that actorcritic methods can be interpreted as actionvalue fitting methods, where the exact method depends on the choice of critic.
Actorcritic methods.
Consider an agent using an actorcritic method to learn both a policy and a value function . At any iteration , the value function has parameters , and the policy is of the form
(8) 
where is parameterized by and is the entropy regularization penalty. In this case . Using equation (6) the parameters are updated as
(9) 
where is the critic minus baseline term, which depends on the variant of actorcritic being used (see the remark below).
Actionvalue methods.
Compare this to the case where an agent is learning Qvalues with a dueling architecture (Wang et al., 2016), which at iteration is given by
where is a probability distribution, is parameterized by , is parameterized by , and the exploration policy is Boltzmann with temperature , i.e.,
(10) 
In action value fitting methods at each iteration the parameters are updated to reduce some error, where the update is given by
(11) 
where is the actionvalue error term and depends on which algorithm is being used (see the remark below).
Equivalence.
The two policies (8) and (10) are identical if for all . Since and can be initialized and parameterized in the same way, and assuming the two value function estimates are initialized and parameterized in the same way, all that remains is to show that the updates in equations (11) and (9) are identical. Comparing the two, and assuming that (see remark), we see that the only difference is that the measure is not fixed in (9), but is equal to the current policy and therefore changes after each update. Replacing in (11) with makes the updates identical, in which case at all iterations and the two policies (8) and (10) are always the same. In other words, the slightly modified actionvalue method is equivalent to an actorcritic policy gradient method, and viceversa (modulo using the nondiscounted distribution of states, as discussed in §2.2). In particular, regularized policy gradient methods can be interpreted as advantage function learning techniques (Baird III, 1993), since at the optimum the quantity will be equal to the advantage function values in the tabular case.
Remark.
In SARSA (Rummery & Niranjan, 1994) we set , where is the action selected at state , which would be equivalent to using a bootstrap critic in equation (6) where . In expectedSARSA (Sutton & Barto, 1998, Exercise 6.10), (Van Seijen et al., 2009)) we take the expectation over the Qvalues at the next state, so . This is equivalent to TDactorcritic (Konda & Tsitsiklis, 2003) where we use the value function to provide the critic, which is given by . In Qlearning (Watkins, 1989) , which would be equivalent to using an optimizing critic that bootstraps using the max Qvalue at the next state, i.e., . In REINFORCE the critic is the Monte Carlo return from that state on, i.e., . If the return trace is truncated and a bootstrap is performed after steps, this is equivalent to step SARSA or step Qlearning, depending on the form of the bootstrap (Peng & Williams, 1996).
3.4 Bellman residual
In this section we show that with decreasing regularization penalty , where is the policy defined by and is the corresponding Qvalue function, both of which are functions of . We shall show that it converges to zero by bounding the sequence below by zero and above with a sequence that converges to zero. First, we have that , since is greedy with respect to the Qvalues. So . Now, to bound from above we need the fact that . Using this we have
where we define . To conclude our proof we use the fact that , which yields
for all , and so the Bellman residual converges to zero with decreasing . In other words, for small enough (which is the regime we are interested in) the Qvalues induced by the policy (4) will have a small Bellman residual. Moreover, this implies that , as one might expect.
4 Pgql
In this section we introduce the main contribution of the paper, which is a technique to combine policy gradient with Qlearning. We call our technique ‘PGQL’, for policy gradient and Qlearning. In the previous section we showed that the Bellman residual is small at the fixed point of a regularized policy gradient algorithm when the regularization penalty is sufficiently small. This suggests adding an auxiliary update where we explicitly attempt to reduce the Bellman residual as estimated from the policy, i.e., a hybrid between policy gradient and Qlearning.
We first present the technique in a batch update setting, with a perfect knowledge of (i.e., a perfect critic). Later we discuss the practical implementation of the technique in a reinforcement learning setting with function approximation, where the agent generates experience from interacting with the environment and needs to estimate a critic simultaneously with the policy.
4.1 PGQL update
Define the estimate of using the policy as
(12) 
where has parameters and is not necessarily as it was in equation (5). In (2
) it was unnecessary to estimate the constant since the update was invariant to constant offsets, although in practice it is often estimated for use in a variance reduction technique
(Williams, 1992; Sutton et al., 1999).Since we know that at the fixed point the Bellman residual will be small for small , we can consider updating the parameters to reduce the Bellman residual in a fashion similar to Qlearning, i.e.,
(13) 
This is Qlearning applied to a particular form of the Qvalues, and can also be interpreted as an actorcritic algorithm with an optimizing (and therefore biased) critic.
The full scheme simply combines two updates to the policy, the regularized policy gradient update (2) and the Qlearning update (13). Assuming we have an architecture that provides a policy , a value function estimate , and an actionvalue critic , then the parameter updates can be written as (suppressing the notation)
(14) 
here is a weighting parameter that controls how much of each update we apply. In the case where the above scheme reduces to entropy regularized policy gradient. If then it becomes a variant of (batch) Qlearning with an architecture similar to the dueling architecture (Wang et al., 2016). Intermediate values of produce a hybrid between the two. Examining the update we see that two error terms are trading off. The first term encourages consistency with critic, and the second term encourages optimality over time. However, since we know that under standard policy gradient the Bellman residual will be small, then it follows that adding a term that reduces that error should not make much difference at the fixed point. That is, the updates should be complementary, pointing in the same general direction, at least far away from a fixed point. This update can also be interpreted as an actorcritic update where the critic is given by a weighted combination of a standard critic and an optimizing critic. Yet another interpretation of the update is a combination of expectedSARSA and Qlearning, where the Qvalues are parameterized as the sum of an advantage function and a value function.
4.2 Practical implementation
The updates presented in (14) are batch updates, with an exact critic . In practice we want to run this scheme online, with an estimate of the critic, where we don’t necessarily apply the policy gradient update at the same time or from same data source as the Qlearning update.
Our proposal scheme is as follows. One or more agents interact with an environment, encountering states and rewards and performing onpolicy updates of (shared) parameters using an actorcritic algorithm where both the policy and the critic are being updated online. Each time an agent receives new data from the environment it writes it to a shared replay memory buffer. Periodically a separate learner process samples from the replay buffer and performs a step of Qlearning on the parameters of the policy using (13). This scheme has several advantages. The critic can accumulate the Monte Carlo return over many time periods, allowing us to spread the influence of a reward received in the future backwards in time. Furthermore, the replay buffer can be used to store and replay ‘important’ past experiences by prioritizing those samples (Schaul et al., 2015). The use of the replay buffer can help to reduce problems associated with correlated training data, as generated by an agent exploring an environment where the states are likely to be similar from one time step to the next. Also the use of replay can act as a kind of regularizer, preventing the policy from moving too far from satisfying the Bellman equation, thereby improving stability, in a similar sense to that of a policy ‘trustregion’ (Schulman et al., 2015). Moreover, by batching up replay samples to update the network we can leverage GPUs to perform the updates quickly, this is in comparison to pure policy gradient techniques which are generally implemented on CPU (Mnih et al., 2016).
Since we perform Qlearning using samples from a replay buffer that were generated by a old policy we are performing (slightly) offpolicy learning. However, Qlearning is known to converge to the optimal Qvalues in the offpolicy tabular case (under certain conditions) (Sutton & Barto, 1998), and has shown good performance offpolicy in the function approximation case (Mnih et al., 2013).
4.3 Modified fixed point
The PGQL updates in equation (14) have modified the fixed point of the algorithm, so the analysis of §3 is no longer valid. Considering the tabular case once again, it is still the case that the policy as before, where is defined by (12), however where previously the fixed point satisfied , with corresponding to the Qvalues induced by , now we have
(15) 
Or equivalently, if , we have . In the appendix we show that and that with decreasing in the tabular case. That is, for small the induced Qvalues and the Qvalues estimated from the policy are close, and we still have the guarantee that in the limit the Qvalues are optimal. In other words, we have not perturbed the policy very much by the addition of the auxiliary update.
5 Numerical experiments
5.1 Grid world
In this section we discuss the results of running PGQL on a toy by grid world, as shown in Figure 0(a). The agent always begins in the square marked ‘S’ and the episode continues until it reaches the square marked ‘T’, upon which it receives a reward of . All other times it receives no reward. For this experiment we chose regularization parameter and discount factor .
Figure 0(b) shows the performance traces of three different agents learning in the grid world, running from the same initial random seed. The lines show the true expected performance of the policy from the start state, as calculated by value iteration after each update. The blueline is standard TDactorcritic (Konda & Tsitsiklis, 2003), where we maintain an estimate of the value function and use that to generate an estimate of the Qvalues for use as the critic. The green line is Qlearning where at each step an update is performed using data drawn from a replay buffer of prior experience and where the Qvalues are parameterized as in equation (12). The policy is a softmax over the Qvalue estimates with temperature . The red line is PGQL, which at each step first performs the TDactorcritic update, then performs the Qlearning update as in (14).
The grid world was totally deterministic, so the step size could be large and was chosen to be . A stepsize any larger than this made the pure actorcritic agent fail to learn, but both PGQL and Qlearning could handle some increase in the stepsize, possibly due to the stabilizing effect of using replay.
It is clear that PGQL outperforms the other two. At any point along the xaxis the agents have seen the same amount of data, which would indicate that PGQL is more data efficient than either of the vanilla methods since it has the highest performance at practically every point.
5.2 Atari
We tested our algorithm on the full suite of Atari benchmarks (Bellemare et al., 2012), using a neural network to parameterize the policy. In figure 2 we show how a policy network can be augmented with a parameterless additional layer which outputs the Qvalue estimate. With the exception of the extra layer, the architecture and parameters were chosen to exactly match the asynchronous advantage actorcritic (A3C) algorithm presented in Mnih et al. (2016), which in turn reused many of the settings from Mnih et al. (2015). Specifically we used the exact same learning rate, number of workers, entropy penalty, bootstrap horizon, and network architecture. This allows a fair comparison between A3C and PGQL, since the only difference is the addition of the Qlearning step. Our technique augmented A3C with the following change: After each actorlearner has accumulated the gradient for the policy update, it performs a single step of Qlearning from replay data as described in equation (13), where the minibatch size was 32 and the Qlearning learning rate was chosen to be times the actorcritic learning rate (we mention learning rate ratios rather than choice of in (14) because the updates happen at different frequencies and from different data sources). Each actorlearner thread maintained a replay buffer of the last transitions seen by that thread. We ran the learning for million agent steps ( million Atari frames), as in (Mnih et al., 2016).
In the results we compare against both A3C and a variant of asynchronous deep Qlearning. The changes we made to Qlearning are to make it similar to our method, with some tuning of the hyperparameters for performance. We use the exact same network, the exploration policy is a softmax over the Qvalues with a temperature of , and the Qvalues are parameterized as in equation (12) (i.e., similar to the dueling architecture (Wang et al., 2016)), where . The Qvalue updates are performed every 4 steps with a minibatch of 32 (roughly 5 times more frequently than PGQL). For each method, all games used identical hyperparameters.
The results across all games are given in table 3 in the appendix. All scores have been normalized by subtracting the average score achieved by an agent that takes actions uniformly at random. Each game was tested 5 times per method with the same hyperparameters but with different random seeds. The scores presented correspond to the best score obtained by any run from a random start evaluation condition (Mnih et al., 2016). Overall, PGQL performed best in 34 games, A3C performed best in 7 games, and Qlearning was best in 10 games. In 6 games two or more methods tied. In tables 1 and 2 we give the mean and median normalized scores as percentage of an expert human normalized score across all games for each tested algorithm from random and humanstart conditions respectively. In a humanstart condition the agent takes over control of the game from randomly selected humanplay starting points, which generally leads to lower performance since the agent may not have found itself in that state during training. In both cases, PGQL has both the highest mean and median, and the median score exceeds 100%, the human performance threshold.
It is worth noting that PGQL was the worst performer in only one game, in cases where it was not the outright winner it was generally somewhere in between the performance of the other two algorithms. Figure 3 shows some sample traces of games where PGQL was the best performer. In these cases PGQL has far better data efficiency than the other methods. In figure 4 we show some of the games where PGQL underperformed. In practically every case where PGQL did not perform well it had better data efficiency early on in the learning, but performance saturated or collapsed. We hypothesize that in these cases the policy has reached a local optimum, or overfit to the early data, and might perform better were the hyperparameters to be tuned.
A3C  Qlearning  PGQL  

Mean  636.8  756.3  877.2 
Median  107.3  58.9  145.6 
A3C  Qlearning  PGQL  

Mean  266.6  246.6  416.7 
Median  58.3  30.5  103.3 
6 Conclusions
We have made a connection between the fixed point of regularized policy gradient techniques and the Qvalues of the resulting policy. For small regularization (the usual case) we have shown that the Bellman residual of the induced Qvalues must be small. This leads us to consider adding an auxiliary update to the policy gradient which is related to the Bellman residual evaluated on a transformation of the policy. This update can be performed offpolicy, using stored experience. We call the resulting method ‘PGQL’, for policy gradient and Qlearning. Empirically, we observe better data efficiency and stability of PGQL when compared to actorcritic or Qlearning alone. We verified the performance of PGQL on a suite of Atari games, where we parameterize the policy using a neural network, and achieved performance exceeding that of both A3C and Qlearning.
7 Acknowledgments
We thank Joseph Modayil for many comments and suggestions on the paper, and Hubert Soyer for help with performance evaluation. We would also like to thank the anonymous reviewers for their constructive feedback.
References
 Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

Azar et al. (2012)
Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen.
Dynamic policy programming.
Journal of Machine Learning Research
, 13(Nov):3207–3245, 2012.  Bagnell & Schneider (2003) J Andrew Bagnell and Jeff Schneider. Covariant policy search. In IJCAI, 2003.
 Baird III (1993) Leemon C Baird III. Advantage updating. Technical Report WLTR931146, WrightPatterson Air Force Base Ohio: Wright Laboratory, 1993.

Bellemare et al. (2012)
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 2012.  Bellman (1957) Richard Bellman. Dynamic programming. Princeton University Press, 1957.
 Bertsekas (2005) Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scientific, 2005.
 Bertsekas & Tsitsiklis (1996) Dimitri P. Bertsekas and John N. Tsitsiklis. NeuroDynamic Programming. Athena Scientific, 1996.
 Degris et al. (2012) Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actorcritic. 2012.
 Fox et al. (2015) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1207.4708, 2015.
 Hausknecht & Stone (2016) Matthew Hausknecht and Peter Stone. Onpolicy vs. offpolicy updates for deep reinforcement learning. Deep Reinforcement Learning: Frontiers and Challenges, IJCAI 2016 Workshop, 2016.
 Heess et al. (2012) Nicolas Heess, David Silver, and Yee Whye Teh. Actorcritic reinforcement learning with energybased policies. In JMLR: Workshop and Conference Proceedings 24, pp. 43–57, 2012.
 Kakade (2001) Sham Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14, pp. 1531–1538, 2001.
 Konda & Tsitsiklis (2003) Vijay R Konda and John N Tsitsiklis. On actorcritic algorithms. SIAM Journal on Control and Optimization, 42(4):1143–1166, 2003.
 Lehnert & Precup (2015) Lucas Lehnert and Doina Precup. Policy gradient methods for offpolicy control. arXiv preprint arXiv:1512.04105, 2015.
 Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 1–9, 2013.
 Levine et al. (2015) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lin (1993) LongJi Lin. Reinforcement learning for robots using neural networks. Technical report, DTIC Document, 1993.

Mnih et al. (2013)
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller.
Playing atari with deep reinforcement learning.
In
NIPS Deep Learning Workshop
. 2013.  Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL http://dx.doi.org/10.1038/nature14236.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.
 Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured prediction. arXiv preprint arXiv:1609.00150, 2016.
 Pascanu & Bengio (2013) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
 Pednault et al. (2002) Edwin Pednault, Naoki Abe, and Bianca Zadrozny. Sequential costsensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 259–268. ACM, 2002.
 Peng & Williams (1996) Jing Peng and Ronald J Williams. Incremental multistep Qlearning. Machine Learning, 22(13):283–290, 1996.
 Peters et al. (2010) Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI. Atlanta, 2010.
 Riedmiller (2005) Martin Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, pp. 317–328. Springer Berlin Heidelberg, 2005.
 Rummery & Niranjan (1994) Gavin A Rummery and Mahesan Niranjan. Online Qlearning using connectionist systems. 1994.
 Sallans & Hinton (2004) Brian Sallans and Geoffrey E Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5(Aug):1063–1088, 2004.
 Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1889–1897, 2015.
 Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 387–395, 2014.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Sutton & Barto (1998) R. Sutton and A. Barto. Reinforcement Learning: an Introduction. MIT Press, 1998.
 Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Sutton et al. (1999) Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pp. 1057–1063, 1999.
 Tesauro (1995) Gerald Tesauro. Temporal difference learning and TDGammon. Communications of the ACM, 38(3):58–68, 1995.
 Thomas (2014) Philip Thomas. Bias in natural actorcritic algorithms. In Proceedings of The 31st International Conference on Machine Learning, pp. 441––448, 2014.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Qlearning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI16), pp. 2094–2100, 2016.
 Van Seijen et al. (2009) Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected sarsa. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184. IEEE, 2009.
 Wang et al. (2013) YinHao Wang, TzuuHseng S Li, and ChihJui Lin. Backward qlearning: The combination of sarsa algorithm and qlearning. Engineering Applications of Artificial Intelligence, 26(9):2184–2193, 2013.
 Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1995––2003, 2016.
 Watkins (1989) Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Williams & Peng (1991) Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
Appendix A PGQL Bellman residual
Here we demonstrate that in the tabular case the Bellman residual of the induced Qvalues for the PGQL updates of (14) converges to zero as the temperature decreases, which is the same guarantee as vanilla regularized policy gradient (2). We will use the notation that is the policy at the fixed point of PGQL updates (14) for some , i.e., , with induced Qvalue function .
First, note that we can apply the same argument as in §3.4 to show that (the only difference is that we lack the property that is the fixed point of ). Secondly, from equation (15) we can write . Combining these two facts we have
and so as . Using this fact we have
which therefore also converges to zero in the limit. Finally we obtain
which combined with the two previous results implies that , as before.
Appendix B Atari scores
Game  A3C  Qlearning  PGQL 

alien  38.43  25.53  46.70 
amidar  68.69  12.29  71.00 
assault  854.64  1695.21  2802.87 
asterix  191.69  98.53  3790.08 
asteroids  24.37  5.32  50.23 
atlantis  15496.01  13635.88  16217.49 
bank heist  210.28  91.80  212.15 
battle zone  21.63  2.89  52.00 
beam rider  59.55  79.94  155.71 
berzerk  79.38  55.55  92.85 
bowling  2.70  7.09  3.85 
boxing  510.30  299.49  902.77 
breakout  2341.13  3291.22  2959.16 
centipede  50.22  105.98  73.88 
chopper command  61.13  19.18  162.93 
crazy climber  510.25  189.01  476.11 
defender  475.93  58.94  911.13 
demon attack  4027.57  3449.27  3994.49 
double dunk  1250.00  91.35  1375.00 
enduro  9.94  9.94  9.94 
fishing derby  140.84  14.48  145.57 
freeway  0.26  0.13  0.13 
frostbite  5.85  10.71  5.71 
gopher  429.76  9131.97  2060.41 
gravitar  0.71  1.35  1.74 
hero  145.71  15.47  92.88 
ice hockey  62.25  21.57  76.96 
jamesbond  133.90  110.97  142.08 
kangaroo  0.94  0.94  0.75 
krull  736.30  3586.30  557.44 
kung fu master  182.34  260.14  254.42 
montezuma revenge  0.49  1.80  0.48 
ms pacman  17.91  10.71  25.76 
name this game  102.01  113.89  188.90 
phoenix  447.05  812.99  1507.07 
pitfall  5.48  5.49  5.49 
pong  116.37  24.96  116.37 
private eye  0.88  0.03  0.04 
qbert  186.91  159.71  136.17 
riverraid  107.25  65.01  128.63 
road runner  603.11  179.69  519.51 
robotank  15.71  134.87  71.50 
seaquest  3.81  3.71  5.88 
skiing  54.27  54.10  54.16 
solaris  27.05  34.61  28.66 
space invaders  188.65  146.39  608.44 
star gunner  756.60  205.70  977.99 
surround  28.29  1.51  78.15 
tennis  145.58  15.35  145.58 
time pilot  270.74  91.59  438.50 
tutankham  224.76  110.11  239.58 
up n down  1637.01  148.10  1484.43 
venture  1.76  1.76  1.76 
video pinball  3007.37  4325.02  4743.68 
wizard of wor  150.52  88.07  325.39 
yars revenge  81.54  23.39  252.83 
zaxxon  4.01  44.11  224.89 