1 Introduction
Reinforcement learning (RL) considers a problem where an agent interacts with the environment to maximize the cumulative reward trough time. A standard approach to solve the RL problem is called value function based reinforcement learning, which finds a policy that maximizes the value function (Sutton and Barto, 1998)
. Thus, the estimation of the value function of a given stationary policy of a Markov Decision Process (MDP) is an important subroutine of
generalized policy iteration (Sutton and Barto, 1998) and a key intermediate step to generate good control policy (Gelly and Silver, 2008; Tesauro, 1992). The value function is known to solve the Bellman equation, which succinctly describes the recursive relation on stateaction value function .where the expectation is taken over the next state , the reward and the action from policy , is the discount factor. Hence, many RL algorithms are based on the idea of solving the above Bellman equation in a sample driven way, and one popular technique is the temporaldifference (TD) learning (Sutton and Barto, 1998).
The last several years have witnessed the success of the TD learning with the value function approximation (Mnih et al., 2015; Van Hasselt et al., 2016), especially when using a deep neural network. In their seminal work, Tsitsiklis and Van Roy (1996) proved that the TD() algorithm converges when a linear function approximator is implemented and states are sampled according to the policy evaluated (sometimes referred as onpolicy setting in RL literature). However, if either the function approximator is nonlinear, or the onpolicy setting does not hold, there are counterexamples that demonstrates that TD() may diverge. To mitigate this problem, a family of TDstyle algorithms called Gradient Temporal Difference (GTD) are proposed by (Sutton et al., 2009a, b) that address the instability of the TD algorithm with the linear function approximator in the offpolicy setting. These works rely on the objective function called meansquared projected Bellman error (MSPBE) whose unique optimum are the fixed points of the TD(0) algorithm. Bhatnagar et al. (2009) extend this idea to the nonlinear smooth function approximator (e.g., neural networks) and prove the convergence of the algorithm under mild conditions. In the control setting, Maei et al. (2010) propose GreedyGQ which has similar objective function as MSPBE but w.r.t. the Bellman optimality operator.
Recently, the distributional perspective on reinforcement learning has gained much attention. Rather than study on the expectation of the long term return (i.e., ), it explicitly takes into consideration the stochastic nature of the long term return (whose expectation is ). The recursion of is described by the distributional Bellman equation as follows,
(1) 
where stands for “equal in distribution” (see Section 2 for more detailed explanations). The distributional Bellman equation essentially asserts that the distribution of is characterized by the reward , the next random stateaction following policy and its random return . Following the notion in (Bellemare et al., 2017a) we call the value distribution. Bellemare et al. (2017a) showed that for a fixed policy the Bellman operator over value distributions is a contraction in a maximal form of the Wasserstein metric, thus making it possible to learn the value distribution in a sample driven way. There are several advantages to study the value distributions: First, reallife decision makers sometimes are interested in seeking big wins on rare occasions or avoiding a small chance of suffering a large loss. For example, in financial engineering, this risksensitive scenario is one of the central topics. Because of this, risksensitive RL has been an active research field in RL (Heger, 1994; Defourny et al., 2008; Bagnell and Schneider, 2003; Tamar et al., 2016), and the value distribution obviously provides a very useful tool in designing risksensitive RL algorithms. Second, it can model the uncertainty. Engel et al. (2005) leveraged the distributional Bellman equation to define a Gaussian process over the unknown value function. Third, from the algorithmic view, as discussed in (Bellemare et al., 2017a), the distributional Bellman operator preserves multimodality in value distribution, which leads to more stable learning. From the explorationexploitation tradeoff perspective, if the value distribution is known, the agent can explore the region with high uncertainty, which is often called “optimism in the face of uncertainty” (Kearns and Singh, 2002; O’Donoghue et al., 2017).
Contributions: Although distributional approaches on RL (e.g., C51 in Bellemare et al. (2017a)
) have shown promising results, theoretical properties of them are not well understood yet, especially when the function approximator is nonlinear. As nonlinear function approximation is inevitable if we hope to combine RL with deep neural networks – a paradigm with tremendous recent success that enables automatic feature engineering and endtoend learning to solve the real problem. Therefore, to extend the applicability of the distributional approach to the real problem and close the gap between the theory and practical algorithms, we propose the nonlinear distributional gradient temporaldifference learning. It inherits the merits of nonlinear gradient TD and distributional approaches mentioned above. Using the similar heuristic, we also propose a distributional control algorithm called distributional GreedyGQ.
The contributions of this paper are the following.

We propose a distributional MSPBE (DMSPBE) as the objective function to optimize, which is an extension of MSPBE when the stochastic nature of the random return is considered.

We derive two stochastic gradient algorithms to optimize the DMSPBE using the weightduplication trick in (Sutton et al., 2009a; Bhatnagar et al., 2009). In each step, the computational complexity is linear w.r.t. the number of parameters of the function approximator, thus can be efficiently implemented for neural networks.

We propose a distributional RL algorithm in the control setting called distributional GreedyGQ, which is an distributional counterpart of Maei et al. (2010).

We prove distributional GTD2 and TDC converge to a local optimal solution in the policy evaluation setting under mild conditions using the twotimescale stochastic approximation argument. If the linear function approximator is applied we have the finite sample bound.
Remarks: More precisely, we have addition operations in each step of algorithm but the costs of them are negligible compared to computations in neural networks in general. Thus the computational complexity in each step is still linear w.r.t. the number of parameters of the function approximator (neural networks).
2 Problem setting and preliminaries
We consider a standard setting in the reinforcement learning, where an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to maximize the cumulative reward (Sutton and Barto, 1998). This problem is formulated as a Markov Decision Process (MDP) which is a 5tuple (): is the finite state space, is the finite action space,
are the transition probabilities,
are the realvalued immediate rewards and is the discount factor. A policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by , where is the conditional probability density at associated with the policy. We also define , .Suppose the policy to be evaluated is followed and it generates a trajectory . We are given an infinite sequence of 3tuples that satisfies the following assumption.
Assumption 1.
is an valued stationary Markov process, , and .
Here
denotes the probability distribution over initial states for a transition. Since stationarity is assumed, we can drop the index
in the transition and use to denote a random transition. The (stateaction) value function of a policy describes the expected return from taking action from state .It is well known that the value function satisfies the Bellman equation. . Define the Bellman operator as then the Bellman equation becomes . To lighten the notation, from now on we may drop the superscript when the policy to be evaluated is kept fixed.
2.1 Distributional Bellman equation and Cramér distance
Recall that the return is the sum of discounted reward along the agent’s trajectory of interactions with the environment, and hence . When the stochastic nature of the return is considered, we need a distributional variant of the Bellman equation which the distribution of satisfies. Following the notion in (Bellemare et al., 2017a), we define the transition operator :
where
indicates that the random variable
is distributed according to the same law of . The distributional Bellman operator is For more rigorous definition and discussions on this operator, we refer reader to (Bellemare et al., 2017a).Bellemare et al. (2017a) prove that the distributional Bellman operator is a contraction in a maximal form of the Wasserstein metric. However as pointed by them (see proposition 5 in their paper), in practice, it is hard to estimate the Wasserstein distance using samples and furthermore the gradient estimation w.r.t. the parameter of the function approximator is biased in general. Thus KL divergence is implemented instead in the algorithm C51 rather than the Wasserstein metric. However the KL divergence may not be robust to the discrepancies in support of distribution (Arjovsky et al., 2017). In this paper, we adapt the Cramér distance (Székely, 2003; Bellemare et al., 2017b) instead, since the unbiased sample gradient estimaton of Cramér distance can be easily obtained in the setting of reinforcement learning (Bellemare et al., 2017b). The square root of Cramér distance is defined as follows: Suppose there are two distributions and
and their cumulative distribution functions are
and respectively, then the square root of Cramér distance between and isIntuitively, it can be thought as the two norm on the distribution function. Indeed, the distributional Bellman operator is a contraction in a maximal form of the square root of Cramér distance. Here, for two random return with distribution and , a maximal form of the square root of Cramér distance is defined as
Proposition 1.
is a metric over value distributions and is a contraction in .
The proof makes use of Theorem 2 in (Bellemare et al., 2017b), and is deferred to appendix.
2.2 Gradient TD and GreedyGQ
We now review linear and nonlinear gradient TD and GreedyGQ proposed by Sutton et al. (2009a); Bhatnagar et al. (2009); Maei et al. (2010), which helps to better understand the nonlinear distributional gradient TD and distributional GreedyGQ. One approach in reinforcement learning for large scale problems is to use a linear function approximation for the value function . Particularly, the value function , where the feature map is , and is the parameter of the linear model. The objective function of the gradient TD family is the mean squared projected Bellman error (MSPBE).
(2) 
where
is the vector of value function over
, is a diagonal matrix with diagonal elements being the stationary distribution over induced by the policy , and is the weighted projection matrix onto the linear space spanned by , which is . Substitute into (2), the MSPBE can be written as(3) 
where is the TD error for a given transition , i.e., Its negative gradient is where Sutton et al. (2009a) use the weightduplication trick to update on a ”faster” time scale as follows Two different ways to update leads to GTD2 and TDC.
(4) 
Once the nonlinear approximation is used, we can optimize a slightly different version of MSPBE. There is an additional term in the update rule
See more discussion in section 3.
Similarly, GreedyGQ optimizes following objective function,
where is a greedy policy w.r.t. . Reusing the weightduplication trick, Maei et al. (2010) give the update rule.
where
is an unbiased estimate of expected value of the next state under
3 Nonlinear Distributional Gradient TD
In this section, we propose distributional Gradient TD algorithms by considering the Cramér distance between value distribution of and which is a distributional counterpart of Bhatnagar et al. (2009). To ease the exposition, in the following we consider the value distribution on state rather than the stateaction pair since the extension to is straightforward.
3.1 Distributional MSPBE (DMSPBE)
Suppose there are states. One simple choice of the objective function is as follows
(5) 
However, a major challenge to optimize (5) is the double sampling problem, i.e., two independent samples are required from each state. To see that, notice that if we only consider the expectation of the return, (5) reduces to the mean squared Bellman error (MSBE), and the corresponding algorithms to minimize MSBE are the wellknown residual gradient algorithms (Baird, 1995), which is known to require two independent samples for each state (Dann et al., 2014). To get around the double sampling problem, we instead adapt MSPBE into its distributional version. To simplify the problem, we follow (Bellemare et al., 2017a) and assume that the value distribution is discrete with range and whose support is the set of atoms , . In practice and are not hard to get. For instance, suppose we know the bound on the reward , then we can take as
. We further assume that the atom probability can be given by a parametric model
such as a softmax function:where can be a nonlinear function, e.g., a neural network. From an algorithmic perceptive, such assumption or approximation is necessary, since it is hard to represent the full space of probability distributions.
We denote the (cumulative) distribution function of as . Notice is nonlinear w.r.t.
in general, thus it is not restricted to a hyperplane as that in the linear function approximation. Following the nonlinear gradient TD in
(Bhatnagar et al., 2009), we need to define the tangent space at . Particularly, we denote as a vector of and assume is a differentiable function of . becomes a differentiable submanifold of . Define , then the tangent space at is , where is defined as , i.e., each row of it is . Let be the projection that projects vectors to . Particularly, to project the distribution function onto the w.r.t. the Cramér distance, we need to solve the following problem(6) 
where is the value of distribution function of at . Since this is a least squares problem, we have that the projection operator has a closed form
where is a diagonal matrix with diagonal elements being Given this projection operator, we propose the distributional MSPBE (DMSPBE). Particularly, the objective function to optimize is as follows
where is the vector form of , and is the value of distribution function of at atom . Assume is nonsingular, similar to the MSPBE, we rewrite the above formulation into another form.
DMSPBE:
(7) 
To better understand DMSPBE, we compare it with MSPBE. First, in equation (3), we have the term , which is the difference between the value function and , while we have the difference between the distribution of and in equation (7). Second, the matrix is slightly different, since in each state we need atoms to describe the value distribution. Thus we have the diagonal element as . Third, is a gradient for and thus depends on the parameter rather than a constant feature matrix, which is similar to (Bhatnagar et al., 2009).
3.2 Distributional GTD2 and Distributional TDC
In this section, we use the stochastic gradient to optimize the DMSPBE (equation 7) and derive the update rule of distributional GTD2 and distributional TDC.
The first step is to estimate from samples. We denote as the empirical distribution of and . Notice one unbiased empirical distribution at state is the distribution of , whose distribution function is by simply shifting and shrinking the distribution of . Then we have
Then we can write DMSPBE in the following way
We define , analogous to the temporal difference, and call it temporal distribution difference. To ease the exposition, we denote Then we have
(8) 
In the following theorem, we choose , an unbiased empirical distribution we mentioned above and give the gradient of DMSPBE w.r.t. . We defer the proof to the appendix.
Theorem 1.
Assume that is twice continuously differentiable in for any and , is nonsingular in a small neighborhood of . Denote , then we have
(9) 
which has another form
(10) 
where
Based on Theorem 1, we obtain the algorithm of distributional GTD2 and distributional TDC. Particularly (9) leads to Algorithm 1, and (10) leads to Algorithm 2. The difference between distributional gradient TD methods and regular ones are highlighted in boldface.
Some remarks are in order. We use distributional GTD2 as an example, but all remarks hold for the distributional TDC as well.
(1). We stress the difference between the the update rule of GTD2 in (4) and that of the distributional GTD2 (highlighted in boldface): In the distributional GTD2, we use the temporal distribution difference instead of the temporal difference in GTD2. Also notice there is a summation over , which corresponds to the integral in the Cramér distance, since we need the difference over the whole distribution rather than a certain point. The term comes from the shifting and shrinkage on the distribution function of .
(2). The term results from the nonlinear function approximation, which is zero in the linear case. This term is similar to the one in nonlinear GTD2 (Bhatnagar et al., 2009). Notice we do not need to explicitly calculate the Hessian in the term . This term can be evaluated using forward and backward propagation in neural networks with the complexity scaling linearly w.r.t. the number of parameters in neural networks, see the work (Pearlmutter, 1994) or chapter 5.4.6 in (Christopher, 2016). We give an example in the appendix to illustrate how to calculate this term.
(3). It is possible that is not on the support of the distribution in practice. Thus we need to approximate it by projecting it on the support of the distribution, e.g., round to the nearest atoms. This projection step would lead to further errors, which is out of scope of this paper. We refer readers to related discussion in (Dabney et al., 2017; Rowland et al., 2018), and leave its analysis as a future work.
(4). The aim of is to estimate for a fixed value of . Thus is updated on a ”faster” timescale and parameter is updated on a ”slower” timescale.
4 Distributional GreedyGQ
In practice, we care more about the control problem. Thus in this section, we propose the distributional GreedyGQ for the control setting. Now we denote as the distribution function of . Policy is a greedy policy w.r.t. . i.e., the mean of . is the distribution function of . The aim of the distributional GreedyGQ is to optimize following objective function.
Using almost similar derivation as distributional GTD2 (With only difference in notation and we omit the term here), we give the following algorithm 3, analogous to the GreedyGQ Maei et al. (2010).
Some remarks are in order.

interpolates the distributional Qlearning and distributional GreedyGQ. When , it reduces to the distributional Qlearning with Cramér distance while C51 uses KLdivergence. When it is the distributional GreedyGQ where we mainly use the temporal distribution difference to replace the TDerror in Maei et al. (2010).

Unfortunately for the nonlinear function approximator and control setting, so far we do not have convergence guarantee. If the linear function approximation is used, we may obtain a asymptotic convergence result following the similar argument in Maei et al. (2010). We leave both of them as the future work.
5 Convergence analysis
In this section, we analyze the convergence of distributional GTD2 and distributional TDC and leave the convergence of distributional GreedyGQ as an future work. To the best of our knowledge, the convergence of control algorithm even in the nondistributional setting is still a tricky problem. The argument essentially follows the two time scale analysis (e.g., theorem 2 in (Sutton et al., 2009b) and (Bhatnagar et al., 2009)). We first define some notions used in our theorem. Given a compact set , let be the space of continuous mappings . Given projection onto , let operator be
When (interior of ), . Otherwise, if , is the projection of to the tangent space of at . Consider the following ODE:
where is the DMSPBE in (7). Let be the set of all asymptotically stable equilibria of the above ODE. By the definitions . We then have the following convergence theorem, which proof is deferred to the appendix.
Theorem 2.
If we assume the distribution function can be approximated by the linear function. We can obtain a finite sample bound. Due to the limit of space we defer it to appendix.
6 Experimental result
6.1 Distributional GTD2 and distributional TDC
In this section we assess the empirical performance of the proposed distributional GTD2 and distributional TDC and compare the performance with their nondistributional counterparts, namely, and . Since it is hard to compare the performance of distributional GTD2 and TDC with their nondistributional counterparts in the pure policy evaluation environment, we use a simple control problem cartpole problem to test the algorithm, where we do several policy evaluation steps to get a accurate estimation of value function and then do a policy improvement step. To apply distributional GTD2 or distributional TDC, we use a neural network to approximate the distribution function . Particularly, we use a neural network with one hidden layer, the inputs of the neural network are stateaction pairs, and the output is a softmax function. There are hidden units and we choose the number of atoms as in the distribution, i.e., the number of outputs in softmax function is . In the policy evaluation step, the update rule of and is simple, since we just need to calculate the gradient of , which can be obtained by the forward and backward propagation. The update rule of is slightly more involved, where we have the term . Roughly speaking, it requires four times as many computations as the regular back propagation and we present the update rule in the appendix. In the control step, we use the greedy policy over the expected action values, where starts at and decreases gradually to . To implement regular nonlinear GTD2 and TDC (Bhatnagar et al., 2009), we still use one hidden layer neural network with 30 hidden units. The output is . The control policy is greedy with at the beginning and decreases to gradually. In the experiment, we choose discount factor . Since reward is bounded in , in distributional GTD2 and distributional TDC, we choose and . In the experiment, we use episodes to evaluate the policy, and then choose the policy by the
greedy strategy. We report experimental results (mean performance with standard deviation) in the left and mid panel of Figure
1. All experimental results are averaged over 30 repetitions. We observe that the distributional GTD2 has the best result, followed by the distributional TDC. The distributional TDC seems to improve the policy faster at the early stage of the training and then slows down. The performance of regular GTD2 and TDC are inferior than their distributional counterparts. We also observe that standard deviations of the distributional version are smaller than those of regular one. In addition, the performance of the distributional algorithms increases steadily with few oscillations. Thus the simulation results show that the distributional RL is more stable than the regular one which matches the argument and observations in (Bellemare et al., 2017a). In the right panel, we draw a distribution function of estimate by the algorithm.To test whether the algorithm converges in offpolicy setting, we run experiment on grid world to compare distributional GTD2 (with atoms number=50) and GTD2 in the offpolicy setting. The target policy is set to be the optimal policy. The datagenerating policy is a perturbation on the optimal policy (0.05 probability to choose random action). We ran distributional GTD2 and GTD2 and calculate MSPBE every 1k timestep. Results are shown in figure 2
. Both GTD2 and distributional GTD2 converge with similar speed while distributional GTD2 has smaller variance.
6.2 Distributional GreedyGQ
In practice, we are more interested in the control problem. Therefore, the aim of this section is to test the performance of distributional GreedyGQ and compare the result with DQN and distributional DQN (C51). All algorithm are implemented with offpolicy setting with the standard experience replay. Particularly, we test the algorithm in the environment Cartpole v0, lunarlander v2 in the openai gym Brockman et al. (2016) and Vizdoom Kempka et al. (2016). In the platform of vizdoom, we choose the defend the center as a environment, where the agent occupies the center of a circular arena. Enemies continuously got spawned from far away and move closer to the agent until they are close enough to attack form close. The death penalty is given by the environment. We give the penalty if the agent loses ammo and health. Reward
is received if one enemy is killed. Totally, there are 26 bullets. The aim of the agent is to kill enemy and avoid being attacked and killed. For the environment Cartpole and lunarlander, to implement distributional GreedyGQ and C51, we use two hiddenlayer neural network with 64 hidden units to approximated the value distribution where activation functions are relu. The outputs are softmax functions with 40 units to approximate the probability atoms. We apply Adam with learning rate 5e4 to train the agent. In vizdoom experiment, the first three layers are CNN and then follows a dense layer where all activation functions are relu. The outputs are softmax functions with 50 units. We set
and in the experiment.We demonstrate all experiment in figure 3. In the experiment of Cartpole and vizdoom, the performance of distributional GreedyGQ are comparable with C51 and both of them are better than the DQN. Particularly, in the left panel, distributional GreedyGQ is slightly better than C51. The variance of them are both smaller than that of DQN possibly because the distributional algorithms are more stable. In the experiment of vizdoom, C51 learns faster than distributional GreedyGQ at beginning but after 7000 episodes training distributional GreedyGQ has same performance with C51 and starts to outperform C51 later. In middle panel, C51 and distributional GreedyGQ are slightly betten than its nondistributional counterpart DQN.
7 Conclusion and Future work
In this paper, we propose two nonlinear distributional gradient TD algorithms and prove their convergences to a local optimum, while in the control setting, we propose the distributional GreedyGQ. We compare the performance of our algorithm with their nondistributional counterparts, and show their superiority. Distributional RL has several advantages over the regular approach, e.g., it provides richer set of prediction, and the learning is more stable. Based on this work on distributional RL, we foresee many interesting future research directions about performing RL beyond point of the estimation of the value function. An immediate interesting one is to develop efficient exploration to devise the control policy using more distribution information rather than using the expectation.
References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. stat, 1050:26, 2017.
 Bagnell and Schneider (2003) J Andrew Bagnell and Jeff Schneider. Covariant policy search. IJCAI, 2003.
 Baird (1995) Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
 Bellemare et al. (2017a) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458, 2017a.
 Bellemare et al. (2017b) Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017b.
 Bhatnagar et al. (2009) Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.
 Borkar and Meyn (2000) Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Christopher (2016) M Bishop Christopher. PATTERN RECOGNITION AND MACHINE LEARNING. SpringerVerlag New York, 2016.
 Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.
 Dann et al. (2014) Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883, 2014.
 Defourny et al. (2008) Boris Defourny, Damien Ernst, and Louis Wehenkel. Riskaware decision making and dynamic programming. 2008.
 Engel et al. (2005) Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes. In Proceedings of the 22nd international conference on Machine learning, pages 201–208. ACM, 2005.
 Gelly and Silver (2008) Sylvain Gelly and David Silver. Achieving master level play in 9 x 9 computer go. In AAAI, volume 8, pages 1537–1540, 2008.
 Heger (1994) Matthias Heger. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pages 105–111. Elsevier, 1994.
 Juditsky et al. (2011) Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirrorprox algorithm. Stochastic Systems, 1(1):17–58, 2011.
 Kearns and Singh (2002) Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Kempka et al. (2016) Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Vizdoom: A doombased ai research platform for visual reinforcement learning. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pages 1–8. IEEE, 2016.
 Maei et al. (2010) Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward offpolicy learning control with function approximation. In ICML, pages 719–726, 2010.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 O’Donoghue et al. (2017) Brendan O’Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty bellman equation and exploration. arXiv preprint arXiv:1709.05380, 2017.
 Pearlmutter (1994) Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.

Rowland et al. (2018)
Mark Rowland, Marc Bellemare, Will Dabney, Remi Munos, and Yee Whye Teh.
An analysis of categorical distributional reinforcement learning.
In
International Conference on Artificial Intelligence and Statistics
, pages 29–37, 2018.  Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Sutton et al. (2009a) Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. A convergent temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in neural information processing systems, pages 1609–1616, 2009a.
 Sutton et al. (2009b) Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009b.
 Székely (2003) GJ Székely. Estatistics: The energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report, 3(05):1–18, 2003.
 Tamar et al. (2016) Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the rewardtogo. Journal of Machine Learning Research, 17(13):1–36, 2016.
 Tesauro (1992) Gerald Tesauro. Practical issues in temporal difference learning. In Advances in neural information processing systems, pages 259–266, 1992.
 Tsitsiklis and Van Roy (1996) JN Tsitsiklis and B Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690, 1996.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 16, pages 2094–2100, 2016.
Appendix A Proof of Theorem 1
We take the gradient of w.r.t. and denote . Notice that is a function depending on rather than the constant feature vector in the linear function approximation.
(11) 
To get around the double sampling problem [Dann et al., 2014], we follow the idea in Gradient TD and introduce a new variable
Notice this is the solution of the following problem Using stochastic gradient method, we can solve above problem and obtain the update rule of at time step .
(12) 
Replace corresponding terms by , we obtain
Thus we have
Comments
There are no comments yet.