1 Introduction
In recent years, Reinforcement Learning(RL) has come into fasion. General method in ordinary Reinforcement Learning using Markov decision processes use a state action value functions[1]. Agents created by these algorithms take strategies to maximize the expected value of the cumulative reward. However, in practical use , there are many situations where it is necessary to consider not only expected values but also risks. Therefore, Distributional Reinforcement Learning(DRL) that considers the distribution of cumulative rewards has also been studied. DRL research presents a particle method of risk responsive algorithm[2]. As for similar research, there are[3][4],which is equivalent to [2] mathematically,but used the different algorithm and parametric methods[5]. [4] discusses the convergence of measures in discrete steps. Another way to practice DRL is using the Bayesian approach. In [22],it is regarded as an estimation of the uncertainty of the expected value.But in fact, the Bayesian inferece can approximate the distribution of uncertain objecsion. It can perform distributed reinforcement learning. There are other existing papers on Bayesian reinforcement learning. We want to take [6][7] up this time.It is a method using Gaussian processes, and it can be said that the reward follows Gaussian distributions. [5] also supports unbounded rewards like Gaussian distributions. We want to show that the approximation of the cumulative reward distribution converges even in unbounded rewards. In this paper, we prove the convergence of the normal state action value function as a preliminary step. In addition, we perform the convergence proof for Q functions with continuous concentration domain,taking Deep Qlearning(DQN) into consideration.
1.1 Related works
The proof history of Qfunction convergence is long. For example, there are papers such as [8], [9], [10], and [11] using [10]. A paper on an unusual proof method is [12] using ordinary differential equations. For DQN, there is a study [13] summarizing the approximation error. The approximation error due to the neural network is verified there. Other research results include [14][15][16][17][18]. All of these studies assume that rewards are bounded. That is, there is a certain constant
and(1.1) 
holds. Therefore, Gaussian distributions cannot be assumed. In this paper, we prove the convergence of the Q function under condition ,
(1.2) 
which is more relaxed than (1.1), with normal distribution in mind. Finally, we prove the convergence of the Q function in the domain of continuous concentration under ideal conditions. This is a frequent concept in reinforcement learning.
2 Background
2.1 transition kearnel
Let two tuples be both measurable spaces.Transition kernel is defined to satisfy the following two conditions.
(2.1)  
(2.2) 
This is used in situations where is fixed and the distribution on is fixed.
2.2 Markov decision process
Assume that both the set of states and the set of actions are finite sets. A transition kernel is defined on . That is,
is a probability measure that governs the distribution of the next state
and immediate reward when an action is taken in state is there. The strategy is the action probability determined from the current situation, as can be seen from the definition. The deterministic approach is that for any , there is a and. A set of random variables
taking values in is written as . This stochastic process is called Markov decision process(MDP).2.3 Optimal measures and state action value functions
Put the whole set of policies as . The state action value function for the policy is defined as follows.
(2.3) 
Furthermore, the state value function is defined as follows.
(2.4) 
Define the optimal strategy as
(2.5) 
In addition, the state action value function for the optimal policy is called the optimum state action value function, and simply expressed as . The action that takes the maximum value for the optimal state action function is the optimal policy.
(2.6) 
holds for any .
3 Update of state action value function and Robbins Monro condition
Update the Qunction as follows
(3.1) 
The following sequence satisfies the RobbinsMonro condition.
(3.2)  
(3.3)  
(3.4) 
Using this, the mapping is defined as follows.
(3.5) 
In addition,it is assumed that this also satisfies the Robbins Monroe condition stochastically uniformly for arbitrary .
(3.6)  
(3.7) 
4 Proof of Qfunction convergence for unbounded rewards
Consider a realvalued function on a finite set .
Theorem 1
Convergence of Qvalue in case of Gaussian rewards
is finite set. Let ramdom value . Let be the set of functions and is defined as . For any , let .
(4.1) 
proof.
In line with the proof of [9]. The condition is relaxed and the statement is stronger, so it needs to be done more precisely. Consider a stochastic process of . Since is a constant, . Putting, this is measurable stochastic process. Furthermore, if we put , by definition . The two stochastic processes are taken so that . Define time evolution as
(4.2)  
(4.3) 
However, . At this time, . First, we show that converges to 0 for with probability 1 by using Lemma 2. By definition, , so holds. From Lemma 1 and the definition of , holds. Putting , this random variable is measurable and takes a finite value with probability . Since is a finite value, a certain constant can be taken so that holds. And the following holds with probability 1.
(4.4) 
Using the above formula, the following holds
(4.5) 
Suppose there is that is . At this time, put
(4.6)  
(4.7)  
(4.8) 
Then,
(4.9)  
(4.10)  
(4.11)  
(4.12) 
Putting , can be said. Since exists, exists for any , and can be said. It is clear from the equation that when , and Then, holds. Therefore, it was shown earlier that exists for any , in addition can be also said. holds, so the following equation hold.for all
(4.13)  
(4.14)  
(4.15) 
Then,
(4.16)  
(4.17) 
holds for all . When we use Lemma2,putting
(4.18)  
(4.19) 
can be said. Since , holds. Then, for any , set and , then
(4.20)  
(4.21) 
holds. The latter is based on Robbins Monro conditions. Therefore, holds for any . Define the linear operator as follows: for
(4.22)  
(4.23) 
is a fixed point for this operator. For any
(4.24)  
(4.25)  
(4.26)  
(4.27) 
Thus is a reduction operator.
(4.28)  
(4.29)  
(4.30)  
(4.31) 
Then,
(4.32)  
(4.33) 
converges uniformly to 0 with a probability of 1 for any as described above. Therefore, from Lemma 3, for any . That is, for any , , which holds the main theorem assertion.
5 Theorem for SARASA
The method in Chapter 3 is called Qlearning, and the value is updated before performing the next action. On the other hand, SARASA updates the value after performing the following actions.
(5.1) 
is often stochastically determined by softmax function or the like.
Theorem 2
Suppose that the Q function is updated by the above SARASA method. At this time,
(5.2) 
proof.
Put It is clear from the definition that . Later along this follows the proof of Theorem 1.
6 Convergence proof for unbounded rewards under continuous concentration
For example, in a situation such as DQN, an update for one has an effect on other state actions. As a simple model to take such situations into account, we put the ripple function defined on the compact set . This satisfies the next conditions.
(6.1)  
(6.2) 
If is a continuous function, it can be used to depart from any continuous function and have the same convergence on the compact set. Let be a simple connected compact set. Let be a continuous function on . Let be a continuous function on .
(6.3) 
At this time,
proof. Consider a finite set on . Limiting to converges to a correct function uniformly over from Theorem 1.For any Since is a continuous function, the function whose value is defined on a dense set is uniquely determined. Convergence can be said.
7 Conclusion and Future Work
As we mentioned earlier,we want to prove the convergence of the distribution. An order evaluation of the expected value should be also performed. We also want to estimate the convergence order for a specific neural network such as [13]. According to [13], as with Theorem 3, in the domain of continuous concentration, as , using constants
(7.1) 
is established. However, when follows a normal distribution, , so the upper limit of the error is infinite, and this unexpected expression has no meaning. In case of using unbounded rewards, stronger inequality proofs are needed.
References
 [1] Watldns, C.J.C.H. . Learning from delayed rewards. PhD Thesis, University of Cambridge, England. 1989

[2]
T.Morimura,M Sugiyama,H Kashima,H Hachiya,and T.Tanaka.Nonparametric return distribution approximation for reinforcement learning.In International Conference on Machine Learning,2010
 [3] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.

[4]
Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. . An analysis of categorical distributional reinforcement learning. In Artificial Intelligence and Statistics (AISTATS).2018
 [5] T.Morimura,M.Sugiyama,H.Kashima,H.Hachiya,and T.Tanaka.Parametric return density estimation for reinforcement learning. In Conference on Uncertainty in Artificial Intterigence,2010
 [6] Azizzadenesheli, K., Brunskill, E., Anandkumar, A., 2018. Efficient exploration through bayesian deep qnetworks.arXiv preprint arXiv:1802.04412.2018
 [7] Rasmussen, C. E. and Kuss, M.. Gaussian processes in reinforcement learning. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, pages 751–759, Cambridge, MA, USA. MIT Press.2004
 [8] CHRISTOPHER J.C.H. WATKINS ,PETER DAYAN,Qlearning,Machine Learning, 8, 279292 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.1992
 [9] J. N. Tsitsiklis, Asynchronous Stochastic Approximation and Qlearning, Machine Learning, 16, 1994
 [10] Tommi Jakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185–1201, 1994.
 [11] F. S. Melo, Convergence of Qlearning: A simple proof, Institute Of Systems and Robotics, Tech. Rep.2001
 [12] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation.and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
 [13] Zhuoran Yang,Yuchen Xie,Zhaoran Wang,A Theoretical Analysis of Deep QLearning,arXiv preprint arXiv:1901.00137v2.2019
 [14] Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. . Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16 1629–1676.2015
 [15] Farahmand, A.m., Szepesvári, C. and Munos, R. . Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.2010
 [16] Andras Antos, Csaba Szepesvári, and Rémi Munos. Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129, 2008.
 [17] Remi Munos. Performance bounds in ĺp norm for approximate value iteration. SIAM Journalon Control and Optimization, 2007.
 [18] Remi Munos. Error bounds for approximate policy iteration. In ÍCML 2003: Proceedings of the 20th Annual International Conference on Machine Learning, 2003
 [19] Venter J, On Dvoretzky stochastic approximation theorems, Annals of Mathematical Statistics 37, 1534–1544,1966
 [20] Berti, P., Crimaldi, I., Pratelli, L. and Rigo, P. Rate of convergence of predictive distributions for dependent data. Bernoulli 15, 1351–1367.2009.
 [21] BERTI, P., PRATELLI, L. and RIGO, P. . Limit theorems for predictive sequences of random variables. Technical Report 146, Dip. EPMQ, Univ. Pavia. Available at economia.2002
 [22] Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. . Bayesian reinforcement learning:a survey. Foundations and Trends in Machine Learning, 8(56):359–483.2015
 [23] Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization: An alternative approach for continuous control. arXiv preprint arXiv:1905.09855,2019.
 [24] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366, 2018.
 [25] David Ha, Andrew Dai, Quoc V. Le,HyperNetworks. In International Conference on Learning Representations, 2017
Appendix A Lemmas and proofs
Lemma 1
Consider a random variable and a partial algebla . If , the following equation holds.
(A.1) 
We quote the important theorem.
Lemma 2
Convergence theorem for stochastic systems[19]
Consider the following stochastic process.
(A.2) 
This satisfies the following equation with probability 1.
(A.3) 
However, with , with probability 1, holds, and with probability 1, Let .
(A.4)  
(A.5) 
At this time, there exists a certain , and it holds for any
(A.6) 
If are taken again for any and the same can be said, ”uniform convergence to 0” can be said that is much stronger than approximate convergence.
Lemma 3
is assumed to be a real number.
(A.7) 
is a constant.At this time, holds with probability 1.
proof.
Look at each . That is, is constant sequence that satisfies . is nonnegative for a sufficiently large , so it is bounded below. In addition, since is apparent from the equation, is a monotonically decreasing sequence. The sequence converges because it is bounded and monotonically decreasing below.Putting , this satisfies . You can say, and the convergence destination is . , the infinite product of is , but diverges. However, since it is , is known, and can be said.
Lemma 4
Let .
(A.8) 
Then holds.
proof.
(A.9)  
(A.10) 
The difference from is reduced by . If , by definition it is clearly . Moreover,
(A.11)  
(A.12) 
After that, it is because it is by the same argument in Lemma 3.
Lemma 5
Suppose that the sequence converges uniformly to 0 on a set of probabilities 1. That is, for any , there is a certain , and when , holds with probability 1. At this time,
(A.13) 
converges to 0.
proof.
(A.14)  
(A.15) 
for such . from Lemma 4. That is, for any , there is a certain , and for any can be arbitrarily taken, so if we define a new , this is also can be taken arbitrarily. Within the range of Using , there is a for any and for .
Appendix B Strict Proof of Policy Gradient thorem and Distributionaly
We prove the famous policy gradient theorem using the function and its version in distributed reinforcement learning [23].
Theorem 3
Policy Gradient thorem
Consider the gradient of the policy value function . At this time, it is assumed that
is implemented by a neural network, the activation function is Lipschitz continuous, and
. Then,The following equation holds,(B.1) 
However, is memory data in general implementation. Next, consider the case of distributed reinforcement learning. If a random variable representing the cumulative reward sum is expressed as , then holds. Suppose is a neural network with stochastic output.
(B.2) 
Then
(B.3) 
proof.
The interchangeable conditions of differentiation and Lebesgue integration are described as follows. Suppose there is a function that can be Lebesgue integrable over and differentiable by . At this time, there is an integrable function , and can be differentiated almost everywhere on by and holds, then is differentiable by , and holds,
(B.4) 
When
, An example of a function class that satisfies this is the “Lipschitz continuous function”. Neural networks is generally combinations of linear transformations and Lipschitz continuous activation map.
^{1}^{1}1General activation functions such as sigmoid, ReLu, Reaky Relu, and Swish are all Lipschitz continuous functions.
Moreover, if the Lipschitz constant of the function is written as , then considering two Lipschitz continuous functions , . From this, are Lipschitz continuous for , respectively. Although it is not Lipschitz continuous for , it is Lipschitz continuous for each element, and the definition and definition ofallow the exchange of differentiation and integration. That is, the following holds from the differential chain rule,
(B.5) 
Similarly, and is Lipschitz continuous functions for any , For distribution type
(B.6)  
(B.7) 
As described above, the policy gradient theorem is established because the policy is Lipschitzcontinuous for each parameter, and is obviously not for a policy function composed of ODEnet[24], hypernet[25], or the like that reuses parameters.
Appendix C Notation
Let be a topological space.

:Smallest algebla containing all .


:If is a finite set, it is the set of all probability measures defined by measurable space , and if it is an infinite set,
