DeepAI

# Convergence of Q-value in case of Gaussian rewards

In this paper, as a study of reinforcement learning, we converge the Q function to unbounded rewards such as Gaussian distribution. From the central limit theorem, in some real-world applications it is natural to assume that rewards follow a Gaussian distribution , but existing proofs cannot guarantee convergence of the Q-function. Furthermore, in the distribution-type reinforcement learning and Bayesian reinforcement learning that have become popular in recent years, it is better to allow the reward to have a Gaussian distribution. Therefore, in this paper, we prove the convergence of the Q-function under the condition of E[r(s,a)^2]<∞, which is much more relaxed than the existing research. Finally, as a bonus, a proof of the policy gradient theorem for distributed reinforcement learning is also posted.

• 1 publication
• 1 publication
• 1 publication
• 1 publication
04/21/2020

### SIBRE: Self Improvement Based REwards for Reinforcement Learning

We propose a generic reward shaping approach for improving rate of conve...
07/04/2020

### Variational Policy Gradient Method for Reinforcement Learning with General Utilities

In recent years, reinforcement learning (RL) systems with general goals ...
06/22/2021

### Off-Policy Reinforcement Learning with Delayed Rewards

We study deep reinforcement learning (RL) algorithms with delayed reward...
01/11/2021

### Independent Policy Gradient Methods for Competitive Reinforcement Learning

We obtain global, non-asymptotic convergence guarantees for independent ...
11/03/2021

### Proximal Policy Optimization with Continuous Bounded Action Space via the Beta Distribution

Reinforcement learning methods for continuous control tasks have evolved...
05/06/2019

### Deep Ordinal Reinforcement Learning

Reinforcement learning usually makes use of numerical rewards, which hav...
02/21/2020

### Accelerating Reinforcement Learning with a Directional-Gaussian-Smoothing Evolution Strategy

Evolution strategy (ES) has been shown great promise in many challenging...

## 1 Introduction

In recent years, Reinforcement Learning(RL) has come into fasion. General method in ordinary Reinforcement Learning using Markov decision processes use a state action value functions[1]. Agents created by these algorithms take strategies to maximize the expected value of the cumulative reward. However, in practical use , there are many situations where it is necessary to consider not only expected values but also risks. Therefore, Distributional Reinforcement Learning(DRL) that considers the distribution of cumulative rewards has also been studied. DRL research presents a particle method of risk responsive algorithm[2]. As for similar research, there are[3][4],which is equivalent to [2] mathematically,but used the different algorithm and parametric methods[5]. [4] discusses the convergence of measures in discrete steps. Another way to practice DRL is using the Bayesian approach. In [22],it is regarded as an estimation of the uncertainty of the expected value.But in fact, the Bayesian inferece can approximate the distribution of uncertain objecsion. It can perform distributed reinforcement learning. There are other existing papers on Bayesian reinforcement learning. We want to take [6][7] up this time.It is a method using Gaussian processes, and it can be said that the reward follows Gaussian distributions. [5] also supports unbounded rewards like Gaussian distributions. We want to show that the approximation of the cumulative reward distribution converges even in unbounded rewards. In this paper, we prove the convergence of the normal state action value function as a preliminary step. In addition, we perform the convergence proof for Q functions with continuous concentration domain,taking Deep Q-learning(DQN) into consideration.

### 1.1 Related works

The proof history of Q-function convergence is long. For example, there are papers such as [8], [9], [10], and [11] using [10]. A paper on an unusual proof method is [12] using ordinary differential equations. For DQN, there is a study [13] summarizing the approximation error. The approximation error due to the neural network is verified there. Other research results include [14][15][16][17][18]. All of these studies assume that rewards are bounded. That is, there is a certain constant

and

 |r(s,a)|≤Rmax a.e. (1.1)

holds. Therefore, Gaussian distributions cannot be assumed. In this paper, we prove the convergence of the Q function under condition ,

 ∀s,a∈S×A,E[r(s,a)2]<∞ (1.2)

which is more relaxed than (1.1), with normal distribution in mind. Finally, we prove the convergence of the Q function in the domain of continuous concentration under ideal conditions. This is a frequent concept in reinforcement learning.

## 2 Background

### 2.1 transition kearnel

Let two tuples be both measurable spaces.Transition kernel is defined to satisfy the following two conditions.

 ⋅ ∀B∈T,k(⋅,B) on S is measurable (2.1) ⋅ ∀s∈S,k(s,⋅)is measure on T (2.2)

This is used in situations where is fixed and the distribution on is fixed.

### 2.2 Markov decision process

Assume that both the set of states and the set of actions are finite sets. A transition kernel is defined on . That is,

is a probability measure that governs the distribution of the next state

and immediate reward when an action is taken in state is there. The strategy is the action probability determined from the current situation, as can be seen from the definition. The deterministic approach is that for any , there is a and

. A set of random variables

taking values in is written as . This stochastic process is called Markov decision process(MDP).

### 2.3 Optimal measures and state action value functions

Put the whole set of policies as . The state action value function for the policy is defined as follows.

 Qπ(s,a):=E[∞∑t=0γtRt|s0=s,at=a(rt,st+1),p(rt,st+1|st,at),π(at|st)] (2.3)

Furthermore, the state value function is defined as follows.

 Vπ(s):=∑a∈Aπ(a|s)Qπ(s,a) (2.4)

Define the optimal strategy as

 π∗:=argmaxπ∈ΠVπ(s0) (2.5)

In addition, the state action value function for the optimal policy is called the optimum state action value function, and simply expressed as . The action that takes the maximum value for the optimal state action function is the optimal policy.

 π∗(a|s)={1 argmaxa∈AQ(s,a)0 else (2.6)

holds for any .

## 3 Update of state action value function and Robbins Monro condition

Update the Q-unction as follows

 Qn+1(s,a)=(1−α(s,a,st,at,t))Qt(s,a)+α(s,a,st,at,t)[rt(st,at)+maxb∈AQt(st+1,b)] (3.1)

The following sequence satisfies the Robbins-Monro condition.

 ∀t,ct∈[0,1] (3.2) ∞∑t=0ct=∞ (3.3) ∞∑t=0c2t<∞ (3.4)

Using this, the mapping is defined as follows.

 α(s,a,st,at,t)={ct st=s,at=a0 else (3.5)

In addition,it is assumed that this also satisfies the Robbins Monroe condition stochastically uniformly for arbitrary .

 ∞∑t=0α(s,a,st,at,t)=∞ a.e. (3.6) ∞∑t=0α(s,a,st,at,t)2<∞ a.e. (3.7)

## 4 Proof of Q-function convergence for unbounded rewards

Consider a real-valued function on a finite set .

###### Theorem 1

Convergence of Q-value in case of Gaussian rewards

is finite set. Let ramdom value . Let be the set of functions and is defined as . For any , let .

 ||Qt−Q∗||W→0 (4.1)

proof.

In line with the proof of [9]. The condition is relaxed and the statement is stronger, so it needs to be done more precisely. Consider a stochastic process of . Since is a constant, . Putting, this is measurable stochastic process. Furthermore, if we put , by definition . The two stochastic processes are taken so that . Define time evolution as

 δt+1(x)=(1−at(x))δt(x)+at(x)E[Ft(x)|Ft] (4.2) wt+1(x)=(1−at(x))wt(x)+at(x)pt(x) (4.3)

However, . At this time, . First, we show that converges to 0 for with probability 1 by using Lemma 2. By definition, , so holds. From Lemma 1 and the definition of , holds. Putting , this random variable is -measurable and takes a finite value with probability . Since is a finite value, a certain constant can be taken so that holds. And the following holds with probability 1.

 Lt+1≤max(Lt,(1−bt)Lt+bt(supx|rt(x)|+γLt)) (4.4)

Using the above formula, the following holds

 E[L2t+1] ≤max(E[L2t],E[((1−bt)Lt+bt(supx|rt(x)|+γLt))2]) (4.5)

Suppose there is that is . At this time, put

 E[H2t] =E[supx|rt(x)|2]+2E[supx|rt(x)|γLt]+γ2E[L2t] (4.6) ≤CR+2γ√CRK2tCR+K2tCR (4.7) =(1+γKt)2CR (4.8)

Then,

 E[((1−bt)Lt+bt(supx|rt(x)|+γLt))2] ≤(1−bt)2E[L2t]+2(1−bt)bt√E[L2t]E[H2t]+b2tγ2E[H2t] (4.9) ≤(1−bt)2K2tCR+2(1−bt)btKt(1+γKt)CR+(1+γKt)2CR (4.10) =((1−bt)Kt+bt(1+γKt))2CR (4.11) =(Kt+bt(1−(1−γ)Kt))2CR (4.12)

Putting , can be said. Since exists, exists for any , and can be said. It is clear from the equation that when , and Then, holds. Therefore, it was shown earlier that exists for any , in addition can be also said. holds, so the following equation hold.for all

 14E[p2t(x)] ≤E[G2t(x)] (4.13) =E[rt(x)2]+2γ√E[rt(x)2]E[L2t]+E[L2t] (4.14) ≤(1+γK∗)CR (4.15)

Then,

 ∑tE[a2tp2t] ≤∑t4b2t(1+γK∗)CR (4.16) ≤4M(1+γK∗)CR<∞ (4.17)

holds for all . When we use Lemma2,putting

 Ut:=at(x)pt(x) (4.18) T(wt,ω):=(1−at(x))wn (4.19)

can be said. Since , holds. Then, for any , set and , then

 T2(wt,ω)≤max(α,(1+βt)w2t−γt) (4.20) ∑tγt=∞ a.e (4.21)

holds. The latter is based on Robbins Monro conditions. Therefore, holds for any . Define the linear operator as follows: for

 Tq(s,a) =∫R∑s′[r(s,a)+γsupbq(s′,b)]p(dr,s′|s,a) (4.22) =E[r(s,a)+supbq(X(s,a),b)] (4.23)

is a fixed point for this operator. For any

 ||Tq1−Tq2||W =sups,a[|∫R∑s′[r(s,a)+γsupbq1(s′,b)]p(dr,s′|s,a)−∫R∑s′[r(s,a)+γsupbq2(s‘,b)]p(dr,s′|s,a)|] (4.24) ≤∫R∑s′[γ|supbq1(s′,b)−supbq2(s′,b)|]p(dr,s′|s,a) (4.25) ≤∫R∑s′[γsupb|q1(s′,b)−q2(s,b)|]p(dr,s′|s,a) (4.26) =γ||q1−q2||W (4.27)

Thus is a reduction operator.

 |E[Ft(x,a)|Ft]| ≤∫R∑s′|r(s,a)+γsupbQt(s′,b)−Q∗(s,a)|p(dr,s′|s,a) (4.28) =|TQt(x,a)−Q∗(s,a)| (4.29) =|TQt(x,a)−TQ∗(s,a)| (4.30) ≤γ||Δt||W (4.31)

Then,

 ||δt+1|| ≤(1−at(x))||δt||+at(x)||δt+wt|| (4.32) ≤(1−at(x))||δt||+at(x)(||δt||+||wt||) (4.33)

converges uniformly to 0 with a probability of 1 for any as described above. Therefore, from Lemma 3, for any . That is, for any , , which holds the main theorem assertion.

## 5 Theorem for SARASA

The method in Chapter 3 is called Q-learning, and the value is updated before performing the next action. On the other hand, SARASA updates the value after performing the following actions.

 Qt+1(s,a)=(1−α(s,a,st,at,t))Qt(s,a)+α(s,a,st,at,t)(rt(s,a)+Qt(st+1,at+1)) (5.1)

is often stochastically determined by softmax function or the like.

###### Theorem 2

Suppose that the Q function is updated by the above SARASA method. At this time,

 ||Qt−Q∗||W→0 in t→∞ (5.2)

proof.

Put It is clear from the definition that . Later along this follows the proof of Theorem 1.

## 6 Convergence proof for unbounded rewards under continuous concentration

For example, in a situation such as DQN, an update for one has an effect on other state actions. As a simple model to take such situations into account, we put the ripple function defined on the compact set . This satisfies the next conditions.

 f(x,x)=1 (6.1) f(x,y) is continue. (6.2)

If is a continuous function, it can be used to depart from any continuous function and have the same convergence on the compact set. Let be a simple connected compact set. Let be a continuous function on . Let be a continuous function on .

 Qt+1(s,a)=(1−f(s,a,st,at)α(s,a,st,at,t))Qt(s,a)+f(s,a,st,at)α(s,a,st,at,t)(rt(s,a)+maxb∈AQt(st+1,b)) (6.3)

At this time,

proof. Consider a finite set on . Limiting to converges to a correct function uniformly over from Theorem 1.For any Since is a continuous function, the function whose value is defined on a dense set is uniquely determined. Convergence can be said.

## 7 Conclusion and Future Work

As we mentioned earlier,we want to prove the convergence of the distribution. An order evaluation of the expected value should be also performed. We also want to estimate the convergence order for a specific neural network such as [13]. According to [13], as with Theorem 3, in the domain of continuous concentration, as , using constants

 ||Q∗−Qn||W≤C1⋅(logn)ξn−α+C2Rmax (7.1)

is established. However, when follows a normal distribution, , so the upper limit of the error is infinite, and this unexpected expression has no meaning. In case of using unbounded rewards, stronger inequality proofs are needed.

## Appendix A Lemmas and proofs

###### Lemma 1

Consider a random variable and a partial -algebla . If , the following equation holds.

 E[Z2]≤4E[Y2] (A.1)

We quote the important theorem.

###### Lemma 2

Convergence theorem for stochastic systems[19]

Consider the following stochastic process.

 Xt+1:=T(X0,......,Xt,ω)+Ut(ω) (A.2)

This satisfies the following equation with probability 1.

 |T(x1,x2,......,xt,ω)|2≤max(α,(1+βt(ω))x2t−γt) (A.3)

However, with , with probability 1, holds, and with probability 1, Let .

 ∑tE[U2t] <∞ (A.4) ∑tE[Ut|Ft] <∞ (A.5)

At this time, there exists a certain , and it holds for any

 limsupt→∞|Xt|2<α a.e. (A.6)

If are taken again for any and the same can be said, ”uniform convergence to 0” can be said that is much stronger than approximate convergence.

###### Lemma 3

is assumed to be a real number.

 xn+1=(1−an)xn+γan|xn| (A.7)

is a constant.At this time, holds with probability 1.

proof.

Look at each . That is, is constant sequence that satisfies . is nonnegative for a sufficiently large , so it is bounded below. In addition, since is apparent from the equation, is a monotonically decreasing sequence. The sequence converges because it is bounded and monotonically decreasing below.Putting , this satisfies . You can say, and the convergence destination is . , the infinite product of is , but diverges. However, since it is , is known, and can be said.

###### Lemma 4

Let .

 xn+1=(1−an)xn+γan|xn+ϵ| (A.8)

Then holds.

proof.

 xn+1−xn =−an((1−γ)xn−ϵγ) (A.9) =−an(1−γ)(xn−ϵγ1−γ) (A.10)

The difference from is reduced by . If , by definition it is clearly . Moreover,

 yn+1−yn =−an(1−γ)(yn) (A.11) yn+1 =(1−an(1−γ))yn (A.12)

After that, it is because it is by the same argument in Lemma 3.

###### Lemma 5

Suppose that the sequence converges uniformly to 0 on a set of probabilities 1. That is, for any , there is a certain , and when , holds with probability 1. At this time,

 xn+1=(1−an)xn+γan|xn+cn| (A.13)

converges to 0.

proof.

 zNϵ1 =xNϵ1 (A.14) zn+1 =(1−an)zn+γan|zn+ϵ1| (A.15)

for such . from Lemma 4. That is, for any , there is a certain , and for any can be arbitrarily taken, so if we define a new , this is also can be taken arbitrarily. Within the range of Using , there is a for any and for .

## Appendix B Strict Proof of Policy Gradient thorem and Distributionaly

We prove the famous policy gradient theorem using the function and its version in distributed reinforcement learning [23].

###### Theorem 3

Consider the gradient of the policy value function . At this time, it is assumed that

is implemented by a neural network, the activation function is Lipschitz continuous, and

. Then,The following equation holds,

 ∇θJ(θ)=Eρ[∇θπ(θ)∇aQ(s,a)|a=π(x)] (B.1)

However, is memory data in general implementation. Next, consider the case of distributed reinforcement learning. If a random variable representing the cumulative reward sum is expressed as , then holds. Suppose is a neural network with stochastic output.

 Z(ω)(s,a)=fω(s,a) (B.2)

Then

 ∇θJ(θ)=Eρ[∇θπ(x)En[∇aZ(x,a)]|a=π(x)] (B.3)

proof.

The interchangeable conditions of differentiation and Lebesgue integration are described as follows. Suppose there is a function that can be Lebesgue integrable over and differentiable by . At this time, there is an integrable function , and can be differentiated almost everywhere on by and holds, then is differentiable by , and holds,

 ∇x∫Ωf(x,ω)dμ(ω)=∫Ω∇xf(x,ω)dμ(ω) (B.4)

When

, An example of a function class that satisfies this is the “Lipschitz continuous function”. Neural networks is generally combinations of linear transformations and Lipschitz continuous activation map.

111

General activation functions such as sigmoid, ReLu, Reaky Relu, and Swish are all Lipschitz continuous functions.

Moreover, if the Lipschitz constant of the function is written as , then considering two Lipschitz continuous functions , . From this, are Lipschitz continuous for , respectively. Although it is not Lipschitz continuous for , it is Lipschitz continuous for each element, and the definition and definition of

allow the exchange of differentiation and integration. That is, the following holds from the differential chain rule,

 ∇θJ(θ)=Eρ[∇θπθ(x)∇aQ(s,a)|a=πθ(x)] (B.5)

Similarly, and is Lipschitz continuous functions for any , For distribution type

 ∇θJ(θ) =Eρ[∇θπ(x)En[∇afω(x,a)]|a=π(x)] (B.6) =Eρ[∇θπ(x)En[∇aZ(x,a)]|a=π(x)] (B.7)

As described above, the policy gradient theorem is established because the policy is Lipschitz-continuous for each parameter, and is obviously not for a policy function composed of ODEnet[24], hypernet[25], or the like that reuses parameters.

## Appendix C Notation

Let be a topological space.

• :Smallest -algebla containing all .

• :If is a finite set, it is the set of all probability measures defined by measurable space , and if it is an infinite set,