## Introduction

Reinforcement learning (RL) is a powerful tool for sequential decision-making problem.
In RL, the agent’s goal is to learn from experiences and to seek an optimal policy from the delayed reward decision system.
Tabular learning methods are core ideas in RL algorithms with simple forms: *value functions* are represented as arrays, or tables [16].
One merit of tabular learning methods is that such methods converge to the optimal solution with solid theoretical guarantee [15]

. However, when state space is enormous, we suffer from what Bellman called “curse of dimensionality”

[14], and can not expect to obtain value function accurately by tabular learning methods. An efficient approach to address the above problem is to use a parameterized function to approximate the value function [16].Recently, Yang et al. yang2018 propose a new algorithm Q that develops Q [17, 7] with eligibility trace.
Q unifies Sarsa [1] and Q [9].
However, original theoretical results by Yang et al. yang2018 are limited in tabular learning.
In this paper, we extend tabular Q algorithm with linear function approximation and propose the gradient Q (GQ) algorithm.
The proposed GQ algorithm unifies *full-sampling* () and *pure-expectation* () algorithm through the *sampling degree* .
Results show that GQ with a varying combination between full-sampling with pure-expectation achieves a better performance than both full-sampling and pure-expectation methods.

Unfortunately, it is not sound to expend Q by semi-gradient method via *mean square value error* (MSVE) objective function directly,
although the linear, semi-gradient method is arguably the simplest and best-understood kind of function approximation.
In this paper, we provide a profound analysis of the instability of Q with function approximation by the semi-gradient method.

Furthermore, to address above instability, we propose GQ algorithm under the framework of *mean squared projected Bellman error* (MSPBE) [18]

. However, as pointed out by Liu et al. liu2015finite, we can not get an unbiased estimate of the gradient with respect to the MSPBE objective function. In fact, since the update law of gradient involves the product of expectations, the unbiased estimation cannot be obtained via a single sample, which is the double-sampling problem. Secondly, the gradient of MSPBE objective function has a term likes

, cannot also be estimated via a single sample, which is the second bottleneck of applying stochastic gradient method to optimize MSPBE objective function. Inspired by the key step of the derivation of TDC algorithm [18], we apply the two-timescale stochastic approximation [3] to address the dilemma of double-sampling problem, and propose a convergent GQ algorithm which unifies full-sampling and pure-expectation algorithm with function approximation.Finally, we conduct extensive experiments on some standard domains to show that GQ with an value that results in a mixture of the full-sampling with pure-expectation methods, performs better than either extreme or .

## Background and Notations

The standard reinforcement learning framework [16] is often formalized as *Markov decision processes* (MDP) [13].
It considers 5-tuples form ,
where indicates the set of all states, indicates the set of all actions.
At each time , the agent in a state and it takes an action ,
then environment produces a reward to the agent.
,

is the conditional probability for the state transitioning from

to under taking the action . : . The discounted factor .A *policy*

is a probability distribution defined on

,*target policy*is the policy that will be learned, and

*behavior policy*is used to generate behavior. If , the algorithm is called

*on-policy*learning, otherwise it is

*off-policy*

learning. We assume that Markov chain induced by behavior policy

is ergodic, then there exists a stationary distribution such thatWe denote as a diagonal matrix and its diagonal element is the stationary distribution of state.

For a given policy , one of many key steps in RL is to estimate the *state-action value function*

where and

stands for the expectation of a random variable with respect to the probability distribution induced by

. It is known that is the unique fixed point of*Bellman operator*,

(1) |

where Bellman operator is defined as:

(2) |

and with the corresponding elements:

### Temporal Difference Learning and -Return

We can not calculate from Bellman equation (1) directly for the model-free RL problem (in such problem, the agent can not get or for a given MDP). In RL, temporal difference (TD) learning [19] is one of the most important methods to solve the model-free RL problem. -Return is a multi-step TD learning that needs a longer sequence of experienced rewards is to learning the value function.

TD learning One-step TD learning estimates the value function by taking action according to behavior policy, sampling the reward, and bootstrapping via the current estimation of the value function. Sarsa [1], Q-learning [22] and Expected-Sarsa [21] are typical one-step TD learning algorithms.

From the view of sampling degree, TD learning methods fall into two categories: *full-sampling* and *pure-expectation*, which is deeply discussed in section 7.5&7.6 in [17] or [7].

Sarsa and Q-learning are typical full-sampling algorithms which have sampled all transitions to learn value function.
Instead,
pure-expectation algorithms take into account how likely each action is
under current policy,
e.g. *Tree-Backup* [12] or Expected-Sarsa uses the expectation of state-action value to estimate value function.
[9] algorithm is also a pure-expectation algorithm which combines TD learning with eligibility trace.
Harutyunyan et al.H2016 prove that when behavior and target policies are sufficiently close, off-policy algorithm converges both in policy evaluation and control task.

-Return For a trajectory, the -return is an average contains all the -step returns by weighting proportionally to , . Since the mechanisms of all the -return algorithms are similar, we only present the definition of -return of Sarsa [17] as follows,

(3) |

where is -step return of Sarsa from time . After some simple algebra, the -return can be rewritten as a sum of TD-error,

(4) |

where , and .

### An Unified View

In this section, we introduce the existing method that unifies full-sampling and pure-expectation algorithms.

Algorithm
Recently, Sutton and Bartosutton2018reinforcement and De Asis et al.de2018multi propose a new TD algorithm unifies Sarsa and Expected Sarsa
^{1}^{1}1For multi-step case, unifies *-step Sarsa* and *-step Tree-Backup* [12]. For more details, please refer to [7]..
estimates value function by weighting the average between Sarsa and Expected-Sarsa through a *sampling parameter* .
For a transition (),
updates value function as follows,

(5) |

, .

Q is reduced to Expected-Sarsa, while Q is Sarsa exactly. Experiments by De Asis et al.de2018multi show that for an intermediate value of , which results in a mixture of the existing algorithms, performs better than either extreme or .

Q() Algorithm Later, Yang et al.yang2018 extend Q() with eligibility trace, and they propose Q() unifies Sarsa() and . updates value function as:

(6) | ||||

(7) |

where is indicator function, is step-size.

We notice that is reduced to [9], Q is Sarsa() exactly. The experiments in [23] shows a similar conclusion as De Asis et al.de2018multi: an intermediate value of achieve the best performance than extreme or . Besides, Yang et al.yang2018 have showed that for a trajectory , by Q(), the total update of a given episode reaches

(8) |

which is an off-line version of Q() with eligibility trace. If , Eq.(8) is Eq.(4) exactly.

## Q with Semi-Gradient Method

In this section, we analyze the instability of extending tabular Q with linear function approximation by the semi-gradient method. We need some necessary notations about linear function approximation.

When the dimension of is huge, we cannot expect to obtain value function accurately by tabular learning methods. We often use a linear function with a parameter to estimate as follows,

where is a *feature map*, specifically,
the corresponding element is defined as follows .
Then can be written as a matrix version,

where is a matrix whose rows are the state-action features ,.

### Semi-Gradient Method

For a trajectory , we define the update rule of Q() with semi-gradient method as follows,

(10) |

where is step-size, is an off-line estimate of value function according to Eq.(8), specifically, for each ,

(11) | ||||

where and .

### Instability Analysis

Now, we show the iteration (10) is an unstable algorithm. Let’s consider the sequence generated by the iteration (10), then the following holds,

(12) |

where and , where

(13) | ||||

(14) |

Furthermore,

(15) |

Eq.(12) plays a critical role for our analysis, we provide its proof in Appendix A.

As the same discussion by Tsitsiklis and Van Roy tsitsiklis1997analysis; Sutton, Mahmood, and White sutton2016emphatic,
under the conditions of Proposition 4.8 proved by Bertsekas and Tsitsiklis bertsekas1995neuro,
*if is a negative matrix, then
generated by iteration (10) is a convergent algorithm. By (12), converges to the unique TD fixed point *:

(16) |

For on-policy case, for ,

(17) |

It has been shown that in Eq.(17) is negative definite (e.g. section 9.4 in [17]), thus iteration (10) is a convergent algorithm: it converges to satisfies (16).

Unfortunately, by the fact that the steady state-action distribution doesn’t match the transition probability during off-policy learning, , may not have an analog of (17). Thus, unlike on-policy learning, there is no guarantee that keeps the negative definite property, thus may diverge. We use a typical example to illustrate it.

### A Counter Unstable Example

The Figure 2 shows the numerical solution of the parameter learned by (10) for the counterexample (Figure1). This simple example is very striking because the full-sampling and pure-expectation methods are arguably the simplest and best-understood methods and the linear, semi-gradient method is arguably the simplest and best-understood kind of function approximation. This result shows that even the simplest combination of full-sampling and pure-expectation with function approximation can be unstable if the updates are not done according to the on-policy distribution.

## Gradient Q

We have discussed the divergence of Q with semi-gradient method. In this section, we propose a convergent and stable TD algorithm: gradient Q.

### Objective Function

We derive the algorithm via MSPBE [18]:

(18) |

where is an *projection matrix* which projects any value function into the space generated by .
After some simple algebra, we can further rewrite MSPBE as a standard weight least-squares equation:

(19) |

where .

Now, We define the update rule as follows, for a given trajectory , :

(20) |

where is step-size, ,

is trace vector and

. By Eq.(20), it is worth notice that the challenges of solving (18) are two-fold:-
The computational complexity of the invertible matrix

is at least [8], where is the dimension of feature space. Thus, it is too expensive to use gradient to solve the problem (18) directly. -
Besides, as pointed out by Szepesvá szepesvari2010algorithms and Liu et al. liu2015finite, we cannot get an unbiased estimate of . In fact, since the update law of gradient involves the product of expectations, the unbiased estimate cannot be obtained via a single sample. It needs to sample twice, which is a double sampling problem. Secondly, cannot also be estimated via a single sample, which is the second bottleneck of applying stochastic gradient to solve problem (18).

We provide a practical way to solve the above problem in the next subsection.

### Algorithm Derivation

The gradient in Eq.(20) can be replaced by the following equation:

(21) |

The proof of Eq.(21) is similar to the derivation in Chapter 7 of [11], thus we omit its proof. Furthermore, the following Proposition 1 provides a new way to estimate .

###### Proposition 1.

Let be the eligibility traces vector that is generated as , let

(22) | ||||

(23) |

then the following holds,

(24) |

###### Proof.

See Appendix B. ∎

It is too expensive to calculate inverse matrix in Eq.(21). In order to develop an efficient algorithm, Sutton et al.sutton2009fast_a use a weight-duplication trick. They propose the way to estimate on a fast timescale:

(25) |

Now, sampling from Eq.(24) directly , we define the update rule of as follows,

(26) |

where is defined in Eq.(20), is defined in Eq.(23) and are step-size. More details of gradient Q are summary in Algorithm 1.

### Convergence Analysis

We need some additional assumptions to present the convergent of Algorithm 1.

###### Assumption 1.

The positive sequence , satisfy with probability one.

###### Assumption 2 (Boundedness of Feature, Reward and Parameters[10]).

(1)The features

have uniformly bounded second moments, where

. (2)The reward function has uniformly bounded second moments. (3)There exists a bounded region , such that .Assumption 2 guarantees that the matrices and , and vector are uniformly bounded. After some simple algebra, we have , see Appendix C. The following Assumption 3 implies is well-defined.

###### Assumption 3.

is non-singular, where is defined in Eq.(22).

###### Theorem 1 (Convergence of Algorithm 1).

Consider the iteration generated by (26) and (25), if , , as , and satisfies Assumption 1. The sequence satisfies Assumption 2. Furthermore, satisfies Assumption 3. Let

(27) | ||||

(28) |

Then converges to with probability one, where

is the unique global asymptotically stable equilibrium w.r.t ordinary differential equation (ODE)

correspondingly, and : .###### Proof.

The ODE method (see Lemma 1; Appendix D) is our main tool to prove Theorem 1. Let

(29) | ||||

(30) |

Then, we rewrite the iteration (26) and (25) as follows,

(31) | ||||

(32) |

The Lemma 1 requires us to verify the following 4 steps.

Step 1: (Verifying the condition A2) *Both of the functions and are Lipschitz functions.*

By Assumption 2-3, , and are uniformly bounded, thus it is easy to check and are Lipschitz functions.

Step 2: (Verifying the condition A3)
*Let the -field , then
Furthermore, there exists non-negative a constant , s.t. and are square-integrable with*

(33) |

By Eq.(24), . With we have . By Assumption 2, Eq.(13) and Eq.(14), there exists non-negative constants such that , , , which implies all above terms are bounded. Thus, there exists a non-negative constant s.t the following Eq.(34) holds,

(34) |

Similarly, by Assumption 2, have uniformly bounded second, then holds for a constant . Thus, Eq.(33) holds.

Step 3: (Verifying the condition A4)
*For each , the ODE
has a unique global asymptotically stable equilibrium such that:
is Lipschitz.*

For a fixed , let We consider the ODE

(35) |

Assumption 3 implies that is a positive definite matrix, thus, for ODE (35), origin is a globally asymptotically stable equilibrium. Thus, for a fixed , by Assumption 3,

(36) |

is the unique globally asymptotically stable equilibrium of ODE Let it is obvious is Lipschitz.

Step 4: (Verifying the condition A5) *The ODE has a unique global asymptotically stable equilibrium .*

Let . We consider the following ODE

(37) |

By Assumption 2-3, is invertible and is positive definition, thus is a positive defined matrix. Thus the ODE (37) has unique global asymptotically stable equilibrium: origin point. Now, let’s consider the iteration (26)/(31) associated with the ODE which can be rewritten as follows,

(38) | ||||

(39) |

Eq.(39) holds due to . By Assumption 3, is invertible, then

(40) |

is the unique global asymptotically stable equilibrium of ODE (39).

## Experiment

In this section, we test both policy evaluation and control capability of the proposed GQ algorithm and validate the trade-off between full-sampling and pure-expectation on some standard domains. In this section, for all experiments, we set the hyper parameter as follows, , where ranges dynamically from 0.02 to 0.98 with step of 0.02, and

is Gaussian distribution with standard deviation

. In the following paragraph, we use the term*dynamic*to represent the above way to set .

### Policy Evaluation Task

We employ three typical domains in RL for policy evaluation: Baird Star [2], Boyan Chain [5] and linearized Cart-Pole balancing.

Domains Baird Star is a well known example for divergence in off-policy TD learning, which considers -state and = {, }. The behavior policy selects the and actions with and . The target policy always takes the : .

The second benchmark MDP is the classic chain example from [5], which considers a chain of states . Each transition from state results in state or with equal probability and a reward of . The behavior policy we chose is random.

For the limitation of space, we provide more details of the dynamics of Boyan Chain and Baird Star, chosen policy and features in Appendix E.

Cart-Pole balancing is widely used for many RL tasks. According to [6], the target policy we use in this section is the optimal policy

Comments

There are no comments yet.