# Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

It has been a trend in the Reinforcement Learning literature to derive sample complexity bounds: a bound on how many experiences with the environment are required to obtain an ε-optimal policy. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in 1/(1-β), where β < 1 is the discount factor. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an ε-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded for all β < 1. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an ε-optimal policy. Specifically, we show that the asymptotic variance of the Q-learning algorithm, with an optimized step-size sequence, is a quadratic function of 1/(1-β); an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic variance that is a quadratic in 1/(1- ρβ), where 1 - ρ > 0 is the spectral gap of an optimal transition matrix.

Comments

There are no comments yet.

## Authors

• 8 publications
• 5 publications
• ### Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

In this paper we consider the problem of learning an ϵ-optimal policy fo...
06/06/2020 ∙ by Zihan Zhang, et al. ∙ 0

read it

• ### On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

We study the fundamental question of the sample complexity of learning a...
06/18/2021 ∙ by Chenjun Xiao, et al. ∙ 0

read it

• ### Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

The current paper studies the problem of agnostic Q-learning with functi...
02/17/2020 ∙ by Simon S. Du, et al. ∙ 0

read it

• ### Near-Optimal Reinforcement Learning with Self-Play

This paper considers the problem of designing optimal algorithms for rei...
06/22/2020 ∙ by Yu Bai, et al. ∙ 0

read it

• ### ExTra: Transfer-guided Exploration

In this work we present a novel approach for transfer-guided exploration...
06/27/2019 ∙ by Anirban Santara, et al. ∙ 8

read it

• ### An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

What is a good exploration strategy for an agent that interacts with an ...
07/10/2019 ∙ by Mirco Mutti, et al. ∙ 0

read it

• ### Local policy search with Bayesian optimization

Reinforcement learning (RL) aims to find an optimal policy by interactio...
06/22/2021 ∙ by Sarah Müller, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Most Reinforcement Learning (RL) algorithms can be cast as parameter estimation techniques, where the goal is to recursively estimate the parameter vector

that directly, or indirectly yields an optimal decision making rule within a parameterized family. The update equation for the -dimensional parameter estimates can be expressed in the general form

 θn+1=θn+αn+1[¯¯¯f(θn)+Δn+1],n≥0 (1)

in which is given, is a positive scalar gain sequence (also known as learning rate in the RL literature), is a deterministic function, and is a “noise” sequence.

The recursion (1) is an example of stochastic approximation (SA), for which there is a vast research literature. Under standard assumptions, it can be shown that

 limn→∞θn=θ∗

where . Moreover, it can be shown that the best algorithms achieve the optimal mean-square error (MSE) convergence rate:

 E[∥θn−θ∗∥2]=O(1/n) (2)

It is known that TD- and Q-learning can be written in the form (1) [2, 3]. In these algorithms, represents the sequence of parameter estimates that are used to approximate a value function or Q-function. It is also widely recognized that these algorithms can be slow to converge.

It was first established in our work [1, 4] that the convergence rate of the MSE of Watkins’ Q-learning can be slower than , if the discount factor satisfies , and if the step-size is either of two standard forms (see discussion in Section 3.1). It was also shown that the optimal convergence rate (2) is obtained by using a step-size of the form , where is a scalar proportional to ; this is consistent with conclusions in more recent research [5, 6]. In the earlier work [7], a sample path upper bound was obtained on the rate of convergence that is roughly consistent with the mean-square rate established for in [1, 4].

Since the publication of [7], many papers have appeared with proposed improvements to the algorithm. Many of these papers also derive sample complexity bounds, that are essentially a bound on the MSE (2). Ignoring higher order terms, these bounds can be expressed in the following general form [8, 5, 9, 6, 10]:

 E[∥θn−θ∗∥2]≤1(1−β)p⋅Bn (3)

where is a scalar; is a function of the total number of state-action pairs, the discount factor , and the maximum per-stage cost. Much of the literature has worked towards minimizing through a combination of hard analysis and algorithm design.

It is widely recognized that Q-leanring algorithms can be very slow to converge, especially when the discount factor is close to . Quoting [8], a primary reason for slow convergence is “the fact that the Bellman operator propagates information throughout the whole space”, especially when the discount factor is close to 1. We do not dispute these explanations, but in this paper argue that the challenge presented by discounting is relatively minor. In order to make this point clear we must take a step back and rethink fundamentals:

Why do we need to estimate the Q-function?

Denoting to be the optimal Q-function for the state-action pair , the ultimate goal of estimating the Q-function is to obtain from it the corresponding optimal policy:

 ϕ∗(x)=argminuQ∗(x,u)

It is clear from the above definition that adding a constant to will not alter . This is a fortunate fact: it is shown in Section 4 that can be decomposed as

 Q∗(x,u)=˜Q∗(x,u)+ηβ1−β

where denotes the average cost under the optimal policy, and is uniformly bounded in , and .

The reason for slow performance of Q-learning when is because of the high variance in the indirect estimate of the large constant . We argue that if we ignore constants, we can obtain a sample complexity result of the form

 E[∥θn−θ∗∥2]≤1(1−ρβ)p⋅Bn (4)

where , and is the spectral gap of the transition matrix for the pair process under the optimal policy111The non-zero spectral gap is replaced by a milder assumption in Section 4.3.

The new relative Q-learning algorithm proposed here is designed to achieve the upper bound (4). Unfortunately, we have not yet obtained this explicit finite- bound. We have instead obtained formula for the asymptotic covariance that corresponds to each of the algorithms considered in this paper (see (17)). The close relationship between the asymptotic covariance and sample complexity bounds is discussed in Section 1.2, based on the theoretical background in Section 1.1.

### 1.1 Stochastic Approximation & Reinforcement Learning

Consider a parameterized family of -valued functions that can be expressed as an expectation,

 ¯¯¯f(θ):=E[f(θ,Φ)],θ∈Rd, (5)

with a random vector, , and the expectation is with respect to the distribution of the random vector . It is assumed throughout that the there exists a unique vector satisfying . Under this assumption, the goal of SA is to estimate .

The SA algorithm recursively estimates as follows: For initialization , obtain the sequence of estimates :

 θn+1=θn+αn+1f(θn,Φn+1) (6)

where has the same distribution as for each (or its distribution converges to that of as ), and is a non-negative scalar step-size sequence. We assume for some scalar , and special cases in applications to Q-learning are discussed separately in Section 3.

Asymptotic statistical theory for SA is extremely rich. Large Deviations or Central Limit Theorem (CLT) limits hold under very general assumptions for both SA and related Monte-Carlo techniques

[11, 12, 13, 14, 15].

The CLT will be a guide to algorithm design in this paper. For a typical SA algorithm, this takes the following form: denote the error sequence by

 ~θn:=θn−θ∗ (7)

Under general conditions, the scaled sequence converges in distribution to a Gaussian . Typically, the covariance of this scaled sequence is also convergent:

 Σθ=limn→∞nE[~θn~θ⊺n] (8)

The limit is known as the asymptotic covariance. Provided is finite, this implies (2), which is the fastest possible rate [11, 12, 14, 16, 17]. For Q-learning, this also implies a bound of the form (3), but for “large enough”.

An asymptotic bound such as (8) may not be satisfying for RL practitioners, given the success of finite-time performance bounds in prior research. There are however good reasons to apply this asymptotic theory in algorithm design:

1. The asymptotic covariance has a simple representation as the solution to a Lyapunov equation.

2. The MSE convergence is refined in [18] for linear SA algorithms (see Section 1.3): For some ,

 E[~θn~θ⊺n]=n−1Σθ+O(n−1−δ)

Extensions of this bound to nonlinear algorithms found in RL is a topic of current research.

3. The asymptotic covariance lies beneath the surface in the theory of finite-time error bounds. Here is what can be expected from the theory of large deviations [19, 20], for which the rate function is denoted

 Ii(ε):=−limn→∞1nlogP{|θn(i)−θ∗(i)|>ε} (9)

The second order Taylor series approximation holds under general conditions:

 Ii(ε)=12Σθ(i,i)ε2+O(ε3) (10)

from which we obtain

 P{|θn(i)−θ∗(i)|>ε}=exp{−12Σθ(i,i)ε2n+O(nε3)+o(n)} (11)

where as , and is bounded in , and absolutely bounded by a constant times for small .

4. The Central Limit Theorem (CLT) holds under general assumptions:

 √n~θnd⟶W (12)

where the convergence is in distribution, and where is Gaussian [12, 11]; a version of the Law of the Iterated Logarithm also holds [21]:

 √nloglogn~θn \ is bounded, with limit% points in the set \ C={v∈Rd:v⊺Σ−1θv≤1}

The asymptotic theory provides insight into the slow convergence of Watkins’ Q-learning algorithm, and motivates better algorithms such as Zap Q-learning [4], and the relative Q-learning introduced in Section 4.

### 1.2 Sample complexity bounds

The inequalities of Hoeffding and Bennett are finite- approximations of (9):

 P{|θn(i)−θ∗(i)|>ε}≤¯bexp(−n¯Ii(ε)) (13)

where is a constant and for . For a given , denote

 (14)

A sample complexity bound then follows easily from (13): for all . Explicit bounds were obtained in [22, 5, 6] for Watkins’ algorithm, and in [8] for the “speedy” Q-learning algorithm. General theory for SA algorithms is presented in [23, 24, 18].

Observe that whenever both the limit (9) and the bound (13) are valid, the rate function must dominate: . To maximize this upper bound we must minimize the asymptotic covariance (recall (10), and remember we are typically interested in small ).

The value of (14) depends on the size of the constants. Ideally, the function is quadratic as a function of . Theorem 6 of [22] asserts that this ideal is not in general possible for Q-learning: an example is given for which the best bound requires when the discount factor satisfies .

We conjecture that the sample-path complexity bound (14) with quadratic is possible in the setting of [22], provided a sufficiently large scalar gain is introduced on the right hand side of the update equation (1). This conjecture is rooted in the large deviations approximation (11), which requires a finite asymptotic covariance. In the very recent preprint [5], the finite- bound (14) with quadratic was obtained for Watkins’ algorithm in a special synchronous setting, subject to a specific scaling of the step-size: . This result is consistent with our conjecture: it was shown in [25] that the asymptotic covariance is finite for the equivalent step-size sequence, (see Thm. 3.3 for details).

### 1.3 Explicit Mean Square Error bounds for Linear SA

Here, we present a special case of the main result of [18], which we recall later in applications to Q-learning.

The analysis of the SA recursion (6) begins with the transformation to (1):

 θn+1=θn+αn+1[¯¯¯f(θn)+Δn+1] (15)

in which . The difference has zero mean for any (deterministic) when is distributed according to (recall (5)). Though the results of [18] extend to Markovian noise, for the purposes of this paper, we assume here that is a martingale difference sequence:

• The sequence is a martingale difference sequence. Moreover, for some and any initial condition ,

 E[∥Δn+1∥2∣Δ1,…,Δn]≤¯σ2Δ(1+∥θn∥2),n≥0
• , for some scalar , and all .

Our primary interest regards the rate of convergence of the error sequence , measured by the error covariance , and . We say that tends to zero at rate (with ) if for each ,

 limn→∞nμ−εσ2n=0andlimn→∞nμ+εσ2n=∞ (16)

It is known that the maximal value is , and we will show that when this optimal rate is achieved, there is typically an associated limiting matrix known as the asymptotic covariance:

 Σθ=limn→∞nΣn (17)

Under the conditions imposed here, the existence of the finite limit (17) also implies the CLT (12).

The analysis in [18] is based on a “linearized” approximation of the SA recursion (6):

 θn+1=θn+αn+1[An+1θn−bn+1] (18)

where, is a matrix, and is . Let and denote the respective means:

 A=E[A(Φ)],b=E[b(Φ)] (19)

We assume that the matrix is Hurwitz, a necessary condition for convergence of (18).

• The matrix is Hurwitz. Consequently, is invertible, and .

The recursion (18) can be rewritten in the form (15):

 θn+1=θn+αn+1[Aθn−b+Δn+1] (20)

in which is the noise sequence:

 Δn+1=An+1θ∗−bn+1+~An+1~θn (21)

with . The parameter error sequence also evolves as a simple linear recursion:

 ~θn+1=~θn+αn+1[A~θn+Δn+1] (22)

The asymptotic covariance (17) exists under special conditions, and under these conditions it satisfies the Lyapunov equation

 (23)

where the “noise covariance matrix” is defined to be

 ΣΔ=E[(An+1θ∗−bn+1)(An+1θ∗−bn+1)⊺] (24)

Recall (16) for the definition of convergence rate , and the definition . Thm. 1.1 is a special case of the main result of [18] (which does not impose the martingale assumption (A1)).

###### Theorem 1.1.

Suppose (A1) – (A3) hold. Then the following hold for the linear recursion (22), for each initial condition :

1. If

for every eigenvalue

of , then

 Σn=n−1Σθ+O(n−1−δ)

where , and is the solution to the Lyapunov equation (23). Consequently, converges to zero at rate .

2. Suppose there is an eigenvalue of that satisfies . Let

denote a corresponding left eigenvector, and suppose that

. Then, converges to at a rate . Consequently, converges to zero at rate no faster than .

### 1.4 Organization

Readers should skip to Section 4 if they have either read [1], or have a good understanding of the connections between Stochastic Approximation and Q-learning. Though most of the contents of Sections 2 and 3 are essentially known, Section 3 contains new interpretations on the convergence rate of Q-learning. The tutorial sections of this paper are taken from [26].

## 2 Markov Decision Processes Formulation

Consider a Markov Decision Processes (MDP) model with state space

, action space , cost function , and discount factor . It is assumed throughout this section that the state and action spaces are finite: denote and . In the following, the terms ‘action’, ‘control’, and ‘input’ are used interchangeably.

Along with the state-action process is an i.i.d. sequence used to model a randomized policy. We assume without loss of generality that each

is real-valued, with uniform distribution on the interval

. An input sequence is called non-anticipative if

 Un=zn(X0,U0,I1…,Un−1,Xn,In),n≥0

where is a sequence of functions. The input sequence is admissible if it is non-anticipative, and if it is feasible in the sense that remains in the state space for each .

Under the assumption that the state and action spaces are finite, it follows that there are a finite number of deterministic stationary policies , where each , and

. A randomized stationary policy is defined by a probability mass function (pmf)

on the integers such that

 Un =ℓϕ∑k=1ιn(k)ϕ(k)(Xn) (25)

with for each and . It is assumed that is a fixed function of for each , so that this input sequence is non-anticipative.

It is convenient to use the following operator-theoretic notation. The controlled transition matrix acts on functions via

 PuV(x) :=∑x′∈XPu(x,x′)V(x′) (26) =E[V(Xn+1)∣Xn=x,Un=u;Xk,Ik,Uk:k

where the second equality holds for any non-anticipative input sequence . For any deterministic stationary policy , let denote the substitution operator, defined for any function by

 Sϕq(x):=q(x,ϕ(x))

If the policy is randomized, of the form (25), we then define

 Sϕq(x)=∑kμ(k)q(x,ϕ(k)(x))

With viewed as a single matrix with rows and columns, and viewed as a matrix with rows and columns, the following interpretations hold:

###### Lemma 2.1.

Suppose that is defined using a stationary policy (possibly randomized). Then, both and the pair process are Markovian, and

1. is the transition matrix for .

2. is the transition matrix for .

### 2.1 Q-function and the Bellman Equation

For any (possibly randomized) stationary policy , we consider two value functions

 Vϕ(x):=∞∑n=0(βPϕ)nSϕc(x) (27a) Qϕ(x,u):=∞∑n=0(βPSϕ)nc(x,u) (27b)

which are related via

 Qϕ(x,u)=c(x,u)+βPuVϕ(x) (28)

The function in (27a) is the value function that corresponds to the policy (with the corresponding transition probability matrix ), and cost function , that appears in TD-learning algorithms [27, 2]. The function is the fixed-policy Q-function considered in the SARSA algorithm [28, 29, 30].

The minimal (optimal) value function is denoted

 V∗(x):=minϕVϕ(x)

It is known that this is the unique solution to the following Bellman equation:

 V∗(x)=minu{c(x,u)+β∑x′∈XPu(x,x′)V∗(x′)},x∈X (29)

Any minimizer defines a deterministic stationary policy that is optimal over all input sequences [31]:

 ϕ∗(x)∈argminu{c(x,u)+β∑x′∈XPu(x,x′)V∗(x′)},x∈X (30)

The Q-function associated with is given by (28) with , which is precisely the term within the brackets in (29):

 Q∗(x,u):=c(x,u)+βPuV∗(x),x∈X,u∈U

The Bellman equation (29) implies a similar fixed point equation for the Q-function:

 Q∗(x,u)=c(x,u)+β∑x′∈XPu(x,x′)Q––∗(x′) (31)

in which for any function .

For any function , let denote an associated policy that satisfies

 ϕq(x)∈argminuq(x,u),x∈X (32)

It is assumed to be specified uniquely as follows:

 ϕq :=ϕ(κ)such thatκ=min{i:ϕ(i)(x)∈argminuq(x,u),for all x∈X} (33)

Using the above notations, the fixed point equation (31) can be rewritten as

 Q∗(x,u)=c+βPSϕQ∗(x,u),with ϕ=ϕq, \, % q=Q∗ (34)

In general, there may be many optimal policies, so we remove ambiguity by denoting

 ϕ∗ :=ϕ(κ)such thatκ=min{i:ϕ(i)(x)∈argminuQ∗(x,u),for all x∈X} (35)

## 3 Q-learning

The goal in Q-learning is to approximately solve the fixed point equation (31), without assuming knowledge of the controlled transition matrix. We restrict the discussion to the case of linear parameterization for the Q-function: , where denotes the parameter vector, and denotes the vector of basis functions.

A Galerkin approach to approximating is formulated as follows: Obtain a non-anticipative input sequence (using a randomized stationary policy ), and a -dimensional stationary stochastic process that is adapted to . The Galerkin relaxation of the fixed point equation (31) is the root finding problem:

 (36)

where

, and the expectation is with respect to the steady state distribution of the Markov chain

. This is clearly a special case of the general root-finding problem that is the focus of SA algorithms.

The following Q() algorithm is the SA algorithm (6), applied to estimate that solves (36): For initialization , define the sequence of estimates recursively:

 θn+1=θn+αn+1ζndn+1,ζn=ψ(Xn,Un) (37a) dn+1=c(Xn,Un)+βQ––θn(Xn+1)−Qθn(Xn,Un) (37b)

The choice for the sequence of eligibility vectors in (37a) is inspired by the TD() algorithm [32, 2].

Matrix gain Q-learning algorithms are also popular. For a sequence of matrices , the matrix-gain Q(0) algorithm is described as follows: For initialization , the sequence of estimates are defined recursively:

 θn+1=θn+αn+1Gn+1ψ(Xn,Un)dn+1 (38a) dn+1=c(Xn,Un)+βQ––θn(Xn+1)−Qθn(Xn,Un) (38b)

A common choice is

 Gn=(1nn∑k=1ψ(Xk,Uk)ψ⊺(Xk,Uk))−1 (39)

A popular example will follow shortly.

The success of these algorithms have been demonstrated in a few restricted settings, such as optimal stopping [33, 34, 35], deterministic optimal control [36], and the tabular setting discussed next.

### 3.1 Tabular Q-learning

The basic Q-learning algorithm of Watkins [37, 38] (also known as “tabular” Q-learning) is a particular instance of the Galerkin approach (37). The basis functions are taken to be indicator functions:

 ψi(x,u)=I{x=xi,u=ui},1≤i≤d (40)

where is an enumeration of all state-input pairs, with . The goal of this approach is to exactly compute the function . Substituting with defined in (40), the objective (36) can be rewritten as follows: Find such that, for each ,

 0 =E[{c(Xn,Un)+βQ––θ∗(Xn+1)−Qθ∗(Xn,Un)}ψi(Xn,Un)] (41) =[c(xi,ui)+βE[Q––θ∗(Xn+1)|Xn=xi,Un=ui]−Qθ∗(xi,ui)]ϖ(xi,ui) (42)

where the expectation in (41) is in steady state, and in (42) denotes the invariant distribution of the Markov chain . The conditional expectation in (42) is

 E[Q––θ∗(Xn+1)|Xn=xi,Un=ui]=∑x′∈XPui(xi,x′)Q––θ∗(x′)

Consequently, (42) can be rewritten as

 0 =[c(xi,ui)+β∑x′∈XPui(xi,x′)Q––θ∗(x′)−Qθ∗(xi,ui)]ϖ(xi,ui) (43)

If for each , then the function that solves (43) is identical to the optimal Q-function in (31).

There are three flavors of Watkins’ Q-learning that are popular in the literature. We discuss each of them below.

Asynchronous Q-learning: The SA algorithm applied to solve (41) coincides with the most basic version of Watkins’ Q-learning algorithm: For initialization , define the sequence of estimates recursively:

 θn+1=θn+αn+1[c(Xn,Un)+βQ––θn(Xn+1)−Qθn(Xn,Un)]ψ(Xn,Un) (44)

where denotes the non-negative step-size sequence.

Algorithm (44) coincides with the Q() algorithm (37), with defined in (40). Based on this choice of basis functions, a single entry of is updated at each iteration, corresponding to the state-input pair observed (hence the term “asynchronous”). Observing that is identified with the estimate , a more familiar form of (44) is:

 Qn+1(Xn,Un)=Qn(Xn,Un)+αn+1[c(Xn,Un)+βQ––n(Xn+1)−Qn(Xn,Un)] (45)

and if .

With , the ODE approximation of (44) takes the form222The reader is referred to [3] for details.:

 ddtqt(x,u)=ϖ(x,u)[c(x,u)+βPuq–t(x)−qt(x,u)] (46)

in which as defined below (31). We recall in Section 3.2 conditions under which this ODE is stable, and explain why we cannot expect a finite asymptotic covariance in typical settings.

A second and perhaps more popular “Q-learning flavor” is defined using a particular “state-action dependent” step-size [7, 22, 25]. For each , denote if the pair has not been visited up until time . Otherwise,

 αn(x,u)=[n(x,u)]−1,n(x,u):=n−1∑j=0I{Xj=x,Uj=u} (47)

At stage of the algorithm, once and are observed, then a single entry of the Q-function is updated as in (45):

 Qn+1(x,u)=Qn(x,u)+αn+1(x,u)[c(x,u)+βQ––n(Xn+1)−Qn(x,u)] (48)

The ODE approximation simplifies when using this step-size rule:

 ddtqt(x,u)=c(x,u)+βPu–qt(x)−qt(x,u) (49)

Conditions for a finite asymptotic covariance are also greatly simplified (see Thm. 3.3).

The asynchronous variant of Watkins’ Q-learning algorithm (44) with step-size (47) can be viewed as the -Q() algorithm defined in (38), with the matrix gain sequence (39), and step-size . On substituting the Watkins’ basis defined in (40), we find that this matrix is diagonal:

 Gn =ˆΠ−1n,ˆΠn(i,i)=1nn∑k=1I{Xk=xi,Uk=ui},1≤i≤d (50)

By the Law of Large Numbers, we have

 limn→∞Gn=limn→∞ˆΠ−1n=Π−1 (51)

where is a diagonal matrix with entries . It is easy to see why the ODE approximation (46) simplifies to (49) with this matrix gain.

Synchronous Q-learning: In this final flavor, each entry of the Q-function approximation is updated in each iteration. It is popular in the literature because the analysis is greatly simplified in this case.

The algorithm assumes access to an “oracle” that provides the next state of the Markov chain, conditioned on any given current state-action pair: let

denote a collection of mutually independent random variables taking values in

. Assume moreover that for each , the sequence is i.i.d. with common distribution . The synchronous Q-learning algorithm is then obtained as follows: For initialization , define the sequence of estimates recursively:

 θn+1 (52)

Once again, based on the choice of basis functions (40), and observing that is identified with the estimate , an equivalent form of the update rule (52) is

 Qn+1(xi,ui)=Qn(xi,ui)+αn+1[c(xi,ui)+βQ––n(Xin+1)−Qn(xi,ui)],  1≤i≤d (53)

Using the step-size we obtain the simple ODE approximation (49).

### 3.2 Convergence and Rate of Convergence

Convergence of the tabular Q-learning algorithms require the following assumptions:

(Q1) The input is defined by a randomized stationary policy of the form (25). The joint process is an irreducible Markov chain. That is, it has a unique invariant pmf satisfying for each .

(Q2) The optimal policy is unique.

Both ODEs (46) and (49) are stable under assumption (Q1) [3], which then (based on the results of [3]) implies that converges to a.s.. To obtain rates of convergence requires an examination of the linearization of the ODEs at their equilibrium.

Linearization is justified under Assumption (Q2), which implies the existence of such that

 ϕ∗(x)=argminu∈UQθ(x,u),x∈X, θ∈Rd, ∥Qθ−Q∗∥<ε (54)
###### Lemma 3.1.

Under Assumptions (Q1) and (Q2) the following approximations hold

1. When , the ODE (46) reduces to

 ddtqt=−Π[I−βPSϕ∗]qt−b

where is defined below (50), and , expressed as a -dimensional column vector.

2. When , the ODE (49) reduces to

 ddtqt=−[I−βPSϕ∗]qt−b

where .

The proof is contained in Appendix A.

Recall the definition of the linearization matrix [1, 26]:

 A=∂θ¯¯¯f(θ)∣∣θ=θ∗

The crucial take-away from Lemma 3.1 are the linearization matrices that correspond to the different tabular Q-learning algorithms:

 A=−Π[I−βPSϕ∗]in case (i) of Lemma~{}??? (55a) A=−[I−βPSϕ∗]in case (ii) of Lemma~{}??? (55b)

Since is a transition probability matrix of an irreducible Markov chain (see Lemma 2.1), it follows that both matrices are Hurwitz.

We consider next conditions under which the asymptotic covariance for Q-learning is not finite. The noise covariance defined in (24) is diagonal in all three flavors. For the asynchronous Q-learning algorithm (48) with step-size (47), or the synchronous Q-learning algorithm (53), the diagonal elements of are given by

 ΣsΔ(i,i) =β2E[(Q––∗(Xn+1)−∑x′∈XPui(xi,x′)Q––∗(x′))2∣Xn=xi,Un=ui] (56) =β2E[(V∗(Xn+1)−∑x′∈XPui(xi,x′)V∗(x′))2∣Xn=xi