1 Introduction
Most Reinforcement Learning (RL) algorithms can be cast as parameter estimation techniques, where the goal is to recursively estimate the parameter vector
that directly, or indirectly yields an optimal decision making rule within a parameterized family. The update equation for the dimensional parameter estimates can be expressed in the general form(1) 
in which is given, is a positive scalar gain sequence (also known as learning rate in the RL literature), is a deterministic function, and is a “noise” sequence.
The recursion (1) is an example of stochastic approximation (SA), for which there is a vast research literature. Under standard assumptions, it can be shown that
where . Moreover, it can be shown that the best algorithms achieve the optimal meansquare error (MSE) convergence rate:
(2) 
It is known that TD and Qlearning can be written in the form (1) [2, 3]. In these algorithms, represents the sequence of parameter estimates that are used to approximate a value function or Qfunction. It is also widely recognized that these algorithms can be slow to converge.
It was first established in our work [1, 4] that the convergence rate of the MSE of Watkins’ Qlearning can be slower than , if the discount factor satisfies , and if the stepsize is either of two standard forms (see discussion in Section 3.1). It was also shown that the optimal convergence rate (2) is obtained by using a stepsize of the form , where is a scalar proportional to ; this is consistent with conclusions in more recent research [5, 6]. In the earlier work [7], a sample path upper bound was obtained on the rate of convergence that is roughly consistent with the meansquare rate established for in [1, 4].
Since the publication of [7], many papers have appeared with proposed improvements to the algorithm. Many of these papers also derive sample complexity bounds, that are essentially a bound on the MSE (2). Ignoring higher order terms, these bounds can be expressed in the following general form [8, 5, 9, 6, 10]:
(3) 
where is a scalar; is a function of the total number of stateaction pairs, the discount factor , and the maximum perstage cost. Much of the literature has worked towards minimizing through a combination of hard analysis and algorithm design.
It is widely recognized that Qleanring algorithms can be very slow to converge, especially when the discount factor is close to . Quoting [8], a primary reason for slow convergence is “the fact that the Bellman operator propagates information throughout the whole space”, especially when the discount factor is close to 1. We do not dispute these explanations, but in this paper argue that the challenge presented by discounting is relatively minor. In order to make this point clear we must take a step back and rethink fundamentals:
Why do we need to estimate the Qfunction?
Denoting to be the optimal Qfunction for the stateaction pair , the ultimate goal of estimating the Qfunction is to obtain from it the corresponding optimal policy:
It is clear from the above definition that adding a constant to will not alter . This is a fortunate fact: it is shown in Section 4 that can be decomposed as
where denotes the average cost under the optimal policy, and is uniformly bounded in , and .
The reason for slow performance of Qlearning when is because of the high variance in the indirect estimate of the large constant . We argue that if we ignore constants, we can obtain a sample complexity result of the form
(4) 
where , and is the spectral gap of the transition matrix for the pair process under the optimal policy^{1}^{1}1The nonzero spectral gap is replaced by a milder assumption in Section 4.3.
The new relative Qlearning algorithm proposed here is designed to achieve the upper bound (4). Unfortunately, we have not yet obtained this explicit finite bound. We have instead obtained formula for the asymptotic covariance that corresponds to each of the algorithms considered in this paper (see (17)). The close relationship between the asymptotic covariance and sample complexity bounds is discussed in Section 1.2, based on the theoretical background in Section 1.1.
1.1 Stochastic Approximation & Reinforcement Learning
Consider a parameterized family of valued functions that can be expressed as an expectation,
(5) 
with a random vector, , and the expectation is with respect to the distribution of the random vector . It is assumed throughout that the there exists a unique vector satisfying . Under this assumption, the goal of SA is to estimate .
The SA algorithm recursively estimates as follows: For initialization , obtain the sequence of estimates :
(6) 
where has the same distribution as for each (or its distribution converges to that of as ), and is a nonnegative scalar stepsize sequence. We assume for some scalar , and special cases in applications to Qlearning are discussed separately in Section 3.
Asymptotic statistical theory for SA is extremely rich. Large Deviations or Central Limit Theorem (CLT) limits hold under very general assumptions for both SA and related MonteCarlo techniques
[11, 12, 13, 14, 15].The CLT will be a guide to algorithm design in this paper. For a typical SA algorithm, this takes the following form: denote the error sequence by
(7) 
Under general conditions, the scaled sequence converges in distribution to a Gaussian . Typically, the covariance of this scaled sequence is also convergent:
(8) 
The limit is known as the asymptotic covariance. Provided is finite, this implies (2), which is the fastest possible rate [11, 12, 14, 16, 17]. For Qlearning, this also implies a bound of the form (3), but for “large enough”.
An asymptotic bound such as (8) may not be satisfying for RL practitioners, given the success of finitetime performance bounds in prior research. There are however good reasons to apply this asymptotic theory in algorithm design:

The asymptotic covariance has a simple representation as the solution to a Lyapunov equation.

The asymptotic covariance lies beneath the surface in the theory of finitetime error bounds. Here is what can be expected from the theory of large deviations [19, 20], for which the rate function is denoted
(9) The second order Taylor series approximation holds under general conditions:
(10) from which we obtain
(11) where as , and is bounded in , and absolutely bounded by a constant times for small .
The asymptotic theory provides insight into the slow convergence of Watkins’ Qlearning algorithm, and motivates better algorithms such as Zap Qlearning [4], and the relative Qlearning introduced in Section 4.
1.2 Sample complexity bounds
The inequalities of Hoeffding and Bennett are finite approximations of (9):
(13) 
where is a constant and for . For a given , denote
(14) 
A sample complexity bound then follows easily from (13): for all . Explicit bounds were obtained in [22, 5, 6] for Watkins’ algorithm, and in [8] for the “speedy” Qlearning algorithm. General theory for SA algorithms is presented in [23, 24, 18].
Observe that whenever both the limit (9) and the bound (13) are valid, the rate function must dominate: . To maximize this upper bound we must minimize the asymptotic covariance (recall (10), and remember we are typically interested in small ).
The value of (14) depends on the size of the constants. Ideally, the function is quadratic as a function of . Theorem 6 of [22] asserts that this ideal is not in general possible for Qlearning: an example is given for which the best bound requires when the discount factor satisfies .
We conjecture that the samplepath complexity bound (14) with quadratic is possible in the setting of [22], provided a sufficiently large scalar gain is introduced on the right hand side of the update equation (1). This conjecture is rooted in the large deviations approximation (11), which requires a finite asymptotic covariance. In the very recent preprint [5], the finite bound (14) with quadratic was obtained for Watkins’ algorithm in a special synchronous setting, subject to a specific scaling of the stepsize: . This result is consistent with our conjecture: it was shown in [25] that the asymptotic covariance is finite for the equivalent stepsize sequence, (see Thm. 3.3 for details).
1.3 Explicit Mean Square Error bounds for Linear SA
Here, we present a special case of the main result of [18], which we recall later in applications to Qlearning.
The analysis of the SA recursion (6) begins with the transformation to (1):
(15) 
in which . The difference has zero mean for any (deterministic) when is distributed according to (recall (5)). Though the results of [18] extend to Markovian noise, for the purposes of this paper, we assume here that is a martingale difference sequence:

The sequence is a martingale difference sequence. Moreover, for some and any initial condition ,

, for some scalar , and all .
Our primary interest regards the rate of convergence of the error sequence , measured by the error covariance , and . We say that tends to zero at rate (with ) if for each ,
(16) 
It is known that the maximal value is , and we will show that when this optimal rate is achieved, there is typically an associated limiting matrix known as the asymptotic covariance:
(17) 
Under the conditions imposed here, the existence of the finite limit (17) also implies the CLT (12).
The analysis in [18] is based on a “linearized” approximation of the SA recursion (6):
(18) 
where, is a matrix, and is . Let and denote the respective means:
(19) 
We assume that the matrix is Hurwitz, a necessary condition for convergence of (18).

The matrix is Hurwitz. Consequently, is invertible, and .
The recursion (18) can be rewritten in the form (15):
(20) 
in which is the noise sequence:
(21) 
with . The parameter error sequence also evolves as a simple linear recursion:
(22) 
The asymptotic covariance (17) exists under special conditions, and under these conditions it satisfies the Lyapunov equation
(23) 
where the “noise covariance matrix” is defined to be
(24) 
Recall (16) for the definition of convergence rate , and the definition . Thm. 1.1 is a special case of the main result of [18] (which does not impose the martingale assumption (A1)).
Theorem 1.1.
Suppose (A1) – (A3) hold. Then the following hold for the linear recursion (22), for each initial condition :

If
for every eigenvalue
of , thenwhere , and is the solution to the Lyapunov equation (23). Consequently, converges to zero at rate .

Suppose there is an eigenvalue of that satisfies . Let
denote a corresponding left eigenvector, and suppose that
. Then, converges to at a rate . Consequently, converges to zero at rate no faster than .
1.4 Organization
Readers should skip to Section 4 if they have either read [1], or have a good understanding of the connections between Stochastic Approximation and Qlearning. Though most of the contents of Sections 2 and 3 are essentially known, Section 3 contains new interpretations on the convergence rate of Qlearning. The tutorial sections of this paper are taken from [26].
2 Markov Decision Processes Formulation
Consider a Markov Decision Processes (MDP) model with state space
, action space , cost function , and discount factor . It is assumed throughout this section that the state and action spaces are finite: denote and . In the following, the terms ‘action’, ‘control’, and ‘input’ are used interchangeably.Along with the stateaction process is an i.i.d. sequence used to model a randomized policy. We assume without loss of generality that each
is realvalued, with uniform distribution on the interval
. An input sequence is called nonanticipative ifwhere is a sequence of functions. The input sequence is admissible if it is nonanticipative, and if it is feasible in the sense that remains in the state space for each .
Under the assumption that the state and action spaces are finite, it follows that there are a finite number of deterministic stationary policies , where each , and
. A randomized stationary policy is defined by a probability mass function (pmf)
on the integers such that(25) 
with for each and . It is assumed that is a fixed function of for each , so that this input sequence is nonanticipative.
It is convenient to use the following operatortheoretic notation. The controlled transition matrix acts on functions via
(26)  
where the second equality holds for any nonanticipative input sequence . For any deterministic stationary policy , let denote the substitution operator, defined for any function by
If the policy is randomized, of the form (25), we then define
With viewed as a single matrix with rows and columns, and viewed as a matrix with rows and columns, the following interpretations hold:
Lemma 2.1.
Suppose that is defined using a stationary policy (possibly randomized). Then, both and the pair process are Markovian, and

is the transition matrix for .

is the transition matrix for .
2.1 Qfunction and the Bellman Equation
For any (possibly randomized) stationary policy , we consider two value functions
(27a)  
(27b) 
which are related via
(28) 
The function in (27a) is the value function that corresponds to the policy (with the corresponding transition probability matrix ), and cost function , that appears in TDlearning algorithms [27, 2]. The function is the fixedpolicy Qfunction considered in the SARSA algorithm [28, 29, 30].
The minimal (optimal) value function is denoted
It is known that this is the unique solution to the following Bellman equation:
(29) 
Any minimizer defines a deterministic stationary policy that is optimal over all input sequences [31]:
(30) 
The Qfunction associated with is given by (28) with , which is precisely the term within the brackets in (29):
The Bellman equation (29) implies a similar fixed point equation for the Qfunction:
(31) 
in which for any function .
For any function , let denote an associated policy that satisfies
(32) 
It is assumed to be specified uniquely as follows:
(33) 
Using the above notations, the fixed point equation (31) can be rewritten as
(34) 
In general, there may be many optimal policies, so we remove ambiguity by denoting
(35) 
3 Qlearning
The goal in Qlearning is to approximately solve the fixed point equation (31), without assuming knowledge of the controlled transition matrix. We restrict the discussion to the case of linear parameterization for the Qfunction: , where denotes the parameter vector, and denotes the vector of basis functions.
A Galerkin approach to approximating is formulated as follows: Obtain a nonanticipative input sequence (using a randomized stationary policy ), and a dimensional stationary stochastic process that is adapted to . The Galerkin relaxation of the fixed point equation (31) is the root finding problem:
(36) 
where
, and the expectation is with respect to the steady state distribution of the Markov chain
. This is clearly a special case of the general rootfinding problem that is the focus of SA algorithms.The following Q() algorithm is the SA algorithm (6), applied to estimate that solves (36): For initialization , define the sequence of estimates recursively:
(37a)  
(37b) 
The choice for the sequence of eligibility vectors in (37a) is inspired by the TD() algorithm [32, 2].
Matrix gain Qlearning algorithms are also popular. For a sequence of matrices , the matrixgain Q(0) algorithm is described as follows: For initialization , the sequence of estimates are defined recursively:
(38a)  
(38b) 
A common choice is
(39) 
A popular example will follow shortly.
The success of these algorithms have been demonstrated in a few restricted settings, such as optimal stopping [33, 34, 35], deterministic optimal control [36], and the tabular setting discussed next.
3.1 Tabular Qlearning
The basic Qlearning algorithm of Watkins [37, 38] (also known as “tabular” Qlearning) is a particular instance of the Galerkin approach (37). The basis functions are taken to be indicator functions:
(40) 
where is an enumeration of all stateinput pairs, with . The goal of this approach is to exactly compute the function . Substituting with defined in (40), the objective (36) can be rewritten as follows: Find such that, for each ,
(41)  
(42) 
where the expectation in (41) is in steady state, and in (42) denotes the invariant distribution of the Markov chain . The conditional expectation in (42) is
Consequently, (42) can be rewritten as
(43) 
If for each , then the function that solves (43) is identical to the optimal Qfunction in (31).
There are three flavors of Watkins’ Qlearning that are popular in the literature. We discuss each of them below.
Asynchronous Qlearning: The SA algorithm applied to solve (41) coincides with the most basic version of Watkins’ Qlearning algorithm: For initialization , define the sequence of estimates recursively:
(44) 
where denotes the nonnegative stepsize sequence.
Algorithm (44) coincides with the Q() algorithm (37), with defined in (40). Based on this choice of basis functions, a single entry of is updated at each iteration, corresponding to the stateinput pair observed (hence the term “asynchronous”). Observing that is identified with the estimate , a more familiar form of (44) is:
(45) 
and if .
With , the ODE approximation of (44) takes the form^{2}^{2}2The reader is referred to [3] for details.:
(46) 
in which as defined below (31). We recall in Section 3.2 conditions under which this ODE is stable, and explain why we cannot expect a finite asymptotic covariance in typical settings.
A second and perhaps more popular “Qlearning flavor” is defined using a particular “stateaction dependent” stepsize [7, 22, 25]. For each , denote if the pair has not been visited up until time . Otherwise,
(47) 
At stage of the algorithm, once and are observed, then a single entry of the Qfunction is updated as in (45):
(48) 
The ODE approximation simplifies when using this stepsize rule:
(49) 
Conditions for a finite asymptotic covariance are also greatly simplified (see Thm. 3.3).
The asynchronous variant of Watkins’ Qlearning algorithm (44) with stepsize (47) can be viewed as the Q() algorithm defined in (38), with the matrix gain sequence (39), and stepsize . On substituting the Watkins’ basis defined in (40), we find that this matrix is diagonal:
(50) 
By the Law of Large Numbers, we have
(51) 
where is a diagonal matrix with entries . It is easy to see why the ODE approximation (46) simplifies to (49) with this matrix gain.
Synchronous Qlearning: In this final flavor, each entry of the Qfunction approximation is updated in each iteration. It is popular in the literature because the analysis is greatly simplified in this case.
The algorithm assumes access to an “oracle” that provides the next state of the Markov chain, conditioned on any given current stateaction pair: let
denote a collection of mutually independent random variables taking values in
. Assume moreover that for each , the sequence is i.i.d. with common distribution . The synchronous Qlearning algorithm is then obtained as follows: For initialization , define the sequence of estimates recursively:(52) 
Once again, based on the choice of basis functions (40), and observing that is identified with the estimate , an equivalent form of the update rule (52) is
(53) 
Using the stepsize we obtain the simple ODE approximation (49).
3.2 Convergence and Rate of Convergence
Convergence of the tabular Qlearning algorithms require the following assumptions:
(Q1) The input is defined by a randomized stationary policy of the form (25). The joint process is an irreducible Markov chain. That is, it has a unique invariant pmf satisfying for each .
(Q2) The optimal policy is unique.
Both ODEs (46) and (49) are stable under assumption (Q1) [3], which then (based on the results of [3]) implies that converges to a.s.. To obtain rates of convergence requires an examination of the linearization of the ODEs at their equilibrium.
Linearization is justified under Assumption (Q2), which implies the existence of such that
(54) 
Lemma 3.1.
The proof is contained in Appendix A.
Recall the definition of the linearization matrix [1, 26]:
The crucial takeaway from Lemma 3.1 are the linearization matrices that correspond to the different tabular Qlearning algorithms:
(55a)  
(55b) 
Since is a transition probability matrix of an irreducible Markov chain (see Lemma 2.1), it follows that both matrices are Hurwitz.
We consider next conditions under which the asymptotic covariance for Qlearning is not finite. The noise covariance defined in (24) is diagonal in all three flavors. For the asynchronous Qlearning algorithm (48) with stepsize (47), or the synchronous Qlearning algorithm (53), the diagonal elements of are given by
(56)  
Comments
There are no comments yet.