in which is given, is a positive scalar gain sequence (also known as learning rate in the RL literature), is a deterministic function, and is a “noise” sequence.
The recursion (1) is an example of stochastic approximation (SA), for which there is a vast research literature. Under standard assumptions, it can be shown that
where . Moreover, it can be shown that the best algorithms achieve the optimal mean-square error (MSE) convergence rate:
It is known that TD- and Q-learning can be written in the form (1) [2, 3]. In these algorithms, represents the sequence of parameter estimates that are used to approximate a value function or Q-function. It is also widely recognized that these algorithms can be slow to converge.
It was first established in our work [1, 4] that the convergence rate of the MSE of Watkins’ Q-learning can be slower than , if the discount factor satisfies , and if the step-size is either of two standard forms (see discussion in Section 3.1). It was also shown that the optimal convergence rate (2) is obtained by using a step-size of the form , where is a scalar proportional to ; this is consistent with conclusions in more recent research [5, 6]. In the earlier work , a sample path upper bound was obtained on the rate of convergence that is roughly consistent with the mean-square rate established for in [1, 4].
Since the publication of , many papers have appeared with proposed improvements to the algorithm. Many of these papers also derive sample complexity bounds, that are essentially a bound on the MSE (2). Ignoring higher order terms, these bounds can be expressed in the following general form [8, 5, 9, 6, 10]:
where is a scalar; is a function of the total number of state-action pairs, the discount factor , and the maximum per-stage cost. Much of the literature has worked towards minimizing through a combination of hard analysis and algorithm design.
It is widely recognized that Q-leanring algorithms can be very slow to converge, especially when the discount factor is close to . Quoting , a primary reason for slow convergence is “the fact that the Bellman operator propagates information throughout the whole space”, especially when the discount factor is close to 1. We do not dispute these explanations, but in this paper argue that the challenge presented by discounting is relatively minor. In order to make this point clear we must take a step back and rethink fundamentals:
Why do we need to estimate the Q-function?
Denoting to be the optimal Q-function for the state-action pair , the ultimate goal of estimating the Q-function is to obtain from it the corresponding optimal policy:
It is clear from the above definition that adding a constant to will not alter . This is a fortunate fact: it is shown in Section 4 that can be decomposed as
where denotes the average cost under the optimal policy, and is uniformly bounded in , and .
The reason for slow performance of Q-learning when is because of the high variance in the indirect estimate of the large constant . We argue that if we ignore constants, we can obtain a sample complexity result of the form
where , and is the spectral gap of the transition matrix for the pair process under the optimal policy111The non-zero spectral gap is replaced by a milder assumption in Section 4.3.
The new relative Q-learning algorithm proposed here is designed to achieve the upper bound (4). Unfortunately, we have not yet obtained this explicit finite- bound. We have instead obtained formula for the asymptotic covariance that corresponds to each of the algorithms considered in this paper (see (17)). The close relationship between the asymptotic covariance and sample complexity bounds is discussed in Section 1.2, based on the theoretical background in Section 1.1.
1.1 Stochastic Approximation & Reinforcement Learning
Consider a parameterized family of -valued functions that can be expressed as an expectation,
with a random vector, , and the expectation is with respect to the distribution of the random vector . It is assumed throughout that the there exists a unique vector satisfying . Under this assumption, the goal of SA is to estimate .
The SA algorithm recursively estimates as follows: For initialization , obtain the sequence of estimates :
where has the same distribution as for each (or its distribution converges to that of as ), and is a non-negative scalar step-size sequence. We assume for some scalar , and special cases in applications to Q-learning are discussed separately in Section 3.
Asymptotic statistical theory for SA is extremely rich. Large Deviations or Central Limit Theorem (CLT) limits hold under very general assumptions for both SA and related Monte-Carlo techniques[11, 12, 13, 14, 15].
The CLT will be a guide to algorithm design in this paper. For a typical SA algorithm, this takes the following form: denote the error sequence by
Under general conditions, the scaled sequence converges in distribution to a Gaussian . Typically, the covariance of this scaled sequence is also convergent:
The limit is known as the asymptotic covariance. Provided is finite, this implies (2), which is the fastest possible rate [11, 12, 14, 16, 17]. For Q-learning, this also implies a bound of the form (3), but for “large enough”.
An asymptotic bound such as (8) may not be satisfying for RL practitioners, given the success of finite-time performance bounds in prior research. There are however good reasons to apply this asymptotic theory in algorithm design:
The asymptotic covariance has a simple representation as the solution to a Lyapunov equation.
The asymptotic covariance lies beneath the surface in the theory of finite-time error bounds. Here is what can be expected from the theory of large deviations [19, 20], for which the rate function is denoted
The second order Taylor series approximation holds under general conditions:
from which we obtain
where as , and is bounded in , and absolutely bounded by a constant times for small .
The asymptotic theory provides insight into the slow convergence of Watkins’ Q-learning algorithm, and motivates better algorithms such as Zap Q-learning , and the relative Q-learning introduced in Section 4.
1.2 Sample complexity bounds
The inequalities of Hoeffding and Bennett are finite- approximations of (9):
where is a constant and for . For a given , denote
A sample complexity bound then follows easily from (13): for all . Explicit bounds were obtained in [22, 5, 6] for Watkins’ algorithm, and in  for the “speedy” Q-learning algorithm. General theory for SA algorithms is presented in [23, 24, 18].
Observe that whenever both the limit (9) and the bound (13) are valid, the rate function must dominate: . To maximize this upper bound we must minimize the asymptotic covariance (recall (10), and remember we are typically interested in small ).
The value of (14) depends on the size of the constants. Ideally, the function is quadratic as a function of . Theorem 6 of  asserts that this ideal is not in general possible for Q-learning: an example is given for which the best bound requires when the discount factor satisfies .
We conjecture that the sample-path complexity bound (14) with quadratic is possible in the setting of , provided a sufficiently large scalar gain is introduced on the right hand side of the update equation (1). This conjecture is rooted in the large deviations approximation (11), which requires a finite asymptotic covariance. In the very recent preprint , the finite- bound (14) with quadratic was obtained for Watkins’ algorithm in a special synchronous setting, subject to a specific scaling of the step-size: . This result is consistent with our conjecture: it was shown in  that the asymptotic covariance is finite for the equivalent step-size sequence, (see Thm. 3.3 for details).
1.3 Explicit Mean Square Error bounds for Linear SA
Here, we present a special case of the main result of , which we recall later in applications to Q-learning.
in which . The difference has zero mean for any (deterministic) when is distributed according to (recall (5)). Though the results of  extend to Markovian noise, for the purposes of this paper, we assume here that is a martingale difference sequence:
The sequence is a martingale difference sequence. Moreover, for some and any initial condition ,
, for some scalar , and all .
Our primary interest regards the rate of convergence of the error sequence , measured by the error covariance , and . We say that tends to zero at rate (with ) if for each ,
It is known that the maximal value is , and we will show that when this optimal rate is achieved, there is typically an associated limiting matrix known as the asymptotic covariance:
where, is a matrix, and is . Let and denote the respective means:
We assume that the matrix is Hurwitz, a necessary condition for convergence of (18).
The matrix is Hurwitz. Consequently, is invertible, and .
in which is the noise sequence:
with . The parameter error sequence also evolves as a simple linear recursion:
The asymptotic covariance (17) exists under special conditions, and under these conditions it satisfies the Lyapunov equation
where the “noise covariance matrix” is defined to be
Suppose (A1) – (A3) hold. Then the following hold for the linear recursion (22), for each initial condition :
Suppose there is an eigenvalue of that satisfies . Let
denote a corresponding left eigenvector, and suppose that. Then, converges to at a rate . Consequently, converges to zero at rate no faster than .
Readers should skip to Section 4 if they have either read , or have a good understanding of the connections between Stochastic Approximation and Q-learning. Though most of the contents of Sections 2 and 3 are essentially known, Section 3 contains new interpretations on the convergence rate of Q-learning. The tutorial sections of this paper are taken from .
2 Markov Decision Processes Formulation
Consider a Markov Decision Processes (MDP) model with state space, action space , cost function , and discount factor . It is assumed throughout this section that the state and action spaces are finite: denote and . In the following, the terms ‘action’, ‘control’, and ‘input’ are used interchangeably.
Along with the state-action process is an i.i.d. sequence used to model a randomized policy. We assume without loss of generality that each
is real-valued, with uniform distribution on the interval. An input sequence is called non-anticipative if
where is a sequence of functions. The input sequence is admissible if it is non-anticipative, and if it is feasible in the sense that remains in the state space for each .
Under the assumption that the state and action spaces are finite, it follows that there are a finite number of deterministic stationary policies , where each , and
. A randomized stationary policy is defined by a probability mass function (pmf)on the integers such that
with for each and . It is assumed that is a fixed function of for each , so that this input sequence is non-anticipative.
It is convenient to use the following operator-theoretic notation. The controlled transition matrix acts on functions via
where the second equality holds for any non-anticipative input sequence . For any deterministic stationary policy , let denote the substitution operator, defined for any function by
If the policy is randomized, of the form (25), we then define
With viewed as a single matrix with rows and columns, and viewed as a matrix with rows and columns, the following interpretations hold:
Suppose that is defined using a stationary policy (possibly randomized). Then, both and the pair process are Markovian, and
is the transition matrix for .
is the transition matrix for .
2.1 Q-function and the Bellman Equation
For any (possibly randomized) stationary policy , we consider two value functions
which are related via
The function in (27a) is the value function that corresponds to the policy (with the corresponding transition probability matrix ), and cost function , that appears in TD-learning algorithms [27, 2]. The function is the fixed-policy Q-function considered in the SARSA algorithm [28, 29, 30].
The minimal (optimal) value function is denoted
It is known that this is the unique solution to the following Bellman equation:
Any minimizer defines a deterministic stationary policy that is optimal over all input sequences :
The Bellman equation (29) implies a similar fixed point equation for the Q-function:
in which for any function .
For any function , let denote an associated policy that satisfies
It is assumed to be specified uniquely as follows:
Using the above notations, the fixed point equation (31) can be rewritten as
In general, there may be many optimal policies, so we remove ambiguity by denoting
The goal in Q-learning is to approximately solve the fixed point equation (31), without assuming knowledge of the controlled transition matrix. We restrict the discussion to the case of linear parameterization for the Q-function: , where denotes the parameter vector, and denotes the vector of basis functions.
A Galerkin approach to approximating is formulated as follows: Obtain a non-anticipative input sequence (using a randomized stationary policy ), and a -dimensional stationary stochastic process that is adapted to . The Galerkin relaxation of the fixed point equation (31) is the root finding problem:
, and the expectation is with respect to the steady state distribution of the Markov chain. This is clearly a special case of the general root-finding problem that is the focus of SA algorithms.
Matrix gain Q-learning algorithms are also popular. For a sequence of matrices , the matrix-gain Q(0) algorithm is described as follows: For initialization , the sequence of estimates are defined recursively:
A common choice is
A popular example will follow shortly.
3.1 Tabular Q-learning
The basic Q-learning algorithm of Watkins [37, 38] (also known as “tabular” Q-learning) is a particular instance of the Galerkin approach (37). The basis functions are taken to be indicator functions:
where is an enumeration of all state-input pairs, with . The goal of this approach is to exactly compute the function . Substituting with defined in (40), the objective (36) can be rewritten as follows: Find such that, for each ,
Consequently, (42) can be rewritten as
There are three flavors of Watkins’ Q-learning that are popular in the literature. We discuss each of them below.
Asynchronous Q-learning: The SA algorithm applied to solve (41) coincides with the most basic version of Watkins’ Q-learning algorithm: For initialization , define the sequence of estimates recursively:
where denotes the non-negative step-size sequence.
Algorithm (44) coincides with the Q() algorithm (37), with defined in (40). Based on this choice of basis functions, a single entry of is updated at each iteration, corresponding to the state-input pair observed (hence the term “asynchronous”). Observing that is identified with the estimate , a more familiar form of (44) is:
and if .
A second and perhaps more popular “Q-learning flavor” is defined using a particular “state-action dependent” step-size [7, 22, 25]. For each , denote if the pair has not been visited up until time . Otherwise,
At stage of the algorithm, once and are observed, then a single entry of the Q-function is updated as in (45):
The ODE approximation simplifies when using this step-size rule:
Conditions for a finite asymptotic covariance are also greatly simplified (see Thm. 3.3).
The asynchronous variant of Watkins’ Q-learning algorithm (44) with step-size (47) can be viewed as the -Q() algorithm defined in (38), with the matrix gain sequence (39), and step-size . On substituting the Watkins’ basis defined in (40), we find that this matrix is diagonal:
By the Law of Large Numbers, we have
Synchronous Q-learning: In this final flavor, each entry of the Q-function approximation is updated in each iteration. It is popular in the literature because the analysis is greatly simplified in this case.
The algorithm assumes access to an “oracle” that provides the next state of the Markov chain, conditioned on any given current state-action pair: let
denote a collection of mutually independent random variables taking values in. Assume moreover that for each , the sequence is i.i.d. with common distribution . The synchronous Q-learning algorithm is then obtained as follows: For initialization , define the sequence of estimates recursively:
Using the step-size we obtain the simple ODE approximation (49).
3.2 Convergence and Rate of Convergence
Convergence of the tabular Q-learning algorithms require the following assumptions:
(Q1) The input is defined by a randomized stationary policy of the form (25). The joint process is an irreducible Markov chain. That is, it has a unique invariant pmf satisfying for each .
(Q2) The optimal policy is unique.
Both ODEs (46) and (49) are stable under assumption (Q1) , which then (based on the results of ) implies that converges to a.s.. To obtain rates of convergence requires an examination of the linearization of the ODEs at their equilibrium.
Linearization is justified under Assumption (Q2), which implies the existence of such that
The proof is contained in Appendix A.
The crucial take-away from Lemma 3.1 are the linearization matrices that correspond to the different tabular Q-learning algorithms:
Since is a transition probability matrix of an irreducible Markov chain (see Lemma 2.1), it follows that both matrices are Hurwitz.
We consider next conditions under which the asymptotic covariance for Q-learning is not finite. The noise covariance defined in (24) is diagonal in all three flavors. For the asynchronous Q-learning algorithm (48) with step-size (47), or the synchronous Q-learning algorithm (53), the diagonal elements of are given by