Q-learning, introduced originally for the discounted cost Markov decision processes in 
, is a data-driven reinforcement learning algorithm for learning the ‘Q-factor’ function arising from the dynamic programming equation for the infinite horizon discounted reward problem. It can be viewed as a stochastic approximation counterpart of the classical value iteration for computing the value function arising as the solution of the corresponding dynamic programming equation. Going over from value function to the so called Q-factors facilitates an interchange of the conditional expectation and the nonlinearity (the minimization, to be precise) in the recursion, making it amenable for stochastic approximation. These ideas, however, do not extend automatically to the average cost problem, which is harder to analyze even when the model (i.e., the controlled transition probabilities) is readily available. The reason for this is the non-contractive nature of the associated Bellman operator. This extension was achieved in in two different ways. The first, called RVI Q-learning, is a stochastic approximation counterpart of the ‘relative value iteration’ (or RVI) algorithm for average cost  and is close in spirit to the original. However, there is another algorithm dubbed SSP Q-learning based on an alternative scheme due to Bertsekas , which does involve a contraction under a weighted max-norm. Motivated by a recent paper on concentration for stochastic approximation in , we present here a similar concentration for the SSP Q-learning exploiting its explicitly contractive nature, something that is missing in RVI, leading to non-trivial technical issues in providing finite time guarantees for it (see, e.g., ). We also provide an empirical comparison between the two with suggestive outcomes.
Section II builds up the background and section III states the key assumptions and the main result. Its proof follows in section IV. Section V describes the numerical experiments.
We consider a controlled Markov chainon a finite state space with a finite action space and transition probabilities the probability of transition from to under action for . Associated with this transition is a “running cost” and the aim is to choose actions non-anticipatively (i.e., conditionally independent of the future state trajectory given past states and actions) so as to minimize the “average cost”
We shall be interested in “stationary policies” wherein for a map . It is known that an optimal stationary policy exists under the following “unichain” condition which we assume throughout: under any stationary policy the chain has a single communicating class containing a common state (say, ). The dynamic programming equation for the above problem is 
The unknowns are where is uniquely characterized as the optimal average cost. is only unique upto an additive constant. The associated “Q-factor” is
The aim is to get these Q-factors even when we do not know the transition probabilities, but have access to a black box which can generate random variables according to the above transition probabilities.
Ii-B SSP Q-learning
Recall the stochastic shortest path problem. Let with and . The objective is to minimize
where is the terminal cost and . Under our assumtions, a.s., in fact . The dynamic programming equation to solve this problem is given by
Coming back to average cost problem, SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of the expected total cost and the expected time, between two successive visits to the reference state . This connection was exploited by  to convert the average cost problem into a stochastic shortest path (SSP) problem. Consider a family of SSP problem parameterized by , with the cost given by for as above and some scalar . Then the dynamic programming equation for the above SSP problem is
For each fixed policy, the cost is linear in with negative slope. Thus , being the lower envelope thereof, is piecewise linear with finitely many linear pieces and concave decreasing in for each component. When we replace by and force , we recover (2). This suggests the coupled iterations
The SSP Q-learning scheme for the above problem is 
Here is a projection operator onto the interval with chosen so as to satisfy . Although this assumes some prior knowledge of , that can be obtained by a bound on . This also ensures that (14) below holds. We rewrite the above equations as follows
As observed in , the map is a contraction for a fixed under a certain weighted max-norm
for an appropriate weight vector, .
Iii Main Result
We state our main theorem in this section, after setting up the notation and assumptions. The assumptions are specifically geared for the SSP Q-learning applications in Section II-B, as will become apparent.
Consider the coupled iteration
for . Here:
is the ‘Markov noise’ taking values in a finite state space , i.e.,
where for each , is the transition probability of an irreducible Markov chain on with unique stationary distribution . We assume that the map is Lipschitz, i.e., for some ,
By Cramer’s theorem, is a rational function of with a non-vaishing denominator, so the map is similarly Lipschitz, i.e., for some ,
See Appendix B,  for some bounds on .
is, for each , an -valued martingale difference sequence parametrized by , with respect to the increasing family of -fields , . That is,
where is the zero vector. We also assume the componentwise bound: for some ,
for some . By the contraction mapping theorem, this implies that has a unique fixed point (i.e., ). We assume that is independent of , i.e., there exists a such that
We also assume that the map is Lipschitz (w.l.o.g., uniformly in and ). Let the common Lipschitz constant be , i.e.,
We assume that is concave piecewise linear and decreasing in . Furthermore, is assumed to satisfy
Moreover, we assume that is Lipschitz with Lipchitz constant : for all
is a sequence of stepsizes satisfying
and is assumed to be eventually non-increasing, i.e., there exists such that . Since , there exists such that for all .111Observe that we do not require the classical square-summability condition in stochastic approximation, viz., . This is because the contractive nature of our iterates gives us an additional handle on errors by putting less weight on past errors. A similar effect was observed in . We further assume that . So, for all for some and . We also assume that there exists such that , i.e., for all for some and . Larger values of and and smaller values of improve the main result presented below. The role this assumption plays in our bounds will become clear later. Define , i.e., , is non-increasing after and . Also, it is assumed that the sequence i.e., .
For , we further define:
Our main result is a follows:
(a) Let . Then there exist finite positive constants , , and , depending on , such that for , and , the inequality
holds with probability exceeding
(b) There exist finite constants , and an large enough such that for , the inequality
holds with probability exceeding
We begin with a lemma adapted from .
Using (14), we have
For , define if and otherwise. Note that, since , for all . Then
Now . Suppose
for some . Then,
Iv-a Concentration bound for the first iteration
Define and for :
We use the following theorem adapted from , which gives a concentration inequality for the stochastic approximation algorithm with Markov noise.
Let . Then there exist finite constants , depending on , such that for , and , the inequality
holds with probability exceeding
Since , we have
From the definition of , we have
We have suppressed the subscript of , which irrelevant by virtue of (13). Let denote the standard basis vectors. Then the r.h.s. in the above can be written as
Thus we finally have
which leads us to the claim that
To get a bound on we use the nonexpansive property of the projection operator as follows
where we use the fact that . Combining the above inequalities, we get
where, and . The summation in last term can be bounded as
where . Note that for any ,
This implies that
Combining the above,
Iv-B Concentration bound for the second iteration
The second iteration is given by
Let . Subtracting from both sides, we get:
Since the map is concave decreasing and piecewise linear, we have for some finite constant such that
Replace by and by . Since :