I Introduction
Qlearning, introduced originally for the discounted cost Markov decision processes in [6]
, is a datadriven reinforcement learning algorithm for learning the ‘Qfactor’ function arising from the dynamic programming equation for the infinite horizon discounted reward problem. It can be viewed as a stochastic approximation counterpart of the classical value iteration for computing the value function arising as the solution of the corresponding dynamic programming equation. Going over from value function to the so called Qfactors facilitates an interchange of the conditional expectation and the nonlinearity (the minimization, to be precise) in the recursion, making it amenable for stochastic approximation. These ideas, however, do not extend automatically to the average cost problem, which is harder to analyze even when the model (i.e., the controlled transition probabilities) is readily available. The reason for this is the noncontractive nature of the associated Bellman operator. This extension was achieved in
[1] in two different ways. The first, called RVI Qlearning, is a stochastic approximation counterpart of the ‘relative value iteration’ (or RVI) algorithm for average cost [3] and is close in spirit to the original. However, there is another algorithm dubbed SSP Qlearning based on an alternative scheme due to Bertsekas [2], which does involve a contraction under a weighted maxnorm. Motivated by a recent paper on concentration for stochastic approximation in [4], we present here a similar concentration for the SSP Qlearning exploiting its explicitly contractive nature, something that is missing in RVI, leading to nontrivial technical issues in providing finite time guarantees for it (see, e.g., [7]). We also provide an empirical comparison between the two with suggestive outcomes.Section II builds up the background and section III states the key assumptions and the main result. Its proof follows in section IV. Section V describes the numerical experiments.
Ii Background
Iia Preliminaries
We consider a controlled Markov chain
on a finite state space with a finite action space and transition probabilities the probability of transition from to under action for . Associated with this transition is a “running cost” and the aim is to choose actions nonanticipatively (i.e., conditionally independent of the future state trajectory given past states and actions) so as to minimize the “average cost”(1) 
We shall be interested in “stationary policies” wherein for a map . It is known that an optimal stationary policy exists under the following “unichain” condition which we assume throughout: under any stationary policy the chain has a single communicating class containing a common state (say, ). The dynamic programming equation for the above problem is [3]
(2) 
The unknowns are where is uniquely characterized as the optimal average cost. is only unique upto an additive constant. The associated “Qfactor” is
(3) 
The aim is to get these Qfactors even when we do not know the transition probabilities, but have access to a black box which can generate random variables according to the above transition probabilities.
IiB SSP Qlearning
Recall the stochastic shortest path problem. Let with and . The objective is to minimize
where is the terminal cost and . Under our assumtions, a.s., in fact . The dynamic programming equation to solve this problem is given by
Coming back to average cost problem, SSP Qlearning is based on the observation that the average cost under any stationary policy is simply the ratio of the expected total cost and the expected time, between two successive visits to the reference state . This connection was exploited by [2] to convert the average cost problem into a stochastic shortest path (SSP) problem. Consider a family of SSP problem parameterized by , with the cost given by for as above and some scalar . Then the dynamic programming equation for the above SSP problem is
(4a)  
(4b) 
For each fixed policy, the cost is linear in with negative slope. Thus , being the lower envelope thereof, is piecewise linear with finitely many linear pieces and concave decreasing in for each component. When we replace by and force , we recover (2). This suggests the coupled iterations
(5a)  
(5b) 
The SSP Qlearning scheme for the above problem is [1]
(6a)  
(6b) 
Here is a projection operator onto the interval with chosen so as to satisfy . Although this assumes some prior knowledge of , that can be obtained by a bound on . This also ensures that (14) below holds. We rewrite the above equations as follows
(7a)  
(7b) 
and
As observed in [5], the map is a contraction for a fixed under a certain weighted maxnorm
for an appropriate weight vector
, .Iii Main Result
We state our main theorem in this section, after setting up the notation and assumptions. The assumptions are specifically geared for the SSP Qlearning applications in Section IIB, as will become apparent.
Consider the coupled iteration
(8)  
(9) 
for . Here:

is the ‘Markov noise’ taking values in a finite state space , i.e.,
where for each , is the transition probability of an irreducible Markov chain on with unique stationary distribution . We assume that the map is Lipschitz, i.e., for some ,
By Cramer’s theorem, is a rational function of with a nonvaishing denominator, so the map is similarly Lipschitz, i.e., for some ,
See Appendix B, [4] for some bounds on .

is, for each , an valued martingale difference sequence parametrized by , with respect to the increasing family of fields , . That is,
(10) where is the zero vector. We also assume the componentwise bound: for some ,
(11) 
satisfies
(12) for some . By the contraction mapping theorem, this implies that has a unique fixed point (i.e., ). We assume that is independent of , i.e., there exists a such that
(13) We also assume that the map is Lipschitz (w.l.o.g., uniformly in and ). Let the common Lipschitz constant be , i.e.,
We assume that is concave piecewise linear and decreasing in . Furthermore, is assumed to satisfy
(14) 
Moreover, we assume that is Lipschitz with Lipchitz constant : for all
(15) 
is a sequence of stepsizes satisfying
(16) and is assumed to be eventually nonincreasing, i.e., there exists such that . Since , there exists such that for all .^{1}^{1}1Observe that we do not require the classical squaresummability condition in stochastic approximation, viz., . This is because the contractive nature of our iterates gives us an additional handle on errors by putting less weight on past errors. A similar effect was observed in [4]. We further assume that . So, for all for some and . We also assume that there exists such that , i.e., for all for some and . Larger values of and and smaller values of improve the main result presented below. The role this assumption plays in our bounds will become clear later. Define , i.e., , is nonincreasing after and . Also, it is assumed that the sequence i.e., .
For , we further define:
Our main result is a follows:
Theorem 1
(a) Let . Then there exist finite positive constants , , and , depending on , such that for , and , the inequality
(17) 
holds with probability exceeding
(18)  
(19) 
(b) There exist finite constants , and an large enough such that for , the inequality
(20) 
holds with probability exceeding
(21)  
(22) 
Iv Proof
We begin with a lemma adapted from [4].
Lemma 1
a.s.
Using (14), we have
For , define if and otherwise. Note that, since , for all . Then
Now . Suppose
(23) 
for some . Then,
By induction, (23) holds for all , which completes the proof of Lemma 1.
Iva Concentration bound for the first iteration
Define and for :
(24)  
(25) 
We use the following theorem adapted from [4], which gives a concentration inequality for the stochastic approximation algorithm with Markov noise.
Theorem 2
Let . Then there exist finite constants , depending on , such that for , and , the inequality
(26) 
holds with probability exceeding
(27)  
(28) 
Since , we have
(29) 
Since the map is piecewise linear and concave decreasing, and therefore so is the map . By (III) and (13), we have the following lemma,
Lemma 2
From the definition of , we have
We have suppressed the subscript of , which irrelevant by virtue of (13). Let denote the standard basis vectors. Then the r.h.s. in the above can be written as
Thus we finally have
which leads us to the claim that
where .
To get a bound on we use the nonexpansive property of the projection operator as follows
where we use the fact that . Combining the above inequalities, we get
(30) 
Thus,
(31) 
where . Since is bounded by Lemma 1, . Iterating (31) for , we get,
(32)  
(33) 
where, and . The summation in last term can be bounded as
(34) 
where . Note that for any ,
and hence
This implies that
(35) 
Hence
(36) 
Combining the above,
(37)  
IvB Concentration bound for the second iteration
The second iteration is given by
(38) 
Let . Subtracting from both sides, we get:
(39) 
Since the map is concave decreasing and piecewise linear, we have for some finite constant such that
Replace by and by . Since :
Thus,