DeepAI

# Concentration bounds for SSP Q-learning for average cost MDPs

We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.

• 3 publications
• 7 publications
10/03/2022

### Square-root regret bounds for continuous-time episodic Markov decision processes

We study reinforcement learning for continuous-time Markov decision proc...
07/06/2017

### Efficient Strategy Iteration for Mean Payoff in Markov Decision Processes

Markov decision processes (MDPs) are standard models for probabilistic s...
07/23/2021

### An Adaptive State Aggregation Algorithm for Markov Decision Processes

Value iteration is a well-known method of solving Markov Decision Proces...
03/03/2022

### Risk-aware Stochastic Shortest Path

We treat the problem of risk-aware control for stochastic shortest path ...
04/24/2018

### Computational Approaches for Stochastic Shortest Path on Succinct MDPs

We consider the stochastic shortest path (SSP) problem for succinct Mark...
02/08/2020

### Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value functio...
02/12/2019

### Partial and Conditional Expectations in Markov Decision Processes with Integer Weights

The paper addresses two variants of the stochastic shortest path problem...

## I Introduction

Q-learning, introduced originally for the discounted cost Markov decision processes in [6]

, is a data-driven reinforcement learning algorithm for learning the ‘Q-factor’ function arising from the dynamic programming equation for the infinite horizon discounted reward problem. It can be viewed as a stochastic approximation counterpart of the classical value iteration for computing the value function arising as the solution of the corresponding dynamic programming equation. Going over from value function to the so called Q-factors facilitates an interchange of the conditional expectation and the nonlinearity (the minimization, to be precise) in the recursion, making it amenable for stochastic approximation. These ideas, however, do not extend automatically to the average cost problem, which is harder to analyze even when the model (i.e., the controlled transition probabilities) is readily available. The reason for this is the non-contractive nature of the associated Bellman operator. This extension was achieved in

[1] in two different ways. The first, called RVI Q-learning, is a stochastic approximation counterpart of the ‘relative value iteration’ (or RVI) algorithm for average cost [3] and is close in spirit to the original. However, there is another algorithm dubbed SSP Q-learning based on an alternative scheme due to Bertsekas [2], which does involve a contraction under a weighted max-norm. Motivated by a recent paper on concentration for stochastic approximation in [4], we present here a similar concentration for the SSP Q-learning exploiting its explicitly contractive nature, something that is missing in RVI, leading to non-trivial technical issues in providing finite time guarantees for it (see, e.g., [7]). We also provide an empirical comparison between the two with suggestive outcomes.

Section II builds up the background and section III states the key assumptions and the main result. Its proof follows in section IV. Section V describes the numerical experiments.

## Ii Background

### Ii-a Preliminaries

We consider a controlled Markov chain

on a finite state space with a finite action space and transition probabilities the probability of transition from to under action for . Associated with this transition is a “running cost” and the aim is to choose actions non-anticipatively (i.e., conditionally independent of the future state trajectory given past states and actions) so as to minimize the “average cost”

 limsupn→∞1nn−1∑m=0E[k(Xm,Zm)]. (1)

We shall be interested in “stationary policies” wherein for a map . It is known that an optimal stationary policy exists under the following “unichain” condition which we assume throughout: under any stationary policy the chain has a single communicating class containing a common state (say, ). The dynamic programming equation for the above problem is [3]

 V(i)=minu[k(i,u)+∑j∈Sp(j|i,u)V(j)−β]. (2)

The unknowns are where is uniquely characterized as the optimal average cost. is only unique upto an additive constant. The associated “Q-factor” is

 Q(i,a)=[k(i,u)+∑j∈Sp(j|i,u)minvQ(j,v)−β]. (3)

The aim is to get these Q-factors even when we do not know the transition probabilities, but have access to a black box which can generate random variables according to the above transition probabilities.

### Ii-B SSP Q-learning

Recall the stochastic shortest path problem. Let with and . The objective is to minimize

 E[τ−1∑n=1k(Xn,Zn)+h(Xτ)]

where is the terminal cost and . Under our assumtions, a.s., in fact . The dynamic programming equation to solve this problem is given by

 V(i)=minu[k(i,u)+∑j∈Sp(j|i,u)V(j)]] ∀ i∈S0, V(i)=h(i) ∀ i∈T.

Coming back to average cost problem, SSP Q-learning is based on the observation that the average cost under any stationary policy is simply the ratio of the expected total cost and the expected time, between two successive visits to the reference state . This connection was exploited by [2] to convert the average cost problem into a stochastic shortest path (SSP) problem. Consider a family of SSP problem parameterized by , with the cost given by for as above and some scalar . Then the dynamic programming equation for the above SSP problem is

 V(i)=minu[k(i,u)+∑j∈S,j≠i0p(j|i,u)V(j)−λ], (4a) V(i0)=0. (4b)

For each fixed policy, the cost is linear in with negative slope. Thus , being the lower envelope thereof, is piecewise linear with finitely many linear pieces and concave decreasing in for each component. When we replace by and force , we recover (2). This suggests the coupled iterations

 Vk+1(i)=minu[k(i,u)+∑j∈S,j≠i0p(j|i,u)Vk(j)−λk], (5a) λk+1=λk+a(n)Vk(i0). (5b)

The SSP Q-learning scheme for the above problem is [1]

 Qn+1(i,u)=Qn(i,u)+a(n)I{Xn=i,Zn=u}(k(i,u) + minvQn(Xn+1,v)I{Xn+1≠i0}−λn−Qn(i,u)), (6a) λn+1=Γ(λn+a′(n)minvQn(i0,v)). (6b)

Here is a projection operator onto the interval with chosen so as to satisfy . Although this assumes some prior knowledge of , that can be obtained by a bound on . This also ensures that (14) below holds. We rewrite the above equations as follows

 Qn+1(i,u)=Qn(i,u)+a(n)[Fi,u(Qn,(Xn,Zn),λn) −Qn(i,u)+Mi,un+1(Qn)], (7a) λn+1=Γ(λn+a′(n)f(Qn)), (7b)

and

 Fi,u(Qn,(Xn,Zn),λn)=I{Xn=i,Zn=u}(k(i,u) +∑j≠i0p(j|i,u)minvQn(j,v)−λn), Mi,un+1(Qn)=I{Xn=i,Zn=u}(minvQn(Xn+1,v)× I{Xn+1≠i0}−∑j≠i0p(j|i,u)minvQn(j,v)), f(Q)=minuQ(i0,u).

As observed in [5], the map is a contraction for a fixed under a certain weighted max-norm

 ∥x∥:=max|wixi|, x∈Rd,

for an appropriate weight vector

, .

## Iii Main Result

We state our main theorem in this section, after setting up the notation and assumptions. The assumptions are specifically geared for the SSP Q-learning applications in Section II-B, as will become apparent.

Consider the coupled iteration

 xn+1=xn+a(n)(F(xn,Yn,λn)−xn+Mn+1(xn)), (8) λn+1=Γ(λn+a′(n)f(xn)),n≥0 (9)

for . Here:

• is the ‘Markov noise’ taking values in a finite state space , i.e.,

 P(Yn+1|Ym,xm,m≤n) = P(Yn+1|Yn,xn) = pxn(Yn+1|Yn), n≥0,

where for each , is the transition probability of an irreducible Markov chain on with unique stationary distribution . We assume that the map is Lipschitz, i.e., for some ,

 ∑j∈S|pw(j|i)−pv(j|i)|≤L1∥w−v∥, ∀i∈S,w,v∈Rd.

By Cramer’s theorem, is a rational function of with a non-vaishing denominator, so the map is similarly Lipschitz, i.e., for some ,

 ∑i∈S|πw(i)−πv(i)|≤L2∥w−v∥, ∀i∈S,w,v∈Rd.

See Appendix B, [4] for some bounds on .

• is, for each , an -valued martingale difference sequence parametrized by , with respect to the increasing family of -fields , . That is,

 E[Mn+1(x)|Fn]=θ a.s. ∀x,n, (10)

where is the zero vector. We also assume the componentwise bound: for some ,

 |Mℓn(x)|≤K0(1+∥x∥)a.s.∀x,n,l. (11)
• satisfies

 ∥∑i∈Sπw(i)(F(x,i,λ)−F(z,i,λ))∥≤α∥x−z∥, ∀ x,z,w∈Rd,λ∈R, (12)

for some . By the contraction mapping theorem, this implies that has a unique fixed point (i.e., ). We assume that is independent of , i.e., there exists a such that

 ∑i∈Sπw(i)F(x∗(λ),i,λ)=x∗(λ), ∀w∈Rd. (13)

We also assume that the map is Lipschitz (w.l.o.g., uniformly in and ). Let the common Lipschitz constant be , i.e.,

 |Fℓ(x,i,λ)−Fℓ(z,i,λ)|≤L3∥x−z∥,
 ∀i∈S,ℓ∈{1,⋯,d},x,z∈Rd.

We assume that is concave piecewise linear and decreasing in . Furthermore, is assumed to satisfy

 ∥˜Fn(x,Yn,λ)∥≤K+α∥x∥ a.s.. (14)
• Moreover, we assume that is Lipschitz with Lipchitz constant : for all

 |f(x∗(λ1))−f(x∗(λ2))|≤L4||x∗(λ1)−x∗(λ2)||. (15)
• is a sequence of stepsizes satisfying

 a(n)→0, ∑na(n)=∞, (16)

and is assumed to be eventually non-increasing, i.e., there exists such that . Since , there exists such that for all .111Observe that we do not require the classical square-summability condition in stochastic approximation, viz., . This is because the contractive nature of our iterates gives us an additional handle on errors by putting less weight on past errors. A similar effect was observed in [4]. We further assume that . So, for all for some and . We also assume that there exists such that , i.e., for all for some and . Larger values of and and smaller values of improve the main result presented below. The role this assumption plays in our bounds will become clear later. Define , i.e., , is non-increasing after and . Also, it is assumed that the sequence i.e., .

For , we further define:

 bk(n) = n∑m=ka(m), 0≤k≤n<∞, b′k(n) = n∑m=ka′(m), 0≤k≤n<∞, βk(n) = ⎧⎨⎩1kd2−d1nd1,if d1≤d21nd2,otherwise, κ(d) = ∥1∥, 1:=[1,1,⋯,1]T∈Rd, d≥1, ¯ϵ(n) = supm≥nϵ(m).

Our main result is a follows:

###### Theorem 1

(a) Let . Then there exist finite positive constants , , and , depending on , such that for , and , the inequality

 ∥xn−x∗(λn)∥≤e−(1−α)bn0(n)∥xn0−x∗(λn0)∥ +δ+a(n0)c1+¯ϵ(n0)c31−α (17)

holds with probability exceeding

 1−2dn∑m=n0+1e−Dδ2/βn0(m),0<δ≤C, (18) 1−2dn∑m=n0+1e−Dδ/βn0(m),δ>C (19)

(b) There exist finite constants , and an large enough such that for , the inequality

 |λn−β|≤|λ^n−β|e−(1−α)c4b′^n(n) +c5(e−(1−α)b^n(n)∥x^n−x∗(λ^n)∥+δ+a(^n)c1+¯ϵ(^n)c31−α) (20)

holds with probability exceeding

 1−2dn∑m≥^n+1e−Dδ2/β^n(m),0<δ≤C, (21) 1−2dn∑m≥^n+1e−Dδ/β^n(m),δ>C (22)

## Iv Proof

We begin with a lemma adapted from [4].

###### Lemma 1

a.s.

Using (14), we have

 ∥xn+1∥ = ∥(1−a(n))xn+a(n)˜Fn(xn,Yn,λn)∥ ≤ (1−a(n))∥xn∥+a(n)(K+α∥xn∥) = (1−(1−α)a(n))∥xn∥+a(n)K.

For , define if and otherwise. Note that, since , for all . Then

 ∥xn+1∥−K1−α≤(1−(1−α)a(n))(∥xn∥−K1−α).

Now . Suppose

 ∥xn∥≤ψ(n,N)∥xN∥+K1−α (23)

for some . Then,

 ∥xn+1∥−K1−α ≤ (1−(1−α)a(n))(ψ(n,N)∥xN∥ + K1−α−K1−α) ≤ ψ(n+1,N)∥xN∥

By induction, (23) holds for all , which completes the proof of Lemma 1.

### Iv-a Concentration bound for the first iteration

Define and for :

 zn+1=zn+a(n)(∑i∈Sπxn(i)F(zn,i,λn)−zn), (24) λn+1=Γ(λn+a′(n)f(xn)),n≥0. (25)

We use the following theorem adapted from [4], which gives a concentration inequality for the stochastic approximation algorithm with Markov noise.

###### Theorem 2

Let . Then there exist finite constants , depending on , such that for , and , the inequality

 ∥xn−zn∥≤δ+a(n0)c11−α, n≥n0, (26)

holds with probability exceeding

 1−2dn∑m=n0+1e−Dδ2/βn0(m),0<δ≤C, (27) 1−2dn∑m=n0+1e−Dδ/βn0(m),δ>C (28)

Since , we have

 zn+1−x∗(λn+1)=(1−a(n))(zn−x∗(λn)) +a(n)(∑i∈Sπxn(i)(F(zn,i,λn)−F(x∗(λn),i,λn)) +1a(n)(x∗(λn)−x∗(λn+1))). (29)

Since the map is piecewise linear and concave decreasing, and therefore so is the map . By (III) and (13), we have the following lemma,

###### Lemma 2

From the definition of , we have

 ∥x∗(λn+1)−x∗(λn)∥=∥∑i∈Sπ(i)(F(x∗(λn),i,λn) −F(x∗(λn+1),i,λn+1))∥

We have suppressed the subscript of , which irrelevant by virtue of (13). Let denote the standard basis vectors. Then the r.h.s. in the above can be written as

 ∥∑i∈Sπ(i)(F(x∗(λn),i,λn)−F(x∗(λn+1),i,λn+1))∥ =∥∑i∈Sπ(i)(F(x∗(λn),i,λn+1)−F(x∗(λn+1),i,λn+1) +ei(λn+1−λn))∥ ≤∥∑i∈Sπ(i)(F(x∗(λn),i,λn+1)−F(x∗(λn+1),i,λn+1))∥ ≤α∥x∗(λn+1)−x∗(λn)∥+|λn+1−λn|∥∑i∈Sπ(i)ei∥.

Thus we finally have

 ∥x∗(λn+1)−x∗(λn)∥≤α∥x∗(λn+1)−x∗(λn)∥ +|λn+1−λn|∥∑i∈Sπ(i)ei∥

which leads us to the claim that

 ∥x∗(λn+1)−x∗(λn)∥≤Lx|λn+1−λn|

where .

To get a bound on we use the nonexpansive property of the projection operator as follows

 |λn+1−λn|=|Γ(λn+a′(n)f(xn))−Γ(λn)| ≤|λn+a′(n)f(xn)−λn|=a′(n)|f(xn)|

where we use the fact that . Combining the above inequalities, we get

 ∥x∗(λn+1)−x∗(λn)∥≤Lxa′(n)|f(xn)|. (30)

Thus,

 ∥zn+1−x∗(λn+1)∥≤∥(1−a(n))(zn−x∗(λn))∥ +a(n)(∥∑i∈Sπxn(i)(F(zn,i,λn)−F(x∗(λn),i,λn))∥ +1a(n)∥(x∗(λn)−x∗(λn+1))∥) ≤(1−(1−α)a(n))∥zn−x∗(λn)∥+a(n)ϵ(n)Lx|f(xn)|. (31)

where . Since is bounded by Lemma 1, . Iterating (31) for , we get,

 ∥zn+1−x∗(λn+1)∥≤ ∥xn0−x∗(λn0)∥n∏m=n0(1−(1−α)a(m)) +K′n∑m=n0ψ(n+1,m+1)a(m)ϵ(m) (32) ≤e−(1−α)bn0(n)∥xn0−x∗(λn0)∥ +K′n∑m=n0ψ(n+1,m+1)a(m)ϵ(m) (33)

where, and . The summation in last term can be bounded as

 n∑m=n0ψ(n+1,m+1)a(m)ϵ(m) ≤¯ϵ(n0)n∑m=n0ψ(n+1,m+1)a(m), (34)

where . Note that for any ,

 ψ(m,k)+ψ(m,k+1)(1−α)a(k)=ψ(m,k+1),

and hence

 ψ(m+1,n0)1−α+11−αm∑k=n0ψ(m+1,k+1)(1−α)a(k)=
 ψ(m+1,m+1)1−α=11−α.

This implies that

 m∑k=n0ψ(m+1,k+1)a(k) ≤ 11−α. (35)

Hence

 n∑m=n0ψ(n+1,m+1)a(m)¯ϵ(m) ≤ ¯ϵ(n0)1−α. (36)

Combining the above,

 ∥zn+1−x∗(λn+1)∥ ≤ e−(1−α)bn0(n)∥xn0−x∗(λn0)∥ (37) + K′¯ϵ(n0)1−α.

Combining (37) with Theorem 2 yields Theorem 1(a).

### Iv-B Concentration bound for the second iteration

The second iteration is given by

 λn+1=Γ(λn+a′(n)f(x)). (38)

Let . Subtracting from both sides, we get:

 λn+1−β=Γ(λn+a′(n)f(x∗(λn))+ a′(n)(f(xn)−f(x∗(λn))))−β =Γ(λn+a′(n)f(x∗(λn))+a′(n)ξn)−β. (39)

Since the map is concave decreasing and piecewise linear, we have for some finite constant such that

 −L6(λ1−λ2)≤f(x∗(λ1))−f(x∗(λ2))≤−L5(λ1−λ2)

Replace by and by . Since :

 −L6(λn−β)≤f(x∗(λn))≤−L5(λn−β)

Thus,

 Γ(λn−a′(n)L6(λn−β)+a′(n)ξn)−β≤λn+1−β≤ Γ(λn−a′(n)L5(λn−β)+a′(n)ξn)−β |λn+1−β|≤max(|Γ(λn−a′(n)L6(λn−β)+a′(n)ξn) −β|,|Γ