# Zap Q-Learning for Optimal Stopping Time Problems

We propose a novel reinforcement learning algorithm that approximates solutions to the problem of discounted optimal stopping in an irreducible, uniformly ergodic Markov chain evolving on a compact subset of R^n. A dynamic programming approach has been taken by Tsitsikilis and Van Roy to solve this problem, wherein they propose a Q-learning algorithm to estimate the value function, in a linear function approximation setting. The Zap-Q learning algorithm proposed in this work is the first algorithm that is designed to achieve optimal asymptotic variance. We prove convergence of the algorithm using ODE analysis, and the optimal asymptotic variance property is reflected via fast convergence in a finance example.

## Authors

• 4 publications
• 8 publications
• 9 publications
• 5 publications
• ### Successive Over Relaxation Q-Learning

In a discounted reward Markov Decision Process (MDP) the objective is to...
03/09/2019 ∙ by Chandramouli Kamanchi, et al. ∙ 0

• ### QuickStop: A Markov Optimal Stopping Approach for Quickest Misinformation Detection

This paper combines data-driven and model-driven methods for real-time m...
03/04/2019 ∙ by Honghao Wei, et al. ∙ 0

• ### A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

Q-learning with neural network function approximation (neural Q-learning...
12/10/2019 ∙ by Pan Xu, et al. ∙ 17

• ### Semi-tractability of optimal stopping problems via a weighted stochastic mesh algorithm

In this article we propose a Weighted Stochastic Mesh (WSM) Algorithm fo...
06/22/2019 ∙ by D. Belomestny, et al. ∙ 0

• ### Binary Matrix Guessing Problem

We introduce the Binary Matrix Guessing Problem and provide two algorith...
01/22/2017 ∙ by Çağrı Latifoğlu, et al. ∙ 0

• ### Reinforcement Learning for Robotic Time-optimal Path Tracking Using Prior Knowledge

Time-optimal path tracking, as a significant tool for industrial robots,...
06/30/2019 ∙ by Jiadong Xiao, et al. ∙ 0

• ### On the Reduction of Variance and Overestimation of Deep Q-Learning

The breakthrough of deep Q-Learning on different types of environments r...
10/14/2019 ∙ by Mohammed Sabry, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Consider a discrete-time Markov chain evolving on a general state-space . The goal in optimal stopping time problems is to minimize over all stopping times , the associated expected cost:

 E[τ−1∑n=0βnc(Xn)+βτcs(Xτ)] (1)

where denotes the per-stage cost, the terminal cost, and is the discount factor. Examples of such a problem arise mostly in financial applications such as derivatives analysis (see Section V), timing of a purchase or sale of an asset, and more generally in sequential analysis problems.

In this work, the optimal decision rule is approximated using reinforcement learning techniques. We propose and analyze an optimal variance algorithm to approximate the value function associated with the optimal stopping rule.

### I-a Problem Setup

It is assumed that is compact, and we let denote the associated Borel -algebra. The time-homogeneous Markov chain

is defined on a probability space (

), and its distribution is determined by an initial distribution and a transition kernel , defined for each and by

 P(x,A)=Pr(Xn+1∈A∣Xn=x)

It is assumed that is uniformly ergodic, so in particular it has a unique invariant probability measure denoted [9].

Denote by the filtration associated with . The Markov property asserts that for bounded measurable functions ,

 E[h(Xn+1)∣Fn]=∫P(x,dy)h(y)with x=Xn

In this paper a stopping time

is a random variable taking on values in the non-negative integers, with the defining property

for each . A stationary policy is a measurable function that defines a stopping time as follows:

 τϕ=min{n≥0:ϕ(Xn)=1} (2)

The (optimal) value function is defined as the infimum of (1) over all stopping times: for any ,

 h∗(x):=infτE[τ−1∑n=0βnc(Xn)+βτcs(Xτ)|X0=x] (3)

The associated Q-function is defined by . The optimal stopping problem is a special cast of discounted-cost optimal control [1]. It follows that solves the following Bellman equation: for each ,

 Q∗(x)=c(x)+βE[min(cs(X1),Q∗(X1))|X0=x] (4)

Moreover, the optimal stopping rule is defined by the stationary policy,

 ϕ∗(x)=I{cs(x)≤Q∗(x)} (5)

An optimal stopping time is , using the general definition (2).

The Bellman equation (4) can be expressed as the functional fixed point equation:

 Q∗=FQ∗,

where denotes the dynamic programming operator: for any function , and ,

 FQ(x):=c(x)+βE[min(cs(X1),Q(X1))|X0=x] (6)

Analysis is framed in the usual Hilbert space of real-valued measurable functions on with inner product:

 ⟨f,g⟩π=E[f(X)g(X)] (7)

and norm:

 ∥f∥π=√⟨f,f⟩π (8)

where the expectation in (7) is with respect to the steady state distribution . It is assumed throughout that the cost functions and are in .

### I-B Objective

The objective is to approximate using a parameterized family of functions , where

denotes the parameter vector. We restrict to linear parameterization throughout, so that:

 Qθ(x):=θ\it\tiny Tψ(x),x∈X (9)

where with , , , denotes the basis functions. For any parameter vector , we denote the Bellman error

 BθE=FQθ−Qθ.

It is assumed that the basis functions are linearly independent: The covariance matrix is full rank, where

 Σψ(i,j)=⟨ψi,ψj⟩π,1≤i,j≤d (10)

In a finite state-space setting, it is possible to construct a consistent algorithm that computes the Q-function exactly [8]. The Q-learning algorithm of Watkins [18, 19] can be used in this case (see [21] for a discussion).

In a function approximation setting, we need to relax the goal of solving (4). As in previous works [17, 6, 21], the goal in this paper is to obtain the solution to a Galerkin relaxation of (4): Find such that,

 E[Bθ∗E(Xn)ζn(i)]=0,1≤i≤d, (11)

where is adapted to , and the expectation is in steady state. For a given constant , the Zap- algorithm that is proposed in this work intends to solve (11) with

 ζn=∞∑k=0λn−kψ(Xk)

in which is a stationary realization. This is similar to what is considered in the TD() algorithm [12, 16].

In much of this paper we fix for simplicity. In this special case, we have: , for each , and the goal is to find such that, for each ,

 ⟨FQθ∗−Qθ∗,ψi⟩π=0 (12)

### I-C Literature Survey

Obtaining an approximate solution to the original problem (4) using a modified objective (12) was first considered in [17]. The authors propose an extension of the TD() algorithm of [13], and prove convergence under general conditions. Though it is not obvious at first sight, the algorithm in [17] is more closely connected to Watkins’ Q-learning algorithm [18, 19], than the TD() algorithm. This is specifically due to a minimum term that appears in (12) (see definition of in (6)), similar to what appears in Q-learning. This is important to note, because Q-learning algorithms are not known to converge under function approximation settings, and this is due to the fact that the dynamic programming operator may not be a contraction in general [1]. The operator defined in (6) is quite special in this sense: it can be shown that it is a contraction with respect to the -norm defined in (8) [15]:

 ∥FQ−FQ′∥≤β∥Q−Q′∥,for all Q,Q′∈L2(π)

Since [15], many other algorithms have been proposed to improve the convergence rate. In [6] the authors propose a matrix gain variant of the algorithm presented in [15], improving the rate of convergence in numerical studies [7]. In [21], the authors take a least squares approach to solve the problem, and propose the least squares Q-learning algorithm, that has close resemblance to the least squares policy evaluation algorithm (LSPE () of [10]). The authors recognize the high computational complexity of the algorithm, and propose variants in [20]. In prior works [6] and [20], though a function-approximation setting is considered, the state-space is assumed finite.

More recently, in [8, 7], the authors propose the Zap Q-learning algorithm to solve for a solution to a fixed point equation similar to (but more general than) (4). The proof of convergence is provided only for the tabular case (wherein the ’s span all possible functions), and when the state-action space is finite.

The remainder of the paper is organized as follows: Section II contains the approximation architecture, and introduces the Zap Q-learning algorithm. The assumptions and main results are contained in Section III. Section IV provides a high-level proof of the results, numerical results are collected together in Section V, and conclusions in Section VI. Full proofs are available in an extended version of this paper, available on arXiv [4].

## Ii Q-learning for Optimal Stopping

### Ii-a Notation

The following notation will be used throughout. Define for each , the corresponding policy ,

 ϕθ(x):=I{cs(x)≤Qθ(x)}, (13)

For any function with domain , two operators are defined as the simple products,

 Sθf(x) :=I{Qθ(x)

Observe that .

We then denote a matrix, two -dimensional vectors as follows:

 A(θ) :=E[ψ(Xn)βSθψ% \it\tiny T(Xn+1)−ψ(Xn)ψ\it\tiny T(Xn)] (15) b∗ :=E[ψ(Xn)c(Xn)] (16) ¯¯cs(θ) :=E[ψ(Xn)Scθcs(Xn+1)] (17)

The objective (12) can be expressed:

 A(θ∗)θ∗+β¯¯cs(θ∗)+b∗=0 (18)

### Ii-B Zap Q-Learning

It is useful to first look at a more general class of “matrix gain Q-learning” algorithms. Given a matrix gain sequence with each invertible, and a scalar step-size sequence , the corresponding matrix gain Q-learning algorithm for optimal stopping is given by the following recursion:

 θn+1 =θn+αn+1G−1n+1ψ(Xn)dn+1 (19)

with denoting the “temporal difference” sequence:

 dn+1:=c(Xn)+βmin(cs(Xn+1),Qθn(Xn+1))−Qθn(Xn).

The algorithm proposed in [17] is (19), with (the identity matrix). This is similar to the TD() algorithm [16, 13]. The

fixed point Kalman filter

algorithm of [6] can also be written as a special case of (19): Each is an estimate of the matrix defined in (10), which can be obtained recursively:

 Gn+1=Gn+αn+1[ψ(Xn)ψ\it\tiny T(Xn)−Gn] (20)

In the Zap Q() algorithm, the matrix gain sequence is designed so that the asymptotic covariance of the resulting algorithm is minimized (see Section III for details). Similar to the Zap-Q algorithm of [8], it uses matrix gain (the projected pseudo-inverse of ); an estimate of , with defined in (15).

The term inside the expectation in (15), following the substitution , is denoted

 An+1:=ψ(Xn)[βSθnψ(Xn+1)−ψ(Xn)]\it\tiny T (21)

Using (21), the matrix is recursively estimated using Monte Carlo in the Zap Q() algorithm. For simplicity we give details only for :

The gain sequences and in the above algorithm are chosen as: for some ,

 αn=1/n,γn=1/nρ. (24)

For each , consider the following terms:

 b(θ) =−A(θ)θ−β¯¯cs(θ) (25a) cθ(x) =Qθ(x) (25b) −E[βmin(cs(Xn),Qθ(Xn))∣Xn−1=x]

The vector is analogous to in (18), and (25b) recalls the Bellman equation (4). Prop. II.1 follows from these definitions. It shows that is the “projection” of the cost function , similar to how is related to through (16).

###### Proposition II.1

For each , we have:

 b(θ)=E[cθ(Xn)ψ(Xn)] (26)

where the expectation is in steady state. In particular,

## Iii Assumptions and Main Results

### Iii-a Preliminaries

Preliminary results are summarized here that will be useful in establishing the main results. We start with the contraction property of the dynamic programming operator defined in (6). This is a result directly obtained from [17] (Lemma on p. ).

###### Lemma III.1

The dynamic programming operator defined in (6) satisfies:

 ∥FQ−FQ′∥≤β∥Q−Q′∥,Q,Q′∈L2(π).

Furthermore, is the unique fixed point of in .

Recall that is defined in (9). Similar to the operator , for each we define operators and that operate on functions as follows:

 HθQ(x) = {Q(x),ifQθ(x)

The following Lemma is a slight extension of Lemma III.1.

###### Lemma III.2

For each , the operator satisfies:

 ∥FθQ−FθQ′∥≤β∥Q−Q′∥,∀Q,Q′∈L2(π).

Based on the definition (28), we have:

 ∥FθQ−FθQ′∥ =β∥PHθQ−PHθQ′∥ ≤β∥HθQ−HθQ′∥ ≤β∥Q−Q′∥,

where the first inequality follows from the fact that (with the induced operator norm in ). The last inequality is true because:

 HθQ(x)−HθQ′(x)=Sθ(Q−Q′)(x),x∈X

The next result is a direct consequence of Lemma III.2.

###### Lemma III.3

For each ,

1. The matrix defined in (15) satisfies:

 −v\it\tiny TA(θ)v≥(1−β)v\it\tiny TΣψv, (29)

for each , with defined in (10).

2. are strictly bounded away from 0, and are uniformly bounded.

Prop. II.1 implies a Lipschitz bound on the function defined in (25a):

###### Lemma III.4

The mapping is Lipschitz: For some , and each ,

 ∥b(θ1)−b(θ2)∥≤ℓ1∥θ1−θ2∥

### Iii-B Assumptions

The following assumptions are made throughout the paper:

Assumption A1: is a uniformly ergodic Markov chain on the compact state space . Its unique invariant probability measure is denoted (see [9] for definitions).

Assumption A2: There exists a unique solution to the objective (12).

Assumption A3: The conditional distribution of given has a density, . This density is also assumed to have uniformly bounded likelihood ratio with respect to the Gaussian density . Consequently, .

It is assumed moreover that the function is in the span of .

Assumption A4: The parameter sequence is bounded a.s..

Assumption (A3) consists of technical conditions in the proof. The density assumption is imposed to ensure that the conditional expectation given of functions such as are smooth as a function of .

As for (A4), it is highly likely that boundedness can be established via an extension of the “Borkar & Meyn Theorem” of [3, 2]. The “ODE at infinity” posed there is stable as required, but the two-time scale algorithm presents a challenge with application.

### Iii-C Main Result

The main result of this paper establishes convergence of iterates obtained using Algorithm 1:

###### Theorem III.5

Suppose that Assumptions A1-A4 hold. Then,

• The parameter sequence obtained using the Zap-Q(0) algorithm converges to a.s., where satisfies (12).

• An ODE approximation holds for the sequences by continuous time functions satisfying

 ddtb(t) =−b(t)+b (30) b(t) =−A(w(t))w(t)−β¯¯cs(w(t))

The term ODE approximation is standard in the SA literature: For , let denote the solution to:

 ddtws(t)=ξ(ws(t)),ws(s)=¯¯¯¯w(s) (31)

for some . We say that the ODE approximation:

 ddtw(t)=ξ(w(t))

holds for the sequence if the following is true for any :

 lims→∞supt∈[s,s+T]∥¯¯¯¯w(t)−ws(t)∥=0,a.s.

where denotes the continuous time process constructed from the sequence , whose meaning will be made precise in Sec IV-B. The optimality of the algorithm in terms of the asymptotic variance is discussed next.

### Iii-D Asymptotic Variance

The asymptotic covariance of any algorithm is defined to be the following limit:

 (32)

Consider the matrix gain Q-learning algorithm (19), and suppose the matrix sequence is constant: . Also, suppose that all eigenvalues of satisfy . Following standard analysis (see Section 2.2 of [8] and references therein), it can be shown that, under general assumptions, the asymptotic covariance of the algorithm (19) can be obtained as a solution to the Lyapunov equation:

 (GA(θ∗)+12I)ΣΘ (33) +GΣEG\it\tiny T=0

where is the “noise covariance matrix”, that is defined as follows.

A “noise sequence” is defined as

 En:=~An+1θ∗+~bn+1+~An+1~θn (34)

where , ,
, with defined in (21), defined in (15),

 bn+1:=ψ(Xn)[c(Xn)+Scθncs(Xn+1)] (35)

and defined in (16). For the matrix gain algorithm with , the algorithm would be deterministic in the ideal case .

The noise covariance matrix is defined as the limit

 ΣE=limT→∞1TE[STS\it% \tiny TT] (36)

in which , and the expectation is in steady state.

#### Optimality of the asymptotic covariance

The asymptotic covariance can be obtained as a solution to (33) only when all eigenvalues satisfy . If there exists at least one eigenvalue such that , then, under general conditions, it can be shown that the asymptotic covariance is not finite. This implies that the rate of convergence of is slower than .

It is possible to optimize the covariance over all matrix gains using (33). Specifically, it can be shown that letting will result in the minimum asymptotic covariance , where

 Σ∗=A(θ∗)−1ΣE(A(θ∗)−1)\it\tiny T (37)

That is, for any other gain , denoting to be the asymptotic covariance of the algorithm (19), obtained as a solution to the Lyapunov equation (33), the difference is positive semi-definite.

The Zap Q algorithm is specifically designed to achieve the optimal asymptotic covariance. A full proof of optimality will require extra effort. Thm. III.5 tells us that we have the required convergence for this algorithm. Provided we can obtain additional tightness bounds for the scaled error

, we obtain a functional Central Limit Theorem with optimal covariance

[2]. Minor additional bounds ensure convergence of (32) to the optimal covariance .

The next section is dedicated to the proof of Thm. III.5.

## Iv Proof of Theorem iii.5

### Iv-a Overview of the Proof

Unlike martingale difference assumptions in standard stochastic approximation [2], the noise in our algorithm is Markovian. The first part of this section establishes that our noise sequence satisfies the so called ODE friendly property [14], such that their asymptotic effect over the parameter update is zero. This enables the argument that the gain matrices are close to their equilibrium over the fast time scale defined by . We then go back to the slow time scale defined by , and obtain the ODE approximations for and the expected projected cost .

### Iv-B ODE Analysis

The remainder of this section is devoted to the proof of the ODE approximation (30). The construction of an approximating ODE involves first defining a continuous time process . Denote

 tn=n∑i=1αi,  n≥1,t0=0, (38)

and define at these time points, with the definition extended to

via linear interpolation.

Along with the piecewise linear continuous-time process , denote by the piecewise linear continuous-time process defined similarly, with , . Furthermore, for each , denote

 ¯bt≡b(¯¯¯¯wt):=−A(¯¯¯¯wt)¯wt−β¯¯cs(¯¯¯¯wt)

To construct an ODE, it is convenient first to obtain an alternative and suggestive representation for the pair of equations (22,23).

The following definition is needed in Lemma IV.1:
ODE-friendly sequence: A vector-valued sequence of random variables will be called ODE-friendly if it admits the decomposition,

 Ek=Δk+Tk−Tk−1+εk,k≥1 (39)

in which:

1. is a martingale-difference sequence satisfying a.s. for all

2. is a bounded sequence

3. The final sequence is bounded and satisfies:

 ∞∑k=1γk∥εk∥<∞a.s.. (40)

Lemma IV.1 establishes that the error sequences that appear in the updates for and are ODE friendly.

###### Lemma IV.1

The pair of equations (22, 23) can be expressed,

 θn+1 (41a) +EAn+1θn+Eθn+1] ˆAn+1 =ˆAn+γn+1[A(θn)−ˆAn+EAn+1] (41b)

in which the sequences are ODE-friendly.

Based on (22, 23), the two error sequences in (41) are

 Eθn+1 =c(Xn)ψ(Xn)−b∗+βScθncs(Xn+1)−β¯¯cs(θn) EAn+1 =An+1−A(θn)

The argument proceeds by decomposing noise sequences into tractable terms, each of which is shown to be ODE-friendly by standard arguments based on solutions to Poisson’s equation [11]. We only present the treatment for here through Lemma IV.2 to IV.5 since same technique can be applied to and .

Denote for ,

 Mψ,θ(n+1)=ψ(Xn)Sθψ\it\tiny T(Xn+1)

For noise sequence in (41b), we have

 An+1− A(θn) (42) =ψ(Xn)[βSθnψ(Xn+1)−ψ(Xn)]\it\tiny T −E[ψ(Xn)[βSθnψ(Xn+1)−ψ(Xn)]\it\tiny T] =~A1n+1+β~A2n+1+β~A3n+1

where

 ~A1n+1 =−ψ(Xn)ψ\it\tiny T(Xn)+E[ψ(Xn)ψ\it\tiny T(Xn)] ~A2n+1 =Mψ,θn(n+1)−E[Mψ,θn(n+1)∣Fn] ~A3n+1 =E[Mψ,θn(n+1)∣Fn]−E[Mψ,θn(n+1)]
###### Lemma IV.2

Suppose that is a bounded function on with zero mean. Then the sequence is ODE-friendly, with equal to zero, and , with the solution to Poisson’s equation with forcing function .

The sequence is then ODE-friendly since it is a bounded over state space with zero mean. Before we show the same for sequences , we need two preliminary results: the Lipschitz continuity of , and that this continuity is preserved in associated Poisson equation solutions.

###### Lemma IV.3

There is a deterministic constant such that, with probability one,

 ∥E[Mψ,θ1(n+1)−Mψ,θ2(n+1)∣Fn]∥≤ℓM∥θ1−θ2∥

The second result is similar to Lemma IV.2. Both are consequences of the fact that the fundamental kernel defined in Section 20.1 of [9] is a bounded linear operator on when the Markov chain is uniformly ergodic.

###### Lemma IV.4

There is a constant such that the following holds: For any family of zero mean functions satisfying for some constants , ,

 supx∥fθ(x)∥≤BF,supx∥fθ1(x)−fθ2(x)∥≤ℓF∥θ1−θ2∥

for all , then there are solutions to corresponding Poisson’s equation, , with zero mean, and satisfying

 supx∥^fθ1(x)−^fθ2(x)∥≤BZℓF∥θ1−θ2∥

Now we are ready to claim:

###### Lemma IV.5

The sequences are ODE friendly.

The following Lemma shows that the matrix gain , recursively obtained by (22), approximates the mean , with .

###### Lemma IV.6

Suppose the sequence is ODE-friendly. Then,

1. Consequently, only finitely often, and

With the definition of ODE approximation below (31), we have:

###### Lemma IV.7

The ODE approximation for holds: with probability one, asymptotically tracks the ODE:

 (43)

For a fixed but arbitrary time horizon , we define two families of uniformly bounded and uniformly Lipschitz continuous functions: and . Sub-sequential limits of and are denoted and respectively.

We recast the ODE limit of the projected cost as follows:

###### Lemma IV.8

For any sub-sequential limits ,

1. They satisfy .

2. For a.e. ,

 ddtbt =−A(wt)ddtwt=−bt+b∗ (44)

##### Proof of Thm. iii.5

Bounedness of sequences and is established in Lemma IV.6. Together with boundedness assumption of , the ODE approximation is established in Lemma IV.8. Result (i) then follows from those two results using standard arguments from [2].

## V Numerical Results

In this section we illustrate the performance of the Zap Q-learning algorithm in comparison with existing techniques, on a finance problem that has been studied in prior work [6, 17]. We observe that the Zap algorithm performs very well, despite the fact that some of the technical assumptions made in Section III do not hold.

### V-a Finance model

The following finance example is used in [6, 17] to evaluate the performance of their algorithms for optimal stopping. The reader is referred to these references for complete details of the problem set-up.

The Markovian state process considered is the vector of ratios: , , in which is a geometric Brownian motion (derived from an exogenous price-process). This uncontrolled Markov chain is positive Harris recurrent on the state space , so is not compact. The Markov chain is uniformly ergodic.

The “time to exercise” is modeled as a stopping time . The associated expected reward is defined as , with and fixed. The objective of finding a policy that maximizes the expected reward is modeled as an optimal stopping time problem.

The value function is defined to be the infimum (3), with and (the objective in Section I is to minimize the expected cost, while here, the objective is to maximize the expected reward). The associated Q-function is defined using (4), and the associated optimal policy using (5):

When the Q-function is linearly approximated using (9), for a fixed parameter vector , the associated value function can be expressed:

 hϕθ(x) :=E[βτθr(Xτθ)∣x0=x], (45)

where,

 τθ :=min{n:ϕθ(Xn)=1} (46) ϕθ(x) :=I{r(x)≥Qθ(x)}

Given a parameter estimate and the initial state , the corresponding average reward was estimated using Monte-Carlo in the numerical experiments that follow.

### V-B Approximation & Algorithms

Along with Zap Q-learning algorithm we also implement the fixed point Kalman filter algorithm of [6] to estimate . This algorithm is given by the update equations (19) and (20). The computational as well as storage complexities of the least squares Q-learning algorithm (and its variants) [20] are too high for implementation.

### V-C Implementation Details

The experimental setting of [6, 17] is used to define the set of basis functions and other parameters. We choose the dimension of the parameter vector , with the basis functions defined in [6]. The objective here is to compare the performances of the fixed point Kalman filter algorithm with the Zap-Q learning algorithm in terms of the resulting average reward (45).

Recall that the step-size for the Zap Q-learning algorithm is given in (24). We set for all implementations of the Zap algorithm, but similar to what is done in [6], we experiment with different choices for . Specifically, in addition to , we let:

 αn=gb+n (47)

with and experiment with and . In addition, we also implement Zap with . Based on the discussion in Section III-D, we expect this choice of step-size sequences to result in infinite asymptotic variance.

In the implementation of the fixed point Kalman filter algorithm, as suggested by the authors, we choose step-size for the matrix gain update rule in (20), and step-size of the form (47) for the parameter update in (19). Specifically, we let , and and .

The number of iterations for each of the algorithm is fixed to be .

### V-D Experimental Results

The average reward histogram was obtained by the following steps: We simulate parallel simulations of each of the algorithms to obtain as many estimates of . Each of these estimates defines a policy defined in (46). We then estimate the corresponding average reward defined in (45), with .

Along with the average discounted rewards, we also plot the histograms to visualize the asymptotic variance (32), for each of the algorithms. The theoretical values of the covariance matrices and were estimated through the following steps: The matrices and (the limit of the matrix gain used in [6]) were estimated via Monte-Carlo. Estimation of requires an estimate of ; this was taken to be obtained using the Zap-Q algorithm with and . This estimate of was also used to estimate the covariance matrix defined in (36) using the batch means method. The matrices and were then obtained using (33) and (37), respectively.

Fig. 1 contains the histograms of the average rewards obtained using the above algorithms. Fig. 2 contains the histograms of along with a plot of the theoretical prediction.

It was observed that the eigenvalues of the matrix have a wide spread: The condition-number is of the order . Despite a badly conditioned matrix gain, it is observed in Fig. 1, that the average rewards of the Zap-Q algorithms are better than its competitors. It is also observed that the algorithm is robust to the choice of step-sizes. In Fig. 2 we observe that the asymptotic behavior of the algorithms are a close match to the theoretical prediction. Specifically, large variance of Zap-Q with step-size confirms that the asymptotic variance is very large (ideally, infinity), if the eigenvalues of the matrix .

## Vi Conclusion

In this paper, we extend the theory for the Zap Q-learning algorithm to a linear function approximation setting, with application to optimal stopping. We prove