# Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation

Though the convergence of major reinforcement learning algorithms has been extensively studied, the finite-sample analysis to further characterize the convergence rate in terms of the sample complexity for problems with continuous state space is still very limited. Such a type of analysis is especially challenging for algorithms with dynamically changing learning policies and under non-i.i.d. sampled data. In this paper, we present the first finite-sample analysis for the SARSA algorithm and its minimax variant (for zero-sum Markov games), with a single sample path and linear function approximation. To establish our results, we develop a novel technique to bound the gradient bias for dynamically changing learning policies, which can be of independent interest. We further provide finite-sample bounds for Q-learning and its minimax variant. Comparison of our result with the existing finite-sample bound indicates that linear function approximation achieves order-level lower sample complexity than the nearest neighbor approach.

## Authors

• 20 publications
• 16 publications
• 57 publications
• ### Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

In this paper, we develop a novel variant of off-policy natural actor-cr...
05/26/2021 ∙ by Zaiwei Chen, et al. ∙ 0

• ### Q-learning with Nearest Neighbors

We consider the problem of model-free reinforcement learning for infinit...
02/12/2018 ∙ by Devavrat Shah, et al. ∙ 0

• ### Rate of convergence for geometric inference based on the empirical Christoffel function

We consider the problem of estimating the support of a measure from a fi...
10/31/2019 ∙ by Mai Trang Vu, et al. ∙ 0

• ### Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

We offer a theoretical characterization of off-policy evaluation (OPE) i...
02/05/2021 ∙ by Masatoshi Uehara, et al. ∙ 2

• ### Stochastic Approximation for Online Tensorial Independent Component Analysis

Independent component analysis (ICA) has been a popular dimension reduct...
12/28/2020 ∙ by Chris Junchi Li, et al. ∙ 0

• ### Linear Systems can be Hard to Learn

In this paper, we investigate when system identification is statisticall...
04/02/2021 ∙ by Anastasios Tsiamis, et al. ∙ 73

• ### An Automatic Finite-Sample Robustness Metric: Can Dropping a Little Data Change Conclusions?

We propose a method to assess the sensitivity of econometric analyses to...
11/30/2020 ∙ by Tamara Broderick, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The major success of reinforcement learning (RL) lies in the developments of various algorithms for finding policies that attain desirable (many times optimal) cumulative rewards over time (Sutton & Barto, 2018)

. In particular, model-free examples which do not assume knowledge of the underlying Markov decision process (MDP) include the time-difference (TD)

(Sutton, 1988), SARSA (Rummery & Niranjan, 1994), Q-learning (Watkins & Dayan, 1992), and more recently deep Q-network (DQN) (Mnih et al., 2015), and actor-critic (A3C) (Mnih et al., 2016) algorithms. Among them, algorithms that incorporate parametrized function approximation gain extensive attention due to their efficiency and scalability under continuous state space in practice.

Focusing on the basic TD, SARSA and Q-learning algorithms (see Section 1.2 for a summary of their variants) with continuous state space and with function approximation, the theoretical asymptotic convergence has been established for TD algorithms with linear function approximation by (Tsitsiklis & Roy, 1997), Q-learning and SARSA with linear function approximation by (Melo et al., 2008), Q-learning with kernel-based approximation (Ormoneit & Glynn, 2002; Ormoneit & Sen, 2002). Furthermore, the finite-sample analysis of the convergence rate in terms of the sample complexity has been provided for TD with function approximation in (Dalal et al., 2018a; Lakshminarayanan & Szepesvari, 2018), under the assumption that sample observations are independent and identically distributed (i.i.d.).

Under the non-i.i.d. assumption, (Bhandari et al., 2018) recently provided a finite-sample analysis for TD algorithm with linear function approximation, which was further shown to be valid without modification to Q-learning for high-dimensional optimal stopping problems. Furthermore, (Shah & Xie, 2018) proposed a Q-learning algorithm based on the nearest neighbor approach. In fact, the finite-sample analysis for RL algorithms under the non-i.i.d. assumption is still a largely open direction, and the focus of this paper is on the following three open and fundamental problems.

• Under non-i.i.d. observations, existing studies provided finite-sample analysis only for TD and Q-learning algorithms, where samples are taken under a fixed policy. The existing analysis tools are not sufficient to handle the additional challenges due to dynamically changing sample distributions arising in algorithms such as SARSA.

• The finite-sample analysis in (Shah & Xie, 2018) for Q-learning with the nearest neighbor approach relies closely on the state discretization, which can suffer from slow convergence to attain a high accuracy solution in practice. It is thus of interest to provide finite-sample analysis for Q-learning with linear function approximation, which requires different analysis from that in (Shah & Xie, 2018).

• The finite-sample analysis for two-player zero-sum MDP games has been provided for a deep Q-learning model in (Yang et al., 2019) (see a summary of other studies in Section 1.2), but under i.i.d. observations. It is motivated to provide the finite-sample analysis for minimax SARSA and Q-learning algorithms under non-i.i.d. observations.

### 1.1 Contributions

Our contributions are summarized as follows.

• We develop the first finite-sample analysis for the on-policy algorithm SARSA with a continuous state space and linear function approximation, which is applicable to a single sample path and non-i.i.d. data. To accomplish this analysis, we propose a novel technique to handle on-policy algorithms with dynamically changing policies, which may be of independent interest. Existing studies (e.g., (Bhandari et al., 2018)

) on a fixed policy exploited the uniform ergodicity of the Markov chain to decouple the dependency on the Markovian noise. But the uniform ergodicity does not hold generally for the Markov chain induced by a dynamically changing policy. Our approach constructs an auxiliary uniformly ergodic Markov chain to approximate the true MDP to facilitate the analysis.

• We further develop the finite-sample analysis for Q-learning with a continuous state space and linear function approximation. In contrast to many existing studies which assumed i.i.d. samples, our analysis is applicable to the online case with a single sample path and non-i.i.d. data. Furthermore, our analysis of both SARSA and Q-learning indicates that linear function approximation yields an order-level faster convergence rate than the nearest neighbor approach (Shah & Xie, 2018).

• By leveraging the aforementioned technique we develop for single-agent algorithms, we further provide the first finite-sample analysis for the minimax SARSA and Q-learning algorithms for two-player zero-sum Markov games with a single sample path and non-i.i.d. samples.

### 1.2 Related Work

Due to a vast amount of literature on theoretical analysis of RL algorithms, we here focus only on highly relevant studies, which investigated model-free RL algorithms for solving continuous state-space MDP problems.

Fitted value iteration algorithms: The least-squares temporal difference learning (LSTD) algorithms have been extensively studied in (Bradtke & Barto, 1996; Boyan, 2002; Munos & Szepesvari, 2008; Lazaric et al., 2010; Ghavamzadeh et al., 2010; Pires & Szepesvari, 2012; Prashanth et al., 2013; Tagorti & Scherrer, 2015; Tu & Recht, 2018) and references therein, which follows the TD type of update with each iteration solving a least square regression problem based on a batch data in order to fit the approximate function model. A differential LSTD algorithm was recently proposed and studied in (Devraj et al., 2018).

Fitted policy iteration algorithms: Approximate (fitted) policy iteration (API) algorithms further extend fitted value iteration with policy improvement. Several variants of such a type were studied, which adopt different objective functions, including least-squares policy iteration (LSPI) algorithms in (Lagoudakis & Parr, 2003; Lazaric et al., 2012; Yang et al., 2019), fitted policy iteration based on Bellman residual minimization (BRM) in (Antos et al., 2008; Farahmand et al., 2010), and classification-based policy iteration algorithm in (Lazaric et al., 2016).

Our study here focuses on the SARSA and Q-learning algorithms under the online case with each iteration updated based on one data sample, whereas the above fitted value and policy algorithms use the full batch data samples at each iteration for model fitting. Hence, our analysis tools are very different from these two types of algorithms.

Gradient TD algorithms: The off-policy gradient TD algorithm with linear function approximation was proposed in (Sutton et al., 2009a, b) based on gradient descent for minimizing the mean-square projected Bellman error. The convergence of gradient TD algorithms was established in (Sutton et al., 2009a, b), and the finite-sample analysis was further provided in (Dalal et al., 2018b; Liu et al., 2015; Touati et al., 2018). All these analysis were based on i.i.d. data assumption, whereas our paper here focuses on non-i.i.d. scenarios.

Two-player zero-sum MDP game: The zero-sum Markov game problem has been studied extensively for discrete state space models in (Littman, 1994; Bowling, 2001; Conitzer & Sandholm, 2007; Srinivasan et al., 2018; Wei et al., 2017) and references therein, where the convergence and finite-sample analysis have been developed. This problem was further studied with function approximation in (Prasad et al., 2015; Perolat et al., 2018), where only the convergence was provided, but the finite-sample analysis was not characterized. Fitted (i.e., batch) algorithms have also been designed and studied for zero-sum Markov game problems in (Lagoudakis & Parr, 2002; Perolat et al., 2016b, a, 2018; Zhang et al., 2018; Yang et al., 2019). Our study here provides the first finite-analysis for the online SARSA and Q-learning algorithms (differently from the aforementioned fitted algorithms) under the non-i.i.d. data assumption, for which the previously developed technical tools are not applicable.

## 2 Preliminaries

In this section, we introduce the basic MDP problem, and the linear function approximation.

### 2.1 Markov Decision Process

Consider a general reinforcement learning setting, where an agent interacts with a stochastic environment, which is modeled as a Markov decision process (MDP). Specifically, we consider a MDP that consists of , where is a compact continuous state space , and is a finite action set. We further let denote the state at time , and denote the action at time . Then, the measure defines the action dependent transition kernel for the underlying Markov chain : for any measurable set . The one-stage reward at time is given by , where is the reward function, and is assumed to be uniformly bounded, i.e., for any . Finally, denotes the discount factor.

A stationary Markov policy maps a state

over , which does not depend on time. For a policy , the corresponding value function is defined as the expected total discounted reward obtained by actions executed according to :

 Vπ(x0)=E[∞∑t=0γtr(Xt,At)∣∣∣X0=x0].

The action-value function is defined as

 Qπ(x,a)=r(x,a)+γ∫XP(dy|x,a)Vπ(y).

The goal is to find an optimal policy that maximizes the value function from any initial state. The optimal value function is defined as The optimal action-value function is defined as

 Q∗(x,a)=supπQπ(x,a),∀(x,a)∈X×A.

Based on , the optimal policy is the greedy algorithm with respect to . It can be verified that . The Bellman operator is defined as

 (HQ)(x,a)=r(x,a)+γ∫Xmaxb∈AQ(y,b)P(dy|x,a).

It is clear that is contraction in the sup norm defined as , and the optimal action-value function is the fixed point of (Bertsekas, 2012).

### 2.2 Linear Function Approximation

Let be a family of real-valued functions defined on . We consider the problem where any function in is a linear combination of a set of fixed linearly independent functions, . Specifically, for ,

 Qθ(x,a)=N∑i=1θiϕi(x,a)=ϕT(x,a)θ.

We assume that for any , which can be ensured by normalizing the basis functions . The goal is to find a with a compact representation in to approximate the optimal action-value function with a continuous state space.

## 3 Finite-Sample Analysis for SARSA

As suggested in (Melo et al., 2008; Tsitsiklis & Roy, 1997; Perkins & Pendrith, 2002), on-policy algorithms may potentially yield more reliable convergence performance. In this section, we present our main result of the finite-sample analysis for an on-policy algorithm SARSA as a variation of the off-policy Q-learning algorithm. We further present our finite-sample analysis of the Q-learning algorithm as a comparison of performance in Section 4. As will be seen, the sufficient condition for Q-learning to converge is stricter.

### 3.1 SARSA with Linear Function Approximation

We consider a -dependent learning policy, which changes with time. Specifically, the learning policy is -greedy with respect to the Q function . Suppose that are sampled trajectories of states, actions and rewards obtained from the MDP following the time dependent learning policy . Then the projected SARSA algorithm with linear function approximation updates as follows:

 θt+1 =proj2,R(θt+αtgt(θt)), (1)

where , denotes the temporal difference at time t given by and

 proj2,R(θ):=argminθ′:∥θ′∥2≤R∥θ−θ′∥2.

Here, the projection step is to control the norm of the gradient , which is a commonly used technique to control the gradient bias (Bhandari et al., 2018; Kushner, 2010; Lacoste-Julien et al., 2012; Bubeck et al., 2015; Nemirovski et al., 2009). The basic idea is that with a decaying step size and a projected gradient, does not change too fast.

The convergence of this algorithm (without a projection operation) was established using an O.D.E. argument (Melo et al., 2008). However, the finite-sample analysis of the convergence still remains unsolved, which is the goal here.

For any , the learning policy is assumed to be Lipschitz with respect to :

 |πθ1(a|x)−πθ2(a|x)|≤C∥θ1−θ2∥2, (2)

where is the Lipschitz constant. Since is an -greedy policy with respect to , it is clear that decreases with a larger , and equals zero if . We further assume that for any fixed , the Markov chain induced by the learning policy and the transition kernel is uniformly ergodic with the invariant measure denoted by , and satisfies the following assumption.

###### Assumption 1.

There are constants and such that

 supx∈XdTV(P(Xt∈⋅|X0=x),Pθ)≤mρt,∀t≥0,

where denotes the total-variation distance between the probability measures and .

We note that such an assumption holds for irreducible and aperiodic Markov chains (Meyn & Tweedie, 2012). We denote by the probability measure induced by the invariant measure and the learning policy .

### 3.2 Finite-Sample Analysis

We first note that the following facts are useful for our analysis. The limit point of the algorithm in (1) satisfies the following relation (Theorem 2 in (Melo et al., 2008)):

 Aθ∗θ∗+bθ∗=0,

where , and . Here denotes the expectation where follows the invariant probability measure , is generated by the learning policy , is the subsequent state of following the action , i.e., follows from the transition kernel , and again, is generated by the learning policy . Note that it has been shown in (Perkins & Precup, 2003; Tsitsiklis & Roy, 1997) that is negative definite.

Let and . Recall in (2) that the learning policy is assumed to be Lipschitz with respect to with Lipschitz constant . We then make the following assumption as justified in (Melo et al., 2008).

###### Assumption 2.

The Lipschitz constant is small enough so that

is negative definite with the largest eigenvalue denoted by

.

We then present our main result of the finite-sample bound on the convergence of SARSA.

###### Theorem 1.

Consider the projected SARSA algorithm with linear function approximation in (1) with . Under Assumptions 1 and 2, with a decaying step size for , we have

 E∥θT−θ∗∥22≤ G2(4C|A|Gτ20+12τ0ws+1)(logT+1)ws2T +4G2(τ0ws+2ρ−1)ws2T, (3)

where .

###### Remark 1.

As , , and therefore,

 E∥θT−θ∗∥22≲log3TT.

From Theorem 1, it is clear that for a smaller , the algorithm converges faster. However, to obtain a smaller , shall be increased. On the other hand, a smaller means that the learning policy and thus the temporal difference update are “ more greedy”, which results in a more accurate approximation of the optimal Q-function. Hence there is a trade-off between the convergence rate and the accuracy that approximates the optimal Q-function.

In order for Theorem 1 to hold, the projection radius shall be chosen such that . However, is unknown in advance. We next provide an upper bound on

, which can be estimated in practice

(Bhandari et al., 2018).

###### Lemma 1.

For the projected SARSA algorithm in (1), the limit point satisfies that where is the largest eigenvalue of .

### 3.3 Outline of Technical Proof

The challenges in analyzing the SARSA algorithm are two-folds: (1) non-i.i.d. samples; and (2) dynamically changing learning policy. First, as per the updating rule in (1), there is a strong coupling between the sample path and , since the samples are used to compute the gradient and then , which introduces a strong dependency between and , and thus the bias in the gradient . Moreover, differently from TD learning and Q-learning, is further used (as in the policy ) to generate the subsequent actions, which make the dependency even stronger. Although the convergence can still be established using the O.D.E. approach (Benveniste et al., 2012; Melo et al., 2008), in order to derive a finite-sample analysis, the stochastic bias in the gradient needs to be explicitly characterized, which makes the problem challenging. Second, as updates, the transition kernel for the state-action pair changes with time. Analysis in (Bhandari et al., 2018) relies on the fact that the learning policy is fixed so that the Markov process reaches its stationary distribution quickly. In (Perkins & Precup, 2003), an episodic SARSA algorithm is studied, where within each episode, the learning policy is fixed, only the Q-function of the learning policy is updated, and the learning policy is then updated only at the end of each episode. Therefore, within each episode, the Markov process can reach its stationary distribution so that the analysis can be conducted. The SARSA algorithm studied here does not possess these nice properties, since the learning policy is changing at each time step. Thus, to provide a finite-sample analysis, we design a new uniformly ergodic Markov chain to approximate the original Markov chain induced by the SARSA algorithm. Using such an approach, the gradient bias can be explicitly characterized. To illustrate our idea of the proof, we provide a sketch, and the detailed proof can be found in the supplemental materials.

###### Proof sketch.

We sketch the key steps in our proof. We first introduce some notations. For any fixed , define , where follows the stationary distribution , and are subsequent actions and states generated according to the policy and the transition kernel . Here, can be interpreted as the noiseless gradient at . We then define . Thus, measures the bias caused by using non-i.i.d. samples to estimate the gradient.

Step 1. Error decomposition. The error at each time step can be decomposed recursively as follows:

 E [∥θt+1−θ∗∥22] ≤ E[∥θt−θ∗∥22]+2αtE[⟨θt−θ∗,¯g(θt)−¯g(θ∗)⟩] +α2tE[∥gt(θt)∥22]+2αtE[Λt(θt)]. (4)

Step 2. Gradient descent type analysis. The first three terms in (3.3) mimic the analysis of the gradient descent algorithm without noise, because the accurate gradient at is used.

Due to the projection step in (1), is upper bounded by . It can also be shown that . For a small enough , i.e., is smooth enough with respect to , is negative definite. Then, we have

 (5)

Step 3. Stochastic bias analysis. This step consists of our major technical developments. The last term in (3.3) corresponds to the bias caused by using a single sample path with non-i.i.d. data and dynamically changing learning policies. For convenience, we rewrite as , where . This term is very challenging to bound due to the strong dependency between and .

We first show that is Lipschitz in . Due to the projection step, changes slowly with . Combining the two facts, we can show that for any ,

 Λt(θt,Ot)≤Λt(θt−τ,Ot)+6G2t−1∑i=t−ταi. (6)

Such a step is intended to decouple the dependency between and by considering and . If the Markov chain induced by SARSA were uniformly ergodic, and satisfied Assumption 1, then given any , would reach its stationary distribution quickly for large . However, such an argument is not necessarily true, since the learning policy changes with time.

Our idea is to construct an auxiliary Markov chain to assist our proof. Consider the following new Markov chain. Before time , the states and actions are generated according to the SARSA algorithm, but after time , the learning policy is kept fixed as to generate all the subsequent actions. We then denote by the observations of the new Markov chain at time and time . For this new Markov chain, for large , reaches the stationary distribution induced by and . And thus, it can be shown that

 E[Λt(θt−τ,~Ot)]≤4G2mρτ−1. (7)

The next step is to bound the difference between the Markov chain generated by the SARSA algorithm and the auxiliary Markov chain that we construct. Since the learning policy changes slowly, due to its Lipschitz property and the decaying step size , the two Markov chains should not deviate from each other too much. It can be shown that

 E[Λt(θt−τ,Ot)]−E[Λt(θt−τ,~Ot)] ≤2C|A|G3τwslogtt−τ. (8)

Combining (6), (7) and (3.3) yields an upper bound on .

Step 4. Putting the first three steps together and recursively applying Step 1 complete the proof. ∎

## 4 Finite-Sample Analysis for Q-Learning

### 4.1 Q-Learning with Linear Function Approximation

Let be a fixed stationary learning policy. Suppose that are sampled trajectories of states, actions and rewards obtained from the MDP using policy . We consider the projected Q-learning algorithm with the following update rule:

 θt+1 =proj2,R(θt+αtgt(θt)), (9)

where is the step size at time , , is the temporal difference at time t:

 Δt=r(xt,at)+γmaxb∈AϕT(xt+1,b)θt−ϕT(xt,at)θt.

A same projection step as the one in the SARSA algorithm is used to control the norm of the gradient, i.e., , and thus to control the gradient bias. The convergence of this algorithm (without a projection operation) was established using an O.D.E. argument (Melo et al., 2008). However, the finite-sample characterization of the convergence still remains unsolved, which is our interest here.

### 4.2 Finite-Sample Analysis

To explicitly quantify the statistical bias in the iterative update in (9), we assume that the Markov chain induced by the learning policy is uniformly ergodic with invariant measure , which satisfies Assumption 1, and is initialized with . We then introduce some notations. We denote by the probability measure induced by the invariant measure and the learning policy : for any measurable . We define the matrix as

 Σπ =Eπ[ϕ(X,A)ϕ(X,A)T] =∫x∈X∑a∈Aϕ(x,a)ϕT(x,a)π(a|x)Pπ(dx).

For any fixed , and , define . We then define the -dependent matrix

 Σ∗π(θ) =Eπ[ϕ(X,aθX)ϕT(X,aθX)] =∫Xϕ(x,aθx)ϕT(x,aθx)Pπ(dx).

By their construction, and are positive definite. We further make the following assumption.

###### Assumption 3.

For all , .

We denote the minimum of the smallest eigenvalue of over all by . The limit point of the algorithm in (9) (without a projection operation) satisfies the recursive relation

 Qθ∗=projQHQθ∗, (10)

where is the orthogonal projection onto defined with the inner-product given by (Theorem 1 in (Melo et al., 2008)). We then further characterize the finite-sample bound on the performance of the projected Q-learning in (9).

###### Theorem 2.

Consider the projected Q-learning algorithm in (9) with . If Assumptions 1 and 3 are satisfied for , with a decaying step size , we have

 E∥θT−θ∗∥22≤(9+24τ0)G2(logT+1)ws2T, (11)

where .

###### Remark 2.

As , , and therefore,

 E∥θT−θ∗∥22≲log2TT.
###### Remark 3.

As an interesting comparison, we note that the convergence rate has been characterized for the nearest neighbor approach for Q-learning with continuous state space in (Shah & Xie, 2018). Clearly, Theorem 2 implies that linear function approximation yields a much faster convergence rate . On the other hand, the learning policy and basis functions need to satisfy the condition to guarantee the convergence (Melo et al., 2008) due to the nature of linear function approximation.

To clarify the difference between the proof of SARSA and that of Q-learning, we note that for Q-learning, the learning policy does not change with time, and the induced Markov chain can get close to its stationary distribution. Thus, to characterize the stochastic bias, it is not necessary to construct an auxiliary Markov chain as Step 3 for SARSA. On the other hand, to compute the temporal difference in Q-learning, a greedy action is taken, which does not satisfy the Lipschitz condition we pose on the learning policy in the proof for SARSA.

Similar to the SARSA algorithm, we also provide the following upper bound on for practical consideration.

###### Lemma 2.

For the projected Q-learning algorithm in (9), the limit point satisfies

## 5 Minimax SARSA

### 5.1 Zero-Sum Markov Game

In this subsection, we introduce the two-player zero-sum Markov game and the corresponding linear function approximation.

A two-player zero-sum Markov game is defined as a six-tuple , where is a compact continuous state space , and are finite action sets of players 1 and 2. We further let denote the state at time , and , denote the actions of players 1 and 2 at time respectively. The measure defines the actions dependent transition kernel for the underlying Markov chain :

 P(Xt+1∈U|Xt=x,A1t=a1,A2t=a2) =∫UP(dy|x,a1,a2), (12)

for any measurable set . The one-stage reward at time is given by , where is the reward function, and is assumed to be uniformly bounded, i.e., for any , , . Finally, denotes the discount factor.

A stationary policy , , maps a state to a probability distribution over , which does not depend on time. For policy , the corresponding value function is defined as the expected total discounted reward obtained by actions executed according to given by: The action-value function is then defined as

For the two-player zero-sum game, the goal of player 1 is to maximize the expected accumulated -discounted reward from any initial state, while the goal of player 2 is to minimize it. The optimal value function for both player 1 and player 2 is then defined in the minimax sense as follows:

 V∗=minπ2maxπ1Vπ(x),∀x∈X. (13)

The above minimization and maximization are well-defined because and lie in a compact probability simplex since the action sets are finite. For all , the optimal action-value function is also defined in the following minimax sense: It can be verified that has the following property (Perolat et al., 2015):

 Q∗(x,a1,a2)=Qπ∗(x,a1,a2) =minπ2maxπ1Qπ(x,a1,a2)=maxπ1minπ2Qπ(x,a1,a2).

The Bellman operator for the Markov game is defined as

 (^HQ)(x,a1,a2)=r(x,a1,a2) +γ∫Xminπ2∈δ2maxπ1∈δ1πT1\boldmath{Q}(y)π2P(dy|x,a1,a2), (14)

where is the action-value matrix at state and denotes the probability simplex, i.e., entry in denotes the probability that takes the action , for .

The linear function approximation here takes a form similar to the single-user case, except that the function depends on two actions. Let be a family of real-valued functions defined on . We assume that any function in is a linear combination of a fixed set of linearly independent functions, . Specifically, for ,

 Qθ(x,a1,a2)=N∑i=1θiϕi(x,a1,a2)=ϕT(x,a1,a2)θ.

We assume that for any , which can be ensured by normalizing the basis functions . The goal is to find to approximate the optimal action-value function with a continuous state space. We also express in a matrix form , with its -th entry defined as , , .

### 5.2 Minimax SARSA with Linear Function Approximation

In this subsection, we present our finite-sample analysis for an on-policy minimax SARSA algorithm which adapts SARSA to solve the two-player zero-sum Markov game.

We consider a -dependent -greedy learning policy , which changes with time. For any , the optimal minimax policy , at state is given by

 {πθ2,x,πθ1,x} =argminπ2∈δ2argmaxπ1∈δ1πT1[Φ(x)Tθ]π2,

where the action-value matrix . The -greedy learning policy balances the exploration and exploitation for each player by choosing all actions in and with probability at least . Suppose that is a sampled trajectory of states, actions and rewards obtained from the MDP using the time dependent learning policy . Then the projected minimax SARSA algorithm takes the following update rule:

 θt+1 =proj2,R(θt+αtgt(θt)), (15)

where Similarly, we introduce a projection step to control the norm of the gradient. We assume that for any , the learning policy is Lipschitz with respect to : for any in

 |πθ1(a1,a2|x)−πθ2(a1,a2|x)|≤C∥θ1−θ2∥2, (16)

where is the Lipschitz constant, and . We further assume that for any fixed , the Markov chain induced by the learning policy is uniformly ergodic with invariant measure , and satisfies Assumption 1.

### 5.3 Finite-Sample Analysis for Minimax SARSA

Define such that , where

 Aθ∗=Eθ∗[ ϕ(X,A1,A2)(γϕT(Y,B1,B2)θ∗ −ϕT(X,A1,A2)θ∗)]

and . Here is defined similarly as in Section 3.2. Following the steps similar to those in the proof of Theorem 5.1 in (De Farias & Van Roy, 2000), we can verify that such a exists.

Following the proof in (Perkins & Precup, 2003; Tsitsiklis & Roy, 1997), we can verify that is negative definite, for any . Let and . Recall in (16) that the learning policy is assumed to be Lipschitz with respect to with Lipschitz constant . We then make the following assumption.

###### Assumption 4.

The Lipschitz constant is small enough so that is negative definite with the largest eigenvalue denoted by .

We have the following finite-sample bound.

###### Theorem 3.

Consider the projected minimax SARSA algorithm in (15) with . Under Assumptions 1 and 4 with a decaying step size for , we have

 E∥θT−θ∗∥22≤ G2(4C|A|Gτ20+12τ0ws+1)(logT+1)ws2T +4G2(τ0ws+2ρ−1)ws2T, (17)

where and .

###### Remark 4.

As , , and then it follows from Theorem 3 that

We provide the following upper bound on .

###### Lemma 3.

For the minimax SARSA algorithm in (15), the limit point satisfies that where is the largest eigenvalue of .

## 6 Minimax Q-Learning

### 6.1 Minimax Q-Learning with Linear Function Approximation

Let be fixed stationary learning policies for players 1 and 2. Suppose that is a sampled trajectory of states, actions and rewards obtained from the MDP using policy . We consider the projected minimax Q-learning algorithm with the following update rule:

 θt+1 =proj2,R(θt+αtgt(θt)), (18)

where is the step size at time ,

 gt(θt)=ϕ(xt,a1t,a2t)(r(xt,a1t,a2t) +γminπ2∈δ2maxπ1∈δ1πT1[ΦT(xt+1)θt]π2−ϕT(xt,a1t,a2t)θt).

Here, we also use projection to control the norm of the gradient . We assume that the Markov chain induced by the learning policy is uniformly ergodic with invariant measure , and is initialized with .

### 6.2 Finite-Sample Analysis

We denote by the probability measure induced by the invariant measure and the learning policy : for any measurable , and define the matrix :

 Σπ =Eπ[ϕ(X,A1,A2)ϕ(X,A1,A2)T] =∫XΣa1∈A1Σa2∈A2ϕ(x,a1,a2)ϕT(x,a1,a2)dμπ.

For any fixed , and , define