# Feature-Based Q-Learning for Two-Player Stochastic Games

Consider a two-player zero-sum stochastic game where the transition function can be embedded in a given feature space. We propose a two-player Q-learning algorithm for approximating the Nash equilibrium strategy via sampling. The algorithm is shown to find an ϵ-optimal strategy using sample size linear to the number of features. To further improve its sample efficiency, we develop an accelerated algorithm by adopting techniques such as variance reduction, monotonicity preservation and two-sided strategy approximation. We prove that the algorithm is guaranteed to find an ϵ-optimal strategy using no more than Õ(K/(ϵ^2(1-γ)^4)) samples with high probability, where K is the number of features and γ is a discount factor. The sample, time and space complexities of the algorithm are independent of original dimensions of the game.

## Authors

• 4 publications
• 45 publications
• 53 publications
• ### Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity

In this paper, we settle the sampling complexity of solving discounted t...
08/29/2019 ∙ by Aaron Sidford, et al. ∙ 12

• ### Parallel Algorithm for Approximating Nash Equilibrium in Multiplayer Stochastic Games with Application to Naval Strategic Planning

Many real-world domains contain multiple agents behaving strategically w...
10/01/2019 ∙ by Sam Ganzfried, et al. ∙ 0

• ### Convergence of Deep Fictitious Play for Stochastic Differential Games

Stochastic differential games have been used extensively to model agents...
08/12/2020 ∙ by Jiequn Han, et al. ∙ 0

• ### Deep Fictitious Play for Stochastic Differential Games

In this paper, we apply the idea of fictitious play to design deep neura...
03/22/2019 ∙ by Ruimeng Hu, et al. ∙ 0

• ### Sample-Optimal Parametric Q-Learning with Linear Transition Models

Consider a Markov decision process (MDP) that admits a set of state-acti...
02/13/2019 ∙ by Lin F. Yang, et al. ∙ 0

• ### Toward Solving 2-TBSG Efficiently

2-TBSG is a two-player game model which aims to find Nash equilibriums a...
06/09/2019 ∙ by Zeyu Jia, et al. ∙ 0

• ### BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games

We present BL-WoLF, a framework for learnability in repeated zero-sum ga...
07/03/2003 ∙ by Vincent Conitzer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Two-player turn based stochastic game (2-TBSG) is a generalization of Markov decision process (MDP), both of which are widely used models in machine learning and operations research. While MDP involves one agent with its simple objective to maximize the total reward, 2-TBSG is a zero-sum game involving two players with opposite objectives: one player seeks to maximize the total reward and the other player seeks to minimize the total reward. In a 2-TBSG, the set of all states is divided into two subsets that are controlled by the two players, respectively. We focus on the discounted stationary 2-TBSG, where the probability transition model is invariant across time and the total reward is the infinite sum of all discounted rewards. Our goal is to approximate the Nash equilibrium of the 2-TBSG, whose existence is proved in

Shapley (1953).

There are two practical obstacles standing in solving 2-TBSG:

• We usually do not know the transition probability model explicitly;

• The number of possible states and actions are very large or even infinite.

In this paper we have access to a sampling oracle that can generate sample transitions from any state and action pair. We also suppose that a finite number of state-action features are available, such that the unknown probability transition model can be embedded using the feature space. These features allow us to solve 2-TBSG of arbitrary dimensions using parametric algorithms.

A question is raised naturally, that is, how many samples are needed in order to find an approximate Nash equilibrium? For solving the one-player MDP to -optimality using features, Yang and Wang (2019) proves an information-theoretic lower bound of sample complexity . Since MDP is a special case of 2-TBSG, the same lower bound applies to 2-TBSG. Yet there has not been any provably efficient algorithm for solving 2-TBSG using features.

To answer this question, we propose two sampling-based algorithms and provide sample complexity analysis. Motivated by the value iteration and Q-learning like algorithms given by Hansen et al. (2013); Yang and Wang (2019), we propose a two-player Q-learning algorithm for solving 2-TBSG using given features. When the true transition model can be fully embedded in the feature space without losing any information, our algorithm finds an -optimal strategy using no more than sample transitions, where is the number of state-action features. We also provide model misspecification error bound for the case where the features cannot fully embed the transition model.

To further improve the sample complexity, we use a variance reduction technique, together with a specifically designed monotonicity preservation technique which were previously used in Yang and Wang (2019), to develop an algorithm that is even more sample-efficient. This algorithm uses a two-sided approximation scheme to find the equilibrium value from both above and below. It computes the final -optimal strategy by sticking two approximate strategies together. This algorithm is proved to find an -optimal strategy with high probability using samples, which improves significantly from our first result. Our results are the first and sharpest sample complexity bounds for solving two-player stochastic game using features, to our best knowledges. Our algorithms are the first ones of their kind with provable sample efficiency. It is also worth noting that the algorithms are space and time efficient, whose complexities depend polynomially on and , and are free from the game’s dimensions.

In Section 2 we review related literatures. Section 3 presents the problem formulation and basics. We introduce a basic two-player Q-learning algorithm in Section 4 together with its analysis. The accelerated two-player Q-learning algorithm and its analysis are presented in Section 5 and Section 6.

## 2 Related Works

The 2-TBSG is a special case of games and stochastic games (SG), which are first introduced in Von Neumann and Morgenstern (2007) and Shapley (1953). For a comprehensive introduction on SG, please refer to the books Neyman et al. (2003) and Filar and Vrieze (2012). A number of deterministic algorithms have been developed for solving 2-TBSG when its explicit form is fully given, including Littman (1996); Ludwig (1995); Hansen et al. (2013). For example Rao et al. (1973) proposes the strategy iteration algorithm. A value iteration method is proposed by Hansen et al. (2013), which is one of the motivation of our algorithm.

In the special case of MDP, there exist a large body of works on its sample complexity and sampling-based algorithms. For the tabular setting (finitely many state and actions), sample complexity of MDP with a sampling oracle has been studied in Kearns and Singh (1999); Azar et al. (2013); Sidford et al. (2018b, a); Kakade (2003); Singh and Yee (1994); Azar et al. (2011b). Lower bounds for sample complexity have been studied in Azar et al. (2013); Even-Dar et al. (2006); Azar et al. (2011a), where the first tight lower bound is obtained in Azar et al. (2013). The first sample-optimal algorithm for finding an -optimal value is proposed in Azar et al. (2013). Sidford et al. (2018a) gives the first algorithm that finds an -optimal policy using the optimal sample complexity for all values of . For solving MDP using linearly additive features, Yang and Wang (2019) proved a lower bound of sample complexity that is . It also provided an algorithm that achieves this lower bound up to log factors, however, their analysis of the algorithm relies heavily on an extra “anchor state” assumption. In Chen et al. (2018), a primal-dual method solving MDP with linear and bilinear representation of value functions and transition models is proposed for the undiscounted MDP. In Jiang et al. (2017), the sample complexity of contextual decision process is studied.

As for general stochastic games, the minimax Q-learning algorithm and the friend-and-foe Q-learning algorithm is introduced in Littman (1994) and Littman (2001a), respectively. The Nash Q-learning algorithm is proposed for zero-sum games in Hu and Wellman (2003) and for general-sum games in Littman (2001b); Hu and Wellman (1999). Also in Perolat et al. (2015)

, the error of approximate Q-learning is estimated. In

Zhang et al. (2018)

, finite-sample analysis of multi-agent reinforcement learning is provided. To our best knowledge, there is no known algorithm that solves 2-TBSG using features with sample complexity analysis.

There are a large number of works analyzing linear model approximation of value and Q functions, for examples Tsitsiklis and Van Roy (1997); Nedić and Bertsekas (2003); Lagoudakis and Parr (2003); Melo et al. (2008); Parr et al. (2008); Sutton et al. (2009); Lazaric et al. (2012); Tagorti and Scherrer (2015). These work mainly focus on approximating the value function or Q function for a fixed policy. The convergence of temporal difference learning with a linear model for a given policy is proved in Tsitsiklis and Van Roy (1997). Melo et al. (2008) and Sutton et al. (2009) study the convergence of Q-learning and off-policy temporal difference learning with linear function parametrization, respectively. In Parr et al. (2008), the relationship of linear transition model and linear parametrized value functions is explained. It is also pointed out by Yang and Wang (2019) that using linear model for Q function is essentially equivalent to assuming that the transition model can be embedded using these features, provided that there is zero Bellman error.

The fitted value iteration for MDPs or 2TBSGs, where the value function is approximated by functions in a general function space, is analyzed in Munos and Szepesvári (2008); Antos et al. (2008a, b); Farahmand et al. (2010); Yang et al. (2019); Pérolat et al. (2016). In these papers, it is shown that the error is related to the Bellman error of the function space, and depends polynomially on and the dimension of the function space. However, only convergence is analyzed in these paper.

## 3 Preliminaries

#### Basics of 2-TBSG

A discounted 2-TBSG (2-TBSG for short) consists of a tuple , where and are state sets and action sets for Player 1 and Player 2, respectively. is a transition probability matrix, where denotes the probability of transitioning to state from state if action is used.

is the reward vector, where

denotes the immediate reward received using action at state .

For a given state , we use to denote the available action set for state . A value function is a mapping from to . A deterministic strategy (strategy for short) is defined such that are mappings from to and from to , respectively. Given a strategy , the value function of is defined to be the expectation of total discounted reward starting from , i.e.,

 Vπ(s)=E[∞∑i=0γir(si,π(si))∣∣s0=s],∀s∈S, (1)

where is the discounted factor, and the expectation is over all trajectories starting from .

Two players in a 2-TBSG has opposite objectives. While the first player seeks to maximize the value function (1), the second player seeks to minimize it. In the following we present the definition of the equilibrium strategy.

###### Definition 1.

A strategy is called a Nash equilibrium strategy (equilibrium strategy for short), if for any player 1’s strategy and player 2’s strategy .

The existence of the Nash equilibrium strategy is proved in Shapley (1953). And all equilibrium strategies share the same value function, which we denote by .

Notice that is the equilibrium value if and only if it satisfies the following Bellman equation Hansen et al. (2013):

 v∗=Tv∗, (2)

where is an operator mapping a value function into another:

 TV(s)={maxa∈As[r(s,a)+γP(⋅|s,a)TV],∀s∈S1,mina∈As[r(s,a)+γP(⋅|s,a)TV],∀s∈S2. (3)

We give definitions of -optimal values and -optimal strategies.

###### Definition 2.

We call a value function an -optimal value, if .

###### Definition 3.

We call a strategy an -optimal strategy, if for any ,

 max¯¯¯π1[V¯¯¯π1,π2(s)−v∗(s)]≤ϵ,min¯¯¯π2[Vπ1,¯¯¯π2(s)−v∗(s)]≥−ϵ.

Since , the above definition is equivalent to and .

#### Features and Probability Transition Model

Suppose we have feature functions mapping from into . For every state-action pair , these features give a feature vector

 ϕ(s,a)=[ϕ1(s,a),⋯,ϕK(s,a)]T∈RK.

Throughout this paper, we focus on 2-TBSG where the probability transition model can be nearly embedded using the features without losing any information.

###### Definition 4.

We say that the transition model can be embedded into the feature space , if there exists functions such that

 P(s′|s,a)=∑k∈[K]ϕk(s,a)ψk(s′),∀s′∈S, (s,a)∈S×A.

The preceding model is closely related to linear model for Q functions. When can be fully embedded using , the Q-functions belong to so we can parameterize them using -dimensional vectors. Note that the feature representation is only concerned with the probability transition but has nothing to do with the reward function. It is pointed out by Yang and Wang (2019) that having a transition model which can be embedded into is equivalent to using linear Q-function model with zero Bellman error. In our subsequent analysis, we also provide approximation guarantee when cannot be fully embedded using .

It is worth noting that Definition 4 has a kernel interpretation. It is equivalent to that the left singular functions of belong to the Hilbert space with the kernel function . Our model and method can be viewed as approximating and solving the 2-TBSG in a given kernel space.

#### Notations

For two value functions , we use to denote . We use to denote the projection of into the interval . The total variance (TV) distance between two distributions on the state space is defined as And we use to hide log factors of and .

## 4 A Basic Two-Player Q-learning Algorithm

In this section, we develop a basic two-player Q learning algorithm for 2-TBSG. The algorithm is motivated by the two-player value iteration algorithm Hansen et al. (2013). It is also motivated by the parametric Q-learning algorithm for solving MDP given by Yang and Wang (2019).

### 4.1 Algorithm and Parametrization

The algorithm uses a vector to parametrize Q-functions, value functions and strategies as follows:

 Qw(s,a)=r(s,a)+γϕ(s,a)Tw, (4) Vw(s)={maxa∈AQw(s,a)s∈S1,mina∈AQw(s,a)s∈S2,πw(s)={argmaxa∈AQw(s,a)s∈S1,argmina∈AQw(s,a)s∈S2.

The algorithm keeps tracks of the parameter vector only. The value functions and strategies can be obtained from according to preceding equations when they are needed.

We present Algorithm 1, which is an approximate value iteration. Our algorithm picks a set of representative state-action pairs at first. Then at iteration , it uses sampling to estimate the values , and carries value iteration using these estimates. The set can be chosen nearly arbitrarily, but it is necessary that the set is representative of the feature space. It means that the feature vectors of state-action pairs in this set cannot too be alike but need to be linearly independent.

###### Assumption 1.

There exist state-action pairs forming a set satisfying

 ∥ϕ(s,a)∥1≤1,∀s∈S, a∈As,∃L>0,∥Φ−1K∥∞≤L,

where is the matrix formed by row features of those in .

### 4.2 Sample Complexity Analysis

The next theorem establishes the sample complexity of Algorithm 1, which is independent from and . Its proof is deferred to the appendix.

###### Theorem 1 (Convergence of Algorithm 1).

Let Assumption 1 holds. Suppose that the transition model of can be fully embedded into the space. Then for some , with probability at least , the parametrized strategy according to the output of Algorithm 1 is -optimal. The number of samples used is .

## 5 Variance-Reduced Q-Learning for Two-Player Stochastic Games

In this section, we show how to accelerate the two-player Q-learning algorithm and achieve near-optimal sample efficiency. A main technique is to leverage monotonicity of the Bellman operator to guarantee that solutions improve monotonically in the algorithm, which was used in Yang and Wang (2019).

### 5.1 Nonnegative Features

To preserve monotonicity in the algorithm, we assume without loss of generality the features are nonnegative:

 ϕk(s,a)≥0,∀k∈[K],∀(s,a)∈S×A

This condition can be easily satisfied. If the raw features does not satisfy nonnegativity, we can construct new features to make it satisfied. For any state-action pair we append another 1D feature such that for , and there is a subset of such that and is nonsingular. Then satisfies nonnegativity condition and Assumption 1 for some by normalization. More details are deferred to appendix.

### 5.2 Parametrization

We use a “max-linear" parameterization to guarantee that value functions improve monotonically in the algorithm. Instead of using a single vector , we now use a finite collection of -dimensional vectors , where is an integer of order . We use the following parameterization for the Q-functions, the value functions and strategies111Here is defined to be the solution of in the max-min problem: . The definition of is similar.:

 Qw(h)(s,a)=r(s,a)+γϕ(s,a)Tw(h), (5) Vθ(s)={maxh∈[Z]maxa∈AsQw(h)(s,a),∀s∈S1,maxh∈[Z]mina∈AsQw(h)(s,a),∀s∈S2,

For a given and , computing the corresponding Q-value and action requires solving a one-step optimization problem. We assume that there is an oracle that solves the problem with time complexity .

###### Remark 1.

When the action space is continuous, this may become a constant which is independent to the state set and the action set.

### 5.3 Preserving Monotonicity

A drawback of value iteration-like method is: an -optimal value function cannot be used greedily to obtain an -optimal strategy. In order for Algorithm 1 to find an such that is an -optimal value, it needs to find an -optimal value function first, which is very inefficient. However, if a strategy and a value function satisfy following inequality

 V≤TπV, (6)

then there is a strong connection between and as follows (due to monotonicity of the Bellman operator ):

 V≤TπV≤T2πV≤⋯≤T∞πV=Vπ.

This relation will be used to show that if is close to optimal, the policy is also close to optimal.

The accelerated algorithm is given partly in Algorithm 2, which uses two tricks to preserve monotonicity:

• We use parametrization (5) for and . This parametrization ensures that in our algorithm, the values and strategies keeps improving throughout iterations.

• In each iteration, we shift downwards the new parameter to by using a confidence bound, such that

 ϕ(s,a)T¯¯¯¯w(i,j)≤P(⋅|s,a)TV(i,j−1)≤P(⋅|s,a)TV(i,j),

which uses the nonnegativity of features. The shift is used to guarantee (6).

### 5.4 Approximating the Equilibrium from Two Sides

Making value functions monotonically increasing is not enough to find an -optimal strategy for two-player stochastic games. There are two sides of the game, and may be either greater or less than the Nash value. Having a lowerbound for does not lead to an approximate strategy. This is a major difference from one-player MDP.

In order to fix this problem, we approximate the Nash equilibrium from two sides – both from above and below. Given player 1’s strategy and player 2’s strategy , we introduce two Bellman operators .

 Tπ1,minV ={r(s,π1(s))+γP(⋅|s,π1(s))TV,if s∈S1,mins∈As[r(s,a)+γP(⋅|s,a)TV],if s∈S2, (7) Tmax,π2V

Then if there exist value functions such that all of the following

 V≤Tπ1,minV,Tmax,π2W≤W (8)

hold, then by using the monotonicity of we get

 V≤min¯¯¯π2Vπ1,¯¯¯π2≤v∗≤max¯¯¯π1V¯¯¯π1,π2≤W.

Hence if we have and , they jointly imply

 ∥min¯¯¯π2Vπ1,¯¯¯π2−v∗∥∞≤ϵ, (9) ∥max¯¯¯π1V¯¯¯π1,π2−v∗∥∞≤ϵ,

which indicates that is an -optimal strategy.

To achieve this goal, we construct a “flipped" instance of 2-TBSG , where the state set and the action set for each player, the transition probability matrix and the discounted factor are identical with those of . The reward function is defined as

 r′(s,a)=1−r(s,a). (10)

And the objective of two players are switched, which means in the first player aims to minimize and the second player aims to maximize. share the same optimal strategy (but flipped).

We use to denote the value function of , and let for any , which serves as the value function approximating the equilibrium value from upper side. This , together with , forms a two-sided approximation to the equilibrium value function.

We use Algorithm 2 to solve and at the same time. Next we construct a strategy where the first player’s strategy is based on parameters from the lower approximation, and the second player’s strategy is based on parameters from the upper approximation. This process is described in Algorithm 3, and its output is the following approximate Nash equilibrium strategy:

 π(s)={πθ(R′,R)(s),if s∈S1,π′η(R′,R)(s),if s∈S2, (11)

where for , is the strategy defined as

### 5.5 Variance Reduction

We use inner-outer loops for variance reduction in Algorithm 2. Let the parameters at the -th iteration be . At the beginning of the -th outer iteration, we aim to approximate accurately (Step 6, 7). Then in the -th inner iteration, we use as a reference to reduce the variance of estimation. That is, we estimate the difference using samples and then use the following equation (Step 11, 12)

 P(⋅|s,a)TVθ(i,j−1)=P(⋅|s,a)TVθ(i,0)+P(⋅|s,a)T(Vθ(i,j−1)−Vθ(i,0))

to approximate . Since the infinite norm of is guaranteed to be smaller than the absolute value of , the number of samples needed for each inner iteration can be substantially reduced. Hence our algorithm is more sample-efficient.

### 5.6 Putting Together

Algorithms 2-3 puts together all the techniques that were explained. In the next section, we will prove that they output an -optimal strategy with high probability. It is easy to see that the time complexity of Algorithm 2 is . The first term is the time calculating . The second term is the time of sampling and calculating the value function in each iteration, and is the time of calculating given parameter and state , which can be viewed as solving an optimization problem over the action space. The last term is due to the calculation of . As for the space complexity, we only need to store and the parameter at each iteration, which take space. Hence the total time and space complexities are independent from the numbers of states and actions.

## 6 Sample Complexity of Algorithms 2-3

In this section, we analyze the sample complexity of our Algorithms 2-3.

###### Theorem 2.

Let Assumption 1 hold and let features be nonnegative. Suppose that the transition model of can be fully embedded into space. Then for some , with probability at least , the output of Algorithm 3 is an -optimal strategy. The number of samples used is .

We present a proof sketch here, and the complete proof is deferred to appendix.

###### Proof Sketch.

We prove by induction. It is easy to know that . Next we assume holds.

The error between and involves two types of error: the estimation error due to sampling and the convergence error of value iteration. Due to the variance reduction technique, estimation error has two parts. The first part is the estimation error of , which we denote as , and the second part is the error of , which we denote as for short.

According to the Hoeffding inequality, we have and with high probability. By the induction hypothesis, we have . If we choose and , we will have and .

The convergence error of value iteration in the inner loop is . If we choose , we will have . Bringing these two types of errors together, we have with high probability that

 ∥v∗−Vθ(i,0)∥∞ ≤γR/(1−γ)+R∑j=1γR−j⋅(ϵ(i,0)+ϵ(i,j)) ≤O(ϵ+(ϵ(1−γ)+2−i)/(1−γ))=O(2−i/(1−γ)),

where the last equality is due to . Choosing , we have . Here we have omitted the dependence on any constant factors.

Similarly, we can show for the “flipped" side. Hence we have , therefore the combined strategy given by Algorithm 3 is an -optimal strategy since

 Vθ(R′,R)≤min¯¯¯π2Vπ1,¯¯¯π2≤v∗≤max¯¯¯π1V¯¯¯π1,π2≤Wη(R,R′), (13)

whose proof is based on the monotonicity of two operators . The total number of samples used by Algorithm 3 is . ∎

According to Theorem 1 and 2, we have the following theorem when the transition model cannot be embedded exactly, whose proof is deferred to appendix.

###### Theorem 3 (Approximation error due to model misspecification).

Let Assumption 1 holds and let features be nonnegative. If there is an another transition model which can be fully embedded into space, and there exists such that for and for , then with probability at least , the output of Algorithm 3 is an -optimal strategy, and with probability at least , the parametrized strategy according to the output of Algorithm 1 is -optimal.

## 7 Conclusion

In this paper, we develop a two-player Q-learning algorithm for solving 2-TBSG in feature space. This algorithm is proved to find an -optimal strategy with high probability using samples. It is the first and sharpest sample complexity bound for solving two-player stochastic game using features and linear models, to our best knowledges. The algorithm is sample efficient as well as space and time efficient.

## References

• Antos et al. (2008a) Antos, A., Szepesvári, C., and Munos, R. (2008a). Fitted q-iteration in continuous action-space mdps. In Advances in neural information processing systems, pages 9–16.
• Antos et al. (2008b) Antos, A., Szepesvári, C., and Munos, R. (2008b). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
• Azar et al. (2011a) Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (2011a). Reinforcement learning with a near optimal rate of convergence.
• Azar et al. (2011b) Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (2011b). Speedy q-learning. In Advances in neural information processing systems.
• Azar et al. (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
• Chen et al. (2018) Chen, Y., Li, L., and Wang, M. (2018). Scalable bilinear pi learning using state and action features. In Proceedings of the 35th International Conference on Machine Learning, pages 834–843, Stockholmsmässan, Stockholm Sweden. PMLR.
• Even-Dar et al. (2006) Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105.
• Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576.
• Filar and Vrieze (2012) Filar, J. and Vrieze, K. (2012). Competitive Markov decision processes. Springer Science & Business Media.
• Hansen et al. (2013) Hansen, T. D., Miltersen, P. B., and Zwick, U. (2013). Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1.
• Hu and Wellman (1999) Hu, J. and Wellman, M. P. (1999). Multiagent reinforcement learning in stochastic games. Submitted for publication.
• Hu and Wellman (2003) Hu, J. and Wellman, M. P. (2003). Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069.
• Jiang et al. (2017) Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2017). Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org.
• Kakade (2003) Kakade, S. M. (2003). On the sample complexity of reinforcement learning. PhD thesis, University of London London, England.
• Kearns and Singh (1999) Kearns, M. J. and Singh, S. P. (1999). Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
• Lagoudakis and Parr (2003) Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107–1149.
• Lazaric et al. (2012) Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041–3074.
• Littman (1994) Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier.
• Littman (1996) Littman, M. L. (1996). Algorithms for sequential decision making.
• Littman (2001a) Littman, M. L. (2001a). Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328.
• Littman (2001b) Littman, M. L. (2001b). Value-function reinforcement learning in markov games. Cognitive Systems Research, 2(1):55–66.
• Ludwig (1995) Ludwig, W. (1995). A subexponential randomized algorithm for the simple stochastic game problem. Information and computation, 117(1):151–155.
• Melo et al. (2008) Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM.
• Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857.
• Nedić and Bertsekas (2003) Nedić, A. and Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79–110.
• Neyman et al. (2003) Neyman, A., Sorin, S., and Sorin, S. (2003). Stochastic games and applications, volume 570. Springer Science & Business Media.
• Parr et al. (2008) Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008).

An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.

In Proceedings of the 25th international conference on Machine learning, pages 752–759. ACM.
• Pérolat et al. (2016) Pérolat, J., Piot, B., Geist, M., Scherrer, B., and Pietquin, O. (2016). Softened approximate policy iteration for markov games. In ICML 2016-33rd International Conference on Machine Learning.
• Perolat et al. (2015) Perolat, J., Scherrer, B., Piot, B., and Pietquin, O. (2015). Approximate dynamic programming for two-player zero-sum markov games. In International Conference on Machine Learning (ICML 2015).
• Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
• Rao et al. (1973) Rao, S. S., Chandrasekaran, R., and Nair, K. (1973). Algorithms for discounted stochastic games. Journal of Optimization Theory and Applications, 11(6):627–637.
• Shapley (1953) Shapley, L. S. (1953). Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100.
• Sidford et al. (2018a) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018a). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 5186–5196. Curran Associates, Inc.
• Sidford et al. (2018b) Sidford, A., Wang, M., Wu, X., and Ye, Y. (2018b). Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. Society for Industrial and Applied Mathematics.
• Singh and Yee (1994) Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233.
• Sutton et al. (2009) Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009). A convergent temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, pages 1609–1616.
• Tagorti and Scherrer (2015) Tagorti, M. and Scherrer, B. (2015). On the rate of convergence and error bounds for LSTD(). In International Conference on Machine Learning, pages 1521–1529.
• Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
• Von Neumann and Morgenstern (2007) Von Neumann, J. and Morgenstern, O. (2007). Theory of games and economic behavior (commemorative edition). Princeton university press.
• Yang and Wang (2019) Yang, L. F. and Wang, M. (2019). Sample-optimal parametric q-learning with linear transition models. arXiv preprint arXiv:1902.04779.
• Yang et al. (2019) Yang, Z., Xie, Y., and Wang, Z. (2019). A theoretical analysis of deep q-learning. arXiv preprint arXiv:1901.00137.
• Zhang et al. (2018) Zhang, K., Yang, Z., Liu, H., Zhang, T., and Başar, T. (2018). Finite-sample analyses for fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783.

## Appendix A Proof of Theorem 1

We first present the definition of optimal counterstrategies.

###### Definition 5.

For player 1’s strategy , we call a player 2’s optimal counterstrategy against , if for any player 2’s strategy , we have . For player 2’s strategy , we call a player 1’s optimal counterstrategy against , if for any player 1’s strategy , we have .

It is known in Puterman (2014) that for any player 1’s strategy (player 2’s strategy ), the optimal counterstrategy against () always exists.

Our next lemma indicates that we can use the error of parametrized functions to bounded the error of value functions of parametrized strategies.

###### Lemma 1.

If

 ∥Qw(s,a)−Q∗(s,a)∥∞≤ζ, (14)

then we have

 ∥vπ1,π∗2−v∗∥∞≤2ζ1−γ,∥vπ∗1,π2−v∗∥∞≤2ζ1−γ, (15)

where , and are optimal counterstrategies of .

###### Proof.

We only prove the first inequality of (15). The proof of the second inequality is similar.

For any ,

 |vπ1,π∗2(s)−v∗(s)| =|vπ1,π∗2(s)−Q∗(s,π∗(s))| ≤|vπ1,π∗2(s)−Q∗(s,π1(s))|+|Q∗(s,π1(s))−Q∗(s,π∗(s))| =|γP(⋅|s,π1(s))vπ1,π∗2−γP(⋅|s,π1(s))v∗|+|Q∗(s,π1(s))−Q∗(s,π∗(s