1 Introduction
In this paper we study the sample complexity of learning a nearoptimal strategy in discounted twoplayer turnbased zerosum stochastic games Shapley (1953); Hansen et al. (2013), which we refer to more concisely as stochastic games. Stochastic games model dynamic strategic settings in which two players take turns and the state of game evolves stochastically according to some transition law. This model encapsulates a major challenge in multiagent learning: other agents may be learning and adapting as well. Further, stochastic games are a generalization of the Markov decision process (MDP), a fundamental model for reinforcement learning, to the twoplayer setting Littman (1994). MDPs can be viewed as degenerate stochastic games in which one of the players has no influence. Consequently, understanding stochastic games is a natural step towards resolving challenges in reinforcement learning of extending singleagent learning to multiagent settings.
There is a long line of research in both MDPs and stochastic games (for a more thorough introduction, see Filar and Vrieze (2012); Hansen et al. (2013) and references therein). Strikingly, Hansen et al. (2013) showed that there exists a purestrategy Nash equilibrium which can be computed in strongly polynomial time for stochastic games, if the game matrix is fully accessible and the discount factor is fixed. In reinforcement learning settings, however, the transition function of the game is unknown and a common goal is to obtain an approximately optimal strategy (a function that maps states to actions) that is able to obtain an expected cumulative reward of at least (or at most) the Nash equilibrium value no matter what the other player does. Unfortunately, despite interest in generalizing MDP results to stochastic games, currently the best known running times/sample complexity for solving stochastic games in a variety of settings are worse than for solving MDPs. This may not be surprising since in general stochastic games are harder to solve than MDPs, e.g., whereas MDPS can be solved in (weakly) polynomial time it remains open whether or not the same can be done for stochastic games.
There are two natural approaches towards achieving sample complexity bounds for solving stochastic games. The first is to note that the popular stochastic value iteration, dynamic programming, and Qlearning methods all apply to stochastic games Littman (1994); Hu and Wellman (2003); Littman (2001a); Perolat et al. (2015). Consequently, recent advances in these methods Kearns and Singh (1999); Sidford et al. (2018b)
developed for MDPs can be directly generalized to solving stochastic games (though the sample complexity of these generalized methods has not been analyzed previously). It is tempting to generalize the analysis of sample optimal methods for estimating values
Azar et al. (2013) and estimating policies Sidford et al. (2018a)of MDPs to stochastic games. However, this is challenging as these methods rely on monotonicities in MDPs induced by the linear program nature of the problem
Azar et al. (2013); Sidford et al. (2018a).The second approach would be to apply strategy iteration or alternating minimization / maximization to reduce solving stochastic games to approximately solving a sequence of MDPs. Unfortunately, the best analysis of such a method Hansen et al. (2013) requires solving MDPs. Consequently, even if this approach could be carried out with approximate MDP solvers, the resulting sample complexity for solving stochastic games would be larger than that needed for solving MDPs. More discussion of related literatures is given in Section 1.4.
Given the importance of solving stochastic games in reinforcement learning (e.g. Hu et al. (1998); Bowling and Veloso (2000, 2001); Hu and Wellman (2003); Arslan and Yüksel (2017)), this suggests the following fundamental open problem:
Can we design stochastic game learning algorithms that provably match the performance of MDP algorithms and achieve nearoptimal sample complexities?
In this paper, we answer this question in the affirmative in the particular case of solving discounted stochastic games with a generative model, i.e. an oracle for sampling from the transition function for stateaction pairs. We provide an algorithm with the same nearoptimal sample complexity that is known for solving discounted MDPs. Further, we achieve this result by showing how to transform particular MDP algorithms to solving stochastic games that satisfy particular twosided monotonicity constraints. Therefore, while there is a major gap between MDPs and stochastic games in terms of computation time for obtaining the exact solutions, this gap disappears when considering the sampling complexity between the two. We hope this work opens the door to more generally extend results for MDP to stochastic games and thereby enable the application of the rich research on reinforcement learning to a broader multiplayer settings with little overhead.
1.1 The Model
Formally, throughout this paper, we consider discounted turnbased twoplayer zerosum stochastic games described as the tuple . In these games there are two players, a min or minimization player which seeks to minimize the cumulative reward in the game and a max or maximization player which seeks to maximize the cumulative reward. Here, and are disjoint finite sets of states controlled by the minplayer and the maxplayer respectively and their union is the set of all possible states of the game. Further, is a finite set of actions available at each state, is a transition probability function, is the payoff or reward function and is a discount factor.^{1}^{1}1Standard reductions allow this result to be applied for rewards of a broader range Sidford et al. (2018a). Further, while we assume there are the same number of actions perstate, our results easily extend to the case where this is nonuniform; in this case our dependencies on can be replaced with the number of stateaction pairs.
Stochastic games are played dynamically in a sequence of turns, , starting from some initial state at turn . In each turn , the game is in one of the states and the player who controls the state chooses or plays an action from the action space . This action yields reward for the turn and causes the next state to be chosen at random from where the transition probability . The goal of the minplayer (resp. maxplayer) is to choose actions to minimize (resp. maximize) the expected infinitehorizon discountedreward or value of the game .
In this paper we focus on the case where the players play pure (deterministic) stationary strategies (policies), i.e. strategies which depend only on the current state. That is we wish to compute a minplayer strategy or policy which defines the action the min player chooses at a state in and maxplayer strategy which defines the action the max player chooses at a state in . We call a pair of minplayer and maxplayer strategies simply a strategy. Further, we let for and for and define the value function or expected discounted cumulative reward by where
and the expectation is over the random sequence of states, generated according to under the strategy , i.e. for all .
Our goal in solving a game is to compute an approximate Nash equilibrium restricted to stationary strategies Nash (1951); Maskin and Tirole (2001). We call a strategy an equilibrium strategy or optimal if
and we call it optimal if these same inequalities hold up to an additive entrywise. It is worth noting that the best response strategy to a stationary policy is also stationary Fudenberg and Tirole (1991) and there always exists a pure stationary strategy attaining the Nash equilibrium Shapley (1953). Consequently, it is sufficient to focus on deterministic strategies.
Throughout this paper we focus on solving stochastic games in the learning setting where the game is not fully specified. We assume that a generative model is available which given any stateaction pair, i.e. and , can sample a random independently at random from the transition probability function, i.e. . Accessibility to a generative model is a standard and natural assumption (Kakade (2003); Azar et al. (2013); Sidford et al. (2018a); Agarwal et al. (2019)) and corresponds to PAC learning. The special case of solving a MDP given a generative model has been studied extensively (Kakade (2003); Azar et al. (2013); Sidford et al. (2018b, a); Agarwal et al. (2019)) and is a natural proving ground towards designing theoretically motivated reinforcement learning algorithms.
1.2 Our Results
In this paper we provide an algorithm that computes an optimal strategy using a sample size that matches the best known sample complexity for solving discounted MDPs. Further, our algorithm runs in time proportional to the number of samples and space proportional to . Interestingly, we achieve this result by showing how to run twoplayer variant of Qlearning such that the valuestrategy sequences induced enjoy certain monotonicity properties. Essentially, we show that provided a value improving algorithm is sufficiently stable, then it can be extended to the twoplayer setting with limited loss. This allows us to leverage recent advances in solving single player games to solve stochastic games with limited overhead. Our main result is given below.
Theorem 1.1 (Main Theorem).
There is an algorithm which given a stochastic game, with a generative model, outputs, with probability at least , an optimal strategy by querying samples, where and hides polylogarithmic factors. The algorithm runs in time and uses space .
Our sample and time complexities are optimal due to a known lower bound in the single player case by Azar et al. (2013). It was shown in Azar et al. (2013) that solving any oneplayer MDP to optimality with high probability needs at least samples. Our sample complexity upper bound generalizes the recent sharp sample complexity results for solving the discounted MDP Sidford et al. (2018a); Agarwal et al. (2019), and tightly matches the informationtheoretic sample complexity up to polylogarithmic factors. This result provides the first and nearoptimal sample complexity for solving the twoperson stochastic game.
1.3 Notations and Preliminaries
Notation:
We use
to denote the allones vector whose dimension is adapted to the context. We use the operators
as entrywise operators on vectors. We identify the transition probability function as a matrix in and each row as a vector. We denote as a vector in and as a vector in . Therefore is a vector in . We use to denote strategy pairs and for the minplayer or maxplayer strategy. For any strategy , we define as for . We denote as a linear operator defined asMinvalue and maxvalue:
For a minplayer strategy , we define its value as
(1) 
We let denote a maximizing argument of the above and call it an optimal counter strategy of . Thus a value of a minplayer strategy gives his expected reward in the worst case. We say a minplayer strategy is optimal if
The value and optimality for the max player is defined similarly. We denote by the optimal strategy and by the value function of the optimal strategy.
function:
For a strategy , we denote its function (or action value) as by For a vector we denote . Given a , we denote the greedy value of as
Bellman Operator:
We denote the Bellman operator, , as follows: , and
We also denote the greedy strategy, or , as the maximization/minimization argument of the operator. Moreover, for a given strategy , we denote . For a given minplayer strategy , we define the half Bellman operator
We define similarly. Note that is the unique fixed point of the Bellman operator, i.e., (known as the Bellman equation Bellman (1957)). Similarly, (resp. ) is the unique fixed point for (resp. ). The (half) Bellmanoperators satisfy the following properties (see. e.g. Hansen et al. (2013); Puterman (2014))

contraction: ;

monotonicity: .
High Probability:
we say an algorithm has a property “with high probability” if for any by increasing the time and sample complexity by it has the property with probability .
1.4 Previous Work
Here we provide a more detailed survey of previous works related to stochastic games and MDPs. Twoperson stochastic games generalize MDPs Shapley (1953). When one of the players has only one action to choose from, the problem reduces to a MDP. A related game is the stochastic game where both players choose their respective actions simultaneously at each state and the process transitions to the next state under the control of both players Shapley (1953). The turnbased stochastic game can be reduced to the game with simultaneous moves Pérolat et al. (2015).
Computing an optimal strategy for a twoplayer turnbased zerosum stochastic game is known to be in NP coNP Condon (1992). Later Hansen et al. (2013) showed that the strategy iteration, a generalization of Howard’s policy iteration algorithm Howard (1960), solves the discounted problem in strongly polynomial time when the discount factor is fixed. Their work uses ideas from Ye (2011) which proved that the policy iteration algorithm solves the discounted MDP (DMDP) in strongly polynomial time when the discount factor is fixed. In general (e.g., if the discount factor is part of the input size), it is open if stochastic games can be solved in polynomial time Littman (1996). This is in contrast to MDPs which can be solved in (weakly) polynomial time as they are a special case of linear programming.
The algorithms and complexity theory for solving twoplayer stochastic games is closely related to that of solving MDPs. Their is vast literature on solving MDPs which dates back to Bellman who developed value iteration in 1957 Bellman (1957). The policy iteration was introduced shortly after by Howard Howard (1960), and its complexity has been extensive studied in Mansour and Singh (1999); Ye (2011); Scherrer (2013). Then d’Epenoux (1963) and De Ghellinck (1960) discovered that MDPs are special cases of a linear program, which leads to the insight that the simplex method, when applied to solving DMDPs, is a simple policy iteration method. Ye Ye (2011) showed that policy iteration (which is a variant of the general simplex method for linear programming) and the simplex method are strongly polynomial for DMDP and terminate in iterations. Hansen et al. (2013) and Scherrer (2013) improved the iteration bound to for Howard’s policy iteration method. The best known convergence result for policy and strategy iteration are given by Ye (2005) and Hansen et al. (2013). The best known iteration complexities for both problems are of the order , which becomes unbounded as . It is worth mentioning that Ye (2005) designed a combinatorial interiorpoint algorithm (CIPA) that solves the DMDP in strongly polynomial time.
Samplebased algorithms for learning value and policy functions for MDP have been studied in Kearns and Singh (1999); Kakade (2003); Singh and Yee (1994); Azar et al. (2011b, 2013); Sidford et al. (2018b, a); Agarwal et al. (2019) and many others. Among these papers, Azar et al. (2013) obtains the first tight sample bound for finding an optimal value function and for finding optimal policies in a restricted regime and Sidford et al. (2018a) obtains the first tight sample bound for finding an optimal policy for any . Both sample complexities are of the form . Lower bounds have been shown in Azar et al. (2011a); EvenDar et al. (2006) and Azar et al. (2013). Azar et al. (2013) give the first tight lower bound . For undiscounted averagereward MDP, a primaldual based method was proposed in Wang (2017) which achieves sample complexity , where is the worstcase mixing time and is the ergodicity ratio. Samplingbased method for twoplayer stochastic game has been considered in Wei et al. (2017) in an online learning setting. However, their algorithm leads to a suboptimal samplecomplexity when generalized to the generative model setting.
As for general stochastic games, the minimax Qlearning algorithm and the friendandfoe Qlearning algorithm were introduced in Littman (1994) and Littman (2001a), respectively. The Nash Qlearning algorithm was proposed for zerosum games in Hu and Wellman (2003) and for generalsum games in Littman (2001b); Hu and Wellman (1999).
2 Technique Overview
Since stochastic games are a generalization of MDPs, many techniques for solving MDPs can be immediately generalized to stochastic games. However, as we have discussed, some of the techniques used to achieve optimal sample complexities for solving MDPs in a generative model do not have a clear generalization to stochastic games. Nevertheless, we show how to design an algorithm that carefully extends particular Qlearning based methods, i.e. methods that always maintain an estimator for the optimal value function (or ), to achieve our goals.
QLearning:
To motivate our approach we first briefly review previous Qlearning based methods and the core technique that achieves nearoptimal sample complexity. To motivate Qlearning, we first recall the value iteration algorithm solving an MDP. Given a full model for the MDP value iteration updates the iterates as follows
where can be an arbitrary vector. Since the Bellman operator is contractive and is a fix point of , this method gives an optimal value in iterations. In the learning setting, cannot be exactly computed. The Qlearning approach estimates by its approximate version, i.e., to compute , we obtain samples from , and then compute the empirical average. Then we compute the approximate Qvalue at the th iteration as
where
for some . Then the estimation error per step is defined as
Since the exact value iteration takes at least iterations to converge, the Qlearning (or approximated value iteration) takes at least iterations. The total number of samples used over all the iterations is the sample complexity of the algorithm.
Variance Control and Monotonicity Techniques:
To obtain the optimal sample complexity for oneplayer MDP, one approach is to carefully bound each entry of . By Bernstein inequality (Azar et al. (2013); Sidford et al. (2018a); Agarwal et al. (2019)), we have, with high probability,
where is the varianceofvalue vector and “” means “approximately less than.” Let be a policy maintained in the th iteration (e.g. the greedy policy of the current Qvalue). Due to the estimation error , the per step error bound reads,
To derive the overall error accumulation, Sidford et al. (2018a) use the crucial monotonicity property, i.e., since , we have
(2) 
We thus have
By induction, we have
(3) 
The leadingorder error accumulation term satisfies the socalled total variance property, and can be upper bounded uniformly by , resulting the correct dependence on . Therefore the monotonicity property allows us to use as a proxy policy, which carefully bounds the error accumulation. For the additional subtlety of how to obtain an optimal policy, please refer to Sidford et al. (2018a) for the variance reduction technique and the monotonepolicy technique.
Similar observations regarding MDPs was used in Agarwal et al. (2019) as well. This powerful technique, however, does not generalize to the game case due to the lack of monotonicity. Indeed, (2) does not hold for stochastic games due to the existence of both minimization and maximization operations in the Bellman operator. This is the critical issue which this paper seeks to overcome.
Finding Monotone ValueStrategy Sequences for Stochastic Games:
Analogously to the MDP case, one approach is to bound error accumulation for stochastic games is to bound each entry of the error vector carefully. In fact, our method for solving stochastic games is very much like the MDP method used in Sidford et al. (2018a). However, the analysis is much different in order to resolve the difficulty introduced by the lack of monotonicity.
Since a stochastic game has two players, we modify the variance reduced Qvalue iteration (vQVI) method in Sidford et al. (2018a) to obtain a minplayer strategy and a maxplayer strategy respectively. Since the two players are symmetric, let us focus on introducing and analyzing the algorithm for the minplayer. By a slight modification of the vQVI method, we can guarantee to obtain a sequence of strategies and values, , that satisfy, with high probability,
(4) 
where . The first property guarantees that the value sequences are monotonically decreasing, the second property guarantees is always an upper bound of the value , and the third and fourth inequality guarantees that is well approximated by and the estimation error satisfy
where is the total number of samples used per stateaction pair. Note that, as long as we can guarantee that , we can guarantee the minstrategy is also good:
Controlling Error Accumulation using Auxiliary Markovian Strategy:
Due to the lack of monotonicity (2), we cannot use the optimal strategy as a proxy strategy to carefully account for the error accumulation. To resolve this issue, we construct a new proxy strategy . This strategy is a Markovian strategy, which is timedependent but not history dependent, i.e., at time , the strategy played is a deterministic map . The proxy strategy satisfies the following:

Underestimation. its value, , (expected discounted cumulative reward starting from any time) is upper bounded by ;

Contraction. ,
Similarly, we can bound the error by the varianceofvalue of the proxy strategy
Based on the first property, we can upper bound
Based on the second property, and induction on , we can now write a new form of error accumulation,
where for all . We derive a new law of total variance bound for the first term and ultimately prove an error accumulation upper bound:
giving the optimal sample bound.
3 Sample Complexity of Stochastic Games
In this section, we provide and analyze our samplingbased algorithm for solving stochastic games. Recall that we have a generative model for the game such that we can obtain samples from stateaction pairs. Each sample is obtained in time . As such we care about the total number of samples used or the total amount of time consumed by the algorithm. We will provide an efficient algorithm that takes input a generative model and obtains a good strategy for the underlying stochastic game.
We now describe the algorithm. Since the minplayer and maxplayer are symmetric, let us focus on the minplayer strategy. For the max player strategy, we can either consider the game , in which the roles of the max and min players switched, or use the corresponding algorithm for the maxplayer defined in Section 4.4, an algorithm that is a direct generalization from the minplayer algorithm.
(5) 
The Full Algorithm.
For simplicity, let us denote . Our full algorithm will use the QVIMDVSS algorithm (Algorithm 1) as a subroutine. As we will show shortly, this subroutine maintains a monotonic value strategy sequence with high probability. Suppose the algorithm is specified by an accuracy parameter . We initialize a value vector , and an arbitrary strategy . Let . Then our initial value and strategy satisfy the requirement of the input specified by Algorithm 1:
Let and .
We run Algorithm 1 repeatedly:
where and we take the terminal value and strategy of the output sequence of Algorithm 1 as the input for the next iteration. In total we run (6) iterations. In the end, we output from as our minplayer strategy.
The formal guarantee of the algorithm is presented in the following theorem.
Theorem 3.1 (Restatement of Theorem 1.1).
Given a stochastic game with a generative model, there exists (constructively) an algorithm that outputs, with probability at least , an optimal strategy by querying samples in time using space where and hides factors.
The formal proof of Theorem 3.1 is given in the next section. Here we give a sketch of the proof.
Proof Sketch of Theorem 3.1:
We first show the highlevel idea. Considering one iteration of (6), we claim that if the input value and strategy satisfies the input condition (5), then with probability at least , the terminal value and strategy of the output sequence, , satisfies,
(7) 
and satisfies the the input condition (5). Namely, with high probability, the error of the output is decreased by at least half and the output can be used as an input to the QVIMDVSS algorithm again. Suppose we run the subroutine of Algorithm 1 for times, and conditioning on the event that all the instances of QVIMDVSS succeed, the final error of is then at most , as desired. By setting for some , we have that all QVIMDVSS instances succeed with probability at least . It remains to show that the algorithm QVIMDVSS works as claimed.
Highlevel Structure of Algorithm 1. To outline the proof, we denote a monotone decreasing valuestrategy sequence (MDVSS) as , satisfying (4), where and . A more formal treatment of the sequence is presented in Section 4.2.
We next introduce the highlevel idea of Algorithm 1. The basic step of the algorithm is to do approximate valueiteration while preserving all monotonic properties required by an MDVSS, i.e., we would like to approximate
We would like to approximate using samples, but we do not want to use the same amount of samples per iteration (as it become costly if the number of iterations is large). Instead, we compute only the first iteration (i.e., estimate ) up to high accuracy with a large number of samples ( samples, defined in Line 10). These computations are presented in Line 1723. To maintain an upper bound of the of the estimation error, we also compute the empirical variances of the updates in Line 19. We shift upwards our estimates by the estimation error upper bounds to make our estimators onesided, which is crucial to maintain the MDVSS properties. For the subsequent steps (Line 29  40), we use samples per iteration () to estimate . The expectation is that has a small norm, and hence can be estimated up to high accuracy with only a small number of samples. The estimator of plus the estimator of in the initialization steps gives a highaccuracy estimator (Line 40) for the value iteration. Since , the total number of samples per stateaction pair is dominated by . This idea is formally known as variancereduction, firstly proposed for solving MDP in Sidford et al. (2018b). Similarly, we shift our estimators to be onesided. We additionally maintain carefullydesigned strategies in Line 2931 to preserve monotonicity. Hence the algorithm can be viewed as a valuestrategy iteration algorithm.
Correctness of Algorithm 1. We now sketch the proof of correctness for Algorithm 1. Firstly Proposition (4.3) shows that the if an MDVSS, e.g., , satisfies for some then their terminal strategies and values satisfy
This indicates that as long as we can show , then the halvingerrorproperty (7) holds.
Proposition 4.4 shows the halvingerrorproperty can be achieved by setting
where is the varianceofvalue vector of and . This proof is based on constructing an auxiliary Markovian strategy for analyzing the error accumulation throughout the valuestrategy iterations. The Markovian strategy is a timedependent strategy used as a proxy for analyzing the entrywise error recursion (Lemmas 4.44.11).
Proposition 4.12 shows, with high probability, Algorithm 1 produces valuestrategy sequences , which is indeed an MDVSS and satisfies Proposition 4.4. The proof involves analyzing the probability of “good events” on which monotonicity is preserved at every iteration by using confidence estimates computed during the iterations and concentration arguments. See Lemmas 4.134.18 for the full proof of Proposition 4.12.
Putting Everything Together. Finally by putting together the strategies, we conclude that the terminal strategy of the iteration (6) is always an approximately optimal minplayer strategy to the game, with high probability. For implementation, since our algorithm only computes the inner product based on samples, the total computation time is proportional to the number of samples. Moreover, since we can update as samples are drawn and output the monotone sequences as they are generated, we do not need to store samples or the valuestrategy sequences, thus the overall space is . ∎
4 Proof of Main Results
The remainder of this section is devoted to proving Theorem 1.1. We prove this by formally providing a notion of monotone valuestrategy sequences. With this, we show if an algorithm outputs some monotone valuestrategy sequence, then the terminal strategy of the sequence is always an approximately optimal strategy to the game. We then show that Algorithm 1 produces monotone valuestrategy sequences with high probability.
4.1 Additional Notation
First we provide additional notation critical to our proofs.
Markovian Strategies:
We denote a Markovian strategy as an infinitely long sequence of predefined strategies
where each is a normal deterministic strategy. We denote
as another Markovian strategy. We denote and as the minplayer strategy and the maxplayer strategy respectively. When using the strategy, players uses at time . The strategy is Markovian because it does not depend on the historical moves. Note that a stationary strategy is a special case of the Markovian strategy: . The value of a Markovian strategy is defined as before, but the states are generated by playing the action at time . Since the strategy has a time dependence, we denote
The (half) Bellman operators are defined similarly to that of stationary policies.
4.2 Monotone ValueStrategy Sequence
In this section we formally define monotone strategy value sequences. Such a sequence, although not explicitly stated in Sidford et al. (2018b, a), are crucial for these algorithms to obtain good policy while obtaining a good value for an MDP. In the following sections, we denote , and as parameters. Monotone valuestrategy sequences are formally defined as follows.
Definition 4.1 (Monotone Decreasing ValueStrategy Sequence).
A monotone decreasing valuestrategy sequence (MDVSS) is a sequence of where and satisfy

;

, ;

,