In this paper we study the sample complexity of learning a near-optimal strategy in discounted two-player turn-based zero-sum stochastic games Shapley (1953); Hansen et al. (2013), which we refer to more concisely as stochastic games. Stochastic games model dynamic strategic settings in which two players take turns and the state of game evolves stochastically according to some transition law. This model encapsulates a major challenge in multi-agent learning: other agents may be learning and adapting as well. Further, stochastic games are a generalization of the Markov decision process (MDP), a fundamental model for reinforcement learning, to the two-player setting Littman (1994). MDPs can be viewed as degenerate stochastic games in which one of the players has no influence. Consequently, understanding stochastic games is a natural step towards resolving challenges in reinforcement learning of extending single-agent learning to multi-agent settings.
There is a long line of research in both MDPs and stochastic games (for a more thorough introduction, see Filar and Vrieze (2012); Hansen et al. (2013) and references therein). Strikingly, Hansen et al. (2013) showed that there exists a pure-strategy Nash equilibrium which can be computed in strongly polynomial time for stochastic games, if the game matrix is fully accessible and the discount factor is fixed. In reinforcement learning settings, however, the transition function of the game is unknown and a common goal is to obtain an approximately optimal strategy (a function that maps states to actions) that is able to obtain an expected cumulative reward of at least (or at most) the Nash equilibrium value no matter what the other player does. Unfortunately, despite interest in generalizing MDP results to stochastic games, currently the best known running times/sample complexity for solving stochastic games in a variety of settings are worse than for solving MDPs. This may not be surprising since in general stochastic games are harder to solve than MDPs, e.g., whereas MDPS can be solved in (weakly) polynomial time it remains open whether or not the same can be done for stochastic games.
There are two natural approaches towards achieving sample complexity bounds for solving stochastic games. The first is to note that the popular stochastic value iteration, dynamic programming, and Q-learning methods all apply to stochastic games Littman (1994); Hu and Wellman (2003); Littman (2001a); Perolat et al. (2015). Consequently, recent advances in these methods Kearns and Singh (1999); Sidford et al. (2018b)
developed for MDPs can be directly generalized to solving stochastic games (though the sample complexity of these generalized methods has not been analyzed previously). It is tempting to generalize the analysis of sample optimal methods for estimating valuesAzar et al. (2013) and estimating policies Sidford et al. (2018a)
of MDPs to stochastic games. However, this is challenging as these methods rely on monotonicities in MDPs induced by the linear program nature of the problemAzar et al. (2013); Sidford et al. (2018a).
The second approach would be to apply strategy iteration or alternating minimization / maximization to reduce solving stochastic games to approximately solving a sequence of MDPs. Unfortunately, the best analysis of such a method Hansen et al. (2013) requires solving MDPs. Consequently, even if this approach could be carried out with approximate MDP solvers, the resulting sample complexity for solving stochastic games would be larger than that needed for solving MDPs. More discussion of related literatures is given in Section 1.4.
Given the importance of solving stochastic games in reinforcement learning (e.g. Hu et al. (1998); Bowling and Veloso (2000, 2001); Hu and Wellman (2003); Arslan and Yüksel (2017)), this suggests the following fundamental open problem:
Can we design stochastic game learning algorithms that provably match the performance of MDP algorithms and achieve near-optimal sample complexities?
In this paper, we answer this question in the affirmative in the particular case of solving discounted stochastic games with a generative model, i.e. an oracle for sampling from the transition function for state-action pairs. We provide an algorithm with the same near-optimal sample complexity that is known for solving discounted MDPs. Further, we achieve this result by showing how to transform particular MDP algorithms to solving stochastic games that satisfy particular two-sided monotonicity constraints. Therefore, while there is a major gap between MDPs and stochastic games in terms of computation time for obtaining the exact solutions, this gap disappears when considering the sampling complexity between the two. We hope this work opens the door to more generally extend results for MDP to stochastic games and thereby enable the application of the rich research on reinforcement learning to a broader multi-player settings with little overhead.
1.1 The Model
Formally, throughout this paper, we consider discounted turn-based two-player zero-sum stochastic games described as the tuple . In these games there are two players, a min or minimization player which seeks to minimize the cumulative reward in the game and a max or maximization player which seeks to maximize the cumulative reward. Here, and are disjoint finite sets of states controlled by the min-player and the max-player respectively and their union is the set of all possible states of the game. Further, is a finite set of actions available at each state, is a transition probability function, is the payoff or reward function and is a discount factor.111Standard reductions allow this result to be applied for rewards of a broader range Sidford et al. (2018a). Further, while we assume there are the same number of actions per-state, our results easily extend to the case where this is non-uniform; in this case our dependencies on can be replaced with the number of state-action pairs.
Stochastic games are played dynamically in a sequence of turns, , starting from some initial state at turn . In each turn , the game is in one of the states and the player who controls the state chooses or plays an action from the action space . This action yields reward for the turn and causes the next state to be chosen at random from where the transition probability . The goal of the min-player (resp. max-player) is to choose actions to minimize (resp. maximize) the expected infinite-horizon discounted-reward or value of the game .
In this paper we focus on the case where the players play pure (deterministic) stationary strategies (policies), i.e. strategies which depend only on the current state. That is we wish to compute a min-player strategy or policy which defines the action the min player chooses at a state in and max-player strategy which defines the action the max player chooses at a state in . We call a pair of min-player and max-player strategies simply a strategy. Further, we let for and for and define the value function or expected discounted cumulative reward by where
and the expectation is over the random sequence of states, generated according to under the strategy , i.e. for all .
Our goal in solving a game is to compute an approximate Nash equilibrium restricted to stationary strategies Nash (1951); Maskin and Tirole (2001). We call a strategy an equilibrium strategy or optimal if
and we call it -optimal if these same inequalities hold up to an additive entrywise. It is worth noting that the best response strategy to a stationary policy is also stationary Fudenberg and Tirole (1991) and there always exists a pure stationary strategy attaining the Nash equilibrium Shapley (1953). Consequently, it is sufficient to focus on deterministic strategies.
Throughout this paper we focus on solving stochastic games in the learning setting where the game is not fully specified. We assume that a generative model is available which given any state-action pair, i.e. and , can sample a random independently at random from the transition probability function, i.e. . Accessibility to a generative model is a standard and natural assumption (Kakade (2003); Azar et al. (2013); Sidford et al. (2018a); Agarwal et al. (2019)) and corresponds to PAC learning. The special case of solving a MDP given a generative model has been studied extensively (Kakade (2003); Azar et al. (2013); Sidford et al. (2018b, a); Agarwal et al. (2019)) and is a natural proving ground towards designing theoretically motivated reinforcement learning algorithms.
1.2 Our Results
In this paper we provide an algorithm that computes an -optimal strategy using a sample size that matches the best known sample complexity for solving discounted MDPs. Further, our algorithm runs in time proportional to the number of samples and space proportional to . Interestingly, we achieve this result by showing how to run two-player variant of Q-learning such that the value-strategy sequences induced enjoy certain monotonicity properties. Essentially, we show that provided a value improving algorithm is sufficiently stable, then it can be extended to the two-player setting with limited loss. This allows us to leverage recent advances in solving single player games to solve stochastic games with limited overhead. Our main result is given below.
Theorem 1.1 (Main Theorem).
There is an algorithm which given a stochastic game, with a generative model, outputs, with probability at least , an -optimal strategy by querying samples, where and hides polylogarithmic factors. The algorithm runs in time and uses space .
Our sample and time complexities are optimal due to a known lower bound in the single player case by Azar et al. (2013). It was shown in Azar et al. (2013) that solving any one-player MDP to -optimality with high probability needs at least samples. Our sample complexity upper bound generalizes the recent sharp sample complexity results for solving the discounted MDP Sidford et al. (2018a); Agarwal et al. (2019), and tightly matches the information-theoretic sample complexity up to polylogarithmic factors. This result provides the first and near-optimal sample complexity for solving the two-person stochastic game.
1.3 Notations and Preliminaries
to denote the all-ones vector whose dimension is adapted to the context. We use the operatorsas entrywise operators on vectors. We identify the transition probability function as a matrix in and each row as a vector. We denote as a vector in and as a vector in . Therefore is a vector in . We use to denote strategy pairs and for the min-player or max-player strategy. For any strategy , we define as for . We denote as a linear operator defined as
Min-value and max-value:
For a min-player strategy , we define its value as
We let denote a maximizing argument of the above and call it an optimal counter strategy of . Thus a value of a min-player strategy gives his expected reward in the worst case. We say a min-player strategy is -optimal if
The value and -optimality for the max player is defined similarly. We denote by the optimal strategy and by the value function of the optimal strategy.
For a strategy , we denote its -function (or action value) as by For a vector we denote . Given a , we denote the greedy value of as
We denote the Bellman operator, , as follows: , and
We also denote the greedy strategy, or , as the maximization/minimization argument of the operator. Moreover, for a given strategy , we denote . For a given min-player strategy , we define the half Bellman operator
We define similarly. Note that is the unique fixed point of the Bellman operator, i.e., (known as the Bellman equation Bellman (1957)). Similarly, (resp. ) is the unique fixed point for (resp. ). The (half) Bellman-operators satisfy the following properties (see. e.g. Hansen et al. (2013); Puterman (2014))
we say an algorithm has a property “with high probability” if for any by increasing the time and sample complexity by it has the property with probability .
1.4 Previous Work
Here we provide a more detailed survey of previous works related to stochastic games and MDPs. Two-person stochastic games generalize MDPs Shapley (1953). When one of the players has only one action to choose from, the problem reduces to a MDP. A related game is the stochastic game where both players choose their respective actions simultaneously at each state and the process transitions to the next state under the control of both players Shapley (1953). The turn-based stochastic game can be reduced to the game with simultaneous moves Pérolat et al. (2015).
Computing an optimal strategy for a two-player turn-based zero-sum stochastic game is known to be in NP co-NP Condon (1992). Later Hansen et al. (2013) showed that the strategy iteration, a generalization of Howard’s policy iteration algorithm Howard (1960), solves the discounted problem in strongly polynomial time when the discount factor is fixed. Their work uses ideas from Ye (2011) which proved that the policy iteration algorithm solves the discounted MDP (DMDP) in strongly polynomial time when the discount factor is fixed. In general (e.g., if the discount factor is part of the input size), it is open if stochastic games can be solved in polynomial time Littman (1996). This is in contrast to MDPs which can be solved in (weakly) polynomial time as they are a special case of linear programming.
The algorithms and complexity theory for solving two-player stochastic games is closely related to that of solving MDPs. Their is vast literature on solving MDPs which dates back to Bellman who developed value iteration in 1957 Bellman (1957). The policy iteration was introduced shortly after by Howard Howard (1960), and its complexity has been extensive studied in Mansour and Singh (1999); Ye (2011); Scherrer (2013). Then d’Epenoux (1963) and De Ghellinck (1960) discovered that MDPs are special cases of a linear program, which leads to the insight that the simplex method, when applied to solving DMDPs, is a simple policy iteration method. Ye Ye (2011) showed that policy iteration (which is a variant of the general simplex method for linear programming) and the simplex method are strongly polynomial for DMDP and terminate in iterations. Hansen et al. (2013) and Scherrer (2013) improved the iteration bound to for Howard’s policy iteration method. The best known convergence result for policy and strategy iteration are given by Ye (2005) and Hansen et al. (2013). The best known iteration complexities for both problems are of the order , which becomes unbounded as . It is worth mentioning that Ye (2005) designed a combinatorial interior-point algorithm (CIPA) that solves the DMDP in strongly polynomial time.
Sample-based algorithms for learning value and policy functions for MDP have been studied in Kearns and Singh (1999); Kakade (2003); Singh and Yee (1994); Azar et al. (2011b, 2013); Sidford et al. (2018b, a); Agarwal et al. (2019) and many others. Among these papers, Azar et al. (2013) obtains the first tight sample bound for finding an -optimal value function and for finding -optimal policies in a restricted regime and Sidford et al. (2018a) obtains the first tight sample bound for finding an -optimal policy for any . Both sample complexities are of the form . Lower bounds have been shown in Azar et al. (2011a); Even-Dar et al. (2006) and Azar et al. (2013). Azar et al. (2013) give the first tight lower bound . For undiscounted average-reward MDP, a primal-dual based method was proposed in Wang (2017) which achieves sample complexity , where is the worst-case mixing time and is the ergodicity ratio. Sampling-based method for two-player stochastic game has been considered in Wei et al. (2017) in an online learning setting. However, their algorithm leads to a sub-optimal sample-complexity when generalized to the generative model setting.
As for general stochastic games, the minimax Q-learning algorithm and the friend-and-foe Q-learning algorithm were introduced in Littman (1994) and Littman (2001a), respectively. The Nash Q-learning algorithm was proposed for zero-sum games in Hu and Wellman (2003) and for general-sum games in Littman (2001b); Hu and Wellman (1999).
2 Technique Overview
Since stochastic games are a generalization of MDPs, many techniques for solving MDPs can be immediately generalized to stochastic games. However, as we have discussed, some of the techniques used to achieve optimal sample complexities for solving MDPs in a generative model do not have a clear generalization to stochastic games. Nevertheless, we show how to design an algorithm that carefully extends particular Q-learning based methods, i.e. methods that always maintain an estimator for the optimal value function (or ), to achieve our goals.
To motivate our approach we first briefly review previous Q-learning based methods and the core technique that achieves near-optimal sample complexity. To motivate Q-learning, we first recall the value iteration algorithm solving an MDP. Given a full model for the MDP value iteration updates the iterates as follows
where can be an arbitrary vector. Since the Bellman operator is contractive and is a fix point of , this method gives an -optimal value in iterations. In the learning setting, cannot be exactly computed. The Q-learning approach estimates by its approximate version, i.e., to compute , we obtain samples from , and then compute the empirical average. Then we compute the approximate Q-value at the -th iteration as
for some . Then the estimation error per step is defined as
Since the exact value iteration takes at least iterations to converge, the Q-learning (or approximated value iteration) takes at least iterations. The total number of samples used over all the iterations is the sample complexity of the algorithm.
Variance Control and Monotonicity Techniques:
To obtain the optimal sample complexity for one-player MDP, one approach is to carefully bound each entry of . By Bernstein inequality (Azar et al. (2013); Sidford et al. (2018a); Agarwal et al. (2019)), we have, with high probability,
where is the variance-of-value vector and “” means “approximately less than.” Let be a policy maintained in the -th iteration (e.g. the greedy policy of the current Q-value). Due to the estimation error , the per step error bound reads,
To derive the overall error accumulation, Sidford et al. (2018a) use the crucial monotonicity property, i.e., since , we have
We thus have
By induction, we have
The leading-order error accumulation term satisfies the so-called total variance property, and can be upper bounded uniformly by , resulting the correct dependence on . Therefore the monotonicity property allows us to use as a proxy policy, which carefully bounds the error accumulation. For the additional subtlety of how to obtain an optimal policy, please refer to Sidford et al. (2018a) for the variance reduction technique and the monotone-policy technique.
Similar observations regarding MDPs was used in Agarwal et al. (2019) as well. This powerful technique, however, does not generalize to the game case due to the lack of monotonicity. Indeed, (2) does not hold for stochastic games due to the existence of both minimization and maximization operations in the Bellman operator. This is the critical issue which this paper seeks to overcome.
Finding Monotone Value-Strategy Sequences for Stochastic Games:
Analogously to the MDP case, one approach is to bound error accumulation for stochastic games is to bound each entry of the error vector carefully. In fact, our method for solving stochastic games is very much like the MDP method used in Sidford et al. (2018a). However, the analysis is much different in order to resolve the difficulty introduced by the lack of monotonicity.
Since a stochastic game has two players, we modify the variance reduced Q-value iteration (vQVI) method in Sidford et al. (2018a) to obtain a min-player strategy and a max-player strategy respectively. Since the two players are symmetric, let us focus on introducing and analyzing the algorithm for the min-player. By a slight modification of the vQVI method, we can guarantee to obtain a sequence of strategies and values, , that satisfy, with high probability,
where . The first property guarantees that the value sequences are monotonically decreasing, the second property guarantees is always an upper bound of the value , and the third and fourth inequality guarantees that is well approximated by and the estimation error satisfy
where is the total number of samples used per state-action pair. Note that, as long as we can guarantee that , we can guarantee the min-strategy is also good:
Controlling Error Accumulation using Auxiliary Markovian Strategy:
Due to the lack of monotonicity (2), we cannot use the optimal strategy as a proxy strategy to carefully account for the error accumulation. To resolve this issue, we construct a new proxy strategy . This strategy is a Markovian strategy, which is time-dependent but not history dependent, i.e., at time , the strategy played is a deterministic map . The proxy strategy satisfies the following:
Underestimation. its value, , (expected discounted cumulative reward starting from any time) is upper bounded by ;
Similarly, we can bound the error by the variance-of-value of the proxy strategy
Based on the first property, we can upper bound
Based on the second property, and induction on , we can now write a new form of error accumulation,
where for all . We derive a new law of total variance bound for the first term and ultimately prove an error accumulation upper bound:
giving the optimal sample bound.
3 Sample Complexity of Stochastic Games
In this section, we provide and analyze our sampling-based algorithm for solving stochastic games. Recall that we have a generative model for the game such that we can obtain samples from state-action pairs. Each sample is obtained in time . As such we care about the total number of samples used or the total amount of time consumed by the algorithm. We will provide an efficient algorithm that takes input a generative model and obtains a good strategy for the underlying stochastic game.
We now describe the algorithm. Since the min-player and max-player are symmetric, let us focus on the min-player strategy. For the max player strategy, we can either consider the game , in which the roles of the max and min players switched, or use the corresponding algorithm for the max-player defined in Section 4.4, an algorithm that is a direct generalization from the min-player algorithm.
The Full Algorithm.
For simplicity, let us denote . Our full algorithm will use the QVI-MDVSS algorithm (Algorithm 1) as a subroutine. As we will show shortly, this subroutine maintains a monotonic value strategy sequence with high probability. Suppose the algorithm is specified by an accuracy parameter . We initialize a value vector , and an arbitrary strategy . Let . Then our initial value and strategy satisfy the requirement of the input specified by Algorithm 1:
Let and .
We run Algorithm 1 repeatedly:
where and we take the terminal value and strategy of the output sequence of Algorithm 1 as the input for the next iteration. In total we run (6) iterations. In the end, we output from as our min-player strategy.
The formal guarantee of the algorithm is presented in the following theorem.
Theorem 3.1 (Restatement of Theorem 1.1).
Given a stochastic game with a generative model, there exists (constructively) an algorithm that outputs, with probability at least , an -optimal strategy by querying samples in time using space where and hides factors.
The formal proof of Theorem 3.1 is given in the next section. Here we give a sketch of the proof.
Proof Sketch of Theorem 3.1:
We first show the high-level idea. Considering one iteration of (6), we claim that if the input value and strategy satisfies the input condition (5), then with probability at least , the terminal value and strategy of the output sequence, , satisfies,
and satisfies the the input condition (5). Namely, with high probability, the error of the output is decreased by at least half and the output can be used as an input to the QVI-MDVSS algorithm again. Suppose we run the subroutine of Algorithm 1 for times, and conditioning on the event that all the instances of QVI-MDVSS succeed, the final error of is then at most , as desired. By setting for some , we have that all QVI-MDVSS instances succeed with probability at least . It remains to show that the algorithm QVI-MDVSS works as claimed.
High-level Structure of Algorithm 1. To outline the proof, we denote a monotone decreasing value-strategy sequence (MDVSS) as , satisfying (4), where and . A more formal treatment of the sequence is presented in Section 4.2.
We next introduce the high-level idea of Algorithm 1. The basic step of the algorithm is to do approximate value-iteration while preserving all monotonic properties required by an MDVSS, i.e., we would like to approximate
We would like to approximate using samples, but we do not want to use the same amount of samples per iteration (as it become costly if the number of iterations is large). Instead, we compute only the first iteration (i.e., estimate ) up to high accuracy with a large number of samples ( samples, defined in Line 10). These computations are presented in Line 17-23. To maintain an upper bound of the of the estimation error, we also compute the empirical variances of the updates in Line 19. We shift upwards our estimates by the estimation error upper bounds to make our estimators one-sided, which is crucial to maintain the MDVSS properties. For the subsequent steps (Line 29 - 40), we use samples per iteration () to estimate . The expectation is that has a small norm, and hence can be estimated up to high accuracy with only a small number of samples. The estimator of plus the estimator of in the initialization steps gives a high-accuracy estimator (Line 40) for the value iteration. Since , the total number of samples per state-action pair is dominated by . This idea is formally known as variance-reduction, firstly proposed for solving MDP in Sidford et al. (2018b). Similarly, we shift our estimators to be one-sided. We additionally maintain carefully-designed strategies in Line 29-31 to preserve monotonicity. Hence the algorithm can be viewed as a value-strategy iteration algorithm.
Correctness of Algorithm 1. We now sketch the proof of correctness for Algorithm 1. Firstly Proposition (4.3) shows that the if an MDVSS, e.g., , satisfies for some then their terminal strategies and values satisfy
This indicates that as long as we can show , then the halving-error-property (7) holds.
Proposition 4.4 shows the halving-error-property can be achieved by setting
where is the variance-of-value vector of and . This proof is based on constructing an auxiliary Markovian strategy for analyzing the error accumulation throughout the value-strategy iterations. The Markovian strategy is a time-dependent strategy used as a proxy for analyzing the entrywise error recursion (Lemmas 4.4-4.11).
Proposition 4.12 shows, with high probability, Algorithm 1 produces value-strategy sequences , which is indeed an MDVSS and satisfies Proposition 4.4. The proof involves analyzing the probability of “good events” on which monotonicity is preserved at every iteration by using confidence estimates computed during the iterations and concentration arguments. See Lemmas 4.13-4.18 for the full proof of Proposition 4.12.
Putting Everything Together. Finally by putting together the strategies, we conclude that the terminal strategy of the iteration (6) is always an approximately optimal min-player strategy to the game, with high probability. For implementation, since our algorithm only computes the inner product based on samples, the total computation time is proportional to the number of samples. Moreover, since we can update as samples are drawn and output the monotone sequences as they are generated, we do not need to store samples or the value-strategy sequences, thus the overall space is . ∎
4 Proof of Main Results
The remainder of this section is devoted to proving Theorem 1.1. We prove this by formally providing a notion of monotone value-strategy sequences. With this, we show if an algorithm outputs some monotone value-strategy sequence, then the terminal strategy of the sequence is always an approximately optimal strategy to the game. We then show that Algorithm 1 produces monotone value-strategy sequences with high probability.
4.1 Additional Notation
First we provide additional notation critical to our proofs.
We denote a Markovian strategy as an infinitely long sequence of pre-defined strategies
where each is a normal deterministic strategy. We denote
as another Markovian strategy. We denote and as the min-player strategy and the max-player strategy respectively. When using the strategy, players uses at time . The strategy is Markovian because it does not depend on the historical moves. Note that a stationary strategy is a special case of the Markovian strategy: . The value of a Markovian strategy is defined as before, but the states are generated by playing the action at time . Since the strategy has a time dependence, we denote
The (half) Bellman operators are defined similarly to that of stationary policies.
4.2 Monotone Value-Strategy Sequence
In this section we formally define monotone strategy value sequences. Such a sequence, although not explicitly stated in Sidford et al. (2018b, a), are crucial for these algorithms to obtain good policy while obtaining a good value for an MDP. In the following sections, we denote , and as parameters. Monotone value-strategy sequences are formally defined as follows.
Definition 4.1 (Monotone Decreasing Value-Strategy Sequence).
A monotone decreasing value-strategy sequence (MDVSS) is a sequence of where and satisfy