1 Introduction
Twoplayer turn based stochastic game (2TBSG) is a generalization of Markov decision process (MDP), both of which are widely used models in machine learning and operations research. While MDP involves one agent with its simple objective to maximize the total reward, 2TBSG is a zerosum game involving two players with opposite objectives: one player seeks to maximize the total reward and the other player seeks to minimize the total reward. In a 2TBSG, the set of all states is divided into two subsets that are controlled by the two players, respectively. We focus on the discounted stationary 2TBSG, where the probability transition model is invariant across time and the total reward is the infinite sum of all discounted rewards. Our goal is to approximate the Nash equilibrium of the 2TBSG, whose existence is proved in
Shapley (1953).There are two practical obstacles standing in solving 2TBSG:

We usually do not know the transition probability model explicitly;

The number of possible states and actions are very large or even infinite.
In this paper we have access to a sampling oracle that can generate sample transitions from any state and action pair. We also suppose that a finite number of stateaction features are available, such that the unknown probability transition model can be embedded using the feature space. These features allow us to solve 2TBSG of arbitrary dimensions using parametric algorithms.
A question is raised naturally, that is, how many samples are needed in order to find an approximate Nash equilibrium? For solving the oneplayer MDP to optimality using features, Yang and Wang (2019) proves an informationtheoretic lower bound of sample complexity . Since MDP is a special case of 2TBSG, the same lower bound applies to 2TBSG. Yet there has not been any provably efficient algorithm for solving 2TBSG using features.
To answer this question, we propose two samplingbased algorithms and provide sample complexity analysis. Motivated by the value iteration and Qlearning like algorithms given by Hansen et al. (2013); Yang and Wang (2019), we propose a twoplayer Qlearning algorithm for solving 2TBSG using given features. When the true transition model can be fully embedded in the feature space without losing any information, our algorithm finds an optimal strategy using no more than sample transitions, where is the number of stateaction features. We also provide model misspecification error bound for the case where the features cannot fully embed the transition model.
To further improve the sample complexity, we use a variance reduction technique, together with a specifically designed monotonicity preservation technique which were previously used in Yang and Wang (2019), to develop an algorithm that is even more sampleefficient. This algorithm uses a twosided approximation scheme to find the equilibrium value from both above and below. It computes the final optimal strategy by sticking two approximate strategies together. This algorithm is proved to find an optimal strategy with high probability using samples, which improves significantly from our first result. Our results are the first and sharpest sample complexity bounds for solving twoplayer stochastic game using features, to our best knowledges. Our algorithms are the first ones of their kind with provable sample efficiency. It is also worth noting that the algorithms are space and time efficient, whose complexities depend polynomially on and , and are free from the game’s dimensions.
2 Related Works
The 2TBSG is a special case of games and stochastic games (SG), which are first introduced in Von Neumann and Morgenstern (2007) and Shapley (1953). For a comprehensive introduction on SG, please refer to the books Neyman et al. (2003) and Filar and Vrieze (2012). A number of deterministic algorithms have been developed for solving 2TBSG when its explicit form is fully given, including Littman (1996); Ludwig (1995); Hansen et al. (2013). For example Rao et al. (1973) proposes the strategy iteration algorithm. A value iteration method is proposed by Hansen et al. (2013), which is one of the motivation of our algorithm.
In the special case of MDP, there exist a large body of works on its sample complexity and samplingbased algorithms. For the tabular setting (finitely many state and actions), sample complexity of MDP with a sampling oracle has been studied in Kearns and Singh (1999); Azar et al. (2013); Sidford et al. (2018b, a); Kakade (2003); Singh and Yee (1994); Azar et al. (2011b). Lower bounds for sample complexity have been studied in Azar et al. (2013); EvenDar et al. (2006); Azar et al. (2011a), where the first tight lower bound is obtained in Azar et al. (2013). The first sampleoptimal algorithm for finding an optimal value is proposed in Azar et al. (2013). Sidford et al. (2018a) gives the first algorithm that finds an optimal policy using the optimal sample complexity for all values of . For solving MDP using linearly additive features, Yang and Wang (2019) proved a lower bound of sample complexity that is . It also provided an algorithm that achieves this lower bound up to log factors, however, their analysis of the algorithm relies heavily on an extra “anchor state” assumption. In Chen et al. (2018), a primaldual method solving MDP with linear and bilinear representation of value functions and transition models is proposed for the undiscounted MDP. In Jiang et al. (2017), the sample complexity of contextual decision process is studied.
As for general stochastic games, the minimax Qlearning algorithm and the friendandfoe Qlearning algorithm is introduced in Littman (1994) and Littman (2001a), respectively. The Nash Qlearning algorithm is proposed for zerosum games in Hu and Wellman (2003) and for generalsum games in Littman (2001b); Hu and Wellman (1999). Also in Perolat et al. (2015)
, the error of approximate Qlearning is estimated. In
Zhang et al. (2018), finitesample analysis of multiagent reinforcement learning is provided. To our best knowledge, there is no known algorithm that solves 2TBSG using features with sample complexity analysis.
There are a large number of works analyzing linear model approximation of value and Q functions, for examples Tsitsiklis and Van Roy (1997); Nedić and Bertsekas (2003); Lagoudakis and Parr (2003); Melo et al. (2008); Parr et al. (2008); Sutton et al. (2009); Lazaric et al. (2012); Tagorti and Scherrer (2015). These work mainly focus on approximating the value function or Q function for a fixed policy. The convergence of temporal difference learning with a linear model for a given policy is proved in Tsitsiklis and Van Roy (1997). Melo et al. (2008) and Sutton et al. (2009) study the convergence of Qlearning and offpolicy temporal difference learning with linear function parametrization, respectively. In Parr et al. (2008), the relationship of linear transition model and linear parametrized value functions is explained. It is also pointed out by Yang and Wang (2019) that using linear model for Q function is essentially equivalent to assuming that the transition model can be embedded using these features, provided that there is zero Bellman error.
The fitted value iteration for MDPs or 2TBSGs, where the value function is approximated by functions in a general function space, is analyzed in Munos and Szepesvári (2008); Antos et al. (2008a, b); Farahmand et al. (2010); Yang et al. (2019); Pérolat et al. (2016). In these papers, it is shown that the error is related to the Bellman error of the function space, and depends polynomially on and the dimension of the function space. However, only convergence is analyzed in these paper.
3 Preliminaries
Basics of 2TBSG
A discounted 2TBSG (2TBSG for short) consists of a tuple , where and are state sets and action sets for Player 1 and Player 2, respectively. is a transition probability matrix, where denotes the probability of transitioning to state from state if action is used.
is the reward vector, where
denotes the immediate reward received using action at state .For a given state , we use to denote the available action set for state . A value function is a mapping from to . A deterministic strategy (strategy for short) is defined such that are mappings from to and from to , respectively. Given a strategy , the value function of is defined to be the expectation of total discounted reward starting from , i.e.,
(1) 
where is the discounted factor, and the expectation is over all trajectories starting from .
Two players in a 2TBSG has opposite objectives. While the first player seeks to maximize the value function (1), the second player seeks to minimize it. In the following we present the definition of the equilibrium strategy.
Definition 1.
A strategy is called a Nash equilibrium strategy (equilibrium strategy for short), if for any player 1’s strategy and player 2’s strategy .
The existence of the Nash equilibrium strategy is proved in Shapley (1953). And all equilibrium strategies share the same value function, which we denote by .
Notice that is the equilibrium value if and only if it satisfies the following Bellman equation Hansen et al. (2013):
(2) 
where is an operator mapping a value function into another:
(3) 
We give definitions of optimal values and optimal strategies.
Definition 2.
We call a value function an optimal value, if .
Definition 3.
We call a strategy an optimal strategy, if for any ,
Since , the above definition is equivalent to and .
Features and Probability Transition Model
Suppose we have feature functions mapping from into . For every stateaction pair , these features give a feature vector
Throughout this paper, we focus on 2TBSG where the probability transition model can be nearly embedded using the features without losing any information.
Definition 4.
We say that the transition model can be embedded into the feature space , if there exists functions such that
The preceding model is closely related to linear model for Q functions. When can be fully embedded using , the Qfunctions belong to so we can parameterize them using dimensional vectors. Note that the feature representation is only concerned with the probability transition but has nothing to do with the reward function. It is pointed out by Yang and Wang (2019) that having a transition model which can be embedded into is equivalent to using linear Qfunction model with zero Bellman error. In our subsequent analysis, we also provide approximation guarantee when cannot be fully embedded using .
It is worth noting that Definition 4 has a kernel interpretation. It is equivalent to that the left singular functions of belong to the Hilbert space with the kernel function . Our model and method can be viewed as approximating and solving the 2TBSG in a given kernel space.
Notations
For two value functions , we use to denote . We use to denote the projection of into the interval . The total variance (TV) distance between two distributions on the state space is defined as And we use to hide log factors of and .
4 A Basic TwoPlayer Qlearning Algorithm
In this section, we develop a basic twoplayer Q learning algorithm for 2TBSG. The algorithm is motivated by the twoplayer value iteration algorithm Hansen et al. (2013). It is also motivated by the parametric Qlearning algorithm for solving MDP given by Yang and Wang (2019).
4.1 Algorithm and Parametrization
The algorithm uses a vector to parametrize Qfunctions, value functions and strategies as follows:
(4)  
The algorithm keeps tracks of the parameter vector only. The value functions and strategies can be obtained from according to preceding equations when they are needed.
We present Algorithm 1, which is an approximate value iteration. Our algorithm picks a set of representative stateaction pairs at first. Then at iteration , it uses sampling to estimate the values , and carries value iteration using these estimates. The set can be chosen nearly arbitrarily, but it is necessary that the set is representative of the feature space. It means that the feature vectors of stateaction pairs in this set cannot too be alike but need to be linearly independent.
Assumption 1.
There exist stateaction pairs forming a set satisfying
where is the matrix formed by row features of those in .
4.2 Sample Complexity Analysis
The next theorem establishes the sample complexity of Algorithm 1, which is independent from and . Its proof is deferred to the appendix.
5 VarianceReduced QLearning for TwoPlayer Stochastic Games
In this section, we show how to accelerate the twoplayer Qlearning algorithm and achieve nearoptimal sample efficiency. A main technique is to leverage monotonicity of the Bellman operator to guarantee that solutions improve monotonically in the algorithm, which was used in Yang and Wang (2019).
5.1 Nonnegative Features
To preserve monotonicity in the algorithm, we assume without loss of generality the features are nonnegative:
This condition can be easily satisfied. If the raw features does not satisfy nonnegativity, we can construct new features to make it satisfied. For any stateaction pair we append another 1D feature such that for , and there is a subset of such that and is nonsingular. Then satisfies nonnegativity condition and Assumption 1 for some by normalization. More details are deferred to appendix.
5.2 Parametrization
We use a “maxlinear" parameterization to guarantee that value functions improve monotonically in the algorithm. Instead of using a single vector , we now use a finite collection of dimensional vectors , where is an integer of order . We use the following parameterization for the Qfunctions, the value functions and strategies^{1}^{1}1Here is defined to be the solution of in the maxmin problem: . The definition of is similar.:
(5)  
For a given and , computing the corresponding Qvalue and action requires solving a onestep optimization problem. We assume that there is an oracle that solves the problem with time complexity .
Remark 1.
When the action space is continuous, this may become a constant which is independent to the state set and the action set.
5.3 Preserving Monotonicity
A drawback of value iterationlike method is: an optimal value function cannot be used greedily to obtain an optimal strategy. In order for Algorithm 1 to find an such that is an optimal value, it needs to find an optimal value function first, which is very inefficient. However, if a strategy and a value function satisfy following inequality
(6) 
then there is a strong connection between and as follows (due to monotonicity of the Bellman operator ):
This relation will be used to show that if is close to optimal, the policy is also close to optimal.
The accelerated algorithm is given partly in Algorithm 2, which uses two tricks to preserve monotonicity:

We use parametrization (5) for and . This parametrization ensures that in our algorithm, the values and strategies keeps improving throughout iterations.

In each iteration, we shift downwards the new parameter to by using a confidence bound, such that
which uses the nonnegativity of features. The shift is used to guarantee (6).
5.4 Approximating the Equilibrium from Two Sides
Making value functions monotonically increasing is not enough to find an optimal strategy for twoplayer stochastic games. There are two sides of the game, and may be either greater or less than the Nash value. Having a lowerbound for does not lead to an approximate strategy. This is a major difference from oneplayer MDP.
In order to fix this problem, we approximate the Nash equilibrium from two sides – both from above and below. Given player 1’s strategy and player 2’s strategy , we introduce two Bellman operators .
(7)  
Then if there exist value functions such that all of the following
(8) 
hold, then by using the monotonicity of we get
Hence if we have and , they jointly imply
(9)  
which indicates that is an optimal strategy.
To achieve this goal, we construct a “flipped" instance of 2TBSG , where the state set and the action set for each player, the transition probability matrix and the discounted factor are identical with those of . The reward function is defined as
(10) 
And the objective of two players are switched, which means in the first player aims to minimize and the second player aims to maximize. share the same optimal strategy (but flipped).
We use to denote the value function of , and let for any , which serves as the value function approximating the equilibrium value from upper side. This , together with , forms a twosided approximation to the equilibrium value function.
We use Algorithm 2 to solve and at the same time. Next we construct a strategy where the first player’s strategy is based on parameters from the lower approximation, and the second player’s strategy is based on parameters from the upper approximation. This process is described in Algorithm 3, and its output is the following approximate Nash equilibrium strategy:
(11) 
where for , is the strategy defined as
5.5 Variance Reduction
We use innerouter loops for variance reduction in Algorithm 2. Let the parameters at the th iteration be . At the beginning of the th outer iteration, we aim to approximate accurately (Step 6, 7). Then in the th inner iteration, we use as a reference to reduce the variance of estimation. That is, we estimate the difference using samples and then use the following equation (Step 11, 12)
to approximate . Since the infinite norm of is guaranteed to be smaller than the absolute value of , the number of samples needed for each inner iteration can be substantially reduced. Hence our algorithm is more sampleefficient.
5.6 Putting Together
Algorithms 23 puts together all the techniques that were explained. In the next section, we will prove that they output an optimal strategy with high probability. It is easy to see that the time complexity of Algorithm 2 is . The first term is the time calculating . The second term is the time of sampling and calculating the value function in each iteration, and is the time of calculating given parameter and state , which can be viewed as solving an optimization problem over the action space. The last term is due to the calculation of . As for the space complexity, we only need to store and the parameter at each iteration, which take space. Hence the total time and space complexities are independent from the numbers of states and actions.
6 Sample Complexity of Algorithms 23
Theorem 2.
We present a proof sketch here, and the complete proof is deferred to appendix.
Proof Sketch.
We prove by induction. It is easy to know that . Next we assume holds.
The error between and involves two types of error: the estimation error due to sampling and the convergence error of value iteration. Due to the variance reduction technique, estimation error has two parts. The first part is the estimation error of , which we denote as , and the second part is the error of , which we denote as for short.
According to the Hoeffding inequality, we have and with high probability. By the induction hypothesis, we have . If we choose and , we will have and .
The convergence error of value iteration in the inner loop is . If we choose , we will have . Bringing these two types of errors together, we have with high probability that
where the last equality is due to . Choosing , we have . Here we have omitted the dependence on any constant factors.
According to Theorem 1 and 2, we have the following theorem when the transition model cannot be embedded exactly, whose proof is deferred to appendix.
Theorem 3 (Approximation error due to model misspecification).
Let Assumption 1 holds and let features be nonnegative. If there is an another transition model which can be fully embedded into space, and there exists such that for and for , then with probability at least , the output of Algorithm 3 is an optimal strategy, and with probability at least , the parametrized strategy according to the output of Algorithm 1 is optimal.
7 Conclusion
In this paper, we develop a twoplayer Qlearning algorithm for solving 2TBSG in feature space. This algorithm is proved to find an optimal strategy with high probability using samples. It is the first and sharpest sample complexity bound for solving twoplayer stochastic game using features and linear models, to our best knowledges. The algorithm is sample efficient as well as space and time efficient.
References
 Antos et al. (2008a) Antos, A., Szepesvári, C., and Munos, R. (2008a). Fitted qiteration in continuous actionspace mdps. In Advances in neural information processing systems, pages 9–16.
 Antos et al. (2008b) Antos, A., Szepesvári, C., and Munos, R. (2008b). Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
 Azar et al. (2011a) Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (2011a). Reinforcement learning with a near optimal rate of convergence.
 Azar et al. (2011b) Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (2011b). Speedy qlearning. In Advances in neural information processing systems.
 Azar et al. (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
 Chen et al. (2018) Chen, Y., Li, L., and Wang, M. (2018). Scalable bilinear pi learning using state and action features. In Proceedings of the 35th International Conference on Machine Learning, pages 834–843, Stockholmsmässan, Stockholm Sweden. PMLR.
 EvenDar et al. (2006) EvenDar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105.
 Farahmand et al. (2010) Farahmand, A.m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576.
 Filar and Vrieze (2012) Filar, J. and Vrieze, K. (2012). Competitive Markov decision processes. Springer Science & Business Media.
 Hansen et al. (2013) Hansen, T. D., Miltersen, P. B., and Zwick, U. (2013). Strategy iteration is strongly polynomial for 2player turnbased stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1.
 Hu and Wellman (1999) Hu, J. and Wellman, M. P. (1999). Multiagent reinforcement learning in stochastic games. Submitted for publication.
 Hu and Wellman (2003) Hu, J. and Wellman, M. P. (2003). Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069.
 Jiang et al. (2017) Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2017). Contextual decision processes with low bellman rank are paclearnable. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1704–1713. JMLR. org.
 Kakade (2003) Kakade, S. M. (2003). On the sample complexity of reinforcement learning. PhD thesis, University of London London, England.
 Kearns and Singh (1999) Kearns, M. J. and Singh, S. P. (1999). Finitesample convergence rates for qlearning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002.
 Lagoudakis and Parr (2003) Lagoudakis, M. G. and Parr, R. (2003). Leastsquares policy iteration. Journal of machine learning research, 4(Dec):1107–1149.
 Lazaric et al. (2012) Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finitesample analysis of leastsquares policy iteration. Journal of Machine Learning Research, 13(Oct):3041–3074.
 Littman (1994) Littman, M. L. (1994). Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier.
 Littman (1996) Littman, M. L. (1996). Algorithms for sequential decision making.
 Littman (2001a) Littman, M. L. (2001a). Friendorfoe qlearning in generalsum games. In ICML, volume 1, pages 322–328.
 Littman (2001b) Littman, M. L. (2001b). Valuefunction reinforcement learning in markov games. Cognitive Systems Research, 2(1):55–66.
 Ludwig (1995) Ludwig, W. (1995). A subexponential randomized algorithm for the simple stochastic game problem. Information and computation, 117(1):151–155.
 Melo et al. (2008) Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671. ACM.
 Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finitetime bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857.
 Nedić and Bertsekas (2003) Nedić, A. and Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(12):79–110.
 Neyman et al. (2003) Neyman, A., Sorin, S., and Sorin, S. (2003). Stochastic games and applications, volume 570. Springer Science & Business Media.

Parr et al. (2008)
Parr, R., Li, L., Taylor, G., PainterWakefield, C., and Littman, M. L. (2008).
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In Proceedings of the 25th international conference on Machine learning, pages 752–759. ACM.  Pérolat et al. (2016) Pérolat, J., Piot, B., Geist, M., Scherrer, B., and Pietquin, O. (2016). Softened approximate policy iteration for markov games. In ICML 201633rd International Conference on Machine Learning.
 Perolat et al. (2015) Perolat, J., Scherrer, B., Piot, B., and Pietquin, O. (2015). Approximate dynamic programming for twoplayer zerosum markov games. In International Conference on Machine Learning (ICML 2015).
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Rao et al. (1973) Rao, S. S., Chandrasekaran, R., and Nair, K. (1973). Algorithms for discounted stochastic games. Journal of Optimization Theory and Applications, 11(6):627–637.
 Shapley (1953) Shapley, L. S. (1953). Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100.
 Sidford et al. (2018a) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018a). Nearoptimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 5186–5196. Curran Associates, Inc.
 Sidford et al. (2018b) Sidford, A., Wang, M., Wu, X., and Ye, Y. (2018b). Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pages 770–787. Society for Industrial and Applied Mathematics.
 Singh and Yee (1994) Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss from approximate optimalvalue functions. Machine Learning, 16(3):227–233.
 Sutton et al. (2009) Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009). A convergent temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in neural information processing systems, pages 1609–1616.
 Tagorti and Scherrer (2015) Tagorti, M. and Scherrer, B. (2015). On the rate of convergence and error bounds for LSTD(). In International Conference on Machine Learning, pages 1521–1529.
 Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
 Von Neumann and Morgenstern (2007) Von Neumann, J. and Morgenstern, O. (2007). Theory of games and economic behavior (commemorative edition). Princeton university press.
 Yang and Wang (2019) Yang, L. F. and Wang, M. (2019). Sampleoptimal parametric qlearning with linear transition models. arXiv preprint arXiv:1902.04779.
 Yang et al. (2019) Yang, Z., Xie, Y., and Wang, Z. (2019). A theoretical analysis of deep qlearning. arXiv preprint arXiv:1901.00137.
 Zhang et al. (2018) Zhang, K., Yang, Z., Liu, H., Zhang, T., and Başar, T. (2018). Finitesample analyses for fully decentralized multiagent reinforcement learning. arXiv preprint arXiv:1812.02783.
Appendix A Proof of Theorem 1
We first present the definition of optimal counterstrategies.
Definition 5.
For player 1’s strategy , we call a player 2’s optimal counterstrategy against , if for any player 2’s strategy , we have . For player 2’s strategy , we call a player 1’s optimal counterstrategy against , if for any player 1’s strategy , we have .
It is known in Puterman (2014) that for any player 1’s strategy (player 2’s strategy ), the optimal counterstrategy against () always exists.
Our next lemma indicates that we can use the error of parametrized functions to bounded the error of value functions of parametrized strategies.
Lemma 1.
If
(14) 
then we have
(15) 
where , and are optimal counterstrategies of .
Proof.
We only prove the first inequality of (15). The proof of the second inequality is similar.
For any ,