1 Introduction
Markov decision process (MDP) is a widely used model in machine learning and operations research
[1], which establishs basic rules of reinforcement learning. While solving an MDP focuses on maximizing (minimizing) the total reward (cost) for only one player, we consider a broader class of problems, the 2player turn based stochastic games (2TBSG) [15], which involves two players with opposite objectives. One player aims to maximize the total reward, and the other player aims to minimize the total reward. MDP and 2TBSG have many useful applications, see [8, 3, 12, 2, 5, 10, 16].Similar to MDP, every 2TBSG has its state set and action set, both of which are divided into two subsets for each player, respectively. Moreover, its transition probability matrix describes the transition distribution over the state set conditioned on the current action, and its reward function describes the immediate reward when taking the action.
We use a strategy (policy) to denote a mapping from the state set into the action set. In our setting, we focus on the discounted 2TBSG, where the reward in later steps is multiplied by a discounted factor. Given strategies (policies) for both players, the total reward is defined to be the sum of all discounted rewards. We solve a 2TBSG by finding its Nash equilibrium strategy (equilibrium strategy for short), where the first player cannot change its own strategy to obtain a larger total reward, and the second player cannot change its own strategy to obtain a smaller total reward. MDP can be viewed as a special case of 2TBSG, where all states belong to the first player. In such cases, the equilibrium strategy agrees with the optimal policy of MDP.
MDPs have their linear programming (LP) formulations
[3]. Hence algorithms solving LP problems can be used to solve MDPs. One of the most commonly used algorithm in MDP is the policy iteration algorithm [8], which can be viewed as a parallel counterpart of the simplex method solving the corresponding LP. In paper [18], both the simplex method solving the corresponding LP and the policy iteration algorithms have been proved to find the optimal policy in , where are the number of actions, the number of states and the discounted factor, respectively. Later in [7], the bound for the policy iteration algorithm is improved by a factor to . In [14], this bound is improved to . When the MDP is deterministic (all transition probabilities are either or ), a strongly polynomial bound independent on the discounted factor is proved in [11] for the simplex policy iteration method (each iteration changes only one action): for uniform discounted MDPs and for nonuniform discounted MDPs.However, there is no simple LP formulation for 2TBSGs. The strategy iteration algorithm [13], an analogue to the policy iteration, is a commonly used algorithm in finding the equilibrium strategy of 2TBSGs. It is a strongly polynomial time algorithm first proved in [7] with a guarantee to find the equilibrium in iterations if the discounted factor is fixed. When the discounted factor is not fixed, an exponential lower bound is given for the policy iteration in MDP [4] and for the strategy iteration in 2TBSG [6]. It is an open problem whether there is a strongly polynomial algorithm whose complexity is independent of the discounted factor for 2TBSG.
Motivated by the strongly polynomial simplex algorithm for solving MDPs, we present a simplex strategy iteration algorithm and a modified simplex strategy iteration algorithm for the 2TBSG. In both algorithms each player updates in turn, where the second player always finds the best counterstrategy in its turn. In the simplex strategy iteration algorithm the first player updates its strategy using the simplex algorithm. In the modified simplex strategy iteration algorithm, the first player updates the action leading to the largest improvement after the second player finds the optimal counterstrategy. When the second player is trivial, the 2TBSG becomes an MDP and the simplex strategy iteration algorithm can find its solution in strongly polynomial time independent of the discounted factor, which is a property not possessed by the strategy iteration algorithm in [7].
We also develop a proof technique to prove the strongly polynomial complexity for a class of geometrically converging algorithms. This class of algorithms includes the strategy iteration algorithm, the simplex strategy iteration algorithm, and the modified simplex strategy iteration algorithm. The complexity for the strategy iteration algorithm given in [7] can be recovered by our techniques. Our techniques use a combination of the current strategy and the equilibrium strategy. We establish a bound of ratio between the difference of value from the current strategy to the equilibrium strategy, and the difference of value from the combined strategy to the equilibrium strategy. Using this bound and the geometrically converging property, we can prove that after a certain number of iterations, one action will disappear forever, which leads to strongly polynomial convergence when the discount factor is fixed. Although we have not fully answered the open progblem, our algorithms and analysis point out a possible way for conquering the difficulities.
Furthermore, 2TBSG where each state has exactly two actions can be transformed into a linear complementary problem [9]. An MDP where each state has exactly two actions can be solved by a combinatorial interior point method [17]. In this paper we present a way to transform a general 2TBSG into a 2TBSG where each state has exactly two actions. The number of states in this constructed 2TBSG is (we use to hide log factors of ). This result enables the application of both results in [9, 17] to general cases.
The rest of this paper is organized as follows. In Section 2 we present some basic concepts and lemmas of the 2TBSG. In Section 3 we describe the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm. The proof of complexity of the class of geometrically converging algorithm is given in Section 4. The transformation from general 2TBSGs into special 2TBSGs is introduced in Section 5.
2 Preliminaries
In this section, we present some basic concepts of 2TBSG. Our focus here is on the discounted 2TBSG, defined as follows.
Definition 2.1.
A discounted 2TBSG (2TBSG for short) consists of a tuple , where . are the state set and the action set of each player, respectively. is the transition probability matrix, where denotes the probability of the event that the next state is conditioned on the current action .
is the reward vector, where
denotes the immediate reward function received using action . To be convenient, we use to denote the number of actions, and to denote the number of states.Given a state in 2TBSG setting, we use to denote the set of available actions corresponding to state . A deterministic strategy (strategy for short) is defined such that are mappings from to and from to , respectively. Moreover, each state matches to an action in .
For a given strategy , we define the transition probability matrix and reward function with respect to . The th row of is chosen to be the row of action in , and the th element of is chosen to be the reward of action . It is easy to observe that the matrix
is a stochastic matrix. We next define the value vector and the modified reward function.
Definition 2.2.
The value vector of a given strategy is
Definition 2.3.
The modified reward function of a given strategy is defined as
where is defined as
Furthermore, for a given 2TBSG, the optimal counterstrategy against another player’s given strategy is defined in Definition 2.4. The equilibrium strategy is given in Definition 2.5.
Definition 2.4.
For player 2’s strategy , player 1’s strategy is the optimal counterstrategy against if and only if for any strategy of player 1, we have
Player 2’s optimal counterstrategy can be defined similarly: is the optimal counterstrategy against if and only if for any strategy , . Here for two value vector , we say () if and only if () for .
Definition 2.5.
A strategy is called an equilibrium strategy, if and only if is the optimal counterstrategy against , and is the optimal counterstrategy against .
To describe the property of equilibrium strategies, we present Theorems 2.6 and 2.7 given in [7, 15]. Theorem 2.6 indicates the existence of an equilibrium strategy.
Theorem 2.6.
Every 2TBSG has at least an equilibrium strategy. If and are two equilibrium strategies, then . Furthermore, for any player 1’s strategy (or player 2’s strategy ), there always exists a player 2’s optimal counterstrategy against (player 1’s optimal counterstrategy against ), and for any two optimal counterstrategy (), we have ().
The next theorem points out a useful depiction of the value function at the equilibrium.
Theorem 2.7.
Let be an equilibrium strategy for 2TBSG. If is a strategy of player 1, and is player 2’s optimal counterstrategy against , then we have . The equality holds if and only if is an equilibrium strategy.
We now define the flux vector of a given strategy .
Definition 2.8.
The flux of a given strategy is defined as
Our next lemma presents bounds and conditions of the flux vector, and the relationship among the value function, the flux vector and reduced costs. This lemma and the following several lemmas can be found in [7]. To make the paper selfcontained, we briefly give their proofs.
Lemma 2.9.
For any strategy , we have

;

for any , ;

;

, and moreover, .
Proof.
Item (1) is proved by
Item (2) is due to
This indicates that , . Hence we have and from item (1). Finally the last two items are obtained from
and
∎
In the following, we present a lemma indicating the positiveness or negativeness of the reduced costs of optimal counterstrategies and equilibrium strategies.
Lemma 2.10.

A strategy for player 1 is an optimal counterstrategy against player 2’s strategy if only if .

A strategy for player 2 is an optimal counterstrategy against player 1’s strategy if only if .

A strategy is an equilibrium strategy if and only if it satisfies:
Proof.
If satisfies , then for any player 1’s strategy , we have
where the last inequality follows from for and .
Suppose that player 1’s strategy is the optimal counterstrategy against player 2’s strategy . For any , and , we let
Then again from Lemma 2.9 (4), we have
where the inequality comes from the definition of equilibrium strategies. Since , we have , which indicates that
. With this estimation and
for , we have proved that . Hence, item (1) is established, and the proof of item (2) is similar. Finally item (3) follows from items (1) and (2) directly. ∎3 Geometrically Converging Algorithms
Inspired by the simplex method solving the LP corresponding to the MDP and the strategy iteration algorithm given in [7], we propose a simplex strategy iteration (Algorithm 1) and a modified simplex strategy iteration algorithm (Algorithm 2) for 2TBSG.
The simplex strategy iteration algorithm can be viewed as a generalization of the strongly polynomial simplex algorithm in solving MDPs [11]. In our algorithm, both players update their strategies in turn. In each iteration, while the first player updates its strategy using the simplex method, which means only updating the action with the largest reduced cost, the second player updates its strategy according to the optimal counterstrategy. When the second player has only one possible action and the transition matrix is deterministic, the 2TBSG reduces to a deterministic MDP. Then the simplex strategy iteration algorithm can find an equilibrium (optimal) strategy in strongly polynomial time independent of , which is a property has not been proven for the strategy iteration [7].
As for the modified simplex strategy iteration algorithm, it can be viewed as a modification of the simplex strategy iteration algorithm. In this algorithm, both players also update their strategies in turn, and the second player always finds the optimal counterstrategies in its moves. However, in each of the first player’s move, only the action is updated which leads to the biggest improvement on the value function when the second player uses the optimal counterstrategy.
It is easy to know that every iteration of the simplex strategy iteration algorithm involves a step of a simplex update and a solution to an MDP. And every iteration of the modified simplex strategy iteration algorithm involves solutions to multiple MDPs. Hence every iteration in both of these two algorithms can be solved in strongly polynomial time when the discounted factor is fixed.
Next we present a class of geometrically converging algorithms used for proving the strongly polynomial complexity for several algorithms in the next section.
Definition 3.1.
We say a strategyupdate algorithm (algorithms which update strategies for both players in each iteration) is a geometrically converging algorithm with parameter a , if it updates a strategy to such that the following properties holds.

is the optimal counterstrategy against ;

;

If , then is an equilibrium strategy;

The updates of this algorithm satisfies
To begin with, we exhibit a lemma indicating the geometrically converging property of the value function in the simplex strategy iteration algorithm.
Lemma 3.2.
Suppose the sequence of strategy generated by the simplex strategy iteration algorithm is . Then the following inequality holds
(1) 
Proof.
Using this lemma, we show in the next proposition that the strategy iteration algorithm, Algorithm 1 and Algorithm 2 all belong to the class of geometrically converging algorithms.
Proposition 3.3.

The strategy iteration algorithm given in [7] is a geometrically converging algorithm with parameter ;

The simplex strategy iteration algorithm (Algorithm 1) is a geometrically converging algorithm with parameter ;

The modified simplex strategy iteration algorithm (Algorithm 2) is a geometrically converging algorithm with parameter ;
Proof.
It is easy to verify that the previous described three algorithms satisfy the first three conditions in the definition of geometrically converging algorithms. Next, we prove that all of these algorithms satisfy the last condition. For the strategy iteration algorithm, according to Lemma 4.8 and Lemma 5.4 given in [7], we have
Hence if ( is a constant), then we obtain
and the last condition of geometrically converging algorithms is verified.
For the simplex strategy iteration algorithm, if we choose ( is a constant), then according to inequality (1) we have
and the last condition of geometrically converging algorithms is verified.
Finally we consider the modified simplex strategy iteration algorithm. For , let , where is an action of state . Let
be player 2’s optimal counterstrategy against , and . Then from inequality (1), we have
According to Algorithm 2, we have
which leads to the following estimation:
Therefore, similar to the previous case we can choose such that
and the last condition of geometrically converging algorithms is verified. ∎
4 Strongly Polynomial Complexity of Geometrically Converging Algorithms
In this section, we develop the strongly polynomial property of geometric converging algorithms if the parameter is viewed as a constant. Slightly different from the proof in [7] for the strategy at the th iteration, we present a proof by considering the strategy , where is an equilibrium strategy. We show that can be both upper and lower bounded by some proportion of . By applying the property of geometrically converging algorithms, we obtain that after a certain number of iterations, a player 1’s action will disappear in forever.
Theorem 4.1.
Any geometrically converging algorithm with a parameter finds the equilibrium strategy in
number of iterations.
Proof.
Suppose is the sequence generated by a geometrically converging algorithm. We define , where is one of the equilibrium strategy.
According to Lemma 2.10 and the fact that is the optimal counterstrategy against , and the definition of geometrically converging algorithm, we have
which directly leads to
(2) 
According to Lemma 2.10, we have
which implies
(3) 
We next prove the following inequality:
(4) 
A direct calculation gives
where the last inequality is obtained from Lemma 2.10. Then noticing that
we have
Then the inequality (4) is proved.
Finally, we prove that for any , either there exists an action in will never belong to when , or we have
Actually for any , suppose , we obtain
from (2) and the definition of geometrically converging algorithm. Hence according to (3) and (4), we get
(5) 
Therefore, choosing , and because for any , according to Lemma 2.10, we obtain
from Lemma 2.9. If , we have
where the first inequality is due to Lemma 2.10 and the second inequality is due to Lemma 2.9. Therefore, combining these two inequalities and the inequality (5) and noticing that , we get
This leads to contradiction.
The previous derivation means that if does not hold for , then an action of must disappear after forever. Hence every after iterations an action will disappear forever. This process cannot happen for more than times (since there are actions and every strategy has actions), which indicates that for some ,
It follows from the definition of geometrically converging algorithm that is the equilibrium strategy. This indicates that within
number of iterations, we can find one of the equilibrium strategies. ∎
Our next theorem presents the complexity of the strategy iteration algorithm, the simplex strategy iteration algorithm and the modified simplex strategy iteration algorithm.
Theorem 4.2.
The following algorithms has strongly polynomial convergence when the discounted factor is fixed.

The strategy iteration algorithm given in [7] can find the equilibrium strategy within iterations;

The simplex strategy iteration algorithm (Algorithm 1) can find the equilibrium strategy within iterations;

The modified simplex strategy iteration algorithm (Algorithm 2) can find the equilibrium strategy within