Log In Sign Up

Temporal Induced Self-Play for Stochastic Bayesian Games

One practical requirement in solving dynamic games is to ensure that the players play well from any decision point onward. To satisfy this requirement, existing efforts focus on equilibrium refinement, but the scalability and applicability of existing techniques are limited. In this paper, we propose Temporal-Induced Self-Play (TISP), a novel reinforcement learning-based framework to find strategies with decent performances from any decision point onward. TISP uses belief-space representation, backward induction, policy learning, and non-parametric approximation. Building upon TISP, we design a policy-gradient-based algorithm TISP-PG. We prove that TISP-based algorithms can find approximate Perfect Bayesian Equilibrium in zero-sum one-sided stochastic Bayesian games with finite horizon. We test TISP-based algorithms in various games, including finitely repeated security games and a grid-world game. The results show that TISP-PG is more scalable than existing mathematical programming-based methods and significantly outperforms other learning-based methods.


page 1

page 2

page 3

page 4


On the Global Convergence of Stochastic Fictitious Play in Stochastic Games with Turn-based Controllers

This paper presents a learning dynamic with almost sure convergence guar...

Efficient Competitive Self-Play Policy Optimization

Reinforcement learning from self-play has recently reported many success...

Robustness and sample complexity of model-based MARL for general-sum Markov games

Multi-agent reinfocement learning (MARL) is often modeled using the fram...

Independent Policy Gradient Methods for Competitive Reinforcement Learning

We obtain global, non-asymptotic convergence guarantees for independent ...

Solving Structured Hierarchical Games Using Differential Backward Induction

Many real-world systems possess a hierarchical structure where a strateg...

Improving Fictitious Play Reinforcement Learning with Expanding Models

Fictitious play with reinforcement learning is a general and effective f...

Solving Large-Scale Extensive-Form Network Security Games via Neural Fictitious Self-Play

Securing networked infrastructures is important in the real world. The p...

1 Introduction

Many real-world problems involve multiple decision-makers interacting strategically. Over the years, a significant amount of work has focused on building game models for these problems and designing computationally efficient algorithms to solve the games [27, 26]. While Nash equilibrium (NE) is a well-accepted solution concept, the players’ behavior prescribed by an NE can be irrational off the equilibrium path: one player can threaten to play a suboptimal action in a future decision point to convince the other players that they would not gain from unilateral deviation. Such “non-credible threats” restrict the practical applicability of these strategies as in the real world, one may make mistakes unexpectedly, and it is hard to enforce such threats. Thus it is important to find strategy profiles such that each player’s strategy is close to optimal (in expectation) from any point onward given the other players’ strategies.

To find such strategy profiles, researchers have proposed equilibrium refinement concepts such as subgame perfect equilibrium and perfect Bayesian equilibrium (PBE) [7] and studied the computational complexity [3, 9, 14]. However, existing methods for computing refined equilibria have limited scalability and often require full access to the game environment, thus can hardly apply to complex games and real-world problems (as detailed in Section 2). On the other hand, deep reinforcement learning (RL) has shown great promise in complex sequential decision-making problems for single-agent and multi-agent settings [24, 29]. Deep RL leverages a compact representation of the game’s state and the players’ action space, making it possible to handle large games that are intractable for non-learning-based methods. Despite the promise, to our knowledge, no prior work has applied deep RL to equilibrium refinements.

In this paper, we focus on two-player stochastic Bayesian games with finite horizon as they can be used to model various long-term strategic interactions with private information [2]

. We propose Temporal-Induced Self-Play (TISP), the first RL-based framework to find strategy profiles with decent performances from any decision point onward. There are several crucial challenges in using RL for this task. First, in these games, a player’s action at a decision point should be dependent on the entire history of states and joint actions. As the number of histories grows exponentially, a tabular approach that enumerates all the histories is intractable. Although recurrent neural networks (RNNs) can be used to encode the history, RNNs are typically brittle in training and often fail to capture long-term dependency in complex games. Second, using standard RL algorithms with self-play suffers from limited exploration. Hence, it is extremely hard to improve the performance on rarely visited decision points. Our framework TISP tackles these two challenges jointly. We use a belief-based representation to address the first challenge, so that the policy representation remains constant in size regardless of the number of rounds. Besides, we use backward induction to ensure exploration in training. TISP also uses non-parametric approximation in the belief space. Building upon TISP, we design TISP-PG approach that uses policy gradient (PG) for policy learning. TISP can also be combined with other game-solving techniques such as counterfactual regret minimization (CFR) 

[33]. Further, we prove that TISP-based algorithms can find approximate PBE in zero-sum stochastic Bayesian games with one-sided incomplete information and finite horizon. We evaluate TISP-based algorithms in different games. We first test them in finitely repeated security games with unknown attacker types whose PBE can be approximated through mathematical programming (MP) under certain conditions [26]. Results show that our algorithms can scale to larger games and apply to more general game settings, and the solution quality is much better than other learning-based approaches. We also test the algorithms in a two-step matrix game with a closed-form PBE. Our algorithms can find close-to-equilibrium strategies. Lastly, we test the algorithms in a grid-world game, and the experimental results show that TISP-PG performs significantly better than other methods.

2 Related Work

The study of equilibrium refinements is not new in economics [18]. In addition to the backward induction method for perfect information games, mathematical programming (MP)-based methods [26, 10, 23] have been proposed to compute refined equilibria. However, the MPs used are non-linear and often have an exponential number of variables or constraints, resulting in limited scalability. A few works use iterative methods [11, 19] but they require exponentiation in game tree traversal and full access to the game structure, which limits their applicability to large complex games.

Stochastic Bayesian games have been extensively studied in mathematics and economics [12, 30, 6, 2]. [1]

discussed the advantage of using type approximation to approximate the behaviors of agents to what have already been trained, to reduce the complexity in artificial intelligence (AI) researches. We focus on equilibrium refinement in these games and provide an RL-based framework.

Various classical multi-agent RL algorithms [21, 16] are guaranteed to converge to an NE. Recent variants [15, 22, 17]

leverage the advances in deep learning 

[24] and have been empirically shown to find well-performing strategies in large-scale games, such as Go [29] and StarCraft [31]. We present an RL-based approach for equilibrium refinement.

Algorithms for solving large zero-sum imperfect information games like Poker [25, 5] need to explicitly reason about beliefs. Many recent algorithms in multi-agent RL use belief space policy or reason about joint beliefs. These works assume a fixed set of opponent policies that are unchanged [28], or consider specific problem domains [27, 32]. Foerster et al.foerster2019bayesian uses public belief state to find strategies in a fully cooperative partial information game. We consider stochastic Bayesian games and use belief over opponents’ types.

3 Preliminaries

3.1 One-Sided Stochastic Bayesian Game

For expository purposes, we will mainly focus on what we call one-sided stochastic Bayesian games (OSSBG), which extends finite-horizon two-player stochastic games with type information. In particular, player 1 has a private type that affects the payoff function. Hence, in a competitive setting, player 1 needs to hide this information, while player 2 needs to infer the type from player 1’s actions. Our algorithms can be extended to handle games where both players have types, as we will discuss in Section 6. Formally, an OSSBG is defined by a 8-tuple . is the state space. is the initial state distribution. is the set of types for player 1. is the prior over player 1’s type. is the joint action space with the action space for player . is the transition function where represents the

-dimensional probability simplex.

is the payoff function for player given player 1’s type . denotes the length of the horizon or number of rounds. is the discount factor.

One play of an OSSBG starts with a type sampled from and an initial state sampled from . Then, rounds of the game will rollout. In round , players take actions and simultaneously and independently, based on the history . The players will then get payoff . Note that the payoff at every round will not be revealed until the end of every play on the game to prevent type information leakage. The states transit w.r.t. across rounds.

Let denote the set of all possible histories in round and . Let be the players’ behavioral strategies. Given the type , the history and the strategy profile , player ’s discounted accumulative expected utility from round onward is


Similarly, we can define the Q function by .

An OSSBG can be converted into an equivalent extensive-form game (EFG) with imperfect information where each node in the game tree corresponds to a (type, history) pair (see Appendix F). However, this EFG is exponentially large, and existing methods for equilibrium refinement in EFGs [19] are not suitable due to their limited scalability.

3.2 Equilibrium Concepts in OSSBG

Let denote the space of all valid strategies for Player .

Definition 3.1 (-Ne).

A strategy profile is an -NE if for , and all visitable history corresponding to the final policy,


where means playing until is reached, then playing onwards.

Definition 3.2 (-Pbe).

A strategy profile is an -PBE if for and all histories ,


where means playing until is reached, then playing onwards.

It is straightforward that an -PBE is also an -NE.

4 Temporal-Induced Self-Play

Our TISP framework (Alg. 1

) considers each player as an RL agent and trains them with self-play. Each agent maintains a policy and an estimated value function, which will be updated during training. TISP has four ingredients. It uses belief-based representation (Sec. 

4.1), backward induction (Sec. 4.2), policy learning (Sec. 4.3) and belief-space approximation (Sec. 4.4). We discuss test-time strategy and show that TISP converges to -PBEs under certain conditions in Sec.  4.5.

4.1 Belief-Based Representation

Instead of representing a policy as a function of the history, we consider player 2’s belief of player 1’s type and represent as a function of the belief and the current state in round , i.e., and . The belief

represents the posterior probability distribution of

and can be obtained using Bayes rule given player 1’s strategy:


where is the probability of player 1 being type given all its actions up to round . This belief-based representation avoids the enumeration of the exponentially many histories. Although it requires training a policy that outputs an action for any input belief in the continuous space, it is possible to use approximation as we show in Sec. 4.4. We can also define the belief-based value function for agent in round by


The Q-function can be defined similarly. We assume the policies and value functions are parameterized by and

respectively with neural networks.

4.2 Backward Induction

Standard RL approaches with self-play train policies in a top-down manner: it executes the learning policies from round to and only learns from the experiences at visited decision points. To find strategy profiles with decent performances from any decision point onward, we use backward induction and train the policies and calculate the value functions in the reverse order of rounds: we start by training for all agents and then calculate the corresponding value functions , and then train and so on.

The benefit of using backward induction is two-fold. In the standard forward-fashioned approach, one needs to roll out the entire trajectory to estimate the accumulative reward for policy and value learning. In contrast, with backward induction, when training , we have already obtained . Thus, we just need to roll out the policy for 1 round and directly estimate the expected accumulated value using and Eq. (5). Hence, we effectively reduce the original -round game into 1-round games, which makes the learning much easier. Another important benefit is that we can uniformly sample all possible combinations of state, belief and type at each round to ensure effective exploration. More specifically, in round , we can sample a belief and then construct a new game by resetting the environment with a uniformly randomly sampled state and a type sampled from . Implementation-wise, we assume access to an auxiliary function from the environment, called that produces a new game as described. This function takes two parameters, a round and a belief , as input and produces a new game by drawing a random state from the entire state space with equal probability and a random type according to the belief distribution . This function is an algorithmic requirement for the environment, which is typically feasible in practice. For example, most RL environments provides a reset function that generates a random starting state, so a simple code-level enhancement on this reset function can make existing testbeds compatible with our algorithm. We remark that even with such a minimal environment enhancement requirement, our framework does NOT utilize the transition information. Hence, our method remains nearly model-free comparing to other methods that assume full access to the underlying environment transitions — this is the assumption of most CFR-based algorithms. Furthermore, using customized reset function is not rare in the RL literature. For example, most automatic curriculum learning algorithms assumes a flexible reset function that can reset the state of an environment to a desired configuration.

4.3 Policy Learning

Each time a game is produced by , we perform a 1-round learning with self-play to find the policies . TISP allows different policy learning methods. Here we consider two popular choices, policy gradient and regret matching.

4.3.1 Policy Gradient

PG method directly takes a gradient step over the expected utility. For notational conciseness, we omit the super/subscripts of , , in the following equations and use , and , to denote the state and belief at current round and next round.

Theorem 4.1.

In the belief-based representation, the policy gradient derives as follows:


Comparing to the standard policy gradient theorem, we have an additional term in Eq.(6) (the second term). Intuitively, when the belief space is introduced, the next belief is a function of the current belief and the policy in the current round (Eq.(4)). Thus, the change in the current policy may influence future beliefs, resulting in the second term. The full derivation can be found in Appendix E. We also show in the experiment section that when the second term is ignored, the learned policies can be substantially worse. We refer to this PG variant of TISP framework as TISP-PG.

4.3.2 Regret Matching

Regret matching is another popular choice for imperfect information games. We take inspirations from Deep CFR [4] and propose another variant of TISP, referred to as TISP-CFR. Specifically, for each training iteration , let denote the regret of action at state , denote the current policy, denote the Q-function and denote the value function corresponding to . Then we have

where and . Since the policy can be directly computed from the value function, we only learn here. Besides, most regret-based algorithms require known transition functions, which enables them to reset to any infomation-set node in the training. We use the outcome sampling method [20], which samples a batch of transitions to update the regret. This is similar to the value learning procedure in standard RL algorithms. This ensures our algorithm to be model-free. Although TISP-CFR does not require any gradient computation, it is in practice much slower than TISP-PG since it has to learn an entire value network in every single iteration.

1:for  do
2:     for  do run in parallel
3:         for  do
4:              Initialize replay buffer and
5:              for  do parallel
6:                  ;
7:                  ;
8:                  get next state and utility from env;
9:                  ;               
10:              Update and using ;          
11:         ,      
12:return ;
Algorithm 1 Temporal-Induced Self-Play

4.4 Belief-Space Policy Approximation

A small change to the belief can drastically change the policy. When using a function approximator for the policy, this requires the network to be sensitive to the belief input. We empirically observe that a single neural network often fails to capture the desired belief-conditioned policy. Therefore, we use an ensemble-based approach to tackle this issue.

At each round, we sample belief points from the belief space . For each belief , we use self-play to learn an accurate independent strategy and value over the state space but specifically conditioning on this particular belief input . When querying the policy and value for an arbitrary belief different from the sampled ones, we use a distance-based non-parametric method to approximate the target policy and value. Specifically, for any two belief and , we define a distance metric and then for the query belief , we calculate its policy and value by


We introduce a density parameter to ensure the sampled points are dense and has good coverage over the belief space. is defined as the farthest distance between any point in and its nearest sampled points, or formally . A smaller indicates a denser sampled points set.

Note that the policy and value networks at different belief points at each round are completely independent and can be therefore trained in perfect parallel. So the overall training wall-clock time remains unchanged with policy ensembles. Additional discussions on belief sampling are in Appx. A.

4.5 Test-time Strategy

1:function GetStrategy()
3:     for  do
4:         update using , , and with Eq.(4)      
5:     return with Eq.(7)
Algorithm 2 Compute Test-Time Strategy

Note that our algorithm requires computing the precise belief using Eq. 4, which requires the known policy for player 1, which might not be feasible at test time when competing against an unknown opponent. Therefore, at test time, we update the belief according to the training policy of player 1, regardless of its actual opponent policy. That is, even though the actions produced by the actual opponent can be completely different from the oracle strategies we learned from training, we still use the oracle policy from training to compute the belief. Note that it is possible that we obtain an infeasible belief, i.e., the opponent chooses an action with zero probability in the oracle strategy. In this case, we simply use a uniform belief instead. The procedure is summarized in Alg. 2. We also theoretically prove in Thm. 4.2 that in the zero-sum case, the strategy provided in Alg. 2 provides a bounded approximate PBE and further converges to a PBE with infinite policy learning iterations and sampled beliefs. The poof can be found in Appendix F.

Theorem 4.2.

When the game is zero-sum and is finite, the strategies produced by Alg. 2 is -PBE, where , is the distance parameter in belief sampling, , is the number of iterations in policy learning and is a positive constant associated with the particular algorithm (TISP-PG or TISP-CFR, details in Appendix F). When and , the strategy becomes a PBE.

We remark that TISP is nearly-model-free and does not utilize the transition probabilities, which ensures its adaptability.

5 Experiment

We test our algorithms in three sets of games. While very few previous works in RL focus on equilibrium refinement, we compare our algorithm with self-play PG with RNN-based policy (referred to as RNN) and provide ablation studies for the TISP-PG algorithm: BPG uses only belief-based policy without backward induction or belief-space approximation; BI adopt backward induction and belief-based policy but does not use belief-space approximation. Full experiment details can be found in Appx. D.

5.1 Finitely Repeated Security Game

5.1.1 Game Setting

We consider a finitely repeated simultaneous-move security game, as discussed in [26]. Specifically, this is an extension of a one-round security game by repeating it for rounds. Each round’s utility function can be seen as a special form of matrix game and remains the same across rounds. In each round, the attacker can choose to attack one position from all positions, and the defender can choose to defend one. The attacker succeeds if the target is not defended. The attacker will get a reward if it successfully attacks the target and a penalty if it fails. Correspondingly, the defender gets a penalty if it fails to defend a place and a reward otherwise. In the zero-sum setting, the payoff of the defender is the negative of the attacker’s. We also adopt a general-sum setting described in [26] where the defender’s payoff is only related to the action it chooses, regardless of the attacker’s type.

5.1.2 Evaluation

We evaluate our solution by calculating the minimum so that our solution is an -PBE. We show the average result of 5 different game instances.

Mean 0.881 15.18 101.2 27.51
Worst 1.220 31.81 111.8 42.54
(a) Zero-sum result for model-free methods, with .
Mean 0.892 34.62 89.21 83.00
Worst 1.120 57.14 182.1 111.9
(b) General-sum result for model-free methods, with .
Zero-sum General-sum
Mean 0.446 0.474 0.608 0.625
Worst 1.041 1.186 1.855 1.985
(c) Result for known model variants, with .
Mean 1.888 18.20 79.74 40.75
Worst 3.008 28.15 97.67 49.74
(d) Zero-sum result, with .
Table 1: The result for finitely repeated security game. The less the number, the better the solution is. These results are evaluated with , , and uniform prior distribution.
L 2 4 6 8 10
TISP-PG 0.053 0.112 0.211 0.329 0.473
TISP-CFR 0.008 0.065 0.190 0.331 0.499
L 2 4 6 8 10
TISP-PG 0.120 0.232 0.408 0.599 0.842
TISP-CFR 0.002 0.049 0.285 0.525 0.847
Table 2: Comparing mathematical-programming and our methods, i.e., TISP-PG and TISP-CFR, with known model. These results are averaged over 21 starting prior distributions of the attacker ().

5.1.3 Results

We first experiment with the zero-sum setting where we have proved our model can converge to an -PBE. The comparison are shown in Table 0(a),0(c). We use two known model variants, TISP-PG and TISP-CFR, and a model-free version of TISP-PG in this comparison. TISP-PG achieves the best results, while TISP-CFR also has comparable performances. We note that simply using an RNN or using belief-space policy performs only slightly better than a random policy.

Then we conduct experiments in general-sum games with results shown in Table 0(b),0(c). We empirically observe that the derived solution has comparable quality with the zero-sum setting. We also compare our methods with the Mathematical-Programming-based method (MP) in [26], which requires full access to the game transition. The results are shown in Table 2. Although when is small, the MP solution achieves superior accuracy, it quickly runs out of memory (marked as “N/A”) since its time and memory requirement grows at an exponential rate w.r.t. . Again, our TISP variants perform the best among all learning-based methods. We remark that despite of the performance gap between our approach and the MP methods, the error on those games unsolvable for MP methods is merely comparing to the total utility.

In our experiments, we do not observe orders of magnitudes difference in running time between our method and baselines: TISP-PG and TISP-CFR in the Tagging game uses 20 hours with 10M samples in total even with particle-based approximation (200k samples per belief point) while RNN and BPG in the Tagging game utilizes roughly 7 hours with 2M total samples for convergence.

Regarding the scalability on the number of types, as the number of types increases, the intrinsic learning difficulty increases. This is a challenge faced by all the methods. The primary contribution of this paper is a new learning-based framework for PBNE. While we primarily focus on the case of 2 types, our approach generalizes to more types naturally with an additional experiment conducted for 3 types in Table. 0(d). Advanced sampling techniques can potentially be utilized for more types, which we leave for future work.

5.2 Exposing Game

5.2.1 Game Setting

We also present a two-step matrix game, which we call Exposing. In this game, player 2 aims to guess the correct type of player 1 in the second round. There are two actions available for player 1 and three actions available for player 2 in each round. Specifically, the three actions for player 2 means guessing player 1 is type 1, type 2 or not guessing at all. The reward for a correct and wrong guess is and respectively. The reward for not guessing is . Player 1 receives a positive reward when the player 2 chooses to guess in the second round, regardless of which type player 2 guesses. In this game, player 1 has the incentive to expose its type to encourage player 2 to guess in the second round. We further add a reward of for player 1 choosing action 1 in the first round regardless of its type to give player 1 an incentive to not exposing its type. With this reward, a short-sighted player 1 may pursue the reward and forgo the reward in the second round. The payoff matrices for this game are in Appx. D.

The equilibrium strategy for player 2 in the second round w.r.t. different types of player 1 is:

[Ground truth]   [Approximation]

Figure 1: Ground truth and approximated value of player 1 in the second round of Exposing game. The axis corresponds to . The axis corresponds to the equilibrium value.

In this game, the equilibrium values in the second round for both types of player 1 are highly discontinuous with regard to player 2’s belief, as shown in Fig. 1, which makes the belief-space gradient term in Eq. 6 ineffective. However, the approximation introduced in Sec. 4.4 can serve as a soft representation of the true value function, as shown in Fig. 1. This soft representation provides an approximated gradient in the belief space, which allows for belief-space exploration. We will show later that this belief-space exploration cannot be achieved otherwise.

P1’s type Action 1 Action 2 Reward
TISP-PG type 1 0.985 0.015 5.985
type 2 0.258 0.742 5.258
TISP-CFR type 1 1.000 0.000 1.000
type 2 1.000 0.000 1.000
TISP-PG type 1 0.969 0.031 0.969
type 2 0.969 0.031 0.969
Optimal type 1 1.000 0.000 6.000
type 2 0.333 0.667 5.333
Table 3: Detailed first round policy in Exposing game. TISP-PG is the only algorithm that separates the two player-1 types’ strategies and yields a result very close to the optimal solution.333The optimal solution refers to the Pareto-optimal PBE in this game. Note that there are two symmetric optimal solutions. We choose to only show one here for easy comparison and simplicity.

5.2.2 Results

We compare the training result between TISP-PG and TISP-CFR to exhibit the effectiveness of our non-parametric approximation. We further add an ablation study that removes the belief-space gradient term in TISP-PG, which we call TISP-PG. The results are shown in Table 3. We can see that TISP-PG is the only algorithm that successfully escapes from the basin area in Fig. 1 as it is the only algorithm that is capable of doing belief-space exploration. We also show the obtained policies Table 3. The training curves can be found in Appx. D. Note that the policies trained from TISP-CFR and TISP-PG are also close to a PBE where player 1 takes action 1 regardless of its type in the first round, and player 2 chooses not to guess in the second round, although it is not Pareto-optimal.

5.3 Tagging Game

Figure 2: An illustration of the Tagging game.

5.3.1 Game Setting

We test our method in a gridworld game Tagging, as illustrated in Fig. 2. The game is inspired by [28]. Specifically, the game is on an square, and player 1 has two types, i.e., ally and enemy, and each type corresponds to a unique target place. Each player will receive distance-based reward to encourage it to move towards its target place. There is a river in the top half part of the grids which Player 2 cannot enter. Player 1 starts from the bottom middle of the map and player 2 starts from a random position under the river. Both players can choose to move in one of the four directions, [up(U), down(D), left(L), right(R)], by one cell. Player 2 has an additional action, tag, to tag player 1 as the enemy type. The tag action is only available when player 1 has not entered the river and the euclidean distance between the two players is less than . The attackers get higher rewards for getting closer to their type-specified target, and the defenders get higher rewards for getting closer to the attacker. Moreover, if the defender chooses to tag, it will get a reward of if the attacker is of the enemy type and a reward of if the attacker is of the ally type, while the attacker will get a reward of for being tagged no matter what its type is. More detail is provided in Appx. D.

Based on the game’s rule, an enemy-typed player 1 is likely to go to its own target immediately to get a higher reward. However, such a strategy reveals its type straight away, which can be punished by being tagged. A more clever strategy of an enemy-typed player 1 is to mimic the behavior of the ally-typed player 1 in most situations and never let player-2’s belief of enemy type be larger than , so that player 2 has no incentive to tag it. The ally-typed player 1 may simply go up in most states in order to get closer to its target for a reward while player 2 should learn to tag player 1 if its belief of enemy type is high enough and try to move closer to player 1 in other cases.

5.3.2 Evaluation

Although it is computationally intractable to calculate the exact exploitability in this gridworld game, we examine the results following [13] by evaluating the performances in induced games. We choose 256 induced games among all the induced games after the second round and check their exploitability. We get the best response in each induced game in two ways: in the first round of each game, we enumerate all the possible actions and manually choose the best action. We also train a new BPG player-2 from scratch. This new BPG-based player is trained in a single-agent-like environment and does not need to consider the change in the player-1 policy. We check how much reward agents can get after the learning process finishes. A high reward from player 2 would indicate that player 1 fail to hide its type. We also provide the detailed strategies from different baselines for the first round of the game for additional insight.

P1’s Type U D R L
TISP-PG Ally 0.839 0.001 0.001 0.159
Enemy 0.932 0.001 0.001 0.066
RNN Ally 0.248 0.237 0.238 0.274
Enemy 0.599 0.067 0.077 0.255
BPG Ally 0.000 0.000 0.000 1.000
Enemy 1.000 0.000 0.000 0.000
TISP-CFR Ally 0.000 0.000 0.000 1.000
Enemy 1.000 0.000 0.000 0.000
Table 4: The policy at one of the starting states in Tagging game, where player 2 is two cells above the fixed starting point of player 1.
P2 reward -1.90 -1.67 -0.98 -1.82
P1 reward (ally) -2.55 -2.87 -3.26 -3.17
P1 reward (enemy) -2.41 -2.71 -9.29 -4.49
Table 5: The average exploitability result of 256 induced games in the Tagging game. The lower player 2’s reward, the better the algorithm.

5.3.3 Results

We show the derived policy in the very first step in Table 4. The policy learned by our method successfully keeps the belief to be less than , and keeps a large probability of going to the target of each type. The RNN policy shows no preference between different actions, resulting in not being tagged but also not getting a larger reward for getting closer to the target. The BPG policy simply goes straight towards the target and is therefore punished for being tagged. The exploitability results are shown in Table 5. From the training reward achieved by the new exploiter player 2, TISP-PG performs the best among all baselines and TISP-CFR also produces a robust player-1 policy. We remark that relative performances of different methods in the gridworld game are consistent with what we have previously observed in the finitely repeated security games, which further validates the effectiveness of our approach.

6 Discussion and Conclusion

We proposed TISP, an RL-based framework to find strategies with a decent performance from any decision point onward. We provided theoretical justification and empirically demonstrated its effectiveness. Our algorithms can be easily extended to a two-sided stochastic Bayesian game. The TISP framework still applies, and the only major modification needed is to add a for-loop to sample belief points for player 2. This extension will cost more computing resources, but the networks can be trained fully in parallel. The full version of the extended algorithm is in Appendix C.


We would like to thank Ryan Shi for some help in writing the early workshop version of this paper. Co-author Fang is supported in part by NSF grant IIS- 1850477, a research grant from Lockheed Martin, and by the U.S. Army Combat Capabilities Development Command Army Research Laboratory Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the funding agencies.


  • [1] S. V. Albrecht, J. W. Crandall, and S. Ramamoorthy (2016) Belief and truth in hypothesised behaviours. Artificial Intelligence. Cited by: §2.
  • [2] S. V. Albrecht and S. Ramamoorthy (2013) A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In AAMAS, Cited by: §1, §2.
  • [3] B. An, M. Tambe, F. Ordonez, E. Shieh, and C. Kiekintveld (2011) Refinement of strong stackelberg equilibria in security games. In AAAI, Cited by: §1.
  • [4] N. Brown, A. Lerer, S. Gross, and T. Sandholm (2019) Deep counterfactual regret minimization. ICML. Cited by: §4.3.2.
  • [5] N. Brown and T. Sandholm (2018) Superhuman ai for heads-up no-limit poker: libratus beats top professionals. Science. Cited by: §2.
  • [6] M. Chandrasekaran, Y. Chen, and P. Doshi (2017) On markov games played by bayesian and boundedly-rational players. In AAAI, Cited by: §2.
  • [7] I. Cho and D. M. Kreps (1987) Signaling games and stable equilibria. QJE. Cited by: §1.
  • [8] K. Cho, B. V. Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: Appendix B.
  • [9] K. Etessami, K. A. Hansen, P. B. Miltersen, and T. B. Sørensen (2014) The complexity of approximating a trembling hand perfect equilibrium of a multi-player game in strategic form. In SAGT, Cited by: §1.
  • [10] G. Farina and N. Gatti (2017) Extensive-form perfect equilibrium computation in two-player games. In AAAI, Cited by: §2.
  • [11] G. Farina, C. Kroer, and T. Sandholm (2017) Regret minimization in behaviorally-constrained zero-sum games. In ICML, Cited by: §2.
  • [12] F. Forges (1992) Repeated games of incomplete information: non-zero-sum.

    Handbook of game theory with economic applications

    Cited by: §2.
  • [13] J. Gray, A. Lerer, A. Bakhtin, and N. Brown (2021) Human-level performance in no-press diplomacy via equilibrium search. ICLR. Cited by: §5.3.2.
  • [14] K. A. Hansen and T. B. Lund (2018) Computational complexity of proper equilibrium. In EC, Cited by: §1.
  • [15] J. Heinrich, M. Lanctot, and D. Silver (2015) Fictitious self-play in extensive-form games. In ICML, Cited by: §2.
  • [16] J. Hu, M. P. Wellman, et al. (1998) Multiagent reinforcement learning: theoretical framework and an algorithm.. In ICML, Cited by: §2.
  • [17] S. Iqbal and F. Sha (2019) Actor-attention-critic for multi-agent reinforcement learning. In ICML, Cited by: §2.
  • [18] D. M. Kreps and R. Wilson (1982) Sequential equilibria. Econometrica. Cited by: §2.
  • [19] C. Kroer, G. Farina, and T. Sandholm (2017) Smoothing method for approximate extensive-form perfect equilibrium. arXiv. Cited by: §2, §3.1.
  • [20] M. Lanctot, K. Waugh, M. Zinkevich, and M. Bowling (2009) Monte carlo sampling for regret minimization in extensive games. In NIPS, Cited by: §4.3.2.
  • [21] M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceeding, Cited by: §2.
  • [22] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, Cited by: §2.
  • [23] P. B. Miltersen and T. B. Sørensen (2010) Computing a quasi-perfect equilibrium of a two-player game. Economic Theory. Cited by: §2.
  • [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, et al. (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §1, §2.
  • [25] M. Moravčík, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard, T. Davis, et al. (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science. Cited by: §2.
  • [26] T. H. Nguyen, Y. Wang, A. Sinha, and M. P. Wellman (2019) Deception in finitely repeated security games. AAAI. External Links: Document, Link Cited by: §1, §1, §2, §5.1.1, §5.1.3.
  • [27] J. Serrino, M. Kleiman-Weiner, D. C. Parkes, and J. Tenenbaum (2019) Finding friend and foe in multi-agent games. In Neurips, Cited by: §1, §2.
  • [28] M. Shen and J. P. How (2019) Robust opponent modeling via adversarial ensemble reinforcement learning in asymmetric imperfect-information games. arXiv. Cited by: §2, §5.3.1.
  • [29] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, et al. (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. Cited by: §1, §2.
  • [30] S. Sorin (2003) Stochastic games with incomplete information. In Stochastic Games and applications, Cited by: §2.
  • [31] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. Cited by: §2.
  • [32] M. Woodward, C. Finn, and K. Hausman (2020) Learning to interactively learn and assist. AAAI. Cited by: §2.
  • [33] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione (2007) Regret minimization in games with incomplete information. NIPS. Cited by: §1.
  • [34] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione (2008) Regret minimization in games with incomplete information. In NeurIPS, Cited by: §F.3.
  • [35] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In ICML, Cited by: §F.3.

Appendix A Discussion on Belief Sampling

Our algorithm generally requires the sampled belief points guarantee that any possible belief has a sampled one less than distance from it, where

is a hyperparameter. In one-dimension belief space (for two types), the most efficient way is to draw these beliefs with equal distance. This leads to a result that assures not be too far from any point that may cause policy change. With more sampled beliefs, the belief-space approximation will be more precise, which also requires more training resources accordingly. An alternative solution is adaptive sampling based on the quality of the obtained policies, which, however, requires sequential execution at different belief points at the cost of restraining the level of parallelism. The complete training framework is summarized in Algo. 


Appendix B Implementation Details

Sampling New Games in Backward Induction

Implementation-wise, we assume access to an auxiliary function from the environment, which we called . This function takes two parameters, a round and a belief , as input and produces a new game by drawing a random state from the entire state space with equal probability and a random type according to the belief distribution . This function is an algorithmic requirement for the environment, which is typically feasible in practice. For example, most RL environments provides a reset function that generates a random starting state, so a simple code-level enhancement on this reset function can make existing testbeds compatible with our algorithm. We remark that even with such a minimal environment enhancement requirement, our framework does NOT utilize the transition information. Hence, our method remains nearly model-free comparing to other methods that assume full access to the underlying environment transitions — this is the assumption of most CFR-based algorithms.


We use Gated Recurrent Unit networks 

[8] to encode the state-action history and perform a top-down level self-play.


1:function Training
2:     for  do
3:         for  do This loop can run in fully parallel.
4:              Initialize the supervise set
5:              for  do
6:                  ;
7:                  ;
8:                  ;
9:                  if not done then ;                   
10:                   PGUpdate(states, acts, rews);               
11:              Initialize the supervise set
12:              for  do
13:                  ;
14:                  ;
15:                  ;
16:                  if not done then
17:                       ;                   
18:                  ;               
19:              ;               
20:     return All groups ;
Algorithm 3 TISP-PG

We implement our method TISP-PG as shown above. Specifically, there are two ways to implement the attacker, for we can either use separate networks for each type or use one network for all types and take the type simply as additional dimension in the input. In this paper, We use separate network for each type.


The pseudo-code of our TISP-CFR is shown here:

1:function Training
2:     for  do
3:         for  do This loop can run in fully parallel.
4:              Initialize the supervise set
5:              for  do
6:                  ;
7:                  ;
8:                  ;
9:                  if not done then
10:                       ;                   
11:                  ;               
12:              Update using D;
13:              Calculate using regret matching;               
14:     return ;
Algorithm 4 TISP-CFR

Appendix C Extension algorithm to two-sided Bayesian game

1:function Training
2:     for  do
3:         for  do
4:              for  do This loop can run in fully parallel.
5:                  Initialize the supervise set
6:                  for  do
7:                       ;
8:                       ;
9:                       ;
10:                       if not done then
11:                           ;                        
12:                       ;                   
13:                  Update using D;
14:                  Update ;                             
15:     return ;
Algorithm 5 Temporal-induced Self-Play for two-sided Stochastic Bayesian games

Our algorithm can be extended to two-sided Stochastic Bayesian game, with the learning algorithm shown above. Specifically, we now do belief approximation in both side of players and calculate all the policies. This means all the equations in the paper, e.g., Eq. 4, 5, the belief vector

should now be which are the belief points in both side. In other word, if you see the belief vector to carry the belief on both the type of both players, the equations can stay with the same.

Appendix D Experiment Details

d.1 Computing Infrastructure

All the experiments on Security game and Exposing game are conducted on a laptop with 12 core Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, 16 Gigabytes of RAM and Ubuntu 20.04.1 LTS. All the experiments on Tagging game are conducted on a server with two 64 core AMD EPYC 7742 64-Core Processor @ 3.40GHz, 64 Gigabytes of RAM and Linux for Server 4.15.0.

1 0.034 0.003
2 0.053 0.008
3 0.083 0.030
4 0.112 0.065
5 0.162 0.117
6 0.211 0.190
7 N/A 0.267 0.251
8 N/A 0.329 0.331
9 N/A 0.448 0.459
10 N/A 0.473 0.499
Table 6: Comparing mathematical-programming and our methods, i.e., TISP-PG and TISP-CFR, with known model and . These results are averaged over 21 starting prior distributions of the attacker ().

d.2 Security Game

We have also conducted the experiments in an easier setting of , the result is shown in Table. 6.

d.3 Exposing Game

[Round 1] [Type 1]   [Type 2]
[Round 2] [Type 1]   [Type 2]

Figure 3: Payoff matrices for Exposing game. The first number in the tuple indicates the payoff for player 1.

The payoff matrices are shown in Fig. 3

The training curves for the three algorithms are shown in Fig. 4. In the latter two plots, the strategies for both types of player 1 stay the same throughout the training process. This is why there is only one curve visible.

Figure 4: Training curves for the first round of Exposing.

d.4 Tagging game

The target for ally is on the upper left corner, and the target for enemy is on the upper right corner. The rewards for both players can be written in forms of : , where is the reward for players to get close to its own targets, and is the reward that is non-zero only when player 2 takes tag action. Specifically for player 1, , where means the euclidean distance between player 1 and the target of its type. if player 2 tags player 1 in that round and if player 2 does not do tag action in this round. Note that even if player 1 is actually an ally type, it still gets a penalty. For player 2, , where means the euclidean distance between the two players. if player 1 is of ally type and is tagged by player 2 this round, if player 1 is of enemy type and is tagged by player 2 this round. The prior distribution of ally type and enemy type is . The tag action can be used multiple times through the game. We set an episode length limit of in our experiments. While steps seem not able to make both player reach their target, the environment does not give any huge reward for reaching the target. Remark that We assume a player can only know its reward when the game ends, preventing player 2 from knowing player 1’s type by immediately looking at its reward after taking a tag action.

Appendix E Changes to Policy Gradient Corresponding to Belief-space Policy


We demonstrate the proof for player 1. The proof for player 2 should be basically the same with some minor modifications.


Let . Then we have


where . Now we expand the term :



. Together with (12), we can rewrite (10) to:


With policy gradient theorem, we have: