1 Introduction
Many realworld problems involve multiple decisionmakers interacting strategically. Over the years, a significant amount of work has focused on building game models for these problems and designing computationally efficient algorithms to solve the games [27, 26]. While Nash equilibrium (NE) is a wellaccepted solution concept, the players’ behavior prescribed by an NE can be irrational off the equilibrium path: one player can threaten to play a suboptimal action in a future decision point to convince the other players that they would not gain from unilateral deviation. Such “noncredible threats” restrict the practical applicability of these strategies as in the real world, one may make mistakes unexpectedly, and it is hard to enforce such threats. Thus it is important to find strategy profiles such that each player’s strategy is close to optimal (in expectation) from any point onward given the other players’ strategies.
To find such strategy profiles, researchers have proposed equilibrium refinement concepts such as subgame perfect equilibrium and perfect Bayesian equilibrium (PBE) [7] and studied the computational complexity [3, 9, 14]. However, existing methods for computing refined equilibria have limited scalability and often require full access to the game environment, thus can hardly apply to complex games and realworld problems (as detailed in Section 2). On the other hand, deep reinforcement learning (RL) has shown great promise in complex sequential decisionmaking problems for singleagent and multiagent settings [24, 29]. Deep RL leverages a compact representation of the game’s state and the players’ action space, making it possible to handle large games that are intractable for nonlearningbased methods. Despite the promise, to our knowledge, no prior work has applied deep RL to equilibrium refinements.
In this paper, we focus on twoplayer stochastic Bayesian games with finite horizon as they can be used to model various longterm strategic interactions with private information [2]
. We propose TemporalInduced SelfPlay (TISP), the first RLbased framework to find strategy profiles with decent performances from any decision point onward. There are several crucial challenges in using RL for this task. First, in these games, a player’s action at a decision point should be dependent on the entire history of states and joint actions. As the number of histories grows exponentially, a tabular approach that enumerates all the histories is intractable. Although recurrent neural networks (RNNs) can be used to encode the history, RNNs are typically brittle in training and often fail to capture longterm dependency in complex games. Second, using standard RL algorithms with selfplay suffers from limited exploration. Hence, it is extremely hard to improve the performance on rarely visited decision points. Our framework TISP tackles these two challenges jointly. We use a beliefbased representation to address the first challenge, so that the policy representation remains constant in size regardless of the number of rounds. Besides, we use backward induction to ensure exploration in training. TISP also uses nonparametric approximation in the belief space. Building upon TISP, we design TISPPG approach that uses policy gradient (PG) for policy learning. TISP can also be combined with other gamesolving techniques such as counterfactual regret minimization (CFR)
[33]. Further, we prove that TISPbased algorithms can find approximate PBE in zerosum stochastic Bayesian games with onesided incomplete information and finite horizon. We evaluate TISPbased algorithms in different games. We first test them in finitely repeated security games with unknown attacker types whose PBE can be approximated through mathematical programming (MP) under certain conditions [26]. Results show that our algorithms can scale to larger games and apply to more general game settings, and the solution quality is much better than other learningbased approaches. We also test the algorithms in a twostep matrix game with a closedform PBE. Our algorithms can find closetoequilibrium strategies. Lastly, we test the algorithms in a gridworld game, and the experimental results show that TISPPG performs significantly better than other methods.2 Related Work
The study of equilibrium refinements is not new in economics [18]. In addition to the backward induction method for perfect information games, mathematical programming (MP)based methods [26, 10, 23] have been proposed to compute refined equilibria. However, the MPs used are nonlinear and often have an exponential number of variables or constraints, resulting in limited scalability. A few works use iterative methods [11, 19] but they require exponentiation in game tree traversal and full access to the game structure, which limits their applicability to large complex games.
Stochastic Bayesian games have been extensively studied in mathematics and economics [12, 30, 6, 2]. [1]
discussed the advantage of using type approximation to approximate the behaviors of agents to what have already been trained, to reduce the complexity in artificial intelligence (AI) researches. We focus on equilibrium refinement in these games and provide an RLbased framework.
Various classical multiagent RL algorithms [21, 16] are guaranteed to converge to an NE. Recent variants [15, 22, 17]
leverage the advances in deep learning
[24] and have been empirically shown to find wellperforming strategies in largescale games, such as Go [29] and StarCraft [31]. We present an RLbased approach for equilibrium refinement.Algorithms for solving large zerosum imperfect information games like Poker [25, 5] need to explicitly reason about beliefs. Many recent algorithms in multiagent RL use belief space policy or reason about joint beliefs. These works assume a fixed set of opponent policies that are unchanged [28], or consider specific problem domains [27, 32]. Foerster et al.foerster2019bayesian uses public belief state to find strategies in a fully cooperative partial information game. We consider stochastic Bayesian games and use belief over opponents’ types.
3 Preliminaries
3.1 OneSided Stochastic Bayesian Game
For expository purposes, we will mainly focus on what we call onesided stochastic Bayesian games (OSSBG), which extends finitehorizon twoplayer stochastic games with type information. In particular, player 1 has a private type that affects the payoff function. Hence, in a competitive setting, player 1 needs to hide this information, while player 2 needs to infer the type from player 1’s actions. Our algorithms can be extended to handle games where both players have types, as we will discuss in Section 6. Formally, an OSSBG is defined by a 8tuple . is the state space. is the initial state distribution. is the set of types for player 1. is the prior over player 1’s type. is the joint action space with the action space for player . is the transition function where represents the
dimensional probability simplex.
is the payoff function for player given player 1’s type . denotes the length of the horizon or number of rounds. is the discount factor.One play of an OSSBG starts with a type sampled from and an initial state sampled from . Then, rounds of the game will rollout. In round , players take actions and simultaneously and independently, based on the history . The players will then get payoff . Note that the payoff at every round will not be revealed until the end of every play on the game to prevent type information leakage. The states transit w.r.t. across rounds.
Let denote the set of all possible histories in round and . Let be the players’ behavioral strategies. Given the type , the history and the strategy profile , player ’s discounted accumulative expected utility from round onward is
(1) 
Similarly, we can define the Q function by .
An OSSBG can be converted into an equivalent extensiveform game (EFG) with imperfect information where each node in the game tree corresponds to a (type, history) pair (see Appendix F). However, this EFG is exponentially large, and existing methods for equilibrium refinement in EFGs [19] are not suitable due to their limited scalability.
3.2 Equilibrium Concepts in OSSBG
Let denote the space of all valid strategies for Player .
Definition 3.1 (Ne).
A strategy profile is an NE if for , and all visitable history corresponding to the final policy,
(2) 
where means playing until is reached, then playing onwards.
Definition 3.2 (Pbe).
A strategy profile is an PBE if for and all histories ,
(3) 
where means playing until is reached, then playing onwards.
It is straightforward that an PBE is also an NE.
4 TemporalInduced SelfPlay
Our TISP framework (Alg. 1
) considers each player as an RL agent and trains them with selfplay. Each agent maintains a policy and an estimated value function, which will be updated during training. TISP has four ingredients. It uses beliefbased representation (Sec.
4.1), backward induction (Sec. 4.2), policy learning (Sec. 4.3) and beliefspace approximation (Sec. 4.4). We discuss testtime strategy and show that TISP converges to PBEs under certain conditions in Sec. 4.5.4.1 BeliefBased Representation
Instead of representing a policy as a function of the history, we consider player 2’s belief of player 1’s type and represent as a function of the belief and the current state in round , i.e., and . The belief
represents the posterior probability distribution of
and can be obtained using Bayes rule given player 1’s strategy:(4) 
where is the probability of player 1 being type given all its actions up to round . This beliefbased representation avoids the enumeration of the exponentially many histories. Although it requires training a policy that outputs an action for any input belief in the continuous space, it is possible to use approximation as we show in Sec. 4.4. We can also define the beliefbased value function for agent in round by
(5) 
The Qfunction can be defined similarly. We assume the policies and value functions are parameterized by and
respectively with neural networks.
4.2 Backward Induction
Standard RL approaches with selfplay train policies in a topdown manner: it executes the learning policies from round to and only learns from the experiences at visited decision points. To find strategy profiles with decent performances from any decision point onward, we use backward induction and train the policies and calculate the value functions in the reverse order of rounds: we start by training for all agents and then calculate the corresponding value functions , and then train and so on.
The benefit of using backward induction is twofold. In the standard forwardfashioned approach, one needs to roll out the entire trajectory to estimate the accumulative reward for policy and value learning. In contrast, with backward induction, when training , we have already obtained . Thus, we just need to roll out the policy for 1 round and directly estimate the expected accumulated value using and Eq. (5). Hence, we effectively reduce the original round game into 1round games, which makes the learning much easier. Another important benefit is that we can uniformly sample all possible combinations of state, belief and type at each round to ensure effective exploration. More specifically, in round , we can sample a belief and then construct a new game by resetting the environment with a uniformly randomly sampled state and a type sampled from . Implementationwise, we assume access to an auxiliary function from the environment, called that produces a new game as described. This function takes two parameters, a round and a belief , as input and produces a new game by drawing a random state from the entire state space with equal probability and a random type according to the belief distribution . This function is an algorithmic requirement for the environment, which is typically feasible in practice. For example, most RL environments provides a reset function that generates a random starting state, so a simple codelevel enhancement on this reset function can make existing testbeds compatible with our algorithm. We remark that even with such a minimal environment enhancement requirement, our framework does NOT utilize the transition information. Hence, our method remains nearly modelfree comparing to other methods that assume full access to the underlying environment transitions — this is the assumption of most CFRbased algorithms. Furthermore, using customized reset function is not rare in the RL literature. For example, most automatic curriculum learning algorithms assumes a flexible reset function that can reset the state of an environment to a desired configuration.
4.3 Policy Learning
Each time a game is produced by , we perform a 1round learning with selfplay to find the policies . TISP allows different policy learning methods. Here we consider two popular choices, policy gradient and regret matching.
4.3.1 Policy Gradient
PG method directly takes a gradient step over the expected utility. For notational conciseness, we omit the super/subscripts of , , in the following equations and use , and , to denote the state and belief at current round and next round.
Theorem 4.1.
In the beliefbased representation, the policy gradient derives as follows:
(6)  
Comparing to the standard policy gradient theorem, we have an additional term in Eq.(6) (the second term). Intuitively, when the belief space is introduced, the next belief is a function of the current belief and the policy in the current round (Eq.(4)). Thus, the change in the current policy may influence future beliefs, resulting in the second term. The full derivation can be found in Appendix E. We also show in the experiment section that when the second term is ignored, the learned policies can be substantially worse. We refer to this PG variant of TISP framework as TISPPG.
4.3.2 Regret Matching
Regret matching is another popular choice for imperfect information games. We take inspirations from Deep CFR [4] and propose another variant of TISP, referred to as TISPCFR. Specifically, for each training iteration , let denote the regret of action at state , denote the current policy, denote the Qfunction and denote the value function corresponding to . Then we have
where and . Since the policy can be directly computed from the value function, we only learn here. Besides, most regretbased algorithms require known transition functions, which enables them to reset to any infomationset node in the training. We use the outcome sampling method [20], which samples a batch of transitions to update the regret. This is similar to the value learning procedure in standard RL algorithms. This ensures our algorithm to be modelfree. Although TISPCFR does not require any gradient computation, it is in practice much slower than TISPPG since it has to learn an entire value network in every single iteration.
4.4 BeliefSpace Policy Approximation
A small change to the belief can drastically change the policy. When using a function approximator for the policy, this requires the network to be sensitive to the belief input. We empirically observe that a single neural network often fails to capture the desired beliefconditioned policy. Therefore, we use an ensemblebased approach to tackle this issue.
At each round, we sample belief points from the belief space . For each belief , we use selfplay to learn an accurate independent strategy and value over the state space but specifically conditioning on this particular belief input . When querying the policy and value for an arbitrary belief different from the sampled ones, we use a distancebased nonparametric method to approximate the target policy and value. Specifically, for any two belief and , we define a distance metric and then for the query belief , we calculate its policy and value by
(7)  
(8) 
We introduce a density parameter to ensure the sampled points are dense and has good coverage over the belief space. is defined as the farthest distance between any point in and its nearest sampled points, or formally . A smaller indicates a denser sampled points set.
Note that the policy and value networks at different belief points at each round are completely independent and can be therefore trained in perfect parallel. So the overall training wallclock time remains unchanged with policy ensembles. Additional discussions on belief sampling are in Appx. A.
4.5 Testtime Strategy
Note that our algorithm requires computing the precise belief using Eq. 4, which requires the known policy for player 1, which might not be feasible at test time when competing against an unknown opponent. Therefore, at test time, we update the belief according to the training policy of player 1, regardless of its actual opponent policy. That is, even though the actions produced by the actual opponent can be completely different from the oracle strategies we learned from training, we still use the oracle policy from training to compute the belief. Note that it is possible that we obtain an infeasible belief, i.e., the opponent chooses an action with zero probability in the oracle strategy. In this case, we simply use a uniform belief instead. The procedure is summarized in Alg. 2. We also theoretically prove in Thm. 4.2 that in the zerosum case, the strategy provided in Alg. 2 provides a bounded approximate PBE and further converges to a PBE with infinite policy learning iterations and sampled beliefs. The poof can be found in Appendix F.
Theorem 4.2.
When the game is zerosum and is finite, the strategies produced by Alg. 2 is PBE, where , is the distance parameter in belief sampling, , is the number of iterations in policy learning and is a positive constant associated with the particular algorithm (TISPPG or TISPCFR, details in Appendix F). When and , the strategy becomes a PBE.
We remark that TISP is nearlymodelfree and does not utilize the transition probabilities, which ensures its adaptability.
5 Experiment
We test our algorithms in three sets of games. While very few previous works in RL focus on equilibrium refinement, we compare our algorithm with selfplay PG with RNNbased policy (referred to as RNN) and provide ablation studies for the TISPPG algorithm: BPG uses only beliefbased policy without backward induction or beliefspace approximation; BI adopt backward induction and beliefbased policy but does not use beliefspace approximation. Full experiment details can be found in Appx. D.
5.1 Finitely Repeated Security Game
5.1.1 Game Setting
We consider a finitely repeated simultaneousmove security game, as discussed in [26]. Specifically, this is an extension of a oneround security game by repeating it for rounds. Each round’s utility function can be seen as a special form of matrix game and remains the same across rounds. In each round, the attacker can choose to attack one position from all positions, and the defender can choose to defend one. The attacker succeeds if the target is not defended. The attacker will get a reward if it successfully attacks the target and a penalty if it fails. Correspondingly, the defender gets a penalty if it fails to defend a place and a reward otherwise. In the zerosum setting, the payoff of the defender is the negative of the attacker’s. We also adopt a generalsum setting described in [26] where the defender’s payoff is only related to the action it chooses, regardless of the attacker’s type.
5.1.2 Evaluation
We evaluate our solution by calculating the minimum so that our solution is an PBE. We show the average result of 5 different game instances.






5.1.3 Results
We first experiment with the zerosum setting where we have proved our model can converge to an PBE. The comparison are shown in Table 0(a),0(c). We use two known model variants, TISPPG and TISPCFR, and a modelfree version of TISPPG in this comparison. TISPPG achieves the best results, while TISPCFR also has comparable performances. We note that simply using an RNN or using beliefspace policy performs only slightly better than a random policy.
Then we conduct experiments in generalsum games with results shown in Table 0(b),0(c). We empirically observe that the derived solution has comparable quality with the zerosum setting. We also compare our methods with the MathematicalProgrammingbased method (MP) in [26], which requires full access to the game transition. The results are shown in Table 2. Although when is small, the MP solution achieves superior accuracy, it quickly runs out of memory (marked as “N/A”) since its time and memory requirement grows at an exponential rate w.r.t. . Again, our TISP variants perform the best among all learningbased methods. We remark that despite of the performance gap between our approach and the MP methods, the error on those games unsolvable for MP methods is merely comparing to the total utility.
In our experiments, we do not observe orders of magnitudes difference in running time between our method and baselines: TISPPG and TISPCFR in the Tagging game uses 20 hours with 10M samples in total even with particlebased approximation (200k samples per belief point) while RNN and BPG in the Tagging game utilizes roughly 7 hours with 2M total samples for convergence.
Regarding the scalability on the number of types, as the number of types increases, the intrinsic learning difficulty increases. This is a challenge faced by all the methods. The primary contribution of this paper is a new learningbased framework for PBNE. While we primarily focus on the case of 2 types, our approach generalizes to more types naturally with an additional experiment conducted for 3 types in Table. 0(d). Advanced sampling techniques can potentially be utilized for more types, which we leave for future work.
5.2 Exposing Game
5.2.1 Game Setting
We also present a twostep matrix game, which we call Exposing. In this game, player 2 aims to guess the correct type of player 1 in the second round. There are two actions available for player 1 and three actions available for player 2 in each round. Specifically, the three actions for player 2 means guessing player 1 is type 1, type 2 or not guessing at all. The reward for a correct and wrong guess is and respectively. The reward for not guessing is . Player 1 receives a positive reward when the player 2 chooses to guess in the second round, regardless of which type player 2 guesses. In this game, player 1 has the incentive to expose its type to encourage player 2 to guess in the second round. We further add a reward of for player 1 choosing action 1 in the first round regardless of its type to give player 1 an incentive to not exposing its type. With this reward, a shortsighted player 1 may pursue the reward and forgo the reward in the second round. The payoff matrices for this game are in Appx. D.
The equilibrium strategy for player 2 in the second round w.r.t. different types of player 1 is:
In this game, the equilibrium values in the second round for both types of player 1 are highly discontinuous with regard to player 2’s belief, as shown in Fig. 1, which makes the beliefspace gradient term in Eq. 6 ineffective. However, the approximation introduced in Sec. 4.4 can serve as a soft representation of the true value function, as shown in Fig. 1. This soft representation provides an approximated gradient in the belief space, which allows for beliefspace exploration. We will show later that this beliefspace exploration cannot be achieved otherwise.
P1’s type  Action 1  Action 2  Reward  

TISPPG  type 1  0.985  0.015  5.985 
type 2  0.258  0.742  5.258  
TISPCFR  type 1  1.000  0.000  1.000 
type 2  1.000  0.000  1.000  
TISPPG  type 1  0.969  0.031  0.969 
type 2  0.969  0.031  0.969  
Optimal  type 1  1.000  0.000  6.000 
type 2  0.333  0.667  5.333 
5.2.2 Results
We compare the training result between TISPPG and TISPCFR to exhibit the effectiveness of our nonparametric approximation. We further add an ablation study that removes the beliefspace gradient term in TISPPG, which we call TISPPG. The results are shown in Table 3. We can see that TISPPG is the only algorithm that successfully escapes from the basin area in Fig. 1 as it is the only algorithm that is capable of doing beliefspace exploration. We also show the obtained policies Table 3. The training curves can be found in Appx. D. Note that the policies trained from TISPCFR and TISPPG are also close to a PBE where player 1 takes action 1 regardless of its type in the first round, and player 2 chooses not to guess in the second round, although it is not Paretooptimal.
5.3 Tagging Game
5.3.1 Game Setting
We test our method in a gridworld game Tagging, as illustrated in Fig. 2. The game is inspired by [28]. Specifically, the game is on an square, and player 1 has two types, i.e., ally and enemy, and each type corresponds to a unique target place. Each player will receive distancebased reward to encourage it to move towards its target place. There is a river in the top half part of the grids which Player 2 cannot enter. Player 1 starts from the bottom middle of the map and player 2 starts from a random position under the river. Both players can choose to move in one of the four directions, [up(U), down(D), left(L), right(R)], by one cell. Player 2 has an additional action, tag, to tag player 1 as the enemy type. The tag action is only available when player 1 has not entered the river and the euclidean distance between the two players is less than . The attackers get higher rewards for getting closer to their typespecified target, and the defenders get higher rewards for getting closer to the attacker. Moreover, if the defender chooses to tag, it will get a reward of if the attacker is of the enemy type and a reward of if the attacker is of the ally type, while the attacker will get a reward of for being tagged no matter what its type is. More detail is provided in Appx. D.
Based on the game’s rule, an enemytyped player 1 is likely to go to its own target immediately to get a higher reward. However, such a strategy reveals its type straight away, which can be punished by being tagged. A more clever strategy of an enemytyped player 1 is to mimic the behavior of the allytyped player 1 in most situations and never let player2’s belief of enemy type be larger than , so that player 2 has no incentive to tag it. The allytyped player 1 may simply go up in most states in order to get closer to its target for a reward while player 2 should learn to tag player 1 if its belief of enemy type is high enough and try to move closer to player 1 in other cases.
5.3.2 Evaluation
Although it is computationally intractable to calculate the exact exploitability in this gridworld game, we examine the results following [13] by evaluating the performances in induced games. We choose 256 induced games among all the induced games after the second round and check their exploitability. We get the best response in each induced game in two ways: in the first round of each game, we enumerate all the possible actions and manually choose the best action. We also train a new BPG player2 from scratch. This new BPGbased player is trained in a singleagentlike environment and does not need to consider the change in the player1 policy. We check how much reward agents can get after the learning process finishes. A high reward from player 2 would indicate that player 1 fail to hide its type. We also provide the detailed strategies from different baselines for the first round of the game for additional insight.
P1’s Type  U  D  R  L  

TISPPG  Ally  0.839  0.001  0.001  0.159 
Enemy  0.932  0.001  0.001  0.066  
RNN  Ally  0.248  0.237  0.238  0.274 
Enemy  0.599  0.067  0.077  0.255  
BPG  Ally  0.000  0.000  0.000  1.000 
Enemy  1.000  0.000  0.000  0.000  
TISPCFR  Ally  0.000  0.000  0.000  1.000 
Enemy  1.000  0.000  0.000  0.000 
TISPPG  RNN  BPG  TISPCFR  

P2 reward  1.90  1.67  0.98  1.82 
P1 reward (ally)  2.55  2.87  3.26  3.17 
P1 reward (enemy)  2.41  2.71  9.29  4.49 
5.3.3 Results
We show the derived policy in the very first step in Table 4. The policy learned by our method successfully keeps the belief to be less than , and keeps a large probability of going to the target of each type. The RNN policy shows no preference between different actions, resulting in not being tagged but also not getting a larger reward for getting closer to the target. The BPG policy simply goes straight towards the target and is therefore punished for being tagged. The exploitability results are shown in Table 5. From the training reward achieved by the new exploiter player 2, TISPPG performs the best among all baselines and TISPCFR also produces a robust player1 policy. We remark that relative performances of different methods in the gridworld game are consistent with what we have previously observed in the finitely repeated security games, which further validates the effectiveness of our approach.
6 Discussion and Conclusion
We proposed TISP, an RLbased framework to find strategies with a decent performance from any decision point onward. We provided theoretical justification and empirically demonstrated its effectiveness. Our algorithms can be easily extended to a twosided stochastic Bayesian game. The TISP framework still applies, and the only major modification needed is to add a forloop to sample belief points for player 2. This extension will cost more computing resources, but the networks can be trained fully in parallel. The full version of the extended algorithm is in Appendix C.
Acknowledgements
We would like to thank Ryan Shi for some help in writing the early workshop version of this paper. Coauthor Fang is supported in part by NSF grant IIS 1850477, a research grant from Lockheed Martin, and by the U.S. Army Combat Capabilities Development Command Army Research Laboratory Cooperative Agreement Number W911NF1320045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the funding agencies.
References
 [1] (2016) Belief and truth in hypothesised behaviours. Artificial Intelligence. Cited by: §2.
 [2] (2013) A gametheoretic model and bestresponse learning method for ad hoc coordination in multiagent systems. In AAMAS, Cited by: §1, §2.
 [3] (2011) Refinement of strong stackelberg equilibria in security games. In AAAI, Cited by: §1.
 [4] (2019) Deep counterfactual regret minimization. ICML. Cited by: §4.3.2.
 [5] (2018) Superhuman ai for headsup nolimit poker: libratus beats top professionals. Science. Cited by: §2.
 [6] (2017) On markov games played by bayesian and boundedlyrational players. In AAAI, Cited by: §2.
 [7] (1987) Signaling games and stable equilibria. QJE. Cited by: §1.
 [8] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, Cited by: Appendix B.
 [9] (2014) The complexity of approximating a trembling hand perfect equilibrium of a multiplayer game in strategic form. In SAGT, Cited by: §1.
 [10] (2017) Extensiveform perfect equilibrium computation in twoplayer games. In AAAI, Cited by: §2.
 [11] (2017) Regret minimization in behaviorallyconstrained zerosum games. In ICML, Cited by: §2.

[12]
(1992)
Repeated games of incomplete information: nonzerosum.
Handbook of game theory with economic applications
. Cited by: §2.  [13] (2021) Humanlevel performance in nopress diplomacy via equilibrium search. ICLR. Cited by: §5.3.2.
 [14] (2018) Computational complexity of proper equilibrium. In EC, Cited by: §1.
 [15] (2015) Fictitious selfplay in extensiveform games. In ICML, Cited by: §2.
 [16] (1998) Multiagent reinforcement learning: theoretical framework and an algorithm.. In ICML, Cited by: §2.
 [17] (2019) Actorattentioncritic for multiagent reinforcement learning. In ICML, Cited by: §2.
 [18] (1982) Sequential equilibria. Econometrica. Cited by: §2.
 [19] (2017) Smoothing method for approximate extensiveform perfect equilibrium. arXiv. Cited by: §2, §3.1.
 [20] (2009) Monte carlo sampling for regret minimization in extensive games. In NIPS, Cited by: §4.3.2.
 [21] (1994) Markov games as a framework for multiagent reinforcement learning. In Machine learning proceeding, Cited by: §2.
 [22] (2017) Multiagent actorcritic for mixed cooperativecompetitive environments. In NIPS, Cited by: §2.
 [23] (2010) Computing a quasiperfect equilibrium of a twoplayer game. Economic Theory. Cited by: §2.
 [24] (2015) Humanlevel control through deep reinforcement learning. Nature. Cited by: §1, §2.
 [25] (2017) Deepstack: expertlevel artificial intelligence in headsup nolimit poker. Science. Cited by: §2.
 [26] (2019) Deception in finitely repeated security games. AAAI. External Links: Document, Link Cited by: §1, §1, §2, §5.1.1, §5.1.3.
 [27] (2019) Finding friend and foe in multiagent games. In Neurips, Cited by: §1, §2.
 [28] (2019) Robust opponent modeling via adversarial ensemble reinforcement learning in asymmetric imperfectinformation games. arXiv. Cited by: §2, §5.3.1.
 [29] (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science. Cited by: §1, §2.
 [30] (2003) Stochastic games with incomplete information. In Stochastic Games and applications, Cited by: §2.
 [31] (2019) Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature. Cited by: §2.
 [32] (2020) Learning to interactively learn and assist. AAAI. Cited by: §2.
 [33] (2007) Regret minimization in games with incomplete information. NIPS. Cited by: §1.
 [34] (2008) Regret minimization in games with incomplete information. In NeurIPS, Cited by: §F.3.
 [35] (2003) Online convex programming and generalized infinitesimal gradient ascent. In ICML, Cited by: §F.3.
Appendix A Discussion on Belief Sampling
Our algorithm generally requires the sampled belief points guarantee that any possible belief has a sampled one less than distance from it, where
is a hyperparameter. In onedimension belief space (for two types), the most efficient way is to draw these beliefs with equal distance. This leads to a result that assures not be too far from any point that may cause policy change. With more sampled beliefs, the beliefspace approximation will be more precise, which also requires more training resources accordingly. An alternative solution is adaptive sampling based on the quality of the obtained policies, which, however, requires sequential execution at different belief points at the cost of restraining the level of parallelism. The complete training framework is summarized in Algo.
5.Appendix B Implementation Details
Sampling New Games in Backward Induction
Implementationwise, we assume access to an auxiliary function from the environment, which we called . This function takes two parameters, a round and a belief , as input and produces a new game by drawing a random state from the entire state space with equal probability and a random type according to the belief distribution . This function is an algorithmic requirement for the environment, which is typically feasible in practice. For example, most RL environments provides a reset function that generates a random starting state, so a simple codelevel enhancement on this reset function can make existing testbeds compatible with our algorithm. We remark that even with such a minimal environment enhancement requirement, our framework does NOT utilize the transition information. Hence, our method remains nearly modelfree comparing to other methods that assume full access to the underlying environment transitions — this is the assumption of most CFRbased algorithms.
Rnn
We use Gated Recurrent Unit networks
[8] to encode the stateaction history and perform a topdown level selfplay.TispPg
We implement our method TISPPG as shown above. Specifically, there are two ways to implement the attacker, for we can either use separate networks for each type or use one network for all types and take the type simply as additional dimension in the input. In this paper, We use separate network for each type.
TispCfr
The pseudocode of our TISPCFR is shown here:
Appendix C Extension algorithm to twosided Bayesian game
Our algorithm can be extended to twosided Stochastic Bayesian game, with the learning algorithm shown above. Specifically, we now do belief approximation in both side of players and calculate all the policies. This means all the equations in the paper, e.g., Eq. 4, 5, the belief vector
should now be which are the belief points in both side. In other word, if you see the belief vector to carry the belief on both the type of both players, the equations can stay with the same.Appendix D Experiment Details
d.1 Computing Infrastructure
All the experiments on Security game and Exposing game are conducted on a laptop with 12 core Intel(R) Core(TM) i79750H CPU @ 2.60GHz, 16 Gigabytes of RAM and Ubuntu 20.04.1 LTS. All the experiments on Tagging game are conducted on a server with two 64 core AMD EPYC 7742 64Core Processor @ 3.40GHz, 64 Gigabytes of RAM and Linux for Server 4.15.0.
MP  TISPPG  TISPCFR  

1  0.034  0.003  
2  0.053  0.008  
3  0.083  0.030  
4  0.112  0.065  
5  0.162  0.117  
6  0.211  0.190  
7  N/A  0.267  0.251 
8  N/A  0.329  0.331 
9  N/A  0.448  0.459 
10  N/A  0.473  0.499 
d.2 Security Game
We have also conducted the experiments in an easier setting of , the result is shown in Table. 6.
d.3 Exposing Game
The payoff matrices are shown in Fig. 3
The training curves for the three algorithms are shown in Fig. 4. In the latter two plots, the strategies for both types of player 1 stay the same throughout the training process. This is why there is only one curve visible.
d.4 Tagging game
The target for ally is on the upper left corner, and the target for enemy is on the upper right corner. The rewards for both players can be written in forms of : , where is the reward for players to get close to its own targets, and is the reward that is nonzero only when player 2 takes tag action. Specifically for player 1, , where means the euclidean distance between player 1 and the target of its type. if player 2 tags player 1 in that round and if player 2 does not do tag action in this round. Note that even if player 1 is actually an ally type, it still gets a penalty. For player 2, , where means the euclidean distance between the two players. if player 1 is of ally type and is tagged by player 2 this round, if player 1 is of enemy type and is tagged by player 2 this round. The prior distribution of ally type and enemy type is . The tag action can be used multiple times through the game. We set an episode length limit of in our experiments. While steps seem not able to make both player reach their target, the environment does not give any huge reward for reaching the target. Remark that We assume a player can only know its reward when the game ends, preventing player 2 from knowing player 1’s type by immediately looking at its reward after taking a tag action.