Recent advances in deep reinforcement learning (DRL) have achieved breakthroughs in solving challenging decision-making problems in both single-agent environments , ,  and multiagent games , , , , . Multiagent reinforcement learning (MARL) deals with multiple agents concurrently learning in a multiagent environment such as a multiagent game. One of the difficulties of learning in multiagent environments is that, in general, the state transition and reward depend on the joint action of all the agents. As a result, the best response of each agent depends on the joint policy of all the rest agents. This inter-dependency between agents makes it generally impossible to learn an optimal policy from a single agent prospect. In order to determine one’s optimal policy, each agent has to reason about the likely policies of the other agents, and plans for its action accordingly, which is much more complicated than the single agent case.
Many of the successful MARL applications deal with two-player symmetric zero-sum games such as , , , which was proved to have a Nash Equilibrium strategy profile that is equal to both the maximin strategy and the minimax strategy. This implies that the optimal policies in those games can be solved by using the worst case opponent policy. Since the game is symmetric, self-play is used which assigns one’s own policy to its opponent, essentially reducing the multiagent learning problem to a single agent learning problem.
In more general scenarios, there could be multiple equilibrium profiles. Agents do not necessarily adopt the equilibrium policy from the same equilibrium policy profile. As a result, solving for all the equilibrium profiles does not entail finding the optimal policies. Reasoning about other agents’ policies becomes crucial for optimizing one’s own policy.
Opponent modeling studies the problem of constructing models to reason about and make predictions about various properties (e.g. actions, goals, etc.) of the modelled agents. Classic methods, such as policy reconstruction ,  and plan recognition, , 
etc., develop parametric models to model agent behaviors. One of the limitations of these approaches is the requirement of domain-specific models, which could be difficult to acquire. Moreover, these models tend to decouple the interactions between the modeling agent and the modeled agents to simplify the modeling process, which is likely to be biased where strong coupling exists between agent rewards and interactions. In contrast, a more natural approach of opponent modeling is concurrently training all the agents via MARL in a self-play manner , , . This approach requires little domain-specific knowledge order than a black-box simulator. The interactions between the modeling agent and the modeled agents are fully captured in the joint observations, state-action pairs, and rewards. Moreover, concurrent learning provides a natural curriculum with the right level of difficulty for each agent .
MARL in general-sum games, however, is more challenging than that in two-player zero-sum games. A general-sum game may have multiple equilibria, corresponding to a variety of diverse strong policies. During the training, the agents might have only seen a small subset of these policies, which could lead to significant performance degradation when playing against unseen opponent policies. A common approach to mitigate this type of ‘policy over-fitting’ is training policy ensembles such as in [16, 12, 11]
. Each policy ensemble consists of several policies for each agent, which would be robust on average against all the policies within the ensembles of other agents. Ensemble training has also been widely applied to learning classifiers that are robust to adversarial attacks in computer vision.
Although ensemble training improves the policy robustness, it also significantly increases the computational complexity, as each policy within an ensemble has to be optimized against ensembles of policies of other agents. In addition, choosing the right size of the ensemble is critical. A large ensemble size would likely result in high robustness but poor scalability, while a small size would scale better but potentially lead to a less robust policy. Therefore, finding a reasonable ensemble size that maintains a good trade-off between robustness and complexity is highly desirable, which has not been addressed in the related works. One fundamental issue with these works is the lack of a quantitative measure of robustness. Without this measure, we cannot optimize the ensemble size and the population selection.
Imperfect information and belief space planning
Imperfect-information and partial observability is another common difficulty in decision-making problems. Partially observable Markov decision process (POMDP) decentralized-POMDP (Dec-POMDP)  and Partially observable stochastic games (POSG)  are the decision-making models for single agent, multiagent fully cooperative and multiagent general sum scenarios, respectively. Model-based planning is the most prevalent technique for solving POMDP and Dec-POMDP (e.g., , , ). In POMDP, a value function satisfying Bellmen equation can be defined on the belief space. Piece-wise linear convexity (PWLC) is an important property of finite horizon POMDP value functions . This property implies that belief state of high uncertainty has lower value while that of low uncertainty has higher value. Emergent exploration behavior is a natural result of PWLC.
Belief space policy, where the belief is a sufficient statistics of the action-observation history, is a special case of recurrent policy. One limitation of belief space planning is the requirement of an environment model for belief update, which is typically unavailable or intractable in scenarios with complex environments. In model-free DRL, recurrent neural network (RNN) is a widely used architecture to handle partial observability. Although RNN-based DRL approaches have achieved impressive successes in partially observable domains (e.g.,, , ), we identified two limitations of model-free learning with RNN: First, learning exploration behavior could be challenging. To the best of the the authors’ knowledge, there is little evidence in literature showing emergent exploration behavior learned by RNN alone. Our conjecture is that RNN has to simultaneously learn an encoding of the action
observation history that has a similar information structure as the belief space, and a mapping from this hidden encoding to an optimal action. This is a more challenging learning task that a single black-box RNN might struggle to accomplish, as compared with model-based planning. Second, since there is no belief state in the RNN approach, RNN policy learns directly from the actual reward instead of the belief space reward. The actual reward could be very noise due to different realizations of the hidden state. This high reward variance poses challenges to reinforcement learning algorithms.
Besides, within the imperfect information and partially observable domains, different problems have different levels of difficulty. Most works deal with domains where the hidden state has a well-modeled probabilistic relationship with the observations, such as partial observability due to sensor noise or failure , limited field of view , screen flickering 
. These types of partial observability are relatively simple, in the sense that the hidden information can be inferred without bias via Bayes’ rule. In the rest cases, the hidden state cannot be directly inferred from the observation. For example, in Poker game, the observation is all the hands that have been played, and the hidden state is the hands that have not been revealed. There is no probabilistic relationship between the hidden state and the observation. Nonetheless, inferring the hidden state is still possible given knowledge about the agent types and assumptions about rationality (agent modeling). For example, in Bayesian game theory, a Bayesian-Nash Equilibrium is well-defined given a joint equilibrium policy profile assuming perfect rationality. However, this belief is likely to be biased since it is unlikely that the actual agent adopts the exact model policy. Moreover, in even more complicated imperfect-information scenarios, agent types could also be uncertain. For example, in one-night werewolf game,  agents do not know whether the other agents are their ally or enemy. In this case, one has to jointly reason about the (hidden) agent types and their policies, which is more challenging than the aforementioned situations.
This paper presents an algorithmic framework for learning robust policies in asymmetric imperfect-information competitive games. We mainly focus on the scenarios where the opponent type is unknown to the protagonist agent but the joint reward is strongly correlated with this hidden type. This setting models a spectrum of real world scenarios, but has seldomly been studied in the context of MARL.
In this section, we review the preliminary of the decision-making framework and solution techniques.
Partially observable stochastic games
A partially observable stochastic game (POSG) is a tuple , where,
is a finite set of agents indexed by
is a state space
is the prior state distribution
is the action space of each agent, and we use to denote the joint action
is the observation space for each agent, and we use to denote the joint observation
is the Markovian state transition and observation probability, which is denoted asand
is the reward function of each agent
Asymmetric imperfect-information POSG
In this paper, we primarily focus on a special subset of POSG, which we refer to as asymmetric imperfect-information game (AIIG). We define AIIG as POSG where not all the agents hold the same initial belief over the state and agent type distributions, i.e., , for some , where , in contrast to the common prior assumption in Bayesian games, which is unrealistic in some scenarios.
AIIG models a variety of real world scenarios. For example, in poker games, agents have perfect knowledge of their own cards while maintain a belief distribution over the others. Therefore, each agent has its own initial belief that is distinct from the others. In a buyer-seller game, the seller knows the true value of the goods while the buyer does not, which leads to different initial belief over the value of the goods. In an urban-security scenario, suppose a police officer wants to identify a terrorist among a swarm of people. The officer does not have prior knowledge over the type of each person, so he has to assign the same belief to each of the people. In contrast, the terrorist knows his own type and the type of all the other innocent.
Within AIIG, we focus on a special subset, where the protagonist agent is uncertain about the types of its opponents, while its opponents have full observability over the global state. This special case is difficult for the protagonist, as its opponents have information advantages.
POSG with extended observation space
We propose a small modification to the POSG framework, which we refer to as POSG with extended observation space (POSGEOS). In POSG, one of the general assumptions is that agent cannot observe the observations and actions of the other agents. We propose to extend the observation space with the actions taken by the other agents. We argue that this is a realistic assumption in many real game scenarios. For example, in the buyer-seller game, the actions are the prices proposed by the buyer or the seller, which is observed by both agents. In games with deterministic state transition, often the actions taken by agents could be deduced from the state transition. Effectively, the actions are also observable in these games.
The motivation of proposing POSGEOS is for tackling imperfect-information scenarios with uncertain agent types. The idea is to infer the hidden types of the other agents from observing their actions, by reasoning about their policies based on some rationality assumption.
Belief space reward
In single agent partially observable domains, value function is defined as the expected cumulative reward with respect to the state-action distribution under the belief space policy ,
where is the state distribution, and is the belief over state. If the belief is unbiased, then , and Eq. 1 degenerate to
where is the belief space reward. In reinforcement learning, we sample reward from the environment. The belief space reward sample clearly has lower variance than the actual reward sample , because the uncertainty associated with the state distribution has been analytically marginalized out. As a result, learning in the belief space benefits from the low reward variance, in contrast to RNN-based approaches that learn directly from state space reward which has higher variance.
In general, however, the state distribution and the belief could be different, for example, when the environment model used for belief update is biased. In this case, the policy maximizing the belief space cumulative reward Eq. 2 does not necessarily maximize the actual cumulative reward Eq. 1. That is, the agent learns an optimal policy in its imagined world, which is actually sub-optimal due to the discrepancy between its world model and the actual world. This makes it challenging to solve asymmetric imperfect-information games with uncertain opponent types. On one hand, we want to exploit belief space reward for stable learning. On the other hand, however, belief update requires an opponent model, which is likely to be biased. Therefore, accurately modeling the opponent is crucial in our problem.
We first give an overview of our approach. We use MARL for policy learning, where competitive agents are trained against each other to consistently improve their skills. We use neural network to represent a belief space policy that maps a belief over the hidden state to an action. The belief state is updated via Bayes’ rule using a learned model of the opponent policy. The opponent model learning process consists of an ensemble policy training step and a policy distillation step. We apply a neuro-evolutionary method to improve the diversity of the ensemble population for robustness. The above steps are illustrated in ig. 1. We then developed a stochastic optimization framework to meta-optimize the policy ensemble allocation for improved balance between robustness and complexity. We present the detail of each step in the following sections.
MARL with ensemble training
In order to improve the policy robustness of the protagonist agent, we formulate its RL objective as the average cumulative reward against an ensemble of opponent policies of size , as in ,
where the policy ensemble is also learned from training RL agent against the protagonist policy. Via this self-play, both the protagonist agent and its opponent improve their policies. Nonetheless, there is no explicit mechanism to enforce distinction among the policies within the ensemble. As a result, there could be redundant policies that are very similar to the others.
To address this redundancy issue, we apply the cooperative evolutionary reinforcement learning (CERL) approach 
. The key idea is to use different hyper-parameter settings for each opponent policy, while use an off-policy learning algorithm and a shared experience replay buffer to keep the advantage of concurrently training multiple policies. Furthermore, neuro-evolutionary algorithm is applied to create mutated policies from the ensemble, and the trajectory under the mutated policies are also stored in the share experience replay buffer for better diversity and exploration.
Belief space policy and belief update
In the asymmetric imperfect-information games, however, the global state is not fully observable. We use the belief space approach for agent policy learning. Agents explicitly maintain a belief over the hidden states (e.g. hidden state includes the actual opponent types), and learns a belief space policy that maps belief to action. We parameterize this mapping using a multi-layer perceptron (MLP). The learning objective, instead of Eq.4, now becomes,
A belief update mechanism is required to fully specify the agent policy. The belief is the posterior distribution over the hidden states given action and observation history, . In the proposed POSGEOS framework, agent observation includes the actions taken by the other agents, therefore,
where denotes the rest part of the observation excluding other agents’ actions.
Using the Bayes’ rule, the following recursive update can be derived,
In order to further simplify Eq. 6, we make the following approximation,
where denotes the action probability taken by agent with type when it observes . The assumption behind Eq. 7 is: All the other agents make decision based on their own observation, without any explicit coordination mechanism.
Eq. 6 and 7 imply that we need the policies of agent of each possible type . Recall that in the ensemble training step, we create different policies for each agent of each type. Here we use shorthand to denote agent with type . Each policy within one ensemble can be interpreted as one of the likely strategies that could be adopted by agent with type . However, in the belief update equation, we need only one single policy for agent with type . As a result, we need to synthesize the policy ensemble into one representative policy that can best represent the average behavior of the policy ensemble. We propose to learn this representative policy by minimizing information theoretic distance between this policy and the policy ensemble. More specifically, we choose the KullbackLeibler (KL) divergence as the distance measure, and formulate the following minimization objective function for learning the representative policy ,
which happens to be a simple average over the policies within one ensemble. Conceptually, this is straightforward to implement. However, computationally, averaging policies is undesirable, because could be large. Instead, we propose to store an additional action probability term into the shared experience replay buffer, and fit a policy network to samples of action probability from the experience replay using mean square error (MSE) loss. This operation approximates Eq. 9, but at almost constant computational complexity, since we do not need any additional computations to obtain the action probability sample.
Policy ensemble optimization
The ensemble training step typically improves the robustness of the protagonist agent’s policy. However, two problems need to be addressed to make this approach more effective and efficient. First, we want a metrics for measuring policy robustness and we want to explicitly optimize this robustness metrics. Second, we want to minimize the additional computation overhead introduced by ensemble training.
We propose to address these two problems through a meta-optimization of the policy ensemble. Instead of using a fixed-size ensemble, we dynamically resize the ensemble through three operations: pop, append and exchange. pop randomly removes one policy from the ensemble and push it into a deactivation-set. append randomly selects one policy from the deactivation-set and append it to the ensemble. exchange randomly selects one policy from both the ensemble and the deactivation-set and exchanges them with each other.
The objective of modifying the ensemble is to obtain a good trade-off between robustness and computational complexity, which is dominated by the ensemble size. We propose to measure the robustness via Procedure 1,
and we define the following metrics,
where are weight parameters, and is the varying size of the policy ensemble.
The combined reward term is a measure of the robustness of the protagonist policy, which is noisy due to the intrinsic stochasticity of reinforcement learning, while is a surrogate measure for computation complexity. Therefore, minimizing leads to an optimal trade-off between policy robustness and computation complexity. We interpret this minimization problem as a stochastic optimization over the powerset of the initial policy ensemble. We solve this stochastic optimization via simulated annealing as described in Procedure 2.
This section addresses the following questions:
Is it necessary to use ensemble training, considering its additional computation overhead?
Is it beneficial to explicitly model opponent policy and maintain a belief?
How much improvement do we get from ensemble training and ensemble meta-optimization?
Scenario: two-player asymmetric game
We design a two-player asymmetric-information game to evaluate our algorithm, as illustrated in Fig. 2. There are two agents: the protagonist agent (a grasshopper officer) and the opponent agent with two possible types (either an ally beaver or an enemy turtle). The opponent’s objective is to reach its home base (depending on its type) as soon as possible. The protagonist’s objective is to identify the type of its opponent, and obtain reward by tagging the opponent if it turns out to be an enemy turtle. Mistakenly tagging an ally beaver would incur a large penalty to the protagonist. The grasshopper protagonist does not swim, so once the opponent jumps into the river, the officer cannot tag it anymore. If the opponent is an enemy turtle, it receives large penalty if tagged. The opponent always receives penalty if it has not reached its base, and the penalty increases with its distance from its base. The detailed description of this game and the hyper-parameters are provided in the supplement materials.
Based on the rule of this game, an enemy turtle might take multiple different strategies. For example, one strategy is to rush towards its home base to minimize the distance penalty. However, the protagonist can quickly identify the enemy and try to tag it. As a result, the enemy might end up getting a huge penalty as being tagged. Another strategy is to initially head towards the ally base, such that the officer would be fooled to believe that the opponent is an ally. Once the enemy is close enough to the river, it can jump into the river and rush to its base. This strategy incurs larger distance penalty, but eventually might get a higher reward by avoiding being tagged.
Ensemble training vs. single model
To answer the first question, we compared the protagonist policy learned from training against an ensemble of opponent policies and that from training against a single opponent policy. We used a similar ensemble as used in , which consists of four opponent policies, each policy is learned from training against the protagonist policy. We used four different discount factors for the opponent learning objectives: . An interpretation of this setting is a variety of opponent playing styles ranging from myopic to far-sighted strategies.
For comparison, we also trained the protagonist policy individually against each opponent model, so we obtained five protagonist policies in total. For evaluation, we trained five separate opponent evaluation policies, each corresponding to one of the protagonist policies. The evaluation policies all used the same discount factor .
Fig. 5 shows the training and evaluation rewards. During training, the single model policies generally lead to higher protagonist reward, while the ensemble training results in the lowest protagonist reward. This suggests that the protagonist policy overfits to one of the single opponent models, thus achieving high training reward but low evaluation reward. In contrast, the protagonist policy trained against the ensemble achieves the best evaluation reward. It is worth pointing out that, in the second single model setting, although the hyper-parameter is exactly the same as that of the evaluation opponent, the evaluation reward is still significantly worse than the training reward. This is not surprising, as agent could learn different policies even with the same hyper-parameter setting. Therefore, overfitting is almost inevitable when training against single model.
Belief space policy vs. RNN
To answer the second question, we replaced belief space policy with a recurrent policy parameterized by a LSTM. Fig. 5 and Table 1 show the comparison between these two settings, where the belief space policy consistently outperforms the recurrent policy. This result agrees with our conjecture that learning recurrent policy might be difficult due to lack of prior knowledge on the information structure and the high-variance state-space reward.
|belief space, with EO & CE||-13.2 / -14.4||-90.8 / -83.0|
|LSTM, with EO & CE||-16.5 / -17.7||-80.6 / -66.2|
|belief space, w/o EO & CE||-11.8 / -16.5||-73.8 / -58.6|
|LSTM, w/o EO & CE||-17.2 / -16.8||-54.2 / -49.4|
To answer the third question, we compared our algorithm with its ablated versions: (I) without neuro-evolution, (II) without both neuro-evolution and ensemble optimization. For the ablated version II, we randomly sampled subsets of the ensemble from its powerset, and used the fixed subset for training. Fig. 5 and Table 2 show the training and evaluation rewards of the full and ablated versions of our algorithm. The result suggests that both neuro-evolution and ensemble optimization have important contribution to the performance improvement.
|with EO & CE||-13.2 / -14.4||-90.8 / -83.0|
|w.o EO||-15.0 / -15.6||-73.6 / -65.8|
|w.o EO & CE||-11.8 / -16.5||-73.8 / -58.6|
Our contributions can be summarized as follow:
We propose a new decision-making framework, POSGEOS modified from POSG, which is suitable for reasoning in scenarios involving uncertain opponent types.
Within this framework, we propose algorithms based on MARL and ensemble training for robust opponent modeling and posterior inference over the opponent type from the observed action.
We propose an explicit metrics for policy robustness evaluation, and formulate a stochastic optimization to maximize robustness and minimize computation complexity.
We empirically demonstrate that the explicit opponent modeling outperforms a black-box RNN approach, and the stochastic optimization results in better results (in terms of the robustness-complexity trade-off) than standard ensemble training approach.
-  (2018) Autonomous agents modelling other agents: a comprehensive survey and open problems. Artificial Intelligence 258, pp. 66–95. Cited by: Opponent modeling.
-  (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: Introduction.
-  (2017) Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748. Cited by: Opponent modeling.
-  (1998) Model-based learning of interaction strategies in multi-agent systems. Journal of Experimental & Theoretical Artificial Intelligence 10 (3), pp. 309–332. Cited by: Opponent modeling.
-  (2014) Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems 28 (2), pp. 182–213. Cited by: Opponent modeling.
-  (2004) Learning to play bayesian games. Games and Economic Behavior 46 (2), pp. 282–303. Cited by: Recurrent policy.
-  (2003) Case-based plan recognition in computer games. In International Conference on Case-Based Reasoning, pp. 161–170. Cited by: Opponent modeling.
-  (2004) Dynamic programming for partially observable stochastic games. In AAAI, Vol. 4, pp. 709–715. Cited by: Imperfect information and belief space planning.
-  (2015) Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, Cited by: Recurrent policy, Introduction.
-  (2016) Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121. Cited by: Introduction.
-  (2018) Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281. Cited by: Ensemble training, Recurrent policy, Introduction.
-  (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: Ensemble training.
-  (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: Imperfect information and belief space planning.
-  (2019) Collaborative evolutionary reinforcement learning. arXiv preprint arXiv:1905.00976. Cited by: MARL with ensemble training, Ensemble training vs. single model.
-  Sarsop: efficient point-based pomdp planning by approximating optimally reachable belief spaces.. Cited by: Imperfect information and belief space planning.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: Opponent modeling, Ensemble training, MARL with ensemble training.
-  (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: Introduction.
-  (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356 (6337), pp. 508–513. Cited by: Introduction, Introduction.
-  (2016) A concise introduction to decentralized pomdps. Vol. 1, Springer. Cited by: Imperfect information and belief space planning.
-  (2018) OpenAI five. OpenAI blog. External Links: Cited by: Recurrent policy, Introduction.
-  (2018) Modeling others using oneself in multi-agent reinforcement learning. arXiv preprint arXiv:1802.09640. Cited by: Opponent modeling.
-  (2007) Memory-bounded dynamic programming for dec-pomdps.. In IJCAI, pp. 2009–2015. Cited by: Imperfect information and belief space planning.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: Introduction, Introduction.
-  (2010) Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pp. 2164–2172. Cited by: Recurrent policy.
-  (2016) Plan recognition as planning revisited.. In IJCAI, pp. 3258–3264. Cited by: Opponent modeling.
-  (2013) DESPOT: online pomdp planning with regularization. In Advances in neural information processing systems, pp. 1772–1780. Cited by: Imperfect information and belief space planning.
-  (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: Belief space policy and belief update.
-  Collision avoidance for unmanned aircraft using markov decision processes. In AIAA guidance, navigation, and control conference, pp. 8040. Cited by: Recurrent policy.
-  (2016) AI wolf contest—development of game ai using collective intelligence—. In Computer Games, pp. 101–115. Cited by: Recurrent policy.
-  (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: Ensemble training.
-  (2019) AlphaStar: mastering the real-time strategy game starcraft ii. DeepMind Blog. External Links: Cited by: Recurrent policy, Introduction.