Learning Generalizable Risk-Sensitive Policies to Coordinate in Decentralized Multi-Agent General-Sum Games

by   Ziyi Liu, et al.
Nankai University

While various multi-agent reinforcement learning methods have been proposed in cooperative settings, few works investigate how self-interested learning agents achieve mutual coordination in decentralized general-sum games and generalize pre-trained policies to non-cooperative opponents during execution. In this paper, we present a generalizable and sample efficient algorithm for multi-agent coordination in decentralized general-sum games without any access to other agents' rewards or observations. Specifically, we first learn the distributions over the return of individuals and estimate a dynamic risk-seeking bonus to encourage agents to discover risky coordination strategies. Furthermore, to avoid overfitting opponents' coordination strategies during training, we propose an auxiliary opponent modeling task so that agents can infer their opponents' type and dynamically alter corresponding strategies during execution. Empirically, we show that agents trained via our method can achieve mutual coordination during training and avoid being exploited by non-cooperative opponents during execution, which outperforms other baseline methods and reaches the state-of-the-art.


page 1

page 2

page 3

page 4


RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Current value-based multi-agent reinforcement learning methods optimize ...

Signal Instructed Coordination in Cooperative Multi-agent Reinforcement Learning

In many real-world problems, a team of agents need to collaborate to max...

Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested Agents

A key challenge in non-cooperative multi-agent systems is that of develo...

Offline Decentralized Multi-Agent Reinforcement Learning

In many real-world multi-agent cooperative tasks, due to high cost and r...

Planning Not to Talk: Multiagent Systems that are Robust to Communication Loss

In a cooperative multiagent system, a collection of agents executes a jo...

Thompson Sampling for Factored Multi-Agent Bandits

Multi-agent coordination is prevalent in many real-world applications. H...

Learning to Advise and Learning from Advice in Cooperative Multi-Agent Reinforcement Learning

Learning to coordinate is a daunting problem in multi-agent reinforcemen...

1 Introduction

Inspired by advances in deep reinforcement learning (DRL)[Mnih et al., 2015, Lillicrap et al., 2015, Silver et al., 2017], many researchers recently focus on utilizing DRL methods to tackle multi-agent problems[Vinyals et al., 2019, Singh et al., 2020, Berner et al., 2019]. However, most of these works either consider the fully cooperative multi-agent reinforcement learning (MARL) settings [Foerster et al., 2018, Rashid et al., 2018, Son et al., 2019, Wang et al., 2020, Qiu et al., 2021] or general-sum games but make restrictive assumptions about opponents[Foerster et al., 2017, Papoudakis et al., 2021, Mealing and Shapiro, 2015], e.g., either stationary[Papoudakis et al., 2021] or altruistic [Tang et al., 2021, Peysakhovich and Lerer, 2017b]. Considering future applications of MARL, such as self-driving cars[Toghi et al., 2021] and human-robot interactions [Shirado and Christakis, 2017], multiple learning agents optimize their own rewards independently in general-sum games where win-win outcomes are only achieved through coordination which often coupled with risk[Wang et al., 2021, Foerster et al., 2017, Wang et al., 2018] (“Risk” refers to the uncertainty of future outcomes[Dabney et al., 2018a]), and their pre-trained policies should generalize to non-cooperative opponents during execution.

To achieve coordination alongside other learning agents and generalize learned policies to non-cooperative opponents, the agent must be willing to undertake a certain amount of risk and identify the opponents’ type efficiently. One set of approaches use explicit reward shaping to force agents to coordinate[Tampuu et al., 2017, Peysakhovich and Lerer, 2017b, Tang et al., 2021], which can be viewed as an approach to shape the risk degree of coordination strategies. To learn generalizable policies, [Tang et al., 2021, Wang et al., 2018] propose to train an adaptive agent with population-based training methods. However, training a population of policy profiles is computationally intractable in real-world complex multi-agent scenarios, and cannot leverage features about opponent’s type emergent in multi-agent exploration procedure. Other works either treat the other agents as stationary[Papoudakis et al., 2021, Grover et al., 2018, Papoudakis and Albrecht, 2020, Wang et al., 2018], or directly access to opponent’s policy parameters[Foerster et al., 2017].

By contrast, we are interested in a less restrictive setting where we do not assume access to opponents’ rewards, observations, or policy parameters, instead, each agent can infer other agents’ current strategies from the past behaviors of other agents. Considering in self-driving scenarios, the ego car has no knowledge about what the front cars observed or their true payoffs, but it can see the actions made by the front cars, e.g., risky merging and lane changing, and thus adjust its own strategies.

In this paper, one key insight is that learning from opponent’s past behaviors allows the agent to infer the opponent’s type and dynamically alter his strategy between different modes, e.g., either cooperate or compete, during execution. Moreover, given that the other learning agents are non-stationary, decision-making over the agent’s return distributions enables the agent to tackle uncertainties resulting from other agents’ behaviors and alter his risk preference, i.e., from risk-neutral to risk-seeking, to discover coordination strategies. Motivated by the analysis above, we propose GRSP, a Generalizable Risk-Sensitive MARL algorithm and our contributions are summarized as follows:

Leading to mutual coordination in decentralized general-sum games. We estimate a dynamic risk-seeking bonus using a complete distortion risk measure Wang’s Transform (WT)[Wang, 2000] to encourage agents to discover risky cooperative strategies. The risk-seeking bonus only affects the action selection procedure instead of shaping environment rewards and decreases throughout training, leading to an unbiased policy.

Generalizing pre-trained policies to non-cooperative opponents during execution. Policies learned independently can overfit to the other agents’ policies during training, failing to sufficiently generalize during execution[Lanctot et al., 2017]

. We further propose to train each learning agent with two objectives: a standard Quantile Regression objective

[Koenker and Bassett Jr, 1978, Dabney et al., 2018b] and a supervised agent modeling objective, which models the behaviors of opponent, applied on intermediate representation of the value network. The auxiliary supervised task allows the shallow part of the value network, which is designed as a feature extractor, to be influenced by opponent’s past behaviors. During execution, the agent can update parameters of the feature extractor using his local observations and observed opponent’s actions, forcing the intermediate representation to adapt to the new opponent.

Evaluating in multi-agent settings. We evaluate GRSP in four different Markov games: Monster-Hunt[Tang et al., 2021, Zhou et al., 2021], Escalation[Tang et al., 2021, Peysakhovich and Lerer, 2017b], Iterated Prisoners’ Dilemma (IPD)[Foerster et al., 2017, Wang et al., 2018] and Iterated Stag Hunt (ISH)[Wang et al., 2021, Tang et al., 2021]. Compared with several baseline methods, including Multi-Agent Deep Deterministic Policy Gradient (MADDPG)[Lowe et al., 2017], Multi-Agent PPO (MAPPO)[Yu et al., 2021], Local Information Agent Modelling (LIAM)[Papoudakis et al., 2021] and Indenpendent Actor Critic (IAC)[Christianos et al., 2020], GRSP learns substantially faster, achieves mutual coordination during training and can generalize to the non-cooperative opponent during execution, which outperforms other baseline methods and reaches the state-of-the-art.

2 Related Work

Risk-sensitive RL. Risk-sensitive policies, which depend upon more than mean of the outcomes, enable agents to handle the intrinsic uncertainty arising from the stochasticity of the environment. In MARL, the intrinsic uncertainties are amplified due to the non-stationarity and partial observability created by other agents that change their policies during the learning procedure[Wang et al., 2022, Papoudakis et al., 2019, Hernandez-Leal et al., 2017]. Distributional RL[Bellemare et al., 2017, Dabney et al., 2018b] provides a new perspective for optimizing policy under different risk preferences within a unified framework[Dabney et al., 2018a, Markowitz et al., 2021]

. With distributions of return, it is able to approximate value function under different risk measures, such as variance-related measure

[Di Castro et al., 2012, Ma et al., 2020, Prashanth and Ghavamzadeh, 2016], Conditional Value at Risk (CVaR)[Rockafellar and Uryasev, 2002, Chow et al., 2015] and WT[Wang, 2000], and thus produce risk-averse or risk-seeking policies. Qiu et al.[Qiu et al., 2021] propose a cooperative MARL method called RMIX with the CVaR measure over the learned distributions of individuals’ Q values as risk-averse policies. However, RMIX focuses on the fully cooperative setting and learns a centralized but factored value function to guide optimization of the decentralized policies, which cannot be applied beyond cooperative games. Lyu et al.[Lyu and Amato, 2018] propose a decentralized risk-sensitive policy LH-IQN for all agents to seek higher team rewards. In contrast with these works, we are interested in decentralized general-sum games where agents use risk-seeking policies to explore coordination strategies and can generalize them to non-cooperative opponents.

Generalization across different opponents. Many real world scenarios require agents to adapt to different opponents during execution. However, most of existing works focus on learning a fixed and team-dependent policy in fully cooperative setting[Sunehag et al., 2017, Rashid et al., 2018, Son et al., 2019, Qiu et al., 2021, Wang et al., 2020] which can not generalize to slightly altered environments or new opponents. Other works either use a population-based training method to train an adaptive agent[Tang et al., 2021], or adapt to different opponents under the Tit-for-Tat principle[Wang et al., 2018, Peysakhovich and Lerer, 2017a]. Our work is closely related to test-time training methods[Sun et al., 2020, Hansen et al., 2020]. However, they focus on image recognition or single agent policy adaption. We consider generalization across different opponents in MARL and use opponents’ past behaviors as supervised information. Ad hoc teamwork[Stone et al., 2010, Zhang et al., 2020] also requires agents to generalize to new teams, but they focus on cooperative games and has different concerns with us.

Opponent modeling. Our approach to learning generalizable policies can be viewed as a kind of opponent modeling method[Albrecht and Stone, 2018]. These approaches either model intention[Wang et al., 2013, Raileanu et al., 2018], assume an assignment of roles[Losey et al., 2020] or exploit opponent learning dynamics[Foerster et al., 2017, Zhang and Lesser, 2010]. Our approach is similar to policy reconstruction methods[Raileanu et al., 2018] which make explicit predictions about opponent’s actions. However, instead of predicting the opponent’s future actions, we learn from opponent’s past behaviors to update the belief, i.e., parameters of value network, of the opponent’s type.

3 Preliminaries

Stochastic games. In this work, we consider multiple self-interested learning agents interact with each other. We model the problem as a Partially-Observable Stochastic Game (POSG)[Shapley, 1953, Hansen et al., 2004], which consists of agents, a state space describing the possible configurations of all agents, a set of actions and a set of observations for each agent. At each time step, each agent receives his own observation , and selects an action based on a stochastic policy

, which results in a joint action vector

. The environment then transitions to a new state based on the transition function . Each agent obtains rewards as a function of the state and his action . The initial states are determined by a distribution . We treat the reward "function"

of each agent as a random variable to emphasize its stochasticity, and use

to denote the random variable of the cumulative discounted rewards where , , is a discount factor and is the time horizon.

Distributional reinforcement learning with quantile regression. Instead of learning the expected return , distributional RL focuses on learning the full distribution of the random variable directly[Bellemare et al., 2017, Dabney et al., 2018b, a]. As with the scalar setting, a distributional Bellman optimality operator can be defined by


with distributed according to .

Bellmare et al.[Bellemare et al., 2017]

show that the distributional Bellman operator is a contraction in a maximal form of the Wasserstein metric between probability distributions, but the proposed algorithm C51 dose not itself minimize the Wasserstein metric. This theory-practice gap is closed by Dabney et al.

[Dabney et al., 2018b] as they propose to use Quantile Regression for distributional RL. In this paper, we focus on the quantile representation used in QR-DQN[Dabney et al., 2018b], where the distribution of is represented by a uniform mix of supporting quantiles , where denotes a Dirac at , and each is an estimation of the quantile corresponding to the quantile fractions with for . The state-action value can then be approximated by.

These quantile estimates are trained using the Huber[Huber, 1992] quantile regression loss. The loss of the quantile value network is then given by


where , and where is the indicator function and is the Huber loss:


Distorted expectation. Distorted expectation is a risk weighted expectation of value distribution under a specific distortion function[Wirch and Hardy, 2001]. A function is a distortion function if it is non-decreasing and satisfies and [Balbás et al., 2009]. The distorted expectation of under is defined as , where is the quantile function at for the random variable . We introduce two common distortion functions as follow:

  • [leftmargin=*]

  • CVaR is the expectation of the lower or upper tail of the value distribution, corresponding to risk-averse or risk-seeking policy respectively. Its distortion function is (risk-averse) or (risk-seeking), denotes confidence level.

  • WT is proposed by Wang[Wang, 2000]: where is the distribution of a standard normal. The parameter is called the market price of risk and reflects systematic risk. for risk-averse and for risk-seeking.

assigns a 0-value to all percentiles below the or above significance level which leads to erroneous decisions in some cases[Balbás et al., 2009]. Instead, WT is a complete distortion risk measure and ensures using all the information in the original loss distribution which makes training much more stable, and we will empirically demonstrate it in Sec. 5.

4 Methods

In this section, we describe our proposed GRSP method. We first introduce the risk-seeking bonus used to encourage agents to discover coordination strategies in Sec. 4.1 and then propose the auxiliary opponent modeling task to learn generalizable policies in Sec. 4.2. Finally, we provide the details of test-time policy adaptation under different opponents in Sec. 4.3.

4.1 Risk-Seeking Bonus

Figure 1: Quantile value distribution of cooperation and defection in Sequential Stag Hunt weighted by WT compared with risk-neutral policy.

In this section, we first provide an illustrative example for the insight behind risk-seeking bonus and then describe its details. Consider a two-player 10 steps Sequential matrix game Stag Hunt, where each player should decide whether to hunt stag (S) or hunt hare (H) in each round. If both agents choose S they will receive the highest payoff 2. However, if one agent defects, he will receive a descent reward 1 for eating the hare alone while the other agent with an S action will suffer from a big loss -10. If both agents choose H they will receive payoff 1.

Even state of the art RL algorithms fail to discover the “risky” cooperation strategies[Tang et al., 2021, Peysakhovich and Lerer, 2017b, Wang et al., 2021]

. One important reason is that the expected, i.e., risk-neutral, Q value ignores the complete distribution information, especially the upper and lower tail information when the learned distribution is asymmetric. Another reason is that when the risk is high, i.e., a high loss for being betrayed, the probability of finding the S-S (Cooperation) strategy via policy gradient is very low

[Tang et al., 2021].

Therefore, we adopt the distributional RL method to model the whole distribution of Q value. Fig.1 left part shows the quantile distribution of cooperation and defection of risk-neutral policy learned by QR-DQN. The mean value of defection is higher than that of cooperation, but the quantile value distribution of cooperation has a longer upper tail which means that it has a higher potential payoff.

We propose to use WT distortion function to reweight the expectation of quantile distribution. Specifically, the risk-seeking bonus for agent is


where is the derivatives of WT distortion function at , and controls the risk-seeking level. Fig.1 right part shows the WT weighted quantile distribution in which the upper quantile values are multiplied by bigger weights and lower quantile values are multiplied by smaller weights to encourage agents to adopt risky coordination strategies.

Instead of using -greedy method for exploration, we adopt the left truncated variance of quantile distribution to further leverage the intrinsic uncertainty for efficient exploration. The left truncated variance is defined as


and analysed in [Mavrin et al., 2019]. We anneal the two exploration bonuses dynamically so that in the end we produce unbiased policies. The anneal coefficients are defined as which is the parametric uncertainty decay rate[Koenker and Hallock, 2001], and is a constant factor. This approach leads to choosing the action such that


4.2 Auxiliary Opponent Modeling Task

In order to alter the agent’s strategies under different opponents, we split the value network into two parts: feature extractor and decision maker . Specifically, the parameters of the value network for agent are divided into and , i.e., . The auxiliary opponent modeling task shares a common feature extractor with the value network. We can update the parameters of during execution using gradients from the auxiliary opponent modeling task, such that can generalize to different opponents. The supervised prediction head and its specific parameters are with . The details of our network architecture are provided in appendix.

During training, the agent can collect a set of transitions where indicates the joint actions of other agents except at time step . We use the embeddings of agent ’s observations and to predict the joint actions , i.e., the

is a multi-head neural network whose outputs are multiple soft-max distributions over the action space of each other agent, and the objective function of the auxiliary opponent modeling task can be formulated as



is the cross-entropy loss function. The strategies of opponents will change constantly during the procedure of multi-agent exploration and thus various strategies will emerge. The agent can leverage them to gain some experience about how to make the best response by jointly optimizing the auxiliary opponent modeling task and quantile value distribution. The joint training problem is therefore


4.3 Test-Time Policy Adaptation under Different Opponents

We assume the agent can observe actions made by his opponents during execution, so we can continue optimizing to update the parameters of value network’s feature extractor . Learning from opponents’ past behaviors at test time makes the agent generalize his policy to different opponents efficiently. The can be formulated as


5 Experiments

In this section, we empirically evaluate our method on four multi-agent environments. In sec. 5.1 and 5.2 we introduce the four environments we use for experiments and describe the baseline methods. In sec. 5.3 we compare the performance of GRSP with other baselines. In sec. 5.4 we evaluate the generalization ability of GRSP under different opponents during execution. The ablations are studied in sec. 5.5. More details can be found in Appendix.

5.1 Environment Setup

Repeated games. We consider two kinds of repeated matrix games: Iterated Stag Hunt (ISH) and Iterated Prisoners’ Dilemma (IPD). Both of them consist two agents and a constant episode length of 10 time steps[Foerster et al., 2017, Tang et al., 2021, Wang et al., 2021]. At each time step, the agents can choose either cooperation or defection. If both agents choose to cooperate simultaneously, they both get a bonus of 2. However, if a single agent choose to cooperate, he gets a penalty of -10 in ISH and -1 in IPD, and the other agent get a bonus of 1 and 3, respectively. If both agents choose to defection, they get a bonus of 1 in ISH and 0 in IPD. The optimal strategy in ISH and IPD is to cooperate at each time step, and the highest global payoffs of two agents are 40, i.e., 20 for each of them.

Figure 2: Monster-Hunt.

Monster-Hunt. The environment is a grid-world, consisting of two agents, two apples and one monster. The apples are static while the monster keeps moving towards its closest agent. When a single agent meets the monster, he gets a penalty of -10. If two agents catch the monster together, they both get a bonus of 5. If a single agent meets an apple, he get a bonus of 2. Whenever an apple is eaten or the monster meets an agent, the entity will respawn randomly. The optimal strategy, i.e., both agents move towards and catch the monster, is a risky coordination strategy since an agent will receive a penalty if the other agent deceives.

Figure 3: Escalation.

Escalation. Escalation is a grid-world with sparse rewards, consisting of two agents and a static light. If both agents step on the light simultaneously, they receive a bonus of 1, and then the light moves to a random adjacent grid. If only one agent steps on the light, he gets a penalty of , where denotes the latest consecutive cooperation steps, and the light will respawn randomly. To maximize their individual payoffs and global rewards, agents must coordinate to stay together and step on the light grid. For each integer L, there is a corresponding coordination strategy where each agent follows the light for steps then simultaneously stop coordination.

5.2 Baselines and Training

In this subsection, we introduce the baseline methods. We do not share parameters between agents, and more implementation details can be found in Appendix.

Independent actor-critic (IAC). IAC is a decentralized multi-agent policy gradient method in which each agent learns policy and value networks based on his local observations and treat other agents as part of the environment.

Centralized training and decentralized execution (CTDE) methods. MADDPG[Lowe et al., 2017] and MAPPO[Yu et al., 2021]

are two kinds of multi-agent policy gradient methods which improve upon decentralized RL by adopting an actor-critic structure and learning a centralized critic. We implement the two algorithms based on their open-sourced codes and perform a limited grid-search over certain hyper-parameters, including learning rate, entropy bonus coefficient and buffer size (MADDPG).

Local information agent modelling (LIAM). LIAM[Papoudakis et al., 2021] learns latent representations of the modeled agents from the local information of the controlled agent using encoder-decoder architecture. The modeled agent’s observations and actions are utilized as reconstruction targets for the decoder, and the learned latent representation conditions the policy of the controlled agent in addition to its local observation. The policy and model are optimized based on A2C algorithm.

Training. We carry out our experiments on one NVIDIA RTX 3080 Ti and Intel i9-11900K.

Figure 4: Mean evaluation returns for GRSP, MADDPG, MAPPO, IAC and LIAM on two repeated matrix games. The average global rewards equal to 40 means that all agents have learned coordination strategy, i.e., cooperating at each time step.

5.3 Evaluation of Returns

In this subsection, we evaluate all methods on four multi-agent environments and use 5 different random seeds to train each method. We pause training every 50 episodes and run 30 independent episodes with each agent performing greedy action selection to evaluate the average performance of each method.

5.3.1 Iterated Games

Fig. 4 shows the average global rewards, i.e., the summation of all agents’ average returns, of all methods evaluated during training in ISH and IPD environments. The shadowed part represents a confidence interval. The average global rewards equal to 40 means that all agents have learned coordination strategy, i.e., cooperating at each time step. We can find that agents trained with our method can achieve mutual coordination in a sample efficient way in two repeated matrix games with high risk while other methods only converge to safe non-cooperative strategies though some of them have much more restrictive assumptions.

5.3.2 Grid-Worlds

We further show the effectiveness of GRSP in two grid-world games, Monster-Hunt and Escalation[Tang et al., 2021], both of which have high payoff but risky cooperation strategies for agents to converge to. Fig. 5. shows that, compared with other baseline methods, GRSP constantly and significantly outperform baselines with higher sample efficiency over the whole training process both in global rewards and agent’s individual rewards. Specifically, in Monster-Hunt, GRSP agents efficiently find one of the risky cooperation strategies where two agents stay together and wait for the monster. Furthermore, the policies learned by each agent are very stable and neither would like to deviate from the cooperative strategy. However, other baseline methods only converge to safe non-cooperative strategies and get low payoff due to their poor exploration. In Escalation, GRSP outperforms other baselines significantly and both agents have achieved coordination in a decentralized paradigm.

Figure 5: Mean evaluation returns for GRSP, MADDPG, MAPPO, IAC and LIAM on Monster-Hunt and Escalation. Global rewards are summation of both agents rewards.

5.4 Generalization Study

Oppo: Coop(Defect) ISH IPD M-H Escalation
Table 1: Mean evaluation return of GRSP with and without auxiliary opponent modeling task on four multi-agent encironments.

This subsection investigates how well the pre-trained GRSP agent can generalize to different opponents, i.e., cooperation or defection, during execution. The cooperative opponents are trained by GRSP method while the non-cooperative opponents are trained by MADDPG. During evaluation, random seeds of four environments are different from that during training, and hyperparameters of the GRSP are same and fixed between different opponent types. Furthermore, the pre-trained coordinated agents can not access to the rewards to update their policies anymore and they must utilize the auxiliary opponent modeling task to force them to adapt to different opponents. The network details and hyperparameters can be found in Appendix.

Table 1 shows the mean evaluation return of GRSP agent with and without the auxiliary opponent modeling task on four multi-agent environments when interacting with different opponents. All returns are averaged on 100 episodes. The performance of the GRSP-Aom agent that utilizes the auxiliary opponent modeling task to adapt to different opponents outperforms that of the GRSP-No-Aom agent significantly, especially when interacting with non-cooperative opponents. Specifically, the GRSP-Aom agent using history behaviors of its opponents to update its policy can learn to alter its strategy from coordination to not when encountering a non-cooperative opponent. The empirical results further demonstrate that policies learned independently can overfit to the other agents’ policies during training, and our auxiliary opponent modeling task provides a method to tackle this problem.

5.5 Ablations

Figure 6: Mean evaluation return of GRSP compared with other ablation methods in two grid-world multi-agent environments.

In this subsection, we perform an ablation study to examine the components of GRSP to better understand our method. GRSP is based on QR-DQN and has three components: the risk-seeking exploration bonus, the left truncated variance (Tv) and the auxiliary opponent modeling task (Aom). We design and evaluate six different ablations of GRSP in two grid-world environments, as show in Fig. 6. The performance of GRSP-No-Aom which we ablate the Aom module and retain all other features of our method is a little lower than that of GRSP but has a much higher variance, indicating that learning from opponent’s behaviors can stable training and improve performance. Moreover, the GRSP-No-Aom is a completely decentralized method whose training without any opponent information, and the ablation results of GRSP-No-Aom show that our risk-seeking bonus is essential for agents to achieve mutual coordination in general-sum games. We observe that ablating the left truncated variance module leads to a significantly lower return than the GRSP in the Escalation but no difference in the Monster-Hunt. Furthermore, ablating the risk-seeking bonus increases the training variance, leads to slower convergence and perform worse than the GRSP. It is noteworthy that the Escalation is a sparse reward and hard-exploration multi-agent environment since our two decentralized agents can get a reward only if they navigate to and step on the light simultaneously and constantly. These two ablations indicate that the exploration ability of left truncated variance is important to our method and the risk-seeking bonus can encourage agents to coordinate with each other stably and converge to high-risky cooperation strategies efficiently. We also implement our risk-seeking bonus by CVaR instead of WT, and the results are shown as GRSP-CVaR. The GRSP-CVaR performs worse than our method and has a higher training variance. Finally, we ablate all components of the GRSP and use -greedy policy for exploration which leads to the IQR-DQN algorithm. As shown in Fig. 6, IQR-DQN can not learn effective policies in the Monster-Hunt and perform badly in the Escalation.

6 Discussion

Conclusion. While various MARL methods have been proposed in cooperative settings, few works investigate how self-interested learning agents can achieve mutual coordination which is coupled with risk in decentralized general-sum games and generalize learned policies to non-cooperative opponents during execution. In this paper, we present GRSP, a novel decentralized MARL algorithm with estimated risk-seeking bonus and auxiliary opponent modeling task. Empirically, we show that agents trained via GRSP can not only achieve mutual coordination during training with high sample efficiency but generalize learned policies to non-cooperative opponents during execution, while other baseline methods can not.

Limitations and future work. The risk-seeking bonus in GRSP is estimated using WT distorted expectation and its risk-sensitive level is a hyperparameter that can not dynamically change throughout training. Developing a method that can adjust agents’ risk-sensitive levels dynamically by utilizing their observation, rewards, or opponents’ information is the direction of our future work.


  • S. V. Albrecht and P. Stone (2018) Autonomous agents modelling other agents: a comprehensive survey and open problems. Artificial Intelligence 258, pp. 66–95. Cited by: §2.
  • A. Balbás, J. Garrido, and S. Mayoral (2009) Properties of distortion risk measures. Methodology and Computing in Applied Probability 11 (3), pp. 385–399. Cited by: §3, §3.
  • M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. In

    International Conference on Machine Learning

    pp. 449–458. Cited by: §2, §3, §3.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
  • Y. Chow, A. Tamar, S. Mannor, and M. Pavone (2015) Risk-sensitive and robust decision-making: a cvar optimization approach. Advances in neural information processing systems 28. Cited by: §2.
  • F. Christianos, L. Schäfer, and S. Albrecht (2020) Shared experience actor-critic for multi-agent reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 10707–10717. Cited by: §1.
  • W. Dabney, G. Ostrovski, D. Silver, and R. Munos (2018a) Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096–1105. Cited by: §1, §2, §3.
  • W. Dabney, M. Rowland, M. Bellemare, and R. Munos (2018b) Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1, §2, §3, §3.
  • D. Di Castro, A. Tamar, and S. Mannor (2012) Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404. Cited by: §2.
  • J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §1.
  • J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch (2017) Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326. Cited by: §1, §1, §1, §2, §5.1.
  • A. Grover, M. Al-Shedivat, J. Gupta, Y. Burda, and H. Edwards (2018) Learning policy representations in multiagent systems. In International conference on machine learning, pp. 1802–1811. Cited by: §1.
  • E. A. Hansen, D. S. Bernstein, and S. Zilberstein (2004) Dynamic programming for partially observable stochastic games. In AAAI, Vol. 4, pp. 709–715. Cited by: §3.
  • N. Hansen, R. Jangir, Y. Sun, G. Alenyà, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang (2020) Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309. Cited by: §2.
  • P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote (2017) A survey of learning in multiagent environments: dealing with non-stationarity. arXiv preprint arXiv:1707.09183. Cited by: §2.
  • P. J. Huber (1992) Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §3.
  • R. Koenker and G. Bassett Jr (1978) Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33–50. Cited by: §1.
  • R. Koenker and K. F. Hallock (2001) Quantile regression. Journal of economic perspectives 15 (4), pp. 143–156. Cited by: §4.1.
  • M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel (2017) A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems 30. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • D. P. Losey, M. Li, J. Bohg, and D. Sadigh (2020) Learning from my partner’s actions: roles in decentralized robot teams. In Conference on robot learning, pp. 752–765. Cited by: §2.
  • R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems 30. Cited by: §1, §5.2.
  • X. Lyu and C. Amato (2018) Likelihood quantile networks for coordinating multi-agent reinforcement learning. arXiv preprint arXiv:1812.06319. Cited by: §2.
  • X. Ma, L. Xia, Z. Zhou, J. Yang, and Q. Zhao (2020) DSAC: distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547. Cited by: §2.
  • J. Markowitz, R. Gardner, A. Llorens, R. Arora, and I. Wang (2021) A risk-sensitive policy gradient method. Cited by: §2.
  • B. Mavrin, H. Yao, L. Kong, K. Wu, and Y. Yu (2019) Distributional reinforcement learning for efficient exploration. In International conference on machine learning, pp. 4424–4434. Cited by: §4.1.
  • R. Mealing and J. L. Shapiro (2015)

    Opponent modeling by expectation–maximization and sequence prediction in simplified poker

    IEEE Transactions on Computational Intelligence and AI in Games 9 (1), pp. 11–24. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
  • G. Papoudakis and S. V. Albrecht (2020)

    Variational autoencoders for opponent modeling in multi-agent systems

    arXiv preprint arXiv:2001.10829. Cited by: §1.
  • G. Papoudakis, F. Christianos, and S. Albrecht (2021) Agent modelling under partial observability for deep reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §1, §5.2.
  • G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737. Cited by: §2.
  • A. Peysakhovich and A. Lerer (2017a) Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975. Cited by: §2.
  • A. Peysakhovich and A. Lerer (2017b) Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865. Cited by: §1, §1, §1, §4.1.
  • L. Prashanth and M. Ghavamzadeh (2016) Variance-constrained actor-critic algorithms for discounted and average reward mdps. Machine Learning 105 (3), pp. 367–417. Cited by: §2.
  • W. Qiu, X. Wang, R. Yu, R. Wang, X. He, B. An, S. Obraztsova, and Z. Rabinovich (2021) RMIX: learning risk-sensitive policies forcooperative reinforcement learning agents. Advances in Neural Information Processing Systems 34. Cited by: §1, §2, §2.
  • R. Raileanu, E. Denton, A. Szlam, and R. Fergus (2018) Modeling others using oneself in multi-agent reinforcement learning. In International conference on machine learning, pp. 4257–4266. Cited by: §2.
  • T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §1, §2.
  • R. T. Rockafellar and S. Uryasev (2002) Conditional value-at-risk for general loss distributions. Journal of banking & finance 26 (7), pp. 1443–1471. Cited by: §2.
  • L. S. Shapley (1953) Stochastic games. Proceedings of the national academy of sciences 39 (10), pp. 1095–1100. Cited by: §3.
  • H. Shirado and N. A. Christakis (2017) Locally noisy autonomous agents improve global human coordination in network experiments. Nature 545 (7654), pp. 370–374. Cited by: §1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
  • A. J. Singh, A. Kumar, and H. C. Lau (2020) Hierarchical multiagent reinforcement learning for maritime traffic management. Cited by: §1.
  • K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §1, §2.
  • P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein (2010) Ad hoc autonomous agent teams: collaboration without pre-coordination. In Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §2.
  • Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt (2020) Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pp. 9229–9248. Cited by: §2.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §2.
  • A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4), pp. e0172395. Cited by: §1.
  • Z. Tang, C. Yu, B. Chen, H. Xu, X. Wang, F. Fang, S. Du, Y. Wang, and Y. Wu (2021) Discovering diverse multi-agent strategic behavior via reward randomization. arXiv preprint arXiv:2103.04564. Cited by: §1, §1, §1, §2, §4.1, §5.1, §5.3.2.
  • B. Toghi, R. Valiente, D. Sadigh, R. Pedarsani, and Y. P. Fallah (2021) Social coordination and altruism in autonomous driving. arXiv preprint arXiv:2107.00200. Cited by: §1.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang (2020) Qplex: duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062. Cited by: §1, §2.
  • S. S. Wang (2000) A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance, pp. 15–36. Cited by: §1, §2, 2nd item.
  • W. Wang, J. Hao, Y. Wang, and M. Taylor (2018) Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162. Cited by: §1, §1, §1, §2.
  • W. Z. Wang, M. Beliaev, E. Bıyık, D. A. Lazar, R. Pedarsani, and D. Sadigh (2021) Emergent prosociality in multi-agent games through gifting. arXiv preprint arXiv:2105.06593. Cited by: §1, §1, §4.1, §5.1.
  • W. Z. Wang, A. Shih, A. Xie, and D. Sadigh (2022) Influencing towards stable multi-agent interactions. In Conference on Robot Learning, pp. 1132–1143. Cited by: §2.
  • Z. Wang, K. Mülling, M. P. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters (2013) Probabilistic movement modeling for intention inference in human–robot interaction. The International Journal of Robotics Research 32 (7), pp. 841–858. Cited by: §2.
  • J. L. Wirch and M. R. Hardy (2001) Distortion risk measures: coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, pp. 15–17. Cited by: §3.
  • C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955. Cited by: §1, §5.2.
  • C. Zhang and V. Lesser (2010) Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence, Cited by: §2.
  • T. Zhang, H. Xu, X. Wang, Y. Wu, K. Keutzer, J. E. Gonzalez, and Y. Tian (2020) Multi-agent collaboration via reward attribution decomposition. arXiv preprint arXiv:2010.08531. Cited by: §2.
  • Z. Zhou, W. Fu, B. Zhang, and Y. Wu (2021) Continuously discovering novel strategies via reward-switching policy optimization. In Deep RL Workshop NeurIPS 2021, Cited by: §1.