1 Introduction
Inspired by advances in deep reinforcement learning (DRL)[Mnih et al., 2015, Lillicrap et al., 2015, Silver et al., 2017], many researchers recently focus on utilizing DRL methods to tackle multiagent problems[Vinyals et al., 2019, Singh et al., 2020, Berner et al., 2019]. However, most of these works either consider the fully cooperative multiagent reinforcement learning (MARL) settings [Foerster et al., 2018, Rashid et al., 2018, Son et al., 2019, Wang et al., 2020, Qiu et al., 2021] or generalsum games but make restrictive assumptions about opponents[Foerster et al., 2017, Papoudakis et al., 2021, Mealing and Shapiro, 2015], e.g., either stationary[Papoudakis et al., 2021] or altruistic [Tang et al., 2021, Peysakhovich and Lerer, 2017b]. Considering future applications of MARL, such as selfdriving cars[Toghi et al., 2021] and humanrobot interactions [Shirado and Christakis, 2017], multiple learning agents optimize their own rewards independently in generalsum games where winwin outcomes are only achieved through coordination which often coupled with risk[Wang et al., 2021, Foerster et al., 2017, Wang et al., 2018] (“Risk” refers to the uncertainty of future outcomes[Dabney et al., 2018a]), and their pretrained policies should generalize to noncooperative opponents during execution.
To achieve coordination alongside other learning agents and generalize learned policies to noncooperative opponents, the agent must be willing to undertake a certain amount of risk and identify the opponents’ type efficiently. One set of approaches use explicit reward shaping to force agents to coordinate[Tampuu et al., 2017, Peysakhovich and Lerer, 2017b, Tang et al., 2021], which can be viewed as an approach to shape the risk degree of coordination strategies. To learn generalizable policies, [Tang et al., 2021, Wang et al., 2018] propose to train an adaptive agent with populationbased training methods. However, training a population of policy profiles is computationally intractable in realworld complex multiagent scenarios, and cannot leverage features about opponent’s type emergent in multiagent exploration procedure. Other works either treat the other agents as stationary[Papoudakis et al., 2021, Grover et al., 2018, Papoudakis and Albrecht, 2020, Wang et al., 2018], or directly access to opponent’s policy parameters[Foerster et al., 2017].
By contrast, we are interested in a less restrictive setting where we do not assume access to opponents’ rewards, observations, or policy parameters, instead, each agent can infer other agents’ current strategies from the past behaviors of other agents. Considering in selfdriving scenarios, the ego car has no knowledge about what the front cars observed or their true payoffs, but it can see the actions made by the front cars, e.g., risky merging and lane changing, and thus adjust its own strategies.
In this paper, one key insight is that learning from opponent’s past behaviors allows the agent to infer the opponent’s type and dynamically alter his strategy between different modes, e.g., either cooperate or compete, during execution. Moreover, given that the other learning agents are nonstationary, decisionmaking over the agent’s return distributions enables the agent to tackle uncertainties resulting from other agents’ behaviors and alter his risk preference, i.e., from riskneutral to riskseeking, to discover coordination strategies. Motivated by the analysis above, we propose GRSP, a Generalizable RiskSensitive MARL algorithm and our contributions are summarized as follows:
Leading to mutual coordination in decentralized generalsum games. We estimate a dynamic riskseeking bonus using a complete distortion risk measure Wang’s Transform (WT)[Wang, 2000] to encourage agents to discover risky cooperative strategies. The riskseeking bonus only affects the action selection procedure instead of shaping environment rewards and decreases throughout training, leading to an unbiased policy.
Generalizing pretrained policies to noncooperative opponents during execution. Policies learned independently can overfit to the other agents’ policies during training, failing to sufficiently generalize during execution[Lanctot et al., 2017]
. We further propose to train each learning agent with two objectives: a standard Quantile Regression objective
[Koenker and Bassett Jr, 1978, Dabney et al., 2018b] and a supervised agent modeling objective, which models the behaviors of opponent, applied on intermediate representation of the value network. The auxiliary supervised task allows the shallow part of the value network, which is designed as a feature extractor, to be influenced by opponent’s past behaviors. During execution, the agent can update parameters of the feature extractor using his local observations and observed opponent’s actions, forcing the intermediate representation to adapt to the new opponent.Evaluating in multiagent settings. We evaluate GRSP in four different Markov games: MonsterHunt[Tang et al., 2021, Zhou et al., 2021], Escalation[Tang et al., 2021, Peysakhovich and Lerer, 2017b], Iterated Prisoners’ Dilemma (IPD)[Foerster et al., 2017, Wang et al., 2018] and Iterated Stag Hunt (ISH)[Wang et al., 2021, Tang et al., 2021]. Compared with several baseline methods, including MultiAgent Deep Deterministic Policy Gradient (MADDPG)[Lowe et al., 2017], MultiAgent PPO (MAPPO)[Yu et al., 2021], Local Information Agent Modelling (LIAM)[Papoudakis et al., 2021] and Indenpendent Actor Critic (IAC)[Christianos et al., 2020], GRSP learns substantially faster, achieves mutual coordination during training and can generalize to the noncooperative opponent during execution, which outperforms other baseline methods and reaches the stateoftheart.
2 Related Work
Risksensitive RL. Risksensitive policies, which depend upon more than mean of the outcomes, enable agents to handle the intrinsic uncertainty arising from the stochasticity of the environment. In MARL, the intrinsic uncertainties are amplified due to the nonstationarity and partial observability created by other agents that change their policies during the learning procedure[Wang et al., 2022, Papoudakis et al., 2019, HernandezLeal et al., 2017]. Distributional RL[Bellemare et al., 2017, Dabney et al., 2018b] provides a new perspective for optimizing policy under different risk preferences within a unified framework[Dabney et al., 2018a, Markowitz et al., 2021]
. With distributions of return, it is able to approximate value function under different risk measures, such as variancerelated measure
[Di Castro et al., 2012, Ma et al., 2020, Prashanth and Ghavamzadeh, 2016], Conditional Value at Risk (CVaR)[Rockafellar and Uryasev, 2002, Chow et al., 2015] and WT[Wang, 2000], and thus produce riskaverse or riskseeking policies. Qiu et al.[Qiu et al., 2021] propose a cooperative MARL method called RMIX with the CVaR measure over the learned distributions of individuals’ Q values as riskaverse policies. However, RMIX focuses on the fully cooperative setting and learns a centralized but factored value function to guide optimization of the decentralized policies, which cannot be applied beyond cooperative games. Lyu et al.[Lyu and Amato, 2018] propose a decentralized risksensitive policy LHIQN for all agents to seek higher team rewards. In contrast with these works, we are interested in decentralized generalsum games where agents use riskseeking policies to explore coordination strategies and can generalize them to noncooperative opponents.Generalization across different opponents. Many real world scenarios require agents to adapt to different opponents during execution. However, most of existing works focus on learning a fixed and teamdependent policy in fully cooperative setting[Sunehag et al., 2017, Rashid et al., 2018, Son et al., 2019, Qiu et al., 2021, Wang et al., 2020] which can not generalize to slightly altered environments or new opponents. Other works either use a populationbased training method to train an adaptive agent[Tang et al., 2021], or adapt to different opponents under the TitforTat principle[Wang et al., 2018, Peysakhovich and Lerer, 2017a]. Our work is closely related to testtime training methods[Sun et al., 2020, Hansen et al., 2020]. However, they focus on image recognition or single agent policy adaption. We consider generalization across different opponents in MARL and use opponents’ past behaviors as supervised information. Ad hoc teamwork[Stone et al., 2010, Zhang et al., 2020] also requires agents to generalize to new teams, but they focus on cooperative games and has different concerns with us.
Opponent modeling. Our approach to learning generalizable policies can be viewed as a kind of opponent modeling method[Albrecht and Stone, 2018]. These approaches either model intention[Wang et al., 2013, Raileanu et al., 2018], assume an assignment of roles[Losey et al., 2020] or exploit opponent learning dynamics[Foerster et al., 2017, Zhang and Lesser, 2010]. Our approach is similar to policy reconstruction methods[Raileanu et al., 2018] which make explicit predictions about opponent’s actions. However, instead of predicting the opponent’s future actions, we learn from opponent’s past behaviors to update the belief, i.e., parameters of value network, of the opponent’s type.
3 Preliminaries
Stochastic games. In this work, we consider multiple selfinterested learning agents interact with each other. We model the problem as a PartiallyObservable Stochastic Game (POSG)[Shapley, 1953, Hansen et al., 2004], which consists of agents, a state space describing the possible configurations of all agents, a set of actions and a set of observations for each agent. At each time step, each agent receives his own observation , and selects an action based on a stochastic policy
, which results in a joint action vector
. The environment then transitions to a new state based on the transition function . Each agent obtains rewards as a function of the state and his action . The initial states are determined by a distribution . We treat the reward "function"of each agent as a random variable to emphasize its stochasticity, and use
to denote the random variable of the cumulative discounted rewards where , , is a discount factor and is the time horizon.Distributional reinforcement learning with quantile regression. Instead of learning the expected return , distributional RL focuses on learning the full distribution of the random variable directly[Bellemare et al., 2017, Dabney et al., 2018b, a]. As with the scalar setting, a distributional Bellman optimality operator can be defined by
(1) 
with distributed according to .
Bellmare et al.[Bellemare et al., 2017]
show that the distributional Bellman operator is a contraction in a maximal form of the Wasserstein metric between probability distributions, but the proposed algorithm C51 dose not itself minimize the Wasserstein metric. This theorypractice gap is closed by Dabney et al.
[Dabney et al., 2018b] as they propose to use Quantile Regression for distributional RL. In this paper, we focus on the quantile representation used in QRDQN[Dabney et al., 2018b], where the distribution of is represented by a uniform mix of supporting quantiles , where denotes a Dirac at , and each is an estimation of the quantile corresponding to the quantile fractions with for . The stateaction value can then be approximated by.These quantile estimates are trained using the Huber[Huber, 1992] quantile regression loss. The loss of the quantile value network is then given by
(2) 
where , and where is the indicator function and is the Huber loss:
(3) 
Distorted expectation. Distorted expectation is a risk weighted expectation of value distribution under a specific distortion function[Wirch and Hardy, 2001]. A function is a distortion function if it is nondecreasing and satisfies and [Balbás et al., 2009]. The distorted expectation of under is defined as , where is the quantile function at for the random variable . We introduce two common distortion functions as follow:

[leftmargin=*]

CVaR is the expectation of the lower or upper tail of the value distribution, corresponding to riskaverse or riskseeking policy respectively. Its distortion function is (riskaverse) or (riskseeking), denotes confidence level.

WT is proposed by Wang[Wang, 2000]: where is the distribution of a standard normal. The parameter is called the market price of risk and reflects systematic risk. for riskaverse and for riskseeking.
assigns a 0value to all percentiles below the or above significance level which leads to erroneous decisions in some cases[Balbás et al., 2009]. Instead, WT is a complete distortion risk measure and ensures using all the information in the original loss distribution which makes training much more stable, and we will empirically demonstrate it in Sec. 5.
4 Methods
In this section, we describe our proposed GRSP method. We first introduce the riskseeking bonus used to encourage agents to discover coordination strategies in Sec. 4.1 and then propose the auxiliary opponent modeling task to learn generalizable policies in Sec. 4.2. Finally, we provide the details of testtime policy adaptation under different opponents in Sec. 4.3.
4.1 RiskSeeking Bonus
In this section, we first provide an illustrative example for the insight behind riskseeking bonus and then describe its details. Consider a twoplayer 10 steps Sequential matrix game Stag Hunt, where each player should decide whether to hunt stag (S) or hunt hare (H) in each round. If both agents choose S they will receive the highest payoff 2. However, if one agent defects, he will receive a descent reward 1 for eating the hare alone while the other agent with an S action will suffer from a big loss 10. If both agents choose H they will receive payoff 1.
Even state of the art RL algorithms fail to discover the “risky” cooperation strategies[Tang et al., 2021, Peysakhovich and Lerer, 2017b, Wang et al., 2021]
. One important reason is that the expected, i.e., riskneutral, Q value ignores the complete distribution information, especially the upper and lower tail information when the learned distribution is asymmetric. Another reason is that when the risk is high, i.e., a high loss for being betrayed, the probability of finding the SS (Cooperation) strategy via policy gradient is very low
[Tang et al., 2021].Therefore, we adopt the distributional RL method to model the whole distribution of Q value. Fig.1 left part shows the quantile distribution of cooperation and defection of riskneutral policy learned by QRDQN. The mean value of defection is higher than that of cooperation, but the quantile value distribution of cooperation has a longer upper tail which means that it has a higher potential payoff.
We propose to use WT distortion function to reweight the expectation of quantile distribution. Specifically, the riskseeking bonus for agent is
(4) 
where is the derivatives of WT distortion function at , and controls the riskseeking level. Fig.1 right part shows the WT weighted quantile distribution in which the upper quantile values are multiplied by bigger weights and lower quantile values are multiplied by smaller weights to encourage agents to adopt risky coordination strategies.
Instead of using greedy method for exploration, we adopt the left truncated variance of quantile distribution to further leverage the intrinsic uncertainty for efficient exploration. The left truncated variance is defined as
(5) 
and analysed in [Mavrin et al., 2019]. We anneal the two exploration bonuses dynamically so that in the end we produce unbiased policies. The anneal coefficients are defined as which is the parametric uncertainty decay rate[Koenker and Hallock, 2001], and is a constant factor. This approach leads to choosing the action such that
(6) 
4.2 Auxiliary Opponent Modeling Task
In order to alter the agent’s strategies under different opponents, we split the value network into two parts: feature extractor and decision maker . Specifically, the parameters of the value network for agent are divided into and , i.e., . The auxiliary opponent modeling task shares a common feature extractor with the value network. We can update the parameters of during execution using gradients from the auxiliary opponent modeling task, such that can generalize to different opponents. The supervised prediction head and its specific parameters are with . The details of our network architecture are provided in appendix.
During training, the agent can collect a set of transitions where indicates the joint actions of other agents except at time step . We use the embeddings of agent ’s observations and to predict the joint actions , i.e., the
is a multihead neural network whose outputs are multiple softmax distributions over the action space of each other agent, and the objective function of the auxiliary opponent modeling task can be formulated as
(7) 
where
is the crossentropy loss function. The strategies of opponents will change constantly during the procedure of multiagent exploration and thus various strategies will emerge. The agent can leverage them to gain some experience about how to make the best response by jointly optimizing the auxiliary opponent modeling task and quantile value distribution. The joint training problem is therefore
(8) 
4.3 TestTime Policy Adaptation under Different Opponents
We assume the agent can observe actions made by his opponents during execution, so we can continue optimizing to update the parameters of value network’s feature extractor . Learning from opponents’ past behaviors at test time makes the agent generalize his policy to different opponents efficiently. The can be formulated as
(9) 
5 Experiments
In this section, we empirically evaluate our method on four multiagent environments. In sec. 5.1 and 5.2 we introduce the four environments we use for experiments and describe the baseline methods. In sec. 5.3 we compare the performance of GRSP with other baselines. In sec. 5.4 we evaluate the generalization ability of GRSP under different opponents during execution. The ablations are studied in sec. 5.5. More details can be found in Appendix.
5.1 Environment Setup
Repeated games. We consider two kinds of repeated matrix games: Iterated Stag Hunt (ISH) and Iterated Prisoners’ Dilemma (IPD). Both of them consist two agents and a constant episode length of 10 time steps[Foerster et al., 2017, Tang et al., 2021, Wang et al., 2021]. At each time step, the agents can choose either cooperation or defection. If both agents choose to cooperate simultaneously, they both get a bonus of 2. However, if a single agent choose to cooperate, he gets a penalty of 10 in ISH and 1 in IPD, and the other agent get a bonus of 1 and 3, respectively. If both agents choose to defection, they get a bonus of 1 in ISH and 0 in IPD. The optimal strategy in ISH and IPD is to cooperate at each time step, and the highest global payoffs of two agents are 40, i.e., 20 for each of them.
MonsterHunt. The environment is a gridworld, consisting of two agents, two apples and one monster. The apples are static while the monster keeps moving towards its closest agent. When a single agent meets the monster, he gets a penalty of 10. If two agents catch the monster together, they both get a bonus of 5. If a single agent meets an apple, he get a bonus of 2. Whenever an apple is eaten or the monster meets an agent, the entity will respawn randomly. The optimal strategy, i.e., both agents move towards and catch the monster, is a risky coordination strategy since an agent will receive a penalty if the other agent deceives.
Escalation. Escalation is a gridworld with sparse rewards, consisting of two agents and a static light. If both agents step on the light simultaneously, they receive a bonus of 1, and then the light moves to a random adjacent grid. If only one agent steps on the light, he gets a penalty of , where denotes the latest consecutive cooperation steps, and the light will respawn randomly. To maximize their individual payoffs and global rewards, agents must coordinate to stay together and step on the light grid. For each integer L, there is a corresponding coordination strategy where each agent follows the light for steps then simultaneously stop coordination.
5.2 Baselines and Training
In this subsection, we introduce the baseline methods. We do not share parameters between agents, and more implementation details can be found in Appendix.
Independent actorcritic (IAC). IAC is a decentralized multiagent policy gradient method in which each agent learns policy and value networks based on his local observations and treat other agents as part of the environment.
Centralized training and decentralized execution (CTDE) methods. MADDPG[Lowe et al., 2017] and MAPPO[Yu et al., 2021]
are two kinds of multiagent policy gradient methods which improve upon decentralized RL by adopting an actorcritic structure and learning a centralized critic. We implement the two algorithms based on their opensourced codes and perform a limited gridsearch over certain hyperparameters, including learning rate, entropy bonus coefficient and buffer size (MADDPG).
Local information agent modelling (LIAM). LIAM[Papoudakis et al., 2021] learns latent representations of the modeled agents from the local information of the controlled agent using encoderdecoder architecture. The modeled agent’s observations and actions are utilized as reconstruction targets for the decoder, and the learned latent representation conditions the policy of the controlled agent in addition to its local observation. The policy and model are optimized based on A2C algorithm.
Training. We carry out our experiments on one NVIDIA RTX 3080 Ti and Intel i911900K.
5.3 Evaluation of Returns
In this subsection, we evaluate all methods on four multiagent environments and use 5 different random seeds to train each method. We pause training every 50 episodes and run 30 independent episodes with each agent performing greedy action selection to evaluate the average performance of each method.
5.3.1 Iterated Games
Fig. 4 shows the average global rewards, i.e., the summation of all agents’ average returns, of all methods evaluated during training in ISH and IPD environments. The shadowed part represents a confidence interval. The average global rewards equal to 40 means that all agents have learned coordination strategy, i.e., cooperating at each time step. We can find that agents trained with our method can achieve mutual coordination in a sample efficient way in two repeated matrix games with high risk while other methods only converge to safe noncooperative strategies though some of them have much more restrictive assumptions.
5.3.2 GridWorlds
We further show the effectiveness of GRSP in two gridworld games, MonsterHunt and Escalation[Tang et al., 2021], both of which have high payoff but risky cooperation strategies for agents to converge to. Fig. 5. shows that, compared with other baseline methods, GRSP constantly and significantly outperform baselines with higher sample efficiency over the whole training process both in global rewards and agent’s individual rewards. Specifically, in MonsterHunt, GRSP agents efficiently find one of the risky cooperation strategies where two agents stay together and wait for the monster. Furthermore, the policies learned by each agent are very stable and neither would like to deviate from the cooperative strategy. However, other baseline methods only converge to safe noncooperative strategies and get low payoff due to their poor exploration. In Escalation, GRSP outperforms other baselines significantly and both agents have achieved coordination in a decentralized paradigm.
5.4 Generalization Study
Oppo: Coop(Defect)  ISH  IPD  MH  Escalation 

GRSPNoAom  
GRSPAom 
This subsection investigates how well the pretrained GRSP agent can generalize to different opponents, i.e., cooperation or defection, during execution. The cooperative opponents are trained by GRSP method while the noncooperative opponents are trained by MADDPG. During evaluation, random seeds of four environments are different from that during training, and hyperparameters of the GRSP are same and fixed between different opponent types. Furthermore, the pretrained coordinated agents can not access to the rewards to update their policies anymore and they must utilize the auxiliary opponent modeling task to force them to adapt to different opponents. The network details and hyperparameters can be found in Appendix.
Table 1 shows the mean evaluation return of GRSP agent with and without the auxiliary opponent modeling task on four multiagent environments when interacting with different opponents. All returns are averaged on 100 episodes. The performance of the GRSPAom agent that utilizes the auxiliary opponent modeling task to adapt to different opponents outperforms that of the GRSPNoAom agent significantly, especially when interacting with noncooperative opponents. Specifically, the GRSPAom agent using history behaviors of its opponents to update its policy can learn to alter its strategy from coordination to not when encountering a noncooperative opponent. The empirical results further demonstrate that policies learned independently can overfit to the other agents’ policies during training, and our auxiliary opponent modeling task provides a method to tackle this problem.
5.5 Ablations
In this subsection, we perform an ablation study to examine the components of GRSP to better understand our method. GRSP is based on QRDQN and has three components: the riskseeking exploration bonus, the left truncated variance (Tv) and the auxiliary opponent modeling task (Aom). We design and evaluate six different ablations of GRSP in two gridworld environments, as show in Fig. 6. The performance of GRSPNoAom which we ablate the Aom module and retain all other features of our method is a little lower than that of GRSP but has a much higher variance, indicating that learning from opponent’s behaviors can stable training and improve performance. Moreover, the GRSPNoAom is a completely decentralized method whose training without any opponent information, and the ablation results of GRSPNoAom show that our riskseeking bonus is essential for agents to achieve mutual coordination in generalsum games. We observe that ablating the left truncated variance module leads to a significantly lower return than the GRSP in the Escalation but no difference in the MonsterHunt. Furthermore, ablating the riskseeking bonus increases the training variance, leads to slower convergence and perform worse than the GRSP. It is noteworthy that the Escalation is a sparse reward and hardexploration multiagent environment since our two decentralized agents can get a reward only if they navigate to and step on the light simultaneously and constantly. These two ablations indicate that the exploration ability of left truncated variance is important to our method and the riskseeking bonus can encourage agents to coordinate with each other stably and converge to highrisky cooperation strategies efficiently. We also implement our riskseeking bonus by CVaR instead of WT, and the results are shown as GRSPCVaR. The GRSPCVaR performs worse than our method and has a higher training variance. Finally, we ablate all components of the GRSP and use greedy policy for exploration which leads to the IQRDQN algorithm. As shown in Fig. 6, IQRDQN can not learn effective policies in the MonsterHunt and perform badly in the Escalation.
6 Discussion
Conclusion. While various MARL methods have been proposed in cooperative settings, few works investigate how selfinterested learning agents can achieve mutual coordination which is coupled with risk in decentralized generalsum games and generalize learned policies to noncooperative opponents during execution. In this paper, we present GRSP, a novel decentralized MARL algorithm with estimated riskseeking bonus and auxiliary opponent modeling task. Empirically, we show that agents trained via GRSP can not only achieve mutual coordination during training with high sample efficiency but generalize learned policies to noncooperative opponents during execution, while other baseline methods can not.
Limitations and future work. The riskseeking bonus in GRSP is estimated using WT distorted expectation and its risksensitive level is a hyperparameter that can not dynamically change throughout training. Developing a method that can adjust agents’ risksensitive levels dynamically by utilizing their observation, rewards, or opponents’ information is the direction of our future work.
References
 Autonomous agents modelling other agents: a comprehensive survey and open problems. Artificial Intelligence 258, pp. 66–95. Cited by: §2.
 Properties of distortion risk measures. Methodology and Computing in Applied Probability 11 (3), pp. 385–399. Cited by: §3, §3.

A distributional perspective on reinforcement learning.
In
International Conference on Machine Learning
, pp. 449–458. Cited by: §2, §3, §3.  Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
 Risksensitive and robust decisionmaking: a cvar optimization approach. Advances in neural information processing systems 28. Cited by: §2.
 Shared experience actorcritic for multiagent reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 10707–10717. Cited by: §1.
 Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096–1105. Cited by: §1, §2, §3.
 Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1, §2, §3, §3.
 Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404. Cited by: §2.
 Counterfactual multiagent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §1.
 Learning with opponentlearning awareness. arXiv preprint arXiv:1709.04326. Cited by: §1, §1, §1, §2, §5.1.
 Learning policy representations in multiagent systems. In International conference on machine learning, pp. 1802–1811. Cited by: §1.
 Dynamic programming for partially observable stochastic games. In AAAI, Vol. 4, pp. 709–715. Cited by: §3.
 Selfsupervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309. Cited by: §2.
 A survey of learning in multiagent environments: dealing with nonstationarity. arXiv preprint arXiv:1707.09183. Cited by: §2.
 Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §3.
 Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33–50. Cited by: §1.
 Quantile regression. Journal of economic perspectives 15 (4), pp. 143–156. Cited by: §4.1.
 A unified gametheoretic approach to multiagent reinforcement learning. Advances in neural information processing systems 30. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 Learning from my partner’s actions: roles in decentralized robot teams. In Conference on robot learning, pp. 752–765. Cited by: §2.
 Multiagent actorcritic for mixed cooperativecompetitive environments. Advances in neural information processing systems 30. Cited by: §1, §5.2.
 Likelihood quantile networks for coordinating multiagent reinforcement learning. arXiv preprint arXiv:1812.06319. Cited by: §2.
 DSAC: distributional soft actor critic for risksensitive reinforcement learning. arXiv preprint arXiv:2004.14547. Cited by: §2.
 A risksensitive policy gradient method. Cited by: §2.
 Distributional reinforcement learning for efficient exploration. In International conference on machine learning, pp. 4424–4434. Cited by: §4.1.

Opponent modeling by expectation–maximization and sequence prediction in simplified poker
. IEEE Transactions on Computational Intelligence and AI in Games 9 (1), pp. 11–24. Cited by: §1.  Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.

Variational autoencoders for opponent modeling in multiagent systems
. arXiv preprint arXiv:2001.10829. Cited by: §1.  Agent modelling under partial observability for deep reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §1, §5.2.
 Dealing with nonstationarity in multiagent deep reinforcement learning. arXiv preprint arXiv:1906.04737. Cited by: §2.
 Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975. Cited by: §2.
 Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865. Cited by: §1, §1, §1, §4.1.
 Varianceconstrained actorcritic algorithms for discounted and average reward mdps. Machine Learning 105 (3), pp. 367–417. Cited by: §2.
 RMIX: learning risksensitive policies forcooperative reinforcement learning agents. Advances in Neural Information Processing Systems 34. Cited by: §1, §2, §2.
 Modeling others using oneself in multiagent reinforcement learning. In International conference on machine learning, pp. 4257–4266. Cited by: §2.
 Qmix: monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §1, §2.
 Conditional valueatrisk for general loss distributions. Journal of banking & finance 26 (7), pp. 1443–1471. Cited by: §2.
 Stochastic games. Proceedings of the national academy of sciences 39 (10), pp. 1095–1100. Cited by: §3.
 Locally noisy autonomous agents improve global human coordination in network experiments. Nature 545 (7654), pp. 370–374. Cited by: §1.
 Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
 Hierarchical multiagent reinforcement learning for maritime traffic management. Cited by: §1.
 Qtran: learning to factorize with transformation for cooperative multiagent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §1, §2.
 Ad hoc autonomous agent teams: collaboration without precoordination. In TwentyFourth AAAI Conference on Artificial Intelligence, Cited by: §2.
 Testtime training with selfsupervision for generalization under distribution shifts. In International Conference on Machine Learning, pp. 9229–9248. Cited by: §2.
 Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296. Cited by: §2.
 Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4), pp. e0172395. Cited by: §1.
 Discovering diverse multiagent strategic behavior via reward randomization. arXiv preprint arXiv:2103.04564. Cited by: §1, §1, §1, §2, §4.1, §5.1, §5.3.2.
 Social coordination and altruism in autonomous driving. arXiv preprint arXiv:2107.00200. Cited by: §1.
 Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
 Qplex: duplex dueling multiagent qlearning. arXiv preprint arXiv:2008.01062. Cited by: §1, §2.
 A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance, pp. 15–36. Cited by: §1, §2, 2nd item.
 Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162. Cited by: §1, §1, §1, §2.
 Emergent prosociality in multiagent games through gifting. arXiv preprint arXiv:2105.06593. Cited by: §1, §1, §4.1, §5.1.
 Influencing towards stable multiagent interactions. In Conference on Robot Learning, pp. 1132–1143. Cited by: §2.
 Probabilistic movement modeling for intention inference in human–robot interaction. The International Journal of Robotics Research 32 (7), pp. 841–858. Cited by: §2.
 Distortion risk measures: coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, pp. 15–17. Cited by: §3.
 The surprising effectiveness of ppo in cooperative, multiagent games. arXiv preprint arXiv:2103.01955. Cited by: §1, §5.2.
 Multiagent learning with policy prediction. In Twentyfourth AAAI conference on artificial intelligence, Cited by: §2.
 Multiagent collaboration via reward attribution decomposition. arXiv preprint arXiv:2010.08531. Cited by: §2.
 Continuously discovering novel strategies via rewardswitching policy optimization. In Deep RL Workshop NeurIPS 2021, Cited by: §1.