Balancing Rational and Other-Regarding Preferences in Cooperative-Competitive Environments

by   Dmitry Ivanov, et al.

Recent reinforcement learning studies extensively explore the interplay between cooperative and competitive behaviour in mixed environments. Unlike cooperative environments where agents strive towards a common goal, mixed environments are notorious for the conflicts of selfish and social interests. As a consequence, purely rational agents often struggle to achieve and maintain cooperation. A prevalent approach to induce cooperative behaviour is to assign additional rewards based on other agents' well-being. However, this approach suffers from the issue of multi-agent credit assignment, which can hinder performance. This issue is efficiently alleviated in cooperative setting with such state-of-the-art algorithms as QMIX and COMA. Still, when applied to mixed environments, these algorithms may result in unfair allocation of rewards. We propose BAROCCO, an extension of these algorithms capable to balance individual and social incentives. The mechanism behind BAROCCO is to train two distinct but interwoven components that jointly affect each agent's decisions. Our meta-algorithm is compatible with both Q-learning and Actor-Critic frameworks. We experimentally confirm the advantages over the existing methods and explore the behavioural aspects of BAROCCO in two mixed multi-agent setups.



There are no comments yet.


page 6

page 8

page 9

page 10

page 11


Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

We explore deep reinforcement learning methods for multi-agent domains. ...

Normative Disagreement as a Challenge for Cooperative AI

Cooperation in settings where agents have both common and conflicting in...

Definition and properties to assess multi-agent environments as social intelligence tests

Social intelligence in natural and artificial systems is usually measure...

Escaping the State of Nature: A Hobbesian Approach to Cooperation in Multi-agent Reinforcement Learning

Cooperation is a phenomenon that has been widely studied across many dif...

Cooperative Group Optimization with Ants (CGO-AS): Leverage Optimization with Mixed Individual and Social Learning

We present CGO-AS, a generalized Ant System (AS) implemented in the fram...

Open-Ended Learning Leads to Generally Capable Agents

In this work we create agents that can perform well beyond a single, ind...

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

Consider a typical organization whose worker agents seek to collectively...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human cooperation is considered an evolutionary puzzle in the economic literature (Axelrod & Hamilton, 1981; Fehr & Schmidt, 1999; Johnson et al., 2003; Colman, 2006; Rand & Nowak, 2013). Despite the predictions of the rational choice theory to act selfishly (Scott, 2000), people of different age, gender, culture, and socioeconomic status engage into cooperation in a multitude of economic situations (Croson & Buchan, 1999; Henrich et al., 2001; Alvard, 2004; Benenson et al., 2007; Chen et al., 2013; Kettner & Waichman, 2016). A notable example of such situations is prisoner’s dilemma (Rapoport et al., 1965), where a rational agent chooses to defect despite his preference of mutual cooperation over mutual defection. One of the possible mechanisms to resolve the paradox implies that the agents take social and other-regarding preferences into account during decision making (Fehr & Schmidt, 1999; Fehr & Fischbacher, 2002).

The questions of emergence and maintenance of cooperation are mirrored in the Multi-Agent Reinforcement Learning (MARL) literature (Tan, 1993; Lowe et al., 2017; Sunehag et al., 2017; Rashid et al., 2018; Foerster et al., 2018; Peysakhovich & Lerer, 2018b). Numerous works have repeatedly demonstrated that purely rational agents are unable to maintain mutually beneficial cooperation, unlike the agents guided by social incentives (Peysakhovich & Lerer, 2018b; Hughes et al., 2018; Jaques et al., 2019; Wang et al., 2019a). Despite this, training fully social agents can be undesirable when fairness is a concern.

(a) BAROCCO in Q-learning framework
(b) BAROCCO in Actor-Critic framework
Figure 1: BAROCCO. Solid lines represent parts that are used during both training and execution. Dashed lines represent parts that are only used during training. a) Selfish components predict selfish Q-values and are trained independently. Social components predict per-agent contributions , combined into social Q-value through mixing network. Social components and mixing network are trained end-to-end to approximate temporal difference target of social welfare (SW), defined as a combination (e.g. sum) of . Agents act according to combined Q-values , which are convex mixtures of and . b) Selfish components predict selfish values and are trained independently. Social components predict social Q-values and are trained to approximate temporal difference target of social welfare (SW), defined as a combination (e.g. sum) of . Selfish advantages

are estimated as temporal differences (TD) of

. Social advantages are estimated by subtracting counterfactual (CO) baselines from . Decentralized policies are trained via policy gradient on combined advantages , which are convex mixtures of and .

As an example, consider the problem of coordination of autonomous vehicles. On the one hand, each car’s passengers have their own goals in terms of destination and desirable arrival time. Treating this problem as fully cooperative, as implied in (Cao et al., 2012; Rashid et al., 2018), may favor solutions where agents sacrifice these personal goals for the social good. For instance, a fully cooperative agent would be willing to let the other cars pass and stay on a crossroad indefinitely as long as average arrival time decreases. In contrast, a selfish agent would not. This example illustrates how fairness emerges from selfishness. On the other hand, it is still crucial for each agent to avoid creating inconvenient or dangerous situations for other cars. Therefore, this scenario falls in-between selfish and social and requires agents to balance these preferences.

The simplest way to achieve such balance is to train agents on a mixture of selfish and social rewards (Durugkar et al., 2020), which we refer to as Cooperative Reward Shaping (CRS). In this work, we define selfish reward as the standard reward an agent receives in the environment, and social reward as some combination (e.g. sum) of selfish rewards of all agents. However, CRS implies decentralized training and does not address several crucial issues of MARL, such as credit assignment, partial observability, and inherent non-stationarity (Agogino & Tumer, 2004; Hernandez-Leal et al., 2017, 2019). The two latter issues can be alleviated by considering global information and actions of other agents during training, as done in MADDPG (Lowe et al., 2017). Still, the combination of CRS and MADDPG does not address credit assignment of agents to the social welfare. On the other hand, all these issues are addressed by the techniques from fully cooperative MARL like QMIX (Rashid et al., 2018) or COMA (Foerster et al., 2018) that were shown to outperform decentralized training in such complex environments as StarCraft 2 (Vinyals et al., 2017). Still, these techniques are only concerned with team performance and ignore fairness.

In this paper we propose a meta-algorithm that extends techniques like QMIX and COMA to mixed environments with capability to balance the incentives. We refer to this meta-algorithm as BAROCCO, i.e. BAlancing Rational and Other-regarding preferences in Cooperative-COmpetitive environments. BAROCCO is based on the insight that instead of relying on a single model to balance incentives via CRS, two distinct components, i.e. selfish and social, can be trained concurrently and combined during decision making. While we show that mathematically the two approaches are equivalent, the latter approach allows us to train the social component via techniques that address credit assignment.

More specifically, BAROCCO is compatible with both Q-learning and Actor-Critic frameworks. In the case of Q-learning framework, for each agent we train selfish Q-value via Rainbow (Hessel et al., 2018) and social Q-value via QMIX (Rashid et al., 2018). During decision making, the agents choose the action that maximizes the mixture of the two Q-values, and the importance of each Q-value is controlled via predefined prosociality coefficient. In the case of Actor-Critic framework, we train selfish critic via a variant of MADDPG (Lowe et al., 2017) and social critic via COMA (Foerster et al., 2018). Then, the actor is trained via proximal policy gradient (Schulman et al., 2017) using a mixture of predictions of these two critics.

For both frameworks, we show that varying the prosociality coefficient in BAROCCO results in trade-off of efficiency and fairness. In particular, we find that fully social agents may choose to concentrate all environment’s rewards in one particular agent, whereas agents with a non-zero selfish component refuse to make such sacrifices. More surprisingly, in some cases we find that less social agents are not only more fair but also more efficient.

A crucial novelty of BAROCCO concerns the training of the social component. The natural approach would be to construct common reward as a combination of selfish rewards. Instead, we directly combine selfish values, omitting construction of common reward. We respectively refer to these approaches as short-term and long-term. While in certain cases the two approaches are mathematically equivalent, the long-term approach might be more suitable for mixed environments. We also formulate two qualitative advantages of the long-term approach: compatibility with a broader set of social welfare functions and applicability to a wider range of environments.

Finally, an alternative to achieving fairness through selfishness could be to train a fair centralized system by maximizing minimum of agents’ payoffs rather than sum. We show that such procedure can also be viable, but only if the system is trained via the long-term approach used in BAROCCO. In this case, the selfish components are vital for efficiency, albeit are only used to estimate target and do not influence agents’ decisions directly.

2 Definitions and Background

2.1 Notations

A tuple defines a temporally-extended Markov game (Littman, 1994), where:

  • Let be set of states , be number of players, be set of actions of player . Let denote concatenation of state and actions of other agents

  • Let be function that specifies d-dimensional observations available to each agent. Let be set of observations of agent .

  • Let be transition function, where

    is set of discrete probability distributions over

    . Let be the distribution of initial states.

  • Let be reward function for each player .

  • Let be return of player in state with discount factor . Let be policy of player . Let

    be probability of taking action

    in local state .

  • Let be state value function, be state-action value function, be advantage function. Subscript will denote time-step, e.g.

    . Bold font will denote vector, e.g.

    , .

  • Let be social welfare function that evaluates well-being of all agents. The simplest example of its application is sum of agents’ rewards: .

2.2 Single-Agent Reinforcement Learning

Deep Q-Learning.

In Q-learning (Watkins & Dayan, 1992)

, the agent’s goal is to learn Q-values for each state-action pair, and the agent’s policy is to choose actions that correspond to the highest Q-values. This approach has been successfully applied to such complex environments as Atari games when coupled with deep learning

(Hessel et al., 2018; Badia et al., 2020). In Deep Q-Networks (DQN) (Mnih et al., 2015)

, the Q-values are no longer tabular and are instead approximated with a neural network trained with squared Temporal Difference (TD) loss function:


where . The essential features of DQN are a replay buffer, which enables the reuse of past experiences, and a separate network for target estimation, which stabilises training. The performance of DQN was greatly improved in Rainbow (Hessel et al., 2018) by combining several modifications proposed in different papers.


In Actor-Critic framework (Mnih et al., 2016), the Actor’s goal is to learn a policy that maximizes agent’s long-term payoffs predicted by the Critic. A widely-used method is proximal policy optimization (PPO) (Schulman et al., 2017), where the Actor’s neural network is trained on the following loss:


where denotes probability ratio of the policies after and before the update. Using this loss ensures that the agent’s policy stays within a trust region during the update. The advantage is defined as , where . The Critic’s neural network is independently trained to predict by minimizing squared TD error .

2.3 Independent and Centralized Multi-Agent Reinforcement Learning

In Multi-Agent Reinforcement Learning (MARL), multiple agents learn and interact in the same environment. One of the simplest approaches in MARL is to train agents independently using unmodified single-agent RL techniques (Tan, 1993). Unfortunately, this naive approach invalidates convergence guarantees (Lowe et al., 2017) of Q-Learning (Watkins & Dayan, 1992) and Actor-Critic (Konda & Tsitsiklis, 2000). The reason for that is the inherent non-stationarity of multi-agent environments (Laurent et al., 2011; Hernandez-Leal et al., 2017). Furthermore, independent MARL does not address the issue of credit assignment in environments with common reward (Wolpert & Tumer, 2002; Agogino & Tumer, 2004). Nevertheless, this approach can be effective in both cooperative (Berner et al., 2019) and mixed (Leibo et al., 2017; Tampuu et al., 2017) setups. As the opposite extreme, the fully centralized approach reduces MARL to single-agent RL by controlling all agents simultaneously based on global information. Unfortunately, centralized MARL suffers from scalability issues due to exponential growth of the joint action space in the number of agents (Guestrin et al., 2002; Sunehag et al., 2017).

2.4 Centralized Training with Decentralized Execution

Centralized Training with Decentralized Execution (CTDE) is a compromise between independent and centralized MARL (Kraemer & Banerjee, 2016; Lowe et al., 2017; Sunehag et al., 2017; Rashid et al., 2018; Foerster et al., 2018; Son et al., 2019)

. Under this paradigm, training can be enhanced with the use of global information as long as it results in decentralized policies. Typically, CTDE techniques alleviate the issues of multi-agent credit assignment and/or non-stationarity while effectively dealing with the curse of dimensionality.

QMIX (Rashid et al., 2018) is an algorithm designed to train multiple Q-learning agents in cooperative environments. During training, it approximates the joint Q-value as a monotonic mixture of the individual Q-values: . During execution, each agent acts according to its individual Q-value , restricting the use of global information to the training phase. The function is trained as a mixture network in an end-to-end fashion via TD loss . By enforcing the monotonicity of , the joint Q-value can be factorised in a way that preserves the order of actions. As a result, maximization over the joint action becomes tractable: . To utilize global information , the weights of the mixture network are predicted with a set of hypernetworks (Ha et al., 2017). QMIX is a direct extension of Value Decomposition Networks (Sunehag et al., 2017), where the joint Q-value is simply approximated as a sum of agents contributions rather than a monotonic mixture.

Counterfactual Multi-Agent policy gradient (COMA) (Foerster et al., 2018) is an adaptation of Actor-Critic framework to cooperative environments. COMA uses an efficiently designed centralized critic, which outputs Q-values for a specified agent based on the global state and the actions of the other agents . Furthermore, COMA estimates advantage for each agent by marginalising out the agent’s action while keeping actions of other agents fixed: . Decentralized policies are trained on these advantages via policy gradient.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG) (Lowe et al., 2017) is a CTDE algorithm specifically designed for mixed environments. The core idea is to train DDPG agents using centralized critics conditioned on global state and actions of all agents a

. Similarly to the enhanced critic in COMA, this modification reduces variance of policy gradient, as well as addresses non-stationarity and partial observability. However, MADDPG does not concern credit assignment. In our paper, we apply the same centralization of critic as in MADDPG to multi-agent PPO with discrete action spaces when training the selfish components of Actor-Critic agents.

2.5 Cooperative Reward Shaping

We broadly define Cooperative Reward Shaping (CRS) as reward shaping with respect to the behaviour of other agents, e.g. their rewards (Lerer & Peysakhovich, 2017; Peysakhovich & Lerer, 2018a, b; Hughes et al., 2018; Wang et al., 2019b), temporal differences (Hostallero et al., 2018), policies (Jaques et al., 2019), etc. CRS aims to learn cooperative yet not selfless policies in mixed environments. In this paper, we will only be concerned with a particular instance of CRS where agents’ rewards are mixed:


where is prosociality coefficient and is combined reward. The agents are fully selfish when and fully social when . The social reward defined as a sum of individual rewards is routinely used in MARL papers to train cooperative policies (Lerer & Peysakhovich, 2017; Peysakhovich & Lerer, 2018a, b; Wang et al., 2019b). The idea to train agents on a convex mixture of selfish and social rewards similar to (3) is explored by Durugkar et al. (2020). While being simple, this approach is limited in its incapability to find some of the Pareto optimal solutions, particularly the solutions that lie in concave regions of the Pareto front (Vamplew et al., 2008).

3 Barocco

3.1 Factorization of CRS

While CRS can achieve balance between selfish and social incentives, it does not address multi-agent credit assignment, which can be crucial for performance. At the same time, CTDE algorithms like QMIX and COMA address credit assignment but are intended for cooperative environments, requiring agents to forgo selfish incentives. As a middle-ground, we notice that the value that CRS agents optimize can be factored as a mixture of selfish and social values and respectively:


where expectations over policies of other agents and over transition function are omitted for brevity. Note that the same factorization can be applied to Q-value and advantage .

The factorization (4) allows us to train the social component separately from the selfish component via algorithms like QMIX and COMA that address credit assignment in cooperative environments. This technique forms the basis for BAROCCO. In Section 3.2, we take a more in-depth look on the social value and propose an alternative definition that is not based on common reward. Then, in Sections 3.3 and 3.4 we discuss the specifics of training and combining the two components within Q-learning and Actor-Critic frameworks. Pseudocode of BAROCCO is available in Appendix.

3.2 Assessing Social Welfare

In the previous subsection, we defined social value based on common reward , which is a combination of individual rewards of all agents. We will denote this value as , where subscript stands for ‘short-term‘. For convenience, this definition is repeated in (5). Training COMA critic or QMIX on TD loss based on this definition of value when the agents are fully social (i.e. ) is the most straightforward way to extend these algorithms to mixed environments that will be referred to as Vanilla.

The difference between BAROCCO and Vanilla algorithms is two-fold. First, BAROCCO agents can consider both selfish and social motives during decision making, which is also reflected in the modified training procedure. This will be discussed in details in the following subsections. Second, BAROCCO utilizes an alternative definition of social value , formulated in (6) and referred to as ‘long-term‘. Long-term value is not based on common reward and is instead defined as a combination of agents’ selfish values . Essentially, the two values and differ in the order in which expectation, sum, and social welfare function are applied. We experimentally confirm that replacing with can increase performance. Additionally, we identify two qualitative advantages of the long-term value. We briefly formulate these advantages below and verify them experimentally in Section 4.3. We also provide detailed examples in Appendix.


The first limitation of is in the choice of social welfare functions . When is chosen as sum, is mathematically equivalent to due to commutativity of sum with expectation (although practical implementations of the algorithms still differ). However, this is not always the case. For instance, choosing as minimum can be a way to account for both efficiency and fairness (Rawls, 2009). In this case, maximizing requires fair reward distribution at each time-step, whereas to maximize the rewards should only be fairly distributed on average. While solving the first task is sufficient for solving the second, it is also unnecessarily constraining and might result in poor performance. Our experiments support this conjecture.

The second limitation of is inapplicability to environments where trajectory lengths are variable. As an example, consider an environment where the agents receive negative rewards upon termination. In such environment, an agent that maximizes might adopt two opposite strategies. The first strategy is to prolong the episodes of all agents, thus postponing the negative rewards. The second strategy is to terminate own episode early, thus avoiding the negative rewards from other agents altogether. In contrast, an agent that optimizes

anticipates termination of other agents regardless of witnessing it and therefore can only adopt the first strategy. This issue is akin to the bias in rewards identified in generative adversarial imitation learning

(Kostrikov et al., 2018).

(a) Eldorado
(b) Harvest
Figure 2: Environments. Illustration of Harvest map is taken from (Hughes et al., 2018).

As a side note, if simultaneous optimization of payoffs of multiple agents is viewed as multi-objective optimization, then the proposed long-term approach to MARL corresponds to the ‘scalarization of the expected return‘ approach to multi-objective RL (Roijers et al., 2013). It could also be interesting to explore the alternative ‘expectation of the scalarized return‘ approach, which would imply changing the order of function and expectation in (6), but we leave this direction to the future work.

3.3 Combining Independent DQN and QMIX

Here we describe BAROCCO in Q-learning framework. The algorithm is schematically illustrated in Figure 1a.

When choosing an action, each agent maximizes the following convex combination of Q-values:


where , , and denote selfish, social, and combined Q-values, respectively. Equation (7) is, in essence, equation (4) rewritten for Q-values, but with one distinction: the social component is not common but is based on each agent’s contribution to social welfare. These contributions are disentangled via mixture network, as proposed in QMIX. Although the two Q-values and are optimized separately, they still affect each other through the agent’s policy.

For each agent , the selfish Q-value is trained via independent Q-learning (see Section 2.2). In particular, we use Rainbow architecture (Hessel et al., 2018), which is a modification of DQN (Mnih et al., 2015). The only important distinction is that the agents do not act according to the estimated Q-values, i.e. they maximize rather than . At the same time, should be the expectation over the behavioural policy according to the definition of selfish value in (4). To account for this discrepancy, the TD target is modified akin to double Q-learning. Specifically, maximization of over actions is replaced with of the action that maximizes :


The social component is based on QMIX. The common Q-value is trained on TD loss and is disentangled into agents’ individual contributions via mixture network (see Section 2.4). These individual contributions constitute social components for each agent. We explore two alternative estimates of TD target for QMIX that correspond to two definitions of social values, discussed in Section 3.2. The first estimate (9) is based on the common reward and the prediction of QMIX for the next state. As in the case of the selfish component, the target is modified with respect to the combined Q-values . When , this target is equivalent to the target used in Vanilla QMIX. The second estimate (10), used in BAROCCO, is based on TD targets for the selfish components.


In our implementation, both Vanilla QMIX and BAROCCO utilize noisy exploration (Fortunato et al., 2017).

3.4 Combining MADDPG, COMA, and PPO

Here we describe BAROCCO in Actor-Critic framework. The algorithm is schematically illustrated in Figure 1b.

Each agent acts according to its decentralized policy trained on PPO loss , where is a convex combination of selfish and social advantages and :


For agent , the selfish advantage is estimated as TD of a critic that predicts the agent’s selfish value: , where . The selfish critic estimates value with respect to the behavioural policy , which corresponds to the definition of in (4). So, no additional modifications of its target are required. Note that instead of using only local observations, the critic makes predictions based on concatenation of global state and actions of other agents . Therefore, it is trained with a variation of MADDPG.

The social component is based on COMA. For each agent, its social critic is trained on TD loss and predicts social Q-value . Then, the advantage , i.e. the effect of the agent’s actions on social welfare, is estimated by subtracting counterfactual baseline from the social Q-value (see Section 2.4). This advantage enters (11) as the social component. We explore two alternative estimates of TD target for COMA that correspond to two definitions of social values, discussed in Section 3.2. The first estimate (12) is based on the common reward and the prediction of COMA for the next state. When , this target is equivalent to the target used in Vanilla COMA. The second estimate (13), used in BAROCCO, is based on TD targets for the selfish components .


In our implementation, neither critics nor policies share weights.

(a) Lifetime, all algorithms
(b) Gini, all algorithms
(c) Lifetime, BAROCCO with varying
(d) Gini, BAROCCO with varying
Figure 3: Experiments in Eldorado, Actor-Critic framework. ‘Lifetime‘ denotes sum of agents’ episode lengths, ‘Gini‘ is a metric of unfairness. ‘sum‘ and ‘min‘ denote the choice of function.
(a) Lifetime, all algorithms
(b) Gini, all algorithms
(c) Lifetime, BAROCCO with varying
(d) Gini, BAROCCO with varying
Figure 4: Experiments in Eldorado, Q-learning framework. ‘Lifetime‘ denotes sum of agents’ episode lengths, ‘Gini‘ is a metric of unfairness. ‘sum‘ and ‘min‘ denote the choice of function.

4 Experiments

4.1 Modified Prisoner’s Dilemma

As a motivational example that illustrates importance of balance between selfish and social incentives, we present modified prisoner’s dilemma (Table 1). In this 2 by 3 matrix game, both agents have access to ‘Cooperate‘ and ‘Defect‘ actions, but one of the agents can also ‘Sacrifice‘ his payoffs for the common good. As in the classic prisoner’s dilemma (Rapoport et al., 1965), defection is a dominant strategy for a selfish agent. At the same time, mutual defection is Pareto dominated by mutual cooperation. As a result, selfish agents are stuck with mutual defection, even though both agents would benefit from mutual cooperation. In contrast, a social agent prefers to ‘Cooperate‘ than to ‘Defect‘.

Defect Cooperate Sacrifice
Defect 5, 5 15, 0 21, 0
Cooperate 0, 15 10, 10 21, 0
Table 1: Modified Prisoner’s Dilemma

Now, consider the ‘Sacrifice‘ action of the column agent. While this action achieves the highest social welfare, it also ensures the worst individual payoff for the second agent. Nevertheless, a social agent always prefers ‘Sacrifice‘, regardless of how small the surplus of social welfare over the mutual cooperation is. Instead, an agent that is willing to cooperate but refuses to self-sacrifice might be preferable.

0-0.3 0.4-0.8 0.9-1
Row player action D C C
Column player action D C S
Table 2: Actions of Agents in Modified Prisoner’s Dilemma

We report behaviour of agents trained to solve Modified Prisoner’s Dilemma with tabular Q-learning in Table 2. We vary in , each time incrementing it by 0.1. Both agents Defect when is low and start to Cooperate when is as high as 0.4. The column agent further switches to Sacrifice when reaches 0.9. As we will see later in the paper, such sacrificial behaviour is not unique to simple matrix games.

The agents were trained with tabular Q-learning for 100000 iterations. The learning rate was set to 0.1. The exploration rate was initialized at 1 and annealed to 0 over the course of training.

4.2 Environments


Eldorado (Fig. 2a) is based on the NeuralMMO environment (Suarez et al., 2019). Two agents navigate on a fully observable grid-like map, collecting two types of resources – water and food. Both water and food tiles provide 6 points of the corresponding resource. The water tile has infinite supply, while the food tile has a recharge period of 6 turns. Each agent has limited capacity for the resources, as well as health pool limited to 10 points. Furthermore, both food and water supplies decrease each turn by 1. If some supply is absent, the health points also start to decrease. Conversely, the health regenerates when both supplies are above the threshold of 16. If an agent’s health reaches zero, the episode terminates with a unitary negative reward. However, if an agent successfully survives for a 1000 steps, its episode terminates with a unit of positive reward. Upon termination, an agent immediately respawns. Additionally, the agents can interact by attacking each other. This action has two effects. First, it decreases the health of the target by 1. With the small probability of 1/50, the damage is doubled. Second, it steals a unit of both resources. Attack is thus a very appealing action in the short terms. However, in order to successfully complete the task, the agents are required to coordinate their movement while refraining from combat.


Harvest (Fig. 2b) is a popular environment (Perolat et al., 2017; Hughes et al., 2018; Jaques et al., 2019) where five agents collect apples on a partially observable grid-like map. Each episode lasts for a thousand steps. The regrowth rate of apples increases with the number of uncollected apples nearby. Therefore, the agents that harvest every apple in sight quickly exhaust the apple supplies. The optimal strategy for a group of agents is to balance harvesting and cultivating apples.

(a) Apples, all algorithms
(b) Gini, all algorithms
(c) Apples, BAROCCO with varying
(d) Gini, BAROCCO with varying
Figure 5: Experiments in Harvest, Actor-Critic framework. ‘Apples‘ denotes total number of collected apples by all agents in an episode, ‘Gini‘ is a metric of unfairness. ‘sum‘ and ‘min‘ denote the choice of function.
(a) Apples, all algorithms
(b) Gini, all algorithms
(c) Apples, BAROCCO with varying
(d) Gini, BAROCCO with varying
Figure 6: Experiments in Harvest, Q-learning framework. ‘Apples‘ denotes total number of collected apples by all agents in an episode, ‘Gini‘ is a metric of unfairness. ‘sum‘ and ‘min‘ denote the choice of function.

4.3 Results

We report experimental results for Eldorado and Harvest environments in Figures 3, 4 and 5, 6, respectively. We investigate how varying affects agents’ behaviour and performance, as well as compare BAROCCO to baselines, such as selfish baseline, CRS, and Vanilla QMIX / COMA. CRS and Vanilla QMIX / COMA are defined in sections 2.5 and 3.2, respectively. Selfish baseline is defined as BAROCCO without the social component, i.e. with . For other algorithms, unless stated otherwise. The algorithms are compared by performance, defined as sum of payoffs, and by fairness, defined according to (Perolat et al., 2017)

as unity minus Gini index. We repeat each experiment 3 times. Technical details and hyperparameters are reported in Appendix.

4.3.1 Actor-Critic agents in Eldorado

  • Selfish agents are able to coordinate movement, but are unable to refrain from attacking, since this action is very appealing in short terms. For this reason, they only achieve average lifetime of 800 (Fig. 3a).

  • Unlike selfish agents, prosocial ‘BAROCCO, sum‘ agents achieve higher average lifetime (Fig. 3a), but most of it is concentrated in a single agent that collects all resources (Fig. 3b). This illustrates how maximizing sum of agents’ payoffs can result in unfair reward distribution.

  • ’BAROCCO, min’ agents manage to cooperate and successfully solve the environment, reaching average lifetime close to optimal (Fig. 3a). This illustrates how optimizing minimum of agents’ payoffs instead of sum favours the solutions where payoffs are distributed evenly.

  • Increasing influence of the selfish component is another way to reject solutions with uneven payoff distribution. When decreasing prosociality coefficient , ‘BAROCCO, sum‘ agents are able to escape the local optimum where one of the agents is exploited and learn to both successfully complete the task (Fig. 3c,d). This illustrates how fairness emerges from selfishness. The best performance is achieved when . It might seem counter-intuitive that the decreasing agents’ prosociality positively affects performance, but similar results were reported by Durugkar et al. (2020).

  • Finally, CRS and COMA agents perform abysmal in Eldorado (Fig. 3a). Since these algorithms optimize common reward that is always non-positive in Eldorado, each agent attempts to avoid the negative reward for termination of the other agent and races to terminate earlier, as discussed in Section 3.2.

4.3.2 Q-learning agents in Eldorado

  • Selfish Q-learning agents perform about as good as selfish Actor-Critic agents, reaching average lifetime of 800 that is evenly distributed (Fig. 4a, b). These agents are unable to refrain from attacking.

  • Procosial ‘BAROCCO, sum‘ and ‘BAROCCO, min‘ agents are able to cooperate and successfully survive in the environment (Fig. 4a, b). Unlike the case of Actor-Critic agents, Q-learning ‘BAROCCO, sum‘ agents do not converge to a local optimum where one of the agents is exploited.

  • CRS again underperforms compared to the selfish baseline. QMIX outperforms the selfish baseline but still performs slightly worse than BAROCCO (Fig. 4a).

  • At first, decreasing prosociality coefficient has negative but slight effect on agents’ performance (Fig. 4c). While , the agents refrain from attacking and manage to survive in the environment. However, once is at least as low as 0.3, survivability drops significantly as agents begin to combat. The existence of such threshold is consistent with the game-theoretic analysis of Durugkar et al. (2020), as well as with our toy experiment in Section 4.1.

4.3.3 Actor-Critic agents in Harvest

  • Selfish agents quickly learn to naively harvest every apple in sight and exhaust the supplies long before episode ends. Unable to collude, they only gather about 200 apples per episode (Fig. 5a).

  • ‘BAROCCO, sum‘ agents learn to alternate between harvesting and cultivating apples and manage to gather more than 800 apples per episode (Fig. 5a).

  • ‘BAROCCO, min‘ agents collect slightly less apples than ‘BAROCCO, sum‘ agents (Fig. 5a), but distribute the apples significantly more evenly among themselves (Fig. 5b). This result highlights that optimizing minimum instead of sum of agents’ payoffs might be preferable if fairness is a concern.

  • ‘COMA, sum‘ also outperforms selfish agents but is less stable than ‘BAROCCO, sum‘ (Fig. 5a), meaning that our modifications of the training procedure can be beneficial.

  • Decentralized ‘CRS, sum‘ performs better than all other algorithms (Fig. 5a), suggesting that additional complexity of centralized algorithms can hinder performance in some environments. This result contradicts the findings of the prior literature where centralization of training consistently improved performance (Rashid et al., 2018; Foerster et al., 2018). However, the algorithms suggested in this literature were not tested in complex mixed environments like Harvest before.

  • Performance of CRS and COMA plummets when minimum is chosen as (Fig. 5a), which is consistent with our predictions formulated in Section 3.2 that optimizing minimum of agents rewards each step might be too restricting. This result also highlights flexibility of BAROCCO in the choice of function.

  • The effect of varying is monotonic: increasing improves performance (Fig. 5c) but can result in unfair reward allocation (Fig. 5d).

4.3.4 Q-learning agents in Harvest

  • By and large, the results (Fig. 6) are similar to the case of Actor-Critic agents. Selfish agents converge to a naive strategy of collecting every apple in sight. ‘BAROCCO, sum‘ agents outperform selfish agents by balancing harvesting and cultivating apples. ‘BAROCCO, min‘ performs a little worse than ‘BAROCCO, sum‘ but leads to a more even apple distribution. ‘QMIX, sum‘ is less stable than ‘BAROCCO, sum‘, highlighting that our training procedure is more suitable for mixed environments. ‘CRS, sum‘ performs better than all other algorithms. ‘CRS, min‘ and ‘QMIX, min‘ fail to outperform even selfish agents, in contrast to BAROCCO that is flexible in the choice of function.

  • While the best team performance is achieved when prosociality coefficient is maximal (Fig. 6c), this solution favours unfair apple allocation (Fig. 6d). Setting leads to slower convergence and slightly lower final team performance, but is a considerably less unfair solution.

5 Conclusion

In this paper, we present BAROCCO – a meta-algorithm for combining social and selfish incentives in cooperative-competitive environments. We confirm the effectiveness of BAROCCO over the existing methods in two mixed multi-agent environments for both Q-learning and Actor-Critic frameworks. Specifically, we find that BAROCCO consistently improves over vanilla QMIX and COMA in all experiments, highlighting usefulness of the modifications that we propose for training these algorithms in mixed environments. Furthermore, we find that varying the prosociality coefficient results in unique mixtures of selfish and selfless behaviour. While decreasing typically increases fairness at the expense of efficiency, in some cases both efficiency and fairness can benefit from the influence of the selfish component. As an alternative way to achieve fairness, BAROCCO also allows to train fair cooperative agents by maximizing minimum of selfish payoffs. An exciting extension of our work could be to train reciprocal agents that dynamically assess the cooperativeness of others and adapt their policies accordingly. We also note that BAROCCO is not limited to the algorithms utilized in this paper, i.e. DQN, PPO, MADDPG, COMA and QMIX. Rather, we propose a unified framework of two separate modules, which can be modified by other state-of-the-art techniques from single-agent, mixed, or cooperative setups.

This work contributes to the broader discussion of what constitutes cooperation. Most MARL papers that study mixed environments focus on efficiency, but we argue that this metric can be too limiting. Agents that act towards a single common goal are more reminiscent of a swarm system than a group of distinct individuals that could mutually benefit from cooperation. We explore ways to incorporate the notion of fairness into such systems, either by preserving some individuality of the agents or by modifying the centralized objective. We hope that our work sparks further discussion regarding other desirable qualities of multi-agent systems and the means to achieve these qualities.

6 Acknowledgements

This research was supported in part through computational resources of HPC facilities at HSE University. Support from the Basic Research Program of the National Research University Higher School of Economics is gratefully acknowledged.

Appendix A Assessing Social Welfare: Examples

In this section, we elaborate on the advantages of the long-term value approach formulated in Section 3.2.

Applicability of sum and minimum as .

Consider the following toy environment. A centralized controller distributes positive unitary rewards between two agents for two time-steps. Furthermore, if the same agent is rewarded twice, the second reward is doubled. In this environment, there are 4 options to distribute rewards: 2 options to reward the same agent at both time-steps, and 2 options to reward one agent at the first time-step and the other agent at the second time-step. We are interested how to distribute the rewards in order to maximize social welfare. We analyze two definitions of the prosocial value function, i.e. and , as well as two choices of function, i.e. sum and minimum. The results of the analysis are summarized in Table 3.

is sum is min
Rewards of agent 1 Rewards of agent 2
Option 1 0 0
Option 2 0
Option 3 0
Option 4 0 0
Table 3: Reward distributions and corresponding social welfare in the toy environment

We can observe several patterns consistent with our experimental findings. First, when is chosen as sum, the value functions and are equivalent. This is a consequence of commutativity of sum with itself: the order of summation over time-steps and over agents does not affect on resulting value function. Furthermore, maximization of the social welfare requires to sacrifice the interests of one of the agents by choosing either option 1 or 4. Second, when is chosen as minimum, all four options are equivalent from the standpoint of . This is a consequence of the environment design: the controller is unable to reward both agents at the same time-step and minimum of two rewards is always 0. In contrast, maximization of requires fair reward distribution on average, and thus options 2 and 3 are preferred. Therefore, if fairness is a concern, should be chosen as minimum and , i.e. the long-term approach to define value function that is used in BAROCCO, should be focused on.

Environments where trajectory lengths vary.

Consider a two-agent environment where the only reward that each agent receives is a unitary negative reward upon termination. We are interested in the incentives that drive selfish and prosocial agents in case of such reward structure. In this example, function will be chosen as sum. Let the trajectory lengths of the two agents be and , respectively, and let . The agent that terminates earlier will be referred to as the first agent, and vice versa. The selfish values and the prosocial values and of the two agents are estimated in Table 4 (expectation operator is omitted).

Selfish agents Prosocial agents
General formula
Value of agent 1
Value of agent 2
Table 4: Values of two agents in an environment with reward upon termination ()

Depending on the value function that the agents optimize, they might learn different behaviour. First, each of the selfish agents is only incentivized to prolong its own trajectory. As was shown in the literature, such agents may struggle to achieve mutual benefits of stable cooperation (Peysakhovich & Lerer, 2018b; Hughes et al., 2018; Jaques et al., 2019; Wang et al., 2019a). In contrast, the prosocial agents that optimize are incentivized to prolong the trajectories of both agents and thus are willing to cooperate. However, this is not the only incentive that drives the prosocial agents that optimize . While both such agents do benefit from longer episodes, each agent also prefers to be the first agent rather than the second, i.e. terminate earlier. This incentive emerges because the first agent does not observe termination of the second agent. Moreover, by comparing values of such agents (Table 4, column 2) it is evident that the first agent receives higher payoffs than the second regardless of how long the second agent survives, since . Therefore, instead of cooperating to survive, such agents would compete for early termination.

A similar analysis can be performed for the opposite kind of environments where the termination reward is positive. In such environments, the agents are usually required to complete certain tasks. Instead, the agents that optimize would delay task completion in attempts to observe termination of the others.

The two discussed environments with positive and negative termination rewards are extreme examples with two opposite artifacts. However, a combination of these artifacts may emerge in an environment with an arbitrary reward structure and varying episode length, which can result in unexpected and suboptimal behaviour.

Q-learning Actor-Critic
Eldorado Harvest Eldorado Harvest
discount factor 0.99 0.99 0.99 0.99
Adam learning rate 0.0005 0.0005 0.0005 0.001
learning rate decay 0.999995 0.999995 0.999998 0.9998
batch size 64 128 2000 3000
mini-batch size - 500 500

# epochs

- 10 3
# FC layers 3 2 2 3

# per-layer FC neurons

64 64 128 64
# LSTM layers 0 0 0 1
# CNN layers 0 1 0 1
target network period 2K 2K -
exploration rate -
decay 0.99999 0.999975 -
noisy exploration 0.5 0.5 -
entropy coefficient -
entropy decay - 0.99998 0.998
buffer size 500K 250K -
prioritization exponent 0.6 0.6 -

# quantiles (selfish component)

10 10 1 1
# steps in -step returns 5 5 1 1
Table 5: Hyperparameters

Appendix B Technical Details

Pseudocode of BAROCCO for Q-learning and Actor-Critic agents is presented in Algorithms 1 and 2, respectively. The choice of hyperparameters for the algorithms is reported in Table 5.

In Q-learning framework, the selfish component is implemented via Rainbow (Hessel et al., 2018), and the prosocial component is implemented via QMIX (Rashid et al., 2018). Neither of the components utilizes parameter sharing for or predictions. Both noisy (Fortunato et al., 2017) and -greedy explorations are applied. The rate of exploration is annealed to 0. This is an important detail, because for a given agent the hard-coded randomness of other agents’ actions can change its optimal policy (Wunder et al., 2010). Both Rainbow and QMIX use experience replay buffers. A well-known issue of experience replay is that it can be harmful in non-stationary environments (Lin, 1992). To address the inherent non-stationarity of multi-agent environments, we adopt the fingerprint technique (Foerster et al., 2017) by adding the exploration rate to the state space. Vanilla QMIX additionally utilizes double Q-learning (Van Hasselt et al., 2016; Fu et al., 2019). Finally, we utilize multiprocessing to perform interaction with environment, update of the selfish components, and update of the prosocial components in parallel, similarly to APEX (Horgan et al., 2018).

In Actor-Critic framework, the selfish component for each agent is a critic that estimates the agent’s value function based on global information, and the prosocial component is a critic that estimates social welfare using COMA (Foerster et al., 2018). Again, neither selfish nor prosocial critics share the parameters. The decentralized policies are trained on a combination of selfish and social advantages and via PPO (Schulman et al., 2017). The combined advantage is normalized over batch. To enhance exploration, we apply entropy regularization (Mnih et al., 2016), annealed to 0 over the course of training. All weights of the networks use orthogonal initialization (Hu et al., 2020). Finally, neither of the components utilizes experience replay.

In Eldorado, both agents receive global information as inputs. The state space is a vector with 28 elements. It includes statuses of food tiles, as well as characteristics of both agents, such as their coordinates, health points, resources, and actions taken in the previous turn. The action space consists of 10 possible options, which include 4 movement options, an option to pass, and an option to attack (combined with movement and passing). The only reward that each agent receives is upon surviving for 1000 steps or upon earlier termination.

In Harvest, each agent’s local observation is restricted to a 15 by 15 part of the map, whereas the global state includes information about the whole 16 by 38 map. Both local and global states are 3-dimensional RGB images and are always preprocessed with a 6-channel CNN. The action space consists of 8 possible options, which include 4 movement options, 2 turn options, an option to pass, and an option to attack. The reward structure is the same as in the original implementation (Hughes et al., 2018): each agent receives per collected apple, for being attacked directly, and for stepping into the fire left after an attack.

  Initialize Replay buffers and for selfish and prosocial components
    Networks , , that predict action-values , ,
    Hypernetwork that predicts weights of mixing network
    Target networks
  while True do
     for transition  do
        Sample weights in noisy layers, reduce exploration rate
        for agent  do
           With probability sample random action
           Otherwise, select
        end for
        Apply agents’ actions, observe rewards and next state
        Store transitions to , to
     end for
     for agent  do
        Sample mini-batch of transitions from to update selfish action-value
        for transition  do
        end for
        Update via gradient descent on temporal difference loss
     end for
     Periodically, copy weights of online networks to target networks
     Sample mini-batch from to update prosocial action-values and
     for transition  do
        Utilize global state via hypernetworks
        for agent  do
        end for
     end for
     Update , , via gradient descent on
  end while
Algorithm 1 BAROCCO for Q-learning framework
  Initialize Critic networks , that predict values ,
    Actor networks that predict policies
  while True do
     for transition  do
        Sample agents’ actions from respective policies
        Apply agents’ actions, observe rewards and next state
        Store transition to batch
     end for
     for mini-batch  do
        for agent  do
           for transition  do
           end for
           Update via gradient descent on temporal difference loss
           Update on PPO loss with entropy regularization
        end for
        for agent  do
           Update via gradient descent on temporal difference loss:
        end for
     end for
  end while
Algorithm 2 BAROCCO for Actor-Critic framework