Learning Altruistic Behaviours in Reinforcement Learning without External Rewards

07/20/2021 ∙ by Tim Franzmeyer, et al. ∙ Google University of Oxford ETH Zurich 0

Can artificial agents learn to assist others in achieving their goals without knowing what those goals are? Generic reinforcement learning agents could be trained to behave altruistically towards others by rewarding them for altruistic behaviour, i.e., rewarding them for benefiting other agents in a given situation. Such an approach assumes that other agents' goals are known so that the altruistic agent can cooperate in achieving those goals. However, explicit knowledge of other agents' goals is often difficult to acquire. Even assuming such knowledge to be given, training of altruistic agents would require manually-tuned external rewards for each new environment. Thus, it is beneficial to develop agents that do not depend on external supervision and can learn altruistic behaviour in a task-agnostic manner. Assuming that other agents rationally pursue their goals, we hypothesize that giving them more choices will allow them to pursue those goals better. Some concrete examples include opening a door for others or safeguarding them to pursue their objectives without interference. We formalize this concept and propose an altruistic agent that learns to increase the choices another agent has by maximizing the number of states that the other agent can reach in its future. We evaluate our approach on three different multi-agent environments where another agent's success depends on the altruistic agent's behaviour. Finally, we show that our unsupervised agents can perform comparably to agents explicitly trained to work cooperatively. In some cases, our agents can even outperform the supervised ones.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Altruistic behaviour is often described as behaviour that is intended to benefit others, sometimes at a cost for the actor (Dowding1997TheHumanity; Fehr2003TheAltruism)

. Such behaviour might be a desirable trait when integrating artificial intelligence into various aspects of human life and society – such as personal artificial assistants, house or warehouse robots, autonomous vehicles, and even recommender systems for news and entertainment. By observing and interacting with us, we may expect that artificial agents could adapt to our behaviour and objectives, and learn to act helpfully and selflessly. Altruistic behaviour could be a step towards value alignment 

(Allen2005ArtificialApproaches; Gabriel2020ArtificialAlignment), which aims to incorporate common-sense human values into artificial agents, by mimicking the same biological mechanisms that allowed human societies to cooperate and develop (Fehr2003TheAltruism).

Typically, we could achieve such an altruistic behaviour through various forms of supervision such as providing ground-truth actions at each time step, training agents with reinforcement learning and suitable rewards, or through imitation learning 

(Song2018Multi-agentLearning). However, none of the approaches above scale up easily. They either require a large amount of supervision or carefully crafted rewards that can easily be misstated, leading to unwanted behaviour (Russell2019HumanControl, ch. 1).

Given all the challenges above, we propose a new approach that enables artificial agents to learn altruistic behaviour purely from observations and interactions with other agents. We hypothesise that altruistic behaviour can be acquired by actively increasing the number of choices that other agents face, which we refer to as their generic choice

. Intuitively, the more paths to achieve an objective, the easier it can be achieved. Given that intuition, we propose an unsupervised approach to learn altruistic behaviour without any extrinsic supervision such as rewards or expert demonstrations. Towards that goal, we propose three different computational methods that estimate the choice of another agent in a given state. The first method, which we call

discrete choice, estimates the number of states the agent can reach in its near future. This approach is intractable for larger or continuous state spaces, hence our second proposal, entropic choice. This method uses entropy to evaluate uncertainties over possible future states. To further reduce computational complexity, we derive immediate choice as a special case of entropic choice, which estimates the short-term choice of an agent as the entropy of its policy.

We evaluate our methods in three diverse multi-agent environments. We always assume there are at least two agents: the leader agent that executes its own policy and can be trained using standard supervised methods and an altruistic agent whose role is to help the leader. In all our environments, the overall success of the leader agent depends on the altruistic agents’ behaviour. We show that our unsupervised approach outperforms unsupervised baselines by a large margin and, in some cases, also outperforms the supervised ones. Finally, we demonstrate possible failure cases of our approach where maximising the leader agent’s choice can lead to suboptimal behaviour.

Our work makes the following three contributions:

  • We devise a multi-agent RL framework for intrinsically motivated artificial agents that act altruistically by maximising the choice of others.

  • We define and evaluate three task-agnostic methods to estimate the choice that an agent has in a given situation, which are all related to the variety in states it can reach.

  • We experimentally evaluate our unsupervised approach in three multi-agent environments and are able to match and, in some cases, outperform supervised baselines.

2 Related work

To the best of our knowledge, we are the first to experimentally evaluate intrinsically motivated agents with purely altruistic objectives. However, there are many related concepts in the literature.

Cooperative multi-agent reinforcement learning.

Existing research mostly focuses on improving cooperation among agents in order to improve the overall outcome for all participating agents. As the outcome is defined and measured by their supervised rewards, such agents do not have purely altruistic objectives. Foerster2018LearningAwareness enable agents to learn intrinsically motivated cooperation by considering the anticipated parameter updates of other agents. Wang2019EvolvingBehavior

combine principles from high-level natural selection with low-level deep learning to achieve improved Nash equilibria in social dilemmas.


develop a hierarchical social model to infer the intentions of others and decide whether to cooperate or to compete with them. The same authors recently developed a Bayesian inference framework that allows agents to cooperatively solve sets of predefined subtasks 

(Wang2020TooPlanning). Guckelsberger2016IntrinsicallyMaximisation developed an intrinsically motivated general-companion agent for computer games, using coupled empowerment maximization. This work is the most similar to ours, as it develops an agent that explicitly supports others in achieving their goals. However, the objective of the supportive agent also encompasses maximizing its own empowerment (Salge2014EmpowermentAnIntroduction). In contrast to these previous works, we design a purely altruistic agent, i.e. the agent is not incentivized to cooperate in order to maximize its own objectives.

Imitation learning.

Inverse reinforcement learning learns objectives from observations and can be used in single-agent (Fu2017LearningLearning) and multi-agent scenarios (Song2018Multi-agentLearning; Yu2019Multi-agentLearning; Jeon2020ScalableActor-attention-critic). In particular, hadfield2016cooperative presented an approach to explicitly learn cooperative behaviour from expert demonstrations. However, it must be noted that inverse RL relies on the existence of expert demonstrations, which are often difficult to get at scale. Using inverse RL to learn rewards or objectives in multi-agent environments is also less straightforward than the single-agent setting (yu2019multi). In contrast, we do not rely on expert demonstrations or any form of supervision to train an agent to behave altruistically.

Entropy maximization.

One objective that we consider, maximizing entropy over reachable states, is related to maximum-state-entropy exploration in single-agent RL (Hazan2019ProvablyExploration; Mutti2019AnPolicies; Mutti2020AExploration; Lee2019EfficientMatching; Tarbouriech2019ActiveProcesses; Seo2021StateExploration). Furthermore, the concept of maximizing the capacity of an agent to affect its environment is known as empowerment (Klyubin2005Empowerment:Control; Klyubin2008KeepSystems) and can be used as a measure for an agent state’s quality (Anthony2008OnStructure). Gregor2016VariationalControl and Mohamed2015VariationalLearning furthermore use empowerment as an objective for intrinsically motivated exploration. Volpi2020Goal-DirectedBehaviour elaborate on this concept and combine intrinsic motivation through empowerment maximization with externally specified task-oriented behaviour. Moreover, entropy is also used as a regularizer in maximum entropy RL. This regularization term increases the stochasticity of learnt policies, and thus it encourages exploration (haarnoja2018soft; Haarnoja2018SoftApplications; Fox2016TamingUpdates). On another note, Rens2020EvidenceBehaviour conducted a study with human subjects that showed that human free-choice behaviour maximizes both utility and entropy. In a different domain, Wissner-Gross2013CausalForces demonstrated how causal entropic forces drive mechanical systems towards favourable states.

3 Methods

In this section, we formalize our framework. We start with the generic definition describing multi-agent setting. Next, we describe our framework where we show various approaches to estimate choice for a single agent, and how it can be applied to a two-agents Markov Game.

3.1 Markov Game

We consider a Markov Game (Littman1994MarkovLearning)

, which generalizes a Markov Decision Process (MDP) to a multi-agent scenario. In a Markov Game, agents interact in the same environment. At time step

, each agent (the th of a total of agents) takes the action , receives a reward , and finally the environment transitions from state to . Markov Game is then defined by a state space (), a distribution of initial states , the action space () and reward function of each agent

, an environment state transition probability

, and finally the agents’ discount factors , which prevent divergence of accumulated future rewards.

3.2 Estimating choice for a single agent

We first consider a single-agent scenario, i.e. , where only a leader agent, indicated by the subscript , interacts with the environment through its pretrained and stochastic policy .

We denote the leader agent’s generic choice in a given state as

, for which we propose concrete realizations below. Each method relies on the probability distribution over states after the leader agent has taken

actions starting with a given state . This is the -step state distribution of the underlying single-agent MDP, conditioned on the current state: We will now define three methods based on .

Discrete choice.

Our first derived method simply defines the choice of the leader agent in state as the number of states that it can reach within transitions, which we refer to as its discrete choice:



is the set of all values that a random variable

takes on with positive probability and measures the size of that set. While this count-based estimator of choice is intuitive and easily interpretable, it can hardly be estimated practically in large or continuous state spaces.

Entropic choice.

It can be shown that the entropy of a random variable acts as a lower bound for the size of the set of values that takes on with positive probability (Galvin2014ThreeCounting, Property 2.6), i.e. We define a lower bound of the discrete choice by computing the Shannon entropy of the -step state distribution, which we refer to as the agent’s entropic choice:


which estimates the agent’s choice as the uncertainty about its state after transitions. Unlike eq. 1, can be computed in a continuous state spaces or efficiently estimated by Monte Carlo sampling.

Immediate choice.

To further simplify entropic choice and reduce its computational complexity, we limit the look-ahead horizon to and assume that each action taken by the leader agent deterministically transfers the system into a mutually exclusive new state, i.e. no two actions taken at lead to the equivalent state . This assumption is often true in navigation environments, where different step actions result in different states. We can then simplify the one-step state distribution of the leader agent to and compute a simplified, short-horizon version of the agent’s entropic choice, which we refer to as the immediate choice:


Immediate choice (IC) defines the choice of an agent for a 1-step horizon as the uncertainty about the agent’s next action, hence it can be easily computed as the entropy over its policy conditioned on the current state.

3.3 Behaving altruistically by maximizing another agent’s choice

Having considered three methods to estimate an agent’s choice in a given state, we now apply them to a Markov Game of two agents. The main hypothesis is that maximizing the number of choices of another agent will allow it to reach more favourable regions of the state-space, thus cooperating without a task-specific reward signal.

Altruistic agent’s policy definition.

In this Markov Game, one agent is the leader, with the subscript , and another one is the altruistic agent, with the subscript . We define the optimal policy of the altruistic agent as the one that maximizes the future discounted choice of the leader agent,


where the generic choice can be estimated by one of several methods: discrete choice , entropic choice or immediate choice .

Conditional estimates of choice.

As the agents interact in the same environment, they both have influence over the system state , which contains the state of both agents. This makes applying single-agent objectives based on the state distribution (such as eq. 1 and 2) difficult to translate to a multi-agent setting, since the states of both agents are intermingled. For example, an altruistic agent that maximizes entropic choice naively (eq. 2) will maximize both the state uncertainty of the leader agent (which mirrors the single-agent entropic choice) and its own state uncertainty (by acting more randomly, which does not contribute towards the altruism goal).

To maximize entropic choice without also increasing the entropy of the altruistic agent’s actions, we propose to condition the choice estimate on the altruistic agent’s actions over the same time horizon, :


In order to better understand eq. 5

, we can use the chain rule of conditional entropy

(Cover2005ElementsTheory, ch. 2) to decompose it into two terms:


respectively the joint entropy of the states and actions, and the entropy of the actions. Therefore, we can interpret this objective as the altruistic agent maximizing the uncertainty of states and actions, but subtracting the uncertainty of its own actions, which is the undesired quantity. We can also relate eq. 5 to discrete choice (eq. 1). Using the fact that for a random variable and event (Galvin2014ThreeCounting, Property 2.12), we see that eq. 5 is a lower bound for a count-based choice estimate (analogous to eq. 1), also conditioned on the altruistic agent’s actions:


However, assuming simultaneous actions, the immediate choice estimate (eq. 3) stays unchanged, i.e. . The technical details of how these estimates can be computed from observations of the environment transitions are given in Appendix A.

4 Experimental evaluation

We introduce three multi-agent environments of increasing complexity, in which the success of a leader agent depends on the behaviour of one or more additional agents. In each environment, we first evaluate a subset of the proposed methods for choice estimation (, and ) by comparing the estimated choice of the leader agent in minimalist scenarios. We then evaluate our approach of behaving altruistically towards others by maximizing their choice (section 3.3). We provide videos of the emergent behaviours in appendix E. We compare our method to both an unsupervised and a supervised approach. Please note, however, that the supervised approach works under stronger assumptions as it requires direct access to the leader agent’s rewards.

4.1 Discrete environments with controllable gates

Figure 1: Top row: Single-occupancy grid environment in which agents can either move to a free adjacent cell or stay still. The green apple (reward +1) can only be obtained by the leader agent (in green) and no other external rewards exist. Grey cells are blocked. Bottom row: Visualisation of the estimated choice of the leader agent when positioned at the respective cells. Left to right: discrete choice , entropic and immediate choice (eq. 1, 2 and 3).

We start by considering three different scenarios on a grid, illustrated in Fig. 1 (top row), with the starting positions of the leader (green) and an additional agent (blue) shown in faded colors, obstacles are gray, and agents’ actions consist of moving in one of the four cardinal directions or staying still.

Choice estimate analysis.

We first verify whether the estimated choice for each state (agent position) correctly maps to our intuitive understanding of choice (that is, the diversity of actions that can be taken). Therefore, we conducted an analysis of the estimated choice of the leader agent using a simplified version of the environment (Fig. 1, top left), in which only the leader agent is present and selects actions uniformly at random. Fig. 1 (bottom row) shows the three different methods of estimating choice evaluated for each possible cell position of the leader agent. We can observe that states in less confined areas, e.g. further away from walls, generally feature higher choice estimates, with the least choice being afforded by the dead end at the right. All three method’s estimates are qualitatively similar, which validates the chosen approximations. In line with the simplifications made, the immediate choice (IC) estimates tend to be more local, as can be observed when comparing the estimates for the cell at row 2, column 4 (bottom and the rightmost panels in the figure). In conclusion, these results qualitatively agree with an intuitive understanding of choice of an agent in a grid environment.

4.1.1 Evaluating the altruistic agent

Here, we quantitatively evaluate whether these estimates (sec. 3.3)can be used to train an additional agent to behave altruistically towards the leader.

Environment setup.

In the Door Scenario (Fig. 1, top center), the door switch (row 1, col. 8) can only be operated by the altruistic agent. The door (row 2, col. 4) remains open as long as the altruistic agent is on the switch cell and is closed otherwise. As the leader agent always starts to the left of the door and the altruistic agent to the right, the leader agent can only attain its goal, the apple (row 2, col. 6), if the altruistic agent uses the door switch to enable the leader agent to pass through the door. In the Dead End Scenario (Fig. 1, top right), the door is always open, and the leader agent’s target object (green apple) is moved to the top right cell. Hence, the leader agent can obtain the apple without additional help from the altruistic agent. However, the altruistic agent could potentially block the path by positioning itself at the entry to the dead end. This situation would be the opposite of altruistic behaviour and is, of course, undesired.


We start by pretraining the leader agent with Q-Learning (Watkins1992Q-learning), with the altruistic agent executing a random policy. Hence, after convergence, the leader agent’s policy targets the green apple. Appendix B has more details and all parameters for our setting. Afterwards, the leader agent’s learning is frozen and the altruistic agent is trained; it always observes the position of the leader agent , its own position , and the environment state , which is composed of the door state (open, closed) and the food state (present, eaten). The altruistic agent is trained with Q-Learning to maximize the discounted future choice of the leader agent (eq. 4). For that, it uses one of the three proposed methods such as eq. 3, eq. 5 or eq.  7, as detailed in appendix A.1.


We investigate the developed behaviour of the altruistic agent after convergence for different choices of the hyperparameters – look-ahead horizon

(which determines the scale at which choices are considered) and discount factor (which defines whether the altruistic agent gives higher importance to the short-term or long-term choice of the leader agent). Success is binary: either the leader agent attains its goal (green apple), or not.

In the Door Scenario (Fig. 1, top center), we found that, for longer horizons and higher discount factors , the altruistic agent opens the door to allow the leader agent to reach its target, by occupying the switch position (square outline; row 1, col. 8). For shorter horizons and lower discount factors , the altruistic agent does not execute any coordinated policy and the leader does not succeed.

In the Dead End Scenario (Fig. 1, top right), we observe that, for longer horizons and large discount factors , the altruistic agent stays out of the leader agent’s way by occupying a far-away cell (square outline; row 1, col. 6). For short horizons and high discount factors , the altruistic agent actively blocks the entry to the hallway that contains the target (circle outline; row 3, col. 7), to prohibit the leader agent from entering this region of low estimated choice (recall that the choice for each cell is visualized in Fig. 1, bottom right). This failure case can be prevented by having a large enough horizon and discount factor , analogously to the selection of the temperature hyperparameter in maximum entropy single-agent RL (Haarnoja2018SoftApplications). We find that this configuration performs consistently better than other in both scenarios, and hence is more preferred.

We found that the resulting behaviour is independent of the used method for choice estimation, i.e. either discrete choice (eq. 7) or entropic choice (eq. 5) yield the same outcome, with immediate choice (eq. 3) being a special case of entropic choice. The binary outcomes for all hyperparameter combinations are given in appendix B.

Figure 2: Left: In Level Based Foraging (LFB) two agents must cooperate to receive rewards for eating apples. The leader agent is green, altruistic agent is blue. Right: In Tag, a leader agent (green) tries to escape from adversaries (red colors). Altruistic agents (who may help the leader) are blue.
Figure 3: Comparison of the estimated choice (IC) of a leader agent in a high performance scenario (with a cooperative partner) vs. a low performance scenario (with a randomly-acting partner). IC correlates with high performance.
Low High
Tag IC 26.9% 56.7%
Score 7.55 2.11
LBF IC 19.1% 55.3%
Score 4.0% 94.3%
Computational efficiency.

Due to the need to estimate a long-term distribution of states (), all choice estimates except immediate choice (eq. 3

) are computationally infeasible for large or continuous state spaces, and even Monte Carlo estimators would quickly suffer from the curse of dimensionality 

(mcbook, ch. 11). We therefore focus on immediate choice (IC) to estimate the leader agent’s choice in the remaining sections. In rare state-action sequences, the assumptions made for IC may not hold in these environments. Nonetheless, we did not find this to adversely affect the results.

4.2 Level-based foraging experiments

Environment setup.

We use a fully-observable multi-agent environment that enables us to assess the level of cooperation among agents (level-based foraging, LBF, Christianos2020SharedLearning) to evaluate the performance of altruistic agents in more complex environments with discrete state spaces. A visualization of the environment is depicted in Fig. 3 (left). The two agents can forage apples by simultaneously taking positions at different sides of a targeted apple, yielding a fixed reward.


We first train two agents – which receive an equal reward for foraging – using Deep Q-Learning (DQL, van2015deep), corresponding to fully-supervised shared-reward in multi-agent reinforcement learning (MARL). We then take one of these pretrained agents that has learned to forage apples when accompanied by a cooperating agent, freeze its policy, and place it as the leader agent (green) into the test scenario (additional details are provided in appendix C).

Choice estimate analysis.

We first qualitatively evaluate IC as an estimator for choice in Fig. 5, by comparing representative scenarios. To quantitatively analyse IC as an estimator for the leader agent’s choice, we compare the leader agent’s average IC (over 100 episodes) in two scenarios. The first is a scenario in which it can acquire many rewards, i.e. the other agent acts cooperatively, while the second is one where it can acquire only few rewards, i.e. the other agent takes random actions. We show the results in Table 3. We observe that the leader agent’s estimated choice is substantially higher when it is able to acquire high rewards. Note that the IC estimate does not have privileged access to the reward function of the leader agent, and so this experiment evaluates its worth as a generic proxy for the leader’s reward. Assuming that an agent has more choice when enabled to acquire high rewards, these results indicate that IC is a reasonable estimator for the leader agent’s choice in LBF.

Figure 4: IC estimates, i.e. policy entropy of the leader agent, given as a percentage of the maximum possible, for two representative positions of the leader. Left (LBF): The leader agent has low IC when having to wait for another agent to support it in foraging an apple and high IC when having the option of approaching multiple apples. Right (Tag): The leader agent has low policy entropy when chased by the adversaries and high IC when it has multiple escape paths.
Figure 5: Normalized reward of the leader agent (right-most vertical axis, green) when the altruistic agent is trained to maximize the leader’s choice (ours), acting random (random), receiving the same reward as the leader (supervised), or maximizing the state entropy (Mutti2020AExploration). Left-most vertical axis (blue): internal reward of our altruistic agent, i.e. the normalized policy entropy of the leader.

4.2.1 Evaluating the altruistic agent


We now consider an environment that consists of the previously pretrained leader agent and an additional agent, the altruistic agent. This altruistic agent is trained from scratch and does not receive a reward for foraging apples, but is rewarded according to the leader agent’s choice. Its reward is given as the current estimate of the leader agent’s IC, i.e. the leader agent’s current policy entropy (equation 3) and it is trained using DQL. To compute its internal reward signal, the altruistic agent would therefore include a policy estimation network that predicts the policy of the leader agent, given the current environment state, as detailed in A.2. To decouple our approach’s performance from that of the policy estimation network, we instead directly compute the altruistic agent’s reward from the leader agent’s pretrained policy.


We define the performance of the altruistic agent not as its achieved internal reward but as the reward achieved by the leader agent, i.e. its performance in enabling the leader agent to forage apples. Fig. 5

shows a comparison of the altruistic agent’s performance to that achieved by 3 baselines (two unsupervised and one supervised), averaged over 5 random seeds, with the standard deviation as the shaded area. It can be observed that the performance of the altruistic agent converges to a similar performance to that of the supervised agent, and outperforms the baseline approaches by a large margin. Furthermore, it can be observed that the policy entropy improvement of the leader agent is correlated with its reward improvement, which supports using IC as a reasonable proxy for the choice of the leader agent.

4.3 Multi-agent tag game with protective agents

Environment setup.

We use a multi-agent tag environment (Tag,  Mordatch2018EmergencePopulations; lowe2017multi; Terry2020Pettingzoo:Learning), illustrated in Fig. 3 (right), to evaluate the capabilities of altruistic agents in complex environments with continuous state spaces. Adversaries are rewarded for catching the leader, which in turn receivess a negative reward for being caught or crossing the environment boundaries. To speed up training, altruistic agents additionally receive a small negative reward for violating the environment boundaries.


We pretrain the adversaries and the leader (without the presence of altruistic agents) using MADDPG (lowe2017multi) and DDPG (Lillicrap2016ContinuousLearning) respectively. After pretraining, the adversary agents have learned to cooperatively chase the leader agent, which in turn has learned to flee from the adversaries. Exact setup specifications and all parameters are given in appendix D.

Choice estimate analysis.

As done for the LBF environment, we evaluate the IC of the leader agent in representative scenarios in Fig. 5. We also quantitatively evaluate IC as an estimator for the leader agent’s choice, by computing the leader agent’s average policy entropy per timestep for a scenario in which it receives high rewards to one where it receives low rewards. We again hypothesize that the leader agent has more choice in the high-success scenario in which it is caught less frequently. Table 1 shows that the estimated choice is substantially higher in the high-success scenario, indicating that IC is a reasonable estimator for the leader agent’s choice in Tag.

Figure 6: Example behaviour of altruistic agents (blue) that learned to actively defend the leader agent (green) from the adversaries (red) in Tag. Obstacles are black. The trajectories taken by some of the agents in the last 10 steps are shown as dotted lines.
Figure 7: Results of the Tag experiments (mean and standard deviation over 5 random seeds). Refer to sec. 4.3 for more details.
Method # Caught
None 7.550.56
Random 7.121.08
Cage 6.801.12
Supervised 6.941.13
Supervised + Cage 6.801.37
Ours 2.870.96

4.3.1 Evaluating the altruistic agent


We freeze the pretrained policies of the adversary agents and the leader agent and insert three additional altruistic agents which observe all agents but are not observed themselves. Each additional altruistic agent’s internal reward signal is given as the IC of the leader agent (equation 3), which is directly computed as done in LBF (see 4.2.1).


Performance of the altruistic agents is defined as the times per episode that the leader agent is caught by the adversaries, i.e. the lower the better. In Table 7, the performance of the team of three altruistically trained agents (ours) is compared to three relevant baselines, with the altruistic agents either removed (None), acting randomly (random), or solely receiving a small negative reward for violating the environment boundaries (cage). In contrast to LBF, we do not compare to an unsupervised exploration approach, as we are not aware of such an implementation for cooperative MARL. Additionally, we report results for the case in which the altruistic agents receive the same reward as the leader agent (supervised), possibly appended by a negative reward for violating the environment boundaries (supervised + cage). It can be observed that our approach outperforms all relevant baselines by a substantial margin and also outperforms the supervised approach. We hypothesize this to be due to the dense internal reward signal that our approach provides, as compared to the extremely sparse rewards in the supervised scenario: recall that in the supervised scenario the additional altruistic agents receive a large negative reward only when the leader agent is caught by the adversaries, whereas our approach provides a dense reward signal that corresponds to the current estimate of the leader agent’s choice. Fig. 7 displays the emerging protective behaviour of altruistic agents trained with our approach. Results videos can be found in the supplemental material.

5 Conclusions

In this paper, we lay out some initial steps into developing artificial agents that learn altruistic behaviour from observations and interactions with other agents. Our experimental results demonstrate that artificial agents can behave altruistically towards other agents without knowledge of their objective or any external supervision, by actively maximizing their choice. However, our derived methods for estimating the choice of an agent require the relative frequency with which the agent chooses its actions to be related to the associated expected rewards, i.e. the agent must act rationally.

This work was motivated by a desire to address the potential negative outcomes of deploying agents that are oblivious to the values and objectives of others into the real world. As such, we hope that our work serves both as a baseline and facilitator for future research into value alignment in simulation settings, and as a complementary objective to standard RL that biases the behaviour towards more altruistic policies. In addition to the positive impacts of deployed altruistic agents outside of simulation, we remark that altruistic proxy objectives do not yet come with strict guarantees of optimizing for other agents’ rewards, and we identify failure modes (sec. 4.1.1) which are hyperparameter-dependent, and which we hope provide interesting starting points for future work. We thank Thore Graepel and Yoram Bachrach for their advice and feedback on the outcomes of this project. This work was supported by the Royal Academy of Engineering (RF\201819\18\163).


Appendix A Estimation of leader agent’s choice from observation

a.1 Model-based estimation of choice from observations

We introduce a model-based estimator of choice that is suitable for small-scale discrete-state environments, having the advantage that it is easily interpretable. Recalling how we compute the discrete choice and entropic choice estimates for the leader agent, an estimate of the -step state distribution conditioned on the altruistic agent’s actions is needed, i.e. . To simplify this computation, we assume the altruistic agent’s action to equal hold for the next steps. More specifically, we assume that the altruistic agent’s state is unchanged for the next steps. Furthermore assuming that both the state and the action space are discrete, we compute




where the state transition matrix holds the transition probabilities between all possible states, as a function of the state of the altruistic agent . To compute

, the system state is encoded into a one-hot vector


The -step discrete choice of the leader agent can then be computed as


its -step entropic choice as


and its immediate choice as


In environments with a discrete state and action space, the altruistic agent can hence use an estimate of the state transition matrix to estimate the choice of the leader agent using either of the proposed methods, i.e. DC, EC or IC. An estimate of can be built over time, by observing the environment transitions and computing the transition probabilities as relative frequencies of observed transitions.

a.2 Model-free estimation of choice from observations

To limit the computational complexity, which is important for environments with large or continuous state spaces, we also consider immediate choice as an estimator for the leader agent’s choice (). Hence, for these experiments, the altruistic agents require an estimate of the leader agent’s policy to compute its choice, which can be obtained with a policy estimation network (Hong2018ASystems, Papoudakis2020OpponentAutoencoders, Mao2019ModellingDdpg, Grover2018LearningSystems).

Appendix B Gridworld experiments

b.1 Training procedure

b.1.1 Setup

The environment setup is described and displayed in section 4.1.

b.1.2 Pretraining

We first pretrain the leader agent using tabular Q-Learning, with learning parameters given in Table 2. During this pretraining, the altruistic agent takes random actions. We train until all Q-Values are fully converged, i.e. training runs for 300000 environment steps.

b.1.3 Reward computation for altruistic agents

The altruistic agent is then also trained using tabular Q-Learning, and its internal reward signal is given as the choice estimate of the leader agent, i.e. either or , which is computed with the model based-estimation introduced in appendix A.1. The altruistic agent records all environment transitions and frequently updates its estimate of the state transition matrix , which is needed to compute the internal reward signal for the altruistic agent. All training parameters can be found in Table 2. Training time is about thirty minutes per experiment.

Category Scen.
Opens door A
Non blocking B
Gives way B
Table 1: Results Gridworld - behaviour of altruistic agent

b.2 Performance evaluation

Performance of the altruistic agent is reported for three different categories over the two presented scenarios, as shown in Table 1. For each category, we report success or failure for choice estimate look-ahead horizons and discount factors of the altruistic agent . Success or failure was always deterministic, conditioned on the experiment setup, i.e. 10 simulations were run for each setup which always yielded the same outcome. To estimate the leader agent’s choice, the altruistic agent uses either discrete choice (, equations 1 and 10) or entropic choice (, equations 2 and 11). It must be noted that horizon is equivalent to an infinite horizon look-ahead for the given environment size and that entropic choice is equivalent to immediate choice (equations 3 and 12) at horizon , as the environment satisfies the necessary conditions listed for equation 3.

Table 1 displays the results of this experiment. In the first row, it is evaluated whether the altruistic agent opens the door at all times, such that the leader agent can eat the green apple. It can be observed that the altruistic agent only opens the door for longer horizons , respectively higher discount factors .

Given the definitions of discrete choice (Equation 1) and entropic choice (Equation 2), it can be assumed that the choice of the choice horizon determines the locality for which choice is considered and that the discount factor defines whether the altruistic agent gives higher importance to the short-term or long-term choice of the leader agent. This goes in line with the observed results for the first category (Opens door). It can be assumed that, for short horizons , the altruistic agent does not open the door, as it does not estimate that this would lead to an increase in the leader agent’s choice. A similar argumentation follows for low discount factors .

The middle-row category evaluates whether the altruistic agent does not block the hallway that leads up to the leader agent’s target apple in the top right environment cell. This category demonstrates a possible failure case of the proposed approach of maximizing another agent’s choice. For short horizons and high discount factors , the altruistic agent actively blocks the entry to the low-entropy hallway towards the top right cell – by constantly occupying cell (2, 6) – to prohibit the leader agent from entering this region of low estimated choice. This failure case can be prevented by an appropriate selection of the hyperparameters – horizon and discount factor . It is related to the selection of the temperature hyperparameter in maximum entropy single-agent RL (Haarnoja2018SoftApplications); if chosen incorrectly, the agent does not foster environment rewards in low-entropy regions. A possible solution to this problem would be to define a constrained optimization problem, as shown by Haarnoja2018SoftApplications.

The last category (bottom row) evaluates whether the altruistic agent always gives way to the leader agent, i.e. it must not only never block the tunnel entry but never get into its path. The results show that the altruistic agent does not generally give way to the leader agent. This happens as the leader agent’s choice on a very long term (infinite horizon choice) is often independent of the state of the altruistic agent, as it is often possible to simply go around the altruistic agent.

Appendix C Level Based Foraging experiments

c.1 Training procedure

c.1.1 Setup

We adopted the Level Based Foraging111https://github.com/semitable/lb-foraging environment as given in Christianos2020SharedLearning. We only focus on two-agent scenarios and only consider the subset of possible environments that require full cooperation among agents, i.e. those where food can only be foraged by two agents cooperatively. We therefore only consider environments where both agents are at level one, and all present food is at level two. In the original implementation, both agents have to simultaneously select the action while docking at different sides of a food object to forage the object and receive the reward. To reduce training time, we simplify this setup by reducing the action space to , i.e. we remove the action and enable agents to forage food by simultaneously docking at different sides of a food object, with no further action required.

c.1.2 Pretraining

To achieve a pretrained leader agent, we first train two agents in the environment that are equally rewarded for foraging food. This setup corresponds to shared-reward cooperative MARL (Tan1993Multi-AgentAgents). Both agents are trained using Deep Q Learning (DQL,  (van2015deep)

), using a fully connected neural network with two hidden layers and five output values, resembling the Q values of the five possible actions. The exact training parameters are listed in Table

2. We then take either one of the two agents and set it as the pretrained leader agent for the subsequent evaluation of the altruistic agent.

c.1.3 Training of additional agents

We then insert an additional agent into the environment that shall act altruistically towards the leader agent. This additional agent is trained in the same fashion and with the same parameters as the previously trained leader agents. Only its reward signal is different, as laid out in the next section.

c.1.4 Reward computation for additional agents

We compare four different approaches for how the reward of the additional agent is defined, respectively how it behaves. Random: The agent takes random actions. Supervised: The agent receives the same reward as the leader agent, i.e. a shared reward as in cooperative MARL. Ours: The reward of the additional agent is defined as the immediate choice of the leader agent, i.e. its policy entropy, as detailed in equation 3. We compute the leader agent’s policy entropy by computing the entropy of the softmax of the leader agent’s Q values in the given state. We further consider an Unsupervised baseline, as detailed in the next paragraph.

Unsupervised baseline - MaxEnt

As an unsupervised baseline, we implemented the MEPOL approach of Mutti2020AExploration. Their task-agnostic unsupervised exploration approach maximizes the entropy over the state distribution of trajectory rollouts. For this baseline, the additional agent is trained with the implementation given by the authors222https://github.com/muttimirco/mepol, which itself builds on TRPO (Schulman2015TrustOptimization). We leave all parameters unchanged but evaluate different learning rates; . Best results were achieved for a learning rate of , which was hence picked as the relevant baseline.

c.2 Performance evaluation

Each experiment was run for 5 different random seeds and mean and standard deviation are reported. Training progress is shown in Figure 5. Evaluations are computed every 10000 environment steps for 200 episodes, with the exploration set to zero. Training time was about 14 hours for each run. Results are shown in Fig. 5.

Appendix D Tag experiments

Figure 8: Training progress of Tag experiments. Left: Reward of adversary agents. Middle: Reward of leader agent. Right: Reward of additional altruistic agents (rescaled).

d.1 Training procedure

d.1.1 Pretraining

We use the Simple Tag (Tag) implementation from (Terry2020Pettingzoo:Learning)333https://github.com/PettingZoo-Team/PettingZoo which is unchanged as compared to the original implementation in (Mordatch2018EmergencePopulations)444https://github.com/openai/multiagent-particle-envs (only minor errors are fixed). We first adopt the original configuration and pretrain three adversaries and one good agent (leader agent) using the parameters listed in Table 2. We use MADDPG (lowe2017multi)555Our implementation is loosely based on https://github.com/starry-sky6688/MADDPG to train adversary agents, and modify the framework as follows. The last layer of each agent’s actor-network outputs one value for each of the environment’s five possible actions, over which the softmax is computed. We then sample the agent’s action from the output softmax vector, which corresponds to the probabilities with which the agent takes a specific action in a given state. We train the leader agent with DDPG (Lillicrap2016ContinuousLearning)666Our implementation is loosely based on https://github.com/starry-sky6688/MADDPG, where we equally modify the output layer. Each actor and critic network is implemented as a fully-connected neural network with two hidden layers, with layer sizes as given in Table 2.

To aggravate the situation for the leader agent, we decrease its maximum speed and acceleration to of the original value. We next insert three additional agents into the environment whose observations include all agents and objects. These additional agents are not observed by adversary agents or the leader agent. The additional agents are of the same size as the adversary agents, and their acceleration and maximum velocity are equal to that of the leader agent. To speed up training, we made the following changes to the environment, which are applied to our approach as well as to all baselines. First, we spawn the three additional agents in the vicinity of the leader agent, which itself is spawned at a random position. Furthermore, we randomly pick two out of the three adversary agents and decrease their maximum acceleration and maximum speed by 50%. We made these changes to be able to observe substantial differences between the different approaches after a training time of less than 24h.

Gridworld Tag Level Based Foraging
Environment Steps 300000 7500000 50000000
Episode Length 25 25 10/15
Learning Rate Actor - 0.001 -
Learning Rate Critic/ Q-Learning 0.01 0.001 0.001
Exploration Noise 0 0.1 0
Epsilon start 0.1 - 1.0
Epsilon final 0.1 - 0.2
Discount Factor 0.9 0.95 0.9
Target Network Update Rate - 0.01 0.001
Replay Buffer Size - 1000000 200000
Training Batch Size - 256 256
Train every steps - 64 32
Layer Size Adversary Agent - 64 -
Layer Size Leader Agent - 64 64
Layer Size Altruistic Agent - 128 64
Optimizer - Adam Adam
Gradient Norm Clip - - 0.5
Activation Function - relu relu
Table 2: Results Tag

d.1.2 Training of additional agents

We train these three additionally inserted agents with the previously described modified version of MADDPG. The reward for each agent is defined either according to our developed approach, or any of the given baselines, as detailed in the next section.

d.1.3 Reward computation for additional agents for different baselines

We consider the following implementations for the reward computation of the additional agents, respectively different environment configurations.

None: For this scenario, the additional agents are removed from the environment.

The remaining approaches purely differ in the way that the reward of the additional agents is computed. No other changes are made.

Random: The additional agents take random actions.

Cage: The additional agents receive a negative reward for violating the environment boundaries, which is equal to the negative reward that the leader agent receives for itself violating the environment boundaries (part of the original Tag implementation).

Supervised: The additional agents receive the same reward as the leader agent. That is, they receive a reward of -10 if the leader agent is caught by the adversaries and a small negative reward if the leader agent violates the environment boundaries.

Supervised + Cage: The additional agents receive the same reward as the leader agent, and an additional small negative reward if they themselves violate the environment boundaries.

Ours: The reward of the additional agents is defined as the immediate choice of the leader agent, i.e. its policy entropy, as detailed in 3. To stabilize the estimate of the leader agent’s policy entropy, we implement an ensemble of five pretrained actor-networks for the leader agent and compute the reward for the altruistic agents as the median policy entropy estimated by the five networks. Furthermore, the additional agents receive a small negative reward for themselves violating the environment boundaries.

d.2 Performance evaluation

We train Cage, Supervised, Supervised + Cage and Ours for five different random seeds with parameters as detailed in Table 2. Training progress is shown in Figure 8. We then compute the results listed in Table 7 by freezing all weights across all networks, setting the exploration noise to zero and computing the average and standard deviation over 500 rollout episodes.

Appendix E Videos of behaviour of altruistic agent

We provide videos for the most relevant outcomes of our experiments, which can be accessed here777https://youtube.com/playlist?list=PLtq_h_35OaPnO-VWNT5q1-Onqb2GKVNQI

e.1 Videos for results of Gridworld experiments (Section 4.1.1)

In these videos, the target object (apple) of the leader agent is displayed as a green rectangle. The leader agent is always in green, the altruistic agent always in blue.

e.1.1 Door Scenario in Fig. 1 top center

The switch – positioned at the top right cell – that the altruistic agent can use to open the door, is displayed as a blue dot.

Altruistic agent opens door for leader agent : It can be observed that the altruistic agent has learned to operate the door switch to enable the leader agent to pass through the door and reach its target on the other side.

(Fail case) Altruistic agent does not open door for leader agent: It can be observed that for an unfavourable choice of hyperparameters, the altruistic agent does not open the door.

e.1.2 Dead End scenario in Fig. 1 top right

Altruistic agent gives way to leader agent: It can be observed that the altruistic agent does not get into the way of the leader agent, which is hence able to reach its target in the top right cell.

(Fail case) Altruistic agent blocks path of leader agent: It can be observed that for an unfavourable choice of hyperparameters, the altruistic agent blocks the entry to the hallway towards the right side of the environment such that the leader agent cannot reach its target at the top right cell. This happens as the altruistic agent forcefully maximizes the estimated choice of the leader agent by hindering it from entering the hallway, which is a region of fewer estimated choice.

e.2 Video for results of Level Based Foraging (Section 4.2.1)

Altruistic agent enables leader to forage apples: It can be observed how the altruistic agent (blue) learned to coordinate its movements with the leader agent (green), to enable the leader agent to forage apples. It has learned this behaviour purely through optimizing for the leader agents choice and is itself not rewarded for foraging apples.

e.3 Video for results of Tag (Section 4.3.1)

Altruistic agents protect leader from adversaries: It can be observed how the altruistic agents (blue colors) learned to coordinate their movements to protect the leader agent (green) from its adversaries. The adversaries (red colors) try to catch the leader, which in turn tries to flee from them. The altruistic agents protect the leader by actively intercepting the paths of the adversaries. They have learned this behaviour purely through optimizing for the leader agents choice.