DeepAI
Log In Sign Up

Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning

This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal. Solving such MARL problems requires addressing two challenges: identifying (1) relative importance of states along the length of an episode (along time), and (2) relative importance of individual agents' states at any single time-step (among agents). In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges. AREL uses attention mechanisms to characterize the influence of actions on state transitions along trajectories (temporal attention), and how each agent is affected by other agents at each time-step (agent attention). The redistributed rewards predicted by AREL are dense, and can be integrated with any given MARL algorithm. We evaluate AREL on challenging tasks from the Particle World environment and the StarCraft Multi-Agent Challenge. AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods. Our code is available at https://github.com/baicenxiao/AREL.

READ FULL TEXT VIEW PDF
03/29/2021

Shaping Advice in Deep Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning involves multiple agents interacting ...
07/16/2021

Decentralized Multi-Agent Reinforcement Learning for Task Offloading Under Uncertainty

Multi-Agent Reinforcement Learning (MARL) is a challenging subarea of Re...
04/24/2021

baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling Coordinated Agents

In many multi-agent spatiotemporal systems, the agents are under the inf...
09/25/2019

"Good Robot!": Efficient Reinforcement Learning for Multi-Step Visual Tasks via Reward Shaping

In order to learn effectively, robots must be able to extract the intang...
07/26/2018

ToriLLE: Learning Environment for Hand-to-Hand Combat

We present Toribash Learning Environment (ToriLLE), an interface with vi...
04/06/2017

Geometry of Policy Improvement

We investigate the geometry of optimal memoryless time independent decis...
04/14/2021

A Novel Approach to Curiosity and Explainable Reinforcement Learning via Interpretable Sub-Goals

Two key challenges within Reinforcement Learning involve improving (a) a...

1. Introduction

Cooperative multi-agent reinforcement learning (MARL) involves multiple autonomous agents that learn to collaborate to complete tasks in a shared environment by maximizing a global reward (busoniu2008comprehensive). Examples of systems where MARL has been used include autonomous vehicle coordination sallab2017deep, and video games (samvelyan19smac; tampuu2017multiagent).

One approach to enable better coordination is to use a single centralized controller that can access observations of all agents jiang2018learning. In this setting, algorithms designed for single-agent RL can be used for the multi-agent case. However, this may not be feasible when trained agents are deployed independently or when communication costs between agents and the controller is prohibitive. In such a situation, agents will need to be able to learn decentralized policies.

The centralized training with decentralized execution (CTDE) paradigm, introduced in lowe2017multi; rashid2018qmix, enables agents to learn decentralized policies efficiently. Agents using CTDE can communicate with each other during training, but are required to make decisions independently at test-time. The absence of a centralized controller will require each agent to assess how its own actions can contribute to a shared global reward. This is called the multi-agent credit assignment problem, and has been the focus of recent work in MARL, such as COMA foerster2018counterfactual, QMIX rashid2018qmix and QTRAN son2019qtran. Solving the multi-agent credit assignment problem alone, however, is not adequate to efficiently learn agent policies when the (global) reward signal is delayed until the end of an episode.

In reinforcement learning, agents seek to solve a sequential decision problem guided by reward signals at intermediate time-steps. This is called the temporal credit assignment problem sutton2018reinforcement. In many applications, rewards may be delayed. For example, in molecular design olivecrona2017molecular, Go silver2016mastering, and computer games such as Skiing bellemare2013arcade, a summarized score is revealed only at the end of an episode. The episodic reward implies absence of feedback on quality of actions at intermediate time steps, making it difficult to learn good policies. The long-term temporal credit assignment problem has been studied in single-agent RL by performing return decomposition via contribution analysis arjona2019rudder and using sequence modeling liu2019sequence. These methods do not directly scale well to MARL since size of the joint observation space grows exponentially with number of agents lowe2017multi.

Besides scalability, addressing temporal credit assignment in MARL with episodic rewards presents two challenges. It is critical to identify the relative importance of: i) each agent’s state at any single time-step (agent dimension); ii) states along the length of an episode (temporal dimension). We introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these challenges.

AREL uses attention mechanisms vaswani2017attention to carry out multi-agent temporal credit assignment by concatenating: i) a temporal attention module to characterize the influence of actions on state transitions along trajectories, and; ii) an agent attention module, to determine how any single agent is affected by other agents at each time-step. The attention modules enable learning a redistribution of the episodic reward along the length of the episode, resulting in a dense reward signal. To overcome the challenge of scalability, instead of working with the concatenation of (joint) agents’ observations, AREL analyzes observations of each agent using a temporal attention module that is shared among agents. The outcome of the temporal attention module is passed to an agent attention module that characterizes the relative contribution of each agent to the shared global reward. The output of the agent attention module is then used to learn the redistributed rewards.

When rewards are delayed or episodic, it is important to identify ‘critical’ states that contribute to the reward. The authors of gangwani2020learning recently demonstrated that rewards delayed by a long time-interval make it difficult for temporal-difference (TD) learning methods to carry out temporal credit assignment effectively. AREL overcomes this shortcoming by using attention mechanisms to effectively learn a redistribution of an episodic reward. This is accomplished by identifying critical states through capturing long-term dependencies between states and the episodic reward.

Agents that have identical action and observation spaces are said to be homogeneous. Consider a task where two homogeneous agents need to collaborate to open a door by locating two buttons and pressing them simultaneously. In this example, while locations of the two buttons (states) are important, the identities of the agent at each button are not. This property is termed permutation invariance, and can be utilized to make the credit assignment process sample efficient gangwani2020learning; liu2019sequence. Thus, a redistributed reward must identify whether an agent is in a ‘good’ state, and should also be invariant to the identity of the agent in that state. AREL enforces this property by designing the credit assignment network with permutation-invariant operations among homogeneous agents, and can be integrated with MARL algorithms to learn agent policies.

We evaluate AREL on three tasks from the Particle World environment lowe2017multi, and three combat scenarios in the StarCraft Multi-Agent Challenge samvelyan19smac. In each case, agents receive a summarized reward only at the end of an episode. We compare AREL with three state-of-the-art reward redistribution techniques, and observe that AREL results in accelerated learning of policies and higher rewards in Particle World, and improved win rates in StarCraft.

2. Related Work

Several techniques have been proposed to address temporal credit assignment when prior knowledge of the problem domain is available. Potential-based reward shaping is one such method that provided theoretical guarantees in single ng1999policy and multi-agent devlin2011theoretical; lu2011policy RL, and was shown to accelerate learning of policies in devlin2011empirical

. Credit assignment was also studied by incorporating human feedback through imitation learning

kelly2019hg; ross2011reduction and demonstrations brown2019extrapolating; huang2020ma.

When prior knowledge of the problem domain is not available, recent work has studied temporal credit assignment in single-agent RL with delayed rewards. An approach named RUDDER arjona2019rudder used contribution analysis to decompose episodic rewards by computing the difference between predicted returns at successive time-steps. Parallelly, the authors of liu2019sequence

proposed using natural language processing models for carrying out temporal credit assignment for episodic rewards. The scalability of the above methods to MARL, though, can be a challenge due to the exponential growth in the size of the joint observation space

lowe2017multi.

In the multi-agent setting, recent work has studied performing multi-agent credit assignment at each time-step. Difference rewards were used to assess the contribution of an agent to a global reward in agogino2006quicr; devlin2014potential; foerster2018counterfactual by computing a counterfactual term that marginalized out actions of that agent while keeping actions of other agents fixed. Value decomposition networks, proposed in sunehag2018value, decomposed a centralized value into a sum of agent values to assess each one’s contributions. A monotonicity assumption on value functions was imposed in QMIX rashid2018qmix to assign credit to individual agents. A generalized approach to decompose a joint value into individual agent values was presented in QTRAN son2019qtran. The Shapley Q-value was used in wang2020shapley to distribute a global reward to identify each agent’s contribution. The authors of yang2020q decomposed global Q-values along trajectory paths, while zhou2020learning used an entropy-regularized method to encourage exploration to aid multi-agent credit assignment. The above techniques did not address long-term temporal credit assignment and hence will not be adequate for learning policies efficiently when rewards are delayed.

Attention mechanisms have been used for multi-agent credit assignment in recent work. The authors of mao2019modelling used an attention mechanism with a CTDE-based algorithm to enable each agent effectively model policies of other agents (from its own perspective). Hierarchical graph attention networks proposed in ryu2020multi modeled hierarchical relationships among agents and used two attention networks to effectively represent individual and group level interactions. The authors of jiang2019graph; liu2020multi combined attention networks with graph-based representations to indicate the presence and importance of interactions between any two agents. The above approaches used attention mechanisms primarily to identify relationships between agents at a specific time-step. They did not consider long-term temporal dependencies, and therefore may not be sufficient to learn policies effectively when rewards are delayed.

A method for temporal redistribution of episodic rewards in single and multi-agent RL was recently presented in gangwani2020learning. A ‘surrogate objective’ was used to uniformly redistribute an episodic reward along a trajectory. However, this work did not use information from sample trajectories to characterize the relative contributions of agents at intermediate time-steps along an episode.

Our approach differs from the above-mentioned related work in that it uses attention mechanisms for multi-agent temporal credit assignment. AREL overcomes the challenge of scalability by analyzing observations of each agent using temporal and agent attention modules, which respectively characterize the effect of actions on state transitions along a trajectory and how each agent is influenced by other agents at each time-step. Together, these modules will enable an effective redistribution of an episodic reward. AREL does not require human intervention to guide agent behaviors, and can be integrated with MARL algorithms to learn decentralized agent policies in environments with episodic rewards.

3. Background

Figure 1.

Schematic of AREL. The agent-temporal attention block concatenates temporal and agent attention modules, and summarizes input feature (e.g. observation) vectors. This is accomplished by establishing relationships between (

attending to) information along time and among agents. The attention flow indicates that an output feature vector of the agent-temporal attention block for an agent at a time (green square) can attend to input features from all other agents before and including time

. Multiple agent-temporal attention blocks can be concatenated to each other to improve expressivity. The output of the last such block is fed to the credit assignment block, which applies shared multi-layer perceptrons to each attention feature. The output is the redistributed reward, which is integrated with MARL algorithms (e.g. MADDPG, QMIX) to learn agent policies.

A fully cooperative multi-agent task can be specified as a decentralized partially observable Markov decision process (Dec-POMDP)

oliehoek2016concise. A Dec-POMDP is a tuple , where describes the environment state. Each agent receives an observation according to an observation function . At each time step, agent chooses action according to its policy . forms the joint action space, and the environment transitions to the next state according to the function . All agents share a global reward . The goal of the agents is to determine their individual policies to maximize the return, , where is a discount factor, and is the length of the horizon. Let and . A trajectory of length is an alternating sequence of observations and actions, .

In a typical MARL task, agents receive reward immediately following execution of action at state . The expected return can then be determined by accumulating rewards at each time step. In episodic RL, a reward is revealed only at the end of an episode at time , and agents do not receive a reward at intermediate time-steps. As a consequence, the expected return for all will be the same (when . Therefore, the quality of information available for learning policies will be poor at all intermediate time steps. Moreover, delayed rewards have been shown to introduce a large bias arjona2019rudder

or variance

ng1999policy in the performance of RL algorithms.

The CTDE paradigm foerster2018counterfactual; lowe2017multi can be adopted to learn decentralized policies effectively when dimensions of state and action spaces are large. During training, an agent can make use of information about other agents’ states and actions to aid its own learning. At test-time, decentralized policies are executed. This paradigm has been used to successfully complete tasks in complex MARL environments gupta2017cooperative; iqbal2019actor; rashid2018qmix.

4. Approach

This paper considers MARL tasks where agents share the same global reward, which is received only at the end of an episode. The objective is to redistribute this episodic reward for effective multi-agent temporal credit assignment. To accomplish this goal, it is critical to identify the relative importance of: i) individual agents’ observations at each time-step, and; ii) observations along the length of a trajectory. We introduce AREL to address the above challenges. AREL uses an agent-temporal attention block to infer relationships among states at different times, and among agents. A schematic is shown in Fig. 1, and we describe its key components and overall workflow in the remainder of this section.

4.1. Agent-Temporal Attention

In order to redistribute an episodic reward in a meaningful way, we need to be able to extract useful information from trajectories. Each trajectory contains a sequence of observations involving all agents. At each time-step of an episode of length , a feature of dimension corresponds to the embedding of a single observation. When there are agents, a trajectory is denoted by . The objective is to learn a mapping to assign credit to the agents at each time-step. The information in a trajectory comprises two parts: (1) temporal information between (embeddings of) observations at different time steps: this provides insight into the influence of actions on transitions between states; (2) structural information: this provides insight into how any single agent is affected by other agents.

These two parts are coupled, and hence studied together. The process of learning these relationships is termed attention. We propose an agent-temporal attention structure, inspired by the Transformer vaswani2017attention. This structure selectively pays attention to different types of information- either from individual agents, or at different time-steps along a trajectory. This is accomplished by associating a weight to an observation based on its relative importance to other observations along the trajectory. The agent-temporal attention structure is formed by concatenating one agent attention module with one temporal attention module. The temporal attention modules determine how entries of at different time-steps are related (along the ‘first’ dimension of ). The agent attention module determines how agents influence one another (along the ‘second’ dimension of ).

4.1.1. Temporal-Attention Module

The input is a trajectory . To calculate the temporal attention feature, we obtain the transpose of as . Adopting notation from vaswani2017attention, each row of is transformed to a query , key , and value . are learnable parameters, and . The row of the temporal attention feature is a weighted sum . The attention weight vector is a normalization (softmax) of the inner-product between the row of , , and the key matrix :

(1)

where is an element-wise product, and is a mask with its first entries equal to , and remaining entries . The mask preserves causality by ensuring that at any time , information beyond will not be used to assign credit. A temporal positional embedding devlin2018bert maintains information about relative positions of states in an episode. Position embeddings are learnable vectors associated to each temporal position of a trajectory. The sum of position and trajectory embeddings forms the input to the temporal attention module. The output of this module is , got by stacking . The temporal attention process can be described by a function .

The output of the temporal attention module results in an assessment of each agent’s observation at any single time-step relative to observations at other time-steps of an episode. To obtain further insight into how an agent’s observation is related to other agents’ observations, an agent-attention module is concatenated to the temporal-attention module.

4.1.2. Agent-Attention Module

The agent-attention module uses the transpose of , denoted . Each row of , is transformed to a query , key , and value . Here, are learnable parameters. The row of the agent attention feature is a weighted sum, . Maintaining causality is not necessary when computing the agent attention weight vector . These weights are determined similar to the temporal attention weight vector in Eqn. (1), except without a masking operation. Therefore,

(2)

The agent attention procedure can be described by a function , where .

4.1.3. Concatenating Attention Modules

The output of the temporal attention module is an entity that attends to information at time-steps along the length of an episode for each agent. Passing through the agent attention module results in an output that is attended to by embeddings at all time-steps and from all agents. The data-flow of this process can be written as a composition of functions: . The temporal and agent attention modules can be repeatedly composed to improve expressivity. The position embedding is required only at the first temporal attention module when more than one is used.

4.2. Credit Assignment

The output of the attention modules is used to assign credit at each time-step along the length of the episode. Let , where . In order to carry out temporal credit assignment effectively, we leverage a property of permutation invariance.

4.2.1. Permutation Invariance

Agents sharing the same action and observation spaces are termed homogeneous. When homogeneous agents and cooperate to achieve a goal, the reward when observes and observes should be the same as the case when observes and observes . This property is called permutation invariance, and has been shown to improve the sample-efficiency of multi-agent credit assignment as the number of agents increase liu2020pic; gangwani2020learning. When this property is satisfied, the output of the function should be invariant to the order of the agents’ observations. Formally, if the set of all permutations along the agent dimension (second dimension of ) is denoted , then must be true for all .

The function is permutation invariant along the agent dimension by design. A sufficient condition for to be permutation invariant is that the function be permutation invariant. To ensure this, we apply a multi-layer perceptron (MLP), add the MLP outputs element-wise, and pass it through another MLP. When functions and associated to the MLPs are continuous and shared among agents, the evaluation at time is the predicted reward . It was shown in zaheer2017deep that any permutation invariant function can be represented by the above equation.

Remark 4.1.

AREL can be adapted to the heterogeneous case when cooperative agents are divided into homogeneous groups. Similar to a position embedding, we can apply an agent-group embedding such that agents within a group share an agent-group embedding. This will maintain permutation invariance of observations within a group, while enabling identification of agents from different groups. AREL will also work in the case when the multi-agent system is fully heterogeneous. This is equivalent to a scenario when there is only one agent in each homogeneous group. Therefore, AREL can handle agent types ranging from fully homogeneous to fully heterogeneous.

4.2.2. Credit Assignment Learning

Given a reward at the end of an episode of length , the goal is to learn a temporal decomposition of to assess contributions of agents at each time-step along the trajectory. Specifically, we want to learn satisfying . Since is a vector in , its entry is denoted (). The sequence is learned by minimizing a regression loss, , where

are neural network parameters.

The redistributed rewards will be provided as an input to a MARL algorithm. We want to discourage from being sparse, since sparse rewards may impede learning policies devlin2011theoretical. We observe that more than one combination of can minimize . We add a regularization loss to select among solutions that minimize . Specifically, we aim to choose a solution that also minimizes the variance of the redistributed rewards, and set (we examine other choices of in the Appendix). Using

as a regularization term in the loss function leads to learning less sparsely redistributed rewards. With

denoting a hyperparameter, the combined loss function used to learn

is:

(3)

where , and
. The form of in Eqn. (3) incorporates the possibility that not all intermediate states will contribute equally to , and additionally results in learning less sparsely redistributed rewards. Note that will not typically yield (which corresponds to a uniform redistribution of rewards). Since some states may be common to different episodes, the redistributed reward at each time-step cannot be arbitrarily chosen. For e.g., consider different episodes , each of length , with distinct cumulative episodic rewards . If an intermediate state is common to episodes and , under a uniform redistribution, distinct rewards and will be assigned to , which is not possible. Thus, and will not both be true, implying that a uniform redistribution may not be viable.

4.3. Algorithm

0:  Number of agents . Reward weight . Initialize parameters for credit assignment and RL (policy/ critic) modules respectively. Experience buffer for storing trajectories . Prediction function update frequency .
1:  for Episode  do
2:     Reset episode return ; Reset trajectory for current episode
3:     for step  do
4:        Sample action , for
5:        Take action ; Observe
6:        Store transition
7:     end for
8:     Update episode reward ; Store trajectory
9:     Sample a batch of trajectories
10:     Predict reward using for each
11:     Update using and weighted reward
12:     if  mod is 0 then
13:        for each gradient update do
14:           Sample a batch from

, and compute estimate of total loss,

15:           
16:        end for
17:     end if
18:  end for
Algorithm 1 AREL

AREL is summarized in Algorithm 1. Parameters of RL modules and of the credit assignment function are randomly initialized. Observations and actions of agents are collected in each episode (Lines 2-6). Trajectories and episodic rewards are stored in an experience buffer (Line 8). The reward at each time step for every trajectory in a batch (sampled from ) is predicted (Lines 9-10). The predicted changes as is updated, but the episode reward remains the same. A weighted sum ( is an indicator function) is used to update in a stable manner by using a MARL algorithm (Line 11). The credit assignment function is updated when new trajectories are available (Lines 12-17).

4.4. Analysis

(a) Cooperative Push
(b) Predator and Prey
(c) Cooperative Navigation
Figure 2.

Average agent rewards and standard deviation for tasks in Particle World with episodic rewards and

. AREL (dark blue) results in the highest average rewards in all tasks.

In order to establish a connection between redistributed rewards from Line 10 of Algorithm 1 and the episodic reward, we define return equivalence of decentralized partially observable sequence-Markov decision processes (Dec-POSDP). This generalizes the notion of return equivalence introduced in arjona2019rudder

in the fully observable setting for the single agent case. A Dec-POSDP is a decision process with Markov transition probabilities but has a reward distribution that need not be Markov. We present a result that establishes that return equivalent Dec-POSDPs will have the same optimal policies.

Definition 4.2.

Dec-POSDPs and are return-equivalent if they differ only in their reward functions but have the same return for any trajectory .

Theorem 4.3.

Given an initial state , return-equivalent Dec-POSDPs will have the same optimal policies.

According to Definition 4.2, any two return equivalent Dec-POSDPs will have the same expected return for any trajectory . That is, . This is used to prove Theorem 4.3.

Proof.

Consider two return-equivalent Dec-POSDPs and . Since and have the same transition probability and observation functions, the probabilities that a trajectory is realized will be the same if both Dec-POSDPs are provided with the same policy. For any joint agent policy and sequence of states , we have:

These equations follow from Definition 4.2. Let denote an optimal policy for . Then, we have:

Therefore, will also be an optimal policy for .∎

When in Eqn. (3), a Dec-POSDP with the redistributed reward will be return-equivalent to a Dec-POSDP with the original episodic reward. Theorem 4.3 indicates that in this scenario, the two Dec-POSDPs will have the same optimal policies. An additional result in Appendix A gives a bound on when the estimators are unbiased at each time-step.

5. Experiments

In this section, we describe the tasks that we evaluate AREL on, and present results of our experiments. Our code is available at https://github.com/baicenxiao/AREL.

5.1. Environments and Tasks

We study tasks from Particle World lowe2017multi and the StarCraft Multi-Agent Challenge samvelyan19smac. These have been identified as challenging multi-agent environments in lowe2017multi; samvelyan19smac. In each task, a reward is received by agents only at the end of an episode. No reward is provided at other time steps. We briefly summarize the tasks below and defer detailed task descriptions to Appendix B.
(1) Cooperative Push: agents work together to move a large ball to a landmark.
(2) Predator-Prey: predators seek to capture preys. landmarks impede movement of agents.
(3) Cooperative Navigation: agents seek to reach landmarks. The maximum reward is obtained when there is exactly one agent at each landmark.
(4) StarCraft

: Units from one group (controlled by RL agents) collaborate to attack units from another (controlled by heuristics). We report results for three maps: 2 Stalkers, 3 Zealots (2s3z); 1 Colossus, 3 Stalkers, 5 Zealots (1c3s5z); 3 Stalkers vs. 5 Zealots (3s_vs_5z).

5.2. Architecture and Training

In order to make the agent-temporal attention module more expressive, we use a transformer architecture with multi-head attention vaswani2017attention for both agent and temporal attention. The permutation invariant critic (PIC) based on the multi-agent deep deterministic policy gradient (MADDPG) from liu2020pic is used as the base RL algorithm in Particle World. In StarCraft, we use QMIX rashid2018qmix as the base RL algorithm. The value of is set to in Particle World and in StarCraft. Additional details are presented in Appendix C.

5.3. Evaluation

We compare AREL with three state-of-the-art methods:
(1) RUDDER arjona2019rudder

: A long short-term memory (LSTM) network is used for reward decomposition along the length of an episode.


(2) Sequence Modeling liu2019sequence: An attention mechanism is used for temporal decomposition of rewards along an episode.
(3) Iterative Relative Credit Refinement (IRCR) gangwani2020learning: ‘Guidance rewards’ for temporal credit assignment are learned using a surrogate objective.

RUDDER and Sequence Modeling were originally developed for the single agent case. We adapted these methods to MARL by concatenating observations from all agents. We added the variance-based regularization loss in our experiments for Sequence Modeling, and observed that incorporating the regularization term resulted in an improved performance compared to without regularization.

5.4. Results

(a) 2s3z
(b) 3s vs 5z
(c) 1c3s5z
Figure 3. Average test win rate and variance in StarCraft. AREL (dark blue) results in the highest win rates in 2s3z and 3s_vs_5z, and obtains a comparable win rate to Sequence Modeling in 1c3s5z.
(a) Agent Attention
(b) Regularization
(c) Reward Weight
Figure 4. Ablations: Effects of the agent attention module (Fig. 3(a)) and regularization parameter in Eqn (3) (Fig. 3(b)) in Cooperative Push, and reward weight (Fig. 3(c)) in the 2s3z StarCraft map.

5.4.1. Arel enables improved performance

Figure 2 shows results of our experiments for tasks in Particle World for . In each case, AREL is able to guide agents to learn policies that result in higher average rewards compared to other methods. This is a consequence of using an attention mechanism to redistribute an episodic reward along the length of an episode, and also characterizing the contributions of individual agents.

The PIC baseline liu2020pic fails to learn policies to complete tasks with episodic rewards. A similar result of failure to complete tasks was observed when using RUDDER arjona2019rudder. An explanation for this could be that RUDDER only carries out a temporal redistribution of rewards, but does not consider the effect of agents contributing differently to a reward.

Sequence Modeling liu2019sequence performs better than RUDDER and the PIC baseline, possibly because it uses attention to redistribute episodic rewards. This was shown to outperform LSTM-based models, including RUDDER, in liu2019sequence in single-agent episodic RL, due to the relative ease of training the attention mechanism. We believe that absence of an explicit characterization of agent-attention resulted in a lower reward for this method compared to AREL.

Using a surrogate objective in IRCR gangwani2020learning results in obtaining rewards comparable to AREL in some runs in the Cooperative Navigation task. However, the reward when using IRCR has a much higher variance compared to that obtained when using AREL. A possible reason for this is that IRCR does not characterize the relative contributions of agents at intermediate time-steps.

Figure 3 shows the results of our experiments for the three maps in StarCraft. AREL achieves the highest average win rate in the 2s3z and 3s_vs_5z maps, and obtains a comparable win rate to Sequence Modeling in 1c3s5z. Sequence Modeling does not explicitly model agent-attention, which could explain the lower average win rates in 2s3z and 3s_vs_5z. RUDDER achieves a nonzero, albeit much lower win rate than AREL in two maps, possibly because the increased episode length might affect the redistribution of the episode reward for this method. IRCR and QTRAN son2019qtran obtain the lowest win rates. Additional experimental results are provided in Appendix D.

5.4.2. Ablations

We carry out several ablations to evaluate components of AREL. Figure 3(a) demonstrates the impact of the agent-attention module. In the absence of agent-attention (while retaining permutation invariance among agents through the shared temporal attention module), rewards are significantly lower. We study the effect of the value of in Eqn. (3) on rewards in Figure 3(b). This term is critical in ensuring that agents learn good policies. This is underscored by observations that rewards are significantly lower for very small or very large (). Third, we evaluate the effect of mixing the original episodic reward and redistributed reward by changing the reward weight in Figure 3(c). The reward mixture influences win rates; 0.5 or 0.8 yields the highest win rate. The win rate is lower when using redistributed reward alone (). Additional ablations and evaluating the choice of regularization loss are shown in Appendices E and F.

Figure 5. Comparison of AREL with QMIX and a strategic exploration technique, MAVEN in the StarCraft map (avg. over 5 runs). AREL yields highest rewards and win rates.
Figure 6. An instantiation of the Cooperative Navigation task with where rewards are provided only at the end of an episode. Blue and red dots respectively denote agents and landmarks. Arrows on agents represent their directions of movement. The objective of this task is for each agent to cover a distinct landmark. The axis of the graph shows the normalized predicted rewards for a sample trajectory. The positions of agents relative to landmarks are shown at several points along this trajectory. The figure shows a scenario where two agents are close to a single landmark. In this case, one of them must remain close to this landmark, while the other moves towards a different landmark. The predicted redistributed reward encourages such an action, since it has a higher magnitude when agents navigate towards distinct landmarks. The predicted redistributed reward by AREL is not uniform along the length of the episode.

5.4.3. Credit Assignment vs. exploration

This section demonstrates the importance of effective redistribution of an episodic reward vis-a-vis strategic exploration of the environment. The episodic reward takes continuous values and provides fine-grained information on performance (beyond only win/ loss). AREL learns a redistribution of by identifying critical states in an episode, and does not provide exploration abilities beyond that of the base RL algorithm. The redistributed rewards of AREL can be given as input to any RL algorithm to learn policies (in our experiments, we demonstrate using QMIX for StarCraft; MADDPG for Particle World). Figure 5 illustrates a comparison of AREL with a state-of-the-art exploration strategy, MAVEN mahajan2019maven and with QMIX rashid2018qmix in the StarCraft map. We observe that when rewards are delayed to the end of an episode, effectively redistributing the reward can be more beneficial than strategically exploring the environment to improve win-rates or total rewards.

5.4.4. Interpretability of Learned Rewards

Figure 6 presents an interpretation of the decomposed predicted rewards vis-a-vis the relative positions of agents to landmarks in the Cooperative Navigation task with . When the reward is provided only at the end of an episode, AREL is used to learn a temporal redistribution of this episodic reward. The predicted rewards are normalized to a scale for ease of representation. The positions of the agents relative to the landmarks are shown at several points along a sample trajectory. Successfully trained agents must learn policies that enable each agent to cover a distinct landmark. For example, in a scenario where two agents are close to a single landmark, one of them must remain close to this landmark, while the other moves towards a different landmark. We observe that the magnitude of the predicted rewards is consistent with this insight in that it is higher when agents navigate away and towards different landmarks.

This visualization in Figure 6 reveals that the attention mechanism in AREL is able to learn to redistribute an episodic reward effectively in order to successfully train agents to accomplish task objectives in cooperative multi-agent reinforcement learning. Moreover, it reveals that the redistributed reward predicted by AREL is not uniform along the length of the episode.

5.5. Discussion

This paper focused on developing techniques to effectively learn policies in MARL environments when rewards were delayed or episodic. Our experiments demonstrate that AREL can be used as a module that enables more effective credit assignment by identifying critical states through capturing long-term temporal dependencies between states and an episodic reward. Redistributed rewards predicted by AREL are dense, which can then be provided as an input to MARL algorithms that learn value functions for credit assignment (we used MADDPG lowe2017multi and QMIX rashid2018qmix in our experiments).

By including a variance-based regularization term, the total loss in Eqn. (3) enabled incorporating the possibility that not all intermediate states would contribute equally to an episodic reward, while also learning less sparse redistributed rewards. Moreover, any exploration ability available to the agents was provided solely by the MARL algorithm, and not by AREL. We further demonstrated that effective credit assignment was more beneficial than strategic exploration of the environment when rewards are episodic.

6. Conclusion

This paper studied the multi-agent temporal credit assignment problem in MARL tasks with episodic rewards. Solving this problem required addressing the twin challenges of identifying the relative importance of states along the length of an episode and individual agent’s state at any single time-step. We presented an attention-based method called AREL to deal with the above challenges. The temporally redistributed reward predicted by AREL was dense, and could be integrated with MARL algorithms. AREL was evaluated on tasks from Particle World and StarCraft, and was successful in obtaining higher rewards and better win rates than three state-of-the-art reward redistribution techniques.

Acknowledgment

This work was supported by the Office of Naval Research via Grant N00014-17-S-B001.

References

7. Appendices

These appendices include detailed analysis and proofs of the theoretical results in the main paper. They also contain details of the environments and present additional experimental results and ablation studies.

Appendix A: Analysis

Using the variance of the predicted redistributed rewards as the regularization loss allows us to provide an interpretation of the overall loss in Eqn. (3) in terms of a bias-variance trade-off in a mean square estimator. The variance-regularized predicted rewards are analyzed in detail in the sequel.

First, assume that the episodic reward is given by the sum of ‘ground-truth’ rewards at each time step. That is, . The objective in Eqn. (3) is:

(4)

This type of regularization will determine in a manner such that it will be robust to overfitting.

For a sample trajectory, the expectation over in can be omitted. Let (i.e., the mean of the predicted rewards). Moreover, let (i.e., is an ergodic process). Then, the following holds:

(5)
(6)

The first term in (6) is obtained by applying the Cauchy-Schwarz inequality to the first term of (5). Consider . From linearity of the expectation operator, this is equal to . Then, can be interpreted as an estimator of , and the expression above is the mean square error of this estimator. By adding and subtracting , the mean square error can be expressed as the sum of the variance and the squared bias of the estimator domingos2000unified. Formally,

After distributing the expectation in the above expression, we obtain the following:
(a): since is constant, the third term is equal to ;
(b)

: the first two terms correspond to the variance of a random variable

and the square of a bias between and , and may not be both zero.
Substituting these in Eqn. (6),

(7)

Therefore, the total loss is upper-bounded by an expression that represents the sum of a bias and a variance. The parameter will determine the relative significance of each term. Let represent the term on the right hand side of (7). If we denote by the parameters that minimize , and by the parameters that minimize , then an optimization carried out on can be related to one carried out on as:

For the special case when the mean of the predicted rewards, , the first term of in Eqn. (5) will evaluate to zero. The optimization of in this case is then reduced to minimizing the square error of predictors at each time

. This setting is consistent with the principle of maximum entropy when the objective is to distribute an episodic return uniformly along the length of the trajectory.

Consider a single time-step , and assume that there are enough samples (say, ) to ‘learn’ . Then, at each time-step , the goal is to solve the problem .

The squared loss above will admit a bias-variance decomposition domingos2000unified; geman1992neural that is commonly interpreted as a trade-off between the two terms. This underscores an insight that the complexity of the estimator (in terms of dimension of the set containing the parameters ) should achieve an ‘optimal’ balance friedman2001elements; goodfellow2016deep. This is represented as a U-shaped curve for the total error, where bias decreases and variance increases with the complexity of the estimator. However, recent work has demonstrated that the variance of the prediction also decreases with the complexity of the estimator belkin2019reconciling; neal2018modern.

In order to determine a bound on the error of the variance of the predictor at each time-step, we first state a result from neal2018modern. We use this to provide a bound on when the estimators are unbiased at each time-step in Theorem 7.2.

Theorem 7.1.

neal2018modern Let be the dimension of the parameter space containing . Assume that the parameters are initialized by a Gaussian as . Let be a Lipschitz constant associated to . Then, for some constant , the variance of the prediction satisfies .

Theorem 7.2.

Let denote the mean of the predicted rewards. Assume that there are samples to ‘learn’ at each time-step

. Then, for unbiased estimators

and associated Lipschitz constants ,

Proof.

The proof follows from applying Theorem 7.1 at each time step . Since the estimators at each time-step are unbiased, the mean square error of predictors is equal to the variance of the predictor. The expectation operator in Eqn. (5) is now over a constant, which completes the proof. ∎

The assumption on estimators being unbiased is reasonable. Proposition 7.3 indicates that in the more general case (i.e. for an estimator that may not be unbiased), the prediction is concentrated around its mean with high probability.

Proposition 7.3.

wainwright2019high Under a Gaussian initialization of parameters as , at each time-step , the following holds:

Theorem 7.2 indicates that the optimization of will be equivalent to minimizing the square error of predictors at each time .

Appendix B: Detailed Task Descriptions

This Appendix gives a detailed description of the tasks that we evaluate AREL on. In each experiment, a reward is obtained by the agents only at the end of an episode. No reward is provided at other time steps.

  • Cooperative Push: This task has agents working together to move a large ball to a landmark. Agents are rewarded when the ball reaches the landmark. Each agent observes its position and velocity, relative position of the target landmark and the large ball, and relative positions of the nearest agents. We report results for , , and . At each time step, the distance between agents and the ball, distance between ball and landmark, and whether the agents touch the ball is recorded. These quantities, though, will be not be immediately revealed to the agents. Agents receive a reward at the end of each episode at time .

  • Predator-Prey: This task has predators working together to capture preys. landmarks impede movement of the agents. Preys can move faster than predators, and predators obtain a positive reward when they collide with a prey. The prey agents are controlled by the environment. Each predator observes its position and velocity, relative locations of the nearest landmarks, and relative positions and velocities of the nearest prey and nearest predators. We report results for , , and . At each time step, the distance between a prey and the closest predator, and whether a predator touches a prey is recorded. These quantities, though, will be not be immediately revealed to the agents. The agents receive a reward at the end of each episode at time .

  • Cooperative Navigation: This task has agents seeking to reach landmarks. The maximum reward is obtained when there is exactly one agent at each landmark. Agents are also penalized for colliding with each other. Each agent observes its position, velocity, and the relative locations of the nearest landmarks and agents. We report results for . At each time step, the distance between an agent and the closest landmark, and whether an agent collides with other agents is recorded. These quantities, though, will be not be immediately revealed to the agents. The agents receive a reward at the end of each episode at time .

  • StarCraft: We use the SMAC benchmark from samvelyan19smac. The environment comprises two groups of army units, and units from one group (controlled by learning agents) collaborate to attack units from the other (controlled by handcrafted heuristics). Each learning agent controls one army unit. We report results for three maps: 2 Stalkers and 3 Zealots (2s3z); 1 Colossus, 3 Stalkers, and 5 Zealots (1c3s5z); and 3 Stalkers versus 5 Zealots (3s_vs_5z). In 2s3z and 1c3s5z, two groups of identical units are placed symmetrically on the map. In 3s_vs_5z, the learning agents control 3 Stalkers to attack 5 Zealots controlled by the StarCraft AI. In all maps, units can only observe other units if they are both alive and located within the sight range. The 2s3z and 1c3s5z maps comprise heterogeneous agents, since there are different types of units, while the 3s_vs_5z is a homogeneous map. In all our experiments, the default environment reward is delayed and revealed only at the end of an episode. The reader is referred to samvelyan19smac for a detailed description of the default rewards.

Appendix C: Implementation Details

(a) Cooperative Push
(b) Predator and Prey
(c) Cooperative Navigation
Figure 7. Average agent rewards and standard deviation for tasks in the Particle World environment with episodic rewards and . AREL (dark blue) results in the highest average rewards in all the tasks.
(a) Cooperative Push
(b) Predator and Prey
(c) Cooperative Navigation
Figure 8. Average agent rewards in Particle World environment with episodic rewards and . AREL (dark blue) results in the highest average rewards.

All the results presented in this paper are averaged over 5 runs with different random seeds. We tested the following values for the regularization parameter in Eqn. (3): , and observed that resulted in the best performance.

In order to make the agent-temporal attention module more expressive, we use a transformer architecture with multi-head attention vaswani2017attention

for both agent and temporal attention. Specifically, in our experiments, the transformer architecture applies, in sequence: an attention layer, layer normalization, two feed forward layers with ReLU activation, and another layer normalization. Before each layer normalization, residual connections are added.

When dimension of the observation space exceeds , a single fully connected layer with units is applied to compress the observation before attention module. The credit assignment block that produces redistributed rewards consists of two MLPs, each with a single hidden layer of 50 units. Neural networks for credit assignment are trained using Adam with learning rate .

In Particle World, credit assignment networks are updated for batches every episodes. Each batch contains fully unrolled episodes uniformly sampled from the trajectory experience buffer . In StarCraft, credit assignment networks are updated for batches every episodes. Each batch contains fully unrolled episodes uniformly sampled from .

During training and testing, the length of each episode in Particle World is kept fixed at time steps, except in Predator-Prey with , where the episode length is set to time steps. In StarCraft, an episode is restricted to have a maximum length of time steps for 2s3z, time steps for 1c3s5z, and time steps for 3s_vs_5z. If both armies are alive at the end of the episode, we count it as a loss for the team of learning agents. An episode terminates after one army has been defeated, or the time limit has been reached.

In Particle World, we use the permutation invariant critic (PIC) based on MADDPG from liu2020pic as the base reinforcement learning algorithm. The code is based on the implementation available at https://github.com/IouJenLiu/PIC. Following MADDPG lowe2017multi

, the actor policy is parameterized by a two-layer MLP with 128 hidden units per layer, and ReLU activation function. The permutation invariant critic is a two-layer graph convolution net with 128 hidden units per layer, max pooling at the top, and ReLU activation. Learning rates for actor and critic are 0.01, and is linearly decreased to zero at the end of training. Trajectories of the first

episodes are sampled randomly for filling the experience buffer. During training, uniform noise was added for exploration during action selection.

In StarCraft, we use QMIX rashid2018qmix as the base algorithm. The QMIX code is based on the implementation from https://github.com/starry-sky6688/StarCraft. In the implementation, all agent networks share a deep recurrent Q-network with recurrent layer comprised of a GRU with a 64-dimensional hidden state, with a fully-connected layer before and after. Trajectories of the first

episodes are sampled randomly to fill the experience buffer. Target networks are updated every 200 training episodes. QMIX is trained using RMSprop with learning rate

. Throughout training, -greedy is adopted for exploration, and is annealed linearly from 1.0 to 0.05 over 50k time steps and kept constant for the rest of learning.

The experiments that we perform in this paper require computational resources to train the attention modules of AREL in addition to those needed to train deep RL algorithms. Using these resources might result in higher energy consumption, especially as the number of agents grows. This is a potential limitation of the methods studied in this paper. However, we believe that AREL partially addresses this concern by sharing certain modules among all agents in order to improve scalability.

We provide a description of our hardware resources below:
Hardware Configuration: All our experiments were carried out on a machine running Ubuntu® equipped with a core Intel®Xeon® GHz CPU, two NVIDIA®GeFORCE®RTX Ti graphics cards and a GB RAM.

Appendix D: Additional Experimental Results

No. of Agents Task : AREL : Uniform : AREL : Uniform
CP -1553.6
N = 15 PP
CN
CP
N = 6 PP
CN
CP
N = 3 PP
CN
Table 1. Ablation: This table demonstrates the effect of removing the agent attention module, and uniformly weighting the attention of each agent in the three tasks in Particle World. This is termed Uniform, and we compare this with AREL. We report the average rewards over the number of training episodes () and the final agent reward () at the end of training in both scenarios for each task. AREL consistently results in higher average and maximum rewards (shown in bold), which indicates that the agent attention module plays a crucial role in effective credit assignment.

This Appendix presents additional experimental results carried out in the Particle World environment.

Figure 7 shows results of our experiments for tasks in Particle World when . In each case, AREL is consistently able to allow agents to learn policies that result in higher average rewards compared to other methods. This is a consequence of using an attention mechanism that enables decomposition of an episodic reward along the length of an episode, and that also characterizes contributions of individual agents to the reward. Performances of the PIC baseline liu2020pic, RUDDER arjona2019rudder, and Sequence Modeling liu2019sequence can be explained similar to that presented for the case in the main paper. Using a surrogate objective in IRCR gangwani2020learning results in obtaining comparable agent rewards in some cases in the Cooperative Navigation task, but the reward curves are unstable and have high variance.

Figure 8 show the results of experiments on these tasks when . The PIC baseline and RUDDER are unable to learn good policies and IRCR results in lower rewards than AREL in two tasks. The performance of Sequence Modeling is comparable to AREL, which indicates that characterizing agent attention plays a smaller role when there are fewer agents.

Appendix E: Additional Ablations

Figure 9. Ablation: Effect of the number of agent-temporal attention blocks on rewards.

This Appendix presents additional ablations that examine the impact of uniformly weighting the attention of agents and the number of agent-temporal attention blocks.

We evaluate the effect of removing the agent-attention block, and uniformly weighting the attention of each agent. This is termed uniform agent attention (Uniform). The average rewards over the number of training episodes () and the final agent reward () at the end of training obtained when using AREL and when using Uniform are compared. The results of these experiments, presented in Table 1, indicate that and are higher for AREL than for Uniform. This shows that the agent-attention block in AREL plays a crucial role in performing credit assignment effectively.

We examine the effect of the number of agent-temporal attention blocks, (depth) on rewards in the Cooperative Push task with in Figure 9. The depth has negligible impact on average rewards at the end of training. However, rewards during the early stages of training are lower for , and these rewards also have a larger variance than the other cases ().

Appendix F: Effect of Choice of Regularization Loss

This Appendix examines the effect of the choice of the regularization loss term in Eqn. (3). The need for the regularization loss term arises due to the possibility that there could be more than one choice of redistributed rewards that minimize the regression loss alone. In our results in the main paper, we used the variance of the redistributed rewards as the regularization loss. This choice was motivated by a need to discourage the predicted redistributed rewards from being sparse, since sparse rewards might impede learning of policies when provided as an input to a MARL algorithm devlin2011theoretical. By adding a variance-based regularization, the total loss enables incorporating the possibility that not all intermediate states would contribute equally to an episodic reward, while also resulting in learning redistributed rewards that are less sparse.

We compare the variance-based regularization loss with two other widely used choices of the regularization loss- the and -based losses. The -based regularization encourages learning sparse redistributed rewards, and the -based regularization discourages learning a redistributed reward of large magnitude (i.e., ‘spikes’ in the redistributed reward). Specifically, we study:

where are the regression loss and variance of redistributed rewards as in Eqn. (3), () is the norm ( norm) of the redistributed reward.

We compare the use of the three regularization loss functions on the tasks in Particle World with . In each task, we calculate the -normalized reward received by the agents during the last training steps. We use for the variance-based regularization loss. For the other two regularization losses, we searched over and , and we observed that and resulted in the best performance. The graph in Figure 10 shows the average -normalized reward. We observe that using a variance-based regularization loss results in agents obtaining the highest average rewards.

Figure 10. Normalized average agent rewards for tasks in Particle World when using variance-based, -based, and -based regularization losses. The variance-based regularization results in agents obtaining the highest average rewards.

In particular, we observe that using an -based regularization results in significantly smaller rewards. A possible reason for this is that the -based regularization has the property of encouraging learning a sparse redistributed reward, which hinders learning of policies when provided as an input to the MARL algorithm. The performance when using the -based regularization results in a comparable, albeit slightly lower, average agent reward to using the variance-based regularization. This is reasonable since using the variance-based or -based regularization will result in less sparse predicted redistributed rewards.

Appendix G: Verification of QMIX Implementation

Figure 11. Comparison of PyMARL and our implementations (avg. over 5 runs). Test win-rates using either implementation are almost identical.

This Appendix demonstrates correctness of the QMIX implementation that we use from https://github.com/starry-sky6688/StarCraft. In the QMIX evaluation first used in rashid2018qmix, rewards were not delayed. In our experiments, rewards are delayed and revealed only at the end of an episode. In such a scenario, QMIX may not be able to perform long-term credit assignment, which explains the difference in performance between the default and delayed reward cases. We observe that using redistributed rewards from AREL as an input to QMIX results in an improved performance compared to using QMIX alone when rewards from the environment were delayed (Figure 3 in the main paper). Using the default, non-delayed rewards, we compare the performance of the QMIX implementation that we used in our experiments with the benchmark implementation from samvelyan19smac. Figure 11 shows that test win rates in two StarCraft maps ( and ) using both implementations are almost identical.