1. Introduction
Cooperative multiagent reinforcement learning (MARL) involves multiple autonomous agents that learn to collaborate to complete tasks in a shared environment by maximizing a global reward (busoniu2008comprehensive). Examples of systems where MARL has been used include autonomous vehicle coordination sallab2017deep, and video games (samvelyan19smac; tampuu2017multiagent).
One approach to enable better coordination is to use a single centralized controller that can access observations of all agents jiang2018learning. In this setting, algorithms designed for singleagent RL can be used for the multiagent case. However, this may not be feasible when trained agents are deployed independently or when communication costs between agents and the controller is prohibitive. In such a situation, agents will need to be able to learn decentralized policies.
The centralized training with decentralized execution (CTDE) paradigm, introduced in lowe2017multi; rashid2018qmix, enables agents to learn decentralized policies efficiently. Agents using CTDE can communicate with each other during training, but are required to make decisions independently at testtime. The absence of a centralized controller will require each agent to assess how its own actions can contribute to a shared global reward. This is called the multiagent credit assignment problem, and has been the focus of recent work in MARL, such as COMA foerster2018counterfactual, QMIX rashid2018qmix and QTRAN son2019qtran. Solving the multiagent credit assignment problem alone, however, is not adequate to efficiently learn agent policies when the (global) reward signal is delayed until the end of an episode.
In reinforcement learning, agents seek to solve a sequential decision problem guided by reward signals at intermediate timesteps. This is called the temporal credit assignment problem sutton2018reinforcement. In many applications, rewards may be delayed. For example, in molecular design olivecrona2017molecular, Go silver2016mastering, and computer games such as Skiing bellemare2013arcade, a summarized score is revealed only at the end of an episode. The episodic reward implies absence of feedback on quality of actions at intermediate time steps, making it difficult to learn good policies. The longterm temporal credit assignment problem has been studied in singleagent RL by performing return decomposition via contribution analysis arjona2019rudder and using sequence modeling liu2019sequence. These methods do not directly scale well to MARL since size of the joint observation space grows exponentially with number of agents lowe2017multi.
Besides scalability, addressing temporal credit assignment in MARL with episodic rewards presents two challenges. It is critical to identify the relative importance of: i) each agent’s state at any single timestep (agent dimension); ii) states along the length of an episode (temporal dimension). We introduce AgentTemporal Attention for Reward Redistribution in Episodic MultiAgent Reinforcement Learning (AREL) to address these challenges.
AREL uses attention mechanisms vaswani2017attention to carry out multiagent temporal credit assignment by concatenating: i) a temporal attention module to characterize the influence of actions on state transitions along trajectories, and; ii) an agent attention module, to determine how any single agent is affected by other agents at each timestep. The attention modules enable learning a redistribution of the episodic reward along the length of the episode, resulting in a dense reward signal. To overcome the challenge of scalability, instead of working with the concatenation of (joint) agents’ observations, AREL analyzes observations of each agent using a temporal attention module that is shared among agents. The outcome of the temporal attention module is passed to an agent attention module that characterizes the relative contribution of each agent to the shared global reward. The output of the agent attention module is then used to learn the redistributed rewards.
When rewards are delayed or episodic, it is important to identify ‘critical’ states that contribute to the reward. The authors of gangwani2020learning recently demonstrated that rewards delayed by a long timeinterval make it difficult for temporaldifference (TD) learning methods to carry out temporal credit assignment effectively. AREL overcomes this shortcoming by using attention mechanisms to effectively learn a redistribution of an episodic reward. This is accomplished by identifying critical states through capturing longterm dependencies between states and the episodic reward.
Agents that have identical action and observation spaces are said to be homogeneous. Consider a task where two homogeneous agents need to collaborate to open a door by locating two buttons and pressing them simultaneously. In this example, while locations of the two buttons (states) are important, the identities of the agent at each button are not. This property is termed permutation invariance, and can be utilized to make the credit assignment process sample efficient gangwani2020learning; liu2019sequence. Thus, a redistributed reward must identify whether an agent is in a ‘good’ state, and should also be invariant to the identity of the agent in that state. AREL enforces this property by designing the credit assignment network with permutationinvariant operations among homogeneous agents, and can be integrated with MARL algorithms to learn agent policies.
We evaluate AREL on three tasks from the Particle World environment lowe2017multi, and three combat scenarios in the StarCraft MultiAgent Challenge samvelyan19smac. In each case, agents receive a summarized reward only at the end of an episode. We compare AREL with three stateoftheart reward redistribution techniques, and observe that AREL results in accelerated learning of policies and higher rewards in Particle World, and improved win rates in StarCraft.
2. Related Work
Several techniques have been proposed to address temporal credit assignment when prior knowledge of the problem domain is available. Potentialbased reward shaping is one such method that provided theoretical guarantees in single ng1999policy and multiagent devlin2011theoretical; lu2011policy RL, and was shown to accelerate learning of policies in devlin2011empirical
. Credit assignment was also studied by incorporating human feedback through imitation learning
kelly2019hg; ross2011reduction and demonstrations brown2019extrapolating; huang2020ma.When prior knowledge of the problem domain is not available, recent work has studied temporal credit assignment in singleagent RL with delayed rewards. An approach named RUDDER arjona2019rudder used contribution analysis to decompose episodic rewards by computing the difference between predicted returns at successive timesteps. Parallelly, the authors of liu2019sequence
proposed using natural language processing models for carrying out temporal credit assignment for episodic rewards. The scalability of the above methods to MARL, though, can be a challenge due to the exponential growth in the size of the joint observation space
lowe2017multi.In the multiagent setting, recent work has studied performing multiagent credit assignment at each timestep. Difference rewards were used to assess the contribution of an agent to a global reward in agogino2006quicr; devlin2014potential; foerster2018counterfactual by computing a counterfactual term that marginalized out actions of that agent while keeping actions of other agents fixed. Value decomposition networks, proposed in sunehag2018value, decomposed a centralized value into a sum of agent values to assess each one’s contributions. A monotonicity assumption on value functions was imposed in QMIX rashid2018qmix to assign credit to individual agents. A generalized approach to decompose a joint value into individual agent values was presented in QTRAN son2019qtran. The Shapley Qvalue was used in wang2020shapley to distribute a global reward to identify each agent’s contribution. The authors of yang2020q decomposed global Qvalues along trajectory paths, while zhou2020learning used an entropyregularized method to encourage exploration to aid multiagent credit assignment. The above techniques did not address longterm temporal credit assignment and hence will not be adequate for learning policies efficiently when rewards are delayed.
Attention mechanisms have been used for multiagent credit assignment in recent work. The authors of mao2019modelling used an attention mechanism with a CTDEbased algorithm to enable each agent effectively model policies of other agents (from its own perspective). Hierarchical graph attention networks proposed in ryu2020multi modeled hierarchical relationships among agents and used two attention networks to effectively represent individual and group level interactions. The authors of jiang2019graph; liu2020multi combined attention networks with graphbased representations to indicate the presence and importance of interactions between any two agents. The above approaches used attention mechanisms primarily to identify relationships between agents at a specific timestep. They did not consider longterm temporal dependencies, and therefore may not be sufficient to learn policies effectively when rewards are delayed.
A method for temporal redistribution of episodic rewards in single and multiagent RL was recently presented in gangwani2020learning. A ‘surrogate objective’ was used to uniformly redistribute an episodic reward along a trajectory. However, this work did not use information from sample trajectories to characterize the relative contributions of agents at intermediate timesteps along an episode.
Our approach differs from the abovementioned related work in that it uses attention mechanisms for multiagent temporal credit assignment. AREL overcomes the challenge of scalability by analyzing observations of each agent using temporal and agent attention modules, which respectively characterize the effect of actions on state transitions along a trajectory and how each agent is influenced by other agents at each timestep. Together, these modules will enable an effective redistribution of an episodic reward. AREL does not require human intervention to guide agent behaviors, and can be integrated with MARL algorithms to learn decentralized agent policies in environments with episodic rewards.
3. Background
A fully cooperative multiagent task can be specified as a decentralized partially observable Markov decision process (DecPOMDP)
oliehoek2016concise. A DecPOMDP is a tuple , where describes the environment state. Each agent receives an observation according to an observation function . At each time step, agent chooses action according to its policy . forms the joint action space, and the environment transitions to the next state according to the function . All agents share a global reward . The goal of the agents is to determine their individual policies to maximize the return, , where is a discount factor, and is the length of the horizon. Let and . A trajectory of length is an alternating sequence of observations and actions, .In a typical MARL task, agents receive reward immediately following execution of action at state . The expected return can then be determined by accumulating rewards at each time step. In episodic RL, a reward is revealed only at the end of an episode at time , and agents do not receive a reward at intermediate timesteps. As a consequence, the expected return for all will be the same (when . Therefore, the quality of information available for learning policies will be poor at all intermediate time steps. Moreover, delayed rewards have been shown to introduce a large bias arjona2019rudder
or variance
ng1999policy in the performance of RL algorithms.The CTDE paradigm foerster2018counterfactual; lowe2017multi can be adopted to learn decentralized policies effectively when dimensions of state and action spaces are large. During training, an agent can make use of information about other agents’ states and actions to aid its own learning. At testtime, decentralized policies are executed. This paradigm has been used to successfully complete tasks in complex MARL environments gupta2017cooperative; iqbal2019actor; rashid2018qmix.
4. Approach
This paper considers MARL tasks where agents share the same global reward, which is received only at the end of an episode. The objective is to redistribute this episodic reward for effective multiagent temporal credit assignment. To accomplish this goal, it is critical to identify the relative importance of: i) individual agents’ observations at each timestep, and; ii) observations along the length of a trajectory. We introduce AREL to address the above challenges. AREL uses an agenttemporal attention block to infer relationships among states at different times, and among agents. A schematic is shown in Fig. 1, and we describe its key components and overall workflow in the remainder of this section.
4.1. AgentTemporal Attention
In order to redistribute an episodic reward in a meaningful way, we need to be able to extract useful information from trajectories. Each trajectory contains a sequence of observations involving all agents. At each timestep of an episode of length , a feature of dimension corresponds to the embedding of a single observation. When there are agents, a trajectory is denoted by . The objective is to learn a mapping to assign credit to the agents at each timestep. The information in a trajectory comprises two parts: (1) temporal information between (embeddings of) observations at different time steps: this provides insight into the influence of actions on transitions between states; (2) structural information: this provides insight into how any single agent is affected by other agents.
These two parts are coupled, and hence studied together. The process of learning these relationships is termed attention. We propose an agenttemporal attention structure, inspired by the Transformer vaswani2017attention. This structure selectively pays attention to different types of information either from individual agents, or at different timesteps along a trajectory. This is accomplished by associating a weight to an observation based on its relative importance to other observations along the trajectory. The agenttemporal attention structure is formed by concatenating one agent attention module with one temporal attention module. The temporal attention modules determine how entries of at different timesteps are related (along the ‘first’ dimension of ). The agent attention module determines how agents influence one another (along the ‘second’ dimension of ).
4.1.1. TemporalAttention Module
The input is a trajectory . To calculate the temporal attention feature, we obtain the transpose of as . Adopting notation from vaswani2017attention, each row of is transformed to a query , key , and value . are learnable parameters, and . The row of the temporal attention feature is a weighted sum . The attention weight vector is a normalization (softmax) of the innerproduct between the row of , , and the key matrix :
(1) 
where is an elementwise product, and is a mask with its first entries equal to , and remaining entries . The mask preserves causality by ensuring that at any time , information beyond will not be used to assign credit. A temporal positional embedding devlin2018bert maintains information about relative positions of states in an episode. Position embeddings are learnable vectors associated to each temporal position of a trajectory. The sum of position and trajectory embeddings forms the input to the temporal attention module. The output of this module is , got by stacking . The temporal attention process can be described by a function .
The output of the temporal attention module results in an assessment of each agent’s observation at any single timestep relative to observations at other timesteps of an episode. To obtain further insight into how an agent’s observation is related to other agents’ observations, an agentattention module is concatenated to the temporalattention module.
4.1.2. AgentAttention Module
The agentattention module uses the transpose of , denoted . Each row of , is transformed to a query , key , and value . Here, are learnable parameters. The row of the agent attention feature is a weighted sum, . Maintaining causality is not necessary when computing the agent attention weight vector . These weights are determined similar to the temporal attention weight vector in Eqn. (1), except without a masking operation. Therefore,
(2) 
The agent attention procedure can be described by a function , where .
4.1.3. Concatenating Attention Modules
The output of the temporal attention module is an entity that attends to information at timesteps along the length of an episode for each agent. Passing through the agent attention module results in an output that is attended to by embeddings at all timesteps and from all agents. The dataflow of this process can be written as a composition of functions: . The temporal and agent attention modules can be repeatedly composed to improve expressivity. The position embedding is required only at the first temporal attention module when more than one is used.
4.2. Credit Assignment
The output of the attention modules is used to assign credit at each timestep along the length of the episode. Let , where . In order to carry out temporal credit assignment effectively, we leverage a property of permutation invariance.
4.2.1. Permutation Invariance
Agents sharing the same action and observation spaces are termed homogeneous. When homogeneous agents and cooperate to achieve a goal, the reward when observes and observes should be the same as the case when observes and observes . This property is called permutation invariance, and has been shown to improve the sampleefficiency of multiagent credit assignment as the number of agents increase liu2020pic; gangwani2020learning. When this property is satisfied, the output of the function should be invariant to the order of the agents’ observations. Formally, if the set of all permutations along the agent dimension (second dimension of ) is denoted , then must be true for all .
The function is permutation invariant along the agent dimension by design. A sufficient condition for to be permutation invariant is that the function be permutation invariant. To ensure this, we apply a multilayer perceptron (MLP), add the MLP outputs elementwise, and pass it through another MLP. When functions and associated to the MLPs are continuous and shared among agents, the evaluation at time is the predicted reward . It was shown in zaheer2017deep that any permutation invariant function can be represented by the above equation.
Remark 4.1.
AREL can be adapted to the heterogeneous case when cooperative agents are divided into homogeneous groups. Similar to a position embedding, we can apply an agentgroup embedding such that agents within a group share an agentgroup embedding. This will maintain permutation invariance of observations within a group, while enabling identification of agents from different groups. AREL will also work in the case when the multiagent system is fully heterogeneous. This is equivalent to a scenario when there is only one agent in each homogeneous group. Therefore, AREL can handle agent types ranging from fully homogeneous to fully heterogeneous.
4.2.2. Credit Assignment Learning
Given a reward at the end of an episode of length , the goal is to learn a temporal decomposition of to assess contributions of agents at each timestep along the trajectory. Specifically, we want to learn satisfying . Since is a vector in , its entry is denoted (). The sequence is learned by minimizing a regression loss, , where
are neural network parameters.
The redistributed rewards will be provided as an input to a MARL algorithm. We want to discourage from being sparse, since sparse rewards may impede learning policies devlin2011theoretical. We observe that more than one combination of can minimize . We add a regularization loss to select among solutions that minimize . Specifically, we aim to choose a solution that also minimizes the variance of the redistributed rewards, and set (we examine other choices of in the Appendix). Using
as a regularization term in the loss function leads to learning less sparsely redistributed rewards. With
denoting a hyperparameter, the combined loss function used to learn
is:(3) 
where , and
.
The form of in Eqn. (3) incorporates the possibility that not all intermediate states will contribute equally to ,
and additionally results in learning less sparsely redistributed rewards. Note that will not typically yield (which corresponds to a uniform redistribution of rewards).
Since some states may be common to different episodes, the redistributed reward at each timestep
cannot be arbitrarily chosen.
For e.g., consider different episodes , each of length , with distinct cumulative episodic rewards .
If an intermediate state is common to episodes and , under a uniform redistribution, distinct rewards and will be assigned to , which is not possible. Thus, and will not both be true, implying that a uniform redistribution may not be viable.
4.3. Algorithm
AREL is summarized in Algorithm 1. Parameters of RL modules and of the credit assignment function are randomly initialized. Observations and actions of agents are collected in each episode (Lines 26). Trajectories and episodic rewards are stored in an experience buffer (Line 8). The reward at each time step for every trajectory in a batch (sampled from ) is predicted (Lines 910). The predicted changes as is updated, but the episode reward remains the same. A weighted sum ( is an indicator function) is used to update in a stable manner by using a MARL algorithm (Line 11). The credit assignment function is updated when new trajectories are available (Lines 1217).
4.4. Analysis
Average agent rewards and standard deviation for tasks in Particle World with episodic rewards and
. AREL (dark blue) results in the highest average rewards in all tasks.In order to establish a connection between redistributed rewards from Line 10 of Algorithm 1 and the episodic reward, we define return equivalence of decentralized partially observable sequenceMarkov decision processes (DecPOSDP). This generalizes the notion of return equivalence introduced in arjona2019rudder
in the fully observable setting for the single agent case. A DecPOSDP is a decision process with Markov transition probabilities but has a reward distribution that need not be Markov. We present a result that establishes that return equivalent DecPOSDPs will have the same optimal policies.
Definition 4.2.
DecPOSDPs and are returnequivalent if they differ only in their reward functions but have the same return for any trajectory .
Theorem 4.3.
Given an initial state , returnequivalent DecPOSDPs will have the same optimal policies.
According to Definition 4.2, any two return equivalent DecPOSDPs will have the same expected return for any trajectory . That is, . This is used to prove Theorem 4.3.
Proof.
Consider two returnequivalent DecPOSDPs and . Since and have the same transition probability and observation functions, the probabilities that a trajectory is realized will be the same if both DecPOSDPs are provided with the same policy. For any joint agent policy and sequence of states , we have:
These equations follow from Definition 4.2. Let denote an optimal policy for . Then, we have:
Therefore, will also be an optimal policy for .∎
When in Eqn. (3), a DecPOSDP with the redistributed reward will be returnequivalent to a DecPOSDP with the original episodic reward. Theorem 4.3 indicates that in this scenario, the two DecPOSDPs will have the same optimal policies. An additional result in Appendix A gives a bound on when the estimators are unbiased at each timestep.
5. Experiments
In this section, we describe the tasks that we evaluate AREL on, and present results of our experiments. Our code is available at https://github.com/baicenxiao/AREL.
5.1. Environments and Tasks
We study tasks from Particle World lowe2017multi and the StarCraft MultiAgent Challenge samvelyan19smac.
These have been identified as challenging multiagent environments in lowe2017multi; samvelyan19smac. In each task, a reward is received by agents only at the end of an episode.
No reward is provided at other time steps.
We briefly summarize the tasks below and defer detailed task descriptions to Appendix B.
(1) Cooperative Push: agents work together to move a large ball to a landmark.
(2) PredatorPrey: predators seek to capture preys. landmarks impede movement of agents.
(3) Cooperative Navigation: agents seek to reach landmarks. The maximum reward is obtained when there is exactly one agent at each landmark.
(4) StarCraft
: Units from one group (controlled by RL agents) collaborate to attack units from another (controlled by heuristics). We report results for three maps: 2 Stalkers, 3 Zealots (2s3z); 1 Colossus, 3 Stalkers, 5 Zealots (1c3s5z); 3 Stalkers vs. 5 Zealots (3s_vs_5z).
5.2. Architecture and Training
In order to make the agenttemporal attention module more expressive, we use a transformer architecture with multihead attention vaswani2017attention for both agent and temporal attention. The permutation invariant critic (PIC) based on the multiagent deep deterministic policy gradient (MADDPG) from liu2020pic is used as the base RL algorithm in Particle World. In StarCraft, we use QMIX rashid2018qmix as the base RL algorithm. The value of is set to in Particle World and in StarCraft. Additional details are presented in Appendix C.
5.3. Evaluation
We compare AREL with three stateoftheart methods:
(1) RUDDER arjona2019rudder
: A long shortterm memory (LSTM) network is used for reward decomposition along the length of an episode.
(2) Sequence Modeling liu2019sequence: An attention mechanism is used for temporal decomposition of rewards along an episode.
(3) Iterative Relative Credit Refinement (IRCR) gangwani2020learning: ‘Guidance rewards’ for temporal credit assignment are learned using a surrogate objective.
RUDDER and Sequence Modeling were originally developed for the single agent case. We adapted these methods to MARL by concatenating observations from all agents. We added the variancebased regularization loss in our experiments for Sequence Modeling, and observed that incorporating the regularization term resulted in an improved performance compared to without regularization.
5.4. Results
5.4.1. Arel enables improved performance
Figure 2 shows results of our experiments for tasks in Particle World for . In each case, AREL is able to guide agents to learn policies that result in higher average rewards compared to other methods. This is a consequence of using an attention mechanism to redistribute an episodic reward along the length of an episode, and also characterizing the contributions of individual agents.
The PIC baseline liu2020pic fails to learn policies to complete tasks with episodic rewards. A similar result of failure to complete tasks was observed when using RUDDER arjona2019rudder. An explanation for this could be that RUDDER only carries out a temporal redistribution of rewards, but does not consider the effect of agents contributing differently to a reward.
Sequence Modeling liu2019sequence performs better than RUDDER and the PIC baseline, possibly because it uses attention to redistribute episodic rewards. This was shown to outperform LSTMbased models, including RUDDER, in liu2019sequence in singleagent episodic RL, due to the relative ease of training the attention mechanism. We believe that absence of an explicit characterization of agentattention resulted in a lower reward for this method compared to AREL.
Using a surrogate objective in IRCR gangwani2020learning results in obtaining rewards comparable to AREL in some runs in the Cooperative Navigation task. However, the reward when using IRCR has a much higher variance compared to that obtained when using AREL. A possible reason for this is that IRCR does not characterize the relative contributions of agents at intermediate timesteps.
Figure 3 shows the results of our experiments for the three maps in StarCraft. AREL achieves the highest average win rate in the 2s3z and 3s_vs_5z maps, and obtains a comparable win rate to Sequence Modeling in 1c3s5z. Sequence Modeling does not explicitly model agentattention, which could explain the lower average win rates in 2s3z and 3s_vs_5z. RUDDER achieves a nonzero, albeit much lower win rate than AREL in two maps, possibly because the increased episode length might affect the redistribution of the episode reward for this method. IRCR and QTRAN son2019qtran obtain the lowest win rates. Additional experimental results are provided in Appendix D.
5.4.2. Ablations
We carry out several ablations to evaluate components of AREL. Figure 3(a) demonstrates the impact of the agentattention module. In the absence of agentattention (while retaining permutation invariance among agents through the shared temporal attention module), rewards are significantly lower. We study the effect of the value of in Eqn. (3) on rewards in Figure 3(b). This term is critical in ensuring that agents learn good policies. This is underscored by observations that rewards are significantly lower for very small or very large (). Third, we evaluate the effect of mixing the original episodic reward and redistributed reward by changing the reward weight in Figure 3(c). The reward mixture influences win rates; 0.5 or 0.8 yields the highest win rate. The win rate is lower when using redistributed reward alone (). Additional ablations and evaluating the choice of regularization loss are shown in Appendices E and F.
5.4.3. Credit Assignment vs. exploration
This section demonstrates the importance of effective redistribution of an episodic reward visavis strategic exploration of the environment. The episodic reward takes continuous values and provides finegrained information on performance (beyond only win/ loss). AREL learns a redistribution of by identifying critical states in an episode, and does not provide exploration abilities beyond that of the base RL algorithm. The redistributed rewards of AREL can be given as input to any RL algorithm to learn policies (in our experiments, we demonstrate using QMIX for StarCraft; MADDPG for Particle World). Figure 5 illustrates a comparison of AREL with a stateoftheart exploration strategy, MAVEN mahajan2019maven and with QMIX rashid2018qmix in the StarCraft map. We observe that when rewards are delayed to the end of an episode, effectively redistributing the reward can be more beneficial than strategically exploring the environment to improve winrates or total rewards.
5.4.4. Interpretability of Learned Rewards
Figure 6 presents an interpretation of the decomposed predicted rewards visavis the relative positions of agents to landmarks in the Cooperative Navigation task with . When the reward is provided only at the end of an episode, AREL is used to learn a temporal redistribution of this episodic reward. The predicted rewards are normalized to a scale for ease of representation. The positions of the agents relative to the landmarks are shown at several points along a sample trajectory. Successfully trained agents must learn policies that enable each agent to cover a distinct landmark. For example, in a scenario where two agents are close to a single landmark, one of them must remain close to this landmark, while the other moves towards a different landmark. We observe that the magnitude of the predicted rewards is consistent with this insight in that it is higher when agents navigate away and towards different landmarks.
This visualization in Figure 6 reveals that the attention mechanism in AREL is able to learn to redistribute an episodic reward effectively in order to successfully train agents to accomplish task objectives in cooperative multiagent reinforcement learning. Moreover, it reveals that the redistributed reward predicted by AREL is not uniform along the length of the episode.
5.5. Discussion
This paper focused on developing techniques to effectively learn policies in MARL environments when rewards were delayed or episodic. Our experiments demonstrate that AREL can be used as a module that enables more effective credit assignment by identifying critical states through capturing longterm temporal dependencies between states and an episodic reward. Redistributed rewards predicted by AREL are dense, which can then be provided as an input to MARL algorithms that learn value functions for credit assignment (we used MADDPG lowe2017multi and QMIX rashid2018qmix in our experiments).
By including a variancebased regularization term, the total loss in Eqn. (3) enabled incorporating the possibility that not all intermediate states would contribute equally to an episodic reward, while also learning less sparse redistributed rewards. Moreover, any exploration ability available to the agents was provided solely by the MARL algorithm, and not by AREL. We further demonstrated that effective credit assignment was more beneficial than strategic exploration of the environment when rewards are episodic.
6. Conclusion
This paper studied the multiagent temporal credit assignment problem in MARL tasks with episodic rewards. Solving this problem required addressing the twin challenges of identifying the relative importance of states along the length of an episode and individual agent’s state at any single timestep. We presented an attentionbased method called AREL to deal with the above challenges. The temporally redistributed reward predicted by AREL was dense, and could be integrated with MARL algorithms. AREL was evaluated on tasks from Particle World and StarCraft, and was successful in obtaining higher rewards and better win rates than three stateoftheart reward redistribution techniques.
Acknowledgment
This work was supported by the Office of Naval Research via Grant N0001417SB001.
References
7. Appendices
These appendices include detailed analysis and proofs of the theoretical results in the main paper. They also contain details of the environments and present additional experimental results and ablation studies.
Appendix A: Analysis
Using the variance of the predicted redistributed rewards as the regularization loss allows us to provide an interpretation of the overall loss in Eqn. (3) in terms of a biasvariance tradeoff in a mean square estimator. The varianceregularized predicted rewards are analyzed in detail in the sequel.
First, assume that the episodic reward is given by the sum of ‘groundtruth’ rewards at each time step. That is, . The objective in Eqn. (3) is:
(4) 
This type of regularization will determine in a manner such that it will be robust to overfitting.
For a sample trajectory, the expectation over in can be omitted. Let (i.e., the mean of the predicted rewards). Moreover, let (i.e., is an ergodic process). Then, the following holds:
(5)  
(6) 
The first term in (6) is obtained by applying the CauchySchwarz inequality to the first term of (5). Consider . From linearity of the expectation operator, this is equal to . Then, can be interpreted as an estimator of , and the expression above is the mean square error of this estimator. By adding and subtracting , the mean square error can be expressed as the sum of the variance and the squared bias of the estimator domingos2000unified. Formally,
After distributing the expectation in the above expression, we obtain the following:
(a): since is constant, the third term is equal to
;
(b)
: the first two terms correspond to the variance of a random variable
and the square of a bias between and , and may not be both zero.Substituting these in Eqn. (6),
(7) 
Therefore, the total loss is upperbounded by an expression that represents the sum of a bias and a variance. The parameter will determine the relative significance of each term. Let represent the term on the right hand side of (7). If we denote by the parameters that minimize , and by the parameters that minimize , then an optimization carried out on can be related to one carried out on as:
For the special case when the mean of the predicted rewards, , the first term of in Eqn. (5) will evaluate to zero. The optimization of in this case is then reduced to minimizing the square error of predictors at each time
. This setting is consistent with the principle of maximum entropy when the objective is to distribute an episodic return uniformly along the length of the trajectory.
Consider a single timestep , and assume that there are enough samples (say, ) to ‘learn’ . Then, at each timestep , the goal is to solve the problem .
The squared loss above will admit a biasvariance decomposition domingos2000unified; geman1992neural that is commonly interpreted as a tradeoff between the two terms. This underscores an insight that the complexity of the estimator (in terms of dimension of the set containing the parameters ) should achieve an ‘optimal’ balance friedman2001elements; goodfellow2016deep. This is represented as a Ushaped curve for the total error, where bias decreases and variance increases with the complexity of the estimator. However, recent work has demonstrated that the variance of the prediction also decreases with the complexity of the estimator belkin2019reconciling; neal2018modern.
In order to determine a bound on the error of the variance of the predictor at each timestep, we first state a result from neal2018modern. We use this to provide a bound on when the estimators are unbiased at each timestep in Theorem 7.2.
Theorem 7.1.
neal2018modern Let be the dimension of the parameter space containing . Assume that the parameters are initialized by a Gaussian as . Let be a Lipschitz constant associated to . Then, for some constant , the variance of the prediction satisfies .
Theorem 7.2.
Let denote the mean of the predicted rewards. Assume that there are samples to ‘learn’ at each timestep
. Then, for unbiased estimators
and associated Lipschitz constants ,Proof.
The assumption on estimators being unbiased is reasonable. Proposition 7.3 indicates that in the more general case (i.e. for an estimator that may not be unbiased), the prediction is concentrated around its mean with high probability.
Proposition 7.3.
wainwright2019high Under a Gaussian initialization of parameters as , at each timestep , the following holds:
Theorem 7.2 indicates that the optimization of will be equivalent to minimizing the square error of predictors at each time .
Appendix B: Detailed Task Descriptions
This Appendix gives a detailed description of the tasks that we evaluate AREL on. In each experiment, a reward is obtained by the agents only at the end of an episode. No reward is provided at other time steps.

Cooperative Push: This task has agents working together to move a large ball to a landmark. Agents are rewarded when the ball reaches the landmark. Each agent observes its position and velocity, relative position of the target landmark and the large ball, and relative positions of the nearest agents. We report results for , , and . At each time step, the distance between agents and the ball, distance between ball and landmark, and whether the agents touch the ball is recorded. These quantities, though, will be not be immediately revealed to the agents. Agents receive a reward at the end of each episode at time .

PredatorPrey: This task has predators working together to capture preys. landmarks impede movement of the agents. Preys can move faster than predators, and predators obtain a positive reward when they collide with a prey. The prey agents are controlled by the environment. Each predator observes its position and velocity, relative locations of the nearest landmarks, and relative positions and velocities of the nearest prey and nearest predators. We report results for , , and . At each time step, the distance between a prey and the closest predator, and whether a predator touches a prey is recorded. These quantities, though, will be not be immediately revealed to the agents. The agents receive a reward at the end of each episode at time .

Cooperative Navigation: This task has agents seeking to reach landmarks. The maximum reward is obtained when there is exactly one agent at each landmark. Agents are also penalized for colliding with each other. Each agent observes its position, velocity, and the relative locations of the nearest landmarks and agents. We report results for . At each time step, the distance between an agent and the closest landmark, and whether an agent collides with other agents is recorded. These quantities, though, will be not be immediately revealed to the agents. The agents receive a reward at the end of each episode at time .

StarCraft: We use the SMAC benchmark from samvelyan19smac. The environment comprises two groups of army units, and units from one group (controlled by learning agents) collaborate to attack units from the other (controlled by handcrafted heuristics). Each learning agent controls one army unit. We report results for three maps: 2 Stalkers and 3 Zealots (2s3z); 1 Colossus, 3 Stalkers, and 5 Zealots (1c3s5z); and 3 Stalkers versus 5 Zealots (3s_vs_5z). In 2s3z and 1c3s5z, two groups of identical units are placed symmetrically on the map. In 3s_vs_5z, the learning agents control 3 Stalkers to attack 5 Zealots controlled by the StarCraft AI. In all maps, units can only observe other units if they are both alive and located within the sight range. The 2s3z and 1c3s5z maps comprise heterogeneous agents, since there are different types of units, while the 3s_vs_5z is a homogeneous map. In all our experiments, the default environment reward is delayed and revealed only at the end of an episode. The reader is referred to samvelyan19smac for a detailed description of the default rewards.
Appendix C: Implementation Details
All the results presented in this paper are averaged over 5 runs with different random seeds. We tested the following values for the regularization parameter in Eqn. (3): , and observed that resulted in the best performance.
In order to make the agenttemporal attention module more expressive, we use a transformer architecture with multihead attention vaswani2017attention
for both agent and temporal attention. Specifically, in our experiments, the transformer architecture applies, in sequence: an attention layer, layer normalization, two feed forward layers with ReLU activation, and another layer normalization. Before each layer normalization, residual connections are added.
When dimension of the observation space exceeds , a single fully connected layer with units is applied to compress the observation before attention module. The credit assignment block that produces redistributed rewards consists of two MLPs, each with a single hidden layer of 50 units. Neural networks for credit assignment are trained using Adam with learning rate .
In Particle World, credit assignment networks are updated for batches every episodes. Each batch contains fully unrolled episodes uniformly sampled from the trajectory experience buffer . In StarCraft, credit assignment networks are updated for batches every episodes. Each batch contains fully unrolled episodes uniformly sampled from .
During training and testing, the length of each episode in Particle World is kept fixed at time steps, except in PredatorPrey with , where the episode length is set to time steps. In StarCraft, an episode is restricted to have a maximum length of time steps for 2s3z, time steps for 1c3s5z, and time steps for 3s_vs_5z. If both armies are alive at the end of the episode, we count it as a loss for the team of learning agents. An episode terminates after one army has been defeated, or the time limit has been reached.
In Particle World, we use the permutation invariant critic (PIC) based on MADDPG from liu2020pic as the base reinforcement learning algorithm. The code is based on the implementation available at https://github.com/IouJenLiu/PIC. Following MADDPG lowe2017multi
, the actor policy is parameterized by a twolayer MLP with 128 hidden units per layer, and ReLU activation function. The permutation invariant critic is a twolayer graph convolution net with 128 hidden units per layer, max pooling at the top, and ReLU activation. Learning rates for actor and critic are 0.01, and is linearly decreased to zero at the end of training. Trajectories of the first
episodes are sampled randomly for filling the experience buffer. During training, uniform noise was added for exploration during action selection.In StarCraft, we use QMIX rashid2018qmix as the base algorithm. The QMIX code is based on the implementation from https://github.com/starrysky6688/StarCraft. In the implementation, all agent networks share a deep recurrent Qnetwork with recurrent layer comprised of a GRU with a 64dimensional hidden state, with a fullyconnected layer before and after. Trajectories of the first
episodes are sampled randomly to fill the experience buffer. Target networks are updated every 200 training episodes. QMIX is trained using RMSprop with learning rate
. Throughout training, greedy is adopted for exploration, and is annealed linearly from 1.0 to 0.05 over 50k time steps and kept constant for the rest of learning.The experiments that we perform in this paper require computational resources to train the attention modules of AREL in addition to those needed to train deep RL algorithms. Using these resources might result in higher energy consumption, especially as the number of agents grows. This is a potential limitation of the methods studied in this paper. However, we believe that AREL partially addresses this concern by sharing certain modules among all agents in order to improve scalability.
We provide a description of our hardware resources below:
Hardware Configuration: All our experiments were carried out on a machine running Ubuntu^{®} equipped with a core Intel^{®}Xeon^{®} GHz CPU, two NVIDIA^{®}GeFORCE^{®}RTX Ti graphics cards and a GB RAM.
Appendix D: Additional Experimental Results
No. of Agents  Task  : AREL  : Uniform  : AREL  : Uniform 

CP  1553.6  
N = 15  PP  
CN  
CP  
N = 6  PP  
CN  
CP  
N = 3  PP  
CN 
This Appendix presents additional experimental results carried out in the Particle World environment.
Figure 7 shows results of our experiments for tasks in Particle World when . In each case, AREL is consistently able to allow agents to learn policies that result in higher average rewards compared to other methods. This is a consequence of using an attention mechanism that enables decomposition of an episodic reward along the length of an episode, and that also characterizes contributions of individual agents to the reward. Performances of the PIC baseline liu2020pic, RUDDER arjona2019rudder, and Sequence Modeling liu2019sequence can be explained similar to that presented for the case in the main paper. Using a surrogate objective in IRCR gangwani2020learning results in obtaining comparable agent rewards in some cases in the Cooperative Navigation task, but the reward curves are unstable and have high variance.
Figure 8 show the results of experiments on these tasks when . The PIC baseline and RUDDER are unable to learn good policies and IRCR results in lower rewards than AREL in two tasks. The performance of Sequence Modeling is comparable to AREL, which indicates that characterizing agent attention plays a smaller role when there are fewer agents.
Appendix E: Additional Ablations
This Appendix presents additional ablations that examine the impact of uniformly weighting the attention of agents and the number of agenttemporal attention blocks.
We evaluate the effect of removing the agentattention block, and uniformly weighting the attention of each agent. This is termed uniform agent attention (Uniform). The average rewards over the number of training episodes () and the final agent reward () at the end of training obtained when using AREL and when using Uniform are compared. The results of these experiments, presented in Table 1, indicate that and are higher for AREL than for Uniform. This shows that the agentattention block in AREL plays a crucial role in performing credit assignment effectively.
We examine the effect of the number of agenttemporal attention blocks, (depth) on rewards in the Cooperative Push task with in Figure 9. The depth has negligible impact on average rewards at the end of training. However, rewards during the early stages of training are lower for , and these rewards also have a larger variance than the other cases ().
Appendix F: Effect of Choice of Regularization Loss
This Appendix examines the effect of the choice of the regularization loss term in Eqn. (3). The need for the regularization loss term arises due to the possibility that there could be more than one choice of redistributed rewards that minimize the regression loss alone. In our results in the main paper, we used the variance of the redistributed rewards as the regularization loss. This choice was motivated by a need to discourage the predicted redistributed rewards from being sparse, since sparse rewards might impede learning of policies when provided as an input to a MARL algorithm devlin2011theoretical. By adding a variancebased regularization, the total loss enables incorporating the possibility that not all intermediate states would contribute equally to an episodic reward, while also resulting in learning redistributed rewards that are less sparse.
We compare the variancebased regularization loss with two other widely used choices of the regularization loss the and based losses. The based regularization encourages learning sparse redistributed rewards, and the based regularization discourages learning a redistributed reward of large magnitude (i.e., ‘spikes’ in the redistributed reward). Specifically, we study:
where are the regression loss and variance of redistributed rewards as in Eqn. (3), () is the norm ( norm) of the redistributed reward.
We compare the use of the three regularization loss functions on the tasks in Particle World with . In each task, we calculate the normalized reward received by the agents during the last training steps. We use for the variancebased regularization loss. For the other two regularization losses, we searched over and , and we observed that and resulted in the best performance. The graph in Figure 10 shows the average normalized reward. We observe that using a variancebased regularization loss results in agents obtaining the highest average rewards.
In particular, we observe that using an based regularization results in significantly smaller rewards. A possible reason for this is that the based regularization has the property of encouraging learning a sparse redistributed reward, which hinders learning of policies when provided as an input to the MARL algorithm. The performance when using the based regularization results in a comparable, albeit slightly lower, average agent reward to using the variancebased regularization. This is reasonable since using the variancebased or based regularization will result in less sparse predicted redistributed rewards.
Appendix G: Verification of QMIX Implementation
This Appendix demonstrates correctness of the QMIX implementation that we use from https://github.com/starrysky6688/StarCraft. In the QMIX evaluation first used in rashid2018qmix, rewards were not delayed. In our experiments, rewards are delayed and revealed only at the end of an episode. In such a scenario, QMIX may not be able to perform longterm credit assignment, which explains the difference in performance between the default and delayed reward cases. We observe that using redistributed rewards from AREL as an input to QMIX results in an improved performance compared to using QMIX alone when rewards from the environment were delayed (Figure 3 in the main paper). Using the default, nondelayed rewards, we compare the performance of the QMIX implementation that we used in our experiments with the benchmark implementation from samvelyan19smac. Figure 11 shows that test win rates in two StarCraft maps ( and ) using both implementations are almost identical.