1. Introduction
Multiagent reinforcement learning (MARL) has shown exceptional results in many reallife applications, such as multiplayer games vinyals2017starcraft; lowe2017maddpg, traffic control kuyer2008traffic, and social dilemmas leibo2017multi_social. A suitable control policy is extremely important in multiagent systems (MASs). One choice is to treat the MAS as a single agent and adopt a centralized control policy han2019grid; jiang2018atoc; however, this approach is constrained by poor scalability for highdimensional state and action spaces. On the contrary, decentralized control sunehag2018vdn; rashid2018qmix; du2019liir; lowe2017maddpg; iqbal2019maac allows agents to make decisions independently, but struggles to enable coordinated behaviors on complex tasks. Taking traffic flow as an example, when multiple vehicles are trying to cross an intersection without traffic lights, most likely, the traffic will become congested if all vehicles take actions simultaneously without a rational sequence. This problem may be solved, however, if those vehicles move in an orderly way based on some coordination structure. This example shows that it is imperative to improve on a fully decentralized decisionmaking process, and a solution to alleviate the above issue is to develop a coordinated control policy to obtain cooperative behaviors.
Several approaches have been reported to address the problem of action coordination. BiAC zhang2020bilevel mainly focuses on coordination of the asynchronous decisions of two agents. The multiagent rollout algorithm bertsekas2019multiagent_rollout provides a theoretical view of executing a local rollout with some coordinating information, but is limited to an agentbyagent decision dependency structure. Although these works investigate the action execution order of twoagent and multiagent systems, they are still insufficient to characterize the complicated and dynamic underlying decision dependency structure of a general multiagent system. Moreover, we assert that representing the underlying decision dependency structure and using this to control the action execution is essential to improving coordination.
In this work, we propose a graphbased coordination strategy (GCS) that learns coordinated behaviors through factorizing the joint team policy into a graph generator and a graphbased coordinated policy. The former aims to learn an action coordination graph (ACG) that properly represents the decision dependency. The latter further coordinates the dependent behaviors among agents exploiting the underlying decision dependency. We train the graph generator and the graphbased coordinated policy simultaneously to maximize the discounted return. For the ACG we employ directed acyclic graphs (DAGs), whose nodes represent agents and whose directed edges denote action dependencies of the associated agents. Moreover, we propose using the DAGnessconstrained and DAG depthconstrained optimization in the graph generator to balance efficiency and performance.
The contributions of this paper can be summarized as follows:

[itemsep= 4 pt,topsep = 6 pt]

As far as we know, we are the first to introduce directed acyclic graphs to action coordination, dynamically representing the underlying decision dependencies of MAS.

We propose a DAGnessconstrained and DAG depthconstrained optimization in the graph generator, achieving a tradeoff between decisionmaking efficiency and performance.

Empirical evaluations on several challenging MARL benchmarks (Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Football) show that our method can achieve superior performance and obtain meaningful results consistent with intuitive expectations.
2. Related Work
Deep reinforcement learning has been successfully applied to addressing complex decision problems (silver2017alphagozero; schrittwieser2020muzero; zhang2021population; wang2021ordering; meng2021offline)
. Due to the widespread existence of multiagent tasks, MARL has attracted increasing attention, and learning appropriate control policies is important to obtain the maximum cumulative discounted return. Based on the structures of their execution schemes, we classify the existing approaches into three categories.
First, a fully independent execution scheme allows agents to determine actions according to their individual policies without any interaction. One line of research, such as IQL tan1993iql, VDN (sunehag2018vdn), QMIX (rashid2018qmix), and QTRAN (son2019qtran), focuses on valuebased methods, which assign each agent an independent policy for execution. Another line of research, including MAAC (iqbal2019maac), COMA (foerster2018coma), and LIIR (du2019liir), extends the actorcritic algorithm degris2012actorcritic to the multiagent case, where each actor represents an individual policy for an agent.
Second, the communicationbased independent execution scheme is widely used, which allows the use of extensive information in its individual decision making (busoniu2008comprehensive_survey). In this scheme, agents learn how to transmit informative messages and to process the messages during training. Then agents exchange the messages to determine their actions individually during independent execution. Representative methods foerster2016RIALDIAL; sukhbaatar2016CommNet; zhang2013coordinating; peng2017bicnet; jiang2018atoc; du2021flowcomm autonomously learn communication protocols that are required in generating informative messages: these determine whom to communicate with and what messages to transmit for assisted decision making.
Third, the coordinated execution scheme, where agents develop their policies conditioned on other agents’ actions and make decisions in a coordinated manner, is important in MAS. There are some methods that implicitly model the coordinated behaviours from the perspective of a coordination graph. DCG bohmer2020dcg uses pairwise graphs to propagate beliefs for joint decisions, while DICG li2021dicg focuses on generating a coordination graph with soft edge weights for message passing. DGN jiang2018dgn uses a graph attention network as an embedding extractor to assist in the decision making. Furthermore, some methods have been proposed to explicitly model the coordinated behaviours in order, such as BiAC (zhang2020bilevel) and multiagent rollout (bertsekas2019multiagent_rollout), which propose utilizing twoagent and agentbyagent dependency structures, respectively, to help agents make decisions in order and promote action coordination. Similar but essentially distinct, we introduce a mechanism to learn the underlying DAG structure that represents the decision dependency among agents.
Moreover, the generation of the DAGs is an essential part of our work. Recently, some continuous optimization approaches zheng2018dagsnotears; yu2019daggnn; yu2020dagsnocurl; lachapelle2020gradientdag have been proposed to recover the DAGs through structure learning. The method zhu2019dagRL, closely related to our work, uses reinforcement learning as its search strategy to maximize a predefined score function. Borrowing from an idea of zhu2019dagRL, which obtains the DAG using reinforcement learning, we construct a graph generator module to generate the DAG structure as an action coordination graph and regard the extrinsic reward as the incentives to jointly train with MARL tasks.
3. Problem Setup
MMDPs
Cooperative multiagent problems can be modeled as multiagent Markov decision processes (MMDPs)
boutilier1996planning, which can be expressed as a tuple . is the player, is the global state space, and denotes the action space for the player. We label the joint actions for all players. Intuitively, the agent will select an individual action to perform and execute it. is the transition dynamics, and gives the distribution of the next state at state taking action . All agents share the same reward function . denotes a discount factor, and denotes the trajectory induced by the policy . All the agents coordinate together to maximize the cumulative discounted return .Factored MMDPs
We formalize our problem based on MMDPs as factored MMDPs. Different from MMDPs, where all actions are taken simultaneously and do not depend on each other, we endow the hierarchy order to the joint action based on the learned DAG structure , called an action coordination graph (ACG). The adjacency matrix representing the ACG denotes the decision dependency from the graph generator . With , we can define as the graphbased coordinated policy for the player, where is the observation of the ith player, are the parents of agent , and are the actions taken by the parents, whose order is generated from . Note that a fully decentralized policy is a special case of our graphbased coordinated policy where none of the agents have parents.
Figure 1 gives an example of the DAG and the relationships between nodes. The nodes in correspond to the agents in the MAS, and the parentchild relationships represent the hierarchical decision dependencies among agents. For example, the case that the node in the graph has two parents illustrates that the action taken by the agent is constrained by and .
Now, the graphbased coordination strategy is factorized as:
(1) 
where is the graph generator, and is graphbased coordinated policy.
4. Methodology
The overall goal is to maximize the cumulative return, denoted as:
(2) 
Now we further elaborate on the derivation of the graphbased coordinated policy and the graph generator , respectively.
4.1. Graphbased Coordinated Policy
Given a known graph generator (elaborated in Section 4.2), we have to represent the underlying decision dependency. Based on it, we can denote the decision policy as , called graphbased coordinated policy. We explore graphbased coordinated policy that obtains the final joint action as follows.
The graphbased coordinated policy for agents can be parameterized by . Correspondingly, the gradient of the expected return for agent is expressed as:
(4) 
By applying the minibatch technique to the offpolicy training, the gradient can be approximately estimated as:
(5) 
where is the experience replay buffer, recording experiences of all agents. Moreover, the centralized actionvalue function can be updated as:
(6) 
where is the learning target and is the target network parameterized by .
During the training process, the graphbased coordinated policy and the graph generator are updated iteratively. We will describe how to find the graph generator under a given policy .
4.2. Graph Generator
The graph generator aims to generate the DAG to define the decision dependency among agents. We will introduce it in detail from three aspects: (a) DAGness constraint, (b) DAG Depth constraint, and (c) optimization objective.
DAGness constraint
The acyclicity constraint is an important issue in our problem setting. In this work, we also use the penalty terms like zheng2018dagsnotears; lachapelle2020gradientdag; zhu2019dagRL to ensure acyclicity. The result in zheng2018dagsnotears shows that the directed graph with binary adjacency matrix is acyclic if and only if:
(7) 
where is the matrix exponential, guarantees the nonnegativity, and is the number of nodes in the DAGs. The ‘’ of a matrix is defined as the sum of the diagonal elements zheng2018dagsnotears. The constraint function should satisfy that: (a) its derivatives are computable, and (b) can be the measurement of DAGs.
DAG Depth constraint
Moreover, taking the tradeoff between efficiency and performance into account, we claim that the maximum depth of graph structure should be adjustable over different tasks. Therefore, we propose an alternative constraint to control the hierarchy of the generated DAGs as follows.
Definition 0.
A square matrix is a Nilpotent Matrix Algebra75, if
where
is the zero matrix and
is called the Nilpotent of index .Proposition 0.
Let be an adjacency matrix for a directed acyclic graph, then the maximal length between any two nodes and is if is Nilpotent of index .
Optimization objective.
Based on the foregoing, we can optimize parameterized by with the maximal length by:
(8) 
where denotes the weight matrix generated from the graph generator
. Then the weight matrix is modeled as a Bernoulli distribution, from which the binary adjacency matrix
is sampled. Here, we use the constraints of the weight matrix and to approximate those of the adjacency matrix and due to the consistency of representing the graph structure. With this approximation, we restate as:(9) 
Fixing graphbased coordinated policy , we approximate the graph generator as follows. We augment the original problem shown in Equation (8) with a quadratic penalty using the augmented Lagrangian technique bertsekas1997nonlinear:
(10) 
with the penalty .
Next, we convert the Equation (10) to an unconstrained Lagrangian function:
(11) 
Proposition 0.
The gradient for Equation (11) to optimize the coordination graph generation policy can be derived as follows:
In proposition 3, we remark that after considering the influence of various decision dependencies on the reinforcement learning tasks, we can obtain the underlying graph structure that makes the best response to MARL tasks.
4.3. Implementation Details
As shown in Figure 2, the proposed framework include the graphbased coordinated policy and the graph generator, which will be elaborated below.
Graphbased Coordinated Policy
The graphbased coordinated policy can be obtained from the standard multiagent actorcritic framework. As for the policy
of the actor, we use the RNN network with the stochastic policy gradient to model the action distributions. The critic used to criticize the joint actions made by the actors is a threelayer feedforward neural network activated by the ReLU units, denoted as
.Graph Generator
As shown in Figure 2, the graph generator adopt an encoderdecoder module used to find the ACG. The GATbased encoder can model the interplay of agents and extract the further latent representations. The MLPbased decoder is used to recover the pairwise relationship between agents to generate an ACG. The graph generator takes the local observations of the agents as input and outputs the ACG to obtain the decision dependency for the decisionmaking process of the graphbased coordinated policy, elaborated as follows.
The graph generator contains two submodules. Firstly, we use the graph attention network (GAT) velivckovic2018gat
as the attentionbased encoder to extract the latent information for the graph structure generation. First, the simple feature is extracted by a multilayer perceptron (MLP) as an initial step:
(12) 
Due to the sufficiently expressive power of GAT, we use it to extract the further latent information of the simple feature. We compute the importance coefficients through the attention mechanism:
(13) 
where is a learnable weight matrix, and
denote the dimensions of the input vector and the latent vector, respectively, and
indexes other agents except the agent .Then the multihead attention is used to stabilize the learning process of selfattention, and the final latent feature is as follows:
(14) 
where is the number of attention heads. are importance coefficients computed by the kth attention mechanism, and is the corresponding weight matrix.
Another submodule in the graph generator is the decoder that generates a weight matrix used to sample the graph structure. Since the GATbased encoder has already provided sufficiently expressive features among agents, a singlelayer decoder is enough to easily construct the pairwise relationship between the encoder outputs to find a better structure of DAG for the decision policy.
(15) 
where and are agents’ higherlevel representations from two encoder outputs, are trainable parameters, and are the hidden dimension and encoder output dimension, respectively.
Moreover, the logistic sigmoid function
generates the probability for constructing the Bernoulli distribution from which the binary adjacency matrix
is sampled. The binary adjacency matrix forms a directed graph corresponding to the ACG . Here, we denote this graph generation process as .Parameter Setting.
In the graph generator, the attention head in the GAT encoder is set to 8, the stacked attentional layers are set to 4, and the hidden units in the MLP is set to 64. In the graphbased coordinated policy, the actor critic architecture is adopted. The recurrent layer comprised of a GRU with a 64dimensional hidden state, with a fullyconnected layer before and after, is used as the actor. The critic is a twolayer MLP with the ReLu activation.
4.4. Algorithm Description
The main procedures are summarized in Algorithm 1, where , , and are optimized. Our ultimate goal is to obtain the graphbased coordinated policy . The graph generator is an intermediate used to access an excellent graph structure in guiding the decisionmaking sequence among agents to achieve a high degree of multiagent coordination. The graph generator is a pluggable module that can be replaced by other algorithms for solving the DAGs. Note that DAGs are necessary because we need an execution structure that can determine a clear sequence of the actions, and therefore it should be directed and not circular. The policy solver is a universal module, which, in general, one can choose from a diverse set of cooperative MARL algorithms lowe2017maddpg; yu2021mappo; iqbal2019maac.
5. Experiments
We evaluate the effectiveness of our algorithm on three different environments: Collaborative Gaussian Squeeze^{1}^{1}1The MGS environment is at https://github.com/Sonkyunghwan/QTRAN son2019qtran, Cooperation Navigation^{2}^{2}2The code is at https://github.com/openai/multiagentparticleenvs lowe2017maddpg, and Google Research Football^{3}^{3}3The code is at https://github.com/googleresearch/football kurach2020google.
5.1. Experimental Setting
Collaborative Gaussian Squeeze (CGS)
As an extension of Multidomain Gaussian Squeeze (MGS) son2019qtran, Collaborative Gaussian Squeeze is a challenging environment for evaluating coordination. In MGS, there exist domains in the system. The system contains agents, and each agent can take actions within range of . The prior given by the environment represents the unitlevel resource for each agent . The total amount of resources mobilized by all agents is denoted as . The goal is to maximize the joint reward . In our settings, we modify the original MGS to a collaborative task. We use two domains . The computation of the joint reward is as follows:
(16) 
Figure 3 shows the reward curves for this setting. According to the above definition, the reward is maximized when the resources of all the agents reach or . In this environment, the social welfare depends on the intensity of collaboration.
Cooperative Navigation (CN)
Cooperative Navigation is a classic scenario implemented in the multiagent particle world. This scenario has agents and landmarks, which are initialized with random locations at the beginning of each episode. The objective of the agents is to cooperate to cover all landmarks by controlling their velocities with directions. The action set includes five actions: [up, down, left, right, stop]. Each agent can only observe its velocity, position, and displacement from other agents and the landmarks. The shared reward is the negative sum of displacements between each landmark and its nearest agent. Agents must also avoid collisions, since each agent is penalized with a ‘’ shared reward for every collision with other agents. We set the length of each episode as 25 timesteps. Therefore, the agents have to learn to navigate toward the landmarks cooperatively to cover all positions quickly and accurately. Figure 3 shows a classic scenario in Cooperative Navigation with .
Google Research Football (GRF)
GRF is a realistically complicated and dynamic simulation environment without any clearly defined behavior abstraction, which is a suitable testbed for studying multiagent decision making and coordination. In GRF, we use the Floats wrappers to represent the state. The Floats representation contains a 115dimensional vector that summarizes the information, such as the ball position and possession, coordinates of all players, and the game state. Each player has 19 actions to control, including the standard move actions and different ballkicking techniques. The rewards include the reward and the reward, which is the shaped reward that specifically addresses the sparsity of . Detailed descriptions are shown in Appendix B.
Baselines
We compared our results with several baselines as follows. VDN and QMIX are stateoftheart value factorization approaches that follow the regime of centralized training and decentralized execution, belonging to the class of fully decentralized control policies, with which it is difficult to obtain coordinated behaviours. DCG uses fully connected graphs for belief propagation, which only allows the message passing of paired agents. DGN aims at learning abstract representations to make simultaneous decisions.

VDN sunehag2018vdn: Value Decomposition Network (VDN) imposes the structural constraint of additivity in the factorization, which represents as a sum of individual Qvalues.

QMIX rashid2018qmix: This was proposed to overcome the limitation that VDN uses the linear decomposition and ignores any extra state information available during training. QMIX enforces to be monotonic in the individual Qvalues .

DCG bohmer2020dcg: Deep Coordination Graph (DCG) factorizes the joint value function of all agents according to a coordination graph into payoffs between pairs of agents, which coordinates the actions between agents explicitly.

DGN jiang2018dgn: DGN relies on a graph convolutional network to model the relation representations, implicitly modeling the action coordination.
5.2. Main Results
Here we report the experimental results from the setup described in Section 5.1. Performance validation indicates the superiority of introducing ACG to multiagent systems.
Collaborative Gaussian Squeeze
In this game, there are 10 agents, and the maximum episode length is also set to 10. To emphasize the feasibility and effectiveness of our proposed framework, we first conduct the experiment on CGS. We report the average episode rewards over 10 random runs, shown in Figure 4. Our proposed algorithm GCS outperforms the baseline methods by a large margin. It can be seen that our algorithm handles the collaborative problem well; the action coordination graph facilitates behavior learning to promote cooperation. Next, we will verify the effectiveness of our algorithm on more complicated environments.
Cooperative Navigation
As shown in Figure 3, Cooperative Navigation is a fully cooperative environment in which agents (circles) must cooperate to reach landmarks (crosses) with as few collisions as possible. We conduct experiments in Cooperative Navigation with and with . Figures 5 and 5 show the learning curve comparisons for the two cases. We report the average episode reward at a training step, averaged over 10 independent running seeds.
First, our algorithm outperforms most baseline algorithms by giving higher converged rewards. We consider that our performance improvement results from the action coordination graph representing the action dependency for better coordination. Moreover, our algorithm converges fast during training, which is possibly because the hierarchical decision policies can efficiently induce coordination among agents in this cooperative setting. In addition, our algorithm can achieve a lower variance than those baselines, which indicates that the learned action coordination graph can reduce the uncertainty in decision making to facilitate cooperative behaviors among agents.
In contrast, VDN and QMIX take actions simultaneously without considering the action dependency among agents. They are faster during training, but that is of no benefit in inducing cooperation among agents. Additionally, DCG exhibits mediocre performance in this task. We believe that DCG considers only pairwise relationships between agents, which may disturb the overall balance in the system. In this case, DGN shows good performance consistent with ours, which shows that implicit action coordination modeling is also effective in pure cooperative settings.
In order to clearly show the meaningful effect of the action coordination graph in the decisionmaking process, we give a visualization at a time step in an episode of Cooperative Navigation, as shown in Figure 6. Figure 6 shows the topological structure of the learned ACG, and it suggests that the decision dependency of the agent is [3, 4, 2, 1]. The parent sets of the four agents are denoted as , , , and , respectively. First, agent 3 decides to move to the bottomleft landmark, then agent 4 takes the best response and decides to move to the bottomright in order to avoid conflict with agent 3. After agent 2 knows the decisions of the previous two agents, it chooses the closer upperleft as its target instead of the bottomright. Finally, agent 1 moves after observing the decisions of agents 2 and 4. This visualization shows how agents’ joint actions deriving from the ACG representing the underlying decision dependencies achieves efficiency.
Google Research Football (GRF)
To evaluate our method in complicated and dynamic environments, we conduct several experiments on GRF, as shown in Figure 7. In the 3vs2 scenario, three of our players try to score from the edge of the box, and the opponent team contains one defender and one keeper. In the 3vs6 scenario, there are six opponent players on the pitch to play against three of our players. In the 5vs5 scenario, each team has a keeper, an offensive player, and three defenders. Here, we report the average episode reward at a training step for each scenario, averaged over 10 independent running seeds.
As can be seen in Figure 7, our algorithm always obtains higher rewards than all the baselines in the different scenarios of GRF. This indicates that our method is quite general in complicated and dynamic environments. Moreover, this performance improvement in GRF demonstrates that our approach is good at effectively handling stochasticity and sparse rewards. This is because the learned ACG with the decision dependency is an efficient way to mitigate uncertainty and induce cooperation among agents. Taking the 3vs6 scenario for further demonstration, the training curve of QMIX fluctuates and is unstable, indicating this method’s inability to adapt to the dynamically complicated scenario with multiple opponent players. Here, DCG shows a trend of nonconvergence, but our algorithm steadily rises to converge and obtains the highest reward, which exhibits the modeling supremacy of our approach for handling complicated tasks.
5.3. Results on DAG Depth
We observe that the inference efficiency and the performance gains are inversely affected by the ACG’s depth. Therefore, we aim to find the suitable depth that best balances the tradeoff. As shown in Figure 8, to validate the impact of the depth, we test our method on the Collaborative Gaussian Squeeze with different depth sizes of the learned ACG. In this figure, the horizontal axis is the depth, and the vertical axis is the testing episode reward averaged over five seeds. We test episodes for each seed and obtain an average episode reward.
As the depth of the ACG increases, the training time will increase correspondingly. However, the performance growth will gradually slow down, and performance degradation may even occur. In this case, is the optimal depth that balances the computational burden and the performance gains. As for the reason for the performance degradation with and , we speculate that as the hierarchy of action dependency deepens, the complexity of the hypothesis space for the inference will increase, and it becomes harder to learn the optimal policy. In summary, a higher dependency level of the graph structure can provide more decision information to promote coordination and facilitate performance. However, this higher dependency level leads to a lower efficiency of inference, as the leaf node on the ACG needs to wait for all the parent nodes’ decisions before it makes its own decision.
5.4. Results on Dropping Edges
In this section, we verify the stability of the learned ACG in our proposed algorithm. Given a trained model with on Collaborative Gaussian Squeeze, we evaluate 1000 episodes and count the average number of edges of ACG, denoted as . A fair comparison requires the same depth and number of edges during training and evaluation. Therefore, we generate a fixed DAG structure , see Appendix C, whose depth is and whose number of edges is , as the baseline to compare with our algorithm.
In Figure 9, the horizontal axis represents the number of edges dropped from the generated or fixed graph structure, denoted as , and the vertical axis is the testing average episode reward over 1000 episodes.
denotes dropping all the edges. The box plots visually show the distribution of the testing episode rewards and skewness by displaying the data quartiles. From the overall trend of Figure
9, we can observe that the data quartiles of the baseline reduce faster and change more drastically than our algorithm when the disturbance of edge dropping increases. This demonstrates that our algorithm has better stability. Moreover, in Table 1, our algorithm outperforms the baseline with higher mean rewards in most cases, which demonstrates the power of the ACG learned by our model to promote the coordination among agents reliably and stably even when confronted with disturbances of different intensities. It is worth noting that to guarantee stability, a slight performance loss may occur. It will be an interesting future research direction to study the stability–performance tradeoff.Methods#drop  0  1  3  5  7  9  15  inf 

The baseline  45.3  44.0  40.7  36.9  30.8  29.3  25.8  26.2 
GCS (ours)  41.6  41.3  40.9  39.4  37.9  36.0  30.5  26.1 
In summary, comparisons with VDN, QMIX, and DCG on three environments demonstrate that our algorithm achieves better performance, stronger stability, and more powerful modeling capability for handling dynamically complicated tasks than any of these methods. Moreover, the proposed DAG depth constraint provides an insightful view on balancing efficiency and performance.
6. Conclusions
In this paper, we introduce a novel graph generator and graphbased coordinated policy in MARL to dynamically represent the underlying decision dependency structure and facilitate behavior learning, respectively. We propose the DAGnessconstrained and DAG depthconstrained optimization to balance training efficiency and performance gains. Extensive empirical experiments on Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Research Football, as well as comparisons to baseline algorithms, demonstrate the superiority of our method.
Future research may consider improving the limited performance by upgrading the graph generator model. We will also investigate an automatic mechanism for finding an appropriate depth for the action coordination graph.
Acknowledgements
Jingqing Ruan is supported in part by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDA27010404 and in part by the National Nature Science Foundation of China under Grant 62073324. Coauthor Haifeng Zhang is supported in part by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA27030401.
References
Appendix A Detailed Proofs
We provide the detailed proofs of the propositions in the following.
a.1. Proof of Proposition 2
proof.
Firstly, we prove that the entry in th power of the adjacency matrix indicates the number of walks of length from node to . Let denote the number of walks of length from node to . When , . Let denote the entry of . We have , where . So denote the number of walks of length from to . Then:
(17) 
The equations represent there is not a walk of length and there is at least one walk of length from to respectively, that is . We define the hierarchy of the DAGs as the longest path length. Therefore, the hierarchy of the DAGs is if is the Nilpotent Matrix of index . ∎
a.2. Proof of Proposition 3
Appendix B Deatils of Google Research Football
Observations
The environment exposes the raw observations as Table 2. We use the Simple115StateWrapper^{4}^{4}4We refer the reader to:https://github.com/googleresearch/football for details of encoded information. as the simplified representation of a game state encoded with 115 floats.
Information  Descriptions  

Ball information  position of ball  
direction of ball  
rotation angles of ball  
owned team of ball  
owned player of ball  
Left team  position of players in left team  
direction of players in left team  
tiredness factor of players  
numbers of players with yellow card  
whether a player got a red card  
roles of players  
Right team  position of players in right team  
direction of players in right team  
tiredness factor of players  
numbers of players with yellow card  
whether a player got a red card  
roles of players  

controlled player index  
designated player index  
active action  
Match state  goals of left and right teams  
left steps  
current game mode  
Screen  rendered screen 
Actions
The number of actions available to an individual agent can be denoted as .
The standard move actions (in directions) include .
Moreover, the actions represent different ways to kick the ball is
.
Rewards
The reward function mainly includes two parts. The first is , which corresponds to the natural reward where the team obtains when scoring a goal and when losing one to the opposing team. The second part is , which is proposed to address the issue of sparse rewards. It is encoded with domain knowledge by an additional auxiliary reward contribution. For example, we can increase the reward when the player owns the ball to boost passing the ball.
Appendix C The detailed structure of
The adjacency matrix of is shown as follows.
Appendix D Additional Experimental Details
We set discount factor
. The optimization is conducted using RMSprop with a learning rate of
and with no weight decay. Exploration for action selection is performed during training, and each agent executes policy over its actions. is annealed from to over the first time steps and is kept constant afterwards.In addition, the information regarding computational resources used is Enterprise Linux Server with 96 CPU cores and 6 Tesla K80 GPU cores(12G memory).