Log In Sign Up

CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning

by   Jiachen Yang, et al.
Honda Research Institute USA
Georgia Institute of Technology

We propose CM3, a new deep reinforcement learning method for cooperative multi-agent problems where agents must coordinate for joint success in achieving different individual goals. We restructure multi-agent learning into a two-stage curriculum, consisting of a single-agent stage for learning to accomplish individual tasks, followed by a multi-agent stage for learning to cooperate in the presence of other agents. These two stages are bridged by modular augmentation of neural network policy and value functions. We further adapt the actor-critic framework to this curriculum by formulating local and global views of the policy gradient and learning via a double critic, consisting of a decentralized value function and a centralized action-value function. We evaluated CM3 on a new high-dimensional multi-agent environment with sparse rewards: negotiating lane changes among multiple autonomous vehicles in the Simulation of Urban Mobility (SUMO) traffic simulator. Detailed ablation experiments show the positive contribution of each component in CM3, and the overall synthesis converges significantly faster to higher performance policies than existing cooperative multi-agent methods.


page 7

page 12


Local Advantage Actor-Critic for Robust Multi-Agent Deep Reinforcement Learning

Policy gradient methods have become popular in multi-agent reinforcement...

Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics

Many real world tasks require multiple agents to work together. Multi-ag...

Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning

Ranking is a fundamental and widely studied problem in scenarios such as...

Developing cooperative policies for multi-stage tasks

This paper proposes the Cooperative Soft Actor Critic (CSAC) method of e...

A Multi-Agent Approach for Adaptive Finger Cooperation in Learning-based In-Hand Manipulation

In-hand manipulation is challenging for a multi-finger robotic hand due ...

Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) enables us to create adaptive ...

1 Introduction

Cooperative multi-agent control problems that are pervasive in real-world settings, such as social dilemmas (Van Lange et al., 2013) and coordination among autonomous vehicles (Cao et al., 2013), often carry a distinctive feature: the need for learning cooperative policies emerges only because agents should accomplish different individual goals while contributing to the common goal of joint success of all agents. For example, multiple autonomous vehicles must cooperate when their individual target locations and nominal trajectories are in conflict—e.g. at unsignalized intersections and during double lane merges—and each vehicle should reach its target without compromising the success of all other vehicles. The effectiveness of deep reinforcement learning (RL) in solving high-dimensional single-agent optimal control problems (Mnih et al., 2015; Lillicrap et al., 2015; Silver et al., 2017) has motivated recent extensions of RL methods to such multi-agent problems with complex interactions (Lowe et al., 2017; Foerster et al., 2017; Mordatch and Abbeel, 2017; Leibo et al., 2017). Using global state information to train policies that choose actions only based on local information, in the paradigm of centralized training of decentralized policies, has shown promise for multi-agent cooperation (Oliehoek et al., 2008; Foerster et al., 2017). This is practically relevant when centralized training of a few agents suffices for learning a cooperative policy, which is used for decentralized operation of a large agent population (e.g. autonomous vehicle fleet). In the context of cooperative multi-agent learning, however, these multi-agent reinforcement learning (MARL) methods have mainly tackled environments with a single task shared by all agents, while the question of how best to learn multi-agent cooperation for different individual goals remains open.

We consider a multi-agent setting where each agent learns to accomplish all goals within a finite set and cooperate with other agents with possibly different goals. Our contribution is a new general framework called cooperative multi-goal multi-stage multi-agent RL (CM3) for restructuring the problem into a new multi-agent curriculum, involving three synergistic components:

  1. [leftmargin=*]

  2. We address the difficulty of multi-agent exploration by training a Stage 1 policy to achieve different goals in a single-agent setting. Our insight is that exploratory actions are most useful for finding cooperative solutions after agents can reliably generate conflict by acting toward individual goals.

  3. Next, we observe that multi-agent environments generally permit a decomposition of an agent’s observation into a representation of the agent’s own state (e.g. distance to target), and a representation of other agents. This motivates a new neural network construction for MARL: we simplify training of Stage 1 actor and critic networks by limiting their input space to the part that is sufficient for achieving individual goals in a single-agent environment, then augment the architecture in Stage 2 for further learning in the full multi-agent environment.

  4. Third, we address the problem of learning both local objectives and cooperation by showing two equivalent views of the policy gradient and proposing a new actor-critic adaptation: we train a decentralized policy using a double critic, consisting of a decentralized value function for learning local objectives and a centralized action-value function for learning cooperation.

CM3 combines these methods within one curriculum: a simplified policy network with a decentralized critic learns to achieve multiple goals in Stage 1, while Stage 2 augments the policy’s observation space to represent other agents and learns multi-agent cooperation using the double critic. While we assume parameter-sharing among all agents, CM3 is easily applicable to agent-specific policies.

We evaluated our method on a challenging multi-agent autonomous vehicle environment with a high-dimensional state space and sparse rewards in the SUMO simulator (Krajzewicz et al., 2012). CM3 converges to higher-performing policies significantly faster than alternative multi-agent methods. Multiple ablations show the positive contribution of each component to the overall synthesis.

2 Related work

Early work on MARL were limited to small discrete state and action spaces (Tan, 1993; Littman, 1994; Hu and Wellman, 2003). Recently, extensions of deep RL methods (Mnih et al., 2015; Lillicrap et al., 2015) to high dimensional multi-agent environments have demonstrated the potential of MARL on a diverse set of tasks, including two-player Pong with independent DQN agents (Tampuu et al., 2017), learning a grounded language for cooperation (Mordatch and Abbeel, 2017), mixed cooperation-competition in a 2D particle environment (Lowe et al., 2017), and competition in 3D worlds with simulated physics (Bansal et al., 2017).

Cooperative multi-agent learning is of particular interest, as many real-world problems can be formulated as distributed systems in which decentralized agents must coordinate to achieve shared objectives (Panait and Luke, 2005). Instead of learning with independent observations, sharing global state information among agents with centralized training can increase the learning speed and performance of decentralized policies, which act on purely local information during execution (Tan, 1993; Oliehoek et al., 2008; Foerster et al., 2017; Lowe et al., 2017). Foerster et al. (2017) proposed an actor-critic method called counterfactual multi-agent (COMA) policy gradients to address the issue of multi-agent credit assignment that arises when all agents receive a shared reward (Panait and Luke, 2005; Chang et al., 2004). COMA pertains to the specific case of a single shared task, whereas we target the general case where agents must learn multiple tasks and coordinate with others with different goals. Alternatively, multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al., 2017) is an extension of DDPG (Lillicrap et al., 2015) to cooperative-competitive environments with continuous actions, using one pair of centralized critic and decentralized actor per agent to accommodate different reward functions. While MADDPG may be extended to our problem setting by using one actor-critic pair to learn each goal, this does not directly address the need for cooperation; in fact, a direct extension would not distinguish between the problem of cooperation and competition, despite the fundamental difference. The resulting architecture may also face scalability issues with large numbers of goals and agents.

Multi-agent cooperation for different goals was previously explored in environments that require coordination for agents’ success. Using two independent tabular Q-learning agents in a small grid world, Austerweil et al. (2016) showed that agents whose rewards depend on all agents’ success achieve higher scores than agents who only optimize for their own success. Also using independent DQN (Mnih et al., 2015) agents , Leibo et al. (2017) investigated the effect of environmental parameters, such as resource abundance, on the degree of agents’ cooperation in team games and social dilemmas. While these works focus on the impact of a Markov game’s parameters on learning agents’ behavior, our work proposes a multi-agent learning method for such cooperative Markov games with individual goals.

Curriculum learning (Bengio et al., 2009)

and transfer learning

(Pan and Yang, 2010) are pertinent to the context of multi-task reinforcement learning. Gupta et al. (2017) sampled stages from a curriculum defined by increasing number of agents involved in a single task. Schaul et al. (2015) proposed universal value function approximation, with both state and goals as input, for single-agent settings. Rusu et al. (2016) instantiate new neural network “columns” for transfer across a sequence of tasks. Our proposed method unifies the general ideas behind these two techniques—function approximation generalized with task information, and multi-stage construction of neural networks—into a new curriculum for cooperative multi-agent RL.

The closest to our problem setting is the recent independent work of Zhang et al. (2018). While they learn one actor-critic pair per agent using local rewards and experiment on a small particle environment, we improve scalability and learning rate by using both local and joint rewards to learn an actor and double critics shared by all agents, within a new multi-agent curriculum, and our test environment has larger spatial extent with highly sparse rewards.

3 Preliminaries

Each agent should learn to accomplish any goal within a finite set, cooperate with other agents for collective success, and act independently with limited local information during test time. This learning problem can be formulated as an episodic environment, where all agents are assigned randomly-sampled goals during each training episode. We formalize the environment as a multi-goal Markov game, then review an actor-critic approach to centralized training of decentralized policies.

Markov Games. We define a multi-goal Markov game as a tuple with agents labeled by . Each agent has one goal during each episode. At each time step , the configuration of agents is completely specified by a state , while each agent receives a partial observation and chooses an action . The environment moves to a next state due to the joint action

, according to transition probability

. Each agent receives a reward , and the learning task is to find stochastic policies , which condition only on local observations and goals, to maximize over horizon , where is a discount factor. For the rest of this paper, we use and to denote the collection of all agents’ actions and goals, respectively, except that of agent ; to denote the collection of all agents’ goals; and to denote the joint policy. We use to stand for , where is the discounted stationary state distribution under .

Centralized learning of decentralized policies. The actor-critic approach in single-agent reinforcement learning (Sutton and Barto, 1998; Sutton et al., 2000; Konda and Tsitsiklis, 2000) is suitable for adaptation to the paradigm of centralized multi-agent learning of decentralized policies. A centralized critic that receives full state-action information can speed up learning of decentralized actors (i.e. policies) that receive only local information, and only the actors are retained for execution after training (Lowe et al., 2017; Foerster et al., 2017). In a single-agent setting, policy (with parameter ) maximizes the objective by ascending the gradient


where is the action-value function, and is any state-dependent baseline. Extending this to the multi-agent setting with a single global reward for all agents, COMA (Foerster et al., 2017) uses a counterfactual baseline


to address the issue of multi-agent credit assignment: represents the contribution of an agent’s chosen action versus the average of all possible counterfactual actions , keeping other agents’ actions fixed. COMA also employs parameter-sharing among all agents, meaning that all agents execute the same policy but can behave differently according to their individual observations. The policy gradient is given by (Foerster et al., 2017, Lemma 1)


In our work, we incorporate the counterfactual baseline in a larger framework that accounts for individual objectives and mixtures of local and joint rewards.

4 Methods

We present our learning framework in two parts. First we show two views of the learning objective, corresponding to the need for agents to act toward their goal and cooperate for the success of other agents. Then we show how to incorporate these two views into a new two-stage curriculum bridged by neural network construction.

4.1 Multi-agent cooperation with individual goals

If all agents were to take greedy goal-directed actions that are individually optimal in the absence of other agents, the joint action can be sub-optimal. In social dilemmas, this occurs when a defecting agent receives higher individual payoff at the cost of lower joint payoff; in the physical world, a driver who takes a greedy straight-line trajectory toward a target location may cause disastrous consequences for surrounding cars. On the other hand, agents who are rewarded for both individual and collective success can find cooperative policies to avoid such local optima and achieve higher joint payoff. Formally, we seek a joint policy , with implied parameter shared by all individual , to maximize the objective:


This objective can be viewed in two ways, leading to our actor-critic method involving a pair of decentralized and centralized critics. Standard derivations are included in Appendix E.

Local view. Individual rewards corresponding to each agent’s goal provide exact goal-specific learning signals that cannot be easily extracted from a joint reward. We learn a decentralized critic from these rewards to provide a policy gradient for agents to achieve local goals without explicit regard to the joint success of all agents. Let objectives correspond to individual goals . We rename the original objective (4) as for clarity, so that . Note that despite this objective decomposition, a collection of greedy policies for each may not be optimal for , since rewards are different for various . Applying the policy gradient theorem (Sutton et al., 2000) to each , the objective (4) is maximized by ascending the gradient


Each is the state-action value corresponding to the individual reward . To arrive at the second line, we took the following steps: 1) motivated by (Schaul et al., 2015), we approximate all by a single with an additional input goal for scalability, instead of using different function approximators; 2) without changing the expectation, we further replace with the advantage function , and use the TD error

as an unbiased estimate of the advantage value; 3) we arrive at the decentralized critic

by approximating with , which is common practice in settings with partial observability (Foerster et al., 2017). Parameterized by , the critic is updated by minimizing the loss


where are parameters of a target network (Mnih et al., 2015) that slowly updates towards the main .

Global view. Equivalently, we can define a joint reward and use it to learn a centralized critic that encourages each agent to contribute to other agents’ success. We rename the original objective (4) as for clarity, so that . Applying the COMA policy gradient lemma (Foerster et al., 2017) to the new case with multiple goals, the gradient is:


where is the centralized critic and is our generalized counterfactual baseline with multiple goals:


Parameterized by , the centralized critic is updated by minimizing the loss


where and represent slowly-updating target and target policy networks, respectively.

Combined view.

We interpolate between both views using

, which determines the extent that the joint reward affects each agent’s policy. The overall policy gradient is


This can be viewed as a weighted-sum scalarization of a two-objective optimization problem (Marler and Arora, 2004). This reformulation of the objective into two components is also well-suited to the two-stage multi-agent curriculum that we define immediately below.

4.2 Curriculum

Figure 1: In Stage 1, a pair of reduced networks and learn to achieve multiple goals in a single-agent environment, using policy gradient only. Between Stage 1 and 2, a new policy network is constructed from the trained and a new module according to (11) (the same construction is done for , not shown). In the full multi-agent environment of Stage 2, these larger and are instantiated for each of agents (with full parameter-sharing), along with a new centralized critic , and trained using the interpolated policy gradient (10).

The issue of efficient exploration in reinforcement learning (Thrun, 1992) is especially acute in the multi-agent setting, where the state space and joint action space scale exponentially with the number of agents. Relying on random exploration to learn both individual task completion and cooperative behavior concurrently can be highly inefficient. Agents who have not yet learned to accomplish local goals may rarely encounter the region of state space where cooperation is needed, rendering any exploration actions useless for learning cooperative behavior. On the other extreme, exploratory actions taken in situations that require coordination can easily lead to failure, and the resulting penalties could cause agents to avoid the coordination problem altogether and fail to learn their individual tasks. Instead, agents that achieve local objectives in the absence of other agents can more reliably produce the necessary state configurations for learning cooperative behavior. This motivates our multi-agent curriculum: first we train an actor and decentralized critic to learn multiple goals in a single-agent setting; then we instantiate all agents, augment the pretrained and neural networks with new modules, activate a new centralized critic , and further train for cooperative behavior. We describe the CM3 curriculum in detail below (see Figure 1 and pseudo-code in Appendix A).

Stage 1.

This stage is identical to a single-agent Markov decision process (MDP). We train an actor

and critic according to policy gradient  (5) and loss  (6), respectively. A goal is uniformly sampled from

in each training episode, for the agent to learn all goals over the course of training. Using deep neural networks for function approximation, the input to the actor and critic networks consists of the agent’s observation vector

and a vector representing the goal for a particular episode. We make the simple observation that multi-agent environments typically permit a decomposition of the agent’s observation space into , where contains information about the agent’s own state (e.g. position) while is the agent’s local observation of surrounding agents, and that the ability to process is unnecessary in Stage 1. This allows us to reduce the size of the input space of and to , thereby reducing the number of trainable parameters in Stage 1 and simplifying the learning problem. These reduced actor and critic networks are trained until convergence, and we label them as and .

Stage 2. The Markov game is instantiated with all agents. We retain the previously-trained parameters, instantiate a new neural network for agents to process the component of their local observation, and introduce hidden connections from the output of to a selected layer of . Specifically, let be the hidden activations of layer with units in an -layer neural network representation of , connected to layer via with

and activation function

. Stage 2 introduces a -layer neural network with output layer , chooses a particular layer111While choosing layer is domain-dependent, it should be chosen among fully-connected layers after convolutional layers have processed agents’ image-representations of their field of view. We also found it unnecessary to tune this parameter for our experiments. of , and augments the hidden activations to be


with . An equivalent augmentation is made to critic using a new neural network . Finally, we instantiate the centralized critic , which was not required and therefore absent during Stage 1, and train using the combined gradient  (10), loss  (6) and loss  (9), respectively. Similar to Stage 1, we assign goals to agents by sampling from a distribution over during each training episode. The distribution can be constructed to ensure sufficient training on difficult goal combinations that require cooperation, along with easier combinations for maintaining agents’ ability to act toward their goal.

We postulate—and verified in experiments—that this two-stage construction of actor-critic networks with curriculum learning improves learning speed compared to direct training on the full multi-agent environment. Hidden layers that were pre-trained for processing in Stage 1 retain the ability to process task information, while the new module learns the effect of surrounding agents. Higher layers that can generate goal-directed actions in the single-agent setting of Stage 1 are fine-tuned by the combined gradient to generate cooperative actions for joint success of all agents.

5 Experiments

We evaluated CM3 on the problem of learning cooperative policies for negotiating lane changes among multiple autonomous vehicles in the Simulation of Urban Mobility (SUMO) traffic simulator (Krajzewicz et al., 2012). While previous work have applied reinforcement learning to autonomous driving tasks in simulation (Isele et al., 2017; Kuefler et al., 2017; Mukadam et al., 2017), they modeled the problem as a single-agent MDP, in which other vehicles behave according to hand-designed policies without the capacity for strategic response to the learning agent. However, driving in real-world traffic must involve deliberate cooperation222We view multi-vehicle interaction in driving as cooperative rather than competitive, because it is undesirable for any vehicle to follow a self-interested policy at the risk of other vehicles’ safety. among interacting vehicles who have different individual intentions (e.g. change to different target lanes), which makes the problem of autonomous vehicle negotiation a natural testbed for MARL methods. The following sections describe environment details, results of a comprehensive ablation study, and performance of CM3 versus existing methods. We find that CM3 learns significantly faster and finds more successful policies than strong baselines. Performance of ablations show that both the two-stage curriculum and the decentralized critic are crucial to success, while the global view of policy gradient gives a noticeable advantage in finding a cooperative solution.

5.1 Autonomous vehicle negotiation

Figure 2: The final of a road network in SUMO. Four agent vehicles with different start and goal lanes are colored according to labels of the inset window. All vehicles in yellow and bright red are controlled by SUMO and appear in E only.

Figure 2 shows one segment of a larger road network in SUMO that we used for all experiments. It consists of initial lanes starting at horizontal position , two of which encounter a merge point, and goal lanes at terminal position . In each episode, agents are emitted at on randomly-selected initial lanes, and each agent is associated with a randomly-selected goal lane that it should learn to reach at position . Agents receive observations with a limited field of view, choose actions from a discrete action space, and receive rewards according to both terminal and instantaneous criteria (e.g. reached goal, exceeded speed limit). Following the curriculum structure of CM3, we define the following environments:

  • [leftmargin=*]

  • E1: a single agent on an otherwise empty road learns to reach any goal lane from any initial lane. This is used for Stage 1 of CM3, which trains initial networks and with objective .

  • E2: agents are randomly initialized: with probability 0.8, initial and goal lanes are set so that a double-merge is necessary (see Figure 4 of Appendix C); with probability 0.2, initial and goal lanes are uniformly sampled. The full Stage 2 architecture of CM3 is trained in E2.

  • E: used to test generalization, with SUMO-controlled vehicles emitted with probability 0.5/sec.

Stage 1 of CM3 was trained in E1, followed by training of Stage 2 on E2. Competitor methods were trained directly in E2. Neural network architectures and hyperparameters for our algorithm and all baselines are reported in

Appendices F and D. Full parameter-sharing among all agents were used in all methods, unless specified otherwise. Appendix C exhaustively describes the full Markov game.

5.2 Results

Our method contains three key components: curriculum learning with two-stage construction of actor and decentralized critic networks, decentralized value function for learning multiple individual objectives, and centralized action-value function for learning coordination. We conducted ablation experiments to demonstrate the impact of each component. Each paragraph below describes an ablation in which the component written in bold is absent.

Curriculum. The benefit of our multi-agent curriculum can be seen by contrasting CM3 against the same architecture trained directly in E2, without first training components and in the multi-goal single-agent environment E1. We call this ablation “CM3 (direct)”. Learning curves in Figure 2(a) show that the performance of CM3 (direct) at episode 80k was already attained by CM3 at episode 25k. Accounting for 10k episodes used by CM3 in E1 (Appendix G), CM3 still learns faster by 45k episodes. This shows that our two-stage curriculum with neural net construction enables significantly more efficient exploration of the state-action space to find cooperative solutions.

Decentralized value function. We found that the decentralized critic and in the local view play a crucial role in our method, shown by the lower performance of CM3-V (“minus” V) in Figure 2(a). CM3-V inherits the same Stage 1 policy and uses the same neural network construction for bridging Stages 1-2 as CM3; the sole difference is that , so only the centralized , joint reward and gradient are available in E2. It gives evidence that a joint reward is insufficient when agents have different goals, even when a counterfactual baseline is used for credit assignment, while goal-specific gradients from local rewards and decentralized critic is needed.

Centralized action-value function. The contrast between learning curves of CM3-Q (“minus” Q) and CM3 in Figure 2(a) show that the centralized critic and in the global view can improve the time taken to learn a cooperative policy. CM3-Q uses the same , , and neural network construction for bridging Stages 1-2 as CM3, but turns off the -function (setting ) for training in E2. The inability to represent joint actions leads to fully-decentralized learning in a non-stationary environment. One variant, labeled CM3-Q (1), trains the policy and value function simultaneously using all agents’ experiences in each transition. Another variant, CM3-Q (2), aims to improve stationarity by training a main policy with only one selected agent’s experiences while using a fixed policy for other agents, and periodically updating the fixed policy with the main policy. CM3 finds a cooperative solution (reward ) 10k episodes faster than CM3-Q (1), giving evidence that the global view promotes cooperation. Both learn faster than CM3-Q (2), showing that benefits of stationarity were negated by the reduction in training experiences. Similar to findings in other work (Foerster et al., 2017; Zhang et al., 2018), both decentralized variants are able to learn a successful policy if given enough time.

(a) Learning curves averaged over 5 runs per method.
(b) Generalization performance (avg. over 100 episodes)
Figure 3: (a) CM3 converged to higher performance more than 45k episodes earlier than competitors in E2. All ablations except CM3 (direct) were initialized using the same policy trained in E1. (b) Policy learned by CM3 generalizes better to E and to configurations C1-C4 that rarely occur in E2.

CM3 outperforms a generalization of COMA333To use the same neural network design and hyperparameters for a fair comparison, we generalized a variant of COMA with feed-forward policy networks and without using previous actions as input to . (Foerster et al., 2017) to the setting of multiple goals. Whereas COMA’s centralized critic only learns a shared task and does not represent agents’ differing goals, the generalization receives input , so that each output node represents

. Due to the lack of decentralized critic and curriculum, it learns much slower with higher variance. CM3 also converged 45k episodes earlier than independent actor-critic (IAC)

(Foerster et al., 2017), which is equivalent to removing both curriculum and centralized from CM3. We do not directly compare against MADDPG (Lowe et al., 2017) because it must undergo essential modifications for our problem setting, such as adapting the deterministic policy gradient to a discrete action space. However, MADDPG contains neither decentralized critics nor two-stage curriculum with neural network construction, and our ablations show that these components give significant performance gains.

We investigated how well policies learned in E2 generalize to new cooperative settings in two ways: 1. running agents directly among SUMO-vehicles in E without training; 2. testing agents in E2 with difficult configurations of initial and goal lanes that require cooperation but which are rarely seen during training (C1-C4 described in Appendix B). Given equal total training episodes, Figure 2(b) shows that CM3 generalizes marginally better than IAC and outperforms COMA.

6 Conclusion

We presented a general framework called CM3 for cooperative multi-agent RL with differing agent goals. CM3 enables efficient exploration of a multi-agent state-action space via a two-stage curriculum bridged by neural network construction. It learns a double critic under local and global views of the policy gradient to address the need for cooperation while optimizing individual success. Experiments on negotiating lane merges in simulation show that CM3 learns significantly faster than other actor-critic methods with higher performance, and that each component contributes positively to the whole framework. Future work can generalize these ideas to learn separate policies for inhomogeneous agents, and further investigate the benefit of centralized learning in more difficult environments.


Appendix A Algorithm

1:procedure CM3
2:     for curriculum stage to 2 do
3:         if  then
4:              Set number of agents
5:              Initialize Stage 1 main networks and target network
6:         else if  then
7:              Instantiate agents
8:              Construct Stage 2 networks: and according to (11)
9:              Initialize centralized critic network , and target networks
10:              Restore values of previously-trained parameters
11:         end if
12:         Set all target network weights to equal main networks weights
13:         Initialize exploration parameter and empty replay buffer
14:         for each training episode to  do
15:              Assign goal(s) to agent(s) according to given distribution
16:              Get initial state and observation(s)
17:              for  to  do
18:                  Sample action for each agent.
19:                  Execute action(s) , receive reward(s) and next state
20:                  Store into
21:                  Sample minibatch of transitions from
22:                  // Execute the following computations for all agents in parallel
23:                  Compute TD target
24:                  Update decentralized critic by minimizing
25:                  Compute policy gradient in local view (5)
26:                  if c = 1 then
27:                       Update policy:
28:                  else if c = 2 then
29:                       Compute TD target
30:                       Minimize loss
31:                       Compute
32:                       Compute policy gradient in global view (7)
33:                       Update policy:
34:                  end if
35:                  Update target network parameters for :
37:              end for
38:              If , then
39:         end for
40:     end for
41:end procedure
Algorithm 1 Cooperative multi-goal multi-stage multi-agent reinforcement learning (CM3)

Appendix B Generalization

Config Initial lanes Goal lanes Mean departure times (s) CM3 IAC COMA
C1 37.34 36.14 -6.63
C2 38.02 37.39 18.75
C3 38.95 28.96 -6.72
C4 38.42 37.43 14.45
Table 1: Test performance on rarely-occurring configurations of initial and goal lanes

Table 1 shows the sum of agents’ reward, averaged over 100 test episodes, on several special configurations of initial and goal lanes that require cooperation for success in E2. These configurations occur with low probability (

1e-5) during training on E2. “-1” for initial lane indicates that the agent begins on the merge lane. Agents’ departure times are drawn from normal distributions with mean specified in the table and standard deviation 0.2s. Maximum possible sum of rewards is 40. IAC and COMA both received a total of 90k training episodes. For CM3, Stage 1 was trained for 10k episodes in E1 and Stage 2 was trained for 80k episodes in E2.

Appendix C Implementation details

Figure 4: Initial lane configuration and goal assignment in E2, which requires agents to perform a double-merge. Occurs with probability 0.8.

c.1 SUMO road architecture

We constructed a straight road of total length , consisting of five main lanes and one merge lane. Vehicles on the merge lane are able to merge onto main lanes during the segment , and the merge lane ends at . All lanes have width , and vehicles can be aligned along any of four sub-lanes within a lane, with lateral spacing . Legal speed limit was set to . In E, SUMO-controlled passenger cars and trucks (semitrailer) that behave according to the Krauß car-following model (Krajzewicz et al., 2012) are emitted on to main lanes with probability 0.5 per second. Simulation time resolution was per step. Supplementary file merge_stage3_dense.rou.xml contains all vehicle parameters, and defines the complete road architecture.

c.2 Markov game definition

Initial distribution.

In E1, the single agent’s initial lane and goal lane were sampled randomly from uniform distributions over the number of start and end lanes. In E2, with probability

, all agents’ initial and goal lanes were sampled independently from uniform distributions over the number of start and end lanes; with probability , agents [1,2,3,4] were initialized with initial lanes [2,2,3,3] and goal lanes [4,4,0,0]. The spatial arrangement is shown in Figure 4. Departure times were drawn from normal distributions with mean and standard deviation .

Local observation. Each agent vehicle’s local observation consists of two components. The first component , used in all training stages, is a vector consisting of:

  1. agent’s speed normalized by

  2. normalized number of sub-lanes between agent’s current sub-lane and center sub-lane of goal lane

  3. normalized longitudinal distance to goal position

  4. binary indicator for being on the merge lane

  5. normalized distance to the next segment along the road (segment boundaries are )

The second component , is a discretized observation grid (i.e. picture) centered on the agent, with four channels:

  1. binary indicator of vehicle occupancy

  2. normalized relative speed between other vehicle and agent

  3. binary indicator of vehicle type being a passenger car

  4. binary indicator of vehicle type being a truck

Each channel is a matrix with dimensions , corresponding to visibility of forward and backward (with resolution ) and four sub-lanes to the left and right.

Global state. The global state vector is the concatenation of all agents’ observation component .

Goals. Each goal vector is a one-hot vector of length 5, indicating the goal lane at which agent should arrive once it reaches position . Goals are randomly sampled for all agents during each episode (see details of initial distribution above).

Actions. All agents have the same discrete action space, consisting of five options: no-op (maintain current speed and lane), accelerate (), decelerate (), shift one sub-lane to the left, shift one sub-lane to the right. Each agent’s action is represented as a one-hot vector of length 5.

Individual rewards. The reward for agent with goal is given according to the conditions:

  • for a collision (followed by termination of episode)

  • for time-out (exceed simulation steps during episode)

  • for reaching the end of the road and having a normalized sub-lane difference of from the center of the goal lane

  • for entering the merge lane from another lane during

  • for being in the merge lane during

  • if current speed exceeds

Global reward. The shared global reward is determined by: 1. if any collision occurred; 2. the average of all individual rewards of agents who reached the end of the road at time .

Appendix D Architecture

The policy network during Stage 1 feeds each of the inputs and to one fully-connected layer with 32 units. The concatenation is then fully-connected to a layer with 64 units, and fully-connected to a softmax output layer with 5 units, each corresponding to one discrete action. In Stage 2, the input observation grid

is processed by a convolutional layer with 4 filters of size 5x3 and stride 1x1, flattened and fully-connected to a layer with 64 units, then fully-connected to the layer


. ReLU nonlinearity was used for all hidden layers. Action probabilities are computed by lower-bounding the softmax outputs via

, where is a decaying exploration parameter and .

The decentralized critic during Stage 1 feeds each of the inputs and to one fully-connected layer with 32 units. The concatenation is then fully-connected to the output linear layer with a single unit. In Stage 2, the input observation grid is processed by a convolutional layer with 4 filters of size 5x3 and stride 1x1, flattened and fully-connected to a layer with 32 units, then fully-connected to the output layer of . ReLU nonlinearity was used for all hidden layers.

The centralized critic receives input , which is connected to two fully-connected layers with 128 units and ReLU activation, then fully-connected to a linear output layer with 5 units. The value of each output node is interpreted as the action-value for agent taking action and all other agents taking action . The agent label vector is a one-hot indicator vector, used as input to differentiate between evaluations of the Q-function for different agents.

IAC and COMA used the same neural networks as those in Stage 2 of CM3, except: 1. IAC does not use ; 2. COMA does not use . Both train completely in environment E2 from scratch without curriculum and neural network construction.

Double replay buffers and

were used as a heuristic to improve training stability for all algorithms on Stage 2. Instead of storing each environment transition immediately, an additional episode buffer was used to store all transitions encountered during each episode. At the end of each episode, the cumulative reward of all agents is compared against a threshold (32, in our experiments), to determine whether the transitions in the episode buffer should be stored into

or . For training, half of the minibatch is sampled from each of and .

Appendix E Derivations

Policy gradient. In Equation 5, we used the fact that . For each agent , define

Note that the full transition dynamics of the Markov game is accounted by conditioning on the joint action , but only the local reward is accumulated in these value functions. This means that all steps in the proof of the policy gradient theorem (Sutton et al., 2000) carry over to the multi-agent setting, by making the substitutions , , , .

State-dependent baseline. We replaced by the advantage function , which does not change the expectation in Equation 5 because:

Approximation of . The temporal difference error is an unbiased estimate of the advantage function, since:

Appendix F Parameters

Parameter Stage 1 Stage 2 -V -Q (1) -Q (2) (direct) IAC COMA
Episodes 10k 80k 80k 80k 80k 100k 100k 100k
1.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0
0.01 0.05 0.05 0.05 0.05 0.05 0.05 0.05
9.9e-5 5.6e-6 5.6e-6 5.6e-6 5.6e-6 9.5e-6 9.5e-6 9.5e-6
Replay buffer 10k 50k 50k 50k 50k 50k 50k 50k
Minibatch size 256 128 128 128 128 128 128 128
Steps per train 10 10 10 10 10 10 10 10
1.0 0.7 0 1.0 1.0 0.7 N/A N/A
Learning rate 1e-3 1e-4 N/A 1e-4 1e-4 1e-4 1e-4 N/A
Learning rate 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
Learning rate N/A 1e-4 1e-4 N/A N/A 1e-4 N/A 1e-4
Table 2: Parameters used for CM3, ablations, and baselines

We used the Adam optimizer in Tensorflow with the learning rates in

Table 2 for all experiments.

Appendix G Stage 1

We trained Stage 1 of CM3 on the single-agent environment E1 for 10k episodes, with reward curve shown in Figure 6. Environment E1 uses the same SUMO road architecture and Markov game definitions as E2 (described in Appendix C), except that only a single agent is instantiated during each episode. Starting lane and goal lane of each episode were randomly sampled from a uniform distribution over lanes. Activated neural network modules are the Stage 1 policy network and decentralized critic , with architecture described in Appendix D. The resulting policy at episode 10k was used for neural network construction in Stage 2 by CM3, CM3-V, CM3-Q (1) and CM3-Q (2). The resulting critic at episode 10k was used for neural network construction in Stage 2 by CM3, CM3-Q (1) and CM3-Q (2). The ablation CM3 (direct) does not use such pretraining on E1.

Appendix H Interpolated policy gradient

Higher values of assign more weight to the local policy gradient, while lower values assign more weight to the global view to encourage cooperation. We trained Stage 2 of CM3 in E2 across a range of values, resulting in Figure 6. was sufficient for finding a successful policy. is most stable at latter episodes, since optimizing for local rewards is feasible once agents have learned to cooperate. Lower alpha values allow slightly faster learning in earlier episodes.

Figure 5: Reward versus episode in single-agent environment E1. Maximum reward is 10.
Figure 6: Reward curves of CM3 in E2 over a range of . Moving average over 10 points.