Single agent reinforcement learning (RL) algorithms have made significant progress in game playing  and robotics , however, single agent learning algorithms in multi-agent settings are prone to learn stereotyped behaviors that over-fit to the training environment [7, 17]. There are several reasons why multi-agent environments are more difficult: 1) interacting with an unknown agent requires having either multiple responses to a given situation or a more nuanced ability to perceive differences. The former breaks the Markov assumption, the latter rules out simpler solutions which are likely to be found first. 2) Intentions and goals of other agents are not known and must be inferred. This also can break the Markov assumption. 3) Agents are co-evolving, and their policies are non-stationary during training. In this work we investigate a property associated with non-stationary policies: partially successful policies for one agent can get repeated multiple times or ’burned in’, causing the other agents to adapt to that specific behavior. This sets off a chain of mutual adaptation that encourages agents to only visit suboptimal regions of the state space.
When training independent agents , the competing learning processes of multiple agents is sufficiently difficult that either the agents fail to learn, or the agents learn one-after-the-other, resulting in stereotyped policies that are sensitive to the behavior of other agents. Recent approaches to multi-agent reinforcement learning have relaxed the independence of an agent by exploiting centralized training and assuming the other agents’ actions are known [28, 13]. While these techniques are more successful, we show that they still exhibit one-after-the-other learning, producing highly-dependent policies.
We propose the use of an independent curriculum-based training procedure for learning policies that avoids mutual adaptation without using either centralized training or knowledge of the other agents actions. We start from the observation that when agents learn at different rates (as often happens under random initialization), one agent learns a policy around the other agents’ policies. Since the policies being accommodated are suboptimal during the early stages of learning, the agents drive each other into poor local optima. One solution to pacing the learning of multiple agents is self-play , but self-play requires symmetric agents. As an alternative, we consider learning via a curriculum . The use of curriculum learning lets us pace the learning of each agent, while allowing us to handle the case of agents with different goals.
We structure our curriculum by first learning an optimal single agent and then explicitly learning the modifications required to interact with other agents. This approach leverages the intuition that when other agents are not present in the environment, the agent should behave like a single agent trying to reach a goal. We address the interaction-aware decision making problem in the second stage of the curriculum. In the second stage, we introduce an architecture that uses the learned single-agent policy, and adjusts it with a learned interactive multi-agent policy.
We consider two robotic applications where the agents’ policies must be mutually consistent in order to achieve the intended goals: a lane-change scenario and a mobile navigation scenario. The agents do not have access to the goals or intentions of other agents and are learning different policies simultaneously. We show that not only is our curriculum-based approach better able to learn the desired behaviors, but that the learned policies also generalize against agents that were not included in the training process.
Ii Related Work
Multiple works have focused on reinforcement learning methods for multi-agent domains in fully cooperative, fully competitive, and mixed environments . Foerster et al.  propose a multi-agent actor-critic method to address the challenges of multi-agent credit assignment. Different from our work where each agent pursues its individual goal, their approach is appropriate for problems with a single shared task. Lowe et al.  propose an actor-critic approach, called MADDPG, that augments the agent’s critic with action policies of other agents and is able to successfully learn the coordination policy between multiple agents. Unlike our approach these methods use centralized training.
He et al.  presents new models based on deep Q-network for decision-making in adversarial games which jointly learns a policy and different strategies of opponents. Their proposed approach is appropriate for two agents, whereas our approach is tested on more than two agents. Furthermore, our approach is demonstrated in the non-stationary case where the agents are co-evolving together while the opponents in their approach have a set of fixed strategies. Some work leverages self-play to provide the agents with a curriculum for learning complex competitive tasks . They use dense rewards in the beginning phase of the training to allow the agents to learn basic motor skills. In our environments, it is difficult to create motor skills or define a reward function for them. In our problems, self-play using single agent RL algorithms failed to learn successful policies since the environment is non-stationary, and also without cooperation the agents were not able to even occasionally succeed in their tasks thus were not able to learn successful policies.
In social dilemma research, most works focus on one-shot or repeated tasks [2, 10] and ignore that in real world scenarios the behaviors are temporally extended. Among the most relevant RL approaches in this area, is work on the sequential prisoner’s dilemma (SPD)  which leverages deep Q-networks to study effects of environmental parameters on the agents’ degree of cooperation. In contrast to this work that focuses on the impact of the games’ parameters on the agents’ behavior, we provide a multi-agent learning approach for cooperative problems with individual goals. Another relevant work  proposes a deep RL approach for mutual cooperation in SPD games. Their approach adaptively selects its policy with the proper cooperation degree based on the detected cooperation degree of an opponent. Different from their approach, our approach is not specific to two player games. In addition, since we focus on mobile navigation problems our approach learns how to react to the continuous policies of other agents, not just their cooperation degree.
Significant amount of work has focused on motion planning for autonomous vehicles 
where the problem of intention prediction or trajectory estimation of other agents has been studied. Among these, relevant work has focused on intention-aware POMDP planning for autonomous vehicles
. These methods leverage machine learning methods to learn models of other agents as also surveyed in. In contrast to these approaches, we focus on RL algorithms for interaction-aware agents where the agents are co-evolving together and their motion models are dependent on others policies. Another work  presents an approach for computing optimal trajectories for multiple robots in a distributed fashion.
We focus on multi-goal multi-agent settings, where each agent cooperates with other agents in order to accomplish their individual goals. We leverage the intuition that in many settings, like autonomous driving, the interaction between multiple agents is limited to certain parts of the state space where conflict of interest is present, otherwise the agent behaves according to its single-agent policy. I.e., the agent starts with its own single-agent policy and adapts it to account for the multiple agents that appear in the environment. We propose a two stages approach to learn multiple interactive policies for multiple agents. In the first stage of learning, the agents learn a single agent policy to accomplish their individual goals. The learned single agent policies are then passed to the multi-agent model that enables each agent to learn an interactive policy to account for the other agents.
Iii-a Single Agent Module
We model an agent with individual goal as a Markov Decision Process (MDP) with goals . The MDP is defined as a tuple in which represents the state of the world, represents the agent’s observation, is a set of actions, determines the distribution over next states, G is the agent’s goal, is the goal distribution, is an immediate reward function, and is the discount factor. The solution to an MDP is a policy where is the parameters of the policy. For continuous actions,
is assumed to be Gaussian, and in our work the mean is represented by a neural network with parameters. The robot seeks to find a policy that maximizes the expected future discounted reward . In the following paragraphs, we add subscript “self” or “s” to our notation to show the agent’s own properties and “single” or ”sng” to highlight the single agent scenario.
We use a decentralized actor-critic policy gradient algorithm to learn the single agent policies. The single agent model gets observation and goal as inputs, and outputs an action that the agent should take. Each agent learns a policy , according to its individual goal-specific reward function in the absence of other agents. The decentralized actor-critic policy gradient algorithm maximizes by ascending the following gradient:
Where is the state distribution, and is the advantage function . We use to refer to . For simplicity, is removed. The gray colored boxes in Fig. 2 show the actor-critic model for the single agent.
Iii-B Multiple Agents Module
We assume that each agent has a noisy estimate of the other agents’ states, but they don’t have access to the other agents actions or intentions. We model the multi-agent decision problem as a Markov Game , modified to accommodate mixed goals, and defined as the tuple with N agents. The possible configurations of the agents is specified by . Each agent gets an observation which includes both the agent’s observation of its own state and an observation of other agents . Each agent has its own set of actions , a goal , and a reward function . The markov game includes a transition function which determines the distribution over next states. The solution to the markov game is a policy for each agent where is the parameters of the policy. For continuous actions problems, is assumed to be a Gaussian where the mean is modeled by neural networks. Each agent seeks to find a policy that maximizes its own expected future discounted reward . In the following paragraphs, we remove for simplicity and add subscript “self” or “s” to our notation to show the agent’s own properties. We add “multi” or “mlt” to highlight the multi-agent scenario, and subscript “others” or “o” refers to other agents properties.
We modify each agent’s to account for the presence of other agents in the environment. Each agent is rewarded based on its individual objective, but is punished if it gets into conflicts (e.g., collisions in mobile agent scenarios) with other agents. The new reward function for each agent is as follows where is a positive constant that penalizes the agent for conflicts, and determines if conflict is present:
We use a decentralized actor-critic policy gradient algorithm to learn the multi-agent policies. Each agent learns an actor-critic model that accounts for the multiple agents in the environment. The model gets observation , goal , and as inputs, and outputs an action that the agent should take. Each agent learns a policy , according to its multi-agent goal-specific reward function in the presence of other agents. The decentralized actor-critic policy gradient algorithm maximizes by ascending the gradient:
Where is the other agents’ state distribution. We use to refer to . Fig. 2 shows our architecture for both reaching the individual goal and cooperative non-conflicting behavior. In this architecture, each agent leverages its learned single agent actor-critic models. The learned model is frozen and combined with another multi-agent model that addresses the cooperative non-conflicting behavior.
The multi-agent module includes the single agent (SA) module from the previous stage of the curriculum. The output of the single agent model is combined with the output of a multi-agent (MA) module to learn a policy that modifies the single agent value functions to account for the other agents. The agent’s own state and an estimation of the other agents state are passed to the actor and critic models of the multi-agent module. In Fig. 2, we only show the actor models for simplicity, the critic models have the same structure. In this work, we used a summation to combine the single agent and multi-agent models since we found it sufficient in our experiments.
In this section, first we discuss our network architecture. We then delve into the details of the environments and experimental setup, and then discuss our results.
Iv-a Algorithm and Network Architecture
. We use ReLU as the activation function instead of tanh in the original implementation. The actor-critic models each have 2 hidden layers with 128 neurons. We call our approach Interaction-Aware TRPO or IATRPO.
Iv-B Simulation Environments
We tested our proposed approach on the following two environments. Both environments are designed such that interaction is required for successfully achieving the individual goals. The environments are shown in Fig. 3.
This environment consists of two cars that are crossing to go to their goal destination (matching color stars). At each episode a pair of non-adjacent goals are selected and are kept constant throughout the episode. The agents’ positions are selected randomly in the top left and bottom left quarters of the course. The agents start with a velocity, angular velocity, and heading angle. We call this environment “C2”. We created an easier version of this environment where the goals are fixed to the 1st and 3rd lanes, and the agents’ position has randomness only in the direction. We call this environment “C2-fixed”.
The agent’s state includes its and position, velocity, angular velocity, heading angle, if it is broken due to collision with other agents or the environment, and if it has reached its goal. The observation noise for is in . The cars have acceleration and angular acceleration as their actions with uniform noise in . The car can reach a minimum and maximum velocity of and respectively, and a minimum and maximum angular velocity of and respectively. The multi-agent module for each agent has access to other agent’s x and y positions, velocity, heading, angular velocity with a uniform noise in . The cars use a bicycle kinematics model .
The reward function for the single agent scenario and the multi-agents scenario are as follows with a reward scale . Function computes the euclidean distance between the center of the car and agent’s goal. Function or specify if the agent is in collision with the environment or the other agents respectively.
Iv-B2 Multi-Robot Navigation
This environment consists of two or more mobile robots that are crossing one another to go to their goal destination (matching color stars). Three fixed goals are located at the top, left and right sides of the course. The agents’ positions are selected randomly in the left (goal in right region of the environment), right (goal in left region), and bottom (goal in top region) of the course to assure that the agents pass one another to go to their goal position. The 2 agent environment has the same setting, but the bottom (gray) agent is not present. We call the environment with 2 agents and 3 agents “R2” and “R3” respectively. The robots have the same state space and parameters as the cars in the lane changing environment. The mobile robots use the unicycle kinematics model . The reward function is as before.
We provide quantitative and qualitative results on the performance of our approach. As our baseline, we compare against MADDPG . We leveraged the main contribution of the MADDPG approach and implemented the multi-agent version of the TRPO  algorithm (MATRPO) where each agent is provided with the action values of the other agents. TRPO was used in place of DDPG because it was found to consistently outperform DDPG in all our experiments. TRPO was also used to train the IATRPO for all stages of the curriculum.
Iv-D Qualitative Results
Figures 4 and 5 illustrate the agents learned policies on the two environments. In both scenarios, all the agents must cross paths to reach their goal destination (matching color stars). We show the single agent policies with dashed lines, and the multi-agent policies with solid lines. We ran the learned multi-agent models and observed that the agents are successfully able to learn how to interact. We refer to the agents based on their colors or their start positions.
In the multi-agent policy in Fig. 4, the bottom agent (orange or abbreviated as O) slows down for the top agent (blue or B) to pass first and then it goes to its goals (but the single agent trajectory is in collision with the other agent). The top agent (B) also modifies the shape of its trajectory to not get into collision with the bottom agent. In the multi-agent policy in Fig. 5, the bottom agent (gray or G) learns to go first with maximum speed, the right agent (B) slows down and modifies the shape of its trajectory for the bottom agent (G) to pass first. The left agent (O) modifies its speed to prevent a collision with the other agents. Notice that the single agent policies differ both in speed and shape from the multi-agent policies, and if all the agents executed their single agent policy, they would have collided with the others. Please refer to the video accompanying the paper to see examples of successful and failed executions.
|Environment||MATRPO success rate (%)||IATRPO success rate (%)|
Iv-E Quantitative Results
We evaluate the IATRPO approach against the MATRPO approach and report the final results in three evaluations:
Success Rate: Table I shows the success rate of our approach against the MATRPO approach. We ran both approaches on the 4 environments with 5 random seeds. An episode is successful if both agents are away from their goals . To compute the success rate, the final learned policies were run on random episodes. Both MATRPO and IATRPO approaches give a high accuracy on the C2-fixed environment, but the success rate of the IATRPO algorithm is higher. On the C2 environment, which has a greater amount of randomness in the start position and has random goals, the MATRPO algorithm is not able to learn successful policies for all the agents, but the IATRPO algorithm has success rate. MATRPO learns a successful policy for one agent, but the second agent is stuck in a local optima, having only learned to not collide with the environment or the successful agent. Most of the failure cases in IATRPO happens around the boundaries of the environments or when the agents are too close to one another. We believe this is because of the noise associated with both the agent’s action and the observations of others.
The performance of the IATRPO algorithm is much higher than the MATRPO algorithm on the R2 environment. In half of the random runs the MATRPO algorithm did not learn a successful policy for both the agents. However, the IATRPO algorithm has a success rate of around . The MATRPO algorithm completely failed to learn a successful policy on the R3 algorithm, but the IATRPO algorithm achieves a success rate of .
Level of Interaction: We measure how interactive are the final policies that are learned by the IATRPO and MATRPO approaches. We ran both approaches on C2-fixed and R2 environments where MATRPO was able to learn successful policies. We performed 5 training runs with random seeds and tested the final learned policies on a
random episodes. We estimate how interactive the policies are by finding which agent reached its goal first. For each agent, we compute the average and standard deviation of the interactiveness metric on the 5 runs. In each algorithm run if one agent always waits for the other agent to go first and compromises, the agents are considered non-interactive. In a two agents scenario, the ideal case is if both agents have an average of aroundwith a low standard deviation. Fig. 7 shows the results of the two algorithms on the two environments. In each run on the C2-fixed environment where we applied the MATRPO algorithm, one agent always waited for the other agent to go first. However, the IATRPO algorithm was able to learn more interactive policies than the MATRPO policies where both the agents sometimes compromised. Fig. 8 provides more evidence of why, in the MATRPO training, one agent always compromises.
Fig. 8 shows the mean episode length of both the algorithms in one of the training runs on the C2-fixed environment. When the mean episode length becomes constant, the agent has converged to a successful policy. In the MATRPO training, the orange agent converges to the successful policy and after about 800 training iterations the blue agent adapts its policy to the orange agent’s policy and converges as well. However, with IATRPO the two agents learn a successful policy around the same time.
We applied both the algorithms on the R2 environment and noticed that in the IATRPO algorithm the two agents have a better balance where both the agents achieve the first place about
of the times. The agents are less balanced when using the MATRPO approach and have higher variance than the IATRPO approach.
We also investigate the influence of our two stages approach on the interactiveness of the agents. We measured the distance between the single agent trajectories and the multi-agent trajectories to measure how much each agent modified its trajectory to account for the other agents in the IATRPO algorithm. We use the Fréchet distance for our comparison. As before, we use the final learned policy, run it on random episodes and compute the distance between the single agent module’s trajectory and the multi-agent module’s trajectory. For each agent, we average the computed distance and use that to compute the overall compromise () that each agent makes compared to other agents. Fig. 6 shows the results on the C2 and R3 environments. Although the top agent gets the first place of the times, the changes in the distance is almost equal for the two agents in the C2 environment.
In the R3 environment, the bottom agent (G) always arrives first at the goal, but the distance between its single agent trajectory and multi-agent trajectory is about . This implies that the bottom agent is also trying to adapt its policy to the other agents’ policies. The right agent and the left agent get the second place and respectively. The overall impact of the right agent (B) on distance is compared to the left agent (O) . This implies that both agents change their single agent policies, the right agent mostly changes the shape of its policy and the left agent mostly changes its speed to account for the other agents.
|Environment||MATRPO success rate (%)||IATRPO success rate (%)|
We conducted experiments where we look at the performance of agents not trained together. This is a scenario known in the literature to cause agents to fail due to dependent policies [7, 17]. We used the 5 random training runs and generated 20 pairs (or triples) of agents where the agents in each pair are trained separately using a different seed. Table II shows the success rate of the MATRPO and IATRPO algorithms on the four environments. We used the same approach as above to compute the success rate on a 1000 random episodes. The performance of the MATRPO algorithm is not affected much in C2-fixed experiments, but it has drastically decreased in R2 experiments. We believe the reason for this behavior is the following. In C2-fixed experiments with MATRPO, the bottom agent (O) learns to always go first regardless of what the top agent (B) is doing. Even when the bottom agent is tested against other top agents (Bs), both the agents show the same behavior thus the performance of the algorithm does not get affected. However, in the R2 experiments with MATRPO, the agents show a more interactive behavior than C2-fixed thus when we test the agents, which were not trained together, against each other, the performance drastically decreases.
The success rate of the IATRPO algorithm also decreases when we test it on the 20 pairs (or triples). The performance degrades less on the easier environments such as C2-fixed and on the environments where the agents learned a more interactive policy with low variance such as R2. The success rate for the C2 and R3 degrades more than R2 since the agents learned a less interactive behavior with high variance.
We focus on multi-agent settings where each agent learns a policy to simultaneously achieve its individual goal and interact with others. We provide a curriculum learning approach and a architecture that learn how to adapt single agent policies to the multi-agent setting. We tested the method on two robotics problems and observed that our approach outperforms the state-of-the-art approach and results in interactive policies.
Our formulation is generalizable to domains with inhomogeneous agents since we make no assumptions regarding the homogeneity of the agents, and our future work involves testing the approach on such domains. One method that we are planning to try is the ensemble of policies method proposed in . If a model is learned in an environment with agents, we can apply the same model on an environment with agents where we assume the non-existent agent is in a corner and is not interacting with others. However, a limitation of our work is that a new model should be learned if we increase the number of agents. We believe this issue can be addressed by leveraging an approach that is agnostic to the number of agents such as .
-  (2015) Kinematic and dynamic vehicle models for autonomous driving control design.. In IV, Cited by: §IV-B1.
-  (2015) New winning strategies for the iterated prisoner’s dilemma. In AAMAS, Cited by: §II.
-  (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IV. Cited by: §II.
-  (2017) Multi-agent reinforcement learning in sequential social dilemmas. In AAMAS, Cited by: §II.
-  (2016) Opponent modeling in deep reinforcement learning. In ICML, Cited by: §II.
-  (1999) Kinematic time-invariant control of a 2d nonholonomic vehicle. In CDC, Cited by: §IV-B2.
-  (2017) Can deep reinforcement learning solve erdos-selfridge-spencer games?. arXiv preprint arXiv:1711.02301. Cited by: §I, §IV-E.
-  (2010) Multi-agent path planning with multiple tasks and distance constraints. In ICRA, Cited by: §II.
-  (2015) Intention-aware online pomdp planning for autonomous driving in a crowd. In ICRA, Cited by: §II.
-  (2015) Introducing decision entrustment mechanism into repeated bilateral agent interactions to achieve social optimality. AAMAS. Cited by: §II.
-  (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings, Cited by: §III-B.
-  (2017) Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748. Cited by: §II, §V.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, Cited by: §I, §II, §IV-C.
-  (2015) Trust region policy optimization. In ICML, Cited by: §III-A, §IV-A, §IV-C.
-  (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §I.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature. Cited by: §I.
-  (2017) A unified game-theoretic approach to multiagent reinforcement learning. In NIPS, Cited by: §I, §IV-E.
-  (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §IV-A.
-  (2013) Reinforcement learning in robotics: a survey. IJRR. Cited by: §I.
-  (2008) A comprehensive survey of multiagent reinforcement learning. IEEE SMC-Part C. Cited by: §II.
-  (2015) Universal value function approximators. In ICML, Cited by: §III-A.
-  (2018) Autonomous agents modelling other agents: a comprehensive survey and open problems. Artificial Intelligence. Cited by: §II.
-  (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In ICML, Cited by: §I.
-  (2018) Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint arXiv:1803.00162. Cited by: §II.
Zero shot transfer learning for robot soccer. In AAMAS, Cited by: §V.
-  (2009) Curriculum learning. In ICML, Cited by: §I.
-  (1989) Markov decision processes. EJOR. Cited by: §III-A.
-  (2017) Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926. Cited by: §I, §II.