I Introduction and Related Work
Reinforcement Learning (RL) algorithms when trained with the correct reward functions and favorable for learning, training conditions, have shown surprisingly impressive results. Some popular examples being: mastering the Go Game , playing Atari Games  and the very recent, defeating the world’s top professionals in DOTA . RL has shown promising results in learning driving behaviors for single agents , , , , , . Inspired from the same, this work focuses on behavioral based learning in multi agent settings.
A lot of prior work exists for multi agent systems [10, 11, 12]. The major paradigms include frameworks which use inter agent communication [13, 14], and the ones which learn in a decentralized manner  and ones which learn in a centralized manner , there also exist frameworks where training is centralized but testing is decentralized [17, 18]. Centralized learning refers to learning actions jointly for all the agents. The input to the algorithm is the observation and action of all the agents, which results in a major disadvantage: the exponential increase in state space with the number of agents. Secondly, centralized approaches are centralized not only during training but during testing as well, resulting in higher resource requirements for actual deployments. Unlike centralized approaches, in a concurrent learning setting, multiple agents in the same environment learn independently. Each one of them will have their own networks, policies, observations and actions. This is equivalent of learning multiple single agent learnings in a same environment. The disadvantage of this approach is the huge number of parameters and that no advantage is drawn from the fact that agents are learning together. Lastly, each agent is learning independently, hence the environment is non-stationary which can lead to instability.
When similar agents are learning similar behaviors, their parameters can be shared to enhance the speed of learning and to decrease the complexity and resource utilization of the algorithm. This concept of parameter sharing was first introduced by Tan et.al in . Authors showed that if cooperation is done intelligently, each agent can benefit from other agents’ instantaneous information, episodic experience, and learned knowledge. Sharing learned policies and episodes between agents can speed up the whole learning. Policies can be shared between homogeneous agents only, and if episodes can be interpreted, heterogeneous agents can also benefit from sharing episodes.
Chu et.al have shown Parameter Sharing in special cases in . Recently, Gupta et.al  introduced Parameter Sharing extensions of three popular RL algorithms: Deep Q-Network (DQN) , Asyncronous Advantage Actor-Critic (A3C)  and Trust Region Policy Optimization (TRPO) . Their results indicate a scalable cooperative reinforcement learning algorithm, Parameter-Sharing TRPO and also show that Policy Gradient methods outperform temporal-difference and actor-critic methods. Inspired by their success, we have developed a Parameter Sharing Deep Deterministic Policy Gradients (DDPG)  architecture, where the agents share their Actor and Critic Networks.
The proposed architecture was applied to traffic agent behaviors where multiple homogeneous agents are learning similar behavior, which is trained by asynchronous and cumulative efforts of all agents. The primary motivation behind the proposed work is to develop a method which can be used to generate behavior based traffic in simulators. With the rising interests in autonomous driving research, simulated environments provide a fast and risk-free method to develop and test the algorithms. To the best of our knowledge, this is the first work that targets behavior based multi agent learning using Deep RL. Further, results indicate faster trainings, scalable learning, which can be tested with varied number of agents, independent from the number of agents trained earlier. Importantly, the architecture is able to learn multiple behaviors simultaneously using single Actor-Critic Networks.
The rest of the paper is organized as follows, Section II explains the architectural details of our proposed approach and Section III contains the details of implementation specific to behavior learning for driving agents and finally section IV shows the results of various experiments from our approach.
Ii Proposed Architecture
Multi agent learning is a challenging task because of the dynamic nature of the environment. Each agent explores the environment in an attempt to learn a policy which increases the complexity of learning for other agents. Above everything, learning for multiple agents involve high number of parameters and resource requirements, which further limits the performance of the algorithms. In this section we propose and explain an architecture that addresses these problems using the concept of parameter sharing. The proposed architecture is based on Deep Deterministic Policy Gradients (DDPG) , one of the first RL algorithms which targeted problem solving in continuous spaces. It has shown promising results in wide ranged domains: Humanoids , controlling a bicycle  and the most relevant here, driving on tracks  and overtaking behavior in presence of other cars .
The proposed architecture is shown in Fig 1. As shown in the figure, both Actor and Critic Network are shared between all the agents. Apart from these, we also maintain a shared Replay Buffer, which stores the experiences from all the agents. Each agent has its own copy of its state information, its observations from the environment, the actions it takes and its corresponding rewards. For a given agent this information is not known to any other agent. However, the data stored in Replay Buffer is not distinguishable and hence each agent gets benefits from the experiences of all agents. Finally, actor and critic networks are updated asynchronously by each agent at each step. The update equations are given as follows:
The Critic Network learns by minimising the loss between target and the current Q value:
where is the reward for agent at the timestep, is the target Q value for the state-action pair where is obtained from the target actor network, is the Q value from the learned network, is the batch-size and is the discount factor.
The Actor Network weights are updated as:
where is the batch-size, are the critic network parameters and are the actor network parameters, is the learning rate. The rest of the terms have the same meaning as those in Eq. 1.
The reward function has to be represented using one standard function for all the agents, independent of the behavior they learn. Additionally, the reward function should only contain variables which can be derived from the state information and the observation of the agent. This condition, makes sure that the experiences in the replay buffer could be generalized to all agents.
The proposed setting is highly advantageous over multiple DDPG 111Learning for multiple agents in a same environment using independent DDPGs for each of them setting. Firstly, in each step, every agent is updating the networks, hence the speed of training is increased by times, where is the total number of agents learning. In a multiple DDPG setting, since each agent maintains a separate Actor and Critic, only one update is possible for the corresponding networks in each step. Hence, it takes longer time to converge. Moreover, because of multiple Actor and Critic Networks, very large number of parameters are present in the architecture.
Secondly, the agents in the proposed architecture use a shared replay buffer. Sharing the replay buffer increases the diversity of experience for all the agents. This way, the learned behavior of one agent does not depend only on the experiences it sees, rather on the experiences of all the agents which are getting trained. Sharing is possible, since the agents are homogeneous in their properties. Unlike this setting, Multiple DDPG do not have a shared replay buffer and depends only on the agent’s individual experiences even when the agents are homogeneous. This is another drawback in this setting, even though the agents are learning in multi agent setting, they do not make use of it for faster learning.
Iii Implementation Details
We use a modified version of TORCS called Gym-TORCS  which supports development of RL algorithms. The agent car used is ”scr_server”. We use a NVIDIA GeForce GTX 1080 GPU for training.
For individual behavior learning (either Lanekeeping or Overtaking) , the state vector is a 65 sized array consisting of the following sensor data:
Angle between the car and the axis of the track.
Track Information: Readings from 19 sensors with a 200m range, present at every on the front half of the car. They return the distance to the track edge.
Track Position: Distance between the car and the axis of the track, normalized with respect to the track width.
SpeedX, SpeedY, SpeedZ
Wheel Spin Velocity of each of the 4 wheels.
Rotations per minute of the car engine
Opponent information: Array of 36 sensor values, each corresponding to the distance of the nearest opponent in the range of 200 meters, located at a difference of , spanning the complete car.
Further details about each of these sensor readings can be found in . The Action Vector consists of continuous values of steer (-1,1), acceleration (0,1) and brake (0,1).
Iii-a Reward functions for the Behaviors learned
For all of our experiments, we have used two main reward functions. Both of these have been inspired from the work done in .
Lanekeeping is a behavior when the agent drives straight on the road and it is motivated by the distance it moves along the lane in each step. The Reward function to learn this behavior is given by:
where denotes the longitudinal velocity of the car, denotes the angle between the car and the track axis. We give a positive reward when the car moves forward along the track axis, given by , and negative reward when it moves laterally, i.e. perpendicular to the track axis, given by
. The above function can standalone handle the negative impact conditions like collisions, off track drifting, since on colliding with walls or other agents, ego vehicle’s velocity will be decreased and hence the above term. The decrease whether significant or not, the velocity has high probability of remaining positive. The learning algorithm would take high number of episodes to understand that collisions are bad. To increase the learning, we introduce extra reward conditions for such not required cases.
|Off track drifting|
Iii-A3 Multi-Behavior Learning
By multi-behavior, we imply learning multiple behaviors simultaneously using one single instance of the architecture. For Multi-Behavior learning the type of agents have to be distinguished somehow. For the same, we gave them a ids. Agents which had to learn the overtaking behavior were given id as 1 and the lanekeeping agents were given the id as 0. The state vector was modified from 65 to 66 space, because of the addition of id. Reward function should be a single equation using the terms derivable from observation or state vector of the agent. Following this,
For a given training, is the total number of agents present in the simulator, which is a constant term, hence satisfies the requirement of permissible variables in the reward function. Next, is a term TORCS provides as observation for each agent, hence this variable also satisfies the requirement.
Iv-a Lanekeeping Behavior
Figure 2 depict the result of our architecture. We learned lanekeeping behavior for 6 agents and tested it for number as high as 20. The agents moved harmoniously with minimal collisions and followed the lane, staying in the middle maximal times.
Table II shows how the learning has evolved with training episodes. At episode 0, the agents starts to train, hence the reward is 0 while testing. Till 300 episode of training the agents have learned to drive on lane, with collisions only around 11% times, after 600 episodes of training the average sum of reward of the complete system has improvised, and collisions are approximately same. From 300 to 600, the training results have almost saturated. The reward indicates sum of reward of all agents over an episode. While training there were 6 learning agents in the environment.
No. of Training Episodes Sum of Reward of all agents %colliding steps in the system Observations 0 0 0 Nothing learned 300 47476 10.89 Learns to drive on lane 600 50180 11.4 More stable driving TABLE II: The percentage increase in reward from 300 to 600 episodes, is 5%, indicating over time reward value starts saturating. All the values are averaged over 20 episodes. The sum of reward of all agents is calculated episode wise, averaged over 20 episodes. %colliding steps indicate the times when any one or more agents were experiencing collisions Number of Agents Average total Reward/Progress per agent 3 8165.8 5 7903.8 7 7918.6 10 7510.3 12 7570.8 15 7678 TABLE III: Total reward of each agent in an episode, averaged over 600 episodes, for lanekeeping behavior. We observe as the number of agents increase the total reward for each agent is approximately same, this indicates that with increasing number of agents, the per agent reward is not getting affected, implying behavior and performance of agents is not affected. This demonstrates both scalability and stability of our approach.
Lastly, we compare PS-DDPG with regular DDPG, trained in a single agent environment and multiple DDPG agents trained in together.
In multi-DDPG setting, when 6 agents were trained simultaneously, only the 3 agents in the front were able to learn the behaviour. The last 3 agents could not learn to drive and got stuck at local minima. During training they collided with the front cars and gained negative rewards, henceforth they learned not to move forward at all. In this setting, the average reward per agent is highest in case of single agent with 79.3257 and declined as more agents are introduced into the setting. In case of 7 agents, the average reward obtained in 42.6297. However, in PS-DDPG setting, the average reward remained almost constant, as evident from the Fig. 3
The number of training episodes required for DDPG were 2000, for PS-DDPG were 300 and for Multiple DDPGs together were 3000.
Iv-B Overtaking behavior
Similar to the curriculum learning approach followed in for overtaking behaviors in , we initialized our overtaking learning with weights from lanekeeping learning. Curriculum learning not only helps in learning the behavior but also reduces the training time.
Figure 4 shows results of our approach for overtaking behavior. Our learned agents, align themselves towards right end, overtake the opponent agents and scatter back on the road.
Evaluation of how the training progresses is done in table IV. Overtaking behavior is learned in a curriculum fashion, it starts with initialization of lanekeeping weights. Hence, the total system reward is not zero, unlike the lanekeeping case. Overtaking, being a more complex behavior, is effectively learned in 600 episodes, unlike lanekeeping which was learned in 300 episodes only.
No. of Training Episodes Sum of Reward of all agents Sum of Progress of all agents %colliding steps in the system Observations 0 241420 43516 21.6 Follows lane- keeping behavior 300 228160 39464 15.98 Learns to deviate from lane and move towards side to avoid collision 600 295010 49813 9.55 Learns to overtake TABLE IV: Progress of training results with the number of episodes. The value “progress” in the table, indicates the reward for lanekeeping behavior which is the forward movement made along the track in one time step. The terms reward and %colliding steps are same as in table II
. Observe how the Progress decreases and then increases. The initial high value is because the agent blindly follows lane, colliding with anyone who comes in between, eventually as the agent tries to navigate safely, collisions decrease as well as the value progress, but the reward is increasing over the training epochs.
Number of agents indicate the number of agents which were following overtaking behavior. The values, reward and progress are cumulative over all time steps in an episode, averaged over 20 episodes. Average colliding steps are also defined over all time steps in an episode. We observe that the values of progress and collisions have remained in the range 9-14, with an outlier when number of agents was 7. Similarly, progress has also remained in range of 7.7k to 8.2k. This indicates that the performance of agents was not affected by the increase in number of agents, which further justifies the scalability of the architecture. Lastly, the values in reward are increasing, since reward is proportional to total number of agents in the scene (the term, (n-racePos)), which causes this linear increase in average reward values.
Lastly, we compare PS-DDPG with single agent learned DDPG, in table 5. Similar to lanekeeping case, we experimented replicating results for overtaking using multiple DDPG learners. Even though, initial weights of the network were initialized with lanekeeping stable weights, the agents behind the first two agents, could not learn the overtaking behavior. Unfortunately, in multiple DDPGs, even after initialization with lanekeeping rewards, only first two agents were able to learn something. The other agents did not learn anything. The resulting learned behavior of first two agents was equivalent to the single DDPG training behavior. The number of training episodes required, after curriculum learning, for DDPG were 1k, for PS-DDPG were 600 and for multiple DDPGs together were more than 2k.
Iv-C Learning cooperative multiple behaviors
We learned the two behaviors using a shared network. The reward function have been described in section III. The training process required required around 1.5k episodes to converge. Our main observations indicated that the lanekeeping agents moved slowly when in presence of other agents, they cannot distinguish the other agents as lanekeeping or overtaking. When they are not in vicinity of other agents, their move with higher velocities. Similarly, overtaking agents are always high sped, they do not compete with other agents, since competition would lead to instability, had the other agent been a overtaking agent. They instead learn how to change lanes smoothly in order to overtake and once they overtake, they maintain the speed to stay ahead in lane. Our quantitative results are shown in figure 6 and table VI.
Parameter Sharing is a well known concept in multi agent systems, we extended it for Deep Deterministic Policy Gradients for a particular case of simulated highway behaviors. The homogeneous nature of the agents, enabled sharing the replay buffer, hence each agent now has a plethora of experiences. The network is updated times in each time step, because of which the algorithm converges faster. In the cases when connecting additional agents does not require heavy resources, PS-DDPG can be used to speed up the training and to learn more generically. Apart from its advantages over DDPG, it serves as a fast-asynchronous multi agent learning algorithm. With a correct formulation of reward function and state vector, multiple behaviors can be learned jointly, an example of which is shown in this work. Another advantage the current work offers is scalability. This can be used to generate behavioral traffic in simulations. Given the interests in autonomous driving, a simulator which provides scalable traffic will help accelerate many complex research statements.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, et al.
Mastering the game of go with deep neural networks and tree search.Nature, 2016.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller.
Playing atari with deep reinforcement learning.
NIPS Deep Learning Workshop, 2013.
-  OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
-  Wei Xia, Huiyun Li, and Baopu Li. A control strategy of autonomous vehicles based on deep reinforcement learning. In International Symposium on Computational Intelligence and Design (ISCID), 2016.
-  Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017.
-  M Kaushik, Vignesh Prasad, K Madhava Krishna, and Balaraman Ravindran. Overtaking maneuvers in simulated highway driving using deep reinforcement learning. In Intelligent Vehicles Symposium (IV), 2018 IEEE. IEEE, 2018.
-  Sahand Sharifzadeh, Ioannis Chiotellis, Rudolph Triebel, and Daniel Cremers. Learning to drive using inverse reinforcement learning and deep q-networks. In NIPS workshop on Deep Learning for Action and Interaction, 2016.
Daniele Loiacono, Alessandro Prete, Pier Luca Lanzi, and Luigi Cardamone.
Learning to overtake in torcs using simple reinforcement learning.
IEEE Congress on Evolutionary Computation (CEC), 2010.
-  Lei Tai, Giuseppe Paolo, and Ming Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS). IEEE, 2017.
-  Lucian Busoniu, Robert Babuska, and Bart De Schutter. Multi-agent reinforcement learning: A survey. In Control, Automation, Robotics and Vision, 2006. ICARCV’06. 9th International Conference on, pages 1–6. IEEE, 2006.
Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers.
Evolutionary dynamics of multi-agent learning: a survey.
Journal of Artificial Intelligence Research, 53:659–697, 2015.
Norihiko Ono and Kenji Fukumoto.
A modular approach to multi-agent reinforcement learning.
Distributed Artificial Intelligence Meets Machine Learning Learning in Multi-Agent Environments, pages 25–39. Springer, 1997.
Sainbayar Sukhbaatar, Rob Fergus, et al.
Learning multiagent communication with backpropagation.In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
-  Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
-  Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
A centralized reinforcement learning method for multi-agent job
scheduling in grid.
Computer and Knowledge Engineering (ICCKE), 2016 6th International Conference on, pages 171–176. IEEE, 2016.
-  Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
-  Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
-  Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242–250. Citeseer, 1998.
-  Xiangxiang Chu and Hangjun Ye. Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1710.00336, 2017.
-  Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83. Springer, 2017.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
-  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.
-  S Phaniteja, Parijat Dewangan, Pooja Guhan, Abhishek Sarkar, and K Madhava Krishna. A deep reinforcement learning approach for dynamically stable inverse kinematics of humanoid robots. In 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 1818–1823. IEEE, 2017.
-  TaeChoong Chung et al. Controlling bicycle using deep deterministic policy gradient algorithm. In Ubiquitous Robots and Ambient Intelligence (URAI), 2017 14th International Conference on, pages 413–417. IEEE, 2017.
-  Naoto Yoshida. Gym-torcs. github.com/ugo-nama-kun/gym_torcs, 2016.
-  Daniele Loiacono, Luigi Cardamone, and Pier Luca Lanzi. Simulated car racing championship: Competition software manual. arXiv preprint arXiv:1304.1672, 2013.