Multi-Vehicle System (MVS) raises tremendous research interests in recent years . Comparing to a single vehicle system, the MVS usually has higher efficiency and operational capability in accomplishing complex tasks such as transportation , search and rescue , mapping . However, the MVS applications also require more sophisticated interaction between vehicles and environemnt, which includes high-level cooperation, competition and behavior control strategies and will significantly increase the system complexity. Some of the MVS collective behaviors, such as formation  and flocking  are extensively studied in area of control recently with both theoretical analysis and experimental results. However the controller design and analysis method may be unable to deal with the large scale MVS with paralleling multiple purpose interaction, such as way-point tracking, neighboring cooperation and competition, communication preserving and so on.
Recent development of the reinforcement learning (RL)  has provided an alternative way to deal with the vehicle control problem and shows its potential on applications that involve interaction between multiple vehicles, such as the collision avoidance , communication , and environment exploration . Inspired by the works above, we aim to develop an alternative flocking control method by implementing the RL method.
Some RL based method has already been proposed to deal with the flocking control with collision avoidance under MVS setup. a hybrid predator/intruder avoidance method for robot flocking combined the reinforcement learning based decision process with a low level flocking controller has been proposed in . A complete reinforcement learning based approach for UAV flocking is proposed in  where the collision avoidance is incorporated into the reward function of the Q-learning scheme. The most similar works related to our topic is proposed in  and . In , a learning framework imitating the ORCA 
is proposed by designing a neural network. However the work is based on the training set from the well validated algorithm which may limit its application to more generalized environment. A deep reinforcement learning method for collision avoidance is proposed in. Based on the learned policy with designed reward function, the method is validated with improved performance to the ORCA method.
As the fact that the policy of each agent is dynamically changed in every training loop, which results in un-stationary environment for each agent, the classical Q-learning method is inapplicable. To deal with the policy learning in un-stationary environment with large scale multi-agent system, in this paper we adopt the deep deterministic policy gradient (DDPG) method similar to  with centralized training process and distributed execution process. First, to avoid the changing size of observation space from different vehicles, a three-layer-tensor with constant size are implemented to represent the observation of neighbors, obstacles and way-points. Then the reinforcement learning function with collision avoidance, way-points tracking and communication preserving are designed. To further take into consideration of the state of neighbors, the reward function are augmented with the reward of neighbors with a discount factor. Finally the DDPG is trained and the replay buffer is filled with all vehicle’s state transition which means the training process is centralized and the policy is shared among all vehicles.
The remainder of this paper is organized as follows. In Section II the basis idea of deep reinforcement learning method is described. The system modeling and problem description is described in Section III. In Section IV our reinforcement learning based method is proposed and Section V validates the proposed method based on experiments. Section VI concludes this paper.
Ii Deep Reinforcement Learning
Ii-a Deep Q-Learning
In reinforcement learning, an agent receives the current state of the environment and selects an action based on this state according to a stochastic policy or a deterministic policy , then the agent receives a reward and arrives at a new state
. The transition dynamics are in general take as Markovian, with the transition probability
Obviously the reinforcement learning problem can be treated as a Markov Decision Process (MDP) with the tuple. is a discount parameter. The core objective is to find a policy which maximizes the cumulative long term discount gain
Specifically, in Q-learning, the value of by taking a certain action from state is called Q-function, which is progressively updated with
The traditional RL algorithm are typically limited to discrete, low dimensional domains and poorly suited to multi-vehicle environment with continuous state/action and observation.
Recent advances in deep reinforcement learning  have demonstrated human-level performance in complex and high-dimensional spaces. As an extension of Q-learning on high-dimension application, the deep Q-learning method uses a deep neural network with parameters to approximate the Q-function on continuous state and discrete action space. By defining the lost function
with the target values
The parameter is updated using back-propagation
One of the update rule is presented as the ADAM.
Ii-B Deep deterministic policy gradient
The deep Q-learning method above may be difficult to extended to the multi-vehicle environment as the local policy is changing based on local training process, the learning process based on global training set maybe unstable, therfore a global function is not feasible. To overcome the drawback, the deep deterministic policy gradient (DDPG) method can be implemented by introducing extra target network for the function and a deterministic actor function , as and and a replay buffer. The DDPG updates as follows.
Define the target value as
Update the critic by minimizing the lost function
Given the objective function similar to (2), the gradient can be calculated as
Then the parameters and can be updated respectively as
The parameters of target network is updated after a sequence of critic and actor network update as
where and is to determine the update rate of the actor-critic network and target network respectively.
For our multi-vehicle application, we implement the centralized training and decentralized execution process. In the centralized training process, each critic network is augmented with policies of its neighbors by including the reward of neighbors and trained based on shared training set. For the decentralized execution process, decision is made based on observation of neighbor states.
Iii System Modeling and Problem
In this section the MVS is modeled using unicycle model and proximity network and the multi-vehicle control problem involving way-points tracking, collision avoidance and communication preserving is described.
Iii-a System Modeling
Iii-A1 Dynamic Model
In this paper we consider a set of homogeneous mobile vehicles, denote as , are operating in 2D space with unicycle model. The discretized dynamic model for each vehicle is described as
where denote the position and the heading angle respectively in 2D space. and are respectively the linear velocity and angular velocity. We define the position of vehicle at time instance as and control input as . denotes the sampling period.
Iii-A2 Proximity Network
In this paper, an undirected proximity graph is used to represent the communication topology of the multi-vehicle system at each time instance , where and are, respectively, the set of vertices that stands for the local vehicles and the edge set that stands for the communication links. In proximity network, the edge set is defined according to the spatial distance between vehicles, namely , as
where is the proximity network threshold. The neighborhood set of vehicle is defined as .
Besides the graph , an additional directed graph is also implemented to represent the MVS obstacle detection status, where is the set of obstacle and the edge set denote the pairwise detection between vehicles and obstacles. Similar, the existence of edge depends on the detection range of each vehicle , which is defined similar as
where the spatial distance between vehicles and obstacle is . The obstacles that within vehicle ’s sensing range is defined as .
Iii-B Problem description
The objective is to develop distributed local controllers with way-point tracking, collision avoidance and network preserving.
Way-point tracking: Given the discretized way-points , the control object is to minimize the weighted tracking error norm .
Collision avoidance: Given a predefined minimum separation distance and , the relationship between vehicle and vehicle or vehicle and obstacle should satsifies
Communication Preserving: Given a maximum communication range between two vehicles, the objective is to keep the connectivity of the graph by driving the vehicles stay within the sensing range of each other, as
Similar problem setup is extensively studied in  based on controller design and stability analysis. In this paper we would like to solve the problem in an alternative way without explicit analytical process. Instead, we formulate the above three objectives and the dynamic model using the reinforcement learning scheme. Based on the centralized training process and distributed execution process, similar behavior to the flocking  can be achieved.
Iv Flocking control with DRL
In this section, we introduce the key ingredients of our reinforcement learning based flocking control framework. Specifically we begin with representing the observed state as three layers tensor, then the reward function is described with regard to our flocking control objective. The DDPG based flocking control is described in the end.
Iv-a Observation Representation
For the MVS, the interaction between agent and the environment contains three aspects, namely, the status of its cooperative neighbor , the obstacle status within its sensing range , the common way-points . In order to model the status in a continuous space at the same time remain the observation space invariant, we represent the three types of observation mentioned above as three channel of image-like tensor within different scale.
: In this channel, the location of neighborhood agents in the local frame of the vehicle ， defined as , are projected to a 2D matrix . The transformation is as follows. First virtual points, namely the anchors, are equally distributed within the area , then each anchor is able to measure the neighbor intensity around it based on a Gaussian radial function as
where denote the neighbor ’s location in the local frame of agent , and is calculated as . denotes the location of anchor in the local frame of agent , represent the anchor’s sensitivity to each vehicle’s radian. The measurement value of anchor is the summation of all neighbors’ radiation on anchor , that is
An example of the data representation is as figure 1 and there are vehicles within the sensing range of vehicle ’s sensing range, denote as the neighbors , as Figure 1(a). By implementing a anchor grid, the neighbor observation based on above definition is represented in the color domain as Figure 1(b).
In this channel, The location of the obstacles within the agent’s sensing range in the local frame are compressed into a similar channel within the area as the neighbor status.
where is a similar Gaussian radial function as in Eq. 17 and is the anchor ’s position in the local frame of agent , denotes the obstacle ’s position in vehicle ’s local frame, as .
In this channel, the next waypoints are projected onto a similar channel with anchors equally positioned in the region , and the observation is calculated as
Iv-B Reward function
The reward function is concerning three aspect of the our objective, namely connectivity preserving, obstacle avoid and way-points tracking which are detailed as follows:
Connectivity Preserving: This function is to maintain the distance between each vehicle and its neighbor within the maximum communication range at the same time keep a minimum separation distance . Consequently the pairwise reward function for vehicle and its neighbor can be defined as
Obstacle Avoidance This function is to avoid collision with obstacle, that is, to keep a minimum separation distance between vehicle and the obstacle within its sensing range, . The pairwise reward function is defined as
Way-points Tracking: The agents should follow the predefined mission way-points. The reward is defined based on a normalized distance of local vehicle to the way-point,
where is a normalized factor of the distance between target way-point and position of an agent.
Finally the reward function to evaluate the behavior of the agent is composed as
The last term is a punish term to enforce smooth action trajectory and economic maneuver with a negative weight factor . In the MVS operation environment, we defined a inclusive reward function as a combination of the reward function (24) and the discounted reward of neighbors as
where is a weight factor denotes how much portion of the interest of neighbors are considered.
Iv-C DDPG network
According to the DDPG framework proposed in , we would like to derive a policy learning method based on following setup: (1) the input of the learning policy is based on local observation on neighbors, obstacles as well as way-points and (2) the experiences are collected from all vehicles and to train a shared policy, which make the framework involves a centralized training process and a distributed execution process. According to the Algorthm 1, the reward during each action is defined as the vehicle as well as its neighbors. The replay buffer is filled with all vehicles’ state transition which means the training process is centralized and the policy is shared among all vehicles.
The critic network and actor network is represented as Figure 2. The critic network contains four hidden layers neural network for the state and one hidden layer for the action to approximate the action-state function , the actor network contains similar four-layer network appended with the action bias and action bound to approximate the actor network
. Specifically, the two convolution layers are implemented as pre-process of the observation and each convolution layers contains multiple convolution kernels and ReLu layers. The overall algorithm of the described DDPG based flocking control framework is presented as Algorithm 1.
|Reference way-points tracking Error||0.08|
|Minimum separation distance to obstacle||0.138|
|Minimum separation distance to neighbors||0.094|
|Average separation distance to neighbors||0.185|
|Setup||The flocking performance|
|No. vehicles||No. obstacles||Aver. training time (ms)||W.P. tracking error||Min. sep. dis. (obs./nei.)||Aver. sep. dis. to nei.|
In this section, the experiments and results of the proposed flocking control framework are presented. We setup the flocking control environment with different number of cooperative agent and uncooperative obstacles. The object is to track the desired reference waypoint at the same time maintain a proper distance with its neighbors and avoid a random moving obstacles. In particular, our DDPG network is designed based on the tensorflow deep learning framework and the scenario is built using the Gym package  and the multiagent particle environment (MPE) package. The python implementation of our algorithm is carried out on a laptop with i7-7700HQ CPU and GTX 1050 graphic card.
During the experiments, the vehicles and obstacles move according to the dynamic model (14) with sample time . Specifically, the maximum velocity for the vehicles and obstacle are and respectively. The minimum separation distance to neighbors and obstacle are set as and . The distance threshold for the proximity network and are respectively set as and . Initially the position of vehicles are placed randomly in the 2D plane with random and . The obstacle is placed in the same area and subject to a random walking process. The reference waypoint is randomly placed with constant velocity 0.1 towards the original points.
In this experiment, the collision avoidance, reference waypoint tracking and obstacle avoidance capabilities based on our proposed method is evaluated. The scenario is set as 3 vehicles and one obstacle. The training set is set as 30000 episodes, and we use 1000 episodes to evaluate the performance. The averaged training time is 122ms per episode. The averaged reward over every 1000 episode is plotted in Figure 3, which shows a stable flocking control policy is obtained after 10000 episodes. The test results of 1000 episodes are shown in Table I. Obviously the collision avoidance and way-points tracking is demonstrated.
In this experiment, different scenarios are set to evaluate the performance of our method. The results is shown in Table. II. Three different scenarios are defined based on different vehicles and obstacles. Apparently the training time shows only slight increase as the number of vehicles and obstacles grow, which mainly because that the state space remain constant with the variation of number of vehicles or obstacle based on our observation representation method. The way-point tracking error, collision avoidance and communication preserving are both demonstrated with similar statistic performance.
In this paper, a reinforcement learning framework for flocking control with collision avoidance and communication preserving are proposed. The main differences of our work lies in two folds, 1) we implemented a observation represented method which transform the state with dynamically changed size into a tensor based state which remain constant, and 2) we design a centralized training framework which uses the augmented reward function and shared policy which is trained based on the common replay buffer filled by all vehicles. The experiment results show that the proposed method is able to demonstrate the flocking control with acceptable performance. In the further the work will be extended with more experiments with detailed analysis. As one possible directions, the theoretical analysis of the consensus on policy will be carried out.
The authors would like to thank the National Research Foundation, Keppel Corporation, and National University of Singapore for supporting this work done in the Keppel-NUS Corporate Laboratory. The conclusions put forward reflect the views of the authors alone and not necessarily those of the institutions within the Corporate Laboratory. The WBS number of this project is R-261-507-004-281.
-  W. Ren, R. W. Beard, and E. M. Atkins, “Information consensus in multivehicle cooperative control,” IEEE Control Systems, vol. 27, no. 2, pp. 71–82, April 2007.
-  J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot formation control and object transport in dynamic environments via constrained optimization,” The International Journal of Robotics Research, vol. 36, no. 9, pp. 1000–1021, 2017.
-  Y. Liu and G. Nejat, “Multirobot cooperative learning for semiautonomous control in urban search and rescue applications,” Journal of Field Robotics, vol. 33, no. 4, pp. 512–536, 2016.
-  S. Liu, K. Mohta, S. Shen, and V. Kumar, “Towards collaborative mapping and exploration using multiple micro aerial robots,” in Experimental Robotics. Springer, 2016, pp. 865–878.
-  R. W. Beard, J. Lawton, and F. Y. Hadaegh, “A coordination architecture for spacecraft formation control,” IEEE Transactions on control systems technology, vol. 9, no. 6, pp. 777–790, 2001.
-  R. Olfati-Saber, “Flocking for multi-agent dynamic systems: Algorithms and theory,” IEEE Transactions on automatic control, vol. 51, no. 3, pp. 401–420, 2006.
-  R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction, bradford book,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 285–286, 2005.
-  Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 285–292.
-  J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems, 2016, pp. 2137–2145.
-  H. X. Pham, H. M. La, D. Feil-Seifer, and L. Van Nguyen, “Cooperative and distributed reinforcement learning of drones for field coverage,” arXiv preprint arXiv:1803.07250, 2018.
-  H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for predator avoidance,” IEEE Transactions on Control Systems Technology, vol. 23, no. 1, pp. 52–63, 2015.
-  S. M. Hung and S. N. Givigi, “A q-learning approach to flocking with uavs in a stochastic environment.” IEEE Transactions on Cybernetics, vol. 47, no. 1, pp. 186–197, 2016.
-  P. Long, W. Liu, and J. Pan, “Deep-learned collision avoidance policy for distributed multiagent navigation,” IEEE Robotics & Automation Letters, vol. 2, no. 2, pp. 656–663, 2016.
-  J. V. D. Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for real-time multi-agent navigation,” in IEEE International Conference on Robotics and Automation, 2008, pp. 1928–1935.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: A system for large-scale machine learning.” inOSDI, vol. 16, 2016, pp. 265–283.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Neural Information Processing Systems (NIPS), 2017.