I Introduction
MultiVehicle System (MVS) raises tremendous research interests in recent years [1]. Comparing to a single vehicle system, the MVS usually has higher efficiency and operational capability in accomplishing complex tasks such as transportation [2], search and rescue [3], mapping [4]. However, the MVS applications also require more sophisticated interaction between vehicles and environemnt, which includes highlevel cooperation, competition and behavior control strategies and will significantly increase the system complexity. Some of the MVS collective behaviors, such as formation [5] and flocking [6] are extensively studied in area of control recently with both theoretical analysis and experimental results. However the controller design and analysis method may be unable to deal with the large scale MVS with paralleling multiple purpose interaction, such as waypoint tracking, neighboring cooperation and competition, communication preserving and so on.
Recent development of the reinforcement learning (RL) [7] has provided an alternative way to deal with the vehicle control problem and shows its potential on applications that involve interaction between multiple vehicles, such as the collision avoidance [8], communication [9], and environment exploration [10]. Inspired by the works above, we aim to develop an alternative flocking control method by implementing the RL method.
Some RL based method has already been proposed to deal with the flocking control with collision avoidance under MVS setup. a hybrid predator/intruder avoidance method for robot flocking combined the reinforcement learning based decision process with a low level flocking controller has been proposed in [11]. A complete reinforcement learning based approach for UAV flocking is proposed in [12] where the collision avoidance is incorporated into the reward function of the Qlearning scheme. The most similar works related to our topic is proposed in [13] and [8]. In [13], a learning framework imitating the ORCA [14]
is proposed by designing a neural network. However the work is based on the training set from the well validated algorithm which may limit its application to more generalized environment. A deep reinforcement learning method for collision avoidance is proposed in
[8]. Based on the learned policy with designed reward function, the method is validated with improved performance to the ORCA method.As the fact that the policy of each agent is dynamically changed in every training loop, which results in unstationary environment for each agent, the classical Qlearning method is inapplicable. To deal with the policy learning in unstationary environment with large scale multiagent system, in this paper we adopt the deep deterministic policy gradient (DDPG) method similar to [15] with centralized training process and distributed execution process. First, to avoid the changing size of observation space from different vehicles, a threelayertensor with constant size are implemented to represent the observation of neighbors, obstacles and waypoints. Then the reinforcement learning function with collision avoidance, waypoints tracking and communication preserving are designed. To further take into consideration of the state of neighbors, the reward function are augmented with the reward of neighbors with a discount factor. Finally the DDPG is trained and the replay buffer is filled with all vehicle’s state transition which means the training process is centralized and the policy is shared among all vehicles.
The remainder of this paper is organized as follows. In Section II the basis idea of deep reinforcement learning method is described. The system modeling and problem description is described in Section III. In Section IV our reinforcement learning based method is proposed and Section V validates the proposed method based on experiments. Section VI concludes this paper.
Ii Deep Reinforcement Learning
Iia Deep QLearning
In reinforcement learning, an agent receives the current state of the environment and selects an action based on this state according to a stochastic policy or a deterministic policy , then the agent receives a reward and arrives at a new state
. The transition dynamics are in general take as Markovian, with the transition probability
(1) 
Obviously the reinforcement learning problem can be treated as a Markov Decision Process (MDP) with the tuple
. is a discount parameter. The core objective is to find a policy which maximizes the cumulative long term discount gain(2) 
Specifically, in Qlearning, the value of by taking a certain action from state is called Qfunction, which is progressively updated with
(3) 
The traditional RL algorithm are typically limited to discrete, low dimensional domains and poorly suited to multivehicle environment with continuous state/action and observation.
Recent advances in deep reinforcement learning [16] have demonstrated humanlevel performance in complex and highdimensional spaces. As an extension of Qlearning on highdimension application, the deep Qlearning method uses a deep neural network with parameters to approximate the Qfunction on continuous state and discrete action space. By defining the lost function
(4) 
with the target values
(5) 
The parameter is updated using backpropagation
(6) 
One of the update rule is presented as the ADAM[17].
IiB Deep deterministic policy gradient
The deep Qlearning method above may be difficult to extended to the multivehicle environment as the local policy is changing based on local training process, the learning process based on global training set maybe unstable, therfore a global function is not feasible. To overcome the drawback, the deep deterministic policy gradient (DDPG) method can be implemented by introducing extra target network for the function and a deterministic actor function , as and and a replay buffer. The DDPG updates as follows.
Define the target value as
(7) 
Update the critic by minimizing the lost function
(8) 
Given the objective function similar to (2), the gradient can be calculated as
(9) 
Then the parameters and can be updated respectively as
(10)  
(11) 
The parameters of target network is updated after a sequence of critic and actor network update as
(12)  
(13) 
where and is to determine the update rate of the actorcritic network and target network respectively.
For our multivehicle application, we implement the centralized training and decentralized execution process. In the centralized training process, each critic network is augmented with policies of its neighbors by including the reward of neighbors and trained based on shared training set. For the decentralized execution process, decision is made based on observation of neighbor states.
Iii System Modeling and Problem
In this section the MVS is modeled using unicycle model and proximity network and the multivehicle control problem involving waypoints tracking, collision avoidance and communication preserving is described.
Iiia System Modeling
IiiA1 Dynamic Model
In this paper we consider a set of homogeneous mobile vehicles, denote as , are operating in 2D space with unicycle model. The discretized dynamic model for each vehicle is described as
(14) 
where denote the position and the heading angle respectively in 2D space. and are respectively the linear velocity and angular velocity. We define the position of vehicle at time instance as and control input as . denotes the sampling period.
IiiA2 Proximity Network
In this paper, an undirected proximity graph is used to represent the communication topology of the multivehicle system at each time instance , where and are, respectively, the set of vertices that stands for the local vehicles and the edge set that stands for the communication links. In proximity network, the edge set is defined according to the spatial distance between vehicles, namely , as
(15) 
where is the proximity network threshold. The neighborhood set of vehicle is defined as .
Besides the graph , an additional directed graph is also implemented to represent the MVS obstacle detection status, where is the set of obstacle and the edge set denote the pairwise detection between vehicles and obstacles. Similar, the existence of edge depends on the detection range of each vehicle , which is defined similar as
(16) 
where the spatial distance between vehicles and obstacle is . The obstacles that within vehicle ’s sensing range is defined as .
IiiB Problem description
The objective is to develop distributed local controllers with waypoint tracking, collision avoidance and network preserving.

Waypoint tracking: Given the discretized waypoints , the control object is to minimize the weighted tracking error norm .

Collision avoidance: Given a predefined minimum separation distance and , the relationship between vehicle and vehicle or vehicle and obstacle should satsifies

Communication Preserving: Given a maximum communication range between two vehicles, the objective is to keep the connectivity of the graph by driving the vehicles stay within the sensing range of each other, as
Similar problem setup is extensively studied in [6] based on controller design and stability analysis. In this paper we would like to solve the problem in an alternative way without explicit analytical process. Instead, we formulate the above three objectives and the dynamic model using the reinforcement learning scheme. Based on the centralized training process and distributed execution process, similar behavior to the flocking [6] can be achieved.
Iv Flocking control with DRL
In this section, we introduce the key ingredients of our reinforcement learning based flocking control framework. Specifically we begin with representing the observed state as three layers tensor, then the reward function is described with regard to our flocking control objective. The DDPG based flocking control is described in the end.
Iva Observation Representation
For the MVS, the interaction between agent and the environment contains three aspects, namely, the status of its cooperative neighbor , the obstacle status within its sensing range , the common waypoints . In order to model the status in a continuous space at the same time remain the observation space invariant, we represent the three types of observation mentioned above as three channel of imagelike tensor within different scale.
Neighbor channel
: In this channel, the location of neighborhood agents in the local frame of the vehicle ， defined as , are projected to a 2D matrix . The transformation is as follows. First virtual points, namely the anchors, are equally distributed within the area , then each anchor is able to measure the neighbor intensity around it based on a Gaussian radial function as
(17) 
where denote the neighbor ’s location in the local frame of agent , and is calculated as . denotes the location of anchor in the local frame of agent , represent the anchor’s sensitivity to each vehicle’s radian. The measurement value of anchor is the summation of all neighbors’ radiation on anchor , that is
(18) 
An example of the data representation is as figure 1 and there are vehicles within the sensing range of vehicle ’s sensing range, denote as the neighbors , as Figure 1(a). By implementing a anchor grid, the neighbor observation based on above definition is represented in the color domain as Figure 1(b).
Obstacle channel
In this channel, The location of the obstacles within the agent’s sensing range in the local frame are compressed into a similar channel within the area as the neighbor status.
(19) 
where is a similar Gaussian radial function as in Eq. 17 and is the anchor ’s position in the local frame of agent , denotes the obstacle ’s position in vehicle ’s local frame, as .
Goal channel
In this channel, the next waypoints are projected onto a similar channel with anchors equally positioned in the region , and the observation is calculated as
(20) 
IvB Reward function
The reward function is concerning three aspect of the our objective, namely connectivity preserving, obstacle avoid and waypoints tracking which are detailed as follows:

Connectivity Preserving: This function is to maintain the distance between each vehicle and its neighbor within the maximum communication range at the same time keep a minimum separation distance . Consequently the pairwise reward function for vehicle and its neighbor can be defined as
(21) 
Obstacle Avoidance This function is to avoid collision with obstacle, that is, to keep a minimum separation distance between vehicle and the obstacle within its sensing range, . The pairwise reward function is defined as
(22) 
Waypoints Tracking: The agents should follow the predefined mission waypoints. The reward is defined based on a normalized distance of local vehicle to the waypoint,
(23) where is a normalized factor of the distance between target waypoint and position of an agent.
Finally the reward function to evaluate the behavior of the agent is composed as
(24) 
The last term is a punish term to enforce smooth action trajectory and economic maneuver with a negative weight factor . In the MVS operation environment, we defined a inclusive reward function as a combination of the reward function (24) and the discounted reward of neighbors as
(25) 
where is a weight factor denotes how much portion of the interest of neighbors are considered.
IvC DDPG network
According to the DDPG framework proposed in [15], we would like to derive a policy learning method based on following setup: (1) the input of the learning policy is based on local observation on neighbors, obstacles as well as waypoints and (2) the experiences are collected from all vehicles and to train a shared policy, which make the framework involves a centralized training process and a distributed execution process. According to the Algorthm 1, the reward during each action is defined as the vehicle as well as its neighbors. The replay buffer is filled with all vehicles’ state transition which means the training process is centralized and the policy is shared among all vehicles.
The critic network and actor network is represented as Figure 2. The critic network contains four hidden layers neural network for the state and one hidden layer for the action to approximate the actionstate function , the actor network contains similar fourlayer network appended with the action bias and action bound to approximate the actor network
. Specifically, the two convolution layers are implemented as preprocess of the observation and each convolution layers contains multiple convolution kernels and ReLu layers. The overall algorithm of the described DDPG based flocking control framework is presented as Algorithm 1.
Reference waypoints tracking Error  0.08 

Minimum separation distance to obstacle  0.138 
Minimum separation distance to neighbors  0.094 
Average separation distance to neighbors  0.185 
Setup  The flocking performance  
No. vehicles  No. obstacles  Aver. training time (ms)  W.P. tracking error  Min. sep. dis. (obs./nei.)  Aver. sep. dis. to nei. 
3  1  122  0.125  0.158/0.102  0.187 
5  1  140  0.152  0.152/0.110  0.174 
5  2  147  0.147  0.144/0.103  0.182 
V experiment
In this section, the experiments and results of the proposed flocking control framework are presented. We setup the flocking control environment with different number of cooperative agent and uncooperative obstacles. The object is to track the desired reference waypoint at the same time maintain a proper distance with its neighbors and avoid a random moving obstacles. In particular, our DDPG network is designed based on the tensorflow deep learning framework
[18] and the scenario is built using the Gym package [19] and the multiagent particle environment (MPE) package[20]. The python implementation of our algorithm is carried out on a laptop with i77700HQ CPU and GTX 1050 graphic card.During the experiments, the vehicles and obstacles move according to the dynamic model (14) with sample time . Specifically, the maximum velocity for the vehicles and obstacle are and respectively. The minimum separation distance to neighbors and obstacle are set as and . The distance threshold for the proximity network and are respectively set as and . Initially the position of vehicles are placed randomly in the 2D plane with random and . The obstacle is placed in the same area and subject to a random walking process. The reference waypoint is randomly placed with constant velocity 0.1 towards the original points.
Experiment 1
In this experiment, the collision avoidance, reference waypoint tracking and obstacle avoidance capabilities based on our proposed method is evaluated. The scenario is set as 3 vehicles and one obstacle. The training set is set as 30000 episodes, and we use 1000 episodes to evaluate the performance. The averaged training time is 122ms per episode. The averaged reward over every 1000 episode is plotted in Figure 3, which shows a stable flocking control policy is obtained after 10000 episodes. The test results of 1000 episodes are shown in Table I. Obviously the collision avoidance and waypoints tracking is demonstrated.
Experiment 2
In this experiment, different scenarios are set to evaluate the performance of our method. The results is shown in Table. II. Three different scenarios are defined based on different vehicles and obstacles. Apparently the training time shows only slight increase as the number of vehicles and obstacles grow, which mainly because that the state space remain constant with the variation of number of vehicles or obstacle based on our observation representation method. The waypoint tracking error, collision avoidance and communication preserving are both demonstrated with similar statistic performance.
Vi Conclusion
In this paper, a reinforcement learning framework for flocking control with collision avoidance and communication preserving are proposed. The main differences of our work lies in two folds, 1) we implemented a observation represented method which transform the state with dynamically changed size into a tensor based state which remain constant, and 2) we design a centralized training framework which uses the augmented reward function and shared policy which is trained based on the common replay buffer filled by all vehicles. The experiment results show that the proposed method is able to demonstrate the flocking control with acceptable performance. In the further the work will be extended with more experiments with detailed analysis. As one possible directions, the theoretical analysis of the consensus on policy will be carried out.
Acknowledgment
The authors would like to thank the National Research Foundation, Keppel Corporation, and National University of Singapore for supporting this work done in the KeppelNUS Corporate Laboratory. The conclusions put forward reflect the views of the authors alone and not necessarily those of the institutions within the Corporate Laboratory. The WBS number of this project is R261507004281.
References
 [1] W. Ren, R. W. Beard, and E. M. Atkins, “Information consensus in multivehicle cooperative control,” IEEE Control Systems, vol. 27, no. 2, pp. 71–82, April 2007.
 [2] J. AlonsoMora, S. Baker, and D. Rus, “Multirobot formation control and object transport in dynamic environments via constrained optimization,” The International Journal of Robotics Research, vol. 36, no. 9, pp. 1000–1021, 2017.
 [3] Y. Liu and G. Nejat, “Multirobot cooperative learning for semiautonomous control in urban search and rescue applications,” Journal of Field Robotics, vol. 33, no. 4, pp. 512–536, 2016.
 [4] S. Liu, K. Mohta, S. Shen, and V. Kumar, “Towards collaborative mapping and exploration using multiple micro aerial robots,” in Experimental Robotics. Springer, 2016, pp. 865–878.
 [5] R. W. Beard, J. Lawton, and F. Y. Hadaegh, “A coordination architecture for spacecraft formation control,” IEEE Transactions on control systems technology, vol. 9, no. 6, pp. 777–790, 2001.
 [6] R. OlfatiSaber, “Flocking for multiagent dynamic systems: Algorithms and theory,” IEEE Transactions on automatic control, vol. 51, no. 3, pp. 401–420, 2006.
 [7] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction, bradford book,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 285–286, 2005.
 [8] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized noncommunicating multiagent collision avoidance with deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 285–292.
 [9] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multiagent reinforcement learning,” in Advances in Neural Information Processing Systems, 2016, pp. 2137–2145.
 [10] H. X. Pham, H. M. La, D. FeilSeifer, and L. Van Nguyen, “Cooperative and distributed reinforcement learning of drones for field coverage,” arXiv preprint arXiv:1803.07250, 2018.
 [11] H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learning for predator avoidance,” IEEE Transactions on Control Systems Technology, vol. 23, no. 1, pp. 52–63, 2015.
 [12] S. M. Hung and S. N. Givigi, “A qlearning approach to flocking with uavs in a stochastic environment.” IEEE Transactions on Cybernetics, vol. 47, no. 1, pp. 186–197, 2016.
 [13] P. Long, W. Liu, and J. Pan, “Deeplearned collision avoidance policy for distributed multiagent navigation,” IEEE Robotics & Automation Letters, vol. 2, no. 2, pp. 656–663, 2016.
 [14] J. V. D. Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for realtime multiagent navigation,” in IEEE International Conference on Robotics and Automation, 2008, pp. 1928–1935.
 [15] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275
 [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.

[18]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: A system for largescale machine learning.” in
OSDI, vol. 16, 2016, pp. 265–283.  [19] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [20] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” Neural Information Processing Systems (NIPS), 2017.
Comments
There are no comments yet.