I Introduction
The traditional control problem of dynamical systems with nonholonomic constraints is a heavily researched area because of its challenging theoretical nature and its practical use. A wheeled mobile robot (WMR) is a typical example of a nonholonomic system. Researchers in the control community have targeted problems in WMR including setpoint regulation, tracking [7], and formation control [31]. Due to the nature of these problems, the control law design involves sophisticated mathematical derivations and assumptions [13].
One of these problems is imagebased localization, which involves an autonomous WMR trying to locate its camera pose with respect to the world frame [30]. It is an important problem in robotics since many other tasks including navigation, SLAM, and obstacle avoidance require accurate knowledge of a WMR’s pose [30]. PoseNet adopts a convolutional neural network (CNN) for indoor and outdoor localization [19]
. Using endtoend camera pose estimation, the authors sidestep the need for feature engineering. This method was later extended by introducing uncertainty modeling for factors such as noisy environments, motion blur, and silhouette lighting
[18]. Recently, there is an emerging trend of localization in dynamic indoor environments [22, 24].Given accurate localization methods, various visionbased control tasks such as leader following can be accomplished. The leaderfollowing problem is defined as an autonomous vehicle trying to follow the movement of a leader object [27]. The form of a leader is not limited to an actual robot, but can also include virtual markers [29], which can serve as a new method for controlling real robots.
However, a virtual leader is able to pass through a territory that a real follower cannot, which require the follower to perform obstacle avoidance [5]. The major challenge of applying classical control methods to the obstacle avoidance problem is having controllers correctly respond to different obstacles or unreachable areas. [6, 9]. A common approach is to design an adaptive or hybrid controller by considering all the cases, which is timeconsuming [5].
Formation control is a leaderfollower problem but with multiple leader/followers. It includes the additional challenge of collision avoidance among group members [16]. Two major approaches include graphbased solutions with interactive topology and optimizationbased solutions[12]. However, the nonlinearity of nonholonomic robots adds to the challenges of modeling robots and multiagent consensus. Deep RL solves these problems, but training an agent requires complex environmental simulation and physics modeling of robots [2, 25].
In this paper, modular localization circumvents the challenge of not having an actual WMR’s physics model or timeconsuming environmental simulation. While endtoend training from images to control has become popular [20, 4], partitioning the problem into two steps—localization and control—enables the flexibility of retraining the controller while reusing the localization module. The flowchart of our modular design is shown in Figure 2. The first phase of this work is to apply a CNN model called residual networks (ResNets) [14] to visionbased localization. We focus on indoor localization which includes challenges such as dynamic foreground, dynamic background, motion blur, and various lighting. The dynamic foreground and background are realized by giving intermittent views of landmarks. A picture of our training and testing environment is shown in Figure 1. We show that our model accurately predicts the position of robots, which enables DRL without need for a detailed 3D simulation.
To replace traditional controllers that usually involve complex mathematical modeling and control law design, we leverage DRL for several control problems including leader tracking, formation control, and obstacle avoidance. We propose a new actorcritic algorithm called Momentum Policy Gradient (MPG), an improved framework of TD3 that reduces under/overestimation [11]. We theoretically and experimentally prove MPG’s convergence and stability. Furthermore, the proposed algorithm is efficient at solving leaderfollowing problems with regular and irregular leader trajectories. MPG can be extended to solve the collision avoidance and formation control problems by simply modifying the reward function.
In summary, our contribution is fourfold.

We propose Momentum Policy Gradient for continuous control tasks that combats under/overestimation.

A modular approach that circumvents the need for complex modelling work or control law design.

Robust imagebased localization achieved using CNNs.

Natural extensions from the single leaderfollower tracking problem to collision/obstacle avoidance and formation control with reward shaping.
Ii Mobility of Wheeled Robots
We consider the problem of forward kinematics for WMRs that has been extensively studied over the decades. A WMR collects two inputs: angular and linear velocities. The velocity inputs are then fed into onboard encoders to generate torque for each wheel. Due to the different sizes of robots, number of wheels, and moving mechanisms, robots can be classified into holonomic agents (
i.e., omnidirectional Mecanum wheel robots) [17] and nonholonomic agents (i.e., real vehicles with constrained moving direction and speed) [3]. In this paper, we consider nonholonomic agents.Ideally, angular and linear velocities are the action components of a DRL control agent. However, these two action spaces have very different scales and usually cause sizeasymmetric competition [34]
. We found out that training a DRL agent with angular and linear velocities as actions converges slower than our methods presented below. Since there is no loss in degree of freedom, it is sufficient to use scalar velocities in
and axes similar to the work done in [31].Let be the Cartesian position, as orientation, and denote linear and angular velocities of the WMR agent . The dynamics of each agent is then
(1) 
The nonholonomic simplified kinematic Equation (2) can then be derived by linearizing (1) with respect to a fixed reference point distance off the center of the wheel axis of the robot, where .
Following the original setting meters in [31], it is trivial to transfer velocity signals as actions used in nonholonomic system control.
(2) 
where and are input control signals to each robot .
In addition, other differential drive methods such as Instantaneous Center of Curvature (ICC) can be used for nonholonomic robots. However, compared to (2), ICC requires more physical details such as distance between the centers of the two wheels and eventually only works for twowheeled robots [8]. Meanwhile, decomposing velocity dynamics to velocities on x and y axes can be reasonably applied to any WMRs.
Iii Localization
The localization module focuses on the problem of estimating position directly from images in a noisy environment with both dynamic background and foreground. Several landmarks (i.e., books, other robots) were placed so as to be visible in the foreground. Using landmarks as reference for indoor localization tasks has proven to be successful for learningbased methods [21]. As the WMR moves around the environment, a camera captures only intermittent view of landmarks and their position as the frame changes.
Overall, data was gathered from types of robot trajectories (i.e., regular, random, whirlpool) with multiple trials taking place at different times of day when the background changes greatly and lighting conditions vary from morning, afternoon, and evening. The image dataset and ground truth pose of the agent was collected using a HD camera and a motion capture system, respectively. As the WMR moved along a closed path, images were sampled at rate of Hz, and the ground truth pose of the camera and agent at a rate of Hz. Data collection was performed at Nonlinear Controls and Robotics Lab. The camera used for recording images is placed on a TurtleBot at a fixed viewing angle. An example from our dataset is displayed in Figure 1.
A ResNet50 model [14]
with a resized output layer is trained from scratch on the dataset. Residual networks contain connections between layers of different depth to improve backpropagation. This eliminates the vanishing gradient problems encountered when training very deep CNNs. The ResNet predicts the robots current 2D position, orientation, and the distance to nearby landmarks. Distance to landmarks was not used in later modules, but helps focus attention onto the landmarks, which is a more reliable indicator of current pose, instead of any background changes between images.
We claim that our approach is robust enough to accurately predict a robot’s pose for even very complex trajectories. These paths can have multiple points of selfintersection. Furthermore, ResNet based localization works well even with a dynamic foreground (multiple landmarks) and a dynamic background. Different lighting conditions and changes in background objects between trials do not affect the accuracy. The error rates are given in Table I for the robot’s 2D coordinates and 4D quaternion orientation. They are all less than . Figure 3 shows an example predicted and motion captured poses as well as landmarks; there is almost no difference between the two.
x (%)  y (%)  q1 (%)  q2 (%)  q3 (%)  q4 (%) 

Iv Momentum Policy Gradient
Since nonholonomic controllers have a continuous action space, we design our algorithm based on the framework established by DDPG [23]. There are two neural networks: a policy network predicts the action given the state , a Qnetwork estimates the expected cumulative reward for each stateaction pair.
(3) 
The Qnetwork is part of the loss function for the policy network. For this reason, the policy network is called the actor and the Qnetwork is called the critic.
(4) 
The critic itself is trained using a Bellman equation derived loss function.
(5) 
However, this type of loss leads to overestimation of the true total return [10]. TD3 fixes this by using two Qvalue estimators and taking the lesser of the two [11].
(6) 
Note that this is equivalent to taking the maximum, and then subtracting by the absolute difference. However, always choosing the lower value brings underestimation and higher variance
[11].To lower the variance in the estimate, inspired by the momentum used for optimization in [36], we propose Momentum Policy Gradient illustrated in Algorithm 1 which averages the current difference with the previous difference .
(7)  
(8) 
This combats overestimation bias more aggressively than just taking the minimum of . Moreover, this counters any overtendency TD3 might have towards underestimation. Because neural networks are randomly initialized, by pure chance could be large. However, it is unlikely that and are both large as they are computed using different batches of data. Thus has a lower variance than .
In the case of negative rewards, the minimum takes the larger in magnitude. This will actually encourage overestimation (here the estimates trend toward ).
Before proving the convergence of our algorithm, we first require a lemma proved in [33].
Lemma 1
Consider a stochastic process where such that
for all . Let be a sequence of increasing algebras such that are measurable. If

the set is finite

, , but
with probability 1

where and converges to 0 with probability 1

for some constant ,
Then converges to 0 with probability 1.
The theorem and proof of MPG’s convergence is borrowed heavily from those for Clipped Double Qlearning [11].
Theorem 1 (Convergence of MPG update rule)
Consider a finite MDP with and suppose

each stateaction pair is sampled an infinite number of times

Qvalues are stored in a lookup table

receive an infinite number of updates

the learning rates satisfy , , but with probability 1

for all stateaction pairs.
Then Momentum converges to the optimal value function .
Let , , and . Conditions 1 and 2 of Lemma 1 are satisfied. By the definition of Momentum Policy Gradient,
where . Then
where
We have split into two parts: a term from standard Qlearning, and times another expression.
As it is well known that , condition (3) of Lemma 1 holds if we can show converges to 0 with probability 1. Let . If with probability 1, then
so . Therefore showing proves that converges to 0.
This clearly converges to 0. Hence converges to as converges to 0 with probability by Lemma 1. The convergence of follows from a similar argument, with the roles of and reversed.
V Continuous Control
We demonstrate MPG on a variety of leaderfollower continuous control tasks. In all simulations, neural networks have two hidden layers of and units respectively. Output and input sizes vary depending on the environment and purpose of the model. Constraints on the motion of robots (Table II) are enforced by ending episodes once they are breached. A penalty of
is also added to the reward for that time step. The hyperparameters of MPG are given in Table
III.Leader  
Follower 
Hyperparameter  Symbol  Value 

Actor Learning Rate  
Critic Learning Rate  
Batch Size  
Discount factor  
Number of steps in each episode  
Training noise variance  
Initial exploration noise variance  
Minimum exploration noise variance  
Exploration noise variance decay rate 
Suppose there are agents whose kinematics are governed by Equations 1 and 2. Let denote the state of agent and the desired formation control for agents follow the constraint [28]:
(9) 
Definition 1
From (9), we consider displacementbased formation control with the updated constraint given as:
(10) 
Each agent measures the position of other agents with respect to a global coordinate system. However, absolute state measurements with respect to the global coordinate system is not needed. A general assumption made for formation control communication is that all agent’s position, trajectory or dynamics should be partially or fully observable [35, 13].
The leaderfollower tracking problem can be viewed as formation control with only agents.
Va Tracking
We first test MPG by training a follower agent to track a leader whose position is known. The leader constantly moves without waiting the follower. The follower agent is punished based on its distance to the leader and rewarded for being very close. Let be the 2D position of the leader and be the corresponding position for the follower. The reward function for discrete leader movement is defined as
(11) 
where is the L2 norm.
Then, We train a follower to track a leader moving in a circular pattern. Remarkably, this follower: can generalize to a scaledup or scaleddown trajectory, is robust to perturbations in the initial starting location, and even tracks a leader with an elliptical motion pattern that it has never encountered before. However, the follower fails to track a square leader which is caused by under exploration of the entire environment. The under exploration issue is further highlighted when training a follower to track a leader with square motion. The model learns to go straight along an edge in less than episodes, but fails to learn to turn a angle for around episodes.
Agents trained against a random moving leader generalize well to regular patterns.
(12) 
This is very similar to the discrete courier task in [26]. The random leader provides the follower with a richer set of possible movements compare to previous settings. As shown in Figure 4, this allows the follower to track even more regular trajectories. The follower is robust to changes in the initial position, as seen in Figure 4(c) where the follower starts outside the circle. Average distance between the follower and the leader and the average reward are given in Table IV.
Average Distance  Average Reward  

Random  
Square  
Circle 
For comparison, we also trained several followers using TD3. An example reward is displayed in Figure 5. The values are smoothed using a 1D box filter of size 200. For the circle leader task, TD3 which struggles to close the loop, slowly drifting away from the leader as time progresses. The MPG trained agent does not suffer from this problem.
We also noticed that some TD3 trained followers do not move smoothly. This is demonstrated in Figure 6. When trying to track a circle, these agents first dip down from despite the leader moving counterclockwise, starting at .
VB Formation Control
We naturally extend from the tracking problem to displacementbased formation control by adding additional trained followers to track the sole leader. However, additional work is needed to achieve collision avoidance among the followers. Based on displacementbase formation control, we design and conduct two simulations similar to [15]:

Unison formation control: There is a predefined formation with respect to global frame. All agents maintain a rigid formation throughout the entire movement process. The orientation of the formation does not change.

Consensus formation control: Agents are given freedom to adapt rapidly while keeping the formation. The orientation of the formation can be changed.
In all simulations, a single neural network takes as input the positions of all the agents with respect to a global coordinate system and issues movement commands to all agents in the same coordinate system.
For unison formation control, groups of agents have a rigid formation that should not rotate. Given the position of the leader, whose index is , each follower should try to minimize its distance to an intended relative location with respect to the leader while avoiding collisions. As the leader moves, the expected positions move in unison. Let be the minimum safety distance, above which collision can be avoided. The reward is
(13) 
where is collision coefficient and is the number of collisions and is the position of agent . Upon any collisions, we reset the environment and start a new episode.
Then we explore unison formation control for a square formation with curved and random leader movements. As seen in Figure 7, three followers move in unison equally well when tracking (a) a leader with smooth trajectory starting from lower left corner and (b) a random leader. During training, we observed that adding more agents results in longer training time. This is because adding an agent increases the state space and action space, and thus our input and output dimensions, by 2. Training time grows approximately linearly in the number of agents. This makes it intractable to train and deploy large formations in the hundreds or thousands of robots. The average reward and distances are reported in Table V.
Pattern  Reward  Dist. to  Dist. to  Dist. to 

Random  
Regular 
Reward  

Unlike unison formation control which dictates the individual motion of all agents, multiagent consensus keeps a formation with respect to the local frame. The formation can rotate with respect to the global frame. The problem definition allows for switching and expansion within a given topology, as long as there are no collisions. This can be beneficial. For example, the agents may need to move further apart to avoid an obstacle or tighten their formation to fit through some passageway. In general, agents should maintain a constant distance to each other. Letting , the reward function is given according to the following equation.
(14) 
In our simulations, we trained follower agents to maintain triangle formation with a leader undergoing random motion. As shown in Figure 7 (c), we test the performance of the multiagent consensus while having the leader traverse a counterclockwise circular trajectory. The leader starts at , follower starts at a random lower left position, and follower starts at a random upper right position. Initially, the three agents are very far away from each other but quickly formed a triangle when the leader reaches around . We observe that the followers swapped their local positions within the formation when the leader arrives . This is because the reward function is only interested in their relative distances, and the agents can maintain the formation with less total movement by switching their relative positions in the group. Results are reported in Table VI.
For the purpose of comparison, an agent was also trained using TD3. The rewards per time step are shown in Figure 8. These were collected over 5 training runs of 200 episodes each. The MPG curves are longer than the TD3 curves, because the MPG networks avoid episode ending collisions for longer. Hence, MPG trained agents achieve the desired behavior sooner than the TD3 agents.
VC Obstacle avoidance
Based on the collision penalty embedded in Equation (14) of formation control, fixed or moving obstacle avoidance can naturally be integrated into the leader following or formation control problems. Instead of control law redesign, obstacle avoidance can easily be achieved by adding an additional term in the reward function.
The simulation setup is illustrated in Figure 9. The leader linearly travels from to . We have fixed obstacles on the path of the leader that only stop the follower, and a moving obstacle travelling linearly from to . The follower, starting from a random location, must track the leader without hitting any obstacles. The reward function is
(15) 
where is the position of the obstacles and is the relative importance of avoiding the obstacles.
We use this reward, instead of a single penalty for colliding with an obstacle, because Equation (15) gives constant feedback. Training with sparse rewards is a big challenge in DRL [32]. In particular, because the obstacle is encountered later on, the follower learns to strictly copy the leader’s movements without regard for the obstacles. This is a local optimum that the agent fails to escape. But for our purpose, it is not so important that the agent learns with even poorly shaped rewards. The realism of the training settings is irrelevant as the agent is already in a artificial and simplified environment.
Vi Conclusion and Future Work
In this paper, we propose a new DRL algorithm Momentum Policy Gradient to solve leaderfollower tracking, formation control, and obstacle/collision avoidance problems, which are difficult to solve using traditional control methods. These controllers can be trained in a simple toy environment, and then plugged into a larger modular framework. The results show that MPG performs well in training agents to tackle a variety of continuous control tasks. Imagebased localization is achieved with a ResNet trained on a custom dataset. Analysis demonstrates that the model can reliably predict a WMR’s pose even when there are dynamic foreground/background, lighting, and view.
These methods are computationally inexpensive. On a M40 Nvidia GPU and Intel Xeon E5 processor, MPG only takes, at most, hours to fully converge. For localization, the CNN model took about hours to finish training on a dataset of over GB of images. This is because the ResNet50 model has residual blocks, compared to the layers in our DRL agents.
In the future, we would like to apply MPG to a wider variety of robotics problems. In particular, we believe our current framework can generalize to larger indoor and outdoor settings. Some work has already been done in this field [2, 1], but there are still major challenges that have not been solved. One of these is data collection. Currently, we require many samples to train our localization module. However, using the techniques of few/oneshot learning [1] and data augmentation [2], it will no longer be necessary to collect so many samples at different times of day. This opens up the possibility of online learning as a robot explores a new area of the environment.
References
 [1] (2017) Oneshot reinforcement learning for robot navigation with interactive replay. arXiv preprint arXiv:1711.10137. Cited by: §VI.
 [2] (2018) Learning deployable navigation policies at kilometer scale from a single traversal. arXiv preprint arXiv:1807.05211. Cited by: §I, §VI.
 [3] (2006) Geometry of manifolds with special holonomy. 150 Years of Mathematics at Washington University in St. Louis: Sesquicentennial of Mathematics at Washington University, October 35, 2003, Washington University, St. Louis, Missouri 395, pp. 29. Cited by: §II.
 [4] (2018) Learning navigation behaviors end to end. CoRR abs/1809.10124. External Links: Link, 1809.10124 Cited by: §I.
 [5] (1993) A path algorithm for robotic machining. Robotics and computerintegrated manufacturing 10 (3), pp. 185–198. Cited by: §I.
 [6] (2004) Adaptive tracking and regulation of a wheeled mobile robot with controller/update law modularity. IEEE Transactions on control systems technology 12 (1), pp. 138–147. Cited by: §I.
 [7] (2000) Robust tracking and regulation control for mobile robots. International Journal of Robust and Nonlinear Control: IFACAffiliated Journal 10 (4), pp. 199–216. Cited by: §I.
 [8] (2010) Computational principles of mobile robotics. Cambridge university press. Cited by: §II.
 [9] (2010) Allostatic control for robot behaviour regulation: an extension to path planning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1935–1942. Cited by: §I.

[10]
(2018)
An introduction to deep reinforcement learning.
Foundations and Trends® in Machine Learning
11 (34), pp. 219–354. Cited by: §IV.  [11] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §I, §IV, §IV.
 [12] (2018) A survey on recent advances in distributed sampleddata cooperative control of multiagent systems. Neurocomputing 275, pp. 1684–1701. Cited by: §I.
 [13] (2019) Integrated relative localization and leader–follower formation control. IEEE Transactions on Automatic Control 64 (1), pp. 20–34. Cited by: §I, §V.
 [14] (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §I, §III.
 [15] (2019) Leader–follower formation control of usvs with prescribed performance and collision avoidance. IEEE Transactions on Industrial Informatics 15 (1), pp. 572–581. Cited by: §VB.
 [16] (2008) Distributed observers design for leaderfollowing control of multiagent networks. Automatica 44 (3), pp. 846–850. Cited by: §I.
 [17] (1975April 8) Wheels for a course stable selfpropelling vehicle movable in any desired direction on the ground or some other base. Google Patents. Note: US Patent 3,876,255 Cited by: §II.

[18]
(2016)
Modelling uncertainty in deep learning for camera relocalization
. In 2016 IEEE international conference on Robotics and Automation (ICRA), pp. 4762–4769. Cited by: §I. 
[19]
(2015)
Posenet: a convolutional network for realtime 6dof camera relocalization.
In
Proceedings of the IEEE international conference on computer vision
, pp. 2938–2946. Cited by: §I.  [20] (2018) Navigation without localisation: reliable teach and repeat based on the convergence theorem. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1657–1664. Cited by: §I.
 [21] (2018) AMID: accurate magnetic indoor localization using deep learning. Sensors 18 (5), pp. 1598. Cited by: §III.
 [22] (2018) Indoor relocalization in challenging environments with dualstream convolutional neural networks. IEEE Transactions on Automation Science and Engineering 15 (2), pp. 651–662. Cited by: §I.
 [23] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §IV.
 [24] (2018) Deep globalrelative networks for endtoend 6dof visual localization and odometry. arXiv preprint arXiv:1812.07869. Cited by: §I.
 [25] (2018) The formation control of mobile autonomous multiagent systems using deep reinforcement learning. 13th Annual IEEE International Systems Conference. Cited by: §I.
 [26] (2018) Learning to navigate in cities without a map. In Advances in Neural Information Processing Systems, pp. 2424–2435. Cited by: §VA.
 [27] (2017) Following the leader using a tracking system based on pretrained deep neural networks. In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 4332–4339. Cited by: §I.
 [28] (2015) A survey of multiagent formation control. Automatica 53, pp. 424–440. Cited by: §V.
 [29] (2004) Flocking for multiagent dynamic systems: algorithms and theory. Technical report CALIFORNIA INST OF TECH PASADENA CONTROL AND DYNAMICAL SYSTEMS. Cited by: §I.
 [30] (2018) A survey on visualbased localization: on the benefit of heterogeneous data. Pattern Recognition 74, pp. 90–109. Cited by: §I.
 [31] (2008) Consensus tracking under directed interaction topologies: algorithms and experiments. In 2008 American Control Conference, pp. 742–747. Cited by: §I, §II, §II.
 [32] (2018) Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381. Cited by: §VC.
 [33] (2000) Convergence results for singlestep onpolicyreinforcementlearning algorithms. Machine Learning 38, pp. 287–308. Cited by: §IV.
 [34] (1990) Asymmetric competition in plant populations. Trends in ecology & evolution 5 (11), pp. 360–364. Cited by: §II.
 [35] (2019) A switched systems approach to consensus of a distributed multiagent system with intermittent communication. In Proc. Am. Control Conf., Cited by: §V.
 [36] (2015) Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693. Cited by: §IV.