Continuous Deep Hierarchical Reinforcement Learning for Ground-Air Swarm Shepherding

04/24/2020 ∙ by Hung The Nguyen, et al. ∙ 0

The control and guidance of multi-robots (swarm) is a non-trivial problem due to the complexity inherent in the coupled interaction among the group. Whether the swarm is cooperative or non cooperative, lessons could be learnt from sheepdogs herding sheep. Biomimicry of shepherding offers computational methods for swarm control with the potential to generalize and scale in different environments. However, learning to shepherd is complex due to the large search space that a machine learner is faced with. We present a deep hierarchical reinforcement learning approach for shepherding, whereby an unmanned aerial vehicle (UAV) learns to act as an Aerial sheepdog to control and guide a swarm of unmanned ground vehicles (UGVs). The approach extends our previous work on machine education to decompose the search space into hierarchically organized curriculum. Each lesson in the curriculum is learnt by a deep reinforcement learning model. The hierarchy is formed by fusing the outputs of the model. The approach is demonstrated first in a high-fidelity robotic-operating-system (ROS)-based simulation environment, then with physical UGVs and a UAV in an in-door testing facility. We investigate the ability of the method to generalize as the models move from simulation to the real-world and as the models move from one scale to another.



There are no comments yet.


page 1

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A wide variety of attempts have been made to design computational intelligence algorithms to design swarm robotic systems [18, 5]. One ultimate objective has been centered around the concept of emergence, whereby a long standing question is how to design local rules that create self-organized and interesting group-level emergent properties [3, 5, 28]. The practical motivation is to design multi-robot systems that can cooperatively solve problems that a single robot, on its own, cannot [43].

A swarm of multi-robot systems demands higher computational requirements as the swarm size increases. We hypothesize that the scalability problem could be solved by borrowing concepts from Nature, where a single agent could control a flock. The computational shepherding problem [17], inspired from real shepherding in agriculture, is borrowed in this paper for swarm control.

Similar to a number of other authors, Strömbom et al. [38]

developed a heuristic model and validated it against actual sheepdog behaviors. An artificial agent mimicked the sheepdog and a swarm of artificial agents mimicked the behavior of the sheep. The flexibility of the model could extend its uses to human-swarm interaction and to the dynamic control of robotic swarm using a single, or a few, control agents. A large swarm could be guided and controlled using a smaller number of well-trained artificial sheepdogs.

Swarm guidance in the form of shepherding, offers an opportunity for an agent in one domain to guide a swarm in a different domain. For example, the coordination between unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) has been investigated in previous studies for search-and-rescue, surveillance, path planning for delivery, and wildlife research missions [6, 21, 19, 14]. UAVs possess an aerial mode of observation which provides an extensive field of view (FoV); they have significant advantages when guiding ground vehicles [26] and, therefore, can plan and guide motion from the air. It is plausible to use shepherding principles to design an effective coordination strategy of autonomous air-ground vehicles [4]. The swarm control problem in this case aims at developing an artificial intelligence for controlling a UAV to influence the ground vehicles. The coupling in the dynamics creates a repulsive force from the UGVs due to the presence of the UAV.

By leveraging the generalization ability of a deep neural network and the long-term planning ability of reinforcement learning algorithms, deep reinforcement learning methods 

[22, 45, 46] could potentially design an intelligent agent which controls the UAV in the proposed aerial shepherding problem. Nevertheless, it is challenging when one designs a deep reinforcement learning algorithm for this problem due to the complexity of the search space and uncertainties inherent in the operation of real vehicles and the environment [26]. Previous studies of machine education [10, 7] have demonstrated the feasibility of reducing the complexity of the learning problem by disassembling the search space into smaller chunks.

Some of our previous research [25] applied a decomposition approach using deep Hierarchical reinforcement learning (DHRL). The approach was successful in designing a UAV-shepherd to guide a group of ground vehicles. The performance of this HDRL is competitive with the rule-based baseline method of Strömbom method [38]. In this paper, we advance the research in three directions.

First, we use a Hierarchical Deep Deterministic Policy Gradients (H-DDPGs) methodology to design continuous control policies. Using the same aerial shepherding task as conducted in our DHRL approach [25], we train an aerial sheepdog to guide a swarm of UGVs, and decompose the entire mission into chunks. Each chunk corresponds to a sub-task that requires learning of a group of basic sub-skills. In the context of shepherding, two basic sub-skill groups include the ability to collect the sheep into a flock and the ability to drive the flock to a goal. Each of these sub-skills is learnt individually using DDPGs then the sub-skills are aggregated to solve the overall problem using H-DDPGs. When we compare the baseline method of Strömbom and our previous DHRL work against the proposed H-DDPG, the proposed method is at least as good in terms of performance, with better success rate and efficiency.

Second, we transfer the models learnt in a simulated environment to real UAV-UGVs physical vehicles in our in-door testing facility. We show that the UAV driven by H-DDPGs is able to perform effectively in this system.

Third, we validated the robustness of transferring the models trained in a small environment to larger environments. We trained an H-DDPG model in a small simulation environment in order to reduce training time. We then successfully transferred the model to a larger simulation environment.

The remainder of the paper is organized as follows. In Section III, we formally define shepherding using an appropriate notational system and a corresponding mathematical objective. The proposed H-DDPG framework is introduced in Section IV, along with the decomposition algorithm, reward scheme, and testing method. The framework is then applied to a physical UAV-UGV shepherding task in Section V. Sections VI and VII present the results of the framework in simulation and physical environments, respectively. Conclusions are drawn in Section VIII, followed by a discussion on future work.

Ii Related work

Two approaches have emerged for the control of a robotic swarm: rule-based and learning-based techniques [29, 2]

. Rule-based systems 

[42, 31]

represent the controller in different forms including symbolic if…then…, tree representations, predicate or higher-order logic, equations using calculus-based or probabilistic representations, or graph-representations such as finite state machines and probabilistic graphical models. These systems define and fix the mapping from the states of an agent into action vectors 

[35, 11, 20]. Learning systems dynamically form a model from experiences and interactions with the environment [13].

Rule-based systems are scalable and fast to adapt in the contexts they were designed for. However, when the context changes, as in changes in the distribution of uncertainties and disturbances [47], such that the model is no longer appropriate, they are unable to adapt or generalize to new contexts. Learning systems, however, are flexible, so that they can adapt to control a unmanned vehicle or swarm in novel situations [34].

Literature on other swarm behavior learning problems such as swarm flocking and leader-follower models have used decentralized policies learned with reinforcement learning (RL) [13, 44]. A reward scheme is used in RL to dynamically design a learning model for intelligent lifelong learning agents able to adapt to changing environments using trials and errors [36]. Asada et al [1] uses RL to learn cooperative behaviors in a robotics soccer application. Zema et al [48] introduces a Q-learning algorithm, applied on each UAV follower, which uses the radio signal strength values from communications among followers and a single leader to learn to maintain the swarm’s formation. Coupled with reinforcement learning methods for producing control policy, knowledge sharing techniques are commonly used in swarm systems to distribute information on the environment or a common value system among swarm members [27, 37, 30].

Iii Problem Formulation

Without loss of generality, we assume the environment is squared of length , with two types of agents forming two sets: a set of sheep , and a set of shepherds . A sheep performs three basic behaviors at a time step :

  • Escaping behavior : Sheep at position attempts to escape a predator (sheepdog) by a repulsive force if the distance between the sheep and sheepdog is less than the sheep sensing range for sheepdog (Equation 1).


    The force vector of this behavior is computed by:

  • Collision avoidance : Sheep avoids collision with another sheep using a force vector representing the summation of all repulsive force vectors from all sheep within the neighborhood. Behavior exists when at least the distance between one other sheep and sheep is less than ; that is,


    The force vector for this behavior is then computed by:

  • Grouping behavior : Sheep gets attracted to the local center of mass of the flock in its neighborhood with a force :


The effect of total force vector from previous time step with weight is also taken into account. Finally, the weighted summation of all force vectors is computed to determine the movement of sheep .


Strömbom et al model specifies two basic behaviors for a shepherd :

  • Driving behavior : The behavior is triggered when distances of all sheep to their center of mass is smaller than a threshold specified in Equation 7; that is, all sheep are gathered as a cluster within a circle of small enough diameter. The shepherd at position moves according to a force vector . The direction of the vector emits from the shepherd to a driving sub-goal point situated a distance of behind the sheep’s cluster relative to the goal location. is the sensing range between two sheep, and is the unit distance. The driving force vector is computed by Equation 8.

  • Collecting behavior : If there is at least one sheep outside the flock allowed radius ; that is, there exists sheep that need to be brought back to the sheep flock, the shepherd moves according to a force vector . The vector emits from the shepherd the collection point situated a distance of behind the furthest sheep relative to the flock center of mass. The collecting force vector is computed by Equation 9.


The updated positions of each agent are computed by Equations 10 and 11, where and are the speed at time of shepherd and sheep , respectively.


There are two main objectives in the shepherding problem. Firstly, the shepherd agents need to collect the sheep from a scattered swarm into a flock and herd them towards a goal destination. The first objective is to find a solution that minimizes the completion time. Given as the completion time for the shepherding task, the course of actions taken by the shepherds should result in the minimum time possible to complete and the state of the sheep flock must satisfy the constraints below:


where is the position of the goal at time and is the local centre of mass of the shepherd .

In addition, the second objective is to minimize the total travel distance of the shepherds, that is


Iv Deep Hierarchical Reinforcement Learning

Our previous study [25] has demonstrated that a deep hierarchical reinforcement learning (DHRL) framework with multi-skill learning is capable of producing an effective strategy for aerial shepherding of ground vehicles. Simulation results of the study indicate that the performance of the DHRL is equivalent to the performance of the Strömbom model. A limitation of DHRL is its reliance on Deep Q-Networks which only generates four discrete actions: moving forward, backward, turning right, and turning left, corresponding to four axis-parallel movements with fixed travel distance every step. In reality, the action space of a UAV is continuous, i.e the UAV moves according to an arbitrary vector with dynamic length. Hence, the trajectory produced by the DHRL might be sub-obtimal and less smooth than a continuous output policy.

To address that limitation, we propose a Hierarchical Deep Deterministic Policy Gradient (H-DDPG) algorithm for aerial shepherding of autonomous rule-based UGV agents. The algorithm inherits the advantages of a deep hierarchical reinforcement learning framework, but combines with a continuous action-producing type of reinforcement learning networks. The H-DDPG produces a continuous output policy for the UAV. We first describe the DDPG algorithm for controlling the UAV in Subsection IV-A then introduce our proposed learning framework for multi-skill learning in shepherding application in subsection IV-B. In Section VI, we also compared our proposed H-DDPG approach with the DHRL and the Strömbom method [38] in a simulation environment.

Iv-a Deep Deterministic Policy Gradient/DDPG

Deep Reinforcement Learning/DRL (DRL) [23] couples a multi-layer hierarchy of deep neural networks with optimal planning using reinforcement learning for effective behavioral learning in large and continuous state spaces. Less emphasis is placed on feature engineering due to the ability of deep models to autonomously approximate necessary features.

This section revisits the formulation of the reinforcement learning (RL) problem and then describes the Deep Deterministic Policy Gradient (DDPG) algorithm. RL searches for the strategy that offers the best long-term outcome at each state of the environment. Consider a state at time , an agent can perform one action which leads the agent to a new state in the state space. An instant reward corresponding to the return of the performed action and the value of the next state is received by the agent. The objective is to find the best strategy, given the states and the reward function, with the maximum accumulated return over time:


where is the total number of time steps, and is the discount factor.

Model-free approaches in RL consider the problem in the absence of a world model which are highly suitable for real problems in novel and unknown environments. In these approaches, the Q-value represents the expected value of each state-action pair. Instead of storing Q-values for every state-action pairs, which is infeasible in continuous environments, DRL uses deep neural networks as universal functional approximators to approximate the mapping from input states to Q-values.

While many problems have been solved using Deep Q-networks (DQN), a popular DRL method in the literature is DDPG [16], which outputs continuous actions. The method is more appropriate for shepherding to learn and generate the influence vectors produced by sheepdogs in the environment.

DDPG employs an architecture that involves two networks called actor and critic. The actor network (with weights

) estimates the output policy in the form of continuous values. To approximate the corresponding Q-value of the policy proposed by the actor, the critic network (with weights

) is used. Common practice of the DDPG algorithm is to initialize those two main networks and their clones called target networks (with weights and

respectively) in order to stabilize the learning process that happens when the networks change too quickly. The loss function for the critic network is shown below:


where the target value is computed as followings:


The updating process of the deep critic network is based on the following equation:


where is the learning rate of the critic network. The action gradient computed with the critic network is then used for updating the weights of the actor network :


where is the learning rate of the actor network.

The target networks’ weights might be updated by hard replacement or soft replacement procedures. The hard replacement copies the weights of the main networks to the target networks, while the soft replacement changes the weights of the target networks by a proportion of the main networks’ weights; that is,


Random experience replay is regularly utilized with this algorithm by randomly sampling from the historical data stored in a memory to diminish biases from strongly correlated transitions and reduce training time. DDPG is summarized in Algorithm 1.

Input : Maximum number of episodes (), Maximum number of time steps in one episode (), replay memory , mini-batch size
Initialization : Randomly initialize networks’ weights and , and target networks’ weights and
for  to  do
       Get initial state .
       Initialize random noise .
       while  and target state is not achieved do
             Select action using actor network .
             Add noise: .
             Execute and get next state and reward .
             Store transition in .
             if  then
                   Randomly sample transitions from .
                   Train on batch of transitions and compute target value according to equation 16.
                   Update critic and actor networks according to equations 17 and 18.
                   Update target critic and target actor networks according to equations 19 and 20.
             end if
       end while
end for
Algorithm 1 Deep Deterministic Policy Gradient (DDPG) Algorithm

Iv-B A Hierarchical Framework of DDPGs

In Figure 1, a hierarchical framework of deep RL using two DDPG networks for controlling the shepherding UAV is illustrated. The input of the networks are the relative directional vectors between the UAV and the sub-goal location (collecting/driving point), and between the UAV and the center of mass of UGVs. These inputs are calculated directly from sensorial data and are used for training the deep neural networks.

Fig. 1: Deep networks for controlling UAV in aerial shepherding scenario.
Fig. 2: DDPG networks architecture.

Iv-B1 Skill Decomposition Framework

Given the complexity of the problem, it is challenging for an agent to learn to control a swarm of other agents to complete multiple mission’s requirements [9]. Thus, using a single deep reinforcement learning network to achieve optimal behavior for different objectives is difficult due to the large search space. Our previous work decomposed the shepherding learning problem to two sub-problems: one for collecting and the other for driving. The UAV can learn to complete a shepherding mission by learning these two sub-problems independently; thus, the complexity of the learning problem becomes more manageable [27, 24].

Two independent training sessions are conducted simultaneously, one for learning to collect and the second for learning to drive. In the former, the training session begins with the initialization of an environment where a UGV is situated at a position far away from a cluster of the other UGVs. For every time step, the UAV learns to navigate to a collection point behind the furthest UGV from the group’s center of mass. When the UAV reaches the collection point, the training session for collecting ends. The training session for driving is similar except that the UAV learns to reach a position on the vector from the goal to the UGVs center of mass, and outside the perimeter of the UGVs’ cluster.

During testing, the two trained networks get connected to a logical gate that switches between the two behaviors based on the state of the UGVs in the environment.

Iv-B2 Reward Design

Let and be the position of the UAV and the subgoal at time . The distance between them is computed by . When the UAV moves with a velocity of , the next position of the UAV is . The next distance of the UAV to the subgoal will be . A positive reward is received if , i.e. the UAV gets closer to the sub-goal, and a negative reward is received otherwise. The reward is discounted over time, therefore the UAV has to learn to optimize its course of actions to navigate to the sub-goal as fast as possible.


In the testing scenario, a large reward is received when the whole mission is completed.

V Experimental Setups

In this paper, we test our proposed model in both simulation and physical environments. We investigate the learning success rate and the training environment’s scalability, where we evaluate how the algorithm performs when the size of the training environment during simulation is different from the physical environment. While, in hindsight, this sounds like a simple rescaling that needs to occur, the scaling problem is non-trivial due to the coupling between the scale and the non-linear functional approximation of DDPG.

V-a Experimental Design

In our shepherding scenario, there is one UAV acting as a sheepdog and three UGVs acting similar to sheep. An underlying assumption is that the UAV has global-sensing ability which enables it to sense the positions of all the UGVs in the environment. In practice, this means that either the UGVs are within the camera field of view of the UAV or they are within the communication range of the UAV. The UAV is required to drive the swarm of the UGVs to a target.

V-A1 Action and state space

For the UGVs, the action space consists of two continuous values representing the linear velocity, , and the angular velocity (yaw rate), . This velocity represents the step length of a sheep per second. Similarly, for the UAV, the action space consists of two real values representing the linear velocity, , according to the longitudinal and lateral directions. The commanded speed of the UAV varies in the range . At each episode in both the training and testing processes, the UAV automatically takes off to a height of , and the height is maintained till the end of the episode. While the height of a UAV would change the influence zone on the ground vehicle, it does not impact the ability of the AR.Drone quadrotor to track a complicated reference trajectory along the and axes as demonstrated in [33, 32]. The state inputs include and coordinates of two representational vectors: UGVs’ centre of mass to UAV () and Subgoal to UAV ()

V-A2 Deep neural network structure

In the paper, we use a deep feedforward architecture for DDPG. The deep actor network includes two hidden layers with 32 and 64 nodes while the deep critic network is more complex as shown in Figure 2. The optimizer method is Adam [15], and the training parameters are the same used in the original DDPG algorithm [23]. The H-DDPG algorithm is trained with a replay memory size of state-action pairs, the discount factor , mini-batch size 32, and the learning rate for the actor network and the critic network being 0.00025 and 0.001, respectively. The maximum number of steps in both collecting and driving DDPG models is 1000.

V-A3 Experimental setups

In the simulation experiment, we aim to investigate the learning success rate and the scalability of the training environment when training our proposed H-DDPG approach in both 44 and 66 environments. Firstly, we train the collecting and driving DDPG models using both environment sizes. Then, we test the models trained in the 44 environment on the 66 environment. The learning framework can demonstrate the scalability when an agent, which is successfully trained in a small simulation environment, can be effectively applied in a larger simulation or physical environment. When adopting the model trained from the smaller environment into a larger environment, the inputs and outputs of the deep networks are scaled. The scaling factor is calculated as in Equation 22.


where and represent the sizes of small and large environments. and are width and height of the environment. The state input and action output of the model are multiplied with and , respectively. The action output was normalized using tanh to limit the range between -1 to 1.

In this paper, we conduct two experimental setups. The first setup aims to evaluate the learning success rate and the scalability in response to changes in the size of the environment. The second setup tests the models trained in simulation in a physical environment. Table I shows the parameters used in both the setups.

Name Parameters
0.5 (m)
1 (m)
1.3 (m)
2 (m)
2 (m)
0.5 (m/s)
0.1 (s)
TABLE I: Parameters in setups.

V-A4 Evaluation Metrics

Four metrics for evaluating the performance of the learning framework for the UAV-UGV shepherding problem are listed below.

  • Cumulative reward over training episodes that UAV agent receives.

  • Success rate () is the percentage of mission completion computed on 30 trials. The mission success is achieved when all the UGV are collected and driven to the goal position.

  • Number of steps is the number of steps for the centre of mass of UGVs to reach goal position.

  • Travel Distance (m) is the total distance that the UAV covers in the environment.

  • Error per step (m) is the difference between the desired and actual positions meanwhile the desired position is the sum of the previous position with the desired movement. The desired movement is calculated by multiplying the linear velocity with the time of a step.

  • Distance from the aerial shepherd and the sub-goal per step (m).

  • Reduced distance from the center of the mass of the UGVs and the target per step (m).

While the cumulative reward and success rate reflect the training and testing effectiveness of the learning algorithm, the remaining metrics represent the efficiency of the policy proposed by the learning framework. In addition, the trajectories of the UAV and UGVs’ center of mass in the simulation and physical environments are visualized to illustrate mission performance.

V-B Environment and Control Network Setups

Our proposed hierarchical framework using DDPG networks to control shepherding UAV is initially trained and tested within a simulation environment. Robot operating system (ROS) is used as an interface which allows the agents to communicate with the Gazebo simulation environment. The simulator package for the UAV is Tum-Simulator [12], which simulates the Parrot AR Drone 2.

We further evaluate the transferability of our proposed model to a physical environment. The control network in the physical environment includes a VICON motion capture system (MCS) to detect the states of the environment, a base station where the information is processed and command messages are automatically generated, and the UAV and UGVs. Detailed description and specifications of each component in the system can be found in Section LABEL:section-S1 (Supplementary Document).

Fig. 3: Overall network architecture.

Figure 3 shows how data is transferred in the physical system. Firstly, the VICON MCS broadcasts continuously to each entity at a frequency of 100 Hz using a UDP network protocol. Regarding the control network, the central computer, containing the AI code, receives all UAV and UGVs’ states being the position and orientation in order to calculate the input states for the AI program. After that, the AI program produces [] for the UGVs and [] for the UAV using ROS message.

Vi Evaluation in Simulation

We compare the performance of the H-DDPG with that of the DHRL and the Strömbom approach as a baseline method. For the DHRL [25], we re-trained the two deep Q-network (DQN) models (driving and collecting) until convergence with the same parameters of the shepherding task shown in I. In the testing phase, our proposed H-DDPG are tested on 30 different testing cases. In each testing case, the UAV is initialized at a different position in the simulation environment. Both the DHRL and Strömbom methods are tested in this testing set.

Vi-a Training

The two collecting and driving DDPG models of our H-DDPG approach in both 44 and 66 environments are fully trained in 3000 episodes. Figure 4

show averages and standard deviations of cumulative total reward per action in every 10 episodes.

The cumulative total rewards per action of both the collecting and driving models increase significantly in the first 100 episodes, and then stabilizes relatively around 0.09 till the end of training. The tendencies demonstrate that these collecting and driving DDPG models are able to learn effectively when they converge at an approximate reward of 0.085.

(a) Driving in 44
(b) Collecting in 44
(c) Driving in 66
(d) Collecting in 66
Fig. 4: Learning curves of two driving and collecting DDPG models on 44 and 66 environments.

This tendency shows the convergence of the learning processes of four DDPG models in both the 44 and 66 environments.

Vi-B Testing

The testing shepherding environment is the same as the training environment. In total, we conduct eight testing scenarios as shown in Table II. For the DHRL-44to66 testing scenario, we just scale the state because of the fixed discrete action output of the DHRL in both the two environment.

Testing ID Description
Strombom-44 Testing the Strombom model on 44
Strombom-66 Testing the Strombom model on 66
DHRL-44 Testing the 44 trained model on 44
DHRL-66 Testing the 66 trained model on 66
DHRL-44to66 Testing the 44 trained model on 66 with scale.
H-DDPG-44 Testing the 44 trained model on 44
H-DDPG-66 Testing the 66 trained model on 66
H-DDPG-44to66 Testing the 44 trained model on 66 with scale.
TABLE II: Testing scenarios on 44 and 66 environments.

In each testing scenario, we conduct the 30 different testing cases, and then calculate average and standard deviations of the number of steps, the travel distance, and success rate.

Table III shows the results of the three methods tested in the 44 environment. H-DDPG outperforms the Strömbom approach on the three metrics (the number of steps, traveled distances, and success rate). However, in the larger environment of 66, it seems that the Strömbom model performs better than the learning methods in both the number of steps and the travelled distance, but not in terms of success rate.

H-DDPG used slightly less number of steps than DHRL in both the 44 and 66 environments. Although the travelled distance of the H-DDPG is slightly longer, the discrete action space of DHRL is unrealistic and therefore, the shorter distance reflects non-smooth and sharper manoeuvres by the UAV. The discrete actions of the DHRL produces unnatural zigzag movements reducing the travelled distance. The difference in the two movements is shown in Figure 5.

Table IV shows that after scaling both actions and states, the trained 44 H-DDPG model outperforms the H-DDPG model trained in the larger environment of 66 in terms of the travelled distance, but the first model takes some more steps. It can be understood that the action values of 44 H-DDPG model is slightly smaller than that of the 66 model when these values are scaled and then put through the function tanh. These smaller values helps to reduce the wide movement of the UAV, causing the path of the 44 H-DDPG agent to be slightly shorter. This view is illustrated in Figure 5.

From these testing results in the simulation, it is clear that the H-DDPG learning approach produces agents which are able to successfully aerial shepherd. Additionally, the results show that it is feasible that an agent trained in a small environment has the ability to perform in a larger environment with the same task or even for the aerial shepherding task.

Experiments Number-Steps Travelled-Distance Success
Strombom-44 135 119 11 9.3 96.67
DHRL-44 107 16 8.3 1.2 100
H-DDPG-44 100 16 9.7 1.5 100
TABLE III: Averages and standard deviations of number of steps, travelled distance, and success rates of the three setups in 30 testing cases in 4x4 simulation environment.
Experiments Number-Steps Travelled-Distance Success
Strombom-66 187 21 15.2 1.8 96.67
DHRL-66 224 32 16.9 2.4 100
DHRL-44to66 206 29 15.9 2.3 100
H-DDPG-66 188 44 19.9 5.4 100
H-DDPG-44to66 205 20 16.7 1.8 100
TABLE IV: Averages and standard deviations of number of steps, travelled distance, and success rates of the five setups in 30 testing cases in 6x6 simulation environment.
(a) Strombom-66
(b) DHRL-66
(c) HDDPG-66
(d) HDDPG-44to66
Fig. 5: Trajectories of the UAV and the center of mass in the 15th testing case performed in a 66 simulation environment. The movement of DHRL is zigzag being less smoother that of HDDPG. The movement of HDDPG in the 44to66 environment is narrower than that of the HDDPG trained in the 66 environment. Compared to the Strömbom model, the behaviour of the three trained models seems to drive better when the mass of the UGVs are guided straight to the target.

The promising results in the simulation initially demonstrate the effectiveness of our proposed learning approach for the aerial shepherding task. Besides, they also show that it is not necessary to train the agent in the same simulation environment and that the skills learnt, with appropriate re-scaling, are usable. When the agent was trained in the smaller environment, it still had an ability to perform well or even better.

Vii Evaluating on physical system

Transferring the learnt models from simulation to the physical environment is challenging. Firstly, the yaw of the UGVs needs to be stabilized in order to make them move in the required directions without violating the non-holonomic constraints for a wheeled vehicle.

Secondly, our proposed algorithm gives virtual force commands which basically correlate to an acceleration command. We effectively integrate these to get velocity and again to get the position we use the prescribed position command. The UAV quadcopter is controlled to follow a desired position trajectory rather than following the acceleration references in order to reduce the tracking errors, caused by offset errors and exogenous disturbances. In the next section, we describe these solutions in the physical environment before conducting the testing scenarios for our learning approach.

Vii-a System stabilization

Vii-A1 UGVs yaw stabilization

The UGV absolute yaw angle was controlled by an adaptive Strictly Negative Imaginary (SNI) controller-based Fuzzy Interference System (FIS) whose parameters were properly tuned by minimizing the tracking errors through trial-and-error [39]. To do so, we first rotated the yaw axis in VMTS to coincide with the one in the Gazebo simulation. After that, we designed the closed-loop yaw control loop to stabilize the rotational motion of all mobile robots. Descriptions of the Fuzzy-SNI control strategy and a test experiment for control stabilization can be found in Section LABEL:section-S2 (Supplementary Document).

Vii-A2 UAV control stabilization

Since the AR. Drone quadcopter is subject to multiple disturbances and has control offsets which can cause position drift, controlling the AR. Drone quadcopter based on velocity commands is not feasible in areas with restricted space [8], especially in the area found in our VICON lab.

To solve this problem, we estimated the next global position based on the UAV’s current coordinate and its velocity setpoints as in Equation 23 and then stabilized the UAV position using the adaptive Strictly-Negative Imaginary (SNI) position tracking controller as described in [40, 39, 41]. As soon as a desired position in waypoints was reached, the quadcopter starts to receive the next velocity references and produce the next desired position using Eq. (23).


where ( denotes the desired position of the UAV on the planar plane. Next, highlights the actual velocity of the UAV along the and axis. While is the time step, is the sample time.

Vii-B Testing

We adopt the parameters as shown in Table I except for the time step. For this setup, we set the time step at seconds in both the simulation and physical environment. Firstly, we conduct testing scenarios in the simulation, and then in the physical environment with the same parameters. Four testing scenarios are shown in Table V.

Testing ID Description
HDDPG-66-Sim Testing the 66 trained model
on 66 simulation
HDDPG-44to66-Sim Testing the 44 trained model
on 66 simulation with scale.
HDDPG-66-Phy Testing the 66 trained model
on 66 physical
HDDPG-44to66-Phy Testing the 44 trained model
on 66 physical with scale.
TABLE V: Testing scenarios on 66 simulation and physical environments.

We calculate average and standard deviations of the number of steps, the travel distance, the error per step between the desired movement and the actual movement, the distance per step between the position of the aerial shepherd and the sub-goal, the reduced distance per step between the center of the mass of the UGVs and the target, and the success rate as shown in Table VI.

The error per step of each episode is calculated by measuring the distance between the actual position and the desired position in every step. For both the simulations and real experiments, the UAV is stabilized by a position tracking system. The desired position is updated by adding the desired linear velocity multiplied by the time step ( second) at each time step.

Table VI show interesting results in the testing scenarios. Firstly, the success rate of the four testing scenarios is when all the agents pass three different testing cases in both simulation and physical experiments. Similar to the investigation of the previous testing setup, Table VI shows that the 44 agent after being scaled by Equation 22 is able to perform better than the agent trained in the 66 agent, and in the 66 testing environment. In the simulation, the travel distance of the 44 agent is compared to of the 66 agent. When we inspect the error per step of the two agents, the error of the 44 agent is considerably smaller than that of the 66 agent, which are and , respectively. Although the distance per step is larger, it is not enough to impact the entire performance. This higher distance per step appears because of the different size of the environment even though the agent’s actions are scaled up. The reduced distance per step between the center and the target of the testing scenarios in the simulation is the same .

Experiments Number-Steps Travelled-Distance Error-Per-Step Dog-to-Subgoal Center-to-Target Success
HDDPG-66-Sim 505 23 17.4 1.4 0.026 0.003 0.63 0.03 0.006 0.000 100
HDDPG-44to66-Sim 533 55 13.4 0.9 0.018 0.001 0.72 0.04 0.006 0.000 100
HDDPG-66-Phy 911 93 27.6 2.6 0.05 0.000 1.35 0.05 0.003 0.000 100
HDDPG-44to66-Phy 881 88 24.1 1.7 0.05 0.001 1.29 0.06 0.003 0.000 100
TABLE VI: Averages and standard deviations of number of steps, travelled distance, cumulative error and success rates of three testing cases of the four setups in simulation and physical environments.
(a) HDDPG-6x6to6x6-Sim:Trajectory
(b) HDDPG-4x4to6x6-Sim:Trajectory
(c) HDDPG-6x6to6x6-Phy:Trajectory
(d) HDDPG-4x4to6x6-Phy:Trajectory
(e) HDDPG-6x6to6x6-Sim:Errors
(f) HDDPG-4x4to6x6-Sim:Errors
(g) HDDPG-6x6to6x6-Phy:Errors
(h) HDDPG-4x4to6x6-Phy:Errors
Fig. 6: Trajectories of the UAV and the center of mass and errors of the distance from the UAV to the sub-goal and the reduced distance from the center to the target in a testing case performed in a 66 simulation and physical environment.

Similar behavior is observed in the physical environment. The travelled distance of the 44 aerial shepherd is smaller than that of the 66 agent, which are and . Compared to this value of the agents in the simulation, both of them are higher. However, when we inspect the error per step of the agents, they are the same value of . This value is considerably higher than their errors per step in the simulation. This difference leads to the UAV agents needing to move longer to achieve the task in the physical environment. This is seen in the distances per step between the position of the UAV and the sub-goal. These distances cause the driving control of the UAV to the UGVs be undesirable. It is also understandable when the reduced distance from the center to target is smaller than that of the agents tested in the simulation so that the number of steps of the agents in the physical is higher than that in the simulation. However, it is worth noting that in the 66 area of the physical environment, this error per step is acceptable [39, 41].

Some general observations are shown in Figure 6. Although the paths of the 44 and 66 agent in the physical environment are longer and more oscillatory than that in the simulation, they are highly similar in the shape of the movement when the agents collect the furthest sheep, and then herd the entire swarm of the UGVs towards the target successfully. The error comes from the disturbances and natural offset behaviors of the drone in moving even though being stabilized by the position tracking system. Figure 6 confirms that the distance from the UAV to the sub-goal in the simulation is stable and smaller than the corresponding distance in the physical environment. However, the distance between the center and the target of the agents in both the simulation and physical tends to non-monotonically decrease until the task is completed.

From these testing results, we can see that the proposed learning approach shows promise for producing successful aerial shepherds. Although there are gaps between the performance of the agents in the simulation and the physical environment due to disturbances and un-modelled dynamics, these gaps do not significantly change the behaviour. Additionally, within both the simulation and physical environments, we show that it is feasible that an agent trained in a smaller environment can transfer its skills to a larger environment.

Viii Conclusion and Future Work

In this paper, we have introduced a deep hierarchical reinforcement learning framework for decomposing a complex shepherding problem of ground-air vehicles into simpler sub-problem spaces and training the UAV agent to obtain the desired behavior in each case. The deep deterministic policy gradient (DDPG) networks demonstrate effective learning capabilities that achieve near-optimal solutions with dynamic, continuous environment and output continuous values as velocity vectors which are easy to execute through the control systems. The framework with the trained networks are then tested on the entire UAV-UGV shepherding mission where the objective is fulfilled with behavior emerged from combining low-level actions learned through interacting with two simpler search spaces.

The framework is tested in simulated and physical environments of the same size. We compared the H-DDPG approach against DHRL [25] and the Strömbom method [38] in the simulated environment. In the physical environment, there is more uncertainty due to the limitation of dynamic modelling in simulation, which results in some offsets between the desired and actual trajectories. The difference in performances between the simulation and physical environments is insignificant.

The scalability of the framework is also examined through transfer learning with state and action scaling from a smaller environment of 4

4 (m m) to a larger one of 66 (m m). Our results demonstrate that the transferred model achieves slightly similar completion time and travel distance while performing in a more stable manner than the model trained from the target environment.

Future directions include the need to validate the performance and scalability of our proposed framework with different UGV scenarios. There is also an opportunity to account for obstacles in the environment by adding navigation and path planning skills to the behavioural hierarchy. Last, but not least, adding more UAVs and allowing for coordination skills in the hierarchy will generalize the problem to more realistic settings.


This material is based upon work supported by the Air Force Office of Scientific Research and the Office of Naval Research - Global (ONR-G).


  • [1] M. Asada, E. Uchibe, and K. Hosoda (1999) Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development. Artificial Intelligence 110 (2), pp. 275–292. Cited by: §II.
  • [2] T. Balch and R. C. Arkin (1998) Behavior-based formation control for multirobot teams. IEEE Transactions on robotics and automation 14 (6), pp. 926–939. Cited by: §II.
  • [3] R. Carelli, C. De la Cruz, and F. Roberti (2006) Centralized formation control of non-holonomic mobile robots. Latin American applied research 36 (2), pp. 63–69. Cited by: §I.
  • [4] L. Chaimowicz and V. Kumar (2004) Aerial shepherds: coordination among UAVs and swarms of robots. In In Proc. of DARS’04, Cited by: §I.
  • [5] Y. Chang, C. Chang, C. Chen, and C. Tao (2012-04) Fuzzy sliding-mode formation control for multirobot systems: design and implementation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42 (2), pp. 444–457. External Links: Document, ISSN 1083-4419 Cited by: §I.
  • [6] J. Chen, X. Zhang, B. Xin, and H. Fang (2016-04) Coordination between unmanned aerial and ground vehicles: a taxonomy and optimization perspective. IEEE Transactions on Cybernetics 46 (4), pp. 959–972. External Links: Document, ISSN 2168-2267 Cited by: §I.
  • [7] N. R. Clayton and H. Abbass (2019) Machine teaching in hierarchical genetic reinforcement learning: curriculum design of reward functions for swarm shepherding. arXiv preprint arXiv:1901.00949. Cited by: §I.
  • [8] R. A. S. Fernánde, S. Dominguez, and P. Campoy (2017) L 1 adaptive control for wind gust rejection in quad-rotor UAV wind turbine inspection. In 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Cited by: §VII-A2.
  • [9] G. H. W. Gebhardt, K. Daun, M. Schnaubelt, and G. Neumann (2018-05) Learning robust policies for object manipulation with robot swarms. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7688–7695. External Links: Document, ISSN 2577-087X Cited by: §IV-B1.
  • [10] A. Gee and H. Abbass (2019) Transparent machine education of neural networks for swarm shepherding using curriculum design. arXiv preprint arXiv:1903.09297. Cited by: §I.
  • [11] A. Guillet, R. Lenain, B. Thuilot, and V. Rousseau (2017) Formation control of agricultural mobile robots: a bidirectional weighted constraints approach. Journal of Field Robotics. Cited by: §II.
  • [12] H. Huang and J. Sturm (2014) Tum simulator. Note: 2019-06-20 Cited by: §V-B.
  • [13] S. Hung and S. N. Givigi (2017-01) A Q-learning approach to flocking with UAVs in a stochastic environment. IEEE Transactions on Cybernetics 47 (1), pp. 186–197. External Links: Document, ISSN 2168-2267 Cited by: §II, §II.
  • [14] A. Khan, B. Rinner, and A. Cavallaro (2018-01) Cooperative robots to observe moving targets: review. IEEE Transactions on Cybernetics 48 (1), pp. 187–198. External Links: Document, ISSN 2168-2267 Cited by: §I.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-A2.
  • [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §IV-A.
  • [17] N. K. Long, K. Sammut, D. Sgarioto, M. Garratt, and H. Abbass (2019) A comprehensive review of shepherding as a bio-inspired swarm-robotics guidance approach. arXiv preprint arXiv:1912.07796. Cited by: §I.
  • [18] S. Martinez, J. Cortes, and F. Bullo (2007) Motion coordination with distributed information. IEEE Control Systems Magazine 27 (4), pp. 75–88. Cited by: §I.
  • [19] N. Mathew, S. L. Smith, and S. L. Waslander (2015) Planning paths for package delivery in heterogeneous multirobot teams. IEEE Transactions on Automation Science and Engineering 12 (4), pp. 1298–1308. Cited by: §I.
  • [20] Z. Miao, Y. Liu, Y. Wang, G. Yi, and R. Fierro (2018) Distributed estimation and control for leader-following formations of nonholonomic mobile robots. IEEE Transactions on Automation Science and Engineering 15 (4), pp. 1946–1954. Cited by: §II.
  • [21] S. Minaeian, J. Liu, and Y. Son (2016-07) Vision-based target detection and localization via a team of cooperative UAV and UGVs. IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 (7), pp. 1005–1016. External Links: Document, ISSN 2168-2216 Cited by: §I.
  • [22] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In

    International conference on machine learning

    pp. 1928–1937. Cited by: §I.
  • [23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §IV-A, §V-A2.
  • [24] H. Nguyen, M. Garratt, and H. Abbass (2018) Apprenticeship bootstrapping. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §IV-B1.
  • [25] H. Nguyen, T. Nguyen, M. Garratt, K. Kasmarik, S. Anavatti, M. Barlow, and H. Abbass (2019) A deep hierarchical reinforcement learner for aerial shepherding of ground swarms. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, Australia, November 14-18, 2017, Proceedings, Cited by: §I, §I, §IV, §VI, §VIII.
  • [26] H. T. Nguyen, M. Garratt, L. T. Bui, and H. Abbass (2017)

    Supervised deep actor network for imitation learning in a ground-air UAV-UGVs coordination task

    In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8. Cited by: §I, §I.
  • [27] T. Nguyen, H. Nguyen, E. Debie, K. Kasmarik, M. Garratt, and H. Abbass (2018) Swarm Q-learning with knowledge sharing within environments for formation control. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §II, §IV-B1.
  • [28] H. Oh, A. R. Shirazi, C. Sun, and Y. Jin (2017) Bio-inspired self-organising multi-robot pattern formation: a review. Robotics and Autonomous Systems 91, pp. 83–100. Cited by: §I.
  • [29] K. Oh, M. Park, and H. Ahn (2015) A survey of multi-agent formation control. Automatica 53, pp. 424–440. Cited by: §II.
  • [30] G. Palmer, K. Tuyls, D. Bloembergen, and R. Savani (2018) Lenient multi-agent deep reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 443–451. Cited by: §II.
  • [31] S. Ramazani, R. Selmic, and M. de Queiroz (2017-08) Rigidity-based multiagent layered formation control. IEEE Transactions on Cybernetics 47 (8), pp. 1902–1913. External Links: Document, ISSN 2168-2267 Cited by: §II.
  • [32] L. V. Santana, A. S. Brandao, M. Sarcinelli-Filho, and R. Carelli (2014) A trajectory tracking and 3D positioning controller for the AR Drone quadrotor. In 2014 international conference on unmanned aircraft systems (ICUAS), pp. 756–767. Cited by: §V-A1.
  • [33] L. V. Santana, A. S. Brandao, and M. Sarcinelli-Filho (2015) Outdoor waypoint navigation with the AR Drone quadrotor. In 2015 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 303–311. Cited by: §V-A1.
  • [34] F. Santoso, M. A. Garratt, and S. G. Anavatti (2017) State-of-the-art intelligent flight control systems in unmanned aerial vehicles. IEEE Transactions on Automation Science and Engineering 15 (2), pp. 613–627. Cited by: §II.
  • [35] A. Sen, S. R. Sahoo, and M. Kothari (2017) Cooperative formation control strategy in heterogeneous network with bounded acceleration. In Control Conference (ICC), 2017 Indian, pp. 344–349. Cited by: §II.
  • [36] P. Singh, R. Tiwari, and M. Bhattacharya (2016-03) Navigation in multi robot system using cooperative learning: a survey. In 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), Vol. , pp. 145–150. External Links: Document, ISSN Cited by: §II.
  • [37] C. Speck and D. J. Bucci (2018) Distributed UAV swarm formation control via object-focused, multi-objective SARSA. In 2018 Annual American Control Conference (ACC), pp. 6596–6601. Cited by: §II.
  • [38] D. Strömbom, R. P. Mann, A. M. Wilson, S. Hailes, A. J. Morton, D. J. Sumpter, and A. J. King (2014) Solving the shepherding problem: heuristics for herding autonomous, interacting agents. Journal of the royal society interface 11 (100), pp. 20140719. Cited by: §I, §I, §IV, §VIII.
  • [39] P. V. Tran, F. Santoso, M. A. Garratt, and I. R. Petersen (2019) Adaptive second order strictly negative imaginary controllers based on the interval type-2 fuzzy systems for a hovering quadrotor with uncertainties. IEEE/ASME Transactions on Mechatronics (), pp. 1–1. External Links: Document, ISSN Cited by: §VII-A1, §VII-A2, §VII-B.
  • [40] V. P. Tran, M. Garratt, and I. R. Petersen (2017) Formation control of multi-uavs using negative-imaginary systems theory. In 2017 11th Asian Control Conference (ASCC), pp. 2031–2036. Cited by: §VII-A2.
  • [41] V. P. Tran, M. Garratt, and I. R. Petersen (2020) Switching time-invariant formation control of a collaborative multi-agent system using negative imaginary systems theory. Control Engineering Practice 95, pp. 104245. Cited by: §VII-A2, §VII-B.
  • [42] G. Vásárhelyi, C. Virágh, G. Somorjai, N. Tarcai, T. Szörényi, T. Nepusz, and T. Vicsek (2014-Sep.) Outdoor flocking and formation flight with autonomous aerial robots. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 3866–3873. External Links: Document, ISSN 2153-0858 Cited by: §II.
  • [43] J. Wen, L. He, and F. Zhu (2018-07) Swarm robotics control and communications: imminent challenges for next generation smart logistics. IEEE Communications Magazine 56 (7), pp. 102–107. External Links: Document, ISSN 0163-6804 Cited by: §I.
  • [44] Y. Yang, H. Modares, D. C. Wunsch, and Y. Yin (2018-06) Leader-follower output synchronization of linear heterogeneous systems with active leader using reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems 29 (6), pp. 2139–2153. External Links: Document, ISSN 2162-237X Cited by: §II.
  • [45] Z. Yang, K. Merrick, L. Jin, and H. A. Abbass (2018) Hierarchical deep reinforcement learning for continuous action control. IEEE transactions on neural networks and learning systems (99), pp. 1–11. Cited by: §I.
  • [46] T. Yasuda and K. Ohkura (2018-01) Collective behavior acquisition of real robotic swarms using deep reinforcement learning. In 2018 Second IEEE International Conference on Robotic Computing (IRC), Vol. , pp. 179–180. External Links: Document, ISSN Cited by: §I.
  • [47] X. Yi, A. Zhu, S. X. Yang, and C. Luo (2017-04) A bio-inspired approach to task assignment of swarm robots in 3-D dynamic environments. IEEE Transactions on Cybernetics 47 (4), pp. 974–983. External Links: Document, ISSN 2168-2267 Cited by: §II.
  • [48] N. R. Zema, D. Quadri, S. Martin, and O. Shrit (2019-06) Formation control of a mono-operated UAV fleet through ad-hoc communications: a Q-learning approach. In 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Vol. , pp. 1–6. External Links: Document, ISSN 2155-5494 Cited by: §II.