Context-Aware Deep Q-Network for Decentralized Cooperative Reconnaissance by a Robotic Swarm

01/31/2020 ∙ by N. Mohanty, et al. ∙ 0

This paper addresses the problem of decentralized cooperation in a robotic swarm. The aim is to perform target search and destroy operation in an unknown/uncertain environment, without any communication within the swarm. The aim is to perform target search and destroy operation in an unknown/uncertain environment, without any communication within the swarm. The environment consists of heterogeneous targets; some that can be handled by a single robot and others that require a team of cooperating robots to neutralize the targets. The system fives the error of "Bad character(s) in field Abstract" for no reason. Please refer to manuscript for the full abstract



There are no comments yet.


page 1

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advancements in unmanned aerial vehicle (UAV) technologies have led to the deployment of swarm of vehicle cooperating/coordinating between each other to achieve a common goal in complex mission scenarios like search and reconnaissance mission, territory protection and disaster relief etc. In the past decade, multiple unmanned aerial vehicles are shown to be effective in surveillance [8], hazardous environment operations [10, 23], and in cooperative search [24], [26], [11], [20], [1]. Due to their inherent distributed, decentralized and diverse nature, swarm-based operation are preferred for typical search and reconnaissance missions. Further, in an uncertain/unknown environment, heterogeneous targets may be present that need to be neutralized by a single vehicle or multiple vehicles (in a sequence of operations or in a simultaneous operation). In such situations, the absence of inter communication complicates the problem for a multi-vehicle based search. The inherent resilient property and heterogenetic nature of robots in a swarm makes it an ideal candidate for search and reconnaissance missions. With this in mind, the main focus of this paper is to address the problem of coordination and cooperation between the vehicles in a swarm (without inter-communication) for an effective search and neutralization of heterogeneous targets.

In a typical search and reconnaissance mission, the distributed deployment of multiple vehicles are expected to carry out the following operations: a) search and detect the unknown number of targets; b) assign specific targets to the vehicles based on their respective distributions and c) neutralize the targets by coordination/cooperation in the presence of information uncertainty. In the earlier research works only a subset of these tasks has been addressed. Simultaneous search by multiple UAVs and task assignment problems has been formulated as an single optimization problem in [27],[25],[16]. In these papers, in the absence of communication, a single UAV is assigned only to a single task. Once the target is assigned, it has been assumed that these UAVs are capable of generating necessary feasible trajectories and avoid collisions. In the literature, negotiation and consensus [9], [13] based approaches have also been addressed. Also, the problem of task allocation with known target location has been dealt using game theoretic approaches in [18, 12]. It should be noted that the above works are suitable for a single robot single task, whereas in a real practical environment one may have heterogeneous targets which may require multiple robot coordination. Further, the presence of information uncertainty, partial observability and also absence of communication add more complexity to the mission.

Recently in [5], a game-theoretic approach was presented for coordination between robots on a task allocation problem with information uncertainty. Here, measurements about the targets are used to compensate for the loss of communication to assign the task for a soccer playing robot. An information theoretic approach was proposed to learn the individual agent policy which collectively mimics the centralized static optimization problem in [7]. Both these above mentioned approaches assumes certain level of interaction between the agent action for task allocation and also are not scalable. Further, in a typical reconnaissance mission, the number of agents/targets are large and the information available to an agent for decision-making is dynamic (changing number of agents and targets over a time) and their is no observation of agent interaction with targets during the task allocation process. To overcome the above problems, recent works in swarm robotics have focused on the application of Deep Reinforcement Learning (DRL strategies for multi agent cooperation. Reinforcement Learning techniques, like Q-Learning [28], define the tasks indirectly using cost functions, which are easier than defining a model of task [17]. Although these methods make them easier to implement [29, 3], these techniques are not scalable and cannot be decentralized as they often requires the states of the entire system rather than a single agent. They also requires significant memory as Q-Learning technique specifies different Q-maps [28]

for every given state. One can reduce the memory requirements by using a neural network to approximate the optimal policy, rather than Q-Maps for a given state. Even though, neural network based solutions work well for single agent. Multi Agent Reinforcement Learning (MARL) still remains a challenging problem as it involves the interaction of multiple robots with the same environment along with the interactions between each other. Although the use of DRL for cooperative task allocation problem (single target handled by single agent) seems promising and has been successfully implemented in

[6, 4, 21], the scalability issue still remains. Since, the number of input dimensions to the neural network policy is fixed, the work is not scalable due to the presence of large number of agents and also the changing number of observations over time. One needs to retrain the neural network for different number of agents. Recently in [15, 2]

, the issue of scalability was addressed by using decentralized partially observable Markov decision process for targets handled by a single agent by fixing the information collection region. For better convergence of the network, one needs to generate a large number of training data in a simulator. In summary, these DRL algorithms do not address the aspect of target detection in an unknown/uncertain environment and also the target neutralization problem with heterogeneous targets. Further, the problem of collision avoidance with obstacles/other agents in the environment is important for efficient task allocation in a reconnaissance mission. Hence, there is a need to develop new deep reinforcement learning algorithms for a swarm of robots to perform reconnaissance in an uncertain environment.

In this paper, we present such an algorithm referred to as a Context-Aware Deep-Q-Network (CA-DQN), a decision-making framework for a swarm of robots to perform reconnaissance missions in an unknown/uncertain environment without communication. Here, the environment has an unknown number of heterogeneous targets, i.e., single-robot-targets (SRTs) and multi-robot-targets (MRTs) where, SRTs require only one robot to neutralize while the MRTs require multiple robots to interact with the target in a sequence (one after the other) for neutralizing them. The

robots as a swarm searching for targets in a given area are assumed to have a sensor to detect the targets (their location and type) and their neighbouring robots (Note: swarm here means all the robots are in a vicinity to each other. They are within certain distance from the centroid of the swarm, i.e. they are intact in a certain region). Depending upon the information regarding the distribution of the detected targets (global perception about the targets within the sensor radius), a robot generates its context-aware grid (adaptive grid). Using this information, the structure of this grid keeps on changing (it deforms into the required structure) as the swarm moves across the area. The information about these grid locations are embedded as a matrix and is used for allocating unique targets to individual robots. After the allocation, depending upon the local data perceived by the robots about its neighbouring robots (local perception about its neighbour robots within the sensor radius) scenario identification is done. Every given scenario is classified as either conflict or conflict-free for a robot with other constituent robots of the swarm. To deal with these scenarios the information embedded as context information, which is invariant to the number of targets seen by the robots and number of robots in the swarm, is used as an input to the Deep-Q-Network. Two different DQNs, (namely, one DQN for conflict scenario and another DQN for conflict-free scenario) are used for action-value approximations. The DQN parameters are updated using the standard Deep Q-Learning algorithm

[19] with experience replay and -greedy policy execution. The performance of the CA-DQN framework was evaluated in a simulated environment for heterogeneous targets. Monte Carlo simulations were done to study the performance by varying target distributions and the swarm size. Further, the CA-DQN approach is validated experimentally in a laboratory environment. Both the experimental and Monte Carlo simulation results highlight that the swarm of robots with CA-DQN can effectively conduct reconnaissance missions under information uncertainty and no communication.

The paper is organized as follows. Section II presents first the mathematical formulation of the problem followed by the proposed decision-making framework. Section III extensively discusses the Monte Carlo simulation study results. Section IV provides the performance evaluation of the proposed approach in an experimental setup with ground reports and different targets. Finally, the conclusions from the studies are summarised in section V.

Ii Decentralized cooperative of swarm for reconnaissance missions

In this section, we first define the reconnaissance problem and provide details about the environment along with robot capabilities. Next, we provide the algorithm for information embedding in context regions and scenario identification, and finally, provide details regarding the DQN architectures and their training procedure.

Ii-a Problem Definition

A swarm of robots () having number of homogeneous robots with the necessary sensors/actuators is deployed to perform reconnaissance and neutralizing the targets in a bounded area () of length and width as shown in Fig. 1. The robots do not know the number of targets or target types in the area apriori. The targets are assumed to be randomly distributed in . Every robot is assumed to have sensing capabilities within a fixed range up to which it can detect the targets (global perception), and a range of for sensing other robots in the vicinity (local perception).

Fig. 1: A typical scenario of swarm based search in unknown/uncertain environment.
Fig. 2: Information connectivity to the robots considering detected targets and observable neighbouring robots for the scenario depicted in Fig. 1.

For example, consider the scenario a shown in Fig. 1, where three robots and are deployed to perform the reconnaissance to neutralise the targets . For clarity, the sensing radii of and are shown to be as one sensing radius in the figure. Number of targets in , along with their locations are unknown beyond these ranges and it is assumed that the environment itself is uncertain. As shown in the figure, these targets could be of two types; a) Targets that can be neutralized by a single robot, referred to as a Single-Robot-Targets (SRT); b) Targets that require multiple robots to neutralize them in a sequence, referred to as Multi-Robot-Targets (MRT). In a realistic environment, one will have a combination of such targets in the environment and it is also assumed that there is no communication network available for robots to share their information.

At any given time , the information about the target location and the neighbouring robots within the sensing radius is used to define the state of the robot within the swarm. For the swarm to stay intact, the robot needs to stay within a fixed radius from the centroid of the swarm, represented by a virtual robot, as shown in Fig. 1. Mathematically, , where is the distance of robot from the virtual robot. In this way, every individual robot is having an indirect communication to stay within the zone to maintain the swarm. While moving in a group, once a target is identified, depending upon the target location and its type, the swarm needs to get in the required formation without colliding with each other so that they can neutralize the target. In the figure, the grid structure of the swarm is shown (with black solid circles representing the grid nodes) while manoeuvring across the area, along with a virtual robot in the centroid. Fig. 2 shows the information connectivity (as the robots visibility of the targets and other neighbouring robots) of individual robots of the swarm and the observed targets for the scenario depicted in Fig. 1. In Fig. 1, the red robot annotated as has robot blue (), a MRT and the virtual robot () at the centroid of the grid within its sensing radius. The information connectivity between these objects are shown by distinct solid lines in Fig. 2. Similarly the connectivity of robot green () along with other observed targets and are also shown. Fig. 1 also shows some targets namely , and that are beyond the sensing radii of the robots and hence do not affect the decision making of the swarm robots at this instance. One should note that the information (both global and local) structure is dynamic in nature and hence, the decision making ability of robots should be adaptive and scalable. In summary, the problem considered in this paper is to detect a maximum number of targets, allocate different tasks to individual robots and neutralise the targets by cooperating/coordinating with other robots within the swarm.

Ii-B Context Aware Deep Q-Network (CA-DQN)

In this section, we present a decision-making framework for a swarm of robots to perform a reconnaissance mission in an uncertain/unknown environment. With an unknown number and heterogeneous targets distributed in a given area

, the robots move across the area as a swarm searching for the targets and neutralizing them upon detection with the required sequence of actions based on their types. This work considers the movement of individual robots within a swarm as a multi-robot grid game. Every individual estimates the movement of a virtual robot across the area.

Based on their locations, at any given time , individual robots generate their grid of a fixed size keeping the virtual one at the grid centroid node. Hence, with the virtual robot moving across the area, the grid of every individual robot moves also along with it. At any given time . every robot is assigned a particular node in their respective grids, As the grid moves, the location of the assigned node also keeps changing with time. These node locations essentially act as waypoints to generate a trajectory for the robots to follow using a PI controller for achieving the desired objective while keeping the swarm intact. With the movement of the swarm across the area, every individual robot is assumed to be capable of differentiating between the types of targets upon detection within its sensing radius. If there exists multiple targets within this radius as in the case of robot green () in Fig. 1, the robots choose the one nearest to it for engaging. Sometimes, along with targets there exists other robots in this region as in the case of robot red () where targets and exist inside it’s sensing radius. In such scenarios, the robots estimates the distances of other robots to the target and the one closest to the target is assigned the task of nullifying. In the depicted scenario in Fig. 1 estimates that it is closer to target than is and hence it chooses to engage and neutralise. Meanwhile estimates that it is farther to the target than and hence it leaves the target to be neutralised by .

After the detection and assignment of the targets for every individual robot, the proposed approach referred to here as the Context-Aware Deep Q-Network (CA-DQN) uses the context information about the neighbouring robots’ positions and the target positions as the state input to the algorithm. The output of CA-DQN is a probabilistic array for a discrete set of actions, i.e, move [, , , , ], which changes the node assignment of the robot to the next adjacent grid node, for the robot to maneuver in its grid. Using these actions, the robot’s waypoint now changes in order to follow a different grid node. As all the robots of the swarm operate in a neighbourhood, there is a need that a robot has to be assigned a unique node at any given time . This will ensure that no conflicts exist in a swarm.

Fig. 3: Schematic of CA-DQN algorithm and information flow.

The schematic of the proposed CA-DQN and its information flow are shown in Fig. 3. Every robot has information about both global and local perceptions by sensing the targets and the neighbouring robots in the vicinity respectively. The above grid is updated depending upon this context information and hence it is named as the context-aware grid. This context-aware grid helps the robot reach the desired target location for neutralizing it. As multiple robots are operating in a single environment, at times, the actions taken by individual robots may be conflicting with each other, as multiple robots cannot occupy the same space at any given time. Here, we use a SI algorithm for the robots by using the context information and the local perception to decide whether a situation is a conflict or conflict-free scenario for itself. Two different DQNs are designed for the robots to use depending upon the above scenario. Based on the above classification, the robot decides whether to use a DQN for conflict-free scenario or DQN for conflict scenario to ensure collision-free navigation to the target location.

Ii-B1 Context Awareness Scheme

Fig. 4: Matrix modification upon detection of targets where the grid is being reconfigured from configuration A to B, for context aware grid.

In the problem setting, there are homogeneous robots in a swarm performing reconnaissance mission in an unknown/uncertain environment. As discussed earlier, in the grid, every individual robot of the swarm is assigned a unique node. Also, each robot is equipped with a visual sensor having a range of to detect the target, using the perception of the global environment. Similarly, a proximity sensor with a range of is used to detect the neighbouring robots (local environment). Here, it is assumed that .

(a) Context region towards assigned target.
(b) Context region towards other target
(c) region to check for conflicts.
Fig. 5: A depiction of targets and neighbouring robots information embedded in a adaptive grid representation.

As the environment is dynamic, the distribution of the targets may not coincide with that of a uniform grid (with a constant distance, between the nodes. It is assumed that for the robots to neutralize a target, it needs to be over the target. As the robots follow a particular node all the time, for a robot to be over a target, the grid needs to be deformed depending upon the target location. For this, based upon the context information, the distance between the grid nodes are stored and modified to get the nearest node over the target location. Hence, for the context-aware grid, two matrices and are used to store the corresponding distances between the nodes along and axis respectively. To modify the grid, based upon the context data, the distance elements in these matrices are updated as and , where and are the new distance elements in and respectively. and are the angles made by the target location with the nearest node and the distance between them respectively. For example, the method used for updating these matrices is shown in Fig. 4. In the figure, the context grid (configuration A) for the robot is reconfigured to get the grid (configuration B) based upon the target distribution observed by the robot. The and matrices for the corresponding grid in the image are given by:


where is the distance between adjacent nodes, and is are are the corresponding distances of the targets from the nearest nodes as shown in the figure.

Ii-B2 Scenario-Identifier (SI) Algorithm

Based on the local information (by sensing the neighbouring robots) and the global information (by sensing targets) received by the robots, the SI classifies the given scenario as ’conflict scenario’ or ’conflict-free scenario’ and to deal with them separately. Every robot has its context-aware grid that adapts to any given distribution of detected targets. As the robot moves along, it is assigned a unique node in the grid. Based on the information about the target location and the position of the neighbouring robots, the robot itself estimates the grid nodes that neighbours occupy and the nodes that would lie above the target.

function conflict with robot
     Generate new Context region on the grid;
     if presence of other robots then
         assign as ’conflict scenario’;
         assign context region;
         mask the robot;
         assign as a conflict-free scenario;
         use original context region;      
Task Allocation: Nearest target assignment;
Input: Generate context region;
Scenario Identifier procedure:
if presence of any other targets then
     generate new context region for each target;
     if presence of any robots then
         call conflict with robot;
         mask target location;
         assign as a conflict-free scenario      
if presence of any robots then
     call conflict with robot;
     assign as a conflict-free scenario;
Algorithm 1 Pseudocode of SI Algorithm

After this information is embedded into the grid, the revised grid representation can be visualized as shown in Fig. 5. With many possible representations that can arise depending upon the scenario, the approach proposed in this paper classifies the scenarios by using the information embedded in particular regions of the grid, here referred to as context regions, as shown in Fig. 5. In particular, two types of context regions are considered by the SI algorithm; a) a context region of size and b) a context region of size . A formal description of the SI algorithm is given in the Algorithm 1. First, after certain targets are detected, the task allocation among the robots is done for the robots to decide which robot would neutralize which target. With this information, the robot generates a context region towards its assigned target as shown in Fig. 4(a) . In this region the robot checks for the presence of other targets or robots. If no other robots exist then the robot classifies this scenario as ’conflict-free’, and hence it can proceed as an individual (without bothering about other neighbouring robots). Other possible cases involve the presence of a) other targets; b) neighbouring robots or c) both, in the context region. If there are other targets within this region, then the robot checks for a context region in the grid (towards these targets) as shown in Fig. 4(b). Within this region, the absence of any other robots causes the robot to classify this scenario as conflict-free and it proceeds as an individual by masking the target. Otherwise, with the presence of any other target in this region, the robot checks if they are within a context region of size as shown in Fig. 4(c). If the robot senses the presence of other robots in this , it classifies the scenario as ’conflict scenario’, otherwise, it is classified as ’conflict-free scenario’ and the robot proceeds by masking the robot. Other possible scenarios include : if there exist other robots within the context region towards the assigned target, the robot directly checks for the presence of other robots within the grid. If there are other robots in this region, it classifies the scenario as ’conflict scenario’ otherwise it is classified as ’conflict-free scenario’.

Fig. 6: Information from context region for DQN in a conflicting scenario.
Fig. 7: Architecture for DQN in a conflicting scenario.

Ii-C Deep Q-Networks and its Learning Algorithm

The problem of finding collision free path toward their targets for a given robot is modelled as a Markov’s game [22]. In a reinforcement learning framework, at every time step, the robot perceives a state from the environment, takes an action based on the state using policy and gets a reward for the transition to new state . In Markov’s game, this is represented as an experience tuple given by . The robots need to learn the optimal policies in any given state to achieve the desired objective.

Fig. 8: Information from context region for DQN in a conflict-free scenario.
Fig. 9: Architecture for DQN in a conflict-free scenario.

The Q-Learning technique developed in [28], uses Q-value approximation to find the optimal policy. Here, the action value function relates the utility of an action when a robot is in a state . Here, we use deep neural network (called DQNs) to approximate the Q-value function [19]. In every iteration , experience tuple is sampled from the replay memory and network parameters

are updated by using the back-propagation algorithm to minimize the loss function given by:


where, is the loss. Usually, DQN uses greedy strategy during the training process. CA-DQN has two different DQNs to handle both ’conflict scenarios’ and ’conflict-free scenarios’ separately. Next, we describe the state representation based on the context region and DQN architectures for the individual cases.

Ii-C1 Deep Q-Network for conflict scenario

First, we describe the process of extracting state information from the context region and then present the DQN architecture used for the Q-value approximation. For this scenario, the context region is a grid which contains the local information around the robot. The information in the local grid is used to represents the state of the robot, which is then input to the DQN. The state of the robot (

) is represented by traversing in a clockwise direction in the grid. Each node in the grid represented by a vector

. If the node is occupied by the robot itself then the information is code as , where (,) is the robot position and (,) is its target position. Similarly, if the node is occupied by other robots then information is codes as , where (,) is the other robot position and (,) is its target position. The empty nodes in the context grid is represented using a null vector . A typical ’conflict scenario’ is shown in Fig. 6 where the context region for both the robot ’R1’ and ’R2’ robot matches. The robots and their targets are marked in the figure. The context region in the grid has got two blank nodes. With these information about each node within this region, based on the above description, the state vector for R1 is and R2 is .

The architecture of DQN for conflict scenario is shown in Fig. 7. The DQN approximates the Q-function between the state () and action state (). The DQN has five hidden layers and layer receive the outputs from layer and layer

. The above architecture helps for proper convergence of the DQN. The neurons in the hidden layer (except layer 4) employ hyperbolic tangent function and the outputs of neurons in the hidden layers are given as,


where, and . At ,

Softmax activation function is used over the output layer to get a probabilistic output. The architecture has a augmented layer which is concatenated at

of the network. This is used to unsure a proper flow of gradient in the network during back propagation [14]. The augmented layer is given by:


For any given state , the DQN predicts an action . The reward for the predicted action is generated and the weights of the network are updated using back propagation [14] by minimizing the loss function in eqn. 3. For this case, one can have maximum of four robots in the local context region and hence self play-in approach is adapted to train the DQN for conflict scenario. Here, every individual is trained based on their own reward but the policy was shared among all. In the training process, two robots were made to play in the context region. Since, the conflict scenario is dynamic in nature, every iteration in the learning process is a single step of action in the environment. A positive reward of is provided with every iteration passed without collision with an additional if there is no motion, and a negative reward of for collision. Using the trained policy, we execute the same training process with three and four robots respectively.

Ii-C2 Deep Q-Network for conflict-free scenario

The context region for this scenario is a region which contains both the local and global information (embedded into the context aware grid) around the robot. These information are used to represent the state () of the robot (which is the input to the DQN). The state here is represented as a vector, with the information coded as . Fig. 8 shows a typical conflict-free scenario where the context region towards the assigned target is considered. With this information, based on the above description, the state vector for the robot is The architecture of DQN for conflict-free scenario is shown in Fig. 9 that is used to approximate the Q-function between the state () and action space (). This DQN has two hidden layers with hyperbolic tangent function as the activation function and a softmax activation function is used over the output layer to get a probabilistic output. In the training process, the state was given as an input to the DQN to predict an action space . Based on the action the reward was generated. A negative reward of is provided with every iteration passed without the robot reaching the target, and a positive reward of for robot reaching the target. With these reward values the loss is calculated for the network by using eqn. 3, and the weights of the network are updated using back propagation.

Iii Performance Evaluation of CA-DQN

(a) Simulation frame at time 16 seconds
(b) Simulation frame at time 37 seconds
(c) Simulation frame at time 59 seconds
Fig. 10: Maneuvering of three robots across the area of size while neutralizing distinct targets. Fig. 9(a) shows a conflict-free scenario with each robot proceed without any conflict with other robot. Fig. 9(b) and Fig. 9(c) represent conflict scenarios where robots have to go through cooperate for target neutralization.

In this section, we first present the working, followed by the performance evaluation of CA-DQN. It was implemented in a swarm of robots performing reconnaissance mission in a simulated unknown/uncertain bounded search area () of . The time to complete the mission and the number of targets neutralized are measured and compared by Monte Carlo simulation with varying number of constituent robots, proportion of heterogeneous targets and sensor radii. The robots used in the swarm are assumed to be equipped with global sensors that can detect targets within radius and local sensors to detect neighbours within radius. The targets are assumed to be randomly distributed in , with their locations and types unknown to the robots until the time they are detected. The speed of the robot varies dynamically depending on the requirement but the maximum speed of a robot is constrained to be . The simulations are carried out in a Python 3.7 environment on an Intel Core-i7, a 2.6-GHz processor with a 16-GB RAM.

Iii-a Working of CA-DQN

For a better understanding of the working of CA-DQN algorithm, we consider an example with a swarm of three robots (’R1’, ’R2’ and ’R3’) performing search and reconnaissance mission. The initial position and orientation () of robots are taken as (), () and () respectively. Here, heterogeneous targets with of SRTs and of MRTs are randomly placed in the locations as shown in the Fig. 10. The figure also shows the state of the simulation at different times. Each robot is assigned to a unique node of a grid structure of their own. The grid nodes are not shown in the images (for image clarity purposes). With the movement of the centroid across the area, the robots move along with it to sweep the entire area searching for targets while keeping the swarm intact. The sequential images in the figure show the changes in the configuration of the robots in a collision-free manner within the swarm depending upon the target distribution and types of targets. Different shapes are used to depict the MRTs and SRTs that are not yet neutralized and the dotted lines show the path taken by the robots.

Fig. 11: Monte Carlo simulation results showing the total time taken by the robots to neutralize all targets in the given area.

Fig. 14(a) shows the motion of the robots as a swarm in a cases of conflict-free scenarios, where each robot proceeds as a conflict-free scenario, for neutralizing their respective targets without bothering about presence of other robots. Figs. 14(b) and 14(c) depict two conflicting scenarios where the robots require to cooperate while proceeding further, by using information within the context region.

Iii-B Performance Evaluation using Monte Carlo Simulation

Fig. 12: Monte Carlo simulation results with varying percentage of MRTs among the total number of targets.

For the Monte Carlo simulations, a search area of was considered with heterogeneous targets randomly spread across the area. The context grid for each robot is of size . The robot swarm is randomly initialized in the bottom left corner of . The Monte Carlo simulation study was conducted with varying number of robots ( to ) while keeping the rest of the parameters as constant. For a given number of robots, the simulations were performed in this environment repeatedly for

iterations with random initialization of targets locations. The time taken to detect and neutralize all the targets is measured as a performance index of the robotic swarm. The mean and standard deviation of mission completion time is shown in the Fig.

11. From the figure, we can see that the time taken to complete the mission decreases with increase in number of robots. The rate of decrease in the total time, decreases beyond the swarm size of and is due to increase in coordination time between the robots inside the swarm. Further, the swarm size is restricted to due to the size of search area and coverage area of individual robots. Note that individual CA-DQN algorithm runs in individual robots and is independent of number of robots. Hence, the approach is scalable.

Similarly, a Monte Carlo simulation was performed with varying the percentage of MRTs among the targets ( to ) while rest of the parameters remains constant. The number of robots in swarm and the total number of targets in the area are taken to be and respectively. The mean and standard deviation of time taken to complete the mission is shown in the Fig. 12. From the figure, we can see that the time taken to complete the mission increases with increase in the percentage of MRTs.

Fig. 13: Monte Carlo simulation results with varying global sensing radius of the constituent robots in the swarm.

To study the impact of the change in global sensing radius, varying sensor radii ranging from m to m were considered. The average time taken for the robots in the swarm to neutralize all the targets is shown in Fig. 13. From the figure, we can see that the time taken decreases with an increase in the global sensor radius. But the rate of decrease is very less beyond radius. This could be due to high overlapping information with increase in the sensing range beyond .

The mission performance also depends on the target distribution. To study the effect of the target distribution, we have conducted a study with dispersed and cluttered target distributions. In Fig. 13(a), the target distribution is shown to be cluttered with , , , and being marked as four distinct clusters for the robots to neutralize.

(a) Cluttered targets.
(b) Dispersed targets.
(c) Step plot for target neutralization.
Fig. 14: Comparison of performance by CA-DQN with varying distribution of targets with cluttered targets (a) and dispersed targets (b) as shown in step plot (c) for target neutralization.

In Fig. 13(b) the targets are shown to be evenly distributed across the area. Fig. 13(c) shows the step plot comparing the functioning of the robots in these two different scenarios. With the cluttered environment the coordination time of robots may be higher but here, multiple robots are neutralizing their respective targets simultaneously. Hence, from these figures we can conclude that, the total time consumed for the mission is lower in case of cluttered targets compared to that of dispersed targets.

(a) Simulation frame at time 22 seconds
(b) Simulation frame at time 45 seconds
(c) Simulation frame at time 73 seconds
(d) Experiment frame at time 31 seconds
(e) Experiment frame at time 48 seconds
(f) Experiment frame at time 79 seconds
Fig. 15: The snapshots of simulation results (a-c) and experimental results captured by the fish eye camera (d-f) for SRT targets.
Fig. 16: System architecture implemented in TurtleBot3.
Fig. 17: Experimental setup using TurtleBot3 and arena highlighting the targets.

Iv Experimental setup and Results

We validated the proposed CA-DQN strategy using three turtlebot3 robots. The experimental setup and the results are discussed below. The video of the experiment can be found in

Iv-a Experimental Setup

The experiments are conducted in a laboratory of area . A fixed camera is mounted at the roof that monitors the robot movements. The SRT is represented using the ’red’ colour and MRT is represented using the ’green’ colour. For the experimental study, three TurtleBot3 Burger ground robots were used.

Each robot has an on-board Raspberry Pi 3B+ board for on-board processing and is controlled by an OpenCR control module which is an STM32F7 series chip with a powerful ARM Cortex-M7 processor. A Logitech C930e web-camera is placed on top of the robots as a global sensor to detect the targets and the 360 Laser Distance Sensor LDS-01 (2D Lidar) is used as a local sensor to detect a neighbour. Fig. 16

shows the system architecture implemented in the TurtleBot3 Burger robot. The control layer realizes the steering control and trajectory tracking control using the PI controller. The control layer and firmware layer are implemented in an OpenCR ARM Cortex-M7 board. The sensor layer, perception layer, decision control layer and navigation layer are realized on a Raspberry Pi 3B+ board using the robot operating system. The on-board IMU is fused with the Extended Kalman filtering algorithm to provide local positioning of the robot. The camera feed from Logitech C930e webcam is sent to the vision stack for color detection algorithm and reference image transformation to detect targets and its position with respect to the local grid for the robots to move. Using the way-points, navigation layer realizes point-to-point navigation and provides the necessary commands to the control layer.

(a) Simulation frame at time 26 seconds
(b) Simulation frame at time 32 seconds
(c) Simulation frame at time 86 seconds
(d) Experiment frame at time 34 seconds
(e) Experiment frame at time 41 seconds
(f) Experiment frame at time 91 seconds
Fig. 18: The snapshots of simulation results (a-c) and experimental results captured by the fish eye camera (d-f) for heterogeneous targets.
Fig. 19: The time history of target neutralization from simulation and experimental setup for homogeneous targets.
Fig. 20: The time history of target neutralization from simulation and experimental setup for heterogeneous targets.

Iv-B Experimental Results for Homogeneous Targets

For the experiment, targets (only SRT) are placed in the arena and are marked in ’red’ colour as shown in Fig. 15. The robots are initialized at the bottom right corner of the arena. The robots search the area to detect the targets and go over them to neutralize. Due to the limitations of experimental setup, as there is nothing to detect in the form of high surveillance aircraft, the centroid position of the grid are transmitted to the robots through the WiFi. This helps in the swarm of robots to stay within the radius of operation. The simulation and their respective experimental frames at certain scenarios are shown in Fig. 15 (a-f). Although the allocated target remains the same, differences in the path taken by the robots in simulation and corresponding experiment scenario can be observed. These differences are a result of varying model of the robot in the simulation and experiment. This is also due uncertainties in the experimental setup like wheel slip, difference in motor rpm etc. The path traveled by the robots in the experiment and simulation are highlighted for better understanding. Fig. 14(a) and Fig. 14(d) show the simulation and experimental results of a scenario in which the robot 2 takes a rectangular trajectory whereas it is a smooth trajectory in the simulation. Similarly, in Fig. 14(b) and Fig. 14(c) it can be observed that the robot 3 and robot 1 respectively are taking a path opposite to that of the swarm and correcting itself according to assigned node which is not emulated in experiment. Fig. 19 shows the time taken to neutralize all targets in simulation and the experiment. There is a significant difference in the time taken to complete the mission.

Iv-C Experimental Results for Heterogeneous Targets

We consider heterogeneous targets ( SRTs and MRTs) that are randomly placed in the area. SRTs are marked in ’red’ colour and MRTs are marked in green colour, as shown in Fig. 18. One should note that in heterogeneous cases, the MRT requires two robots to neutralize these targets and robots need to move over the target in a sequence one after another. Similar to the homogeneous case, the robots are randomly initialized at the bottom right corner of the arena. The robots search the area to neutralize the targets and the centroid of the robots are transmitted to all robots using WiFi. The snapshots of simulation results for heterogeneous cases are shown in Fig. 18 (a-c) and the snapshots from fish-eye camera on the experimental setup is shown in Fig. 18 (d-f). There is a difference between the path followed by the robots in experiment and in simulation, due to robot dynamics; but the sequence of allocated task is the same.

In Figs. 17(b) and 17(e) it can be observed that robot can detect the nearest target but it does not approaches to neutralize it. This is a case where the targets lie beyond the bounding radius of the swarm from the centroid, hence the robot chooses not to approach it. Similar to previous case, one can observe the trajectory differences in Figs. 17(a), 17(d) and in Figs. 17(c) and 17(e). The time history of target neutralization between simulation and the experimental setup is shown in the Fig. 20. From these experimental study, we can see that the CA-DQN can be implemented in a low-power computing device and is capable of handling MRTs by proper coordination between the neighbors.

V Conclusion

This paper presents a Context-Awareness based DQN architecture for a swarm of robots to performing reconnaissance mission in unknown/uncertain environment. For the first time in literature CA-DQN has been proposed to perform missions in environments with no communication and also do deal with heterogeneous targets as a swarm and also is a decentralized approach, hence scalable. Context awareness scheme enables it to use local and global information from the sensor to identify the scenario and state of the information. Based on the state information, DQN generates necessary action space for coordination in the absence of communication. CA-DQN can handle heterogeneous targets (SRT and MRT) effectively. The performance of CA-DQN has been evaluated by conducting Monte Carlo simulation study by varying number of robots, global sensing radius and ratio of SRTs and MRTs in the environment. Further, from the simulations and experimental study it was observed that there exist a number of robots beyond which the performance improvement is not that significant. The algorithm was able to tackle dispersed and cluttered targets very effectively. Due to the limitations in experimental setup the simulation could not be exactly emulated in real world. But the decision making capability remains the same. The experimental study shows that CA-DQN can be implemented with low-power computing devices which forms the fundamental basis of swarm robotics.


  • [1] E. T. Alotaibi, S. S. Alqefari, and A. Koubaa (2019) LSAR: Multi-UAV Collaboration for Search and Rescue Msissions. IEEE Access 7 (), pp. 55817–55832. External Links: Document, ISSN 2169-3536 Cited by: §I.
  • [2] C. Amato (2018) Decision-making under uncertainty in multi-agent and multi-robot systems: planning and learning. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    pp. 5662–5666. External Links: ISBN 9780999241127 Cited by: §I.
  • [3] L. Buşoniu, R. Babuška, and B. De Schutter (2010) Multi-agent Reinforcement Learning: An Overview. In Innovations in Multi-Agent Systems and Applications - 1, pp. 183–221. External Links: ISBN 978-3-642-14435-6, Document Cited by: §I.
  • [4] D. Claes, F. Oliehoek, H. Baier, and K. Tuyls (2017) Decentralised Online Planning for Multi-Robot Warehouse Commissioning. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 492–500. Cited by: §I.
  • [5] W. Dai, H. Lu, J. Xiao, and Z. Zheng (2019-06-01)

    Task Allocation Without Communication Based on Incomplete Information Game Theory for Multi-robot Systems

    Journal of Intelligent & Robotic Systems 94 (3), pp. 841–856. External Links: ISSN 1573-0409, Document Cited by: §I.
  • [6] J. S. Dibangoye, C. Amato, O. Buffet, and F. Charpillet (2016-01) Optimally solving Dec-POMDPs as Continuous-State MDPs. Vol. 55, El Segundo, CA, USA, pp. 443–497. External Links: ISSN 1076-9757 Cited by: §I.
  • [7] R. Dobbe, D. Fridovich-Keil, and C. J. Tomlin (2017) Fully Decentralized Policies for Multi-Agent Systems: An Information Theoretic Approach. In NIPS, Cited by: §I.
  • [8] R. L. Finn and D. Wright (2012) Unmanned aircraft systems: surveillance, ethics and privacy in civil applications. Computer Law & Security Review 28 (2), pp. 184 – 194. External Links: ISSN 0267-3649 Cited by: §I.
  • [9] B. P. Gerkey and M. J. Mataric (2002-10) Sold!: auction methods for multirobot coordination. IEEE Transactions on Robotics and Automation 18 (5), pp. 758–768. External Links: Document, ISSN 2374-958X Cited by: §I.
  • [10] K. Harikumar, J. Senthilnath, and S. Sundaram (2019-04) Multi-UAV Oxyrrhis Marina-Inspired Search and Dynamic Formation Control for Forest Firefighting. IEEE Transactions on Automation Science and Engineering 16 (2), pp. 863–873. External Links: Document, ISSN 1558-3783 Cited by: §I.
  • [11] K. Harikumar, J. Senthilnath, and S. Sundaram (2020-03) Mission Aware Motion Planning (MAP) Framework With Physical and Geographical Constraints for a Swarm of Mobile Stations. IEEE Transactions on Cybernetics 50 (3), pp. 1209–1219. External Links: Document, ISSN 2168-2275 Cited by: §I.
  • [12] A. Kanakia, B. Touri, and N. Correll (2016-06-01) Modeling multi-robot task allocation with limited information as global game. Swarm Intelligence 10 (2), pp. 147–160. External Links: ISSN 1935-3820, Document Cited by: §I.
  • [13] H. Kandath, J. Senthilnath, and S. Sundaram (2018-11) Mutli-agent consensus under communication failure using Actor-Critic Reinforcement Learning. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1461–1465. External Links: Document, ISSN null Cited by: §I.
  • [14] H. J. Kelley (1960) Gradient theory of optimal flight paths. Ars Journal 30 (10), pp. 947–954. Cited by: §II-C1, §II-C1.
  • [15] M. Liu, K. Sivakumar, S. Omidshafiei, C. Amato, and J. P. How (2017) Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1853–1860. Cited by: §I.
  • [16] J. G. Manathara, P. B. Sujit, and R. W. Beard (2011-04-01) Multiple UAV Coalitions for a Search and Prosecute Mission. Journal of Intelligent & Robotic Systems 62 (1), pp. 125–158. Cited by: §I.
  • [17] M. J. Matarić (1997-03-01) Reinforcement Learning in the Multi-Robot Domain. Autonomous Robots 4 (1), pp. 73–83. External Links: ISSN 1573-7527, Document Cited by: §I.
  • [18] Y. Meng (2008-11) Multi-Robot Searching using Game-Theory Based Approach. International Journal of Advanced Robotic Systems 5, pp. . External Links: Document Cited by: §I.
  • [19] V. Mnih, K. Kavukcuoglu, and S. D. et al (2015-02) Human-level control through deep reinforcement learning. Nature 518, pp. 529–33. External Links: Document Cited by: §I, §II-C.
  • [20] P. Molina, M. Parés Calaf, I. Colomina, T. Vitoria, P.F. Silva, J. Skaloud, W. Kornus, R. Prades, and C. Aguilera (2012-01) Drones to the rescue! unmanned aerial search missions based on thermal imaging and reliable navigation. InsideGNSS 7, pp. 36–47. Cited by: §I.
  • [21] D. T. Nguyen, A. Kumar, and H. C. Lau (2017) Policy gradient with value function approximation for collective multiagent planning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4322–4332. External Links: ISBN 9781510860964 Cited by: §I.
  • [22] S. Richard S. and A. G. Barto (1998) Reinforcement learning - an introduction, volume 1. MIT Press Cambridge. Cited by: §II-C.
  • [23] B. Risti, D. Angley, B. Moran, and J. Palmer (2017-04) Autonomous Multi-Robot Search for a Hazardous Source in a Turbulent Environment. Sensors 17, pp. 918. External Links: Document Cited by: §I.
  • [24] F. Rossi, S. Bandyopadhyay, M. Wolf, and M. Pavone (2018) Review of multi-agent algorithms for collective behavior: a structural taxonomy. ArXiv abs/1803.05464. Cited by: §I.
  • [25] P. B. Sujit, A. Sinha, and D. Ghose (2005) Multi-UAV Task Allocation using Team Theory. Proceedings of the 44th IEEE Conference on Decision and Control, pp. 1497–1502. Cited by: §I.
  • [26] C. Tang and L. Dou (2019-07) A game-theory based UKF algorithm for multi-robots cooperative localization. In 2019 IEEE 15th International Conference on Control and Automation (ICCA), Vol. , pp. 899–904. External Links: Document, ISSN 1948-3449 Cited by: §I.
  • [27] D. Turra, L. Pollini, and M. Innocenti (2004-5253-08) Real-Time Unmanned Vehicles Task Allocation with Moving Targets. AIAA Guidance, Navigation, and Control Conference and Exhibit, August, Providence, Rhode Island. External Links: ISBN 978-1-62410-073-4, Document Cited by: §I.
  • [28] C. J.C.H. Watkins and P. Dayan (1992-05-01) Technical Note: Q-Learning. Machine Learning 8 (3), pp. 279–292. External Links: ISSN 1573-0565, Document Cited by: §I, §II-C.
  • [29] E. Yang and D. Gu (2004) Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey. Cited by: §I.