Cooperative Multi-Agent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control

by   Won Joon Yun, et al.

CCTV-based surveillance using unmanned aerial vehicles (UAVs) is considered a key technology for security in smart city environments. This paper creates a case where the UAVs with CCTV-cameras fly over the city area for flexible and reliable surveillance services. UAVs should be deployed to cover a large area while minimize overlapping and shadow areas for a reliable surveillance system. However, the operation of UAVs is subject to high uncertainty, necessitating autonomous recovery systems. This work develops a multi-agent deep reinforcement learning-based management scheme for reliable industry surveillance in smart city applications. The core idea this paper employs is autonomously replenishing the UAV's deficient network requirements with communications. Via intensive simulations, our proposed algorithm outperforms the state-of-the-art algorithms in terms of surveillance coverage, user support capability, and computational costs.



page 3

page 10


Watch from sky: machine-learning-based multi-UAV network for predictive police surveillance

This paper presents the watch-from-sky framework, where multiple unmanne...

Multi-Agent Deep Reinforcement Learning Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing

An unmanned aerial vehicle (UAV)-aided mobile edge computing (MEC) frame...

Spatio-Temporal Split Learning for Autonomous Aerial Surveillance using Urban Air Mobility (UAM) Networks

Autonomous surveillance unmanned aerial vehicles (UAVs) are deployed to ...

Programming and Deployment of Autonomous Swarms using Multi-Agent Reinforcement Learning

Autonomous systems (AS) carry out complex missions by continuously obser...

Solving reward-collecting problems with UAVs: a comparison of online optimization and Q-learning

Uncrewed autonomous vehicles (UAVs) have made significant contributions ...

Efficient UAV Trajectory-Planning using Economic Reinforcement Learning

Advances in unmanned aerial vehicle (UAV) design have opened up applicat...

Distributed Reinforcement Learning for Flexible and Efficient UAV Swarm Control

Over the past few years, the use of swarms of Unmanned Aerial Vehicles (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The unprecedented demands of reliable surveillance services fueled the need for flexible and reliable network supporting systems [1]. Due to recent advances, unmanned aerial vehicles (UAVs), or drones, began to be considered as core network devices for providing flexible and reliable network services such as mobile surveillance applications using UAVs. Using the mobility feature of UAVs, it has been shown that UAVs are able to adaptively and dynamically update the surveillance UAVs’ locations [2, 3, 4, 5, 6, 7, 8]. UAVs facilitate their mobility to enable line-of-sight (LOS) CCTV-based vision-enabled surveillance services, ensuring reliable monitoring services. The on-demand deployments of surveillance UAV allows them to change their positions continuously to enlarge the area covered by surveillance UAVs. Moreover, due to the relatively low-cost and the possibility of a broad range of applications, UAVs are one promising solution for robust and flexible mobile CCTV-based surveillance. However, UAVs are generally insufficiently maintained, causing them to face higher safety risks. For example, engine shutdowns due to collisions with other aircrafts or terrain can damage the wireless base station’s hardware. In addition, the on-board power of UAVs is jointly consumed by their mobility and communication support functions. To prevent unexpected battery problems, such as low battery levels, the power consumption of communication, surveillance, and mobility needs to be monitored regularly.

As such, autonomous surveillance UAVs management system is essential in imbuing more robust and resilient surveillance services into UAV-based network systems. It is essential to conduct a joint optimization of the energy consumption and enhance the reliability of the network surveillance services under the behavior uncertainty of target object’s movements/deployments and neighboring UAVs. Recently, a considerable amount of work has been published on the optimization of deployment of UAVs for cellular services, including optimization-based coverage control, which is essential for surveillance, for transmit power reduction and deployment of UAVs considering various uncertainties (e.g., demands changes of users) [9], optimal path-planning for multiple UAVs to provide wireless services to the cell edge user based on convex relaxation technique [10], and the coverage maximization with minimum number of drone base stations [11].

Even though the previous works show reasonable performance in terms of their objectives, the solution approaches are all centralized optimization problems. These approaches are impossible to yield an online (computational) solution for highly dynamic and distributed UAVs-enabled networks. To solve those problems in a distributed setting, machine learning based approaches are effective. To handle high dynamics of UAVs-enabled network (e.g, surging demands of users, malfunction of neighboring UAVs, etc.) with high uncertainty and dynamic updates, a new multi-agent deep reinforcement learning (DRL) scheme is designed in this study, for distributed computation over UAVs, considering users and multiple UAVs.

Compared to the conventional optimization approaches, various ML techniques have been applied to improve the performance of UAV-based computing, including an ML-based approach for autonomous trajectory optimization [12]

and the optimization of UAV location in a downlink system with a joint K-means and expectation maximization (EM) based on a Gaussian mixture model (GMM) 

[13], dynamic optimization of the locations of UAVs in a VLC-enabled UAV based network for minimizing the transmit power [14], among others. There are some studies that utilize DRL methods in UAV network systems, including the meta-reinforcement learning-based path-planning for UAVs in dynamic and unknown wireless network environments [15], a Q-learning method-based dynamic location planning of UAVs in a non-orthogonal multiple access (NOMA) based wireless network [16], and the optimization for UAV optimal energy consumption control considering communication coverage, fairness, and connectivity.

However, these general ML approaches cannot be applied to deterministic multi UAVs decision-making under uncertain environments, leading to undesirable outputs. This is mainly because the aforementioned studies do not consider the UAV-based network system’s partially observable multi-agent environment, since each agent has different information, and UAVs that cannot fully exchange information. Moreover, these approaches only focus on the network communication system even if visual information is available. This underlines the need for more advanced research in the communication between multiple surveillance UAVs. Thus, this paper builds on the existing issues by creating a novel algorithm that exchanges partially observed information from a particular UAV to other UAVs in an uncertain and constantly varying environment.

Fig. 1: CommNet-based autonomous and reliable surveillance management systems for smart city applications.

Fig. 1 shows a schematic system that consists of two units: the UAVs, and the target of surveillance, which is called the user. Some UAVs cooperatively exchange information for reliable monitoring services, while uncooperative UAVs do not. The objective of the surveillance UAV is to autonomously maneuver itself to an area with the highest number of users. Hence, it is essential to configure a system where UAVs automatically induce the optimal trajectories and coverage to achieve this objective. In this process, the autonomous optimization system needs to take into account the characteristics of UAVs (e.g., on-board battery). This paper configures the autonomous surveillance resilience system based on a multi-agent DRL (MADRL) scheme, called communication neural network (CommNet[17]. The goal of the proposed scheme is to optimize the energy consumption of UAVs and trajectories that strengthen surveillance performance, which means deploying the surveillance UAVs to cover the area with the most number of users. The monitoring of the target is achieved through a camera lens attached to the body of the surveillance UAV. Hence, this paper also analyzes the relationship between the surveillance resolution and its corresponding area covered. The surveillance resolution is defined as the resolution of the images captured by the surveillance UAVs. It is observed through the experiments that the lower the resolution of the captured image, the larger the area covered by that particular UAV.

To the best of the author’s knowledge, the autonomous UAVs cooperative surveillance management system (under the consideration of UAVs that do not fully exchange information) via MADRL has not been studied yet. Thus the proposed scheme will guide dynamic UAV-based autonomous surveillance studies in the future. To sum up, the contributions of this paper are as follows:

  • This paper proposes a new MADRL algorithm to achieve autonomous UAV cooperation in a distributed UAV context to estimate the uncertainty of the environment and optimize energy consumption and overall surveillance reliability and operation.

  • Experiments conducted in this paper show the relationship between the performance of the surveillance UAVs, which refers to the resolutions of the captured surveillance UAV data, and the area covered by the corresponding surveillance UAV.

  • Through performance evaluations of the proposed CommNet based autonomous UAV surveillance management scheme, this work shows the scheme’s ability to successfully perform reliable and robust cooperative management of distributed UAVs.

The rest of this paper is organized as follows. Sec. II summarizes the state-of-the-art of reinforcement learning algorithms. Sec. III presents the proposed autonomous UAV coordination inspired by CommNet-based reinforcement learning algorithms. Sec. IV evaluates the performance of the proposed autonomous UAV coordination algorithm. Finally, Sec. V concludes this paper and provides future research directions.

Ii Deep Reinforcement Learning

Ii-a Preliminaries

In this work, a Markov decision process (MDP) is defined as a tuple

, where is a finite set of all valid states, and is a finite set of all valid actions. The function

is a transition probability function, with

being the probability of transitioning into state if an agent starts executing action in state . The function denotes the reward function, with . The MDP has a finite time horizon , and solving an MDP means finding a policy , where is parameterized with (e.g., the weights and biases of a neural layer); specifies which action must be executed in each for maximizing the discounted cumulative rewards received during the finite time .

When the environment transitions and the policy are stochastic, the probability of a -step trajectory is defined as where is the starting state distribution. Then, the expected return is defined as where the trajectory is a sequence of states and actions in the environment. The goal in reinforcement learning is to learn a policy that maximizes the expected return when the agent acts according to the policy . Therefore, the optimization objective is expressed by


with being the optimal policy. The multi-agent MDP (MMDP) [18] generalizes the MDP to the multi-agent system, where the state space is defined by taking the Cartesian product of the state spaces of the individual agents, and actions represent the joint actions that can be executed by the agents.

Ii-B Deep reinforcement learning

Deep Q-Network (DQN)[19]. A deep Q-network (DQN) is a model-free reinforcement learning method designed to learn the optimal policy with a high-dimensional state space. The DQN is inspired by Q-learning[20], where a neural network is used to approximate the Q-function. Experience replay and target network are two key features used to stabilize the optimization. Experiences of the agent are stored in the experience buffer , and are periodically resampled to train the Q-networks. A mini-batch resampled experience is used to update the parameters

of the policy with the loss function at the

-th training iteration where the loss function is defined as


where are the target network parameters. The target network parameters are updated using the Q-network parameters

in every predefined step. The stochastic gradient descent method is used to optimize the loss function.

Proximal Policy Optimization (PPO)[21]. PPO is one of the breakthroughs of DRL algorithms, which adopts the trust region concept [22] to improve the training stability by ensuring that updates at every iteration are small by clipping the probability ratio , where is the parameter of the new policy while is that of the old policy. A clipped surrogate objective to prevent the new policy from straying away from the old one is used to train the policy . The clipped objective function is as follows:



is the estimated advantage value under hyperparameter

, which denotes how far away the new policy is allowed to update from the old policy. PPO uses the stochastic gradient descent to maximize the objective (3).

Limitations and MADRL. The prior approaches are designed with a single agent system. In this system, the agent considers only changes that are the outcomes of its own actions. However, in a multi-agent system, the agent needs to concurrently observe the effect of its own actions as well as the behavior of other agents. This characteristic of the multi-agent system constantly reshapes the environment and leads to non-stationarity (i.e., lead to a non-stationary problem). As a result, the convergence theory of the predecessors is generally not guaranteed in multi-agent systems [23]. Therefore, the information collection and processing method should not affect the convergence stability of the agents in multi-agent systems. Thus, various MADRL algorithms are proposed for multi-agent cooperation and coordination. In centralized MADRL algorithms, multi-agent states and actions are formulated as the inputs and outputs of a single deep neural network. Thus, there is only one policy that determines MADRL actions. In addition, the centralized policy collects all the observable information (fully observable information) and determines the actions of all agents at once. In CommNet [24, 25, 26] which is one of well-known centralized MADRL algorithms, the cooperation and coordination among multiple agents is formulated by mixing information in each hidden layer computation.

Iii Autonomous Multi-UAV Coordination for Reliable Surveillance

Iii-a System description

Suppose that users, multi-agent cooperative agents, non-agent UAVs are deployed. The sets of users, UAVs agents, and non-agent UAVs are denoted as , and , respectively. In addition, , , and represent -th user, -th UAV agent, and -th non-agent UAV, where , , and . The proposed system assumes a leader UAV agent for handling communications between UAV agents. The leader UAV agent receives information for multi-agent cooperation, as shown in Fig. 3. In addition, this paper assumes that each user is associated with only one UAV , whereas each UAV can be associated with multiple users. A camera is embedded on each UAV. Each UAV captures surveillance coverage with the resolution , where stands for the set of resolutions. This study presupposes that each camera sensor is not affected by zoom-in, zoom-out, or the UAV’s movement. These assumptions are reasonable since, according to [27], the camera sensor dynamically controls the sizes of micro pixels, so that the resolution is not degraded even when zooming-in. Therefore, the surveillance resolutions have an inverse relationship with surveillance areas. Each UAV can conduct multi-agent cooperation to manage its surveillance coverage autonomously. This paper assumes that the wireless connection between inter-UAVs is good enough to deliver UAV’s information without losing robust and stable MADRL training. Each UAV has a maximum capacity of the battery.

The objective of the proposed autonomous coordination for reliability (ACR) is to increase the regions of monitoring areas for surveillance reliability with high resolution. Therefore, the proposed algorithm in this paper tries to achieve reliable surveillance under the scenario that the number of users to be observed changes while considering the UAV’s conditions e.g., the possibilities that the UAVs are dropped, malfunctioned, or energy-exhausted. At the same time, each UAV tries to optimize energy consumption to combat the power-hungry nature of UAVs.

Iii-B MADRL formulation

For the MADRL formulation of the proposed ACR, the state space, action space, and rewards should be defined.

Iii-B1 State space

The state space of our proposed ACR consists of location information (e.g., absolute position information, and the relative position or distance information with other users/UAVs), energy information, and surveillance information (e.g., whether user is monitored or not; and which UAV monitors with which resolutions). The absolute positions of each UAV agent, user, and non-agent UAV are defined as ,, , , and , , respectively. The relative positions for with , , and are denoted by , , and , respectively. The distance for with , , and are denoted as , , and . Note that , and stand for the relative position and physical distance between and , respectively. The location information of is defined as , where . In each time, every single UAV consumes basic operational energy for aviation and monitoring, where the basic energy consumption of is denoted by . In addition, denotes the energy consumption of , which depends on the surveillance coverage range; and it can be computed as follows:


In (4), is the minimum power consumption to fly UAV over the ground, and is the speed multiplier of the motor operation. Here, and represent the speed and operating time. The required power consumption for lifting the UAV to height at speed is denoted as  [28, 29]. In (5), and represent the current coverage range of the -th UAV agent and the coefficient of monitoring energy consumption. Note that the monitoring related energy consumption is related with the radiation and signal processing. Here, the energy information of is defined as . depends on the surveillance resolution of , which is denoted by . The surveillance coverage of is defined as positional set and is written as follows:


Note that , and represent surveilance field and arbitrary point on , respectively. Whether surveils is determined with the allocation variable , which follows the surveillance rule defined as,


Note that is indicator function. Here, the surveillance information of is defined as . Taking all the above into consideration, the set of states is denoted as where represents the state of -th UAV which is defined as .

Fig. 2: UAV coordination

Iii-B2 Action space

The action space of the proposed ACR consists of discrete actions, i.e., actions are for moving actions in direction, in direction, or in both and direction; and the other actions are for controlling the surveillance resolution level , where current position is and surveillance coverage range is . The illustrative description for this action space is as shown in Fig. 2. To achieve monitoring service, each UAV takes moving actions and coverage range controlling actions. The discrete set of actions is defined as .

Iii-B3 Reward

The rewards of the proposed ACR are classified into two groups, i.e., UAV rewards and cooperation rewards.

UAV Reward. For defining the rewards in UAVs, this paper considers energy consumption, battery discharge, and the number of users. The UAV is power-hungry by nature, the rewards for -th UAV with energy consumption , and the of each UAV agent is the scaled summation of energy consumption of and :


where , and are the scaling factor for energy consumption and for aviation status of UAV agent, respectively.

The reward for the surveillance is defined as the ratio of the number of users in the current surveillance coverage to the number of users when the surveillance coverage is maximized. The surveillance reward for is written as follows:


where is the scaling factor for surveillance. As is increased, the surveillance resolution and the number of monitored/observed users are also increased. That is obviously beneficial in terms of surveillance reliability.

Cooperation Reward. To define the rewards for the cooperation among UAVs, this paper addresses the overlapped area among UAVs, and the number of users. If there are a lot of overlapped areas regarding surveillance coverage, it is obviously not good in terms of energy and resource efficiency. Thus, the reward formulation also aims to minimize overlapped areas. The overlapped UAV agent surveillance coverage is computed as:


where returns the area of input. This paper proposes the overlapped threshold, , for reward constraint of the overlapped area among UAVs. The reward of the total utilization is defined as the number of users who use the service and total number of users, and it is computed as follows:


where is a scaling factor for total utilization. Therefore, the total reward for each agent is defined as follows:


Iii-C Algorithm for learning cooperation

1 Initialize the critic and actor networks with weights and
2 Initialize the target networks as:
3 for episode = 1, MaxEpisode do
4        Initialize UAV Agent Environments
5        for time step = 1, T do
6               With probability select a set of actions for each
7               Execute actions at in Simulation Environments and observe reward and the next set of states
8               Store the transition pairs in replay buffer
9               If time step is update period, do followings:
10               Sample a random minibatch from
11               Set
12               Update the by applying stochastic gradient descent to the loss function of critic network:
13               Update the by applying stochastic gradient ascent with respect to the gradient of actor network:
15               Target Update and
16       end
Algorithm 1 Autonomous multi-UAV coordination for reliable surveillance
Fig. 3: Structure of CommNet.

The UAV agents of ACR exchange pieces of information of users and other UAVs. With communications UAVs and other UAV agents, the autonomous UAV coordination scheme tries to learn how to maintain the surveillance under environment uncertainty, which is shown as Algorithm 1. The proposed ACR scheme considers multi-agents system, so CommNet, a representative communication based multi-agent DRL algorithm, is applied. In CommNet, each agent uses only its observable state as shown in Fig. 3. The CommNet enables communications among multiple agents using a single deep neural network. Note that conventional CommNet considers a central server that collects pieces of information of agents and distributes processed information. That is, each agent has access to a central server to share information. However, the proposed scheme configures one of the UAV agents that is randomly selected as a leader UAV, and the leader collects and distribute information, i.e., other agents’ information. Except the leader UAV, each UAV agent sends its embedded state information as the communication message to the leader UAV. The leader UAV collects the embedded information, and averages all of the received embedded messages. After that, the averaged message is taken as the input of the next layer. For other UAV agents, the leader UAV sends the averaged message to other UAV agents. The final layer output determines the agent’s action. Here, , where is the number of communication steps. Each

takes two input vectors for each UAV agent

: the hidden state and the communication . The output of is a vector . In CommNet, the communication and hidden state are calculated as follows:


The softmax activation function is placed at the output layer

. Then, the output of the softmax can be interpreted as the probabilities that action is taken when the UAV agent observes state . This paper adopts the actor and critic reinforcement learning model based CommNet [30]. The overall training process is then defined as follows:

  1. For actor and critic networks, the weight parameters, i.e., and , are initialized (line 1).

  2. The weight parameters of the target actor and critic networks, i.e., and , are initialized (line 2).

  3. The set of UAVs recursively follows this procedures for learning autonomous UAV coordination policies: (i) For every episode, the transition pairs are stored in replay buffer . Here, , , , and stand for the set of states, the set of actions generated by the parameter of actor , the reward of UAV agents, and the observed set of next state spaces. Note that all transition pairs are derived from ACR environments (lines 4–8). (ii) When it comes to update period, a minibatch is randomly sampled from . Using Bellman optimality equation (line 11). The -th transition pair of the minibatch, mean squared Bellman error is calculated with target value and to update the critic network (line 12) [31]. The parameters of the actor network, i.e., , are updated via gradient-based optimization (line 13–14). (iii) After updating the parameters of actor and critic, i.e., and , the target parameters and are updated (line 15).

  4. The parameters of actor and critic are shared for UAV agents; thus the UAV agents have the same parameters of cooperation policy. Note that this sharing procedure ensures the easy usages of more UAV agents.

Iv Performance Evaluation

Iv-a Performance metric

Parameters Value
Energy for hovering, 128.89 W
Energy for flying, 170.32 W
Energy for surveillance, 5 W
Time steps per episode, 40 min
Surveillance resolution,
The number of UAV agents, 4
The number of users, 25
The number of non-agent UAVs, 3
Overlapped threshold, 0.5
TABLE I: Environmental setup parameters.

In ACR dynamic environment, the number of monitored users continuously change. This study investigates the convergence of ACR and its benchmark schemes. The UAVs transfer the latest locations. Every episode starts with randomly spreading agent UAVs on the grid. Each UAV is then randomly assigned a specific area of interest which is the last location left by UAV. In order to provide an optimal surveillance service, which is the common goal of agent UAVs, they must provide the best surveillance for each area. As a benchmark, this paper compares our proposed CommNet-based ACR to state-of-the-art techniques as follows:

  1. ACR with communication (Comp1): In Comp1 scheme, all agents collect observation information of others. Comp1 ACR utilizes the state-of-the-art algorithm (communication learning). In the comparison experiment, this paper compares the performance and computational cost. In addition, this work analyzes the efficiency of the proposed scheme in Sec. IV-C. Note that all agents in Comp1 make action decision with CommNet-based policy.

  2. ACR with disconnection (Comp2): In Comp2 scheme, there is no leader UAV agent which collects observation information of other agents. Comp2 ACR utilizes the state-of-the-art algorithm (independent actor-critic (IAC)-based algorithm). This work compares the proposed scheme to Comp2 corresponding to cooperation between inter-agent. Note that the structure of policy in Comp2 ACR (i.e., DNN-based policy) is equivalent to CommNet-based policy without (12) and (13).

Iv-B Simulation setup

In order to numerically analyze the performance of the proposed ACR scheme, the paper considers a -dimensional grid map, randomly distributed users, and multi-agents system based UAV agents. This experiment setup assumes each episode has a total of time steps. The initial positions of agent UAVs are set to center of the grid. In addition, the initial positions of non-agent UAVs are detached 750m from the center of the grid. Each non-agent UAV is randomly malfunctioned with probability 0.03 for each step. Note that ACR parameters are summarized in Table I. In addition, experiments were conducted with the system parameters m, m, , , , , and . This study configured two neural network structures (e.g., CommNet and DNN) of ACR as follows. The neural networks consist of six dense layers. In the first layer to the last layer, the number of nodes is

. The ReLU function is used in first layer to fifth layer, respectively. A Xavier initializer is used for weight initialization. This paper uses Adam optimizer with the learning rate

. In the training procedure, -greedy method is used to make the UAV agents experience a variety of actions. The initial value of epsilon is and the annealing epsilon is .

Iv-C Evaluation results

In the following, the study investigates the reward convergence, surveillance, trained behaviors of the proposed scheme, and comparison schemes, and computation cost comparison.

(a) Total Reward (b) Training Loss
  (c) Proposed     (d) Comp1     (e) Comp2
  (f) Proposed     (g) Comp1     (h) Comp2
Fig. 4: The learning curve of proposed scheme, Comp1, and Comp2. All agents are trained with 100,000 iterations for each scheme.

Iv-C1 Reward convergence

This work studies the tendency of training phase corresponding to total reward and training loss. Fig. 4 shows the training results. As shown in Fig. 4(a)/(b), the tendencies of two metrics in proposed scheme and Comp1 shows more stable than in Comp2. The total rewards in proposed scheme and Comp1 converge at around , whereas the total reward in Comp2 fluctuates in . The malfunction of non-agent UAVs affects the reward fluctuation due to its uncertainty. Therefore, regarding training loss, the proposed scheme starts at highest value 1.7 among the schemes and converges at 1.2 which is second rank. Comp1 and Comp2 show the minimal and highest loss tendency, respectively. As shown in Fig. 4(c–e), commonly takes the highest reward among four agents. Fig. 4(c)/(f) show that and get high reward in proposed scheme, whereas and show the lowest and the highest loss, respectively. As shown in Fig. 4(d)/(g), agents in Comp1 show similar tendencies and smooth curves in both reward and loss. However, Comp2 shows unstable fluctuations in two metrics. To sum it up, CommNet-based decision making helps to make ACR stable with making converging the neural network to optimal in training, even though only one agent with communication operation exists as shown in Fig. 4.

Iv-C2 Surveillance reliability

  (a) Proposed     (b) Comp1     (c) Comp2
  (d) Resolution  (e) Support Rate
Fig. 5: The performance difference of proposed scheme, Comp1, and Comp2. The graphs are plotted with trained model and taking 25 test iterations results on average.

This paper studies the impact of CommNet-based ACR on the number of users surveiled to UAV agents and surveilance resolution. The test is conducted with fixed user position and non-agent UAV position. The test results are derived with 25 test iterations. Fig. 5 presents the test results with the trained model for each scheme. Fig. 5(a–c) shows the number of surveiled users. In our proposed scheme, the number of surveiled users is 12–15 on average as shown in Fig. 5(a). UAV agents and surveil more users than other agents or 3 non-agent UAVs. In Comp1 scheme, 12–14 users are surveiled by UAVs. As shown in Fig. 5(b), all agents in Comp1 provide monitoring services to users more evenly than those of proposed or Comp2 scheme due to computing communication operations for all agents. According to 11–13 users are surveiled. Fig. 5(d)/(e) show the resolution of monitoring service and the support rate, respectively. The surveillance resolution of all schemes starts at around p. The terminal monitoring resolution is found to be high in the order of proposed, Comp1, and Comp2. The support rate is derived from calculating the ratio of the number of surveiled user by UAV and the total number of users. The support rate is found to be high at , in the order of the proposed scheme, Comp1, and Comp2. In addition, Comp1 is the same as the proposed scheme, and Comp2 is the lowest at . In these results, agents in the proposed scheme take a totally different strategy. The agent that operates communication operation, surveils maximum users, whereas agent in other comparison schemes surveils users evenly. The strategy in the proposed scheme makes an outperformance in the highest resolution and maximum support rate among the benchmark scheme.

Iv-C3 Behavior pattern analysis of UAV agents

(a) Behavior patterns of UAV agent in the proposed scheme
(b) Behavior patterns of UAV agent in the Comp1 scheme
(c) Behavior patterns of UAV agent in the Comp2 scheme
Fig. 6: UAV agents behaviors; circle depicts the surveillance coverage range of each UAV agent, the purple/blue stars represent the position of each CommNet-based/DNN-based UAV agent, and squares stand for users, respectively.

Fig. 6 shows how the trained UAV agents adjust the optimal location and surveillance coverage over time. Fig. 6(a) shows the trajectories and coverage of UAVs in the proposed scheme, Fig. 6(b)/(c) present the trajectories and coverage of UAVs in Comp1 and Comp2, respectively. Four agents consist of one CommNet UAV and three DNN UAVs for the proposed scheme, four CommNet UAVs for the Comp1 scheme, and four DNN UAVs for the Comp2 scheme. As shown in Fig. 6(a), the non-agent UAVs are malfunctioned at , , and , respectively. The UAV agent with CommNet-based policy, occupies the optimal position where the most users located at . The UAV agents with DNN-based policy, prowl to surveil the remaining users while enhancing the resolution. The strategy in the proposed scheme makes the fastest increasing of resolution as shown Fig. 5(d). Fig. 6(b) presents Comp1 scheme that consists of all agents use CommNet-based policies. The malfunctioned events occur at , , and . There is static CommNet agent that controls the resolution between 1080p and 2160p. Two agents moves to where the malfunctioned event occurs at . All users get monitoring service at . When , all UAV agents mosey to guarantee surveillance resolution and the number of users. As shown in Fig. 5(d)/(e), The strategy of Comp1 increases the resolution and support rate even though all non-agent UAVs are malfunctioned. Fig. 6(c) shows the behavior pattern of DNN-based UAV agents in Comp2 scheme. The non-agent UAVs are malfunctioned at , , and . Comp2 shows the highest overlapped area. Agents in Comp2 try to occupy preempt an area from the other agents at . This behavior pattern causes the lowest performance on every performance metrics, including not only benchmark of training and test as shown in Fig. 4 and Fig. 5.

Iv-C4 Computational cost and efficiency

Metric Computational Cost [FLOPS]
Policy CommNet
Scheme Proposed
TABLE II: Comparison of computational cost between CommNet-based and DNN-based policy and their application scheme (Proposed, Comp1, and Comp2).

This study looks at the computational cost of CommNet-based policy and DNN-based policy. Furthermore, this paper figures out the total computational cost of our proposed scheme, Comp1, and Comp2. According to [32], the computational cost of the neural network can be calculated. As shown in Table II, the computational cost of CommNet-based policy is 22.7% larger than DNN-based policy. In the proposed scheme, there is a leader UAV agent. In other words, a leader UAV requires CommNet-based policy to collects the observations of other agents. All agents have CommNet-based policies for Comp1 and DNN-based policies for Comp2. Regarding computational cost for each time step, the proposed scheme requires 16.1% fewer FLOPS than Comp1, and 5.3% more FLOPS than Comp2. Even Comp1 requires higher computational cost than other schemes, and our proposed scheme shows stable convergence and high performance as much as Comp1. The proposed scheme requires a higher computational cost than Comp2 and shows performance superiority to Comp2. Therefore, our proposed scheme outperforms other comparison schemes.

Iv-D Discussion

This subsection discusses the reason why these experimental results are derived. According to [24, 33], no communication between an inter-agent does not guarantee collaboration for surveillance reliability. In the proposed scheme and Comp1, the cooperative reward enables the cooperation between an inter-UAV. In Comp2, no communication between UAVs leads to a lack of cooperation due to the non-communicative neural network structure. In addition, it is remarkable that the performance of ACR increases even though only one CommNet UAV agent exists. Due to this aspect, our proposed scheme shows the equal or more extraordinary performance of Comp2 with less computational cost.

V Conclusions and Future Work

This paper considers the deployment of surveillance UAVs, which aims at energy-efficient and reliable surveillance. Using UAVs to improve monitoring services and handle various uncertainty problems, this paper proposes an autonomous surveillance UAVs cooperation scheme based on cooperative model-free MADRL, called CommNet. The proposed scheme offers a promising and reliable solution to find optimal trajectories in the operating area and surveillance coverage control of UAVs that can cover as many users as possible. In addition, the proposed scheme outperforms comparison techniques in the computational efficiency as well as the benchmarks. The future work of this study is to expand ACR under various conditions (i.e., ACR with geographic constraints [34]) and diverse applications (i.e., federated learning applications to ACR in order to train data those are collected from different agents).


  • [1] D. Kim, S. Park, J. Kim, J. Y. Bang, and S. Jung, “Stabilized adaptive sampling control for reliable real-time learning-based surveillance systems,” Journal of Communications and Networks, vol. 23, no. 2, pp. 129–137, April 2021.
  • [2] H. Huang and A. V. Savkin, “An algorithm of reactive collision free 3-D deployment of networked unmanned aerial vehicles for surveillance and monitoring,” IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 132–140, January 2020.
  • [3]

    R. Nawaratne, D. Alahakoon, D. D. Silva, and X. Yu, “Spatiotemporal anomaly detection using deep learning for real-time video surveillance,”

    IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 393–402, January 2020.
  • [4] S. Jung, W. J. Yun, M. Shin, J. Kim, and J.-H. Kim, “Orchestrated scheduling and multi-agent deep reinforcement learning for cloud-assisted multi-UAV charging systems,” IEEE Transactions on Vehicular Technology, vol. 70, no. 6, pp. 5362–5377, June 2021.
  • [5] K. Muhammad, T. Hussain, J. D. Ser, V. Palade, and V. H. C. de Albuquerque, “DeepReS: A deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios,” IEEE Transactions on Industrial Informatics, vol. 16, no. 9, pp. 5938–5947, September 2020.
  • [6] M. Shin, J. Kim, and M. Levorato, “Auction-based charging scheduling with deep learning framework for multi-drone networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 5, pp. 4235–4248, May 2019.
  • [7] Z. Zhang, Y. Xiao, Z. Ma, M. Xiao, Z. Ding, X. Lei, G. K, Karagiannidis, and P. Fan, “6G wireless networks: Vision, requirements, architecture, and key technologies,” IEEE Vehicular Technology Magazine, vol. 14, no. 3, pp. 28–41, September 2019.
  • [8] S. Park, W.-Y. Shin, M. Choi, and J. Kim, “Joint mobile charging and coverage-time extension for unmanned aerial vehicles,” IEEE Access, vol. 9, pp. 94 053–94 063, June 2021.
  • [9] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Optimal transport theory for power-efficient deployment of unmanned aerial vehicles,” in Proc. IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, May 2016, pp. 1–6.
  • [10] F. Cheng, S. Zhang, Z. Li, Y. Chen, N. Zhao, F. R. Yu, and V. C. Leung, “UAV trajectory optimization for data offloading at the edge of multiple cells,” IEEE Transactions on Vehicular Technology, vol. 67, no. 7, pp. 6732–6736, July 2018.
  • [11] E. Kalantari, H. Yanikomeroglu, and A. Yongacoglu, “On the number and 3D placement of drone base stations in wireless cellular networks,” in Proc. IEEE Vehicular Technology Conference (VTC), Montreal, QC, Canada, September 2016, pp. 1–6.
  • [12] J. Chen, U. Yatnalli, and D. Gesbert, “Learning radio maps for UAV-aided wireless networks: A segmented regression approach,” in Proc. IEEE International Conference on Communications (ICC), Paris, France, May 2017, pp. 1–6.
  • [13] L. Lu, Y. Hu, Y. Zhang, G. Jia, J. Nie, and M. Shikh-Bahaei, “Machine learning for predictive deployment of UAVs with multiple access,” in Proc. IEEE GLOBECOM Workshops, Taipei, Taiwan, December 2020.
  • [14] Y. Wang, M. Chen, Z. Yang, T. Luo, and W. Saad, “Deep learning for optimal deployment of uavs with visible light communications,” IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp. 7049–7063, 2020.
  • [15] Y. Hu, M. Chen, W. Saad, H. V. Poor, and S. Cui, “Meta-reinforcement learning for trajectory design in wireless UAV networks,” arXiv preprint arXiv:2005.12394, 2020.
  • [16] Y. Liu, Z. Qin, Y. Cai, Y. Gao, G. Y. Li, and A. Nallanathan, “UAV communications based on non-orthogonal multiple access,” IEEE Wireless Communications, vol. 26, no. 1, pp. 52–57, February 2019.
  • [17] S. Sukhbaatar, R. Fergus et al.

    , “Learning multiagent communication with backpropagation,” in

    Proc. Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016, pp. 2244–2252.
  • [18] C. Boutilier, “Planning, learning and coordination in multiagent decision processes,” in Proc. Conference on Theoretical Aspects of Rationality and Knowledge (TARK), De Zeeuwse Stromen, The Netherlands, March 1996, pp. 195–210.
  • [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with deep reinforcement learning,” arXiv:1312.5602, 2013.
  • [20] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3–4, pp. 279–292, May 1992.
  • [21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.
  • [22] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” arXiv:1502.05477, 2017.
  • [23] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Transactions on Cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020.
  • [24] M. Shin, D. Choi, and J. Kim, “Cooperative management for PV/ESS-enabled electric vehicle charging stations: A multiagent deep reinforcement learning approach,” IEEE Transactions on Industrial Informatics, vol. 16, no. 5, pp. 3493–3503, May 2020.
  • [25] W. J. Yun, S. Jung, J. Kim, and J.-H. Kim, “Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications,” ICT Express, vol. 7, no. 1, pp. 1–4, March 2021.
  • [26] S. Jung, W. J. Yun, J. Kim, and J.-H. Kim, “Infrastructure-assisted cooperative multi-UAV deep reinforcement energy trading learning for big-data processing,” in Proc. IEEE International Conference on Information Networking (ICOIN), Jeju Island, Republic of Korea, January 2021.
  • [27] Samsung Electronics, “Samsung brings advanced ultra-fine pixel technologies to new mobile image sensors,” September 2021.
  • [28] L. D. P. Pugliese, F. Guerriero, D. Zorbas, and T. Razafindralambo, “Modelling the mobile target covering problem using flying drones,” Optimization Letters, vol. 10, no. 5, pp. 1021–1052, June 2016.
  • [29] D. Zorbas, L. D. P. Pugliese, T. Razafindralambo, and F. Guerriero, “Optimal drone placement and cost-efficient target coverage,” Journal of Network and Computer Applications, vol. 75, pp. 16–31, November 2016.
  • [30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. International Conference on Machine Learning (ICML), NY, USA, June 2016, pp. 1928–1937.
  • [31] R. S. Sutton, “On the significance of Markov decision processes,” in Proc. International Conference on Artificial Neural Networks (ICANN), Lausanne, Switzerland, October 1997, pp. 273–282.
  • [32] J. Cohen, “How to optimize a deep learning model for faster inference?” Think Autonomous, April 2021.
  • [33] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016, pp. 2244–2252.
  • [34] S. Milner, C. Davis, H. Zhang, and J. Llorca, “Nature-inspired self-organization, control, and optimization in heterogeneous wireless networks,” IEEE Transactions on Mobile Computing, vol. 11, no. 7, pp. 1207–1222, July 2012.