Optimising Energy Efficiency in UAV-Assisted Networks using Deep Reinforcement Learning

by   Babatunji Omoniwa, et al.

In this letter, we study the energy efficiency (EE) optimisation of unmanned aerial vehicles (UAVs) providing wireless coverage to static and mobile ground users. Recent multi-agent reinforcement learning approaches optimise the system's EE using a 2D trajectory design, neglecting interference from nearby UAV cells. We aim to maximise the system's EE by jointly optimising each UAV's 3D trajectory, number of connected users, and the energy consumed, while accounting for interference. Thus, we propose a cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN) approach. Our approach outperforms existing baselines in terms of EE by as much as 55 – 80



page 1


Multi-Agent Deep Reinforcement Learning For Optimising Energy Efficiency of Fixed-Wing UAV Cellular Access Points

Unmanned Aerial Vehicles (UAVs) promise to become an intrinsic part of n...

Programming and Deployment of Autonomous Swarms using Multi-Agent Reinforcement Learning

Autonomous systems (AS) carry out complex missions by continuously obser...

Multi-UAV Conflict Resolution with Graph Convolutional Reinforcement Learning

Safety is the primary concern when it comes to air traffic. In-flight sa...

Mobile Cellular-Connected UAVs: Reinforcement Learning for Sky Limits

A cellular-connected unmanned aerial vehicle (UAV)faces several key chal...

Cellular-Connected UAVs over 5G: Deep Reinforcement Learning for Interference Management

In this paper, an interference-aware path planning scheme for a network ...

Integrating LEO Satellites and Multi-UAV Reinforcement Learning for Hybrid FSO/RF Non-Terrestrial Networks

A mega-constellation of low-altitude earth orbit (LEO) satellites (SATs)...

Air Learning: An AI Research Platform for Algorithm-Hardware Benchmarking of Autonomous Aerial Robots

We introduce Air Learning, an AI research platform for benchmarking algo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The deployment of unmanned aerial vehicles (UAVs) to provide wireless coverage to ground users has received significant research attention [1][7]. UAVs can play a vital role in supporting the Internet of Things (IoT) networks by providing connectivity to a large number of devices, static or mobile [1]. More importantly, UAVs have numerous real-world applications, ranging from assisted-communication in disaster-affected areas to surveillance, search and rescue operations [8][9]. Specifically, UAVs can be deployed in circumstances of network congestion or downtime of existing terrestrial infrastructure. Nevertheless, to provide ubiquitous services to dynamic ground users, UAVs require robust strategies to optimise their flight trajectory while providing coverage. As energy-constrained UAVs operate in the sky, they may be faced with the challenge of interference from nearby UAV cells or other access points sharing the same frequency band, thereby impacting the system’s energy efficiency (EE) [7].

There has been significant research effort on optimising EE in multi-UAV networks [1] – [5]. The authors in [2] proposed an iterative algorithm to minimise the energy consumption of UAVs serving as aerial base stations to static ground users. In [4]

, a game-theoretic approach was proposed to maximise the system’s EE while maximising the ground area covered by the UAVs irrespective of the presence of ground users. However, these works rely on a central ground controller for UAVs’ decision making, thereby making it impractical to be deployed for emergencies due to the significant amount of exchanged information between the UAVs and the controller. Moreover, it may be difficult to track user locations in such a scenario. Machine learning is increasingly being used to address complex multi-UAV deployment problems. In particular, multi-agent reinforcement learning (MARL) approaches have been deployed in several works to optimise the system’s EE. A distributed Q-learning approach 

[1] focused on optimising the energy utilisation of UAVs without considering the system’s EE. To address this challenge, a deep reinforcement learning (DRL) approach [7] could be adopted. In our prior work [10], a DRL-based approach was proposed to optimise the EE of fixed-wings UAVs that move in circular orbits and are typically incapable of hovering like the rotary-winged UAVs. Moreover, the focus was on UAVs providing coverage to static ground users. The distributed DRL work in [3] was an improvement on the centralised approach in [5], where all UAVs are controlled by a single autonomous agent. The authors in [3][5] proposed a deep deterministic policy gradient (DDPG) approach to improve the system’s EE as UAVs hover at fixed altitudes while providing coverage to static ground users in an interference-free network environment. Although the approaches in [3] and [5] promise performance gains in terms of coverage score, they focus on the 2D trajectory optimisation of the UAVs serving static ground users.

Figure 1: System model for UAVs serving static and mobile ground users.

Motivated by the research gaps above, we focus on maximising the system’s EE by optimising the 3D trajectory of each UAV over a series of time steps, while taking into account the impact of interference from nearby UAV cells and the coverage of both static and mobile ground users. We propose a cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN) approach, where each agent’s reward reflects the coverage performance in its neighbourhood. The MAD-DDQN approach maximises the system’s EE without hampering performance gains in the network.

Ii System Model

We consider a set of static and mobile ground users located in a given area, as shown in Figure 1. Each user at time is located in the coordinate . We assume service unavailability from the existing terrestrial infrastructure due to disasters or increased network load. As such, a set of quadrotor UAVs are deployed within the area to provide wireless coverage to the ground users. A serving UAV at time is located in the coordinate . Without loss of generality, we assume a guaranteed line-of-sight (LOS) channel condition [11], due to the aerial positions of the UAVs. Signal-to-interference-plus-noise-ratio (SINR) is a measure of the signal quality. It can be defined as the ratio of the power of a certain signal of interest and the interference power from all the other interfering signals plus the noise power. Each user in time can be connected to a single UAV which provides the strongest downlink SINR. Thus, the SINR at time is expressed as [1],


where and are the attenuation factor and path loss exponent that characterises the wireless channel, respectively. is the power of the additive white Gaussian noise at the receiver, is the distance between the and at time is the set of interfering UAVs. is the index of an interfering UAV in the set . is the transmit power of the UAVs. We model the mobility of mobile users using the Gauss Markov Mobility (GMM) model [12], which allows users to dynamically change their positions. UAVs must optimise their flight trajectory to provide ubiquitous connectivity to users. Given a channel bandwidth , the receiving data rate of a ground user can be expressed using Shannon’s equation [7],


In our interference-limited system, coverage is affected by the SINR. Hence, we compute the connectivity score of a UAV at time as [3],


where denotes whether user is connected to UAV at time . if , otherwise , where is the SINR predefined threshold. Likewise if user is not connected to UAV .

During flight operations, a UAV at time expends energy . A UAVs’ total energy is expressed as the sum in propulsion  and communication  energies, . Since is practically much smaller than , i.e., [1], we ignore . A closed-form analytical propulsion power consumption model for a rotary-wing UAV at time is given as [13],


where and are the UAVs’ flight constants (e.g., rotor radius or weight), is the rotor blade’s tip speed, is the mean hovering velocity, is the drag ratio, is the rotor solidity, is the rotor disc area, is the UAVs’ speed at time and is the air density. In particular, we take into account the basic operations of the UAV, such as, hovering and acceleration. Therefore, we can derive the average propulsion power over all time steps as , and the total consumed energy of a UAV is given as [1],


where is the duration of each time step. The EE at time can be expressed as the ratio of the total data throughput and the total energy consumed by all UAVs, expressed as,


Iii Multi-Agent Reinforcement Learning Approach for Energy Efficiency Optimisation

In this section, we formulate the problem and propose a our MAD-DDQN algorithm to improve the trajectory of each UAV in a manner that maximises the total system’s EE.

Figure 2: Multi-agent decentralised double deep Q-network framework where each UAV equipped with a DDQN agent interacts with its environment. The environment shows the simulation snapshot of UAVs providing wireless coverage to 200 static (blue) and 200 mobile (red) ground users with flight trajectories. On the left shows the broadcast range of UAV in a multi-UAV scenario, where UAVs broadcast their telemetry information to nearest neighbours

Iii-a Problem Formulation

Our objective is to maximise the total system’s EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed by the UAVs serving ground users under a strict energy budget. Maximising the number of connected users will maximise the total amount of data the UAV will deliver in time step which, for a given amount of consumed energy , will also maximise the EE . Therefore, the optimisation problem can be formulated as,

s.t. (7b)

where is the maximum UAV energy level, , , and , , are the minimum and maximum 3D coordinates of , and , respectively. As multiple wireless transmitters sharing the same frequency band are in close proximity to one another the possibility of interference is significantly increased. The computational complexity of problem (7a) is known to be NP-complete [6]. The problem (7a) is non-convex, thus having multiple local optimum. For this reason, solving (7a) with conventional optimization approaches is challenging [1][6]. Specifically, the problem (7a) will become more complex as more UAVs are deployed in a shared wireless environment, hence it is challenging to find the optimal cooperative strategies to improve the system’s EE while completing the coverage tasks under dynamic settings. This is often because UAVs may become selfish and pursue the goal of improving their individual EE while minimising the communication outage and energy consumption, rather than the collective goal of maximising the system’s EE. In such cases, cooperative MARL approaches may be suitable when individual and collective interests of UAVs conflict. Deep RL has been shown to perform well in decision-making tasks in such a dynamic environment [14]. Hence, we adopt a cooperative deep MARL approach to solve the system’s EE optimisation problem.

1:Input: UAV3Dposition, ConnectivityScore, InstantaneousEnergyConsumed and Output: Q-values corresponding to each possible action , , , , , ,  
2: – empty replay memory, – initial network parameters, – copy of , – maximum size of replay memory, – batch size, – target replacement frequency.
3: initial state, maxStep maximum number of steps in the episode
4:while goal not Reached and Agent alive and maxStep not reached do
5:      s MapLocalObservationToState(Env)
6:       Execute -greedy method based on
7:      a DeepQnetwork.SelectAction(s)
8:       Agent executes action in state
9:      a.execute(Env)
10:      if  a.execute(Env) is True then
11:             Map sensed observations to new state
12:            Env.UAV3Dposition [6]
13:            Env.ConnectivityScore (3)
14:            Env.InstantaneousEnergyConsumed (5)       
15:      r Env.RewardWithCooperativeNeighbourFactor (8)
16:       Execute UpdateDDQNprocedure()
17:      Sample a minibatch of tuples
18:      Construct target values, one for each of the tuples:
19:      Define
20:      if  is Terminal then
22:      else
24:      Apply a gradient descent step with loss
25:      Replace target parameters every step
Algorithm 1 Double Deep Q-Network (DDQN) for Agent

Iii-B Cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN)

We propose a cooperative MAD-DDQN approach, where each agent’s reward reflects the coverage performance in its neighbourhood. Here, each UAV is controlled by a Double Deep Q-Network (DDQN) agent that aims to maximise the system’s EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed. We assume the agents interact with each other in a shared and dynamic environment, which may lead to learning instabilities due to conflicting policies from other agents. From Algorithm 1, Agent follows an –greedy policy by executing an action , transiting from state to a new state and receiving a reward reflecting the coverage performance in its neighbourhood in (8), after which DDQN procedure described on line 1725 optimises the agent’s decisions. We explicitly define the states, actions, and reward as follows:

  • [leftmargin=*]

  • State space: We consider the three-dimensional (3D) position of each UAV [6], the connectivity score and the UAV’s instantaneous energy level at time , expressed as a tuple, ⟨⟩.

  • Action space: At each time-step , each UAV takes an action by changing its direction along the 3D coordinates. Unlike our closest related work and the evaluation baseline [3], we discretise the agent’s actions following the design from  [1] and [6], as follows: , , , , , and . Our rationale to discretise the action space was to ensure quick adaptability and convergence of the agents.

  • Reward: The agent’s goal is to learn a policy that implicitly maximises the system’s EE by jointly minimising the ground users outage and total UAVs energy consumption. Hence, we introduce a shared cooperative factor  to shape the reward formulation of each agent in each time-step given as,


    where and are the connectivity score in present and previous time-step, respectively. , where and are the instantaneous energy consumed by agent in present and previous time-step, respectively. To enhance cooperation, we assign each agent a ‘’ incentive from its neighbourhood via a function  only when the overall connectivity score, which is the total number of connected users by UAVs in its locality in the present time-step exceeds that in the previous time-step , otherwise the agent receives a ‘’ incentive. We compute  as,

(a) Subfigure 1 list of figures text
(b) Subfigure 2 list of figures text
(c) Subfigure 3 list of figures text
Figure 3: Impact of number of deployed UAVs on the UAVs’ EE, ground users outage and total energy consumption under dynamic network conditions with 400 ground users deployed in a 1 km area, with results from 2000 runs of MC trials.

Iii-C DDQN Implementation

The neural network (NN) architecture of Agent

’s DDQN shown in Figure 2

comprises of a 5-dimensional state space input vector, densely connected to 2 layers with 128 and 64 nodes, with each using a rectified linear unit (ReLU) activation function, leading to an output layer with 7 dimensions. Our decentralised approach assume agents to be independent learners. Following the analysis presented in

[15], the computational complexity of the NN architecture used in the MAD-DDQN is approximately with an average response time of 5.6 ms, while that of our closest related work and the evaluation baseline [3] (MADDPG) is approximately with an average response time of 7.4 ms, where is the dimension of the state space, is the dimension of the action space, is the number of layers, is the number nodes in each hidden layer.
In the training phase, given the state information as input, Agent trains the main network to make better decision by yielding Q-values corresponding to each possible action as output. The maximum Q-value obtained determines the action the agent executes. At each time step Agent observes its present state and updates it’s trajectory by selecting an action in accordance with its policy. Following its action in time step , Agent observes a reward which is defined in (8), and transits to a new state . The information is inputted in the replay memory as shown in Figure 2. Agent then samples the random mini-batch from the replay memory and uses the mini-batch to obtain . The optimisation is performed with and updated accordingly. In every 100th time step, the target Q-network updates the parameters with the same parameters

of the main network. For the training, the memory size was set to 10,000, and the mini-batch size was set to 1024. The optimisation is performed using a variant of the stochastic gradient descent called RMSprop to minimise the loss following the methodology described in

[16, Chapter 4]. The learning rate and discount factor were set to 0.0001 and 0.95, respectively. We train the Q-networks by running multiple episodes, and at each training step the -greedy policy is used to have a balance between exploration and exploitation [16]. In the -greedy policy, the action is randomly selected with probability, whereas the action with the largest action value is selected with a probability of . The initial value of was set to 1 and linearly decreased to 0.01.

Parameters Value
Software platform/Library

Python 3.7.4/PyTorch 1.8.1

Optimiser/Loss function

Learning rate/Discount factor 0.0001/0.95
Hidden layers/Activation function 2 (128, 64)/ReLu
Replay memory size/Batch size 10,000/1024
Policy/Episodes/maxStep -greedy/250/1500
No. of ground users/Model 400/GMM
Ground user direction/Velocity [0, 2]/[0, 15] mps
Number of UAVs/Weight per UAV [2–12]/16 kg
Nominal battery capacity 16,000 mAh
Maximum transmit power [6] 20 dBm
Noise power/SINR threshold [2] -130 dBm/5 dB
Bandwidth [6] 1 MHz
Pathloss exponent [2][6] 2
UAV step distance () [0–20] m
Table I: Simulation Parameters

Iv Evaluation and Results

In this section, we verify the effectiveness of the proposed MAD-DDQN approach against the following baselines: (i) the random policy; and (ii) the MADDPG [3] approach that considers a 2D trajectory optimisation while neglecting interference from nearby UAV cells. Simulation parameters are presented in Table I. We simulate a varying number of UAVs ranging from 2 to 12 to serve both static and mobile ground users in a 10001000 area as shown in Figure 2. We perform 2000 runs of Monte-Carlo (MC) trials over trained episodes. In Figure 3, we compare the MAD-DDQN approach with baselines to evaluate the impact of different number of deployed UAVs on the EE, ground users outage and total energy consumption. Due to baseline MADDPG approach taking significantly longer to converge (learn suitable behaviours), to achieve a fair comparison, Figure 3 compares the performance after training the MAD-DDQN approach for 250 episodes and the MADDPG approach for 2000 episodes.

Since we focus on comparing the EE values rather than showing their absolute values, we normalise the EE values with respect to the mean values of the proposed MAD-DDQN approach. From Figure (a)a, we observe that the MAD-DDQN approach consistently outperforms the random policy and MADDPG approaches across the entire range of UAVs deployment by approximately 80% and 55%, respectively.

Figure 4: Energy efficiency vs. learning episodes showing the convergence of MAD-DDQN while varying the number of agents.

Interestingly, we see a marginally better performance by the MADDPG approach over the MAD-DDQN approach in minimising the outages experienced by ground users by about 2%, as shown in Figure (b)b. However, the slight performance gain by the MADDPG comes at a huge computational training cost which is 8 times higher than the MAD-DDQN approach. Intuitively, the MAD-DDQN approach hides redundant information about the environment through discretisation of the agent’s action space, which makes the MAD-DDQN approach require less experience to successfully learn a policy than the MADDPG approach. On the other hand, the random policy performed worst among the approaches in reducing connection outages, emphasizing the relevance of strategic decision making in MARL problems. Figure (c)c clearly shows that the proposed approach significantly minimises the total energy consumed by all UAVs as compared to the baselines. Although the MADDPG approach performs slightly better at reducing outages than our approach, our MAD-DDQN approach is significantly more energy efficient, hereby implying the MADDPG approach trades energy consumption for improved coverage of ground users. In Figure 4, we show the plot of the EE versus the learning episodes while varying the number of agents to demonstrate the convergence behaviour of the MAD-DDQN approach. We observe a steady decrease in the converged values of the EE while increasing the number of UAVs because the system becomes more unstable with more UAVs, thereby decreasing the system throughput as interference increases. Overall, the cooperative MAD-DDQN approach shows convergence in the system’s EE irrespective of the number of UAVs deployed in the network.

V Conclusion

In this letter, we propose a MAD-DDQN approach to optimise the EE of a fleet of UAVs serving static and mobile ground users in an interference-limited environment. The MAD-DDQN approach guarantees quick adaptability and convergence, thereby allowing agents to learn policies that maximise the total system’s EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed by the UAVs serving ground users under a strict energy budget. Extensive simulation results have demonstrated that the MAD-DDQN approach significantly outperforms the random policy and a state-of-the-art decentralised MARL solution in terms of EE without degrading coverage performance in the network.


  • [1] B. Omoniwa, B. Galkin and I. Dusparic, “Energy-aware optimization of UAV base stations placement via decentralized multi-agent Q-learning,” 2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC), Jan. 2022, pp. 216-222.
  • [2] M. Mozaffari, W. Saad, M. Bennis and M. Debbah, “Mobile Unmanned Aerial Vehicles (UAVs) for Energy-Efficient Internet of Things Communications,” IEEE Transactions on Wireless Communications, vol. 16, no. 11, pp. 7574-7589, Nov. 2017.
  • [3] C. H. Liu, X. Ma, X. Gao and J. Tang, “Distributed Energy-Efficient Multi-UAV Navigation for Long-Term Communication Coverage by Deep Reinforcement Learning,” IEEE Transactions on Mobile Computing, vol. 19, no. 6, pp. 1274-1285, June 2020.
  • [4] L. Ruan et al., “Energy-efficient multi-UAV coverage deployment in UAV networks: A game-theoretic framework,” China Communications, vol. 15, no. 10, pp. 194-209, Oct. 2018.
  • [5] C. H. Liu, Z. Chen, J. Tang, J. Xu and C. Piao, “Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 2059-2070, Sept. 2018.
  • [6] X. Liu, Y. Liu and Y. Chen, “Reinforcement Learning in Multiple-UAV Networks: Deployment and Movement Design,” IEEE Transactions on Vehicular Technology, vol. 68, no. 8, pp. 8036-8049, Aug. 2019.
  • [7] B. Galkin, E. Fonseca, R. Amer, L. A. DaSilva and I. Dusparic, “REQIBA: Regression and Deep Q-Learning for Intelligent UAV Cellular User to Base Station Association,” IEEE Transactions on Vehicular Technology, vol. 71, no. 1, pp. 5-20, Jan. 2022.
  • [8] C. Zhang, M. Dong and K. Ota, “Heterogeneous Mobile Networking for Lightweight UAV Assisted Emergency Communication,” IEEE Transactions on Green Communications and Networking, vol. 5, no. 3, pp. 1345-1356, Sept. 2021.
  • [9] J. Xu, K. Ota and M. Dong, “Big Data on the Fly: UAV-Mounted Mobile Edge Computing for Disaster Management,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 4, pp. 2620-2630, Oct.-Dec. 2020.
  • [10] B. Galkin, B. Omoniwa, and I. Dusparic, “Multi-Agent Deep Reinforcement Learning For Optimising Energy Efficiency of Fixed-Wing UAV Cellular Access Points,” ICC 2022 - IEEE International Conference on Communications, (to appear), arXiv:2111.02258, May 2022.
  • [11] B. Galkin, J. Kibilda and L. A. DaSilva, “Deployment of UAV-mounted access points according to spatial user locations in two-tier cellular networks,” 2016 Wireless Days (WD), 2016, pp. 1-6.
  • [12] T. Camp, J. Boleng, V. Davies, “A Survey of Mobility Models for Ad Hoc Network Research,” Wireless Communication & Mobile Computing (WCMC): Special issue on Mobile Ad Hoc Networking: Research, Trends and Applications, 2002, pp. 483-502.
  • [13] Y. Zeng, J. Xu and R. Zhang, “Energy Minimization for Wireless Communication With Rotary-Wing UAV,” IEEE Transactions on Wireless Communications, vol. 18, no. 4, pp. 2329-2345, April 2019.
  • [14] M. Zhang, S. Fu and Q. Fan, “Joint 3D Deployment and Power Allocation for UAV-BS: A Deep Reinforcement Learning Approach,” IEEE Wireless Commun. Lett., vol. 10, no. 10, pp. 2309-2312, Oct. 2021.
  • [15] J. Hribar, A. Marinescu, A. Chiumento and L. A. DaSilva, “Energy Aware Deep Reinforcement Learning Scheduling for Sensors Correlated in Time and Space,” IEEE Internet of Things Journal, doi: 10.1109/JIOT.2021.3114102.
  • [16] V. Franco̧is-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau, “An Introduction to Deep Reinforcement Learning,” Foundations and Trends in Machine Learning, vol. 11, no. 3-4, 2018.