Log In Sign Up

Energy Minimization in UAV-Aided Networks: Actor-Critic Learning for Constrained Scheduling Optimization

In unmanned aerial vehicle (UAV) applications, the UAV's limited energy supply and storage have triggered the development of intelligent energy-conserving scheduling solutions. In this paper, we investigate energy minimization for UAV-aided communication networks by jointly optimizing data-transmission scheduling and UAV hovering time. The formulated problem is combinatorial and non-convex with bilinear constraints. To tackle the problem, firstly, we provide an optimal relax-and-approximate solution and develop a near-optimal algorithm. Both the proposed solutions are served as offline performance benchmarks but might not be suitable for online operation. To this end, we develop a solution from a deep reinforcement learning (DRL) aspect. The conventional RL/DRL, e.g., deep Q-learning, however, is limited in dealing with two main issues in constrained combinatorial optimization, i.e., exponentially increasing action space and infeasible actions. The novelty of solution development lies in handling these two issues. To address the former, we propose an actor-critic-based deep stochastic online scheduling (AC-DSOS) algorithm and develop a set of approaches to confine the action space. For the latter, we design a tailored reward function to guarantee the solution feasibility. Numerical results show that, by consuming equal magnitude of time, AC-DSOS is able to provide feasible solutions and saves 29.94 with a conventional deep actor-critic method. Compared to the developed near-optimal algorithm, AC-DSOS consumes around 10 the computational time from minute-level to millisecond-level.


An Actor-Critic-Based UAV-BSs Deployment Method for Dynamic Environments

In this paper, the real-time deployment of unmanned aerial vehicles (UAV...

SREC: Proactive Self-Remedy of Energy-Constrained UAV-Based Networks via Deep Reinforcement Learning

Energy-aware control for multiple unmanned aerial vehicles (UAVs) is one...

Combinatorial Keyword Recommendations for Sponsored Search with Deep Reinforcement Learning

In sponsored search, keyword recommendations help advertisers to achieve...

A Deep Reinforcement Learning Approach for Constrained Online Logistics Route Assignment

As online shopping prevails and e-commerce platforms emerge, there is a ...

JDRec: Practical Actor-Critic Framework for Online Combinatorial Recommender System

A combinatorial recommender (CR) system feeds a list of items to a user ...

A Hierarchical Deep Actor-Critic Learning Method for Joint Distribution System State Estimation

Due to increasing penetration of volatile distributed photovoltaic (PV) ...

I Introduction

Unmanned aerial vehicles (UAVs) have attracted much attention to high-speed data transmission in dynamic, distributed, or plug-and-play scenarios, e.g., disaster rescue, live concert, or sports events [1]. However, UAVs’ limited endurance, energy supply, and storage become critical issues for its applications, which motivates the study of energy efficiency in UAV-aided communication networks. The UAV’s energy consumption comes from two aspects, propulsion energy for flying and hovering, and communication energy for data transmission. The flying energy mainly depends on the UAV’s velocity and trajectory [2]. The hovering energy is in general proportional to the hovering time. Compared to the propulsion energy, the communication energy consumption is not a negligible part, e.g., considerable communication energy can be consumed in the scenarios with high traffic requests from a large number of users. Thus joint energy optimization for both parts is necessary and has attracted considerable attention in the literature [6, 3, 4, 5, 7, 8].

The authors in [3, 4] maximized the energy efficiency, referring to the ratio between transmitted data and propulsion energy. In [5], the authors introduced a complete UAV energy model and proposed a user-timeslot scheduling method to minimize the sum of the propulsion energy and communication energy. Based on the energy model in [5], the authors formulated an energy minimization problem with latency constraints by trajectory design in [6]. The above works in [3, 4, 5, 6] adopted a time division multiple access (TDMA) mode, where the UAV serves one user per timeslot. Besides TDMA, space division multiple access (SDMA) enables simultaneous data transmission to multiple users, such that the hovering time and hovering energy can be reduced. In [7], the authors designed an SDMA-based beamforming scheme to minimize the total transmit power for multi-antenna UAVs. In [8], an energy efficiency maximization problem was investigated in an SDMA-based multi-antenna UAV network via optimizing the flying velocity and power allocation. However, serving multiple users simultaneously may lead to strong inter-user interference and may require more communication energy to fulfill users’ demands.

Deterministic optimization algorithms, e.g., [6, 3, 4, 5, 7, 8]

might not be suitable for fast decision making in a dynamic wireless environment. To address this issue, deep learning-based solutions have been investigated in the literature. The authors in


applied a deep neural network (DNN) for UAV-enabled hybrid networks to efficiently predict the resource allocation scheme. In


, a deep learning-based auction algorithm was proposed to determine a dynamic battery charging scheduling for UAV-aided systems. Supervised learning, such as DNN, requires large amounts of training data, which is a non-trivial task in an offline manner. Another category of studies is deep reinforcement learning (DRL), with the following advantages. Firstly, DRL provides timely solutions, adapted to environment variations. Secondly, DRL integrates DNN to make decisions and improve solution quality. Thirdly, DNN requires an offline data generating and training phase, whereas DRL is less needed for prior knowledge and is able to train by exploring unknown environments and exploiting received feedbacks in an online manner. In

[11], the authors applied a deep Q network (DQN) to design an energy-efficient flying trajectory scheme for UAV-aided networks. In general, DQN is used to deal with a relatively small and discrete action space, where the action space refers to the set of all possible decisions [12]. The authors in [13] designed a different deep Q-learning architecture with a high dimensional action space, but it needs to evaluate all of the actions before making a decision, which is time-consuming.

Deep actor-critic is an emerging DRL method with fast convergent properties and the capability to deal with a large action space [14]. In [15], an actor-critic-based DRL (AC-DRL) algorithm was proposed to reduce the UAV’s energy consumption and enhance the UAV’s coverage of ground users via optimizing UAV’s flying direction and distance. In [16], the authors employed deep actor-critic to design a learning algorithm for UAV-aided systems, considering energy efficiency and users’ fairness. Note that the AC-DRL in [15, 16] was developed for unconstrained problems. However, most of the problems in UAV systems are constrained and with discrete variables. The conventional AC-DRL algorithms have limitations on tackling constrained combinatorial optimization problems, which may result in slow convergent, infeasible, and degraded solutions. The authors in [17] developed an AC-DRL algorithm for a combinatorial optimization problem in a UAV-aided system, but when the size of the action space grows exponentially, the convergence of the algorithm deteriorates.

In this study, we minimize the UAV’s communication and propulsion energy in a downlink UAV-aided communication system. The novelty of solution development lies in two aspects. Firstly, compared to offline optimization approaches, we provide online learning and timely energy-saving solutions based on DRL. Secondly, unlike the conventional DRL methods, the proposed solution is designed to address the challenging issues in constrained combinatorial optimization. The major contributions are summarized as follows:

  • We formulate an energy minimization problem for an SDMA-enabled UAV communication system, where user-timeslot allocation and UAV’s hovering time assignment are the coupled optimization tasks. The formulated problem is combinatorial and non-convex with bilinear constraints.

  • We provide a relax-and-approximate method to approach the optimum. That is, the bilinear terms are addressed by McCormick envelop relaxation, then the remaining integer linear programming problem is solved by branch-and-bound (B&B).

  • We characterize the interplay among communication energy, hovering time, and hovering energy. Based on the derived analytical results, we develop a golden section search-based heuristic (GSS-HEU) algorithm for benchmarking general instances with lower complexity than the optimal solution.

  • Being aware of the issues in optimal/sub-optimal and conventional DRL approaches, we propose an actor-critic-based deep stochastic online scheduling (AC-DSOS) algorithm, where the original problem is transformed to a Markov decision process (MDP). Unlike conventional AC-DRL solutions, in AC-DSOS, we design a set of approaches, e.g., stochastic policy quantification, action space reduction, and feasibility-guaranteed reward function design, to specifically address the constrained combinatorial problem.

  • Simulations demonstrate that the proposed AC-DSOS enables a feasible, fast-converging, and dynamically-adaptive solution. The designed approaches are effective in reducing action space and guaranteeing feasibility. AC-DSOS achieves 29.94% and 52.51% energy reduction compared with a conventional AC-DRL method and a heuristic user scheduling method with almost the same computation time.

The rest of the paper is organized as follows. Section II provides the system model and Section III formulates the considered optimization problem. In Section IV, we analyze the relationship between the energy consumption and hovering time, and propose a heuristic algorithm. In Section V, we reformulate the problem as an MDP and develop an AC-DSOS algorithm. Numerical results are presented and analyzed in Section VI. Finally, we draw the conclusions in Section VII.

The codes for generating the results are online available at the link:

Ii System Model and Problem Formulation

Ii-a System Model

We consider a downlink UAV-aided communication system. A UAV serves as an aerial base station (BS) to deliver data to ground users, e.g., for the scenarios if terrestrial BSs are unavailable or overloaded by high traffic demand from numerous users. We assume that the UAV is equipped with antennas and each ground user has a single antenna [8]. The UAV is fully loaded with data and energy at a dock station before the task starts. The service area is divided into clusters considering the UAV’s limited coverage area. This setup can be used in many practical scenarios such as emergency rescue and temporary communication[18, 19]. We denote as the set of clusters and as the extended set, where the -th cluster denotes the dock station. The UAV flies through all the clusters successively according to a pre-optimized trajectory, and transmits data to the users by hovering at a given point, e.g., above the cluster’s center. Let and denote the number and set of the users in the -th cluster. The demands of user are denoted by (in bits). When all the demands in a cluster are satisfied, the UAV leaves the current cluster and visits the next one. After serving all the clusters, the UAV flies back to the dock station. The process of the UAV from leaving to returning the dock station is defined as a round or a task. Fig. 1 illustrates an example of the considered system.

Figure 1: An illustrative UAV-aided network.

The data stored in the UAV typically has a certain life span [20]. Thus, we consider the transmitted data is delay-sensitive, and all data delivery must be completed within (in frames), where the time domain is divided by frames in set . One frame consists of timeslots, and the duration of a timeslot is . With SDMA, the UAV can simultaneously transmit data to more than one user in each timeslot. The frame-timeslot structure is shown in Fig. 2, where the shaded blocks indicate that the users are scheduled. We define the scheduled users at a timeslot as a user group. The union of the possible groups in cluster is denoted by . The maximum number of candidate groups in cluster is [21], which increases exponentially with . The number and set of the users of group in cluster are denoted by and , respectively.

Figure 2: An illustration of the frame-timeslot structure.

We consider a quasi-static Rician fading channel which comprises both a deterministic line-of-sight (LoS) component and a random multipath component [22]

. The channel states are static within a transmission frame, and varying from one frame to another. The channel vector from the UAV antennas to ground user

is denoted as , which can be expressed by , where is the multipath Rician fading vector and is the free-space propagation loss between the UAV and ground user . We collect all the channel vectors of the users in to form a matrix . Within a user group, we apply a linear minimum mean square error (MMSE) precoding scheme due to its high efficiency and low computational complexity in mitigating intra-group interference. The precoding vector for user is calculated by:


where is the transmit power for user in group , is the -th column in , and is the noise power. Note that transmit power is fixed as parameters in this work by following practical UAV applications, e.g., constant transmit power can be selected from 0.1 W to 10 W [23]. The signal-to-interference-plus-noise ratio (SINR) for the user is given by:


where and are the effective channel gains. Since the channel states vary over frames, we use , and to track SINR and channel coefficients on the -th frame. In this work, the time-varying channel is further modeled as a first state Markov channel (FSMC). Under the FSMC, we quantify each coefficient and

to multiple Markov states and obtain a transition probability such that the variations of

and follow a Markov process between frames [24].

If group is scheduled at timeslot on frame , the amount of data transmitted to user and the consumed communication energy of group can be expressed by:




where is the system bandwidth. Note that within a frame, we assume a user’s channel condition is identical across all the timeslots, thus index is omitted in and .

Ii-B UAV’s Energy Model

We employ a UAV energy model proposed in [5]. The flying power is formulated as a function of flying velocity :



  • : the blade profile power in hovering status;

  • : the induced power in hovering status;

  • : the tip speed of the rotor blade;

  • : the mean rotor induced velocity;

  • : the parameter related to the fuselage drag ratio, rotor solidity, and the rotor disc area;

  • : the air density.

When UAV approaches the hovering point of each cluster, it will fly around the point with a certain velocity , which is more energy-efficient than [6]. Thus, the hovering power is . The flying energy with constant velocity and traveling distance is expressed as:


Hovering energy and communication energy need to be jointly optimized since they are coupled by hovering time, whereas the optimization of flying energy is independent. By applying graph-based numerical methods [25], the minimum flying energy along with the optimal flying speed can be obtained by:


where .

The main notations are summarized in Table I.

Notation Description
number and set of clusters
number of antennas in UAV
number and set of users in cluster
number and set of groups in cluster
number and set of users in group of cluster
demands of user in cluster
maximum number and set of frames in each round
number and set of timeslots in each frame
duration of each timeslot (in seconds)
SINR of user on frame
channel coefficient from user ’s precoding
vector to user () on frame
transmitted data of user per timeslot
on frame
communication energy of group per
timeslot on frame
UAV’s flying velocity that minimizes flying energy
with a predetermined flying path
minimal flying energy with a predetermined
flying path
Table I: Summary of Symbols and Notations

Iii Problem Formulation

We denote binary variables

as the scheduling indicator, where indicates that user group is assigned to timeslot on frame and otherwise. Another binary variables indicate that the UAV is hovering above cluster on frame (), and otherwise. The UAV energy consumption consists of flying energy , hovering energy , and communication energy . Since the minimal flying energy can be independently obtained by Eq. (7) without loss of optimality, the objective focuses on joint optimization of and , which are expressed by:


Note that the UAV is battery limited in practice. We focus on the instances that the minimum consumed energy in (10a) is within the UAV’s battery storage, otherwise the task is infeasible. The optimization problem is formulated as:


Constraints (10b) guarantee that all the users’ requests have to be satisfied within . Constraints (10c) define that the UAV follows a successive and forward manner in visiting clusters. For example, if the UAV is hovering above cluster on frame , in the next frame , the UAV either chooses to stay at the current cluster or move to the next cluster . The option of flying back to previously visited clusters, e.g., , is thus excluded. Note that the UAV takes off from the first cluster, i.e., . Constraints (10d) represent that all the timeslots on frame are assigned to a user group when , otherwise, no users are scheduled in any timeslot. Constraints (10e) and (10f) indicate that no more than one group can be scheduled at a timeslot and only one cluster can be served within a frame. Constraints (10g) and (10h) confine variables and to binary.

Note that is a combinatorial optimization problem with a non-convex bilinear objective and constraints. The optimum can be approached by a well-established relax-and-approximate method. That is, the non-convex bilinear terms are relaxed and bounded by McCormick envelop [26], where each variable ( and ) is bounded by an upper and a lower bound. The relaxation problem becomes an integer linear programming (ILP) problem which can be optimally solved by B&B. Overall, the optimum of can be approached by ultimately tightening the bounds, e.g., increase the number of breakpoints in the envelopes, but this results in exponentially increasing complexity which is unaffordable in practice [27]. Thus, we adopt the above relax-and-approximate method to provide an optimal solution for benchmarking small-medium cases. For general cases, we propose a sub-optimal algorithm in the next section.

Iv Heuristic Approach

We decompose the joint optimization to two sub-problems, i.e., user-timeslot and hovering time allocation, corresponding to optimization of and , respectively. We then solve one sub-problem when the other is fixed.

Iv-a User-Timeslot Scheduling

The bilinear items are resolved with the fixed . The number of frames at each cluster are determined by:


and is the hovering duration. The user-timeslot scheduling can be carried out independently in each cluster, and the resulting problem for the -th cluster is formulated in with a given . We denote and as the hovering and communication energy for the -th cluster:


where refers to the number of elapsed frames before the UAV arriving cluster , which can be calculated by:


The sub-problem is formulated as:


is a multi-choice multi-dimensional knapsack problem (MMKP), which can be solved by a guided local search (GLS)-based heuristic algorithm with high-quality sub-optimal solutions and pseudo-polynomial-time complexity [28].

Iv-B Hovering Time Allocation

To optimize hovering time efficiently, we first investigate the connection between the objective energy and . From Eq. (12) and Eq. (13), increases linearly with while is determined by both and . Next, we show the relationship between the optimum and . For cluster , we denote as the communication energy with the optimal scheduling decision at a given hovering time .

Lemma 1.

is a non-increasing function of ,


We denote the optimal user scheduling for as . If increases from to , is still feasible for such that


might not be necessarily optimal for . There exists an optimal scheduling resulting in lower communication energy, i.e.,


Thus the conclusion. ∎

From Lemma 1, we can observe that is an non-increasing function of , i.e., . For , we can derive that based on Eq. (12). Thus, the extreme point of can be obtained at when


Since the existence and the number of extreme points are undetermined. There are three possible cases, i.e., unimodal, multimodal, and monotonic, for , as illustrated in Fig. 3. In case 1, the curve is a unimodal function with only one extreme point. In case 2, the fluctuation of leads to multiple extreme points such that the curve is a multimodal function. In case 3, Eq. (19) cannot hold, e.g., is consistently lager than , so the curve is monotonously increasing with no extreme point.

Figure 3: Energy curves for three possible cases.

Observing the possible cases, we employ an efficient golden section search (GSS) to find the extreme points [29]. In GSS, we limit the hovering time to ensure that the total service duration does not exceed , where is a maximal time limitation for cluster . Intuitively, the clusters with more demands need more transmission frames. We assume is proportional to the users’ demands:


Iv-C Algorithm Summary

We summarize the proposed GSS-based heuristic (GSS-HEU) algorithm in Alg. 1. We denote as the set of channel states of cluster on frame , which is expressed as:


In GSS-HEU, the initial search range of GSS is set as , which is partitioned into 3 sections by two points and with the golden ratio 0.618 in lines 2-4, where is an operation to round a value up to an integer. When a hovering time is searched in GSS, e.g., or , the corresponding user-timeslot allocation is obtained by solving in line 6. In lines 9-13, we compare the objective energy and update the search range. The search process terminates at . The selected hovering time is and the corresponding scheduling scheme is .

0:    Users’ demands: ,, ,, ,,;Channel states: ,,,,,,;Search range’s upper bound: .
0:    Heuristic solution:
1:  for ; ;  do
2:     ; ;
3:     ;
4:     ;
5:     for ; ;  do
6:        Solve and ;
7:        Obtain the corresponding user scheduling schemes and ;
8:        Obtain the objective energy and ;
9:        if  then
10:           ;;
11:        else
12:           ;;
13:        end if
14:     end for
15:     ; .
16:  end for
Algorithm 1 GSS-HEU Algorithm

The complexity of GSS-HEU is , which is much lower than that of the optimal method. However, both the optimal and GSS-HEU approaches may have limitations in fast decision-making. The computational time for both algorithms grows exponentially with the number of users since [30]

. In addition, both algorithms need the estimated and complete channel states for the whole task frames, i.e., from

to . This may result in difficulties in channel estimation. Therefore, we reconsider from the perspective of DRL to enable the UAV to make decisions intelligently, while the developed optimal and sub-optimal algorithms are used to benchmark the performance of learning-based solutions.

V Actor-Critic-Based DRL algorithm

V-a Overview of Actor-Cirtic-Based DRL (AC-DRL)

In DRL, an agent learns to make decisions by exploring the unknown environments and exploiting the received feedbacks. At each learning step111In this paper, a learning step is equivalent to a transmission frame. , the agent observes the current state and takes an action based on a policy. Then, a reward will be fed back to the agent. The policy will be updated step by step according to the feedback. Actor-critic is an emerging reinforcement learning method that separates the agent into two parts, an actor and a critic. The actor is responsible for taking actions following a stochastic policy , where

refers to a conditional probability density function. The critic is used to evaluate the decisions via a Q-value, which is given by:


where is a conditional expectation under the policy , and is the cumulative discounted reward with a discount factor , which can be expressed as:


However, obtaining the explicit expressions of and is difficult. DRL uses DNNs as the parameterized approximators to provide estimations for and . We denote and as the parameter vectors for the actor and critic, and and as the corresponding parameterized functions222 For simplicity, .

. The goal of the agent is to minimize the loss function of the actor



Based on the fundamental results of the policy gradient theorem [12], the gradient of can be calculated by:


The update rule of can be derived based on gradient descent:


where is the learning rate of the actor. For the critic, the parameter vector is updated based on temporal-difference (TD) learning [12]. In TD learning, the loss function of the critic is defined as the expectation of the square of TD error , i.e., . The TD error refers to the difference between the TD target and estimated Q-value, which is given by:


where is the TD target. The objective of the critic is to minimize the loss function and the updated rule of can be derived by gradient descent:


where is the learning rate for the critic.

However, approximating

brings about a large variance for the gradient

, resulting in poor convergence [31]. To solve the problem, a V-value is introduced:


Approximating can reduce the variance. With the parametered V-value , the TD error and the loss function of the critic are expressed as:




In addition,

provides an unbiased estimation of Q-value

[31]. Thus, we can rewrite in Eq. (25) as:


V-B Problem Reformulation

To apply AC-DRL, we reformulate to an MDP problem, in which the UAV acts as an agent. We define the states, actions, and rewards as follows.

V-B1 States

The system states consist of the channel states for all the clusters on the current frame, i.e., , the undelivered demands, and the currently served cluster on frame . The undelivered demands is the residual data to be delivered for cluster on frame :


where is the delivered data for cluster in frame under the policy . We denote as an indicator to represent which cluster the UAV is serving in frame . When the users’ requests in the current cluster are completed, the UAV will move to the next cluster in the next frame, otherwise, staying at the current cluster. For example, we assume that the UAV is hovering above cluster on frame , i.e., . For the next frame, is obtained by:


When the UAV’s duration exceeds , the UAV will fly back to the dock station. By assembling the above three parts, the state is defined as:


Note that the elements of are modeled as FSMC. In addition, based on Eq. (33) and Eq. (35), the next state of and only depend on the current state and current policy. Therefore, the transition of the state conforms to MDP [12].

V-B2 Actions

The action of the UAV is the user-timeslot assignment on frame t, which is given by: