Unmanned aerial vehicles (UAVs) have attracted much attention to high-speed data transmission in dynamic, distributed, or plug-and-play scenarios, e.g., disaster rescue, live concert, or sports events . However, UAVs’ limited endurance, energy supply, and storage become critical issues for its applications, which motivates the study of energy efficiency in UAV-aided communication networks. The UAV’s energy consumption comes from two aspects, propulsion energy for flying and hovering, and communication energy for data transmission. The flying energy mainly depends on the UAV’s velocity and trajectory . The hovering energy is in general proportional to the hovering time. Compared to the propulsion energy, the communication energy consumption is not a negligible part, e.g., considerable communication energy can be consumed in the scenarios with high traffic requests from a large number of users. Thus joint energy optimization for both parts is necessary and has attracted considerable attention in the literature [6, 3, 4, 5, 7, 8].
The authors in [3, 4] maximized the energy efficiency, referring to the ratio between transmitted data and propulsion energy. In , the authors introduced a complete UAV energy model and proposed a user-timeslot scheduling method to minimize the sum of the propulsion energy and communication energy. Based on the energy model in , the authors formulated an energy minimization problem with latency constraints by trajectory design in . The above works in [3, 4, 5, 6] adopted a time division multiple access (TDMA) mode, where the UAV serves one user per timeslot. Besides TDMA, space division multiple access (SDMA) enables simultaneous data transmission to multiple users, such that the hovering time and hovering energy can be reduced. In , the authors designed an SDMA-based beamforming scheme to minimize the total transmit power for multi-antenna UAVs. In , an energy efficiency maximization problem was investigated in an SDMA-based multi-antenna UAV network via optimizing the flying velocity and power allocation. However, serving multiple users simultaneously may lead to strong inter-user interference and may require more communication energy to fulfill users’ demands.
might not be suitable for fast decision making in a dynamic wireless environment. To address this issue, deep learning-based solutions have been investigated in the literature. The authors in
applied a deep neural network (DNN) for UAV-enabled hybrid networks to efficiently predict the resource allocation scheme. In
, a deep learning-based auction algorithm was proposed to determine a dynamic battery charging scheduling for UAV-aided systems. Supervised learning, such as DNN, requires large amounts of training data, which is a non-trivial task in an offline manner. Another category of studies is deep reinforcement learning (DRL), with the following advantages. Firstly, DRL provides timely solutions, adapted to environment variations. Secondly, DRL integrates DNN to make decisions and improve solution quality. Thirdly, DNN requires an offline data generating and training phase, whereas DRL is less needed for prior knowledge and is able to train by exploring unknown environments and exploiting received feedbacks in an online manner. In, the authors applied a deep Q network (DQN) to design an energy-efficient flying trajectory scheme for UAV-aided networks. In general, DQN is used to deal with a relatively small and discrete action space, where the action space refers to the set of all possible decisions . The authors in  designed a different deep Q-learning architecture with a high dimensional action space, but it needs to evaluate all of the actions before making a decision, which is time-consuming.
Deep actor-critic is an emerging DRL method with fast convergent properties and the capability to deal with a large action space . In , an actor-critic-based DRL (AC-DRL) algorithm was proposed to reduce the UAV’s energy consumption and enhance the UAV’s coverage of ground users via optimizing UAV’s flying direction and distance. In , the authors employed deep actor-critic to design a learning algorithm for UAV-aided systems, considering energy efficiency and users’ fairness. Note that the AC-DRL in [15, 16] was developed for unconstrained problems. However, most of the problems in UAV systems are constrained and with discrete variables. The conventional AC-DRL algorithms have limitations on tackling constrained combinatorial optimization problems, which may result in slow convergent, infeasible, and degraded solutions. The authors in  developed an AC-DRL algorithm for a combinatorial optimization problem in a UAV-aided system, but when the size of the action space grows exponentially, the convergence of the algorithm deteriorates.
In this study, we minimize the UAV’s communication and propulsion energy in a downlink UAV-aided communication system. The novelty of solution development lies in two aspects. Firstly, compared to offline optimization approaches, we provide online learning and timely energy-saving solutions based on DRL. Secondly, unlike the conventional DRL methods, the proposed solution is designed to address the challenging issues in constrained combinatorial optimization. The major contributions are summarized as follows:
We formulate an energy minimization problem for an SDMA-enabled UAV communication system, where user-timeslot allocation and UAV’s hovering time assignment are the coupled optimization tasks. The formulated problem is combinatorial and non-convex with bilinear constraints.
We provide a relax-and-approximate method to approach the optimum. That is, the bilinear terms are addressed by McCormick envelop relaxation, then the remaining integer linear programming problem is solved by branch-and-bound (B&B).
We characterize the interplay among communication energy, hovering time, and hovering energy. Based on the derived analytical results, we develop a golden section search-based heuristic (GSS-HEU) algorithm for benchmarking general instances with lower complexity than the optimal solution.
Being aware of the issues in optimal/sub-optimal and conventional DRL approaches, we propose an actor-critic-based deep stochastic online scheduling (AC-DSOS) algorithm, where the original problem is transformed to a Markov decision process (MDP). Unlike conventional AC-DRL solutions, in AC-DSOS, we design a set of approaches, e.g., stochastic policy quantification, action space reduction, and feasibility-guaranteed reward function design, to specifically address the constrained combinatorial problem.
Simulations demonstrate that the proposed AC-DSOS enables a feasible, fast-converging, and dynamically-adaptive solution. The designed approaches are effective in reducing action space and guaranteeing feasibility. AC-DSOS achieves 29.94% and 52.51% energy reduction compared with a conventional AC-DRL method and a heuristic user scheduling method with almost the same computation time.
The rest of the paper is organized as follows. Section II provides the system model and Section III formulates the considered optimization problem. In Section IV, we analyze the relationship between the energy consumption and hovering time, and propose a heuristic algorithm. In Section V, we reformulate the problem as an MDP and develop an AC-DSOS algorithm. Numerical results are presented and analyzed in Section VI. Finally, we draw the conclusions in Section VII.
The codes for generating the results are online available at the link: https://github.com/ArthuretYuan.
Ii System Model and Problem Formulation
Ii-a System Model
We consider a downlink UAV-aided communication system. A UAV serves as an aerial base station (BS) to deliver data to ground users, e.g., for the scenarios if terrestrial BSs are unavailable or overloaded by high traffic demand from numerous users. We assume that the UAV is equipped with antennas and each ground user has a single antenna . The UAV is fully loaded with data and energy at a dock station before the task starts. The service area is divided into clusters considering the UAV’s limited coverage area. This setup can be used in many practical scenarios such as emergency rescue and temporary communication[18, 19]. We denote as the set of clusters and as the extended set, where the -th cluster denotes the dock station. The UAV flies through all the clusters successively according to a pre-optimized trajectory, and transmits data to the users by hovering at a given point, e.g., above the cluster’s center. Let and denote the number and set of the users in the -th cluster. The demands of user are denoted by (in bits). When all the demands in a cluster are satisfied, the UAV leaves the current cluster and visits the next one. After serving all the clusters, the UAV flies back to the dock station. The process of the UAV from leaving to returning the dock station is defined as a round or a task. Fig. 1 illustrates an example of the considered system.
The data stored in the UAV typically has a certain life span . Thus, we consider the transmitted data is delay-sensitive, and all data delivery must be completed within (in frames), where the time domain is divided by frames in set . One frame consists of timeslots, and the duration of a timeslot is . With SDMA, the UAV can simultaneously transmit data to more than one user in each timeslot. The frame-timeslot structure is shown in Fig. 2, where the shaded blocks indicate that the users are scheduled. We define the scheduled users at a timeslot as a user group. The union of the possible groups in cluster is denoted by . The maximum number of candidate groups in cluster is , which increases exponentially with . The number and set of the users of group in cluster are denoted by and , respectively.
We consider a quasi-static Rician fading channel which comprises both a deterministic line-of-sight (LoS) component and a random multipath component 
. The channel states are static within a transmission frame, and varying from one frame to another. The channel vector from the UAV antennas to ground useris denoted as , which can be expressed by , where is the multipath Rician fading vector and is the free-space propagation loss between the UAV and ground user . We collect all the channel vectors of the users in to form a matrix . Within a user group, we apply a linear minimum mean square error (MMSE) precoding scheme due to its high efficiency and low computational complexity in mitigating intra-group interference. The precoding vector for user is calculated by:
where is the transmit power for user in group , is the -th column in , and is the noise power. Note that transmit power is fixed as parameters in this work by following practical UAV applications, e.g., constant transmit power can be selected from 0.1 W to 10 W . The signal-to-interference-plus-noise ratio (SINR) for the user is given by:
where and are the effective channel gains. Since the channel states vary over frames, we use , and to track SINR and channel coefficients on the -th frame. In this work, the time-varying channel is further modeled as a first state Markov channel (FSMC). Under the FSMC, we quantify each coefficient and
to multiple Markov states and obtain a transition probability such that the variations ofand follow a Markov process between frames .
If group is scheduled at timeslot on frame , the amount of data transmitted to user and the consumed communication energy of group can be expressed by:
where is the system bandwidth. Note that within a frame, we assume a user’s channel condition is identical across all the timeslots, thus index is omitted in and .
Ii-B UAV’s Energy Model
We employ a UAV energy model proposed in . The flying power is formulated as a function of flying velocity :
: the blade profile power in hovering status;
: the induced power in hovering status;
: the tip speed of the rotor blade;
: the mean rotor induced velocity;
: the parameter related to the fuselage drag ratio, rotor solidity, and the rotor disc area;
: the air density.
When UAV approaches the hovering point of each cluster, it will fly around the point with a certain velocity , which is more energy-efficient than . Thus, the hovering power is . The flying energy with constant velocity and traveling distance is expressed as:
Hovering energy and communication energy need to be jointly optimized since they are coupled by hovering time, whereas the optimization of flying energy is independent. By applying graph-based numerical methods , the minimum flying energy along with the optimal flying speed can be obtained by:
The main notations are summarized in Table I.
|number and set of clusters|
|number of antennas in UAV|
|number and set of users in cluster|
|number and set of groups in cluster|
|number and set of users in group of cluster|
|demands of user in cluster|
|maximum number and set of frames in each round|
|number and set of timeslots in each frame|
|duration of each timeslot (in seconds)|
|SINR of user on frame|
|channel coefficient from user ’s precoding|
|vector to user () on frame|
|transmitted data of user per timeslot|
|communication energy of group per|
|timeslot on frame|
|UAV’s flying velocity that minimizes flying energy|
|with a predetermined flying path|
|minimal flying energy with a predetermined|
Iii Problem Formulation
We denote binary variablesas the scheduling indicator, where indicates that user group is assigned to timeslot on frame and otherwise. Another binary variables indicate that the UAV is hovering above cluster on frame (), and otherwise. The UAV energy consumption consists of flying energy , hovering energy , and communication energy . Since the minimal flying energy can be independently obtained by Eq. (7) without loss of optimality, the objective focuses on joint optimization of and , which are expressed by:
Note that the UAV is battery limited in practice. We focus on the instances that the minimum consumed energy in (10a) is within the UAV’s battery storage, otherwise the task is infeasible. The optimization problem is formulated as:
Constraints (10b) guarantee that all the users’ requests have to be satisfied within . Constraints (10c) define that the UAV follows a successive and forward manner in visiting clusters. For example, if the UAV is hovering above cluster on frame , in the next frame , the UAV either chooses to stay at the current cluster or move to the next cluster . The option of flying back to previously visited clusters, e.g., , is thus excluded. Note that the UAV takes off from the first cluster, i.e., . Constraints (10d) represent that all the timeslots on frame are assigned to a user group when , otherwise, no users are scheduled in any timeslot. Constraints (10e) and (10f) indicate that no more than one group can be scheduled at a timeslot and only one cluster can be served within a frame. Constraints (10g) and (10h) confine variables and to binary.
Note that is a combinatorial optimization problem with a non-convex bilinear objective and constraints. The optimum can be approached by a well-established relax-and-approximate method. That is, the non-convex bilinear terms are relaxed and bounded by McCormick envelop , where each variable ( and ) is bounded by an upper and a lower bound. The relaxation problem becomes an integer linear programming (ILP) problem which can be optimally solved by B&B. Overall, the optimum of can be approached by ultimately tightening the bounds, e.g., increase the number of breakpoints in the envelopes, but this results in exponentially increasing complexity which is unaffordable in practice . Thus, we adopt the above relax-and-approximate method to provide an optimal solution for benchmarking small-medium cases. For general cases, we propose a sub-optimal algorithm in the next section.
Iv Heuristic Approach
We decompose the joint optimization to two sub-problems, i.e., user-timeslot and hovering time allocation, corresponding to optimization of and , respectively. We then solve one sub-problem when the other is fixed.
Iv-a User-Timeslot Scheduling
The bilinear items are resolved with the fixed . The number of frames at each cluster are determined by:
and is the hovering duration. The user-timeslot scheduling can be carried out independently in each cluster, and the resulting problem for the -th cluster is formulated in with a given . We denote and as the hovering and communication energy for the -th cluster:
where refers to the number of elapsed frames before the UAV arriving cluster , which can be calculated by:
The sub-problem is formulated as:
is a multi-choice multi-dimensional knapsack problem (MMKP), which can be solved by a guided local search (GLS)-based heuristic algorithm with high-quality sub-optimal solutions and pseudo-polynomial-time complexity .
Iv-B Hovering Time Allocation
To optimize hovering time efficiently, we first investigate the connection between the objective energy and . From Eq. (12) and Eq. (13), increases linearly with while is determined by both and . Next, we show the relationship between the optimum and . For cluster , we denote as the communication energy with the optimal scheduling decision at a given hovering time .
is a non-increasing function of ,
We denote the optimal user scheduling for as . If increases from to , is still feasible for such that
might not be necessarily optimal for . There exists an optimal scheduling resulting in lower communication energy, i.e.,
Thus the conclusion. ∎
Since the existence and the number of extreme points are undetermined. There are three possible cases, i.e., unimodal, multimodal, and monotonic, for , as illustrated in Fig. 3. In case 1, the curve is a unimodal function with only one extreme point. In case 2, the fluctuation of leads to multiple extreme points such that the curve is a multimodal function. In case 3, Eq. (19) cannot hold, e.g., is consistently lager than , so the curve is monotonously increasing with no extreme point.
Observing the possible cases, we employ an efficient golden section search (GSS) to find the extreme points . In GSS, we limit the hovering time to ensure that the total service duration does not exceed , where is a maximal time limitation for cluster . Intuitively, the clusters with more demands need more transmission frames. We assume is proportional to the users’ demands:
Iv-C Algorithm Summary
We summarize the proposed GSS-based heuristic (GSS-HEU) algorithm in Alg. 1. We denote as the set of channel states of cluster on frame , which is expressed as:
In GSS-HEU, the initial search range of GSS is set as , which is partitioned into 3 sections by two points and with the golden ratio 0.618 in lines 2-4, where is an operation to round a value up to an integer. When a hovering time is searched in GSS, e.g., or , the corresponding user-timeslot allocation is obtained by solving in line 6. In lines 9-13, we compare the objective energy and update the search range. The search process terminates at . The selected hovering time is and the corresponding scheduling scheme is .
The complexity of GSS-HEU is , which is much lower than that of the optimal method. However, both the optimal and GSS-HEU approaches may have limitations in fast decision-making. The computational time for both algorithms grows exponentially with the number of users since 
. In addition, both algorithms need the estimated and complete channel states for the whole task frames, i.e., fromto . This may result in difficulties in channel estimation. Therefore, we reconsider from the perspective of DRL to enable the UAV to make decisions intelligently, while the developed optimal and sub-optimal algorithms are used to benchmark the performance of learning-based solutions.
V Actor-Critic-Based DRL algorithm
V-a Overview of Actor-Cirtic-Based DRL (AC-DRL)
In DRL, an agent learns to make decisions by exploring the unknown environments and exploiting the received feedbacks. At each learning step111In this paper, a learning step is equivalent to a transmission frame. , the agent observes the current state and takes an action based on a policy. Then, a reward will be fed back to the agent. The policy will be updated step by step according to the feedback. Actor-critic is an emerging reinforcement learning method that separates the agent into two parts, an actor and a critic. The actor is responsible for taking actions following a stochastic policy , where
refers to a conditional probability density function. The critic is used to evaluate the decisions via a Q-value, which is given by:
where is a conditional expectation under the policy , and is the cumulative discounted reward with a discount factor , which can be expressed as:
However, obtaining the explicit expressions of and is difficult. DRL uses DNNs as the parameterized approximators to provide estimations for and . We denote and as the parameter vectors for the actor and critic, and and as the corresponding parameterized functions222 For simplicity, .
. The goal of the agent is to minimize the loss function of the actor:
Based on the fundamental results of the policy gradient theorem , the gradient of can be calculated by:
The update rule of can be derived based on gradient descent:
where is the learning rate of the actor. For the critic, the parameter vector is updated based on temporal-difference (TD) learning . In TD learning, the loss function of the critic is defined as the expectation of the square of TD error , i.e., . The TD error refers to the difference between the TD target and estimated Q-value, which is given by:
where is the TD target. The objective of the critic is to minimize the loss function and the updated rule of can be derived by gradient descent:
where is the learning rate for the critic.
brings about a large variance for the gradient, resulting in poor convergence . To solve the problem, a V-value is introduced:
Approximating can reduce the variance. With the parametered V-value , the TD error and the loss function of the critic are expressed as:
provides an unbiased estimation of Q-value. Thus, we can rewrite in Eq. (25) as:
V-B Problem Reformulation
To apply AC-DRL, we reformulate to an MDP problem, in which the UAV acts as an agent. We define the states, actions, and rewards as follows.
The system states consist of the channel states for all the clusters on the current frame, i.e., , the undelivered demands, and the currently served cluster on frame . The undelivered demands is the residual data to be delivered for cluster on frame :
where is the delivered data for cluster in frame under the policy . We denote as an indicator to represent which cluster the UAV is serving in frame . When the users’ requests in the current cluster are completed, the UAV will move to the next cluster in the next frame, otherwise, staying at the current cluster. For example, we assume that the UAV is hovering above cluster on frame , i.e., . For the next frame, is obtained by:
When the UAV’s duration exceeds , the UAV will fly back to the dock station. By assembling the above three parts, the state is defined as:
Note that the elements of are modeled as FSMC. In addition, based on Eq. (33) and Eq. (35), the next state of and only depend on the current state and current policy. Therefore, the transition of the state conforms to MDP .
The action of the UAV is the user-timeslot assignment on frame t, which is given by: