I Introduction
Unmanned aerial vehicles (UAVs) have attracted much attention to highspeed data transmission in dynamic, distributed, or plugandplay scenarios, e.g., disaster rescue, live concert, or sports events [1]. However, UAVs’ limited endurance, energy supply, and storage become critical issues for its applications, which motivates the study of energy efficiency in UAVaided communication networks. The UAV’s energy consumption comes from two aspects, propulsion energy for flying and hovering, and communication energy for data transmission. The flying energy mainly depends on the UAV’s velocity and trajectory [2]. The hovering energy is in general proportional to the hovering time. Compared to the propulsion energy, the communication energy consumption is not a negligible part, e.g., considerable communication energy can be consumed in the scenarios with high traffic requests from a large number of users. Thus joint energy optimization for both parts is necessary and has attracted considerable attention in the literature [6, 3, 4, 5, 7, 8].
The authors in [3, 4] maximized the energy efficiency, referring to the ratio between transmitted data and propulsion energy. In [5], the authors introduced a complete UAV energy model and proposed a usertimeslot scheduling method to minimize the sum of the propulsion energy and communication energy. Based on the energy model in [5], the authors formulated an energy minimization problem with latency constraints by trajectory design in [6]. The above works in [3, 4, 5, 6] adopted a time division multiple access (TDMA) mode, where the UAV serves one user per timeslot. Besides TDMA, space division multiple access (SDMA) enables simultaneous data transmission to multiple users, such that the hovering time and hovering energy can be reduced. In [7], the authors designed an SDMAbased beamforming scheme to minimize the total transmit power for multiantenna UAVs. In [8], an energy efficiency maximization problem was investigated in an SDMAbased multiantenna UAV network via optimizing the flying velocity and power allocation. However, serving multiple users simultaneously may lead to strong interuser interference and may require more communication energy to fulfill users’ demands.
Deterministic optimization algorithms, e.g., [6, 3, 4, 5, 7, 8]
might not be suitable for fast decision making in a dynamic wireless environment. To address this issue, deep learningbased solutions have been investigated in the literature. The authors in
[9]applied a deep neural network (DNN) for UAVenabled hybrid networks to efficiently predict the resource allocation scheme. In
[10], a deep learningbased auction algorithm was proposed to determine a dynamic battery charging scheduling for UAVaided systems. Supervised learning, such as DNN, requires large amounts of training data, which is a nontrivial task in an offline manner. Another category of studies is deep reinforcement learning (DRL), with the following advantages. Firstly, DRL provides timely solutions, adapted to environment variations. Secondly, DRL integrates DNN to make decisions and improve solution quality. Thirdly, DNN requires an offline data generating and training phase, whereas DRL is less needed for prior knowledge and is able to train by exploring unknown environments and exploiting received feedbacks in an online manner. In
[11], the authors applied a deep Q network (DQN) to design an energyefficient flying trajectory scheme for UAVaided networks. In general, DQN is used to deal with a relatively small and discrete action space, where the action space refers to the set of all possible decisions [12]. The authors in [13] designed a different deep Qlearning architecture with a high dimensional action space, but it needs to evaluate all of the actions before making a decision, which is timeconsuming.Deep actorcritic is an emerging DRL method with fast convergent properties and the capability to deal with a large action space [14]. In [15], an actorcriticbased DRL (ACDRL) algorithm was proposed to reduce the UAV’s energy consumption and enhance the UAV’s coverage of ground users via optimizing UAV’s flying direction and distance. In [16], the authors employed deep actorcritic to design a learning algorithm for UAVaided systems, considering energy efficiency and users’ fairness. Note that the ACDRL in [15, 16] was developed for unconstrained problems. However, most of the problems in UAV systems are constrained and with discrete variables. The conventional ACDRL algorithms have limitations on tackling constrained combinatorial optimization problems, which may result in slow convergent, infeasible, and degraded solutions. The authors in [17] developed an ACDRL algorithm for a combinatorial optimization problem in a UAVaided system, but when the size of the action space grows exponentially, the convergence of the algorithm deteriorates.
In this study, we minimize the UAV’s communication and propulsion energy in a downlink UAVaided communication system. The novelty of solution development lies in two aspects. Firstly, compared to offline optimization approaches, we provide online learning and timely energysaving solutions based on DRL. Secondly, unlike the conventional DRL methods, the proposed solution is designed to address the challenging issues in constrained combinatorial optimization. The major contributions are summarized as follows:

We formulate an energy minimization problem for an SDMAenabled UAV communication system, where usertimeslot allocation and UAV’s hovering time assignment are the coupled optimization tasks. The formulated problem is combinatorial and nonconvex with bilinear constraints.

We provide a relaxandapproximate method to approach the optimum. That is, the bilinear terms are addressed by McCormick envelop relaxation, then the remaining integer linear programming problem is solved by branchandbound (B&B).

We characterize the interplay among communication energy, hovering time, and hovering energy. Based on the derived analytical results, we develop a golden section searchbased heuristic (GSSHEU) algorithm for benchmarking general instances with lower complexity than the optimal solution.

Being aware of the issues in optimal/suboptimal and conventional DRL approaches, we propose an actorcriticbased deep stochastic online scheduling (ACDSOS) algorithm, where the original problem is transformed to a Markov decision process (MDP). Unlike conventional ACDRL solutions, in ACDSOS, we design a set of approaches, e.g., stochastic policy quantification, action space reduction, and feasibilityguaranteed reward function design, to specifically address the constrained combinatorial problem.

Simulations demonstrate that the proposed ACDSOS enables a feasible, fastconverging, and dynamicallyadaptive solution. The designed approaches are effective in reducing action space and guaranteeing feasibility. ACDSOS achieves 29.94% and 52.51% energy reduction compared with a conventional ACDRL method and a heuristic user scheduling method with almost the same computation time.
The rest of the paper is organized as follows. Section II provides the system model and Section III formulates the considered optimization problem. In Section IV, we analyze the relationship between the energy consumption and hovering time, and propose a heuristic algorithm. In Section V, we reformulate the problem as an MDP and develop an ACDSOS algorithm. Numerical results are presented and analyzed in Section VI. Finally, we draw the conclusions in Section VII.
The codes for generating the results are online available at the link: https://github.com/ArthuretYuan.
Ii System Model and Problem Formulation
Iia System Model
We consider a downlink UAVaided communication system. A UAV serves as an aerial base station (BS) to deliver data to ground users, e.g., for the scenarios if terrestrial BSs are unavailable or overloaded by high traffic demand from numerous users. We assume that the UAV is equipped with antennas and each ground user has a single antenna [8]. The UAV is fully loaded with data and energy at a dock station before the task starts. The service area is divided into clusters considering the UAV’s limited coverage area. This setup can be used in many practical scenarios such as emergency rescue and temporary communication[18, 19]. We denote as the set of clusters and as the extended set, where the th cluster denotes the dock station. The UAV flies through all the clusters successively according to a preoptimized trajectory, and transmits data to the users by hovering at a given point, e.g., above the cluster’s center. Let and denote the number and set of the users in the th cluster. The demands of user are denoted by (in bits). When all the demands in a cluster are satisfied, the UAV leaves the current cluster and visits the next one. After serving all the clusters, the UAV flies back to the dock station. The process of the UAV from leaving to returning the dock station is defined as a round or a task. Fig. 1 illustrates an example of the considered system.
The data stored in the UAV typically has a certain life span [20]. Thus, we consider the transmitted data is delaysensitive, and all data delivery must be completed within (in frames), where the time domain is divided by frames in set . One frame consists of timeslots, and the duration of a timeslot is . With SDMA, the UAV can simultaneously transmit data to more than one user in each timeslot. The frametimeslot structure is shown in Fig. 2, where the shaded blocks indicate that the users are scheduled. We define the scheduled users at a timeslot as a user group. The union of the possible groups in cluster is denoted by . The maximum number of candidate groups in cluster is [21], which increases exponentially with . The number and set of the users of group in cluster are denoted by and , respectively.
We consider a quasistatic Rician fading channel which comprises both a deterministic lineofsight (LoS) component and a random multipath component [22]
. The channel states are static within a transmission frame, and varying from one frame to another. The channel vector from the UAV antennas to ground user
is denoted as , which can be expressed by , where is the multipath Rician fading vector and is the freespace propagation loss between the UAV and ground user . We collect all the channel vectors of the users in to form a matrix . Within a user group, we apply a linear minimum mean square error (MMSE) precoding scheme due to its high efficiency and low computational complexity in mitigating intragroup interference. The precoding vector for user is calculated by:(1) 
where is the transmit power for user in group , is the th column in , and is the noise power. Note that transmit power is fixed as parameters in this work by following practical UAV applications, e.g., constant transmit power can be selected from 0.1 W to 10 W [23]. The signaltointerferenceplusnoise ratio (SINR) for the user is given by:
(2) 
where and are the effective channel gains. Since the channel states vary over frames, we use , and to track SINR and channel coefficients on the th frame. In this work, the timevarying channel is further modeled as a first state Markov channel (FSMC). Under the FSMC, we quantify each coefficient and
to multiple Markov states and obtain a transition probability such that the variations of
and follow a Markov process between frames [24].If group is scheduled at timeslot on frame , the amount of data transmitted to user and the consumed communication energy of group can be expressed by:
(3) 
and
(4) 
where is the system bandwidth. Note that within a frame, we assume a user’s channel condition is identical across all the timeslots, thus index is omitted in and .
IiB UAV’s Energy Model
We employ a UAV energy model proposed in [5]. The flying power is formulated as a function of flying velocity :
(5) 
where

: the blade profile power in hovering status;

: the induced power in hovering status;

: the tip speed of the rotor blade;

: the mean rotor induced velocity;

: the parameter related to the fuselage drag ratio, rotor solidity, and the rotor disc area;

: the air density.
When UAV approaches the hovering point of each cluster, it will fly around the point with a certain velocity , which is more energyefficient than [6]. Thus, the hovering power is . The flying energy with constant velocity and traveling distance is expressed as:
(6) 
Hovering energy and communication energy need to be jointly optimized since they are coupled by hovering time, whereas the optimization of flying energy is independent. By applying graphbased numerical methods [25], the minimum flying energy along with the optimal flying speed can be obtained by:
(7) 
where .
The main notations are summarized in Table I.
Notation  Description 
number and set of clusters  
number of antennas in UAV  
number and set of users in cluster  
number and set of groups in cluster  
number and set of users in group of cluster  
demands of user in cluster  
maximum number and set of frames in each round  
number and set of timeslots in each frame  
duration of each timeslot (in seconds)  
SINR of user on frame  
channel coefficient from user ’s precoding  
vector to user () on frame  
transmitted data of user per timeslot  
on frame  
communication energy of group per  
timeslot on frame  
UAV’s flying velocity that minimizes flying energy  
with a predetermined flying path  
minimal flying energy with a predetermined  
flying path 
Iii Problem Formulation
We denote binary variables
as the scheduling indicator, where indicates that user group is assigned to timeslot on frame and otherwise. Another binary variables indicate that the UAV is hovering above cluster on frame (), and otherwise. The UAV energy consumption consists of flying energy , hovering energy , and communication energy . Since the minimal flying energy can be independently obtained by Eq. (7) without loss of optimality, the objective focuses on joint optimization of and , which are expressed by:(8) 
(9) 
Note that the UAV is battery limited in practice. We focus on the instances that the minimum consumed energy in (10a) is within the UAV’s battery storage, otherwise the task is infeasible. The optimization problem is formulated as:
(10a)  
(10b)  
(10c)  
(10d)  
(10e)  
(10f)  
(10g)  
(10h) 
Constraints (10b) guarantee that all the users’ requests have to be satisfied within . Constraints (10c) define that the UAV follows a successive and forward manner in visiting clusters. For example, if the UAV is hovering above cluster on frame , in the next frame , the UAV either chooses to stay at the current cluster or move to the next cluster . The option of flying back to previously visited clusters, e.g., , is thus excluded. Note that the UAV takes off from the first cluster, i.e., . Constraints (10d) represent that all the timeslots on frame are assigned to a user group when , otherwise, no users are scheduled in any timeslot. Constraints (10e) and (10f) indicate that no more than one group can be scheduled at a timeslot and only one cluster can be served within a frame. Constraints (10g) and (10h) confine variables and to binary.
Note that is a combinatorial optimization problem with a nonconvex bilinear objective and constraints. The optimum can be approached by a wellestablished relaxandapproximate method. That is, the nonconvex bilinear terms are relaxed and bounded by McCormick envelop [26], where each variable ( and ) is bounded by an upper and a lower bound. The relaxation problem becomes an integer linear programming (ILP) problem which can be optimally solved by B&B. Overall, the optimum of can be approached by ultimately tightening the bounds, e.g., increase the number of breakpoints in the envelopes, but this results in exponentially increasing complexity which is unaffordable in practice [27]. Thus, we adopt the above relaxandapproximate method to provide an optimal solution for benchmarking smallmedium cases. For general cases, we propose a suboptimal algorithm in the next section.
Iv Heuristic Approach
We decompose the joint optimization to two subproblems, i.e., usertimeslot and hovering time allocation, corresponding to optimization of and , respectively. We then solve one subproblem when the other is fixed.
Iva UserTimeslot Scheduling
The bilinear items are resolved with the fixed . The number of frames at each cluster are determined by:
(11) 
and is the hovering duration. The usertimeslot scheduling can be carried out independently in each cluster, and the resulting problem for the th cluster is formulated in with a given . We denote and as the hovering and communication energy for the th cluster:
(12)  
(13) 
where refers to the number of elapsed frames before the UAV arriving cluster , which can be calculated by:
(14) 
The subproblem is formulated as:
(15a)  
(15b)  
(15c)  
(15d)  
(15e) 
is a multichoice multidimensional knapsack problem (MMKP), which can be solved by a guided local search (GLS)based heuristic algorithm with highquality suboptimal solutions and pseudopolynomialtime complexity [28].
IvB Hovering Time Allocation
To optimize hovering time efficiently, we first investigate the connection between the objective energy and . From Eq. (12) and Eq. (13), increases linearly with while is determined by both and . Next, we show the relationship between the optimum and . For cluster , we denote as the communication energy with the optimal scheduling decision at a given hovering time .
Lemma 1.
is a nonincreasing function of ,
(16) 
Proof.
We denote the optimal user scheduling for as . If increases from to , is still feasible for such that
(17) 
might not be necessarily optimal for . There exists an optimal scheduling resulting in lower communication energy, i.e.,
(18) 
Thus the conclusion. ∎
From Lemma 1, we can observe that is an nonincreasing function of , i.e., . For , we can derive that based on Eq. (12). Thus, the extreme point of can be obtained at when
(19) 
Since the existence and the number of extreme points are undetermined. There are three possible cases, i.e., unimodal, multimodal, and monotonic, for , as illustrated in Fig. 3. In case 1, the curve is a unimodal function with only one extreme point. In case 2, the fluctuation of leads to multiple extreme points such that the curve is a multimodal function. In case 3, Eq. (19) cannot hold, e.g., is consistently lager than , so the curve is monotonously increasing with no extreme point.
Observing the possible cases, we employ an efficient golden section search (GSS) to find the extreme points [29]. In GSS, we limit the hovering time to ensure that the total service duration does not exceed , where is a maximal time limitation for cluster . Intuitively, the clusters with more demands need more transmission frames. We assume is proportional to the users’ demands:
(20) 
IvC Algorithm Summary
We summarize the proposed GSSbased heuristic (GSSHEU) algorithm in Alg. 1. We denote as the set of channel states of cluster on frame , which is expressed as:
(21) 
In GSSHEU, the initial search range of GSS is set as , which is partitioned into 3 sections by two points and with the golden ratio 0.618 in lines 24, where is an operation to round a value up to an integer. When a hovering time is searched in GSS, e.g., or , the corresponding usertimeslot allocation is obtained by solving in line 6. In lines 913, we compare the objective energy and update the search range. The search process terminates at . The selected hovering time is and the corresponding scheduling scheme is .
The complexity of GSSHEU is , which is much lower than that of the optimal method. However, both the optimal and GSSHEU approaches may have limitations in fast decisionmaking. The computational time for both algorithms grows exponentially with the number of users since [30]
. In addition, both algorithms need the estimated and complete channel states for the whole task frames, i.e., from
to . This may result in difficulties in channel estimation. Therefore, we reconsider from the perspective of DRL to enable the UAV to make decisions intelligently, while the developed optimal and suboptimal algorithms are used to benchmark the performance of learningbased solutions.V ActorCriticBased DRL algorithm
Va Overview of ActorCirticBased DRL (ACDRL)
In DRL, an agent learns to make decisions by exploring the unknown environments and exploiting the received feedbacks. At each learning step^{1}^{1}1In this paper, a learning step is equivalent to a transmission frame. , the agent observes the current state and takes an action based on a policy. Then, a reward will be fed back to the agent. The policy will be updated step by step according to the feedback. Actorcritic is an emerging reinforcement learning method that separates the agent into two parts, an actor and a critic. The actor is responsible for taking actions following a stochastic policy , where
refers to a conditional probability density function. The critic is used to evaluate the decisions via a Qvalue, which is given by:
(22) 
where is a conditional expectation under the policy , and is the cumulative discounted reward with a discount factor , which can be expressed as:
(23) 
However, obtaining the explicit expressions of and is difficult. DRL uses DNNs as the parameterized approximators to provide estimations for and . We denote and as the parameter vectors for the actor and critic, and and as the corresponding parameterized functions^{2}^{2}2 For simplicity, .
. The goal of the agent is to minimize the loss function of the actor
:(24) 
Based on the fundamental results of the policy gradient theorem [12], the gradient of can be calculated by:
(25) 
The update rule of can be derived based on gradient descent:
(26) 
where is the learning rate of the actor. For the critic, the parameter vector is updated based on temporaldifference (TD) learning [12]. In TD learning, the loss function of the critic is defined as the expectation of the square of TD error , i.e., . The TD error refers to the difference between the TD target and estimated Qvalue, which is given by:
(27) 
where is the TD target. The objective of the critic is to minimize the loss function and the updated rule of can be derived by gradient descent:
(28) 
where is the learning rate for the critic.
However, approximating
brings about a large variance for the gradient
, resulting in poor convergence [31]. To solve the problem, a Vvalue is introduced:(29) 
Approximating can reduce the variance. With the parametered Vvalue , the TD error and the loss function of the critic are expressed as:
(30) 
and
(31) 
In addition,
provides an unbiased estimation of Qvalue
[31]. Thus, we can rewrite in Eq. (25) as:(32) 
VB Problem Reformulation
To apply ACDRL, we reformulate to an MDP problem, in which the UAV acts as an agent. We define the states, actions, and rewards as follows.
VB1 States
The system states consist of the channel states for all the clusters on the current frame, i.e., , the undelivered demands, and the currently served cluster on frame . The undelivered demands is the residual data to be delivered for cluster on frame :
(33)  
(34) 
where is the delivered data for cluster in frame under the policy . We denote as an indicator to represent which cluster the UAV is serving in frame . When the users’ requests in the current cluster are completed, the UAV will move to the next cluster in the next frame, otherwise, staying at the current cluster. For example, we assume that the UAV is hovering above cluster on frame , i.e., . For the next frame, is obtained by:
(35) 
When the UAV’s duration exceeds , the UAV will fly back to the dock station. By assembling the above three parts, the state is defined as:
(36) 
Note that the elements of are modeled as FSMC. In addition, based on Eq. (33) and Eq. (35), the next state of and only depend on the current state and current policy. Therefore, the transition of the state conforms to MDP [12].
VB2 Actions
The action of the UAV is the usertimeslot assignment on frame t, which is given by: