I Introduction
With the popularity of computationallyintensive tasks, e.g., smart navigation and augmented reality, people are expecting to enjoy more convenient life than ever before. However, current smart devices and user equipments (UEs), due to small size and limited resource, e.g., computation and battery, may not be able to provide satisfactory Quality of Service (QoS) and Quality of Experience (QoE) in executing those highly demanding tasks.
Mobile edge computing (MEC) has been proposed by moving the computation resource to the network edge and it has been proved to greatly enhance UE’s ability in executing computationhungry tasks [1]. Recently, flying mobile edge computing (FMEC) has been proposed, which goes one step further by considering that the computing resource can be carried by unmanned aerial vehicles (UAVs) [2]. FMEC inherits the merits of UAV and it is expected to provide more flexible, easier and faster computing service than traditional fixedlocation MEC infrastructures. However, the FMEC also brings several challenges: 1) how to minimize the longterm energy consumption of all UEs by choosing proper user association (i.e., whether UE should offload the tasks and if so, which UAV to offload to, in the case of multiple flying UAVs); 2) how much computations the UAV should allocate to each offloaded UE by considering the limited amount of onboard resource; 3) how to control each UAV’s trajectory in real time (namely, flying direction and distance), especially considering the dynamic environment (i.e., the UAV may take off from different starting points). Traditional approaches like exhaustive search are hardly to tackle the above problems due to the fact that the decision variable space of FMEC, e.g., deciding the optimal trajectory and resource allocation, is continuous instead of discrete. In [3], the authors propose a quantized dynamic programming algorithm to address the resource allocation problem of MEC. However, the complexity of this approach is very high as the flying choice of UAV is nearly infinite (as continues variables). Moreover, the authors in [4] discretize the UAV trajectory into a sequence of UAV locations and make their proposed problem tractable. Similarly, in [5], the authors assume that the UAV’s trajectory can be approximated by using the discrete variables and then they deal with it by using the traditional convex optimization approaches. However, the above treatment may decrease the control accuracy of the UAV and also is not flexible. Furthermore, the above contributions only considered a single UAV case. In practice, one UAV may not have enough resource to serve all the users. If the served area is very large, more than one UAV are normally needed, which will undoubtedly increase the decision space and make it very difficult for the traditional convex optimization based approaches to obtain the optimal control strategies of each UAV. In [6], Liu et al. propose a deep reinforcement learning based DRLEC algorithm, which can control the trajectory of multiple UAVs but did not consider the user association and resource allocation.
Inspired by the challenges mentioned above, in this paper, we first propose a Convex optimizAtion based Trajectory control algorithm (CAT) to minimize the energy consumption of all the UEs, by jointly optimizing user association, resource allocation and UAV trajectory. Specifically, by applying block coordinate descent (BCD) method, CAT is divided into two parts, i.e., subproblems for deciding UAV trajectories and for deciding user association and resource allocation. In each iteration, we solve each part separately while keep the other part fixed, until the convergence is achieved.
Next, we propose a deep Reinforcement leArning based Trajectory control algorithm (RAT) to facilitate the realtime decision making. In RAT, two deep Q networks (DQNs), i.e., actor and critic networks are applied, where the actor network is responsible for deciding the direction and flying distance of the UAV, while the critic network is in charge of evaluating the actions generated by the actor network. Then, we propose a lowcomplexity matching algorithm to decide the user association and resource allocation with the UAVs. We choose the overall energy consumption of all the UEs as a reward of the RAT. In addition, we deploy a minibatch to collect samples from the experience replay buffer by using a Prioritized Experience Replay (PER) scheme.
Different from the traditional optimization based algorithms which normally need iterations and are susceptible to the initial points, the proposed RAT can be adapted to any taking off points of the UAVs and can obtain the solutions very rapidly once the training process has been completed. In other words, if the starting off points of the UAV are input to the RAT, the trajectories of the UAVs will be determined by the proposed RAT with only some simple algebraic calculations instead of solving the original optimization problem through traditional highcomplexity optimization algorithms. This attributes to the fact that during the training stages, excessive randomly taking off points of UAV are generated and used to train the networks until the networks are converged. Also, with the help of prioritized experience reply (PER), the convergence speed will be increased significantly. RAT can be applied to the practical scenarios where the UAVs needs to act and fly swiftly such as the battlefields. By inputting the current coordinates as the starting off points to the networks, the trajectories of the UAVs will be immediately obtained and then all the UAVs can take off and fly according to the obtained trajectories. Also, the resource allocation and user association are determined by the proposed lowcomplexity matching algorithm. This is particularly useful to some emergence scenarios (e.g., battlefields, earthquake, large fires), as fast decision making is crucial in these areas.
In the simulation, we can see that the proposed RAT can achieve the similar performance as the convexbased solution CAT. They both have considerable performance gain over other traditional algorithms. In addition, we can see that during the learning procedure, the proposed RAT is less sensitive to the hyperparameters, i.e., the size of minibatch and the experience replay buffer, when comparing to tradtional reinforcement learning where PER is not applied.
The remainder of this paper is organized as follows. Section II presents the related work. Section III describes the system model. Section IV introduces the proposed CAT algorithm, whereas Section V gives the proposed RAT algorithm including the preliminaries of DRL. The simulation results are reported in Section VI. Finally, conclusions are given in Section VII.
Ii Related Work
There are many related works that study UAV, MEC and DRL separately, but only a very few consider them holistically. For UAV aided wireless communications, several scenarios have been studied, such as in areas of relay transmissions [7, 8, 9], cellular system [10], data collection [11, 12, 13, 14], wireless power transfer [15], caching networks [16], and D2D communication [17]. In [18], the authors presented an approach to optimize the altitude of UAV to guarantee the maximum radio coverage on the ground. In [19], the authors presented a flyhoverandcommunicate protocol in a UAVenabled multiuser communication system. They partitioned the ground terminals into disjoint clusters and deployed the UAV as a flying base station. Then, by jointly optimizing the UAV altitude and antenna beamwidth, they optimized the throughput in UAVenabled downlink multicasting, downlink broadcasting, and uplink multiple access models. In [4], to maximize the minimum average throughput of covered users in OFDMA system, the authors proposed an efficient iterative algorithm based on block coordinate descent and convex optimization techniques to optimize the UAV trajectory and resource allocation. Furthermore, UAV trajectory optimization research were also investigated. For instance in [20], Zeng et al. proposed an efficient design by optimizing UAV’s flight radius and speed for the sake of maximizing the energy efficiency of UAV communication. In order to maximize the minimum throughput of all mobile terminals in cellular networks, Lyu et al. [13] developed a new hybrid network architecture by deploying UAV as an aerial mobile base station. Different from [18, 19, 4, 20] with the single UAV system, a multiUAV enabled wireless communication system was considered to serve a group of users in [21]. Also, in [22], resource allocation between communication and computation has been investigated in multiUAV systems.
In addition, some recent literature made efforts to mobile edge computing (MEC), which is considered to be a promising technology for bringing computing resource to the edge of the wireless networks [23], where UEs can benefit from offloading their intensive tasks to MEC servers. In [24], partial computation offloading was studied. The computation tasks can be divided into two parts, where one part is executed locally and the other part is offloaded to MEC servers. In [25], binary computation offloading was studied, where the computation tasks can either be executed locally or offloaded to MEC servers.
By taking the advantage of the mobility of UAVs, UAVenabled MEC has also been studied in [26, 27]. In [26], the authors minimized the overall mobile energy consumption by jointly optimizing UAV trajectory and bit allocation, while satisfying QoS requirements of the offloaded mobile application. In [27], the authors studied UAVenabled MEC, where wireless power transfer technology is applied to power the Internet of things devices and collect data from them.
For most of the above works, optimization theory are mainly applied in order to obtain the optimal and / or suboptimal solutions, e.g., trajectory design and resource allocation. However, solving such optimization problems normally requires plenty of computational resources and take much time. To address this problem, DRL has been applied and attracted much attention recently. In [28], the authors proposed a RL framework that uses DQN as the function approximator. In addition, two important ingredients experience replay and target network are used for improving the convergence performance. In [29], the authors pointed out that the classical DQN algorithm may suffer from substantial overestimations in some scenarios, and proposed a double Qlearning algorithm. In order to solve control problems with continuous state and action space, Lillicrap at al. [30] proposed a policy gradient based algorithm. For the purpose of obtaining faster learning and stateofart performance, in [31], the authors proposed a more robust and scalable approach named prioritized experience replay. Although DRL has achieved remarkable successes in gameplaying scenarios, it is still an open research area in UAVenabled MEC.
Iii System Model
As shown in Fig. 1, we consider a scenario that there are UEs with the set denoted as ) and UAVs with the set denoted as ), which form an FMEC platform. To make it clear, the main notations used in this paper are listed in Table. I.
Notation  Definition 

Index of an UE, the number of UAVs and the set of of UEs, respectively  
Index of an UAV, the number of UAVs, the set of UAVs and the set of offloading places, respectively  
Index of a timeslot, the number of timeslots and the set of timeslots, respectively  
The th UEs’ task in th time slot  
The data size of th UEs’ task in th time slot  
The required CPU cycles of th UEs’ task in th time slot  
User association between th UE and th place in th timeslot  
Maximal horizontal coverage range of th UAV  
Flying direction and flying distance of th UAV, respectively  
Maximal flying distance and flying velocity of th UAV, respectively  
Coordinates of th UAV in th timeslot  
Maximal duration of timeslot  
Maximal number of tasks and maximal computation resource that th UAV possesses, respectively  
Coordinates of th UE  
Euclidean distance between th UE and th UAV in th timeslot  
Channel bandwidth, transmitting power, channel power gain and noise power, respectively  
The time for task completion and offloading, and executing, respectively  
Energy consumption for offloading and local execution, respectively  
The set of UAV trajectory, UAV coordinates, user association and resource allocation, respectively  
State, action and reward in th timeslot, respectively  
Factor of flying direction and flying distance in th timeslot, respectively  
Policy function, Q function and loss function, respectively 

Network parameter, TDerror and policy gradient, respectively 
We assume that the th UE constantly generates one task in the th time slot and lasting for time slots. Then, tasks will be generated for each UE and one has and
(1) 
where denotes the size of data required to be transmitted to a UAV if the UE chooses to offload the task, and denotes the total number of CPU cycles needed to execute this task. Assume that each UE can choose either to offload the task to one of the UAVs or execute the task locally. Then one can have
(2) 
where , implies that the th UE decides to offload the task to the th UAV in the th time slot, while , means that the th UE executes the task itself in the th time slot, and otherwise, . Define a new set to represent the possible place where the tasks from UEs can be executed, where indicates that UE conducts its own task locally without offloading.
In addition, we assume that each UE can only be served by at most one UAV or itself, and each task only has one place to execute. Then, it follows
(3) 
Iiia UAV Movement
Assume that the th UAV flies at a fixed altitude like [19], and it has a maximal horizontal coverage , which depends on the transmitting angle of antennas and the flying altitude. Also, assume that in the th time slot, the th UAV can fly with direction as
(4) 
and distance as
(5) 
where one can have the maximal flying distance in each time slot as , is the constant flying velocity, is the maximal duration of the time slot. We also denote the coordinate of the th UAV in the th time slot as , where , and is the starting coordinate of the th UAV. Then, the flying time of the th UAV in the th time slot is
(6) 
and one has
(7) 
Also, in each time slot, we assume that each UAV can accept the limited amount of offloaded tasks. Then, one has
(8) 
where is the maximal number of tasks that the th UAV can accept in the th time slot.
IiiB Task Execution
If the th UE decides to offload the task to the th UAV in the th time slot, then the euclidean distance can be written as
(9) 
where is the coordinate of the th UE, and it has
(10) 
where is the maximal horizontal coverage of the th UAV. Then, the uplink data rate is given by
(11) 
where is the bandwidth for each communication channel; is the transmitting power of the th UE; = with 2.2846; is the channel power gain at the reference distance 1 and is the noise power. Note that we consider each user applies orthogonal frequency division multiplexing (OFDM) channel and there is no interference among them.
If the th UE decides to offload its task to the th UAV in the th time slot, the total task completion time is given by
(12) 
where is the time to offload the data from the th UE to the th UAV in the th time slot, given by
(13) 
and is the time required to execute the task at the UAV as
(14) 
where is the computation resource that the th UAV can provide to the th UE in the th time slot.
Note that the time needed for returning the results back to UE from UAV is ignored, similar to [32]. The overall energy consumption of the th UE to the th UAV in the th time slot is given by
(15) 
If the UE decides to execute the task locally, the power consumption can be evaluated as , where is the effective switched capacitance, is typically set to 3, and is the computation resource that the th UE applies to execute the task. The overall time for local execution can be given by
(16) 
Thus, the total energy consumption for local execution equals
(17) 
To sum up, the overall energy consumption for task execution is given by
(18) 
and the time to complete the task is expressed as
(19) 
Without loss of generality, we assume that each task has to be completed within the time duration , which is consistent with the maximal flying time in each time slot, given by (7). Then, one has
(20) 
In each time slot, since the computation resource that each UAV can provide is limited, we have
(21) 
where is the maximal computation resource that the th UAV can provide in the th time slot. Next, we show our proposed problem formulation.
IiiC Problem Formulation
Denote = , = , = . Then, the energy minimization for all UEs is formulated as
(22a)  
subject to:  
(22b)  
(22c)  
(22d)  
(22e)  
(22f)  
(22g)  
(22h)  
(22i) 
One can see that the above problem is a mixed integer nonlinear programming (MINLP), as it includes both integer variable, and continuous variables, and , which is very difficult to solve in general. We first propose a convex optimization based algorithm CAT to address it iteratively. Then, we propose a Deep Reinforcement Learning (DRL) based RAT to facilitate fast decisionmaking, which can be applied in dynamic environment. Note that in practice, if the th UE does not generate the tasks in the th time slot and then the corresponding and can be set to zero.
Iv Proposed CAT Algorithm
In this section, a convex optimization based CAT is proposed to solve the above problem . We first define a set of new variables to denote the trajectories of UAVs as = , where the coordinates are , and . Thus, the optimization problem can be reformulated as
(23a)  
(23b)  
(23c) 
where . In order to solve , we divide it into two subproblems and apply the block coordinate descent (BCD) method to address it. To this end, we first optimize the user association and resource allocation given the UAV trajectory . Then, we optimize the UAV trajectory given the user association and resource allocation . We solve the two optimization problems iteratively, until the convergence is achieved.
Iva User Association and Resource Allocation
Given the UAV trajectory , the subproblem to decide user association and resource allocation can be formulated as
(24a)  
One can see that (22h) can be written as
(25) 
if the th UE chooses to offload the task, and
(26) 
if the th UE decides to execute the task locally. It is readily to see that equality holds for both (25) and (26).
IvB UAV Trajectory Optimization
Given the user association and resource allocation from (27) and removing the constant, can be simplified as
(28a)  
(28b) 
It is easy to see that the above optimization problem is nonconvex with respect to . Next, we introduce a set , where , then, problem (28) can be transformed into
(29a)  
(29b)  
(29c) 
One observes that (29b) and (29c) are convex with respect to , respectively. Thus, (29b) and (29c) are nonconvex constraints. Then, similar to [4, 5], we apply the successive convex approximation (SCA) to solve this problem. Specifically, for any given local point in , one can have the following inequality as
(30)  
where
(31) 
and
(32) 
Then, problem (29) can be written as
(33a)  
(33b)  
(33c) 
The above problem is a convex quadratically constrained quadratic program (QCQP) and it can be solved by a standard Python package CVXPY [34].
IvC Overall Algorithm Design
In this section, a convex optimization algorithm based CAT is proposed to solve Problem , where we optimize user association and resource allocation subproblem iteratively with the UAV trajectory subproblem until the convergence is achieved. We describe the pseudo code of proposed CAT in Algorithm 1.
Discussions: Algorithm 1 needs to run once the initial takingoff locations of the UAVs change. However, the complexity of Algorithm 1 is high as the solutions are iteratively obtained and each subproblem involves a huge number of optimization variables especially when the .total number of time slots is high. Hence, Algorithm 1 is not suitable for some emergence scenarios (e.g., battlefields, earthquake, large fires), where fast decision making is highly demanded. This motivates the algorithm developed based on DRL in the following section.
V Proposed RAT Algorithm
To facilitate the fast decision making, the DRLbased RAT algorithm is proposed in this section. We first give some preliminaries as follows.
Va Preliminaries
VA1 Dqn
In a standard reinforcement learning, an agent is assumed to interact with the environment and select the optimal action that can maximize the accumulated reward. In [28]
, a Deep Q Network (DQN) structure developed by Google Deepmind, integrates the deep neural networks with traditional reinforcement learning. The DQN is used to estimate the wellknown Qvalue defined as
(34) 
where and denote the state and action respectively, denotes the expectation, whereas is a reward and is the discount factor and is a reward function in the th time step (or time slot). As the objective is to maximize the reward, a widely used policy is , where is the parameter of the deep neural network. Then, the DQN can be trained by minimizing the loss function [28]. Also, since the deep networks are known to be unstable and very difficult to converge, two effective approaches, i.e., target network and experience replay, have been introduced in [28]. The target network has the same structure as the original DQN but the parameters are updated more slowly. The experience replay stores the state transition samples which can help the DQN converge. However, the DQN was originally designed to solve the problem with discrete variables. Although we can adapt the DQN to continuous problems by discretizing the action space, it may unfortunately result in a huge searching space and therefore intractable to deal with.
VA2 Ddpg
To deal with the problem with continuous variables, e.g., the trajectory control of UAV, one may apply the actorcritic approach, which was developed in [35]. DeepMind has proposed a deep deterministic policy gradient (DDPG) approach [30] by integrating the actorcritic approach into DRL. DDPG includes two DQNs, one of the DQNs, named actor network with function is applied to generate action for a given state . The other DQN named critic network with function , is used to generate the Qvalue, which evaluates the action produced by the actor network. In order to improve the learning stability, two adjacent target networks corresponding to the actor and critic networks, , with respective parameters , , are also applied.
Then, the critic network can be updated with the loss function, , as
(35) 
where in each time step we obtain samples constituting minibatch from the experience replay buffer, and is the temporal difference (TD)error [36] which is given by
(36) 
On the other hand, the actor network can be updated by applying the policy gradient, which is described as [30].
(37)  
VB The RAT Algorithm
In this section, we introduce the DRL based RAT algorithm, which includes deep neural networks (i.e., actor and critic networks) and the matching algorithms. In order to apply the DRL, we first define the state, action and reward as follows:

State : , is the set of the coordinates of all UAVs.

Action : is the set of the actions of all UAVs, including the flying direction and distance . Since the absolute operation of
is used as the activation function, it means the output value of the DQN is within the interval
. Thus, the flying direction and distance are reformulated as and , where(38) and
(39) Then, the action set can be defined as .

Reward : is defined as the minus of the overall energy consumption of all the UEs in each time slot as
(40)
The algorithm framework used in this paper is depicted in Fig. 2, where an agent, which could be deployed in the central control center in the base station, is assumed to interact with the environment. An actor network is applied to generate the action, which includes the flying direction and distance for each UAV. The critic network is used to obtain the Q value of the action (i.e., to evaluate the actions generated by actor networks). In each time slot, the agent generates the actions for all the UAVs (including moving direction and distance). Then, each UE tries to associate with one UAV in its coverage, i.e., (10) by using a matching algorithm in Algorithm 3. More specifically, each UE tries to connect the UAV which has the least offloading energy. If the minimum offloading energy is larger than the energy of local execution, the UE will decide to conduct the task locally. Note that RAT has the same optimization strategy for resource allocation as CAT.
Also, each UAV selects the UEs based on the following criteria: 1) UE should be in its coverage area; 2) UE with the smaller resource requirement, i.e., will be given higher priority in offloading to this UAV. We will introduce the details of the proposed matching algorithm in Algorithm 2. After the matching algorithm, the reward in (40) can be obtained.
We assume that there is an experience replay buffer for the agent to store the experience . Once the experience replay buffer is full, the learning procedure starts. A minibatch with size can be obtained from the experience replay buffer to train the networks.
In the classical DRL algorithms, such as Qlearning [37], SARSA [38] and DDPG [30], the minibatch uniformly samples experiences from the experience replay buffer. However, since TDerror in (36) is used to update the Q value network, experience with high TDerror often indicates the successful attempts. Therefore, a better way to select the experience is to assign different weights to samples. Schaul et al. [31] developed a prioritized experience replay scheme, in which the absolute TDerror
is used to evaluate the probability of the sampled
th experience from the minibatch. Then, the probability of sampling the th experience can be given by(41) 
where , is a positive constant to avoid the edgecase of transitions not being revisited if is 0, is denoted as a factor to determine the prioritization [31].
However, frequently sampling experiences with high can cause divergence and oscillation. To tackle this issue, the importancesampling weight [39] is introduced to represent the importance of sampled experience, which can be given by
(42) 
where is the size of experience replay buffer , is given as 0.4 [31]. Thus, the loss function in (35) can be updated as
(43) 
which is used in our proposed RAT to train the networks. Next, we describe the pseudo code of the overall RAT framework in Algorithm 2.
We first initialize the actor, critic, two target networks, and experience replay buffer in Line 1  3. At each epoch, the taking off points of all UAVs are randomly generated in the square area of UEs. We add a random noise to the action, where
follows a normal distribution with
mean and variance
, is set to 3 and decays with a rate of 0.9995 in each time step. From Line 811, each UAV flies according to the generated action and enters the next state . Then, we obtain the user association by using Algorithm 3. Next, the reward is obtained according to (40) (i.e., Line 13). The experience is also stored in the replay buffer . When is full, the minibatch samples experiences by applying the prioritized experience replay (i.e., Line 1619). Then, we update the actor and critic networks by using loss function in (43) and policy gradient in (37) respectively. Finally, we update the target networks by using the following equations as (i.e., Line 22)(44) 
and
(45) 
where is the updating rate.
Next, we introduce the lowcomplexity matching algorithm which can decide the user association and resource allocation given UAVs’ trajectory, as shown in Algorithm 3. First, we denote with size to record the user association between UEs and UAVs. If , it means the th UE matches with the th UAV, and if , it denotes that the th UE is not matched yet and has to execute its task locally. In addition, we denote a preference list for the th UAV to record UEs that can benefit from offloading. Then, from Line 2 to 10, we generate the preference list for the th UAV. Precisely, if constraint (10) is met, we obtain , and according to (17), (15), and (25), respectively. UEs that benefit from offloading will be stored in . Since UAVs wish to accept as many UEs as possible, we sort the preference list with ascending order with respect to , as shown in Line 11. The UE that consumes less
Comments
There are no comments yet.