With the popularity of computationally-intensive tasks, e.g., smart navigation and augmented reality, people are expecting to enjoy more convenient life than ever before. However, current smart devices and user equipments (UEs), due to small size and limited resource, e.g., computation and battery, may not be able to provide satisfactory Quality of Service (QoS) and Quality of Experience (QoE) in executing those highly demanding tasks.
Mobile edge computing (MEC) has been proposed by moving the computation resource to the network edge and it has been proved to greatly enhance UE’s ability in executing computation-hungry tasks . Recently, flying mobile edge computing (F-MEC) has been proposed, which goes one step further by considering that the computing resource can be carried by unmanned aerial vehicles (UAVs) . F-MEC inherits the merits of UAV and it is expected to provide more flexible, easier and faster computing service than traditional fixed-location MEC infrastructures. However, the F-MEC also brings several challenges: 1) how to minimize the long-term energy consumption of all UEs by choosing proper user association (i.e., whether UE should offload the tasks and if so, which UAV to offload to, in the case of multiple flying UAVs); 2) how much computations the UAV should allocate to each offloaded UE by considering the limited amount of on-board resource; 3) how to control each UAV’s trajectory in real time (namely, flying direction and distance), especially considering the dynamic environment (i.e., the UAV may take off from different starting points). Traditional approaches like exhaustive search are hardly to tackle the above problems due to the fact that the decision variable space of F-MEC, e.g., deciding the optimal trajectory and resource allocation, is continuous instead of discrete. In , the authors propose a quantized dynamic programming algorithm to address the resource allocation problem of MEC. However, the complexity of this approach is very high as the flying choice of UAV is nearly infinite (as continues variables). Moreover, the authors in  discretize the UAV trajectory into a sequence of UAV locations and make their proposed problem tractable. Similarly, in , the authors assume that the UAV’s trajectory can be approximated by using the discrete variables and then they deal with it by using the traditional convex optimization approaches. However, the above treatment may decrease the control accuracy of the UAV and also is not flexible. Furthermore, the above contributions only considered a single UAV case. In practice, one UAV may not have enough resource to serve all the users. If the served area is very large, more than one UAV are normally needed, which will undoubtedly increase the decision space and make it very difficult for the traditional convex optimization based approaches to obtain the optimal control strategies of each UAV. In , Liu et al. propose a deep reinforcement learning based DRL-EC algorithm, which can control the trajectory of multiple UAVs but did not consider the user association and resource allocation.
Inspired by the challenges mentioned above, in this paper, we first propose a Convex optimizAtion based Trajectory control algorithm (CAT) to minimize the energy consumption of all the UEs, by jointly optimizing user association, resource allocation and UAV trajectory. Specifically, by applying block coordinate descent (BCD) method, CAT is divided into two parts, i.e., subproblems for deciding UAV trajectories and for deciding user association and resource allocation. In each iteration, we solve each part separately while keep the other part fixed, until the convergence is achieved.
Next, we propose a deep Reinforcement leArning based Trajectory control algorithm (RAT) to facilitate the real-time decision making. In RAT, two deep Q networks (DQNs), i.e., actor and critic networks are applied, where the actor network is responsible for deciding the direction and flying distance of the UAV, while the critic network is in charge of evaluating the actions generated by the actor network. Then, we propose a low-complexity matching algorithm to decide the user association and resource allocation with the UAVs. We choose the overall energy consumption of all the UEs as a reward of the RAT. In addition, we deploy a mini-batch to collect samples from the experience replay buffer by using a Prioritized Experience Replay (PER) scheme.
Different from the traditional optimization based algorithms which normally need iterations and are susceptible to the initial points, the proposed RAT can be adapted to any taking off points of the UAVs and can obtain the solutions very rapidly once the training process has been completed. In other words, if the starting off points of the UAV are input to the RAT, the trajectories of the UAVs will be determined by the proposed RAT with only some simple algebraic calculations instead of solving the original optimization problem through traditional high-complexity optimization algorithms. This attributes to the fact that during the training stages, excessive randomly taking off points of UAV are generated and used to train the networks until the networks are converged. Also, with the help of prioritized experience reply (PER), the convergence speed will be increased significantly. RAT can be applied to the practical scenarios where the UAVs needs to act and fly swiftly such as the battlefields. By inputting the current coordinates as the starting off points to the networks, the trajectories of the UAVs will be immediately obtained and then all the UAVs can take off and fly according to the obtained trajectories. Also, the resource allocation and user association are determined by the proposed low-complexity matching algorithm. This is particularly useful to some emergence scenarios (e.g., battlefields, earthquake, large fires), as fast decision making is crucial in these areas.
In the simulation, we can see that the proposed RAT can achieve the similar performance as the convex-based solution CAT. They both have considerable performance gain over other traditional algorithms. In addition, we can see that during the learning procedure, the proposed RAT is less sensitive to the hyperparameters, i.e., the size of mini-batch and the experience replay buffer, when comparing to tradtional reinforcement learning where PER is not applied.
The remainder of this paper is organized as follows. Section II presents the related work. Section III describes the system model. Section IV introduces the proposed CAT algorithm, whereas Section V gives the proposed RAT algorithm including the preliminaries of DRL. The simulation results are reported in Section VI. Finally, conclusions are given in Section VII.
Ii Related Work
There are many related works that study UAV, MEC and DRL separately, but only a very few consider them holistically. For UAV aided wireless communications, several scenarios have been studied, such as in areas of relay transmissions [7, 8, 9], cellular system , data collection [11, 12, 13, 14], wireless power transfer , caching networks , and D2D communication . In , the authors presented an approach to optimize the altitude of UAV to guarantee the maximum radio coverage on the ground. In , the authors presented a fly-hover-and-communicate protocol in a UAV-enabled multiuser communication system. They partitioned the ground terminals into disjoint clusters and deployed the UAV as a flying base station. Then, by jointly optimizing the UAV altitude and antenna beamwidth, they optimized the throughput in UAV-enabled downlink multicasting, downlink broadcasting, and uplink multiple access models. In , to maximize the minimum average throughput of covered users in OFDMA system, the authors proposed an efficient iterative algorithm based on block coordinate descent and convex optimization techniques to optimize the UAV trajectory and resource allocation. Furthermore, UAV trajectory optimization research were also investigated. For instance in , Zeng et al. proposed an efficient design by optimizing UAV’s flight radius and speed for the sake of maximizing the energy efficiency of UAV communication. In order to maximize the minimum throughput of all mobile terminals in cellular networks, Lyu et al.  developed a new hybrid network architecture by deploying UAV as an aerial mobile base station. Different from [18, 19, 4, 20] with the single UAV system, a multi-UAV enabled wireless communication system was considered to serve a group of users in . Also, in , resource allocation between communication and computation has been investigated in multi-UAV systems.
In addition, some recent literature made efforts to mobile edge computing (MEC), which is considered to be a promising technology for bringing computing resource to the edge of the wireless networks , where UEs can benefit from offloading their intensive tasks to MEC servers. In , partial computation offloading was studied. The computation tasks can be divided into two parts, where one part is executed locally and the other part is offloaded to MEC servers. In , binary computation offloading was studied, where the computation tasks can either be executed locally or offloaded to MEC servers.
By taking the advantage of the mobility of UAVs, UAV-enabled MEC has also been studied in [26, 27]. In , the authors minimized the overall mobile energy consumption by jointly optimizing UAV trajectory and bit allocation, while satisfying QoS requirements of the offloaded mobile application. In , the authors studied UAV-enabled MEC, where wireless power transfer technology is applied to power the Internet of things devices and collect data from them.
For most of the above works, optimization theory are mainly applied in order to obtain the optimal and / or suboptimal solutions, e.g., trajectory design and resource allocation. However, solving such optimization problems normally requires plenty of computational resources and take much time. To address this problem, DRL has been applied and attracted much attention recently. In , the authors proposed a RL framework that uses DQN as the function approximator. In addition, two important ingredients experience replay and target network are used for improving the convergence performance. In , the authors pointed out that the classical DQN algorithm may suffer from substantial overestimations in some scenarios, and proposed a double Q-learning algorithm. In order to solve control problems with continuous state and action space, Lillicrap at al.  proposed a policy gradient based algorithm. For the purpose of obtaining faster learning and state-of-art performance, in , the authors proposed a more robust and scalable approach named prioritized experience replay. Although DRL has achieved remarkable successes in game-playing scenarios, it is still an open research area in UAV-enabled MEC.
Iii System Model
As shown in Fig. 1, we consider a scenario that there are UEs with the set denoted as ) and UAVs with the set denoted as ), which form an F-MEC platform. To make it clear, the main notations used in this paper are listed in Table. I.
|Index of an UE, the number of UAVs and the set of of UEs, respectively|
|Index of an UAV, the number of UAVs, the set of UAVs and the set of offloading places, respectively|
|Index of a timeslot, the number of timeslots and the set of timeslots, respectively|
|The -th UEs’ task in -th time slot|
|The data size of -th UEs’ task in -th time slot|
|The required CPU cycles of -th UEs’ task in -th time slot|
|User association between -th UE and -th place in -th timeslot|
|Maximal horizontal coverage range of -th UAV|
|Flying direction and flying distance of -th UAV, respectively|
|Maximal flying distance and flying velocity of -th UAV, respectively|
|Coordinates of -th UAV in -th timeslot|
|Maximal duration of timeslot|
|Maximal number of tasks and maximal computation resource that -th UAV possesses, respectively|
|Coordinates of -th UE|
|Euclidean distance between -th UE and -th UAV in -th timeslot|
|Channel bandwidth, transmitting power, channel power gain and noise power, respectively|
|The time for task completion and offloading, and executing, respectively|
|Energy consumption for offloading and local execution, respectively|
|The set of UAV trajectory, UAV coordinates, user association and resource allocation, respectively|
|State, action and reward in -th timeslot, respectively|
|Factor of flying direction and flying distance in -th timeslot, respectively|
Policy function, Q function and loss function, respectively
|Network parameter, TD-error and policy gradient, respectively|
We assume that the -th UE constantly generates one task in the -th time slot and lasting for time slots. Then, tasks will be generated for each UE and one has and
where denotes the size of data required to be transmitted to a UAV if the UE chooses to offload the task, and denotes the total number of CPU cycles needed to execute this task. Assume that each UE can choose either to offload the task to one of the UAVs or execute the task locally. Then one can have
where , implies that the -th UE decides to offload the task to the -th UAV in the -th time slot, while , means that the -th UE executes the task itself in the -th time slot, and otherwise, . Define a new set to represent the possible place where the tasks from UEs can be executed, where indicates that UE conducts its own task locally without offloading.
In addition, we assume that each UE can only be served by at most one UAV or itself, and each task only has one place to execute. Then, it follows
Iii-a UAV Movement
Assume that the -th UAV flies at a fixed altitude like , and it has a maximal horizontal coverage , which depends on the transmitting angle of antennas and the flying altitude. Also, assume that in the -th time slot, the -th UAV can fly with direction as
and distance as
where one can have the maximal flying distance in each time slot as , is the constant flying velocity, is the maximal duration of the time slot. We also denote the coordinate of the -th UAV in the -th time slot as , where , and is the starting coordinate of the -th UAV. Then, the flying time of the -th UAV in the -th time slot is
and one has
Also, in each time slot, we assume that each UAV can accept the limited amount of offloaded tasks. Then, one has
where is the maximal number of tasks that the -th UAV can accept in the -th time slot.
Iii-B Task Execution
If the -th UE decides to offload the task to the -th UAV in the -th time slot, then the euclidean distance can be written as
where is the coordinate of the -th UE, and it has
where is the maximal horizontal coverage of the -th UAV. Then, the uplink data rate is given by
where is the bandwidth for each communication channel; is the transmitting power of the -th UE; = with 2.2846; is the channel power gain at the reference distance 1 and is the noise power. Note that we consider each user applies orthogonal frequency division multiplexing (OFDM) channel and there is no interference among them.
If the -th UE decides to offload its task to the -th UAV in the -th time slot, the total task completion time is given by
where is the time to offload the data from the -th UE to the -th UAV in the -th time slot, given by
and is the time required to execute the task at the UAV as
where is the computation resource that the -th UAV can provide to the -th UE in the -th time slot.
Note that the time needed for returning the results back to UE from UAV is ignored, similar to . The overall energy consumption of the -th UE to the -th UAV in the -th time slot is given by
If the UE decides to execute the task locally, the power consumption can be evaluated as , where is the effective switched capacitance, is typically set to 3, and is the computation resource that the -th UE applies to execute the task. The overall time for local execution can be given by
Thus, the total energy consumption for local execution equals
To sum up, the overall energy consumption for task execution is given by
and the time to complete the task is expressed as
Without loss of generality, we assume that each task has to be completed within the time duration , which is consistent with the maximal flying time in each time slot, given by (7). Then, one has
In each time slot, since the computation resource that each UAV can provide is limited, we have
where is the maximal computation resource that the -th UAV can provide in the -th time slot. Next, we show our proposed problem formulation.
Iii-C Problem Formulation
Denote = , = , = . Then, the energy minimization for all UEs is formulated as
One can see that the above problem is a mixed integer nonlinear programming (MINLP), as it includes both integer variable, and continuous variables, and , which is very difficult to solve in general. We first propose a convex optimization based algorithm CAT to address it iteratively. Then, we propose a Deep Reinforcement Learning (DRL) based RAT to facilitate fast decision-making, which can be applied in dynamic environment. Note that in practice, if the -th UE does not generate the tasks in the -th time slot and then the corresponding and can be set to zero.
Iv Proposed CAT Algorithm
In this section, a convex optimization based CAT is proposed to solve the above problem . We first define a set of new variables to denote the trajectories of UAVs as = , where the coordinates are , and . Thus, the optimization problem can be reformulated as
where . In order to solve , we divide it into two subproblems and apply the block coordinate descent (BCD) method to address it. To this end, we first optimize the user association and resource allocation given the UAV trajectory . Then, we optimize the UAV trajectory given the user association and resource allocation . We solve the two optimization problems iteratively, until the convergence is achieved.
Iv-a User Association and Resource Allocation
Given the UAV trajectory , the subproblem to decide user association and resource allocation can be formulated as
One can see that (22h) can be written as
if the -th UE chooses to offload the task, and
Iv-B UAV Trajectory Optimization
Given the user association and resource allocation from (27) and removing the constant, can be simplified as
It is easy to see that the above optimization problem is non-convex with respect to . Next, we introduce a set , where , then, problem (28) can be transformed into
One observes that (29b) and (29c) are convex with respect to , respectively. Thus, (29b) and (29c) are non-convex constraints. Then, similar to [4, 5], we apply the successive convex approximation (SCA) to solve this problem. Specifically, for any given local point in , one can have the following inequality as
Then, problem (29) can be written as
The above problem is a convex quadratically constrained quadratic program (QCQP) and it can be solved by a standard Python package CVXPY .
Iv-C Overall Algorithm Design
In this section, a convex optimization algorithm based CAT is proposed to solve Problem , where we optimize user association and resource allocation subproblem iteratively with the UAV trajectory subproblem until the convergence is achieved. We describe the pseudo code of proposed CAT in Algorithm 1.
Discussions: Algorithm 1 needs to run once the initial taking-off locations of the UAVs change. However, the complexity of Algorithm 1 is high as the solutions are iteratively obtained and each subproblem involves a huge number of optimization variables especially when the .total number of time slots is high. Hence, Algorithm 1 is not suitable for some emergence scenarios (e.g., battlefields, earthquake, large fires), where fast decision making is highly demanded. This motivates the algorithm developed based on DRL in the following section.
V Proposed RAT Algorithm
To facilitate the fast decision making, the DRL-based RAT algorithm is proposed in this section. We first give some preliminaries as follows.
In a standard reinforcement learning, an agent is assumed to interact with the environment and select the optimal action that can maximize the accumulated reward. In 
, a Deep Q Network (DQN) structure developed by Google Deepmind, integrates the deep neural networks with traditional reinforcement learning. The DQN is used to estimate the well-known Q-value defined as
where and denote the state and action respectively, denotes the expectation, whereas is a reward and is the discount factor and is a reward function in the -th time step (or time slot). As the objective is to maximize the reward, a widely used policy is , where is the parameter of the deep neural network. Then, the DQN can be trained by minimizing the loss function . Also, since the deep networks are known to be unstable and very difficult to converge, two effective approaches, i.e., target network and experience replay, have been introduced in . The target network has the same structure as the original DQN but the parameters are updated more slowly. The experience replay stores the state transition samples which can help the DQN converge. However, the DQN was originally designed to solve the problem with discrete variables. Although we can adapt the DQN to continuous problems by discretizing the action space, it may unfortunately result in a huge searching space and therefore intractable to deal with.
To deal with the problem with continuous variables, e.g., the trajectory control of UAV, one may apply the actor-critic approach, which was developed in . DeepMind has proposed a deep deterministic policy gradient (DDPG) approach  by integrating the actor-critic approach into DRL. DDPG includes two DQNs, one of the DQNs, named actor network with function is applied to generate action for a given state . The other DQN named critic network with function , is used to generate the Q-value, which evaluates the action produced by the actor network. In order to improve the learning stability, two adjacent target networks corresponding to the actor and critic networks, , with respective parameters , , are also applied.
Then, the critic network can be updated with the loss function, , as
where in each time step we obtain samples constituting mini-batch from the experience replay buffer, and is the temporal difference (TD)-error  which is given by
On the other hand, the actor network can be updated by applying the policy gradient, which is described as .
V-B The RAT Algorithm
In this section, we introduce the DRL based RAT algorithm, which includes deep neural networks (i.e., actor and critic networks) and the matching algorithms. In order to apply the DRL, we first define the state, action and reward as follows:
State : , is the set of the coordinates of all UAVs.
Action : is the set of the actions of all UAVs, including the flying direction and distance . Since the absolute operation of
is used as the activation function, it means the output value of the DQN is within the interval. Thus, the flying direction and distance are reformulated as and , where
Then, the action set can be defined as .
Reward : is defined as the minus of the overall energy consumption of all the UEs in each time slot as
The algorithm framework used in this paper is depicted in Fig. 2, where an agent, which could be deployed in the central control center in the base station, is assumed to interact with the environment. An actor network is applied to generate the action, which includes the flying direction and distance for each UAV. The critic network is used to obtain the Q value of the action (i.e., to evaluate the actions generated by actor networks). In each time slot, the agent generates the actions for all the UAVs (including moving direction and distance). Then, each UE tries to associate with one UAV in its coverage, i.e., (10) by using a matching algorithm in Algorithm 3. More specifically, each UE tries to connect the UAV which has the least offloading energy. If the minimum offloading energy is larger than the energy of local execution, the UE will decide to conduct the task locally. Note that RAT has the same optimization strategy for resource allocation as CAT.
Also, each UAV selects the UEs based on the following criteria: 1) UE should be in its coverage area; 2) UE with the smaller resource requirement, i.e., will be given higher priority in offloading to this UAV. We will introduce the details of the proposed matching algorithm in Algorithm 2. After the matching algorithm, the reward in (40) can be obtained.
We assume that there is an experience replay buffer for the agent to store the experience . Once the experience replay buffer is full, the learning procedure starts. A mini-batch with size can be obtained from the experience replay buffer to train the networks.
In the classical DRL algorithms, such as Q-learning , SARSA  and DDPG , the mini-batch uniformly samples experiences from the experience replay buffer. However, since TD-error in (36) is used to update the Q value network, experience with high TD-error often indicates the successful attempts. Therefore, a better way to select the experience is to assign different weights to samples. Schaul et al.  developed a prioritized experience replay scheme, in which the absolute TD-error
is used to evaluate the probability of the sampled-th experience from the mini-batch. Then, the probability of sampling the -th experience can be given by
where , is a positive constant to avoid the edge-case of transitions not being revisited if is 0, is denoted as a factor to determine the prioritization .
However, frequently sampling experiences with high can cause divergence and oscillation. To tackle this issue, the importance-sampling weight  is introduced to represent the importance of sampled experience, which can be given by
which is used in our proposed RAT to train the networks. Next, we describe the pseudo code of the overall RAT framework in Algorithm 2.
We first initialize the actor, critic, two target networks, and experience replay buffer in Line 1 - 3. At each epoch, the taking off points of all UAVs are randomly generated in the square area of UEs. We add a random noise to the action, where
follows a normal distribution with
mean and variance, is set to 3 and decays with a rate of 0.9995 in each time step. From Line 8-11, each UAV flies according to the generated action and enters the next state . Then, we obtain the user association by using Algorithm 3. Next, the reward is obtained according to (40) (i.e., Line 13). The experience is also stored in the replay buffer . When is full, the mini-batch samples experiences by applying the prioritized experience replay (i.e., Line 16-19). Then, we update the actor and critic networks by using loss function in (43) and policy gradient in (37) respectively. Finally, we update the target networks by using the following equations as (i.e., Line 22)
where is the updating rate.
Next, we introduce the low-complexity matching algorithm which can decide the user association and resource allocation given UAVs’ trajectory, as shown in Algorithm 3. First, we denote with size to record the user association between UEs and UAVs. If , it means the -th UE matches with the -th UAV, and if , it denotes that the -th UE is not matched yet and has to execute its task locally. In addition, we denote a preference list for the -th UAV to record UEs that can benefit from offloading. Then, from Line 2 to 10, we generate the preference list for the -th UAV. Precisely, if constraint (10) is met, we obtain , and according to (17), (15), and (25), respectively. UEs that benefit from offloading will be stored in . Since UAVs wish to accept as many UEs as possible, we sort the preference list with ascending order with respect to , as shown in Line 11. The UE that consumes less