Deep Reinforcement Learning Based Dynamic Trajectory Control for UAV-assisted Mobile Edge Computing

11/10/2019 ∙ by Liang Wang, et al. ∙ 41

In this paper, we consider a platform of flying mobile edge computing (F-MEC), where unmanned aerial vehicles (UAVs) serve as equipment providing computation resource, and they enable task offloading from user equipment (UE). We aim to minimize energy consumption of all the UEs via optimizing the user association, resource allocation and the trajectory of UAVs. To this end, we first propose a Convex optimizAtion based Trajectory control algorithm (CAT), which solves the problem in an iterative way by using block coordinate descent (BCD) method. Then, to make the real-time decision while taking into account the dynamics of the environment (i.e., UAV may take off from different locations), we propose a deep Reinforcement leArning based Trajectory control algorithm (RAT). In RAT, we apply the Prioritized Experience Replay (PER) to improve the convergence of the training procedure. Different from the convex optimization based algorithm which may be susceptible to the initial points and requires iterations, RAT can be adapted to any taking off points of the UAVs and can obtain the solution more rapidly than CAT once training process has been completed. Simulation results show that the proposed CAT and RAT achieve the similar performance and both outperform traditional algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 14

page 15

page 17

page 21

page 22

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the popularity of computationally-intensive tasks, e.g., smart navigation and augmented reality, people are expecting to enjoy more convenient life than ever before. However, current smart devices and user equipments (UEs), due to small size and limited resource, e.g., computation and battery, may not be able to provide satisfactory Quality of Service (QoS) and Quality of Experience (QoE) in executing those highly demanding tasks.

Mobile edge computing (MEC) has been proposed by moving the computation resource to the network edge and it has been proved to greatly enhance UE’s ability in executing computation-hungry tasks [1]. Recently, flying mobile edge computing (F-MEC) has been proposed, which goes one step further by considering that the computing resource can be carried by unmanned aerial vehicles (UAVs) [2]. F-MEC inherits the merits of UAV and it is expected to provide more flexible, easier and faster computing service than traditional fixed-location MEC infrastructures. However, the F-MEC also brings several challenges: 1) how to minimize the long-term energy consumption of all UEs by choosing proper user association (i.e., whether UE should offload the tasks and if so, which UAV to offload to, in the case of multiple flying UAVs); 2) how much computations the UAV should allocate to each offloaded UE by considering the limited amount of on-board resource; 3) how to control each UAV’s trajectory in real time (namely, flying direction and distance), especially considering the dynamic environment (i.e., the UAV may take off from different starting points). Traditional approaches like exhaustive search are hardly to tackle the above problems due to the fact that the decision variable space of F-MEC, e.g., deciding the optimal trajectory and resource allocation, is continuous instead of discrete. In [3], the authors propose a quantized dynamic programming algorithm to address the resource allocation problem of MEC. However, the complexity of this approach is very high as the flying choice of UAV is nearly infinite (as continues variables). Moreover, the authors in [4] discretize the UAV trajectory into a sequence of UAV locations and make their proposed problem tractable. Similarly, in [5], the authors assume that the UAV’s trajectory can be approximated by using the discrete variables and then they deal with it by using the traditional convex optimization approaches. However, the above treatment may decrease the control accuracy of the UAV and also is not flexible. Furthermore, the above contributions only considered a single UAV case. In practice, one UAV may not have enough resource to serve all the users. If the served area is very large, more than one UAV are normally needed, which will undoubtedly increase the decision space and make it very difficult for the traditional convex optimization based approaches to obtain the optimal control strategies of each UAV. In [6], Liu et al. propose a deep reinforcement learning based DRL-EC algorithm, which can control the trajectory of multiple UAVs but did not consider the user association and resource allocation.

Inspired by the challenges mentioned above, in this paper, we first propose a Convex optimizAtion based Trajectory control algorithm (CAT) to minimize the energy consumption of all the UEs, by jointly optimizing user association, resource allocation and UAV trajectory. Specifically, by applying block coordinate descent (BCD) method, CAT is divided into two parts, i.e., subproblems for deciding UAV trajectories and for deciding user association and resource allocation. In each iteration, we solve each part separately while keep the other part fixed, until the convergence is achieved.

Next, we propose a deep Reinforcement leArning based Trajectory control algorithm (RAT) to facilitate the real-time decision making. In RAT, two deep Q networks (DQNs), i.e., actor and critic networks are applied, where the actor network is responsible for deciding the direction and flying distance of the UAV, while the critic network is in charge of evaluating the actions generated by the actor network. Then, we propose a low-complexity matching algorithm to decide the user association and resource allocation with the UAVs. We choose the overall energy consumption of all the UEs as a reward of the RAT. In addition, we deploy a mini-batch to collect samples from the experience replay buffer by using a Prioritized Experience Replay (PER) scheme.

Different from the traditional optimization based algorithms which normally need iterations and are susceptible to the initial points, the proposed RAT can be adapted to any taking off points of the UAVs and can obtain the solutions very rapidly once the training process has been completed. In other words, if the starting off points of the UAV are input to the RAT, the trajectories of the UAVs will be determined by the proposed RAT with only some simple algebraic calculations instead of solving the original optimization problem through traditional high-complexity optimization algorithms. This attributes to the fact that during the training stages, excessive randomly taking off points of UAV are generated and used to train the networks until the networks are converged. Also, with the help of prioritized experience reply (PER), the convergence speed will be increased significantly. RAT can be applied to the practical scenarios where the UAVs needs to act and fly swiftly such as the battlefields. By inputting the current coordinates as the starting off points to the networks, the trajectories of the UAVs will be immediately obtained and then all the UAVs can take off and fly according to the obtained trajectories. Also, the resource allocation and user association are determined by the proposed low-complexity matching algorithm. This is particularly useful to some emergence scenarios (e.g., battlefields, earthquake, large fires), as fast decision making is crucial in these areas.

In the simulation, we can see that the proposed RAT can achieve the similar performance as the convex-based solution CAT. They both have considerable performance gain over other traditional algorithms. In addition, we can see that during the learning procedure, the proposed RAT is less sensitive to the hyperparameters, i.e., the size of mini-batch and the experience replay buffer, when comparing to tradtional reinforcement learning where PER is not applied.

The remainder of this paper is organized as follows. Section II presents the related work. Section III describes the system model. Section IV introduces the proposed CAT algorithm, whereas Section V gives the proposed RAT algorithm including the preliminaries of DRL. The simulation results are reported in Section VI. Finally, conclusions are given in Section VII.

Ii Related Work

There are many related works that study UAV, MEC and DRL separately, but only a very few consider them holistically. For UAV aided wireless communications, several scenarios have been studied, such as in areas of relay transmissions [7, 8, 9], cellular system [10], data collection [11, 12, 13, 14], wireless power transfer [15], caching networks [16], and D2D communication [17]. In [18], the authors presented an approach to optimize the altitude of UAV to guarantee the maximum radio coverage on the ground. In [19], the authors presented a fly-hover-and-communicate protocol in a UAV-enabled multiuser communication system. They partitioned the ground terminals into disjoint clusters and deployed the UAV as a flying base station. Then, by jointly optimizing the UAV altitude and antenna beamwidth, they optimized the throughput in UAV-enabled downlink multicasting, downlink broadcasting, and uplink multiple access models. In [4], to maximize the minimum average throughput of covered users in OFDMA system, the authors proposed an efficient iterative algorithm based on block coordinate descent and convex optimization techniques to optimize the UAV trajectory and resource allocation. Furthermore, UAV trajectory optimization research were also investigated. For instance in [20], Zeng et al. proposed an efficient design by optimizing UAV’s flight radius and speed for the sake of maximizing the energy efficiency of UAV communication. In order to maximize the minimum throughput of all mobile terminals in cellular networks, Lyu et al. [13] developed a new hybrid network architecture by deploying UAV as an aerial mobile base station. Different from [18, 19, 4, 20] with the single UAV system, a multi-UAV enabled wireless communication system was considered to serve a group of users in [21]. Also, in [22], resource allocation between communication and computation has been investigated in multi-UAV systems.

In addition, some recent literature made efforts to mobile edge computing (MEC), which is considered to be a promising technology for bringing computing resource to the edge of the wireless networks [23], where UEs can benefit from offloading their intensive tasks to MEC servers. In [24], partial computation offloading was studied. The computation tasks can be divided into two parts, where one part is executed locally and the other part is offloaded to MEC servers. In [25], binary computation offloading was studied, where the computation tasks can either be executed locally or offloaded to MEC servers.

By taking the advantage of the mobility of UAVs, UAV-enabled MEC has also been studied in [26, 27]. In [26], the authors minimized the overall mobile energy consumption by jointly optimizing UAV trajectory and bit allocation, while satisfying QoS requirements of the offloaded mobile application. In [27], the authors studied UAV-enabled MEC, where wireless power transfer technology is applied to power the Internet of things devices and collect data from them.

For most of the above works, optimization theory are mainly applied in order to obtain the optimal and / or suboptimal solutions, e.g., trajectory design and resource allocation. However, solving such optimization problems normally requires plenty of computational resources and take much time. To address this problem, DRL has been applied and attracted much attention recently. In [28], the authors proposed a RL framework that uses DQN as the function approximator. In addition, two important ingredients experience replay and target network are used for improving the convergence performance. In [29], the authors pointed out that the classical DQN algorithm may suffer from substantial overestimations in some scenarios, and proposed a double Q-learning algorithm. In order to solve control problems with continuous state and action space, Lillicrap at al. [30] proposed a policy gradient based algorithm. For the purpose of obtaining faster learning and state-of-art performance, in [31], the authors proposed a more robust and scalable approach named prioritized experience replay. Although DRL has achieved remarkable successes in game-playing scenarios, it is still an open research area in UAV-enabled MEC.

Iii System Model

As shown in Fig. 1, we consider a scenario that there are UEs with the set denoted as ) and UAVs with the set denoted as ), which form an F-MEC platform. To make it clear, the main notations used in this paper are listed in Table. I.

Fig. 1: Multi-UAV enabled F-MEC architecture.
Notation Definition
Index of an UE, the number of UAVs and the set of of UEs, respectively
Index of an UAV, the number of UAVs, the set of UAVs and the set of offloading places, respectively
Index of a timeslot, the number of timeslots and the set of timeslots, respectively
The -th UEs’ task in -th time slot
The data size of -th UEs’ task in -th time slot
The required CPU cycles of -th UEs’ task in -th time slot
User association between -th UE and -th place in -th timeslot
Maximal horizontal coverage range of -th UAV
Flying direction and flying distance of -th UAV, respectively
Maximal flying distance and flying velocity of -th UAV, respectively
Coordinates of -th UAV in -th timeslot
Maximal duration of timeslot
Maximal number of tasks and maximal computation resource that -th UAV possesses, respectively
Coordinates of -th UE
Euclidean distance between -th UE and -th UAV in -th timeslot
Channel bandwidth, transmitting power, channel power gain and noise power, respectively
The time for task completion and offloading, and executing, respectively
Energy consumption for offloading and local execution, respectively
The set of UAV trajectory, UAV coordinates, user association and resource allocation, respectively
State, action and reward in -th timeslot, respectively
Factor of flying direction and flying distance in -th timeslot, respectively

Policy function, Q function and loss function, respectively

Network parameter, TD-error and policy gradient, respectively
TABLE I: Main Notations.

We assume that the -th UE constantly generates one task in the -th time slot and lasting for time slots. Then, tasks will be generated for each UE and one has and

(1)

where denotes the size of data required to be transmitted to a UAV if the UE chooses to offload the task, and denotes the total number of CPU cycles needed to execute this task. Assume that each UE can choose either to offload the task to one of the UAVs or execute the task locally. Then one can have

(2)

where , implies that the -th UE decides to offload the task to the -th UAV in the -th time slot, while , means that the -th UE executes the task itself in the -th time slot, and otherwise, . Define a new set to represent the possible place where the tasks from UEs can be executed, where indicates that UE conducts its own task locally without offloading.

In addition, we assume that each UE can only be served by at most one UAV or itself, and each task only has one place to execute. Then, it follows

(3)

Iii-a UAV Movement

Assume that the -th UAV flies at a fixed altitude like [19], and it has a maximal horizontal coverage , which depends on the transmitting angle of antennas and the flying altitude. Also, assume that in the -th time slot, the -th UAV can fly with direction as

(4)

and distance as

(5)

where one can have the maximal flying distance in each time slot as , is the constant flying velocity, is the maximal duration of the time slot. We also denote the coordinate of the -th UAV in the -th time slot as , where , and is the starting coordinate of the -th UAV. Then, the flying time of the -th UAV in the -th time slot is

(6)

and one has

(7)

Also, in each time slot, we assume that each UAV can accept the limited amount of offloaded tasks. Then, one has

(8)

where is the maximal number of tasks that the -th UAV can accept in the -th time slot.

Iii-B Task Execution

If the -th UE decides to offload the task to the -th UAV in the -th time slot, then the euclidean distance can be written as

(9)

where is the coordinate of the -th UE, and it has

(10)

where is the maximal horizontal coverage of the -th UAV. Then, the uplink data rate is given by

(11)

where is the bandwidth for each communication channel; is the transmitting power of the -th UE; = with 2.2846; is the channel power gain at the reference distance 1 and is the noise power. Note that we consider each user applies orthogonal frequency division multiplexing (OFDM) channel and there is no interference among them.

If the -th UE decides to offload its task to the -th UAV in the -th time slot, the total task completion time is given by

(12)

where is the time to offload the data from the -th UE to the -th UAV in the -th time slot, given by

(13)

and is the time required to execute the task at the UAV as

(14)

where is the computation resource that the -th UAV can provide to the -th UE in the -th time slot.

Note that the time needed for returning the results back to UE from UAV is ignored, similar to [32]. The overall energy consumption of the -th UE to the -th UAV in the -th time slot is given by

(15)

If the UE decides to execute the task locally, the power consumption can be evaluated as , where is the effective switched capacitance, is typically set to 3, and is the computation resource that the -th UE applies to execute the task. The overall time for local execution can be given by

(16)

Thus, the total energy consumption for local execution equals

(17)

To sum up, the overall energy consumption for task execution is given by

(18)

and the time to complete the task is expressed as

(19)

Without loss of generality, we assume that each task has to be completed within the time duration , which is consistent with the maximal flying time in each time slot, given by (7). Then, one has

(20)

In each time slot, since the computation resource that each UAV can provide is limited, we have

(21)

where is the maximal computation resource that the -th UAV can provide in the -th time slot. Next, we show our proposed problem formulation.

Iii-C Problem Formulation

Denote = , = , = . Then, the energy minimization for all UEs is formulated as

(22a)
subject to:
(22b)
(22c)
(22d)
(22e)
(22f)
(22g)
(22h)
(22i)

One can see that the above problem is a mixed integer nonlinear programming (MINLP), as it includes both integer variable, and continuous variables, and , which is very difficult to solve in general. We first propose a convex optimization based algorithm CAT to address it iteratively. Then, we propose a Deep Reinforcement Learning (DRL) based RAT to facilitate fast decision-making, which can be applied in dynamic environment. Note that in practice, if the -th UE does not generate the tasks in the -th time slot and then the corresponding and can be set to zero.

Iv Proposed CAT Algorithm

In this section, a convex optimization based CAT is proposed to solve the above problem . We first define a set of new variables to denote the trajectories of UAVs as = , where the coordinates are , and . Thus, the optimization problem can be reformulated as

(23a)
(23b)
(23c)

where . In order to solve , we divide it into two subproblems and apply the block coordinate descent (BCD) method to address it. To this end, we first optimize the user association and resource allocation given the UAV trajectory . Then, we optimize the UAV trajectory given the user association and resource allocation . We solve the two optimization problems iteratively, until the convergence is achieved.

Iv-a User Association and Resource Allocation

Given the UAV trajectory , the subproblem to decide user association and resource allocation can be formulated as

(24a)

One can see that (22h) can be written as

(25)

if the -th UE chooses to offload the task, and

(26)

if the -th UE decides to execute the task locally. It is readily to see that equality holds for both (25) and (26).

Then, (24) can be re-written as

(27a)
(27b)
(27c)

It is readily to find that (27) is a Multiple-Choice Multi-Dimensional 0-1 Knapsack Problem (MMKP), which is NP-hard in general. Fortunately, it can be solved by applying Branch and Bound method via a standard Python package PULP [33].

Iv-B UAV Trajectory Optimization

Given the user association and resource allocation from (27) and removing the constant, can be simplified as

(28a)
(28b)

It is easy to see that the above optimization problem is non-convex with respect to . Next, we introduce a set , where , then, problem (28) can be transformed into

(29a)
(29b)
(29c)

One observes that (29b) and (29c) are convex with respect to , respectively. Thus, (29b) and (29c) are non-convex constraints. Then, similar to [4, 5], we apply the successive convex approximation (SCA) to solve this problem. Specifically, for any given local point in , one can have the following inequality as

(30)

where

(31)

and

(32)

Then, problem (29) can be written as

(33a)
(33b)
(33c)

The above problem is a convex quadratically constrained quadratic program (QCQP) and it can be solved by a standard Python package CVXPY [34].

Iv-C Overall Algorithm Design

In this section, a convex optimization algorithm based CAT is proposed to solve Problem , where we optimize user association and resource allocation subproblem iteratively with the UAV trajectory subproblem until the convergence is achieved. We describe the pseudo code of proposed CAT in Algorithm 1.

1:  Set , and initialize ;
2:  repeat  
3:  Solve Problem (27) by Branch and Bound method for given , and denote the optimal solution as and
4:  Solve Problem (33) for given and , and denote the solution as ;
5:  
6:  until the convergence is achieved.  
Algorithm 1 CAT Algorithm

Discussions: Algorithm 1 needs to run once the initial taking-off locations of the UAVs change. However, the complexity of Algorithm 1 is high as the solutions are iteratively obtained and each subproblem involves a huge number of optimization variables especially when the .total number of time slots is high. Hence, Algorithm 1 is not suitable for some emergence scenarios (e.g., battlefields, earthquake, large fires), where fast decision making is highly demanded. This motivates the algorithm developed based on DRL in the following section.

V Proposed RAT Algorithm

To facilitate the fast decision making, the DRL-based RAT algorithm is proposed in this section. We first give some preliminaries as follows.

V-a Preliminaries

V-A1 Dqn

In a standard reinforcement learning, an agent is assumed to interact with the environment and select the optimal action that can maximize the accumulated reward. In [28]

, a Deep Q Network (DQN) structure developed by Google Deepmind, integrates the deep neural networks with traditional reinforcement learning. The DQN is used to estimate the well-known Q-value defined as

(34)

where and denote the state and action respectively, denotes the expectation, whereas is a reward and is the discount factor and is a reward function in the -th time step (or time slot). As the objective is to maximize the reward, a widely used policy is , where is the parameter of the deep neural network. Then, the DQN can be trained by minimizing the loss function [28]. Also, since the deep networks are known to be unstable and very difficult to converge, two effective approaches, i.e., target network and experience replay, have been introduced in [28]. The target network has the same structure as the original DQN but the parameters are updated more slowly. The experience replay stores the state transition samples which can help the DQN converge. However, the DQN was originally designed to solve the problem with discrete variables. Although we can adapt the DQN to continuous problems by discretizing the action space, it may unfortunately result in a huge searching space and therefore intractable to deal with.

V-A2 Ddpg

To deal with the problem with continuous variables, e.g., the trajectory control of UAV, one may apply the actor-critic approach, which was developed in [35]. DeepMind has proposed a deep deterministic policy gradient (DDPG) approach [30] by integrating the actor-critic approach into DRL. DDPG includes two DQNs, one of the DQNs, named actor network with function is applied to generate action for a given state . The other DQN named critic network with function , is used to generate the Q-value, which evaluates the action produced by the actor network. In order to improve the learning stability, two adjacent target networks corresponding to the actor and critic networks, , with respective parameters , , are also applied.

Then, the critic network can be updated with the loss function, , as

(35)

where in each time step we obtain samples constituting mini-batch from the experience replay buffer, and is the temporal difference (TD)-error [36] which is given by

(36)

On the other hand, the actor network can be updated by applying the policy gradient, which is described as [30].

(37)

V-B The RAT Algorithm

In this section, we introduce the DRL based RAT algorithm, which includes deep neural networks (i.e., actor and critic networks) and the matching algorithms. In order to apply the DRL, we first define the state, action and reward as follows:

  • State : , is the set of the coordinates of all UAVs.

  • Action : is the set of the actions of all UAVs, including the flying direction and distance . Since the absolute operation of

    is used as the activation function, it means the output value of the DQN is within the interval

    . Thus, the flying direction and distance are reformulated as and , where

    (38)

    and

    (39)

    Then, the action set can be defined as .

  • Reward : is defined as the minus of the overall energy consumption of all the UEs in each time slot as

    (40)
Fig. 2: The networks applied in this paper.

The algorithm framework used in this paper is depicted in Fig. 2, where an agent, which could be deployed in the central control center in the base station, is assumed to interact with the environment. An actor network is applied to generate the action, which includes the flying direction and distance for each UAV. The critic network is used to obtain the Q value of the action (i.e., to evaluate the actions generated by actor networks). In each time slot, the agent generates the actions for all the UAVs (including moving direction and distance). Then, each UE tries to associate with one UAV in its coverage, i.e., (10) by using a matching algorithm in Algorithm 3. More specifically, each UE tries to connect the UAV which has the least offloading energy. If the minimum offloading energy is larger than the energy of local execution, the UE will decide to conduct the task locally. Note that RAT has the same optimization strategy for resource allocation as CAT.

Also, each UAV selects the UEs based on the following criteria: 1) UE should be in its coverage area; 2) UE with the smaller resource requirement, i.e., will be given higher priority in offloading to this UAV. We will introduce the details of the proposed matching algorithm in Algorithm 2. After the matching algorithm, the reward in (40) can be obtained.

We assume that there is an experience replay buffer for the agent to store the experience . Once the experience replay buffer is full, the learning procedure starts. A mini-batch with size can be obtained from the experience replay buffer to train the networks.

In the classical DRL algorithms, such as Q-learning [37], SARSA [38] and DDPG [30], the mini-batch uniformly samples experiences from the experience replay buffer. However, since TD-error in (36) is used to update the Q value network, experience with high TD-error often indicates the successful attempts. Therefore, a better way to select the experience is to assign different weights to samples. Schaul et al. [31] developed a prioritized experience replay scheme, in which the absolute TD-error

is used to evaluate the probability of the sampled

-th experience from the mini-batch. Then, the probability of sampling the -th experience can be given by

(41)

where , is a positive constant to avoid the edge-case of transitions not being revisited if is 0, is denoted as a factor to determine the prioritization [31].

However, frequently sampling experiences with high can cause divergence and oscillation. To tackle this issue, the importance-sampling weight [39] is introduced to represent the importance of sampled experience, which can be given by

(42)

where is the size of experience replay buffer , is given as 0.4 [31]. Thus, the loss function in (35) can be updated as

(43)

which is used in our proposed RAT to train the networks. Next, we describe the pseudo code of the overall RAT framework in Algorithm 2.

1:  Initialize actor network with parameters and critic network with parameters ;  
2:  Initialize target networks with parameters and with parameters ;  
3:  Initialize experience replay buffer ;  
4:  for

 epoch =1,…,

 do
5:     Initialize ;  
6:     for time step =1,…,  do
7:         where is the random noise and decays with ;  
8:        for UAV =1,…,  do
9:           Execute ;  
10:           Obtain ;  
11:        end for
12:        Obtain the user association with UAVs using matching algorithm proposed in Algorithm 3;  
13:        Obtain the reward from (40);  
14:        Store experience [] into ;  
15:        if  is full then
16:           for  = 1,…,  do
17:              Sample -th experience with probability from (41);  
18:              Calculate and from (36) and (42) respectively;  
19:           end for
20:           Update parameters of the critic network by minimizing its loss function according to (43);  
21:           Update parameters of the actor network by using policy gradient approach according to (37);  
22:           Update two target networks with the updating rate :  
23:        end if
24:     end for
25:  end for
Algorithm 2 RAT Algorithm

We first initialize the actor, critic, two target networks, and experience replay buffer in Line 1 - 3. At each epoch, the taking off points of all UAVs are randomly generated in the square area of UEs. We add a random noise to the action, where

follows a normal distribution with

mean and variance

, is set to 3 and decays with a rate of 0.9995 in each time step. From Line 8-11, each UAV flies according to the generated action and enters the next state . Then, we obtain the user association by using Algorithm 3. Next, the reward is obtained according to (40) (i.e., Line 13). The experience is also stored in the replay buffer . When is full, the mini-batch samples experiences by applying the prioritized experience replay (i.e., Line 16-19). Then, we update the actor and critic networks by using loss function in (43) and policy gradient in (37) respectively. Finally, we update the target networks by using the following equations as (i.e., Line 22)

(44)

and

(45)

where is the updating rate.

1:  Initialize and , , ;  
2:  for UAV = 1,…,  do
3:     for UE = 1,…,  do
4:        if (10) is met then
5:           Calculate , and ;  
6:           if  then
7:              Store into ;
8:           end if
9:        end if
10:     end for
11:     Sort the element in in ascending order with respect to ;  
12:  end for
13:  repeat  
14:  for UAV = 1,…,  do
15:     ;  
16:     if (8), (21) are met then
17:        if  or  then
18:           ;  
19:        end if
20:        ;  
21:     end if
22:  end for
23:  until Each UE in is checked.  
24:  Return  
Algorithm 3 Matching Algorithm

Next, we introduce the low-complexity matching algorithm which can decide the user association and resource allocation given UAVs’ trajectory, as shown in Algorithm 3. First, we denote with size to record the user association between UEs and UAVs. If , it means the -th UE matches with the -th UAV, and if , it denotes that the -th UE is not matched yet and has to execute its task locally. In addition, we denote a preference list for the -th UAV to record UEs that can benefit from offloading. Then, from Line 2 to 10, we generate the preference list for the -th UAV. Precisely, if constraint (10) is met, we obtain , and according to (17), (15), and (25), respectively. UEs that benefit from offloading will be stored in . Since UAVs wish to accept as many UEs as possible, we sort the preference list with ascending order with respect to , as shown in Line 11. The UE that consumes less