I Introduction
In recent years, unmanned aerial vehicles (UAV)assisted wireless powered mobileedge computing (MEC) network has attracted more and more attention [1, 2, 3, 4, 5]. Due to technological advances, today’s UAVs can equip MEC servers with strong computing capabilities and energy transmitters. It can perform not only computation offloading via MEC technology [6] but also wireless charging via wireless power transfer (WPT) [7] for mobile terminals, which need to operate computationintensive applications but have limited computing capacity and battery lifetime. The UAV is thus suitable for building temporary MEC systems for mobile terminals in some special situations. For example, the UAV can provide service for the mobile terminals in the scenarios where base station (BS) is damaged, in the public meeting places where there is a traffic hotspot, or in the remote fields where there is a coverage hole of wireless networks.
The UAVassisted wireless powered MEC networks were previously investigated for ground terminals with fixed locations [1, 2, 3, 4, 5]. To maximize the network utility in terms of computation rate [1, 5] or minimize the energy consumption [2, 3, 4], previous works optimized UAV trajectory, offloading decision, and resource allocation. Sometimes, the UAV is required to arrive at specified locations automatically, so that it can be utilized in the places that are difficult for people to reach, thereby reducing labor costs [1, 4]. The previous works solved the optimization problems by offline algorithms such as successive convex approximation [1, 2, 4] and block coordinate descending [5]. The numerical results in [1, 2, 3, 4, 5] show that these algorithms work well in the scenarios where the locations of terminals are fixed.
However, terminals in practice such as smartphones, tablets, wearable devices, and tracking collars carried by wildlife [8] are typically in motion, and their trajectories are likely to be stochastic. To serve mobile terminals, online algorithms are needed to make decisions based on realtime information. Unfortunately, the existing algorithms designed for the terminals with fixed locations [1, 2, 3, 4, 5] are offline algorithms, and may not work well in these scenarios since all of them need environment information a priori [9].
In this paper, we study a UAVassisted wireless powered MEC network where a flying UAV serves multiple mobile terminals. We aim to maximize the computation rate of all terminals while ensuring the fairness among them. Herein, considering fairness is to balance the computation performance of different terminals. We demonstrate that this problem is a joint optimization and continuous control problem of the UAV trajectory and the resource allocation of terminals, under the condition that the trajectories of terminals are stochastic. Therefore, we propose a soft actorcritic (SAC) based deepreinforcementlearning (DRL) algorithm for trajectory planning and resource allocation (SACTR). Since this problem is a complex highdimensional DRL task, SACTR combines offpolicy and maximum entropy reinforcement learning to ensure sampling efficiency and stabilize convergence at the same time. Taking the computation rate, fairness, and reaching destination into consideration, we design the reward in SACTR as a heterogeneous function to satisfy multiple objectives simultaneously. The simulation results show that SACTR outperforms representative benchmarks in most cases.
The main contributions of this paper are highlighted as follows.

To the best of our knowledge, we are the first to provide an online algorithm for trajectory planning and resource allocation for mobile terminals in the UAVassisted wireless powered MEC network.

By integrating a fairness index into the reward of SACTR, we guarantee the fairness among different terminals according to the needs of scenarios.

By using the progress estimate in the reward of SACTR, the UAV can reach the specified destination automatically and the convergence of SACTR is accelerated.

SACTR can converge steadily and fast adapt to unexpected changes of the environment.
The rest of this paper is organized as follows. In Section II, we review the related works. We model the UAVassisted wireless powered MEC network in Section III and formulate the trajectory planning and resource allocation problem in Section IV. In Section V, the detailed design of SACTR is given. We evaluate the performance of our algorithm by simulations in Section VI. Finally, the paper is concluded in Section VII.
Ii Related Works
The previous works related to our paper include those focusing on wireless powered MEC networks [10, 11, 12, 13, 14, 1, 2, 3, 4, 5], and UAVassisted communication networks for mobile terminals [15, 16, 17].
Iia Wireless Powered MEC Network
The wireless powered MEC network can be divided into two types according to the carriers of MEC servers and energy transmitters. The first type is the wireless powered MEC network, where the BS is the carrier [10, 11, 12, 13, 14], while the second type is the UAVassisted wireless powered MEC network, where the UAV is the carrier [1, 2, 3, 4, 5].
In the system with the BS as the carrier, the works mainly focus on resource allocation and offloading decision [10, 11, 12, 13, 14]. In [10] and [11], the parameters such as offloading decision, CPU frequency, and transmission time of terminals were optimized to minimize the energy consumption and maximize the computation rate respectively. Different from [10, 11], Mao et al. investigated the maxmin energy efficiency optimization problem to guarantee the fairness of energy efficiency among different devices [12]. To utilize the advantages of DRL in handling problems with sophisticated state space and timevarying environment, Min et al. [13] proposed a deep Q network (DQN) based offloading policy for energyharvesting MEC network to improve the computation performance. Huang et al. [14]
proposed a DRLbased online computation offloading (DROO) framework. Instead of solving for the hybrid integer continuous solution altogether, DROO decomposes the optimization problem into a binary offloading decision subproblem and a continuous resource allocation subproblem, and tackles them separately by deep learning and traditional optimization methods, respectively.
In the system with the UAV as the carrier, the previous works only considered the case where the locations of terminals are fixed [1, 2, 3, 4, 5]. To maximize the weighted sum computation rate of terminals, Zhou et al. [1] jointly optimized the CPU frequencies, transmit powers, and offloading times of terminals as well as the UAV trajectory. Ref. [2] minimized the energy consumption of the UAV while guaranteeing the computation rate of all terminals. Ref. [3] proposed a timedivision multiple access (TDMA) based workflow model, which allows parallel transmitting and computing. In particular, the UAV was arranged to hover over designated successive positions, and the parameters such as the service sequence of terminals, computing resource allocation, and hovering time of UAV were jointly optimized. To assist the service of UAV, Liu et al. [4] utilized idle sensor devices to cooperate with the UAV to provide computation offloading service for busy sensor devices, and Hu et al. [5] utilized access points (APs) to offer wireless power and computation offloading services for the UAV. The offline algorithms in [1, 2, 3, 4, 5] require the system information a priori.
IiB UAVassisted Communication Network for Mobile Terminals
The UAVassisted communication network for mobile terminals is similar to the UAVassisted MEC network for mobile terminals. The difference is that the UAV in the former case carries out traffic offloading while that in the latter case performs computation offloading. The major problem in the UAVassisted communication network is resource allocation and UAV trajectory design for performance optimization. Ref.[15] proposed a deterministic policy gradient (DPG) based algorithm to maximize the expected uplink sum rate of terminals. Ref.[16] considered the scenario where a group of UAVs are employed to enhance communication coverage area. This paper proposed an actorcritic (AC) based algorithm to optimize the UAV trajectory, such that the objectives including coverage expansion, fairness improvement, and power saving can be achieved. However, DPG [15] and AC [16] are hard to converge if applied to the complex highdimensional DRL task, e.g., the problem considered in this paper, which jointly optimizes multiple types of parameters. Ref.[17] aimed to maximize the throughput of a UAVassisted cellular offloading network. Ref.[17] discretized the flight direction of UAV and the transmit power of terminals and devised a valuebased DRL algorithm. Since this algorithm has to search the action space exhaustively in each iteration, it cannot be used for problems with highdimensional or continuous actions [18].
Iii UAVassisted Wireless Powered MEC Network
Fig.1 illustrates the UAVassisted wireless powered MEC network considered in this paper.There are a set of mobile terminals, denoted by and a UAV. All the terminals move on the ground with altitude 0, and the UAV flies at a fixed altitude, denoted by , such that it can avoid frequent ascent and descent to evade surficial obstacles. The UAV is equipped with MEC servers and energy transmitters and serves these mobile terminals with low battery lifetime and computing capacity. Each terminal has accumulated computation tasks, which can be divided into two parts. One part is executed locally by the mobile terminal and the other part is offloaded to and executed at the UAV, which is known as the partial offloading mode. In the meanwhile, the UAV broadcasts radiofrequency (RF) energy to all mobile terminals, and terminals harvest the energy and store it in the chargeable battery. The UAV/terminal can perform energy transferring/harvesting, computing, and data exchange simultaneously [10, 11, 12, 1]. The UAV is required to arrive at a designated location at the end of the flight [1, 4].
Iiia Computation Offloading
Mobile terminals adopt the TDMA protocol to communicate with the UAV, as illustrated in Fig.2. The flight time of the UAV, denoted by , is discretized into slots. The duration of a time slot is very short such that the locations of UAV and terminals and the channel gain almost keep unchanged. In each slot, the mobile terminals offload computation tasks to the UAV in a roundrobin manner and download the computation results from it after completion.
IiiA1 Computation Time and Dataexchange Time
We denote as the proportion of uploading time of the th terminal in slot . In general, the computing capacity of the MEC servers on the UAV is powerful and the size of computation result is quite small. Thus, we assume that
IiiA2 Channel Condition
The data exchange between the UAV and the terminals is influenced by the wireless channel conditions. In our model, we assume that

the impact of the Doppler effect in data exchange due to the position changes of the UAV and mobile terminals can be perfectly compensated by the receivers, and
Employing the threedimensional (3D) Euclidean coordinate, we let be the horizontal plane coordinate of the th terminal and be that of the UAV in slot . Under the airtoground model, the path loss between the th terminal and the UAV in the th slot is given by [19], as
(1) 
where is Euclidean norm, is the carrier frequency, is the light speed,
is the probability that the link between the terminal and the UAV is a LineofSight (LoS) link, and
and are the additional loss caused by the LoS and nonLOS (NLoS) link on the top of the free space path loss. The values of and are determined by the environments, such as urban and rural. According to [19], is given by(2) 
where and are two constants determined by the environments [19]. Accordingly, the channel power gain between the th terminal and the UAV in slot is given by
(3) 
IiiA3 Task Offloading by Terminals
The computation tasks that the terminal uploads to the UAV including the raw data and the communication overhead such as the encryption and the packet header [11]. We assume that

each bit of raw data needs bits of upload data.
Recall that the th terminal offloads data to the UAV with duration in slot . The volume of raw data that terminal offloads to the UAV in slot is
(4) 
where is the transmit power of terminal in slot , is the offloading bandwidth, and is the noise power at the terminal. It follows that the energy consumption for offloading these data is .
IiiB Local Computation
Mobile terminals execute local computation tasks and adjust the CPU frequency by dynamic voltage and frequency scaling technique in each slot [1, 11, 12]. Let be the CPU frequency (unit: cycle/s) of the th terminal in slot , be the number of CPU cycles required for computing one bit of raw data. Then, the local computation bits of the th terminal in the th slot is
(5) 
Accordingly, the local energy consumption of the th terminal in the th slot is given by , where is the effective capacitance coefficient of the processor chip [11].
Let be the number of computation bits of the th terminal in the th slot, including both the local and the offloaded ones. is given by
(6) 
Thus, the total computation bits of the th terminal in the entire flight time are .
IiiC Wireless Power Transfer
The UAV broadcasts RF energy to all mobile terminals continuously during its flight time. We assume that

the energy of the UAV is sufficient, and

the transmit power of the UAV is a constant, .
The energy harvested by the th terminal in slot is , where is the energy conservation efficiency.
Iv Problem Formulation
To ensure the performance of each terminal, we aim to maximize the sum computation bits of all mobile terminals in the entire flight time , while guaranteeing the fairness of computation bits among different terminals. Based on the Jain’s fairness index [21], we define the fairness index () as
(7) 
Clearly, a larger indicates higher fairness. Accordingly, we define the objective function as a joint function of the computation bits and fairness, as , where is a nonnegative integer used to adjust the proportion of in the objective function.
We intend to optimize the UAV trajectory and the resource allocation of terminals during the flight time of the UAV. Let and be the flight speed and direction of the UAV in slot , respectively. The UAV trajectory is described by and , where . The resource allocation variables include the transmit powers, offloading times, and CPU frequencies in all the slots. In particular, the resource allocation variables are , , and . Note that, the transmit power and offloading time affect the offloading performance, and the CPU frequency decides the number of local computation bits. Consequently, to maximize the objective function, we should jointly optimize the flight speed and direction of the UAV, and the transmit powers, offloading times, and CPU frequencies of mobile terminals in each slot.
Our optimization problem is formulated as
(8a)  
s.t.  (8b)  
(8c)  
(8d)  
(8e)  
(8f)  
(8g)  
(8h) 
where is the initial energy of the th terminal, is the maximum horizontal flying speed of UAV, and are respectively the locations of UAV in the first slot and after the last slot, and are the locations of designed starting point and the destination.
indicates that the transmit powers and CPU frequencies of terminals should be nonnegative. restricts that, by each slot, the accumulated energy consumption of a terminal cannot exceed the sum of the initial energy and the energy harvested by this terminal. states that the sum of offloading time of all terminals in each slot cannot exceed the duration of a slot. and give the range of the flight speed and direction of the UAV. and restrict the starting point and the destination of the UAV.
Though problem
is a sequential decision problem that can be characterized by a Markov decision process (MDP), the moving trajectories of terminals may be unpredictable and cannot be known in advance. Also, it involves the joint optimization and continuous control of highdimensional parameters. As a result, traditional optimization approaches fail to solve this problem. For example, offline algorithms such as dynamic planning, successive convex approximation, or block coordinate descending, require the system information a priori; the DQN method
[22] can only deal with problems with discrete or lowdimensional actions [18]; also, it is a challenge for the policy gradient method [23] or the actorcritic (AC) method [24] to maintain both high sample efficiency and stable convergence at the same time [25], when they are employed to handle a like complex highdimensional DRL task. We thus will introduce the soft actorcritic (SAC) method [26] to solve this problem in the next section.V SACbased Algorithm for Trajectory Planning and Resource Allocation
In this section, we propose an SACbased trajectory planning and resource allocation (SACTR) algorithm to solve problem . To deal this complex highdimensional DRL task, SACTR adopts the combination of offpolicy and maximum entropy reinforcement learning in SAC method, so as to increase sampling efficiency and stabilize convergence at the same time. Taking into consideration the computation rate, fairness, and reaching of destination, we design a heterogeneous reward function in SACTR. SACTR is introduced in the following three parts. We present the main design of SACTR in Section VA and the heterogeneous reward function in Section VB. Section VC
introduces the maximum entropy reinforcement learning and gives the gradient descent formulas of the neural networks in SACTR.
Va Design of SACTR
Fig.3 plots the structure of SACTR, which consists of a policy function, denoted by , two functions, denoted by and , two target networks, denoted by and , and an experience replay memory, where and are the environment state and the action in slot , respectively.
VA1 Policy Function
Policy function performs as an “actor”. In slot , the policy collects the state information from the network. The state includes a 2dimensional UAV location, a dimensional terminal location, an dimensional terminal battery energy, and a 1dimensional current slot, which is formally defined by
(9) 
where is the battery energy of terminal in slot . According to state , the policy function takes an action defined by a
dimensional vector
(10) 
to adjust the horizontal flight speed and direction of the UAV, the transmit power, CPU frequency, and offloading time proportion of each terminal.
The policy function is implemented by a deep neural network (DNN), of which the parameter is denoted by
. The DNN has two output layers. During the training process, the DNN generates the mean and covariance of a Gaussian random variable at the two output layers. Sampling the Gaussian random variable and then restricting it via a tanh function, the policy function produces an action.
The actions generated by the policy function might not meet all the constraints of problem . To satisfy constraints and , we should adjust the generated actions as follows. restricts the energy consumed by each terminal in each slot. If the generated action for a terminal does not satisfy this constraint, we set the transmit power and the CPU frequency of this terminal to zero in this slot. As a result, the computation bit of this terminal is also zero in this slot, which can be regarded as a penalty for this infeasible action. To satisfy , the offloadingtime constraint, we normalize the proportion of offloading time of the generated action. Let be the proportion of offloading time of the th terminal generated by the policy in slot . If , constraint is met, thus ; otherwise, is normalized as follows
(11) 
After that, SACTR exports the adjusted action and obtains a reward, which is denoted by and will be defined in Section VB.
VA2 Experience Replay Memory
After getting and the state of the next slot from the MEC network, SACTR combines , , , and as a sample and stores it in the experience replay memory. Once the memory is full, the newly generated sample will replace the oldest one. At fixed intervals, SACTR randomly selects a batch of samples from the memory and performs gradient descent on the neural networks of policy function and functions.
VA3 function
Following the clipped double trick [26], SACTR uses two functions and as a “critic” in the gradient descent process of DNN of the policy function, such that the positive deviation of policy promotion can be reduced. and are performed by two DNNs with parameters and . They both generate values of a stateaction pair. SACTR selects the small one of two values.
VA4 Target Network
The DNN of each function is also updated by gradient descent, where two target networks and are used to reduce the correlation between samples so as to stabilize the training. As the backup of functions, the initial structure and the parameters of two target networks are the same as those of two functions. They update their parameters, using the exponentially moving averages of parameters of and , with a smoothing constant .
VB Heterogeneous Reward Function
To meet different types of requirements, including the computation rate, fairness, and specified UAV destination, we customize the reward in SACTR as a heterogeneous function of a computation reward and an arrival reward. In particular, we design the computation reward based on the fairness index to maximize the objective of problem , and design the arrival reward based on progress estimate to meet the arrival constraint .
VB1 Computation Reward with Fairness
We aim to maximize the computation bits of terminals while guaranteeing the fairness among them. On one hand, we include the incremental computation bits, i.e., in the reward to encourage the improvement of computation bits in slot . On the other hand, to make use of existing information to promote fairness in each slot, we define an indicator, called current fairness index, corresponding to the definition of fairness index in (7) as follows
(12) 
to measure the fairness among terminals in slot . Eq.12 can be regarded as an evolution of fairness index. Clearly, there is .
Combining with the incremental computation bits, we design the computation reward so that it can encourage actions that increase more computation bits and the actions that achieve high fairness, thereby promoting the final fairness. In the th slot, the computation reward is given by
(13) 
VB2 Arrival Reward Based on Progress Estimate
It is important to set a proper arrival reward to facilitate UAV arriving at the designated destination at the end of the flight. Otherwise, the UAV may take a long time to (or even cannot) reach the designated destination. An example of arrival reward is the sparse reward in [27], where a fixed reward is given when the UAV arrives at the destination, or a fixed penalty when the UAV does not. However, in our problem, the area of destination is much smaller than the whole flight area, and thus the samples that the UAV arrives at the destination would be rare in the training process. As a result, if our algorithm employs the sparse reward, it will be difficult to converge.
Inspired by the progress estimate reported in [28], we design a distancebased arrival reward. The idea of the progress estimate is that, if the goal is not reached, an artificial progress estimator is given to accelerate convergence. Based on this idea, we define an arrival reward at the end of the flight according to the distance between the UAV and destination as follows
(14) 
where are constants. Clearly, decreases linearly with the distance between the destination and the final location of the UAV. In this way, the samples that the UAV fails to arrive at the destination can also be utilized in the training process to guide the algorithm.
VC Maximum Entropy Reinforcement Learning
SAC uses the concept, called entropy of policy, to indicate the randomness of policy and is given by [26]. The objective of SAC is to maximize the expectation of accumulated rewards and the expected entropy of the policy, such that the policy can be trained with various highly random samples. In this way, SAC can avoid falling into a local optimum. This objective is called the maximum entropy objective in SAC. To solve problem , SACTR defines the maximum entropy objective
(16) 
based on the reward function in (15), where is the temperature parameter that adjusts the importance of entropy against the reward and controls the stochasticity of policy.
At a fixed interval, SACTR performs gradient descent on the neural networks of functions and the policy function. The parameters of function , , are updated by minimizing the soft Bellman residual [26]
(17) 
where is the distribution of sampled states and actions. The parameter of policy function, , is updated by
(18) 
In (18), the reparameterization trick is employed as the solution for policy gradient [26], in which the policy is rewritten as , where is an independent noise vector, as shown in Fig.3.
Before use, SACTR is trained until it converges, of which the training process is summarized by Algorithm 1. The welltrained algorithm is then carried by the UAV as an agent. At the beginning of each slot, the UAV collects the state information and makes a decision. During the flight time, SACTR can continue to be trained at a fixed interval if needed.
Vi Performance Evaluation
In this section, we evaluate the performance of SACTR by simulations. In particular, we study the convergence, usability, and adaptability of SACTR, the effect of the exponent of the fairness index, and the optimal policy given by SACTR. We also compare SACTR with other benchmarks.
Via Simulation Settings
ViA1 System Settings
In the simulation, we set the total flight time seconds, which is discretized into slots, and the number of mobile terminals . The maximum flight speed of UAV m/s, the data offloading bandwidth MHz, the carrier frequency GHz, and the receiver noise power Watts [1]. The WPT energy conversion efficiency at each terminal [29]. The effective capacitance coefficient of the terminal , which depends on the chip architecture, and the CPU cycle of raw data cycles/bit [14, 30]. The upload data needed for each bit of raw data . In remote area, the parameters in (2) are [31]. A field with a horizontal area of m is considered, and the flight altitude of UAV m. The horizontal location of the starting point of the UAV is m, and the destination range is a sector with the center of m and the radius of m, as shown in Fig. 9.
ViA2 Mobility Model of Terminals
Since the mobility model of the terminal may contain fixed components, randomness, and memory, we employ the GaussMarkov random model (GMRM) to characterize it [32]. Assume the speed and the direction of the th terminal in the th slot are respectively and , they can be calculated in GMRM by
(19a)  
(19b) 
Herein, represent the memory in the mobility model of the th terminal. and are the average speed and average direction of the th terminal. and
are Gaussian distributed random variables, which inflect the randomness in the mobility model of the
th terminal. In the simulation, we set , , and m/s, , the mean and covariance of as 0 and 2, and that of as 0 and 1, for . Note that SACTR can also be applied to other mobility models of terminals, including changeable or unknown models.ViA3 Simulation Platform
We execute SACTR in Python 3.7 with PyTorch 1.7. The neural networks in policy function and Qfunctions are both fully connected networks, each of which has three hidden layers and each hidden layer has 400 neurons. We adopt the Adam optimizer and utilize the RELU as the activation function. We set the discount
and the algorithm is updated every 100 slots. The parameters and in arrival reward are respectively set as 500 and 80. It is pointed out in [25] that the SAC method performs well when the average reward in each slot is around dozens. Thus, we regulate the average reward in this range by setting different in different scenarios. For example, when the exponent of the fairness index and the transmit power of the UAV Watts, we select to make the return to be around 1200 (30 per slot) after convergence. Herein, the return is the accumulated reward in an episode, and the episode is an independent realization of an entire flight time.ViB Convergence of SACTR
In this part, we study the effect of the reward function and hyperparameters on the convergence of SACTR, where we set
Watts and .The current fairness index is integrated into the computation reward to improve fairness. To investigate its effect, we compared SACTR to that without in the computation reward, as plotted in Fig.4. The curves denote the moving average of the objective function over a window of 100 episodes. As we can see, the computation reward integrated with reaches higher objective function compared to that without , since can guarantee fairness in almost all the situations.
We also design an arrival reward based on the progress estimate to promote convergence. To examine its effect, we compare SACTR to that with sparse arrival reward. With sparse arrival reward, if the UAV reaches the destination, a fixed reward will be given, otherwise, no arrival reward will be given. Fig.5 shows the moving average of the objective function and arrival ratio under different arrival rewards. Herein, the arrival ratio is the ratio that the UAV arrives at the destination successfully over the last 100 episodes. Fig.5 shows that the objective function converges to a stable value around the th episode under the reward based on the progress estimate, while it takes episodes to reach the same value under the sparse arrival reward. Fig.5 shows that we achieve the goal of reaching the destination more quickly and stably under the progress estimatebased reward.
In Fig.6, we investigate the effect of the temperature parameter on convergence, which is utilized to control the randomness of the policy. It is shown that the return can be barely improved when . This is because the policy entropy is not included in the object of SACTR at this time, which causes a low exploration of the algorithm. This informs us SACTR cannot be fully upgraded when it only has the structure of the AC method but without policy entropy in the DRL objective. In addition, a quite large (=0.4) can also lead to a local optimum. This is because a too large makes the algorithm excessively pursue the improvement of randomness instead of the accumulated reward. Thus, we set in the following simulations.
In Fig.7, we show the experience of setting hyperparameters. Fig.7 plots the moving average of the return under different learning rates in the Adam optimizer. We observe that when is quite large (= 0.01), the algorithm converges to a local optimum, while a small stabilizes and decelerates convergence. Trading between performance and convergence speed, we set in the following simulations. At a fixed interval, SACTR randomly selects a batch of samples from the experience replay memory to train neural networks. In Fig.7 and 7, we respectively study the effect of the size of the experience replay memory and the training batch size on convergence. Fig.7 indicates that either too small or too large memory size decelerates convergence. Therefore, we select memory size of 100,000. Fig.7 shows that the algorithm may converge to a local optimum when batch size is too small (= 32), while the convergence speed slows down when the batch size is quite large ( 256). To maximize the convergence speed, we set batch size = 64 or 128 in simulations under different cases. In SACTR, functions update the values of target networks by using the exponentially moving average with a parameter called target smoothing coefficient, . In Fig.7, we study the effect of on convergence. It shows that a too large decelerates convergence, since an overly fast update of the target network destabilizes convergence, while a toosmall also reduces the convergence speed. For fast convergence, we set in the following simulations.
ViC Effect of Fairness Index Exponent
In Fig.8, we investigate the effect of the exponent of fairness index, , on the sum computation bits of all terminals and the fairness index when Watts, since it is used to adjust the importance of fairness in the objective function. With the increase of , the sum computation bits declines while monotonously increases. It indicates that the preference of SACTR for sum computation bits and fairness can be effectively adjusted by . A large can be set when quite high fairness is demanded. In contrast, if the scenario aims to maximize the sum computation bits, we should have .
ViD Optimal Policy
The optimal policy given by SACTR includes the optimal UAV trajectory and the optimal resource allocation of terminals. Fig.9 shows an example of optimal UAV trajectory, where Watts and . In the entire fight time, the UAV first hovers over four mobile terminals, which can ensure the fairness among terminals, then flies towards the destination at high speed during the last few slots and finally arrives at the destination.
Fig.9 plots the UAV trajectory when . Unlike the case of , the UAV first flies quickly to terminal 1, then adjusts its trajectory to keep itself always on the top of terminal 1, and finally reaches the destination. The difference is caused by the fact that when , SACTR aims to maximize the sum computation bits. Therefore, the algorithm minimizes the distance between the UAV and one of the terminals to enhance communication and energy harvesting. It reveals that when , SACTR cannot guarantee the fairness issue and an unfair phenomenon may appear such as the situation in Fig.9.
Fig.10 plots an example of the optimal resource allocation of terminal 1 when . From top to bottom, Fig.10 exhibits the transmit power, the proportion of offloading time, the CPU frequency, and the distance between the UAV and the terminal 1 in an entire flight time. We observe that the resource allocation of the terminal cooperates with the UAV trajectory for good performance. In particular, when the UAV flies near the terminal 1 (13 slots
19), the transmit power and the proportion of offloading time of this terminal increase so as to utilize the high power channel gain at this moment to offload more computation tasks. The CPU frequency is relatively uniform over the entire flight time.
ViE Comparison with Benchmark Algorithms
To evaluate the performance of SACTR, we compare it with other representative benchmarks as follows.

Hover–fly–hover (HFH) trajectory algorithm. The HFH trajectory of UAV is widely used to serve the terminals with fixed locations [3, 33, 34]. In HFH, the UAV flies to and hovers over some specified locations in turn. To serve mobile terminals, we make simple adaptations to HFH: the UAV flies to and follows each mobile terminal in turn for equal time. During following one terminal, it keeps itself on the top of this terminal in each slot. After serving one terminal, It flies to the next terminal with the maximum speed. Also, the UAV reserves the minimum time to arrive at the destination.

Straight trajectory algorithm. In this algorithm, the UAV flies straight from the starting point to the destination with constant speed.

Greedy local algorithm. In this algorithm, the terminal exhausts all the energy in the battery for local computation in each slot.

Greedy offloading algorithm. The terminal spends all the energy in the battery for computation offloading in each slot, while the offloading time of each terminal is equal.

Random algorithm. In this algorithm, the flight speed and direction of the UAV, transmit power, offloading time, and CPU frequency of each terminal in each slot are picked randomly.
To compare with SACTR, the resource allocation in the HFH trajectory algorithm and the straight trajectory algorithm is optimized by our algorithm. Also, the UAV trajectory in the greedy local algorithm and the greedy offloading algorithm is optimized by our algorithm.
In Fig.11, we investigate the stable value of the objective function after convergence, under SACTR and the abovementioned algorithms. We set from 0.1 to 21.1 Watts and . With the increase of , mobile terminals can harvest more energy to use in local computation or computation offloading, thus the objective functions of all algorithms improve. The greedy local algorithm is better than the greedy offloading algorithm when is small; otherwise, the greedy offloading algorithm is better. The objective function of the HFH trajectory algorithm is always high since the HFH trajectory guarantees high fairness while keeping the distance between the UAV and terminals as small as possible to maximize the computation bits. The objective function of SACTR is always close to or exceeds other benchmarks. Also, SACTR is sometimes worse than that of the HFH trajectory algorithm, which is due to the error caused by the inconsistency of the objective function of problem and the objective of SACTR.
Different from that in Fig.11, we consider the situation that there exists a greater degree of difference in the mobility of each terminal in Fig.11. In particular, the average speed of each terminal is different, that is, m/s, and so as the memory of each terminal, , and , for . As Fig.11 shows, the objective function of SACTR still approaches or exceeds other benchmarks in this situation. We observe that, the advantage of the trajectory planning of our algorithm becomes more prominent in this situation. For example, the superiority of SACTR and the greedy offloading algorithm whose UAV trajectory is optimized by our algorithm is more evident compared to that in Fig.11. This is because the mobility model of each terminal is quite different in this situation, and the UAV can approach each terminal with flexible time and flexible trajectory under the trajectory planning of our algorithm, while HFH still allocates equal flight time to follow each terminal with inflexible trajectory and the straight trajectory algorithm has completely fixed UAV trajectory.
Fig.11 and 11 reveal the advantage of trajectory planning of SACTR by comparing it with HFH and straight trajectories. Also, the superiority of resource allocation of SACTR is highlighted when comparing it with the greedy local algorithm and the greedy offloading algorithm, of which the reason is that SACTR makes overall arrangements over the entire flight time. Since benchmarks (1)(4) only perform optimization on partial parameters by our algorithm, these comparisons reflect the benefit of joint optimization of all parameters in SACTR.
ViF Usability and Adaptability
Terminal number  2  4  8  16  32 

Execution latency ()  4.13  4.30  4.35  4.49  5.07 
Update latency ()  5.10  5.23  5.27  5.30  5.85 
Finally, we evaluate the usability and adaptability of SACTR by simulations, where we set Watts and .
Since SACTR is executed at the beginning of each slot, its execution latency has a big impact on its usability. Therefore, we evaluate the execution latency on a desktop configured with an Intel Core i54590 3.3GHz CPU and 8 GB of memory and exhibit the results in Table I. The single execution latency is averaged over executions. Under different numbers of mobile terminals, the execution latency is always around s, which is much lower than the length of a slot, such as 0.1s, so its impact on performance can be ignored.
In some situations, we proceed to train SACTR during use with a fixed interval so as to adapt to unpredictable changes of the environment. Therefore, we also evaluate the latency of updating SACTR for once in table I, which is averaged over updates with a batch size of 64. Under different numbers of terminals, the update latency is around s. It follows that SACTR should be updated with an interval that is larger than the update latency. During use, SACTR can be trained on the CPU of the UAV or the cloud server by transmitting data to it via satellite communication [20].
On the other hand, we examine the adaptability of SACTR to the unexpected changes of the mobility model of terminals. In Fig.12, we investigate the largescale changes of the average speed of terminals. In the th slot, of two terminals are both quadrupled, and then reduced to origin value in the th slot. As we can see, SACTR can adapt to these abrupt changes instantly. In the th slot, of all terminals are quadrupled, and the performance of SACTR drops drastically in an instant but converges to a stable value again quickly. In the th slot, the average speed of all terminals drops 4 times and SACTR handles it smoothly.
In Fig.13, we study the impact of largescale changes of and , which represent the memory of the mobility model of terminals. At first, for . In the th, th, th, and th episode, and of all terminals simultaneously change to 1, back to 0.5, change to 0, and back to 0.5 again. For the first three changes, SACTR can adapt instantaneously. For the last change, the return first drops drastically and then ascends to a stable value rapidly. Fig.12 and 13 demonstrate that SACTR has good adaption to unexpected changes of the environment. In reality, even if there exists a certain degree of difference between the model of training samples and the real environment, SACTR can also handle it.
Vii Conclusion
In this paper, we study an UAVassisted wireless powered MEC system for mobile terminals. By combining the computation rate and a fairness index in our objective function, we aim to jointly optimize and continuously control the UAV trajectory and the resource allocation of terminals. An SACbased algorithm, named SACTR, is proposed for trajectory planning and resource allocation to solve this complex highdimensional DRL task. In SACTR, reward is designed as a heterogeneous function including a computation reward and an arrival reward. The computation reward integrates a fairness index to improve the computation rate while guaranteeing fairness, and the arrival reward based on the progress estimate guides the UAV to reach the specified destination and promotes convergence. Simulation results show that SACTR can converge stably and adapt to drastic changes of the environment quickly. Compared to widelyused benchmarks, such as the HFH trajectory, the straight trajectory, the greedy algorithm and the random algorithm, the performance of SACTR exceeds or approaches them in various situations.
References
 [1] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in uavenabled wirelesspowered mobileedge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, 2018.
 [2] F. Zhou, Y. Wu, H. Sun, and Z. Chu, “Uavenabled mobile edge computing: Offloading optimization and trajectory design,” in Proc. IEEE Int. Conf. Commun., 2018, pp. 1–6.
 [3] Y. Du, K. Yang, K. Wang, G. Zhang, Y. Zhao, and D. Chen, “Joint resources and workflow scheduling in uavenabled wirelesslypowered mec for iot systems,” IEEE Trans. Veh. Technol., vol. 68, no. 10, pp. 10 187–10 200, 2019.
 [4] Y. Liu, K. Xiong, Q. Ni, P. Fan, and K. B. Letaief, “Uavassisted wireless powered cooperative mobile edge computing: Joint offloading, cpu control, and trajectory optimization,” IEEE Internet Things J., vol. 7, no. 4, pp. 2777–2790, 2020.
 [5] X. Hu, K.K. Wong, and Z. Zheng, “Wirelesspowered mobile edge computing with cooperated uav,” in Proc. IEEE SPAWC, 2019, pp. 1–5.
 [6] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
 [7] X. Lu, P. Wang, D. Niyato, D. I. Kim, and Z. Han, “Wireless networks with rf energy harvesting: A contemporary survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 2, pp. 757–789, 2015.
 [8] P. Juang, H. Oki, Y. Wang, M. Martonosi, L. S. Peh, and D. Rubenstein, “Energyefficient computing for wildlife tracking: Design tradeoffs and early experiences with zebranet,” SIGARCH Comput. Archit. News, vol. 30, no. 5, p. 96–107, Oct. 2002.
 [9] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. Shu, “Path planning for uavmounted mobile edge computing with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, 2020.
 [10] C. You, K. Huang, and H. Chae, “Energy efficient mobile cloud computing powered by wireless energy transfer,” IEEE J. Sel. Areas Commun., vol. 34, no. 5, pp. 1757–1771, 2016.
 [11] S. Bi and Y. J. Zhang, “Computation rate maximization for wireless powered mobileedge computing with binary computation offloading,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–4190, 2018.
 [12] S. Mao, S. Leng, K. Yang, X. Huang, and Q. Zhao, “Fair energyefficient scheduling in wireless powered fullduplex mobileedge computing systems,” in Proc. IEEE Global Commun. Conf., 2017, pp. 1–6.
 [13] M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, and W. Zhuang, “Learningbased computation offloading for iot devices with energy harvesting,” IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930–1941, 2019.
 [14] L. Huang, S. Bi, and Y.J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobileedge computing networks,” IEEE Trans. Mobile Comput., vol. 19, no. 11, pp. 2581–2593, 2020.
 [15] S. Yin, S. Zhao, Y. Zhao, and F. R. Yu, “Intelligent trajectory design in uavaided communications with reinforcement learning,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, 2019.
 [16] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energyefficient uav control for effective and fair communication coverage: A deep reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 2059–2070, 2018.
 [17] R. Zhong, X. Liu, Y. Liu, and Y. Chen, “Multiagent reinforcement learning in nomaaided uav networks for cellular offloading,” IEEE Trans. Wireless Commun., pp. 1–1, 2021.
 [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019.
 [19] A. AlHourani, S. Kandeepan, and S. Lardner, “Optimal lap altitude for maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, 2014.
 [20] N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerialassisted computing offloading for iot applications: A learningbased approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, 2019.
 [21] D. M. C. R. K. Jain and W. R. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,” Eastern Res. Lab., Tech. Rep. DECTR301, 1984.
 [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529–533, 2015.
 [23] R. S. Sutton and F. Bach, Reinforcement Learning—An Introduction. Cambridge, MA, USA: MIT Press, 1998.
 [24] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. ICML, New York, USA, 2016, pp. 1928–1937.
 [25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. ICML, 2018, pp. 1861–1870.
 [26] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actorcritic algorithms and applications,” CoRR, vol. abs/1812.05905, 2018.
 [27] B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, “Learningbased energyefficient data collection by unmanned vehicles in smart cities,” IEEE Trans. Ind. Informat., vol. 14, no. 4, pp. 1666–1676, 2018.
 [28] M. J. Mataric, “Reward functions for accelerated learning,” in Mach. Learn. Proc. 1994. San Francisco (CA): Morgan Kaufmann, 1994, pp. 181–189.
 [29] H. Sun, Y.x. Guo, M. He, and Z. Zhong, “Design of a highefficiency 2.45ghz rectenna for lowinputpower energy harvesting,” IEEE Antennas Wireless Propag. Lett., vol. 11, pp. 929–932, 2012.
 [30] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobileedge computing: Partial computation offloading using dynamic voltage scaling,” IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, 2016.
 [31] R. I. BorYaliniz, A. ElKeyi, and H. Yanikomeroglu, “Efficient 3d placement of an aerial base station in next generation cellular networks,” in Proc. IEEE Int. Conf. Commun., 2016, pp. 1–5.
 [32] S. Batabyal and P. Bhaumik, “Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1679–1707, 2015.
 [33] J. Xu, Y. Zeng, and R. Zhang, “Uavenabled wireless power transfer: Trajectory design and energy optimization,” IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5092–5106, 2018.
 [34] L. Xie, J. Xu, and R. Zhang, “Throughput maximization for uavenabled wireless powered communication networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1690–1703, 2019.
Comments
There are no comments yet.