In recent years, unmanned aerial vehicles (UAV)-assisted wireless powered mobile-edge computing (MEC) network has attracted more and more attention [1, 2, 3, 4, 5]. Due to technological advances, today’s UAVs can equip MEC servers with strong computing capabilities and energy transmitters. It can perform not only computation offloading via MEC technology  but also wireless charging via wireless power transfer (WPT)  for mobile terminals, which need to operate computation-intensive applications but have limited computing capacity and battery lifetime. The UAV is thus suitable for building temporary MEC systems for mobile terminals in some special situations. For example, the UAV can provide service for the mobile terminals in the scenarios where base station (BS) is damaged, in the public meeting places where there is a traffic hotspot, or in the remote fields where there is a coverage hole of wireless networks.
The UAV-assisted wireless powered MEC networks were previously investigated for ground terminals with fixed locations [1, 2, 3, 4, 5]. To maximize the network utility in terms of computation rate [1, 5] or minimize the energy consumption [2, 3, 4], previous works optimized UAV trajectory, offloading decision, and resource allocation. Sometimes, the UAV is required to arrive at specified locations automatically, so that it can be utilized in the places that are difficult for people to reach, thereby reducing labor costs [1, 4]. The previous works solved the optimization problems by offline algorithms such as successive convex approximation [1, 2, 4] and block coordinate descending . The numerical results in [1, 2, 3, 4, 5] show that these algorithms work well in the scenarios where the locations of terminals are fixed.
However, terminals in practice such as smartphones, tablets, wearable devices, and tracking collars carried by wildlife  are typically in motion, and their trajectories are likely to be stochastic. To serve mobile terminals, online algorithms are needed to make decisions based on real-time information. Unfortunately, the existing algorithms designed for the terminals with fixed locations [1, 2, 3, 4, 5] are offline algorithms, and may not work well in these scenarios since all of them need environment information a priori .
In this paper, we study a UAV-assisted wireless powered MEC network where a flying UAV serves multiple mobile terminals. We aim to maximize the computation rate of all terminals while ensuring the fairness among them. Herein, considering fairness is to balance the computation performance of different terminals. We demonstrate that this problem is a joint optimization and continuous control problem of the UAV trajectory and the resource allocation of terminals, under the condition that the trajectories of terminals are stochastic. Therefore, we propose a soft actor-critic (SAC) based deep-reinforcement-learning (DRL) algorithm for trajectory planning and resource allocation (SAC-TR). Since this problem is a complex high-dimensional DRL task, SAC-TR combines off-policy and maximum entropy reinforcement learning to ensure sampling efficiency and stabilize convergence at the same time. Taking the computation rate, fairness, and reaching destination into consideration, we design the reward in SAC-TR as a heterogeneous function to satisfy multiple objectives simultaneously. The simulation results show that SAC-TR outperforms representative benchmarks in most cases.
The main contributions of this paper are highlighted as follows.
To the best of our knowledge, we are the first to provide an online algorithm for trajectory planning and resource allocation for mobile terminals in the UAV-assisted wireless powered MEC network.
By integrating a fairness index into the reward of SAC-TR, we guarantee the fairness among different terminals according to the needs of scenarios.
By using the progress estimate in the reward of SAC-TR, the UAV can reach the specified destination automatically and the convergence of SAC-TR is accelerated.
SAC-TR can converge steadily and fast adapt to unexpected changes of the environment.
The rest of this paper is organized as follows. In Section II, we review the related works. We model the UAV-assisted wireless powered MEC network in Section III and formulate the trajectory planning and resource allocation problem in Section IV. In Section V, the detailed design of SAC-TR is given. We evaluate the performance of our algorithm by simulations in Section VI. Finally, the paper is concluded in Section VII.
Ii Related Works
The previous works related to our paper include those focusing on wireless powered MEC networks [10, 11, 12, 13, 14, 1, 2, 3, 4, 5], and UAV-assisted communication networks for mobile terminals [15, 16, 17].
Ii-a Wireless Powered MEC Network
The wireless powered MEC network can be divided into two types according to the carriers of MEC servers and energy transmitters. The first type is the wireless powered MEC network, where the BS is the carrier [10, 11, 12, 13, 14], while the second type is the UAV-assisted wireless powered MEC network, where the UAV is the carrier [1, 2, 3, 4, 5].
In the system with the BS as the carrier, the works mainly focus on resource allocation and offloading decision [10, 11, 12, 13, 14]. In  and , the parameters such as offloading decision, CPU frequency, and transmission time of terminals were optimized to minimize the energy consumption and maximize the computation rate respectively. Different from [10, 11], Mao et al. investigated the max-min energy efficiency optimization problem to guarantee the fairness of energy efficiency among different devices . To utilize the advantages of DRL in handling problems with sophisticated state space and time-varying environment, Min et al.  proposed a deep Q network (DQN) based offloading policy for energy-harvesting MEC network to improve the computation performance. Huang et al. 
proposed a DRL-based online computation offloading (DROO) framework. Instead of solving for the hybrid integer continuous solution altogether, DROO decomposes the optimization problem into a binary offloading decision sub-problem and a continuous resource allocation subproblem, and tackles them separately by deep learning and traditional optimization methods, respectively.
In the system with the UAV as the carrier, the previous works only considered the case where the locations of terminals are fixed [1, 2, 3, 4, 5]. To maximize the weighted sum computation rate of terminals, Zhou et al.  jointly optimized the CPU frequencies, transmit powers, and offloading times of terminals as well as the UAV trajectory. Ref.  minimized the energy consumption of the UAV while guaranteeing the computation rate of all terminals. Ref.  proposed a time-division multiple access (TDMA) based workflow model, which allows parallel transmitting and computing. In particular, the UAV was arranged to hover over designated successive positions, and the parameters such as the service sequence of terminals, computing resource allocation, and hovering time of UAV were jointly optimized. To assist the service of UAV, Liu et al.  utilized idle sensor devices to cooperate with the UAV to provide computation offloading service for busy sensor devices, and Hu et al.  utilized access points (APs) to offer wireless power and computation offloading services for the UAV. The offline algorithms in [1, 2, 3, 4, 5] require the system information a priori.
Ii-B UAV-assisted Communication Network for Mobile Terminals
The UAV-assisted communication network for mobile terminals is similar to the UAV-assisted MEC network for mobile terminals. The difference is that the UAV in the former case carries out traffic offloading while that in the latter case performs computation offloading. The major problem in the UAV-assisted communication network is resource allocation and UAV trajectory design for performance optimization. Ref. proposed a deterministic policy gradient (DPG) based algorithm to maximize the expected uplink sum rate of terminals. Ref. considered the scenario where a group of UAVs are employed to enhance communication coverage area. This paper proposed an actor-critic (AC) based algorithm to optimize the UAV trajectory, such that the objectives including coverage expansion, fairness improvement, and power saving can be achieved. However, DPG  and AC  are hard to converge if applied to the complex high-dimensional DRL task, e.g., the problem considered in this paper, which jointly optimizes multiple types of parameters. Ref. aimed to maximize the throughput of a UAV-assisted cellular offloading network. Ref. discretized the flight direction of UAV and the transmit power of terminals and devised a value-based DRL algorithm. Since this algorithm has to search the action space exhaustively in each iteration, it cannot be used for problems with high-dimensional or continuous actions .
Iii UAV-assisted Wireless Powered MEC Network
Fig.1 illustrates the UAV-assisted wireless powered MEC network considered in this paper.There are a set of mobile terminals, denoted by and a UAV. All the terminals move on the ground with altitude 0, and the UAV flies at a fixed altitude, denoted by , such that it can avoid frequent ascent and descent to evade surficial obstacles. The UAV is equipped with MEC servers and energy transmitters and serves these mobile terminals with low battery lifetime and computing capacity. Each terminal has accumulated computation tasks, which can be divided into two parts. One part is executed locally by the mobile terminal and the other part is offloaded to and executed at the UAV, which is known as the partial offloading mode. In the meanwhile, the UAV broadcasts radio-frequency (RF) energy to all mobile terminals, and terminals harvest the energy and store it in the chargeable battery. The UAV/terminal can perform energy transferring/harvesting, computing, and data exchange simultaneously [10, 11, 12, 1]. The UAV is required to arrive at a designated location at the end of the flight [1, 4].
Iii-a Computation Offloading
Mobile terminals adopt the TDMA protocol to communicate with the UAV, as illustrated in Fig.2. The flight time of the UAV, denoted by , is discretized into slots. The duration of a time slot is very short such that the locations of UAV and terminals and the channel gain almost keep unchanged. In each slot, the mobile terminals offload computation tasks to the UAV in a round-robin manner and download the computation results from it after completion.
Iii-A1 Computation Time and Data-exchange Time
We denote as the proportion of uploading time of the th terminal in slot . In general, the computing capacity of the MEC servers on the UAV is powerful and the size of computation result is quite small. Thus, we assume that
Iii-A2 Channel Condition
The data exchange between the UAV and the terminals is influenced by the wireless channel conditions. In our model, we assume that
the impact of the Doppler effect in data exchange due to the position changes of the UAV and mobile terminals can be perfectly compensated by the receivers, and
Employing the three-dimensional (3D) Euclidean coordinate, we let be the horizontal plane coordinate of the th terminal and be that of the UAV in slot . Under the air-to-ground model, the path loss between the th terminal and the UAV in the th slot is given by , as
where is Euclidean norm, is the carrier frequency, is the light speed,
is the probability that the link between the terminal and the UAV is a Line-of-Sight (LoS) link, andand are the additional loss caused by the LoS and non-LOS (NLoS) link on the top of the free space path loss. The values of and are determined by the environments, such as urban and rural. According to , is given by
where and are two constants determined by the environments . Accordingly, the channel power gain between the th terminal and the UAV in slot is given by
Iii-A3 Task Offloading by Terminals
The computation tasks that the terminal uploads to the UAV including the raw data and the communication overhead such as the encryption and the packet header . We assume that
each bit of raw data needs bits of upload data.
Recall that the th terminal offloads data to the UAV with duration in slot . The volume of raw data that terminal offloads to the UAV in slot is
where is the transmit power of terminal in slot , is the offloading bandwidth, and is the noise power at the terminal. It follows that the energy consumption for offloading these data is .
Iii-B Local Computation
Mobile terminals execute local computation tasks and adjust the CPU frequency by dynamic voltage and frequency scaling technique in each slot [1, 11, 12]. Let be the CPU frequency (unit: cycle/s) of the th terminal in slot , be the number of CPU cycles required for computing one bit of raw data. Then, the local computation bits of the th terminal in the th slot is
Accordingly, the local energy consumption of the th terminal in the th slot is given by , where is the effective capacitance coefficient of the processor chip .
Let be the number of computation bits of the th terminal in the th slot, including both the local and the offloaded ones. is given by
Thus, the total computation bits of the th terminal in the entire flight time are .
Iii-C Wireless Power Transfer
The UAV broadcasts RF energy to all mobile terminals continuously during its flight time. We assume that
the energy of the UAV is sufficient, and
the transmit power of the UAV is a constant, .
The energy harvested by the th terminal in slot is , where is the energy conservation efficiency.
Iv Problem Formulation
To ensure the performance of each terminal, we aim to maximize the sum computation bits of all mobile terminals in the entire flight time , while guaranteeing the fairness of computation bits among different terminals. Based on the Jain’s fairness index , we define the fairness index () as
Clearly, a larger indicates higher fairness. Accordingly, we define the objective function as a joint function of the computation bits and fairness, as , where is a non-negative integer used to adjust the proportion of in the objective function.
We intend to optimize the UAV trajectory and the resource allocation of terminals during the flight time of the UAV. Let and be the flight speed and direction of the UAV in slot , respectively. The UAV trajectory is described by and , where . The resource allocation variables include the transmit powers, offloading times, and CPU frequencies in all the slots. In particular, the resource allocation variables are , , and . Note that, the transmit power and offloading time affect the offloading performance, and the CPU frequency decides the number of local computation bits. Consequently, to maximize the objective function, we should jointly optimize the flight speed and direction of the UAV, and the transmit powers, offloading times, and CPU frequencies of mobile terminals in each slot.
Our optimization problem is formulated as
where is the initial energy of the th terminal, is the maximum horizontal flying speed of UAV, and are respectively the locations of UAV in the first slot and after the last slot, and are the locations of designed starting point and the destination.
indicates that the transmit powers and CPU frequencies of terminals should be non-negative. restricts that, by each slot, the accumulated energy consumption of a terminal cannot exceed the sum of the initial energy and the energy harvested by this terminal. states that the sum of offloading time of all terminals in each slot cannot exceed the duration of a slot. and give the range of the flight speed and direction of the UAV. and restrict the starting point and the destination of the UAV.
is a sequential decision problem that can be characterized by a Markov decision process (MDP), the moving trajectories of terminals may be unpredictable and cannot be known in advance. Also, it involves the joint optimization and continuous control of high-dimensional parameters. As a result, traditional optimization approaches fail to solve this problem. For example, offline algorithms such as dynamic planning, successive convex approximation, or block coordinate descending, require the system information a priori; the DQN method can only deal with problems with discrete or low-dimensional actions ; also, it is a challenge for the policy gradient method  or the actor-critic (AC) method  to maintain both high sample efficiency and stable convergence at the same time , when they are employed to handle a -like complex high-dimensional DRL task. We thus will introduce the soft actor-critic (SAC) method  to solve this problem in the next section.
V SAC-based Algorithm for Trajectory Planning and Resource Allocation
In this section, we propose an SAC-based trajectory planning and resource allocation (SAC-TR) algorithm to solve problem . To deal this complex high-dimensional DRL task, SAC-TR adopts the combination of off-policy and maximum entropy reinforcement learning in SAC method, so as to increase sampling efficiency and stabilize convergence at the same time. Taking into consideration the computation rate, fairness, and reaching of destination, we design a heterogeneous reward function in SAC-TR. SAC-TR is introduced in the following three parts. We present the main design of SAC-TR in Section V-A and the heterogeneous reward function in Section V-B. Section V-C
introduces the maximum entropy reinforcement learning and gives the gradient descent formulas of the neural networks in SAC-TR.
V-a Design of SAC-TR
Fig.3 plots the structure of SAC-TR, which consists of a policy function, denoted by , two -functions, denoted by and , two target networks, denoted by and , and an experience replay memory, where and are the environment state and the action in slot , respectively.
V-A1 Policy Function
Policy function performs as an “actor”. In slot , the policy collects the state information from the network. The state includes a 2-dimensional UAV location, a -dimensional terminal location, an -dimensional terminal battery energy, and a 1-dimensional current slot, which is formally defined by
where is the battery energy of terminal in slot . According to state , the policy function takes an action defined by a
to adjust the horizontal flight speed and direction of the UAV, the transmit power, CPU frequency, and offloading time proportion of each terminal.
The policy function is implemented by a deep neural network (DNN), of which the parameter is denoted by
. The DNN has two output layers. During the training process, the DNN generates the mean and covariance of a Gaussian random variable at the two output layers. Sampling the Gaussian random variable and then restricting it via a tanh function, the policy function produces an action.
The actions generated by the policy function might not meet all the constraints of problem . To satisfy constraints and , we should adjust the generated actions as follows. restricts the energy consumed by each terminal in each slot. If the generated action for a terminal does not satisfy this constraint, we set the transmit power and the CPU frequency of this terminal to zero in this slot. As a result, the computation bit of this terminal is also zero in this slot, which can be regarded as a penalty for this infeasible action. To satisfy , the offloading-time constraint, we normalize the proportion of offloading time of the generated action. Let be the proportion of offloading time of the th terminal generated by the policy in slot . If , constraint is met, thus ; otherwise, is normalized as follows
After that, SAC-TR exports the adjusted action and obtains a reward, which is denoted by and will be defined in Section V-B.
V-A2 Experience Replay Memory
After getting and the state of the next slot from the MEC network, SAC-TR combines , , , and as a sample and stores it in the experience replay memory. Once the memory is full, the newly generated sample will replace the oldest one. At fixed intervals, SAC-TR randomly selects a batch of samples from the memory and performs gradient descent on the neural networks of policy function and -functions.
Following the clipped double- trick , SAC-TR uses two -functions and as a “critic” in the gradient descent process of DNN of the policy function, such that the positive deviation of policy promotion can be reduced. and are performed by two DNNs with parameters and . They both generate -values of a state-action pair. SAC-TR selects the small one of two -values.
V-A4 Target Network
The DNN of each -function is also updated by gradient descent, where two target networks and are used to reduce the correlation between samples so as to stabilize the training. As the backup of -functions, the initial structure and the parameters of two target networks are the same as those of two -functions. They update their parameters, using the exponentially moving averages of parameters of and , with a smoothing constant .
V-B Heterogeneous Reward Function
To meet different types of requirements, including the computation rate, fairness, and specified UAV destination, we customize the reward in SAC-TR as a heterogeneous function of a computation reward and an arrival reward. In particular, we design the computation reward based on the fairness index to maximize the objective of problem , and design the arrival reward based on progress estimate to meet the arrival constraint .
V-B1 Computation Reward with Fairness
We aim to maximize the computation bits of terminals while guaranteeing the fairness among them. On one hand, we include the incremental computation bits, i.e., in the reward to encourage the improvement of computation bits in slot . On the other hand, to make use of existing information to promote fairness in each slot, we define an indicator, called current fairness index, corresponding to the definition of fairness index in (7) as follows
to measure the fairness among terminals in slot . Eq.12 can be regarded as an evolution of fairness index. Clearly, there is .
Combining with the incremental computation bits, we design the computation reward so that it can encourage actions that increase more computation bits and the actions that achieve high fairness, thereby promoting the final fairness. In the th slot, the computation reward is given by
V-B2 Arrival Reward Based on Progress Estimate
It is important to set a proper arrival reward to facilitate UAV arriving at the designated destination at the end of the flight. Otherwise, the UAV may take a long time to (or even cannot) reach the designated destination. An example of arrival reward is the sparse reward in , where a fixed reward is given when the UAV arrives at the destination, or a fixed penalty when the UAV does not. However, in our problem, the area of destination is much smaller than the whole flight area, and thus the samples that the UAV arrives at the destination would be rare in the training process. As a result, if our algorithm employs the sparse reward, it will be difficult to converge.
Inspired by the progress estimate reported in , we design a distance-based arrival reward. The idea of the progress estimate is that, if the goal is not reached, an artificial progress estimator is given to accelerate convergence. Based on this idea, we define an arrival reward at the end of the flight according to the distance between the UAV and destination as follows
where are constants. Clearly, decreases linearly with the distance between the destination and the final location of the UAV. In this way, the samples that the UAV fails to arrive at the destination can also be utilized in the training process to guide the algorithm.
V-C Maximum Entropy Reinforcement Learning
SAC uses the concept, called entropy of policy, to indicate the randomness of policy and is given by . The objective of SAC is to maximize the expectation of accumulated rewards and the expected entropy of the policy, such that the policy can be trained with various highly random samples. In this way, SAC can avoid falling into a local optimum. This objective is called the maximum entropy objective in SAC. To solve problem , SAC-TR defines the maximum entropy objective
based on the reward function in (15), where is the temperature parameter that adjusts the importance of entropy against the reward and controls the stochasticity of policy.
At a fixed interval, SAC-TR performs gradient descent on the neural networks of -functions and the policy function. The parameters of -function , , are updated by minimizing the soft Bellman residual 
where is the distribution of sampled states and actions. The parameter of policy function, , is updated by
Before use, SAC-TR is trained until it converges, of which the training process is summarized by Algorithm 1. The well-trained algorithm is then carried by the UAV as an agent. At the beginning of each slot, the UAV collects the state information and makes a decision. During the flight time, SAC-TR can continue to be trained at a fixed interval if needed.
Vi Performance Evaluation
In this section, we evaluate the performance of SAC-TR by simulations. In particular, we study the convergence, usability, and adaptability of SAC-TR, the effect of the exponent of the fairness index, and the optimal policy given by SAC-TR. We also compare SAC-TR with other benchmarks.
Vi-a Simulation Settings
Vi-A1 System Settings
In the simulation, we set the total flight time seconds, which is discretized into slots, and the number of mobile terminals . The maximum flight speed of UAV m/s, the data offloading bandwidth MHz, the carrier frequency GHz, and the receiver noise power Watts . The WPT energy conversion efficiency at each terminal . The effective capacitance coefficient of the terminal , which depends on the chip architecture, and the CPU cycle of raw data cycles/bit [14, 30]. The upload data needed for each bit of raw data . In remote area, the parameters in (2) are . A field with a horizontal area of m is considered, and the flight altitude of UAV m. The horizontal location of the starting point of the UAV is m, and the destination range is a sector with the center of m and the radius of m, as shown in Fig. 9.
Vi-A2 Mobility Model of Terminals
Since the mobility model of the terminal may contain fixed components, randomness, and memory, we employ the Gauss-Markov random model (GMRM) to characterize it . Assume the speed and the direction of the th terminal in the th slot are respectively and , they can be calculated in GMRM by
Herein, represent the memory in the mobility model of the th terminal. and are the average speed and average direction of the th terminal. and
are Gaussian distributed random variables, which inflect the randomness in the mobility model of theth terminal. In the simulation, we set , , and m/s, , the mean and covariance of as 0 and 2, and that of as 0 and 1, for . Note that SAC-TR can also be applied to other mobility models of terminals, including changeable or unknown models.
Vi-A3 Simulation Platform
We execute SAC-TR in Python 3.7 with PyTorch 1.7. The neural networks in policy function and Q-functions are both fully connected networks, each of which has three hidden layers and each hidden layer has 400 neurons. We adopt the Adam optimizer and utilize the RELU as the activation function. We set the discountand the algorithm is updated every 100 slots. The parameters and in arrival reward are respectively set as 500 and 80. It is pointed out in  that the SAC method performs well when the average reward in each slot is around dozens. Thus, we regulate the average reward in this range by setting different in different scenarios. For example, when the exponent of the fairness index and the transmit power of the UAV Watts, we select to make the return to be around 1200 (30 per slot) after convergence. Herein, the return is the accumulated reward in an episode, and the episode is an independent realization of an entire flight time.
Vi-B Convergence of SAC-TR
In this part, we study the effect of the reward function and hyperparameters on the convergence of SAC-TR, where we setWatts and .
The current fairness index is integrated into the computation reward to improve fairness. To investigate its effect, we compared SAC-TR to that without in the computation reward, as plotted in Fig.4. The curves denote the moving average of the objective function over a window of 100 episodes. As we can see, the computation reward integrated with reaches higher objective function compared to that without , since can guarantee fairness in almost all the situations.
We also design an arrival reward based on the progress estimate to promote convergence. To examine its effect, we compare SAC-TR to that with sparse arrival reward. With sparse arrival reward, if the UAV reaches the destination, a fixed reward will be given, otherwise, no arrival reward will be given. Fig.5 shows the moving average of the objective function and arrival ratio under different arrival rewards. Herein, the arrival ratio is the ratio that the UAV arrives at the destination successfully over the last 100 episodes. Fig.5 shows that the objective function converges to a stable value around the th episode under the reward based on the progress estimate, while it takes episodes to reach the same value under the sparse arrival reward. Fig.5 shows that we achieve the goal of reaching the destination more quickly and stably under the progress estimate-based reward.
In Fig.6, we investigate the effect of the temperature parameter on convergence, which is utilized to control the randomness of the policy. It is shown that the return can be barely improved when . This is because the policy entropy is not included in the object of SAC-TR at this time, which causes a low exploration of the algorithm. This informs us SAC-TR cannot be fully upgraded when it only has the structure of the AC method but without policy entropy in the DRL objective. In addition, a quite large (=0.4) can also lead to a local optimum. This is because a too large makes the algorithm excessively pursue the improvement of randomness instead of the accumulated reward. Thus, we set in the following simulations.
In Fig.7, we show the experience of setting hyperparameters. Fig.7 plots the moving average of the return under different learning rates in the Adam optimizer. We observe that when is quite large (= 0.01), the algorithm converges to a local optimum, while a small stabilizes and decelerates convergence. Trading between performance and convergence speed, we set in the following simulations. At a fixed interval, SAC-TR randomly selects a batch of samples from the experience replay memory to train neural networks. In Fig.7 and 7, we respectively study the effect of the size of the experience replay memory and the training batch size on convergence. Fig.7 indicates that either too small or too large memory size decelerates convergence. Therefore, we select memory size of 100,000. Fig.7 shows that the algorithm may converge to a local optimum when batch size is too small (= 32), while the convergence speed slows down when the batch size is quite large ( 256). To maximize the convergence speed, we set batch size = 64 or 128 in simulations under different cases. In SAC-TR, -functions update the values of target networks by using the exponentially moving average with a parameter called target smoothing coefficient, . In Fig.7, we study the effect of on convergence. It shows that a too large decelerates convergence, since an overly fast update of the target network destabilizes convergence, while a too-small also reduces the convergence speed. For fast convergence, we set in the following simulations.
Vi-C Effect of Fairness Index Exponent
In Fig.8, we investigate the effect of the exponent of fairness index, , on the sum computation bits of all terminals and the fairness index when Watts, since it is used to adjust the importance of fairness in the objective function. With the increase of , the sum computation bits declines while monotonously increases. It indicates that the preference of SAC-TR for sum computation bits and fairness can be effectively adjusted by . A large can be set when quite high fairness is demanded. In contrast, if the scenario aims to maximize the sum computation bits, we should have .
Vi-D Optimal Policy
The optimal policy given by SAC-TR includes the optimal UAV trajectory and the optimal resource allocation of terminals. Fig.9 shows an example of optimal UAV trajectory, where Watts and . In the entire fight time, the UAV first hovers over four mobile terminals, which can ensure the fairness among terminals, then flies towards the destination at high speed during the last few slots and finally arrives at the destination.
Fig.9 plots the UAV trajectory when . Unlike the case of , the UAV first flies quickly to terminal 1, then adjusts its trajectory to keep itself always on the top of terminal 1, and finally reaches the destination. The difference is caused by the fact that when , SAC-TR aims to maximize the sum computation bits. Therefore, the algorithm minimizes the distance between the UAV and one of the terminals to enhance communication and energy harvesting. It reveals that when , SAC-TR cannot guarantee the fairness issue and an unfair phenomenon may appear such as the situation in Fig.9.
Fig.10 plots an example of the optimal resource allocation of terminal 1 when . From top to bottom, Fig.10 exhibits the transmit power, the proportion of offloading time, the CPU frequency, and the distance between the UAV and the terminal 1 in an entire flight time. We observe that the resource allocation of the terminal cooperates with the UAV trajectory for good performance. In particular, when the UAV flies near the terminal 1 (13 slots
19), the transmit power and the proportion of offloading time of this terminal increase so as to utilize the high power channel gain at this moment to offload more computation tasks. The CPU frequency is relatively uniform over the entire flight time.
Vi-E Comparison with Benchmark Algorithms
To evaluate the performance of SAC-TR, we compare it with other representative benchmarks as follows.
Hover–fly–hover (HFH) trajectory algorithm. The HFH trajectory of UAV is widely used to serve the terminals with fixed locations [3, 33, 34]. In HFH, the UAV flies to and hovers over some specified locations in turn. To serve mobile terminals, we make simple adaptations to HFH: the UAV flies to and follows each mobile terminal in turn for equal time. During following one terminal, it keeps itself on the top of this terminal in each slot. After serving one terminal, It flies to the next terminal with the maximum speed. Also, the UAV reserves the minimum time to arrive at the destination.
Straight trajectory algorithm. In this algorithm, the UAV flies straight from the starting point to the destination with constant speed.
Greedy local algorithm. In this algorithm, the terminal exhausts all the energy in the battery for local computation in each slot.
Greedy offloading algorithm. The terminal spends all the energy in the battery for computation offloading in each slot, while the offloading time of each terminal is equal.
Random algorithm. In this algorithm, the flight speed and direction of the UAV, transmit power, offloading time, and CPU frequency of each terminal in each slot are picked randomly.
To compare with SAC-TR, the resource allocation in the HFH trajectory algorithm and the straight trajectory algorithm is optimized by our algorithm. Also, the UAV trajectory in the greedy local algorithm and the greedy offloading algorithm is optimized by our algorithm.
In Fig.11, we investigate the stable value of the objective function after convergence, under SAC-TR and the above-mentioned algorithms. We set from 0.1 to 21.1 Watts and . With the increase of , mobile terminals can harvest more energy to use in local computation or computation offloading, thus the objective functions of all algorithms improve. The greedy local algorithm is better than the greedy offloading algorithm when is small; otherwise, the greedy offloading algorithm is better. The objective function of the HFH trajectory algorithm is always high since the HFH trajectory guarantees high fairness while keeping the distance between the UAV and terminals as small as possible to maximize the computation bits. The objective function of SAC-TR is always close to or exceeds other benchmarks. Also, SAC-TR is sometimes worse than that of the HFH trajectory algorithm, which is due to the error caused by the inconsistency of the objective function of problem and the objective of SAC-TR.
Different from that in Fig.11, we consider the situation that there exists a greater degree of difference in the mobility of each terminal in Fig.11. In particular, the average speed of each terminal is different, that is, m/s, and so as the memory of each terminal, , and , for . As Fig.11 shows, the objective function of SAC-TR still approaches or exceeds other benchmarks in this situation. We observe that, the advantage of the trajectory planning of our algorithm becomes more prominent in this situation. For example, the superiority of SAC-TR and the greedy offloading algorithm whose UAV trajectory is optimized by our algorithm is more evident compared to that in Fig.11. This is because the mobility model of each terminal is quite different in this situation, and the UAV can approach each terminal with flexible time and flexible trajectory under the trajectory planning of our algorithm, while HFH still allocates equal flight time to follow each terminal with inflexible trajectory and the straight trajectory algorithm has completely fixed UAV trajectory.
Fig.11 and 11 reveal the advantage of trajectory planning of SAC-TR by comparing it with HFH and straight trajectories. Also, the superiority of resource allocation of SAC-TR is highlighted when comparing it with the greedy local algorithm and the greedy offloading algorithm, of which the reason is that SAC-TR makes overall arrangements over the entire flight time. Since benchmarks (1)(4) only perform optimization on partial parameters by our algorithm, these comparisons reflect the benefit of joint optimization of all parameters in SAC-TR.
Vi-F Usability and Adaptability
|Execution latency ()||4.13||4.30||4.35||4.49||5.07|
|Update latency ()||5.10||5.23||5.27||5.30||5.85|
Finally, we evaluate the usability and adaptability of SAC-TR by simulations, where we set Watts and .
Since SAC-TR is executed at the beginning of each slot, its execution latency has a big impact on its usability. Therefore, we evaluate the execution latency on a desktop configured with an Intel Core i5-4590 3.3GHz CPU and 8 GB of memory and exhibit the results in Table I. The single execution latency is averaged over executions. Under different numbers of mobile terminals, the execution latency is always around s, which is much lower than the length of a slot, such as 0.1s, so its impact on performance can be ignored.
In some situations, we proceed to train SAC-TR during use with a fixed interval so as to adapt to unpredictable changes of the environment. Therefore, we also evaluate the latency of updating SAC-TR for once in table I, which is averaged over updates with a batch size of 64. Under different numbers of terminals, the update latency is around s. It follows that SAC-TR should be updated with an interval that is larger than the update latency. During use, SAC-TR can be trained on the CPU of the UAV or the cloud server by transmitting data to it via satellite communication .
On the other hand, we examine the adaptability of SAC-TR to the unexpected changes of the mobility model of terminals. In Fig.12, we investigate the large-scale changes of the average speed of terminals. In the th slot, of two terminals are both quadrupled, and then reduced to origin value in the th slot. As we can see, SAC-TR can adapt to these abrupt changes instantly. In the th slot, of all terminals are quadrupled, and the performance of SAC-TR drops drastically in an instant but converges to a stable value again quickly. In the th slot, the average speed of all terminals drops 4 times and SAC-TR handles it smoothly.
In Fig.13, we study the impact of large-scale changes of and , which represent the memory of the mobility model of terminals. At first, for . In the th, th, th, and th episode, and of all terminals simultaneously change to 1, back to 0.5, change to 0, and back to 0.5 again. For the first three changes, SAC-TR can adapt instantaneously. For the last change, the return first drops drastically and then ascends to a stable value rapidly. Fig.12 and 13 demonstrate that SAC-TR has good adaption to unexpected changes of the environment. In reality, even if there exists a certain degree of difference between the model of training samples and the real environment, SAC-TR can also handle it.
In this paper, we study an UAV-assisted wireless powered MEC system for mobile terminals. By combining the computation rate and a fairness index in our objective function, we aim to jointly optimize and continuously control the UAV trajectory and the resource allocation of terminals. An SAC-based algorithm, named SAC-TR, is proposed for trajectory planning and resource allocation to solve this complex high-dimensional DRL task. In SAC-TR, reward is designed as a heterogeneous function including a computation reward and an arrival reward. The computation reward integrates a fairness index to improve the computation rate while guaranteeing fairness, and the arrival reward based on the progress estimate guides the UAV to reach the specified destination and promotes convergence. Simulation results show that SAC-TR can converge stably and adapt to drastic changes of the environment quickly. Compared to widely-used benchmarks, such as the HFH trajectory, the straight trajectory, the greedy algorithm and the random algorithm, the performance of SAC-TR exceeds or approaches them in various situations.
-  F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in uav-enabled wireless-powered mobile-edge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, 2018.
-  F. Zhou, Y. Wu, H. Sun, and Z. Chu, “Uav-enabled mobile edge computing: Offloading optimization and trajectory design,” in Proc. IEEE Int. Conf. Commun., 2018, pp. 1–6.
-  Y. Du, K. Yang, K. Wang, G. Zhang, Y. Zhao, and D. Chen, “Joint resources and workflow scheduling in uav-enabled wirelessly-powered mec for iot systems,” IEEE Trans. Veh. Technol., vol. 68, no. 10, pp. 10 187–10 200, 2019.
-  Y. Liu, K. Xiong, Q. Ni, P. Fan, and K. B. Letaief, “Uav-assisted wireless powered cooperative mobile edge computing: Joint offloading, cpu control, and trajectory optimization,” IEEE Internet Things J., vol. 7, no. 4, pp. 2777–2790, 2020.
-  X. Hu, K.-K. Wong, and Z. Zheng, “Wireless-powered mobile edge computing with cooperated uav,” in Proc. IEEE SPAWC, 2019, pp. 1–5.
-  Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
-  X. Lu, P. Wang, D. Niyato, D. I. Kim, and Z. Han, “Wireless networks with rf energy harvesting: A contemporary survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 2, pp. 757–789, 2015.
-  P. Juang, H. Oki, Y. Wang, M. Martonosi, L. S. Peh, and D. Rubenstein, “Energy-efficient computing for wildlife tracking: Design tradeoffs and early experiences with zebranet,” SIGARCH Comput. Archit. News, vol. 30, no. 5, p. 96–107, Oct. 2002.
-  Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. Shu, “Path planning for uav-mounted mobile edge computing with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, 2020.
-  C. You, K. Huang, and H. Chae, “Energy efficient mobile cloud computing powered by wireless energy transfer,” IEEE J. Sel. Areas Commun., vol. 34, no. 5, pp. 1757–1771, 2016.
-  S. Bi and Y. J. Zhang, “Computation rate maximization for wireless powered mobile-edge computing with binary computation offloading,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–4190, 2018.
-  S. Mao, S. Leng, K. Yang, X. Huang, and Q. Zhao, “Fair energy-efficient scheduling in wireless powered full-duplex mobile-edge computing systems,” in Proc. IEEE Global Commun. Conf., 2017, pp. 1–6.
-  M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, and W. Zhuang, “Learning-based computation offloading for iot devices with energy harvesting,” IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930–1941, 2019.
-  L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Trans. Mobile Comput., vol. 19, no. 11, pp. 2581–2593, 2020.
-  S. Yin, S. Zhao, Y. Zhao, and F. R. Yu, “Intelligent trajectory design in uav-aided communications with reinforcement learning,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, 2019.
-  C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient uav control for effective and fair communication coverage: A deep reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 2059–2070, 2018.
-  R. Zhong, X. Liu, Y. Liu, and Y. Chen, “Multi-agent reinforcement learning in noma-aided uav networks for cellular offloading,” IEEE Trans. Wireless Commun., pp. 1–1, 2021.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019.
-  A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal lap altitude for maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, 2014.
-  N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerial-assisted computing offloading for iot applications: A learning-based approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, 2019.
-  D. M. C. R. K. Jain and W. R. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,” Eastern Res. Lab., Tech. Rep. DEC-TR-301, 1984.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529–533, 2015.
-  R. S. Sutton and F. Bach, Reinforcement Learning—An Introduction. Cambridge, MA, USA: MIT Press, 1998.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. ICML, New York, USA, 2016, pp. 1928–1937.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. ICML, 2018, pp. 1861–1870.
-  T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” CoRR, vol. abs/1812.05905, 2018.
-  B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, “Learning-based energy-efficient data collection by unmanned vehicles in smart cities,” IEEE Trans. Ind. Informat., vol. 14, no. 4, pp. 1666–1676, 2018.
-  M. J. Mataric, “Reward functions for accelerated learning,” in Mach. Learn. Proc. 1994. San Francisco (CA): Morgan Kaufmann, 1994, pp. 181–189.
-  H. Sun, Y.-x. Guo, M. He, and Z. Zhong, “Design of a high-efficiency 2.45-ghz rectenna for low-input-power energy harvesting,” IEEE Antennas Wireless Propag. Lett., vol. 11, pp. 929–932, 2012.
-  Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge computing: Partial computation offloading using dynamic voltage scaling,” IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, 2016.
-  R. I. Bor-Yaliniz, A. El-Keyi, and H. Yanikomeroglu, “Efficient 3-d placement of an aerial base station in next generation cellular networks,” in Proc. IEEE Int. Conf. Commun., 2016, pp. 1–5.
-  S. Batabyal and P. Bhaumik, “Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1679–1707, 2015.
-  J. Xu, Y. Zeng, and R. Zhang, “Uav-enabled wireless power transfer: Trajectory design and energy optimization,” IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5092–5106, 2018.
-  L. Xie, J. Xu, and R. Zhang, “Throughput maximization for uav-enabled wireless powered communication networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1690–1703, 2019.