I Introduction
Mobile edge computing (MEC) enables the computational power at the edge of cellular networks to flexibly and rapidly deploy innovative applications and services towards mobile terminal users (TUs) [1]. In contrast to positionfixed edge servers, recent works on MEC have been devoted to mobile edge servers that can provide more flexible and costefficient computing services in hostile environments. As a moving cloudlet, the unmanned aerial vehicle (UAV) can be applied in MEC due to its reliable connectivity with affordable infrastructure investment [2]. For example, [3] proposed an adaptive UAVmounted cloudletaided recommendation system in the location based social networks to provide active recommendation for mobile users. Recently, [4] proposed a distributed anticoordination game based partially overlapping channel assignment algorithm in the UAVaided devicetodevice networks to achieve good throughput and low signaling overhead. Later on, [5] developed a novel gametheoretic and reinforcement learning (RL) framework in the UAVenabled MEC networks, in order to maximize each base station’s longterm payoff by selecting a coalition and deciding its action.
Recent research mainly focuses on path planning in the UAVmounted MEC networks. For instance, [6] jointly optimized the UAV trajectory and bit allocation under latency and UAV energy constraints. Later on, [7] studied a fixed UAV trajectory with dynamic power allocation among the social internet of vehicles. On one hand, the UAV trajectories were designed offline in [6, 7, 8], assuming that the TU locations are invariant. However, the TU locations may change dynamically over time in practice. To ensure the qualityofservice (QoS) for each TU, the UAV needs to adjust its trajectory according to the timevarying TU locations. How to design the UAV trajectory to serve mobile TUs in the MEC networks remains challenging and primarily motivates our work. On the other hand, the trajectory optimization relies on either dynamic programming [6] or successive convex approximation method [7][8]. A major concern lies in that the optimization for the offline trajectory designs in [6, 7, 8] may not be feasible to deal with the mobile TUs in MEC networks.
Markov decision process (MDP) and RL algorithm have been applied in online UAV trajectory design to improve the detection accuracy [9] and detect locations of endangered species [10]. However, the dynamic change of TU locations inevitably leads to innumerable states in the MDP, making the path planning problem even more complex. In this context, deep reinforcement learning (DRL) algorithm is more adequate to deal with the curse of huge state and action spaces induced by timevarying TU locations than conventional RL methods. Ref. [11] leveraged DRL for enabling modelfree UAV control to collect the data from users in mobile crowd sensingbased smart cities. Recently, [12] investigated a joint resource allocation and task scheduling approach in a spaceairground integrated network based on policy gradient and actorcritic methods, where the UAVs provide nearuser edge computing for static TUs. Moreover, [13] proposed the deterministic policy gradient algorithm to maximize the expected uplink sum rate in the UAVaided cellular networks with mobile TUs. Among the value based DRL algorithms, [14] unveiled that double deep Qnetwork (DDQN) addresses the overestimation problem in deep Qnetwork (DQN) via decoupling target Qvalue and predicted Qvalue, and generates a more accurate stateaction value function than DQN. It is known that the better stateaction value function corresponds to the better policy. Under this policy, the agent chooses the better action to improve the system reward.
In this letter, we propose a DRLbased algorithm for the UAV to serve the mobile TUs in the UAVmounted MEC network, where the motion of each TU follows the GaussMarkov random model (GMRM). Our goal is to optimize the UAV trajectory to maximize the longterm system reward subject to limited energy of UAV and QoS constraint of each TU. Toward this goal, we formulate the optimization problem as an MDP. In particular, we develop a QoSbased greedy policy in our proposed algorithm to maximize the system reward and meet the QoS constraint. Simulation results show that our proposed algorithm outperforms conventional RL and DQN algorithms in terms of convergence and throughput, and the QoSbased greedy policy can achieve guarantee rate in QoS of each TU.
Ii System Model
Fig. 1 shows that a UAV with limited energy provides computational services to TUs over a certain period. The operating period is discretized into times slots each with nonuniform duration, indexed by . Suppose that the UAV can only serve a single TU in each time slot, referred to as the association between the UAV and TU. In each time slot, the UAV can only hover over one of fixed perceptual access points (FPAPs) to form direct connection with the associated TU and execute its offloaded tasks.
Iia Movement Model of TUs
Consider that all TUs are randomly located at . Assume that all TU locations do not change during the duration between the th and th time slots. Following the GMRM in [15], the velocity and direction of the th TU in the th time slot () are updated as
(1a)  
(1b) 
where are utilized to adjust the effect of the previous state, is the average velocity for all TUs, and is the average direction of the th TU. In particular, we consider that the average speed for all TUs is same and different TUs have distinct average directions. Also, and
follow two independent Gaussian distributions with different meanvariance pairs
and for the th TU, both of which reflect the randomness in the movements of different TUs. Let denote the location of the th TU in the th time slot. Given (1a) and (1b), the TU location is updated as [15](2a)  
(2b) 
Also, the UAV location at the th FPAP in the th time slot is
IiB Energy Consumption of UAV
The energy consumption of the UAV falls into the following three categories:
(1) Flying Energy Consumption : Let and denote the UAV flying speed and the UAV flying power respectively. Consider that is constant over time slots. Moreover, , where and denote the parasitic power and the induced power to overcome the parasitic drag and the liftinduced drag respectively[16]. Consequently, the flying energy consumed by the UAV flying from one FPAP in the th time slot to another in the th time slot is given by
(3) 
(2) Hovering Energy Consumption : Considering the lineofsight channel between the UAV and its associated TU, the uploading rate (bits/s/Hz) from the associated th TU to the UAV at the th FPAP in the th time slot is given by
(4) 
where is the transmission power of each TU,
is Gaussian white noise power at the UAV, and
denotes the channel gain between the th TU and the th FPAP with being the path loss per meter and being the fixed flying altitude of the UAV. From (4), the hovering energy consumed by the UAV in the th time slot is given by(5) 
where is the UAV hovering power, is the amount of offloaded tasks from the th TU in the th time slot, and is the number of bits per task.
(3) Computing Energy Consumption : The computing energy for the offloaded tasks from the th TU is where is the effective switched capacitance, is the number of CPU cycles for computing one bit, and is the CPU frequency [17].
Consequently, the total energy consumption of the UAV in the th time slot is , and the energy that can be used by the UAV in the th time slot is
(6) 
Iii MDP Modeling and Problem Formulation
From (2a), (2b) and (6), the locations of TUs and the UAV energy possess Markov characteristics. As such, we formulate the optimization problem of the UAV trajectory as an MDP. Our goal is to maximize the longterm system reward subject to the UAV energy and TUs’ QoS constraint.
Iiia State, Action, and Reward
The state space of MDP is described as
(7) 
Furthermore, the UAV chooses to serve one of TUs among one of FPAPs in each time slot. Overall, the action space in our system includes two kinds of actions, denoted by
(8) 
where represents that the UAV chooses the th TU in the th time slot and represents that the UAV flies to the th FPAP in the th time slot.
Suppose that the UAV serves the th TU in the th time slot. In general, system utility is closely related to the number of offloaded tasks . However, the correlation is not simply in a linear manner. With reference to [18], we adopt a sigmoidallike function to describe the correlation as
(9) 
where the constants and are used to adjust the efficiency of . Note that the values of and vary as the range of changes. From (9), the system utility first increases steeply as rises and then becomes steady when
is sufficiently large. Therefore, the heuristic use of (
9) prevents the UAV from serving any single TU over a long period while ignoring other TUs, which is consistent with the QoS constraint in (11b). In addition, the system reward takes the effect of UAV energy consumption into account. As such, the system reward in the th time slot induced by the current state and action is defined as(10) 
where is used to normalize and unify the unit of and .
IiiB Problem Formulation
From [14]
, the policy in RL corresponds to the probability of choosing the action
according to the current state . The optimal policy is the specific policy that contributes to the maximal longterm system reward. Our goal is to find to maximize the average longterm system reward as(11a)  
(11b) 
where the first constraint represents that the total energy consumption over time slots cannot exceed the UAV battery capacity and the second constraint (i.e., QoS constraint) guarantees the minimum amount of offloaded tasks (i.e., ) from each TU over time slots.
Iv Proposed Algorithm
In this paper, we employ the RL algorithm to explore the unknown environment, where the UAV performs actions with the aim of maximizing the longterm system rewards by trying different actions, learning from the feedback, and then reinforcing the actions until the actions deliver the best result. Furthermore, we use DDQN of DRL algorithm to address not only the overestimation problem of DQN, but also the massive stateaction pairs induced by timevarying TU locations rather than conventional RL algorithm. Besides, we develop a QoSbased greedy policy in our proposed algorithm to further meet the second constraint in (11b).
Iva Deep QNetwork (DQN)
The stateaction value function is , where is the discount factor and is the immediate reward in the th time slot based on the stateaction pair () [14]. The concept of is to evaluate how good the action performed by the UAV in the state is. As illustrated in [14]
, DQN approximates the Qvalue by using two deep neural networks (DNNs) with the same four fully connected layers but different parameters
and . One is the predicted network, whose input is the current stateaction pair and output is the predicted value, i.e., . The other one is the target network, whose input is the next state and output is the maximum Qvalue of the next stateaction pair. Given this output, the target value of is , where is the candidate of next action.IvB DDQN with Proposed QoSBased greedy Policy
DQN structure chooses directly in the target network, whose parameter is not updated timely and may lead to the overestimation of Qvalue [14]
. To address the overestimation problem, DDQN applies two independent estimators to approximate the Qvalue. Fig. 2 shows the DDQN structure with QoSbased
greedy policy. The predicted network outputs . For the target network, DDQN chooses the action for the next state that yields in the predicted network and identifies the corresponding Qvalue of next stateaction pair in the target network, i.e., . Consequently, the target value in DDQN is defined as(12) 
The goal of the two DNNs is to approximate the Qvalue in (12). Based on this Qvalue, the UAV chooses an action according to the current state with the proposed QoSbased greedy policy, receives the reward , and then transfers to the next state . At time slot , a transition pair is defined as .
The description of the DDQN structure is given in Algorithm 1. From lines 11 to 19, the DNNs are trained by the transition pairs stored in memory . In line 12, minibatch samples are randomly extracted from to update
. In line 16, the loss function is
, where and represent the target and predicted values of the th sample from the minibatch samples, respectively. In line 17, the gradient descent method is applied to update of the predicted network as where is the learning rate and is the gradient function with respect to . Moreover, is updated as after a fixed interval. To achieve a good tradeoff between exploration and exploitation, a decrement is subtracted from in line 20. The episode ends in the th time slot if . Finally, the proposed algorithm produces the optimal policy in line 24.For the current state , the UAV uses conventional greedy policy to select a random action with probability and with probability , which is unable to guarantee the QoS constraint in (11b). Consider an arbitrary time slot . To meet the QoS constraint, we develop a QoSbased greedy policy to choose the optimal action of from lines 5 to 8 in Algorithm 1 as follows:
Case I: In this case, all TUs satisfy the QoS constraint. Then the UAV chooses an action with conventional greedy policy.
Case II: In this case, there exists at least one TU that does not meet the QoS constraint in the th time slot. First, the UAV collects the TUs in with . Then, the UAV chooses an action with conventional greedy policy. The UAV chooses the action if the associated TU based on this . Otherwise, the UAV discards this action and chooses another action until the associated TU .
Note that Algorithm 1 describes the offline training process to find the optimal policy . Then is used to instruct the UAV to serve the TUs with the maximal longterm system reward during the online testing process.
Remark 1: First, the Qlearning used in [19] is not wellsuited to our complex environment with realtime mobile TUs, since the number of stateaction pairs increases over time and the cost of managing the Qtable is unaffordable. Second, [20] employed the dueling DQN to optimize the UAV deployment in the multiUAV wireless networks, while our work uses the double DQN (DDQN) to optimize the UAV trajectory in the UAVmounted MEC networks. Third, different from the DQNbased UAV navigation in [21], we employ the DDQNbased algorithm to address the overestimation problem.
V Simulations and Results
The simulation parameters are set as FPAPs, kJ, m/s, dB, dB, F, , GHz, Mb, , , , , , and randomly ranges between 0 and 10 [17]. The powers of each TU transmission, UAV flying and hovering are W, W and W, respectively.
Fig. 3 depicts the average reward of proposed algorithm, DQN, DQL (double Qlearning), and QL algorithms. First, our proposed algorithm achieves the largest convergence rate and average reward among all the algorithms. Second, lower UAV altitude or less TUs contributes to a larger average reward. On one hand, the higher UAV altitude results in larger path loss and more UAV hovering energy. On the other hand, the UAV consumes more energy to meet the QoS constraint of each TU as the number of TUs goes up. Third, when , it is observed that QL and DQL are hardly implemented because the construction of the Qtable with massive states and actions is unaffordable.
Fig. 4 shows the sum throughput per episode of proposed algorithm and DQN algorithm versus the number of TUs. We define the sum throughput per episode as the product of the offloaded tasks from all TUs per episode and the number of bits per task . First, the proposed algorithm achieves the largest sum throughput per episode among all the algorithms under any . Second, the sum throughput per episode reduces as increases. Third, the sum throughput per episode increases as reduces for all the algorithms. For example, both the proposed algorithm and DQN achieve their respective largest sum throughput per episode at m/s. This is due to the factor that the path planning problem gradually reduces down to the problem with static TUs as decreases, which can directly find the global optimal solution.


Fig. 5 shows that our proposed algorithm is robust under different average speeds of TUs. Note that we only train the DNNs under m/s and use the trained DNNs for m/s. It is observed that the proposed algorithm can converge under speed variations.
Fig. 6 plots the UAV path planning with TUs and m/s from to based on the proposed algorithm. The dashed and solid red triangles represent the initial and current locations of the UAV, respectively. The black points are the projection of FPAPs. The dashed and solid circles are the current and previous locations of each TU, respectively. The dashed purple line links the UAV and its associated TU. The arrows are the UAV trajectory. It is shown that the UAV serves TU4 with , TU5 with , and TU2 with in respectively. To meet the QoS constraint with , the UAV flies back to serve TU4 with in .
Table I presents the percentage of QoS satisfaction over 100000 episodes for 15 TUs under conventional greedy policy and the proposed QoSbased greedy policy respectively. It is observed that the proposed policy significantly outperforms conventional greedy policy.
Vi Conclusions
We optimized the UAV trajectory in the UAVmounted MEC network, where the UAV was deployed as a mobile edge server to dynamically serve the mobile TUs. We formulated the optimization problem as an MDP, assuming that the motion of each TU follows the GMRM. In particular, we developed the QoSbased greedy policy based on DDQN to maximize the longterm system reward and meet the QoS constraint. The simulation results demonstrated that the proposed algorithm not only outperforms DQN, DQL and QL in terms of convergence and sum throughput, but also achieves almost guarantee rate in QoS of each TU.
References
 [1] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge computing: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450–465, Feb. 2018.
 [2] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in UAVenabled wirelesspowered mobileedge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, Sep. 2018.
 [3] F. Tang, Z. M. Fadlullah, B. Mao, N. Kato, F. Ono, and R. Miura, “ On a novel adaptive UAVmounted cloudletaided recommendation system for LBSNs,” IEEE Trans. Emerg. Topics Comput., vol. 7, no. 4, pp. 565–577, Nov. 2019.
 [4] F. Tang, Z. M. Fadlullah, N. Kato, F. Ono, and R. Miura, “ ACPOCA: Anticoordination game based partially overlapping channels assignment in combined UAV and D2Dbased networks,” IEEE Trans. Veh. Technol., vol. 67, no. 2, pp. 1672–1683, Feb. 2018.
 [5] A. Asheralieva and D. Niyato, “ Hierarchical gametheoretic and reinforcement learning framework for computational offloading in UAVenabled mobile edge computing networks with multiple service providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 9873–9769, Oct. 2019.
 [6] S. Jeong, O. Simeone, and J. Kang, “Mobile edge computing via a UAVmounted cloudlet: Optimization of bit allocation and path planning,” IEEE Trans. Veh. Technol., vol. 67, no. 3, pp. 2049–2063, Mar. 2018.
 [7] L. Zhang, Z. Zhao, Q. Wu, H. Zhao, H. Xu, and X. Wu, “Energyaware dynamic resource allocation in UAVassisted mobile edge computing over social internet of vehicles,” IEEE Access, vol. 6, pp. 56 700–56 715, Oct. 2018.
 [8] Y. Qian, F. Wang, J. Li, L. Shi, K. Cai, and F. Shu, “User association and path planning for UAVaided mobile edge computing with energy restriction,” IEEE Wireless Commun. Lett., vol. 8, no. 5, pp. 1312–1315, Oct. 2019.
 [9] X. Gao, Y. Fang, and Y. Wu, “Fuzzy Q learning algorithm for dualaircraft path planning to cooperatively detect targets by passive radars,” J. Syst. Eng. Electron., vol. 24, no. 5, pp. 800–810, Oct. 2013.
 [10] J. Xu, G. Solmaz, R. Rahmatizadeh, D. Turgut, and L. Bölöni, “Internet of things applications: Animal monitoring with unmanned aerial vehicle,” Comput. Sci., 2016. [Online]. Available: http://arxiv.org/abs/1610.05287.
 [11] B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, “Learningbased energyefficient data collection by unmanned vehicles in smart cities,” IEEE Trans. Veh. Technol., vol. 14, no. 4, pp. 1666–1676, Apr. 2018.
 [12] N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerialassisted computing offloading for IoT applications: A learningbased approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, May 2019.
 [13] S. Yin, S. Zhao, Y. Zhao, and F. R. Yu, “Intelligent trajectory design in UAVaided communications with reinforcement learning,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, Aug. 2019.
 [14] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Proc. AAAI16, Feb. 2016.
 [15] S. Batabyal and P. Bhaumik, “Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1679–1707, Sep. 2015.
 [16] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned aerial vehicles (UAVs) for energyefficient internet of things communications,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574–7589, Nov. 2017.
 [17] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication design for multiUAV enabled wireless networks,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018.
 [18] J. Lee, R. Mazumdar, and N. Shroff, “Nonconvex optimization and rate control for multiclass services in the internet,” IEEE/ACM Trans. Netw., vol. 13, no. 4, pp. 827–840, Aug. 2005.

[19]
X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “ Trajectory design and power control for multiUAV assisted wireless networks: A machine learning approach,”
IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969, Aug. 2019.  [20] Q. Wang, W. Zhang, Y. Liu, and Y. Liu, “ MultiUAV dynamic wireless networking with deep reinforcement learning,” IEEE Commun. Lett., vol. 23, no. 12, pp. 2243–2246, Dec. 2019.
 [21] H. Huang, Y. Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “ Deep reinforcement learning for UAV navigation through massive MIMO technique,” IEEE Trans. Veh Technol, early access, 2019.