Mobile edge computing (MEC) enables the computational power at the edge of cellular networks to flexibly and rapidly deploy innovative applications and services towards mobile terminal users (TUs) . In contrast to position-fixed edge servers, recent works on MEC have been devoted to mobile edge servers that can provide more flexible and cost-efficient computing services in hostile environments. As a moving cloudlet, the unmanned aerial vehicle (UAV) can be applied in MEC due to its reliable connectivity with affordable infrastructure investment . For example,  proposed an adaptive UAV-mounted cloudlet-aided recommendation system in the location based social networks to provide active recommendation for mobile users. Recently,  proposed a distributed anticoordination game based partially overlapping channel assignment algorithm in the UAV-aided device-to-device networks to achieve good throughput and low signaling overhead. Later on,  developed a novel game-theoretic and reinforcement learning (RL) framework in the UAV-enabled MEC networks, in order to maximize each base station’s long-term payoff by selecting a coalition and deciding its action.
Recent research mainly focuses on path planning in the UAV-mounted MEC networks. For instance,  jointly optimized the UAV trajectory and bit allocation under latency and UAV energy constraints. Later on,  studied a fixed UAV trajectory with dynamic power allocation among the social internet of vehicles. On one hand, the UAV trajectories were designed offline in [6, 7, 8], assuming that the TU locations are invariant. However, the TU locations may change dynamically over time in practice. To ensure the quality-of-service (QoS) for each TU, the UAV needs to adjust its trajectory according to the time-varying TU locations. How to design the UAV trajectory to serve mobile TUs in the MEC networks remains challenging and primarily motivates our work. On the other hand, the trajectory optimization relies on either dynamic programming  or successive convex approximation method . A major concern lies in that the optimization for the offline trajectory designs in [6, 7, 8] may not be feasible to deal with the mobile TUs in MEC networks.
Markov decision process (MDP) and RL algorithm have been applied in online UAV trajectory design to improve the detection accuracy  and detect locations of endangered species . However, the dynamic change of TU locations inevitably leads to innumerable states in the MDP, making the path planning problem even more complex. In this context, deep reinforcement learning (DRL) algorithm is more adequate to deal with the curse of huge state and action spaces induced by time-varying TU locations than conventional RL methods. Ref.  leveraged DRL for enabling model-free UAV control to collect the data from users in mobile crowd sensing-based smart cities. Recently,  investigated a joint resource allocation and task scheduling approach in a space-air-ground integrated network based on policy gradient and actor-critic methods, where the UAVs provide near-user edge computing for static TUs. Moreover,  proposed the deterministic policy gradient algorithm to maximize the expected uplink sum rate in the UAV-aided cellular networks with mobile TUs. Among the value based DRL algorithms,  unveiled that double deep Q-network (DDQN) addresses the overestimation problem in deep Q-network (DQN) via decoupling target Q-value and predicted Q-value, and generates a more accurate state-action value function than DQN. It is known that the better state-action value function corresponds to the better policy. Under this policy, the agent chooses the better action to improve the system reward.
In this letter, we propose a DRL-based algorithm for the UAV to serve the mobile TUs in the UAV-mounted MEC network, where the motion of each TU follows the Gauss-Markov random model (GMRM). Our goal is to optimize the UAV trajectory to maximize the long-term system reward subject to limited energy of UAV and QoS constraint of each TU. Toward this goal, we formulate the optimization problem as an MDP. In particular, we develop a QoS-based -greedy policy in our proposed algorithm to maximize the system reward and meet the QoS constraint. Simulation results show that our proposed algorithm outperforms conventional RL and DQN algorithms in terms of convergence and throughput, and the QoS-based -greedy policy can achieve guarantee rate in QoS of each TU.
Ii System Model
Fig. 1 shows that a UAV with limited energy provides computational services to TUs over a certain period. The operating period is discretized into times slots each with non-uniform duration, indexed by . Suppose that the UAV can only serve a single TU in each time slot, referred to as the association between the UAV and TU. In each time slot, the UAV can only hover over one of fixed perceptual access points (FPAPs) to form direct connection with the associated TU and execute its offloaded tasks.
Ii-a Movement Model of TUs
Consider that all TUs are randomly located at . Assume that all TU locations do not change during the duration between the th and th time slots. Following the GMRM in , the velocity and direction of the th TU in the th time slot () are updated as
where are utilized to adjust the effect of the previous state, is the average velocity for all TUs, and is the average direction of the th TU. In particular, we consider that the average speed for all TUs is same and different TUs have distinct average directions. Also, andand for the th TU, both of which reflect the randomness in the movements of different TUs. Let denote the location of the th TU in the th time slot. Given (1a) and (1b), the TU location is updated as 
Also, the UAV location at the th FPAP in the th time slot is
Ii-B Energy Consumption of UAV
The energy consumption of the UAV falls into the following three categories:
(1) Flying Energy Consumption : Let and denote the UAV flying speed and the UAV flying power respectively. Consider that is constant over time slots. Moreover, , where and denote the parasitic power and the induced power to overcome the parasitic drag and the lift-induced drag respectively. Consequently, the flying energy consumed by the UAV flying from one FPAP in the th time slot to another in the th time slot is given by
(2) Hovering Energy Consumption : Considering the line-of-sight channel between the UAV and its associated TU, the uploading rate (bits/s/Hz) from the associated th TU to the UAV at the th FPAP in the th time slot is given by
where is the transmission power of each TU,
is Gaussian white noise power at the UAV, anddenotes the channel gain between the th TU and the th FPAP with being the path loss per meter and being the fixed flying altitude of the UAV. From (4), the hovering energy consumed by the UAV in the th time slot is given by
where is the UAV hovering power, is the amount of offloaded tasks from the th TU in the th time slot, and is the number of bits per task.
(3) Computing Energy Consumption : The computing energy for the offloaded tasks from the th TU is where is the effective switched capacitance, is the number of CPU cycles for computing one bit, and is the CPU frequency .
Consequently, the total energy consumption of the UAV in the th time slot is , and the energy that can be used by the UAV in the th time slot is
Iii MDP Modeling and Problem Formulation
From (2a), (2b) and (6), the locations of TUs and the UAV energy possess Markov characteristics. As such, we formulate the optimization problem of the UAV trajectory as an MDP. Our goal is to maximize the long-term system reward subject to the UAV energy and TUs’ QoS constraint.
Iii-a State, Action, and Reward
The state space of MDP is described as
Furthermore, the UAV chooses to serve one of TUs among one of FPAPs in each time slot. Overall, the action space in our system includes two kinds of actions, denoted by
where represents that the UAV chooses the th TU in the th time slot and represents that the UAV flies to the th FPAP in the th time slot.
Suppose that the UAV serves the th TU in the th time slot. In general, system utility is closely related to the number of offloaded tasks . However, the correlation is not simply in a linear manner. With reference to , we adopt a sigmoidal-like function to describe the correlation as
where the constants and are used to adjust the efficiency of . Note that the values of and vary as the range of changes. From (9), the system utility first increases steeply as rises and then becomes steady when
is sufficiently large. Therefore, the heuristic use of (9) prevents the UAV from serving any single TU over a long period while ignoring other TUs, which is consistent with the QoS constraint in (11b). In addition, the system reward takes the effect of UAV energy consumption into account. As such, the system reward in the th time slot induced by the current state and action is defined as
where is used to normalize and unify the unit of and .
Iii-B Problem Formulation
, the policy in RL corresponds to the probability of choosing the actionaccording to the current state . The optimal policy is the specific policy that contributes to the maximal long-term system reward. Our goal is to find to maximize the average long-term system reward as
where the first constraint represents that the total energy consumption over time slots cannot exceed the UAV battery capacity and the second constraint (i.e., QoS constraint) guarantees the minimum amount of offloaded tasks (i.e., ) from each TU over time slots.
Iv Proposed Algorithm
In this paper, we employ the RL algorithm to explore the unknown environment, where the UAV performs actions with the aim of maximizing the long-term system rewards by trying different actions, learning from the feedback, and then reinforcing the actions until the actions deliver the best result. Furthermore, we use DDQN of DRL algorithm to address not only the overestimation problem of DQN, but also the massive state-action pairs induced by time-varying TU locations rather than conventional RL algorithm. Besides, we develop a QoS-based -greedy policy in our proposed algorithm to further meet the second constraint in (11b).
Iv-a Deep Q-Network (DQN)
The state-action value function is , where is the discount factor and is the immediate reward in the th time slot based on the state-action pair () . The concept of is to evaluate how good the action performed by the UAV in the state is. As illustrated in 
, DQN approximates the Q-value by using two deep neural networks (DNNs) with the same four fully connected layers but different parametersand . One is the predicted network, whose input is the current state-action pair and output is the predicted value, i.e., . The other one is the target network, whose input is the next state and output is the maximum Q-value of the next state-action pair. Given this output, the target value of is , where is the candidate of next action.
Iv-B DDQN with Proposed QoS-Based -greedy Policy
DQN structure chooses directly in the target network, whose parameter is not updated timely and may lead to the overestimation of Q-value 
. To address the overestimation problem, DDQN applies two independent estimators to approximate the Q-value. Fig. 2 shows the DDQN structure with QoS-based-greedy policy. The predicted network outputs . For the target network, DDQN chooses the action for the next state that yields in the predicted network and identifies the corresponding Q-value of next state-action pair in the target network, i.e., . Consequently, the target value in DDQN is defined as
The goal of the two DNNs is to approximate the Q-value in (12). Based on this Q-value, the UAV chooses an action according to the current state with the proposed QoS-based -greedy policy, receives the reward , and then transfers to the next state . At time slot , a transition pair is defined as .
The description of the DDQN structure is given in Algorithm 1. From lines 11 to 19, the DNNs are trained by the transition pairs stored in memory . In line 12, mini-batch samples are randomly extracted from to update
. In line 16, the loss function is, where and represent the target and predicted values of the th sample from the mini-batch samples, respectively. In line 17, the gradient descent method is applied to update of the predicted network as where is the learning rate and is the gradient function with respect to . Moreover, is updated as after a fixed interval. To achieve a good tradeoff between exploration and exploitation, a decrement is subtracted from in line 20. The episode ends in the th time slot if . Finally, the proposed algorithm produces the optimal policy in line 24.
For the current state , the UAV uses conventional -greedy policy to select a random action with probability and with probability , which is unable to guarantee the QoS constraint in (11b). Consider an arbitrary time slot . To meet the QoS constraint, we develop a QoS-based -greedy policy to choose the optimal action of from lines 5 to 8 in Algorithm 1 as follows:
Case I: In this case, all TUs satisfy the QoS constraint. Then the UAV chooses an action with conventional -greedy policy.
Case II: In this case, there exists at least one TU that does not meet the QoS constraint in the th time slot. First, the UAV collects the TUs in with . Then, the UAV chooses an action with conventional -greedy policy. The UAV chooses the action if the associated TU based on this . Otherwise, the UAV discards this action and chooses another action until the associated TU .
Note that Algorithm 1 describes the offline training process to find the optimal policy . Then is used to instruct the UAV to serve the TUs with the maximal long-term system reward during the online testing process.
Remark 1: First, the Q-learning used in  is not well-suited to our complex environment with real-time mobile TUs, since the number of state-action pairs increases over time and the cost of managing the Q-table is unaffordable. Second,  employed the dueling DQN to optimize the UAV deployment in the multi-UAV wireless networks, while our work uses the double DQN (DDQN) to optimize the UAV trajectory in the UAV-mounted MEC networks. Third, different from the DQN-based UAV navigation in , we employ the DDQN-based algorithm to address the overestimation problem.
V Simulations and Results
The simulation parameters are set as FPAPs, kJ, m/s, dB, dB, F, , GHz, Mb, , , , , , and randomly ranges between 0 and 10 . The powers of each TU transmission, UAV flying and hovering are W, W and W, respectively.
Fig. 3 depicts the average reward of proposed algorithm, DQN, DQL (double Q-learning), and QL algorithms. First, our proposed algorithm achieves the largest convergence rate and average reward among all the algorithms. Second, lower UAV altitude or less TUs contributes to a larger average reward. On one hand, the higher UAV altitude results in larger path loss and more UAV hovering energy. On the other hand, the UAV consumes more energy to meet the QoS constraint of each TU as the number of TUs goes up. Third, when , it is observed that QL and DQL are hardly implemented because the construction of the Q-table with massive states and actions is unaffordable.
Fig. 4 shows the sum throughput per episode of proposed algorithm and DQN algorithm versus the number of TUs. We define the sum throughput per episode as the product of the offloaded tasks from all TUs per episode and the number of bits per task . First, the proposed algorithm achieves the largest sum throughput per episode among all the algorithms under any . Second, the sum throughput per episode reduces as increases. Third, the sum throughput per episode increases as reduces for all the algorithms. For example, both the proposed algorithm and DQN achieve their respective largest sum throughput per episode at m/s. This is due to the factor that the path planning problem gradually reduces down to the problem with static TUs as decreases, which can directly find the global optimal solution.
Fig. 5 shows that our proposed algorithm is robust under different average speeds of TUs. Note that we only train the DNNs under m/s and use the trained DNNs for m/s. It is observed that the proposed algorithm can converge under speed variations.
Fig. 6 plots the UAV path planning with TUs and m/s from to based on the proposed algorithm. The dashed and solid red triangles represent the initial and current locations of the UAV, respectively. The black points are the projection of FPAPs. The dashed and solid circles are the current and previous locations of each TU, respectively. The dashed purple line links the UAV and its associated TU. The arrows are the UAV trajectory. It is shown that the UAV serves TU4 with , TU5 with , and TU2 with in respectively. To meet the QoS constraint with , the UAV flies back to serve TU4 with in .
Table I presents the percentage of QoS satisfaction over 100000 episodes for 15 TUs under conventional -greedy policy and the proposed QoS-based -greedy policy respectively. It is observed that the proposed policy significantly outperforms conventional -greedy policy.
We optimized the UAV trajectory in the UAV-mounted MEC network, where the UAV was deployed as a mobile edge server to dynamically serve the mobile TUs. We formulated the optimization problem as an MDP, assuming that the motion of each TU follows the GMRM. In particular, we developed the QoS-based -greedy policy based on DDQN to maximize the long-term system reward and meet the QoS constraint. The simulation results demonstrated that the proposed algorithm not only outperforms DQN, DQL and QL in terms of convergence and sum throughput, but also achieves almost guarantee rate in QoS of each TU.
-  N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge computing: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450–465, Feb. 2018.
-  F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in UAV-enabled wireless-powered mobile-edge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, Sep. 2018.
-  F. Tang, Z. M. Fadlullah, B. Mao, N. Kato, F. Ono, and R. Miura, “ On a novel adaptive UAV-mounted cloudlet-aided recommendation system for LBSNs,” IEEE Trans. Emerg. Topics Comput., vol. 7, no. 4, pp. 565–577, Nov. 2019.
-  F. Tang, Z. M. Fadlullah, N. Kato, F. Ono, and R. Miura, “ AC-POCA: Anticoordination game based partially overlapping channels assignment in combined UAV and D2D-based networks,” IEEE Trans. Veh. Technol., vol. 67, no. 2, pp. 1672–1683, Feb. 2018.
-  A. Asheralieva and D. Niyato, “ Hierarchical game-theoretic and reinforcement learning framework for computational offloading in UAV-enabled mobile edge computing networks with multiple service providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 9873–9769, Oct. 2019.
-  S. Jeong, O. Simeone, and J. Kang, “Mobile edge computing via a UAV-mounted cloudlet: Optimization of bit allocation and path planning,” IEEE Trans. Veh. Technol., vol. 67, no. 3, pp. 2049–2063, Mar. 2018.
-  L. Zhang, Z. Zhao, Q. Wu, H. Zhao, H. Xu, and X. Wu, “Energy-aware dynamic resource allocation in UAV-assisted mobile edge computing over social internet of vehicles,” IEEE Access, vol. 6, pp. 56 700–56 715, Oct. 2018.
-  Y. Qian, F. Wang, J. Li, L. Shi, K. Cai, and F. Shu, “User association and path planning for UAV-aided mobile edge computing with energy restriction,” IEEE Wireless Commun. Lett., vol. 8, no. 5, pp. 1312–1315, Oct. 2019.
-  X. Gao, Y. Fang, and Y. Wu, “Fuzzy Q learning algorithm for dual-aircraft path planning to cooperatively detect targets by passive radars,” J. Syst. Eng. Electron., vol. 24, no. 5, pp. 800–810, Oct. 2013.
-  J. Xu, G. Solmaz, R. Rahmatizadeh, D. Turgut, and L. Bölöni, “Internet of things applications: Animal monitoring with unmanned aerial vehicle,” Comput. Sci., 2016. [Online]. Available: http://arxiv.org/abs/1610.05287.
-  B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, “Learning-based energy-efficient data collection by unmanned vehicles in smart cities,” IEEE Trans. Veh. Technol., vol. 14, no. 4, pp. 1666–1676, Apr. 2018.
-  N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerial-assisted computing offloading for IoT applications: A learning-based approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, May 2019.
-  S. Yin, S. Zhao, Y. Zhao, and F. R. Yu, “Intelligent trajectory design in UAV-aided communications with reinforcement learning,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8227–8231, Aug. 2019.
-  H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. AAAI-16, Feb. 2016.
-  S. Batabyal and P. Bhaumik, “Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1679–1707, Sep. 2015.
-  M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned aerial vehicles (UAVs) for energy-efficient internet of things communications,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574–7589, Nov. 2017.
-  Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication design for multi-UAV enabled wireless networks,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018.
-  J. Lee, R. Mazumdar, and N. Shroff, “Non-convex optimization and rate control for multi-class services in the internet,” IEEE/ACM Trans. Netw., vol. 13, no. 4, pp. 827–840, Aug. 2005.
X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “ Trajectory design and power control for multi-UAV assisted wireless networks: A machine learning approach,”IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969, Aug. 2019.
-  Q. Wang, W. Zhang, Y. Liu, and Y. Liu, “ Multi-UAV dynamic wireless networking with deep reinforcement learning,” IEEE Commun. Lett., vol. 23, no. 12, pp. 2243–2246, Dec. 2019.
-  H. Huang, Y. Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “ Deep reinforcement learning for UAV navigation through massive MIMO technique,” IEEE Trans. Veh Technol, early access, 2019.