Mobile-edge computing (MEC), which provides computing capabilities within the radio access networks (RANs) in close proximity to the mobile users (MUs), is a promising paradigm to address the tension between computation-intensive applications and resource-constrained mobile devices . By offloading computation tasks to the resource-rich MEC cloud, not only the computation qualities of service and experience can be greatly improved, but also the capability of a mobile device can be augmented for running a variety of resource-demanding applications. Recently, there are a number of related works on designing computation offloading schemes. For example, in , Wang et al. proposed a Lagrangian duality method to minimize the total energy consumption in a computation latency constrained wireless powered multiuser MEC system. In , Liu et al. studied the power-delay tradeoff for a MEC system using the Lyapunov optimization technique. In our priori work 
, the infinite time-horizon Markov decision process (MDP) framework was used to model the problem of computation offloading for a MU in an ultra-dense RAN and to solve the optimal policies, we proposed the deep reinforcement learning (DRL) based schemes.
Offloading the input data of a task from the mobile device of a MU to the MEC cloud requires wireless transmissions, which account for the dynamics from the surrounding environment. Particularly, the time-varying channel qualities due to the MU mobility in turn limits the computation performance . Because of among others, the low deployment cost, the flexibility and the line-of-sight (LOS) connections, unmanned aerial vehicles (UAVs) are expected to play a significant role in advancing the future wireless networks . Leveraging the UAV technology in a MEC system has been shown to be substantial. In , Hu et al. put forward an alternating algorithm to minimize the weighted sum energy consumption for a UAV-assisted MEC system. In , Zhou et al. investigated a UAV-enabled wireless-powered MEC system and derived alternating algorithms to solve the computation rate maximization problems under both the partial and the binary computation offloading modes. However, most of the existing literature is basically based on a finite time-horizon.
In this paper, we concentrate on a three-dimensional UAV-assisted MEC system, in which a UAV is implemented as a complementary computing server flying in the air. That is, in addition to local computation execution, each MU in the system can also offload a computation task to the UAV or to the MEC cloud via one of the base stations (BSs) in the RAN. The UAV can co-execute the computation tasks of the MUs by creating isolated virtual machines (VMs) 
. Sharing the same physical UAV platform causes I/O interference, leading to computation rate reduction for each VM. Under this context, the MUs compete to schedule local and remote task computations with the awareness of environmental dynamics. The aim of each MU is to maximize the expected long-term computation performance. The non-cooperative interactions among the MUs are modeled as a stochastic game. Solving a Nash equilibrium (NE) of the stochastic game needs complete information exchange among the MUs, which is practically overwhelming. Motivated by recent advances in recurrent and deep neural networks, we propose a proactive DRL scheme, enabling each MU to behave at an approximated NE only with local information[10, 11]. Furthermore, we establish a digital twin of the MEC system to get over the hurdle of training the neural networks . To the best of our knowledge, there does not exist a comprehensive study on stochastic resource awareness among the non-cooperative MUs in a UAV-assisted MEC system.
Ii System Descriptions and Assumptions
As illustrated in Fig. 1, we focus on a three-dimensional scenario, in which a terrestrial MEC system is assisted by a UAV. The UAV hovers in the air at a fixed altitude of (in meters) 111This work assumes that the power of the UAV is supplied by laser charging . Hence the UAV is able to operate over the long run.. The terrestrial MEC system consists of a set of BSs, which are connected via wired links to the computing cloud at the edge. To ease analysis, we use a common finite set of locations (i.e., small two-dimensional non-overlapping areas)222Each location or small area can be characterized by uniform wireless communication conditions . to denote both the terrestrial service region covered by the BSs and the region of the UAV mapped vertically from the air to the ground. In the system, a set of MUs coexist and generate sporadic computation tasks over the infinite time-horizon, which is discretized into decision epochs. Each epoch is assumed to be of equal duration (in seconds) and indexed by an integer .
Ii-a Mobility Model
We apply the smooth-turn mobility model with a reflecting boundary to simulate the UAV trajectory . In this model, the UAV maintains a constant forward speed but randomly changes the centripetal acceleration. Let be the mapped terrestrial location of the UAV during a decision epoch . With regards to the MUs, their movements are modelled using a boundary Gauss-Markov mobility model . Specifically, the location of each MU during each decision epoch is determined by both the location at epoch and the velocity during epoch , while the velocity of a MU during a decision epoch depends on the velocity during the previous epoch only.
Ii-B Task Model
The computation task arrivals at the MUs are assumed to be independent and identically distributed sequences of Bernoulli random variables with a common parameter. More specifically, we choose to be the task arrival indicator for a MU , that is, if a computation task is generated at MU in the end of epoch and otherwise, . Then, , , where
denotes the probability of the occurrence of an event. We let(in bits) and represent, respectively, the input data size and the number of CPU cycles required to accomplish one input bit of a computation task. The arrived but not processed tasks will be queued at the buffer of a MU. A computation task can be either computed locally at the device of the MU or executed remotely (at the UAV or the MEC cloud). We let and denote the local and remote computation task scheduling decisions of MU at each decision epoch . That is, if MU sends a computation task to the local CPU and otherwise, , while if MU offloads the computation task to the UAV, , or to the MEC cloud via one of the BSs, () and otherwise, . Hence the task queue dynamics of MU can be expressed as
where is the number of computation tasks in the task buffer of MU at the beginning of decision epoch and is an indicator function that equals if the condition is satisfied and , otherwise. In this work, we assume a large enough buffer capacity for a MU to avoid the buffer overflows.
Ii-C Computation Model
The UAV complements the terrestrial MEC system with the computation resource from the air. By strategically offloading the computation tasks to the UAV or the MEC cloud via one of the BSs for remote execution, the MUs can expect a significantly improved computation experience.
Ii-C1 Local Computation
When a computation task is scheduled for processing locally at the mobile device of a MU during a decision epoch , i.e., , the number of needed epochs can be calculated as , where means the ceiling function and we assume that the local CPU of a MU operates at frequency (in Hz). We describe the local processing state at a decision epoch using the number of remaining epochs to finish the computation task. For local computation during an epoch , the processing delay experienced by MU is given by
and the resulted energy consumed by the mobile device of MU then is
where is the effective switched capacitance that depends on the chip architecture of a mobile device .
Ii-C2 Remote Execution
For remote computation execution, a MU has to be first associated with a BS or the UAV until the task is accomplished. Let be the association state of each MU during a decision epoch , namely, if MU is associated with a BS and if MU is associated with the UAV, . Then
where , while and mean, respectively, logic OR and logic AND. When , which may happen only when 333If a MU does not offload a task at the beginning of a decision epoch , the association state remains unchanged, i.e., . In this case, no handover will be triggered., a handover among the BSs and the UAV is hence triggered . We assume that the energy consumption during the occurrence of one handover is negligible at MU but the incurred delay is (in seconds). During a decision epoch , MU experiences the average channel power gains for the link between MU and BS and for the link between MU and the UAV, which are determined by the physical distances.
At the beginning of a decision epoch , if a MU lets the MEC cloud execute a computation task, all input data needs to be offloaded via a BS , for which the achievable data rate can be written as , where is the frequency bandwidth exclusively allocated to a MU, is the transmit power and is the noise power spectral density. We use to denote the local transmission state of MU at the beginning of a decision epoch , which indicates the remaining amount of input data to be transmitted for the task. Hence the transmission delay444The transmission delay includes the delay during the handover procedure. and the energy consumption during epoch are calculated as and . In this paper, we assume that the BSs are connected using the wired links to the MEC cloud, which is of rich computation resource. We ignore the round-trip delay between the BSs and the MEC cloud as well as the time consumed for processing a computation task at the MEC cloud. Further, the time consumed by the selected BS (or the UAV in the following) to send back the computation result is negligible due to the fact that the size is much smaller than the input data of a computation task .
Similarly, if a MU offloads a computation task to the UAV for processing at a decision epoch , namely, , the time555After receiving all the input data of a computation task during a current decision epoch, the UAV starts to process from the subsequent decision epoch since the VMs are created at the beginning of an epoch . and the energy consumed during each decision epoch turn to be and , respectively, where is the achievable data rate, while denotes the transmission state at a decision epoch . Let represent the subset of MUs, whose computation tasks are being simultaneously processed by the corresponding VMs at the UAV during a decision epoch . Denote by the computation service rate of a VM at the UAV given that the task is run in isolation, the degraded computation rate of each MU is modeled as , where means the cardinality of a set and is a factor specifying the percentage of reduction in the computation rate of a VM when multiplexed with another VM. Accordingly, we obtain the remote processing delay of MU during decision epoch as with the remote processing state showing the amount of input data to be processed at the beginning of an epoch .
Iii Problem Formulation and Game-Theoretic Solution
During each decision epoch , the local state of a MU can be described by , where is a common finite state space for all MUs. We use to represent the global system state with denoting all the other MUs in without the presence of a MU . Let be the stationary task scheduling policy employed by MU . When deploying , MU observes at the beginning of a decision epoch and accordingly, makes local as well as remote task scheduling decisions, that is, . We define an immediate utility function666To stabilize the training process of the proactive algorithm designed in this work, we choose an exponential function for the definition of an immediate utility, whose value does not dramatically diverge.
to measure the satisfaction of experienced delay and consumed energy for each MU during each epoch , where is the weighting constant, is composed of not only the processing and transmission delay but also the task queueing delay, while constitutes the total local energy consumption.
Along with the discussions, it can be easily verified that the randomness lying in a sequence of the global system states over the time horizon is Markovian. Given a stationary task scheduling policy by each MU and an initial global state , we express the expected long-term discounted utility function of MU as
where is the discount factor and the expectation is taken over different decision makings under different global system states following a joint task scheduling policy across the decision epochs. When approaches , (6) approximates the expected long-term un-discounted utility as well . is also termed as the state value function in a global system state under a joint task scheduling policy .
Due to the shared I/O resource at the UAV and the dynamic nature in networking environment, we formulate the problem of resource awareness among multiple MUs across the decision epochs as a non-cooperative stochastic game, in which the MUs are the players and there are a set of global system states and a collection of task scheduling policies . The aim of each MU is to device a best-response policy that maximizes , which can be formally formulated as , . A NE, which is a best-response task scheduling policy profile , describes the rational behaviours of the MUs in a stochastic game . In order to operate the NE, a MU has to know the complete global system dynamics, which is prohibited in a non-cooperative networking environment . Define as the optimal state-value function.
Iv Proactive DRL with Local Observations
In this section, we shall develop a proactive DRL algorithm to approach the NE task scheduling policy.
Iv-a Approximation from Local Observations
During the competitive interactions with other MUs in the stochastic game, it is challenging for a MU to obtain the global system state information. There still exists the possibility for each MU to acquire the side information, which is the partial observation , of during a decision epoch . In this work, the partial observation of MU at the beginning of a decision epoch indicates the remote processing delay at the UAV from the previous epoch , namely, . Therefore, can be approximated by (7),
where is the initial partial observation of . Each MU then switches to solve the following single-agent MDP,
A dynamic programming approach to (8) based on the value or policy iteration requires complete a priori knowledge of the local state and observation transition statistics . The Q-learning enables each MU to learn in an unknown MEC system. Define
as the Q-function, where and are the decision makings at a current decision epoch, and are the local state and the partial observation at the subsequent epoch, while . In turn, can be straightforwardly obtained from
with and denoting the local and remote computation task scheduling decisions under .
During the process of Q-learning, each MU in the network first observes , , at a current decision epoch as well as at the next epoch , and then updates the Q-function iteratively as in (IV-A),
where is the learning rate. It has been well established that if: 1) the global system state transition probability under is time-invariant; 2) is infinite and is finite; and 3) all -pairs are visited infinitely often, the learning process converges towards .
Iv-B Proactive DRL for NE Control Policy
We can easily find that for the system model being investigated in this paper, the joint space of local states and partial observations faced by each MU is extremely huge. The tabular nature in representing the Q-function values makes the Q-learning impractical. Inspired by the widespread success of a deep neural network , we adopt a double deep Q-network (DQN) to model the Q-function of a MU . However, the accuracy of (8), which is based on partial observations of other MUs in the MEC system, can be, in general, arbitrarily bad. In order to overcome such a challenge from partial observability, we propose a slight modification to the DQN architecture. That is, we replace the first fully-connected layer of the DQN with a long short-term memory (LSTM) layer , resulting in a deep recurrent Q-network (DRQN) [25, 10].
More specifically, for each MU in the MEC system, is replaced by , where
contains a vector of parameters associated with the DRQN whileconsists of the most recent local states and partial observations up to a current decision epoch , namely,
It is worth mentioning that is taken as an input to the LSTM layer of the DRQN of MU for a proactive and more precise prediction of the current global system state . Eventually, a MU leans the parameters of a DRQN, instead of finding the Q-function according to the rule in (IV-A).
Iv-C Offline Training by Digital Twin
Simply being equipped with an independent DRQN at each MU raises two new technical challenges:
the possibly asynchronous training of DRQNs at the MUs constrains the overall system performance; and
in practice, the limited computation capability at the mobile device of a MU hinders the feasibility of training a DRQN locally.
As a promising alternative, we set up a digital twin of the MEC system to offline train the DRQNs, the parameters of which can be preloaded to a MU during the network initiation. From the assumptions made in this paper and the definition of an identical utility function structure as in (5), the homogeneous behaviours in all MUs provide an opportunity for the digital twin to train a common DRQN with parameters . In other words, we derive for each MU , .
To implement the DRQN offline training at the digital twin, we maintain a replay memory to store the most recent experiences up to the beginning of each decision epoch , where an experience () is given as (14).
Meanwhile, a pool of latest local states and partial observations is kept to predict the global system state for task scheduling policy evaluation at epoch . Both and are refreshed over the decision epochs. We first randomly sample a mini-batch of size from , where each () is given by (15).
Then the set of parameters at epoch
is updated by minimizing the accumulative loss function, which is defined as in (IV-C),
where is the set of parameters of the target DRQN at a certain previous decision epoch before epoch .
V Numerical Experiments
In order to quantify the performance gain from the proposed proactive DRL scheme in a UAV-assisted MEC system, numerical experiments based on TensorFlow are conducted. For experimental purpose, we build up a terrestrial MEC system, which is with BSs in a Km square area. The BSs are placed at equal distance apart, and the square area is divided into locations with each representing a small area of m. The channel model in  and the LOS model in  are assumed, respectively, for and , , and . We use the mobility configurations as in  for the MUs and the UAV. Regarding the DRQN, we design two fully connected layers after the LSTM layer with each of the three layers containing neurons. ReLU is selected as the activation function  and Adam as the optimizer . Other parameter values are listed in Table I.
For the performance comparisons, we design the following four baseline schemes as well.
Local Computation – Each MU processes locally all arriving computation tasks.
Cloud Execution – All arriving computation tasks at the MUs are offloaded to the MEC cloud for execution via the BS with the best channel gain.
UAV Execution – All queued computation tasks from the MUs are processed by the VMs at the UAV.
Greedy Processing – Each MU schedules the local task computation or offloads the computation to the UAV or the MEC cloud whenever possible.
In the experiments, the priority is to demonstrate the average utility performance per MU across the decision epochs from the proposed proactive DRL scheme and the four baselines under various computation task arrival probabilities. We assume MUs in the MEC system. The results are depicted in Fig. 2. It can be observed from the curves that the average utility performance from the proposed scheme, the local computation, the cloud execution, the UAV execution and the greedy processing deceases as the computation task arrival probability increases, which is in accordance with our intuition lying in the surge of per-MU task queue length. Due to the LOS wireless transmissions between the MUs and the UAV, the UAV execution scheme achieves better average utility performance than the cloud execution scheme. As increases, more task computations are offloaded for UAV execution under the greedy processing scheme to avoid the possible handover delay, though the cloud execution scheme outperforms the local computation scheme. Among the four baselines, the greedy processing scheme exhibits the best performance under large values of . Last but not least, the results clearly show that the proposed scheme provides a significant performance gain, compared with the four baselines.
In this work, our focus is to study the design of a stochastic local and remote computation scheduling policy for each MU in a UAV-assisted MEC system, which takes into account the system dynamics originated from the UAV and the MU mobilities as well as the time-varying computation task arrivals. The non-cooperative interactions among the MUs across the decision epochs are formulated as a stochastic game. To approach the NE, we derive a proactive DRL scheme, with which each MU schedules local and remote computations using only the local information. The homogeneity in the behaviours of MUs facilitates the use of a digital twin to offline train the proposed scheme. From numerical experiments, we find that compared with the four baselines, the proposed proactive DRL scheme achieves the best average utility performance.
-  Y. Mao et al., “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Q4 2017.
-  F. Wang et al., “Joint offloading and computing optimization in wireless powered mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–1797, Mar. 2018.
-  C.-F. Liu, M. Bennis, and H. V. Poor, “Latency and reliability-aware task offloading and resource allocation for mobile edge computing,” in Proc. IEEE GLOBECOM WKSHP, Singapore, Dec. 2017.
-  X. Chen et al., “Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning,” IEEE Internet Things J., vol. 6, no. 3, pp. 4005–4018, Jun. 2019.
-  X. Chen et al., “Multi-tenant cross-slice resource orchestration: A deep reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 2377–2392, Oct. 2019.
-  M. Mozaffari et al., “A tutorial on UAVs for wireless networks: Applications, challenges, and open problems,” IEEE Commun. Surveys Tuts., vol. 21, no. 3, Q3 2019.
-  X. Hu et al., “UAV-assisted relaying and edge computing: Scheduling and trajectory optimization,” IEEE Trans. Wireless Commun., vol. 18, no. 10, pp. 4738–4752, Oct. 2019.
-  F. Zhou et al., “Computation rate maximization in UAV-enabled wireless-powered mobile-edge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, Sep. 2018.
-  Z. Liang et al., “Multiuser computation offloading and downloading for edge computing with virtualization,” IEEE Trans. Wireless Commun., vol. 18, no. 9, pp. 4298–4311, Sep. 2019.
-  X. Chen et al., “Age of information-aware radio resource management in vehicular networks: A proactive deep reinforcement learning perspective,” 2019. Available: https://arxiv.org/pdf/1908.02047.pdf
-  O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for distributed dynamic spectrum access,” IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 310–323, Jan. 2019.
R. Dong et al.
, “Deep learning for hybrid 5G services in mobile edge computing systems: Learn from a digital twin,”IEEE Trans. Wireless Commun., vol. 18, no. 10, pp. 4692–4707, Oct. 2019.
X. Liu et al.
, “Trajectory design and power control for multi-UAV assisted wireless networks: A machine learning approach,”IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969, Aug. 2019.
-  Y. Wan et al., “A smooth-turn mobility model for airborne networks,” IEEE Trans. Veh. Technol., vol. 62, no. 7, pp. 3359–3370, Sep. 2013.
-  X. Xi et al., “Efficient and fair network selection for integrated cellular and drone-cell networks,” IEEE Trans. Veh. Technol., vol. 68, no. 1, pp. 923–937, Jan. 2019.
-  T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Signal Process. Syst., vol. 13, no. 2–3, pp. 203–221, Aug. 1996.
-  X. Chen et al., “Efficient multi-user computation offloading for mobile-edge cloud computing,” IEEE Trans. Netw., vol. 24, no. 5, pp. 2795–2808, Oct. 2016.
-  D. Adelman and A. J. Mersereau, “Relaxations of weakly coupled stochastic dynamic programs,” Oper. Res., vol. 56, no. 3, pp. 712–727, Jan. 2008.
-  A. M. Fink, “Equilibrium in a stochastic -person game,” J. Sci. Hiroshima Univ. Ser. A-I, vol. 28, pp. 89–93, 1964.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
-  M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in Proc. OSDI, Savannah, GA, Nov. 2016.
-  V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
-  H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. AAAI, Phoenix, AZ, Feb. 2016.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 9, pp. 1735–1780, Nov. 1997.
-  M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observable MDPs,” in Proc. AAAI, Austin, TX, Jan. 2015.
-  Y. Zeng, R. Zhang, and T. J. Lim, “Throughput maximization for UAV-enabled mobile relaying systems,” IEEE Trans. Commun., vol. 64, no. 12, pp. 4983–4996, Dec. 2016.
-  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. ICLR, San Diego, CA, May 2015.