I Introduction
Mobileedge computing (MEC), which provides computing capabilities within the radio access networks (RANs) in close proximity to the mobile users (MUs), is a promising paradigm to address the tension between computationintensive applications and resourceconstrained mobile devices [1]. By offloading computation tasks to the resourcerich MEC cloud, not only the computation qualities of service and experience can be greatly improved, but also the capability of a mobile device can be augmented for running a variety of resourcedemanding applications. Recently, there are a number of related works on designing computation offloading schemes. For example, in [2], Wang et al. proposed a Lagrangian duality method to minimize the total energy consumption in a computation latency constrained wireless powered multiuser MEC system. In [3], Liu et al. studied the powerdelay tradeoff for a MEC system using the Lyapunov optimization technique. In our priori work [4]
, the infinite timehorizon Markov decision process (MDP) framework was used to model the problem of computation offloading for a MU in an ultradense RAN and to solve the optimal policies, we proposed the deep reinforcement learning (DRL) based schemes.
Offloading the input data of a task from the mobile device of a MU to the MEC cloud requires wireless transmissions, which account for the dynamics from the surrounding environment. Particularly, the timevarying channel qualities due to the MU mobility in turn limits the computation performance [5]. Because of among others, the low deployment cost, the flexibility and the lineofsight (LOS) connections, unmanned aerial vehicles (UAVs) are expected to play a significant role in advancing the future wireless networks [6]. Leveraging the UAV technology in a MEC system has been shown to be substantial. In [7], Hu et al. put forward an alternating algorithm to minimize the weighted sum energy consumption for a UAVassisted MEC system. In [8], Zhou et al. investigated a UAVenabled wirelesspowered MEC system and derived alternating algorithms to solve the computation rate maximization problems under both the partial and the binary computation offloading modes. However, most of the existing literature is basically based on a finite timehorizon.
In this paper, we concentrate on a threedimensional UAVassisted MEC system, in which a UAV is implemented as a complementary computing server flying in the air. That is, in addition to local computation execution, each MU in the system can also offload a computation task to the UAV or to the MEC cloud via one of the base stations (BSs) in the RAN. The UAV can coexecute the computation tasks of the MUs by creating isolated virtual machines (VMs) [9]
. Sharing the same physical UAV platform causes I/O interference, leading to computation rate reduction for each VM. Under this context, the MUs compete to schedule local and remote task computations with the awareness of environmental dynamics. The aim of each MU is to maximize the expected longterm computation performance. The noncooperative interactions among the MUs are modeled as a stochastic game. Solving a Nash equilibrium (NE) of the stochastic game needs complete information exchange among the MUs, which is practically overwhelming. Motivated by recent advances in recurrent and deep neural networks, we propose a proactive DRL scheme, enabling each MU to behave at an approximated NE only with local information
[10, 11]. Furthermore, we establish a digital twin of the MEC system to get over the hurdle of training the neural networks [12]. To the best of our knowledge, there does not exist a comprehensive study on stochastic resource awareness among the noncooperative MUs in a UAVassisted MEC system.Ii System Descriptions and Assumptions
As illustrated in Fig. 1, we focus on a threedimensional scenario, in which a terrestrial MEC system is assisted by a UAV. The UAV hovers in the air at a fixed altitude of (in meters) ^{1}^{1}1This work assumes that the power of the UAV is supplied by laser charging [13]. Hence the UAV is able to operate over the long run.. The terrestrial MEC system consists of a set of BSs, which are connected via wired links to the computing cloud at the edge. To ease analysis, we use a common finite set of locations (i.e., small twodimensional nonoverlapping areas)^{2}^{2}2Each location or small area can be characterized by uniform wireless communication conditions [5]. to denote both the terrestrial service region covered by the BSs and the region of the UAV mapped vertically from the air to the ground. In the system, a set of MUs coexist and generate sporadic computation tasks over the infinite timehorizon, which is discretized into decision epochs. Each epoch is assumed to be of equal duration (in seconds) and indexed by an integer .
Iia Mobility Model
We apply the smoothturn mobility model with a reflecting boundary to simulate the UAV trajectory [14]. In this model, the UAV maintains a constant forward speed but randomly changes the centripetal acceleration. Let be the mapped terrestrial location of the UAV during a decision epoch . With regards to the MUs, their movements are modelled using a boundary GaussMarkov mobility model [15]. Specifically, the location of each MU during each decision epoch is determined by both the location at epoch and the velocity during epoch , while the velocity of a MU during a decision epoch depends on the velocity during the previous epoch only.
IiB Task Model
The computation task arrivals at the MUs are assumed to be independent and identically distributed sequences of Bernoulli random variables with a common parameter
. More specifically, we choose to be the task arrival indicator for a MU , that is, if a computation task is generated at MU in the end of epoch and otherwise, . Then, , , wheredenotes the probability of the occurrence of an event. We let
(in bits) and represent, respectively, the input data size and the number of CPU cycles required to accomplish one input bit of a computation task. The arrived but not processed tasks will be queued at the buffer of a MU. A computation task can be either computed locally at the device of the MU or executed remotely (at the UAV or the MEC cloud). We let and denote the local and remote computation task scheduling decisions of MU at each decision epoch . That is, if MU sends a computation task to the local CPU and otherwise, , while if MU offloads the computation task to the UAV, , or to the MEC cloud via one of the BSs, () and otherwise, . Hence the task queue dynamics of MU can be expressed as(1) 
where is the number of computation tasks in the task buffer of MU at the beginning of decision epoch and is an indicator function that equals if the condition is satisfied and , otherwise. In this work, we assume a large enough buffer capacity for a MU to avoid the buffer overflows.
IiC Computation Model
The UAV complements the terrestrial MEC system with the computation resource from the air. By strategically offloading the computation tasks to the UAV or the MEC cloud via one of the BSs for remote execution, the MUs can expect a significantly improved computation experience.
IiC1 Local Computation
When a computation task is scheduled for processing locally at the mobile device of a MU during a decision epoch , i.e., , the number of needed epochs can be calculated as , where means the ceiling function and we assume that the local CPU of a MU operates at frequency (in Hz). We describe the local processing state at a decision epoch using the number of remaining epochs to finish the computation task. For local computation during an epoch , the processing delay experienced by MU is given by
(2) 
and the resulted energy consumed by the mobile device of MU then is
(3) 
where is the effective switched capacitance that depends on the chip architecture of a mobile device [16].
IiC2 Remote Execution
For remote computation execution, a MU has to be first associated with a BS or the UAV until the task is accomplished. Let be the association state of each MU during a decision epoch , namely, if MU is associated with a BS and if MU is associated with the UAV, . Then
(4) 
where , while and mean, respectively, logic OR and logic AND. When , which may happen only when ^{3}^{3}3If a MU does not offload a task at the beginning of a decision epoch , the association state remains unchanged, i.e., . In this case, no handover will be triggered., a handover among the BSs and the UAV is hence triggered [4]. We assume that the energy consumption during the occurrence of one handover is negligible at MU but the incurred delay is (in seconds). During a decision epoch , MU experiences the average channel power gains for the link between MU and BS and for the link between MU and the UAV, which are determined by the physical distances.
At the beginning of a decision epoch , if a MU lets the MEC cloud execute a computation task, all input data needs to be offloaded via a BS , for which the achievable data rate can be written as , where is the frequency bandwidth exclusively allocated to a MU, is the transmit power and is the noise power spectral density. We use to denote the local transmission state of MU at the beginning of a decision epoch , which indicates the remaining amount of input data to be transmitted for the task. Hence the transmission delay^{4}^{4}4The transmission delay includes the delay during the handover procedure. and the energy consumption during epoch are calculated as and . In this paper, we assume that the BSs are connected using the wired links to the MEC cloud, which is of rich computation resource. We ignore the roundtrip delay between the BSs and the MEC cloud as well as the time consumed for processing a computation task at the MEC cloud. Further, the time consumed by the selected BS (or the UAV in the following) to send back the computation result is negligible due to the fact that the size is much smaller than the input data of a computation task [17].
Similarly, if a MU offloads a computation task to the UAV for processing at a decision epoch , namely, , the time^{5}^{5}5After receiving all the input data of a computation task during a current decision epoch, the UAV starts to process from the subsequent decision epoch since the VMs are created at the beginning of an epoch [9]. and the energy consumed during each decision epoch turn to be and , respectively, where is the achievable data rate, while denotes the transmission state at a decision epoch . Let represent the subset of MUs, whose computation tasks are being simultaneously processed by the corresponding VMs at the UAV during a decision epoch . Denote by the computation service rate of a VM at the UAV given that the task is run in isolation, the degraded computation rate of each MU is modeled as , where means the cardinality of a set and is a factor specifying the percentage of reduction in the computation rate of a VM when multiplexed with another VM. Accordingly, we obtain the remote processing delay of MU during decision epoch as with the remote processing state showing the amount of input data to be processed at the beginning of an epoch .
Iii Problem Formulation and GameTheoretic Solution
During each decision epoch , the local state of a MU can be described by , where is a common finite state space for all MUs. We use to represent the global system state with denoting all the other MUs in without the presence of a MU . Let be the stationary task scheduling policy employed by MU . When deploying , MU observes at the beginning of a decision epoch and accordingly, makes local as well as remote task scheduling decisions, that is, . We define an immediate utility function^{6}^{6}6To stabilize the training process of the proactive algorithm designed in this work, we choose an exponential function for the definition of an immediate utility, whose value does not dramatically diverge.
(5) 
to measure the satisfaction of experienced delay and consumed energy for each MU during each epoch , where is the weighting constant, is composed of not only the processing and transmission delay but also the task queueing delay, while constitutes the total local energy consumption.
Along with the discussions, it can be easily verified that the randomness lying in a sequence of the global system states over the time horizon is Markovian. Given a stationary task scheduling policy by each MU and an initial global state , we express the expected longterm discounted utility function of MU as
(6)  
where is the discount factor and the expectation is taken over different decision makings under different global system states following a joint task scheduling policy across the decision epochs. When approaches , (6) approximates the expected longterm undiscounted utility as well [18]. is also termed as the state value function in a global system state under a joint task scheduling policy [20].
Due to the shared I/O resource at the UAV and the dynamic nature in networking environment, we formulate the problem of resource awareness among multiple MUs across the decision epochs as a noncooperative stochastic game, in which the MUs are the players and there are a set of global system states and a collection of task scheduling policies . The aim of each MU is to device a bestresponse policy that maximizes , which can be formally formulated as , . A NE, which is a bestresponse task scheduling policy profile , describes the rational behaviours of the MUs in a stochastic game [19]. In order to operate the NE, a MU has to know the complete global system dynamics, which is prohibited in a noncooperative networking environment [5]. Define as the optimal statevalue function.
Iv Proactive DRL with Local Observations
In this section, we shall develop a proactive DRL algorithm to approach the NE task scheduling policy.
Iva Approximation from Local Observations
During the competitive interactions with other MUs in the stochastic game, it is challenging for a MU to obtain the global system state information. There still exists the possibility for each MU to acquire the side information, which is the partial observation , of during a decision epoch . In this work, the partial observation of MU at the beginning of a decision epoch indicates the remote processing delay at the UAV from the previous epoch , namely, . Therefore, can be approximated by (7),
(7) 
where is the initial partial observation of . Each MU then switches to solve the following singleagent MDP,
(8) 
A dynamic programming approach to (8) based on the value or policy iteration requires complete a priori knowledge of the local state and observation transition statistics [20]. The Qlearning enables each MU to learn in an unknown MEC system. Define
(9) 
as the Qfunction, where and are the decision makings at a current decision epoch, and are the local state and the partial observation at the subsequent epoch, while . In turn, can be straightforwardly obtained from
(10) 
(11) 
with and denoting the local and remote computation task scheduling decisions under .
During the process of Qlearning, each MU in the network first observes , , at a current decision epoch as well as at the next epoch , and then updates the Qfunction iteratively as in (IVA),
(12) 
where is the learning rate. It has been well established that if: 1) the global system state transition probability under is timeinvariant; 2) is infinite and is finite; and 3) all pairs are visited infinitely often, the learning process converges towards [20].
IvB Proactive DRL for NE Control Policy
We can easily find that for the system model being investigated in this paper, the joint space of local states and partial observations faced by each MU is extremely huge. The tabular nature in representing the Qfunction values makes the Qlearning impractical. Inspired by the widespread success of a deep neural network [22], we adopt a double deep Qnetwork (DQN) to model the Qfunction of a MU [23]. However, the accuracy of (8), which is based on partial observations of other MUs in the MEC system, can be, in general, arbitrarily bad. In order to overcome such a challenge from partial observability, we propose a slight modification to the DQN architecture. That is, we replace the first fullyconnected layer of the DQN with a long shortterm memory (LSTM) layer [24], resulting in a deep recurrent Qnetwork (DRQN) [25, 10].
More specifically, for each MU in the MEC system, is replaced by , where
contains a vector of parameters associated with the DRQN while
consists of the most recent local states and partial observations up to a current decision epoch , namely,(13) 
It is worth mentioning that is taken as an input to the LSTM layer of the DRQN of MU for a proactive and more precise prediction of the current global system state . Eventually, a MU leans the parameters of a DRQN, instead of finding the Qfunction according to the rule in (IVA).
IvC Offline Training by Digital Twin
Simply being equipped with an independent DRQN at each MU raises two new technical challenges:

the possibly asynchronous training of DRQNs at the MUs constrains the overall system performance; and

in practice, the limited computation capability at the mobile device of a MU hinders the feasibility of training a DRQN locally.
As a promising alternative, we set up a digital twin of the MEC system to offline train the DRQNs, the parameters of which can be preloaded to a MU during the network initiation. From the assumptions made in this paper and the definition of an identical utility function structure as in (5), the homogeneous behaviours in all MUs provide an opportunity for the digital twin to train a common DRQN with parameters . In other words, we derive for each MU , .
To implement the DRQN offline training at the digital twin, we maintain a replay memory to store the most recent experiences up to the beginning of each decision epoch , where an experience () is given as (14).
(14) 
Meanwhile, a pool of latest local states and partial observations is kept to predict the global system state for task scheduling policy evaluation at epoch . Both and are refreshed over the decision epochs. We first randomly sample a minibatch of size from , where each () is given by (15).
(15) 
Then the set of parameters at epoch
is updated by minimizing the accumulative loss function, which is defined as in (
IVC),(16) 
where is the set of parameters of the target DRQN at a certain previous decision epoch before epoch .
V Numerical Experiments
In order to quantify the performance gain from the proposed proactive DRL scheme in a UAVassisted MEC system, numerical experiments based on TensorFlow
[21] are conducted. For experimental purpose, we build up a terrestrial MEC system, which is with BSs in a Km square area. The BSs are placed at equal distance apart, and the square area is divided into locations with each representing a small area of m. The channel model in [5] and the LOS model in [26] are assumed, respectively, for and , , and . We use the mobility configurations as in [15] for the MUs and the UAV. Regarding the DRQN, we design two fully connected layers after the LSTM layer with each of the three layers containing neurons. ReLU is selected as the activation function [27] and Adam as the optimizer [28]. Other parameter values are listed in Table I.Parameter  Value  Parameter  Value 

Kbits  
meters  MHz  
dBm/Hz  second  
Watt,  
GHz  
second  bits/second  
For the performance comparisons, we design the following four baseline schemes as well.

Local Computation – Each MU processes locally all arriving computation tasks.

Cloud Execution – All arriving computation tasks at the MUs are offloaded to the MEC cloud for execution via the BS with the best channel gain.

UAV Execution – All queued computation tasks from the MUs are processed by the VMs at the UAV.

Greedy Processing – Each MU schedules the local task computation or offloads the computation to the UAV or the MEC cloud whenever possible.
In the experiments, the priority is to demonstrate the average utility performance per MU across the decision epochs from the proposed proactive DRL scheme and the four baselines under various computation task arrival probabilities. We assume MUs in the MEC system. The results are depicted in Fig. 2. It can be observed from the curves that the average utility performance from the proposed scheme, the local computation, the cloud execution, the UAV execution and the greedy processing deceases as the computation task arrival probability increases, which is in accordance with our intuition lying in the surge of perMU task queue length. Due to the LOS wireless transmissions between the MUs and the UAV, the UAV execution scheme achieves better average utility performance than the cloud execution scheme. As increases, more task computations are offloaded for UAV execution under the greedy processing scheme to avoid the possible handover delay, though the cloud execution scheme outperforms the local computation scheme. Among the four baselines, the greedy processing scheme exhibits the best performance under large values of . Last but not least, the results clearly show that the proposed scheme provides a significant performance gain, compared with the four baselines.
Vi Conclusions
In this work, our focus is to study the design of a stochastic local and remote computation scheduling policy for each MU in a UAVassisted MEC system, which takes into account the system dynamics originated from the UAV and the MU mobilities as well as the timevarying computation task arrivals. The noncooperative interactions among the MUs across the decision epochs are formulated as a stochastic game. To approach the NE, we derive a proactive DRL scheme, with which each MU schedules local and remote computations using only the local information. The homogeneity in the behaviours of MUs facilitates the use of a digital twin to offline train the proposed scheme. From numerical experiments, we find that compared with the four baselines, the proposed proactive DRL scheme achieves the best average utility performance.
References
 [1] Y. Mao et al., “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Q4 2017.
 [2] F. Wang et al., “Joint offloading and computing optimization in wireless powered mobileedge computing systems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–1797, Mar. 2018.
 [3] C.F. Liu, M. Bennis, and H. V. Poor, “Latency and reliabilityaware task offloading and resource allocation for mobile edge computing,” in Proc. IEEE GLOBECOM WKSHP, Singapore, Dec. 2017.
 [4] X. Chen et al., “Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning,” IEEE Internet Things J., vol. 6, no. 3, pp. 4005–4018, Jun. 2019.
 [5] X. Chen et al., “Multitenant crossslice resource orchestration: A deep reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 2377–2392, Oct. 2019.
 [6] M. Mozaffari et al., “A tutorial on UAVs for wireless networks: Applications, challenges, and open problems,” IEEE Commun. Surveys Tuts., vol. 21, no. 3, Q3 2019.
 [7] X. Hu et al., “UAVassisted relaying and edge computing: Scheduling and trajectory optimization,” IEEE Trans. Wireless Commun., vol. 18, no. 10, pp. 4738–4752, Oct. 2019.
 [8] F. Zhou et al., “Computation rate maximization in UAVenabled wirelesspowered mobileedge computing systems,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, Sep. 2018.
 [9] Z. Liang et al., “Multiuser computation offloading and downloading for edge computing with virtualization,” IEEE Trans. Wireless Commun., vol. 18, no. 9, pp. 4298–4311, Sep. 2019.
 [10] X. Chen et al., “Age of informationaware radio resource management in vehicular networks: A proactive deep reinforcement learning perspective,” 2019. Available: https://arxiv.org/pdf/1908.02047.pdf
 [11] O. Naparstek and K. Cohen, “Deep multiuser reinforcement learning for distributed dynamic spectrum access,” IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 310–323, Jan. 2019.

[12]
R. Dong et al.
, “Deep learning for hybrid 5G services in mobile edge computing systems: Learn from a digital twin,”
IEEE Trans. Wireless Commun., vol. 18, no. 10, pp. 4692–4707, Oct. 2019. 
[13]
X. Liu et al.
, “Trajectory design and power control for multiUAV assisted wireless networks: A machine learning approach,”
IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969, Aug. 2019.  [14] Y. Wan et al., “A smoothturn mobility model for airborne networks,” IEEE Trans. Veh. Technol., vol. 62, no. 7, pp. 3359–3370, Sep. 2013.
 [15] X. Xi et al., “Efficient and fair network selection for integrated cellular and dronecell networks,” IEEE Trans. Veh. Technol., vol. 68, no. 1, pp. 923–937, Jan. 2019.
 [16] T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Signal Process. Syst., vol. 13, no. 2–3, pp. 203–221, Aug. 1996.
 [17] X. Chen et al., “Efficient multiuser computation offloading for mobileedge cloud computing,” IEEE Trans. Netw., vol. 24, no. 5, pp. 2795–2808, Oct. 2016.
 [18] D. Adelman and A. J. Mersereau, “Relaxations of weakly coupled stochastic dynamic programs,” Oper. Res., vol. 56, no. 3, pp. 712–727, Jan. 2008.
 [19] A. M. Fink, “Equilibrium in a stochastic person game,” J. Sci. Hiroshima Univ. Ser. AI, vol. 28, pp. 89–93, 1964.
 [20] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
 [21] M. Abadi et al., “Tensorflow: A system for largescale machine learning,” in Proc. OSDI, Savannah, GA, Nov. 2016.
 [22] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
 [23] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Proc. AAAI, Phoenix, AZ, Feb. 2016.
 [24] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Comput., vol. 9, no. 9, pp. 1735–1780, Nov. 1997.
 [25] M. Hausknecht and P. Stone, “Deep recurrent Qlearning for partially observable MDPs,” in Proc. AAAI, Austin, TX, Jan. 2015.
 [26] Y. Zeng, R. Zhang, and T. J. Lim, “Throughput maximization for UAVenabled mobile relaying systems,” IEEE Trans. Commun., vol. 64, no. 12, pp. 4983–4996, Dec. 2016.

[27]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proc. ICML, Haifa, Israel, Jun. 2010.  [28] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. ICLR, San Diego, CA, May 2015.
Comments
There are no comments yet.