I Introduction
Nowadays, user equipments (UEs) such as smart phones, tablets, wearable devices and other Internet of smart things are becoming increasingly popular and bringing huge convenience to our daily life. Moreover, many emerging mobile applications (e.g., augmented reality, smart navigation and interactive service) are receiving more and more attention but most of those applications are resource intensive, which makes the UEs very difficult to execute them, due to limited battery and computation resource (e.g. CPU, storage or memory) in UEs.
Fortunately, mobile edge computing (MEC) has recently been proposed as a means to enable UEs with intensive computational tasks to offload them to the edge cloud, which can not only prolong the battery life of UEs, but also increase UEs’ computational capacity. Offloading decision making and resource allocation have been studied in [1, 2], while MEC with Cloud Radio Access Network (CRAN) has been investigated in [3, 4, 5]. The above works either consider there is only one MEC (e.g., [1, 7]), or consider the MECs have fixed location (e.g., [8, 3]), which may not be practical in some scenarios. For instance, the single MEC is normally resourcelimited and may not be able to meet the requirement of all the UEs at the same time. Also, MEC with fixed location lacks flexibility and may not be suitable to the cases where the number and the requirement of UEs keep changing.
Unmanned aerial vehicle (UAV), due to the features of low cost, high flexibility and easy to deployment, have recently attracted much attention in wireless communication, e.g., serving as base station [9] or mobile relays [10]. UAV enabled MEC (e.g., [6]) have been proposed by integrating MEC server to UAVs (i.e., UAVE), to provide computing resource to ground UEs. Compared with the traditional fixed location MEC, UAVE is of particular interest to the scenario such as 1) temporary events (i.e., in case of a large number of people gathering in the ground celebrating a big event or watching football match); 2) emergency situations (i.e., in case of earthquake and the infrastructure may be destroyed or temporary unavailable) or other ondemand services. However, the operation of UAVE faces many challenges, two of which are how to achieve 1) the association between multiple UEs and UAVs and 2) the resource allocation from the UAVs to the UEs, while meeting the quality of service (QoS) and minimizing the whole energy consumption for all the UEs.
To address these challenges, we formulate above problem into a mixed integer nonlinear programming (MINLP), which is very difficult to be addressed in general, especially in the largescale scenario (e.g., when there is a large number of UEs in the ground waiting to be served). We then propose a Reinforcement Learningbased user Association and resource Allocation (RLAA) algorithm to deal with this problem efficiently and effectively. Numerical results show that the proposed RLAA can achieve the optimal performance compared to the exhaustive search in small scale, and have considerable performance gain over other typical algorithms in largescale scenarios.
The rest of the paper is organized as follows. We show the system model and the optimization problem in Section II. Then, our proposed RLAA algorithm is introduced in Section III. The simulation result is given in Section IV, followed by the conclusion remarks in Section V.
Ii System Model
As shown in Fig. 1, we consider there are UEs, each of which has a computationintensive task to be executed. Also, we consider there are UAVs deployed as the MEC platform, flying in a circle with radius
. Define a new vector
to denote the possible place where the tasks from ground UEs can be executed at, in which denotes that UE conducts task itself without offloading. Similar to [6], we assume that the th UAV’s flight period can be discretized into time slots. Define a new vector to denote the possible time slots when the tasks from ground UEs can be executed at, in which denotes that UE conducts task itself. Also we assume that the UAV’s location change within each time slot can be ignored, compared to the distances from the UAV to all UEs.Denote the coordinate of the th UAV at th time slot as and the coordinate of the th UE as .
Similar to [3], assume th UE has a computational intensive task to be executed as
(1) 
where denotes the total number of CPU cycles required to complete this task and denotes the amount of data needed to be transmitted to UAV if deciding to offload, in which and can be obtained by using the approaches provided in [11]. Assume that each UE can decide either to execute the task locally or choose to offload to one of the UAVs in one time slot and also assume that the task can be completed in this time slot. Similar to [1], we do not consider the time for returning the results back to UE from UAV. Thus, one can have
(2) 
where , , denotes that the th UE choose the th UAV in the th time slot to offload, while , , denotes that th UE execute the task itself and otherwise, . Note that , if and only if .
Also, assume that the th () UAV can serve more than one UE in each time slot and this task has to be completed either via offloading or local execution. Therefore, one can have
(3) 
Iia Task Offloading
In offloading scenario, we assume the horizontal distance between th UE and the th UAV in th time slot as
(4) 
Then, the offloading data rate can be given by
(5) 
where is denoted as the channel bandwidth, as the transmission power of the th UE, =, 2.2846, as the channel power gain at the reference distance 1 and as the noise power [12].
Also, one can see that the time to offload the data from th UE to the th UAV in th time interval can be given as
(6) 
Also, the time to execute the task can be expressed as
(7) 
where is the computation resource that the th UAV could provide to the th UE. Then, we can have the total time consumption as
(8) 
Moreover, the total energy consumption of the th UE to the th UAV in th time slot can be given as
(9) 
Similar to [1], we assume each UAV in every time slot can only accept limited amount of offloaded task. Then, one has
(10) 
where is the maximal number of UEs that each UAV can accept in each time slot.
IiB Local Execution
If the UE decides to execute the task locally, the power consumption for the th UE can be given by , where , , and 0 is the effective switched capacitance and can be normally to 3 [1]. Then, the local execution time can be given by , (, ) and then, the total energy consumption can be given as .
IiC Problem Formulation
Then, one can have the energy consumption of each UE as
(11) 
Also, the total time spent to complete each task can be expressed as
(12) 
One can assume that the maximal computation resource which the th UAV can provide is as . Then, one can have
(13) 
Also, as the task normally has to be completed in certain amount of the time and thus without loss of generality, we assume the task must be completed in time without loss of generality. In our paper, assume all the transmitting and computing process for each task must be completed within one time interval , Then, we have
(14) 
Denote = , = . Then, one can have
(15a)  
subject to  (15b)  
(15c)  
(15d)  
(15e)  
(15f)  
(15g) 
Note that the above problem is a MINLP problem, which is difficult to be solved optimally in general. Some existing algorithms like exhaustive search or branch and bound algorithm may solve this problem, but with prohibitive complexity. Therefore, in this paper, we aim to obtain an efficient solution to solve this problem. To this end, we propose the RLAA algorithm to deal with effectively and efficiently.
Iii Proposed Algorithm
In this section, we show our proposed RLAA algorithm. First, we introduce three important elements in RLAA (i.e., actions, states, and reward functions).

Actions: At each episode , each UE takes an action. If the UE decides to offload the task to the th UAV in th time interval, the action is denoted as , . If UE decides to execute the task locally, the action is as . Then, one can define the collection of actions as follows:
(16) For above offloading action , , the minimal computation resources of the th UE can be given by
(17) For local execution action , the minimal computation resources of the th UE is given as
(18) Note that not all actions can guarantee that the task can be completed within one time interval, as the available computation resources may be less than the minimal computation resources (i.e., in (17) and (18)). Similarly, the communication resource can also not be guaranteed (i.e., in (15e)). Therefore we may remove some actions in , resulting in the collection of feasible actions for the th UE as .

States: Then, we define the states as follows:
(19) where represents the decision of the th UE. Specifically, if the th UE offloads the task to the th UAV in th time interval, we assign action to state . It is worth mentioning that if the th UE decides to execute the task locally, we assign action to state .

Reward Functions: We define the reward function as
(20) The above proposed reward function can keep reducing the energy consumption of each UE and may finally achieve the minimization of the energy consumption of all UEs.
Then, we present RLAA in Algorithm 1. In the beginning, states is initialized. The table is also initialized, which is used to record every state and action (i.e., line 1 in Algorithm 1). At each episode, we obtain the collection of the actions for the th UE. Then, according to the greedy policy [13], the
th UE either chooses a random action with probability
or follows the greedy policy with probability , which is expresses as(21) 
where is an action randomly selected from
, rand(0,1) denotes a random number uniformly distributed over the interval [0,1] (i.e., line 4  line 8 in Algorithm
1).Then, the resource allocation is conducted for the th UE (i.e., line 9 in Algorithm 1). If the th UE offload the task to the th UAV in th time slot, the minimal computation resource in (17) is allocated. If the th UE execute task locally, the minimal computation resource in (18) is allocated. Based on the proposed reward function in (20), the th UE can then obtain a reward (i.e., line 10 in Algorithm 1).
Next, we update the table (line 11), where the updating rule of table is given as
(22)  
where is the reward decay over the interval [0,1], is the learning rate over the interval [0,1], and is the next state. Also, states is updated based on action . Specifically, we assign action , to state .
The above process will be repeated until the maximum episode () is reached. Finally, each UE selects an action according to table (line 16). Specifically, for the th UE, the action in corresponding to the largest value of table is selected.
Iv Simulation Results
Parameters  Settings 
Radius for all UAVs  800 m 
Flying height for all UAVs  350 m 
Bandwidth  1 MHz 
Transmission power  1 W 
Noise variance 
dbm/Hz 
2.2846  
Channel power gain  1.42 
Data Size  [] KB 
Execution task  [] cycles 
Time duration  1 s 
Location of UEs  m 
150 GHz  
greedy policy probability  
Reward decay  0.9 
Learning rate  0.2 
for all UEs  
for all UEs  3 
10000  
for all UAVs  12 
In this section, the simulation for the proposed multiUAV enabled MEC system is conducted, where the parameters of the tests are shown in Table. I, in which the channel bandwidth is set to = 1 MHz, the noise variance is set to = dbm/Hz, the channel power gain at the reference distance 1 is set to = [12], the transmission power is set to 1 W, the time interval is set to 1 s, the is set to for all the UEs. Also, we assume each UAV can support = 150 UEs in one time slot. All UEs are assumed to be randomly distributed in a rectangle area of coordinates m. We randomly select the data size of each task from the interval of KB and select from the interval of cycles.
In order to evaluate the performance of our proposed RLAA, the following four algorithms are used as comparison algorithms.

Exhaustive search (ES): We examine all the possibilities, with the objective of minimizing the overall energy consumption for all the UEs.

Local execution (LE): We assume all tasks are executed locally and there are no offloading.

Random offloading (RO): Each UE randomly selects the UAV and the time slot to offload its task.

Greedy offloading (GO): Each UE selects the nearest UAV to offload its task. If the UAV is overloaded (i.e., is violated), then selects the second nearest UAV to offload and so on.
Firstly, we compare the performance of RLAA with its four compared algorithms on a set of small scale instances (i.e., the number of UEs ranges from 3 to 7). We assume that there are two UAVs flying in circles with the same radius and the center coordinates of two UAVs are set to and , respectively. From Fig. 2, one can see that RLAA has the same performance as ES, both of which can achieve the minimal enery consumption. Also, one can see that GO achieves better performance than RO, whereas LE achieve the worst performance for all examined values. This is because that our proposed RLAA can choose most energyefficient action for all the UEs according to computation and communication requirement, while others either make UE to execute all the task locally (i.e., LE), or randomly offload the tasks (i.e., RO), or just find the nearest UAV (i.e., GO), resulting in worse performance.
Next, we compare the performance of RLAA with LE, RO and GO on a set of large scale instances, where the number of UEs is increased to 1001000. The number of the UAVs is set to 3, where the center coordinates are , and , respectively. Note that we do not examine ES here, due to its prohibitive complexity. From Fig. 3, one can see that our proposed RLAA still performs best, followed by GO, RO and LE, as expected.
In Fig. 4, we further increase the number of UAVs to 5, where the center coordinates are set to as , , , and , respectively. One sees that our proposed RLAA still outperforms other compared algorithms, with significant amount of energy being saved for all the UEs.
V Conclusion
In this paper, we studied a multiUAV enabled MEC system, in which the UAVs are assumed to fly in circles over the ground UEs to provide the computation services. The proposed problem is formulated as a MINLP, which is hard to deal with in general. We propose a RLAA algorithm to address it effectively. Simulation results show that RLAA can achieve the same performance as the exhaustive search in small scale cases, whereas in large case scenario, RLAA still have considerate performance gain over other traditional approaches.
Vi Acknowledgements
This work was supported in part by the Zhongshan City Team Project (Grant No. 180809162197874), National Natural Science Foundation of China (Grant No. 61620106011 and 61572389) and UK EPSRC NIRVANA project (Grant No. EP/L026031/1).
References
 [1] X. Lyu, H. Tian, W. Ni, Y. Zhang, P. Zhang, and R. P. Liu, “EnergyEfficient Admission of DelaySensitive Tasks for Mobile Edge Computing,” IEEE Transactions on Communications, vol. 66, no. 6, pp. 2603–2616, June 2018.
 [2] K. Yang, S. Ou, and H. Chen, “On effective offloading services for resourceconstrained mobile devices running heavier mobile internet applications,” IEEE Communications Magazine, vol. 46, no. 1, pp. 56–63, January 2008.
 [3] L. Zhang, K. Wang, D. Xuan, and K. Yang, “Optimal Task Allocation in NearFar Computing Enhanced CRAN for Wireless Big Data Processing,” IEEE Wireless Communications, vol. 25, no. 1, pp. 50–55, February 2018.
 [4] H. Mei, K. Wang, and K. Yang, “Multilayer cloudran with cooperative resource allocations for lowlatency computing and communication services,” IEEE Access, vol. 5, pp. 19 023–19 032, 2017.
 [5] X. Wang, K. Wang, S. Wu, S. Di, K. Yang, and H. Jin, “Dynamic resource scheduling in cloud radio access network with mobile cloud computing,” in 2016 IEEE/ACM 24th International Symposium on Quality of Service (IWQoS), June 2016, pp. 1–6.
 [6] Q. Wu and R. Zhang, “Common throughput maximization in uavenabled ofdma systems with delay consideration,” IEEE Transactions on Communications, vol. 66, no. 12, pp. 6614–6627, Dec 2018.
 [7] X. Chen, “Decentralized Computation Offloading Game for Mobile Cloud Computing,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 4, pp. 974–983, April 2015.
 [8] X. Wang, K. Wang, S. Wu, S. Di, H. Jin, K. Yang, and S. Ou, “Dynamic Resource Scheduling in Mobile Edge Cloud with Cloud Radio Access Network,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 11, pp. 2429–2445, Nov 2018.
 [9] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim, “Placement Optimization of UAVMounted Mobile Base Stations,” IEEE Communications Letters, vol. 21, no. 3, pp. 604–607, March 2017.
 [10] Y. Chen, N. Zhao, Z. Ding, and M. Alouini, “Multiple UAVs as Relays: MultiHop Single Link Versus Multiple DualHop Links,” IEEE Transactions on Wireless Communications, vol. 17, no. 9, pp. 6348–6359, Sept 2018.
 [11] L. Yang, J. Cao, S. Tang, T. Li, and A. T. S. Chan, “A Framework for Partitioning and Execution of Data Stream Applications in Mobile Cloud Computing,” in 2012 IEEE Fifth International Conference on Cloud Computing, June 2012, pp. 794–802.
 [12] H. He, S. Zhang, Y. Zeng, and R. Zhang, “Joint Altitude and Beamwidth Optimization for UAVEnabled Multiuser Communications,” IEEE Communications Letters, vol. 22, no. 2, pp. 344–347, Feb 2018.
 [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
Comments
There are no comments yet.