I Introduction
In recent years, the number of user equipments (UEs) or Internet of Things (IoT) devices is growing rapidly. Meanwhile, some new resource-intensive applications, e.g., augmented reality (AR), virtual reality (VR) and real-time gaming playing are constantly raging. Mobile edge computing (MEC) is proposed to enable UEs to offload the above-mentioned workloads to available resource-intensive MEC servers [1]. The joint resource scheduling problem plays a key role in multi-user and multi-MEC systems, which consists of offloading decision making and resource allocation [1, 2].
However, such joint optimization problem is generally considered as a mixed integer nonlinear programming (MINLP) problem because the computation offloading decision is always an integer variable while the power and resource allocation are continuous variables. Some traditional methods are proposed to solve the above MINLP problem, such as dynamic programing[3], branch-and-bound method[4]
and game theory
[5]. However, these methods have high computational complexity when implemented in large-scale MEC systems. In addition, some heuristic local search
[6] and convex relaxation[7] algorithms have also been proposed to reduce the computational complexity, and these algorithms applied iterative search to achieve a satisfactory local optimum solution.Despite the recent search progress, the multi-user and multi-MEC system still faces several technical challenges. First, with the increase of the number of UEs, the resources allocated to each UE are fundamentally limited. Next, in dynamic environment, the time-varying wireless channel largely impacts the optimal user association and computation offloading, therefore the iterative search or traditional convex-based solutions are not suitable for making real-time decisions.
Fortunately, the above-mentioned challenges fall into the field of artificial intelligence (AI), which is considered to be a promising technique to address such issues by adaptive modelling and intelligent learning. Recently, some AI algorithms have been proposed and applied to MEC systems, such as DNN
[8], LSTM[9], CNN[10], Q-learning[11], DQN[12] and DDPG[13]. However, on one hand, the deep learning (DL)-based models (e.g. DNN, LSTM and CNN) have outstanding prediction and reasoning capabilities, but they require considerable amount of labelled training data
[14, 15]. On the other hand, when the scale of the MEC system grows, reinforcement learning (RL)-based models (e.g. Q-learning, DQN and DDPG) are not able to converge and the final results are unstable[16, 17].In this paper, we perform a comprehensive study on jointly optimizing computation offloading and resource allocation in a complex MEC system with multiple UEs and multiple MEC servers. We aim to obtain an online scheduling algorithm to minimize the sum of weighted task latency for all the UEs. Towards this end, we propose a joint resource scheduling framework with a stacked auto encoder (SAE) and a deep reinforcement learning (DRL) model to achieve the above targets. Compared with the existing works, we have the following novel contributions:
First, we present an MEC system model with the aim of minimizing the sum of weighted task latency for all the UEs and formulate the problem as an MINLP problem, considering the dynamic environment. Then, we decompose the MINLP problem into a computation offloading decision making sub-problem and resource allocation sub-problem, which avoid solving the original MINLP problem directly and guarantees all the constraints are satisfied at the same time.
Second, we propose a related and regularized SAE (2r-SAE) with unsupervised learning to carry out data compression and representation for high-dimensional channel quality information (CQI) data. 2r-SAE can provide a compact data representation to the DRL model, which will reduce the state space and enhance the learning efficiency of the DRL. In addition, we add the relative error term of each UE to the error term of the loss function, which will consider the relative error and absolute error simultaneously and reduce the information loss of each UE in the feature extraction process. We also add a regularization term to the loss function to improve the generalization of SAE. Furthermore, the incremental learning is used to update the SAE for tracking the variations of the real scenarios.
Third, we introduce a novel DRL model to generate computation offloading decision in real time, in which we present an adaptive simulated annealing (ASA) as the heuristic search method to find the optimal action for the corresponding state. In the ASA, we introduce two adaptive mechanisms: on one hand, the subsequent solution is mutated adaptively according to the CQI. On the other hand, the iteration number is adjusted adaptively according to the loss decrease of DRL. These two mechanisms can enhance the efficiency of SA and reduce the times of solving the convex optimization problem without compromising the system performance.
Fourth, a preserved and prioritized experience replay (2p-ER) is used to train the deep neural network (DNN), whose parameters represent the offloading policy of DRL. In particular, we use a preserve strategy to protect the transitions which are close to the current offloading policy. We also adopt a priority strategy to select the transitions which have more contributions to the decrease of loss function. These two strategies can accelerate the convergence of the DRL.
The rest of this paper is organized as follows. In Section II, a review of related works is presented. We describe the system model and problem formulation in Section III. We introduce the detailed designs of the DRL framework in Section IV. Section V provides some numerical results, which is followed by the conclusions in Section VI.
Ii Related works
There are many previous contributions in the MEC systems using AI-based solutions. In the following, we review the related works from three aspects: DL-based methods, RL-based methods and other AI-based methods.
DL-based methods: In [8], a distributed DL algorithm was proposed to make offloading decision for MEC systems, where several DNNs were trained parallelly and the offloading decisions were made cooperatively. In [9], a long and short-term memory (LSTM) network was proposed to predict the traffic of small base stations (SBSs), and the cross-entropy loss function was applied to evaluate the LSTM and obtain the offloading strategy. In [18]
, a distributed deployment strategy for the multi-layer convolutional neural network was presented, which included two parts: the preprocessing part and the classification part. The preprocessing part was deployed on the edge server for feature extraction and data compression so as to reduce the data transmission between the edge and the cloud.
RL-based methods: In [11], A Q-learning-based mobile offloading strategy was proposed in the mobile offloading game. In [12], a DQN approach was applied to jointly optimize the networking, caching, and computing resources in the next generation vehicular networks. In [13], a DRL-based Energy-efficient UAV Control method was proposed to design the trajectory of UAV by jointly considering the communications coverage, fairness, energy consumption and connectivity.
Other AI-based methods: In [19]
, the energy-efficient computation offloading management scheme in the MEC system with small cell networks (SCNs) was proposed, and a hierarchical genetic algorithm (GA) and particle swarm optimization (PSO)-based heuristic algorithm were designed to solve this problem. In
[20], a conceptor-based echo state network was proposed to predict content request distribution of users and its mobility pattern when the network is available. Based on the prediction results, the optimal positions of UAVs and the content to cache at UAVs can be obtained.However, none of above methods consider the online decision making with a large number of UEs in dynamic environment. Firstly, DL-based methods need prior knowledge and labelled samples which are not suited for the dynamic environment. Secondly, RL-based methods is unstable and hard to converge for large search space with a large number of UEs. Thus, more flexible and efficient AI methods should be designed.
In this paper, we consider a multi-user and multi-MEC systems in dynamical environment, which is a large-scale NP-hard problem with no prior knowledge. To solve this MINLP, we propose a DRL-based joint resource scheduling framework which combines the feature-extracting capability of SAE and the decision-making capability of DRL, whose framework is different from the existing works.
Iii System model and problem Formulation
Iii-a System model
As shown in Fig. 1, we consider there are UEs, each of which has a computation task to be executed. Also, we consider there are
MEC enhanced base station, which can enable UEs to offload their tasks. Define a new vector
to denote the possible place which the tasks can be executed, where denotes that UE conducts task itself without offloading, therefore one has| (1) |
where , denotes that the -th UE decides to offload the task to the -th MEC, while , denotes that the -th UE decides not to offload the task to the -th MEC, and , denotes UE conducts the task itself. Also, one has
| (2) |
which denotes that each task can only be or may not be able to execute in one place.
Similar to [21], we assume that the -th UE has the computational intensive task to be executed as follows
| (3) |
where describes that the total number of the CPU cycles of to be computed, denotes the data size transmitting to the MEC if offloading action is decided. and can be obtained by using the approaches provided in [22].
Then, one can have the execution time as
| (4) |
where is the computation capacity of the -th MEC providing to the -th UE and means the UE executes the task itself.
Then, the time to offload the data is given by
| (5) |
where is the offloading data rate from the -th UE to the -th MEC.
The computing capacity for the UE is constrained by
| (6) |
where is local computational capability of the -th UE.
The power consumption of the UE is constrained by
| (7) |
where is the transmitting power from the -th UE to the -th MEC and is the execution power of the -th UE if UE conducts the task itself. Thus, can be given by
| (8) |
where is the effective switched capacitance and is the positive constant. To match the realistic measurements, we set and [23].
The computing capacity for the MEC is constrained by
| (9) |
where is the computational capability of the -th MEC.
Assume that the coordinate of the -th UE is as and the coordinate of the -th MEC is as . The horizontal distance between the -th UE and the -th MEC is as
| (10) |
Then, we can define CQI as
| (11) |
where denotes the channel power gain at the reference distance and describes the influence of small-scale fading.
Therefore, if UEs decide to offload to the MEC, the data rate can be given as
| (12) |
where is the channel bandwidth.
Iii-B Problem Formulation
In order to minimize the weighted sum of task latency of all the tasks, we formulate the optimization problem as follows:
| (13) |
where , ,
are vectors for offloading decision, resource allocation and transmission power of each UE, respectively. Also, we can find that this is a mixed integer non-linear programming (MINLP), as it includes both integer and continuous variables. Assume if UE conducts the tasks locally, the energy consumption is expressed as
. Also assume that is the time-varying variable, whereas other parameters are fixed values.One can see that the Problem is non-convex non-smooth and non-differential optimization problem. We first decompose 0 into two sub-problems, i.e., offloading decision sub-problem (1), as well as transmission power and computation resource allocation sub-problem (2). For 1, we assume that other variables are fixed, and it only includes the integer variable . Then, one can see that 1 is an integer optimization problem, which is difficult to be solved in real-time under fast changing environment. To solve this issue, we propose to apply a novel DRL to address this problem and obtain the decision . Once the computation offloading variable is obtained, 0 can be simplified as follows, with the integer variable fixed.
| (14) |
One can see that the variables can be set to its maximal value by applying . Then, 2 is to minimize the summation of fractional functions and can be seen as the nonconvex sum-of-ratios optimization[21]. By applying
| (15) |
and combining Eq. (12) and Eq. (15), one can have
| (16) |
Then, Problem 2 can be written as
| (17) |
One can see that P2.1 is a convex problem which can be solved by the standard convex optimization tool, e.g., CVX tool box.
Iv The Online joint resource scheduling framework (OJRS)
Deep reinforcement learning (DRL) is a goal-oriented algorithm which can learn an optimal policy by using DNN for offloading decision making [16]. In this paper, similarly, DRL is applied to predict computation offloading, i.e., 1, while convex optimization technique is used to solve 2 and evaluate the reward of DRL, which guarantees that all the physical constraints are satisfied. However, in a large-scale MEC system, there are three challenges for DRL to be directly applied: (1) because of the large number of UEs, the state space of DRL is extremely large, which increases the difficulty of policy learning; (2) the action search is very difficult because of the complex MINLP and the DRL is hard to find the best action and the learning process is inefficient; (3) the experience replay is sensitive to the environment, especially in dynamic situations, where the DRL is unstable and difficult to converge. These problems prohibit the DRL to be applied in the proposed problem [17]. To address above challenges, we introduce an online joint resource scheduling (OJRS) framework which includes a SAE and a DRL for dimensionality reduction and offloading decision making, respectively. Next, we outline the OJRS framework.
Iv-a The framework outline
We show OJRS framework in Fig. 2. There are three key improvements for solving the aforementioned problems in the OJRS framework: (1) the related and regularized stacked auto encoder (2r-SAE) is provided in Subsection-B as a feature extractor, which can realize adaptive dimensionality reduction and data compression from the input (i.e., channel quality) by deep learning and hierarchical representation. The extracted feature is considered as the current state of DRL; (2) an adaptive simulated annealing named ASA is presented in Subsection-D as the heuristic search to help agent find better actions in DRL. Then the optimal offloading action is achieved by maximizing the reward which is cached into the replay buffer of DRL; (3) a DNN is applied to devise the optimal offloading policy function , which is achieved by a novel preserved and prioritized experience replay (2p-ER) in Subsection-E. Finally, the convex optimization techniques is applied to solve the Problem 2.1 according to the given and therefore the transmission power and computation resource
can be calculated efficiently. The OJRS framework combines the hierarchical representation ability of deep autoencoder and the autonomous learning ability of DRL, which can realize an end-to-end online joint resource scheduling for large-scale MEC system in dynamic environment. The OJRS framework reduces the state space greatly by applying SAE. Meanwhile, the OJRS framework depends on no prior knowledge of environment, and can provide online decision making without solving the original MINLP problem. In the following, we provide the details of each component of the OJRS framework.
Iv-B 2r-SAE
An auto-encoder (AE) is a special and tricky feedback neural network with the same input and output by unsupervised learning. Consider the advantages of deep learning in feature extraction and representation learning, the SAE with multilayer encoder and decoder stacked by several AEs is shown in Fig. 2, which assumes a symmetrical structure. Suppose the input vector , and the new representation , the encoder with layers describes a mapping:
| (18) |
where is the output of the encoder through the iterative processing steps as follows:
| (19) |
where is the output of the -th layer, is the weight of the -th layer, is the threshold of the -th layer. The set of parameters for the -th layer is .
is the activation function which can be selected as sigmoid, tanh or ReLU
[22]. Then the decoder with layers describes a mapping:| (20) |
where is the reconstruction vector.
The SAE training aims to optimize the parameter set , minimizing the reconstruction error between and . The loss function of traditional SAE is always calculated as follows [23]:
| (21) |
where the mean square error (MSE) is usually used as the error term.
Gradient descent based methods are applied to tackle the loss minimization problem, i.e. iteratively updating the parameters according to the formula:
| (22) |
where is the learning rate, and is the iteration number.
SAE can be seen as a way to transform representation. When restricting the number of output nodes to be less than the number of original input nodes in the encoder, we can obtain a compressed representation of the input, which actually achieves desired dimensionality reduction. In large-scale MEC systems, the CQI matrix is taken as the input vector for offloading decision making, and the input dimensionality of the increases when the number of UEs and MECs are increased. Therefore, SAE can be used as a dimensionality reduction tool to hierarchically extract the key features of the original and obtain a compact representation as the input state of the DRL.
However, there are still two open problems in the design of SAE model for our problem: First, the error term of the loss function is MSE in SAE, which is an absolute error indicator for all UEs, but the relative CQI of each UE between different MECs provides key information for offloading decision. If we only consider absolute error in loss function, some UEs with small CQI values will have serious loss in the feature-extracting process. Second, the standard SAE only adopts MSE as the loss function, which is always prone to over-fitting and not suitable for online feature-extracting in our OJRS framework because of the poor generalization.
To address the above problems, we propose a novel related and regularized stacked auto encoder (2r-SAE) with an improved loss function, which can be implemented by
| (23) |
where is the CQI between the -th UE and the -th MEC, and the is the corresponding reconstruction output of SAE. In the loss function, the first term is the traditional absolute error term; the second term is the relative error term, which is used to maintain the relative size of for each UE, and the third term is the regularized term, which is applied to improved generalization for online data compression.
In summary, as shown in Fig. 2, the 2r-SAE is composed of two stages: (1) Offline incremental learning stage: In this stage, we introduce the SAE to preprocess the matrix of all UEs and the unsupervised learning is used to extract the potential features of the matrix and provide a compact state space for DRL, which will improve the robustness and efficiency of DRL in the large-scale MEC system. In addition, the incremental learning is used to train the SAE for tracking the variations of the real scenarios [24]. The procedure of incremental learning is described as follows. First, each is input to the SAE, and a reconstruction error can be calculated. Then, we use an error check to decide if the current can be put into the memory. In this paper, error check is a simple threshold evaluation, which means if the reconstruction error is larger than threshold, the current will be put into the memory. Next, memory is a dynamic database with fixed-size, and first-in first-out (FIFO) scheduling policy is applied to the memory when the memory is full. Finally, the memory is used as the sample database to train the SAE. (2) Online data compression stage: The trained SAE can be implemented for online feature extraction and information compression. The extracted feature is considered as the current state of DRL algorithm. The detailed description of 2r-SAE algorithm is provided in .
Iv-C DRL with ASA and 2p-ER
We use the other DNN to generate the optimal offloading action of Problem 1 in real time, which can be regarded as an unknown function mapping from the compressed to the optimal offloading action , namely:
| (24) |
However, it is challenging to collect sufficient number of labelled samples for DNN in practical MEC systems. Therefore DRL is better than supervised learning, as it can learn the offloading policy
via the reward. By learning the offloading policy gradually from the interaction with environment, DNN can generate the best offloading decision behaviours by maximizing the rewards. Nevertheless, the traditional DRL is not suitable for our problem due to the following two reasons: First, different from the traditional DQN, DNN in OJRS framework is used to directly generate actions instead of Q values and how to find the optimal action for improving the offloading policy remains unclear; Second, considering that the dynamic environment, DRL is unstable and hard to converge, a robust and efficient learning algorithm should be designed.Motivated by above issues, we propose a novel DRL, in which an ASA algorithm is applied to enhance the action search process and a 2p-ER strategy is used to improve the learning process of DNN. The schematic of the DRL is also illustrated in Fig. 2
. In the novel DRL algorithm, the agent interacts with the system environment in discrete decision epochs. At each epoch
, the agent carries out action according to the state , then the environment produces a reword according to the action . To improve the policy, a heuristic search is applied to search the optimal action , and then the state-action pairs are put into the experience replay (ER) for agent learning. Concretely, in our problem, DNN can be seen as the agent, the is defined as the compressed which is preprocessed by the 2r-SAE and acquired as the DNN’s inputs; the is defined as the offloading action which is regarded as the DNN’s outputs; and the reward is deduced from the current . For realizing the online decision-making process, we calculate directly by solving Problem 2.1 using convex optimization method which can be calculated efficiently and rapidly in the fast changing environment without considering the long-term reward. In addition, the reciprocal of the weighted task latency is defined as the reward of our DRL. The ASA is adopted as the heuristic search to find the optimal action for maximizing reward, and 2p-ER is proposed as the enhanced ER for DNN training in dynamic environment.In addition, different from the SAE, the offloading decision making is a classification task, thus the regularized cross-entropy loss function of the DNN is selected as follows:
| (25) |
where is the sample set size; is the predicted offloading action from the DNN; is the labeled offloading action; and is the parameters of DNN at epoch which is updated by applying the Adam algorithm[25] until the loss value is below a required threshold. Regularized term is also used in the loss function and the reasons are as follows: (1) regularized restraint will increase the generalization of DNN[14]; (2) the L2-norm of will record the status of DNN at each epoch which will be applied to preserve transitions in replay buffer.
Iv-D Asa
Action search plays a key role in our DRL, some local search methods are applied to find the best for improving the performance of DNN and achieving the optimal offloading policy [16]. However, these local search methods are easily stuck in local minima and the globally optimal offloading policy cannot be guaranteed. We introduce an adaptive simulated annealing (ASA) to carry out the global heuristic search for searching the best action and acquiring the optimal offloading policy in DRL. After heuristic search, the newly generated state-action pairs are appended to the replay buffer as training transitions of DNN.
Simulated annealing (SA) is a single-solution-based metaheuristic search inspired by the annealing in metallurgy[26]. Due to its simplicity, less parameter, and fast convergence, SA has been widely adapted for global search and optimization during recent years[27].
The traditional SA algorithm begins with an initial solution and a starting temperature , then an iterative search process is carried out. For each generation , a neighbor solution close to the current solution is generated by a randomly generation. The subsequent solution
is selected by the Boltzmann probability distribution
[26]:| (26) |
where denotes the objective function of SA, which varies during the iterations because is the cooling factor. denotes a uniform random number in the range [0, 1].
However, the traditional SA algorithm has three drawbacks that avoid its direct application in our DRL algorithm. Firstly, SA algorithm often employs continuous real-valued encodings, but the offloading decision is a matrix with integer elements equal to 0 or 1; Second, traditional SA generates neighbour solutions randomly, and it does not take advantage of the CQI information; Third, the iteration number of SA is always fixed, which will lead to long computing time when the DRL finally converges. In this regard, we propose a new ASA algorithm to search the optimal action efficiently.
First, we improve the coding of SA’s solution. In our ASA algorithm, the solution can be represented as:
| (27) |
where means that the -th UE decides to execute the task itself, and means that the -th UE decides to offload the task to the -th MEC, while . This representation transforms the offloading decision matrix to an integer coding for SA.
Second, channel quality provides the prior information for guiding neighbour solution generation. We introduce an adaptive h-mutation to obtain the neighbour solution. The mutation probability of the -th solution is given as:
| (28) |
The adaptive h-mutation strategy is given as
| (29) |
where is a randomly generated integer to make sure that the -th UE will offload the task to an MEC or execute the task itself. In the h-mutation strategy, the UE will have higher probability to offload the task to the MEC whose channel quality is better, so this strategy is better than random neighbour solution.
Third, of the DNN at each epoch is also introduced to adjust the iteration number adaptively using the following equation:
| (30) |
where is a threshold. In the adaptive iteration strategy, the iteration number of SA will decrease continuously in the training process of the DNN, while will increase when the environment varies, therefore this strategy is suitable for action search in dynamic environment and has high search efficiency.
Fourth, the convex optimization is applied to solve Problem 2.1 for each solution in ASA and Eq. (17) is adopted as the objective function . The detailed description of ASA algorithm is provided in .
Iv-E 2p-ER
Experience replay (ER) is the other key technology in our DRL framework, because it has the following merits: (1) The random sampling can enhance stability of DRL by reducing the correlation between the samples in the buffer; (2) The reuse of history data can enhance the transition utilization and maintain the transition diversity, which will improve the performance of DNN[28]. The procedure of ER is as follows: the buffer is empty at the beginning of the first epoch, and then the new state-action pairs at the epoch are collected and added to the buffer. Next, the random batch sampling in the buffer is applied to train DNN, and new transitions will be collected from the trained DNN continually. When the buffer is full, FIFO scheduling policy is employed, and the oldest transitions will be discarded. However, traditional ER may discard some good transitions when the buffer is full because of the FIFO strategy, and the selection probability of all transitions is uniform. These traits limit the learning efficiency of DNN, especially in the dynamic environment. To address these obstacles, we propose a preserve strategy and a priority strategy in replay buffer whose details are described as follows:
(1) Preserve strategy: in replay buffer, we will preserve the transitions which are similar to the current offloading policy . During the training process, the offloading policy gradually shifts away from the previous status, and the samples whose offloading policy are different from the current offloading policy may not contribute to DNN’s outcomes. The difference between the offloading policy of the transition collected at epoch and current offloading policy can be measured as follows:
| (31) |
where is the L2-norm of at the current epoch , and is the L2-norm of at the epoch which is the transition collected epoch. Thus we compute a dissimilarity factor of each transition and define the reusable transition if with . The reusable transitions will be preserved and reused during the FIFO process.
(2) Priority strategy: in replay buffer, the transition which incurs obvious loss function decrease will be set with the higher selection probability, while the transition which cannot improve the performance of DNN obviously will be set with the lower selection probability. This strategy will increase the learning frequency of the valuable transitions and eliminate inefficiencies in the DRL process. The probability of sampling transitions is defined as:
| (32) |
where , is a small positive constant that guarantee all the transitions can be sampled, even if the variation of loss function at epoch [29]. is the set of all transitions in the replay buffer. is a probability factor to control how much priority is used.
To realize the preserve strategy and priority strategy in replay buffer, we sort two extra variable at epoch when we update the DNN. It is worth noting that these two strategies are readily to realize because and have been calculated at the loss function and we have no further burden.
In summary, as shown in Fig. 2, the DRL with ASA and 2p-ER is composed of two alternating stages: (1) Offloading decision making stage: At epoch , the DNN whose parameters are represented as the offloading policy can be deployed for generating online offloading action according to , then the convex optimization algorithm is used to solve 2.1 and calculate and according to , which guarantees that all constraints are satisfied. Then the solutions for is output in real time; (2) Offloading policy updating stage: The computation offloading is set as the initial solution of the ASA search. Then the ASA search is introduced to improve the action and the best is selected as the new transition and appended to the replay buffer. After that, a batch of transitions are drawn from the buffer according to our preserve and priority strategy, and the DNN is trained and the offloading policy is updated from to . Meanwhile the variable is recorded to update the and of selected transitions. The new offloading policy is applied in the epoch to generate the offloading decision according to the new . These two stages are alternatively performed and the offloading policy is gradually improved in the iteration process. The detailed description of DRL with ASA and 2p-ER is provided in .
V Numerical results and discussion
V-a Simulation parameters setting
Our simulation parameters are given in TABLE I
, unless otherwise specified. The parameters of the 2r-SAE are chosen as follows: we adopt a 3-layer fully-connected feedforward neural network to serve as the encoder of SAE, which includes 60, 45 and 30 neurons in the first, second and third layers respectively and,
=500, =0.5, =0.08. The parameters of the DRL are chosen as follows: we use a 4-layer fully-connected feedforward neural network to serve as the DNN, which includes 30, 120, 80 and 30 neurons in each layer respectively, and =0.02, =10000, =10. The parameters of the ASA are chosen as follows: =20, =0.02. The parameters of the 2p-ER are chosen as follows: =1.2, . We assume there are two MEC servers with coordinates (10m,10m) and (40m,40m) located in the areas with size 50m*50m. Also, we assume there are 30 UEs, randomly distributed in this area.| Data size of task | 100kB |
| Required CPU cycles of task | cycles/s |
| Bandwidth | 1MHz |
| Local Computational Capability | cycles/s |
| Remote Computational Capability | cycles/s |
V-B 2r-SAE performance evaluation
2r-SAE can provide a compact data representation to the DRL model. Fig. 3 characterizes the reconstruction accuracy of AE and SAE for the data compression and representation of channel state in the MEC system with 2 MEC servers. The encoder of AE is a simple 2-layer fully-connected feedforward neural network, which includes 60 and 30 neurons in the first and second layers, respectively. It can be observed that the reconstruction accuracy of SAE is 92.73% while the reconstruction accuracy of AE is 87.55%. The SAE with 3 layers has more precise representation than traditional AE with 2 layers. This is due to the fact that the depth of the DNN directly affects the potential feature representation and extraction of which in turn directly affects the reconstruction accuracy. In addition, in Fig. 4, the training losses of AE and SAE all converge to 0.0055 around after about 80 episodes, while the same phenomenon can be observed in testing loss curves, which means the unsupervised learning of AE and SAE can be used in CQI data preprocessing and compression successfully and the overfitting dose not happen.
Fig. 5 and Fig. 6 characterize the absolute error distribution of all channel state data for 2r-SAE and standard SAE, we can see that the training error and testing error of 2r-SAE are more focused at the minimal error bar. There are two reasons to explain this phenomenon: Firstly, the relative error loss term of each UE is added to the loss function, so that the SAE considers not only MSE, but also the relative error of each data in the training process, which leads to the lower training error. Secondly, the regularized term ensures the generalization of SAE, which leads to the lower testing error.
V-C DRL performance evaluation
ASA is a key element to affect the performance of our DRL. Fig. 7 characterizes the action search process using ASA and the traditional SA. It is observed that the ASA achieves the optimal action with less iterations and higher efficiency than SA. This is because h-mutation is applied to guide the action search and prompt the ASA to find the optimal neighbor solution efficiently. Fig. 8 characterizes the adaptive iteration number of ASA during the DRL stage. We can see that the iteration number of ASA decreases to 1 with the decline of . At some special DRL epochs, the iteration number of ASA increases because of the augment of . The adaptive iteration number will reduce the times of solving the convex optimization problem and further improve the computational efficiency of DRL.
2p-ER is another element to affect the performance of our DRL. Fig. 9 characterizes the reward and the loss value for our DRL with 2p-ER, while Fig. 10 characterizes the reward and the loss value for DRL with traditional replay buffer. We can see that, for both offloading policies learned from DRL with 2p-ER and traditional DRL, the reward of each epoch increases as the interaction between the DNN and the MEC system environment continues, which indicates that DRL can acquire efficient offloading policies successfully without any prior environment knowledge. Besides, the reward of our DRL becomes stable after about 2500 epochs, while the reward of traditional DRL becomes stable after about 7000 epochs. On the other hand, loss performance of the DNN (offloading policy) learned from our DRL is always lower than traditional DRL. This is because the preserve strategy preserves the reusable transitions and enhances the correlation between the transitions and the current offloading policy. In addition, the priority strategy makes the transitions which can lead to the decline of loss function have higher selection probability. All of the above strategies improve the performance of 2p-ER.
V-D OJRS framework performance evaluation
| Metric | Computing time (Sec) | Task latency (Sec) | Reward |
| OJRS framework | 0.0174 | 20.6874 | 0.0483 |
| Greedy | 0.0139 | 25.2942 | 0.0395 |
| Random | 0.0057 | 36.4325 | 0.0275 |
| ASA | 0.2354 | 20.2385 | 0.0494 |
Finally, we evaluate the whole OJRS framework. TABLE II characterizes the performance of the proposed OJRS framework for online joint resource scheduling. The Greedy, Random and ASA are used as the benchmarks. Random offloading (Random) means the offloading admission is decided randomly for each UE. If the computational resource of the allocated MEC is insufficient, UE executes the task locally. Greedy offloading (Greedy) means all UEs offload the task to the nearest MEC, if the computational resource are insufficient, the UEs who need more computational resources execute the task locally. ASA denotes that the task offloading decision is optimized by the ASA method directly, without applying DRL. It can be observed that the ASA achieves the highest reward. The proposed method attains almost the same reward compared with ASA, which is higher than Greedy and Random. This is because the proposed method uses ASA to search the action space and constructs an optimal nonlinear offloading policy from compressed to offloading decision . Meanwhile, if the SAE and DRL are applied, the complexity of the proposed method in online decision making is far lower than that of the ASA.
TABLE III characterizes the performance of the proposed OJRS framework in dynamic environment. We increase the number of the MEC servers from one to five, with their coordinates of locations as follows. The locations of cases of 1 MEC, 2 MECs, 3 MECs, 4 MECs and 5 MEC are respectively assumed as [(25m, 25m)]; [(10m, 10m), (40m, 40m)]; [(10m, 10m), (25m, 25m), (40m, 40m)]; [(10m, 10m), (10m, 40m), (40m, 10m), (40m, 40m)] and [(10m, 10m), (10m, 40m), (25m, 25m), (40m, 10m), (40m, 40m)]. We compare the accuracy and compression ratio of SAE with different number of MECs, and we also compare the best reward and average reward acquired from the DRL with varying weights. Especially, we consider a constant number of output neurons in SAE which is set to 30 when the number of MECs is changed, and we also consider a random weight variation at the 5000th epoch for simulating the dynamic environment. In order to evaluate the performance of DRL in different scenarios, we define the normalized reward rate (NRR), which is equal to that the inferred reward dividing the optimal reward. In NRR, the inferred reward in the numerator is obtained from the offloading decision of the DNN, and the optimal reward in the denominator is obtained from the particle swarm optimization (PSO) which is suitable for solving large-scale MINLP problems and can normally achieve nearly optimal global solutions but with long computation time [19].
The data such as the accuracy of SAE (Acc), the compression ratio of SAE (CR), the best NRR (F-Best) and the average NRR (F-Avg) of DRL before the 5000th epoch, and the best NRR (S-Best) and the average NRR (S-Avg) of DRL after the 5000th epoch are saved in Table III for detailed statistical analysis. It can be observed that the reconstruction accuracy of SAE decreases when the number of MECs increases, while the compression ratio of SAE increases when the number of MEC server increases. Therefore if we are willing to accept some loss of reconstruction accuracy, we can obtain a larger compression ratio, especially for a large-scale MEC system.
It also can be inferred from the results that the NRR of DRL also decreases when the number of MECs increases because of the information loss of SAE. However, this loss is compensated by the large compression ratio for state space, which will lead to fast search ability and stable convergence speed of DRL. Moreover, the DRL before the 5000th epoch achieves the same best NRR compared with the DRL after the 5000th epoch achieves, which means the proposed DRL can adjust the offloading policy automatically and it is suitable for making offloading decisions in dynamic environment. The average NRR of the DRL before the 5000th epoch is higher than the DRL after the 5000th epoch. A possible explanation of this phenomenon is that when the weights are changed, the DRL should just adjust the offloading policy to adapt the new environment, which is easier than the learning process of the original DRL without any prior information.
| Performance | ||||||
| MEC No. | SAE | DRL | ||||
| Acc | CR | F-Best | F-Avg | S-Best | S-Avg | |
| 1 | 1 | 0 | 0.9987 | 0.9462 | 0.9988 | 0.9764 |
| 2 | 0.9273 | 0.5 | 0.9895 | 0.9421 | 0.9886 | 0.9693 |
| 3 | 0.8994 | 0.67 | 0.9821 | 0.9362 | 0.9823 | 0.9612 |
| 4 | 0.8823 | 0.75 | 0.9732 | 0.9252 | 0.9733 | 0.9575 |
| 5 | 0.8782 | 0.80 | 0.9672 | 0.9197 | 0.9671 | 0.9488 |
Vi Conclusion
In this paper, we have proposed a DRL based online joint resource scheduling framework. This framework adopts a SAE and a DRL to optimize computation offloading, transmission power, and computation resource in a large-scale MEC system. More particularly, a novel 2r-SAE with unsupervised learning is presented to carry out data compression and representation for high dimensional channel state data, which can reduce the state space of DRL. Secondly, a novel DRL is proposed to make offloading decision, in which an ASA is used to search the optimal action and a 2p-ER is used to assist the DRL to train the DNN and find the optimal offloading policy. Specifically, the ASA uses adaptive h-mutation and iteration to enhance the action search and further improve the computing efficiency during the DRL process. The 2p-ER uses preserve and priority strategies to optimize the ER and improve the training process of DNN. It is demonstrated that the proposed framework is capable of optimizing the computation offloading, transmission power, and computation resource jointly at a high accuracy, making real-time resource scheduling feasible for large-scale MEC systems.
References
- [1] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017.
- [2] K. Wang, P. Huang, K. Yang, C. Pan, and J. Wang, “Unified offloading decision making and resource allocation in me-ran,” IEEE Transactions on Vehicular Technology, vol. 68, no. 8, pp. 8159–8172, Aug 2019.
- [3] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995, vol. 1, no. 2.
- [4] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Transactions on computers, no. 9, pp. 917–922, 1977.
- [5] D. Liu, L. Khoukhi, and A. Hafid, “Decentralized data offloading for mobile cloud computing based on game theory,” in 2017 Second International Conference on Fog and Mobile Edge Computing (FMEC). IEEE, 2017, pp. 20–24.
- [6] S. Bi and Y. J. Zhang, “Computation rate maximization for wireless powered mobile-edge computing with binary computation offloading,” IEEE Transactions on Wireless Communications, vol. 17, no. 6, pp. 4177–4190, 2018.
- [7] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3571–3584, 2017.
- [8] L. Huang, X. Feng, A. Feng, Y. Huang, and L. P. Qian, “Distributed deep learning-based offloading for mobile edge computing networks,” Mobile Networks and Applications, pp. 1–8, 2018.
- [9] H. Jiang, D. Peng, K. Yang, Y. Zeng, and Q. Chen, “Predicted mobile data offloading for mobile edge computing systems,” in International Conference on Smart Computing and Communication. Springer, 2018, pp. 153–162.
- [10] C. H. Liu, Z. Chen, and Y. Zhan, “Energy-efficient distributed mobile crowd sensing: A deep learning approach,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1262–1276, 2019.
- [11] L. Xiao, C. Xie, T. Chen, H. Dai, and H. V. Poor, “A mobile offloading game against smart attacks,” IEEE Access, vol. 4, pp. 2281–2291, 2016.
- [12] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach,” IEEE Transactions on Vehicular Technology, vol. 67, no. 1, pp. 44–55, 2017.
- [13] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient uav control for effective and fair communication coverage: A deep reinforcement learning approach,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 2059–2070, 2018.
- [14] F. Jiang, L. Dong, and Q. Dai, “Electrical resistivity imaging inversion: An isfla trained kernel principal component wavelet neural network approach,” Neural Networks, vol. 104, pp. 114–123, 2018.
- [15] F. Jiang, L. Dong, Q. Dai, and D. C. Nobes, “Using wavelet packet denoising and anfis networks based on cosfla optimization for electrical resistivity imaging inversion,” Fuzzy Sets and Systems, vol. 337, pp. 93–112, 2018.
- [16] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Transactions on Mobile Computing, 2019.
- [17] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [18] H. Li, K. Ota, and M. Dong, “Learning iot in edge: Deep learning for the internet of things with edge computing,” IEEE Network, vol. 32, no. 1, pp. 96–101, 2018.
- [19] F. Guo, H. Zhang, H. Ji, X. Li, and V. C. Leung, “An efficient computation offloading management scheme in the densely deployed small cell networks with mobile edge computing,” IEEE/ACM Transactions on Networking, vol. 26, no. 6, pp. 2651–2664, 2018.
- [20] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong, “Caching in the sky: Proactive deployment of cache-enabled unmanned aerial vehicles for optimized quality-of-experience,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 5, pp. 1046–1061, 2017.
- [21] A. P. Miettinen and J. K. Nurminen, “Energy efficiency of mobile clients in cloud computing.” HotCloud, vol. 10, no. 4-4, p. 19, 2010.
- [22] F. Jiang, K. Wang, L. Dong, C. Pan, W. Xu, and K. Yang, “Deep learning based joint resource scheduling algorithms for hybrid mec networks,” IEEE Internet of Things Journal, pp. 1–14, 2019.
- [23] H. Shao, H. Jiang, H. Zhao, and F. Wang, “A novel deep autoencoder feature learning method for rotating machinery fault diagnosis,” Mechanical Systems and Signal Processing, vol. 95, pp. 187–204, 2017.
- [24] R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Transactions on Neural Networks, vol. 22, no. 10, pp. 1517–1531, 2011.
- [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [26] W. Zhang, A. Maleki, M. A. Rosen, and J. Liu, “Optimization with a simulated annealing algorithm of a hybrid system for renewable energy including battery and hydrogen storage,” Energy, vol. 163, pp. 191–207, 2018.
- [27] B. Morales-Castañeda, D. Zaldívar, E. Cuevas, O. Maciel-Castillo, I. Aranguren, and F. Fausto, “An improved simulated annealing algorithm based on ancient metallurgy techniques,” Applied Soft Computing, vol. 84, p. 105761, 2019.
- [28] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Thirtieth AAAI conference on artificial intelligence, 2016.
- [29] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
share
Comments
There are no comments yet.