Recent years have witnessed explosive growth of Internet of Things (IoT) as a way to connect tens of billions of resource-limited wireless devices, such as sensors, mobile devices (MDs) and wearable devices, to Internet through the cellular networks. Due to small physical sizes and stringent production costs constraints, IoT devices often suffer from limited computation capabilities and finite battery lives. Perceived as a promising solution, mobile edge computing (MEC) [2, 3] has attracted significant attention. With MEC, computationally intensive tasks can be offloaded to nearby servers located at the edges of wireless networks. This efficiently overcomes the drawbacks of long backhaul latency and high overhead compared to traditional mobile cloud computing.
Typically, there are two computation task offloading models for MEC : one is referred to as binary offloading, and the other is partial offloading. For the binary offloading model, each task is either executed locally or offloaded to the MEC server as a whole [4, 5, 6, 7, 8, 9]. As for partial offloading, tasks can be arbitrarily divided into two parts that are executed by the device and the edge server, respectively [10, 11]. Nevertheless, in practice, a mobile application usually has multiple components and the dependency among them cannot be ignored since the outputs of some components are the inputs of others. In this regard, task call graph  is proposed to model the sophisticated inter-dependency among different components in a mobile application. In this paper, we consider computation offloading with a general task call graph.
Due to the random variation of wireless channels, it is not always advantageous to offload all the tasks for edge execution. Instead, offloading computation tasks in an opportunistic manner considering the time-varying channel condition has shown significant performance advantage [4, 5, 6, 7, 8, 9, 10, 11]. Due to the mutual coupling constraints in a task call graph, offloading policy design becomes much challenging [13, 14, 15, 16, 17, 18]. Specifically,  considered a sequential task graph and derived an optimal one-climb policy, where the execution migrates only at most once between the MD and the cloud server. This work was extended to a general task graph case in , where authors applied the partial critical path analysis for the general task graph scheduling. In 
, the offloading problem in a general task graph was formulated as a linear programming problem through convex relaxation.
modeled the task scheduling problem in a general task graph as an energy consumption minimization problem that is solved by a genetic algorithm. Note that general task graphs are considered much harder to deal with compared to other task graphs with special structures (i.e., sequential task graph), since it is hard to explore and derive the offloading properties (i.e., one-climb policy in the sequential task graph) with the general and complicated coupling among tasks.
On the other hand, recent work has considered joint optimization of radio/computing resource allocation and computation offloading. In particular,  studied an energy-efficiency cost minimization problem by incorporating CPU frequency control and transmit power allocation in the MEC offloading decision.  considered inter-user task dependency and proposed a reduced-complexity Gibbs sampling algorithm to obtain the optimal offloading decisions.
) or heuristic local search methods (e.g., in[13, 14, 16, 18]). However, both methods are likely to get stuck in a local optimal solution that does not guarantee good performance. Moreover, the optimization problems need to be re-solved once the wireless channel conditions change or the available computing power of the edge server changes due to the variation of demands by background applications. The frequent re-calculation of offloading decisions renders the existing methods impractical.
In this paper, we endeavor to design an efficient optimal computation offloading algorithm in an MEC system with a general task graph, so that the optimal decision swiftly adapts to the time-varying wireless channels and available edge computing power with very low computational complexity. In particular, we propose a deep reinforcement learning (DRL) framework. The key idea of DRL is to utilize the deep neural networks (DNNs) to learn the optimal mapping between the state space and the action space. There exists several work on DRL-based offloading methods for MEC systems[19, 20, 21]. In , a deep Q-network (DQN) based offloading policy was proposed to optimize the computational performance in the MEC system with energy harvesting. When tasks arrive randomly, 
proposed DQN to learn the optimal offloading decisions without a priori knowledge of network dynamics. To tackle the curse of dimensionality problem in DQN-based methods, proposed a novel DRL framework to achieve near-optimal offloading actions by considering only a small subset of candidate offloading actions in each iteration. Notice that [19, 20, 21] all assume independent tasks among multiple users. Very recently, considering a general task dependency, 
proposed a recurrent neural network (RNN) based reinforcement learning method for the computation offloading problem. However, it neglected the system dynamics, such as wireless fading channels and time-varying edge server CPU frequency.
We consider an MEC system with a single access point (AP) and a MD as shown in Fig. 1. The MD has an application with a general task topology to execute under time-varying wireless fading channels and edge server CPU frequency. In particular, we propose a DRL framework to minimize the weighted sum of task execution time and energy consumption of the MD. The main contributions are concluded as follows:
We formulate a mixed integer optimization problem to jointly optimize the offloading decisions and local CPU frequencies of the MD to minimize the computation delay and energy consumption. The problem is challenging because of the combinatorial nature of the offloading decisions and the strong coupling among task executions under general dependency model.
In order to solve the combinatorial optimization problem efficiently, we propose a DRL framework based on the actor-critic learning structure, where we train a DNN in the actor network periodically from the past experiences to learn the optimal mapping between the states (i.e., wireless channels and edge CPU frequency) and actions (i.e., offloading decisions). Within the actor network, we devise a novel Gaussian noise-added order-preserving action generation method to balance the diversity and complexity in generating candidate binary offloading actions under a high-dimensional action space.
For the critic network, we simplify the problem according to the total loop-free paths in the general task graph and derive closed-form solution for the optimal local CPU frequencies. Based on this, we propose an efficient algorithm. As such, unlike traditional actor-critic networks that utilize a DNN to predict the values of the actions in the critic network, our analysis allows fast and accurate calculation of the performance of each action generated by the actor network. In this way, the complexity and convergence of the actor-critic based DRL are greatly improved.
To further speed up the computation of the proposed DRL framework, we propose a heuristics where the offloading decisions are limited to the ones that follow the one-climb offloading policy. The heuristics greatly reduces the number of performance evaluations for the actions in the critic network. The optimality of the one-climb policy is analyzed and its advantageous performance over conventional action generation method is verified through simulations.
Numerical results show that for various types of general task graphs, the proposed DRL-based algorithm achieves up to of the optimal energy and time cost. Meanwhile, our proposed method only takes around 1 second to generate an offloading action, which is more than one order of magnitude faster than the other representative benchmark methods. In this paper, we formulate the joint optimization of offloading and resource allocation with general task graph in the MEC as a mixed integer non-linear programming (MINLP) problem, which is hard to solve with conventional optimization algorithms under time-varying wireless channels and stochastic edge computing capability. By exploring the special structure of the considered MINLP problem, we observe that for any given integer variables (offloading decisions), the remaining problem is convex. Therefore, the main difficulty lies in finding the optimal integer offloading decisions. With such property, we propose the actor-critic learning structure based DRL algorithm, where the actor network generates a set of integer offloading actions according to the time-varying parameters and the critic network scores each action output from the actor network by convex optimization. Then, we utilize the generated action-score pairs to make current offloading decision and improve the performance of the actor network. It is worth mentioning that the key target of the critic is for evaluating the action quality, regardless of using a general neural network or a specialized algorithm . In this paper, as one of the major contributions, we propose an efficient low-complexity algorithm in the critic network to evaluate the actions generated from the actor network, which greatly reduces the training cost of the critic DNN and increases the accuracy of action evaluation.
The rest of the paper is organized as follows. In Section II, we present the system model and problem formulation. The optimal local CPU frequencies under fixed offloading decisions are studied in Section III. We introduce the detailed design for the DRL framework in Section IV. In Section V, simulation results are described. Finally, we conclude the paper in Section VI.
Ii System Model And Problem Formulation
As shown in Fig. 1, we consider an MEC system with one AP and one MD. The AP is the gateway of the edge cloud and has stable power supply. The MD has a computationally intensive mobile application consisting of dependent tasks. The input-output dependency of the tasks is represented by a directed acyclic task graph . As shown in Fig. 2, each vertex in represents a task and the associated parameter indicates the computing workload in terms of the total number of CPU cycles required for accomplishing the task. Besides, each edge in represents that a precedent task must be completed before starting to execute task . Additionally, we denote the size of data in bits transferred from task to by . For simplicity of exposition, we introduce two virtual tasks and as the entry and exit tasks, respectively. Specifically, we have . By forcing the two virtual tasks to be executed locally, we ensure that the application is initiated and terminated at the MD side. We denote the set of tasks in the task graph as .
Define an indicator variable such that means that task is executed locally and means that the MD offloads the computation of task to the edge side. Recall that the two virtual tasks and must be executed locally. That is, .
In addition, we assume that the MD is allocated a dedicated spectral resource block throughout its transmission, which can support concurrent transmissions for task offloading and downloading. We denote by and the channel gains when offloading and downloading the task data , respectively.
Besides, we assume additive white Gaussian noise (AWGN) with zero mean and equal varianceat the receiver for all the tasks.
To characterize the task execution time and energy consumption for local and edge computing, respectively, we first define the finish time and ready time of each task.
Definition 1 (Finish Time). The finish time of task
is the moment when all the workloadhas been executed. We denote and as the finish time of task when it is executed locally and at the edge server, respectively.
Definition 2 (Ready Time). The ready time of a task is the earliest time when the task has received all the necessary input data to commence the task computation. For instance, in Fig. 2, the ready time of the fifth task is the time when both the input data streams from the first and second tasks have arrived. We denote the ready time of task when computing locally and at the edge server as and , respectively.
Ii-a Local Computing
We assume that the MD is equipped with a -core CPU, where each CPU core can execute only one task at a time. That is, the MD can execute in total tasks simultaneously. Suppose that task is computed locally. We denote the local CPU frequency for computing the task as , which is upper bounded by . Thus, the local execution time of task is given by
and the corresponding energy consumption is 
where is the effective switched capacitance depending on the chip architecture. According to the circuit theory , the power consumption of the CPU is approximately proportional to the product of , where is the circuit supplied voltage. Besides, is approximately linear proportional to the CPU frequency when the CPU works at the low voltage limits . Therefore, the energy consumption per CPU cycle is given by . It is worth mentioning that for the two virtual tasks and , we have and .
If a task preceding task is executed at the edge server, then the output data must be downloaded to the MD before task can be executed locally. Denote the fixed downlink transmit power of the AP by . Then, according to the Shannon-Hartley theorem, the downlink data rate from the AP to the MD is
The corresponding downlink transmission time for sending the data is
As such, the ready time of task is given by
where pred(i) denotes the set of immediate predecessors of task . Specifically, if for a task , the time until its output data is available at the MD for the execution of task is equal to its finish time at the edge side plus the downlink transmission time . Otherwise, if , the time until its output data is available at the MD is equal to its local finish time . When all needed data is available at the ready time , the MD locally computes task with the local execution time in (1), so that the finish time of task becomes
Ii-B Edge Computing
We denote the fixed transmit power of the MD by . Then, the uplink data rate for offloading the data to the AP is
and the corresponding uplink transmission time is
The transmission energy consumption is
We assume that the edge server has cores and can compute tasks in parallel. The execution time of task on the AP is given by
where is the fixed service rate of each CPU core. Similarly, we can calculate the ready time of task executed at the edge server as
and its finish time is
Ii-C Problem Formulation
We assume that both the MD and MEC server have a lot more CPU cores than needed to execute the possibly concurrent tasks in the considered mobile application. As such, we can safely set . Besides, it is assumed that the number of available channels is sufficiently large to execute the possibly concurrent data transmissions in the task graph.
From the above discussion, the total time to complete the all tasks is equal to the local finish time of the auxiliary exit task , i.e., . Besides, we can calculate the total energy consumption of the MD by
which consists of energy consumed on local computation and task offloading.
In this paper, we consider the energy-time cost (ETC) as the performance metric, which is defined as the weighted sum of the total energy consumption and execution time, i.e.,
where and denote the weights of energy consumption and computation completion time of the MD, respectively. It is assumed that the weights are related by . We consider the weighted-sum approach [9,17,18] for a general multi-objective optimization problem. According to the Proposition 3.9 of , for any given positive weights, we can reach an efficient solution of the multi-objective optimization problem by solving Problem (P1). A weakly efficient solution will be obtained if any of the weights is zero. Besides, in order to meet user-specific demands, we allow the MD to choose different weights. For instance, the MD with low battery energy prefers a larger for energy saving, while for the delay-sensitive MD, a larger will be chosen to reduce the execution time.
Evidently, a higher CPU frequency leads to shorter task execution time. Meanwhile, according to (2), the energy consumption per CPU cycle is a quadratic function of the CPU frequency, thus the energy consumption increases with the CPU frequency for executing a task. Because the AP has stable power supply, it can operate with a fixed maximum frequency to minimize the execution delay. However, since the MD is often energy-constrained, we can apply dynamic voltage and frequency scaling (DVFS) technique to tune the local CPU frequency for balancing the performance between energy consumption and execution time. Denoting and , , we aim to minimize the ETC of the MD subject to the peak CPU frequency constraint of the MD, i.e.,
where we assume in this paper. In general,
is non-convex due to the binary variablesand the recursive structure of . In the following section, we first simplify by exploiting the property of the total task completion time . Then, we propose an efficient method to obtain the optimal CPU frequencies with a given .
Iii Optimal Resource Allocation Under Fixed Offloading Decisions
Iii-a Problem (P1) Simplification
We denote a path as an ordered sequence of task indices , that pass through the general task graph from the entry task to the exit task . Here, is the total number of real tasks in path . For instance, is a path in Fig. 2. There are three real tasks in the path. Besides, we denote the set of all loop-free paths as , which can be obtained by running the -shortest path routing algorithm on . Likewise, we denote by the total number of paths. Let denote the total execution time in the -th path excluding the waiting time for the data inputs from the other paths. Then, we have
which consists of the total computation and communication delay in path .
To simplify Problem (P1), we first have the following lemma on .
Lemma 3.1: holds given any .
Please refer to Appendix A. ∎
Lemma 3.1 indicates that the final completion time is equal to the largest total execution time of all the paths in . Note that although does not include the time spent on waiting for the task input data from other paths, the largest among all paths is the final completion time.
Due to the one-to-one mapping between and in (1), it is equivalent to optimize (P1) over the time allocation . By introducing an auxiliary variable , (P1) can be equivalently expressed as
Notice that (P2) is non-convex in general due to the binary variables . However, for any given , the remaining optimization over is a convex problem. In the following, we assume a fixed offloading decision and derive an efficient algorithm to obtain the optimal , or equivalently the optimal local CPU frequencies .
Iii-B Optimal Local CPU Frequencies
Suppose that is given. We express a partial Lagrangian of Problem (P2) as
where denotes the dual variables associated with the corresponding constraints. Let denote the optimal dual variables. Then, we derive the closed-form expressions for the optimal local CPU frequencies as follows.
Proposition 3.1: with , by denoting the index set of the paths that contain task as , the optimal CPU frequencies at the MD satisfy
Please refer to Appendix B. ∎
From Proposition 3.1, we observe that the optimal is determined by the dual variables corresponding to all the paths containing task . Besides, increasing leads to a lower optimal for energy saving.
Corollary 3.1: The summation of the optimal dual variables over all paths is equal to the constant . That is,
Then, if , according to the Proposition 3.1, the optimal local CPU frequency for task is
which is a constant regardless of the values of .
Please refer to Appendix C. ∎
The above corollary indicates that the optimal is a constant when the -th task is included in all the paths, i.e., .
Based on Proposition 3.1 and Corollary 3.1, we can apply the projected subgradient method  to search for the optimal dual variables . Specifically, we initialize satisfying (20). In the -th iteration, we first calculate using (16) and (19) and set . Then, the dual variables are updated to by using subgradients , i.e.,
where is a small learning rate. In order to guarantee the feasibility of dual variables, we need to project to the feasible region given in (20). The projection is calculated from the following convex problem,
which can be efficiently solved by general convex optimization techniques, e.g., interior point method . After updating the dual variables, we can further obtain the updated optimal local CPU frequencies. Such iteration proceeds until a stopping criterion is met. The pseudo-code of the method is shown in Algorithm 1.
Iv Deep Reinforcement Learning Based Task Offloading
In the last section, we efficiently obtain the optimal given the offloading decision . Intuitively, we can enumerate all feasible and choose the optimal one that achieves the minimum objective of (P2). However, such brute-force search is computationally prohibitive, especially when the problem needs to be frequently re-solved with time-varying channel gains and available server computing power. Besides, other searching based methods, such as branch-and-bound and Gibbs sampling algorithms, are also time consuming when is large.
In this section, we propose a DRL-based algorithm to solve the joint optimization under time-varying channel gains and CPU frequency at the edge server. Our goal is to derive an offloading decision policy that can quickly predict an optimal offloading action of (P2) once the channel gain and the CPU frequency at the edge server are revealed at the beginning of the execution of the application (task graph). The offloading decision policy is denoted as
The algorithm structure is illustrated in Fig. 3. There are two stages in the DRL-based offloading algorithm: one is referred to as the actor-critic network based offloading action generation, and the other is offloading policy update, which are detailed as follows. Furthermore, we propose the one-climb policy to speed up the learning process.
Iv-a Actor-critic Network Based Offloading Action Generation
Iv-A1 Actor Network
The offloading action is generated based on a DNN. We denote the embedded parameters of the DNN at the
-th epoch as, where
is randomly initialized following a zero-mean normal distribution. At the-th epoch, we take the channel gain and edge CPU frequency as the input of the DNN. Accordingly, the DNN outputs a relaxed offloading action , which is denoted by a mapping , i.e.,
where , and the denotes the -th entry of .
Notice that each entry of is a continuous value between 0 and 1. To generate a feasible binary offloading decision, we first quantize into candidate binary offloading actions. Then, the critic network will evaluate the performance of the candidate actions, and the one with the lowest ETC will be selected as the output solution. Noticeably, for a good quantization method, we only need to generate few candidate actions to reduce the computational complexity. Meanwhile, the quantized actions based on the relaxed action should contain sufficient diversity to yield a lower ETC. In this paper, we propose a Gaussian noise-added order-preserving (GNOP) quantization method as shown in Fig. 4. We define the quantization function as
where is the generated candidate action set in the -th epoch.
Order-preserving quantization method was originally introduced to explore the output of the DNN in 
. The key idea is to preserve the ordering of all the entries in a vector before and after quantization. In our proposed GNOP method, the firstactions is generated by traditional order-preserving method, where we assume that is an even number without loss of generality. Specifically, suppose that the output offloading action is . The generation rule for in the order-preserving method is shown as follow.
First, we obtain the offloading decision as
for . For the other offloading actions, we first order the entries of according to their distances to 0.5, i.e., , where is denoted as the -th order entry of . Then, the -th offloading action is obtained as
for and .
Compared to the traditional
-nearest neighbor (KNN) method, the order-preserving quantization method leads to a higher diversity in the offloading action space. However, the offloading actions produced by conventional order-preserving quantization method are still closely placed around, which reduces the chance of finding a local optimum in a large action space. To better explore the action space, we introduce a Gaussian noise-added approach to generate the other half of candidate actions. Specifically, we first add a Gaussian noise to as
is the sigmoid function that maps the original noise-added action to. Then, we apply the order-preserving method on to generate the offloading actions.
Iv-A2 Critic Network
After generating the candidate offloading actions in the actor network, we evaluate the ETC performance of each action in the critic network. Instead of training a critic DNN as the conventional actor-critic method does, we can accurately and efficiently evaluate the ETC corresponding to each candidate using our analysis in Section III. In particular, we denote the ETC achieved by the candidate as by optimizing the local CPU frequencies as described in Algorithm 1. This greatly reduces the training cost of the critic DNN and increases the accuracy of ETC evaluation. Accordingly, we choose the best offloading action at the -th epoch as
Noticeably, , together with its corresponding optimal resource allocation constitutes the optimal solution to Problem (P1) (or equivalently, Problem (P2)).
Iv-B Offloading Policy Update
The optimal actions learned in the offloading action generation stage are used to update the parameters of the DNN through the offloading policy update stage.
As illustrated in Fig. 3, we implement a replay memory to store the past state-action pairs, where the memory is of limited capacity. At the -th epoch, obtained in the actor-critic network based offloading action generation stage is added to the memory as a new training data sample. Note that the newly generated data sample will replace the oldest one if the memory is full.
The data samples stored in the memory are used to train the DNN. Specifically, in the -th epoch, we randomly select a batch of training data samples from the memory, where represents the set of chosen time indices. Then, we minimize the average cross-entropy loss through the Adam algorithm in order to update the parameters of the DNN, where
is the size of , the superscript denotes the transpose operator, and the log function is the element-wise logarithm operation for a vector. For brevity, the detail of the Adam algorithm is omitted here. In practice, we start the training step when the number of samples is larger than half of the memory size and train the DNN in every epochs in order to collect a sufficient number of new data samples in the memory.
Iv-C Low-complexity Action Generation Method
Within the proposed DRL framework, we improve the GNOP quantization method to further reduce the complexity. The basic idea is to restrict our action selection only to those offloading decisions that satisfy the following one-climb policy.
Definition 3 (One-climb policy): The execution for the tasks in each path of the graph migrates at most once from the MD to the edge server.
Fig. 5 illustrates the two-time offloading and one-climb schemes in a path . We show in the Appendix D that by converting the scheme from the two-time offloading to the one-climb policy, the MD saves the energy and time costs for the path . This however may increase the ETC of other paths with overlapping tasks with path . We show that, certain mild conditions hold if the minimum ETC is achieved when all the paths satisfy the one-climb policy. Please refer to Appendix D for the detailed analysis.
The one-climb policy is applied to reduce the number of offloading actions to be evaluated by the critic network. Suppose that is the set of actions obtained by the GNOP quantization method at the -th epoch. We remove the actions in that violate the one-climb policy. By using the one-climb policy in the quantization module, we efficiently reduce the number of calculations for Algorithm 1 at the actor-critic network based offloading action generation stage.
V Numerical Results
In this section, we evaluate the performance of our proposed algorithm through numerical simulations. Consider three different task graphs in Fig. 6, each consisting of 8 actual tasks. Fig. 6(a) illustrates a mesh task graph including a set of linear chains, while a task graph with tree-based structure is considered in Fig. 6(b). In Fig. 6(c), we consider a general task graph which is a combination of the mesh and the tree. The input and output data size (KByte) of each task are shown in Fig. 6. We assume that the computing workload (Mcycles) for all the three task graphs. The transmit power at the MD and the AP are fixed as 100 mW and 1 W, respectively. It is assumed that the CPU frequency
is time-varying and follows a uniform distribution between 2 GHz and 50 GHz. Besides, the peak computational frequency of the MD is equal to 0.01 GHz.
In the simulations, we assume that the average channel gain follows the free-space path loss model , where denotes the antenna gain, MHz denotes the carrier frequency, in meters denotes the distance between the MD and the AP, and denotes the pass loss exponent. The time-varying fading channel follows an i.i.d. Rician distribution, where the LOS link power is equal to . Besides, we follow some classic uplink-downlink channel models that the random variable downlink channel
. Besides, we follow some classic uplink-downlink channel models that the random variable downlink channelis correlated with the uplink channel and we set the correlation coefficient as 0.7 (the coefficient 0.7 is used in  for modeling weakly-correlated uplink and downlink channels. For some highly correlated case, the correlation coefficient is larger than 0.9). The noise power W. In addition, we set the computing efficiency parameter , and the bandwidth MHz. The priority weights of energy consumption and computation time of the MD are set as . The parameters used in the simulations are listed in Table I.
We consider a fully connected DNN consisting of one input layer, three hidden layers, and one output layer in the proposed DRL algorithm, where the first, second, and third hidden layers have 160, 120, and 80 hidden neurons, respectively. We implement the DRL algorithm in Python with TensorFlow and set the learning rate for Adam optimizer as 0.01, the training batch size, the memory size as 1024, and the training interval .
V-a Convergence Performance
Without loss of generality, we first consider the tree task graph in Fig. 6(b) as an example to study the impact of the parameters on the convergence performance of the proposed DRL algorithm, including learning rates, batch sizes, memory sizes, and learning intervals in Fig. 7. As shown in Fig. 7(a), we illustrate the impact of the learning rate in Adam optimizer on the moving average of the training loss over moving windows of 15 epochs. It is observed that a too large (i.e., 0.1) or a too small (i.e., 0.001) learning rate leads to a worse convergence. Therefore, in the following simulations, we set the learning rate as 0.01. As for different batch sizes in Fig. 7(b), we observe that a large batch size (i.e., 1024) causes higher fluctuation for the moving average of the training loss, which is due to the frequent usage of the “old” training data in the memory. Besides, a large batch size consumes more time when training the DNN. Hence, the training batch size is set to 128 in the following simulations. In Fig. 7(c), the moving average of the training loss gradually decreases and stabilizes at around 0.01 for different memory sizes. In addition, we observe that the convergence performance is insensitive to the memory size. In Fig. 7(d), we investigate the convergence of our proposed DRL algorithm under different training intervals. It is observed that for different training intervals, the moving average of the training loss gradually decreases and becomes stable at around 0.02 after 400 training steps, which means that the convergence performance is insensible with respect to the training intervals. In the following simulations, we set the training interval as 10.
Accordingly, Fig. 8 illustrates the convergence performance of the DRL algorithm for the three task graphs, where we set the learning rate as 0.01, the training batch size as 128, the memory size as 1024, and the training interval as 10. We observe that under different task graphs, the moving average of the training loss is below 0.1 after 300 training steps.
In Fig. 9, we plot the moving average of the accuracy rates over training steps for the three task graphs, where the proposed DRL algorithm is tested in each training step using 50 independent realizations. We define the accuracy rate as , where is the average optimal ETC obtained by the exhaustive search method under the 50 independent realizations and is the ratio of bias of the ETC in DRL algorithm compared to the optimum. We see that the moving average of the accuracy rates for the proposed DRL algorithm gradually converges as the training step increases. Specifically, for the mesh task graph, the achieved exceeds 0.99 after 800 training steps.
V-B Energy and Time Cost (ETC) Performance Evaluation
We now compare the energy and time cost (ETC) performance of the proposed methods with that of the following four representative benchmarks.
Gibbs sampling algorithm. The Gibbs sampling algorithm updates the offloading decision iteratively based on the designed probability distribution with respect to the objective values and the temperature parameter. According to the proof in, a Gibbs sampling algorithm obtains the optimal solution when it converges.
Exhaustive search. We enumerate all feasible offloading decisions and choose the optimal one that yields the minimum ETC.
All edge computing. In this scheme, all the tasks of the MD are offloaded to the edge side for execution.
All local computing. In this scheme, all the tasks of the MD are executed locally.
In Fig. 10, we compare the ETC performance among different offloading schemes under the three task topologies in Fig. 6. Each point in the figure is the average performance of 50 independent realizations. When evaluating the performance, we have neglected the first 20000 time epochs as a warm-up period, so that the DRL has converged. We observe that for all the three task graphs, our proposed DRL algorithm can achieve near-optimal performance compared with the exhaustive search and the Gibbs sampling algorithms. In addition, by applying the one-climb policy heuristics in the GNOP quantization method, the ETC performance is hardly affected. Besides, the DRL algorithm significantly outperforms the all-edge-computing and all-local-computing schemes. This suggests the benefit of adapting the offloading decisions under different wireless channels and edge CPU frequency.
Then, Table II illustrates the average accuracy rates of our proposed DRL algorithm. It is observed that on average the DRL algorithm achieves over of the optimal ETC. Specifically, for the general task graph shown in Fig. 6(c), accuracy rate with respect to the ETC objective is achieved.
V-C Complexity of the Proposed DRL Algorithm
At last, we compare the computational complexity among the four algorithms, where the number of quantized offloading decisions for each epoch in the DRL algorithm . We see from the Table III that the DRL algorithm with one-climb policy based GNOP quantization significantly reduces the computation time compared with the DRL algorithm with GNOP method. That is, around , , and lower average runtime achieved in the mesh, tree, and general task graphs, respectively. Therefore, the one-climb policy heuristics can achieve the near performance as the original GNOP method, while efficiently reducing the complexity of the proposed DRL algorithm. Specifically, in Fig. 11, we illustrate the computation time for each epoch in the DRL algorithm with one-climb policy based GNOP method under the tree task graph. For some epochs, the DRL algorithm with one-climb policy based GNOP only consumes around 0.3 second for obtaining the optimal solution.
|DRL with One-climb policy based GNOP ()||0.9240 s||1.3421 s||1.0464 s|
|DRL with GNOP ()||1.4702 s||1.4107 s||1.5821 s|
|Gibbs sampling||8.2039 s||8.3046 s||8.6101 s|
|Exhaustive search||25.6690 s||26.8181 s||27.5185 s|
Furthermore, as shown in Table III, the DRL algorithm with one-climb policy based GNOP requires much shorter runtime than the Gibbs sampling algorithm and the exhaustive search method. In particular, for the general task graph, it outputs an offloading decision in around 1 second for each realization on average, while the Gibbs sampling and exhaustive search methods spend 8 times and 26 times longer runtime, respectively.
Considering a single-user MEC system with a general task graph, this paper has proposed a DRL framework to jointly optimize the offloading decisions and resource allocation, with the goal of minimizing the weighted sum of MD’s energy consumption and task execution time. The DRL framework utilizes a DNN to learn and improve the offloading policy from the experiences, which completely removes the need of solving hard combinatorial optimization problem. Besides, we have derived a Gaussian noise-added order-preserving quantization method to efficiently generate offloading actions in the DRL framework. Meanwhile, a low-complexity algorithm has been proposed to accurately evaluate the ETC performance of each generated offloading decision. We have further proposed an one-climb policy to speed up the learning process. Simulation results have demonstrated that the proposed algorithm can achieve near-optimal performance while significantly decreasing the complexity compared to the conventional optimization methods.
Appendix A Proof of Lemma 3.1
For the term in (A), we have
For the term in (A), we have
where is defined in (16).
Appendix B Proof of Proposition 3.1
The derivative of of (18) with respect to can be expressed as
where is a monotonously increasing function with . Thus, if , we have . Otherwise, we have