I Introduction
Recent years have witnessed explosive growth of Internet of Things (IoT) as a way to connect tens of billions of resourcelimited wireless devices, such as sensors, mobile devices (MDs) and wearable devices, to Internet through the cellular networks. Due to small physical sizes and stringent production costs constraints, IoT devices often suffer from limited computation capabilities and finite battery lives. Perceived as a promising solution, mobile edge computing (MEC) [2, 3] has attracted significant attention. With MEC, computationally intensive tasks can be offloaded to nearby servers located at the edges of wireless networks. This efficiently overcomes the drawbacks of long backhaul latency and high overhead compared to traditional mobile cloud computing.
Typically, there are two computation task offloading models for MEC [2]: one is referred to as binary offloading, and the other is partial offloading. For the binary offloading model, each task is either executed locally or offloaded to the MEC server as a whole [4, 5, 6, 7, 8, 9]. As for partial offloading, tasks can be arbitrarily divided into two parts that are executed by the device and the edge server, respectively [10, 11]. Nevertheless, in practice, a mobile application usually has multiple components and the dependency among them cannot be ignored since the outputs of some components are the inputs of others. In this regard, task call graph [12] is proposed to model the sophisticated interdependency among different components in a mobile application. In this paper, we consider computation offloading with a general task call graph.
Due to the random variation of wireless channels, it is not always advantageous to offload all the tasks for edge execution. Instead, offloading computation tasks in an opportunistic manner considering the timevarying channel condition has shown significant performance advantage [4, 5, 6, 7, 8, 9, 10, 11]. Due to the mutual coupling constraints in a task call graph, offloading policy design becomes much challenging [13, 14, 15, 16, 17, 18]. Specifically, [13] considered a sequential task graph and derived an optimal oneclimb policy, where the execution migrates only at most once between the MD and the cloud server. This work was extended to a general task graph case in [14], where authors applied the partial critical path analysis for the general task graph scheduling. In [15]
, the offloading problem in a general task graph was formulated as a linear programming problem through convex relaxation.
[16]modeled the task scheduling problem in a general task graph as an energy consumption minimization problem that is solved by a genetic algorithm. Note that general task graphs are considered much harder to deal with compared to other task graphs with special structures (i.e., sequential task graph), since it is hard to explore and derive the offloading properties (i.e., oneclimb policy in the sequential task graph) with the general and complicated coupling among tasks.
On the other hand, recent work has considered joint optimization of radio/computing resource allocation and computation offloading. In particular, [17] studied an energyefficiency cost minimization problem by incorporating CPU frequency control and transmit power allocation in the MEC offloading decision. [18] considered interuser task dependency and proposed a reducedcomplexity Gibbs sampling algorithm to obtain the optimal offloading decisions.
The existing work on task offloading with general task graph adopts either convex relaxation methods (e.g., in [17, 15]
) or heuristic local search methods (e.g., in
[13, 14, 16, 18]). However, both methods are likely to get stuck in a local optimal solution that does not guarantee good performance. Moreover, the optimization problems need to be resolved once the wireless channel conditions change or the available computing power of the edge server changes due to the variation of demands by background applications. The frequent recalculation of offloading decisions renders the existing methods impractical.In this paper, we endeavor to design an efficient optimal computation offloading algorithm in an MEC system with a general task graph, so that the optimal decision swiftly adapts to the timevarying wireless channels and available edge computing power with very low computational complexity. In particular, we propose a deep reinforcement learning (DRL) framework. The key idea of DRL is to utilize the deep neural networks (DNNs) to learn the optimal mapping between the state space and the action space. There exists several work on DRLbased offloading methods for MEC systems
[19, 20, 21]. In [19], a deep Qnetwork (DQN) based offloading policy was proposed to optimize the computational performance in the MEC system with energy harvesting. When tasks arrive randomly, [20]proposed DQN to learn the optimal offloading decisions without a priori knowledge of network dynamics. To tackle the curse of dimensionality problem in DQNbased methods,
[21] proposed a novel DRL framework to achieve nearoptimal offloading actions by considering only a small subset of candidate offloading actions in each iteration. Notice that [19, 20, 21] all assume independent tasks among multiple users. Very recently, considering a general task dependency, [22]proposed a recurrent neural network (RNN) based reinforcement learning method for the computation offloading problem. However, it neglected the system dynamics, such as wireless fading channels and timevarying edge server CPU frequency.
We consider an MEC system with a single access point (AP) and a MD as shown in Fig. 1. The MD has an application with a general task topology to execute under timevarying wireless fading channels and edge server CPU frequency. In particular, we propose a DRL framework to minimize the weighted sum of task execution time and energy consumption of the MD. The main contributions are concluded as follows:

We formulate a mixed integer optimization problem to jointly optimize the offloading decisions and local CPU frequencies of the MD to minimize the computation delay and energy consumption. The problem is challenging because of the combinatorial nature of the offloading decisions and the strong coupling among task executions under general dependency model.

In order to solve the combinatorial optimization problem efficiently, we propose a DRL framework based on the actorcritic learning structure, where we train a DNN in the actor network periodically from the past experiences to learn the optimal mapping between the states (i.e., wireless channels and edge CPU frequency) and actions (i.e., offloading decisions). Within the actor network, we devise a novel Gaussian noiseadded orderpreserving action generation method to balance the diversity and complexity in generating candidate binary offloading actions under a highdimensional action space.

For the critic network, we simplify the problem according to the total loopfree paths in the general task graph and derive closedform solution for the optimal local CPU frequencies. Based on this, we propose an efficient algorithm. As such, unlike traditional actorcritic networks that utilize a DNN to predict the values of the actions in the critic network, our analysis allows fast and accurate calculation of the performance of each action generated by the actor network. In this way, the complexity and convergence of the actorcritic based DRL are greatly improved.

To further speed up the computation of the proposed DRL framework, we propose a heuristics where the offloading decisions are limited to the ones that follow the oneclimb offloading policy. The heuristics greatly reduces the number of performance evaluations for the actions in the critic network. The optimality of the oneclimb policy is analyzed and its advantageous performance over conventional action generation method is verified through simulations.
Numerical results show that for various types of general task graphs, the proposed DRLbased algorithm achieves up to of the optimal energy and time cost. Meanwhile, our proposed method only takes around 1 second to generate an offloading action, which is more than one order of magnitude faster than the other representative benchmark methods. In this paper, we formulate the joint optimization of offloading and resource allocation with general task graph in the MEC as a mixed integer nonlinear programming (MINLP) problem, which is hard to solve with conventional optimization algorithms under timevarying wireless channels and stochastic edge computing capability. By exploring the special structure of the considered MINLP problem, we observe that for any given integer variables (offloading decisions), the remaining problem is convex. Therefore, the main difficulty lies in finding the optimal integer offloading decisions. With such property, we propose the actorcritic learning structure based DRL algorithm, where the actor network generates a set of integer offloading actions according to the timevarying parameters and the critic network scores each action output from the actor network by convex optimization. Then, we utilize the generated actionscore pairs to make current offloading decision and improve the performance of the actor network. It is worth mentioning that the key target of the critic is for evaluating the action quality, regardless of using a general neural network or a specialized algorithm [23]. In this paper, as one of the major contributions, we propose an efficient lowcomplexity algorithm in the critic network to evaluate the actions generated from the actor network, which greatly reduces the training cost of the critic DNN and increases the accuracy of action evaluation.
The rest of the paper is organized as follows. In Section II, we present the system model and problem formulation. The optimal local CPU frequencies under fixed offloading decisions are studied in Section III. We introduce the detailed design for the DRL framework in Section IV. In Section V, simulation results are described. Finally, we conclude the paper in Section VI.
Ii System Model And Problem Formulation
As shown in Fig. 1, we consider an MEC system with one AP and one MD. The AP is the gateway of the edge cloud and has stable power supply. The MD has a computationally intensive mobile application consisting of dependent tasks. The inputoutput dependency of the tasks is represented by a directed acyclic task graph . As shown in Fig. 2, each vertex in represents a task and the associated parameter indicates the computing workload in terms of the total number of CPU cycles required for accomplishing the task. Besides, each edge in represents that a precedent task must be completed before starting to execute task . Additionally, we denote the size of data in bits transferred from task to by . For simplicity of exposition, we introduce two virtual tasks and as the entry and exit tasks, respectively. Specifically, we have . By forcing the two virtual tasks to be executed locally, we ensure that the application is initiated and terminated at the MD side. We denote the set of tasks in the task graph as .
Define an indicator variable such that means that task is executed locally and means that the MD offloads the computation of task to the edge side. Recall that the two virtual tasks and must be executed locally. That is, .
In addition, we assume that the MD is allocated a dedicated spectral resource block throughout its transmission, which can support concurrent transmissions for task offloading and downloading. We denote by and the channel gains when offloading and downloading the task data , respectively.
Besides, we assume additive white Gaussian noise (AWGN) with zero mean and equal variance
at the receiver for all the tasks.To characterize the task execution time and energy consumption for local and edge computing, respectively, we first define the finish time and ready time of each task.
Definition 1 (Finish Time). The finish time of task
is the moment when all the workload
has been executed. We denote and as the finish time of task when it is executed locally and at the edge server, respectively.Definition 2 (Ready Time). The ready time of a task is the earliest time when the task has received all the necessary input data to commence the task computation. For instance, in Fig. 2, the ready time of the fifth task is the time when both the input data streams from the first and second tasks have arrived. We denote the ready time of task when computing locally and at the edge server as and , respectively.
Iia Local Computing
We assume that the MD is equipped with a core CPU, where each CPU core can execute only one task at a time. That is, the MD can execute in total tasks simultaneously. Suppose that task is computed locally. We denote the local CPU frequency for computing the task as , which is upper bounded by . Thus, the local execution time of task is given by
(1) 
and the corresponding energy consumption is [2]
(2) 
where is the effective switched capacitance depending on the chip architecture. According to the circuit theory [24], the power consumption of the CPU is approximately proportional to the product of , where is the circuit supplied voltage. Besides, is approximately linear proportional to the CPU frequency when the CPU works at the low voltage limits [25]. Therefore, the energy consumption per CPU cycle is given by . It is worth mentioning that for the two virtual tasks and , we have and .
If a task preceding task is executed at the edge server, then the output data must be downloaded to the MD before task can be executed locally. Denote the fixed downlink transmit power of the AP by . Then, according to the ShannonHartley theorem, the downlink data rate from the AP to the MD is
(3) 
The corresponding downlink transmission time for sending the data is
(4) 
As such, the ready time of task is given by
(5) 
where pred(i) denotes the set of immediate predecessors of task . Specifically, if for a task , the time until its output data is available at the MD for the execution of task is equal to its finish time at the edge side plus the downlink transmission time . Otherwise, if , the time until its output data is available at the MD is equal to its local finish time . When all needed data is available at the ready time , the MD locally computes task with the local execution time in (1), so that the finish time of task becomes
(6) 
IiB Edge Computing
We denote the fixed transmit power of the MD by . Then, the uplink data rate for offloading the data to the AP is
(7) 
and the corresponding uplink transmission time is
(8) 
The transmission energy consumption is
(9) 
We assume that the edge server has cores and can compute tasks in parallel. The execution time of task on the AP is given by
(10) 
where is the fixed service rate of each CPU core. Similarly, we can calculate the ready time of task executed at the edge server as
(11) 
and its finish time is
(12) 
IiC Problem Formulation
We assume that both the MD and MEC server have a lot more CPU cores than needed to execute the possibly concurrent tasks in the considered mobile application. As such, we can safely set . Besides, it is assumed that the number of available channels is sufficiently large to execute the possibly concurrent data transmissions in the task graph.
From the above discussion, the total time to complete the all tasks is equal to the local finish time of the auxiliary exit task , i.e., . Besides, we can calculate the total energy consumption of the MD by
(13) 
which consists of energy consumed on local computation and task offloading.
In this paper, we consider the energytime cost (ETC) as the performance metric, which is defined as the weighted sum of the total energy consumption and execution time, i.e.,
(14) 
where and denote the weights of energy consumption and computation completion time of the MD, respectively. It is assumed that the weights are related by . We consider the weightedsum approach [9,17,18] for a general multiobjective optimization problem. According to the Proposition 3.9 of [26], for any given positive weights, we can reach an efficient solution of the multiobjective optimization problem by solving Problem (P1). A weakly efficient solution will be obtained if any of the weights is zero. Besides, in order to meet userspecific demands, we allow the MD to choose different weights. For instance, the MD with low battery energy prefers a larger for energy saving, while for the delaysensitive MD, a larger will be chosen to reduce the execution time.
Evidently, a higher CPU frequency leads to shorter task execution time. Meanwhile, according to (2), the energy consumption per CPU cycle is a quadratic function of the CPU frequency, thus the energy consumption increases with the CPU frequency for executing a task. Because the AP has stable power supply, it can operate with a fixed maximum frequency to minimize the execution delay. However, since the MD is often energyconstrained, we can apply dynamic voltage and frequency scaling (DVFS) technique to tune the local CPU frequency for balancing the performance between energy consumption and execution time. Denoting and , , we aim to minimize the ETC of the MD subject to the peak CPU frequency constraint of the MD, i.e.,
(15)  
where we assume in this paper. In general,
is nonconvex due to the binary variables
and the recursive structure of . In the following section, we first simplify by exploiting the property of the total task completion time . Then, we propose an efficient method to obtain the optimal CPU frequencies with a given .Iii Optimal Resource Allocation Under Fixed Offloading Decisions
Iiia Problem (P1) Simplification
We denote a path as an ordered sequence of task indices , that pass through the general task graph from the entry task to the exit task . Here, is the total number of real tasks in path . For instance, is a path in Fig. 2. There are three real tasks in the path. Besides, we denote the set of all loopfree paths as , which can be obtained by running the shortest path routing algorithm on . Likewise, we denote by the total number of paths. Let denote the total execution time in the th path excluding the waiting time for the data inputs from the other paths. Then, we have
(16) 
which consists of the total computation and communication delay in path .
To simplify Problem (P1), we first have the following lemma on .
Lemma 3.1: holds given any .
Proof.
Please refer to Appendix A. ∎
Lemma 3.1 indicates that the final completion time is equal to the largest total execution time of all the paths in . Note that although does not include the time spent on waiting for the task input data from other paths, the largest among all paths is the final completion time.
Due to the onetoone mapping between and in (1), it is equivalent to optimize (P1) over the time allocation . By introducing an auxiliary variable , (P1) can be equivalently expressed as
(17)  
Notice that (P2) is nonconvex in general due to the binary variables . However, for any given , the remaining optimization over is a convex problem. In the following, we assume a fixed offloading decision and derive an efficient algorithm to obtain the optimal , or equivalently the optimal local CPU frequencies .
IiiB Optimal Local CPU Frequencies
Suppose that is given. We express a partial Lagrangian of Problem (P2) as
(18) 
where denotes the dual variables associated with the corresponding constraints. Let denote the optimal dual variables. Then, we derive the closedform expressions for the optimal local CPU frequencies as follows.
Proposition 3.1: with , by denoting the index set of the paths that contain task as , the optimal CPU frequencies at the MD satisfy
(19) 
Proof.
Please refer to Appendix B. ∎
From Proposition 3.1, we observe that the optimal is determined by the dual variables corresponding to all the paths containing task . Besides, increasing leads to a lower optimal for energy saving.
Corollary 3.1: The summation of the optimal dual variables over all paths is equal to the constant . That is,
(20) 
Then, if , according to the Proposition 3.1, the optimal local CPU frequency for task is
(21) 
which is a constant regardless of the values of .
Proof.
Please refer to Appendix C. ∎
The above corollary indicates that the optimal is a constant when the th task is included in all the paths, i.e., .
Based on Proposition 3.1 and Corollary 3.1, we can apply the projected subgradient method [27] to search for the optimal dual variables . Specifically, we initialize satisfying (20). In the th iteration, we first calculate using (16) and (19) and set . Then, the dual variables are updated to by using subgradients , i.e.,
(22) 
where is a small learning rate. In order to guarantee the feasibility of dual variables, we need to project to the feasible region given in (20). The projection is calculated from the following convex problem,
(23)  
which can be efficiently solved by general convex optimization techniques, e.g., interior point method [27]. After updating the dual variables, we can further obtain the updated optimal local CPU frequencies. Such iteration proceeds until a stopping criterion is met. The pseudocode of the method is shown in Algorithm 1.
Iv Deep Reinforcement Learning Based Task Offloading
In the last section, we efficiently obtain the optimal given the offloading decision . Intuitively, we can enumerate all feasible and choose the optimal one that achieves the minimum objective of (P2). However, such bruteforce search is computationally prohibitive, especially when the problem needs to be frequently resolved with timevarying channel gains and available server computing power. Besides, other searching based methods, such as branchandbound and Gibbs sampling algorithms, are also time consuming when is large.
In this section, we propose a DRLbased algorithm to solve the joint optimization under timevarying channel gains and CPU frequency at the edge server. Our goal is to derive an offloading decision policy that can quickly predict an optimal offloading action of (P2) once the channel gain and the CPU frequency at the edge server are revealed at the beginning of the execution of the application (task graph). The offloading decision policy is denoted as
(24) 
The algorithm structure is illustrated in Fig. 3. There are two stages in the DRLbased offloading algorithm: one is referred to as the actorcritic network based offloading action generation, and the other is offloading policy update, which are detailed as follows. Furthermore, we propose the oneclimb policy to speed up the learning process.
Iva Actorcritic Network Based Offloading Action Generation
IvA1 Actor Network
The offloading action is generated based on a DNN. We denote the embedded parameters of the DNN at the
th epoch as
, whereis randomly initialized following a zeromean normal distribution. At the
th epoch, we take the channel gain and edge CPU frequency as the input of the DNN. Accordingly, the DNN outputs a relaxed offloading action , which is denoted by a mapping , i.e.,(25) 
where , and the denotes the th entry of .
Notice that each entry of is a continuous value between 0 and 1. To generate a feasible binary offloading decision, we first quantize into candidate binary offloading actions. Then, the critic network will evaluate the performance of the candidate actions, and the one with the lowest ETC will be selected as the output solution. Noticeably, for a good quantization method, we only need to generate few candidate actions to reduce the computational complexity. Meanwhile, the quantized actions based on the relaxed action should contain sufficient diversity to yield a lower ETC. In this paper, we propose a Gaussian noiseadded orderpreserving (GNOP) quantization method as shown in Fig. 4. We define the quantization function as
(26) 
where is the generated candidate action set in the th epoch.
Orderpreserving quantization method was originally introduced to explore the output of the DNN in [21]
. The key idea is to preserve the ordering of all the entries in a vector before and after quantization. In our proposed GNOP method, the first
actions is generated by traditional orderpreserving method, where we assume that is an even number without loss of generality. Specifically, suppose that the output offloading action is . The generation rule for in the orderpreserving method is shown as follow.First, we obtain the offloading decision as
(27) 
for . For the other offloading actions, we first order the entries of according to their distances to 0.5, i.e., , where is denoted as the th order entry of . Then, the th offloading action is obtained as
(28) 
for and .
Compared to the traditional
nearest neighbor (KNN) method, the orderpreserving quantization method leads to a higher diversity in the offloading action space. However, the offloading actions produced by conventional orderpreserving quantization method are still closely placed around
, which reduces the chance of finding a local optimum in a large action space. To better explore the action space, we introduce a Gaussian noiseadded approach to generate the other half of candidate actions. Specifically, we first add a Gaussian noise to as(29) 
where and
is the sigmoid function that maps the original noiseadded action to
. Then, we apply the orderpreserving method on to generate the offloading actions.IvA2 Critic Network
After generating the candidate offloading actions in the actor network, we evaluate the ETC performance of each action in the critic network. Instead of training a critic DNN as the conventional actorcritic method does, we can accurately and efficiently evaluate the ETC corresponding to each candidate using our analysis in Section III. In particular, we denote the ETC achieved by the candidate as by optimizing the local CPU frequencies as described in Algorithm 1. This greatly reduces the training cost of the critic DNN and increases the accuracy of ETC evaluation. Accordingly, we choose the best offloading action at the th epoch as
(30) 
Noticeably, , together with its corresponding optimal resource allocation constitutes the optimal solution to Problem (P1) (or equivalently, Problem (P2)).
IvB Offloading Policy Update
The optimal actions learned in the offloading action generation stage are used to update the parameters of the DNN through the offloading policy update stage.
As illustrated in Fig. 3, we implement a replay memory to store the past stateaction pairs, where the memory is of limited capacity. At the th epoch, obtained in the actorcritic network based offloading action generation stage is added to the memory as a new training data sample. Note that the newly generated data sample will replace the oldest one if the memory is full.
The data samples stored in the memory are used to train the DNN. Specifically, in the th epoch, we randomly select a batch of training data samples from the memory, where represents the set of chosen time indices. Then, we minimize the average crossentropy loss through the Adam algorithm in order to update the parameters of the DNN, where
(31) 
is the size of , the superscript denotes the transpose operator, and the log function is the elementwise logarithm operation for a vector. For brevity, the detail of the Adam algorithm is omitted here. In practice, we start the training step when the number of samples is larger than half of the memory size and train the DNN in every epochs in order to collect a sufficient number of new data samples in the memory.
IvC Lowcomplexity Action Generation Method
Within the proposed DRL framework, we improve the GNOP quantization method to further reduce the complexity. The basic idea is to restrict our action selection only to those offloading decisions that satisfy the following oneclimb policy.
Definition 3 (Oneclimb policy): The execution for the tasks in each path of the graph migrates at most once from the MD to the edge server.
Fig. 5 illustrates the twotime offloading and oneclimb schemes in a path . We show in the Appendix D that by converting the scheme from the twotime offloading to the oneclimb policy, the MD saves the energy and time costs for the path . This however may increase the ETC of other paths with overlapping tasks with path . We show that, certain mild conditions hold if the minimum ETC is achieved when all the paths satisfy the oneclimb policy. Please refer to Appendix D for the detailed analysis.
The oneclimb policy is applied to reduce the number of offloading actions to be evaluated by the critic network. Suppose that is the set of actions obtained by the GNOP quantization method at the th epoch. We remove the actions in that violate the oneclimb policy. By using the oneclimb policy in the quantization module, we efficiently reduce the number of calculations for Algorithm 1 at the actorcritic network based offloading action generation stage.
Hz  

Watt  GHz 
Watt  GHz 
Watt  meters 
MHz  
V Numerical Results
In this section, we evaluate the performance of our proposed algorithm through numerical simulations. Consider three different task graphs in Fig. 6, each consisting of 8 actual tasks. Fig. 6(a) illustrates a mesh task graph including a set of linear chains, while a task graph with treebased structure is considered in Fig. 6(b). In Fig. 6(c), we consider a general task graph which is a combination of the mesh and the tree. The input and output data size (KByte) of each task are shown in Fig. 6. We assume that the computing workload (Mcycles) for all the three task graphs. The transmit power at the MD and the AP are fixed as 100 mW and 1 W, respectively. It is assumed that the CPU frequency
is timevarying and follows a uniform distribution between 2 GHz and 50 GHz. Besides, the peak computational frequency of the MD is equal to 0.01 GHz.
In the simulations, we assume that the average channel gain follows the freespace path loss model , where denotes the antenna gain, MHz denotes the carrier frequency, in meters denotes the distance between the MD and the AP, and denotes the pass loss exponent. The timevarying fading channel follows an i.i.d. Rician distribution, where the LOS link power is equal to
. Besides, we follow some classic uplinkdownlink channel models that the random variable downlink channel
is correlated with the uplink channel and we set the correlation coefficient as 0.7 (the coefficient 0.7 is used in [28] for modeling weaklycorrelated uplink and downlink channels. For some highly correlated case, the correlation coefficient is larger than 0.9). The noise power W. In addition, we set the computing efficiency parameter , and the bandwidth MHz. The priority weights of energy consumption and computation time of the MD are set as . The parameters used in the simulations are listed in Table I.We consider a fully connected DNN consisting of one input layer, three hidden layers, and one output layer in the proposed DRL algorithm, where the first, second, and third hidden layers have 160, 120, and 80 hidden neurons, respectively. We implement the DRL algorithm in Python with TensorFlow and set the learning rate for Adam optimizer as 0.01, the training batch size
, the memory size as 1024, and the training interval .Va Convergence Performance
Without loss of generality, we first consider the tree task graph in Fig. 6(b) as an example to study the impact of the parameters on the convergence performance of the proposed DRL algorithm, including learning rates, batch sizes, memory sizes, and learning intervals in Fig. 7. As shown in Fig. 7(a), we illustrate the impact of the learning rate in Adam optimizer on the moving average of the training loss over moving windows of 15 epochs. It is observed that a too large (i.e., 0.1) or a too small (i.e., 0.001) learning rate leads to a worse convergence. Therefore, in the following simulations, we set the learning rate as 0.01. As for different batch sizes in Fig. 7(b), we observe that a large batch size (i.e., 1024) causes higher fluctuation for the moving average of the training loss, which is due to the frequent usage of the “old” training data in the memory. Besides, a large batch size consumes more time when training the DNN. Hence, the training batch size is set to 128 in the following simulations. In Fig. 7(c), the moving average of the training loss gradually decreases and stabilizes at around 0.01 for different memory sizes. In addition, we observe that the convergence performance is insensitive to the memory size. In Fig. 7(d), we investigate the convergence of our proposed DRL algorithm under different training intervals. It is observed that for different training intervals, the moving average of the training loss gradually decreases and becomes stable at around 0.02 after 400 training steps, which means that the convergence performance is insensible with respect to the training intervals. In the following simulations, we set the training interval as 10.
Accordingly, Fig. 8 illustrates the convergence performance of the DRL algorithm for the three task graphs, where we set the learning rate as 0.01, the training batch size as 128, the memory size as 1024, and the training interval as 10. We observe that under different task graphs, the moving average of the training loss is below 0.1 after 300 training steps.
In Fig. 9, we plot the moving average of the accuracy rates over training steps for the three task graphs, where the proposed DRL algorithm is tested in each training step using 50 independent realizations. We define the accuracy rate as , where is the average optimal ETC obtained by the exhaustive search method under the 50 independent realizations and is the ratio of bias of the ETC in DRL algorithm compared to the optimum. We see that the moving average of the accuracy rates for the proposed DRL algorithm gradually converges as the training step increases. Specifically, for the mesh task graph, the achieved exceeds 0.99 after 800 training steps.
VB Energy and Time Cost (ETC) Performance Evaluation
We now compare the energy and time cost (ETC) performance of the proposed methods with that of the following four representative benchmarks.

Gibbs sampling algorithm. The Gibbs sampling algorithm updates the offloading decision iteratively based on the designed probability distribution with respect to the objective values and the temperature parameter. According to the proof in
[29], a Gibbs sampling algorithm obtains the optimal solution when it converges. 
Exhaustive search. We enumerate all feasible offloading decisions and choose the optimal one that yields the minimum ETC.

All edge computing. In this scheme, all the tasks of the MD are offloaded to the edge side for execution.

All local computing. In this scheme, all the tasks of the MD are executed locally.
In Fig. 10, we compare the ETC performance among different offloading schemes under the three task topologies in Fig. 6. Each point in the figure is the average performance of 50 independent realizations. When evaluating the performance, we have neglected the first 20000 time epochs as a warmup period, so that the DRL has converged. We observe that for all the three task graphs, our proposed DRL algorithm can achieve nearoptimal performance compared with the exhaustive search and the Gibbs sampling algorithms. In addition, by applying the oneclimb policy heuristics in the GNOP quantization method, the ETC performance is hardly affected. Besides, the DRL algorithm significantly outperforms the alledgecomputing and alllocalcomputing schemes. This suggests the benefit of adapting the offloading decisions under different wireless channels and edge CPU frequency.
Then, Table II illustrates the average accuracy rates of our proposed DRL algorithm. It is observed that on average the DRL algorithm achieves over of the optimal ETC. Specifically, for the general task graph shown in Fig. 6(c), accuracy rate with respect to the ETC objective is achieved.
Mesh  Tree  General  

VC Complexity of the Proposed DRL Algorithm
At last, we compare the computational complexity among the four algorithms, where the number of quantized offloading decisions for each epoch in the DRL algorithm . We see from the Table III that the DRL algorithm with oneclimb policy based GNOP quantization significantly reduces the computation time compared with the DRL algorithm with GNOP method. That is, around , , and lower average runtime achieved in the mesh, tree, and general task graphs, respectively. Therefore, the oneclimb policy heuristics can achieve the near performance as the original GNOP method, while efficiently reducing the complexity of the proposed DRL algorithm. Specifically, in Fig. 11, we illustrate the computation time for each epoch in the DRL algorithm with oneclimb policy based GNOP method under the tree task graph. For some epochs, the DRL algorithm with oneclimb policy based GNOP only consumes around 0.3 second for obtaining the optimal solution.
Mesh  Tree  General  
DRL with Oneclimb policy based GNOP ()  0.9240 s  1.3421 s  1.0464 s 
DRL with GNOP ()  1.4702 s  1.4107 s  1.5821 s 
Gibbs sampling  8.2039 s  8.3046 s  8.6101 s 
Exhaustive search  25.6690 s  26.8181 s  27.5185 s 
Furthermore, as shown in Table III, the DRL algorithm with oneclimb policy based GNOP requires much shorter runtime than the Gibbs sampling algorithm and the exhaustive search method. In particular, for the general task graph, it outputs an offloading decision in around 1 second for each realization on average, while the Gibbs sampling and exhaustive search methods spend 8 times and 26 times longer runtime, respectively.
Vi Conclusions
Considering a singleuser MEC system with a general task graph, this paper has proposed a DRL framework to jointly optimize the offloading decisions and resource allocation, with the goal of minimizing the weighted sum of MD’s energy consumption and task execution time. The DRL framework utilizes a DNN to learn and improve the offloading policy from the experiences, which completely removes the need of solving hard combinatorial optimization problem. Besides, we have derived a Gaussian noiseadded orderpreserving quantization method to efficiently generate offloading actions in the DRL framework. Meanwhile, a lowcomplexity algorithm has been proposed to accurately evaluate the ETC performance of each generated offloading decision. We have further proposed an oneclimb policy to speed up the learning process. Simulation results have demonstrated that the proposed algorithm can achieve nearoptimal performance while significantly decreasing the complexity compared to the conventional optimization methods.
Appendix A Proof of Lemma 3.1
For the term in (A), we have
(33) 
For the term in (A), we have
(34) 
(35) 
where is defined in (16).
Appendix B Proof of Proposition 3.1
The derivative of of (18) with respect to can be expressed as
(36) 
where is a monotonously increasing function with . Thus, if , we have . Otherwise, we have
Comments
There are no comments yet.