With the development of wireless communications, a large number of applications have brought great computing pressure to mobile terminals (MTs). However, due to the limited computational resources at the MTs, the traditional way to alleviate the intensive computation burden at the MTs is to offload partial tasks to the remote cloud servers. Unfortunately, the cloud servers are usually far away from the MTs. The offloading may waste large communication resources and introduce extra time delay. Mobile edge computing (MEC) has been considered as an efficient solution to address these problems. Compared with cloud computing, MEC deploys servers closer to the MTs, and partial task processing can be completed at near MEC servers with shorter time delay .
In order to improve MEC processing performance, effective task offloading is crucial, and has received great attentions recently. For indivisible or highly integrated tasks, binary offloading strategies are generally adopted, where the task can only be computed locally or offloaded to the servers . In practice, offloading decisions can be more flexible. The computation tasks can be divided into two parts which are performed in parallel, one part is processed locally, and the other part is offloaded to the MEC servers for processing. 
studied this task offloading problem, and the offloading scheduling, latencies and the energy saving were jointly investigated by formulating it as a linear programming problem.
More recently, reinforcement learning (RL), a model-free machine learning algorithm that can perform self-iterative training based on the observations, has been employed as a new solution to the MEC offloading. In, the authors studied the problem of computation offloading in a internet of things (IoT) network. A Q-learning based RL approach was proposed for an IoT device to select an proper device and to determine the proportion of the task to offload. The authors in  investigated the resource allocation for the vehicular networks by considering the vehicle’s mobility. A deep Q-learning based RL with multi-timescale framework was developed to solve the joint communication, caching and computing control problem. In , the authors studied the offloading for the energy harvesting MEC network. An after-state RL algorithm was proposed to address the large time complexity problem and polynomial value function approximation was introduced to accelerate the learning process.
As seen from [4, 5, 6], efficient offloading offered by RL based approach can help improve performance. In this work, we study the offloading optimization in a more complex dynamic environment. Different from current studies, we consider a more practical scenario where the current channel state information (CSI) cannot be observed when making offloading decision. Moreover, the discrete-continuous hybrid offloading policy including local task splitting ratio, the transmission/computation power allocations, and the MEC server selection is made in accordance to the predicted CSI and task arrival rates under the user and server energy constraints in a multi-user multi-MEC server network, with an aim to minimize the long-term average delay cost. We first establish a low-complexity deep Q-learning network (DQN) based offloading framework where the action includes only discrete MEC server selection, while remaining continuous variables are optimized by solving the convex optimization problem. Then we develop a deep deterministic policy gradient (DDPG) based framework which includes all discrete and continuous variables as actions. Numerical results demonstrates that both proposed strategies perform better than the traditional schemes. And the DDPG strategy is superior to the DQN strategy as it can online learn all variables although it requires relatively large complexity.
Ii System Model
Consider an offloading-enabled wireless network that includes MEC servers and users as shown in Fig. 1. Let and be the sets of the MEC servers and the users, respectively. Suppose that the MEC server has stronger computing capability than the user, so partial tasks can be offloaded to the MEC server for processing. The system time are divided into consecutive time frames with equal time period and the time indexed by . At time , we denote the CSI between the -th MEC server and the -th user as and the size of the task at user as .
In the considered MEC network, we assume that communications are operated in an orthogonal frequency division multiple access framework and a dedicate subchannel with bandwidth is allocated for each user for the partial task offloading. Suppose that user communicates with MEC server , the received signal at MEC is represented as
where denotes the symbols transmitted from user with unit power and is the utilized power at user , denotes the received additive Gaussian noise with power . Here we assume that the channel gains
follows the finite-state Markov chain. And the communication rate between MEC serverand user is give by
Suppose the task received at user at time needs to be processed during time interval . We denote the task splitting ratio as which indicates bits is executed at the user device and remaining bits are offloaded to the MEC server for processing.
1) Local computing: The local CPU adopts the dynamic frequency and voltage scaling (DVFS) technique, and the performance of the CPU is controlled by the CPU-cycle frequency . Let denotes the local processing power at user , the user’s computing speed (cycles per second) at -th slot is given by . Let denotes the number of CPU cycles required for user to accomplish one task bit. Then the local computation rate for user at time slot is given by .
2) MEC computing: The task model for MEC offloading is data-partition model, where the task-input bits are bit-wise and can be arbitrarily divided into different groups. At the beginning of each time slot, the user chooses which MEC server to connect to according to the CSI. Assume that the processing power allocated to the user by the MEC server is , the computation rate at MEC server for user is given by , where is the number of CPU cycles required for MEC server to accomplish one bit task, and denotes the the CPU-cycle frequency at the MEC server. It is noted that a MEC server simultaneously processes the tasks from multiple users, we assume that multiple application can be executed in parallel with a negligible processing latency. Moreover, the feedback time from the MEC to the user is ignored due to the small sized computational output.
Iii DQN Based Offloading Design
In this section, we develop a DQN based offloading framework for minimizing the long-term delay cost. DQN is the development of the Q-learning algorithm, and is particularly suitable for high-dimensional state spaces and posses fast convergence behavior. In the DQN offloading framework, there exists a central agent node which observes the states, performs actions and received the fed-back reward. The central can be the cloud server or a MEC server. In the DQN paradigm, it is assumed that the instantaneous CSI is estimated at MEC servers using the training sequences, then the CSI is delivered to the agent. Here, due to the channel estimation operations and feedback delay, the CSI observed at the agent is the delayed version. Moreover, for each MEC server, only local CSI which connects this MEC server to all users is acquired.
System State Space: In the considered DQN paradigm, the state space observed by the agent includes the CSI of the overall network and the received task size at time . At time , the agent only observed delayed version of CSI at time , i.e., . Denote , . The state space observed at time can be represented as .
System Action Space: With the observed state space , the agent will take certain actions to interact with the environment. As DQN can only take care of the discrete actions, the actions defined in the proposed DQN paradigm constitutes only the MEC server selection. Denote the MEC server selection action as , where means that the user does not select the MEC server at -th time slot, while indicates that the user selects the MEC server at -th time slot.
Reward Function: The reward is defined as the maximum time delay required to complete all the tasks at all users. After taking the actions, a dedicate MEC server could calculate the time delays required for the users choosing this MEC server to offload, as all MEC can observe the local CSI. Denote that user with offloads the tasks to MEC , where set defines the indexes of the users selecting MEC server to offload tasks. To minimize the time delays, the MEC server needs to formulate an optimize problem to find optimal , , , and . It is worth noting that as the MEC server knows the instantaneous CSI at time , the solution can be obtained based on , which is different from the MEC server selection taken based on . For the users not offloading tasks to the MEC servers, the required time delays for local task processing is known by the users. Then, the agent collects all the time delay consumptions from the users and the MEC servers to obtain the final reward.
Next we detail how to compute the time delay for user assuming that it selects MEC server to offload. Denote the time consumption for completing the task processing at user by where , , and denote the times of local task processing time, task offloading transmission, and task processing at MEC server, respectively, which can be represented as , , and .
To maximize the reward, it is necessary to minimize the time delay for each user under the total energy constraint at users and MEC servers. To illustrate the way to find optimal , , , and for different types of MEC server selection, we next present two typical offloading scenarios. The solution can be extended to the case where a MEC server serves arbitrary number of users.
1) Scenario 1: one MEC server serves one user
The energy consumption at user , denoted by , can be represented as , where the first and the second terms are for local partial task processing and partial task transmission, respectively. The energy consumption at the MEC server for processing the partial task offloaded from user , denoted by , is represented as . The optimization problem for optimizing is given by
where and denote the maximum available energy at user and MEC server , respectively. To solve (2), we find that the optimal solution must activate constraint (2d), which produces . Substituting to problem (2), we have
It is noted that (3) is non-convex. To find an efficient solution, we propose an alternating algorithm to separately solve , and in different subproblems.
In the first subproblem, we solve for given , and . To minimize the the objective function, the optimal solution of should activate constraint (3c), that is, , which implies
In the second subproblem, we solve with given and . The corresponding problem is given by
Problem (4) is a convex optimization problem and it can be efficiently solved by for example interior point algorithm.
In the third problem, is solved with given and . The corresponding optimization problem is given by
By denoting the two terms in the objective function as and , and , it is known that the optimal , denoted by , occurs in the following three cases, that is, , or . Note that here the solution of the third case can be obtained by solving a cubic equation. The final solution is given as
2) Scenario 2: one MEC server serves two users
Assume that MEC server serves two users, user and user , then the optimization problem is formulated as
The previously proposed iterative algorithm can still be applied here to solve , , and with . Here the only difference lies in solving and . The corresponding optimization problem can be formulated as
It is worth noting that the optimal solution must activate the constraints and make the two terms within the objective function equal to each other. Therefore, the optimal and can be obtained by solving the following equations
Hence, under an action , the system reward can be obtained as . The structure of the DQN-based offloading algorithm is illustrated in Fig. 2 and the pseudocode is presented in Algorithm 1.
Iv DDPG based Offloading Design
Considering that the DQN offloading design can only deal with the discrete actions. The reward acquisition mainly depends on solving the formulated optimization problems at MEC servers, which may increase extra computing burden at the MEC servers. In this section, we rely on the DDPG to design offloading policy considering that DDPG can deal with both the discrete and continuous value actions. Different from DQN, DDPG uses the Actor-Critic network to improve the accuracy of the model. In this section, we directly regard , , and as action instead of disassembling the problem into two parts as in DQN counterpart.
System State Space: In the DDPG offloading paradigm, the system state space action is the same with the DQN based offloading paradigm, which is given by As in the DQN offloading paradigm, the agent can only observe the delayed CSI due to channel estimation operations and feedback delay.
System Action Space: In DDPG offloading paradigm, we utilize the value of to indicate the MEC server selection, that is, represents that there is no partial task at user offloaded to the MEC server . In other words, the MEC server is not chosen by user . If is not equal to , it means that the user decides to offload partial tasks to the MEC server . Since the user can only connect to one MEC server within a time slot, only one is not with a dedicate . The action space of DDPG offloading paradigm can be expressed as
System Reward Funciton: In the DDPG offloading algorithm, as , , , and can be obtained from a continuous action space. With the decisions, the agent tell each user the selected MEC server and delivery , to it to perform the offloading. Moreover, the agent needs to send to each server for the computing resources allocation. After that, the reward is obtained as in DQN by collecting observed at the MEC servers or users.
Compared to the DQN based offloading paradigm, DDPG counterpart does not need the MEC servers to solve the optimization problems, which can release the computation burden at the MEC servers. However, as DDPG algorithm is generally more complex than the DQN algorithm, the computation complexity is unavoidably increased at the agent.
The structure of the DDPG-based offloading is illustrated in Fig. 3 and the pseudocode is given in Algorithm 2.
V Numerical Results
In the simulation, we assume that ms and MHz. Additionally, the required CPU cycles per bit are and . In the training process, the learning rate of the DQN-network is ; the learning rate of the DDPG actor and critic networks are .
In Fig. 4, we plot the training process for two RL algorithms. We see that DDPG based design converges faster than the DQN counterpart and can obtain a lower latency. This indicates that the DDPG-based algorithm performs better than that of the DQN-based algorithm for our offloading problem.
In Fig. 5, the offloading delay is illustrated versus the task size. Three benchmarks, “random scheme”, “Local computing” and “MEC computing”, are chosen for comparison. Here “random scheme” means that the computing resources are allocated in a random manner; “Local computing” and “MEC computing” means that the tasks are processed only at users and only at MEC servers, respectively. We see that as the amount of tasks grows, the required time delay is increased. As there is not much local computing capacity at users, the computation delay of “Local computing” is the largest. “MEC computing” performs better than “random scheme” when the task arrival rate is greater than Mbps, which indicates that when the task arrival rate increases, task offloading to MEC servers can obtain a lower time delay. The two proposed DQN and DDPG offloading paradigms performs better than three benchmarks, proving the effectiveness of the proposed designs. Moreover, the DDPG-based paradigm achieves lower latency than the DQN-based paradigm, which further verifies the superiority of the DDPG algorithm in dealing with high-dimensional continuous action-state space problems.
Fig. 6 shows the impact of computing capabilities on the processing delay. We fix the local computing capability as a constant value and increase the computing capacity of the MEC server, so the delay of “Local computing” is not changed. We see that under different computing capabilities, the two proposed DQN and DDPG offloading paradigms achieve better performance than three benchmarks, and the performance of DDPG-based offloading paradigm is slightly better than DQN-based offloading paradigm.
In this paper, we studied the offloading design for multi-user MEC system based on deep RL. Two different deep RL algorithms, namely, DQN and DDPG, were investigated to solve the formulated offloading optimization problem. The effectiveness and convergence of the proposed algorithms were verified through simulation results.
-  K. Kumar, J. Liu, Y.-H. Lu, and B. Bhargava, “A survey of computation offloading for mobile systems,” Mobile Networks Applicat., vol. 18, no. 1, pp. 129-140, Feb. 2013.
-  Y. Mao, J. Zhang and K. B. Letaief, “Dynamic computation offloading for mobile-edge computing with energy harvesting devices,” IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3590-3605, Dec. 2016.
-  S. E. Mahmoodi, R. N. Uma and K. P. Subbalakshmi, “Optimal joint scheduling and cloud offloading for mobile applications,” IEEE Transactions on Cloud Comput., vol. 7, no. 2, pp. 301-313, April 2019,
-  M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu and W. Zhuang, “Learning-based bomputation ofloading for IoT devices with energy harvesting,” IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930-1941, Feb. 2019.
-  L. T. Tan and R. Q. Hu, “Mobility-aware edge caching and computing in vehicle networks: a deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 67, no. 11, pp. 10190-10203, Nov. 2018.
-  Z. Wei, B. Zhao, J. Su and X. Lu, “Dynamic edge computation offloading for internet of things with energy harvesting: a learning method,” IEEE Internet of Things, vol. 6, no. 3, pp. 4436-4447, June 2019.