I Introduction
Modern wireless networks have to embrace the upsurge of traffic demands and diverse quality provisioning requirements. This requires a strategic shift in the network design that utilizes sophisticated wireless technologies in a more decentralized, adhoc, and diverse environment. As such, the network design problems become very challenging as the dimensionality and complexity rapidly increase, e.g., due to couplings among different network entities. Recently, deep reinforcement learning (DRL) has been developed as a breakthrough technology to learn the optimal control strategy in a dynamic network environment by continuously interacting with it [luong18]
. DRL integrates deep neural networks (DNNs) with the conventional reinforcement learning algorithms for autonomous decision making. It becomes capable of solving high dimensional, nonconvex, and modelfree network control problems, e.g., channel access and resource allocation in mobile edge computing (MEC)
[yan18]. These are very difficult to handle by conventional techniques such as convex optimization, dynamic and stochastic programming, due to imprecise modeling, uncertain system dynamics, and huge variable spaces. Hence, the application of DRL in wireless networks is envisioned to revolutionize the network optimization paradigm.In this article, we first provide an overview of the DRL framework in Section II and its variants to improve the stability and learning performance. In Section III, as a concrete example, we shift our focus on performance optimization of the emerging MEC applications, which is generally complicated by the resource competition and interactions among multiple wireless users, base stations, caching and MEC servers [yan18]. We firstly build a general DRL framework to learn the optimal data offloading policy with uncertain network information, and then review the existing applications of DRL framework for MEC in different network scenarios. We observe that data offloading is not always preferred by lowpower IoT devices due to the high energy consumption in wireless communications. Hence, in Section IV, we introduce a novel hybrid MEC offloading model to balance the energy consumption in offloading and computation. Besides local computation, the hybrid model allows data offloading to the MEC server via either the active RF communications or the passive wireless backscatter [ieeenetwork]. Our numerical results verify that the hybrid MEC offloading can significantly improve the network performance, by learning the optimal transmission scheduling and workload allocation among different offloading schemes. Finally, some open issues are discussed in Section V.
Ii An Overview of Deep Reinforcement Learning
In this section, we first review fundamentals of reinforcement learning and then discuss its extension to DRL, as well as various techniques to improve the learning efficiency and stability.
Iia Fundamentals of Reinforcement Learning
Reinforcement learning is an effective solution to Markov Decision Processes (MDPs), which is composed of the decisionmaking agent, system state, action, and reward
[sutton1998reinforcement]. The agent is the entity of decision making through interactions with the environment. Based on the observation of the environment, referred to as the system state, the agent takes an action and then receives an immediate reward correspondingly to the stateaction pair. The action affects the environment and may cause the transition to a new system state. The immediate reward and the transition to a new state will guide, i.e., reinforce, the adaptation of the agent’s policy, which defines the sequence of actions taken in each decision epoch as the system evolves. This learning process continues as we find the optimal policy to maximize the accumulated reward, which can be characterized by either the statevalue or actionvalue function. The statevalue records the expected total reward starting from an initial system state, while the actionvalue, also referred to as the
value, maps each stateaction pair to the accumulated reward.There are mainly value and policybased approaches for solving reinforcement learning problems [sutton1998reinforcement]
. The valuebased approach estimates the value function and takes the action to improve it directly in an iterative process. The estimation of value functions can be based on value iteration following the Bellman equation or
learning algorithm. A variant of the valuebased approach relies on the estimate of an advantagevalue, which can stabilize the learning process by subtracting a baseline from the estimate of actionvalue. The policybased approach improves the value function by updating a parametric policy in gradientbased methods. Reinforcement learning can be also classified into on and offpolicy approaches. The onpolicy learning relies on the sample trajectory induced by the current policy, i.e., all future actions are chosen according to the current policy. This may require more interactions with the environment to ensure unbiased policy updates and thus make it impractical for solving complicated problems. The offpolicy learning can improve the sample efficiency by utilizing all historical sample trajectories. However, it requires more effort in hyperparameter tuning to ensure the convergence in learning.
IiB Deep Reinforcement Learning Approaches
The reinforcement learning becomes unstable and even fail to converge when the state and action spaces are large in complex wireless networks. DRL can use the DNNs as function approximators for different components of reinforcement learning, including the value function, policy, and the underlying system model, e.g., the state transition probability. In the following, we introduce the basics of DRL and recent advances to improve its learning performance.
IiB1 Deep Network (DQN)
It extends the valuebased learning algorithm for MDPs by using DNN as a parametric approximation for the actionvalue function [nature15]. The success of DQN and its variants relies on two key mechanisms, i.e., experience replay and target network, to stabilize learning with large state and action spaces. The experience replay maintains a replay memory to buffer historical transition samples and randomly selects a subset of samples, i.e., minibatch, to train the DNN. This can break the sample correlations and ensure more efficient training by independent samples. The training of DNN in each step aims to minimize the temporaldifference (TD) error, i.e., the meansquared difference between the estimated value by the DNN and its target value. Practically, we can replay more frequently the transition samples that generate a higher expected reward. Hence, a prioritized experience replay (PER) scheme can potentially increase the learning speed [per]. A straightforward way for PER is to prioritize samples by their TDerrors. A higher TDerror implies a larger potential to be further optimized. The TDerror based PER can be further combined with random sampling to ensure that all transition samples have the chance to be selected for training [sutton1998reinforcement].

IiB2 Double and Dueling DQN
DQN uses a separate network to generate the target value. The target network updates its parameter by copying it from the online network in every a few steps, as illustrated in Fig. 9(a). This can make the learning more stable compared to the learning algorithm. The drawback of DQN lies in that it uses the greedy policy to select an action and evaluate it by the same network [sutton1998reinforcement]. This may lead to overoptimistic estimation of the value. To correct this, Double DQN (DDQN) updates the action by the online network and then evaluates it by the target network [ddqn], as illustrated in Fig. 9(b). Another variant of DQN decomposes the value into two streams, i.e., the statevalue and the advantagevalue [dueling], approximated by two independent DNNs in a dueling architecture. The two streams are then combined via an aggregating layer to produce the final estimate of the value.
IiB3 Deep Deterministic Policy Gradient
The policybased and valuebased approaches can be combined in the actorcritic framework [sutton1998reinforcement]. The critic function produces the estimation of the value by minimizing the TDerror. The actor function then updates the policy parameter using the critic’s feedback. Two independent DNNs can be used as the parametric approximations for the critic and actor functions, respectively. The intuition behind actorcritic framework stems from the policy gradient theorem that builds the connection between policy gradient and the value. It decomposes the gradient computation into the evaluation of the value and the gradient of the parametric policy, averaged over the whole state and action spaces. One recent development is to extend the policy gradient theorem to deterministic policy gradient (DPG), which outputs a deterministic action instead of a distribution on action space and thus makes it more efficient to estimate the policy gradient. The deep deterministic policy gradient (DDPG) algorithm combines DQN and DPG in the actorcritic framework to make the learning more stable and robust by using the experience replay and target network for DNN training [ddpg].
A comparison of typical DRL approaches is listed in Fig. 9. In general, DQN and its variants are applicable to discrete action space, which are natural extensions of learning algorithm for solving MDPs with large action and state spaces. The Rainbow algorithm in [rainbow] is an integrated design of different DQN variants, which achieves the best learning speed and maximum reward. The continuous action space can be more preferably tackled by DDPG in [ddpg] and the trustregion policy optimization (TRPO) in [on_trpo]
. To avoid large variance in gradient estimation, TRPO formulates a constrained optimization to search for a better policy that improves the value function. Besides, we observe that the offpolicy is more popular for DRL as it can use all historical samples efficiently. Though TRPO is generally onpolicy, it has been adapted in
[on_trpo] to leverage a replay buffer and thus can achieve a better learning performance compared to DDPG.As modern wireless networks become largescale and complicated, the network control problems face very diversified decision variables, including both discrete indicators and continuous variables for resource allocation. Thus, both value and policybased methods need to be used jointly for mixed decisionmaking problems. In the following, we focus on the applications of DRL in the emerging MEC offloading scenarios, which typically involve the interactions among multiple network entities and complicated optimization in both discrete and continuous domains.
Iii DRLbased Data Offloading for Mobile Edge Computing
MEC offloading allows IoT devices to offload data and computationintensive workload (e.g., compressing and encryption) to resourcerich MEC servers. It can potentially reduce the processing delay, extend the battery lifetime, and even enhance security for IoT applications [yan18]. One of the critical design issues is to optimize the offloading rate, workload allocation, and choose the optimal MEC server, considering the timevarying channel conditions, user mobility, energy supply, dynamic workloads, and various resource constraints. A joint optimization on caching, offloading, networking, and transmission control is usually very complicated due to close couplings among multiple wireless users, base stations, and MEC servers. The optimization is also very inflexible to capture the network dynamics with uncertain parameters, e.g., the fluctuating channel conditions, the timevarying workload and energy supplies.
Iiia General DRL Framework for MEC Offloading
DRL avoids abovementioned difficulties by reformulating the network control problem into the MDP framework and enhancing the reinforcement learning solution by deep learning techniques. In the sequel, we propose a general DRL framework for MEC offloading that can be flexibly tailored for learning offloading strategy under different network scenarios. As illustrated in Fig.
6, the DRL framework includes the following main components:
Network Profiling: The network environment contains very high dimensional information. Dimension reduction is required to speed up the learning process. Network profiling helps to extract problemdependent information closely related to the network control problems. This can be assisted by conventional modelbased optimization problems.

State Orchestration: It aims to select the most salient or indicative state variables to minimize the state space without compromising the learning performance. The network performance depends on the demands and supplies of various resources. Hence, the system state can be set to show the realtime dynamics of resource consumption and its regeneration.

Training Scheme: The training scheme can flexibly organize the value and policy networks to learn both discrete and continuous offloading decisions, e.g., the discrete indicators for base station (de)activation, channel assignment, user association, and routing selection, as well as the continuous variables for bandwidth allocation and beamforming optimization.

Reward Evaluation: The reward in each decision epoch drives the DRL agent to adjust its MEC offloading policy. Practically, the reward is evaluated after completing the workload after a few decision epoches or time slots. A modelbased optimization can be deployed to estimate the instant reward based on the prediction of future network dynamics.

Action Generation: The DRL agent outputs a vector of actions for each system state, which will be translated into the control variables to execute the offloading decisions. Quantization or approximation can be required in some cases to project continuous variables into discrete actions. Random noise can be also added to continuous actions for a better exploration.
The general DRL framework can be applied to optimize MEC offloading policies under different network scenarios by customizing different components of the DRL framework to meet the performance requirements of various design problems. In the following, we provide a review of recent works on the applications of DRL for MEC offloading problems.
IiiB Design Issues for DRLbased MEC Offloading
IiiB1 Network Selection for Cost Minimization
In the simplest case with one wireless user and multiple access points, e.g., cellular base station and WLAN access point [zhangdeep], the MEC offloading is regarded as a network selection problem as illustrated in case (i) of Fig. 9. The wireless user can either access the cellular network or WLAN with different costs. To minimize the user’s energy consumption and cost for channel access, DQN can be constructed to learn the optimal selection scheme without knowing the user’s mobility pattern. The offloading decision is made based on the prediction of the user’s location and the remaining data size.
IiiB2 Channel and Capacity Sharing
When multiple wireless users request for the computation resources simultaneously from a single MEC server, e.g., [xdwang_ddpg18], as shown in case (ii) of Fig. 9, the spectrum and capacity sharing becomes a critical problem to minimize the cost of delay and power consumptions for all users. The system state can be the sum cost of all users and the remaining capacity of the MEC server. The DRL agent learns the continuous resource allocation for wireless users and the binary offloading decisions, considering a limited capacity of the MEC server and timevarying channel conditions.

IiiB3 MEC Server and User Association
With multiple base stations or MEC servers, each wireless user’s computation offloading can be routed via different base stations, as shown in case (iii) of Fig. 9. To minimize the cost of processing delay, the authors in [chen2018optimized] employed DDQN to learn the optimal offloading policy including the binary user association and transmit control strategies. The system state consists of the channel conditions between the wireless users and the base stations, the statuses of energy and data queues. Considering low utilization of base stations, DDQN can be also used to control the (de)activations of base stations to reduce total energy consumption while maintaining the same quality provisioning.
IiiB4 Collaborative Data Offloading
Besides offloading to the MEC server, the collaborative offloading among multiple wireless users can be envisioned in case (iv) of Fig. 9, i.e., each wireless user can offload its computation workload to nearby users via devicetodevice communications, e.g., [duc2018deep]. The optimization of offloading decisions depends on the number of remaining tasks at each wireless user, the availability of the computation resources, and the link quality between wireless users. DQN or DDQN can be customized to learn the optimal offloading policy in a mobile adhoc network to maximize the resource utilization or minimize total power consumption, subject to the user’s energy and delay requirements.
Iv DRL Approach for a Hybrid MEC Offloading Model
One design objective of the future wireless network is to embrace the ubiquitous interconnections of lowpower IoT devices, e.g., the wearable wireless sensors for healthcare monitoring, either batterypowered or wireless powered via energy harvesting [xdwang_ddpg18]. For these lowpower IoT devices, it is clear that energy consumption on data processing can be reduced significantly by offloading computationintensive workload to the MEC servers, e.g., [zhangdeep, xdwang_ddpg18, chen2018optimized, duc2018deep]. However, in another aspect, the energy saving on computation comes with the price of more energy consumption on computation offloading, which is generally performed by conventional RF communications. Due to the high energy consumption of RF communications, MEC offloading may not be affordable by these lowpower IoT devices.
In this section, we tackle this problem by proposing a hybrid offloading strategy that can schedule data offloading in both highrate RF communications and lowpower backscatter communications [ieeenetwork]. The backscatter radio operates in the passive mode by reflecting the incident RF signals. It is featured with extremely low power consumption and a low data rate, while the active radio in RF communications can transmit reliably with a higher data rate by adapting its transmissions against the channel fading effect. Hence, we aim to optimize the hybrid MEC offloading policy to balance energy consumptions in both data offloading and computation. This can be achieved by exploiting the complement operations of the passive and active radios. However, due to the couplings among two radio modes, it becomes more complicated to optimize the MEC offloading policy by using the conventional modelbased approaches. In the sequel, we employ the DRL framework to optimize the hybrid MEC offloading strategy with uncertain channel conditions, dynamic energy supply, and timevarying workloads at the IoT devices.
Iva Hybrid MEC Offloading Model
We consider a set of edge users, e.g., wireless IoT sensors, that send backlogged workloads to a hybrid base station (HBS), which is colocated with the MEC server. The system model is illustrated in Fig. 10
. The channel from the HBS to each edge user is modeled by a finitestate Markov model. The HBS allocates each edge user a time slot for MEC offloading, similar to the timeslotted structure in
[xdwang_ddpg18]. Each edge user can harvest RF energy from the HBS and the ambient RF sources with random power density. The energy harvesting capability is illustrated in case (i) of Fig. 10. The edge user’s workload is uncertain due to the user’s mobility and timevarying demand of upper layer applications, e.g., the data sampling rate may vary according to the health conditions of the subject being monitored. The user’s workload needs to be processed locally or remotely at the MEC server before a time deadline. We assume that the MEC server can return the processed data to the edge user instantly via simultaneous power and information transfer, as illustrated in case (ii) of Fig. 10. The hybrid MEC offloading scheme allows each user to flexibly switch data offloading between the passive backscatter communications and the active RF communications, as illustrated in cases (iii) and (iv) of Fig. 10, respectively. To maintain a fixed offloading rate, the active radio’s transmit power has to be adapted with the timevarying channel conditions. This implies a dynamic process of the edge user’s energy buffer.It is obvious that the offloading scheduling between two radio modes introduces an additional degree of freedom to improve the MEC performance in such a dynamic network environment. The DRL approach aims to learn the optimal hybrid MEC offloading policy from past experience. Given the channel conditions, energy status, and workload in each time slot, the edge user will choose its action (e.g., local computation, passive or active offloading) to maximize the reward function, which is defined as the energy efficiency, i.e., the successfully processed workload per unit energy. Workload outage happens when the edge user’s workload is not processed successfully before the delay bound. In this case, the instant reward will be set to zero. To proceed, we divide each time slot into flexible subslots as illustrated in Fig.
10. The first subslot is reserved for RF energy harvesting. The following subslot is allocated to active offloading with a higher rate and another subslot is used by passive offloading with a lower rate . The offloading schemes also differ in their power consumption. To achieve the maximum energy efficiency, the DRL agent is designed to learn a transmission scheduling policy that determines the optimal action on each system state, including time and workload allocations among local computation, active and passive offloading, subject to the edge user’s energy budget constraint.IvB Numerical Evaluation
To exploit the flexibility and performance gain via hybrid MEC offloading scheme, we compare it with the conventional offloading scheme, namely, the ActiveOffload scheme that only supports active RF communications, e.g., [xdwang_ddpg18]. We also compare it with the typical greedy and random schemes. In Fig. 13, we show the performance of different schemes and observe that the DRLbased HybridOffload scheme achieves the highest reward and the lowest outage performance, as shown in Fig. 13(a) and Fig. 13(b), respectively. The ActiveOffload scheme uses a similar DRL framework as that of the HybridOffload scheme, however with the reduced action space, i.e., it only chooses between local computation and active offloading. Hence, it achieves a reduced reward performance than that of the HybridOffload scheme. The benchmark greedy scheme always chooses the myopic action to maximize the instant reward in each time slot. It even performs better than the ActiveOffload scheme due to its flexibility in radio mode switching.
In the next group of simulations, we equally divide each time slot into multiple subslots and assume that the edge user follows the same DRL framework to optimize its offloading decision independently in each subslot. By this way, we can flexibly allocate the workload and thus approximate the optimal workload allocation strategy among local computation, passive and active offloading. In Fig. 16(a), we show the performance of the HybridOffload scheme when we set a different number of subslots for MEC offloading. We observe that it generally achieves a higher reward performance with more subslots. We also show the performance of the DDPG algorithm for continuous control that is shown to achieve the maximum reward. Such a performance gain is obtained from the increased flexibility in workload and time allocation.
In Fig. 16(b), we show the averaged workload allocation among different computation schemes at the convergence of the HybridOffload algorithm. The axis of Fig. 16(b) denotes the mean power density in the ambient RF environment. We can observe that with low energy supply the passive offloading scheme is preferred due to its extremely low power consumption. With a higher energy density, the edge user generally harvests more RF energy and thus it prefers to use the active offloading scheme. This can provide a higher offloading rate and thus reduce the processing delay.
V Open Research Issues
Though DRL has been successfully applied to various network control problems, there still exist some challenges and open issues for MEC offloading in wireless networks.
V1 Multiagent DRL for MEC Offloading
MEC offloading involves multiple heterogenous network entities, e.g., wireless users, base station, and MEC server, which may have totally different reward functions and control variables. Each user can customize its own DRL framework and make decisions based on local observations. However, this may destroy the Markovian property of the underlying system model and lead to divergent learning performance.
V2 Modelbased Reward Evaluation
The DRL agent requires realtime reward evaluation to drive the learning process. As the performance of MEC offloading decision is usually not observable until the completion of workload, we require a more effective way combining learning and modelbased optimization to predict the reward with incomplete network information.
V3 Hierarchical DRL for MEC Offloading
MEC offloading decision generally involves both discrete and continuous control variables. To improve learning efficiency, a hierarchical or twostage DRL framework can be implemented to learn the optimal resource allocation strategy by using the policybased DRL approaches in the inner loop, and then update the discrete user association or offloading decisions by DQN or its variants in the outer loop.
Vi Conclusions
In this paper, we firstly have reviewed the DRL framework for its applications in MEC offloading with uncertain network information. Then, we have customized the DRL framework to realize a novel hybrid MEC offloading scheme that exploits the complement transmissions of the passive and active radios. Numerical results demonstrate that it can significantly improve the offloading performance. In the last, we have outlined a few open research issues.
Comments
There are no comments yet.