I Introduction
Recent years have witnessed a booming of emerging mobile applications such as augmented reality, virtual reality, and interactive gaming. These types of applications require intensive computing power for realtime processing, which often exceeds the limited computing and storage capabilities of mobile devices. To resolve this issue, Multiaccess Edge Computing (MEC) [sabella2019developing], a key technology in the fifthgeneration (5G) network, was proposed to meet the everincreasing demands for the QualityofService (QoS) of mobile applications. MEC provides many computing and storage resources at the network edge (close to users), which can effectively cut down the application latency and improve the QoS. Specifically, a mobile application empowered by the MEC consists of a frontend component running on mobile devices, and a backend service that runs the tasks offloaded from the application on MEC servers [rejiba2019survey]. In this way, the MEC enables mobile devices with limited processing power to run complex applications with satisfied QoS.
When considering the user mobility along with the limited coverage of MEC servers, the communications between a mobile user and the user service running on an edge server may go through multiple hops, which would severely affect the QoS. To address this problem, the service could be dynamically migrated to a more suitable MEC server so that the QoS is maintained. Unfortunately, finding an optimal migration policy for such a problem is nontrivial, due to the complex system dynamics and user mobility. Many existing works [ouyang2018follow, wang2019delay, wu2020mobility, chen2019dynamic, wang2019dynamic] proposed service migration solutions based on Markov Decision Process (MDP) or Lyapunov optimization under the assumption of knowing the complete systemlevel information (e.g., available computation resources of MEC servers, profiles of offloaded tasks, and backhaul network conditions). Thus, they designed centralized controllers that make migration decisions for all mobile users in the MEC system.
The aforementioned methods have two potential drawbacks: 1) in a realworld MEC system, gathering complete systemlevel information can be difficult and timeconsuming; 2) the centralized control approach will have the scalability issue since its time complexity rapidly increases with the number of mobile users. To address the above issues, some works proposed decentralized service migration methods based on contextual MultiArmed Bandit (MAB) [sun2017emm, ouyang2019adaptive, sun2018learning], where the migration decisions are made by the user side with partially observed information. However, they did not consider the intrinsically large state space and complex dynamics in the MEC system, which may lead to unsatisfactory performances. A recent work [yuan2020joint] modeled the joint optimization problem of service migration and path selection as a partially observable Markov decision process (POMDP) solved by independent Qlearning, which can be unstable and inefficient when handling the MEC environment with continuous state space (e.g., data size, CPU cycle, workload) and complex system dynamics.
To address the above challenges, we propose a Deep Recurrent ActorCritic based service Migration (DRACM) method, which is usercentric and can learn to make online migration decisions with incomplete systemlevel information based on Deep Reinforcement Learning (DRL). DRL is able to solve complex decisionmaking problems in various areas, including robotics [gu2017deep], games [ye2020mastering], networks [chinchali2018cellular], etc., making it an attractive approach. Distinguished from the existing works, we model the service migration problem as a POMDP with continuous state space and develop a tailored offpolicy actorcritic algorithm to efficiently solve the POMDP. The main contributions of this work are listed as follows:

We model the service migration problem as a POMDP to capture the intrinsically complex system dynamics in the MEC. We solve the POMDP by proposing a novel offpolicy actorcritic method, DRACM. Specifically, our method is modelfree and can quickly learn effective migration policies through endtoend reinforcement learning (RL), where the agent makes online migration decisions based on the sampled raw data from the MEC environment with minimal human expertise.

A new encoder network that combines a Long ShortTerm Memory (LSTM) and an embedding matrix is designed to effectively extract the hidden information from the sampled histories. Moreover, a tailored offpolicy actorcritic algorithm with a clipped surrogate objective function is developed to substantially stabilize the training and improve the performance.

We demonstrate how to implement the DRACM in an emerging MEC framework, where the migration decisions can be made online through the inference of the policy network, while the training of the policy network can be offline, saving the cost of directly interacting with the MEC environment.

Extensive experiments are conducted to evaluate the performance of the DRACM using realworld mobility traces. The results demonstrate that the DRACM has a stable training process and high adaptivity to different scenarios, while outperforms the online baseline algorithms, and achieves nearoptimal results.
The remainder of this paper is organized as follows. Section II gives the problem formulation of service migration. Section III presents the DRL backgrounds, POMDP modeling for service migration, details of the DRACM algorithm, and the implementation of the DRACM in the emerging MEC system. In Section IV, we evaluate the performance of the DRACM and five baseline algorithms on two realworld mobility traces with various MEC scenarios. We then review the related works in Section V. Finally, Section VI draws conclusions.
Ii Problem Formulation of Service Migration
As shown in Fig. 1, we consider a typical scenario where mobile users move in a geographical area covered by a set of MEC servers, , each of which is colocated with a base station. In the MEC system, mobile users can offload their computation tasks to the services provided by MEC servers. We define the MEC server that runs the service of a mobile user as the user’s serving node, and the MEC server that directly connects with the mobile user as the user’s local server. In general, the MEC servers are interconnected via stable backhaul links, thus the mobile user can still access its service via multihop communication among MEC servers when it is no longer directly connected to the serving node. To maintain satisfactory QoS, the service should be dynamically migrated among the MEC servers as the user moves. In this paper, we use latency as the measurement for the QoS that consists of migration, computation, and communication delays.
We consider a timeslotted model, where a user’s location may only change at the beginning of each time slot. The timeslotted model is widely used to address the service migration problem [wang2019dynamic, ouyang2019adaptive, wang2019delay], which can be regarded as a sampled version of a continuoustime model. When a mobile user changes location, the user makes the migration decision for the current service and then offloads computation tasks to the serving node for processing. Denote the migration decision at time slot as (), where can be any of the MEC servers in this area. In general, the migration, computation, and communication delays are expressed as follows.
Migration delay: The migration delay is incurred when a service is moved out from the previous serving node. In general, the migration delay is a nondecreasing function of [wang2019dynamic, ouyang2019adaptive], , where is the hop distance between the current serving node and the previous one , and is the coefficient of migration delay. The migration delay can capture the service interruption time during migration, which increases with the hop distance due to the involved propagation and switching delay of service data transmission.
Computation delay: At each time slot, the mobile user may offload computation tasks to the serving node for processing. The computing resources of MEC servers are shared by multiple mobile users to process their applications. At time slot , we denote the sum of the required CPU cycles for processing the offloaded tasks as , the workload of the serving node as , and the total computing capacity of the serving node as . We consider a weighted resource allocation strategy on each MEC server, where tasks are allocated with computation resources proportional to their required CPU cycles. Therefore, the computation delay of running the offloaded tasks at time slot , can be calculated as
(1) 
Communication delay: After migrating the service, the communication delay is incurred when the mobile user offloads computation tasks to the serving node. Generally, the communication delay consists of two parts: access delay between the mobile user and the local server, and backhaul delay between the local server and the serving node. The access delay is determined by the wireless environment and the data size of the offloaded tasks. At time slot , we denote the data size of the offloaded tasks as , the average upload rate of the wireless channel as . Hence, the access delay can be expressed as
(2) 
While the backhaul delay is incurred by data transmission, propagation, processing, and queuing between the serving node and the local server through backhaul networks, which mainly depends on the hop distance along the shortest communication path and the data size of the offloaded tasks [yuan2020joint, wang2019dynamic, ouyang2019adaptive]. We denote the local server at time slot as () and the hop distance between the serving node and the local server as . The bandwidth of the outgoing link of the local server is denoted as . Generally, the transmission delay of the computation results can be ignored because of the small data size. Consequently, the backhaul delay can be given by
(3) 
where is a coefficient of the backhaul delay [yuan2020joint]. Especially, when the serving node and mobile user are directly connected (), there is no backhaul cost. Overall, the total communication delay at time slot can be obtained by
(4) 
Given a finite time horizon , our objective for the service migration problem is to obtain optimal migration decisions, , so that the sum of all the above costs (i.e., total latency) is minimal. Formally, the objective is expressed as:
(5) 
Obtaining the optimal solution for the above objective is challenging, which requires user mobility and complete systemlevel information over the entire time horizon. However, in realworld scenarios, it is impractical to gather all the relative information in advance. To address this challenge, we propose a learningbased online service migration method that can make efficient migration decisions based on partially observed information. In the next section, we present our solution in detail.
Iii Online Service Migration with Incomplete Information
Service migration in MEC is intrinsically a sequential decisionmaking problem with a partially observable environment (i.e., with incomplete system information), which can be naturally modeled as a POMDP. We solve the POMDP with the proposed DRACM method to provide effective online migration decisions. Before presenting the details of our solution, we first introduce the necessary backgrounds.
Iiia Backgrounds of RL and POMDP
Reinforcement learning: RL can solve sequential decisionmaking problems by learning from interaction with the environment. In general, RL uses the formal framework of MDP, which is defined by a tuple , to represent the interaction between a learning agent and its environment. Specifically, is the state space, denotes the action space, is the environment dynamics, represents the reward function, and is the discounted factor. The policy, , represents the distribution over actions given a state . The return from state , which is defined as , is the sum of discounted rewards along a trajectory . The goal of RL is to find an optimal policy , so that the expected return, , is maximal.
The actionvalue function is defined by the expected return after taking an action in state and thereafter following policy , . An optimal actionvalue function, which is defined as , is the maximum action value achievable by any policy for state and action . The valuedbased DRL methods (e.g., deep Qlearning (DQL) [mnih2015human]
) use the deep neural network to approximate the optimal actionvalue function,
where are parameters of the deep neural network. They obtain the optimal policy by greedily selecting the action with maximal action value, where . However, since DQL uses deterministic target policy and greedy strategy to handle the tradeoff between exploring and exploiting, it can have unsatisfactory performance and the convergence issue when handling environments with continuous state space (e.g., the MEC environment). In contrast, the policybased methods (e.g., asynchronous actorcritic [mnih2016asynchronous]), provide good convergence property for dealing with the complexity of the continuous state space. They directly parameterized the stochastic policy with a deep neural network rather than use deterministic policy derived from the actionvalue function. The parameters of the policy network are updated by performing gradient ascent on . In this paper, we build the DRACM based on the policybased methods and show the performance comparison between the DQLbased method and the DRACM in section IV.Partially Observable Markov Decision Process: MDP assumes that states include complete information for decisionmaking. However, in many realworld scenarios, observing such states is intractable. Therefore, the POMDP, an extension of MDP, is proposed as a general model for the sequential decisionmaking problem with a partially observable environment, which is defined by a tuple . Fig. 2 shows the graphical model of POMDP. Specifically, the state is latent and the observation contains partial information of the latent state .
represents the observation distribution, which gives the probability of observing
if action is performed and the resulting state is . Since the state is latent, the learning agent cannot choose its action directly based on the state. Alternatively, it has to consider a complete history of its past actions and observations to choose its current action. Specifically, the history up to time step is defined by . Therefore, the key for RLbased methods to solve the POMDP is how to effectively infer the latent state based on the history. In the literature, some RL methods [hausknecht2015deep, zhu2018improving] assume the latent states as deterministic states, which encode the whole history by RNN and use the hidden state of RNN as input to the policy. Other works [watter2015embed, igl2018deep, zhang2019solar] explicitly infer the belief state that is defined by the distribution over latent states (i.e., stochastic latent state) given the history and sampling latent state from the distribution as input to the policy. We use LSTM for latent information extraction, which can achieve excellent performance and is much easier to be implemented in MEC scenarios compared to methods based on inferring the belief state. In the next subsection, we present the motivations of POMDP modeling for service migration problem and the detailed definition of the model.IiiB POMDP modeling for service migration problem
Key factors that affect the migration decision of a mobile user at a time slot are the mobility of the user, the offloading tasks’ profile, the workloads of edge servers, and the resource allocations of edge servers, etc. Ideally, the user can make optimal migration decisions if knowing complete information related to the decisionmaking process. However, some information are hard to obtain for the user side. For example, at each time slot, the workloads of edge servers are determined by the task requests from their associated mobile users and the available computation resources of edge servers. However, it is unlikely for a mobile user to get such information. To make effective decisions based on partially observable information, POMDP is a natural choice to model the problem, which gives the agent the ability to effectively estimate the outcome of its actions even when it cannot exactly observe the state of its environment. In our POMDP modeling, the mobile user treats the unobserved information (e.g., workloads and resource allocations of MEC servers) as a part of the latent state. Differing from the simplified model such as MAB, POMDP does not ignore the intrinsic large state space and complex dynamics of the service migration problem, thus solving the POMDP can result in more effective decisions.
The detailed POMDP model of service migration is defined as follows:

Observation: The observation contains information that is accessible from the user side, which is defined by a tuple of the local server , the transmission rate of wireless network , the required CPU cycles of computation tasks , and the sizes of transmission data, :
(6) Note that the geographical location of the mobile user is an indirect factor that affects the migration decisions, which determines the local server associated with the mobile user and affects the transmission rate (included in our definition of the observation, Eq. (6)). Therefore, we define the local server as a component of the observation rather than the geographical location of the mobile user.

Action: At each time slot, the service can be migrated to any of the MEC servers in the area. Therefore, an action is defined as .

Reward: The reward at each time slot is defined as the negative sum of migration, computation, and communication delays, which is formally expressed as
(7)
Solving the above POMDP is nontrivial due to the complex dynamics and continuous state space of the MEC environment. In the next subsection, we present our method, DRACM, to solve the above POMDP.
IiiC Deep Recurrent ActorCritic based service Migration (DRACM)
Fig. 3 shows the overall architecture of the DRACM, which follows an endtoend principle with raw history sampled from the environment as input and the migration decisions as output. The DRACM consists of two parts: the encoder network and the learning agent, where the encoder network learns to effectively represent the latent state of the POMDP based on the history and the learning agent learns to make effective migration decisions. The encoder network combines an LSTM that encodes the history of up to time slot into the hidden state :
(8) 
where and represent the inner process and parameters of the encoder network, respectively.
To improve the representation ability of the features and , we convert them into embeddings by looking up a matrix, where
is the dimension of embedding vectors. Subsequently, the action embedding, user location embedding, and the rest components of the observation are concatenated as a vector,
, feeding into the LSTM to produce the hidden state .The learning agent is based on a standard actorcritic structure. Both actor and critic are parametrized by neural networks with the hidden state as input. We denote and as the parameters of actor and critic networks, respectively. The actor network aims at approximating the policy, , which outputs a distribution over the action space at time step given . Meanwhile, the critic network, , approximates the value function that is an estimation of the expected return when starting in and following the policy thereafter.
Denote the trajectory sampled from the environment following policy as . The critic network can be updated by minimizing the mean square error of onestep temporal differences based on the sampled trajectories, which is formally defined as
(9) 
(10) 
where the can be obtained by Eq. (8). The objective of the actor is to find an optimal policy that maximizes the accumulated reward, which can be formally expressed as
(11) 
The optimal policy can then be obtained by gradient assent through policy gradient with onestep actorcritic [sutton2018reinforcement], where the gradient of the above objective function can be calculated by
(12) 
Initialize the parameters of behavior policy , behavior encoder network , target policy , target encoder network , and critic network ,
Synchronize the parameters: , . 
Sample a set of trajectories by running the behavior policy in the environment, where . 
Compute the advantage estimator, , according to Eq. (15). 
Update the parameters of encoder network , target policy network , and critic network , 
, 
, 
, 
by minibatch gradient updates based on collected trajectories with Adam. 
However, directly applying the above onpolicy (i.e., using the same policy for training and sampling) objective has some drawbacks when solving the service migration problem. First, we cannot train the policy network offline with minibatches by using onpolicy objective. This can lead to severe sample efficiency problem, since the learning agent needs to resample trajectories from the environment after each gradient update. Especially, in the MEC system, frequently interacting with the environment to get the training samples is costly. Besides, the onpolicy objective has limited exploring ability, thus the policy can easily get stuck in a local optima. To mitigate this problem, we design an offpolicy (i.e., training a policy different from that was used to sample the data) algorithm that can train the policy with minibatches and reduce the interaction frequency with the environment. Inspired by the previous works on RL [schulman2017proximal, haarnoja2018soft, schulman2016high], we introduce an offpolicy training method with a surrogate objective as follows:
(13) 
(14) 
(15) 
where is the behavior policy for sampling trajectories, which does not participate in gradient updates. is the target policy for optimization. is the importance sampling ratio which is used to correct the distribution errors caused by the difference between the behavior and target policies. Besides, we introduce as a regularization term to further encourage exploration during training, where denotes the entropy of the policy and is a coefficient. However, the offpolicy method is known for being unstable and hard to coverage. To address this issue, the clip function, , is used to limit the value of the importance sampling ratio by removing the incentive for moving the ratio outside of the interval
, thus it can prevent very large policy updates and stabilize the training. To reduce the variance of the training objective, we utilize the generalized advantage estimator
[schulman2016high], , as given by Eq. (15), where is used to control the tradeoff between bias and variance.Algorithm 1 summarizes the training process of the DRACM. Each training loop consists of the sampling process and the target policy updating process. In the sampling process, we firstly synchronize the parameters of the behavior and target networks (include policy network and encoder network), and then sample a set of trajectories from the environment using the behavior encoder and policy networks. The advantage estimator, , can then be obtained based on the sampled trajectories. Next, in the target policy updating process, we conduct training of
loops to update the parameters of the encoder network, policy network, and critic network via minibatch stochastic gradient descent with
Adam [kingma2014adam]. After training, the target policy and encoder networks can be deployed to the end device for making online migration decisions by neural network inference, which has a linear time complexity of , where is the length of the history. In the next subsection, we discuss how to implement the DRACM in the emerging MEC system.IiiD The DRACM empowered MEC framework
The emerging MEC system defined by ETSI consists of three levels: user level, edge level, and remote level [sabella2019developing]. The user level includes various mobile devices such as smartphones and vehicles. The edge level consists of multiple edge servers where each server provides services for processing tasks that are offloaded by mobile users. The edge servers are connected through backhaul links, thus the service can be migrated among them. The remote level includes data centers with large storage and computing capacity. Fig. 4 shows the overall framework of integrating the DRACM into the threelevel MEC system. Four key components (experience collector, migration decision maker, experience pool, and target policy trainer) of the DRACM are deployed at the user and remote level:

At the user level, the experience collector is responsible of collecting the information of observations and rewards from the MEC environment (Step 1⃝). It sends the history to the migration decision maker for online decisionmaking (Step 2⃝), and the collected trajectories to the experience pool for the target policy training (Step 4⃝). The migration decision maker includes behavior policy and encoder networks. It downloads parameters from the target policy trainer as the initial values of the behavior policy and encoder networks (Step 5⃝), and decides the migration actions based on the observed history (Step 3⃝).

At the remote level, the experience pool stores the sampled trajectories from mobile users. The target policy trainer is in charge of training the target policy based on the sampled trajectories.
According to Algorithm 1, the target policy trainer conducts multiple training loops with minibatch gradient updates based on the collected trajectories in the experience pool. Note that the training can be offline without directly interacting with the MEC environment. After training, the target policy trainer sends the updated parameters of policy and encoder networks to mobile users for the nextround of sampling process.
Iv Experiments
In this section, we present the comprehensive evaluation results of the DRACM in detail. Our experiments demonstrate that: 1) the DRACM has a stable and efficient training process; 2) the DRACM can autonomously adapt to different MEC scenarios including various user’s task arriving rates, applications’ processing densities, and coefficients of migration delay. We firstly introduce the experiment settings based on a realworld MEC environment. Next, we present the baseline algorithms for comparison. Finally, we evaluate the performance of the DRACM and baseline algorithms in different MEC scenarios.
Iva Experiment settings
We evaluate the DRACM with two realworld mobility traces of cabs in Rome, Italy [romataxi20140717] and San Francisco, USA [piorkowski2009crawdad]. Specifically, we focus our analysis to the central parts of Rome and San Franscisco, as shown in Fig. 5. We consider that 64 MEC servers are deployed in each area, where each MEC server covers a 1 km 1 km grid with a computation capacity GHz (i.e., four 16core servers with 2 GHz for each core). According to [narayanan2020first], the upload rate of realworld commercial 5G networks is generally less than 60 Mbps. Therefore, in our environment, the upload rate in each grid is set as 60, 48, 36, 24, and 12 Mbps from a proximal end to a distal end. The hop distances between two MEC servers are calculated by Manhattan distance. The location of an MEC server is represented by a 2D vector with respect to a reference location at . To calculate the propagation latency, we set the bandwidth of backhaul network, , as 500 Mbps [ma2019efficient] and the coefficient of backhaul delay, , as 0.02 s/hop [yuan2020joint]. The migration delay varies with various service types and network conditions, e.g., the migration delay of Busybox (a type of service) ranges from 2.4 to 3.3 seconds [ma2019efficient] with different bachkhaul bandwidths. Following some related work on MEC [ouyang2019adaptive, wang2019delay, ma2019efficient]
, we assume the coefficient of migration delay is uniformly distributed in
s/hop during our training.At each time slot, the tasks arriving at a mobile user and those arriving at an MEC server are sampled from Poisson distributions with rates
and , respectively. In our experiments, we show the performance of the DRACM under different task arriving rates of mobile users. According to the current works [nguyen2020privacy, chen2015efficient, zhan2020mobility], the data size of an offloaded task in realworld mobile applications often varies from 50 KB (sensor data) [nguyen2020privacy] to 5 MB (image data) [chen2015efficient]. Therefore, we set the data size of each offloaded task uniformly distributed in MB. The required CPU cycles of each task can be calculated by the product of the data size and processing density, , which is uniformly distributed in cycles/bit, covering a wide range of tasks from low to high computation complexity [kwak2015dream]. We summarize the parameter settings of our simulation environment in Table I.Parameter  Value 
Computation capacity of an MEC server,  128 GHz 
Upload rate of wireless network,  {60, 48, 36, 24, 12} Mbps 
Bandwidth of backhaul network,  500 Mbps 
Coefficient of backhaul delay,  0.02 s/hop 
Coefficient of migration delay,  s/hop 
Data size of each offloaded task  MB 
Processing density of an offloaded task,  cycles/bit 
User’s task arriving rate  2 tasks/slot 
MEC server’s task arriving rate  tasks/slot 
Hyperparameter  Value  Hyperparameter  Value 
LSTM Hidd. Units  256  Embedding Dim.  2 
Actor Layer Type  Dense  Actor Hidd. Units  128 
Critic Layer Type  Dense  Critic Hidd. Units  128 
Learning Rate  0.0005  Optimizer  Adam 
Discount  0.95  Discount  0.99 
Coefficient  0.01  Clipping Value  0.2 
IvB Baseline algorithms
We compare the performance of the DRACM to that of five baseline algorithms:

Always migrate (AM): A mobile user always selects the nearest MEC server to migrate at each time slot.

Never migrate (NM): The service is placed on an MEC server and never migrate during the time horizon.

Multiarmed Bandit with Thompson Sampling (MABTS): Some exiting works [sun2017emm, ouyang2019adaptive] solve the service migration problem based on MAB. According to the work [ouyang2019adaptive]
, MABTS uses a diagonal Gaussian distribution to approximate the posterior of the cost for each arm and applies Thompson sampling to handle the tradeoff between exploring and exploiting.

DQLbased migrate (DQLM): Some recent works [wang2019delay, wu2020mobility, chen2019dynamic, yuan2020joint] adapt DQL to tackle the service migration problem. For a fair comparison, we use similar neural network structure as DRACM to approximate the actionvalue function for DQLM, but use the objective function of the DQL method as the training target. Moreover, we use greedy to control the exploringexploiting tradeoff as the above works do.

Optimal migrate (OPTIM): Assuming the user mobility trace and the complete systemlevel information over the time horizon are known ahead, the service migration problem can be transformed to the shortestpath problem [ouyang2018follow, wang2019delay], which can be solved by the Dijkstra algorithm.
The NM, AM, MABTS, and DQLM algorithms can run online, while the OPTIM is an offline algorithm which defines the performance upperbound of service migration algorithms.
IvC Evaluation of the DRACM and baseline algorithms
We first evaluate the training performance of the DRACM and DQLM on two different mobility trace datasets [romataxi20140717, piorkowski2009crawdad]. Each training dataset includes 100 randomly picked mobility traces, where each trace has 100 time slots of threeminute length each. Table II lists the hyperparameters in training. The neural network structure of the DQLM is similar to the DRACM with the same encoder network. The difference is that, rather than using the actorcritic structure, the DQLM is based on the Qnetwork that includes a fully connected layer with 128 hidden units to approximate the actionvalue function and chooses the action with the largest actionvalue at each time step. We train the DQLM and DRACM with the same learning rate, minibatch size, and number of gradient update steps.
Figs. 6 and 7 show the training results of DRACM and DQLM on mobility traces of Rome and San Francisco, respectively. The other baseline algorithms do not involve the training process for neural networks, thus we show their final performance. The network parameters of both DRACM and DQLM are initialized by random values, thus they randomly select actions to explore the environment and achieve the worst results compared to other baseline algorithms before training. However, the DRACM quickly surpasses NM and AM after 500 gradient updates and keeps growing on both mobility traces. After 1000 gradient updates, the average total reward of the DRACM remains stable, which shows the excellent convergence property of the DRACM. Besides, the final stable results of the DRACM on both mobility traces beat all baseline algorithms. Compared to the DQLM, the proposed DRACM has two main advantages: 1) The DQLM uses greedy to control the tradeoff between exploring and exploiting and obtains a deterministic policy based on the learned Qnetwork. When handling the MEC environment that has stochastic dynamics and continuous state space, the training of DQLM can be unstable and inefficient. In contrast, the DRACM directly learns a stochastic policy that is used to handle the exploringexploiting tradeoff, thus it can achieve faster and stabler learning. 2) The DRACM obtains better results. The offpolicy objective of the DRACM helps alleviate the problem of getting stuck in local optima.
To evaluate the generalization ability of the DRACM, we test the trained target policy on testing datasets of both mobility traces, where each test dataset includes 30 randomly picked mobility traces that were not included in the training dataset. Figs. 8 and 9 present the results of the average total latency of DRACM and baseline algorithms on Rome and San Francisco mobility traces, respectively. We found the DRACM achieves the best performance compared to online baseline algorithms on both mobility traces. Specifically, Fig. 8 shows that the DRACM outperforms the DQLM and MABTS by 18% and 13%, respectively. Fig. 9 indicates that the DRACM surpasses the DQLM and MABTS by 44% and 23%, respectively. Furthermore, the DRACM achieves nearoptimal results within 12% of the optimum on both mobility traces.
We then test the DRACM and baseline algorithms with different task arriving rates of users on both mobility traces. As shown in Figs. 10 and 11, the average total latencies of all evaluated algorithms increase with the rise of user’s task arriving rate, since the average number of offloaded tasks increases at each time slot. The evaluation results show that the DRACM adapts well among different task arriving rates of users, where it outperforms the DQLM and MABTS by up to 24% and 45%, respectively. Moreover, in all cases, the results of DRACM are close to the optimal values.
Next, we investigate the performance of the DRACM with different processing densities. For a realworld mobile application, the higher is the processing density, the more computation power is required for processing the application. Figs. 12 and 13 depict the average total latency of DRACM on Rome mobility traces and San Francisco mobility traces, respectively. We find that the DRACM adapts well to the change of processing density on both mobility traces, where it outperforms all online baselines.
Migration delay is another important factor that influences the overall latency. To investigate the impact of the migration delay, we evaluate the DRACM and baseline algorithms on the testing datasets with different coefficients of migration delay. Intuitively, when the migration delay is high, a mobile user may not choose to frequently migrate services. As shown in Figs. 14 and 15, the NM algorithm keeps the similar performance in all cases while the performance of other algorithms drops with the increase of . This is because that the NM does not involve the migration process and thus has no migration delay. In Fig. 14, we find the MABTS suffers serious performance degradation as increases. When the is low (e.g., ), the MABTS achieves similar results as the DRACM. However, when , the performance of MABTS becomes even worse than the DQLM. Compared to RLbased methods like the DQLM and DRACM, MABTS is “shortsighted” since it only considers the onestep reward rather than explicitly optimizes the total reward over the entire time horizon. Overall, the DRACM autonomously learns to adapt among the scenarios with different migration delays, which achieves the best performance compared to the online baselines (with up to 25% improvement over the MABTS and up to 42% improvement over the DQLM), and obtains nearoptimal results in our experiments.
The DRACM method has many advantages: 1) the learningbased nature of the DRACM makes it flexible among different scenarios with few human expertise; 2) the usercentric design is scalable for the increasing number of mobile users, where each mobile user makes effective online migration decisions based on the incomplete system information; 3) the tailored offpolicy training objective improves both performance and stability of the training process; 4) the design of online decisionmaking and offline policy training makes the DRACM more practical in realworld MEC systems. Beyond the scope of service migration, the framework of the DRACM has the potential to be applied to solve more decisionmaking problems in MEC systems such as task offloading and resource allocation [mao2017survey].
V Related Work
Service migration in MEC has attracted intensive research interests in recent years. Rejiba et al. [rejiba2019survey]
published a comprehensive survey on mobilityinduced service migration in fog, edge, and related computing paradigms. We roughly classify the related work into centralized control approach (the central cloud or MEC servers make service migration decisions for all mobile users) and decentralized control approach (each mobile user makes its own migration decisions).
Centralized control approach: plenty of works focused on making centralized migration decisions based on the complete systemlevel information to minimize the total cost. Ouyang et al. [ouyang2018follow] converted the service migration problem as an online queue stability control problem and applied Lyapunov optimization to solve it. Xu et al. [xu2020path] formulated the service migration problem as a multiobjective optimization framework and proposed a method to achieve a weak Pareto optimal solution. Wang et al. [wang2019dynamic] formulated the service migration problem as a finitestate MDP and proposed an approximation of the underlying state space. They solve the finitestate MDP by using a modified policyiteration algorithm. Other recent works tackled the service migration problem based on RL. Wang et al. [wang2019delay] proposed a Qlearning based microservice migration algorithm in mobile edge computing. Chen et al. [chen2019dynamic] built a practical platform for dynamic service migration and used a Qlearning based method to obtain the migration strategy. Wu et al. [wu2020mobility] considered jointly optimizing the task offloading and service migration, and proposed a Qlearning based method combing the predicted user mobility. These works considered the case where the decisionmaking agent knows the complete systemlevel information. However, in a practical MEC system, collecting complete systemlevel information can be difficult and timeconsuming. Moreover, the centralized control approach may suffer from the scalability issue when facing a rapidly increasing number of mobile users.
Decentralized control approach: some studies proposed to make migration decisions by the user side based on incomplete systemlevel information. Ouyang et al. [ouyang2019adaptive] formulated the service migration problem as an MAB and proposed a Thompsonsampling based algorithm that explores the dynamic MEC environment to make adaptive service migration decisions. Sun et al. [sun2018learning] proposed an MAB based service placement framework for vehicle cloud computing, which can enable the vehicle to learn to select effective neighboring vehicles for its service. Sun et al. [sun2017emm] developed a usercentric service migration framework using MAB and Lyapunov optimization to minimize the latency with constraints of energy consumption. These methods simplify the system dynamics by modeling with MAB, which ignores the inherently large state space and complex transitions among states in a realworld MEC system. Distinguished from the above works, our method models the service migration problem as a POMDP that has a continuous state space and models complex transitions between states. Moreover, our method is modelfree and adaptive to different scenarios, which can learn to make online service migration decisions with minimal expert knowledge. More recently, Yuan et al. [yuan2020joint] investigated the joint service migration and mobility optimization problem for vehicular edge computing. They modeled the MEC environment as a POMDP and proposed a multiagent DRL method based on independent Qlearning to learn the policy. However, using Qlearning based method to solve the environment with complex dynamics and continuous state space can be unstable and inefficient. Our evaluation results show that our method can achieve stabler training and better results than the DQLbased method.
Vi Conclusion
In this paper, we proposed the DRACM, a new method for solving the service migration problem in MEC given incomplete systemlevel information. Our method is completely modelfree and can learn to make online migration decisions through endtoend RL training with minimal human expertise. Specifically, the service migration problem in MEC is modeled as a POMDP. To solve the POMDP, we designed an encoder network that combines an LSTM and an embedding matrix to effectively extract hidden information from sampled histories. Besides, we proposed a tailored offpolicy actorcritic algorithm with a clipped surrogate objective to improve the training performance. We demonstrated the implementation of the DRACM in the emerging MEC framework, where migration decisions can be made online from the user side and the training for the policy can be offline without directly interacting with the environment. We evaluated the DRACM and four online baseline algorithms with realworld datasets and demonstrated that the DRACM consistently outperforms the online baselines and achieves nearoptimal results on a diverse set of scenarios.
Comments
There are no comments yet.