Online Service Migration in Edge Computing with Incomplete Information: A Deep Recurrent Actor-Critic Method

12/16/2020 ∙ by Jin Wang, et al. ∙ University of Exeter 0

Multi-access Edge Computing (MEC) is a key technology in the fifth-generation (5G) network and beyond. MEC extends cloud computing to the network edge (e.g., base stations, MEC servers) to support emerging resource-intensive applications on mobile devices. As a crucial problem in MEC, service migration needs to decide where to migrate user services for maintaining high Quality-of-Service (QoS), when users roam between MEC servers with limited coverage and capacity. However, finding an optimal migration policy is intractable due to the highly dynamic MEC environment and user mobility. Many existing works make centralized migration decisions based on complete system-level information, which can be time-consuming and suffer from the scalability issue with the rapidly increasing number of mobile users. To address these challenges, we propose a new learning-driven method, namely Deep Recurrent Actor-Critic based service Migration (DRACM), which is user-centric and can make effective online migration decisions given incomplete system-level information. Specifically, the service migration problem is modeled as a Partially Observable Markov Decision Process (POMDP). To solve the POMDP, we design an encoder network that combines a Long Short-Term Memory (LSTM) and an embedding matrix for effective extraction of hidden information. We then propose a tailored off-policy actor-critic algorithm with a clipped surrogate objective for efficient training. Results from extensive experiments based on real-world mobility traces demonstrate that our method consistently outperforms both the heuristic and state-of-the-art learning-driven algorithms, and achieves near-optimal results on various MEC scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed a booming of emerging mobile applications such as augmented reality, virtual reality, and interactive gaming. These types of applications require intensive computing power for real-time processing, which often exceeds the limited computing and storage capabilities of mobile devices. To resolve this issue, Multi-access Edge Computing (MEC) [sabella2019developing], a key technology in the fifth-generation (5G) network, was proposed to meet the ever-increasing demands for the Quality-of-Service (QoS) of mobile applications. MEC provides many computing and storage resources at the network edge (close to users), which can effectively cut down the application latency and improve the QoS. Specifically, a mobile application empowered by the MEC consists of a front-end component running on mobile devices, and a back-end service that runs the tasks offloaded from the application on MEC servers [rejiba2019survey]. In this way, the MEC enables mobile devices with limited processing power to run complex applications with satisfied QoS.

When considering the user mobility along with the limited coverage of MEC servers, the communications between a mobile user and the user service running on an edge server may go through multiple hops, which would severely affect the QoS. To address this problem, the service could be dynamically migrated to a more suitable MEC server so that the QoS is maintained. Unfortunately, finding an optimal migration policy for such a problem is non-trivial, due to the complex system dynamics and user mobility. Many existing works [ouyang2018follow, wang2019delay, wu2020mobility, chen2019dynamic, wang2019dynamic] proposed service migration solutions based on Markov Decision Process (MDP) or Lyapunov optimization under the assumption of knowing the complete system-level information (e.g., available computation resources of MEC servers, profiles of offloaded tasks, and backhaul network conditions). Thus, they designed centralized controllers that make migration decisions for all mobile users in the MEC system.

The aforementioned methods have two potential drawbacks: 1) in a real-world MEC system, gathering complete system-level information can be difficult and time-consuming; 2) the centralized control approach will have the scalability issue since its time complexity rapidly increases with the number of mobile users. To address the above issues, some works proposed decentralized service migration methods based on contextual Multi-Armed Bandit (MAB) [sun2017emm, ouyang2019adaptive, sun2018learning], where the migration decisions are made by the user side with partially observed information. However, they did not consider the intrinsically large state space and complex dynamics in the MEC system, which may lead to unsatisfactory performances. A recent work [yuan2020joint] modeled the joint optimization problem of service migration and path selection as a partially observable Markov decision process (POMDP) solved by independent Q-learning, which can be unstable and inefficient when handling the MEC environment with continuous state space (e.g., data size, CPU cycle, workload) and complex system dynamics.

To address the above challenges, we propose a Deep Recurrent Actor-Critic based service Migration (DRACM) method, which is user-centric and can learn to make online migration decisions with incomplete system-level information based on Deep Reinforcement Learning (DRL). DRL is able to solve complex decision-making problems in various areas, including robotics [gu2017deep], games [ye2020mastering], networks [chinchali2018cellular], etc., making it an attractive approach. Distinguished from the existing works, we model the service migration problem as a POMDP with continuous state space and develop a tailored off-policy actor-critic algorithm to efficiently solve the POMDP. The main contributions of this work are listed as follows:

  • We model the service migration problem as a POMDP to capture the intrinsically complex system dynamics in the MEC. We solve the POMDP by proposing a novel off-policy actor-critic method, DRACM. Specifically, our method is model-free and can quickly learn effective migration policies through end-to-end reinforcement learning (RL), where the agent makes online migration decisions based on the sampled raw data from the MEC environment with minimal human expertise.

  • A new encoder network that combines a Long Short-Term Memory (LSTM) and an embedding matrix is designed to effectively extract the hidden information from the sampled histories. Moreover, a tailored off-policy actor-critic algorithm with a clipped surrogate objective function is developed to substantially stabilize the training and improve the performance.

  • We demonstrate how to implement the DRACM in an emerging MEC framework, where the migration decisions can be made online through the inference of the policy network, while the training of the policy network can be offline, saving the cost of directly interacting with the MEC environment.

  • Extensive experiments are conducted to evaluate the performance of the DRACM using real-world mobility traces. The results demonstrate that the DRACM has a stable training process and high adaptivity to different scenarios, while outperforms the online baseline algorithms, and achieves near-optimal results.

The remainder of this paper is organized as follows. Section II gives the problem formulation of service migration. Section III presents the DRL backgrounds, POMDP modeling for service migration, details of the DRACM algorithm, and the implementation of the DRACM in the emerging MEC system. In Section IV, we evaluate the performance of the DRACM and five baseline algorithms on two real-world mobility traces with various MEC scenarios. We then review the related works in Section V. Finally, Section VI draws conclusions.

Ii Problem Formulation of Service Migration

As shown in Fig. 1, we consider a typical scenario where mobile users move in a geographical area covered by a set of MEC servers, , each of which is co-located with a base station. In the MEC system, mobile users can offload their computation tasks to the services provided by MEC servers. We define the MEC server that runs the service of a mobile user as the user’s serving node, and the MEC server that directly connects with the mobile user as the user’s local server. In general, the MEC servers are interconnected via stable backhaul links, thus the mobile user can still access its service via multi-hop communication among MEC servers when it is no longer directly connected to the serving node. To maintain satisfactory QoS, the service should be dynamically migrated among the MEC servers as the user moves. In this paper, we use latency as the measurement for the QoS that consists of migration, computation, and communication delays.

We consider a time-slotted model, where a user’s location may only change at the beginning of each time slot. The time-slotted model is widely used to address the service migration problem [wang2019dynamic, ouyang2019adaptive, wang2019delay], which can be regarded as a sampled version of a continuous-time model. When a mobile user changes location, the user makes the migration decision for the current service and then offloads computation tasks to the serving node for processing. Denote the migration decision at time slot as (), where can be any of the MEC servers in this area. In general, the migration, computation, and communication delays are expressed as follows.

Fig. 1: An example of service migration in MEC.

Migration delay: The migration delay is incurred when a service is moved out from the previous serving node. In general, the migration delay is a non-decreasing function of [wang2019dynamic, ouyang2019adaptive], , where is the hop distance between the current serving node and the previous one , and is the coefficient of migration delay. The migration delay can capture the service interruption time during migration, which increases with the hop distance due to the involved propagation and switching delay of service data transmission.

Computation delay: At each time slot, the mobile user may offload computation tasks to the serving node for processing. The computing resources of MEC servers are shared by multiple mobile users to process their applications. At time slot , we denote the sum of the required CPU cycles for processing the offloaded tasks as , the workload of the serving node as , and the total computing capacity of the serving node as . We consider a weighted resource allocation strategy on each MEC server, where tasks are allocated with computation resources proportional to their required CPU cycles. Therefore, the computation delay of running the offloaded tasks at time slot , can be calculated as

(1)

Communication delay: After migrating the service, the communication delay is incurred when the mobile user offloads computation tasks to the serving node. Generally, the communication delay consists of two parts: access delay between the mobile user and the local server, and backhaul delay between the local server and the serving node. The access delay is determined by the wireless environment and the data size of the offloaded tasks. At time slot , we denote the data size of the offloaded tasks as , the average upload rate of the wireless channel as . Hence, the access delay can be expressed as

(2)

While the backhaul delay is incurred by data transmission, propagation, processing, and queuing between the serving node and the local server through backhaul networks, which mainly depends on the hop distance along the shortest communication path and the data size of the offloaded tasks [yuan2020joint, wang2019dynamic, ouyang2019adaptive]. We denote the local server at time slot as () and the hop distance between the serving node and the local server as . The bandwidth of the outgoing link of the local server is denoted as . Generally, the transmission delay of the computation results can be ignored because of the small data size. Consequently, the backhaul delay can be given by

(3)

where is a coefficient of the backhaul delay [yuan2020joint]. Especially, when the serving node and mobile user are directly connected (), there is no backhaul cost. Overall, the total communication delay at time slot can be obtained by

(4)

Given a finite time horizon , our objective for the service migration problem is to obtain optimal migration decisions, , so that the sum of all the above costs (i.e., total latency) is minimal. Formally, the objective is expressed as:

(5)

Obtaining the optimal solution for the above objective is challenging, which requires user mobility and complete system-level information over the entire time horizon. However, in real-world scenarios, it is impractical to gather all the relative information in advance. To address this challenge, we propose a learning-based online service migration method that can make efficient migration decisions based on partially observed information. In the next section, we present our solution in detail.

Iii Online Service Migration with Incomplete Information

Service migration in MEC is intrinsically a sequential decision-making problem with a partially observable environment (i.e., with incomplete system information), which can be naturally modeled as a POMDP. We solve the POMDP with the proposed DRACM method to provide effective online migration decisions. Before presenting the details of our solution, we first introduce the necessary backgrounds.

Fig. 2: Graphical model of POMDP.

Iii-a Backgrounds of RL and POMDP

Reinforcement learning: RL can solve sequential decision-making problems by learning from interaction with the environment. In general, RL uses the formal framework of MDP, which is defined by a tuple , to represent the interaction between a learning agent and its environment. Specifically, is the state space, denotes the action space, is the environment dynamics, represents the reward function, and is the discounted factor. The policy, , represents the distribution over actions given a state . The return from state , which is defined as , is the sum of discounted rewards along a trajectory . The goal of RL is to find an optimal policy , so that the expected return, , is maximal.

The action-value function is defined by the expected return after taking an action in state and thereafter following policy , . An optimal action-value function, which is defined as , is the maximum action value achievable by any policy for state and action . The valued-based DRL methods (e.g., deep Q-learning (DQL) [mnih2015human]

) use the deep neural network to approximate the optimal action-value function,

where are parameters of the deep neural network. They obtain the optimal policy by greedily selecting the action with maximal action value, where . However, since DQL uses deterministic target policy and -greedy strategy to handle the trade-off between exploring and exploiting, it can have unsatisfactory performance and the convergence issue when handling environments with continuous state space (e.g., the MEC environment). In contrast, the policy-based methods (e.g., asynchronous actor-critic [mnih2016asynchronous]), provide good convergence property for dealing with the complexity of the continuous state space. They directly parameterized the stochastic policy with a deep neural network rather than use deterministic policy derived from the action-value function. The parameters of the policy network are updated by performing gradient ascent on . In this paper, we build the DRACM based on the policy-based methods and show the performance comparison between the DQL-based method and the DRACM in section IV.

Partially Observable Markov Decision Process: MDP assumes that states include complete information for decision-making. However, in many real-world scenarios, observing such states is intractable. Therefore, the POMDP, an extension of MDP, is proposed as a general model for the sequential decision-making problem with a partially observable environment, which is defined by a tuple . Fig. 2 shows the graphical model of POMDP. Specifically, the state is latent and the observation contains partial information of the latent state .

represents the observation distribution, which gives the probability of observing

if action is performed and the resulting state is . Since the state is latent, the learning agent cannot choose its action directly based on the state. Alternatively, it has to consider a complete history of its past actions and observations to choose its current action. Specifically, the history up to time step is defined by . Therefore, the key for RL-based methods to solve the POMDP is how to effectively infer the latent state based on the history. In the literature, some RL methods [hausknecht2015deep, zhu2018improving] assume the latent states as deterministic states, which encode the whole history by RNN and use the hidden state of RNN as input to the policy. Other works [watter2015embed, igl2018deep, zhang2019solar] explicitly infer the belief state that is defined by the distribution over latent states (i.e., stochastic latent state) given the history and sampling latent state from the distribution as input to the policy. We use LSTM for latent information extraction, which can achieve excellent performance and is much easier to be implemented in MEC scenarios compared to methods based on inferring the belief state. In the next subsection, we present the motivations of POMDP modeling for service migration problem and the detailed definition of the model.

Iii-B POMDP modeling for service migration problem

Key factors that affect the migration decision of a mobile user at a time slot are the mobility of the user, the offloading tasks’ profile, the workloads of edge servers, and the resource allocations of edge servers, etc. Ideally, the user can make optimal migration decisions if knowing complete information related to the decision-making process. However, some information are hard to obtain for the user side. For example, at each time slot, the workloads of edge servers are determined by the task requests from their associated mobile users and the available computation resources of edge servers. However, it is unlikely for a mobile user to get such information. To make effective decisions based on partially observable information, POMDP is a natural choice to model the problem, which gives the agent the ability to effectively estimate the outcome of its actions even when it cannot exactly observe the state of its environment. In our POMDP modeling, the mobile user treats the unobserved information (e.g., workloads and resource allocations of MEC servers) as a part of the latent state. Differing from the simplified model such as MAB, POMDP does not ignore the intrinsic large state space and complex dynamics of the service migration problem, thus solving the POMDP can result in more effective decisions.

The detailed POMDP model of service migration is defined as follows:

  • Observation: The observation contains information that is accessible from the user side, which is defined by a tuple of the local server , the transmission rate of wireless network , the required CPU cycles of computation tasks , and the sizes of transmission data, :

    (6)

    Note that the geographical location of the mobile user is an indirect factor that affects the migration decisions, which determines the local server associated with the mobile user and affects the transmission rate (included in our definition of the observation, Eq. (6)). Therefore, we define the local server as a component of the observation rather than the geographical location of the mobile user.

  • Action: At each time slot, the service can be migrated to any of the MEC servers in the area. Therefore, an action is defined as .

  • Reward: The reward at each time slot is defined as the negative sum of migration, computation, and communication delays, which is formally expressed as

    (7)

Solving the above POMDP is non-trivial due to the complex dynamics and continuous state space of the MEC environment. In the next subsection, we present our method, DRACM, to solve the above POMDP.

Iii-C Deep Recurrent Actor-Critic based service Migration (DRACM)

Fig. 3: The architecture of the DRACM.

Fig. 3 shows the overall architecture of the DRACM, which follows an end-to-end principle with raw history sampled from the environment as input and the migration decisions as output. The DRACM consists of two parts: the encoder network and the learning agent, where the encoder network learns to effectively represent the latent state of the POMDP based on the history and the learning agent learns to make effective migration decisions. The encoder network combines an LSTM that encodes the history of up to time slot into the hidden state :

(8)

where and represent the inner process and parameters of the encoder network, respectively.

To improve the representation ability of the features and , we convert them into embeddings by looking up a matrix, where

is the dimension of embedding vectors. Subsequently, the action embedding, user location embedding, and the rest components of the observation are concatenated as a vector,

, feeding into the LSTM to produce the hidden state .

The learning agent is based on a standard actor-critic structure. Both actor and critic are parametrized by neural networks with the hidden state as input. We denote and as the parameters of actor and critic networks, respectively. The actor network aims at approximating the policy, , which outputs a distribution over the action space at time step given . Meanwhile, the critic network, , approximates the value function that is an estimation of the expected return when starting in and following the policy thereafter.

Denote the trajectory sampled from the environment following policy as . The critic network can be updated by minimizing the mean square error of one-step temporal differences based on the sampled trajectories, which is formally defined as

(9)
(10)

where the can be obtained by Eq. (8). The objective of the actor is to find an optimal policy that maximizes the accumulated reward, which can be formally expressed as

(11)

The optimal policy can then be obtained by gradient assent through policy gradient with one-step actor-critic [sutton2018reinforcement], where the gradient of the above objective function can be calculated by

(12)

Initialize the parameters of behavior policy , behavior encoder network , target policy , target encoder network , and critic network ,

1:for   do
2: % Start sampling process %
3:     
Synchronize the parameters: , .
4:     
Sample a set of trajectories by running the behavior policy in the environment, where .
5:     
Compute the advantage estimator, , according to Eq. (15).
6: % Start target policy updating process %
7:     for  do
8:         
Update the parameters of encoder network , target policy network , and critic network ,
,
,
,
by mini-batch gradient updates based on collected trajectories with Adam.
9:     end for
10:end for
Algorithm 1 Deep Recurrent Actor-Critic based service Migration (DRACM)

However, directly applying the above on-policy (i.e., using the same policy for training and sampling) objective has some drawbacks when solving the service migration problem. First, we cannot train the policy network offline with mini-batches by using on-policy objective. This can lead to severe sample efficiency problem, since the learning agent needs to resample trajectories from the environment after each gradient update. Especially, in the MEC system, frequently interacting with the environment to get the training samples is costly. Besides, the on-policy objective has limited exploring ability, thus the policy can easily get stuck in a local optima. To mitigate this problem, we design an off-policy (i.e., training a policy different from that was used to sample the data) algorithm that can train the policy with mini-batches and reduce the interaction frequency with the environment. Inspired by the previous works on RL [schulman2017proximal, haarnoja2018soft, schulman2016high], we introduce an off-policy training method with a surrogate objective as follows:

(13)
(14)
(15)

where is the behavior policy for sampling trajectories, which does not participate in gradient updates. is the target policy for optimization. is the importance sampling ratio which is used to correct the distribution errors caused by the difference between the behavior and target policies. Besides, we introduce as a regularization term to further encourage exploration during training, where denotes the entropy of the policy and is a coefficient. However, the off-policy method is known for being unstable and hard to coverage. To address this issue, the clip function, , is used to limit the value of the importance sampling ratio by removing the incentive for moving the ratio outside of the interval

, thus it can prevent very large policy updates and stabilize the training. To reduce the variance of the training objective, we utilize the generalized advantage estimator

[schulman2016high], , as given by Eq. (15), where is used to control the trade-off between bias and variance.

Algorithm 1 summarizes the training process of the DRACM. Each training loop consists of the sampling process and the target policy updating process. In the sampling process, we firstly synchronize the parameters of the behavior and target networks (include policy network and encoder network), and then sample a set of trajectories from the environment using the behavior encoder and policy networks. The advantage estimator, , can then be obtained based on the sampled trajectories. Next, in the target policy updating process, we conduct training of

loops to update the parameters of the encoder network, policy network, and critic network via mini-batch stochastic gradient descent with

Adam [kingma2014adam]. After training, the target policy and encoder networks can be deployed to the end device for making online migration decisions by neural network inference, which has a linear time complexity of , where is the length of the history. In the next subsection, we discuss how to implement the DRACM in the emerging MEC system.

Iii-D The DRACM empowered MEC framework

The emerging MEC system defined by ETSI consists of three levels: user level, edge level, and remote level [sabella2019developing]. The user level includes various mobile devices such as smartphones and vehicles. The edge level consists of multiple edge servers where each server provides services for processing tasks that are offloaded by mobile users. The edge servers are connected through backhaul links, thus the service can be migrated among them. The remote level includes data centers with large storage and computing capacity. Fig. 4 shows the overall framework of integrating the DRACM into the three-level MEC system. Four key components (experience collector, migration decision maker, experience pool, and target policy trainer) of the DRACM are deployed at the user and remote level:

  • At the user level, the experience collector is responsible of collecting the information of observations and rewards from the MEC environment (Step 1⃝). It sends the history to the migration decision maker for online decision-making (Step 2⃝), and the collected trajectories to the experience pool for the target policy training (Step 4⃝). The migration decision maker includes behavior policy and encoder networks. It downloads parameters from the target policy trainer as the initial values of the behavior policy and encoder networks (Step 5⃝), and decides the migration actions based on the observed history (Step 3⃝).

  • At the remote level, the experience pool stores the sampled trajectories from mobile users. The target policy trainer is in charge of training the target policy based on the sampled trajectories.

Fig. 4: The framework of DRACM empowered MEC system. The data flows in this framework are: 1⃝ the observation and reward from the MEC environment, 2⃝ the history for migration decision-making, 3⃝ the migration action, , made by the behavior policy, 4⃝ the collected trajectories uploaded to the experience pool, 5⃝ the parameters of the trained target policy and encoder networks for service migration.

According to Algorithm 1, the target policy trainer conducts multiple training loops with mini-batch gradient updates based on the collected trajectories in the experience pool. Note that the training can be offline without directly interacting with the MEC environment. After training, the target policy trainer sends the updated parameters of policy and encoder networks to mobile users for the next-round of sampling process.

Iv Experiments

Fig. 5: The central areas of Rome, Italy (8 km 8 km area bounded by the coordinate pairs [41.856, 12.442] and [41.928, 12.5387]) and San Francisco (8 km 8 km area bounded by the coordinates pairs [37.709, -122.483] and [37.781, -122.391]).

In this section, we present the comprehensive evaluation results of the DRACM in detail. Our experiments demonstrate that: 1) the DRACM has a stable and efficient training process; 2) the DRACM can autonomously adapt to different MEC scenarios including various user’s task arriving rates, applications’ processing densities, and coefficients of migration delay. We firstly introduce the experiment settings based on a real-world MEC environment. Next, we present the baseline algorithms for comparison. Finally, we evaluate the performance of the DRACM and baseline algorithms in different MEC scenarios.

Iv-a Experiment settings

We evaluate the DRACM with two real-world mobility traces of cabs in Rome, Italy [roma-taxi-20140717] and San Francisco, USA [piorkowski2009crawdad]. Specifically, we focus our analysis to the central parts of Rome and San Franscisco, as shown in Fig. 5. We consider that 64 MEC servers are deployed in each area, where each MEC server covers a 1 km 1 km grid with a computation capacity GHz (i.e., four 16-core servers with 2 GHz for each core). According to [narayanan2020first], the upload rate of real-world commercial 5G networks is generally less than 60 Mbps. Therefore, in our environment, the upload rate in each grid is set as 60, 48, 36, 24, and 12 Mbps from a proximal end to a distal end. The hop distances between two MEC servers are calculated by Manhattan distance. The location of an MEC server is represented by a 2-D vector with respect to a reference location at . To calculate the propagation latency, we set the bandwidth of backhaul network, , as 500 Mbps [ma2019efficient] and the coefficient of backhaul delay, , as 0.02 s/hop [yuan2020joint]. The migration delay varies with various service types and network conditions, e.g., the migration delay of Busybox (a type of service) ranges from 2.4 to 3.3 seconds [ma2019efficient] with different bachkhaul bandwidths. Following some related work on MEC [ouyang2019adaptive, wang2019delay, ma2019efficient]

, we assume the coefficient of migration delay is uniformly distributed in

s/hop during our training.

At each time slot, the tasks arriving at a mobile user and those arriving at an MEC server are sampled from Poisson distributions with rates

and , respectively. In our experiments, we show the performance of the DRACM under different task arriving rates of mobile users. According to the current works [nguyen2020privacy, chen2015efficient, zhan2020mobility], the data size of an offloaded task in real-world mobile applications often varies from 50 KB (sensor data) [nguyen2020privacy] to 5 MB (image data) [chen2015efficient]. Therefore, we set the data size of each offloaded task uniformly distributed in MB. The required CPU cycles of each task can be calculated by the product of the data size and processing density, , which is uniformly distributed in cycles/bit, covering a wide range of tasks from low to high computation complexity [kwak2015dream]. We summarize the parameter settings of our simulation environment in Table I.

Parameter Value
Computation capacity of an MEC server, 128 GHz
Upload rate of wireless network, {60, 48, 36, 24, 12} Mbps
Bandwidth of backhaul network, 500 Mbps
Coefficient of backhaul delay, 0.02 s/hop
Coefficient of migration delay, s/hop
Data size of each offloaded task MB
Processing density of an offloaded task, cycles/bit
User’s task arriving rate 2 tasks/slot
MEC server’s task arriving rate tasks/slot
TABLE I: Parameters of the Simulated Environment.
Hyperparameter Value Hyperparameter Value
LSTM Hidd. Units 256 Embedding Dim. 2
Actor Layer Type Dense Actor Hidd. Units 128
Critic Layer Type Dense Critic Hidd. Units 128
Learning Rate 0.0005 Optimizer Adam
Discount 0.95 Discount 0.99
Coefficient 0.01 Clipping Value 0.2
TABLE II: Hyperparameters of the DRACM.

Iv-B Baseline algorithms

We compare the performance of the DRACM to that of five baseline algorithms:

  • Always migrate (AM): A mobile user always selects the nearest MEC server to migrate at each time slot.

  • Never migrate (NM): The service is placed on an MEC server and never migrate during the time horizon.

  • Multi-armed Bandit with Thompson Sampling (MABTS): Some exiting works [sun2017emm, ouyang2019adaptive] solve the service migration problem based on MAB. According to the work [ouyang2019adaptive]

    , MABTS uses a diagonal Gaussian distribution to approximate the posterior of the cost for each arm and applies Thompson sampling to handle the trade-off between exploring and exploiting.

  • DQL-based migrate (DQLM): Some recent works [wang2019delay, wu2020mobility, chen2019dynamic, yuan2020joint] adapt DQL to tackle the service migration problem. For a fair comparison, we use similar neural network structure as DRACM to approximate the action-value function for DQLM, but use the objective function of the DQL method as the training target. Moreover, we use -greedy to control the exploring-exploiting trade-off as the above works do.

  • Optimal migrate (OPTIM): Assuming the user mobility trace and the complete system-level information over the time horizon are known ahead, the service migration problem can be transformed to the shortest-path problem [ouyang2018follow, wang2019delay], which can be solved by the Dijkstra algorithm.

The NM, AM, MABTS, and DQLM algorithms can run online, while the OPTIM is an offline algorithm which defines the performance upper-bound of service migration algorithms.

Iv-C Evaluation of the DRACM and baseline algorithms

Fig. 6: Average total reward of the DRACM and baseline algorithms with the mobility traces of Rome.
Fig. 7: Average total reward of the DRACM and baseline algorithms with the mobility traces of San Francisco.
Fig. 8: Average total latency (s) of service migration over the time horizon (250 minutes) on the testing dataset from mobility traces of Rome.
Fig. 9: Average total latency (s) of service migration over the time horizon (250 minutes) on the testing dataset from mobility traces of San Francisco.

We first evaluate the training performance of the DRACM and DQLM on two different mobility trace datasets [roma-taxi-20140717, piorkowski2009crawdad]. Each training dataset includes 100 randomly picked mobility traces, where each trace has 100 time slots of three-minute length each. Table II lists the hyperparameters in training. The neural network structure of the DQLM is similar to the DRACM with the same encoder network. The difference is that, rather than using the actor-critic structure, the DQLM is based on the Q-network that includes a fully connected layer with 128 hidden units to approximate the action-value function and chooses the action with the largest action-value at each time step. We train the DQLM and DRACM with the same learning rate, mini-batch size, and number of gradient update steps.

Figs. 6 and 7 show the training results of DRACM and DQLM on mobility traces of Rome and San Francisco, respectively. The other baseline algorithms do not involve the training process for neural networks, thus we show their final performance. The network parameters of both DRACM and DQLM are initialized by random values, thus they randomly select actions to explore the environment and achieve the worst results compared to other baseline algorithms before training. However, the DRACM quickly surpasses NM and AM after 500 gradient updates and keeps growing on both mobility traces. After 1000 gradient updates, the average total reward of the DRACM remains stable, which shows the excellent convergence property of the DRACM. Besides, the final stable results of the DRACM on both mobility traces beat all baseline algorithms. Compared to the DQLM, the proposed DRACM has two main advantages: 1) The DQLM uses -greedy to control the trade-off between exploring and exploiting and obtains a deterministic policy based on the learned Q-network. When handling the MEC environment that has stochastic dynamics and continuous state space, the training of DQLM can be unstable and inefficient. In contrast, the DRACM directly learns a stochastic policy that is used to handle the exploring-exploiting trade-off, thus it can achieve faster and stabler learning. 2) The DRACM obtains better results. The off-policy objective of the DRACM helps alleviate the problem of getting stuck in local optima.

To evaluate the generalization ability of the DRACM, we test the trained target policy on testing datasets of both mobility traces, where each test dataset includes 30 randomly picked mobility traces that were not included in the training dataset. Figs. 8 and 9 present the results of the average total latency of DRACM and baseline algorithms on Rome and San Francisco mobility traces, respectively. We found the DRACM achieves the best performance compared to online baseline algorithms on both mobility traces. Specifically, Fig. 8 shows that the DRACM outperforms the DQLM and MABTS by 18% and 13%, respectively. Fig. 9 indicates that the DRACM surpasses the DQLM and MABTS by 44% and 23%, respectively. Furthermore, the DRACM achieves near-optimal results within 12% of the optimum on both mobility traces.

Fig. 10: Average total latency (s) of service migration over the time horizon (250 minutes) with different task arriving rates of users (mobility traces of Rome).
Fig. 11: Average total latency (s) of service migration over the time horizon (250 minutes) with different task arriving rates of users (mobility traces of San Francisco).

We then test the DRACM and baseline algorithms with different task arriving rates of users on both mobility traces. As shown in Figs. 10 and 11, the average total latencies of all evaluated algorithms increase with the rise of user’s task arriving rate, since the average number of offloaded tasks increases at each time slot. The evaluation results show that the DRACM adapts well among different task arriving rates of users, where it outperforms the DQLM and MABTS by up to 24% and 45%, respectively. Moreover, in all cases, the results of DRACM are close to the optimal values.

Next, we investigate the performance of the DRACM with different processing densities. For a real-world mobile application, the higher is the processing density, the more computation power is required for processing the application. Figs. 12 and 13 depict the average total latency of DRACM on Rome mobility traces and San Francisco mobility traces, respectively. We find that the DRACM adapts well to the change of processing density on both mobility traces, where it outperforms all online baselines.

Fig. 12: Average total latency (s) of service migration over the time horizon (250 minutes) with different processing densities (mobility traces of Rome).
Fig. 13: Average total latency (s) of service migration over the time horizon (250 minutes) with different processing densities (mobility traces of San Francisco).
Fig. 14: Average total latency (s) of service migration over the time horizon (250 minutes) with different coefficients of migration delay (mobility traces of Rome).
Fig. 15: Average total latency (s) of service migration over the time horizon (250 minutes) with different coefficients of migration delay (mobility traces of San Francisco).

Migration delay is another important factor that influences the overall latency. To investigate the impact of the migration delay, we evaluate the DRACM and baseline algorithms on the testing datasets with different coefficients of migration delay. Intuitively, when the migration delay is high, a mobile user may not choose to frequently migrate services. As shown in Figs. 14 and 15, the NM algorithm keeps the similar performance in all cases while the performance of other algorithms drops with the increase of . This is because that the NM does not involve the migration process and thus has no migration delay. In Fig. 14, we find the MABTS suffers serious performance degradation as increases. When the is low (e.g., ), the MABTS achieves similar results as the DRACM. However, when , the performance of MABTS becomes even worse than the DQLM. Compared to RL-based methods like the DQLM and DRACM, MABTS is “short-sighted” since it only considers the one-step reward rather than explicitly optimizes the total reward over the entire time horizon. Overall, the DRACM autonomously learns to adapt among the scenarios with different migration delays, which achieves the best performance compared to the online baselines (with up to 25% improvement over the MABTS and up to 42% improvement over the DQLM), and obtains near-optimal results in our experiments.

The DRACM method has many advantages: 1) the learning-based nature of the DRACM makes it flexible among different scenarios with few human expertise; 2) the user-centric design is scalable for the increasing number of mobile users, where each mobile user makes effective online migration decisions based on the incomplete system information; 3) the tailored off-policy training objective improves both performance and stability of the training process; 4) the design of online decision-making and offline policy training makes the DRACM more practical in real-world MEC systems. Beyond the scope of service migration, the framework of the DRACM has the potential to be applied to solve more decision-making problems in MEC systems such as task offloading and resource allocation [mao2017survey].

V Related Work

Service migration in MEC has attracted intensive research interests in recent years. Rejiba et al. [rejiba2019survey]

published a comprehensive survey on mobility-induced service migration in fog, edge, and related computing paradigms. We roughly classify the related work into centralized control approach (the central cloud or MEC servers make service migration decisions for all mobile users) and decentralized control approach (each mobile user makes its own migration decisions).

Centralized control approach: plenty of works focused on making centralized migration decisions based on the complete system-level information to minimize the total cost. Ouyang et al. [ouyang2018follow] converted the service migration problem as an online queue stability control problem and applied Lyapunov optimization to solve it. Xu et al. [xu2020path] formulated the service migration problem as a multi-objective optimization framework and proposed a method to achieve a weak Pareto optimal solution. Wang et al. [wang2019dynamic] formulated the service migration problem as a finite-state MDP and proposed an approximation of the underlying state space. They solve the finite-state MDP by using a modified policy-iteration algorithm. Other recent works tackled the service migration problem based on RL. Wang et al. [wang2019delay] proposed a Q-learning based micro-service migration algorithm in mobile edge computing. Chen et al. [chen2019dynamic] built a practical platform for dynamic service migration and used a Q-learning based method to obtain the migration strategy. Wu et al. [wu2020mobility] considered jointly optimizing the task offloading and service migration, and proposed a Q-learning based method combing the predicted user mobility. These works considered the case where the decision-making agent knows the complete system-level information. However, in a practical MEC system, collecting complete system-level information can be difficult and time-consuming. Moreover, the centralized control approach may suffer from the scalability issue when facing a rapidly increasing number of mobile users.

Decentralized control approach: some studies proposed to make migration decisions by the user side based on incomplete system-level information. Ouyang et al. [ouyang2019adaptive] formulated the service migration problem as an MAB and proposed a Thompson-sampling based algorithm that explores the dynamic MEC environment to make adaptive service migration decisions. Sun et al. [sun2018learning] proposed an MAB based service placement framework for vehicle cloud computing, which can enable the vehicle to learn to select effective neighboring vehicles for its service. Sun et al. [sun2017emm] developed a user-centric service migration framework using MAB and Lyapunov optimization to minimize the latency with constraints of energy consumption. These methods simplify the system dynamics by modeling with MAB, which ignores the inherently large state space and complex transitions among states in a real-world MEC system. Distinguished from the above works, our method models the service migration problem as a POMDP that has a continuous state space and models complex transitions between states. Moreover, our method is model-free and adaptive to different scenarios, which can learn to make online service migration decisions with minimal expert knowledge. More recently, Yuan et al. [yuan2020joint] investigated the joint service migration and mobility optimization problem for vehicular edge computing. They modeled the MEC environment as a POMDP and proposed a multi-agent DRL method based on independent Q-learning to learn the policy. However, using Q-learning based method to solve the environment with complex dynamics and continuous state space can be unstable and inefficient. Our evaluation results show that our method can achieve stabler training and better results than the DQL-based method.

Vi Conclusion

In this paper, we proposed the DRACM, a new method for solving the service migration problem in MEC given incomplete system-level information. Our method is completely model-free and can learn to make online migration decisions through end-to-end RL training with minimal human expertise. Specifically, the service migration problem in MEC is modeled as a POMDP. To solve the POMDP, we designed an encoder network that combines an LSTM and an embedding matrix to effectively extract hidden information from sampled histories. Besides, we proposed a tailored off-policy actor-critic algorithm with a clipped surrogate objective to improve the training performance. We demonstrated the implementation of the DRACM in the emerging MEC framework, where migration decisions can be made online from the user side and the training for the policy can be offline without directly interacting with the environment. We evaluated the DRACM and four online baseline algorithms with real-world datasets and demonstrated that the DRACM consistently outperforms the online baselines and achieves near-optimal results on a diverse set of scenarios.

References