In the last decade, we have seen a shift in the computing paradigm from co-located datacenters and compute servers to cloud computing. Due to the aggregation of resources, cloud computing can deliver elastic computing power and storage to customers without the overhead of setting up expensive datacenters and networking infrastructures. It has specially attracted small and medium-sized businesses who can leverage the cloud infrastructure with minimal setup costs. In recent years, the proliferation of Video-on-Demand (VoD) services, Internet-of-Things (IoT), real-time online gaming platforms, and Virtual Reality (VR) applications has lead to a strong focus on the quality of experience of the end users. The cloud paradigm is not the ideal candidate for such latency-sensitive applications owing to the delay between the end user and cloud server.
This has led to a new trend in computing called Mobile Edge Computing (MEC) [19, 15], where the compute capabilities are moved closer to the network edges. It represents an essential building block in the 5G vision of creating large distributed, pervasive, heterogeneous, and multi-domain environments. Harvesting the vast amount of the idle computation power and storage space distributed at the network edges can yield sufficient capacities for performing computation-intensive and latency-critical tasks requested by the end-users. However, it is not feasible to set-up huge resourceful edge clusters along all network edges that mimic the capabilities of the cloud due to the sheer volume of resources that would be required, which would remain underutilized most of the times. Due to the limited resources at the edge nodes and fluctuations in the user requests, an edge cluster may not be capable of meeting the resource and service requirements of all the users it is serving.
Computation offloading methods have gained a lot of popularity as they provide a simple solution to overcome the problems of edge and mobile computing. Data and computation offloading can potentially reduce the processing delay, improve energy efficiency, and even enhance security for computation-intensive applications. The critical problem in the computation offloading is to determine the amount of computational workload, and choose the MEC server from all available servers. Various aspects of MEC from the point of view of the mobile user have been investigated in the literature. For example, the questions of when to offload to a mobile server, to which mobile server to offload, and how to offload have been studied extensively. See, [14, 25, 23, 5, 27] and references therein.
However, the design questions at the server side have not been investigated as extensively. When an edge server receives a large number of requests in a short period of time (for example due to a sporting event), the edge server can get overloaded, which can lead to service degradation or even node failure. When such service degradation occurs, edge servers are configured to offload requests to other nodes in the cluster in order to avoid the node crash. The crash of an edge node leads to the reduction of the cluster capacity, which is a disaster for the platform operator as well as the end users, who are using the services or the applications. However, performing this migration takes extra time and reduces the resources availability for other services deployed in the cluster. Therefore, it is paramount to design pro-active mechanisms that prevent a node from getting overloaded using dynamic offloading policies that can adapt to service request dynamics.
The design of an offloading policy has to take into account the time-varying channel conditions, user mobility, energy supply, computation workload and the computational capabilities of different MEC servers. The problem can be modeled as a Markov Decision Process (MDP) and solved using dynamic programming. However, solving a dynamic program requires the knowledge of the system parameters, which are not typically known and may also vary with time. In such time-varying environments, the offloading policy must adapt to the environment. Reinforcement Learning (RL)  is a natural choice to design such adaptive policies as they do not need a model of the environment and can learn the optimal policy based on the observed per-step cost. RL has been successfully applied for designing adaptive offloading policies in edge and fog computing in [7, 9, 26, 13, 6, 24, 22]
to realize one or more objectives such as minimizing latency, minimizing power consumption, association of users and base stations. Although RL has achieved considerable success in the previous work, this success is generally achieved by using deep neural networks to model the policy and the value function. Such deep RL algorithms require considerable computational power and time to train, and are notoriously brittle to the choice of hyper-parameters. They may also not transfer well from simulation to the real-world, and output policies which are difficult to interpret. These features make them impractical to be deployed on the edge nodes to continuously adapt to the changing network conditions.
In this work, we study the problem of node-overload protection for a single edge node. Our main contributions are as follows:
We present a mathematical model for designing an offloading policy for node-overload protection. The model incorporates practical considerations of server holding, processing and offloading costs. In the simplest case, when the request arrival process is time-homogeneous, we model the system as a continuous-time MDP and use the uniformization technique [10, 8] to convert the continuous-time MDP to a discrete-time MDP, which can then be solved using standard dynamic programming algorithms .
We show that for time-homogeneous arrival process, the value function and the optimal policy are weakly increasing in the CPU utilization.
We design a node-overload protection scheme that uses a recently proposed low-complexity RL algorithm called Structure-Aware Learning for Multiple Thresholds (SALMUT) . The original SALMUT algorithm was designed for the average cost models. We extend the algorithm to the discounted cost setup and prove that SALMUT converges almost surely to a locally optimal policy.
We compare the performance of Deep RL algorithms with SALMUT in a variety of scenarios in a simulated testbed which are motivated by real world deployments. Our simulation experiments show that SALMUT performs close to the state-of-the-art Deep RL algorithms such as PPO  and A2C , but requires an order of magnitude less time to train and provides optimal policies which are easy to interpret.
We developed a docker testbed where we run actual workloads and compare the performance of SALMUT with the baseline policy. Our results show that SALMUT algorithm outperforms the baseline algorithm.
A preliminary version of this paper appeared in , where the monotonicity results of the optimal policy (Proposition 1 and 2) were stated without proof and the modified SALMUT algorithm was presented with a slightly different derivation. However, the convergence behavior of the algorithm (Theorem 2) was not analyzed. A preliminary version of the comparison on SALMUT with state of the art RL algorithms on a computer simulation were included in . However, the detailed behavioral analysis (Sec. V) and the results for the docker testbed (Sec. VI) are new.
The rest of the paper is organized as follows. We present the system model and problem formulation in Sec. II. In Sec. III, we present a dynamic programming decomposition for the case of time-homogeneous statistics of the arrival process. In Sec. IV, we present the structure aware RL algorithm (SALMUT) proposed in  for our model. In Sec. V, we conduct a detailed experimental study to compare the performance of SALMUT with other state-of-the-art RL algorithms using computer simulations. In Sec. VI, we compare the performance of SALMUT with baseline algorithm on the real-testbed. Finally, in Sec. VII, we provide the conclusion, limitations of our model, and future directions.
Ii Model and Problem Formulation
|Queue length at time|
|CPU load of the system at time|
|Offloading action taken by agent at time|
|Number of cores in the edge node|
|CPU resources required by a request|
|PMF of the CPU resources required|
|Processing time of a single core in the edge node|
|Request arrival rate of user|
|Holding cost per unit time|
|Running cost per unit time|
|Penalty for offloading the packet|
|Cost function in the continuous MDP|
|Policy of the RL agent|
|Discount factor in continuous MDP|
|Performance of the policy|
Transition probability function
|Discount factor in discrete MDP|
|Cost function in the discrete MDP|
|Q-value for state and action|
|Optimal Threshold Policy for SALMUT|
|Performance of the SALMULT policy|
|Probability of accepting new request|
Temperature of the sigmoid function
|Occupancy measure on the states starting from|
|Fast timescale learning rate|
|Slow timescale learning rate|
|Number of requests offloaded by edge node|
|Number of times edge node enters an overloaded state|
Ii-a System model
A simplified MEC system consists of an edge server and several mobile users accessing that server (see Fig. 1). Mobile users independently generate service requests according to a Poisson process. The rate of requests and the number of users may also change with time. The edge server takes CPU resources to serve each request from mobile users. When a new request arrives, the edge server has the option to serve it or offload it to other healthy edge server in the cluster. The request is buffered in a queue before it is served. The mathematical model of the edge server and the mobile users is presented below.
Ii-A1 Edge server
Let denote the number of service requests buffered in the queue, where denotes the size of the buffer. Let denote the CPU load at the server where is the capacity of the CPU. We assume that the CPU has cores.
We assume that the requests arrive according to a (potentially time-varying) Poisson process with rate . If a new request arrives when the buffer is full, the request is offloaded to another server. If a new request arrives when the buffer is not full, the server has the option to either accept or offload the request.
The server can process up to a maximum of
requests from the head of the queue. Processing each request requires CPU resources for the duration in which the request is being served. The required CPU resources is a random variablewith probability mass function . The realization of
is not revealed until the server starts working on the request. The duration of service is exponentially distributed random variable with rate.
Let denote the action set. Here means that the server decides to offload the request while means that the server accepts the request.
Ii-A2 Traffic model for mobile users
We consider multiple models for traffic.
Scenario 1: All users generate requests according to the same rate and the rate does not change over time. Thus, the rate at which requests arrive is .
Scenario 2: In this scenario, we assume that all users generate requests according to rate , where is a global state which changes over time. Thus, the rate at which requests arrive in state , where , is .
Scenario 3: Each user has a state . When the user is in state , it generates requests according to rate . The state changes over time. Thus, the rate at which requests arrive at the server is .
Time-varying user set: In each of the scenarios above, we can consider the case when the number of users is not fixed and changes over time. We call them Scenario 4, 5, and 6 respectively.
Ii-A3 Cost and the optimization framework
The system incurs three types of a cost:
a holding cost of per unit time when a request is buffered in the queue but is not being served.
a running cost of per unit time for running the CPU at a load of .
a penalty of for offloading a packet at CPU load .
We combine all these costs in a cost function
where denotes the action, is a short-hand for and is the indicator function. Note that to simplify the analysis, we assume that the server always serves requests. It is also assumed that and are increasing in .
Whenever a new request arrives, the server uses a memoryless policy to choose an action
The performance of a policy starting from initial state is given by
where is the discount rate and the expectation is with respect to the arrival process, CPU utilization, and service completions.
The objective is to minimize the performance (2) for the different traffic scenarios listed above. We are particularly interested in the scenarios where the arrival rate and potentially other components of the model such as the resource distribution are not known to the system designer and change during the operation of the system.
Ii-B Solution framework
When the model parameters are known and time-homogeneous, the optimal policy can be computed using dynamic programming. However, in a real system, these parameters may not be known, so we are interested in developing a RL algorithm which can learn the optimal policy based on the observed per-step cost.
In principle, when the model parameters are known, Scenarios 2 and 3 can also be solved using dynamic programming. However, the state of such dynamic programs will include the state of the system (for Scenario 2) or the states of all users (for Scenario 3). Typically, these states change at a slow time-scale. So, we will consider reinforcement learning algorithms which do not explicitly keep track of the states of the user and verify that the algorithm can adapt quickly whenever the arrival rates change.
Iii Dynamic programming to identify optimal admission control policy
When the arrival process is time-homogeneous, the process is a finite-state continuous-time MDP controlled through . To specify the controlled transition probability of this MDP, we consider the following two cases.
First, if there is a new arrival at time , then
We denote this transition function by . Note that the first term denotes the probability that the accepted request required CPU resources.
Second, if there is a departure at time ,
We denote this transition function by . Note that there is no decision to be taken at the completion of a request, so the above transition does not depend on the action. In general, the reduction in CPU utilization will correspond to the resources released after the client requests are served. However, keeping track of those resources would mean that we would need to expand the state and include as part of the state, where denotes the resources required by the request which is being processed by CPU . In order to avoid such an increase in state dimension, we assume that when a request is completed, CPU utilization reduces by amount with probability .
Let denote the uniform upper bound on the transition rate at the states. Then, using the uniformization technique [10, 8], we can convert the above continuous time discounted cost MDP into a discrete time discounted cost MDP with discount factor , transition probability matrix and per-step cost
Therefore, we have the following.
Consider the following dynamic program
where denotes .
Let denote the argmin the right hand side of (6). Then, the time-homogeneous policy is optimal for the original continuous-time optimization problem.
Thus, for all practical purposes, the decision maker has to solve a discrete-time MDP, where he has to take decisions at the instances when a new request arrives. In the sequel, we will ignore the term in front of the per-step cost and assume that it has been absorbed in the constant , and the functions , .
When the system parameters are known, the above dynamic program can be solved using standard techniques such as value iteration, policy iteration, or linear programming. However, in practice, the system parameters may slowly change over time. Therefore, instead of pursuing a planning solution, we consider reinforcement learning solutions which can adapt to time-varying environments.
Iv Structure-aware reinforcement learning
Although, in principle, the optimal admission control problem formulated above can be solved using deep RL algorithms, such algorithms require significant computational resources to train, are brittle to the choice of hyperparameters, and generate policies which are difficult to interpret. For the aforementioned reasons, we investigate an alternate class of RL algorithms which circumvents these limitations.
Iv-a Structure of the optimal policy
We first establish basic monotonicity properties of the value function and the optimal policy.
For a fixed queue length , the value function is weakly increasing in the CPU utilization .
The proof is present in Appendix A.
For a fixed queue length , if it is optimal to reject a request at CPU utilization , then it is optimal to reject a request at all CPU utilizations .
The proof is present in Appendix B.
Iv-B The SALMUT algorithm
Proposition 2 shows that the optimal policy can be represented by a threshold vector , where is the smallest value of the CPU utilization such that it is optimal to accept the packet for CPU utilization less than or equal to and reject it for utilization greater than .
The SALMUT algorithm was proposed in  to exploit a similar structure in admission control for multi-class queues. It was originally proposed for the average cost setting. We present a generalization to the discounted-time setting.
We use to denote a threshold-based policy with the parameters taking values in . The key idea behind SALMUT is that, instead of deterministic threshold-based policies, we consider a random policy parameterized with parameters taking value in the compact set . Then, for any state , the randomized policy chooses action with probability and chooses action with probability , where is any continuous decreasing function w.r.t , which is differentiable in its first argument, e.g., the sigmoid function
where is a hyper-parameter (often called “temperature”).
Fix an initial state and let denote the performance of policy . Furthermore, let denote the transition function under policy , i.e.
Similarly, let denote the expected per-step reward under policy , i.e.
Let denote the gradient with respect to .
From Performance Derivative formula [4, Eq. 2.44], we know that
where is the occupancy measure on the states starting from the initial state .
From (8), we get that
Similarly, from (9), we get that
Therefore, when is sampled from the stationary distribution
, an unbiased estimator ofis proportional to .
Thus, we can use the standard two time-scale Actor-Critic algorithm  to simultaneously learn the policy parameters and the action-value function as follows. We start with an initial guess and for the action-value function and the optimal policy parameters. Then, we update the action-value function using temporal difference learning:
where is a projection operator which clips the values to the interval and and are learning rates which satisfy the standard conditions on two time-scale learning: , , , and .
The two time-scale SALMUT algorithm described above converges almost surely and .
The proof is present in Appendix C.
The idea of replacing the "hard" threshold with a "soft" threshold is same as that of the SALMUT algorithm . However, our simplification of the performance derivative (10) given by (13) is conceptually different from the simplification presented in . The simplification in  is based on viewing term in (10) as
where is an independent binary random variable and . In contrast, our simplification is based on a different algebric simplification that directly simplifies (10) without requiring any additional sampling.
V Numerical experiments - Computer Simulations
In this section, we present detailed numerical experiments to evaluate the proposed reinforcement learning algorithm on various scenarios described in Sec. II-A.
We consider an edge server with buffer size , CPU capacity , cores, service-rate for each core, holding cost . The CPU capacity is discretized into states for utilization , with corresponding to a state with CPU load , and so on.
The CPU running cost is modelled such that it incurs a positive reinforcement for being in the optimal CPU range, and a high cost for an overloaded system.
The offload penalty is modelled such that it incurs a fixed cost for offloading to enable the offloading behavior only when the system is loaded and a very high cost when load is system is idle to discourage offloading in such scenarios.
The probability mass function of resources requested per request is as follows
Rather than simulating the system in continuous-time, we simulate the equivalent discrete-time MDP by generating the next event (arrival or departure) using a Bernoulli distribution with probabilities and costs described in Sec.III. We assume that the parameter in (6) has been absorbed in the cost function. We assume that the discrete time discount factor equals .
V-a Simulation scenarios
We consider a number of traffic scenarios which are increasing in complexity and closeness to the real-world setting. Each scenario runs for a horizon of . The scenarios capture variation in the transmission rate and the number of users over time, their realization can be seen in Fig. 4.
The evolution of the arrival rate and the number of users for the more dynamic environments is shown in Fig. 4.
This scenario tests how the learning algorithms perform in the time-homogeneous setting. We consider a system with users with arrival rate . Thus, the overall arrival rate .
This scenario tests how the learning algorithms adapt to occasional but significant changes to arrival rates. We consider a system with users, where each user generates requests at rate for the interval , then generates requests at rate for the interval , and then generates requests at rate again for the interval .
This scenario tests how the learning algorithms adapt to frequent but small changes to the arrival rates. We consider a system with users, where each user generates requests according to rate where and . We assume that each user starts with a rate or with equal probability. At time intervals , each user toggles its transmission rate with probability .
This scenario tests how the learning algorithm adapts to change in the number of users. In particular, we consider a setting where the system starts with user. At every time steps, a user may leave the network, stay in the network or add another mobile device to the network with probabilities , , and , respectively. Each new user generates requests at rate .
This scenario tests how the learning algorithm adapts to large but occasional change in the arrival rates and small changes in the number of users. In particular, we consider the setup of Scenario 2, where the number of users change as in Scenario 4.
This scenario tests how the learning algorithm adapts to small but frequent change in the arrival rates and small changes in the number of users. In particular, we consider the setup of Scenario 3, where the number of users change as in Scenario 4.
V-B The RL algorithms
For each scenarios, we compare the performance of the following policies
Dynamic Programming (DP), which computes the optimal policy using Theorem 1.
SALMUT, as described in Sec. IV-B.
Q-Learning, using (14).
PPO , which is a family of trust region policy gradient method and optimizes a surrogate objective function using stochastic gradient ascent.
A2C , which is a two time-timescale learning algorithms where the critic estimates the value function and actor updates the policy distribution in the direction suggested by the critic.
Baseline, which is a fixed-threshold based policy, where the node accepts requests when (non-overloaded state) and offloads requests otherwise. Such static policies are currently deployed in many real-world systems.
For each of the algorithm described above, we train SALMUT, Q-learning, PPO, and A2C for steps. The performance of each algorithm is evaluated every steps using independent rollouts of length for different random seeds. The experiment is repeated for the
sample paths and the median performance with an uncertainty band from the first to the third quartile are plotted in Fig.5.
For Scenario 1, all RL algorithms (SALMUT, Q-learning, PPO, A2C) converge to a close-to-optimal policy and remain stable after convergence. Since all policies converge quickly, SALMUT, PPO, and A2C are also able to adapt quickly in Scenarios 2–6 and keep track of the time-varying arrival rates and number of users. There are small differences in the performance of the RL algorithms, but these are minor. Note that, in contrast, Q-learning policy does not perform well when the dynamics of the requests changes drastically, whereas the baseline policy performs poorly when the server is overloaded.
The plots for Scenario 1 (Fig. 4(a)) show that PPO converges to the optimal policy in less than steps, SALMUT and A2C takes around steps, whereas Q-learning takes around steps to converge. The policies for all the algorithms remain stable after convergence. Upon further analysis on the structure of the optimal policy, we observe that the structure of the optimal policy of SALMUT (Fig. 5(b)) differs from that of the optimal policy computed using DP (Fig. 5(a)). There is a slight difference in the structure of these policies when buffer size (x) is low and CPU load () is high, which occurs because these states are reachable with a very low probability and hence SALMUT doesn’t encounter these states in the simulation often to be able to learn the optimal policy in these states. The plots from Scenario 2 (Fig. 4(b)) show similar behavior when is constant. When changes significantly, we observe all RL algorithms except Q-learning are able to adapt to the drastic but stable changes in the environment. Once the load stabilizes, all the algorithms are able to readjust to the changes and perform close to the optimal policy. The plots from Scenario 3 (Fig. 4(c)) show similar behavior to Scenario 1, i.e. small but frequent changes in the environment do not impact the learning performance of reinforcement learning algorithms.
The plots from Scenario 4-6 (Fig. 4(d)-4(f)) show consistent performance with varying users. The RL algorithms including Q-learning show similar performance for most of the time-steps except in Scenario 5, which is similar to the behavior observed in Scenario 2. The Q-learning algorithm also performs poorly when the load suddenly changes in Scenarios 4 and 6. This could be due to the fact that Q-learning takes longer to adjust to a more aggressive offloading policy.
V-D Analysis of Training Time and Policy Interpretability
The main difference among these four RL algorithms is the training time and interpretability of policies. We ran our experiments on a server with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz processor. The training time of all the RL algoirthms is shown in Table II. The mean training time is computed based on a single scenario over different runs and averaged across all the six scenarios. SALMUT is about 28 times faster to train than PPO and 17 times faster than A2C. SALMUT does not require a non-linear function approximator such as Neural Networks (NN) to represent its policy, making the training time for SALMUT very fast. We observe that Q-learning is around 1.5 times faster than SALMUT as it does not need to update its policy parameters separately. Even though Q-learning is faster than SALMUT, Q-learning does not converge to an optimal policy when the request distribution changes.
|Algorithm||Mean Time (s)||Std-dev (s)|
By construction, SALMUT searches for (randomized) threshold based policies. For example, for Scenario 1, SALMUT converges to the policy shown in Fig. 5(b). It is easy for a network operator to interpret such threshold based strategies and decide whether to deploy them or not. In contrast, in deep RL algorithms such as PPO and A2C, the policy is parameterized using a neural network and it is difficult to visualize the learned weights of such a policy and decide whether the resultant policy is reasonable. Thus, by leveraging on the threshold structure of the optimal policy, SALMUT is able to learn faster and at the same time provide threshold based policies which are easier to interpret.
The policy of SALMUT is completely characterized by the threshold vector , making it storage efficient too. The threshold-nature of the optimal policy computed by SALMUT, can be easily interpreted by just looking at the threshold vector (see Fig. 5(b)), making it easy to debug and estimate the behavior of the system operating under such policies. However, the policies learned by A2C and PPO are the learned weights of the NN, which are undecipherable and may lead to unpredictable results occasionally. It is very important that the performance of real-time systems be predictable and reliable, which has hindered the adoption of NNs in real-time deployments.
V-E Behavioral Analysis of Policies
We performed further analysis on the behavior of the learned policy by observing the number of times the system enters into an overloaded state and offloads incoming request. Let us define to be the number of times the system enters into an overloaded state and to be the number of times the system offloads requests for every 1000 steps of training iteration. We generated a set of random numbers between 0 and 1, defined by , where is the step count. We use this set of random numbers to fix the trajectory of events (arrival or departure) for all the experiments in this section. Similar to the experiment in the previous section, the number of users and the arrival rate are fixed for 1000 steps and evolve according to the scenarios described in Fig. 4. The event is set to arrival if is less than or equal to , and set to departure otherwise. These experiments were carried out during the training time for 10 different seeds. We plot the median of the number of times a system goes into an overloaded state (Fig. 7) and the number of requests offloaded by the system (Fig. 8) along with the uncertainty band from the first to the third quartile for every 1000 steps.
We observe in Fig. 7, that all the algorithms (SALMUT, Q-learning, PPO, A2C) learn not to enter into the overloaded state. As seen in the case of total discounted cost (Fig. 5), PPO learns it instantly, followed by SALMUT, Q-learning, and A2C. The observation is valid for all the different scenarios we tested. We observe that for Scenario-4, PPO enters the overloaded state at around which is due to the fact the increases drastically at that point (seen in Fig. 3(d)) and we also see its effect on the cost in Fig. 4(d) at that time. We also observe that SALMUT enters into overloaded states when the request distribution changes drastically in Scenario 2 and 5. It is able to recover quickly and adapt its threshold policy to a more aggressive offloading policy. The baseline algorithms, on the other hand, enters into the overloaded state quiet often.
We observe in Fig. 8
, that the algorithms (SALMUT, PPO, A2C) learn to adjust their offloading rate to avoid overloaded state. The number of times requests have been offloaded is directly proportional to the total arrival rate of all the users at that time. When the arrrival rate increases, the number of times the offloading occurs also increases in the interval. We see that even though the offloaded requests are higher for the RL algorithms than the baseline algorithm in all scenarios and timesteps, the difference between the number of times they offload is not significant implying that the RL algorithms learn policies that offload at the right moment as to not lead the system into an overloaded state. We perform further analysis of this behavior for the docker-testbed (see Fig.13) and the results are similar for the simulations too.
Vi Testbed Implementation and Results
We test our proposed algorithm on a testbed resembling the MEC architecture in Fig. 1, but without the core network and backend cloud server for simplicity. We consider an edge node which serves a single application. Both the edge nodes and clients are implemented as containerized environments in a virtual machine. The overview of the testbed is shown in Fig. 9. The load generator generates requests for each client independently according to a time-varying Poisson process. The requests at the edge node are handled by the controller which decides either to accept the request or offload the request based on the policy for the current state of the edge node. If the action is "accept", the request is added to the request queue of the edge node, otherwise the request is offloaded to another healthy edge node via the proxy network. The Key Performance Indicator (KPI) collector copies the KPI metrics into a database at regular intervals. The RL modules uses these metrics to update its policies. The Subscriber/Notification (Sub/Notify) module notifies the controller about the updated policy. The controller now uses the updated policy to serve all future requests.
In our implementation, the number of clients served by an edge node and the request rate of the clients is constant for at-least seconds. We define a step to be the execution of the testbed for seconds. Each request runs a workload on the edge node and consumes CPU resources , where is a random variable. The states, actions, costs, next states for each step are stored in a buffer in the edge node. After the completion of a step, the KPI collector copies these buffers into a database. The RL module is then invoked, which loads its most recent policy and other parameters, and trains on this new data to update its policy. Once the updated policy is generated, it is copied in the edge node and is used by the controller for serving the requests for the next step.
We run our experiments for a total of steps, where and evolve according to Fig. 4 for different scenarios, similar to the previous set of experiments. We consider an edge server with buffer size , CPU capacity , cores, service-rate for each core, holding cost . The CPU capacity is discretized into states for utilization , similar to the previous experiment. The CPU running cost is for , for , and otherwise. The offload penalty is for and for . We assume that the discrete time discount factor equals .
We run the experiments for SALMUT and baseline algorithm for a total of 1000 steps. We do not run the simulations for PPO and A2C algorithms in our testbed as these algorithms cannot be trained in real-time because the time they require to process each sample is more than the sampling interval. The performance of SALMUT and baseline algorithm is evaluated at every step by computing the discounted total cost for that step using the cost buffers which are stored in the database. The experiment is repeated times and the median performance with an uncertainty band from the first to the third quartile are plotted in Fig. 10 along with the total request arrival rate () in gray dotted lines.
For Scenario 1 (Fig. 9(a)), we observe that the SALMUT algorithm outperforms the baseline algorithm right from the start, indicating that SALMUT updates its policy swiftly at the start and slowly converges towards optimal performance, whereas the baseline incurs high cost throughout. Since SALMUT policies converge towards optimal performance after some time, they are also able to adapt quickly in Scenarios 2–6 (Fig. 9(b)-9(f)) and keep track of the time-varying arrival rates and number of users. We observe that SALMUT takes some time to learn a good policy, but once it learns the policy, it adjusts to frequent but small changes in and very well (see Fig. 9(c) and 9(d)). If the request rate changes drastically, the performance decreases a little (which is bound to happen as the total requests to process are much larger than the server’s capacity) but the magnitude of the performance drop is much lesser in SALMUT as compared to the baseline, seen in Fig. 9(b), 9(e) and 9(f). It is because the baseline algorithms incur high overloading cost for these requests whereas SALMUT incurs offloading costs for the same requests. Further analysis on this is present in Section 4.2.2.
Vi-B Behavioral Analysis
We perform behavior analysis of the learned policy by observing the number of times the system enters into an overloaded state (denoted by ) and the number of incoming request offloaded by the edge node (denoted by ) in a window of size 100. These plots are shown in Fig. 11 & 12.
We observe from Fig. 11 that the number of times the edge node goes into an overload state while following policy executed by SALMUT is much less than the baseline algorithm. Even when the system goes into an overloaded state, it is able to recover quickly and does not suffer from performance deterioration. From Fig. 11(b) and 11(e) we can observe that in Scenarios 2 and 5, when the request load increases drastically (at around 340 steps), increases and its effects can also be seen in the overall discounted cost in Fig. 9(b) and 9(e) at around the same time. SALMUT is able to adapt its policy quickly and recover quickly. We observe in Fig. 12 that SALMUT performs more offloading as compared to the baseline algorithm.
A policy that offloads often and does not go into an overloaded state may not necessarily minimize the total cost. We did some further investigation by visualizing the scatter-plot (Fig. 13) of the overload count () on the y-axis and the offload count () on the x-axis for both SALMUT and the baseline algorithm for all the scenarios described in Fig. 4. We observe that SALMUT keeps much lower than the baseline algorithm at the cost of increased . We observe from Fig. 13 that the slope for the plot is linear for baseline algorithms because they are offloading reactively. SALMUT, on the other hand, learns a behavior that is analogous to pro-active offloading, where it benefits from the offloading action it takes by minimizing .
Vii Conclusion and Limitations
In this paper we considered a single node optimal policy for overload protection on the edge server in a time varying environment. We proposed a RL-based adaptive low-complexity admission control policy that exploits the structure of the optimal policy and finds a policy that is easy to interpret. Our proposed algorithm performs as well as the standard deep RL algorithms but has a better computational and storage complexity, thereby significantly reducing the total training time. Therefore, our proposed algorithm is more suitable for deployment in real systems for online training.
The results presented in this paper can be extended in several directions. In addition to CPU overload, one could consider other resource bottlenecks such as disk I/O, RAM utilization, etc. It may be desirable to simultaneously consider multiple resource constraints. Along similar lines, one could consider multiple applications with different resource requirements and different priority. If it can be established that the optimal policy in these more sophisticated setup has a threshold structure similar to Proposition 2, then we can apply the framework developed in this paper.
The discussion in this paper was restricted to a single node. These results could also provide a foundation to investigate node overload protection in multi-node clusters where additional challenges such as routing, link failures, and network topology shall be considered.
Appendix A Proof of Proposition 1
Let . We define a sequence of value functions as follows
where denotes .
Note that denotes the iterates of the value iteration algorithm, and from , we know that
where is the unique fixed point of (6).
For each and , is weakly increasing in .
We prove the result by induction. Note that and is trivially weakly increasing in . This forms the basis of the induction. Now assume that is weakly increasing in . Consider iteration . Let and such that . Then,
where follows from the fact that and are weakly increasing in .
By a similar argument, we can show that
Appendix B Proof of Proposition 2
Let . Consider
For a fixed , by Proposition 1, is weakly decreasing in . If it is optimal to reject a request at state (i.e., ), then for any ,
therefore, it is optimal to reject the request.
Appendix C Proof of Optimality of SALMUT
The choice of learning rates implies that there is a separation of timescales between the updates of (14) and (15). In particular, since , iteration (14) evolves at a faster timescale than iteration (15). Therefore, we first consider update (15) under the assumption that the policy , which updates at the slower timescale, is constant.
We first provide a preliminary result.
Let denote the action-value function corresponding to the policy . Then, is Lipscitz continuous in .
This follows immediately from the Lipscitz continuity of in .
Define the operator , where , as follows:
Then, the step-size conditions on imply that for a fixed , iteration (14
) may be viewed as a noisy discretization of the ODE (ordinary differential equation):
Then we have the following:
The ODE (21) has a unique globally asymptotically stable equilibrium point .
We now consider the faster timescale. Recall that is the initial state of the MDP. Recall
and consider the ODE limit of the slower timescale iteration (15), which is given by
The equilibrium points of the ODE (23) are the same as the local optima of . Moreover, these equilibrium points are locally asymptotically stable.
The equivalence between the stationary points of the ODE and local optima of follows from definition. Now consider as a Lyapunov function. Observe that
as long as . Thus, from Lyapunov stability criteria all local optima of (23) are locally asymptotically stable.
Now, we have all the ingredients to prove convergence. Lemmas 2-4 imply assumptions (A1) and (A2) of . Thus, the iteration (14) and (15) converges almost surely to a limit point such that and provided that the iterates and are bounded.
Note that are bounded by construction. The boundness of follows from considering the scaled version of (21):
It is easy to see that