I Introduction
The challenges introduced by new softwarized technologies such as network function virtualization (NFV) [mijumbi2016network] and softwaredefined networking (SDN) [mckeown2009software], and new network architectures such as 5G, are driving network transformations that radically change the way operators manage their networks and orchestrate network services. The network architectures come with a diverse range of capabilities and requirements, including massive capacity (e.g., Massive MIMO [larsson2017massive]), ultra low latency (ULL), ultra high reliability, and support for massive machine to machine communications (M2M) in the context of Industry 4.0 [wiki:industry4.0]. Networks are being transformed into programmable, softwaredriven, servicebased and holistically managed systems, accelerated via enablers such as NFV, SDN, and mobile edge computing. In order to consolidate multiple networks with varied requirements, the 5G architecture must exploit network virtualization and programmability. This introduces the problem of network slicing [5GSM2017], where a single network infrastructure is divided into multiple subnetworks, and each slice can be operated by different parties.
Network slicing can be modeled as a dynamic resource allocation problem. Once the network is sliced into multiple subnetworks with SLA requirements on latency, bandwidth, reliability, etc., the underlying infrastructure operator needs to ensure that the SLAs for each slice are guaranteed. These strict SLA guarantees need to be maintained despite variability in the slice request arrivals and request resource requirement distributions. For example, if one of the slices is owned by a video streaming operator and requires low latency during a popular gaming event, then the slice resource allocation needs to be dynamically adjusted as a burst of requests arrive. While traditionally such network resource allocation problems have been solved using analytical queueing theoretic and optimization methods [Neely:2010:SNO:1941130, Shakkottai:2007:NOC:1345033.1345034, Srikant:2014:CNO:2636796]
, given the complexity and scale of modern networks such methods are not feasible. Thus we resort to approximate blackbox models using machine learning, namely Reinforcement Learning (RL)
[sutton2018reinforcement]Reinforcement learning is a computational approach for goaldirected learning and decision making, with the goal being to select actions to maximize future rewards [sutton2018reinforcement]
. The emphasis is on learning by an agent (with the capacity to act) where each action influences the agent’s future state, through direct interaction with its environment, without the need for exemplary supervision or complete models of the environment. RL methods are selfcustomizable, as they can adapt the mapping from state to actions to maximize expected rewards in response to changing environmental conditions, for example, network environment (wireless vs wired), traffic dynamics, job size distribution, etc. Recent developments in RL employ deep neural networks to learn complex patterns in the experienced state, and can select high reward actions (e.g., AlphaGo
[silver2017mastering]). Our hypothesis is that deep RL can be used to learn good network slicing strategies by learning over simulated and trace driven workload data, and such learned policies can be applied for real network slicing deployments.To satisfy a network slice resource request, a network operator needs to simultaneously allocate heterogeneous resource requests such as compute, bandwidth and storage, adding additional complexity to the problem. To summarize, we face three main challenges for efficient dynamic resource allocation for network slicing.

Unknown request arrival process and request resource requirements.

Heterogeneous resource requirements for each slice (e.g., CPU, bandwidth, memory, etc.).

Finite resource capacities.
While existing solutions deal with each challenge separately, we propose a unified solution using deep RL that can simultaneously deal with varying and unknown traffic arrival dynamics, heterogeneous resource requirements, and finite resource capacities. In our formulation, each network slice has both bandwidth and compute resource allocation requirements, where the distribution of request arrivals and the amount of resources requested is not known apriori. Our main contributions are as follows:

[label=)]

Mathematical formulation of the network slicing resource allocation RL problem as a Markov Decision Process (MDP) where the constrained multiresource optimization problem is formulated for service upon arrival and batch service.

A policygradient method to solve this problem based on the popular REINFORCE [silver2017mastering] algorithm where we use a deep neural network architecture as the function approximator for learning the optimal policy.

Experimental study with varying resource budgets using both simulated and real datasets to evaluate our proposed solution versus an equal slicing strategy.
Our RL framework for network slicing creates new opportunities for network and infrastructure operators, as they can train the network slicing models based on their target network dynamics, network heterogeneity, and SLA requirements. The models can be trained offline using simulated and workload data, and then deployed in realtime for slicing resource allocation decisions.
The remainder of the paper is organized as follows. Section II gives an overview of the network slicing architecture in 5G. Section III gives an introduction to RL, including markov decision process (MDP) models and policy gradient algorithms. In Section IV we formulate the models for dynamic resource allocation for network slicing as an MDP, and give details of the policy gradient learning algorithms. Section V gives the details of the workload traces for CPU and bandwidth that are used in the experiments. Section VI presents the experimental results of our proposed models using both simulated and real datasets. Finally we discuss related work in Section VII and conclude in Section VIII.
Ii Network Slicing
Network slicing in 5G is an emerging technology area and, hence, presents many opportunities as well research challenges. In [7Li2017] the authors propose a framework for the technology, and discuss challenges and future research directions listing the efficient allocation of slice resources as a challenging problem to be addressed algorithmically.
The basic concept of network slicing is a virtual network architecture running multiple logical networks on a common shared physical infrastructure. Each network slice represents an independent virtualized endtoend network customized to meet the specific needs of an application, service, device, customer or operator [5GSM2017]. It comprises a set of logical (software) network functions that support the requirements of the particular use case. where each function is optimized to provide the resources and network topology for the specific service and traffic that will use the slice.
A key benefit of network slicing is that it provides an endtoend virtual network encompassing compute, bandwidth and storage functions. The objective is to allow a network operator to partition its network resources to allow for different types users or tenants to multiplex over a single physical infrastructure. A typical example used in the 5G literature is the following: run Internet of Things (IoT), Mobile Broadband (MBB), and very vehicular communications applications on the same network. IoT will typically have a large number of devices each with low throughput, while MBB will have a smaller number of devices with high bandwidth content, and vehicular communications will have stringent requirements of lowlatency. The goal of network slicing is to enable partitioning of the physical network at an endtoend level so as to allow optimum grouping of traffic, tenant isolation, and configuration of resources at a macro level. In Figure 1 we show an example of 5G network slices spanning the access, transport and mobile packet core network domains [rost2017network].
Iii Background on RL and Deep RL
In this section, we give a brief introduction of RL, specifically the use of neural networks for RL function approximation, and the policy gradient learning algorithm.
Iiia Basic Reinforcement Learning Model
In RL, an agent interacts with an environment. The agent has a set of discrete or continuous actions to choose from, and the action can influence the next state of the environment. We consider the full RL model, where actions can influence state. This is unlike the bandit framework, where actions do not influence state, and, at every instance, the agent only chooses the best action independent of how it may impact the state.
At each time step , the RL agent observes a local copy of the environment’s state , and selects an action . At the next time step , the agent observes a reward which represents the cost/reward for taking the last action. The agent also observes the next state . We make the markovian assumption that the future state () only depend on the current state (), and also that the dynamics are stationary and do not change over time. The agent’s goal is to maximize the expected cumulative discounted return:
(1) 
where the parameter is the discount factor and it weighs immediate reward with possibilities of better rewards in the future. It is also a mathematical trick to ensure cumulative rewards converge for infinite horizon problems. Figure 2 summarizes the full RL model.
IiiB Policy
The agent chooses actions based on a learned policy, which is a probability distribution over actions for any state,
, where denotes the probability of taking action in state , and . In most practical scenarios, especially network environments, there are exponentially many possible state,action (s,a) pairs (see section IV). Thus modern scalable methods eschew tabular representations in favor of function approximators [sutton2018reinforcement]. A function approximator is parameterized by , and policies are denoted by . Typically policy approximations learn to cluster the behavior of similar states such that, given a new state, the approximator can find the action for the closest state seen so far.There is considerable research on various types of function approximators. Any popular supervised learning framework such as SVM can be used. Recently a popular choice is to use a deep neural network (DNN) as the function approximator
[sutton2018reinforcement] which we also use in our models. An attractive feature of DNNs is that the model automatically learns the best representation of the feature space.IiiC Policy Gradient Methods
We utilize a class of policy learning algorithms called policy gradient methods, that try to learn an optimal policy using gradient descent (or ascent) [sutton2018reinforcement]. The objective is to maximize the expected cumulative reward (Equation 1); the gradient of this objective is given by:
(2) 
where represents the expected cumulative discounted reward from selecting the action in state and then following policy
. This class of methods estimate the gradient by sampling trajectories of policy executions and obtaining a reward estimate
for the trajectory, and subsequently update the policy parameters using gradient ascent using the following equation:(3) 
where is the gradient ascent step size. This results in the REINFORCE algorithm [silver2017mastering], which we used for learning in our system. The pseudocode of our implemented training algorithm is in Section IV.
An advantage of policygradient methods is that they directly search the policy space and can be quite generic so that a general algorithm can be adapted to various scenarios without significant modification. Policy gradient techniques have found success in robotics, game playing, and cloud resource allocation domains. There are other popular learning algorithms such as Qlearning which can be used. Below we briefly describe the differences among policy gradient algorithms such as REINFORCE and value iteration methods such as Qlearning and indicate why we choose policy gradient for our system.
First, both classes of algorithms can solve general MDPs and can converge to optimal policies. However, their internal structures are different. The fundamental difference is in the approach to action selection, both whilst learning and as the output (the learned policy). In Qlearning, the goal is to learn a single deterministic action from a discrete set of actions by finding the maximum value. With policy gradients, and other direct policy searches, the goal is to learn a map from state to action, which can be stochastic, and can work in continuous action spaces. As a result, policy gradient methods can solve problems that valuebased methods cannot, especially in scenarios with large and continuous action spaces, and stochastic policies. In our RL formulation (details in Section IV), we deal with real valued resource allocations with large action spaces, for example, compute and bandwidth. Due to these benefits we selected policy gradient methods in our RL agents, though our system is general enough to use alternative RL algorithms as well.
Iv Proposed Models
In this section we present models for allocating bandwidth and Virtual Machines (VM) to network slices. We formulate the resource allocation problem, and describe this in RL settings. We present two models for different service types; service upon arrival and batch service.
Iva Service upon arrival
For service upon arrival, consider a mobile network that receives resource requirements for bandwidth and VMs from a set of classes during the time horizon . It is assumed that requests for bandwidth and VMs have the same arrival process for each class. However, the quantities of the resource requests have different and independent distributions. We assume that we have infinite buffers for both resources to hold received requests. The controller determines resource allocations for each class (slice) for the whole network at any arrival of each set of requests. Fig. 4 presents the request arrival and service processes for the service upon arrival formulation.
A set of bandwidth and VM requests for each class has a arrival process . Each arrival has different amounts of arriving requests for bandwidth and VMs from each class. For bandwidth we have amounts of the requests, and buffer levels .^{1}^{1}1The buffer level can be computed by . Similarly, for VMs, we have amounts of the requests, and buffer levels . Furthermore, we have resource allocations by a controller, and for bandwidth and computing resources.
Our goal is to maximize the Quality of Service (QoS) with respect to bandwidth and VM requests and minimize resource usage costs. Measuring QoS as delays to process received requests, we achieve our goal by solving the following optimization problem:
(4) 
where , and integrates delays in processing bandwidth and VM requests measuring buffer levels, is a cost for bandwidth and computing resources, and and are a discount and a balance factors.
We solve the above optimization problem with RL algorithms. TABLE I shows the state, action and reward for the RL formulation.
Deep neural networks are introduced as a policy agent for determining the policy
. We assign a separate agent for each of the resources, namely CPU and bandwidth. The neural network agents feed the input feature vector:
, where denotes the amount of received resource requests, denotes the buffer level, and denotes the last request arrival time, and computes as output the slice resource allocation. We solve our RL models by applying a class of RL algorithms, policy gradient methods, that learn by performing gradientdescent on the policy parameters [sutton2018reinforcement]. The algorithm based on REINFORCE [Williams:1992:SSG:139611.139614] is shown in Figure 3. In this algorithm, denotes the objective function, which is the expected cumulative discounted loss as shown in Eq. 4, andare learning parameters of policy agents for each resource. Each neural agent consists of multiple hidden layers and one output layer. The leaky ReLU is adopted as an activation function at hidden and output layers,
with a small constant . The leaky ReLU is selected as it produces positive allocations, and resolves a difficulty of ReLU when units are not active by allowing a small, positive gradient [Maas2013].
Service upon arrival  Batch service  

State  Resource requests arrivals, and buffer levels since last arrival time  Resource requests arrivals, and buffer levels since last service time 
Action  Resource allocation to each slice  Resource allocation to each slice 
Reward  (Delays in processing requests and resource use costs)  (Delays in processing requests loss and resource use costs) 
IvB Batch service
Next we formulate the resource allocation problem for a batch service strategy, where the slice orchestrator wakes up periodically and processes all the requests in the queue. Selecting between service upon arrival and batch service is a tradeoff between increased delay (waiting time in queue) and reduced scheduling overhead (allocate resources at scheduled times). In batch service, received resource requests are held in buffers, and resources are allocated for processing at predetermined times. The same arrival process and assumptions are applied to this formulation as for the service upon arrival formulation. Fig. 5 presents an example request arrival and service process for the batch mode of operation.
The same optimization problem is solved for the batch formulation as for the service upon arrival formulation. The MDP formulation for this model is shown in Table I
, however a different neural network structure is proposed for a policy agent in this batch mode. For this mode, each deep neural network agent per class uses statistical measures (mean, max and standard deviation) of
(the amount of received resource requests), (the buffer level), and (the request arrival time) computed from the requests in a single batch. This allows the batch service to capture the dependencies among the requests in each batch, and prevents the allocation to simply sum up per request allocations. The hidden and output layers for the neural network agents in this batch model are similar to the service upon arrival model.We compare our proposed algorithms to an equal slicing policy. This baseline strategy fairly divides the resources among each slice. Such a policy only makes sense when there is a finite budget for resources. Thus in our RL algorithms, we also impose budget constraints for each resource type to make a fair comparison, however our proposed algorithms are generic and can also work when there is no bound on resource capacity.
We introduce budget constraints to Equation (4)
where is the budget for bandwidth and is the budget for computing resources. Throughout the whole time horizon, we do not allow resource allocations larger than the budgets. To implement budget constraints, we project obtained allocations from neural agents proportionally when the sum of the allocations exceeds the budgets,
The budget constraints are incorporated in Algorithm 3.
V Data for Models
In this section we briefly discuss the workload traces for CPU and bandwidth requests that are used for the evaluation of the algorithms (Section VI).
Facebook Workload Trace. SWIM [swim] is a workload generator for mapreduce cluster traces. The SWIM repository contains real workload traces from production clusters. We utilize the SWIM workload suite for Facebook mapreduce clusters. The traces contain both job arrival times and job sizes (map input bytes). The arrival data shows that the arrival process is generally bursty and is more realistic than using the Poisson arrival process for simulating the job arrival process. Each workload is for 1 hour of production cluster usage. In our experiment we use 3 different traces for each slice. For each slice we run through the trace by simulating an event of job submission at corresponding arrival time from the trace, and use the job size at that time instance.
Bandwidth Data Trace. We use a 4G LTE trace of bandwidth data from [Hooft2016] for the experiments. Since we have a continuous bandwidth trace, we only sampled the bandwidth at specific instances.
Merging CPU and Bandwidth Traces. Since the traces for CPU and bandwidth were separate, we had to make some assumptions to combine the two traces for our experiments. For each arrival time in the Facebook trace, we take the maximum bandwidth value in the interval .
Vi Performance Evaluation
In this section we present the scenarios for experimentation, and the results obtained with simulated and real data. Model implementations for all the experiments are done in Python TensorFlow using the GeForce GTX 1080 and Intel(R) Core(TM) i76850K CPU @ 3.60GHz.
Via Scenarios
For analysis, we create scenarios with 4 levels of the resource budget: smaller, small, large, and larger. Each arriving request requires two resources: bandwidth and compute (expressed as number of VMs). The budget size is determined using the mean and standard deviations of the resource request distributions. For service upon arrival,
where and for class are the mean and standard deviation respectively of request distributions for the bandwidth and VMs. as 0, 1, 2, and 3 for scenarios of smaller, small, large, and larger resource budgets respectively. Batch service requires a larger budget since arriving requests wait in the buffer till the next batch service time. To this end, we multiply the service time interval and the request arrival rate to the budget of the service upon arrival such that where, for class i, is the service time and is the arrival rate.
ViB Simulated Data
The models are first validated by conducting experiments on the simulated data. The data is generated in the same format as the real data described in Section V
. There are three classes of requests each with different characteristics of bandwidth and VM requests. Request arrivals are assumed to follow a Poisson distribution with the average number of events per interval = 2, and the resource amounts of the requests follow uniform distributions as shown in Table
II.Bandwidth  VM  

a  b  a  b  
Class 1  100  150  500  600 
Class 2  100  200  1000  1500 
Class 3  300  500  1000  2000 
The bounds for the distributions are based on the assumption that each class has different characteristics in terms of the mean and standard deviation, e.g. class 2 has a larger standard deviation than class 1 for bandwidth. For the hyperparameters, different settings were tested: learning rates from 0.1 to 0.001, the number of layers from 1 to 3, and 500 and 1000 units per layer. Based on the results the following settings were chosen. For service upon arrival, we generate 1,000 episodes for training, and 100 episodes for test. For batch service, we generate 5,000 episodes for training, and 100 episodes for test. The service time interval is set at 10 for batch service, i.e. it allocates resources every 10 time units. We put the same weights on delays and resource use costs, . In both cases, each neural agent has 3 hidden layers with 1,000 units. The Adam optimizer [kingma2014adam]
, an algorithm for firstorder gradientbased optimization of stochastic objective functions based on adaptive estimates of lowerorder moments, is used for the gradient optimization using a learning rate of 0.001.
Using the simulated data, Table III shows the results, of the proposed models (NN) compared with the equal slicing strategy (ES), in terms of the expected rewards for each class and resource. with the winners shown in bold. Note that a smaller loss means larger rewards. The results show that our models perform better in almost all the scenarios as compared to the ES strategy. Though the NN models may have less rewards in individual cases, in total the NN models earn larger rewards. For example, in the smaller budget scenario for service upon arrival, our model has a greater loss for bandwidth allocations in classes 1 and 2 as compared to ES, however, it achieves a smaller loss in total. As formulated, our models learn request amount distributions of each class, and then allocate resources optimally to each class such that the budgets are used efficiently.
Service upon arrival  

Class  C1  C2  C3  Total  
Smaller budget  BW  NN  4.093E+02  4.935E+02  1.510E+03  2.413E+03 
ES  2.250E+02  2.250E+02  8.778E+04  8.823E+04  
VM  NN  1.958E+03  4.630E+03  5.358E+03  1.195E+04  
ES  1.100E+03  7.620E+04  2.011E+05  2.784E+05  
Small budget  BW  NN  1.428E+02  1.727E+02  4.642E+02  7.796E+02 
ES  2.587E+02  2.587E+02  7.098E+04  7.150E+04  
VM  NN  6.169E+02  1.423E+03  1.746E+03  3.786E+03  
ES  1.254E+03  4.120E+03  1.244E+05  1.298E+05  
Large budget  BW  NN  1.641E+02  2.107E+02  5.023E+02  8.771E+02 
ES  2.924E+02  2.924E+02  5.418E+04  5.477E+04  
VM  NN  6.691E+02  1.536E+03  2.018E+03  4.224E+03  
ES  1.408E+03  1.437E+03  4.819E+04  5.103E+04  
Larger budget  BW  NN  1.775E+02  2.328E+02  5.679E+02  9.781E+02 
ES  3.260E+02  3.260E+02  3.739E+04  3.804E+04  
VM  NN  9.119E+02  1.594E+03  2.180E+03  4.686E+03  
ES  1.562E+03  1.562E+03  2.841E+03  5.965E+03 
Batch service  

Class  C1  C2  C3  Total  
Smaller budget  BW  NN  6.886E+02  8.217E+02  2.200E+03  3.710E+03 
ES  1.125E+03  1.126E+03  2.419E+04  2.644E+04  
VM  NN  2.993E+03  6.880E+03  8.277E+03  1.815E+04  
ES  5.500E+03  7.707E+03  3.143E+04  4.464E+04  
Small budget  BW  NN  7.371E+02  8.842E+02  2.357E+03  3.978E+03 
ES  1.293E+03  1.294E+03  1.620E+04  1.879E+04  
VM  NN  3.179E+03  7.352E+03  8.805E+03  1.934E+04  
ES  6.270E+03  6.878E+03  1.166E+04  2.481E+04  
Large budget  BW  NN  8.529E+02  9.858E+02  2.577E+03  4.416E+03 
ES  1.462E+03  1.462E+03  8.587E+03  1.151E+04  
VM  NN  3.474E+03  8.125E+03  9.692E+03  2.129E+04  
ES  7.040E+03  7.262E+03  8.489E+03  2.279E+04  
Larger budget  BW  NN  9.089E+02  1.107E+03  2.882E+03  4.898E+03 
ES  1.630E+03  1.630E+03  3.575E+03  6.836E+03  
VM  NN  3.870E+03  8.970E+03  1.064E+04  2.348E+04  
ES  7.809E+03  7.892E+03  8.389E+03  2.409E+04 
ViC Real Data
The models are further validated in real settings by conducting trace driven experiments using a Facebook cluster compute workload trace for CPU and a 4G LTE trace for bandwidth (details in Section V). Assuming the models control both bandwidth and computing requests concurrently, the traces are combined as described in Section V. For the bandwidth trace, we used three classes: Bus, Car, and Train from the data. For the compute trace, we use the traces labeled FB20090, FB20091, and FB20100 for each class separately. For bandwidth requests, it is assumed that the maximum size of the request amounts since the last arrival should be serviced. Also, given the limited number of traces, we assume that the requests are recurrent. To prevent divergence during training, we appropriately scaled down the original values for compute and bandwidth in the traces. The traces are split into two sets: 90 % of the trace is used to train the model, and 10 % is used for testing. The experimental setup is similar to the simulated data experiments including the settings for the following parameters: hyperparameters, weights for delays and resource usage costs, service time interval for batch service. Each neural agent has three hidden layers with 1,000 units. Were we also use the Adam optimizer with learning rate 0.001.
Table IV shows the results using the real data trace driven experiments, of the proposed models (NN) compared with the equal slicing strategy (ES), in terms of the expected rewards for each class and resource. We denote the winners in bold. As seen with the simulated data, the results show that our models perform better in almost all the scenarios as compared to the ES strategy. Though the NN models may have less rewards in individual cases, in total the NN models earn larger rewards. For example, in the smaller budget scenario for service upon arrival our model has a greater loss for VM allocations in class 1 and 3 as compared to the ES startegy, however, it achieves a smaller total loss as it saves a lot in class 3. This can be explained by looking at the training phase shown in Figure 7, in that our models learn the request amount distributions of all classes and then allocate resources accordingly. During training, our models allocate resources differently to each class, and buffer levels also decrease.
Based on experimental results on both simulated and real data, we can conclude that our models can learn efficient resource allocation policies for the coupled dynamic resource allocation problem for network slicing and outperform the baseline strategy.
Service upon arrival  

Class  C1  C2  C3  Total  
Smaller budget  BW  NN  1.170E+05  2.933E+05  4.558E+04  4.559E+05 
ES  1.140E+05  2.935E+05  4.847E+04  4.559E+05  
VM  NN  4.679E+02  6.658E+02  6.776E+03  7.909E+03  
ES  9.514E+02  6.913E+01  1.367E+06  1.368E+06  
Small budget  BW  NN  4.585E+02  1.878E+03  6.056E+02  2.942E+03 
ES  2.387E+01  1.318E+05  1.793E+01  1.319E+05  
VM  NN  9.702E+01  9.266E+01  5.609E+02  7.506E+02  
ES  1.718E+02  1.719E+02  9.897E+02  1.333E+03  
Large budget  BW  NN  9.017E+00  1.379E+01  7.341E+00  3.015E+01 
ES  7.546E+00  1.418E+03  7.546E+00  1.433E+03  
VM  NN  9.510E+01  1.211E+02  8.374E+02  1.054E+03  
ES  3.231E+02  3.232E+02  7.204E+02  1.367E+03  
Larger budget  BW  NN  7.521E+00  1.321E+01  7.910E+00  2.864E+01 
ES  9.548E+00  4.866E+01  9.548E+00  6.775E+01  
VM  NN  8.966E+01  1.095E+02  1.271E+03  1.470E+03  
ES  4.757E+02  4.757E+02  7.341E+02  1.685E+03 
Batch service  

Class  C1  C2  C3  Total  
Smaller budget  BW  NN  1.663E+04  2.795E+04  1.312E+04  5.771E+04 
ES  1.584E+04  3.176E+04  1.129E+04  5.888E+04  
VM  NN  2.604E+03  9.454E+03  8.106E+04  9.311E+04  
ES  1.246E+02  1.361E+02  1.553E+05  1.556E+05  
Small budget  BW  NN  1.283E+04  1.594E+04  9.770E+03  3.854E+04 
ES  1.108E+04  2.081E+04  6.741E+03  3.862E+04  
VM  NN  8.109E+02  6.816E+02  2.009E+03  3.501E+03  
ES  1.163E+03  1.163E+03  1.342E+03  3.669E+03  
Large budget  BW  NN  7.517E+03  1.074E+04  6.762E+03  2.502E+04 
ES  6.588E+03  1.620E+04  2.260E+03  2.505E+04  
VM  NN  1.549E+03  9.145E+02  4.148E+03  6.612E+03  
ES  2.204E+03  2.204E+03  2.234E+03  6.642E+03  
Larger budget  BW  NN  3.229E+03  5.697E+03  2.654E+03  1.158E+04 
ES  2.109E+03  1.169E+04  7.079E+01  1.387E+04  
VM  NN  2.670E+03  2.145E+03  4.919E+03  9.734E+03  
ES  3.245E+03  3.245E+03  3.252E+03  9.741E+03 
Vii Related Work
There is considerable literature on communication network resource allocation problems including network slicing using RL, and more recently, deep RL approaches; these are categorized as follows (TABLE V).
Resource Allocation  Network Slicing  

RL 






In [13Elwalid1995] the authors study the the admissibility of variable bit rate (VBR) traffic and bandwidth allocation in buffered ATM networks. [17Nordstrom1995] presents an adaptive link allocation scheme for ATM networks formulated as semiMarkov Decision Problems (SMDPs). In [14Hetzer2006] the authors propose adaptive bandwidth planning using RL for optimal scheduling considering QoS parameter patterns as feedback from the environment. Adaptive call admission control (CAC) in multimedia networks is addressed in [15Tong2000] via RL, formulating the problem as a constrained SMDP maximizing revenue and meeting packetlevel and calllevel QoS constraints. The work in [16Hui2003] uses RL for adaptive provisioning of differentiated services networks for per hop behavior when the bandwidth required is not known at the time of connection admission. Cloud compute resource allocation is addressed in [18Jamshidi2015] where the authors propose learning adaptation rules by a cloud controller that learns and modifies fuzzy rules at runtime for scaling cloud resources. [19Benifa2018] addresses the same issue, basing their work on the RLSARSA algorithm that learns the environment dynamically in parallel and allocates the resources.
Network resource allocation using deep RL models have been considered in various domains such as cognitive radio for smart cities [9He2017], cloud RAN [10Xu2017], and VANET [11He2017]. In [9He2017] the authors propose an integrated framework for dynamic orchestration of networking, caching, and computing resources for smart city applications, where the algorithm performance is evaluated through simulations (real data is not used). In [2Bega2017] and [3Han2018]
, the authors propose algorithms for slicing admission strategy optimization in 5G networks using Qlearning and genetic algorithms; however, they consider a binary decision mechanism where declined requests are dropped. In
[4Zhao2018], which is closest to our work, the authors formulate the network slicing problem with deep RL for two typical resource management scenarios: radio resource slicing for a base station with multiple services; and prioritybased core network slicing, with multiple service function chains requiring different compute resources and waiting time. However, our work is different from [4Zhao2018] in several critical aspects — our models address the issue of allocating multiple resources simultaneously, solve constrained problems by introducing buffers, and consider both service upon arrival and batch service. Furthermore, the performance of our approach is validated with simulated and real data.Viii Conclusions
In this paper we proposed a new deep RL framework for network slicing with heterogeneous resource requirements and finite capacity which can deal with highly dynamic traffic demands from network users. Experiments using both synthetic and real workload driven traces show that our system performs well compared to a baseline equalslicing strategy. Our RL algorithm can be trained offline using the simulated and trace data, and the learned policies can be used in realtime for endtoend 5G slicing systems. Taking a more holistic view of network slicing, we plan to extend this work to lifecycle management of network slices to include creation, activation and deactivation, and elasticity for slices. We plan to improve our RL architecture by exploring different learning algorithms such as ActorCritic methods [Konda:2002:AA:936987], and DQN [mnihatari2013]. We also plan to deploy the algorithms in a real 5G network slicing testbed.