I Introduction
The promise of network Slicing is to enable a high level of customization of network services in future networks (5G and beyond) leveraged by virtualization and software defined networking techniques. These key enablers transform telecommunications networks into programmable platforms capable of offering virtual networks enriched by Virtual Network Functions (VNFs) and IT resources tailored to the specific needs of certain customers (e.g., companies) or vertical markets (automotive, ehealth, etc.)[1, 10].
From an optimization theory perspective, the Network Slice Placement problem can be viewed as a specific case of Virtual Network Embedding (VNE) or VNF Forwarding Graph Embedding (VNFFGE) problems [11]
. It is then generally possible to formulate Integer Linear Programming (ILP) problems
[9], which, however, turn out to be hard [5] with very long convergence time.With regard to network management, there are specific characteristics related to network slicing: slices are expected to share resources and coexist in a large and distributed infrastructure. Moreover, slices have a wide range of requirements in terms of resources, quality objectives and lifetime. In practice, these characteristics bring additional complexity as the placement algorithms need to be highly scalable with low response time even under varying network conditions.
As an alternative to optimization techniques and the development of heuristic methods, Deep Reinforcement Learning (DRL) has recently been used in the context of VNE and Network Slice Placement [25, 8, 26, 23, 24, 16, 19, 12].
DRL techniques are seen as very promising since they allow, at least theoretically, to learn optimal decision policies only based on experience [22]. However, from a practical point of view, especially in the context of nonstationary environments, ensuring that a DRL agent converges to an optimal policy is still a challenge.
As a matter of fact, when the environment is continually changing the rules, the algorithm has trouble in using the acquired knowledge to find optimal solutions. The usage of the DRL algorithm in a online fashion can then become impractical. In fact, most of the existing works applying DRL to solve the Network Slice Placement or VNE problem assume a stationary environment, i.e., with static network load. However, traffic conditions in networks are basically nonstationary with daily and weekly variations and subject to drastic changes (e.g., traffic storm due to an unpredictable event).
To cope with traffic changes, we propose in the present paper to extend a hybrid DRLheuristic algorithm we have recently introduced in [4] (namely Heuristically Assisted DRL, HADRL) to evaluate the performance under nonstationary network loads. We apply this strategy to a fully online learning scenario with timevarying network loads to show how this strategy can be used to accelerate and stabilize the convergence of DRL techniques when applied to the Network Slice Placement problem.
The contributions of the present paper are threefold:

We propose a network load model to network slice infrastructure conditions with timevarying network loads;

We propose a framework combining Advantage Actor Critic and a Graph Convolutional Network (GCN) for conceiving DRLbased algorithms adapted to the nonstationary case;

We show how the DRL learning process can be accelerated by using the proposed HADRL technique to control the algorithms convergence.
The organization of this paper is as follows: In Section II, we review related work. In Section III, we describe the Network Slice Placement problem modeling. The learning framework for slice placement optimization is described in Section IV. The adaptation of the pureDRL approaches and its control by using heuristic is introduced in Section IVB.
Ii Related Work Analysis
We review, in this section, recent studies on DRLbased approaches for network slice placement. The reader interested in a more detailed and comprehensive discussion of those works cited in the present paper may refer to [4].
Iia On PureDRL approaches for slice placement
There are only a few recent works on DRL for network slice placement and VNE related problems in the literature and the majority of them are pureDRL approaches [25, 8, 26, 23, 24, 16, 19, 12]. In those works, only the knowledge acquired by the learning agent via training is used as a basis for taking placement decisions. The drawback of this approach is that the learning agent needs extensive exploration of the state and action spaces to learn an appropriate policy for decision making; such a process takes a lot of time during which the agent takes bad placement decisions; this leads to rejection of slices and bad performance.
Furthermore, existing works consider static or fixed average network load regimes in which network slices arrive and exit the system always at fixed rates. In reality, however, network load conditions vary over time as the demand for network services depend on many factors (unpredictable events, daynight variations, etc.). In this kind of nonstationary environment, the learning agent might have trouble in using its knowledge acquired via training as the rules of the environment are constantly changing.
IiB On Hybrid DRLheuristic approaches for slice placement
Two recent works on network slice placement and VNE have been proposed and combine DRL with heuristic methods to speed up the convergence and to increase the reliability of DRL algorithms [17, 18]. However, these hybrid DRLheuristic approaches have some drawbacks explained in more details in [4]. In particular, the approach proposed in [17] adopts an infinite action space formulation that adds some overhead reducing the applicability of the algorithm, and the approach proposed in [18] strongly depends on the quality of the initial solution provided by a heuristic. Moreover, these two works do not consider the nonstationarity assumption discussed in the present paper.
IiC On AI/ML approaches considering dynamic network load
A recent body of research considers dynamic network load scenarios when applying AI/MLbased approaches to support optimization of network slice life cycle management. For instance, the authors of [6]
propose a deep learningbased data analytics tool to predict dynamic network slice traffic demands to help avoid SLA violations and network overprovisioning. The paper
[2]also adopts neural networks to predict network slice traffic demands but, in this case, to perform proactive resource provisioning and congestion control.
To the best of our knowledge, the only paper considering placement optimization using DRL in dynamic network load scenarios is [15].
The authors propose a Double Deep Q Network (DDQN) algorithm for reoptimizing an initial VNF placement. They consider that the network load changes periodically with a time cycle . They then separate time cycle T in time intervals , and train a DDQN model to specifically take charge of VNF placement reoptimization in each time interval . Despite its originality, the approach proposed in [15] presents two drawbacks: 1) it depends on offline learning, which is not applicable to online optimization scenarios; 2) it does not use DRL to optimize placement directly as the DRL algorithm selects the region of the network to reoptimize and delegates the optimization to threshold policy procedure. The heuristic calculation of placement decisions can lead to suboptimal solutions. Contrary to [15], the present contribution is applicable to offline and online learning and directly learns a placement optimization policy.
Iii Network Slice Placement Optimization Problem
This section presents the various elements composing the model for slice placement. Slices are placed on a substrate network, referred to as Physical Network Substrate (PSN) and described in Section IIIA. Slices give rise to as Network Slice Placement Requests (Section IIIB), generating a network load defined in Section IIIC. The optimization problem is formulated in Section IIID.
Iiia Physical Substrate Network Modeling
The Physical Substrate Network (PSN) is composed of the infrastructure resources, namely IT resources (CPU, RAM, disk, etc.) needed for supporting the Virtual Network Functions (VNFs) of network slices together with the transport network, in particular Virtual Links (VLs) for interconnecting the VNFs of slices.
The PSN is divided into three components: the Virtualized Infrastructure (VI) corresponding to IT resources, the Access Network (AN), and the Transport Network (TN).
The Virtual Infrastructure (VI) hosting IT resources is the set of Data Centers (DCs) interconnected by network elements (switches and routers). We assume that data centers are distributed in Points of Presence (PoP) or centralized (e.g., in a big cloud platform). As in [21], we define three types of DCs with different capacities: Edge Data Centers (EDCs) close to end users but with small resources capacities, Core Data Centers (CDCs) as regional DCs with medium resource capacities, and Central Cloud Platforms (CCPs) as national DCs with big resource capacities.
We consider that slices are rooted so as to take into account the location of those users of a slice. We thus introduce an Access Network (AN) representing User Access Points (UAPs) such as WiFi APs, antennas of cellular networks, etc. and Access Links. Users access slices via one UAP, which may change during the life time of a communication by a user (e.g., because of mobility).
The Transport Network (TN) is the set of routers and transmission links needed to interconnect the different DCs and the UAPs. The complete PSN is modeled as a weighted undirected graph with parameters described in Table I, where is the set of physical nodes in the PSN, and refers to a set of substrate links. Each node has a type in the set UAP, router, switch, server. The available CPU and RAM capacities on each node are defined as , for all , respectively. The available bandwidth on the links are defined as .
Parameter  Description 

PSN graph  
Network nodes  
Set of servers  
Set of data centers  
,  Set of servers in data center 
Switch of of data center  
Set of physical links  
Bandwidth capacity of link  
available CPU capacity on server  
maximum CPU capacity of server  
available RAM capacity on server  
maximum RAM capacity of server  
maximum outgoing bandwidth from 
IiiB Network Slice Placement Requests Modeling
We consider that a slice is a chain of VNFs to be placed and connected over the PSN. VNFs of a slice are grouped into a request, namely a Network Slice Placement Request (NSPRs), which has to be placed on the PSN. A NSPR is represented as a weighted undirected graph , with parameters described in Table II, where is the set of VNFs in the NSPR, and is a set of VLs to interconnect the VNFs of the slice . The CPU and RAM requirements of each VNF of a NSPR are defined as and for all , respectively. The bandwidth required by each VL in a NSPR is given by for all .
Parameter  Description 

NSPR graph  
Set of VNFs of the NSPR  
Set of VLs of the NSPR  
CPU requirement of VNF  
RAM requirement of VNF  
Bandwidth requirement of VL 
IiiC Network Load Modeling
We consider the case when the load offered by NSPRs is time varying. We specifically assume that there are several classes of NSPRs and two sets of classes. A first set is composed of NSPR classes (referred to as static) with constant arrival rates, creating a background traffic. A second set is composed of NSPR classes, with timevarying arrival rates so as to reflect some volatility in the NSPRs. Those NSPRs are said dynamic.
IiiC1 Network Load for Static NSPR Classes
Let be the set of resources in the network (i.e., CPU, RAM, bandwidth). Let be the set of static NSPR classes. We compute the load generated by arrivals of NSPRs of class in for resource in as in [20]:
(1) 
where is the total capacity of resource , is the number of resource units requested by an NSPR of class , is the average arrival rate for an NSPR of class and is the average lifetime of an NSPR of class .
IiiC2 Network Load for Dynamic NSPR Classes
Let be the set of dynamic NSPR classes. We consider a periodic average arrival rate for class in given by
(2) 
where is the period of in time units and is a parameter used to control the amplitude of . We then adapt Eq. (1) to compute the network load for dynamic NSPR class and resource as . It is worth noting that to preserve in the interval, must be between and .
IiiC3 Global Network Load
Finally, we define the global network load for each resource in in the simulated time instant as the sum of the network loads generated by static NSPR classes () and dynamic NSPR classes (), that is,
(3) 
It is worth noting .
IiiD Network Slice Placement Optimization Problem Statement

Given: a NSPR graph and a PSN graph ,

Find: a mapping , , ,

Subject to: the VNF CPU requirements , the VNF RAM requirements , the VLs bandwidth requirements , the server CPU available capacity , the server RAM available capacity , the physical link bandwidth available capacity .

Objective: maximize the network slice placement request acceptance ratio, minimize the total resource consumption and maximize load balancing.
Iv Learning framework for Network Slice Placement Optimization
We describe in this section the machine learning framework used to solve the optimization formulated in Section
III. We adopt the same approach as in [4] but to cope with the non stationary behavior of NSPR arrivals, we introduce an additional set of states to describe network load.Other methods could be considered to deal with the cyclic nature of NSPR arrivals considered in this paper (e.g., LSTM techniques able to infer the periodic characteristics of the NSPRs arrival process). However, our goal in this paper is to set up a method of dealing with non stationary NSPR arrivals, independently of underlying periodic structures. We test the proposed approach for periodic NSPR arrivals to validate the approach, keeping in mind that the method has to be robust against non stationary traffic variations. This point will be addressed in further studies.
Iva Learning framework
IvA1 Policy
We reuse the framework introduced in [4]. We denote by the set of possible actions (namely placing VNFs on nodes) and by the set of all states. We adopt a sequential placement strategy so that we choose a node where to place a specific VNF . The VNFs are sequentially placed so that placement starts with the VNF and ends for the VNF .
At each time step , given a state , the learning agent selects an action
with probability given by the Softmax distribution given by
(4) 
where the function yields a real value for each state and action calculated by a Deep Neural Network (DNN) as detailed in Section IVB1. The notation is used to indicate that policy depends on . The control parameter represents the weights in the DNN.
IvA2 State representation
As in [4], the PSN state is characterized by the occupancy of servers: , and
. In addition, we keep track of the placement of the outstanding NSPR (under placement) via the vector
, where is the number of VNFs of the current NSPR placed on node .Accordingly, the NSPR state is a view of the current placement and is composed of four characteristics, three related to resource requirements (see Table II for the notation) of the current VNF to be placed: , and .In addition, let be the number of VNFs of the outstanding NSPR still to be placed.
To deal with the non stationary environment, we introduce the Load state that represents the network load forecast used to learn the network load variations. It is defined by a set of 100 features for each resource calculated using the network load formula given by Eq. (3) as follows : , where is the simulated time instant in which the current NSPR arrives.
IvA3 Reward function
We reuse the reward function introduced in [4]. We precisely consider
(5) 
where is the number of iterations of a training episode and where the rewards , , and are defined as follows:

An Action may lead to a successful or unsuccessful placement. We then define the Acceptance Reward value due to action as
(6) 
The Resource Consumption Reward value for the placement of VNF via action is defined by
(7) where is the path used to place VL . Note that a maximum is given when , that is, when VNFs and are placed on the same server.

The Load Balancing Reward value for the placement of VNF via
(8)
IvB Adaptation of DRL and Introduction of a Heuristic Function
IvB1 Proposed Deep Reinforcement Learning Algorithm
As in [4], we use a single thread version of the A3C Algorithm introduced in [14]. This algorithm relies on two DNNs that are trained in parallel: i) the Actor Network with the parameter , which is used to generate the policy at each time step, and ii) the Critic Network with the parameter
which generates an estimate
for the Statevalue function defined byfor some discount parameter .
As depicted in Fig. 1 both Actor and Critic Networks have almost identical structure. As in [25], we use the GCN formulation proposed by [13] to automatically extract advanced characteristics of the PSN. The characteristics produced by the GCN represent semantics of the PSN topology by encoding and accumulating characteristics of neighbour nodes in the PSN graph. The size of the neighbourhood is defined by the orderindex parameter . As in [25], we consider in the following and perform automatic extraction of 60 characteristics per PSN node. Both the NSPR state and Network Load characteristics are separately transmitted to fully connected layers with 4 and 100 units, respectively.
The characteristics extracted by both layers and the GCN layer are combined into a single column vector of size and passed through a full connection layer with units.
In the Critic Network, the outputs are forwarded to a single neuron, which is used to calculate the statevalue function estimation
. In the Actor Network, the outputs represent the values of the function introduced in Section IVA. These values are injected into a Softmax layer that transforms them into a Softmax distribution that corresponds to the policy
.During the training phase, at each time step , the A3C algorithm uses the Actor Network to calculate the policy . An action is sampled using the policy and performed on the environment. The Critic Network is used to calculate the statevalue function approximation . The learning agent receives then the reward and next state from the environment and the placement process continues until a terminal state is reached, that is, until the Actor Network returns an unsuccessful action or until the current NSPR is completely placed. At the end of the training episode, the A3C algorithm updates parameters and by using the same rules as in [4].
IvB2 Introduction of a Heuristic Function
To guide the learning process, we use as in [4] the placement heuristic introduced in [3]. This yields the HADRL algorithm. More precisely, from the reference framework shown in Fig. 1, we proposed to include in the Actor Network the Heuristic layer that calculates a Heuristic Function based on external information provided by the heuristic method, referred to as HEU.
Let be the function computed by the fully connected layer of the Actor Network that maps each state and action to a real value which is after converted by the Softmax layer into the selection probability of the respective action (see Section IVA). Let be the action with the highest value for state . Let be the action derived by the HEU method at time step and the preferred action to be chosen. is shaped to allow the value of to become closer to the value of .
The aim is to turn into one of the likeliest actions to be chosen by the policy.
The Heuristic Function is then formulated as
(9) 
where parameter is a small real number. During the training process the Heuristic layer calculates and updates the values by using the following equation:
(10) 
The Softmax layer then computes the policy using the modified . Note the action returned by will have a higher probability to be chosen. The and are parameters used to control how much HEU influence the policy.
IvC Implementation Remarks
IvC1 Algorithms considered
We consider four learning algorithms based on the reference framework presented in Fig. 1.

DRL: It is the pureDRL algorithm we initially proposed in [4]. The state representation does not include the network load state and the Actor Network does not contain the Heuristic layer.

eDRL: This algorithm is an enhanced version of DRL in which the state representation includes the network load state and the Actor Network does not contain the Heuristic layer;

HADRL: This algorithm embeds the heuristic but the state representation does not include the network load state while the Actor Network contain the Heuristic layer;

HAeDRL: The state representation includes the network load state and the Actor Network contains the Heuristic layer.
IvC2 Implementation details
All resourcerelated characteristics are normalized to be in . This is done by dividing and , cpu, ram,bw, by
. With regard to the DNNs, we have implemented the Actor and Critic as two independent Neural Networks. Each neuron has a bias assigned. We have used the hyperbolic tangent (tanh) activation for nonoutput layers of the Actor Network and Rectified Linear Unit (ReLU) activation for all layers of the Critic Network. We have normalized positive global rewards to be in
. During the training phase, we have considered the policy as a Categorical distribution and used it to sample the actions randomly.V Implementation and Evaluation Results
In this section, we present the implementation and experiments we conducted to evaluate the proposed algorithms.
Va Implementation Details & Simulator Settings
VA1 Experimental setting
We developed a simulator in Python containing: i) the elements of the Network Slice Placement Optimization problem (i.e., PSN and NSPR); ii) the DRL, eDRL, HADRL and HAeDRL algorithms. We used the PyTorch framework to implement the DNNs. Experiments were run in a 2x6 cores @2.95Ghz 96GB machine.









Volatile  5  25  20  6300  
Long term  10  25  500  0.02  6300 
VA2 Physical Substrate Network Settings
We consider a PSN that could reflect the infrastructure of an operator as discussed in [20]. In this network, three types of DCs are introduced as in Section III. Each CDC is connected to three EDCs which are 100 km apart. CDCs are interconnected and connected to one CCP that is 300 km away. We consider 15 EDCs each one with 4 servers, 5 CDCs each with 10 servers and 1 CCP with 16 servers. The CPU and RAM capacities of each server are 50 and 300 units, respectively. A bandwidth capacity of 100 Gbps is given to intradata center links inside CDCs and CCP—10Gbps being the case for intradata center links inside EDCs. Transport links connected to EDCs have 10Gpbs of bandwidth capacity. Transport links between CDCs have 100Gpbs of bandwidth capacity as well for the ones between CDCs and the CCP.
VA3 Network Slice Placement Requests Settings
We consider NSPRs to have the Enhanced Mobile Broadband (eMBB) setting described in [3]. Each NSPR is composed of 5 VNFs. Each VNF requires 25 units of CPU and 150 units of RAM. Each VL requires 2 Gbps of bandwidth.
VB Algorithms & Experimental Setup
VB1 Training Process & Hyperparameters
We consider a training process with maximum duration of 55 hours for the considered algorithms. We perform seven independent runs of each algorithm to assess their average and maximal performance in terms of metrics introduced below (see Section VC). After performing Hyperparameter search, we set the learning rates for the Actor and Critic networks of DRL and HADRL algorithms to and , respectively. For eDRL and HAeDRL, we set the learning rates for the Actor and Critic networks to and , respectively. We implement four versions of HADRL and HAeDRL agents, each with a different value for the parameter of the heuristic function formulation (see Section IVB2). We set in addition the parameters and .
VB2 Network load calculation
We implement the network load model introduced in Section VB2 considering two NSPR classes: i) a volatile class (referred to as Volatile), with a dynamic arrival rate and ii) a static class (referred to as Long term), with a static arrival rate. Table III presents the network load models proposed for both classes. We consider that each simulation time unit corresponds to 15 minutes in reality. We set the period for the network load function to 96 simulation time units, i.e., one day. The network global network load varies then between 0.3 and 1.0.
VC Evaluation Metrics
To characterize the performance of the placement algorithms, we consider two performance metrics:

Global Acceptance Ratio (GAR): The Acceptance Ratio of the different tested algorithms during training computed after each arrival as follows: . This metric is used to evaluate the accumulated acceptance ratio of the different algorithms as the learning process progresses.

Acceptance Ratio per training phase (TAR): The Acceptance Ratio obtained in each training phase, i.e., each part of the training process, corresponding to NSPR arrivals or episodes. It is calculated as follows: . This metric allows us to better observe the evolution of algorithm performance over time since it measures algorithm performance in independent parts (phases) of the training process without accumulating the performance of previous training phases.
VD Global Acceptance Ratio Evaluation
Fig. 2 and 3 show the progression of the Global Acceptance Ratio (GAR) over time for the considered algorithms. Fig. (a)a and Fig. (a)a show that after 3 simulated days, the average and maximal GARs of HADRL and HAeDRL with are convergent and higher than 80% while for the other algorithms, the GARs remain between 20% and 40% and are not stable. Fig. (b)b and Fig (b)b exhibit the progression of the GARs over a longer simulated time horizon of 4500 simulated days.
We observe that while GARs of HADRL and HAeDRL with remain stable after the convergence reached in the first 3 simulated days, for the other algorithms, these performance metrics start to stabilize only after 2000 simulated days. Table IV shows that both the maximal and average final GARs, i.e., maximal and average GARs achieved at the end of training, are close to 82% for both HADRL and HAeDRL with . For the other algorithms, maximal final GARs are between 61% and 71% and average final GARs are between 52% and 65%, except for HADRL and HAeDRL with . Table IV also shows that maximal and average final GARs of HADRL algorithms are generally higher than the equivalent HAeDRL versions, the gap being never higher than 6%.
Algorithm 






DRL  65.90  52.15  77.69  55.82  
HADRL,  70.56  55.57  81.45  59.83  
HADRL,  69.80  60.12  73.56  65.33  
HADRL,  64.89  28.23  72.14  41.69  
HADRL,  81.84  81.62  83.91  82.21  
eDRL  69.36  65.10  77.39  58.61  
HAeDRL,  69.25  52.22  76.34  48.45  
HAeDRL,  68.40  54.49  71.91  57.57  
HAeDRL,  61.46  27.33  67.05  33.10  
HAeDRL,  81.82  81.68  81.37  81.95 
The above results show that HADRL and HAeDRL with exhibit more robust performance with faster convergence and handle better network load variations than the other evaluated algorithms, notably when compared with the classical DRL approach.
Indeed, when setting , the Heuristic Function computed on the basis of the HEU algorithm has strong influence on the actions chosen by the agent. Since the HEU algorithm often indicates a good action, this strong influence of the heuristic function helps the algorithms to become stable more quickly.
The addition of network load related features to the state observed by the eDRL algorithm helps improve the maximal and average final GAR when compared with DRL. However, this improvement is less significant than the improvement brought by the Heuristic Function acceleration as the eDRL algorithm does not achieve the same fast convergence and robustness than HADRL and HAeDRL with .
VE Acceptance Ratio per Training Phase Evaluation
Fig. 4 and Fig. 5 show that all algorithms need at least 16 training phases to reach a convergent maximal and average Acceptance Ratio per Training Phase (TAR), except for HADRL and HAeDRL with which need only one training phase (see Fig (a)a and Fig (a)a). The fast convergence of HADRL and HAeDRL with is due to the strong influence of the Heuristic Function in the choice of actions as explained in Section VD. Fig. (a)a and Fig. (b)b illustrate that algorithms HADRL and HAeDRL with , HADRL with and DRL need around 16 phases to reach a convergent maximal TAR while Fig. (b)b and Fig. (c)c demonstrate that eDRL, HAeDRL with and HADRL with need around 34 phases. Finally, Fig. (a)a and Fig. (b)b show that HADRL and HAeDRL with , eDRL, and DRL need around 16 phases to obtain a convergent average TAR while Fig. (b)b and Fig. (c)c show that HADRL and HAeDRL with need around 34 phases.
Table IV shows that the only algorithms that reach a maximal and average final TAR (i.e, maximal and average TAR achieved at the end of training) higher than 80% are HADRL and HAeDRL with . HADRL with also attains a maximal final TAR higher than 80% but its average final TAR is less than 60%. HADRL with has average final TAR higher than 65%, but its maximal final TAR is around 74%. In addition, Table IV shows that: HADRL algorithms have maximal final TAR generally higher than the equivalent HAeDRL versions, the gap being never higher than 6%; eDRL and DRL have equivalent maximal final TAR but eDRL has average final TAR 3% higher. Fig. (b)b and Fig. (a)a show that all algorithms have better maximal and average TAR on volatile NSPRs than on longterm NSPRs. This difference is arguably related to the number of arrivals of each requests and, therefore, the amount of training performed on each class of requests, since many more volatile requests arrive to be placed during the simulated period than longterm requests. These results reinforce the conclusion introduced in Section VD. The maximal TAR progression results (i.e., Fig. 4) allow us to observe that in the best case, if a large number of training phases is given, all the algorithms attain a good performance.
However, the average TAR progression results (see Fig. 5) show that only HADRL and HAeDRL with have robust performance. These algorithms are also the only one to have quick convergence being, among those evaluated, the most adapted to be used in practice.
Vi Conclusion
We have introduced a DRLheuristic algorithm that supports the variations of network load in the placement of network slice requests as a followup of our work in [4]. We have considered four families of algorithms, pureDRL and eDRL techniques and their variants combined with a heuristic whose efficiency was investigated in a previous work [3]. The influence of the heuristic on the DRL algorithm can be tuned by means of a parameter (namely, parameter (introduced in Section IVB2). We assume network load has periodic fluctuations. We have shown how introducing network load states into the DRL algorithm together with a combination with the heuristic function yields very good results in terms of GAR and TAR which are stable in time. This study aims at proving that coupling DRL and heuristic functions yields good and stable results even under non stationary conditions.
The next step is to study the behavior of the proposed algorithm in case of an unpredictable network load disruption.
Acknowledgment
This work has been performed in the framework of 5GPPP MONB5G project (www.monb5g.eu). The experiments were conducted using Grid’5000, a large scale testbed by Inria and Sorbonne University (www.grid5000.fr).
References
 [1] (2020Dec.) Management and orchestration; 5G Network Resource Model (NRM); Stage 2 and stage 3 (Release 17). Technical Specification (TS) Technical Report 28.541, 3rd Generation Partnership Project (3GPP). Note: Version 17.1.0 Cited by: §I.
 [2] (2018) An efficient and lightweight load forecasting for proactive scaling in 5g mobile networks. In Proc. 2018 IEEE Conf. Standards Commun. Netw. (CSCN), pp. 1–6. Cited by: §IIC.
 [3] (2020) Heuristic for edgeenabled network slicing optimization using the “power of two choices”. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), pp. 1–9. External Links: Document Cited by: §IVB2, §VA3, §VI.
 [4] (2021) A heuristically assisted deep reinforcement learning approach for network slice placement. arXiv preprint arXiv:2105.06741. Cited by: §I, §IIB, §IIB, §II, 1st item, §IVA1, §IVA2, §IVA3, §IVB1, §IVB1, §IVB2, §IV, §VI.
 [5] (2016Jun.) On the computational complexity of the virtual network embedding problem. Electron. Notes Discrete Math. 52, pp. 213–220. Cited by: §I.
 [6] (2019) DeepCog: cognitive network management in sliced 5g networks with deep learning. In Proc. IEEE INFOCOM 2019  IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 280–288. Cited by: §IIC.
 [7] (2008) Accelerating autonomous learning by using heuristic selection of actions. J. Heuristics 14 (2), pp. 135–168. Cited by: §IIB.
 [8] (2019) DeepViNE: virtual network embedding with deep reinforcement learning. In Proc. IEEE INFOCOM 2019  IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 879–885. External Links: Document Cited by: §I, §IIA.
 [9] (2020) Locationbased data model for optimized network slice placement. In Proc. 2020 6th IEEE Conf. Netw. Softwarization (NetSoft), Vol. , pp. 404–412. External Links: Document Cited by: §I.
 [10] (2017) Network Functions Virtualisation (NFV); Evolution and Ecosystem; Report on Network Slicing Support, ETSI Standard GR NFVEVE 012 V3.1.1. Technical report ETSI. External Links: Link Cited by: §I.
 [11] (2016Sep.) Resource allocation in NFV: a comprehensive survey. IEEE Trans. Netw. Service Manag. 13 (3), pp. 518–532. External Links: Document Cited by: §I.
 [12] (2020) Accelerating virtual network embedding with graph neural networks. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), Vol. , pp. 1–9. External Links: Document Cited by: §I, §IIA.
 [13] (2017) Semisupervised classification with graph convolutional networks. In Proc. 5th Int. Conf. Learn. Representations (ICLR), pp. 1–14. Cited by: §IVB1.
 [14] (2016) Asynchronous methods for deep reinforcement learning. In Int. Conf. Mach. Learn., pp. 1928–1937. Cited by: §IVB1.
 [15] (2020Feb.) Optimal vnf placement via deep reinforcement learning in sdn/nfvenabled networks. ieee_j_jsac 38 (2), pp. 263–278. External Links: Document Cited by: §IIC, §IIC.
 [16] (2019) Multidomain noncooperative VNFFG embedding: a deep reinforcement learning approach. In Proc. IEEE INFOCOM 2019  IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 886–891. Cited by: §I, §IIA.
 [17] (2019Dec.) A deep reinforcement learning approach for vnf forwarding graph embedding. IEEE Trans. Netw. Service Manag. 16 (4), pp. 1318–1331. Cited by: §IIB, §IIB.
 [18] (2021) Learn to improve: a novel deep reinforcement learning approach for beyond 5G network slicing. In Proc. 2021 IEEE 18th Annu. Consum. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §IIB, §IIB.
 [19] (2020) Selfdriving network and service coordination using deep reinforcement learning. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), pp. 1–9. External Links: Document Cited by: §I, §IIA.
 [20] (2017) Towards a dynamic adaptive placement of virtual network functions under ONAP. In Proc. 2017 IEEE Conf. on Netw. Function Virtualization Softw. Defined Netw. (NFVSDN), pp. 210–215. External Links: Document Cited by: §IIIC1, §VA2.
 [21] (2018) CLOSE: a costless service offloading strategy for distributed edge cloud. In Proc. 2018 15th IEEE Annu. Cons. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §IIIA.
 [22] (2015) Reinforcement learning: an introduction. MIT press, Cambridge, MA, USA. Cited by: §I.
 [23] (2019Sep.) Datadriven dynamic resource scheduling for network slicing: a deep reinforcement learning approach. Inf. Sci. 498, pp. 106–116. Cited by: §I, §IIA.
 [24] (2019) NFVdeep: adaptive online service function chain deployment with deep reinforcement learning. In Proc. 2019 IEEE/ACM 27th Int. Symp. Qual. Service (IWQoS), pp. 1–10. Cited by: §I, §IIA.
 [25] (2020Jun.) Automatic virtual network embedding: a deep reinforcement learning approach with graph convolutional networks. ieee_j_jsac 38 (6), pp. 1040–1057. Cited by: §I, §IIA, §IVB1.
 [26] (2018Apr.) A novel reinforcement learning algorithm for virtual network embedding. Neurocomputing 284, pp. 1–9. External Links: ISSN 09252312 Cited by: §I, §IIA.
Comments
There are no comments yet.