DRL-based Slice Placement Under Non-Stationary Conditions

08/05/2021 ∙ by Jose Jurandir Alves Esteves, et al. ∙ Orange Laboratoire d'Informatique de Paris 6 0

We consider online learning for optimal network slice placement under the assumption that slice requests arrive according to a non-stationary Poisson process. We propose a framework based on Deep Reinforcement Learning (DRL) combined with a heuristic to design algorithms. We specifically design two pure-DRL algorithms and two families of hybrid DRL-heuristic algorithms. To validate their performance, we perform extensive simulations in the context of a large-scale operator infrastructure. The evaluation results show that the proposed hybrid DRL-heuristic algorithms require three orders of magnitude of learning episodes less than pure-DRL to achieve convergence. This result indicates that the proposed hybrid DRL-heuristic approach is more reliable than pure-DRL in a real non-stationary network scenario.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The promise of network Slicing is to enable a high level of customization of network services in future networks (5G and beyond) leveraged by virtualization and software defined networking techniques. These key enablers transform telecommunications networks into programmable platforms capable of offering virtual networks enriched by Virtual Network Functions (VNFs) and IT resources tailored to the specific needs of certain customers (e.g., companies) or vertical markets (automotive, e-health, etc.)[1, 10].

From an optimization theory perspective, the Network Slice Placement problem can be viewed as a specific case of Virtual Network Embedding (VNE) or VNF Forwarding Graph Embedding (VNF-FGE) problems [11]

. It is then generally possible to formulate Integer Linear Programming (ILP) problems

[9], which, however, turn out to be -hard [5] with very long convergence time.

With regard to network management, there are specific characteristics related to network slicing: slices are expected to share resources and coexist in a large and distributed infrastructure. Moreover, slices have a wide range of requirements in terms of resources, quality objectives and lifetime. In practice, these characteristics bring additional complexity as the placement algorithms need to be highly scalable with low response time even under varying network conditions.

As an alternative to optimization techniques and the development of heuristic methods, Deep Reinforcement Learning (DRL) has recently been used in the context of VNE and Network Slice Placement [25, 8, 26, 23, 24, 16, 19, 12].

DRL techniques are seen as very promising since they allow, at least theoretically, to learn optimal decision policies only based on experience [22]. However, from a practical point of view, especially in the context of non-stationary environments, ensuring that a DRL agent converges to an optimal policy is still a challenge.

As a matter of fact, when the environment is continually changing the rules, the algorithm has trouble in using the acquired knowledge to find optimal solutions. The usage of the DRL algorithm in a online fashion can then become impractical. In fact, most of the existing works applying DRL to solve the Network Slice Placement or VNE problem assume a stationary environment, i.e., with static network load. However, traffic conditions in networks are basically non-stationary with daily and weekly variations and subject to drastic changes (e.g., traffic storm due to an unpredictable event).

To cope with traffic changes, we propose in the present paper to extend a hybrid DRL-heuristic algorithm we have recently introduced in [4] (namely Heuristically Assisted DRL, HA-DRL) to evaluate the performance under non-stationary network loads. We apply this strategy to a fully online learning scenario with time-varying network loads to show how this strategy can be used to accelerate and stabilize the convergence of DRL techniques when applied to the Network Slice Placement problem.

The contributions of the present paper are threefold:

  1. We propose a network load model to network slice infrastructure conditions with time-varying network loads;

  2. We propose a framework combining Advantage Actor Critic and a Graph Convolutional Network (GCN) for conceiving DRL-based algorithms adapted to the non-stationary case;

  3. We show how the DRL learning process can be accelerated by using the proposed HA-DRL technique to control the algorithms convergence.

The organization of this paper is as follows: In Section II, we review related work. In Section III, we describe the Network Slice Placement problem modeling. The learning framework for slice placement optimization is described in Section IV. The adaptation of the pure-DRL approaches and its control by using heuristic is introduced in Section IV-B.

The experiments and evaluation results are presented in Section V, while conclusions and perspectives are presented in Section VI.

Ii Related Work Analysis

We review, in this section, recent studies on DRL-based approaches for network slice placement. The reader interested in a more detailed and comprehensive discussion of those works cited in the present paper may refer to [4].

Ii-a On Pure-DRL approaches for slice placement

There are only a few recent works on DRL for network slice placement and VNE related problems in the literature and the majority of them are pure-DRL approaches [25, 8, 26, 23, 24, 16, 19, 12]. In those works, only the knowledge acquired by the learning agent via training is used as a basis for taking placement decisions. The drawback of this approach is that the learning agent needs extensive exploration of the state and action spaces to learn an appropriate policy for decision making; such a process takes a lot of time during which the agent takes bad placement decisions; this leads to rejection of slices and bad performance.

Furthermore, existing works consider static or fixed average network load regimes in which network slices arrive and exit the system always at fixed rates. In reality, however, network load conditions vary over time as the demand for network services depend on many factors (unpredictable events, day-night variations, etc.). In this kind of non-stationary environment, the learning agent might have trouble in using its knowledge acquired via training as the rules of the environment are constantly changing.

Ii-B On Hybrid DRL-heuristic approaches for slice placement

Two recent works on network slice placement and VNE have been proposed and combine DRL with heuristic methods to speed up the convergence and to increase the reliability of DRL algorithms [17, 18]. However, these hybrid DRL-heuristic approaches have some drawbacks explained in more details in [4]. In particular, the approach proposed in [17] adopts an infinite action space formulation that adds some overhead reducing the applicability of the algorithm, and the approach proposed in [18] strongly depends on the quality of the initial solution provided by a heuristic. Moreover, these two works do not consider the non-stationarity assumption discussed in the present paper.

In [4], we proposed to adapt heuristically accelerated reinforcement learning to the network slice placement problem. The approach is inspired by [7] and addresses the shortcomings of [17] and [18]. In this paper, we extend this approach by including the non-stationarity assumption.

Ii-C On AI/ML approaches considering dynamic network load

A recent body of research considers dynamic network load scenarios when applying AI/ML-based approaches to support optimization of network slice life cycle management. For instance, the authors of [6]

propose a deep learning-based data analytics tool to predict dynamic network slice traffic demands to help avoid SLA violations and network over-provisioning. The paper


also adopts neural networks to predict network slice traffic demands but, in this case, to perform proactive resource provisioning and congestion control.

To the best of our knowledge, the only paper considering placement optimization using DRL in dynamic network load scenarios is [15].

The authors propose a Double Deep Q Network (DDQN) algorithm for re-optimizing an initial VNF placement. They consider that the network load changes periodically with a time cycle . They then separate time cycle T in time intervals , and train a DDQN model to specifically take charge of VNF placement re-optimization in each time interval . Despite its originality, the approach proposed in [15] presents two drawbacks: 1) it depends on offline learning, which is not applicable to online optimization scenarios; 2) it does not use DRL to optimize placement directly as the DRL algorithm selects the region of the network to re-optimize and delegates the optimization to threshold policy procedure. The heuristic calculation of placement decisions can lead to sub-optimal solutions. Contrary to [15], the present contribution is applicable to offline and online learning and directly learns a placement optimization policy.

Iii Network Slice Placement Optimization Problem

This section presents the various elements composing the model for slice placement. Slices are placed on a substrate network, referred to as Physical Network Substrate (PSN) and described in Section III-A. Slices give rise to as Network Slice Placement Requests (Section III-B), generating a network load defined in Section III-C. The optimization problem is formulated in Section III-D.

Iii-a Physical Substrate Network Modeling

The Physical Substrate Network (PSN) is composed of the infrastructure resources, namely IT resources (CPU, RAM, disk, etc.) needed for supporting the Virtual Network Functions (VNFs) of network slices together with the transport network, in particular Virtual Links (VLs) for interconnecting the VNFs of slices.

The PSN is divided into three components: the Virtualized Infrastructure (VI) corresponding to IT resources, the Access Network (AN), and the Transport Network (TN).

The Virtual Infrastructure (VI) hosting IT resources is the set of Data Centers (DCs) interconnected by network elements (switches and routers). We assume that data centers are distributed in Points of Presence (PoP) or centralized (e.g., in a big cloud platform). As in [21], we define three types of DCs with different capacities: Edge Data Centers (EDCs) close to end users but with small resources capacities, Core Data Centers (CDCs) as regional DCs with medium resource capacities, and Central Cloud Platforms (CCPs) as national DCs with big resource capacities.

We consider that slices are rooted so as to take into account the location of those users of a slice. We thus introduce an Access Network (AN) representing User Access Points (UAPs) such as Wi-Fi APs, antennas of cellular networks, etc. and Access Links. Users access slices via one UAP, which may change during the life time of a communication by a user (e.g., because of mobility).

The Transport Network (TN) is the set of routers and transmission links needed to interconnect the different DCs and the UAPs. The complete PSN is modeled as a weighted undirected graph with parameters described in Table I, where is the set of physical nodes in the PSN, and refers to a set of substrate links. Each node has a type in the set UAP, router, switch, server. The available CPU and RAM capacities on each node are defined as , for all , respectively. The available bandwidth on the links are defined as .

Parameter Description
PSN graph
Network nodes
Set of servers
Set of data centers
, Set of servers in data center
Switch of of data center
Set of physical links
Bandwidth capacity of link
available CPU capacity on server
maximum CPU capacity of server
available RAM capacity on server
maximum RAM capacity of server
maximum outgoing bandwidth from
TABLE I: PSN parameters

Iii-B Network Slice Placement Requests Modeling

We consider that a slice is a chain of VNFs to be placed and connected over the PSN. VNFs of a slice are grouped into a request, namely a Network Slice Placement Request (NSPRs), which has to be placed on the PSN. A NSPR is represented as a weighted undirected graph , with parameters described in Table II, where is the set of VNFs in the NSPR, and is a set of VLs to interconnect the VNFs of the slice . The CPU and RAM requirements of each VNF of a NSPR are defined as and for all , respectively. The bandwidth required by each VL in a NSPR is given by for all .

Parameter Description
NSPR graph
Set of VNFs of the NSPR
Set of VLs of the NSPR
CPU requirement of VNF
RAM requirement of VNF
Bandwidth requirement of VL
TABLE II: NSPR parameters

Iii-C Network Load Modeling

We consider the case when the load offered by NSPRs is time varying. We specifically assume that there are several classes of NSPRs and two sets of classes. A first set is composed of NSPR classes (referred to as static) with constant arrival rates, creating a background traffic. A second set is composed of NSPR classes, with time-varying arrival rates so as to reflect some volatility in the NSPRs. Those NSPRs are said dynamic.

Iii-C1 Network Load for Static NSPR Classes

Let be the set of resources in the network (i.e., CPU, RAM, bandwidth). Let be the set of static NSPR classes. We compute the load generated by arrivals of NSPRs of class in for resource in as in [20]:


where is the total capacity of resource , is the number of resource units requested by an NSPR of class , is the average arrival rate for an NSPR of class and is the average lifetime of an NSPR of class .

Iii-C2 Network Load for Dynamic NSPR Classes

Let be the set of dynamic NSPR classes. We consider a periodic average arrival rate for class in given by


where is the period of in time units and is a parameter used to control the amplitude of . We then adapt Eq. (1) to compute the network load for dynamic NSPR class and resource as . It is worth noting that to preserve in the interval, must be between and .

Iii-C3 Global Network Load

Finally, we define the global network load for each resource in in the simulated time instant as the sum of the network loads generated by static NSPR classes () and dynamic NSPR classes (), that is,


It is worth noting .

Iii-D Network Slice Placement Optimization Problem Statement

  • Given: a NSPR graph and a PSN graph ,

  • Find: a mapping , , ,

  • Subject to: the VNF CPU requirements , the VNF RAM requirements , the VLs bandwidth requirements , the server CPU available capacity , the server RAM available capacity , the physical link bandwidth available capacity .

  • Objective: maximize the network slice placement request acceptance ratio, minimize the total resource consumption and maximize load balancing.

Iv Learning framework for Network Slice Placement Optimization

We describe in this section the machine learning framework used to solve the optimization formulated in Section 

III. We adopt the same approach as in [4] but to cope with the non stationary behavior of NSPR arrivals, we introduce an additional set of states to describe network load.

Other methods could be considered to deal with the cyclic nature of NSPR arrivals considered in this paper (e.g., LSTM techniques able to infer the periodic characteristics of the NSPRs arrival process). However, our goal in this paper is to set up a method of dealing with non stationary NSPR arrivals, independently of underlying periodic structures. We test the proposed approach for periodic NSPR arrivals to validate the approach, keeping in mind that the method has to be robust against non stationary traffic variations. This point will be addressed in further studies.

Iv-a Learning framework

Iv-A1 Policy

We reuse the framework introduced in [4]. We denote by the set of possible actions (namely placing VNFs on nodes) and by the set of all states. We adopt a sequential placement strategy so that we choose a node where to place a specific VNF . The VNFs are sequentially placed so that placement starts with the VNF and ends for the VNF .

At each time step , given a state , the learning agent selects an action

with probability given by the Softmax distribution given by


where the function yields a real value for each state and action calculated by a Deep Neural Network (DNN) as detailed in Section IV-B1. The notation is used to indicate that policy depends on . The control parameter represents the weights in the DNN.

Iv-A2 State representation

As in [4], the PSN state is characterized by the occupancy of servers: , and

. In addition, we keep track of the placement of the outstanding NSPR (under placement) via the vector

, where is the number of VNFs of the current NSPR placed on node .

Accordingly, the NSPR state is a view of the current placement and is composed of four characteristics, three related to resource requirements (see Table II for the notation) of the current VNF to be placed: , and .In addition, let be the number of VNFs of the outstanding NSPR still to be placed.

To deal with the non stationary environment, we introduce the Load state that represents the network load forecast used to learn the network load variations. It is defined by a set of 100 features for each resource calculated using the network load formula given by Eq. (3) as follows : , where is the simulated time instant in which the current NSPR arrives.

Iv-A3 Reward function

We reuse the reward function introduced in [4]. We precisely consider


where is the number of iterations of a training episode and where the rewards , , and are defined as follows:

  • An Action may lead to a successful or unsuccessful placement. We then define the Acceptance Reward value due to action as

  • The Resource Consumption Reward value for the placement of VNF via action is defined by


    where is the path used to place VL . Note that a maximum is given when , that is, when VNFs and are placed on the same server.

  • The Load Balancing Reward value for the placement of VNF via


Iv-B Adaptation of DRL and Introduction of a Heuristic Function

Iv-B1 Proposed Deep Reinforcement Learning Algorithm

As in [4], we use a single thread version of the A3C Algorithm introduced in [14]. This algorithm relies on two DNNs that are trained in parallel: i) the Actor Network with the parameter , which is used to generate the policy at each time step, and ii) the Critic Network with the parameter

which generates an estimate

for the State-value function defined by

for some discount parameter .

As depicted in Fig. 1 both Actor and Critic Networks have almost identical structure. As in [25], we use the GCN formulation proposed by [13] to automatically extract advanced characteristics of the PSN. The characteristics produced by the GCN represent semantics of the PSN topology by encoding and accumulating characteristics of neighbour nodes in the PSN graph. The size of the neighbourhood is defined by the order-index parameter . As in [25], we consider in the following and perform automatic extraction of 60 characteristics per PSN node. Both the NSPR state and Network Load characteristics are separately transmitted to fully connected layers with 4 and 100 units, respectively.

Fig. 1: Reference framework for the proposed learning algorithms.

The characteristics extracted by both layers and the GCN layer are combined into a single column vector of size and passed through a full connection layer with units.

In the Critic Network, the outputs are forwarded to a single neuron, which is used to calculate the state-value function estimation

. In the Actor Network, the outputs represent the values of the function introduced in Section IV-A

. These values are injected into a Softmax layer that transforms them into a Softmax distribution that corresponds to the policy


During the training phase, at each time step , the A3C algorithm uses the Actor Network to calculate the policy . An action is sampled using the policy and performed on the environment. The Critic Network is used to calculate the state-value function approximation . The learning agent receives then the reward and next state from the environment and the placement process continues until a terminal state is reached, that is, until the Actor Network returns an unsuccessful action or until the current NSPR is completely placed. At the end of the training episode, the A3C algorithm updates parameters and by using the same rules as in [4].

Iv-B2 Introduction of a Heuristic Function

To guide the learning process, we use as in [4] the placement heuristic introduced in [3]. This yields the HA-DRL algorithm. More precisely, from the reference framework shown in Fig. 1, we proposed to include in the Actor Network the Heuristic layer that calculates a Heuristic Function based on external information provided by the heuristic method, referred to as HEU.

Let be the function computed by the fully connected layer of the Actor Network that maps each state and action to a real value which is after converted by the Softmax layer into the selection probability of the respective action (see Section IV-A). Let be the action with the highest value for state . Let be the action derived by the HEU method at time step and the preferred action to be chosen. is shaped to allow the value of to become closer to the value of .

The aim is to turn into one of the likeliest actions to be chosen by the policy.

The Heuristic Function is then formulated as


where parameter is a small real number. During the training process the Heuristic layer calculates and updates the values by using the following equation:


The Softmax layer then computes the policy using the modified . Note the action returned by will have a higher probability to be chosen. The and are parameters used to control how much HEU influence the policy.

Iv-C Implementation Remarks

Iv-C1 Algorithms considered

We consider four learning algorithms based on the reference framework presented in Fig. 1.

  • DRL: It is the pure-DRL algorithm we initially proposed in [4]. The state representation does not include the network load state and the Actor Network does not contain the Heuristic layer.

  • eDRL: This algorithm is an enhanced version of DRL in which the state representation includes the network load state and the Actor Network does not contain the Heuristic layer;

  • HA-DRL: This algorithm embeds the heuristic but the state representation does not include the network load state while the Actor Network contain the Heuristic layer;

  • HA-eDRL: The state representation includes the network load state and the Actor Network contains the Heuristic layer.

Iv-C2 Implementation details

All resource-related characteristics are normalized to be in . This is done by dividing and , cpu, ram,bw, by

. With regard to the DNNs, we have implemented the Actor and Critic as two independent Neural Networks. Each neuron has a bias assigned. We have used the hyperbolic tangent (tanh) activation for non-output layers of the Actor Network and Rectified Linear Unit (ReLU) activation for all layers of the Critic Network. We have normalized positive global rewards to be in

. During the training phase, we have considered the policy as a Categorical distribution and used it to sample the actions randomly.

V Implementation and Evaluation Results

In this section, we present the implementation and experiments we conducted to evaluate the proposed algorithms.

V-a Implementation Details & Simulator Settings

V-A1 Experimental setting

We developed a simulator in Python containing: i) the elements of the Network Slice Placement Optimization problem (i.e., PSN and NSPR); ii) the DRL, eDRL, HA-DRL and HA-eDRL algorithms. We used the PyTorch framework to implement the DNNs. Experiments were run in a 2x6 cores @2.95Ghz 96GB machine.

Classes ()
# of VNFs
per NSPR ()
CPU requested
per VNF ()
lifetime ()
arrival rate ()
Total CPU
capacity ()
Load ()
Volatile 5 25 20 6300
Long term 10 25 500 0.02 6300
TABLE III: Network Load Calculation for both NSPR Classes

V-A2 Physical Substrate Network Settings

We consider a PSN that could reflect the infrastructure of an operator as discussed in [20]. In this network, three types of DCs are introduced as in Section III. Each CDC is connected to three EDCs which are 100 km apart. CDCs are interconnected and connected to one CCP that is 300 km away. We consider 15 EDCs each one with 4 servers, 5 CDCs each with 10 servers and 1 CCP with 16 servers. The CPU and RAM capacities of each server are 50 and 300 units, respectively. A bandwidth capacity of 100 Gbps is given to intra-data center links inside CDCs and CCP—10Gbps being the case for intra-data center links inside EDCs. Transport links connected to EDCs have 10Gpbs of bandwidth capacity. Transport links between CDCs have 100Gpbs of bandwidth capacity as well for the ones between CDCs and the CCP.

V-A3 Network Slice Placement Requests Settings

We consider NSPRs to have the Enhanced Mobile Broadband (eMBB) setting described in [3]. Each NSPR is composed of 5 VNFs. Each VNF requires 25 units of CPU and 150 units of RAM. Each VL requires 2 Gbps of bandwidth.

V-B Algorithms & Experimental Setup

V-B1 Training Process & Hyper-parameters

We consider a training process with maximum duration of 55 hours for the considered algorithms. We perform seven independent runs of each algorithm to assess their average and maximal performance in terms of metrics introduced below (see Section V-C). After performing Hyper-parameter search, we set the learning rates for the Actor and Critic networks of DRL and HA-DRL algorithms to and , respectively. For eDRL and HA-eDRL, we set the learning rates for the Actor and Critic networks to and , respectively. We implement four versions of HA-DRL and HA-eDRL agents, each with a different value for the parameter of the heuristic function formulation (see Section IV-B2). We set in addition the parameters and .

V-B2 Network load calculation

We implement the network load model introduced in Section V-B2 considering two NSPR classes: i) a volatile class (referred to as Volatile), with a dynamic arrival rate and ii) a static class (referred to as Long term), with a static arrival rate. Table III presents the network load models proposed for both classes. We consider that each simulation time unit corresponds to 15 minutes in reality. We set the period for the network load function to 96 simulation time units, i.e., one day. The network global network load varies then between 0.3 and 1.0.

V-C Evaluation Metrics

To characterize the performance of the placement algorithms, we consider two performance metrics:

  1. Global Acceptance Ratio (GAR): The Acceptance Ratio of the different tested algorithms during training computed after each arrival as follows: . This metric is used to evaluate the accumulated acceptance ratio of the different algorithms as the learning process progresses.

  2. Acceptance Ratio per training phase (TAR): The Acceptance Ratio obtained in each training phase, i.e., each part of the training process, corresponding to NSPR arrivals or episodes. It is calculated as follows: . This metric allows us to better observe the evolution of algorithm performance over time since it measures algorithm performance in independent parts (phases) of the training process without accumulating the performance of previous training phases.

(a) Maximal performance in 3 simulated days.
(b) Maximal performance in 4500 simulated days.
Fig. 2: Maximal Global Acceptance ratio results.
(a) Average performance in 3 simulated days.
(b) Average performance in 4500 simulated days.
Fig. 3: Average Global Acceptance ratio results.

V-D Global Acceptance Ratio Evaluation

Fig. 2 and 3 show the progression of the Global Acceptance Ratio (GAR) over time for the considered algorithms. Fig. (a)a and Fig. (a)a show that after 3 simulated days, the average and maximal GARs of HA-DRL and HA-eDRL with are convergent and higher than 80% while for the other algorithms, the GARs remain between 20% and 40% and are not stable. Fig. (b)b and Fig (b)b exhibit the progression of the GARs over a longer simulated time horizon of 4500 simulated days.

We observe that while GARs of HA-DRL and HA-eDRL with remain stable after the convergence reached in the first 3 simulated days, for the other algorithms, these performance metrics start to stabilize only after 2000 simulated days. Table IV shows that both the maximal and average final GARs, i.e., maximal and average GARs achieved at the end of training, are close to 82% for both HA-DRL and HA-eDRL with . For the other algorithms, maximal final GARs are between 61% and 71% and average final GARs are between 52% and 65%, except for HA-DRL and HA-eDRL with . Table IV also shows that maximal and average final GARs of HA-DRL algorithms are generally higher than the equivalent HA-eDRL versions, the gap being never higher than 6%.

GAR (%)
GAR. (%)
TAR (%)
TAR (%)
DRL 65.90 52.15 77.69 55.82
HA-DRL, 70.56 55.57 81.45 59.83
HA-DRL, 69.80 60.12 73.56 65.33
HA-DRL, 64.89 28.23 72.14 41.69
HA-DRL, 81.84 81.62 83.91 82.21
eDRL 69.36 65.10 77.39 58.61
HA-eDRL, 69.25 52.22 76.34 48.45
HA-eDRL, 68.40 54.49 71.91 57.57
HA-eDRL, 61.46 27.33 67.05 33.10
HA-eDRL, 81.82 81.68 81.37 81.95
TABLE IV: Summary of evaluation results

The above results show that HA-DRL and HA-eDRL with exhibit more robust performance with faster convergence and handle better network load variations than the other evaluated algorithms, notably when compared with the classical DRL approach.

Indeed, when setting , the Heuristic Function computed on the basis of the HEU algorithm has strong influence on the actions chosen by the agent. Since the HEU algorithm often indicates a good action, this strong influence of the heuristic function helps the algorithms to become stable more quickly.

The addition of network load related features to the state observed by the eDRL algorithm helps improve the maximal and average final GAR when compared with DRL. However, this improvement is less significant than the improvement brought by the Heuristic Function acceleration as the eDRL algorithm does not achieve the same fast convergence and robustness than HA-DRL and HA-eDRL with .

(a) Training Phases 0–16.
(b) Training Phases 17–33
(c) Training Phases 34–50
Fig. 4: Maximal Acceptance Ratio per training phase results.
(a) Training Phases 0–16.
(b) Training Phases 17–33
(c) Training Phases 34–50
Fig. 5: Average Acceptance Ratio per training phase results.
(a) Maximal performance for volatile NSPRs.
(b) Maximal performance for longterm NSPRs.
Fig. 6: Acceptance Ratio per training phase results (per NSPR class).

V-E Acceptance Ratio per Training Phase Evaluation

Fig. 4 and Fig. 5 show that all algorithms need at least 16 training phases to reach a convergent maximal and average Acceptance Ratio per Training Phase (TAR), except for HA-DRL and HA-eDRL with which need only one training phase (see Fig (a)a and Fig (a)a). The fast convergence of HA-DRL and HA-eDRL with is due to the strong influence of the Heuristic Function in the choice of actions as explained in Section V-D. Fig. (a)a and Fig. (b)b illustrate that algorithms HA-DRL and HA-eDRL with , HA-DRL with and DRL need around 16 phases to reach a convergent maximal TAR while Fig. (b)b and Fig. (c)c demonstrate that eDRL, HA-eDRL with and HA-DRL with need around 34 phases. Finally, Fig. (a)a and Fig. (b)b show that HA-DRL and HA-eDRL with , eDRL, and DRL need around 16 phases to obtain a convergent average TAR while Fig. (b)b and Fig. (c)c show that HA-DRL and HA-eDRL with need around 34 phases.

Table IV shows that the only algorithms that reach a maximal and average final TAR (i.e, maximal and average TAR achieved at the end of training) higher than 80% are HA-DRL and HA-eDRL with . HA-DRL with also attains a maximal final TAR higher than 80% but its average final TAR is less than 60%. HA-DRL with has average final TAR higher than 65%, but its maximal final TAR is around 74%. In addition, Table IV shows that: HA-DRL algorithms have maximal final TAR generally higher than the equivalent HA-eDRL versions, the gap being never higher than 6%; eDRL and DRL have equivalent maximal final TAR but eDRL has average final TAR 3% higher. Fig. (b)b and Fig. (a)a show that all algorithms have better maximal and average TAR on volatile NSPRs than on long-term NSPRs. This difference is arguably related to the number of arrivals of each requests and, therefore, the amount of training performed on each class of requests, since many more volatile requests arrive to be placed during the simulated period than long-term requests. These results reinforce the conclusion introduced in Section V-D. The maximal TAR progression results (i.e., Fig. 4) allow us to observe that in the best case, if a large number of training phases is given, all the algorithms attain a good performance.

However, the average TAR progression results (see Fig. 5) show that only HA-DRL and HA-eDRL with have robust performance. These algorithms are also the only one to have quick convergence being, among those evaluated, the most adapted to be used in practice.

Vi Conclusion

We have introduced a DRL-heuristic algorithm that supports the variations of network load in the placement of network slice requests as a follow-up of our work in [4]. We have considered four families of algorithms, pure-DRL and eDRL techniques and their variants combined with a heuristic whose efficiency was investigated in a previous work [3]. The influence of the heuristic on the DRL algorithm can be tuned by means of a parameter (namely, parameter (introduced in Section IV-B2). We assume network load has periodic fluctuations. We have shown how introducing network load states into the DRL algorithm together with a combination with the heuristic function yields very good results in terms of GAR and TAR which are stable in time. This study aims at proving that coupling DRL and heuristic functions yields good and stable results even under non stationary conditions.

The next step is to study the behavior of the proposed algorithm in case of an unpredictable network load disruption.


This work has been performed in the framework of 5GPPP MON-B5G project (www.monb5g.eu). The experiments were conducted using Grid’5000, a large scale testbed by Inria and Sorbonne University (www.grid5000.fr).


  • [1] 3GPP (2020-Dec.) Management and orchestration; 5G Network Resource Model (NRM); Stage 2 and stage 3 (Release 17). Technical Specification (TS) Technical Report 28.541, 3rd Generation Partnership Project (3GPP). Note: Version 17.1.0 Cited by: §I.
  • [2] I. Alawe, Y. Hadjadj-Aoul, A. Ksentinit, P. Bertin, C. Viho, and D. Darche (2018) An efficient and lightweight load forecasting for proactive scaling in 5g mobile networks. In Proc. 2018 IEEE Conf. Standards Commun. Netw. (CSCN), pp. 1–6. Cited by: §II-C.
  • [3] J. J. Alves Esteves, A. Boubendir, F. Guillemin, and P. Sens (2020) Heuristic for edge-enabled network slicing optimization using the “power of two choices”. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), pp. 1–9. External Links: Document Cited by: §IV-B2, §V-A3, §VI.
  • [4] J. J. Alves Esteves, A. Boubendir, F. Guillemin, and P. Sens (2021) A heuristically assisted deep reinforcement learning approach for network slice placement. arXiv preprint arXiv:2105.06741. Cited by: §I, §II-B, §II-B, §II, 1st item, §IV-A1, §IV-A2, §IV-A3, §IV-B1, §IV-B1, §IV-B2, §IV, §VI.
  • [5] E. Amaldi, S. Coniglio, A. M. Koster, and M. Tieves (2016-Jun.) On the computational complexity of the virtual network embedding problem. Electron. Notes Discrete Math. 52, pp. 213–220. Cited by: §I.
  • [6] D. Bega, M. Gramaglia, M. Fiore, A. Banchs, and X. Costa-Perez (2019) DeepCog: cognitive network management in sliced 5g networks with deep learning. In Proc. IEEE INFOCOM 2019 - IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 280–288. Cited by: §II-C.
  • [7] R. A. Bianchi, C. H. Ribeiro, and A. H. Costa (2008) Accelerating autonomous learning by using heuristic selection of actions. J. Heuristics 14 (2), pp. 135–168. Cited by: §II-B.
  • [8] M. Dolati, S. B. Hassanpour, M. Ghaderi, and A. Khonsari (2019) DeepViNE: virtual network embedding with deep reinforcement learning. In Proc. IEEE INFOCOM 2019 - IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 879–885. External Links: Document Cited by: §I, §II-A.
  • [9] J. J. A. Esteves, A. Boubendir, F. Guillemin, and P. Sens (2020) Location-based data model for optimized network slice placement. In Proc. 2020 6th IEEE Conf. Netw. Softwarization (NetSoft), Vol. , pp. 404–412. External Links: Document Cited by: §I.
  • [10] ETSI NFV ISG (2017) Network Functions Virtualisation (NFV); Evolution and Ecosystem; Report on Network Slicing Support, ETSI Standard GR NFV-EVE 012 V3.1.1. Technical report ETSI. External Links: Link Cited by: §I.
  • [11] J. Gil Herrera and J. F. Botero (2016-Sep.) Resource allocation in NFV: a comprehensive survey. IEEE Trans. Netw. Service Manag. 13 (3), pp. 518–532. External Links: Document Cited by: §I.
  • [12] F. Habibi, M. Dolati, A. Khonsari, and M. Ghaderi (2020) Accelerating virtual network embedding with graph neural networks. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), Vol. , pp. 1–9. External Links: Document Cited by: §I, §II-A.
  • [13] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. 5th Int. Conf. Learn. Representations (ICLR), pp. 1–14. Cited by: §IV-B1.
  • [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Int. Conf. Mach. Learn., pp. 1928–1937. Cited by: §IV-B1.
  • [15] J. Pei, P. Hong, M. Pan, J. Liu, and J. Zhou (2020-Feb.) Optimal vnf placement via deep reinforcement learning in sdn/nfv-enabled networks. ieee_j_jsac 38 (2), pp. 263–278. External Links: Document Cited by: §II-C, §II-C.
  • [16] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul (2019) Multi-domain non-cooperative VNF-FG embedding: a deep reinforcement learning approach. In Proc. IEEE INFOCOM 2019 - IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 886–891. Cited by: §I, §II-A.
  • [17] P. T. A. Quang, Y. Hadjadj-Aoul, and A. Outtagarts (2019-Dec.) A deep reinforcement learning approach for vnf forwarding graph embedding. IEEE Trans. Netw. Service Manag. 16 (4), pp. 1318–1331. Cited by: §II-B, §II-B.
  • [18] A. Rkhami, Y. Hadjadj-Aoul, and A. Outtagarts (2021) Learn to improve: a novel deep reinforcement learning approach for beyond 5G network slicing. In Proc. 2021 IEEE 18th Annu. Consum. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §II-B, §II-B.
  • [19] S. Schneider, A. Manzoor, H. Qarawlus, R. Schellenberg, H. Karl, R. Khalili, and A. Hecker (2020) Self-driving network and service coordination using deep reinforcement learning. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), pp. 1–9. External Links: Document Cited by: §I, §II-A.
  • [20] F. Slim, F. Guillemin, A. Gravey, and Y. Hadjadj-Aoul (2017) Towards a dynamic adaptive placement of virtual network functions under ONAP. In Proc. 2017 IEEE Conf. on Netw. Function Virtualization Softw. Defined Netw. (NFV-SDN), pp. 210–215. External Links: Document Cited by: §III-C1, §V-A2.
  • [21] F. Slim, F. Guillemin, and Y. Hadjadj-Aoul (2018) CLOSE: a costless service offloading strategy for distributed edge cloud. In Proc. 2018 15th IEEE Annu. Cons. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §III-A.
  • [22] R. S. Sutton and A. G. Barto (2015) Reinforcement learning: an introduction. MIT press, Cambridge, MA, USA. Cited by: §I.
  • [23] H. Wang, Y. Wu, G. Min, J. Xu, and P. Tang (2019-Sep.) Data-driven dynamic resource scheduling for network slicing: a deep reinforcement learning approach. Inf. Sci. 498, pp. 106–116. Cited by: §I, §II-A.
  • [24] Y. Xiao, Q. Zhang, F. Liu, J. Wang, M. Zhao, Z. Zhang, and J. Zhang (2019) NFVdeep: adaptive online service function chain deployment with deep reinforcement learning. In Proc. 2019 IEEE/ACM 27th Int. Symp. Qual. Service (IWQoS), pp. 1–10. Cited by: §I, §II-A.
  • [25] Z. Yan, J. Ge, Y. Wu, L. Li, and T. Li (2020-Jun.) Automatic virtual network embedding: a deep reinforcement learning approach with graph convolutional networks. ieee_j_jsac 38 (6), pp. 1040–1057. Cited by: §I, §II-A, §IV-B1.
  • [26] H. Yao, X. Chen, M. Li, P. Zhang, and L. Wang (2018-Apr.) A novel reinforcement learning algorithm for virtual network embedding. Neurocomputing 284, pp. 1–9. External Links: ISSN 0925-2312 Cited by: §I, §II-A.