On the Robustness of Controlled Deep Reinforcement Learning for Slice Placement

08/05/2021 ∙ by Jose Jurandir Alves Esteves, et al. ∙ Orange Laboratoire d'Informatique de Paris 6 0

The evaluation of the impact of using Machine Learning in the management of softwarized networks is considered in multiple research works. Beyond that, we propose to evaluate the robustness of online learning for optimal network slice placement. A major assumption to this study is to consider that slice request arrivals are non-stationary. In this context, we simulate unpredictable network load variations and compare two Deep Reinforcement Learning (DRL) algorithms: a pure DRL-based algorithm and a heuristically controlled DRL as a hybrid DRL-heuristic algorithm, to assess the impact of these unpredictable changes of traffic load on the algorithms performance. We conduct extensive simulations of a large-scale operator infrastructure. The evaluation results show that the proposed hybrid DRL-heuristic approach is more robust and reliable in case of unpredictable network load changes than pure DRL as it reduces the performance degradation. These results are follow-ups for a series of recent research we have performed showing that the proposed hybrid DRL-heuristic approach is efficient and more adapted to real network scenarios than pure DRL.



There are no comments yet.


page 1

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The promise of network Slicing is to enable a high level of customization of network services in future networks (5G and beyond) leveraged by virtualization and software defined networking techniques. These key enablers transform telecommunications networks into programmable platforms capable of offering virtual networks enriched by Virtual Network Functions (VNFs) and IT resources tailored to the specific needs of certain customers (e.g., companies) or vertical markets (automotive, e-health, etc.)[1, 11]. From an optimization theory perspective, the Network Slice Placement problem can be viewed as a specific case of Virtual Network Embedding (VNE) or VNF Forwarding Graph Embedding (VNF-FGE) problems [12, 14]

. It is then generally possible to formulate Integer Linear Programming (ILP) problems

[9], which however turn out to be -hard [5] with very long convergence time.

With regard to network management, there are specific characteristics related to network slicing: slices are expected to share resources and coexist in a large and distributed infrastructure. Moreover, slices have a wide range of requirements in terms of resources, quality objectives and lifetime.

In practice, these characteristics bring additional complexity as the placement algorithms need to be highly scalable with low response time even under varying network conditions.

As an alternative to optimization techniques and the development of heuristic methods, Deep Reinforcement Learning (DRL) has recently been used in the context of VNE and Network Slice Placement [30, 8, 31, 28, 29, 19]. DRL techniques are considered as very promising since they allow, at least theoretically, the determination of optimal decision policies only based on experience [27]. However, from a practical point of view, especially in the context of non-stationary environments, ensuring that a DRL agent converges to an optimal policy is still challenging.

As a matter of fact, when the environment is continually changing, the algorithm has trouble in using the acquired knowledge to find optimal solutions. The usage of the DRL algorithm online can then become impractical. In fact, most of existing works based on DRL to solve the Network Slice Placement or VNE problem assume a stationary environment, i.e., with constant network load. However, traffic conditions in real networks are basically non stationary with daily and weekly variations and subject to drastic changes (e.g., traffic storm due to an unpredictable event).

To cope with traffic changes, this paper proposes a hybrid DRL-heuristic strategy called Heuristically Assisted DRL (HA-DRL)[4]. We applied in [10] this strategy in an online learning scenario with periodic network load variations to show how this strategy can be used to accelerate and stabilize the convergence of DRL techniques in this type of non-stationary environment. As a follow-up of theses two studies, we focus in the present paper on a different non stationary scenario with stair-stepped network load changes. The goal of the paper is to evaluate and show the robustness of the proposed strategy method in the case of sudden and stair-stepped traffic changes.

The contributions of the present paper are threefold:

  1. We propose a network load model to describe network slice demand and adapt it to unpredictable network load changes;

  2. We propose a framework combining Advantage Actor Critic and a Graph Convolutional Network (GCN) for conceiving DRL-based algorithms adapted to the non-stationary case;

  3. We show how the use of a heuristic function can control the DRL learning and improve its robustness to unpredictable network load changes.

The organization of this paper is as follows: In Section II, we review the related work. In Section III, we describe the Network Slice Placement problem modeling. The learning framework for slice placement optimization is described in Section IV. The adaptation of the pure DRL approaches and its control by using heuristic is introduced in Section IV-B. The experiments and evaluation results are presented in Section V, while conclusions and perspectives are presented in Section VI.

Ii Related Work Analysis

We provide in Section II-A a summarized review of the existing DRL-based approaches for network slice placement. The interested reader may refer to [4, 10] for a more detailed and comprehensive discussion. In Section II-B we discuss recent works on robust slice placement algorithms.

Ii-a On DRL-based Approaches for Slice Placement

DRL has been recently applied to solving network slice placement and VNE problems. We divide these works into two categories on the basis of their algorithmic aspects: 1) pure DRL approaches [30, 8, 31, 28, 29, 19], in which only the knowledge acquired by the learning agent via training is used as a basis for taking placement decisions; and 2) hybrid DRL-heuristic approaches [4, 20, 22], in which the placement decision computation is assisted by a heuristic method.

The use of heuristics aims at increasing the reliability of DRL algorithms. However, most of these works are based on the assumption that the network load is static, i.e., slice arrivals occurs at a constant rate. To the best of our knowledge, the work we proposed in [10] is the first attempt to evaluate an online DRL-based approach in a non-stationary network load scenario whereas [18] only considers offline learning.

In addition, in both [10] and [18] is assumed that network load has periodic fluctuations. In the present paper we study the behavior of the algorithms proposed in [10] in case of an unpredictable network load disruption.

Ii-B On Robustness of Slice Placement Approaches

The term robustness has different meanings depending on the field of application. In Robust Optimization (RO), robustness is related to the decision/solution itself. It is the capability of the algorithm solution of coping with the worst case without losing feasibility [7].

In Machine Learning (ML), specially in Deep Learning (DL), robustness is related to the learned model. It is the property of the model (i.e., Deep Neural Network (DNN)) that determines its integrity under varying operating conditions


The authors of [2]

are the first to discuss robustness in the DRL context. They propose to use Genetic Algorithm to improve the robustness of a self-driving car application. Robustness is considered as the capacity of sustaining a high accuracy on image classification even when perceived images change and it is measured by Neuron Coverage (NC), i.e.,the ratio of the activated neurons in the DNN.

There are only a few recent works on the robustness of slice placement procedures, most of them on RO [16, 15, 21, 6]. These works answer a question different from the one we are investigating as they evaluate the robustness of the decision whereas we want to evaluate the robustness of the learning process. Despite their originality, the above approaches present some drawbacks, such as the lack of scalability of ILP, the sub-optimality of heuristic solutions, the fact that they consider offline optimization in which all slices to be placed are known in advance, and the fact that they are single objective optimization approaches, mainly focusing on energy consumption minimization. In this work, we propose to rely on a DRL-based approach in order to overcome ILP and heuristic drawbacks and consider multiple-optimization objectives.

To the best of our knowledge, paper [26] is the only one to have proposed a DRL-based approach for slice placement and evaluated the learning robustness. However, the authors focus on evaluating the robustness of the DRL approach against random topology changes (e.g., node failures or deploying new nodes in the network topology). In this work, we focus on evaluating robustness against network load unpredictable variations. To the best of our knowledge, the present work is the first to perform such an evaluation.

Iii Network Slice Placement Optimization Problem

We present in this section the various elements composing the model for slice placement. Slices are placed on a substrate network, referred to as Physical Network Substrate (PSN) and described in Section III-A. Slices give rise to Network Slice Placement Requests (Section III-B), generating a network load defined in Section III-C. The optimization problem is formulated in Section III-D.

Iii-a Physical Substrate Network Modeling

The Physical Substrate PSN is composed of the infrastructure resources, namely IT resources (CPU, RAM, disk, etc.) needed for supporting the Virtual Network Functions (VNFs) of network slices together with the transport network, in particular Virual Links (VLs) for interconnecting the VNFs of slices. As depicted in Fig. 1, The PSN is divided into three components: the Virtualized Infrastructure (VI) corresponding to IT resources, the Access Network (AN), and the Transport Network (TN). The Virtual Infrastructure (VI) hosting IT resources is the set of Data Centers (DCs) interconnected by network elements (switches and routers). We assume that data centers are distributed in Points of Presence (PoP) or centralized (e.g., in a big cloud platform). As in [25], we define three types of DCs with different capacities: Edge Data Centers (EDCs) close to end users but with small resources capacities, Core Data Centers (CDCs) as regional DCs with medium resource capacities, and Central Cloud Platforms (CCPs) as national DCs with big resource capacities. We consider that slices are rooted so as to take into account the location of those users of a slice. We thus introduce an Access Network (AN) representing User Access Points (UAPs) such as Wi-Fi APs, antennas of cellular networks, etc. and Access Links.

Users access slices via one UAP, which may change during the life time of a communication by a user (e.g., because of mobility). The Transport Network (TN) is the set of routers and transmission links needed to interconnect the different DCs and the UAPs.

The complete PSN is modeled as a weighted undirected graph with parameters described in Table I, where is the set of physical nodes in the PSN, and refers to a set of substrate links. Each node has a type in the set UAP, router, switch, server. The available CPU and RAM capacities on each node are defined as , for all , respectively. The available bandwidth on the links are defined as .

Parameter Description
PSN graph
Network nodes
Set of servers
Set of data centers
, Set of servers in data center
Switch of of data center
Set of physical links
Bandwidth capacity of link
available CPU capacity on server
maximum CPU capacity of server
available RAM capacity on server
maximum RAM capacity of server
maximum outgoing bandwidth from
TABLE I: PSN parameters
Fig. 1: Physical Substrate Network example.

Iii-B Network Slice Placement Requests Modeling

We consider that a slice is a chain of VNFs to be placed and connected over the PSN. VNFs of a slice are grouped into a request, namely a Network Slice Placement Request (NSPR), which has to be placed on the PSN. A NSPR is represented as a weighted undirected graph , with parameters described in Table II, where is the set of VNFs in the NSPR, and is a set of VLs to interconnect the VNFs of the slice . The CPU and RAM requirements of each VNF of a NSPR are defined as and for all , respectively. The bandwidth required by each VL in a NSPR is given by for all .

We consider the existence of different NSPR classes characterizing different levels of resources requirements, lifespan and arrival rate at described in Section III-C.

Parameter Description
NSPR graph
Set of VNFs of the NSPR
Set of VLs of the NSPR
CPU requirement of VNF
RAM requirement of VNF
Bandwidth requirement of VL
TABLE II: NSPR parameters

Iii-C Network Load Modeling

The Network Load model allow us to control the percentage of the total network resources capacity being used at a specific instant.

Let be the set of resources in the network (i.e., CPU, RAM, bandwidth). Let be the set of NSPR classes. We compute the load generated by arrivals of NSPRs of class for resource in as in [24]:


where is the total capacity of resource , is the number of resource units requested by an NSPR of class , is the average arrival rate for an NSPR of class and is the average lifetime of an NSPR of class .

We define the global load for resource as the sum


If , the system is not overloaded for resource ; otherwise, the system is under overload conditions and the rejection of NSPRs may be high.

Iii-D Network Slice Placement Optimization Problem Statement

The Network Slice Placement optimization problem is stated as follows:

  • Given: a NSPR graph and a PSN graph ,

  • Find: a mapping , , ,

  • Subject to: the VNF CPU requirements , the VNF RAM requirements , the VLs bandwidth requirements , the server CPU available capacity , the server RAM available capacity , the physical link bandwidth available capacity .

  • Objective: maximize the network slice placement request acceptance ratio, minimize the total resource consumption and maximize load balancing.

A complete mathematical formulation of this problem can be found in [4].

Iv Learning framework for Network Slice Placement Optimization

We describe in this section the DRL-based approach used to solve the optimization formulated in Section III. As mentioned, we adopt the same approach as in [4] but we focus here on evaluating the performance when a unpredictable network load change occurs.

Iv-a Learning framework

Fig. 2 presents an overview of the DRL framework. The state contains the features of the PSN and NSPR to be placed. A valid action is, for a given NSPR graph , a subgraph of the PSN graph to place the NSPR that does not violate the problem constraints described in [4] Section III-D.

The reward evaluates how good is the computed action with respect to the optimization objectives described in [4] Section III-D. DNNs are trained to calculate i) optimal actions for each state (i.e., placements with maximal rewards) and ii) the State-value function used in the learning process.

In the following sections we describe each one of the elements of this framework.

Fig. 2: DRL framework for Network Slice Placement Optimization

Iv-A1 Policy

We reuse the framework introduced in [4]. We denote by the set of possible actions (namely placing VNFs on nodes) and by the set of all states. We adopt a sequential placement strategy so that we choose a node where to place a specific VNF . The VNFs are sequentially placed so that placement starts with the VNF and ends for the VNF .

At each time step , given a state , the learning agent select an action

with probability given by the Softmax distribution


where the function yields a real value for each state and action calculated by a Deep Neural Network (DNN) as detailed in Section IV-B1. The notation is used to indicate that policy depends on . The control parameter represents the weights in the DNN.

Iv-A2 State representation

As in [4], the PSN state is characterized by available server resources: , and

. In addition, we keep track of the placement of the pending NSPR (i.e., the one being placed) via the vector

, where is the number of VNFs of the current NSPR placed on node .

The NSPR state is a view of the current placement and is composed of four characteristics, three related to resource requirements (see Table II for the notation) of the current VNF to be placed: , and , and the number of VNFs of the outstanding NSPR still to be placed.

Iv-A3 Reward function

We reuse the reward function introduced in [4]. We precisely consider


where is the number of iterations of a training episode and where the rewards , , and are defined as follows:

  • An Action may lead to a successful or unsuccessful placement. We then define the Acceptance Reward value due to action as

  • The Resource Consumption Reward value for the placement of VNF via action is defined by


    where is the path used to place VL . Note that a maximum is given when , that is, when VNFs and are placed on the same server.

  • The Load Balancing Reward value for the placement of VNF via


Iv-B Adaptation of DRL and Introduction of a Heuristic Function

Iv-B1 Proposed Deep Reinforcement Learning Algorithm

As in [4], we use a single thread version of the A3C Algorithm introduced in [17]. This algorithm relies on two DNNs that are trained in parallel: i) the Actor Network with the parameter , which is used to generate the policy at each time step, and ii) the Critic Network with the parameter

which generates an estimate

for the State-value function defined by

for some discount parameter .

As depicted in Fig. 3 both Actor and Critic Networks have almost identical structure. As in [30], we use the GCN formulation proposed by [13] to automatically extract advanced characteristics of the PSN. The characteristics produced by the GCN represent semantics of the PSN topology by encoding and accumulating characteristics of neighbour nodes in the PSN graph. The size of the neighbourhood is defined by the order-index parameter .

Fig. 3: Reference framework for the proposed learning algorithms.

As in [30], we consider in the following and perform automatic extraction of 60 characteristics per PSN node.

The NSPR state characteristics are separately transmitted to a fully connected layer with 4 units. The characteristics extracted by both layers and the GCN layer are combined into a single column vector of size and passed through a fully connected layer with units.

In the Critic Network, the outputs are forwarded to a single neuron, which is used to calculate the state-value function estimation . In the Actor Network, the outputs represent the values of the function introduced in Section IV-A

. These values are injected into a Softmax layer that transforms them into a Softmax distribution that corresponds to the policy


During the training phase, at each time step , the A3C algorithm uses the Actor Network to calculate the policy . An action is sampled using the policy and performed on the environment. The Critic Network is used to calculate the state-value function approximation . The learning agent receives then the reward and next state from the environment and the placement process continues until a terminal state is reached, that is, until the Actor Network returns an unsuccessful action or until the current NSPR is fully placed. At the end of the training episode, the A3C algorithm updates parameters and by using the same rules as in [4].

Iv-B2 Introduction of a Heuristic Function

To guide the learning process, we use as in [4] the placement heuristic introduced in [3]. This yields the HA-DRL algorithm. More precisely, from the reference framework shown in Fig. 3, we proposed to include in the Actor Network the Heuristic layer that calculates an Heuristic Function based on external information provided by the heuristic method, referred as HEU.

Let be the function computed by the fully connected layer of the Actor Network that maps each state and action to a real value which is after converted by the Softmax layer into the selection probability of the respective action (see Section IV-A).

Let be the action with the highest value for state . Let be the action derived by the HEU method at time step and the preferred action to be chosen. is shaped to allow the value of to become closer to the value of . The aim is to turn into one of the likeliest actions to be chosen by the policy.

The Heuristic Function is then formulated as


where parameter is a small real number.

During the training process the Heuristic layer calculates and updates the values by using the following equation:


The Softmax layer then computes the policy using the modified . Note the action returned by will have a higher probability to be chosen. The and parameters are used to control how much HEU influence the policy.

Iv-C Implementation Remarks

All resource-related characteristics are normalized to be in . This is done by dividing and , cpu, ram,bw, by

. With regard to the DNNs, we have implemented the Actor and Critic as two independent Neural Networks. Each neuron has a bias assigned. We have used the hyperbolic tangent (tanh) activation for non-output layers of the Actor Network and Rectified Linear Unit (ReLU) activation for all layers of the Critic Network. We have normalized positive global rewards to be in

. During the training phase, we have considered the policy as a Categorical distribution and used it to sample the actions randomly.

V Implementation and Evaluation Results

V-a Implementation Details & Simulator Settings

V-A1 Experimental setting

We developed a simulator in Python containing: i) the elements of the Network Slice Placement Optimization problem (i.e., PSN and NSPR); ii) the DRL and HA-DRL algorithms. We used the PyTorch framework to implement the DNNs. Experiments were run in a 2x6 cores @2.95Ghz 96GB machine.

V-A2 Physical Substrate Network Settings

We consider a PSN that could reflect the infrastructure of an operator as discussed in [24]. In this network, three types of DCs are introduced as in Section III. Each CDC is connected to three EDCs which are distant of 100 km. CDCs are interconnected and connected to one CCP that is 300 km away.

We consider 15 EDCs each one with 4 servers, 5 CDCs each with 10 servers and 1 CCP with 16 servers. The CPU and RAM capacities of each server are 50 and 300 units, respectively. A bandwidth capacity of 100 Gbps is given to intra-data center links inside CDCs and CCP, 10Gbps being the bandwidth for intra-data center links inside EDCs. Transport links connected to EDCs have 10Gpbs of bandwidth capacity. Transport links between CDCs have 100Gpbs of bandwidth capacity as well for the ones between CDCs and the CCP.

V-A3 Network Slice Placement Requests Settings

We consider NSPRs to have the Enhanced Mobile Broadband (eMBB) setting described in [3]. Each NSPR is composed of 5 or 10 VNFs (see Section V-B2). Each VNF requires 25 units of CPU and 150 units of RAM. Each VL requires 2 Gbps of bandwidth.

V-B Algorithms & Experimental Setup

V-B1 Training Process & Hyper-parameters

We consider a training process with maximum duration of 6 hours for the considered algorithms. We perform seven independent runs of each algorithm to assess their average performance in terms of the metrics introduced below (see Section V-C).

After performing Hyper-parameter search, we set the learning rates for the Actor and Critic networks of DRL and HA-DRL algorithms to and , respectively.

We program four versions of HA-DRL agents, each with a different value for the parameter of the heuristic function formulation (see Section IV-B2). We set in addition the parameters and .

V-B2 Network load calculation

Network loads are calculated using CPU resource but the analysis could easily be applied to RAM; we use the network load model introduced in Section V-B2. We consider two NSPR classes: i) a Volatile class and ii) a Long term class.

The differences between the two classes are related to their resource requirements and their lifespans as Volatile requests have 5 VNFs and a life-span of 20 simulation time units and Long-term requests have 10 VNFs and a life span of 500 simulation time units.

V-B3 Network load change scenarios

We consider that the network runs in a standard regime under a network load being equal to 40% (i.e., ) and that the NSPRs of each class generate half of the total load.

In each experiment, the learning agent is trained during approximately 4 hours for this network load regime. Then a stair-stepped network load change occurs. We simulated eight different network load change levels. Each network load change level is characterized by the addition of a certain amount of extra network load ranging from 10% to 80% (causing system overload).

V-C Evaluation Metrics

To characterize the performance of the placement algorithms, we consider one performance metric called Acceptance Ratio per Training phase (TAR). This metric represents the Acceptance Ratio obtained in each training phase, i.e., each part of the training process, corresponding to NSPR arrivals or episodes. It is calculated as follows: . This metric allows us to better observe the evolution of algorithm performance over time since it measures algorithm performance in independent parts (phases) of the training process without accumulating the performance of previous training phases.

Based on this metric, we identify three other important metrics used in our results discussion:

  1. Rupture TAR: it is the TAR obtained in the training phase where the network load change occurs, i.e., the rupture phase;

  2. Last TAR: it is the TAR obtained in the training phase that is prior to the rupture phase;

  3. Average TAR: it is the average of the TARs obtained in the 30 phases preceding the rupture phase;

  4. TAR standard deviation:

    it is the standard deviation of the TARs obtained in the 30 phases preceding the rupture phase;

V-D Evaluation of the impact of network load change

Fig. 4, 5 and 6 capture the impact of different network load change levels on the TARs obtained by the different evaluated algorithms. The rupture phase is identified by a blue vertical line in the various figures.

We can observe in Fig. 4, 5 and 6 that with the reduced training time of 6 hours the only algorithm that has near optimal performance after 108 training phases is HA-DRL, with . This is due to the fact that the strong influence of the Heuristic Function helps the algorithm to become stable more quickly as discussed in [4] and[10].

(a) Addition of 10% of network load.
(b) Addition of 20% of network load.
(c) Addition of 30% of network load.
Fig. 4: Evaluation of impact of network load disruption on TAR: under-loaded scenarios
(a) Addition of 40% of network load.
(b) Addition of 50% of network load.
(c) Addition of 60% of network load.
Fig. 5: Evaluation of impact of network load disruption on TAR: critical scenarios
(a) Addition of 70% of network load.
(b) Addition of 80% of network load.
Fig. 6: Evaluation of impact of network load disruption on TAR: overloaded scenarios

We can also observe by the shape of the different curves in Fig. 4, 5, and 6 that, as expected, all the algorithms have some variability in their performance during the training phases. In addition, these figures show that the performance of all the algorithms is affected at various levels by the network load change and that, generally speaking, the higher the amount of extra network load added, the lower is the TAR after the change.

Finally, we can also see that the only algorithm to keep a near optimal performance even in overloaded scenarios shown in Fig. 6 is HA-DRL, with .

Tables III, IV,V, VI, and VII present other performance metrics related to the various evaluated algorithms. The columns ”Rupture TAR - Avg. TAR” and ”Rupture TAR - Last TAR” indicate how much the performance of the algorithms drops in the rupture phase when compared with the Average TAR and Last TAR, respectively.

The TAR Standard Deviation column indicates the TAR Standard Deviation metric described in Section V-C.

Network Load
Disruption Level (%)
Rupture TAR - Avg. TAR (%) Rupture TAR - Last TAR (%) TAR Standard Deviation (%)
+10 -3.37 -1.89 3.10
+20 -8.19 -7.37 3.09
+30 -11.89 -6.83 4.17
+40 -17.68 -13.8 4.12
+50 -17.00 -9.11 4.32
+60 -18.50 -10.20 4.35
+70 -20.46 -14.26 3.30
+80 -21.65 -15.86 3.27
TABLE III: DRL algorithm results
Network Load
Disruption Level (%)
Rupture TAR - Avg. TAR (%) Rupture TAR - Last TAR (%) TAR Standard Deviation (%)
+10 -4.13 -2.60 4.07
+20 -11.02 -8.91 3.51
+30 -16.00 -10.54 4.50
+40 -16.28 -9.83 4.13
+50 -18.66 -11.14 5.05
+60 -17.02 -9.80 3.99
+70 -25.13 -18.20 4.93
+80 -29.41 -21.31 4.85
TABLE IV: HA-DRL, algorithm results
Network Load
Disruption Level (%)
Rupture TAR - Avg. TAR (%) Rupture TAR - Last TAR (%) TAR Standard Deviation (%)
+10 -4.55 -3.43 3.95
+20 -8.80 -9.37 4.21
+30 -12.78 -10.66 4.59
+40 -20.33 -15.94 4.61
+50 -21.24 -13.43 4.56
+60 -19.46 -10.46 5.08
+70 -24.28 -16.26 3.75
+80 -26.78 -20.71 3.88
TABLE V: HA-DRL, algorithm results
Network Load
Disruption Level (%)
Rupture TAR - Avg. TAR (%) Rupture TAR - Last TAR (%) TAR Standard Deviation (%)
+10 -2.96 -1.11 2.37
+20 -4.94 -6.49 3.50
+30 -6.93 -4.71 2.37
+40 -7.67 -7.00 1.97
+50 -6.80 -4.77 1.72
+60 -8.95 -5.29 2.45
+70 -11.25 -8.00 1.73
+80 -13.62 -11.69 2.89
TABLE VI: HA-DRL, algorithm results
Network Load
Disruption Level (%)
Rupture TAR - Avg. TAR (%) Rupture TAR - Last TAR (%) TAR Standard Deviation (%)
+10 -2.04 0.09 2.37
+20 -7.01 -5.09 3.50
+30 -7.15 -2.31 2.37
+40 -7.90 -4.69 1.97
+50 -12.13 -5.86 1.72
+60 -10.24 -4.94 2.45
+70 -18.83 -12.34 1.73
+80 -17.79 -11.69 2.89
TABLE VII: HA-DRL, algorithm results

Those tables confirm that in general the performance gaps, i.e., the gaps between the Rupture TAR and Average or Last TAR, grow with the level of disruption for all algorithms. For instance, in the disruption level ”+10”, the performance gaps are never higher than 5%. But, in the change level ”+80” the performance gap are never lower than 11%.

In all the evaluated cases, the difference between the Rupture TAR and the Average TAR is higher than the TAR standard deviation. For instance, for the DRL algorithm, in network load disruption level of +50%, rupture TAR is 17% lower than Average TAR which is 3.94 times the TAR standard deviation.

The algorithm with the lower performance gaps is HA-DRL with as we can see in columns ”Rupture TAR - Avg. TAR” and ”Rupture TAR - Last TAR” of Table VI. We can state that this algorithm has significantly better robustness then all the others as its performance gaps are significantly lower. However, HA-DRL with has the worst TAR performance as shown in Fig. 4, 5 and 6, which reduces its applicability.

HA-DRL with has the second better robustness and DRL the third as we can see on ”Rupture TAR - Avg. TAR” and ”Rupture TAR - Last TAR” columns of Tables VII and III, respectively. Even if the usage of the Heuristic Function has helped HA-DRL, with to achieve significantly better TARs than DRL, the influence of the Heuristic Function in these algorithms was not sufficient to allow to improve the robustness of the DRL algorithm against unpredictable network load disruptions (see Tables IV and V, respectively).

We can observe, however, that HA-DRL with has better robustness against unpredictable network load changes than DRL as the performance gaps obtained with HA-DRL with are significantly lower than the ones obtained with DRL as can be observed in columns ”Rupture TAR - Avg. TAR” and ”Rupture TAR - Last TAR” of Tables VII and III, respectively. These results confirm that HA-DRL with is the algorithm among those evaluated that is the most adapted to be used in practice. Indeed, the algorithm presents not only the better TAR results and quick convergence but also robust performance.

Vi Conclusion

We have specifically introduced two DRL-based algorithms and evaluated their performance in a non-stationary network load scenario with unpredictable changes.

In line with the conclusions of [4, 10], the numerical experiments performed in this paper show that coupling DRL and heuristic functions yields good and stable results even under non stationary load conditions. Therefore, we believe that such an approach is relevant in real networks that are subject to unpredictable network load changes.

As part of our future work, we plan to explore distribution and parallel computing techniques to solve the considered multi-objective optimization problem using multi-agent or federated learning approaches to address slice placement in heterogeneous networks mainly when the network is decomposed into several segments or technical domains where the network abstraction introduced in this paper is no more valid. Indeed, each segment should have its own abstractions and data. It is then necessary to share information between the segments to take a global decision. Instead of exchanging complete network states, segments would exchanging minimal information obtained via heuristics.


This work has been performed in the framework of 5GPPP MON-B5G project (www.monb5g.eu). The experiments were conducted using Grid’5000, a large scale testbed by Inria and Sorbonne University (www.grid5000.fr).


  • [1] 3GPP (2020-Dec.) Management and orchestration; 5G Network Resource Model (NRM); Stage 2 and stage 3 (Release 17). Technical Specification (TS) Technical Report 28.541, 3rd Generation Partnership Project (3GPP). Note: Version 17.1.0 Cited by: §I.
  • [2] R. R. O. Al-Nima, T. Han, S. A. M. Al-Sumaidaee, T. Chen, and W. L. Woo (2021-Jul.) Robustness and performance of deep reinforcement learning. Applied Soft Comput. 105, pp. 107295. Cited by: §II-B.
  • [3] J. J. Alves Esteves, A. Boubendir, F. Guillemin, and P. Sens (2020) Heuristic for edge-enabled network slicing optimization using the “power of two choices”. In Proc. 2020 IEEE 16th Int. Conf. Netw. Service Manag. (CNSM), pp. 1–9. External Links: Document Cited by: §IV-B2, §V-A3.
  • [4] J. J. Alves Esteves, A. Boubendir, F. Guillemin, and P. Sens (2021) A heuristically assisted deep reinforcement learning approach for network slice placement. arXiv preprint arXiv:2105.06741. Cited by: §I, §II-A, §II, §III-D, §IV-A1, §IV-A2, §IV-A3, §IV-A, §IV-A, §IV-B1, §IV-B1, §IV-B2, §IV, §V-D, §VI.
  • [5] E. Amaldi, S. Coniglio, A. M. Koster, and M. Tieves (2016-Jun.) On the computational complexity of the virtual network embedding problem. Electron. Notes Discrete Math. 52, pp. 213–220. Cited by: §I.
  • [6] A. Baumgartner, T. Bauschert, A. A. Blzarour, and V. S. Reddy (2017) Network slice embedding under traffic uncertainties — a light robust approach. In 2017 13th Int. Conf on Netw. Service Manag. (CNSM), pp. 1–5. Cited by: §II-B.
  • [7] D. Bertsimas and A. Thiele (2006) Robust and data-driven optimization: modern decision making under uncertainty. In Models, methods, and applications for innovative decision making, pp. 95–122. Cited by: §II-B.
  • [8] M. Dolati, S. B. Hassanpour, M. Ghaderi, and A. Khonsari (2019) DeepViNE: virtual network embedding with deep reinforcement learning. In Proc. IEEE INFOCOM 2019 - IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 879–885. External Links: Document Cited by: §I, §II-A.
  • [9] J. J. A. Esteves, A. Boubendir, F. Guillemin, and P. Sens (2020) Location-based data model for optimized network slice placement. In Proc. 2020 6th IEEE Conf. Netw. Softwarization (NetSoft), Vol. , pp. 404–412. External Links: Document Cited by: §I.
  • [10] J. Jurandir. A. Esteves, A. Boubendir, F. Guillemin, and P. Sens (2021) DRL-based slice placement under non-stationary conditions. Submitted to IEEE 17th International Conference on Network and Service Management (CNSM). Cited by: §I, §II-A, §II-A, §II, §V-D, §VI.
  • [11] ETSI NFV ISG (2017) Network Functions Virtualisation (NFV); Evolution and Ecosystem; Report on Network Slicing Support, ETSI Standard GR NFV-EVE 012 V3.1.1. Technical report ETSI. External Links: Link Cited by: §I.
  • [12] J. Gil Herrera and J. F. Botero (2016-Sep.) Resource allocation in NFV: a comprehensive survey. IEEE Trans. Netw. Service Manag. 13 (3), pp. 518–532. External Links: Document Cited by: §I.
  • [13] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. 5th Int. Conf. Learn. Representations (ICLR), pp. 1–14. Cited by: §IV-B1.
  • [14] A. Laghrissi and T. Taleb (2019-2nd. Quart.,) A survey on the placement of virtual resources and virtual network functions. ieee_o_csto 21 (2), pp. 1409–1434. Cited by: §I.
  • [15] A. Marotta, F. D’andreagiovanni, A. Kassler, and E. Zola (2017-Apr.) On the energy cost of robustness for green virtual network function placement in 5g virtualized infrastructures. Comput. Netw. 125, pp. 64–75. Cited by: §II-B.
  • [16] A. Marotta, E. Zola, F. D’Andreagiovanni, and A. Kassler (2017-Jul.) A fast robust optimization-based heuristic for the deployment of green virtual network functions. J. Netw. Comput. Applications 95, pp. 42–53. Cited by: §II-B.
  • [17] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Int. Conf. Mach. Learn., pp. 1928–1937. Cited by: §IV-B1.
  • [18] J. Pei, P. Hong, M. Pan, J. Liu, and J. Zhou (2020-Feb.) Optimal vnf placement via deep reinforcement learning in sdn/nfv-enabled networks. ieee_j_jsac 38 (2), pp. 263–278. External Links: Document Cited by: §II-A, §II-A.
  • [19] P. T. A. Quang, A. Bradai, K. D. Singh, and Y. Hadjadj-Aoul (2019) Multi-domain non-cooperative VNF-FG embedding: a deep reinforcement learning approach. In Proc. IEEE INFOCOM 2019 - IEEE Conf. Comput. Commun. Workshops (INFOCOM WKSHPS), pp. 886–891. Cited by: §I, §II-A.
  • [20] P. T. A. Quang, Y. Hadjadj-Aoul, and A. Outtagarts (2019-Dec.) A deep reinforcement learning approach for vnf forwarding graph embedding. IEEE Trans. Netw. Service Manag. 16 (4), pp. 1318–1331. Cited by: §II-A.
  • [21] V. S. Reddy, A. Baumgartner, and T. Bauschert (2016) Robust embedding of vnf/service chains with delay bounds. In 2016 IEEE Conf. Netw. Funct. Virtualization Softw. Defined Netw. (NFV-SDN), pp. 93–99. Cited by: §II-B.
  • [22] A. Rkhami, Y. Hadjadj-Aoul, and A. Outtagarts (2021) Learn to improve: a novel deep reinforcement learning approach for beyond 5G network slicing. In Proc. 2021 IEEE 18th Annu. Consum. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §II-A.
  • [23] M. Shafique, M. Naseer, T. Theocharides, C. Kyrkou, O. Mutlu, L. Orosa, and J. Choi (2020-Apr.) Robust machine learning systems: challenges, current trends, perspectives, and the road ahead. IEEE Design & Test 37 (2), pp. 30–57. Cited by: §II-B.
  • [24] F. Slim, F. Guillemin, A. Gravey, and Y. Hadjadj-Aoul (2017) Towards a dynamic adaptive placement of virtual network functions under ONAP. In Proc. 2017 IEEE Conf. on Netw. Function Virtualization Softw. Defined Netw. (NFV-SDN), pp. 210–215. External Links: Document Cited by: §III-C, §V-A2.
  • [25] F. Slim, F. Guillemin, and Y. Hadjadj-Aoul (2018) CLOSE: a costless service offloading strategy for distributed edge cloud. In Proc. 2018 15th IEEE Annu. Cons. Commun. Netw. Conf. (CCNC), pp. 1–6. External Links: Document Cited by: §III-A.
  • [26] P. Sun, J. Lan, J. Li, Z. Guo, and Y. Hu (2021-Jan.) Combining deep reinforcement learning with graph neural networks for optimal vnf placement. IEEE Commun. Letters 25 (1), pp. 176–180. Cited by: §II-B.
  • [27] R. S. Sutton and A. G. Barto (2015) Reinforcement learning: an introduction. MIT press, Cambridge, MA, USA. Cited by: §I.
  • [28] H. Wang, Y. Wu, G. Min, J. Xu, and P. Tang (2019-Sep.) Data-driven dynamic resource scheduling for network slicing: a deep reinforcement learning approach. Inf. Sci. 498, pp. 106–116. Cited by: §I, §II-A.
  • [29] Y. Xiao, Q. Zhang, F. Liu, J. Wang, M. Zhao, Z. Zhang, and J. Zhang (2019) NFVdeep: adaptive online service function chain deployment with deep reinforcement learning. In Proc. 2019 IEEE/ACM 27th Int. Symp. Qual. Service (IWQoS), pp. 1–10. Cited by: §I, §II-A.
  • [30] Z. Yan, J. Ge, Y. Wu, L. Li, and T. Li (2020-Jun.) Automatic virtual network embedding: a deep reinforcement learning approach with graph convolutional networks. ieee_j_jsac 38 (6), pp. 1040–1057. Cited by: §I, §II-A, §IV-B1, §IV-B1.
  • [31] H. Yao, X. Chen, M. Li, P. Zhang, and L. Wang (2018-Apr.) A novel reinforcement learning algorithm for virtual network embedding. Neurocomputing 284, pp. 1–9. External Links: ISSN 0925-2312 Cited by: §I, §II-A.