Cloud computing and caching capabilities at the edge, together with ai, are two of the key enablers of next-generation mobile wireless communication systems, namely b5g networks .
One of the advantages of introducing intelligence at the edge is the capability of undertaking and executing dynamic ran slicing. Network Slicing is one of the major solutions for the management and integration of diverse applications with concurrent requirements . While slicing is already well-established in the Core segments of current 5G networks, ran slicing is less mature and more challenging , mostly due to the nature of the wireless channel, which is unpredictably variable and prone to severe multi-user interference. For instance, the handling of ultra-reliable and low latency traffic is a specific example of the challenges met with ran slicing . In a conventional approach, these challenges can be met only at the cost of over-provisioning ran resources, which are precious and scarce in general, and at the risk of disrupting other types of traffic. In b5g networks, the ei paradigm is de facto enforced to overcome the latency constraints, since not only data is to be processed in mec servers, but also strategic scheduling decisions will be taken at the edge.
In particular, rl and drl appear to be particularly suited for addressing RAN slicing and rrm optimization, which are problems where the optimum is very difficult to find, due to the non-convex nature of the resource allocation problems, and there is limited knowledge about the structure of the problems themselves . As a matter of fact, the roles of the different involved stakeholders should be maintained. In particular, the owner of the infrastructure should not be aware of third parties’ most valuable information, while the latters will likely have an only partial understanding of the underlying RAN information, e.g., they are not authorized to precisely comprehend the procedures and algorithms implemented in the infrastructure and the internal functioning of the network apparatuses. In this paper, we envision a virtualized control platform in which both the owners of the network infrastructure and third parties interact for enforcing Network Slices in the ran. Finally, focusing on the autonomous driving use case, we demonstrate the feasibility, the features, as well as the performance of the proposed architecture, thanks to computer simulations.
The remainder of this paper is organized as follows. Section II briefly introduces the RAN slicing problem and provides the rationale for using a reinforcement learning approach in the considered setting. Next, Section III describes the proposed reference architecture enabling dynamic RAN slicing at the edge by using a drl algorithm. Section IV presents a performance evaluation, discussing results and comparisons with alternative approaches. Finally, conclusions are drawn in Section V.
Ii RAN Slicing and Reinforcement Learning
A prominent feature of ei is ran slicing: by exploiting the functionalities and capabilities of mec resources, it is possible to partition the radio infrastructure to support orthogonal logical segments. Each network segment, or slice, provides a different service with its own sla and the corresponding qos requirements. Accordingly, the design of each slice is service-based, as it is steered by the requirements of a particular service . In the cn, the operation of slicing allows the creation of segments with their own control and data plane functionalities, which are programmable and auto-configurable. Typically, the slice tnt, i.e., the customers from vertical industries, have a vision of the underlying infrastructure as a virtualized entity of which they have, at least partially, control and which they can configure and operate independently . Going along with this model, the ip is the owner of the resources employed and the tnt is allowed to use those resources, install its own applications, hold its own data, and enable its preferred security policies. In the ran, slicing is based on the virtualization of the radio resources and leverages software-defined networking (SDN) and network function virtualization (NFV) to create different end-to-end virtual networks over the same physical infrastructure. Owing to the intrinsic shared and unpredictable nature of wireless resources, the integration in the ran of the same attributes of slicing in the core network cn is a complex task, and ran slicing is a less mature and challenging practice, which involves several rrm functionalities. e.g., spectrum planning, interference coordination, packet scheduling, and admission control . The different solutions achieve different trade-offs between isolation and optimized resource utilization, between static and dynamic slice creation. For example, a static partitioning of the radio resources is not adequate to enforce the full vision of ran slicing, as shown in , which introduces the concept of a 5G network slice broker, designed to enable new players to dynamically request and lease resources from infrastructure providers via well-defined interfaces.
The advent of SDN and NFV also promotes AI-based resource allocation, management, and orchestration, leading to full network automation, since large-scale data acquisition has become rather easier than before. In order to devise an efficient slice enforcement strategy at the RAN level, two different prediction problems must be jointly addressed: (i) prediction of the incoming agglomerated per-slice traffic; (ii) prediction of the bandwidth efficiency at the radio link level. As for the first problem, solutions to anticipate future offered loads in mobile networks have been extensively studied in the last years, considering different approaches [11, 12, 13, 14, 15, 16, 17]. For instance, in 
, a supervised deep learning approach for estimating the aggregate slice traffic at each data-center is presented and an ad-hoc cost function able to find a good trade-off between resource overprovisioning and service request denial is proposed. However, estimating the traffic represents only a partial step for the optimal slice resource allocation problem. Indeed, even in the presence of perfect traffic estimation, evaluating the optimal rrm setting is a very difficult task owing to the random nature of the radio conditions. As a matter of fact, the problem of optimal rrm is generally formulated as a non-convex optimization problem, whose solution requires the use of optimization tools with unmanageable computational complexity. Alternatively, ML in general, and RL in particular, has been recently investigated as a low-complexity and effective solution for rrm in communication and computing systems, . In the RL framework, an RL agent can generate (near-) optimal control actions on the base of the immediate reward feedback from interactions with the environment. Together with simply optimizing the current reward in a greedy manner, the RL agent can take a long-term goal into account, which is essentially important to time-variant dynamic systems. Accordingly, RL appears to be particularly suited for rrm problems when the following conditions hold :
The optimum is unknown or very difficult to know, and only a reward associated with a given policy is available.
The environment can be modeled as a Markov Decision Process (MDP), where, given a status that can be fully or partially observed by the agent, the reward depends deterministically or stochastically on the action taken by the agent.
The MDP model is not known or only partially known by the agent, e.g., the reward/loss function cannot be expressed in closed-form as a differentiable function of the allocation decisions, i.e., the actions.
As a matter of fact, all these conditions hold in the scenario considered in this paper, bringing out the capabilities of rl- and drl-based ei for RAN slicing. Moreover, differently from the already available studies [11, 12, 18, 19], we specifically consider the openness of the network to third parties, hence encouraging tnt to take (partial) control of the resources without deploying their own infrastructure.
Iii The proposed Architecture for RAN Slicing with EI
We propose an architecture for RAN slicing by exploiting EI to serve mission critical applications. To this end, in line with the current state of the art, we consider a scenario envisaging a single ip that leases part of its network resources to create and manage specific slices for a set of independent tnt, or mobile network operators, to realize advanced network services .
The ip determines the amount of resources that can be used by the TNTs for each slice. The TNTs, in turn, should adapt in real-time their requests according to their own users’ requirements, avoiding expenses due to the issue of resources overbuying. As a consequence, the generation of slice requests, i.e., when a TNT defines its needed slice configuration to the IP, and the slice dynamic enforcement, i.e., the adaptation of the slice allocation policy to the time-varying RAN environment, are the solutions of local optimization problems, which have to be solved in real-time and whose decisions have to be executed instantaneously to reduce any latency of the system. .
Fig. 1 shows the reference architecture. The general network scenario consists of a ran and a Core Network. In the ran, there are multiple BSs grouped in clusters that provide coverage to the service area. A ran controller for managing radio resources is located close to each of these clusters, together with a MEC platform, which provides the required computational and storage resources to implement functions requiring high real-time performance.
The controller dynamically performs slicing operations at the RAN layer, i.e., decides which slice creation requests can be admitted, computes a slicing policy to allocate the available resources to the admitted slices, and enforces the slicing policy on the underlying physical RAN. The network slices are instantiated by the interactions of two different entities, the IP Subsystem and the TNT Subsystem. The IP is the owner of this virtualized control platform and dynamically leases computing and storage resources to TNTs to virtualize specific applications on the MEC. Moreover, the IP Subsystem is in charge of creating different RAN slices leveraging on the ran controller, according to the directives coming from the TNT subsystems. To this aim, specific APIs are provided for the submission of slice requests. Hence, the main role of the IP is that of enforcing the slice request by allocating the required resources at the RAN layer through commands that are executed in the involved BSs, e.g., bandwidth reservation, number of BSs, interference coordination strategies among cells to fulfill inter-slice isolations, etc. The TNT subsystem essentially generates the slice requests by including general information, (e.g., type of services to be provided, the duration in time of the slice, ) as well as high-level control information for successfully addressing the requirements of the related slice. Requests are then sent to the IP Subsystem by means of the provided APIs. Hence, the IP must decide in advance the number of resources that will remain assigned to a slice until the next reallocation takes place.
Since in this paper we focus on a mission-critical scenario, it is reasonable to assume that the TNT slice resource allocation requests must be always accepted by the IP, i.e., neither an admission control nor a resource allocation negotiation policy is enforced. Nevertheless, policies based on pay for what you get mechanisms can be utilized by the IP to prevent from over-provisioning the TNT. As for the specific RAN slicing enforcement strategy, we then propose a dynamical RAN slicing at the Inter-cell Interference Coordination (ICIC) level, which is shown to provide high radio-electrical and traffic isolation with respect to the other slices , that is one of the fundamental requirements for mission-critical services. To elaborate, each slice is assigned a given radio resource pool across a cluster of interfering cells in a given service area. The number of RBs is dynamically determined and requested by the TNT basing on the pieces of information it has access to. This scenario is in line with the general ambition of Network Slicing to open new business models for all the interested parties while maintaining the roles of the different involved stakeholders. As a consequence, the interaction of the participants must be regulated by APIs, where TNTs are not allowed to effectively comprehend the internal functioning of the devices provided by the IP and all the procedures implemented therein, as well as IPs should not be aware of TNTs’ most valuable information.
The service level agreement between the latency-sensitive TNT and the IP provides for a unitary cost associated with each bandwidth resource and a maximum amount of bandwidth to be used in each cell. At the same time, based on some statistics the IP may choose to allocate the additional reserved bandwidth to outage-tolerant slices.
Starting from this architecture, it is evidently clear how rl finds its applicability. Indeed:
the scenario is a classical MDP where the environment is the cellular system, the reward is the efficiency of resource utilization subject to QoS constraints, which depends on the action, i.e., the bandwidth allocated to mission critical slices, and on the radio conditions of the nodes, as well as the amount of incoming traffic, which altogether represent the state of the environment;
the optimal rrm solution cannot be known because of the non-convex nature of the problem and for the fact that the TNTs have only partial knowledge of the underlying RAN information;
the reward that is of interest for the TNT is often a QoS parameter, e.g., the latency and the packet loss ratio for mission-critical users, whose relationship with the allocation decision, e.g., the amount of allocated spectrum, is very hard to establish.
In this setting, the role of the rl agent of each TNT Subsystem is to reserve the minimum amount of bandwidth in each cell to satisfy its QoS requirements, so as to avoid resource overprovisioning. The TNT places its bandwidth allocation requests expressed as a fraction of the maximum available bandwidth within a fixed allocation period (AP). It is worth noting that the details of the radio interface, e.g., the adopted numerology, the scheduling policy, the packet fragmentation rules, and so on, are fully in charge of the IP and are not known by the TNT agents, which have only a limited knowledge of the radio link conditions of their users.
As illustrated in Fig. 2, a TNT agent is trained with the Deep Deterministic Policy Gradient (DDPG) algorithm, which is known to be suitable for dealing with continuous states and actions. Thanks to an actor-critic method, a DDPG agent concurrently learns a Q-function and an optimal policy that maximizes the long-term reward. DDPG exploits the Bellman equation to first learn the Q-function and then uses the Q-function to learn the optimal policy. Specifically, it is a model-free algorithm because the agent cannot predict the future states of the environment without taking the action. Besides, it is an off-policy method because the policy used to improve the Q-function approximation is different from the behavior policy, used to explore the environment.
The action is the amount of bandwidth requested to the IP every AP. The observations (or state) that are available at the TNT are some Key Performance Indicators related to the ran as well as traffic information. Such observations can be either directly accessed by the TNT (e.g., the agglomerated slice traffic) or passed by the IP through the TNT-IP APIs. In addition, they are known at a given AP period and are used to evaluate the requested bandwidth for the next AP. The reward should then take into account the amount of bandwidth the TNT saves with respect to the maximum bandwidth as well as some QoS indicators.
The amount of bandwidth is thus dynamically determined by the TNT basing on the available observations (state) with the goal of maximizing a discounted average future reward.
Iv Performance Evaluation
We evaluate the performance of a DRL TNT agent in a realistic environment where the MDP model is provided by a discrete event simulator of a cellular system developed in MATLAB. A single TNT subsystem is taken into account and it is assumed to provide an autonomous driving service. Without loss of generality, we focus on a single cell scenario, i.e., we do not consider the effect of inter-cell interference. On the other hand, the proposed framework leverages on the capability of the DRL agent to predict the mutual interactions of the involved nodes in determining the actual system performance and, accordingly, it is naturally suitable to encompass a multi-cell scenario, provided that the state variables include some interference related parameters, e.g., the mutual position of the nodes. We focus in the following on the downlink case. Similar considerations and results can be obtained for the uplink case.
Our scenario contains one single macro Base Station and a TNT subsystem, which provides autonomous driving services. In the considered setting, vehicles use their own sensors (e.g., HD camera, LiDAR), as well as sensor information from other vehicles, to perceive the environment and obtain a 3D model of the world around them. The main QoS requirement of the slice is a maximum experienced packet delay of 5 ms, which is half the maximum value of latency envisioned for the High Definition Sensor Sharing, which is one of the main Autonomous Driving use cases . The packet length is assumed to be fixed and equal to 32 bytes (as per the Urban Macro–URLLC usage scenario ). The number of slice subscribers, i.e. the autonomous vehicles, is modeled according to real mobility traces  and, on average, there are approximately 25 different vehicles in the cell. Channel modeling considers path loss and lognormal shadowing. The data rate of each active link is derived based on the Shannon capacity formula. The TNT is allocated a maximum bandwidth of 10 MHz, organized into slots of 1 ms, according to the 5G NR numerology with KHz. The MAC scheduling strategy enforced by the IP is the Throughput to Average scheduling, in order to guarantee a minimum level of service to every user, hence reaching a high fairness index. The AP is 1 second, i.e., the DDPG agent performs its actions every second.
The action, which is a continuous value between 0.1 and 0.9 (i.e., 10% and 90%), is the amount of bandwidth requested to the IP every AP. The state is characterized by the maximum buffer occupancy experienced by the users of the slice, the total per-slice agglomerated traffic, and the list of the worst channel quality indicators (CQI) of the served users averaged over the AP. The reward is computed as:
where is the ratio between the amount of bandwidth the TNT requests and the maximum bandwidth of 10 MHz. In other words, the less the bandwidth requested by the TNT, the higher the reward.
As customary in machine learning, the dataset of mobility traces considered during the training phase is different from the one considered during the testing phase. In this way, it is possible to assess the capability of the proposed DRL approach to generalize the proposed control strategy to every possible data traffic and radio channel conditions.
Fig. 3 shows the running average (with a window length of 100 episodes) of the reward during the training process of the agent. The figure shows that the proposed DRL approach allows to converge to a bandwidth occupancy of around 35% (65% of bandwidth left to other usages). During an initial exploration phase, the agent is not able to address the QoS requirement, hence the low reward. After approximately 1200 training episodes, rewards begin to grow, since the algorithm successfully learned how to satisfy the latency constraint.
The following figures are obtained by running the agent obtained at the end of the learning phase over the dataset considered for the testing phase.
shows the probability density function of the bandwidth requested by the DRL agent of the TNT. Samples are related to 10000 independent simulations. On the one hand, it demonstrates how the agent effectively learned to perform a variety of actions, i.e., it learned to dynamically adapt to the environment. On the other hand, it shows that the agent usually tries to request as low bandwidth as possible, hence indicating a well-engineered reward function.
In order to better demonstrate the importance of DRL, we compare the simulation results with the following methods:
Fixed Allocation, in which the TNT requests always the same amount of bandwidth;
Heuristic strategy, characterized by a perfect prediction (i.e., ideal) of the incoming traffic; at each step the bandwidth to request is directly proportional to the incoming traffic.
Optimum allocation, in which at each step the minimum bandwidth allowing to fulfill the slice QoS requirements is determined through iterative adjustment. Clearly, this approach is unfeasible in a real system, although it can be easily simulated.
Fig. 5 shows the bandwidth requested by the TNT during a representative test episode. It clearly illustrates how the agent learned to request an amount of bandwidth close to the optimum, by taking into account only the state variables. In other words, the agent is dynamically adapting to the changes occurring in the environment. Furthermore, it is of the utmost importance to highlight how the proposed DRL solution outperforms the heuristic approach. In other words, even though the prediction of the incoming traffic is accurate, it is not sufficient to guarantee an optimal bandwidth request. As a matter of fact, it is necessary to take into account what actually happens in the ran to accomplish such a decision. For instance, it is clear that the incoming traffic grows substantially after 60 s. However, it is reasonable to assume that general radio channel conditions improve as well, therefore it is not strictly necessary to claim more bandwidth.
Fig. 6 shows the bandwidth requested by the TNT to ensure a certain level of QoS availability, i.e., the probability associated with the main QoS requirement being satisfied. Specifically, the actions taken by both the trained DRL agent and the heuristic are successively weighted to obtained different behaviors. The results are then averaged over 10000 independent simulations. The most noticeable feature is that the proposed DRL mechanism always outperforms the other strategies. Even though requested bandwidth always grows with more stringent requirements on the QoS Availability probability, the DRL agent requests up to 50% less bandwidth compared to the fixed allocation. Moreover, the variation of the bandwidths requested by the TNT DRL agent are incredibly smaller, confirming how the agent learned a near-optimal allocation strategy starting from the limited information available.
V Conclusions and Open Challenges
In this paper, we presented a novel architecture in which both the Infrastructure Provider and tenants interact for enforcing Network Slices in the next-generation ran. Specifically, it exploits Deep Reinforcement Learning at the edge for supporting effective enforcing of ran slicing, where tenants are encouraged to take control having an only partial understanding of the underlying RAN status. Besides, focusing on the autonomous-driving use case, our proposal’s effectiveness against baseline methodologies is investigated through computer simulation. Results confirm that the prediction of the incoming agglomerated per-slice traffic is not sufficient for an effective ran slice enforcement strategy and resource over-provisioning is remarkably inefficient. Even though the proposed solution proves its success, there are still different open challenges to deal with in future works. For instance, it is important to use cutting-edge methodologies (e.g., transfer learning) for developing flexible and interoperable software agents, in order to guarantee reconfigurability and continuous deployment. Moreover, the privacy concerns of users and regulatory bodies should be also taken into account. Finally, how to provide sufficient and powerful resources for running AI at the edge in an economically sustainable way, is one of the fundamental challenges to be addressed to properly prepare for the advent of ei.
-  Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing”, 2019. Proceedings of the IEEE.
-  X. Foukas, G. Patounas, A. Elmokashfi, and M. K. Marina, “Network Slicing in 5G: Survey and Challenges”, IEEE Communications Magazine, vol. 55:5, pp. 94-100, May 2017.
-  S. Martiradonna, A. Abrardo, M. Moretti, G. Piro, and G. Boggia, “Architecting RAN Slicing for URLLC: Design Decisions and Open Issues”, Proc. of IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT), Cosenza, Italy, Oct., 2019
-  S. E. Elayoubi, S. B. Jemaa, Z. Altman and A. Galindo-Serrano, “5G RAN Slicing for Verticals: Enablers and Challenges”, in IEEE Communications Magazine, vol. 57, no. 1, pp. 28-34, January 2019, doi: 10.1109/MCOM.2018.1701319.
-  Richard S. Sutton and Andrew G. Barto. 2018. “Reinforcement Learning: An Introduction”. A Bradford Book, Cambridge, MA, USA.
-  Zhou et al., ”Network Slicing as a Service: Enabling Enterprises’ Own Software-Defined Cellular Networks,” IEEE Commun. Mag., vol. 54, no. 7, July 2016, pp. 146-53.
-  Oriol Sallent, Jordi Perez-Romero, Ramon Ferrus, and Ramon Agusti, ”On Radio Access Network Slicing From a Radio Resource Management Perspective”, IEEE Wireless Communications, October 2017.
-  Konstantinos Samdanis, Xavier Costa-Perez, and Vincenzo Sciancalepore, ”From Network Sharing to Multi-Tenancy: The 5G Network Slice Broker,” IEEE Communications Magazine, July 2016.
-  O, U. Akgul, I. Malanchini, and A. Capone, ”Dynamic Resource Trading in Sliced Mobile Networks,” IEEE Transactions on Network and Service Management, vol. 16, no. 1, pp. 220-233, March 2019.
-  Salvatore D’Oro, Francesco Restuccia, and Tommaso Melodia, “Toward Operator-to-Waveform 5G Radio Access Network Slicing”, IEEE Commun. Mag., April 2020.
-  A. Y. Nikravesh et al., “An Experimental Investigation of Mobile Network Traffic Prediction Accuracy”, Services Transactions on Big Data, vol. 3, no. 1, pp. 1-16, Jan. 2016.
-  Dario Bega, Marco Gramaglia, Marco Fiore, Albert Banchs, and Xavier Costa-Perez, “DeepCog: Optimizing Resource Provisioning in Network Slicing with AI-based Capacity Forecasting”, JSAC February 2020.
-  F. Xu et al., ?Big Data Driven Mobile Traffic Understanding and Forecasting: A Time Series Approach,? IEEE Transactions on Services Computing, vol. 9, no. 5, pp. 796?805, Sep. 2016.
-  M. Zhang et al., ?Understanding Urban Dynamics From Massive Mobile Traffic Data,? IEEE Transactions on Big Data, vol. 5, no. 2, pp. 266? 278, Nov. 2017.
-  R. Li et al., ?The prediction analysis of cellular radio access network traffic: From entropy theory to networking practice,? IEEE Communications Magazine, vol. 52, no. 6, pp. 234?240, Jun. 2014.
C. Zhang et al., ?Long-Term Mobile Traffic Forecasting Using Deep Spatio-Temporal Neural Networks,? in Proc. of ACM MobiHoc, Los Angeles, CA, USA, Jun. 2018, pp. 231?240.
-  C. Gutterman et al., ?RAN resource usage prediction for a 5G slice broker,? in Proc. of ACM Mobihoc, Catania, Italy, Jul. 2019.
-  Hao Ye , Geoffrey Ye Li, and Biing-Hwang Fred Juang, ”Deep Reinforcement Learning Based Resource Allocation for V2V Communications,” IEEE Transactions in Vehicular Technology, Vol. 68., No. 4, April 2019.
-  Faris B. Mismar , Brian L. Evans , and Ahmed Alkhateeb , “Deep Reinforcement Learning for 5G Networks: Joint Beamforming, Power Control, and Interference Coordination”, IEEE Transactions on Communications, VOL. 68, NO. 3, March 2020.
-  5GAA Automotive Association, “C-V2X Use Cases: Methodology, Examples and Service Level Requirements”, 2019.
-  International Telecommunication Union Radiocommunication sector (ITU-R), “Guidelines for evaluation of radio interface technologies for IMT-2020”, Report M.2083, 11 2017, [online] Available: https://www.itu.int/pub/R-REP-M.2412.
-  M. Piorkowski, N. Sarafijanovic‑Djukic, and M. Grossglauser, “CRAWDAD dataset epfl/mobility (v. 2009‑02‑24)”, downloaded from https://crawdad.org/epfl/mobility/20090224, https://doi.org/10.15783/C7J010, Feb 2009.