Routing is one of the most challenging problems in IP packet networking, with the general form of its optimization being interpreted as the NP-complete multi-commodity flow problem 
. The key factors contributing to the routing complexity are the large number of concurrent demands, each with specific desired QoS constraints, and limited shared resources in the form of number of links and their limited capacities. In large-scale networks with high-dimensional end-to-end QoS constraints, the routing algorithm requires incorporation of additional attributes such as delay, throughput, packet loss, network topology and other TE factors. On top of that, the recent transformation of the IP networking towards a virtualized architecture and an autonomous control plane further increased the routing challenge by introducing more dynamism to the network operation. The traffic profiles in such scenarios are subject to change in much smaller time-scales, and given such a dynamism in hand, a significantly short reaction time is required which results in significant cost increase in the path computation making it harder for a central control to compute all real-time routes in a large-scale network.
Today, source routing methods such as segment routing  in conjunction with PCE  provide much-needed capabilities to resolve many of these routing challenges, by allowing intermediate nodes to perform routing decisions thus offloading centralized path computation entities of the network. The reliance on central control nodes still stands as an open issue as the network scales, therefore, motivating us in to study and find a balance between centralized-vs-decentralized operation. In this paper, we propose a hierarchical routing scheme leveraging the recent advances in DRL, to manage the complexities of routing problem. In order to distribute the route computation load in large-scale networks while taking into account the end-to-end performance, we develop a hierarchical QoS-aware per-flow route calculation algorithm. In our approach, the nodes in the network are grouped (clustered), based on specific criteria such as the latency between the nodes, with each group having a designated leader assigned either autonomously or predefined by the network operator. The end-to-end route from a source node to a destination is calculated by the source node with the assistance of group leaders at different hierarchical levels. The group leaders select links based on the local information available such as link utilizations and delays while taking into account the global view of the network through the feedback of the source nodes.
In our approach, while the stateful computations are done at the source nodes, the assisting group leaders behave as stateless functions for per-route computation threads. The dynamically computed routes are directly applicable to segment routing for establishment of the packet flows. The group leaders in this regard provide assistance to route calculation based on their local network conditions, while the source nodes maintains responsibility of assembling the end-to-end route.
The remainder of this paper is organized as follows. Section II provides a brief overview about RL and the related work applying RL in the routing problem. Section III includes our formulation of the hierarchical routing problem and our DRL-enabled routing algorithm. We evaluate the performance of proposed approach in Section IV, and finally, concluding remarks are provided in Section V.
Ii Background and Related Work
Ii-a Background: Reinforcement Learning
Model-free methods enable an agent to learn from experience while interacting with the environment without prior knowledge of the environment model. The agent at time step observes the environment state , takes an action , gets a reward at time and the state changes to
. In Monte Carlo approaches applied to espisodic problems, where the experience is divided into episodes such that all episodes reach a terminal state irrespective of the taken actions, the value estimates are only updated when an episode terminates. While the values can be estimated accurately in these approaches, the learning rate could be dramatically slow. In contrast, in TD the value estimates are updated incrementally as the agent interacts with the environment without waiting for a final outcome by bootstrapping. Therefore, TD learning usually converges faster.
In TD learning, the goal of the agent is to find the optimal policy maximizing the cumulative discounted reward which can be expressed as follows
where is the discount factor.
The Q-learning algorithm  is one of the off-policy TD control techniques in which the agent learns the optimal action-value function independent of the policy being followed. In Q-learning, the action-value function is updated as follows
where is the learning rate. We denote by the TD target which is given by
and we denote the TD error by which is expressed as follows
which measures the difference between the old estimate and the TD target. Q-learning however suffers from an overestimation problem as the maximization step leads to a significant positive bias that overestimates the actions. Double Q-learning  avoids this problem by decoupling the action selection from the action evaluation. The first -function finds the maximization action and the second
-function estimates the value of taking this action which leads to unbiased estimate of the action-values. Specifically, double Q-learning alternates between updating two-functions and as follows
In large-scale RL problems, storing and learning the action-value function for all states and all actions is inefficient. In order to tackle these challenges, the action-value function is usually approximated by representing it as a function parameterized by a weight vectorinstead of using a table. This approximated function can be a linear function in the features of the state, where is the feature weights vector, or can be computed for instance using a DNN, where represents the weights of the neural network. The DQN algorithm  extends the tabular Q-learning algorithm by approximating the action-value function and learning a parameterized action-value function instead. Specifically, an online neural network whose input is the state outputs the estimated action-value function for each action in this state, where are the parameters of the neural network at time . In DQN, a target neural network with outdated parameters , that are copied from the online neural network every
steps, is used to find the target of the RL algorithm. Deep learning algorithms commonly assume the data samples to be independent and correlated data can significantly slow down the learning. However, in RL the samples are usually correlated. Therefore, an experience replay memory is used in DQN to break the correlations. Specifically, the replay memory stores the agent experience at each time step and a mini-batch of samples is drawn uniformly at random from the replay memory to train the neural network.
In order to tackle the overestimation issue of DQN, the DDQN algorithm  extends the tabular double Q-learning algorithm by using an online neural network that finds the greedy action and a target neural network that evaluates this action. Since sampling uniformly from the memory is inefficient, a PER approach was developed in  such that the samples are drawn according to their priorities. In , a dueling network architecture was developed that estimates the state-value function and the state-dependant action advantage function separately. In , the Rainbow algorithm was developed by combining some of the recent advances in DRL including DDQN, PER, dueling DDQN among other approaches.
Ii-B Related Work: Reinforcement Learning Based Routing
In a dynamic network, where the availability of the resources and the demand change frequently, a routing algorithm needs to adapt and use the resources efficiently to satisfy the QoS constraints of the different users, therefore the routing problem is a natural fit for application of RL.
There have been extensive research efforts in developing RL-based adaptive routing algorithms in the literature [12, 13, 14, 15]. In , a Q-learning based routing approach known as Q-routing was developed with the objective of minimizing the packet delay in a distributed model in which the nodes act autonomously. In this approach, a node estimates the time it takes to deliver a packet to a destination node based on the estimates received from each neighbor indicating the estimated remaining time for the packet to reach from . This update is known as the forward exploration. The Q-routing algorithm was shown, through simulations, to significantly outperform the non-adaptive shortest path algorithm. In , a DRQ approach was proposed that incorporates an additional update known as the backward exploration update which improves the speed of convergence of the algorithm. In order to address the different QoS requirements, the routing algorithm needs to consider other factors such as packet loss and the utilization of the links beside the delay. In , an RL-based QoS-aware routing protocol algorithm was developed for SDN. Based on the transmission delays, queuing delays, packet losses and the utilization of the links, a RL routing protocol was developed and shown through simulations to outperform the Q-routing protocol.
Taking high-dimensional factors into account makes the tabular Q-learning approaches intractable and DRL techniques provide a promising alternative that can address this issue by approximating the action-value function efficiently. Recently, there has been much interest in using deep learning techniques to address the packet routing challenges [17, 18, 19, 20, 21, 22, 23, 24]. A DRL approach for routing was developed in  with the objective of minimizing the delay in the network. In this approach, a single controller finds all the paths of all source-destination pairs given the demand in the network which represents the bandwidth request of each source-destination pair. This approach however results in a complex state and action spaces and does not scale for large networks as it depends on a centralized controller. Moreover, in this approach, the state representation does not capture the network topology. Motivated by the high complexity of the state and action representations of the approaches proposed in [18, 17], a feature engineering approach has been recently proposed in  that only considers some candidate end-to-end paths for each routing request. This proposed representation was shown to outperform the representation of the approaches proposed in [18, 17] in some use-cases. In attempt to address the difficulties facing the centralized approaches, a distributed multi-agent DQN-based per-packet routing algorithm was developed in , where the objective is to minimize the delay in a fully distributed fashion without exchanging control information.
Iii Hierarchical Deep Reinforcement Learning Path Calculation
In this section, we describe our DRL-enabled approach to the packet routing problem. We detail our network model and provide a high level overview of our path calculation approach in section III-A. We formulate the path calculation problem as a RL problem in section III-B. Finally, we present our DRL-based path calculation algorithm in section III-C.
Iii-a Network Model
We start with our notations. We use bold fonts for vectors. For a vector , we denote the -th element by and the elements by . In a graph , where is the set of vertices and is the set of edges, and denote the two end points of the edge .
We model the network as a directed graph , where is the set of nodes and is the set of edges representing the links between the nodes. The queuing delay of node at time is denoted by . A link has a capacity that is denoted by . The utilization of a link at time , denoted by , is the ratio between the current rate and the capacity of the link. The transmission delay of a link is denoted by
and the packet loss probability is denoted by. The nodes are clustered into groups at hierarchical levels based on a certain criteria such as latency between the different nodes as shown in Fig. 1. The -st level is the lowest level in the hierarchy and the -th level is the highest. We refer to a node by its identity and a group vector representing how this node is located in the hierarchy. The group vector is of length and the elements from left to right represent the lowest to the highest level using the identifiers of the group at those levels. Each group has a designated leader denoted by , where is the group vector of the group leader and is the group level.
A source node that issues a route request to a destination node can find a path by the assistance of different group leaders at different hierarchical levels. Specifically, sends a route request to its local group leader at the -st level. This group leader compares its group vector with the group vector of the destination node and searches for the highest level in the hierarchy in which they differ denoted by , starting from the highest level. This group leader then sends a route request to the group leader at level requesting a route segment between and . The group leader then finds all possible links that can connect and according to the desired QoS constraints and chooses a link from these possible links according to the routing policy. This process repeats until a path from to is calculated.
In order for the source to initiate a routing request to it calls the RouteRequest procedure shown in Algorithm 1 with the input tuple which returns a path that connects and . The RouteRequest algorithm depends on two procedures. The first procedure, FindLinks, finds all links that can connect and at level and returns the links among them corresponding to the desired QoS requirements denoted by . The second procedure, ChooseLink, returns a link selected from . The ChooseLink procedure needs to adapt to the dynamic aspects of the routing network such as the utilizations, delays, queue lengths and preferences of the source node. Hence, we design an adaptive ChooseLink procedure based on DRL in the next subsection.
Iii-B Reinforcement Learning Formulation
In this subsection, we formulate the link selection problem as a model-free RL problem. Each group leader is a DRL agent that observes local information and feedback of the source nodes and acts accordingly. We now describe the state space , the action space and our composite reward design approach.
State space. The state of a group leader at time , , consists of the partial group vectors which represent the network topology, where is a parameter that can be chosen based on memory constraints of the routing nodes. The state also includes utilization and transmission delays of the set of possible links . That is, the state is given by
Action space. The action of the group leader represents the link that the group leader chooses for the routing request.
Reward. The group leader at state that takes an action gets a reward . We design a composite reward  such that it captures both the global and the local aspects of the path calculation problem as follows.
Global Reward. The group leader takes actions based purely on the local information available, without knowing the effect of these actions on the end-to-end QoS-type constraints. Hence, we design a global reward control signal to address this issue that is assigned to all group leaders involved in the routing of a certain flow without distinction. Specifically, a source node that issues a route request to a destination node sends a control signal to the group leaders involved in the routing that is between and indicating the satisfaction of the source about the selected path. We note that the global reward signal may not be instantaneous as the source does not continuously send it. Instead, the source sends it from time to time and we assume the group leaders receive it steps after choosing the path. That is, the global reward of the group leader that selects a link in state is expressed as follows
where is the weight of the global reward.
Local Reward. The local reward is assigned individually to each group leader based on the individual contribution. Specifically, the local reward depends on the queuing delay, transmission delay, packet loss of the selected link and how well is the group leader balancing the load over possible links. The local reward is expressed as follows
where are weights that can be chosen by the network operator, denotes a utilization threshold that it is undesirable to exceed and denotes the average utilization of links in given by
Therefore, the composite reward combining global and local rewards is given by
Iii-C Deep Double Q-Routing with Per
We consider a model-free, off-policy algorithm based on DDQN algorithm developed in  as depicted in Fig. 2 and given in Algorithm 2. As we have explained in Section II-A, the action selection uses an online neural network with weights , referred to as Q-network, to estimate the action-value function. The input of neural network is the state of group leader and outputs are the estimated action-value function for each action in that given state, where each output unit corresponds to a particular action (link). The action evaluation uses a target network with weights , which is a copy of the online network weights that is updated every steps as . The target is expressed as follows
and the parameters are updated as follows
We use experience replay memory of size to store the experience of a group leader at each time step as . These experiences are replayed later at a rate that is based on their priority, where the priority depends on how surprising was that experience as indicated by its TD-error. Specifically, we use a proportional priority memory where the priority of transition is expressed as follows
where is the TD-error of transition and is a constant. The probability of sampling a transition from is given by
where is a parameter that controls the level of prioritization and corresponds to the uniform sampling case. Sampling experiences based on the priority introduces a bias, where IS can correct this bias. In particular, the IS weight of transition is given by
where is defined by a schedule that starts with an initial value and reaches at the end of learning. These weights are normalized by the maximum weight .
Iv Performance Evaluation
hidden layers, RMSprop optimizer and Huber loss. The hidden layers have neurons each and have ReLU activation function. The output layer has a linear activation function. We have selected our parameters as shown in Table I and Table II.
|Replay memory size|
|Importance sampling exponent||linearly annealed from to|
|Minimum exploration rate|
|Maximum exploration rate|
|Exploration rate exponent|
The hyperparameters used in our experiment.
|Global reward weights|
|Utilization threshold weight|
|Packet loss weight|
|Load balancing weight|
|Link capacity||flow per second|
|Minimum flow duration||s|
|Maximum flow duration||s|
|Requests inter-arrival time|
As a matter of showcase, we pick the source-destination pair and focus mainly on the utilization components of our reward. The source generates random flow requests with random durations and the group leader of group selects a link from link , link and link to connect group and group . Therefore, based on which link is selected, we consider paths between this source-destination pair. We assume that the source prefers the path that involves link , then prefers the path that involves link , then the path through link . Specifically, the source assigns a global reward of to the first path, global reward of to the second path and a global reward of to the third path.
In Fig. 5, we show the utilizations of three links that group leader manages. In the first steps, the group leader initializes the replay memory by selecting links at random and after that it starts to learn. We observe that the group leader learns the preferences of the source node: the first path, the second, then the third. Moreover, the group leader selects links in a way that the utilization of links do not exceed the predefined utilization threshold and importantly the group leader balances the load across these three links while considering the preferences of the source. Specifically, the group leader picks the first link more than the other two links until the utilization of this link reaches the predefined threshold. The group leader then selects the second link more than the third link, but we observe that the group leader still selects the third link for some of the requests, even before utilization of the second link reaches the predefined threshold, to balance the load across these three links. In Fig. 6, we also show the neural network loss of the group leader during online learning process, highlighting a better performance over time. Finally, we compare the total discounted reward of the DQN-based routing algorithm with the DDQN-based routing under the same traffic pattern in Fig. 7. As expected, we notice that the DDQN-based routing results in a higher total reward as compared with the DQN-based routing.
In this paper, we presented our hierarchical approach to the packet routing problem based on the DDQN algorithm. Our approach scales in large networks as the path calculation is hierarchically distributed over designated nodes in the network rather than a centralized node calculating the paths for all nodes in the network. Moreover, our path calculation algorithm adapts dynamically to rapid changes in the network and utilizes the resources based on a policy to determine real-time paths. Our future work includes assessment of our algorithm in large-scale network topologies and quantifying the advantages in terms of the routing performance, memory requirements and communication efficiency.
-  M. Di Ianni, “Efficient delay routing,” Theoretical Computer Science, vol. 196, no. 1-2, pp. 131–151, 1998.
-  C. Filsfils, N. K. Nainar, C. Pignataro, J. C. Cardona, and P. Francois, “The segment routing architecture,” in 2015 IEEE Global Communications Conference (GLOBECOM), pp. 1–6.
-  J. Ash and A. Farrel, “A path computation element (PCE)-based architecture,” IETF, RFC4655, August 2006.
-  C. J. C. H. Watkins, “Learning from delayed rewards,” PhD thesis, King’s College, Oxford, 1989.
-  H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
-  S. Halkjær and O. Winther, “The effect of correlated input data on the dynamics of learning,” in Advances in neural information processing systems, 1997, pp. 169–175.
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double q-learning,” in
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,
“Dueling network architectures for deep reinforcement learning,” in
International Conference on Machine Learning, 2016, pp. 1995–2003.
-  M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  J. A. Boyan and M. L. Littman, “Packet routing in dynamically changing networks: A reinforcement learning approach,” in Advances in neural information processing systems, 1994, pp. 671–678.
-  S. Kumar and R. Miikkulainen, “Dual reinforcement q-routing: An on-line adaptive routing algorithm,” in Proceedings of the artificial neural networks in engineering Conference, 1997, pp. 231–238.
-  D. Subramanian, P. Druschel, and J. Chen, “Ants and reinforcement learning: A case study in routing in dynamic networks,” in IJCAI (2). Citeseer, 1997, pp. 832–839.
-  S. P. Choi and D.-Y. Yeung, “Predictive q-routing: A memory-based reinforcement learning approach to adaptive traffic control,” in Advances in Neural Information Processing Systems, 1996, pp. 945–951.
-  S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive routing in multi-layer hierarchical software defined networks: A reinforcement learning approach,” in 2016 IEEE International Conference on Services Computing (SCC). IEEE, 2016, pp. 25–33.
-  G. Stampa, M. Arias, D. Sanchez-Charles, V. Muntés-Mulero, and A. Cabellos, “A deep-reinforcement learning approach for software-defined networking routing optimization,” arXiv preprint arXiv:1709.07080, 2017.
-  A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to route with deep rl,” in NIPS Deep Reinforcement Learning Symposium, 2017.
-  T. A. Q. Pham, Y. Hadjadj-Aoul, and A. Outtagarts, “Deep reinforcement learning based qos-aware routing in knowledge-defined networking,” in International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness. Springer, 2018, pp. 14–26.
-  H. Mao, Z. Gong, and Z. Xiao, “Reward design in cooperative multi-agent reinforcement learning for packet routing,” 2018.
-  Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang, “Experience-driven networking: A deep reinforcement learning based approach,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 1871–1879.
-  J. Suarez-Varela, A. Mestres, J. Yu, L. Kuang, H. Feng, P. Barlet-Ros, and A. Cabellos-Aparicio, “Feature engineering for deep reinforcement learning based routing,” in 2019 IEEE International Conference on Communications (ICC), pp. 1–6.
-  P. Sun, J. Li, Z. Guo, Y. Xu, J. Lan, and Y. Hu, “Sinet: Enabling scalable network routing with deep reinforcement learning on partial nodes,” in Proceedings of the ACM SIGCOMM 2019 Conference, 2019, pp. 88–89.
-  H. Mao, Z. Gong, Z. Zhang, Z. Xiao, and Y. Ni, “Learning multi-agent communication under limited-bandwidth restriction for internet packet routing,” arXiv preprint arXiv:1903.05561, 2019.
-  X. You, X. Li, Y. Xu, H. Feng, and J. Zhao, “Toward packet routing with fully-distributed multi-agent deep reinforcement learning,” IEEE RAWNET workshop, WiOpt 2019, Avignon, France.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
-  G. Hinton, N. Srivastava, and K. Swersky, “RMSprop: Divide the gradient by a running average of its recent magnitude,” Lecture notes on Neural networks for machine learning.