Hierarchical Deep Double Q-Routing

10/09/2019 ∙ by Ramy E. Ali, et al. ∙ Penn State University 0

This paper explores a deep reinforcement learning approach applied to the packet routing problem with high-dimensional constraints instigated by dynamic and autonomous communication networks. Our approach is motivated by the fact that centralized path calculation approaches are often not scalable, whereas the distributed approaches with locally acting nodes are not fully aware of the end-to-end performance. We instead hierarchically distribute the path calculation over designated nodes in the network while taking into account the end-to-end performance. Specifically, we develop a hierarchical cluster-oriented adaptive per-flow path calculation mechanism by leveraging the Deep Double Q-network (DDQN) algorithm, where the end-to-end paths are calculated by the source nodes with the assistance of cluster (group) leaders at different hierarchical levels. In our approach, a deferred composite reward is designed to capture the end-to-end performance through a feedback signal from the source nodes to the group leaders and captures the local network performance through the local resource assessments by the group leaders. This approach scales in large networks, adapts to the dynamic demand, utilizes the network resources efficiently and can be applied to segment routing.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Routing is one of the most challenging problems in IP packet networking, with the general form of its optimization being interpreted as the NP-complete multi-commodity flow problem [1]

. The key factors contributing to the routing complexity are the large number of concurrent demands, each with specific desired QoS constraints, and limited shared resources in the form of number of links and their limited capacities. In large-scale networks with high-dimensional end-to-end QoS constraints, the routing algorithm requires incorporation of additional attributes such as delay, throughput, packet loss, network topology and other TE factors. On top of that, the recent transformation of the IP networking towards a virtualized architecture and an autonomous control plane further increased the routing challenge by introducing more dynamism to the network operation. The traffic profiles in such scenarios are subject to change in much smaller time-scales, and given such a dynamism in hand, a significantly short reaction time is required which results in significant cost increase in the path computation making it harder for a central control to compute all real-time routes in a large-scale network.

Today, source routing methods such as segment routing [2] in conjunction with PCE [3] provide much-needed capabilities to resolve many of these routing challenges, by allowing intermediate nodes to perform routing decisions thus offloading centralized path computation entities of the network. The reliance on central control nodes still stands as an open issue as the network scales, therefore, motivating us in to study and find a balance between centralized-vs-decentralized operation. In this paper, we propose a hierarchical routing scheme leveraging the recent advances in DRL, to manage the complexities of routing problem. In order to distribute the route computation load in large-scale networks while taking into account the end-to-end performance, we develop a hierarchical QoS-aware per-flow route calculation algorithm. In our approach, the nodes in the network are grouped (clustered), based on specific criteria such as the latency between the nodes, with each group having a designated leader assigned either autonomously or predefined by the network operator. The end-to-end route from a source node to a destination is calculated by the source node with the assistance of group leaders at different hierarchical levels. The group leaders select links based on the local information available such as link utilizations and delays while taking into account the global view of the network through the feedback of the source nodes.

In our approach, while the stateful computations are done at the source nodes, the assisting group leaders behave as stateless functions for per-route computation threads. The dynamically computed routes are directly applicable to segment routing for establishment of the packet flows. The group leaders in this regard provide assistance to route calculation based on their local network conditions, while the source nodes maintains responsibility of assembling the end-to-end route.

The remainder of this paper is organized as follows. Section II provides a brief overview about RL and the related work applying RL in the routing problem. Section III includes our formulation of the hierarchical routing problem and our DRL-enabled routing algorithm. We evaluate the performance of proposed approach in Section IV, and finally, concluding remarks are provided in Section V.

Ii Background and Related Work

We start with a brief overview of the relevant aspects of RL in section II-A, and then review some of the recent approaches utilizing RL on packet routing problem in section II-B.

Ii-a Background: Reinforcement Learning

Model-free methods enable an agent to learn from experience while interacting with the environment without prior knowledge of the environment model. The agent at time step observes the environment state , takes an action , gets a reward at time and the state changes to

. In Monte Carlo approaches applied to espisodic problems, where the experience is divided into episodes such that all episodes reach a terminal state irrespective of the taken actions, the value estimates are only updated when an episode terminates. While the values can be estimated accurately in these approaches, the learning rate could be dramatically slow. In contrast, in TD the value estimates are updated incrementally as the agent interacts with the environment without waiting for a final outcome by bootstrapping. Therefore, TD learning usually converges faster.

In TD learning, the goal of the agent is to find the optimal policy maximizing the cumulative discounted reward which can be expressed as follows


where is the discount factor.

The Q-learning algorithm [4] is one of the off-policy TD control techniques in which the agent learns the optimal action-value function independent of the policy being followed. In Q-learning, the action-value function is updated as follows


where is the learning rate. We denote by the TD target which is given by


and we denote the TD error by which is expressed as follows


which measures the difference between the old estimate and the TD target. Q-learning however suffers from an overestimation problem as the maximization step leads to a significant positive bias that overestimates the actions. Double Q-learning [5] avoids this problem by decoupling the action selection from the action evaluation. The first -function finds the maximization action and the second

-function estimates the value of taking this action which leads to unbiased estimate of the action-values. Specifically, double Q-learning alternates between updating two

-functions and as follows


In large-scale RL problems, storing and learning the action-value function for all states and all actions is inefficient. In order to tackle these challenges, the action-value function is usually approximated by representing it as a function parameterized by a weight vector

instead of using a table. This approximated function can be a linear function in the features of the state, where is the feature weights vector, or can be computed for instance using a DNN, where represents the weights of the neural network. The DQN algorithm [6] extends the tabular Q-learning algorithm by approximating the action-value function and learning a parameterized action-value function instead. Specifically, an online neural network whose input is the state outputs the estimated action-value function for each action in this state, where are the parameters of the neural network at time . In DQN, a target neural network with outdated parameters , that are copied from the online neural network every

steps, is used to find the target of the RL algorithm. Deep learning algorithms commonly assume the data samples to be independent and correlated data can significantly slow down the learning

[7]. However, in RL the samples are usually correlated. Therefore, an experience replay memory is used in DQN to break the correlations. Specifically, the replay memory stores the agent experience at each time step and a mini-batch of samples is drawn uniformly at random from the replay memory to train the neural network.

In order to tackle the overestimation issue of DQN, the DDQN algorithm [8] extends the tabular double Q-learning algorithm by using an online neural network that finds the greedy action and a target neural network that evaluates this action. Since sampling uniformly from the memory is inefficient, a PER approach was developed in [9] such that the samples are drawn according to their priorities. In [10], a dueling network architecture was developed that estimates the state-value function and the state-dependant action advantage function separately. In [11], the Rainbow algorithm was developed by combining some of the recent advances in DRL including DDQN, PER, dueling DDQN among other approaches.

Ii-B Related Work: Reinforcement Learning Based Routing

In a dynamic network, where the availability of the resources and the demand change frequently, a routing algorithm needs to adapt and use the resources efficiently to satisfy the QoS constraints of the different users, therefore the routing problem is a natural fit for application of RL.

There have been extensive research efforts in developing RL-based adaptive routing algorithms in the literature [12, 13, 14, 15]. In [12], a Q-learning based routing approach known as Q-routing was developed with the objective of minimizing the packet delay in a distributed model in which the nodes act autonomously. In this approach, a node estimates the time it takes to deliver a packet to a destination node based on the estimates received from each neighbor indicating the estimated remaining time for the packet to reach from . This update is known as the forward exploration. The Q-routing algorithm was shown, through simulations, to significantly outperform the non-adaptive shortest path algorithm. In [13], a DRQ approach was proposed that incorporates an additional update known as the backward exploration update which improves the speed of convergence of the algorithm. In order to address the different QoS requirements, the routing algorithm needs to consider other factors such as packet loss and the utilization of the links beside the delay. In [16], an RL-based QoS-aware routing protocol algorithm was developed for SDN. Based on the transmission delays, queuing delays, packet losses and the utilization of the links, a RL routing protocol was developed and shown through simulations to outperform the Q-routing protocol.

Taking high-dimensional factors into account makes the tabular Q-learning approaches intractable and DRL techniques provide a promising alternative that can address this issue by approximating the action-value function efficiently. Recently, there has been much interest in using deep learning techniques to address the packet routing challenges [17, 18, 19, 20, 21, 22, 23, 24]. A DRL approach for routing was developed in [17] with the objective of minimizing the delay in the network. In this approach, a single controller finds all the paths of all source-destination pairs given the demand in the network which represents the bandwidth request of each source-destination pair. This approach however results in a complex state and action spaces and does not scale for large networks as it depends on a centralized controller. Moreover, in this approach, the state representation does not capture the network topology. Motivated by the high complexity of the state and action representations of the approaches proposed in [18, 17], a feature engineering approach has been recently proposed in [22] that only considers some candidate end-to-end paths for each routing request. This proposed representation was shown to outperform the representation of the approaches proposed in [18, 17] in some use-cases. In attempt to address the difficulties facing the centralized approaches, a distributed multi-agent DQN-based per-packet routing algorithm was developed in [25], where the objective is to minimize the delay in a fully distributed fashion without exchanging control information.

Iii Hierarchical Deep Reinforcement Learning Path Calculation

In this section, we describe our DRL-enabled approach to the packet routing problem. We detail our network model and provide a high level overview of our path calculation approach in section III-A. We formulate the path calculation problem as a RL problem in section III-B. Finally, we present our DRL-based path calculation algorithm in section III-C.

Iii-a Network Model

We start with our notations. We use bold fonts for vectors. For a vector , we denote the -th element by and the elements by . In a graph , where is the set of vertices and is the set of edges, and denote the two end points of the edge .

We model the network as a directed graph , where is the set of nodes and is the set of edges representing the links between the nodes. The queuing delay of node at time is denoted by . A link has a capacity that is denoted by . The utilization of a link at time , denoted by , is the ratio between the current rate and the capacity of the link. The transmission delay of a link is denoted by

and the packet loss probability is denoted by

. The nodes are clustered into groups at hierarchical levels based on a certain criteria such as latency between the different nodes as shown in Fig. 1. The -st level is the lowest level in the hierarchy and the -th level is the highest. We refer to a node by its identity and a group vector representing how this node is located in the hierarchy. The group vector is of length and the elements from left to right represent the lowest to the highest level using the identifiers of the group at those levels. Each group has a designated leader denoted by , where is the group vector of the group leader and is the group level.

Fig. 1: Hierarchical dynamic routing for the case where .

A source node that issues a route request to a destination node can find a path by the assistance of different group leaders at different hierarchical levels. Specifically, sends a route request to its local group leader at the -st level. This group leader compares its group vector with the group vector of the destination node and searches for the highest level in the hierarchy in which they differ denoted by , starting from the highest level. This group leader then sends a route request to the group leader at level requesting a route segment between and . The group leader then finds all possible links that can connect and according to the desired QoS constraints and chooses a link from these possible links according to the routing policy. This process repeats until a path from to is calculated.

In order for the source to initiate a routing request to it calls the RouteRequest procedure shown in Algorithm 1 with the input tuple which returns a path that connects and . The RouteRequest algorithm depends on two procedures. The first procedure, FindLinks, finds all links that can connect and at level and returns the links among them corresponding to the desired QoS requirements denoted by . The second procedure, ChooseLink, returns a link selected from . The ChooseLink procedure needs to adapt to the dynamic aspects of the routing network such as the utilizations, delays, queue lengths and preferences of the source node. Hence, we design an adaptive ChooseLink procedure based on DRL in the next subsection.

for  do
     if  then
           add to the path
Algorithm 1 RouteRequest ()

Iii-B Reinforcement Learning Formulation

In this subsection, we formulate the link selection problem as a model-free RL problem. Each group leader is a DRL agent that observes local information and feedback of the source nodes and acts accordingly. We now describe the state space , the action space and our composite reward design approach.

State space. The state of a group leader at time , , consists of the partial group vectors which represent the network topology, where is a parameter that can be chosen based on memory constraints of the routing nodes. The state also includes utilization and transmission delays of the set of possible links . That is, the state is given by

Action space. The action of the group leader represents the link that the group leader chooses for the routing request.

Reward. The group leader at state that takes an action gets a reward . We design a composite reward [20] such that it captures both the global and the local aspects of the path calculation problem as follows.

  • Global Reward. The group leader takes actions based purely on the local information available, without knowing the effect of these actions on the end-to-end QoS-type constraints. Hence, we design a global reward control signal to address this issue that is assigned to all group leaders involved in the routing of a certain flow without distinction. Specifically, a source node that issues a route request to a destination node sends a control signal to the group leaders involved in the routing that is between and indicating the satisfaction of the source about the selected path. We note that the global reward signal may not be instantaneous as the source does not continuously send it. Instead, the source sends it from time to time and we assume the group leaders receive it steps after choosing the path. That is, the global reward of the group leader that selects a link in state is expressed as follows


    where is the weight of the global reward.

  • Local Reward. The local reward is assigned individually to each group leader based on the individual contribution. Specifically, the local reward depends on the queuing delay, transmission delay, packet loss of the selected link and how well is the group leader balancing the load over possible links. The local reward is expressed as follows


    where are weights that can be chosen by the network operator, denotes a utilization threshold that it is undesirable to exceed and denotes the average utilization of links in given by


Therefore, the composite reward combining global and local rewards is given by


Iii-C Deep Double Q-Routing with Per

Fig. 2: The learning architecture of the group leader.

We consider a model-free, off-policy algorithm based on DDQN algorithm developed in [8] as depicted in Fig. 2 and given in Algorithm 2. As we have explained in Section II-A, the action selection uses an online neural network with weights , referred to as Q-network, to estimate the action-value function. The input of neural network is the state of group leader and outputs are the estimated action-value function for each action in that given state, where each output unit corresponds to a particular action (link). The action evaluation uses a target network with weights , which is a copy of the online network weights that is updated every steps as . The target is expressed as follows


and the parameters are updated as follows


We use experience replay memory of size to store the experience of a group leader at each time step as . These experiences are replayed later at a rate that is based on their priority, where the priority depends on how surprising was that experience as indicated by its TD-error. Specifically, we use a proportional priority memory where the priority of transition is expressed as follows


where is the TD-error of transition and is a constant. The probability of sampling a transition from is given by


where is a parameter that controls the level of prioritization and corresponds to the uniform sampling case. Sampling experiences based on the priority introduces a bias, where IS can correct this bias. In particular, the IS weight of transition is given by


where is defined by a schedule that starts with an initial value and reaches at the end of learning. These weights are normalized by the maximum weight .

Parameters: Replay memory size , mini-batch size , replay period , exponents , discount factor , learning rate , exploration rate , update period
Observe current state and reward
Store transition in with priority
if  then
     Update the target network weights
if  then
     Initialize is used to update
     for  do
          Sample transition
          Compute target
          Compute IS weight
          Normalize IS weight
          Compute TD-error
          Update priority
          Update the weight-change
     Update online weights
Choose an action (link) from -greedily as follows
return the selected link
Algorithm 2 ChooseLink

Iv Performance Evaluation

In this section, we evaluate the performance of our approach on OpenAI Gym [26] considering the topology shown in Fig. 3 and Fig. 4. We consider a dense neural network for the agent with

hidden layers, RMSprop optimizer

[27] and Huber loss. The hidden layers have neurons each and have ReLU activation function. The output layer has a linear activation function. We have selected our parameters as shown in Table I and Table II.

Fig. 3: The topology considered in our experiment setup.
Fig. 4: A simplified schematic for the topology of our experiment setup.
Parameter Value
Replay memory size
Mini-batch size
Replay period
Update period
Priortization exponent 0.5
Importance sampling exponent linearly annealed from to
Minimum exploration rate
Maximum exploration rate
Exploration rate exponent
Exploration rate
Prioritization constant

The hyperparameters used in our experiment.

Parameter Value
Global reward weights
Utilization threshold weight
Delay weight
Packet loss weight
Load balancing weight
Link capacity flow per second
Utilization threshold
Minimum flow duration s
Maximum flow duration s
Requests inter-arrival time
TABLE II: Network simulation parameters.

As a matter of showcase, we pick the source-destination pair and focus mainly on the utilization components of our reward. The source generates random flow requests with random durations and the group leader of group selects a link from link , link and link to connect group and group . Therefore, based on which link is selected, we consider paths between this source-destination pair. We assume that the source prefers the path that involves link , then prefers the path that involves link , then the path through link . Specifically, the source assigns a global reward of to the first path, global reward of to the second path and a global reward of to the third path.

In Fig. 5, we show the utilizations of three links that group leader manages. In the first steps, the group leader initializes the replay memory by selecting links at random and after that it starts to learn. We observe that the group leader learns the preferences of the source node: the first path, the second, then the third. Moreover, the group leader selects links in a way that the utilization of links do not exceed the predefined utilization threshold and importantly the group leader balances the load across these three links while considering the preferences of the source. Specifically, the group leader picks the first link more than the other two links until the utilization of this link reaches the predefined threshold. The group leader then selects the second link more than the third link, but we observe that the group leader still selects the third link for some of the requests, even before utilization of the second link reaches the predefined threshold, to balance the load across these three links. In Fig. 6, we also show the neural network loss of the group leader during online learning process, highlighting a better performance over time. Finally, we compare the total discounted reward of the DQN-based routing algorithm with the DDQN-based routing under the same traffic pattern in Fig. 7. As expected, we notice that the DDQN-based routing results in a higher total reward as compared with the DQN-based routing.

Fig. 5: The utilization of the three links of group leader .
Fig. 6: Loss as a function of time.
Fig. 7: Total discounted reward of the DDQN and the DQN-based routing.

V Conclusion

In this paper, we presented our hierarchical approach to the packet routing problem based on the DDQN algorithm. Our approach scales in large networks as the path calculation is hierarchically distributed over designated nodes in the network rather than a centralized node calculating the paths for all nodes in the network. Moreover, our path calculation algorithm adapts dynamically to rapid changes in the network and utilizes the resources based on a policy to determine real-time paths. Our future work includes assessment of our algorithm in large-scale network topologies and quantifying the advantages in terms of the routing performance, memory requirements and communication efficiency.


  • [1] M. Di Ianni, “Efficient delay routing,” Theoretical Computer Science, vol. 196, no. 1-2, pp. 131–151, 1998.
  • [2] C. Filsfils, N. K. Nainar, C. Pignataro, J. C. Cardona, and P. Francois, “The segment routing architecture,” in 2015 IEEE Global Communications Conference (GLOBECOM), pp. 1–6.
  • [3] J. Ash and A. Farrel, “A path computation element (PCE)-based architecture,” IETF, RFC4655, August 2006.
  • [4] C. J. C. H. Watkins, “Learning from delayed rewards,” PhD thesis, King’s College, Oxford, 1989.
  • [5] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
  • [6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [7] S. Halkjær and O. Winther, “The effect of correlated input data on the dynamics of learning,” in Advances in neural information processing systems, 1997, pp. 169–175.
  • [8] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [9] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
  • [10] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in

    International Conference on Machine Learning

    , 2016, pp. 1995–2003.
  • [11] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [12] J. A. Boyan and M. L. Littman, “Packet routing in dynamically changing networks: A reinforcement learning approach,” in Advances in neural information processing systems, 1994, pp. 671–678.
  • [13] S. Kumar and R. Miikkulainen, “Dual reinforcement q-routing: An on-line adaptive routing algorithm,” in Proceedings of the artificial neural networks in engineering Conference, 1997, pp. 231–238.
  • [14] D. Subramanian, P. Druschel, and J. Chen, “Ants and reinforcement learning: A case study in routing in dynamic networks,” in IJCAI (2).   Citeseer, 1997, pp. 832–839.
  • [15] S. P. Choi and D.-Y. Yeung, “Predictive q-routing: A memory-based reinforcement learning approach to adaptive traffic control,” in Advances in Neural Information Processing Systems, 1996, pp. 945–951.
  • [16] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “Qos-aware adaptive routing in multi-layer hierarchical software defined networks: A reinforcement learning approach,” in 2016 IEEE International Conference on Services Computing (SCC).   IEEE, 2016, pp. 25–33.
  • [17] G. Stampa, M. Arias, D. Sanchez-Charles, V. Muntés-Mulero, and A. Cabellos, “A deep-reinforcement learning approach for software-defined networking routing optimization,” arXiv preprint arXiv:1709.07080, 2017.
  • [18] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to route with deep rl,” in NIPS Deep Reinforcement Learning Symposium, 2017.
  • [19] T. A. Q. Pham, Y. Hadjadj-Aoul, and A. Outtagarts, “Deep reinforcement learning based qos-aware routing in knowledge-defined networking,” in International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness.   Springer, 2018, pp. 14–26.
  • [20] H. Mao, Z. Gong, and Z. Xiao, “Reward design in cooperative multi-agent reinforcement learning for packet routing,” 2018.
  • [21] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang, “Experience-driven networking: A deep reinforcement learning based approach,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 1871–1879.
  • [22] J. Suarez-Varela, A. Mestres, J. Yu, L. Kuang, H. Feng, P. Barlet-Ros, and A. Cabellos-Aparicio, “Feature engineering for deep reinforcement learning based routing,” in 2019 IEEE International Conference on Communications (ICC), pp. 1–6.
  • [23] P. Sun, J. Li, Z. Guo, Y. Xu, J. Lan, and Y. Hu, “Sinet: Enabling scalable network routing with deep reinforcement learning on partial nodes,” in Proceedings of the ACM SIGCOMM 2019 Conference, 2019, pp. 88–89.
  • [24] H. Mao, Z. Gong, Z. Zhang, Z. Xiao, and Y. Ni, “Learning multi-agent communication under limited-bandwidth restriction for internet packet routing,” arXiv preprint arXiv:1903.05561, 2019.
  • [25] X. You, X. Li, Y. Xu, H. Feng, and J. Zhao, “Toward packet routing with fully-distributed multi-agent deep reinforcement learning,” IEEE RAWNET workshop, WiOpt 2019, Avignon, France.
  • [26] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
  • [27] G. Hinton, N. Srivastava, and K. Swersky, “RMSprop: Divide the gradient by a running average of its recent magnitude,” Lecture notes on Neural networks for machine learning.