Towards Cognitive Routing based on Deep Reinforcement Learning

by   Jiawei Wu, et al.

Routing is one of the key functions for stable operation of network infrastructure. Nowadays, the rapid growth of network traffic volume and changing of service requirements call for more intelligent routing methods than before. Towards this end, we propose a definition of cognitive routing and an implementation approach based on Deep Reinforcement Learning (DRL). To facilitate the research of DRL-based cognitive routing, we introduce a simulator named RL4Net for DRL-based routing algorithm development and simulation. Then, we design and implement a DDPG-based routing algorithm. The simulation results on an example network topology show that the DDPG-based routing algorithm achieves better performance than OSPF and random weight algorithms. It demonstrate the preliminary feasibility and potential advantage of cognitive routing for future network.


DRL-M4MR: An Intelligent Multicast Routing Approach Based on DQN Deep Reinforcement Learning in SDN

Traditional multicast routing methods have some problems in constructing...

Toward Packet Routing with Fully-distributed Multi-agent Deep Reinforcement Learning

Packet routing is one of the fundamental problems in computer networks i...

Artificial Intelligence Based Cognitive Routing for Cognitive Radio Networks

Cognitive radio networks (CRNs) are networks of nodes equipped with cogn...

Dealing with complex routing requirements using an MCDM based approach

The last decade has witnessed an ever-growing user demand for a better Q...

ENERO: Efficient Real-Time Routing Optimization

Wide Area Networks (WAN) are a key infrastructure in today's society. Du...

5G Routing Interfered Environment

5G is the next-generation cellular network technology, with the goal of ...

Machine Learning Applications in the Routing in Computer Networks

Development of routing algorithms is of clear importance as the volume o...

I Introduction

Routing, the process of selecting a path for packet transmission in networks, is the key function for stable operation of network infrastructure. Basically, we can classify routing technologies into two categories, non-quality-aware and quality-aware. Most widely used routing protocols and algorithms, such as RIP

[1], IGRP[2] and OSPF[3], are non-quality-ware because they cannot make routing decision using network and service quality information. Although non-quality-aware routing protocols and algorithms are simple to be implemented on routers and have been worked well for many years, they are challenged by rapid growth of network traffic volume and changing of service requirements. Therefore, a number of quality-aware routing protocols and algorithms are proposed in recent years, which aim to choose paths with better performance by leveraging network quality metrics like delay, jitter and loss [4, 5, 6]. However, they are not widely used because of higher requirement of computation capabity on routers and expensive upgrade cost.

In recent years, with the rapid progress of new technologies like SDN and NFV, a number of research works introduced that a good opportunity has been raised to implement more complex routing decision on powerful hardwares [7]. For example, Google has proved the separation of routing control and operation approach is feasible for achieving better quality assurance on software defined networks [10]. Inspired by these works, we propose the concept of cognitive routing that extends the concept of quality-aware routing by introduce three key capabilities into the routing decision component, inference, decision and learning. Not just introduce the concept, we propose an implementation approach based on Deep Reinforcement Learning (DRL). To facilitate the research of DRL-based cognitive routing, we develop a simulator named RL4Net for DRL-based routing algorithm development and simulation. In addition, we design and implement a Deep Deterministic Policy Gradient (DDPG) based routing algorithm. To demonstrate the preliminary feasibility and potential advantage of cognitive routing, we compare the DDPG-based routing algorithm with OSPF and random weight algorithms. The simulation results on an example network topology show that the DDPG-based routing algorithm achieves better performance.

In summary, the main contributions of our paper are as follows:

  • We introduce the concept of cognitive routing with an implementation approach based on deep reinforcement learning technology.

  • We design and implement a DDPG-based cognitive routing algorithm under the routing-oriented deep reinforcement learning theory framework.

  • We prove the preliminary feasibility and potential advantage of cognitive routing by experiments on a self-develop simulator, which is also a powerful open source tool for cognitive routing research.

The rest of our paper is organized as follows: In Section II, we introduce the definition of cognitive routing and related works. Then, we propose a routing-oriented deep reinforcement learning theory framework in Section III. Based on this framework, the design of a DDPG-based routing algorithm is descried in Section IV. In Section V, we illustrate the design of RL4Net and the implementation of the DDPG-based routing algorithm on RL4Net. Section V is the description of experiment design and evaluation result. At last ,we conclude our work and future work in Section VI.

Ii Cognitive Routing and Related Work

Fig. 1: Cognitive Routing Framework and Reinforcement Learning

Basically, the software of a router is composed by three functional components that connected by interfaces, data plane, control plane and management plane. Control plane component is responsible for exchanging routing protocols and managing routing tables. Data plane forwards data packets following routing tables produced by control plane component. For simplicity, we call the routing table managing components of control plane as routing controller and data plane as routing operator. In this paper, we focus on the core of the routing controller, routing algorithm. As we mentioned before, we can classify routing algorithms into two categories, non-quality-aware and quality-aware. Most widely used routing algorithm like RIP[1], IGRP[2] and OSPF[3] are non-quality-ware. For example, the link state routing (LSR) algorithm used by OSPF choses the shortest path considering only link costs that usually related to bandwidth. This mechanism may cause congestions in heavy-load network. Although there are a number of variants like ECMP (Equal Cost Multiple Path) [11] attempts to decrease congestion possibilities by randomly choose a path from multiple paths with same distance. However, the defect of absence of network state information limits their improvement. To break this limitation, a number of researchers proposed to introduce quality metrics like delay, jitter and loss into parameters of routing algorithm [4, 5, 6]

. To deal with the complex optimization problem of increased state space, machine learning methods like Q-learning

[12, 13]

and neural network were used to calculate candidate path for packet transmission

[14, 15]. In this process, a rough concept of cognitive routing (not the routing algorithms only for cognitive network [16, 17]) is proposed in [18] and [19]. However, they did not give a clear definition of cognitive routing. Inspired by these works, we define the cognitive routing as: a mechanism learned from historical data for optimal routing decision by considering the inference of network quality state. From this definition, we can see that a cognitive routing controller must have three capabilities: (1) inference network state from monitored data, (2) routing decision by considering network quality state, and (3) learning optimal routing decision policy from historical data. The architecture of cognitive routing enabled network is shown in Figure 1(a).

In Figure 1(a), if we regard the network as an environment and cognitive routing controller(s) as intelligent agent(s), the architecture of cognitive routing enabled network is similar with the reinforcement learning (RL) framework in Figure 1(b). Therefore, the reinforcement learning methodology is a good potential underlying methodology to implement cognitive routing controller. Actually, we are alone of thinking like this way. Applying RL to solve routing problem started in 1994 [20]. After this, a number of RL-based routing algorithms are proposed [21]. However, these RL-based routing algorithms failed because of tabular-based RL method cannot handle explosive space of the combination of network state and action. In recent years, deep reinforcement learning (DRL) has been proved to be a good methodology for solving complex optimal control problem. Authors in [22] firstly used DRL in routing algorithm. After this, a small number of DRL-based routing algorithms are proposed in [23, 24, 25, 26]. Although these initial works have proved the potential of DRL for routing optimization, there are still a number of problems to be solved to achieve cognitive routing for future network.

Iii DRL Problem Definition of Cognitive Routing

Fig. 2: Sample Network Topology

We take a simple network topology shown in Figure 2 as an example to formulate the DRL problem of cognitive routing. Generally, a network can be denoted as . is a set of nodes that are routers in the physical network, such as in Figure 2. is a set of directed links between nodes, which are optical fiber or copper cable between routers. If there is directed link that can sent package from router to , we have . Otherwise, we have . In a period of time , there are a set of packets are transmitted between routers. Each packet has an end-to-end delivery delay , such as the delay of packet from to in Figure 2. In this condition, if we have an intelligent agent (or a set of intelligent agents) than can observe the network environment and take actions on routers, we can define the factors of deep reinforcement as below:

  • state: Each packet , comes into the network via source router and departs from the network via destination router. For example, the packet is sent from source node to destination node . For all packets , we have a Traffic Matrix , where is the sum of size of packets transmitted from to in time slot . We define the state of network environment (, is the state space) as:

  • action: Action represents how the agent change the environment. In routing context, the action of a intelligent controller is setting the routing tables of routers. Therefore, we define the action at time as the set of link weights of all nodes. Each node

    has a weight vector

    , where is the weight of link from to . Then, we define the action at time (, is the action space) as: .

  • reward: Reward is the feedback information from environment to agent after agent takes an action. With different network optimization purpose, we can define different rewards. In this paper, we consider to optimize the end-to-end delay of packets delivery. Therefore, we define the reward as the average delay of packets in time slot : .

  • policy: Policy of agent

    is represented by a distribution of conditional probability:


1:Initialize online actor network and online critic network with random parameters and , respectively.
2:Initialize target actor network and target critic network with parameters and , respectively.
3:Initialize a replay buffer with a Capacity and a sample threshold .
4:for episode=1,…,M do
5:      = reset()
6:     for t=1,…,T do
8:          = .execute()
9:         .push()
10:         if .size  then
11:               = .sample()
13:              Update

by minimizing loss function:

15:              Update by applying policy gradient:
17:               Update target networks:
20:          end if
21:     end for
22:end for
Algorithm 1 Cognitive Routing Algorithm based on DDPG

With above definitions, we can formulate the DRL problem for cognitive routing as an optimization problem: how to find an optimized policy to maximize the reward.

Iv Design of DDPG-based Routing Algorithm

The task of DRL agent is to optimize its policy to maximize the reward. For a state at time slot , we define a value function to evaluate the value obtained following policy . We use a discount rate to decay the future rewards. is evaluated by accumulating discounted reward as follows:


We define a Q-function as:


An optimal policy can maximize : . Therefore, the optimization problem can be solved by updating the Q-function by Temporal Difference (TD) between the target Q-value and current Q-value through iterative processes for all state-action pairs: , where is a hyper-parameter named learning rate in the training process.

We choose widely used Deep Deterministic Policy Gradient (DDPG) algorithm [27] to solve the optimization problem. The designed cognitive routing algorithm based on DDPG is shown in Algorithm 1.

In Algorithm 1, line 5 resets the environment and get the initial state in each episode. In line 7, is the exploration noise, which is generated by Ornstein-Uhlenbeck process (OUProcess) [28]. In line 11, we sample a batch of tuples from replay buffer . Line 13-19 is the process of update target networks off critic and actor.

V RL4Net and Algorithm Implementation

Fig. 3: Architecture of RL4Net Simulator

Implementing a reinforcement learning environment and algorithms from scratch is a difficult task. Inspired by work of [29], we develop tool named RL4Net (Reinforcement Learning for Network) to facilitate the research and simulator of reinforcement learning based cognitive routing. Figure 3 shows the architecture of RL4Net, which is composed by two functional blocks:

  • Environment: Environment is built on widely used ns3 network simulator [30]. We extend ns3 with six components: (1) Metric Extractor for computing quality metrics like delay and loss from ns3; (2) Computers for translating quality metrics to DRL state and reward; (3) Action Operator to get action commands from agent; (4) Action Executor for perform ns3 operations by actions; (5) ns3Env for transforming the ns3 object into DRL environment; (6) envInterface to translate between ns3 data and DRL factors.

  • Agent

    : Agent is container of a DRL-based cognitive routing algorithm. A agent can built on various deep learning frameworks like pyTorch and Tensorflow. We implement our DDPG-based routing algorithm on pyTorch. The algorithm implementation is a python program following the logic of Algorithm


Specifically, we use fully connected neural networks to implement the actor and critic of DDPG. There are four layers in actor networks, one input layer, two hidden layers and one output layer. The neuron numbers of these four layers are represented as

, , and

, respectively. To scale up the action output, we multiply the output of softmax layer with a parameter

. Network of critic is composed of three layers, one input layer, one hidden layer and one output layer. The neuron numbers of these three layers are represented as , and

, respectively. Both actor and critic networks use RELU as activation function.

Vi Experimental Evaluation

Fig. 4: Loss of critic network
Fig. 5: Loss of actor network

Vi-a Experiment Setup

In the experiment, we set the neuron numbers of DDPG actor and critic networks as , , , , , and . The scale-up parameter is set to . In addition, the learning rate of actor and critic, parameters and in Algorithm 1 are set to , , and , respectively. The exploration noise is generated by parameters of , and . The parameters of experience replay buffer are , and .

To evaluate our proposed DDPG-based routing algorithm, we config an experimental network topology as Figure 2. Bandwidth of all links are 5Mbps. On this network, we generated a 4.636Mbps UDP flow with 1024 packet size from to , which makes the link works in a heavy load condition. Under this setting, we compare the average end-to-end delivery delay of packets with other two routing algorithms, OSPF and random weight. The random weight algorithm sets the weight vector of each router randomly.

Vi-B Experiment Results

Figure 4 shows the values of loss function of critic network. As we mentioned before, the target Q-value is . The loss function is the average square of TD-error between Q-value and its target Q-value: . We trained the DDPG model for 43,100 steps. We can see the value of decreases with the increase of steps, which means the TD-error between Q-value and target Q-value decreases. After 15,000 step, the loss value stably remains a small value, which means the critic network is optimal enough. Therefore, we only draw values of 1-15,000 steps.

Figure 5 shows the values of loss function of actor network. We set as the mean of output of network, . The loss function of actor network is . We can see that the value of improves gradually from 1 to 16 during steps from 1-2,000. After that, keeps stable from 2000 step to 15,000 step, which means the actor network is successfully trained.

Figure 6 shows the average end-to-end delivery delay of packets for every 100 steps. The delay is decreased during steps 1-4,000. After that, the delay remains around 2.3ms. It shows that the DDPG algorithm has found an optimal policy.

Figure 7 shows the average delay of DDPG-based, OSPF and random weight routing algorithm. We can see that the proposed DDPG-based routing algorithm achieved the best performance with lowest end-to-end packet delivery delay after it has been trained.

Fig. 6: Average delay of training process

Vii Conclusion and Future Work

In this paper, we introduced a definition of cognitive routing with inference, decision and learning capabilities. Based on the definition, we proposed a deep reinforcement learning (DRL) based cognitive routing framework by defining the DRL factors in the cognitive routing environment. To facilitate the research and evaluation of DRL-based routing, we designed and developed a tool named RL4Net. A DDPG-based cognitive routing algorithm has been design and implemented on RL4Net. The experimental evaluation results showed that the proposed DDPG-based routing algorithm performs better than OSPF and random weight algorithms. Our work in this paper has proves the potential of DRL for achieving cognitive routing. In the future, we plan to extend the RL4Net to enable it configuring routers in testing network for algorithm evaluation. In addition, we will design and implement more algorithms to find effective DRL-based cognitive routing algorithm that can be used in real networks.


  • [1] Malkin G. RIP version 2[R]. STD 56, RFC 2453, November, 1998.
  • [2] Rutgers C L H. An introduction to igrp[J]. The State University of New Jersey, Center for Computers and Information Services, Laboratory for Computer Science Research, 1991: 33.
  • [3] Moy J T. OSPF: anatomy of an Internet routing protocol[M]. Addison-Wesley Professional, 1998.
  • [4] Paul, Pragyansmita, and S. V. Raghavan. ”Survey of QoS routing.” Proceedings of the international conference on computer communication. Vol. 15. No. 1. 2002.
  • [5] Hanzo, Lajos, and Rahim Tafazolli. ”A survey of QoS routing solutions for mobile ad hoc networks.” IEEE Communications Surveys and Tutorials 9.2 (2007): 50-70.
  • [6] Chen, Lei, and Wendi B. Heinzelman. ”A survey of routing protocols that support QoS in mobile ad hoc networks.” IEEE Network 21.6 (2007): 30-38.
  • [7] Xie, J., Yu, F. R., Huang, T., Xie, R., Liu, J., Wang, C., Liu, Y. (2018). A survey of machine learning techniques applied to software defined networking (SDN): Research issues and challenges. IEEE Communications Surveys and Tutorials, 21(1), 393-430.
  • [8] Hu, Fei, Qi Hao, and Ke Bao. ”A survey on software-defined network and openflow: From concept to implementation.” IEEE Communications Surveys and Tutorials 16.4 (2014): 2181-2206.
Fig. 7: Comparison of average delay
  • [9] ETSI, Network Functions Virtualisation. ”Network Functions Virtualisation (NFV).” Management and Orchestration 1 (2014): V1.
  • [10] Kabbani, Abdul, et al. ”Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks.” Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. 2014.
  • [11] Dzida, M., et al. ”Optimization of the shortest-path routing with equal-cost multi-path load balancing.” 2006 International Conference on Transparent Optical Networks. Vol. 3. IEEE, 2006.
  • [12] Santhi, G., et al. ”Q-learning based adaptive QoS routing protocol for MANETs.” 2011 international conference on recent trends in information technology (ICRTIT). IEEE, 2011.
  • [13] Wu, Celimuge, Satoshi Ohzahata, and Toshihiko Kato. ”Flexible, portable, and practicable solution for routing in VANETs: A fuzzy constraint Q-learning approach.” IEEE Transactions on Vehicular Technology 62.9 (2013): 4251-4263.
  • [14] Azzouni, Abdelhadi, Raouf Boutaba, and Guy Pujolle. ”NeuRoute: Predictive dynamic routing for software-defined networks.” 2017 13th International Conference on Network and Service Management (CNSM). IEEE, 2017.
  • [15] Zhuang, Zirui, et al. ”Toward Greater Intelligence in Route Planning: A Graph-Aware Deep Learning Approach.” IEEE Systems Journal (2019).
  • [16] Cesana, Matteo, Francesca Cuomo, and Eylem Ekici. ”Routing in cognitive radio networks: Challenges and solutions.” Ad Hoc Networks 9.3 (2011): 228-248.
  • [17]

    Qadir, Junaid. ”Artificial intelligence based cognitive routing for cognitive radio networks.” Artificial Intelligence Review 45.1 (2016): 25-96.

  • [18] Jian, Cao, et al. ”A data driven cognitive routing protocol for information-centric networking.” Journal of Computer Research and Development 52.4 (2015): 798.
  • [19] Francois, Frederic, and Erol Gelenbe. ”Optimizing secure SDN-enabled inter-data centre overlay networks through cognitive routing.” 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2016.
  • [20] Boyan, Justin A., and Michael L. Littman. ”Packet routing in dynamically changing networks: A reinforcement learning approach.” Advances in neural information processing systems. 1994.
  • [21] Mammeri, Zoubir. ”Reinforcement Learning Based Routing in Networks: Review and Classification of Approaches.” IEEE Access 7 (2019): 55916-55950.
  • [22] G. Stampa, M. Arias, D. Sanchez-Charles, V. Muntes-Mulero and A. Cabellos, “A deep-reinforcement learning approach for software-defined networking routing optimization,” arXiv preprint arXiv:1709.07080, 2017.
  • [23] H. Mao, Y, Ni, Z. Gong, W. Ke, C. Ma, Y. Xiao, Y. Wang, J. Wang, Q. Wang, X. Liu, Y. Song, Z. Zhang and Z. Xiao, “ACCNet: Actor-Coordinator-Critic Net for ”Learning-to-Communicate” with Deep Multi-agent Reinforcement Learning“ arXiv:1706.03235v1, 2017.
  • [24] P. Quang, Y. Hadjadj-Asoul and A. Outtagarts, “Deep Reinforcement Learning based QoS-aware Routing in Knowledge-defined networking“ Qshine 2018 - 14th EAI International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness, Dec 2018, Ho Chi Minh City, Vietnam. pp.1-13. hal-01933970.
  • [25] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. Liu and D. Yang, “Experience-driven Networking: A Deep Reinforcement Learning based Approach“ arXiv:1801.05757v1, 2018.
  • [26] R. Ding, Y. Xu, F. Gao, X. Shen, and W. Wu, “Deep Reinforcement Learning for Router Selection in Network With Heavy Traffic,” IEEE Access, vol7, pp. 37109-37120, 2019.
  • [27] Lillicrap, Timothy P., et al. ”Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).
  • [28] Uhlenbeck, George E and Ornstein, Leonard S, “On the theory of the brownian motion,” Physical review, 36(5):823, 1930.
  • [29] Gawłowicz, Piotr, and Anatolij Zubow. ”ns3-gym: Extending openai gym for networking research.” arXiv preprint arXiv:1810.03943 (2018).
  • [30] R. G.F and H. T.R,“Modeling and Tools for Network Simulation,” Springer, Berlin, Heidelberg, 2010, ch. The ns-3 Network Simulator.