Routing, the process of selecting a path for packet transmission in networks, is the key function for stable operation of network infrastructure. Basically, we can classify routing technologies into two categories, non-quality-aware and quality-aware. Most widely used routing protocols and algorithms, such as RIP, IGRP and OSPF, are non-quality-ware because they cannot make routing decision using network and service quality information. Although non-quality-aware routing protocols and algorithms are simple to be implemented on routers and have been worked well for many years, they are challenged by rapid growth of network traffic volume and changing of service requirements. Therefore, a number of quality-aware routing protocols and algorithms are proposed in recent years, which aim to choose paths with better performance by leveraging network quality metrics like delay, jitter and loss [4, 5, 6]. However, they are not widely used because of higher requirement of computation capabity on routers and expensive upgrade cost.
In recent years, with the rapid progress of new technologies like SDN and NFV, a number of research works introduced that a good opportunity has been raised to implement more complex routing decision on powerful hardwares . For example, Google has proved the separation of routing control and operation approach is feasible for achieving better quality assurance on software defined networks . Inspired by these works, we propose the concept of cognitive routing that extends the concept of quality-aware routing by introduce three key capabilities into the routing decision component, inference, decision and learning. Not just introduce the concept, we propose an implementation approach based on Deep Reinforcement Learning (DRL). To facilitate the research of DRL-based cognitive routing, we develop a simulator named RL4Net for DRL-based routing algorithm development and simulation. In addition, we design and implement a Deep Deterministic Policy Gradient (DDPG) based routing algorithm. To demonstrate the preliminary feasibility and potential advantage of cognitive routing, we compare the DDPG-based routing algorithm with OSPF and random weight algorithms. The simulation results on an example network topology show that the DDPG-based routing algorithm achieves better performance.
In summary, the main contributions of our paper are as follows:
We introduce the concept of cognitive routing with an implementation approach based on deep reinforcement learning technology.
We design and implement a DDPG-based cognitive routing algorithm under the routing-oriented deep reinforcement learning theory framework.
We prove the preliminary feasibility and potential advantage of cognitive routing by experiments on a self-develop simulator, which is also a powerful open source tool for cognitive routing research.
The rest of our paper is organized as follows: In Section II, we introduce the definition of cognitive routing and related works. Then, we propose a routing-oriented deep reinforcement learning theory framework in Section III. Based on this framework, the design of a DDPG-based routing algorithm is descried in Section IV. In Section V, we illustrate the design of RL4Net and the implementation of the DDPG-based routing algorithm on RL4Net. Section V is the description of experiment design and evaluation result. At last ,we conclude our work and future work in Section VI.
Ii Cognitive Routing and Related Work
Basically, the software of a router is composed by three functional components that connected by interfaces, data plane, control plane and management plane. Control plane component is responsible for exchanging routing protocols and managing routing tables. Data plane forwards data packets following routing tables produced by control plane component. For simplicity, we call the routing table managing components of control plane as routing controller and data plane as routing operator. In this paper, we focus on the core of the routing controller, routing algorithm. As we mentioned before, we can classify routing algorithms into two categories, non-quality-aware and quality-aware. Most widely used routing algorithm like RIP, IGRP and OSPF are non-quality-ware. For example, the link state routing (LSR) algorithm used by OSPF choses the shortest path considering only link costs that usually related to bandwidth. This mechanism may cause congestions in heavy-load network. Although there are a number of variants like ECMP (Equal Cost Multiple Path)  attempts to decrease congestion possibilities by randomly choose a path from multiple paths with same distance. However, the defect of absence of network state information limits their improvement. To break this limitation, a number of researchers proposed to introduce quality metrics like delay, jitter and loss into parameters of routing algorithm [4, 5, 6]
. To deal with the complex optimization problem of increased state space, machine learning methods like Q-learning[12, 13]
and neural network were used to calculate candidate path for packet transmission[14, 15]. In this process, a rough concept of cognitive routing (not the routing algorithms only for cognitive network [16, 17]) is proposed in  and . However, they did not give a clear definition of cognitive routing. Inspired by these works, we define the cognitive routing as: a mechanism learned from historical data for optimal routing decision by considering the inference of network quality state. From this definition, we can see that a cognitive routing controller must have three capabilities: (1) inference network state from monitored data, (2) routing decision by considering network quality state, and (3) learning optimal routing decision policy from historical data. The architecture of cognitive routing enabled network is shown in Figure 1(a).
In Figure 1(a), if we regard the network as an environment and cognitive routing controller(s) as intelligent agent(s), the architecture of cognitive routing enabled network is similar with the reinforcement learning (RL) framework in Figure 1(b). Therefore, the reinforcement learning methodology is a good potential underlying methodology to implement cognitive routing controller. Actually, we are alone of thinking like this way. Applying RL to solve routing problem started in 1994 . After this, a number of RL-based routing algorithms are proposed . However, these RL-based routing algorithms failed because of tabular-based RL method cannot handle explosive space of the combination of network state and action. In recent years, deep reinforcement learning (DRL) has been proved to be a good methodology for solving complex optimal control problem. Authors in  firstly used DRL in routing algorithm. After this, a small number of DRL-based routing algorithms are proposed in [23, 24, 25, 26]. Although these initial works have proved the potential of DRL for routing optimization, there are still a number of problems to be solved to achieve cognitive routing for future network.
Iii DRL Problem Definition of Cognitive Routing
We take a simple network topology shown in Figure 2 as an example to formulate the DRL problem of cognitive routing. Generally, a network can be denoted as . is a set of nodes that are routers in the physical network, such as in Figure 2. is a set of directed links between nodes, which are optical fiber or copper cable between routers. If there is directed link that can sent package from router to , we have . Otherwise, we have . In a period of time , there are a set of packets are transmitted between routers. Each packet has an end-to-end delivery delay , such as the delay of packet from to in Figure 2. In this condition, if we have an intelligent agent (or a set of intelligent agents) than can observe the network environment and take actions on routers, we can define the factors of deep reinforcement as below:
state: Each packet , comes into the network via source router and departs from the network via destination router. For example, the packet is sent from source node to destination node . For all packets , we have a Traffic Matrix , where is the sum of size of packets transmitted from to in time slot . We define the state of network environment (, is the state space) as:
action: Action represents how the agent change the environment. In routing context, the action of a intelligent controller is setting the routing tables of routers. Therefore, we define the action at time as the set of link weights of all nodes. Each node
has a weight vector, where is the weight of link from to . Then, we define the action at time (, is the action space) as: .
reward: Reward is the feedback information from environment to agent after agent takes an action. With different network optimization purpose, we can define different rewards. In this paper, we consider to optimize the end-to-end delay of packets delivery. Therefore, we define the reward as the average delay of packets in time slot : .
policy: Policy of agent
is represented by a distribution of conditional probability:.
With above definitions, we can formulate the DRL problem for cognitive routing as an optimization problem: how to find an optimized policy to maximize the reward.
Iv Design of DDPG-based Routing Algorithm
The task of DRL agent is to optimize its policy to maximize the reward. For a state at time slot , we define a value function to evaluate the value obtained following policy . We use a discount rate to decay the future rewards. is evaluated by accumulating discounted reward as follows:
We define a Q-function as:
An optimal policy can maximize : . Therefore, the optimization problem can be solved by updating the Q-function by Temporal Difference (TD) between the target Q-value and current Q-value through iterative processes for all state-action pairs: , where is a hyper-parameter named learning rate in the training process.
In Algorithm 1, line 5 resets the environment and get the initial state in each episode. In line 7, is the exploration noise, which is generated by Ornstein-Uhlenbeck process (OUProcess) . In line 11, we sample a batch of tuples from replay buffer . Line 13-19 is the process of update target networks off critic and actor.
V RL4Net and Algorithm Implementation
Implementing a reinforcement learning environment and algorithms from scratch is a difficult task. Inspired by work of , we develop tool named RL4Net (Reinforcement Learning for Network) to facilitate the research and simulator of reinforcement learning based cognitive routing. Figure 3 shows the architecture of RL4Net, which is composed by two functional blocks:
Environment: Environment is built on widely used ns3 network simulator . We extend ns3 with six components: (1) Metric Extractor for computing quality metrics like delay and loss from ns3; (2) Computers for translating quality metrics to DRL state and reward; (3) Action Operator to get action commands from agent; (4) Action Executor for perform ns3 operations by actions; (5) ns3Env for transforming the ns3 object into DRL environment; (6) envInterface to translate between ns3 data and DRL factors.
Specifically, we use fully connected neural networks to implement the actor and critic of DDPG. There are four layers in actor networks, one input layer, two hidden layers and one output layer. The neuron numbers of these four layers are represented as, , and
, respectively. To scale up the action output, we multiply the output of softmax layer with a parameter. Network of critic is composed of three layers, one input layer, one hidden layer and one output layer. The neuron numbers of these three layers are represented as , and
Vi Experimental Evaluation
Vi-a Experiment Setup
In the experiment, we set the neuron numbers of DDPG actor and critic networks as , , , , , and . The scale-up parameter is set to . In addition, the learning rate of actor and critic, parameters and in Algorithm 1 are set to , , and , respectively. The exploration noise is generated by parameters of , and . The parameters of experience replay buffer are , and .
To evaluate our proposed DDPG-based routing algorithm, we config an experimental network topology as Figure 2. Bandwidth of all links are 5Mbps. On this network, we generated a 4.636Mbps UDP flow with 1024 packet size from to , which makes the link works in a heavy load condition. Under this setting, we compare the average end-to-end delivery delay of packets with other two routing algorithms, OSPF and random weight. The random weight algorithm sets the weight vector of each router randomly.
Vi-B Experiment Results
Figure 4 shows the values of loss function of critic network. As we mentioned before, the target Q-value is . The loss function is the average square of TD-error between Q-value and its target Q-value: . We trained the DDPG model for 43,100 steps. We can see the value of decreases with the increase of steps, which means the TD-error between Q-value and target Q-value decreases. After 15,000 step, the loss value stably remains a small value, which means the critic network is optimal enough. Therefore, we only draw values of 1-15,000 steps.
Figure 5 shows the values of loss function of actor network. We set as the mean of output of network, . The loss function of actor network is . We can see that the value of improves gradually from 1 to 16 during steps from 1-2,000. After that, keeps stable from 2000 step to 15,000 step, which means the actor network is successfully trained.
Figure 6 shows the average end-to-end delivery delay of packets for every 100 steps. The delay is decreased during steps 1-4,000. After that, the delay remains around 2.3ms. It shows that the DDPG algorithm has found an optimal policy.
Figure 7 shows the average delay of DDPG-based, OSPF and random weight routing algorithm. We can see that the proposed DDPG-based routing algorithm achieved the best performance with lowest end-to-end packet delivery delay after it has been trained.
Vii Conclusion and Future Work
In this paper, we introduced a definition of cognitive routing with inference, decision and learning capabilities. Based on the definition, we proposed a deep reinforcement learning (DRL) based cognitive routing framework by defining the DRL factors in the cognitive routing environment. To facilitate the research and evaluation of DRL-based routing, we designed and developed a tool named RL4Net. A DDPG-based cognitive routing algorithm has been design and implemented on RL4Net. The experimental evaluation results showed that the proposed DDPG-based routing algorithm performs better than OSPF and random weight algorithms. Our work in this paper has proves the potential of DRL for achieving cognitive routing. In the future, we plan to extend the RL4Net to enable it configuring routers in testing network for algorithm evaluation. In addition, we will design and implement more algorithms to find effective DRL-based cognitive routing algorithm that can be used in real networks.
-  Malkin G. RIP version 2[R]. STD 56, RFC 2453, November, 1998.
-  Rutgers C L H. An introduction to igrp[J]. The State University of New Jersey, Center for Computers and Information Services, Laboratory for Computer Science Research, 1991: 33.
-  Moy J T. OSPF: anatomy of an Internet routing protocol[M]. Addison-Wesley Professional, 1998.
-  Paul, Pragyansmita, and S. V. Raghavan. ”Survey of QoS routing.” Proceedings of the international conference on computer communication. Vol. 15. No. 1. 2002.
-  Hanzo, Lajos, and Rahim Tafazolli. ”A survey of QoS routing solutions for mobile ad hoc networks.” IEEE Communications Surveys and Tutorials 9.2 (2007): 50-70.
-  Chen, Lei, and Wendi B. Heinzelman. ”A survey of routing protocols that support QoS in mobile ad hoc networks.” IEEE Network 21.6 (2007): 30-38.
-  Xie, J., Yu, F. R., Huang, T., Xie, R., Liu, J., Wang, C., Liu, Y. (2018). A survey of machine learning techniques applied to software defined networking (SDN): Research issues and challenges. IEEE Communications Surveys and Tutorials, 21(1), 393-430.
-  Hu, Fei, Qi Hao, and Ke Bao. ”A survey on software-defined network and openflow: From concept to implementation.” IEEE Communications Surveys and Tutorials 16.4 (2014): 2181-2206.
Qadir, Junaid. ”Artificial intelligence based cognitive routing for cognitive radio networks.” Artificial Intelligence Review 45.1 (2016): 25-96.