1 Introduction
Network providers leverage traffic engineering techniques to optimize performance over operational IP networks [awduche]. With the exponential growth in demand, researchers have experimented with new optimization algorithms that aim to balance utilization and availability of network resources [teavar] or optimize on multiple criteria such as the number of hops, availability, or path attributes [sobrinho2020routing]. As widearea network backbones (WANs) become costly to maintain and upgrade, softwaredefined networking (SDN) is being used as a promising method to maximize routing performance but needs to calculate and optimize routes globally to users [b4, swan].
Optimizing for network performance includes measuring bandwidth, jitter, or latency over resource links, where a poorly designed infrastructure can lead to slow performance and potentially increased packet loss. Apart from meticulously designed heuristics needed to develop an optimization algorithm, researchers would analyze offline models of the network topology and traffic demand matrix to infer the best paths between sourcedestination pairs. This approach, as a planning tool, leads to many limitations such as (1) difficult to optimize in few minutes as networks grow from 10s to 100s of routers, and (2) the dynamic traffic demand matrix would require recalculation every few days as some links become congested and possibly fail.
Recent breakthroughs in deep learning techniques leveraging datadriven learning have identified successful and simple solutions to complex online decisionmaking problems such as playing games
[alphago], resource management [mao2016resource] and learning inherent traffic patterns [greguric2020application]. Particularly deep reinforcement learning (RL) uses agents to learn, through interactions with the environment by trial and error, their optimal actions via experiences and feedback. Here agents can slowly modify behavior through interactions without knowing the accurate mathematical model of the environment. In path optimization problems, RL has a natural application by exploring different routing policies, gathering statistics about which policies maximize performance functions, and learning over time the best policy on which route to take. Examples of similar techniques in path optimization include two main approaches [valadarsky2017learning] such as optimizing routing configurations by predicting future traffic conditions depending on past traffic patterns or optimizing routing configurations based on the number of feasible traffic scenarios to improve performance parameters. Modern communication networks have become very challenging mainly due to the following two reasons [boutaba2018comprehensive]. Firstly, communication networks have become very complicated and highly dynamic, which makes them hard to model and control. For example, in vehicular and ad hoc networks, nodes frequently move, and link failures might occur during working hours, which might result in topology changes [govindan2016evolve, hong2018b4]. Second, as the scale of networks continue to multiply, a central controller may be costly to install and slow to configure and be robust to malicious attacks [simplicio2010survey, al2015application]. Therefore, there is a need to develop innovative ways in which traffic routing does not rely on accurate mathematical models and can be managed in a distributed manner. Examples of distributed path planning such as ant colony optimization and swarm approaches have shown success in static environments, but still need more learning for dynamic environments [Schanemann2007].In this paper, we investigate the above issues and evaluate if deep reinforcement learning can provide an optimal, adaptable and distributed solution to path selections as network load increases, and especially if the topology changes such as link failing or becoming congested. In order to design an optimal path selection for various network topologies and network loads, we design a deep policybased metalearning algorithm (MAMRL) and evaluate its performance in various simulated WAN topologies. MAMRL can optimize multiple objectives packet loss and packet delivery time. Our preliminary results show that MAMRL can perform better than the shortest possible route algorithms, especially as network loads increase and congestion is possible. Modelling the network as a multiagent complex system, representing each router as an agent, we can demonstrate that MAMRL can learn and perform with each router in a distributed manner, allowing future work for online traffic engineering on devices.
1.1 Motivation and Contribution
In this work, we investigate the challenge of how can one build an adaptive network routing controller that continue to provide optimum network performance, even when the topology changes. For this we model the challenge as a path optimization technique that can be adaptive to various network load and topology changes, via novel deep reinforcement learning. Modern communication networks are highly dynamic and hard to model and predict, therefore, we aim to develop a novel experiencedriven algorithm that can learn to select paths from its experience rather than an accurate mathematical model. Additionally, due to difficulties in gathering information from widely distributed routers, we design a distributed optimization framework to learn the local optimal strategy.
Our specific contributions are:

Policybased RL learning performs well at high network load.
Using a policybased deep reinforcement learning method, we train the model at a variety of network loads and save the optimal neural network. Once deployed, our trained neural network can perform superiorly at high network load compared to valuebased RL learning.

Optimizing for multiple criteria. Our neural networks are optimized for multiobjective optimization for both packet delivery time and packet loss on the network links. We design an appropriate reward (utility) function, which well represents the preference of the network controller, to minimize both packet delivery time and packet loss when link failures occur.

Quick adaptation to link failures. Our proposed MAMRL framework aims to make good online decisions under the guidance of powerful Deep Neural Networks (DNNs). In addition, by leveraging the modelagnostic metalearning technique [MAML], our neural networks can quickly alternate paths to minimize both packet delivery time and packet loss when link failures occur.

Calculate optimal packet routes based on limited local observation for future ondevice Traffic Engineering research.
Our neural network model is deployed per multiple agents to represent multiple routers, allowing each router agent to learn and optimize the traffic routing based on their local information. To achieve this, we leverage a dynamic consensus estimator
[consensus] to diffuse local information and estimate global rewards, still achieving the best average packet delivery time.
We develop a fully distributed multiagent metareinforcement learning (MAMRL) for a packet routing problem, where each router agent aims to find the correct adjacent routers to send their packets to minimize the overall average packet delivery time and avoid packet loss. We demonstrate our results via extensive packetlevel simulations representative of WAN network topologies (ATT, Geant, B4), showing that MAMRL significantly outperforms several baseline algorithms in high network load.
2 Background
2.1 Deep Reinforcement learning
Reinforcement learning is concerned with how an intelligent agent learns a good strategy from experimental trials and relative feedback received. With the optimal strategy, the agent is capable to actively adapt to the environment to maximize cumulative rewards. Almost all the deep RL problems can be framed as Markov Decision Processes (MDPs), which consists of four key elements
,. More specifically, at each decision epoch
, the intelligent agent can stay in state that belongs to the state space of the environment, and choose to take an action that belongs to the action spaceto switches from one state to another. The probability that the process moves into its new state
is given by the state transition function . Once an action is taken, the environment delivers a reward as feedback. Figure 1 shows the general process of reinforcement learning (the definition of policy will be given below).There are two key functions in RL: policy function and value function . The policy is a mapping from state to action and tells the agent which action to take in state . For example, in the path optimizing problem, the policy is the router’s strategy that finds the best adjacent router to send out current packets given the current utilization state of the communication network. The state value function measures how rewarding a state is under policy by a prediction of future reward. Similarly, the actionstate value function tells, for a given policy, what the expected cumulative reward of taking action in state is.
The goal of RL is to find the optimal policy that achieves optimal value functions: . Traffic engineering is a natural application of RL by exploring with different routing policies, gathering statistics about which policies maximize the utility function, and learning the best policy accordingly.
Valuebased algorithms versus Policybased algorithms: Valuebased RL algorithms attempt to learn the tabular or approximation of the stateaction value and selects the action based on the maximal value function of all available actions for a given state. For example, the Qrouting algorithm enables the routers to restore Qvalues as the estimate of the negative transmission time between that router and the others. To shorten the average packet delivery time, routers will choose the action with maximal Qvalues. Policybased RL algorithms instead learn the policy directly with a parameterized function concerning , and train the policy to maximize the expected cumulative reward function. Policybased algorithms can learn stochastic policies. It is worthwhile to note that stochastic means stochastic in some actionstate pairs where it makes sense. Usually, valuebased algorithms, which choose the actions with the maximal values, can only follow deterministic policies or stochastic policies with predetermined distributions. That is not quite the same as learning the real optimal stochastic policy. Since the current communication networks are highly dynamic and stochastic, we can expect that the policybased RL algorithms perform superiorly to the valuebased RL algorithms for certain scenarios, where the optimal policy is stochastic. In Section 2.2, we will introduce more details about the policybased RL algorithms.
In this work, to enable highdimensional state representations (such as action histories), we consider deep RL algorithms, which adopt deep neural networks to approximate the policy functions. Here the policy parameters are the weights of the deep neural networks.
2.2 Performance under Partial Observability
Figure 1 shows the general process of reinforcement learning, where the agent is able to observe the global information of the environment. As stated in Section 1, in this work, we consider a path optimization problem in the distributed network environment, indicating that each router only has access to its own information and the information received from its adjacent routers. It follows that the path optimization problem can be modeled as a multiagent partially observable Markov decision process (POMDP). A POMDP, referred as , for routers is defined by a tuple , , where and carry the same meaning as those in Section 2.1 and denotes the set of all routers. , , and are the local observation space, local action space and local reward function of router , respectively. Then we have is the joint action space of all routers. Each router only has access to a private local observation correlated with the state . To choose actions, each router uses a stochastic parametric policy , where represents the probability of choosing action at observation . Thus, the joint policy of all routers satisfies . For a given time horizon we define the trajectory as the collection of state action pairs ended at time
. The probability distribution of the initial state is denoted by
. In the path optimization problem (cooperative multiagent problem), the collective objective of all the routers is to collaboratively find policies for all that maximize the globally expected trajectory reward over the whole network. The goal of all routers is as follows,(1) 
where
and denotes the reward needed by router at time . consists of two parts: 1) denotes the reward signal based solely on individual behavior, and 2) denotes the reward signal based on global behavior. Note that only and can be known by router in a partially observable environment.
As stated in Section 2.1, in this work, we investigate how deep policy optimization algorithms work in path optimization problems. The main idea is to directly adjust the parameters of the policies in order to maximize the objective in (1) by taking steps in the direction of . For POMDP, the gradient of the expected return for router can be written^{1}^{1}1The derivation of Equation (2) is provided in Appendix A. as,
(2) 
Note that with only local information, function cannot be well estimated since the estimation requires the reward of all routers. In this work, we propose to use a dynamic consensus algorithm to estimate using only local information, described in Section 3.
2.3 Modelagnostic Metalearning
In this work, we consider path optimization in the presence of link failures. It follows that, once there is a link failure, the state transition function of the environment changes accordingly, indicating that a new POMDP occurs. Let denote the Markov process modeled by the full network environment (no link failures) and , where , denote the Markov process modeled by the network environment with different link failure scenarios. Suppose that the distribution of all POMDPs follows . To make the routing algorithm adapt to link failures (different POMDPs) quickly, we leverage the modelagnostic metalearning [MAML] into the policy optimization algorithms. Metareinforcement aims to learn an algorithm that can quickly learn optimal policies in drawn from a distribution over a set of Markov decision processes. Our approach trains a wellgeneralized parametric policy initialization that is close to all the possible environments (POMDPs), such that it can quickly improve its performance on a new environment with one or a few vanilla policy gradient steps (see Figure 2). The metalearning objective can be written as:
(3)  
where is the learning rate and represents the distribution of trajectory given policy . Modelagnostic metalearning attempts to learn an initialization such that for any environment the policy attains maximum performance after a few policy gradient steps.
3 Design: MAMRL Approach
3.1 Model
In the packet routing problem, packets are transmitted from a source to its destination through intermediate routers and available links. The mathematical model is given below.
Environment. We consider a possibly timevarying communication network environment, which is characterized by an undirected graph , where is a set of routers and are transmission links between the routers at time . The bandwidth of each link is limited and packet loss might occur when the size of the packet to be transmitted is greater than the link’s capacity. The communication network is possibly time varying since link failures might happen during working hours. When the link failure happens, the capacity of the link becomes zero. Each router has a set of neighbor routers denoted by .
Routing. Packets are introduced into the network with a node of origin and another node of destination. They travel to their destination nodes by hopping on intermediate nodes. Each router only has one local port/queue used to store traffic. The queue of routers follows the firstinfirstout (FIFO) criterion. The node can forward the top packet in its local queue to one of its neighbors. Once a packet reaches its destination, it is removed from the network.
Objective. The packet routing problem aims at finding the optimal transmission path between source and destination routers to minimize the average packet delivery time, which is the sum of queuing time and transmission time while preventing packet loss when link failures happen.
3.2 RL Formulation
Our standard RL setup consists of multiple router agents interacting with an environment (communication networks) in discrete decision epochs. We investigate the deep policy optimization algorithm to address packet routing in a partially observable network environment. To make the router controllers adapt to link failures more quickly, we leverage the modelagnostic metalearning technique to learn the wellgeneralized policy initialization.
Figure 3 shows the MAMRL setup (training and testing process) per router. In the testing process, each router uses the deep policy optimization algorithm coupled with the dynamic consensus estimator to learn the optimal policy. In order to let the routers adapt to topology changes quickly, the policy of each router is initialized using the wellgeneralized policy initialization, which is the output of the training process. The training process follows the traditional modelagnostic metareinforcement learning framework. The basic idea is letting the network controller encounter multiple link failures in the training process. It can use this experience to learn how to adapt if similar situations occur while deployed. In Figure 3, denotes the Markov process modeled by the full network environment (no link failures) and , where , denotes the Markov process modeled by the network environment with link failures. In the training process, the network controller collects data samples from all possible network environments according to the distribution . However, the traditional design of modelagnostic metalearning mainly focuses on singleagent (centralized) reinforcement learning problems. How to solve a multiagent RL problem using modelagnostic metalearning in a distributed manner is rarely studied. In this work, we aim to train and execute the network controller in a distributed manner. As shown in Figure 3, each router has an independent policy model that is represented by a deep neural network. The core of the proposed control framework is letting each router run a deep reinforcement learning algorithm to find the best action at each decision time instant, using only local information and local interaction. Since the routers aim to minimize the average packet delivery time of the whole network, each router needs to feed the global packet delivery time into its policy model as feedback/reward. To achieve this goal, we leverage the dynamic consensus algorithm to estimate the global reward function, which is described below,
The policy optimization algorithms aim to find the best policy parameters that produce the highest longterm expected return using gradient ascent. The gradient of the longterm expected return for the parameters of each router’s policy is defined in Equation (2). However, with only local information, function cannot be well estimated since the estimation requires the rewards of all routers . This motivates our consensusbased policy gradient algorithm that leverages the communication network to diffuse the local information, fostering collaboration among routers. We adapt the following dynamic consensus algorithm [consensus] into the policy optimization method.
(4)  
where is the control gain, and are local estimators, and denotes the neighbor sets of router . It can be proved that converges to the vicinity of within a few time steps. It is worthy to mention that only local information is used in the designed estimator Equation (4).
We develop the following policy optimization method for POMDP,
(5) 
where
(6) 
Here, denotes the sum of the local reward signal and global reward estimate . And is obtained by the dynamic consensus estimator designed in Equation (4). Note that both and can be obtained locally.
We build the deep neural network with one input layer, five hidden layers of size 128 with ReLU, and one output layer with Softmax (see Figure
3). As shown in Figure 3, at each decision epoch , each router provides the local observation to the policy model and gets the action back. Router performs action and switch to a new state. Then router feeds the local reward and global reward estimate , which is the output of the dynamic consensus estimator, to the policy model and the policy model updates its weights with respect to the received reward estimate . It is worthwhile to mention that to update the policy in the direction of greater cumulative reward using Equation (5), only local information , and are required. By integrating modelagnostic metalearning and the proposed multiagent policy optimization algorithm, MAMRL for packet routing problem, where both training and execution process is distributed. These are shown in Algorithms 1 and 2.3.3 Design of State, Action space and Rewards
We design the local observation , local action and local estimation of the reward function below,

Observation of router , : 1) destination router of first packet in the local queue; 2) the last ten step actions taken by router ; 3) the address of the router which has the longest queue among all the neighbor router of router .

Action of router , : next hop of current packet in the queue.

Reward estimate of router , : sum of and , where is negative number of packet loss occurred at router and is the estimate of using Equation (4). Here, denotes the negative average delivery time of all the packets delivered to router .
Note that the design of state space and reward is critical to the success of a deep reinforcement learning method. Our design of the state space captures key components of the network environment. For our design of the reward function, element is introduced to minimize the average packet delivery time of the whole network, element is included to minimize the packet loss occurred at router in the presence of link failures. Note that in our design , where is the negative number of packet loss occurred only at router but is the estimate of the negative average delivery time of the whole network environment. The reasons are summarized below.
Optimizing for packet delivery time: To achieve this goal, all the routers need to collaboratively find the best paths to reroute the traffic. And the delivery time of the packets that are delivered to router is determined by the decisions of all the intermediate routers. That is, packet delivery time is a signal based on global behavior, it is not enough for router to only know the delivery time of the packets delivered to itself.
Optimizing for link failures: Although this goal also involves reading packet loss in the whole network, the link failures have little effect on the routers that are not directly connected to the failed links. Therefore, in our design, we only provide the packet loss that occurred at router to the policy as the feedback.
4 Evaluation
We conduct extensive simulations to evaluate the performance of the proposed MAMRL framework in a path optimization problem with static topologies and topologies with possibly failed links.
Topology Name  Number of nodes  Number of edges 

B4  12  19 
Geant  21  32 
ATT  25  56 
We evaluate the results to,

Benchmark the RL techniques against standard path optimization approaches and other RL approaches,

Examine how robust the MAMRL approach is under link failures, and

Show how quickly MAMRL adapts and reroutes packets to alternate paths achieving better performance.
4.1 Experiment Settings
The simulation runs are performed on three network topologies, B4, Geant, and ATT network. See Table 1 for a specification of network sizes. The B4 and ATT topologies (link capacities) and their traffic matrices (packet size) were obtained from the authors of Teavar [teavar]. The Geant topology is the European Research network providing connectivity to science experiments across Europe and US labs (www.geant.org).
To model the packet arrival, a discrete event network simulator is developed, based on Open AI gym ^{2}^{2}2https://github.com/esnet/daphnepublic/tree/master/MAMRLTE. Packets are introduced into the network with a node of origin and another node of destination. The packet arrives according to the Poisson process of rate . They travel to their destination node by hopping on intermediate nodes. Each router only has a one local port/queue used to store traffic. The queue of routers follows the FIFO criterion. In each time unit, the node forwards the top packet in its local queue to one of its neighbors. Once a packet reaches its destination, it is removed from the network environment. The bandwidth of each link is limited and packet loss might occur when the size of the packet to be transmitted is greater than the link’s capacity. When the link failure happens, the capacity of the link becomes zero.
In the experiments, we choose the step size as . In addition, we use trustregion policy optimization (TRPO) [trpo] as the metaoptimizer and the standard linear feature baseline [baseline] is used.
4.2 Impact of Increasing Network Load
We first test the MAMRL algorithm with static topologies (no link failures). We compare with the classical shortest path algorithm and two existing RLbased routing algorithms:

Shorted path algorithm (SPA) [spa]: a traditional packet routing algorithm.

Qrouting [boyan1994packet]: a valuebased multiagent reinforcement learning algorithm.

Policy gradient (PG) [peshkin2002reinforcement]: a policybased multiagent reinforcement learning algorithm.
In the experiments, the episodes terminate at the horizon of . After 10000 training episodes, we restored the welltrained models to compare their performance in a new test environment where packets were generated at the corresponding network load level. Note that the SPA does not need training and can be applied to test directly. We tested the network on loads ranging from 0.005 to 0.5 and measured the average packet delivery time of several episodes in the testing process to compare with the results given by the abovementioned baseline controllers. The load corresponds to the value of of the Poisson arrival process for the average number of packets injected per unit time.
The average packet delivery time results versus different network load are shown in Figure 4. Under conditions of low load for all the three topologies, MAMRL is slightly inferior to Qrouting and SPA. As the load increases, the MAMRL performs much better than the baseline algorithms. On the B4 topology, when the traffic load is high (i.e., ), MAMRL reduces the average packet delivery time by , , and , respectively, compared to SPA, Qrouting, and policy gradient algorithms. On the Geant topology, when the traffic load , MAMRL significantly reduces the average packet delivery time by , , and , respectively, compared to SPA, Qrouting, and policy gradient algorithms. And on the ATT topology, when the traffic load , MAMRL reduces the average packet delivery time by , , and , respectively, compared to SPA, Qrouting, and policy gradient algorithms. The reason is described as follows. Under conditions of low load, there is no congestion along the route. Therefore, the deterministic policy learned by Qrouting performs as well as SPA, which is the optimal routing policy under low load. However, the routing policy learned by MAMRL is stochastic, which means that not all of the packets are sent down the optimal link. That is why the performance of MAMRL is slightly inferior to Qrouting and SPA under low load. As the load increases, the routes are getting crowded and the length of the queues is getting longer. Due to the stochastic nature of the communication network environment, the optimal policy should be stochastic under conditions of high load. This explains why MAMRL performs much better than the Qrouting and SPA controllers under high load. The results in [peshkin2002reinforcement] also show that policybased reinforcement learning algorithm performs better than valuebased algorithms, especially on high flow load. However, the work in [peshkin2002reinforcement] only considers a simple policy gradient algorithm for the packet routing problem. Instead, we investigate a deep policy optimization algorithm that can take much more information as its inputs, enlarging the stateaction space for better policy making. The results in Figure 4 indicate that our MAMRL algorithm achieves a shorter delivery time than a simple policy gradient algorithm.
4.3 Impact of Link Failures
We test the MAMRL algorithm in the presence of link failures with network load . We let the router train and encounter all possible network environments (link failure scenarios) according to the distribution and return policy parameters using Algorithm 1. We restored the welltrained models in a new test environment where the links get disconnected according to the distribution (We assume that only one link gets failed at one time). We compare results of using the following three controllers: (1) testing the policy from the initialization parameters obtained by MAMRL, (2) testing the policy from randomly initialized weights (called random in the following), (3) shortest path algorithm (SPA) [spa], and (4) Qrouting algorithm [boyan1994packet]. Figure 5 show the results of the packet loss versus episodes and Figure 6 show the results of the average packet delivery time versus episodes. Also, we show the performance of reinforcement learning routing algorithms (MAMRL, random, Qrouting) over the three network topologies during the online learning procedure in terms of the reward. We present the corresponding simulation results in Figure 7. We can make the following observations from these results.

In Figure 5, when there is a link failure, the modelbased routing algorithm (i.e., SPA) witnesses a huge packet loss. The reason is that the SPA algorithm relies on previous knowledge of the network topology to make decisions. Here we assume that as the networks grows, it becomes longer to update the ISIS/OSPF protocols for link failures and update the tables. Both ISIS/OSPF use the same Dijkstra algorithm for computing the best path through the network. The other learning algorithms (MAMRL, random, Qrouting) are modelfree controllers and the policy of the modelfree controller. When link failure happens, the packet loss sensor will tell the RL routing controllers that there are many packet loss at the particular link. Based on our design, the packet loss hurts the reward of the RL routing controllers. To maximize the reward function, the RL routing controller will adjust their policies to improve the reward function and hence reduce the packet loss accordingly.

Figure 7 shows how the reward value changes during online learning over the three network topologies. It is seen that when there is a link failure, for the B4 topology, Qrouting adapts to the link failure (reward values converge to the stable states) after about 30 episodes, MAMRL (our algorithm) adapts to the link failure after about 35 episodes and random algorithm (deep policy optimization with randomly initialized weights) adapts to the link failure after about 800 episodes. The results for Geant topology and ATT topology are shown in Table 2. Qrouting is based on a valuebased Qlearning algorithm and is often much faster to learn a policy than policy optimization algorithms [nachum2017bridging]. In this work, we propose the MAMRL algorithm which leverages modelagnostic metalearning to help the policy optimization adapt to link failures quickly. The basic idea of MAMRL is letting the network controller encounter all possible link failures in the training process. It can then use that experience to learn how to adapt. MAMRL aims to learn a wellgeneralized policy initialization that is close to all possible situations of the environment. Whenever there are continual packet losses at a particular link, the MAMRL controller will reinitialize the policy models based on the pretrained wellgeneralized policy initialization. It can be seen from Figure 7, the MAMRL controller adapts to link failures with a speed that is comparable to the Qrouting algorithm. However, the normal policy optimization controller adapts to the link failures much more slowly.
Topology Name  Qrouting  MAMRL  Random 

B4  23  25  800 
Geant  35  35  100 
ATT  29  30  1000 
4.4 Optimizing for Multiple Objectives
Policy optimization algorithms use gradient descent to optimize an optimization problem. And traffic engineering aims at finding a solution to forward the data traffic to maximize a utility function. The utility function might concern a set of values. In our design, the objective is to minimize the packet delivery time and packet loss, therefore, the utility function, which corresponds to the reward function in the RL algorithms, consists of a function of packet loss and a function of packet delivery.
In future works, we can add multiple objectives such as bandwidth utilization, latency and more, if we want the RL controller to optimize on a number of multiple parameters.
5 Related work
Routing is the process of selecting a path for traffic in a network from the source node to a destination node. Routing algorithm selection may depend on multiple criteria, for example, performance metric (delay, link utilization), the ability to adapt to changes in topology and traffic, and scalability (should be able to support a large number of routers). In the following, we briefly review the traditional routing algorithms and reinforcement learning routing algorithms based on the abovementioned criteria. Since in this work, we propose a kind of multiagent reinforcement learning algorithm to solve the packet routing algorithm, a literature review on multiagent reinforcement learning algorithms is also included in this section.
5.1 Traditional Packet Routing Algorithms
There are multiple routing algorithms in the literature of traditional packet routing [flooding, dijkstra, bellman]. Among these traditional packet routing algorithms, the shortest path algorithm is the most commonly used routing algorithm [abolhasan2004review]. The shortest path algorithm aims to find the shortest path between source and destination nodes and get the packet delivered to the destination node as quickly as possible. The shortest path algorithm is regarded as the best routing algorithm on lower network load since packets can be delivered using the least amount of time along the shortest path between two nodes provided that there is no congestion along the route. However, when the network load is high, the shortest path algorithm will cause a serious backlog in busy routers. Another problem with the shortest path algorithm is that it relies on having full knowledge of the network topology to design routing algorithms and hence needs manual adjustment when topology or traffic changes happen.
5.2 Reinforcement Learning for Routing
Using reinforcement learning for packet routing has attracted increasing interest recently. Various reinforcement learning methods have been proposed to deal with this classical communication network problem and achieved better performances compared with traditional routing methods.
Applications of traditional RL to solve packet routing problems started in the early 1990s with the seminal work [boyan1994packet], where Qrouting was proposed. Qrouting [boyan1994packet] is an adaptive routing approach based on a reinforcement learning algorithm known as Qlearning. Qrouting routes packets based on the learned delivery times (Q values) and achieves a much smaller average delivery time compared with the benchmark shortestpath algorithm [spa]. Since then, several extensions of Qrouting have been proposed, e.g., Dual Qrouting [kumar1997dual], Predictive Qrouting [choi1996predictive], Full Echo Qrouting [kavalerov2017reinforcement], Hierarchical Qrouting [lopez2011simulated] and Antbased Qrouting [subramanian1997ants]. However, Qrouting is a valuebased RL algorithm. That is, Qrouting is a deterministic algorithm that might cause traffic congestion at high loads and does not distribute incoming traffic across the available links. Due to the drawbacks of a valuebased algorithm, some researchers begin to consider policybased RL algorithms for packet routing problems [tao2001multi, peshkin2002reinforcement]. Since policybased RL algorithms can explore the class of stochastic policies, it is natural to expect policybased algorithms to be superior for certain types of network topologies and loads, where the optimal policy is stochastic. In [peshkin2002reinforcement], the results show that policybased RL algorithms perform better than valuebased algorithms, especially on high flow load. These traditional reinforcement learning routing algorithms use tabular functions or simple algebraic functions to estimate the Q functions or policy functions. This is limiting for a large number of states and thus cannot take full advantage of the network traffic history and dynamics. In this work, we investigate the policybased deep reinforcement learning algorithm and use deep neural networks to approximate the policy function. The combination of deep learning techniques with reinforcement learning methods can learn useful representations for the routing problems with high dimensional raw data input and thus achieve superior performance.
Deep reinforcement learning is the combination of reinforcement learning and deep learning, which has been able to solve a wide range of complex decisionmaking tasks. However, rare works are investigating how deep reinforcement learning can be leveraged for packet routing problems since the wireless network is a multiagent environment and the network environment is nonstationary from the perspective of any individual router. This prevents the straightforward use of experience replay, which is crucial for stabilizing deep Q learning [MADDPG]. [mukhutdinov2019multi] combines the Qrouting and deep Qlearning to solve the routing problem. However, the training process of the algorithm proposed in [mukhutdinov2019multi] is in a centralized manner (all the routers need to share parameters), which might cause issues in realworld largescale network environments. The authors in [xu2018experience] propose to use a deep actorcritic reinforcement learning algorithm to optimize the performance of the communication network. However, the training and testing process in [xu2018experience] are also in a centralized manner. Recently, distributed Deep Qrouting has been proposed in [you2020toward]
, where deep recurrent neural network (LSTM) has been utilized to tackle the nonstationary problem in multiagent reinforcement learning. However, the assumptions in
[you2020toward] are different from those in our work. In [you2020toward], it is assumed that the bandwidth of each link equals the packet size, in which case only a single packet can be transmitted at a time. However, in our current work, we assume that the capacity of each link is fixed, in which case many packets can be transmitted at a time. We believe that our assumption is more realistic.5.3 Multiagent Deep Reinforcement Learning Approaches
There is a huge body of literature on singleagent deep reinforcement learning algorithms, where the environment stays largely stationary. Unfortunately, traditional deep reinforcement learning algorithms are poorly suited to multiagent environments, where the environment becomes nonstationary from the perspective of any individual agent. This might cause divergence for valuebased reinforcement learning and very high variance for policybased reinforcement learning algorithms. In the literature, researchers propose multiple methods to apply reinforcement learning algorithms in multiagent settings. To name a few, centralized training and distributed execution
[MADDPG, iqbal2019actor, li2019robust], distributed training and execution under fullyobservable environments [zhang2018networked], and independent Qlearning [hausknecht2015deep, matignon2007hysteretic, foerster2017stabilising]. However, independent Qlearning is a valuebased reinforcement learning algorithm, and in this work, we aim to investigate the policybased reinforcement learning routing algorithm. The ideas of centralized training and fully observable state space work well when there exists a small number of agents in the communication network. With increasing the number of agents, the volume of the information might overwhelm the capacity of a single unit. To tackle this problem, one effective idea to remove the central unit and only allow the agents to share information with only a subset of agents, to reach a consensus over a variable with these agents (called neighbors) [wai2018multi, zhang2016data]. In this work, we also leverage the dynamic consensus algorithm to estimate the global reward function through interactive communication among routers. Moreover, we consider packet routing in the presence of link failures, indicating that not only the local environment from the perspective of any individual router is nonstationary but also the global environment changes during the working hours. We propose to use modelagnostic metalearning to learn a wellgeneralized policy initialization that is close to all possible environments such that the policy can be quickly adapted to different scenarios with a few gradient steps. This is the first time in the literature that the modelagnostic metalearning is applied to a multiagent reinforcement learning problem case.6 Discussion and Conclusions
In this work, we propose a novel framework MAMRL that utilizes deep policy optimization and metalearning to produce a modelfree network routing controller that can perform better path optimization than standard approaches. Our experiments show that MAMRL can learn to control the communication networks from its experience rather than an accurate mathematical model. Specifically, we use deep policy optimization techniques to find optimal paths in complex WAN topologies. In order to address the difficulties in gathering information from widely distributed routers, we design a consensusbased policy optimization algorithm that can learn the local optimal strategy using only local information. Additionally, we consider path optimization problems in the presence of link failures and we leverage the modelagnostic metalearning algorithm to make the proposed network controller adapt to link failures more quickly. We demonstrate how MAMRL improves the learning efficiency of deep reinforcement learning in multiagent packet routing in the presence of link failures. The experiments demonstrate the effectiveness and efficiency of MAMRL for packet routing problem, compared to some baseline controllers.
The distributed nature of MAMRL lays foundation for our future work, where we will experiment with ondevice traffic engineering in physical network setups to see how well the network adapts.
7 Acknowledgements
We would like to thank Dr. Manya Ghobadi for providing the data of At&T and B4 network topologies. We would like to express our gratitude to Dr. Bashir Mohammed for his useful suggestions and critiques of this research work. This work was supported by the U.S. Department of Energy, Office of Science Early Career Research Program for ‘Largescale Deep Learning for Intelligent Networks’ Contract no FP00006145.
References
Appendix A Appendix
Here, we provide the derivation of Equation (2). All the notations carry the same meaning as those in Section 2.2. The probability of a trajectory given that actions come from is,
(7)  
The logprobability of a trajectory is
(8)  
The gradient of the logprobability of a trajectory is
(9)  
and thus
(10) 
Putting the above equations together, we have the following
(11)  
Comments
There are no comments yet.