MAMRL: Exploiting Multi-agent Meta Reinforcement Learning in WAN Traffic Engineering

11/30/2021
by   Shan Sun, et al.
University of California, Riverside
0

Traffic optimization challenges, such as load balancing, flow scheduling, and improving packet delivery time, are difficult online decision-making problems in wide area networks (WAN). Complex heuristics are needed for instance to find optimal paths that improve packet delivery time and minimize interruptions which may be caused by link failures or congestion. The recent success of reinforcement learning (RL) algorithms can provide useful solutions to build better robust systems that learn from experience in model-free settings. In this work, we consider a path optimization problem, specifically for packet routing, in large complex networks. We develop and evaluate a model-free approach, applying multi-agent meta reinforcement learning (MAMRL) that can determine the next-hop of each packet to get it delivered to its destination with minimum time overall. Specifically, we propose to leverage and compare deep policy optimization RL algorithms for enabling distributed model-free control in communication networks and present a novel meta-learning-based framework, MAMRL, for enabling quick adaptation to topology changes. To evaluate the proposed framework, we simulate with various WAN topologies. Our extensive packet-level simulation results show that compared to classical shortest path and traditional reinforcement learning approaches, MAMRL significantly reduces the average packet delivery time even when network demand increases; and compared to a non-meta deep policy optimization algorithm, our results show the reduction of packet loss in much fewer episodes when link failures occur while offering comparable average packet delivery time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/28/2021

Packet Routing with Graph Attention Multi-agent Reinforcement Learning

Packet routing is a fundamental problem in communication networks that d...
05/09/2019

Toward Packet Routing with Fully-distributed Multi-agent Deep Reinforcement Learning

Packet routing is one of the fundamental problems in computer networks i...
03/05/2021

DeepFreight: A Model-free Deep-reinforcement-learning-based Algorithm for Multi-transfer Freight Delivery

With the freight delivery demands and shipping costs increasing rapidly,...
03/02/2018

Model-Free Control for Distributed Stream Data Processing using Deep Reinforcement Learning

In this paper, we focus on general-purpose Distributed Stream Data Proce...
02/04/2022

Analysis of Independent Learning in Network Agents: A Packet Forwarding Use Case

Multi-Agent Reinforcement Learning (MARL) is nowadays widely used to sol...
05/20/2021

Congestion-Aware Routing in Dynamic IoT Networks: A Reinforcement Learning Approach

The innovative services empowered by the Internet of Things (IoT) requir...
12/31/2020

Relational Deep Reinforcement Learning for Routing in Wireless Networks

While routing in wireless networks has been studied extensively, existin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Network providers leverage traffic engineering techniques to optimize performance over operational IP networks [awduche]. With the exponential growth in demand, researchers have experimented with new optimization algorithms that aim to balance utilization and availability of network resources [teavar] or optimize on multiple criteria such as the number of hops, availability, or path attributes [sobrinho2020routing]. As wide-area network backbones (WANs) become costly to maintain and upgrade, software-defined networking (SDN) is being used as a promising method to maximize routing performance but needs to calculate and optimize routes globally to users [b4, swan].

Optimizing for network performance includes measuring bandwidth, jitter, or latency over resource links, where a poorly designed infrastructure can lead to slow performance and potentially increased packet loss. Apart from meticulously designed heuristics needed to develop an optimization algorithm, researchers would analyze offline models of the network topology and traffic demand matrix to infer the best paths between source-destination pairs. This approach, as a planning tool, leads to many limitations such as (1) difficult to optimize in few minutes as networks grow from 10s to 100s of routers, and (2) the dynamic traffic demand matrix would require recalculation every few days as some links become congested and possibly fail.

Recent breakthroughs in deep learning techniques leveraging data-driven learning have identified successful and simple solutions to complex online decision-making problems such as playing games

[alphago], resource management [mao2016resource] and learning inherent traffic patterns [greguric2020application]. Particularly deep reinforcement learning (RL) uses agents to learn, through interactions with the environment by trial and error, their optimal actions via experiences and feedback. Here agents can slowly modify behavior through interactions without knowing the accurate mathematical model of the environment. In path optimization problems, RL has a natural application by exploring different routing policies, gathering statistics about which policies maximize performance functions, and learning over time the best policy on which route to take. Examples of similar techniques in path optimization include two main approaches [valadarsky2017learning] such as optimizing routing configurations by predicting future traffic conditions depending on past traffic patterns or optimizing routing configurations based on the number of feasible traffic scenarios to improve performance parameters. Modern communication networks have become very challenging mainly due to the following two reasons [boutaba2018comprehensive]. Firstly, communication networks have become very complicated and highly dynamic, which makes them hard to model and control. For example, in vehicular and ad hoc networks, nodes frequently move, and link failures might occur during working hours, which might result in topology changes [govindan2016evolve, hong2018b4]. Second, as the scale of networks continue to multiply, a central controller may be costly to install and slow to configure and be robust to malicious attacks [simplicio2010survey, al2015application]. Therefore, there is a need to develop innovative ways in which traffic routing does not rely on accurate mathematical models and can be managed in a distributed manner. Examples of distributed path planning such as ant colony optimization and swarm approaches have shown success in static environments, but still need more learning for dynamic environments [Schanemann2007].

In this paper, we investigate the above issues and evaluate if deep reinforcement learning can provide an optimal, adaptable and distributed solution to path selections as network load increases, and especially if the topology changes such as link failing or becoming congested. In order to design an optimal path selection for various network topologies and network loads, we design a deep policy-based meta-learning algorithm (MAMRL) and evaluate its performance in various simulated WAN topologies. MAMRL can optimize multiple objectives packet loss and packet delivery time. Our preliminary results show that MAMRL can perform better than the shortest possible route algorithms, especially as network loads increase and congestion is possible. Modelling the network as a multi-agent complex system, representing each router as an agent, we can demonstrate that MAMRL can learn and perform with each router in a distributed manner, allowing future work for online traffic engineering on devices.

1.1 Motivation and Contribution

In this work, we investigate the challenge of how can one build an adaptive network routing controller that continue to provide optimum network performance, even when the topology changes. For this we model the challenge as a path optimization technique that can be adaptive to various network load and topology changes, via novel deep reinforcement learning. Modern communication networks are highly dynamic and hard to model and predict, therefore, we aim to develop a novel experience-driven algorithm that can learn to select paths from its experience rather than an accurate mathematical model. Additionally, due to difficulties in gathering information from widely distributed routers, we design a distributed optimization framework to learn the local optimal strategy.

Our specific contributions are:

  • Policy-based RL learning performs well at high network load.

    Using a policy-based deep reinforcement learning method, we train the model at a variety of network loads and save the optimal neural network. Once deployed, our trained neural network can perform superiorly at high network load compared to value-based RL learning.

  • Optimizing for multiple criteria. Our neural networks are optimized for multi-objective optimization for both packet delivery time and packet loss on the network links. We design an appropriate reward (utility) function, which well represents the preference of the network controller, to minimize both packet delivery time and packet loss when link failures occur.

  • Quick adaptation to link failures. Our proposed MAMRL framework aims to make good online decisions under the guidance of powerful Deep Neural Networks (DNNs). In addition, by leveraging the model-agnostic meta-learning technique [MAML], our neural networks can quickly alternate paths to minimize both packet delivery time and packet loss when link failures occur.

  • Calculate optimal packet routes based on limited local observation for future on-device Traffic Engineering research.

    Our neural network model is deployed per multiple agents to represent multiple routers, allowing each router agent to learn and optimize the traffic routing based on their local information. To achieve this, we leverage a dynamic consensus estimator

    [consensus] to diffuse local information and estimate global rewards, still achieving the best average packet delivery time.

We develop a fully distributed multi-agent meta-reinforcement learning (MAMRL) for a packet routing problem, where each router agent aims to find the correct adjacent routers to send their packets to minimize the overall average packet delivery time and avoid packet loss. We demonstrate our results via extensive packet-level simulations representative of WAN network topologies (ATT, Geant, B4), showing that MAMRL significantly outperforms several baseline algorithms in high network load.

2 Background

2.1 Deep Reinforcement learning

Reinforcement learning is concerned with how an intelligent agent learns a good strategy from experimental trials and relative feedback received. With the optimal strategy, the agent is capable to actively adapt to the environment to maximize cumulative rewards. Almost all the deep RL problems can be framed as Markov Decision Processes (MDPs), which consists of four key elements

,

. More specifically, at each decision epoch

, the intelligent agent can stay in state that belongs to the state space of the environment, and choose to take an action that belongs to the action space

to switches from one state to another. The probability that the process moves into its new state

is given by the state transition function . Once an action is taken, the environment delivers a reward as feedback. Figure 1 shows the general process of reinforcement learning (the definition of policy will be given below).

Figure 1: A global reinforcement learning agent learning network states.

There are two key functions in RL: policy function and value function . The policy is a mapping from state to action and tells the agent which action to take in state . For example, in the path optimizing problem, the policy is the router’s strategy that finds the best adjacent router to send out current packets given the current utilization state of the communication network. The state value function measures how rewarding a state is under policy by a prediction of future reward. Similarly, the action-state value function tells, for a given policy, what the expected cumulative reward of taking action in state is.

The goal of RL is to find the optimal policy that achieves optimal value functions: . Traffic engineering is a natural application of RL by exploring with different routing policies, gathering statistics about which policies maximize the utility function, and learning the best policy accordingly.

Value-based algorithms versus Policy-based algorithms: Value-based RL algorithms attempt to learn the tabular or approximation of the state-action value and selects the action based on the maximal value function of all available actions for a given state. For example, the Q-routing algorithm enables the routers to restore Q-values as the estimate of the negative transmission time between that router and the others. To shorten the average packet delivery time, routers will choose the action with maximal Q-values. Policy-based RL algorithms instead learn the policy directly with a parameterized function concerning , and train the policy to maximize the expected cumulative reward function. Policy-based algorithms can learn stochastic policies. It is worthwhile to note that stochastic means stochastic in some action-state pairs where it makes sense. Usually, value-based algorithms, which choose the actions with the maximal values, can only follow deterministic policies or stochastic policies with predetermined distributions. That is not quite the same as learning the real optimal stochastic policy. Since the current communication networks are highly dynamic and stochastic, we can expect that the policy-based RL algorithms perform superiorly to the value-based RL algorithms for certain scenarios, where the optimal policy is stochastic. In Section 2.2, we will introduce more details about the policy-based RL algorithms.

In this work, to enable high-dimensional state representations (such as action histories), we consider deep RL algorithms, which adopt deep neural networks to approximate the policy functions. Here the policy parameters are the weights of the deep neural networks.

2.2 Performance under Partial Observability

Figure 1 shows the general process of reinforcement learning, where the agent is able to observe the global information of the environment. As stated in Section 1, in this work, we consider a path optimization problem in the distributed network environment, indicating that each router only has access to its own information and the information received from its adjacent routers. It follows that the path optimization problem can be modeled as a multi-agent partially observable Markov decision process (POMDP). A POMDP, referred as , for routers is defined by a tuple , , where and carry the same meaning as those in Section 2.1 and denotes the set of all routers. , , and are the local observation space, local action space and local reward function of router , respectively. Then we have is the joint action space of all routers. Each router only has access to a private local observation correlated with the state . To choose actions, each router uses a stochastic parametric policy , where represents the probability of choosing action at observation . Thus, the joint policy of all routers satisfies . For a given time horizon we define the trajectory as the collection of state action pairs ended at time

. The probability distribution of the initial state is denoted by

. In the path optimization problem (cooperative multi-agent problem), the collective objective of all the routers is to collaboratively find policies for all that maximize the globally expected trajectory reward over the whole network. The goal of all routers is as follows,

(1)

where

and denotes the reward needed by router at time . consists of two parts: 1) denotes the reward signal based solely on individual behavior, and 2) denotes the reward signal based on global behavior. Note that only and can be known by router in a partially observable environment.

As stated in Section 2.1, in this work, we investigate how deep policy optimization algorithms work in path optimization problems. The main idea is to directly adjust the parameters of the policies in order to maximize the objective in (1) by taking steps in the direction of . For POMDP, the gradient of the expected return for router can be written111The derivation of Equation (2) is provided in Appendix A. as,

(2)

Note that with only local information, function cannot be well estimated since the estimation requires the reward of all routers. In this work, we propose to use a dynamic consensus algorithm to estimate using only local information, described in Section 3.

2.3 Model-agnostic Meta-learning

Figure 2: The framework of model-agnostic meta learning.

In this work, we consider path optimization in the presence of link failures. It follows that, once there is a link failure, the state transition function of the environment changes accordingly, indicating that a new POMDP occurs. Let denote the Markov process modeled by the full network environment (no link failures) and , where , denote the Markov process modeled by the network environment with different link failure scenarios. Suppose that the distribution of all POMDPs follows . To make the routing algorithm adapt to link failures (different POMDPs) quickly, we leverage the model-agnostic meta-learning [MAML] into the policy optimization algorithms. Meta-reinforcement aims to learn an algorithm that can quickly learn optimal policies in drawn from a distribution over a set of Markov decision processes. Our approach trains a well-generalized parametric policy initialization that is close to all the possible environments (POMDPs), such that it can quickly improve its performance on a new environment with one or a few vanilla policy gradient steps (see Figure 2). The meta-learning objective can be written as:

(3)

where is the learning rate and represents the distribution of trajectory given policy . Model-agnostic meta-learning attempts to learn an initialization such that for any environment the policy attains maximum performance after a few policy gradient steps.

3 Design: MAMRL Approach

3.1 Model

In the packet routing problem, packets are transmitted from a source to its destination through intermediate routers and available links. The mathematical model is given below.

Environment. We consider a possibly time-varying communication network environment, which is characterized by an undirected graph , where is a set of routers and are transmission links between the routers at time . The bandwidth of each link is limited and packet loss might occur when the size of the packet to be transmitted is greater than the link’s capacity. The communication network is possibly time varying since link failures might happen during working hours. When the link failure happens, the capacity of the link becomes zero. Each router has a set of neighbor routers denoted by .

Routing. Packets are introduced into the network with a node of origin and another node of destination. They travel to their destination nodes by hopping on intermediate nodes. Each router only has one local port/queue used to store traffic. The queue of routers follows the first-in-first-out (FIFO) criterion. The node can forward the top packet in its local queue to one of its neighbors. Once a packet reaches its destination, it is removed from the network.

Objective. The packet routing problem aims at finding the optimal transmission path between source and destination routers to minimize the average packet delivery time, which is the sum of queuing time and transmission time while preventing packet loss when link failures happen.

3.2 RL Formulation

Our standard RL setup consists of multiple router agents interacting with an environment (communication networks) in discrete decision epochs. We investigate the deep policy optimization algorithm to address packet routing in a partially observable network environment. To make the router controllers adapt to link failures more quickly, we leverage the model-agnostic meta-learning technique to learn the well-generalized policy initialization.

Figure 3: MAMRL framework.

Figure 3 shows the MAMRL setup (training and testing process) per router. In the testing process, each router uses the deep policy optimization algorithm coupled with the dynamic consensus estimator to learn the optimal policy. In order to let the routers adapt to topology changes quickly, the policy of each router is initialized using the well-generalized policy initialization, which is the output of the training process. The training process follows the traditional model-agnostic meta-reinforcement learning framework. The basic idea is letting the network controller encounter multiple link failures in the training process. It can use this experience to learn how to adapt if similar situations occur while deployed. In Figure 3, denotes the Markov process modeled by the full network environment (no link failures) and , where , denotes the Markov process modeled by the network environment with link failures. In the training process, the network controller collects data samples from all possible network environments according to the distribution . However, the traditional design of model-agnostic meta-learning mainly focuses on single-agent (centralized) reinforcement learning problems. How to solve a multi-agent RL problem using model-agnostic meta-learning in a distributed manner is rarely studied. In this work, we aim to train and execute the network controller in a distributed manner. As shown in Figure 3, each router has an independent policy model that is represented by a deep neural network. The core of the proposed control framework is letting each router run a deep reinforcement learning algorithm to find the best action at each decision time instant, using only local information and local interaction. Since the routers aim to minimize the average packet delivery time of the whole network, each router needs to feed the global packet delivery time into its policy model as feedback/reward. To achieve this goal, we leverage the dynamic consensus algorithm to estimate the global reward function, which is described below,

The policy optimization algorithms aim to find the best policy parameters that produce the highest long-term expected return using gradient ascent. The gradient of the long-term expected return for the parameters of each router’s policy is defined in Equation (2). However, with only local information, function cannot be well estimated since the estimation requires the rewards of all routers . This motivates our consensus-based policy gradient algorithm that leverages the communication network to diffuse the local information, fostering collaboration among routers. We adapt the following dynamic consensus algorithm [consensus] into the policy optimization method.

(4)

where is the control gain, and are local estimators, and denotes the neighbor sets of router . It can be proved that converges to the vicinity of within a few time steps. It is worthy to mention that only local information is used in the designed estimator Equation (4).

We develop the following policy optimization method for POMDP,

(5)

where

(6)

Here, denotes the sum of the local reward signal and global reward estimate . And is obtained by the dynamic consensus estimator designed in Equation (4). Note that both and can be obtained locally.

We build the deep neural network with one input layer, five hidden layers of size 128 with ReLU, and one output layer with Softmax (see Figure

3). As shown in Figure 3, at each decision epoch , each router provides the local observation to the policy model and gets the action back. Router performs action and switch to a new state. Then router feeds the local reward and global reward estimate , which is the output of the dynamic consensus estimator, to the policy model and the policy model updates its weights with respect to the received reward estimate . It is worthwhile to mention that to update the policy in the direction of greater cumulative reward using Equation (5), only local information , and are required. By integrating model-agnostic meta-learning and the proposed multi-agent policy optimization algorithm, MAMRL for packet routing problem, where both training and execution process is distributed. These are shown in Algorithms 1 and 2.

Input: : distributions of network environments;
Input: : step size hyper-parameter;
randomly initialize ;
while not done do
       sample batch of environments ;
       for all  do
             for all routers  do
                  Sample trajectories using and Equation (4) in ;
             end for
            for all routers  do
                  Evaluate using based on Equation (5);
                   Compute adapted parameters with gradient descent: ;
             end for
            for all routes  do
                   Sample trajectories using and Equation (4) in ;
             end for
            
       end for
       for all routes  do
             Evaluate using based on Equation (5);
             Update with gradient descent: ;
             using based on Equation (5);
       end for
      
end while
Return as parameter initialization.
Algorithm 1 Multi-agent meta reinforcement learning algorithm (MAMRL train time)
Input: A dynamic network environment with possible task distribution ;
Input: : step size hyper-parameter;
Input: Learned parameter initialization ;
while not done do
       if Link failure is True then
            ,
       end if
      for all routers  do
            Sample trajectories using and Equation (4);
       end for
      Evaluate using based on Equation (5);
       for all routers  do
             Compute adapted parameters with gradient descent: ;
       end for
      
end while
Algorithm 2 Multi-agent meta reinforcement learning algorithm (MAMRL test time)

3.3 Design of State, Action space and Rewards

We design the local observation , local action and local estimation of the reward function below,

  • Observation of router , : 1) destination router of first packet in the local queue; 2) the last ten step actions taken by router ; 3) the address of the router which has the longest queue among all the neighbor router of router .

  • Action of router , : next hop of current packet in the queue.

  • Reward estimate of router , : sum of and , where is negative number of packet loss occurred at router and is the estimate of using Equation (4). Here, denotes the negative average delivery time of all the packets delivered to router .

Note that the design of state space and reward is critical to the success of a deep reinforcement learning method. Our design of the state space captures key components of the network environment. For our design of the reward function, element is introduced to minimize the average packet delivery time of the whole network, element is included to minimize the packet loss occurred at router in the presence of link failures. Note that in our design , where is the negative number of packet loss occurred only at router but is the estimate of the negative average delivery time of the whole network environment. The reasons are summarized below.

Optimizing for packet delivery time: To achieve this goal, all the routers need to collaboratively find the best paths to reroute the traffic. And the delivery time of the packets that are delivered to router is determined by the decisions of all the intermediate routers. That is, packet delivery time is a signal based on global behavior, it is not enough for router to only know the delivery time of the packets delivered to itself.

Optimizing for link failures: Although this goal also involves reading packet loss in the whole network, the link failures have little effect on the routers that are not directly connected to the failed links. Therefore, in our design, we only provide the packet loss that occurred at router to the policy as the feedback.

4 Evaluation

We conduct extensive simulations to evaluate the performance of the proposed MAMRL framework in a path optimization problem with static topologies and topologies with possibly failed links.

Topology Name Number of nodes Number of edges
B4 12 19
Geant 21 32
ATT 25 56
Table 1: Network topologies used in our evaluations.

We evaluate the results to,

  • Benchmark the RL techniques against standard path optimization approaches and other RL approaches,

  • Examine how robust the MAMRL approach is under link failures, and

  • Show how quickly MAMRL adapts and reroutes packets to alternate paths achieving better performance.

4.1 Experiment Settings

The simulation runs are performed on three network topologies, B4, Geant, and ATT network. See Table 1 for a specification of network sizes. The B4 and ATT topologies (link capacities) and their traffic matrices (packet size) were obtained from the authors of Teavar [teavar]. The Geant topology is the European Research network providing connectivity to science experiments across Europe and US labs (www.geant.org).

To model the packet arrival, a discrete event network simulator is developed, based on Open AI gym 222https://github.com/esnet/daphne-public/tree/master/MAMRL-TE. Packets are introduced into the network with a node of origin and another node of destination. The packet arrives according to the Poisson process of rate . They travel to their destination node by hopping on intermediate nodes. Each router only has a one local port/queue used to store traffic. The queue of routers follows the FIFO criterion. In each time unit, the node forwards the top packet in its local queue to one of its neighbors. Once a packet reaches its destination, it is removed from the network environment. The bandwidth of each link is limited and packet loss might occur when the size of the packet to be transmitted is greater than the link’s capacity. When the link failure happens, the capacity of the link becomes zero.

In the experiments, we choose the step size as . In addition, we use trust-region policy optimization (TRPO) [trpo] as the meta-optimizer and the standard linear feature baseline [baseline] is used.

4.2 Impact of Increasing Network Load

We first test the MAMRL algorithm with static topologies (no link failures). We compare with the classical shortest path algorithm and two existing RL-based routing algorithms:

  • Shorted path algorithm (SPA) [spa]: a traditional packet routing algorithm.

  • Q-routing [boyan1994packet]: a value-based multi-agent reinforcement learning algorithm.

  • Policy gradient (PG) [peshkin2002reinforcement]: a policy-based multi-agent reinforcement learning algorithm.

In the experiments, the episodes terminate at the horizon of . After 10000 training episodes, we restored the well-trained models to compare their performance in a new test environment where packets were generated at the corresponding network load level. Note that the SPA does not need training and can be applied to test directly. We tested the network on loads ranging from 0.005 to 0.5 and measured the average packet delivery time of several episodes in the testing process to compare with the results given by the above-mentioned baseline controllers. The load corresponds to the value of of the Poisson arrival process for the average number of packets injected per unit time.

(a) B4.
(b) Geant.
(c) ATT.
Figure 4: Comparing average packet delivery time as load increases.

The average packet delivery time results versus different network load are shown in Figure 4. Under conditions of low load for all the three topologies, MAMRL is slightly inferior to Q-routing and SPA. As the load increases, the MAMRL performs much better than the baseline algorithms. On the B4 topology, when the traffic load is high (i.e., ), MAMRL reduces the average packet delivery time by , , and , respectively, compared to SPA, Q-routing, and policy gradient algorithms. On the Geant topology, when the traffic load , MAMRL significantly reduces the average packet delivery time by , , and , respectively, compared to SPA, Q-routing, and policy gradient algorithms. And on the ATT topology, when the traffic load , MAMRL reduces the average packet delivery time by , , and , respectively, compared to SPA, Q-routing, and policy gradient algorithms. The reason is described as follows. Under conditions of low load, there is no congestion along the route. Therefore, the deterministic policy learned by Q-routing performs as well as SPA, which is the optimal routing policy under low load. However, the routing policy learned by MAMRL is stochastic, which means that not all of the packets are sent down the optimal link. That is why the performance of MAMRL is slightly inferior to Q-routing and SPA under low load. As the load increases, the routes are getting crowded and the length of the queues is getting longer. Due to the stochastic nature of the communication network environment, the optimal policy should be stochastic under conditions of high load. This explains why MAMRL performs much better than the Q-routing and SPA controllers under high load. The results in [peshkin2002reinforcement] also show that policy-based reinforcement learning algorithm performs better than value-based algorithms, especially on high flow load. However, the work in [peshkin2002reinforcement] only considers a simple policy gradient algorithm for the packet routing problem. Instead, we investigate a deep policy optimization algorithm that can take much more information as its inputs, enlarging the state-action space for better policy making. The results in Figure 4 indicate that our MAMRL algorithm achieves a shorter delivery time than a simple policy gradient algorithm.

(a) B4.
(b) Geant.
(c) ATT.
Figure 5: Packet loss results in the presence of link failures.

4.3 Impact of Link Failures

We test the MAMRL algorithm in the presence of link failures with network load . We let the router train and encounter all possible network environments (link failure scenarios) according to the distribution and return policy parameters using Algorithm 1. We restored the well-trained models in a new test environment where the links get disconnected according to the distribution (We assume that only one link gets failed at one time). We compare results of using the following three controllers: (1) testing the policy from the initialization parameters obtained by MAMRL, (2) testing the policy from randomly initialized weights (called random in the following), (3) shortest path algorithm (SPA) [spa], and (4) Q-routing algorithm [boyan1994packet]. Figure 5 show the results of the packet loss versus episodes and Figure 6 show the results of the average packet delivery time versus episodes. Also, we show the performance of reinforcement learning routing algorithms (MAMRL, random, Q-routing) over the three network topologies during the online learning procedure in terms of the reward. We present the corresponding simulation results in Figure 7. We can make the following observations from these results.

  1. In Figure 5, when there is a link failure, the model-based routing algorithm (i.e., SPA) witnesses a huge packet loss. The reason is that the SPA algorithm relies on previous knowledge of the network topology to make decisions. Here we assume that as the networks grows, it becomes longer to update the ISIS/OSPF protocols for link failures and update the tables. Both ISIS/OSPF use the same Dijkstra algorithm for computing the best path through the network. The other learning algorithms (MAMRL, random, Q-routing) are model-free controllers and the policy of the model-free controller. When link failure happens, the packet loss sensor will tell the RL routing controllers that there are many packet loss at the particular link. Based on our design, the packet loss hurts the reward of the RL routing controllers. To maximize the reward function, the RL routing controller will adjust their policies to improve the reward function and hence reduce the packet loss accordingly.

  2. Figure 7 shows how the reward value changes during online learning over the three network topologies. It is seen that when there is a link failure, for the B4 topology, Q-routing adapts to the link failure (reward values converge to the stable states) after about 30 episodes, MAMRL (our algorithm) adapts to the link failure after about 35 episodes and random algorithm (deep policy optimization with randomly initialized weights) adapts to the link failure after about 800 episodes. The results for Geant topology and ATT topology are shown in Table 2. Q-routing is based on a value-based Q-learning algorithm and is often much faster to learn a policy than policy optimization algorithms [nachum2017bridging]. In this work, we propose the MAMRL algorithm which leverages model-agnostic meta-learning to help the policy optimization adapt to link failures quickly. The basic idea of MAMRL is letting the network controller encounter all possible link failures in the training process. It can then use that experience to learn how to adapt. MAMRL aims to learn a well-generalized policy initialization that is close to all possible situations of the environment. Whenever there are continual packet losses at a particular link, the MAMRL controller will reinitialize the policy models based on the pre-trained well-generalized policy initialization. It can be seen from Figure 7, the MAMRL controller adapts to link failures with a speed that is comparable to the Q-routing algorithm. However, the normal policy optimization controller adapts to the link failures much more slowly.

Topology Name Q-routing MAMRL Random
B4 23 25 800
Geant 35 35 100
ATT 29 30 1000
Table 2: Average number of episodes used to adapt to link failures.
(a) B4.
(b) Geant.
(c) ATT.
Figure 6: Average packet delivery time results in the presence of link failures.

4.4 Optimizing for Multiple Objectives

Policy optimization algorithms use gradient descent to optimize an optimization problem. And traffic engineering aims at finding a solution to forward the data traffic to maximize a utility function. The utility function might concern a set of values. In our design, the objective is to minimize the packet delivery time and packet loss, therefore, the utility function, which corresponds to the reward function in the RL algorithms, consists of a function of packet loss and a function of packet delivery.

In future works, we can add multiple objectives such as bandwidth utilization, latency and more, if we want the RL controller to optimize on a number of multiple parameters.

(a) B4.
(b) Geant.
(c) ATT.
Figure 7: Reward in the presence of link failures.

5 Related work

Routing is the process of selecting a path for traffic in a network from the source node to a destination node. Routing algorithm selection may depend on multiple criteria, for example, performance metric (delay, link utilization), the ability to adapt to changes in topology and traffic, and scalability (should be able to support a large number of routers). In the following, we briefly review the traditional routing algorithms and reinforcement learning routing algorithms based on the above-mentioned criteria. Since in this work, we propose a kind of multi-agent reinforcement learning algorithm to solve the packet routing algorithm, a literature review on multi-agent reinforcement learning algorithms is also included in this section.

5.1 Traditional Packet Routing Algorithms

There are multiple routing algorithms in the literature of traditional packet routing [flooding, dijkstra, bellman]. Among these traditional packet routing algorithms, the shortest path algorithm is the most commonly used routing algorithm [abolhasan2004review]. The shortest path algorithm aims to find the shortest path between source and destination nodes and get the packet delivered to the destination node as quickly as possible. The shortest path algorithm is regarded as the best routing algorithm on lower network load since packets can be delivered using the least amount of time along the shortest path between two nodes provided that there is no congestion along the route. However, when the network load is high, the shortest path algorithm will cause a serious backlog in busy routers. Another problem with the shortest path algorithm is that it relies on having full knowledge of the network topology to design routing algorithms and hence needs manual adjustment when topology or traffic changes happen.

5.2 Reinforcement Learning for Routing

Using reinforcement learning for packet routing has attracted increasing interest recently. Various reinforcement learning methods have been proposed to deal with this classical communication network problem and achieved better performances compared with traditional routing methods.

Applications of traditional RL to solve packet routing problems started in the early 1990s with the seminal work [boyan1994packet], where Q-routing was proposed. Q-routing [boyan1994packet] is an adaptive routing approach based on a reinforcement learning algorithm known as Q-learning. Q-routing routes packets based on the learned delivery times (Q values) and achieves a much smaller average delivery time compared with the benchmark shortest-path algorithm [spa]. Since then, several extensions of Q-routing have been proposed, e.g., Dual Q-routing [kumar1997dual], Predictive Q-routing [choi1996predictive], Full Echo Q-routing [kavalerov2017reinforcement], Hierarchical Q-routing [lopez2011simulated] and Ant-based Q-routing [subramanian1997ants]. However, Q-routing is a value-based RL algorithm. That is, Q-routing is a deterministic algorithm that might cause traffic congestion at high loads and does not distribute incoming traffic across the available links. Due to the drawbacks of a value-based algorithm, some researchers begin to consider policy-based RL algorithms for packet routing problems [tao2001multi, peshkin2002reinforcement]. Since policy-based RL algorithms can explore the class of stochastic policies, it is natural to expect policy-based algorithms to be superior for certain types of network topologies and loads, where the optimal policy is stochastic. In [peshkin2002reinforcement], the results show that policy-based RL algorithms perform better than value-based algorithms, especially on high flow load. These traditional reinforcement learning routing algorithms use tabular functions or simple algebraic functions to estimate the Q functions or policy functions. This is limiting for a large number of states and thus cannot take full advantage of the network traffic history and dynamics. In this work, we investigate the policy-based deep reinforcement learning algorithm and use deep neural networks to approximate the policy function. The combination of deep learning techniques with reinforcement learning methods can learn useful representations for the routing problems with high dimensional raw data input and thus achieve superior performance.

Deep reinforcement learning is the combination of reinforcement learning and deep learning, which has been able to solve a wide range of complex decision-making tasks. However, rare works are investigating how deep reinforcement learning can be leveraged for packet routing problems since the wireless network is a multi-agent environment and the network environment is non-stationary from the perspective of any individual router. This prevents the straightforward use of experience replay, which is crucial for stabilizing deep Q learning [MADDPG]. [mukhutdinov2019multi] combines the Q-routing and deep Q-learning to solve the routing problem. However, the training process of the algorithm proposed in [mukhutdinov2019multi] is in a centralized manner (all the routers need to share parameters), which might cause issues in real-world large-scale network environments. The authors in [xu2018experience] propose to use a deep actor-critic reinforcement learning algorithm to optimize the performance of the communication network. However, the training and testing process in [xu2018experience] are also in a centralized manner. Recently, distributed Deep Q-routing has been proposed in [you2020toward]

, where deep recurrent neural network (LSTM) has been utilized to tackle the non-stationary problem in multi-agent reinforcement learning. However, the assumptions in

[you2020toward] are different from those in our work. In [you2020toward], it is assumed that the bandwidth of each link equals the packet size, in which case only a single packet can be transmitted at a time. However, in our current work, we assume that the capacity of each link is fixed, in which case many packets can be transmitted at a time. We believe that our assumption is more realistic.

5.3 Multi-agent Deep Reinforcement Learning Approaches

There is a huge body of literature on single-agent deep reinforcement learning algorithms, where the environment stays largely stationary. Unfortunately, traditional deep reinforcement learning algorithms are poorly suited to multi-agent environments, where the environment becomes non-stationary from the perspective of any individual agent. This might cause divergence for value-based reinforcement learning and very high variance for policy-based reinforcement learning algorithms. In the literature, researchers propose multiple methods to apply reinforcement learning algorithms in multi-agent settings. To name a few, centralized training and distributed execution

[MADDPG, iqbal2019actor, li2019robust], distributed training and execution under fully-observable environments [zhang2018networked], and independent Q-learning [hausknecht2015deep, matignon2007hysteretic, foerster2017stabilising]. However, independent Q-learning is a value-based reinforcement learning algorithm, and in this work, we aim to investigate the policy-based reinforcement learning routing algorithm. The ideas of centralized training and fully observable state space work well when there exists a small number of agents in the communication network. With increasing the number of agents, the volume of the information might overwhelm the capacity of a single unit. To tackle this problem, one effective idea to remove the central unit and only allow the agents to share information with only a subset of agents, to reach a consensus over a variable with these agents (called neighbors) [wai2018multi, zhang2016data]. In this work, we also leverage the dynamic consensus algorithm to estimate the global reward function through interactive communication among routers. Moreover, we consider packet routing in the presence of link failures, indicating that not only the local environment from the perspective of any individual router is non-stationary but also the global environment changes during the working hours. We propose to use model-agnostic meta-learning to learn a well-generalized policy initialization that is close to all possible environments such that the policy can be quickly adapted to different scenarios with a few gradient steps. This is the first time in the literature that the model-agnostic meta-learning is applied to a multi-agent reinforcement learning problem case.

6 Discussion and Conclusions

In this work, we propose a novel framework MAMRL that utilizes deep policy optimization and meta-learning to produce a model-free network routing controller that can perform better path optimization than standard approaches. Our experiments show that MAMRL can learn to control the communication networks from its experience rather than an accurate mathematical model. Specifically, we use deep policy optimization techniques to find optimal paths in complex WAN topologies. In order to address the difficulties in gathering information from widely distributed routers, we design a consensus-based policy optimization algorithm that can learn the local optimal strategy using only local information. Additionally, we consider path optimization problems in the presence of link failures and we leverage the model-agnostic meta-learning algorithm to make the proposed network controller adapt to link failures more quickly. We demonstrate how MAMRL improves the learning efficiency of deep reinforcement learning in multi-agent packet routing in the presence of link failures. The experiments demonstrate the effectiveness and efficiency of MAMRL for packet routing problem, compared to some baseline controllers.

The distributed nature of MAMRL lays foundation for our future work, where we will experiment with on-device traffic engineering in physical network setups to see how well the network adapts.

7 Acknowledgements

We would like to thank Dr. Manya Ghobadi for providing the data of At&T and B4 network topologies. We would like to express our gratitude to Dr. Bashir Mohammed for his useful suggestions and critiques of this research work. This work was supported by the U.S. Department of Energy, Office of Science Early Career Research Program for ‘Large-scale Deep Learning for Intelligent Networks’ Contract no FP00006145.

References

Appendix A Appendix

Here, we provide the derivation of Equation (2). All the notations carry the same meaning as those in Section 2.2. The probability of a trajectory given that actions come from is,

(7)

The log-probability of a trajectory is

(8)

The gradient of the log-probability of a trajectory is

(9)

and thus

(10)

Putting the above equations together, we have the following

(11)