CFR-RL: Traffic Engineering with Reinforcement Learning in SDN

04/24/2020 ∙ by Junjie Zhang, et al. ∙ NYU college Beijing Institute of Technology 0

Traditional Traffic Engineering (TE) solutions can achieve the optimal or near-optimal performance by rerouting as many flows as possible. However, they do not usually consider the negative impact, such as packet out of order, when frequently rerouting flows in the network. To mitigate the impact of network disturbance, one promising TE solution is forwarding the majority of traffic flows using Equal-Cost Multi-Path (ECMP) and selectively rerouting a few critical flows using Software-Defined Networking (SDN) to balance link utilization of the network. However, critical flow rerouting is not trivial because the solution space for critical flow selection is enormous. Moreover, it is impossible to design a heuristic algorithm for this problem based on fixed and simple rules, since rule-based heuristics are unable to adapt to the changes of the traffic matrix and network dynamics. In this paper, we propose CFR-RL (Critical Flow Rerouting-Reinforcement Learning), a Reinforcement Learning-based scheme that learns a policy to select critical flows for each given traffic matrix automatically. CFR-RL then reroutes these selected critical flows to balance link utilization of the network by formulating and solving a simple Linear Programming (LP) problem. Extensive evaluations show that CFR-RL achieves near-optimal performance by rerouting only 10 total traffic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The emerging Software-Defined Networking (SDN) provides new opportunities to improve network performance [mckeown2008openflow]. In SDN, the control plane can generate routing policies based on its global view of the network and deploy these policies in the network by installing and updating flow entries at the SDN switches.

Traffic Engineering (TE) is one of important network features for SDN [agarwal2013traffic, guo2014traffic, zhang2014dynamic], and is usually implemented in the control plane of SDN. The goal of TE is to help Internet Service Providers (ISPs) optimize network performance and resource utilization by configuring the routing across their backbone networks to control traffic distribution [zhang2015load, guo2019joint]. Due to dynamic load fluctuation among the nodes, traditional TE [wang1999explicit, osborne2002traffic, fortz2002optimizing, holmberg2004optimization, chu2009optimal, zhang2012optimizing]

reroutes many flows periodically to balance the load on each link to minimize network congestion probability, where a flow is defined as a source-destination pair. One usually formulates the flow routing problem with a particular performance metric as a specific objective function for optimization. For a given traffic matrix, one often wants to route all the flows in such a way that the maximum link utilization in the network is minimized.

Although traditional TE solutions can achieve the optimal or near-optimal performance by rerouting as many flows as possible, they do not consider the negative impact, such as packet out of order, when rerouting the flows in the network. To reach the optimal performance, TE solutions might reroute many traffic flows to just slightly reduce the link utilization on the most congested link, leading to significant network disturbance and service disruption. For example, a flow between two nodes in a backbone network is aggregated of many micro-flows (e.g., five tuples-based TCP flows) of different applications. Changing the path of a flow could temporarily affect many TCP flows’ normal operation. Packets loss or out-of-order may cause duplicated ACK transmissions, triggering the sender to react and reduce its congestion window size and hence decrease its sending rate, eventually increasing the flow’s completion time and degrading the flow’s Quality of Service (QoS). In addtion, rerouting all flows in the network could incur a high burden on the SDN controller to calculate and deploy new flow paths [zhang2014dynamic]. Because rerouting flows to reduce congestion in backbone networks could adversely affect the quality of users’ experience, network operators have no desire to deploy these traditional TE solutions in their networks unless reducing network disturbance is taken into the consideration in designing the TE solutions.

To mitigate the impact of network disturbance, one promising TE solution is forwarding majority of traffic flows using Equal-Cost Multi-Path (ECMP) and selectively rerouting a few critical flows using SDN to balance link utilization of the network, where a critical flow is defined as a flow with a dominant impact to network performance (e.g., a flow on the most congested link) [zhang2014dynamic, zhang2014hybrid_routing]. Existing works show that critical flows exist in a given traffic matrix [zhang2014dynamic]. ECMP reduces the congestion probability by equally splitting traffic on equal-cost paths while critical flow rerouting aims to achieve further performance improvement with low network disturbance.

The critical flow rerouting problem can be decoupled into two sub-problems: (1) identifying critical flows and (2) rerouting them to achieve good performance. Although sub-problem (2) is relatively easy to solve by formulating it as a Linear Programming (LP) optimization problem, solving sub-problem (1) is not trivial because the solution space is huge. For example, if we want to find 10 critical flows among 100 flows, the solution space has trillion combinations. Considering the fact that traffic matrix varies in the level of minutes, an efficient solution should be able to quickly and effectively identify the critical flows for each traffic matrix. Unfortunately, it is impossible to design a heuristic algorithm for the above algorithmically-hard problem based on fixed and simple rules. This is because rule-based heuristics are unable to adapt to the changes of the traffic matrix and network dynamics and thus unable to guarantee their performance when their design assumptions are violated, as later shown in Section VI-B.

In this paper, we propose CFR-RL (Critical Flow Rerouting-Reinforcement Learning), a Reinforcement Learning-based scheme that performs critical flow selection followed by rerouting with linear programming. CFR-RL learns a policy to select critical flows purely through observations, without any domain-specific rule-based heuristic. It starts from scratch without any prior knowledge, and gradually learns to make better selections through reinforcement, in the form of reward signals that reflects network performance for past selections. By continuing to observe the actual performance of past selections, CFR-RL would optimize its selection policy for various traffic matrices as time goes. Once training is done, CFR-RL will efficiently and effectively select a small set of critical flows for each given traffic matrix, and reroute them to balance link utilization of the network by formulating and solving a simple linear programming optimization problem.

The main contributions of this paper are summarized as follows:

  1. We consider the impact of flow rerouting to network disturbance in our TE design and propose an effective scheme that not only minimizes the maximum link utilization but also reroutes only a small number of flows to reduce network disturbance.

  2. We customize a RL approach to learn the critical flow selection policy, and utilize LP as a reward function to generate reward signals. This RLLP combined approach turns out to be surprisingly powerful.

  3. We evaluate and compare CFR-RL with other rule-based heuristic schemes by conducting extensive experiments on different topologies with both real and synthesized traffic. CFR-RL not only outperforms rule-based heuristic schemes by up to 12.2%, but also reroutes 11.4%-14.7% less traffic on average. Overall, CFR-RL is able to achieve near-optimal performance by rerouting only 10%-21.3% of total traffic. In addition, the evalution results show that CFR-RL is able to generalize to unseen traffic matrices.

The remainder of this paper is organized as follows. Section II describes the related works. Section III presents the system design. Section IV discusses how to train the critical flow selection policy using a RL-based approach. Section V describes how to reroute the critical flows. Section VI evaluates the effectiveness of our scheme. Section VII concludes the paper and discusses future work.

Ii Related Works

Ii-a Traditional TE Solutions

In Multiprotocol Label Switching (MPLS) networks, a routing problem has been formulated as an optimization problem where explicit routes are obtained for each source-destination pair to distribute traffic flows [wang1999explicit, osborne2002traffic]. Using Open Shortest Path First (OSPF) and ECMP protocols, [fortz2002optimizing, holmberg2004optimization, chu2009optimal] attempt to balance link utilization as even as possible by carefully tuning the link costs to adjust path selection in ECMP. OSPF-OMP (OMP, Optimized Multipath) [OSPF-OMP], a variation of OSPF, attempts to dynamically determine the optimal allocation of traffic among multiple equal-cost paths based on the exchange of special traffic-load control messages. Weighted ECMP [zhang2012optimizing] extends ECMP to allow weighted traffic splitting at each node and achieves significant performance improvement over ECMP. Two-phase routing optimizes routing performance by selecting a set of intermediate nodes and tuning the traffic split ratios to the nodes [kodialam2008oblivious, antic2009two]. In the first phase, each source sends traffic to the intermediate nodes based on predetermined split ratios, and in the second phase, the intermediate nodes then deliver the traffic to the final destinations. This approach requires IP tunnels, optical-layer circuits, or label switched paths in each phase.

Ii-B SDN-Based TE Solutions

Thanks to the flexible routing policy from the emerging SDN, dynamic hybrid routing [zhang2014dynamic] achieves load balancing for a wide range of traffic scenarios by dynamically rebalancing traffic to react to traffic fluctuations with a preconfigured routing policy. Agarwal et al. [agarwal2013traffic] consider a network with partially deployed SDN switches. They improve network utilization and reduce packet loss by strategically placing the controller and SDN switches. Guo et al. [guo2014traffic] propose a novel algorithm named SOTE to minimize the maximum link utilization in an SDN/OSPF hybrid network.

Ii-C Machine Learning-Based TE Solutions

Machine learning has been used to improve the performance of backbone networks and data center networks. For backbone networks, Geyer et al. [geyer2018learning]

design an automatic network protocol using semi-supervised deep learning. Sun et al.

[sun2019sinet] selectively control a set of nodes and use a RL-based policy to dynamically change the routing decision of flows traversing the selected nodes. To minimize signaling delay in large SDNs, Lin et al. [lin2016qos] employ a distributed three-level control plane architecture coupled with a RL-based solution named QoS-aware Adaptive Routing. Xu et al. [xu2018experience] use RL to optimize the throughput and delay in TE. AuTO [chen2018auto] is developed to optimize routing traffic in data center networks with a two-layer RL. One is called the Peripheral System for deploying hosts and routing small flows, and the other one is called the Central System for collecting global traffic information and routing large flows.

However, all of the above works do not consider mitigating the impact of network disturbance and service disruption caused by rerouting.

Iii System Design

In this section, we describe the design of CFR-RL, a RL-based scheme that learns a critical flow selection policy and reroutes the corresponding critical flows to balance link utilization of the network.

Fig. 1: An illustrative example of CFR-RL rerouting procedure. Each link capability equal to 1. Best viewed in color.

We train CFR-RL to learn a selection policy over a rich variety of historical traffic matrices, where traffic matrices can be measured by SDN switches and collected by an SDN central controller periodically [xu2017minimizing]

. CFR-RL represents the selection policy as a neural network that maps a "raw" observation (e.g., a traffic matrix) to a combination of critical flows. The neural network provides a scalable and expressive way to incorporate various traffic matrices into the selection policy. CFR-RL trains this neural network based on REINFORCE algorithm

[Williams1992] with some customizations, as detailed in Section IV.

Once training is done, CFR-RL applies the critical flow selection policy to each real time traffic matrix provided by the SDN controller periodically, where a small number of critical flows (e.g., ) are selected. The evaluation results in Section VI-B1 show that selecting 10% of total flows as critical flows (roughly 11%-21% of total traffic) is sufficient for CFR-RL to achieve near-optimal performance, while network disturbance (i.e., the percentage of total rerouted traffic) is reduced by at least compared to rerouting all flows by traditional TE. Then the SDN controller reroutes the selected critical flows by installing and updating corresponding flow entries at the switches using a flow rerouting optimization method described in Section V. The remaining flows would continue to be routed by the default ECMP routing. Note that the flow entries at the switches for the critical flows selected in the previous period will time out, and the flows would be routed by either default ECMP routing or newly installed flow entries in the current period. Figure 1 shows an illustrative example. CFR-RL reroutes the flow from S0 to S4 to balance link load by installing forwarding entries at the corresponding switches along the SDN path.

There are two reasons we do not want to adopt RL for the flow rerouting problem. Firstly, since the set of critical flows is small, LP is an efficient and optimal method to solve the rerouting problem. Secondly, a routing solution consists of a split ratio (i.e., traffic demand percentage) for each flow on each link. Given a network with links, there will be total split ratios in the routing solution, where is the number of critical flows. Since split ratios are continuous numbers, we have to adopt the RL methods for continuous action domain [ddpg, John2015TRPO]. However, due to the high-dimensional, continuous action spaces, it has been shown that this type of RL methods would lead to slow and ineffective learning when the number of output parameters (i.e., ) is large [xu2018experience, Valadarsky2017learning].

Iv Learning A Critical Flow Selection Policy

In this section, we describe how to learn a critical flow selection policy using a customized RL approach.

Iv-a Reinforcement Learning Formulation

Input / State Space: An agent takes a state as an input, where is a traffic matrix at time step that contains information of traffic demand of each flow. Typically, the network topology remains unchanged. Thus, we do not include the topology information as a part of the input. The results in Section VI-B show that CFR-RL is able to learn a good policy without prior knowledge of the network. It is worth noting that including additional information like link states as a part of input might be beneficial for training the critical flow selection policy. We will investigate it in our future work.

Action Space: For each state , CFR-RL would select critical flows. Given that there are total flows in a network with nodes, this RL problem would require a large action space of size . Inspired by [Mao:2016:RMD:3005745.3005750], we define the action space as {0, 1, …, } and allow the agent to sample different actions in each time step (i.e., ).

Reward: After sampling different critical flows (i.e., ) for a given state , CFR-RL reroutes these critical flows and obtains the maximum link utilization by solving the rerouting optimization problem (4a) (described in the following section). Reward is defined as , which is set to reflect the network performance after rerouting critical flows to balance link utilization. The smaller (i.e., the greater reward ), the better performance. In other words, CFR-RL adopts LP as a reward function to produce reward signals for RL.

Iv-B Training Algorithm

Fig. 2: Policy network architecture.

The critical flow selection policy is represented by a neural network. This policy network takes a state

as an input as described above and outputs a probability distribution

over all available actions. Figure 2 shows the architecture of the policy network (details in Section VI-A1). Since different actions are sampled for each state and their order does not matter, we define a solution as a combination of sampled actions. For selecting a solution with a given state , a stochastic policy parameterized by can be approximated as follows111To select distinct actions, we do the action sampling without replacement. The right side of Eq. (1) is the solution probability when sampling with replacement, we use Eq. (1) to approximate the probability of the solution given a state for simplicity.:

(1)

The goal of training is to maximize the network performance over various traffic matrices, i.e., maximize the expected reward . Thus, we optimize by gradient ascend, using REINFORCE algorithm with a baseline . The policy parameter is updated according to the following equation:

(2)

where is the learning rate for the policy network. A good baseline

reduces gradient variance and thus increases speed of learning. In this paper, we use an average reward for each state

as the baseline. () indicates how much better a specific solution is compared to the "average solution" for a given state according to the policy. Intuitively, Eq.(2) can be explained as follows. If is positive, (i.e., the probability of the solution ) is increased by updating the policy parameters in the direction with a step size of . Otherwise, the solution probability is decreased. The net effect of Eq. (2) is to reinforce actions that empirically lead to better rewards.

To ensure that the RL agent explores the action space adequately during training to discover good policies, the entropy of the policy is added to Eq. (2). This technique improves the exploration by discouraging premature convergence to suboptimal deterministic policies [A3C]. Then, Eq(2) is modified to the following equation:

(3)

where

is the entropy of the policy (the probability distribution over actions). The hyperparameter

controls the strength of the entropy regularization term. Algorithm 1 shows the pseudo-code for the training algorithm.

  Initialize , (keep track the sum of rewards for each state), (keep track the visited count of each state)
  for each iteration do
     
      Sample a batch of states with size
     for  do
        Sample a solution according to policy
        Receive reward
        if  and  then
            (average reward for state )
        else
           , ,
        end if
     end for
     for  do
        
        
        
     end for
     
  end for
Algorithm 1 Training Algorithm

V Rerouting Critical Flows

In this section, we describe how to reroute the selected critical flows to balance link utilization of the network.

V-a Notations

network with nodes and directed edges ().
the capacity of link ().
the traffic load on link ().
the traffic demand from source to destination (, ).
the percentage of traffic demand from source to destination routed on link (, ).

V-B Explicit Routing For Critical Flows

By default, traffic is distributed according to ECMP routing. We reroute the small set of critical flows (i.e., ) by conducting explicit routing optimization for these critical flows .

The critical flow rerouting problem can be described as the following. Given a network with the set of traffic demands for the selected critical flows () and the background link load contributed by the remaining flows using the default ECMP routing, our objective is to obtain the optimal explicit routing ratios for each critical flow, so that the maximum link utilization is minimized.

To search all possible under-utilized paths for the selected critical flows, we formulate the rerouting problem as an optimization as follows.

 

(4a)
subject to
(4b)
(4c)
(4d)
(4e)

 

in (4a) is needed because otherwise the optimal solution may include unnecessarily long paths as long as they avoid the most congested link, where () is a sufficiently small constant to ensure that the minimization of takes higher priority [916782]. (4b) indicates the traffic load on link contributed by the traffic demands routed by the explicit routing and the traffic demands routed by the default ECMP routing. (4c) is the link capacity utilization constraint. (4d) is the flow conservation constraint for the selected critical flows.

By solving the above LP problem using LP solvers (such as Gurobi [gurobi]), we can obtain the optimal explicit routing solution for selected critical flows . Then, the SDN controller installs and updates flow entries at the switches accordingly.

Vi Evaluation

In this section, a series of simulation experiments are conducted using real-world network topologies to evaluate the performance of CFR-RL and show its effectiveness by comparing it with other rule-based heuristic schemes.

Vi-a Evaluation Setup

Vi-A1 Implementation

The policy neural network consists of three layers. The first layer is a convolutional layer with 128 filters. The corresponding kernel size is

and the stride is set to 1. The second layer is a fully connected layer with 128 neurons. The activation function used for the first two layers is Leaky ReLU

[leaky_relu]. The final layer is a fully connected linear layer (without activation function) with neurons corresponding to all possible critical flows. The softmax function is applied upon the output of final layer to generate the probabilities for all available actions. The learning rate is initially configured to 0.001 and decays every 500 iterations with a base of 0.96 until it reaches the minimum value 0.0001. Additionally, the entropy factor is configured to be 0.1. We found that the set of above hyperparameters is a good trade-off between performance and computational complexity of the model (details in Section VI-B5

). Thus, we fixed them throughout our experiments. The results in the following experiments show CFR-RL works well on different network topologies with a single set of fixed hyperparameters. This architecture is implemented using TensorFlow

[tensorflow].

Topology Nodes Directed Links Pairs
Abilene 12 30 132
EBONE (Europe) 23 74 506
Sprintlink (US) 44 166 1892
Tiscali (Europe) 49 172 2352
TABLE I: ISP networks used in evaluation

Vi-A2 Dataset

In our evaluation, we use four real-world network topologies including Abilene network and 3 ISP networks collected by ROCKETFUEL [spring2002measuring]. The number of nodes and directed links of the networks are listed in Table I. For the Abilene network, the measured traffic matrices and network topology information (such as link connectivity, weights, and capacities) are available in [abilene_tm]. Since Abilene traffic matrices are measured every 5 minutes, there are a total of 288 traffic matrices each day. To evaluate the performance of CFR-RL, we choose a total 2016 traffic matrices in the first week (starting from Mar. 1st 2004) as our dataset. For ROCKETFUEL topologies, the link costs are given while the link capacities are not provided. Therefore, we infer the link capacities as the inverse of link costs, which is based on the default link cost setting in Cisco routers. In other words, the link costs are inversely proportional to the link capacities. This approach is commonly adopted in literature [zhang2014dynamic, zhang2014hybrid_routing, kodialam2008oblivious]. Besides, since traffic matrices are also unavailable for the ISP networks from ROCKETFUEL, we use a traffic matrix generation tool [TMgen] to generate 50 synthetic exponential traffic matrices and 50 synthetic uniform traffic matrices for each network. Unless otherwise noted, we use a random sample of 70% of our dataset as a training set for CFR-RL, and use the remaining 30% as a test set for testing all schemes.

Vi-A3 Parallel Training

To speed up training, we spawn multiple actor agents in parallel, as suggested by [A3C]. CFR-RL uses 20 actor agents by default. Each actor agent is configured to experience a different subset of the training set. Then, these agents continually forward their (state, action, advantage (i.e, )) tuples to a central learner agent, which aggregates them to train the policy neural network. The central learner agent performs a gradient update using Eq(3) according to the received tuples, then sends back the updated parameters of the policy network to the actor agents. The whole process can happen asynchronously among all agents. We use 21 CPU cores to train CFR-RL (i.e., one core (2.6GHz) for each agent).

Vi-A4 Metrics

(1) Load Balancing Performance Ratio: To demonstrate the load balancing performance of the proposed CFR-RL scheme, a load balancing performance ratio is applied and defined as follows:

(5)

where is the maximum link utilization achieved by an optimal explicit routing for all flows222The corresponding LP formulation is similar to (4a), except that the objective becomes obtaining the optimal explicit ratios for all flows. Note that the background link load would be 0 for this problem.. means that the proposed CFR-RL achieves load balancing as good as the optimal routing. A lower ratio indicates that the load balancing performance of CFR-RL is farther away from that of the optimal routing.

(2) End-to-end Delay Performance Ratio: To model and measure end-to-end delay in the network, we define the overall end-to-end delay in the network as as described in [zhang2012optimizing]. Then, an end-to-end delay performance ratio is defined as follows:

(6)

where is the minimum end-to-end delay achieved by an optimal explicit routing for all flows with an objective333The objective of this LP problem is to obtain the optimal explicit routing ratios for all flows, such that is minimized. to minimize the end-to-end delay . Note that the rerouting solution for selected critical flows is still obtained by solving (4a). The higher , the better end-to-end delay performance achieved by CFR-RL. means that the proposed CFR-RL achieves the minimum end-to-end delay as the optimal routing.

(3) Rerouting Disturbance: To measure the disturbance caused by rerouting, we define rerouting disturbance as the percentage of total rerouted traffic444Although partial of traffic flows might still be routed along the original ECMP paths, updating routing at the switches might cause packets drop or out-of-order. Thus, we still consider this amount of traffic as rerouting traffic. for a given traffic matrix, i.e.,

(7)

where is the total traffic of selected critical flows that need to be rerouted and is the total traffic of all flows. The smaller , the less disturbance caused by rerouting.

Vi-A5 Rule-based Heuristics

For comparison, we also evaluate two rule-based heuristics as the following:

  1. Top-K: selects the largest flows from a given traffic matrix in terms of demand volume. This approach is based on the assumption that flows with larger traffic volumes would have a dominant impact to network performance.

  2. Top-K Critical: similar to Top- approach, but selects the largest flows from the most congested links. This approach is based on the assumption that flows traversing the most congested links would have a dominant impact to network performance.

Vi-B Evaluation

Vi-B1 Critical Flows Number

Fig. 3: Average load balancing performance ratio of CFR-RL with increasing number of critical flows on the four networks.

We conduct a series of experiments with different number of critical flows selected, and fix other parameters throughout the experiments.

Figure 3 shows the average load balancing performance ratio achieved by CFR-RL with increasing number of critical flows . The initial value with = 0 represents the default ECMP routing. The results indicate that there is a considerable room for further improvement when flows are routed by ECMP. The sharp increases in the average load balancing performance ratio for all four networks shown in Fig. 3 indicates that CFR-RL is able to achieve near-optimal load balancing performance by rerouting only 10% flows. As a result, network disturbance would be much reduced compared to rerouting all flows as traditional TE. For the subsequent experiments, we set for each network.

Vi-B2 Performance Comparison

Fig. 4: Comparison of average load balancing performance ratio where error bars span

one standard deviation from the average on the entire test set of the four networks.

Fig. 5: Comparison of load balancing performance in the four networks on each test traffic matrix.
Fig. 6: Comparison of average end-to-end delay performance ratio where error bars span one standard deviation from the average on the entire test set of the four networks.
Fig. 7: Comparison of end-to-end delay performance in the four networks on each test traffic matrix.
Topology  CFR-RL Top-K Critical Top-K
Abilene 21.3% 32.7% 42.9%
EBONE (Exponential / Uniform) 11.2% / 10.0% 25.9% / 11.5% 32.9% / 11.7%
Sprintlink (Exponential / Uniform) 11.3% / 10.1% 23.6% / 13.8% 33.2% / 14.6%
Tiscali (Exponential / Uniform) 11.2% / 10.0% 24.5% / 12.0% 32.7% / 12.2%
TABLE II: Comparison of average rerouting disturbance
Fig. 8: Comparison of load balancing performance ratio in CDF with the traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
Fig. 9: Comparison of end-to-end delay performance ratio in CDF with the traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
Fig. 10: Comparison of load balancing performance ratio with the traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
Fig. 11: Comparison of end-to-end delay performance ratio with the traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.

For comparison, we also calculate the performance ratios and rerouting disturbances for Top-K, Top-K critical, and ECMP according to Eqs. (5), (6) and (7). Figure 4 shows the average load balancing performance ratio that each scheme achieves on the entire test set of the four networks. Figure 5 shows the load balancing performance ratio on each individual traffic matrix for the four networks. Note that the first 15 traffic matrices in Figs. 5(b)-5(d) are generated by an exponential model and the remaining 15 traffic matrices are generated by an uniform model. CFR-RL performs significantly well in all networks. For example, for the Abilene network, CFR-RL improves load balancing performance by about 32.8% compared to ECMP, and by roughly 7.4% compared to Top-K critical. For the EBONE network, CFR-RL outperforms Top-K critical with an average 12.2% load balancing performance improvement. For Sprintlink and Tiscali networks, CFR-RL performs slightly better than Top-K critical by 1.3% and 3.5% on average, respectively. Moreover, Figure 6 shows the average end-to-end delay performance ratio that each scheme achieves on the entire test set of the four networks. Figure 7 shows the end-to-end delay performance ratio on each test traffic matrix for the four networks. It is worth noting that the rerouting solution for selected critical flows is still obtained by solving (4a) (i.e., minimize maximum link utilization), though the end-to-end delay performance is evaluated555For the Abilene network, the real traffic demands in the measured traffic matrices collected in [abilene_tm] are relatively small, and thus the corresponding end-to-end delay would be very small. To effectively compare end-to-end delay performance of each scheme, we multiply each demand in a real traffic matrix by , where is the maximum link utilization achieved by ECMP routing on the traffic matrix . for each scheme. By effectively selecting and rerouting critical flows to balance link utilization of the network, CFR-RL outperforms heuristic schemes and ECMP in terms of end-to-end delay in all networks except the EBONE network. In the EBONE network, heuristic schemes performs better with the exponential traffic model. It is possible that rerouting the elephant flows selected by heuristic schemes further balances load on non-congested links and results in achieving smaller end-to-end delay. In addition, Tab. II shows the average rerouting disturbance, i.e., the average percentage of total traffic rerouted by each scheme (except ECMP) for the four networks. CFR-RL greatly reduces network disturbance by rerouting at most 21.3%, 11.2%, 11.3%, and 11.2% of total traffic on average for the four networks, respectively. In contrast, Top-K critical reroutes 11.4% more traffic for the Abilene network and 14.7%, 12.3%, and 13.3% more traffic for the EBONE, Sprintlink, and Tiscali networks (for exponential traffic matrices). Top-K performs even worse by rerouting more than 42% of total traffic on average for the Abilene network and 32%, 33%, and 32% of total traffic on average for the other three networks (for exponential traffic matrices). It is worth noting that there are no elephant flows in uniform traffic matrices shown in Fig. 5(b)-5(d). Thus, all three schemes reroute similar amount of traffic for uniform traffic matrices. However, CFR-RL is still able to perform slightly better than the two rule-based heuristics. Overall, the above results indicate that CFR-RL is able to achieve near-optimal load balancing performance and greatly reduce end-to-end delay and network disturbance by smartly selecting a small number of critical flows for each given traffic matrix and effectively rerouting the corresponding small amount of traffic.

As shown in Figs. 5(b)-5(d), Top-K critical performs well with the exponential traffic model. However, its performance is degraded with the uniform traffic model. One possible reason for the performance degradation of Top-K critical is that all links in the network are relatively saturated under the uniform traffic model. Alternative underutilized paths are not available for the critical flows selected by Top-K critical. In other words, there is no much room for rerouting performance improvement by only considering the elephant flows traversing the most congested links. Thus, fixed-rule heuristics are unable to guarantee their performance, showing that their design assumptions are invalid. In contrast, CFR-RL performs consistently well under various traffic models.

Vi-B3 Generalization

In this series of experiments, we trained CFR-RL on the traffic matrices from the first week (starting from Mar. 1st 2004) and evaluate it for each day of the following week (starting from Mar. 8th 2004) for the Abilene network. We only present the results for day 2, day 3, day 5 and day 6, since the results for other days are similar. Figures 8 and 9 show the full CDFs of two types of performance ratio for these 4 days. Figures 10 and 11

show the load balancing and end-to-end delay performance ratios on each traffic matrix of these 4 days, respectively. The results show that CFR-RL still achieves above 95% optimal load balancing performance and average 88.13% end-to-end delay performance, and thus outperforms other schemes on almost all traffic matrices. The load balancing performance of CFR-RL degrades on several outlier traffic matrices in day 2. There are two possible reasons for the degradation: (1) The traffic patterns of these traffic matrices are different from what CFR-RL learned from the previous week. (2) Selecting

is not enough for CFR-RL to achieve near-optimal performance on these outlier traffic matrices. However, CFR-RL still performs better than other schemes. Overall, the results indicate that real traffic patterns are relatively stable and CFR-RL generalizes well to unseen traffic matrices for which it was not explicitly trained.

Vi-B4 Training and Inference Time

Training a policy for the Abilene network took approximately 10,000 iterations, and the time consumed for each iteration is approximately 1 second. As a result, the total training time for Abilene network is approximately 3 hours. Since the EBONE network is relatively larger, it took approximately 60,000 iterations to train a policy. Then, the total training time for EBONE network is approximately 16 hours. For larger networks like Sprintlink and Tiscali, the solution space is even larger. Thus, more iterations (e.g., approximately 90,000 and 100,000 iterations) should be taken to train a good policy, and each iteration takes approximately 2 seconds. Note that this cost is incurred offline and can be performed infrequently depending on environment stability. The policy neural network as described in Section VI-A1 is relatively small. Thus, the inference time for the Abilene and EBONE networks are less than 1 second, and they are less than 2 seconds for the Sprintlink and Tiscali networks.

Vi-B5 Hyperparameters

filters / neurons = 128,
(with decay) 0.761
(with decay) 0.970
* 0.963
  • without decay, since the initial learning rate is equal to the minimum learning rate.


(a)

(with decay),
filters / neurons = 64 0.928
filters / neurons = 128 0.970
filters / neurons = 256 0.837

(b)

filters / neurons = 128, (with decay)
0.970
0.958

(c)

TABLE III: Comparison of average load balancing performance ratio with different sets of hyperparameters

Table III shows that how hyperparameters affect the load balancing performance of CFR-RL in the Abilene network. For each set of hyperparameters, we trained a policy for the Abilene network by 10,000 iterations, and then evaluated the average load balancing performance ratio over the whole test set. We only presented the results for the Abilene network, since the results for other network topologies are similar. In Tab. III(a), the number of filters in the convolutional layer and neurons in the fully connected layer is fixed to 128 and entropy factor is fixed to 0.1. We compare the performance with different learning rate . The results show that training might become unstable if the initial learning rate is too large (e.g., 0.01), and thus it cannot converge to a good policy. In contrast, training with a smaller learning rate is more stable but might require longer training time to further improve the performance. As a result, we chose to encourage exploration in the early stage of training. We compared the performance with different sizes of filters and neurons in Tab. III(b). The results show that too few filters/neurons might restrict the representation that the neural network can learn and thus causes under-fitting. Meanwhile, too many neurons might cause over-fitting, and thus the corresponding policy cannot generalize well to the test set. In addition, more training time is required for a larger neural network. In Tab. III(c), the results show that a larger entropy factor encourages exploration and leads to a better performance. Overall, the set of hyperparameters we have chosen is a good trade-off between performance and computational complexity of the model.

Vii Conclusion and Future Work

With an objective of minimizing the maximum link utilization in a network and reducing disturbance to the network causing service disruption, we proposed CFR-RL, a scheme that learns a critical flow selection policy automatically using reinforcement learning, without any domain-specific rule-based heuristic. CFR-RL selects critical flows for each given traffic matrix and reroutes them to balance link utilization of the network by solving a simple rerouting optimization problem. Extensive evaluations show that CFR-RL achieves near-optimal performance by rerouting only a limited portion of total traffic. In addition, CFR-RL generalizes well to traffic matrices for which it was not explicitly trained.

Yet, there are several aspects that may help improving the solution that we proposed in this contribution. Among them, we are determining how CFR-RL can be updated and improved.

Objectives: CFR-RL could be formulated to achieve other objectives. For example, to minimize overall end-to-end delay in the network (i.e., ) described in Section VI-A4(2), we can define reward as and reformulate the rerouting optimization problem (4a) to minimize .

Table II shows an interesting finding. Although CFR-RL does not explicitly minimize rerouting traffic, it ends up rerouting much less traffic (i.e., 10.0%-21.3%) and performs better than rule-based heuristic schemes by 1.3%-12.2%. This reveals that CFR-RL is effectively searching the whole set of candidate flows to find the best critical flows for various traffic matrices, rather than simply considering the elephant flows on the most congested links or in the whole network as rule-based heuristic schemes do. We will consider minimizing rerouting traffic as one of our objectives and investigate the trade-off between maximizing performance and minimizing rerouting traffic.

Scalability: Scaling CFR-RL to larger networks is an important direction of our future work. CFR-RL relies on LP to produce reward signals . The LP problem would become complex as the number of critical flows and the size of a network increase. This would slow down the policy training for larger networks (e.g., the Tiscali network in Section VI-B4), since the time consumed for each iteration would increase. Moreover, the solution space would become enormous for larger networks, and RL has to take more iterations to converge to a good policy. To further speed up training, we can either spawn even more actor agents (e.g., 30) in parallel to allow the system to consume more data at each time step and thus improve exploration [A3C], or apply GA3C [GA3C] to offload the training to a GPU, which is an alternative architecture of A3C and emphasizes on an efficient GPU utilization to increase the number of training data generated and processed per second. Another possible design to mitigate the scalability issue is adopting SDN multi-controller architectures. Each controller takes care of a subset of routers in a large network, and one CFR-RL agent is running on each SDN controller. The corresponding problem naturally falls into the realm of Multi-Agent Reinforcement Learning. We will evaluate if a multi-SDN controller architecture can help provide additional improvement in our approach.

Retraining: In this paper, we mainly described the RL-based critical flow selection policy training process as an offline task. In other words, once training is done, CFR-RL remains unmodified after being deployed in the network. However, CFR-RL can naturally accommodate future unseen traffic matrices by periodically updating the selection policy. This self-learning technique will enable CFR-RL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFR-RL can be retrained by including new traffic matrices. For example, the outlier traffic matrices (e.g., the 235th-240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VI-B3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.

The above examples are some key issues that are left for future work.

Acknowledgments

The authors would like to thank the editors and reviewers for providing many valuable comments and suggestions.

References