I Introduction
The emerging SoftwareDefined Networking (SDN) provides new opportunities to improve network performance [mckeown2008openflow]. In SDN, the control plane can generate routing policies based on its global view of the network and deploy these policies in the network by installing and updating flow entries at the SDN switches.
Traffic Engineering (TE) is one of important network features for SDN [agarwal2013traffic, guo2014traffic, zhang2014dynamic], and is usually implemented in the control plane of SDN. The goal of TE is to help Internet Service Providers (ISPs) optimize network performance and resource utilization by configuring the routing across their backbone networks to control traffic distribution [zhang2015load, guo2019joint]. Due to dynamic load fluctuation among the nodes, traditional TE [wang1999explicit, osborne2002traffic, fortz2002optimizing, holmberg2004optimization, chu2009optimal, zhang2012optimizing]
reroutes many flows periodically to balance the load on each link to minimize network congestion probability, where a flow is defined as a sourcedestination pair. One usually formulates the flow routing problem with a particular performance metric as a specific objective function for optimization. For a given traffic matrix, one often wants to route all the flows in such a way that the maximum link utilization in the network is minimized.
Although traditional TE solutions can achieve the optimal or nearoptimal performance by rerouting as many flows as possible, they do not consider the negative impact, such as packet out of order, when rerouting the flows in the network. To reach the optimal performance, TE solutions might reroute many traffic flows to just slightly reduce the link utilization on the most congested link, leading to significant network disturbance and service disruption. For example, a flow between two nodes in a backbone network is aggregated of many microflows (e.g., five tuplesbased TCP flows) of different applications. Changing the path of a flow could temporarily affect many TCP flows’ normal operation. Packets loss or outoforder may cause duplicated ACK transmissions, triggering the sender to react and reduce its congestion window size and hence decrease its sending rate, eventually increasing the flow’s completion time and degrading the flow’s Quality of Service (QoS). In addtion, rerouting all flows in the network could incur a high burden on the SDN controller to calculate and deploy new flow paths [zhang2014dynamic]. Because rerouting flows to reduce congestion in backbone networks could adversely affect the quality of users’ experience, network operators have no desire to deploy these traditional TE solutions in their networks unless reducing network disturbance is taken into the consideration in designing the TE solutions.
To mitigate the impact of network disturbance, one promising TE solution is forwarding majority of traffic flows using EqualCost MultiPath (ECMP) and selectively rerouting a few critical flows using SDN to balance link utilization of the network, where a critical flow is defined as a flow with a dominant impact to network performance (e.g., a flow on the most congested link) [zhang2014dynamic, zhang2014hybrid_routing]. Existing works show that critical flows exist in a given traffic matrix [zhang2014dynamic]. ECMP reduces the congestion probability by equally splitting traffic on equalcost paths while critical flow rerouting aims to achieve further performance improvement with low network disturbance.
The critical flow rerouting problem can be decoupled into two subproblems: (1) identifying critical flows and (2) rerouting them to achieve good performance. Although subproblem (2) is relatively easy to solve by formulating it as a Linear Programming (LP) optimization problem, solving subproblem (1) is not trivial because the solution space is huge. For example, if we want to find 10 critical flows among 100 flows, the solution space has trillion combinations. Considering the fact that traffic matrix varies in the level of minutes, an efficient solution should be able to quickly and effectively identify the critical flows for each traffic matrix. Unfortunately, it is impossible to design a heuristic algorithm for the above algorithmicallyhard problem based on fixed and simple rules. This is because rulebased heuristics are unable to adapt to the changes of the traffic matrix and network dynamics and thus unable to guarantee their performance when their design assumptions are violated, as later shown in Section VIB.
In this paper, we propose CFRRL (Critical Flow ReroutingReinforcement Learning), a Reinforcement Learningbased scheme that performs critical flow selection followed by rerouting with linear programming. CFRRL learns a policy to select critical flows purely through observations, without any domainspecific rulebased heuristic. It starts from scratch without any prior knowledge, and gradually learns to make better selections through reinforcement, in the form of reward signals that reflects network performance for past selections. By continuing to observe the actual performance of past selections, CFRRL would optimize its selection policy for various traffic matrices as time goes. Once training is done, CFRRL will efficiently and effectively select a small set of critical flows for each given traffic matrix, and reroute them to balance link utilization of the network by formulating and solving a simple linear programming optimization problem.
The main contributions of this paper are summarized as follows:

We consider the impact of flow rerouting to network disturbance in our TE design and propose an effective scheme that not only minimizes the maximum link utilization but also reroutes only a small number of flows to reduce network disturbance.

We customize a RL approach to learn the critical flow selection policy, and utilize LP as a reward function to generate reward signals. This RLLP combined approach turns out to be surprisingly powerful.

We evaluate and compare CFRRL with other rulebased heuristic schemes by conducting extensive experiments on different topologies with both real and synthesized traffic. CFRRL not only outperforms rulebased heuristic schemes by up to 12.2%, but also reroutes 11.4%14.7% less traffic on average. Overall, CFRRL is able to achieve nearoptimal performance by rerouting only 10%21.3% of total traffic. In addition, the evalution results show that CFRRL is able to generalize to unseen traffic matrices.
The remainder of this paper is organized as follows. Section II describes the related works. Section III presents the system design. Section IV discusses how to train the critical flow selection policy using a RLbased approach. Section V describes how to reroute the critical flows. Section VI evaluates the effectiveness of our scheme. Section VII concludes the paper and discusses future work.
Ii Related Works
Iia Traditional TE Solutions
In Multiprotocol Label Switching (MPLS) networks, a routing problem has been formulated as an optimization problem where explicit routes are obtained for each sourcedestination pair to distribute traffic flows [wang1999explicit, osborne2002traffic]. Using Open Shortest Path First (OSPF) and ECMP protocols, [fortz2002optimizing, holmberg2004optimization, chu2009optimal] attempt to balance link utilization as even as possible by carefully tuning the link costs to adjust path selection in ECMP. OSPFOMP (OMP, Optimized Multipath) [OSPFOMP], a variation of OSPF, attempts to dynamically determine the optimal allocation of traffic among multiple equalcost paths based on the exchange of special trafficload control messages. Weighted ECMP [zhang2012optimizing] extends ECMP to allow weighted traffic splitting at each node and achieves significant performance improvement over ECMP. Twophase routing optimizes routing performance by selecting a set of intermediate nodes and tuning the traffic split ratios to the nodes [kodialam2008oblivious, antic2009two]. In the first phase, each source sends traffic to the intermediate nodes based on predetermined split ratios, and in the second phase, the intermediate nodes then deliver the traffic to the final destinations. This approach requires IP tunnels, opticallayer circuits, or label switched paths in each phase.
IiB SDNBased TE Solutions
Thanks to the flexible routing policy from the emerging SDN, dynamic hybrid routing [zhang2014dynamic] achieves load balancing for a wide range of traffic scenarios by dynamically rebalancing traffic to react to traffic fluctuations with a preconfigured routing policy. Agarwal et al. [agarwal2013traffic] consider a network with partially deployed SDN switches. They improve network utilization and reduce packet loss by strategically placing the controller and SDN switches. Guo et al. [guo2014traffic] propose a novel algorithm named SOTE to minimize the maximum link utilization in an SDN/OSPF hybrid network.
IiC Machine LearningBased TE Solutions
Machine learning has been used to improve the performance of backbone networks and data center networks. For backbone networks, Geyer et al. [geyer2018learning]
design an automatic network protocol using semisupervised deep learning. Sun et al.
[sun2019sinet] selectively control a set of nodes and use a RLbased policy to dynamically change the routing decision of flows traversing the selected nodes. To minimize signaling delay in large SDNs, Lin et al. [lin2016qos] employ a distributed threelevel control plane architecture coupled with a RLbased solution named QoSaware Adaptive Routing. Xu et al. [xu2018experience] use RL to optimize the throughput and delay in TE. AuTO [chen2018auto] is developed to optimize routing traffic in data center networks with a twolayer RL. One is called the Peripheral System for deploying hosts and routing small flows, and the other one is called the Central System for collecting global traffic information and routing large flows.However, all of the above works do not consider mitigating the impact of network disturbance and service disruption caused by rerouting.
Iii System Design
In this section, we describe the design of CFRRL, a RLbased scheme that learns a critical flow selection policy and reroutes the corresponding critical flows to balance link utilization of the network.
We train CFRRL to learn a selection policy over a rich variety of historical traffic matrices, where traffic matrices can be measured by SDN switches and collected by an SDN central controller periodically [xu2017minimizing]
. CFRRL represents the selection policy as a neural network that maps a "raw" observation (e.g., a traffic matrix) to a combination of critical flows. The neural network provides a scalable and expressive way to incorporate various traffic matrices into the selection policy. CFRRL trains this neural network based on REINFORCE algorithm
[Williams1992] with some customizations, as detailed in Section IV.Once training is done, CFRRL applies the critical flow selection policy to each real time traffic matrix provided by the SDN controller periodically, where a small number of critical flows (e.g., ) are selected. The evaluation results in Section VIB1 show that selecting 10% of total flows as critical flows (roughly 11%21% of total traffic) is sufficient for CFRRL to achieve nearoptimal performance, while network disturbance (i.e., the percentage of total rerouted traffic) is reduced by at least compared to rerouting all flows by traditional TE. Then the SDN controller reroutes the selected critical flows by installing and updating corresponding flow entries at the switches using a flow rerouting optimization method described in Section V. The remaining flows would continue to be routed by the default ECMP routing. Note that the flow entries at the switches for the critical flows selected in the previous period will time out, and the flows would be routed by either default ECMP routing or newly installed flow entries in the current period. Figure 1 shows an illustrative example. CFRRL reroutes the flow from S0 to S4 to balance link load by installing forwarding entries at the corresponding switches along the SDN path.
There are two reasons we do not want to adopt RL for the flow rerouting problem. Firstly, since the set of critical flows is small, LP is an efficient and optimal method to solve the rerouting problem. Secondly, a routing solution consists of a split ratio (i.e., traffic demand percentage) for each flow on each link. Given a network with links, there will be total split ratios in the routing solution, where is the number of critical flows. Since split ratios are continuous numbers, we have to adopt the RL methods for continuous action domain [ddpg, John2015TRPO]. However, due to the highdimensional, continuous action spaces, it has been shown that this type of RL methods would lead to slow and ineffective learning when the number of output parameters (i.e., ) is large [xu2018experience, Valadarsky2017learning].
Iv Learning A Critical Flow Selection Policy
In this section, we describe how to learn a critical flow selection policy using a customized RL approach.
Iva Reinforcement Learning Formulation
Input / State Space: An agent takes a state as an input, where is a traffic matrix at time step that contains information of traffic demand of each flow. Typically, the network topology remains unchanged. Thus, we do not include the topology information as a part of the input. The results in Section VIB show that CFRRL is able to learn a good policy without prior knowledge of the network. It is worth noting that including additional information like link states as a part of input might be beneficial for training the critical flow selection policy. We will investigate it in our future work.
Action Space: For each state , CFRRL would select critical flows. Given that there are total flows in a network with nodes, this RL problem would require a large action space of size . Inspired by [Mao:2016:RMD:3005745.3005750], we define the action space as {0, 1, …, } and allow the agent to sample different actions in each time step (i.e., ).
Reward: After sampling different critical flows (i.e., ) for a given state , CFRRL reroutes these critical flows and obtains the maximum link utilization by solving the rerouting optimization problem (4a) (described in the following section). Reward is defined as , which is set to reflect the network performance after rerouting critical flows to balance link utilization. The smaller (i.e., the greater reward ), the better performance. In other words, CFRRL adopts LP as a reward function to produce reward signals for RL.
IvB Training Algorithm
The critical flow selection policy is represented by a neural network. This policy network takes a state
as an input as described above and outputs a probability distribution
over all available actions. Figure 2 shows the architecture of the policy network (details in Section VIA1). Since different actions are sampled for each state and their order does not matter, we define a solution as a combination of sampled actions. For selecting a solution with a given state , a stochastic policy parameterized by can be approximated as follows^{1}^{1}1To select distinct actions, we do the action sampling without replacement. The right side of Eq. (1) is the solution probability when sampling with replacement, we use Eq. (1) to approximate the probability of the solution given a state for simplicity.:(1) 
The goal of training is to maximize the network performance over various traffic matrices, i.e., maximize the expected reward . Thus, we optimize by gradient ascend, using REINFORCE algorithm with a baseline . The policy parameter is updated according to the following equation:
(2) 
where is the learning rate for the policy network. A good baseline
reduces gradient variance and thus increases speed of learning. In this paper, we use an average reward for each state
as the baseline. () indicates how much better a specific solution is compared to the "average solution" for a given state according to the policy. Intuitively, Eq.(2) can be explained as follows. If is positive, (i.e., the probability of the solution ) is increased by updating the policy parameters in the direction with a step size of . Otherwise, the solution probability is decreased. The net effect of Eq. (2) is to reinforce actions that empirically lead to better rewards.To ensure that the RL agent explores the action space adequately during training to discover good policies, the entropy of the policy is added to Eq. (2). This technique improves the exploration by discouraging premature convergence to suboptimal deterministic policies [A3C]. Then, Eq(2) is modified to the following equation:
(3) 
where
is the entropy of the policy (the probability distribution over actions). The hyperparameter
controls the strength of the entropy regularization term. Algorithm 1 shows the pseudocode for the training algorithm.V Rerouting Critical Flows
In this section, we describe how to reroute the selected critical flows to balance link utilization of the network.
Va Notations
network with nodes and directed edges ().  
the capacity of link ().  
the traffic load on link ().  
the traffic demand from source to destination (, ).  
the percentage of traffic demand from source to destination routed on link (, ). 
VB Explicit Routing For Critical Flows
By default, traffic is distributed according to ECMP routing. We reroute the small set of critical flows (i.e., ) by conducting explicit routing optimization for these critical flows .
The critical flow rerouting problem can be described as the following. Given a network with the set of traffic demands for the selected critical flows () and the background link load contributed by the remaining flows using the default ECMP routing, our objective is to obtain the optimal explicit routing ratios for each critical flow, so that the maximum link utilization is minimized.
To search all possible underutilized paths for the selected critical flows, we formulate the rerouting problem as an optimization as follows.
(4a)  
subject to  
(4b)  
(4c)  
(4d)  
(4e) 
in (4a) is needed because otherwise the optimal solution may include unnecessarily long paths as long as they avoid the most congested link, where () is a sufficiently small constant to ensure that the minimization of takes higher priority [916782]. (4b) indicates the traffic load on link contributed by the traffic demands routed by the explicit routing and the traffic demands routed by the default ECMP routing. (4c) is the link capacity utilization constraint. (4d) is the flow conservation constraint for the selected critical flows.
By solving the above LP problem using LP solvers (such as Gurobi [gurobi]), we can obtain the optimal explicit routing solution for selected critical flows . Then, the SDN controller installs and updates flow entries at the switches accordingly.
Vi Evaluation
In this section, a series of simulation experiments are conducted using realworld network topologies to evaluate the performance of CFRRL and show its effectiveness by comparing it with other rulebased heuristic schemes.
Via Evaluation Setup
ViA1 Implementation
The policy neural network consists of three layers. The first layer is a convolutional layer with 128 filters. The corresponding kernel size is
and the stride is set to 1. The second layer is a fully connected layer with 128 neurons. The activation function used for the first two layers is Leaky ReLU
[leaky_relu]. The final layer is a fully connected linear layer (without activation function) with neurons corresponding to all possible critical flows. The softmax function is applied upon the output of final layer to generate the probabilities for all available actions. The learning rate is initially configured to 0.001 and decays every 500 iterations with a base of 0.96 until it reaches the minimum value 0.0001. Additionally, the entropy factor is configured to be 0.1. We found that the set of above hyperparameters is a good tradeoff between performance and computational complexity of the model (details in Section VIB5). Thus, we fixed them throughout our experiments. The results in the following experiments show CFRRL works well on different network topologies with a single set of fixed hyperparameters. This architecture is implemented using TensorFlow
[tensorflow].Topology  Nodes  Directed Links  Pairs 

Abilene  12  30  132 
EBONE (Europe)  23  74  506 
Sprintlink (US)  44  166  1892 
Tiscali (Europe)  49  172  2352 
ViA2 Dataset
In our evaluation, we use four realworld network topologies including Abilene network and 3 ISP networks collected by ROCKETFUEL [spring2002measuring]. The number of nodes and directed links of the networks are listed in Table I. For the Abilene network, the measured traffic matrices and network topology information (such as link connectivity, weights, and capacities) are available in [abilene_tm]. Since Abilene traffic matrices are measured every 5 minutes, there are a total of 288 traffic matrices each day. To evaluate the performance of CFRRL, we choose a total 2016 traffic matrices in the first week (starting from Mar. 1st 2004) as our dataset. For ROCKETFUEL topologies, the link costs are given while the link capacities are not provided. Therefore, we infer the link capacities as the inverse of link costs, which is based on the default link cost setting in Cisco routers. In other words, the link costs are inversely proportional to the link capacities. This approach is commonly adopted in literature [zhang2014dynamic, zhang2014hybrid_routing, kodialam2008oblivious]. Besides, since traffic matrices are also unavailable for the ISP networks from ROCKETFUEL, we use a traffic matrix generation tool [TMgen] to generate 50 synthetic exponential traffic matrices and 50 synthetic uniform traffic matrices for each network. Unless otherwise noted, we use a random sample of 70% of our dataset as a training set for CFRRL, and use the remaining 30% as a test set for testing all schemes.
ViA3 Parallel Training
To speed up training, we spawn multiple actor agents in parallel, as suggested by [A3C]. CFRRL uses 20 actor agents by default. Each actor agent is configured to experience a different subset of the training set. Then, these agents continually forward their (state, action, advantage (i.e, )) tuples to a central learner agent, which aggregates them to train the policy neural network. The central learner agent performs a gradient update using Eq(3) according to the received tuples, then sends back the updated parameters of the policy network to the actor agents. The whole process can happen asynchronously among all agents. We use 21 CPU cores to train CFRRL (i.e., one core (2.6GHz) for each agent).
ViA4 Metrics
(1) Load Balancing Performance Ratio: To demonstrate the load balancing performance of the proposed CFRRL scheme, a load balancing performance ratio is applied and defined as follows:
(5) 
where is the maximum link utilization achieved by an optimal explicit routing for all flows^{2}^{2}2The corresponding LP formulation is similar to (4a), except that the objective becomes obtaining the optimal explicit ratios for all flows. Note that the background link load would be 0 for this problem.. means that the proposed CFRRL achieves load balancing as good as the optimal routing. A lower ratio indicates that the load balancing performance of CFRRL is farther away from that of the optimal routing.
(2) Endtoend Delay Performance Ratio: To model and measure endtoend delay in the network, we define the overall endtoend delay in the network as as described in [zhang2012optimizing]. Then, an endtoend delay performance ratio is defined as follows:
(6) 
where is the minimum endtoend delay achieved by an optimal explicit routing for all flows with an objective^{3}^{3}3The objective of this LP problem is to obtain the optimal explicit routing ratios for all flows, such that is minimized. to minimize the endtoend delay . Note that the rerouting solution for selected critical flows is still obtained by solving (4a). The higher , the better endtoend delay performance achieved by CFRRL. means that the proposed CFRRL achieves the minimum endtoend delay as the optimal routing.
(3) Rerouting Disturbance: To measure the disturbance caused by rerouting, we define rerouting disturbance as the percentage of total rerouted traffic^{4}^{4}4Although partial of traffic flows might still be routed along the original ECMP paths, updating routing at the switches might cause packets drop or outoforder. Thus, we still consider this amount of traffic as rerouting traffic. for a given traffic matrix, i.e.,
(7) 
where is the total traffic of selected critical flows that need to be rerouted and is the total traffic of all flows. The smaller , the less disturbance caused by rerouting.
ViA5 Rulebased Heuristics
For comparison, we also evaluate two rulebased heuristics as the following:

TopK: selects the largest flows from a given traffic matrix in terms of demand volume. This approach is based on the assumption that flows with larger traffic volumes would have a dominant impact to network performance.

TopK Critical: similar to Top approach, but selects the largest flows from the most congested links. This approach is based on the assumption that flows traversing the most congested links would have a dominant impact to network performance.
ViB Evaluation
ViB1 Critical Flows Number
We conduct a series of experiments with different number of critical flows selected, and fix other parameters throughout the experiments.
Figure 3 shows the average load balancing performance ratio achieved by CFRRL with increasing number of critical flows . The initial value with = 0 represents the default ECMP routing. The results indicate that there is a considerable room for further improvement when flows are routed by ECMP. The sharp increases in the average load balancing performance ratio for all four networks shown in Fig. 3 indicates that CFRRL is able to achieve nearoptimal load balancing performance by rerouting only 10% flows. As a result, network disturbance would be much reduced compared to rerouting all flows as traditional TE. For the subsequent experiments, we set for each network.
ViB2 Performance Comparison
Topology  CFRRL  TopK Critical  TopK 

Abilene  21.3%  32.7%  42.9% 
EBONE (Exponential / Uniform)  11.2% / 10.0%  25.9% / 11.5%  32.9% / 11.7% 
Sprintlink (Exponential / Uniform)  11.3% / 10.1%  23.6% / 13.8%  33.2% / 14.6% 
Tiscali (Exponential / Uniform)  11.2% / 10.0%  24.5% / 12.0%  32.7% / 12.2% 
For comparison, we also calculate the performance ratios and rerouting disturbances for TopK, TopK critical, and ECMP according to Eqs. (5), (6) and (7). Figure 4 shows the average load balancing performance ratio that each scheme achieves on the entire test set of the four networks. Figure 5 shows the load balancing performance ratio on each individual traffic matrix for the four networks. Note that the first 15 traffic matrices in Figs. 5(b)5(d) are generated by an exponential model and the remaining 15 traffic matrices are generated by an uniform model. CFRRL performs significantly well in all networks. For example, for the Abilene network, CFRRL improves load balancing performance by about 32.8% compared to ECMP, and by roughly 7.4% compared to TopK critical. For the EBONE network, CFRRL outperforms TopK critical with an average 12.2% load balancing performance improvement. For Sprintlink and Tiscali networks, CFRRL performs slightly better than TopK critical by 1.3% and 3.5% on average, respectively. Moreover, Figure 6 shows the average endtoend delay performance ratio that each scheme achieves on the entire test set of the four networks. Figure 7 shows the endtoend delay performance ratio on each test traffic matrix for the four networks. It is worth noting that the rerouting solution for selected critical flows is still obtained by solving (4a) (i.e., minimize maximum link utilization), though the endtoend delay performance is evaluated^{5}^{5}5For the Abilene network, the real traffic demands in the measured traffic matrices collected in [abilene_tm] are relatively small, and thus the corresponding endtoend delay would be very small. To effectively compare endtoend delay performance of each scheme, we multiply each demand in a real traffic matrix by , where is the maximum link utilization achieved by ECMP routing on the traffic matrix . for each scheme. By effectively selecting and rerouting critical flows to balance link utilization of the network, CFRRL outperforms heuristic schemes and ECMP in terms of endtoend delay in all networks except the EBONE network. In the EBONE network, heuristic schemes performs better with the exponential traffic model. It is possible that rerouting the elephant flows selected by heuristic schemes further balances load on noncongested links and results in achieving smaller endtoend delay. In addition, Tab. II shows the average rerouting disturbance, i.e., the average percentage of total traffic rerouted by each scheme (except ECMP) for the four networks. CFRRL greatly reduces network disturbance by rerouting at most 21.3%, 11.2%, 11.3%, and 11.2% of total traffic on average for the four networks, respectively. In contrast, TopK critical reroutes 11.4% more traffic for the Abilene network and 14.7%, 12.3%, and 13.3% more traffic for the EBONE, Sprintlink, and Tiscali networks (for exponential traffic matrices). TopK performs even worse by rerouting more than 42% of total traffic on average for the Abilene network and 32%, 33%, and 32% of total traffic on average for the other three networks (for exponential traffic matrices). It is worth noting that there are no elephant flows in uniform traffic matrices shown in Fig. 5(b)5(d). Thus, all three schemes reroute similar amount of traffic for uniform traffic matrices. However, CFRRL is still able to perform slightly better than the two rulebased heuristics. Overall, the above results indicate that CFRRL is able to achieve nearoptimal load balancing performance and greatly reduce endtoend delay and network disturbance by smartly selecting a small number of critical flows for each given traffic matrix and effectively rerouting the corresponding small amount of traffic.
As shown in Figs. 5(b)5(d), TopK critical performs well with the exponential traffic model. However, its performance is degraded with the uniform traffic model. One possible reason for the performance degradation of TopK critical is that all links in the network are relatively saturated under the uniform traffic model. Alternative underutilized paths are not available for the critical flows selected by TopK critical. In other words, there is no much room for rerouting performance improvement by only considering the elephant flows traversing the most congested links. Thus, fixedrule heuristics are unable to guarantee their performance, showing that their design assumptions are invalid. In contrast, CFRRL performs consistently well under various traffic models.
ViB3 Generalization
In this series of experiments, we trained CFRRL on the traffic matrices from the first week (starting from Mar. 1st 2004) and evaluate it for each day of the following week (starting from Mar. 8th 2004) for the Abilene network. We only present the results for day 2, day 3, day 5 and day 6, since the results for other days are similar. Figures 8 and 9 show the full CDFs of two types of performance ratio for these 4 days. Figures 10 and 11
show the load balancing and endtoend delay performance ratios on each traffic matrix of these 4 days, respectively. The results show that CFRRL still achieves above 95% optimal load balancing performance and average 88.13% endtoend delay performance, and thus outperforms other schemes on almost all traffic matrices. The load balancing performance of CFRRL degrades on several outlier traffic matrices in day 2. There are two possible reasons for the degradation: (1) The traffic patterns of these traffic matrices are different from what CFRRL learned from the previous week. (2) Selecting
is not enough for CFRRL to achieve nearoptimal performance on these outlier traffic matrices. However, CFRRL still performs better than other schemes. Overall, the results indicate that real traffic patterns are relatively stable and CFRRL generalizes well to unseen traffic matrices for which it was not explicitly trained.ViB4 Training and Inference Time
Training a policy for the Abilene network took approximately 10,000 iterations, and the time consumed for each iteration is approximately 1 second. As a result, the total training time for Abilene network is approximately 3 hours. Since the EBONE network is relatively larger, it took approximately 60,000 iterations to train a policy. Then, the total training time for EBONE network is approximately 16 hours. For larger networks like Sprintlink and Tiscali, the solution space is even larger. Thus, more iterations (e.g., approximately 90,000 and 100,000 iterations) should be taken to train a good policy, and each iteration takes approximately 2 seconds. Note that this cost is incurred offline and can be performed infrequently depending on environment stability. The policy neural network as described in Section VIA1 is relatively small. Thus, the inference time for the Abilene and EBONE networks are less than 1 second, and they are less than 2 seconds for the Sprintlink and Tiscali networks.
ViB5 Hyperparameters
filters / neurons = 128,  

(with decay)  0.761 
(with decay)  0.970 
^{*}  0.963 

without decay, since the initial learning rate is equal to the minimum learning rate.
(a)
(with decay),  

filters / neurons = 64  0.928 
filters / neurons = 128  0.970 
filters / neurons = 256  0.837 
(b)
filters / neurons = 128, (with decay)  
0.970  
0.958 
(c)
Table III shows that how hyperparameters affect the load balancing performance of CFRRL in the Abilene network. For each set of hyperparameters, we trained a policy for the Abilene network by 10,000 iterations, and then evaluated the average load balancing performance ratio over the whole test set. We only presented the results for the Abilene network, since the results for other network topologies are similar. In Tab. III(a), the number of filters in the convolutional layer and neurons in the fully connected layer is fixed to 128 and entropy factor is fixed to 0.1. We compare the performance with different learning rate . The results show that training might become unstable if the initial learning rate is too large (e.g., 0.01), and thus it cannot converge to a good policy. In contrast, training with a smaller learning rate is more stable but might require longer training time to further improve the performance. As a result, we chose to encourage exploration in the early stage of training. We compared the performance with different sizes of filters and neurons in Tab. III(b). The results show that too few filters/neurons might restrict the representation that the neural network can learn and thus causes underfitting. Meanwhile, too many neurons might cause overfitting, and thus the corresponding policy cannot generalize well to the test set. In addition, more training time is required for a larger neural network. In Tab. III(c), the results show that a larger entropy factor encourages exploration and leads to a better performance. Overall, the set of hyperparameters we have chosen is a good tradeoff between performance and computational complexity of the model.
Vii Conclusion and Future Work
With an objective of minimizing the maximum link utilization in a network and reducing disturbance to the network causing service disruption, we proposed CFRRL, a scheme that learns a critical flow selection policy automatically using reinforcement learning, without any domainspecific rulebased heuristic. CFRRL selects critical flows for each given traffic matrix and reroutes them to balance link utilization of the network by solving a simple rerouting optimization problem. Extensive evaluations show that CFRRL achieves nearoptimal performance by rerouting only a limited portion of total traffic. In addition, CFRRL generalizes well to traffic matrices for which it was not explicitly trained.
Yet, there are several aspects that may help improving the solution that we proposed in this contribution. Among them, we are determining how CFRRL can be updated and improved.
Objectives: CFRRL could be formulated to achieve other objectives. For example, to minimize overall endtoend delay in the network (i.e., ) described in Section VIA4(2), we can define reward as and reformulate the rerouting optimization problem (4a) to minimize .
Table II shows an interesting finding. Although CFRRL does not explicitly minimize rerouting traffic, it ends up rerouting much less traffic (i.e., 10.0%21.3%) and performs better than rulebased heuristic schemes by 1.3%12.2%. This reveals that CFRRL is effectively searching the whole set of candidate flows to find the best critical flows for various traffic matrices, rather than simply considering the elephant flows on the most congested links or in the whole network as rulebased heuristic schemes do. We will consider minimizing rerouting traffic as one of our objectives and investigate the tradeoff between maximizing performance and minimizing rerouting traffic.
Scalability: Scaling CFRRL to larger networks is an important direction of our future work. CFRRL relies on LP to produce reward signals . The LP problem would become complex as the number of critical flows and the size of a network increase. This would slow down the policy training for larger networks (e.g., the Tiscali network in Section VIB4), since the time consumed for each iteration would increase. Moreover, the solution space would become enormous for larger networks, and RL has to take more iterations to converge to a good policy. To further speed up training, we can either spawn even more actor agents (e.g., 30) in parallel to allow the system to consume more data at each time step and thus improve exploration [A3C], or apply GA3C [GA3C] to offload the training to a GPU, which is an alternative architecture of A3C and emphasizes on an efficient GPU utilization to increase the number of training data generated and processed per second. Another possible design to mitigate the scalability issue is adopting SDN multicontroller architectures. Each controller takes care of a subset of routers in a large network, and one CFRRL agent is running on each SDN controller. The corresponding problem naturally falls into the realm of MultiAgent Reinforcement Learning. We will evaluate if a multiSDN controller architecture can help provide additional improvement in our approach.
Retraining: In this paper, we mainly described the RLbased critical flow selection policy training process as an offline task. In other words, once training is done, CFRRL remains unmodified after being deployed in the network. However, CFRRL can naturally accommodate future unseen traffic matrices by periodically updating the selection policy. This selflearning technique will enable CFRRL to further adapt itself to the dynamic conditions in the network after being deployed in real networks. CFRRL can be retrained by including new traffic matrices. For example, the outlier traffic matrices (e.g., the 235th240th traffic matrices in Day 2) presented in Fig. 10 should be included for retraining, while the generalization results shown in Section VIB3 suggest that retraining frequently might not be necessary. Techniques to determine when to retrain and which new/old traffic matrix should be included/excluded in/from the training dataset should be further investigated.
The above examples are some key issues that are left for future work.
Acknowledgments
The authors would like to thank the editors and reviewers for providing many valuable comments and suggestions.
Comments
There are no comments yet.