Reinforcement Learning for Datacenter Congestion Control

02/18/2021 ∙ by Chen Tessler, et al. ∙ 0

We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, non-stationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most RL algorithms were designed under the assumption that the world can be adequately modeled as a Markov Decision Process (MDP). Unfortunately, this is seldom the case in realistic applications. In this work, we study a real-world application of RL to data-center congestion control. This application is highly challenging because of partial observability and complex multi-objective. Moreover, there are multiple decision-makers that have to make concurrent decisions that can affect each other. We show that even though that the problem at hand does not fit the standard model, we can nevertheless devise an RL framework that outperforms state-of-the-art tailored heuristics. Moreover, we show experimentally that the RL framework generalizes well.

We provide a full description of the Congestion Control (CC) problem in Section 3, but from a bird’s eye view, it is enough to think of CC as a multi-agent, multi-objective, partially observed problem where each decision maker receives a goal (target). The target makes it easy to tune the behavior to fit the requirements (i.e., how latency-sensitive the system is). We devise the target so that leads to beneficial behavior in the multiple considered metrics, without having to tune coefficients of multiple reward components. We model the task of datacenter congestion control as a reinforcement learning problem. We observe that standard algorithms (mnih2015human; lillicrap2015continuous; schulman2017proximal) are incapable of solving this task. Thus, we introduce a novel on-policy deterministic-policy-gradient scheme that takes advantage of the structure of our target-based reward function. This method enjoys both the stability of deterministic algorithms and the ability to tackle partially observable problems.

To validate our claims, we develop an RL GYM (brockman2016openai) environment, based on a realistic simulator and perform extensive experiments. The simulator is based on OMNeT++ (varga2002omnet++) and emulates the behavior of ConnextX-6Dx network interface cards (state of the art hardware deployed in current datacenters)111

To encourage additional research in this field, the code will be open-sourced after the review phase is over.

. Our experiments show that our method, Programmable CC-RL (PCC-RL), learns a robust policy, in the sense that it is competitive in all the evaluated scenarios; often outperforming the current state-of-the-art methods.

Our contributions are as follows. (i) We formulate the problem of datacenter congestion control as a partially-observable multi-agent multi-objective RL task. (ii) We present the challenges of this realistic formulation and propose a novel on-policy deterministic-policy-gradient method to solve it. (iii) We provide an RL training and evaluation suite for training and testing RL agents within a realistic simulator. Finally, (iv) we ensure our agent satisfies compute and memory constraints such that it can be easily deployed in future datacenter network devices.

Figure 1: Example of a simple many-to-one topology. Multiple senders transmit data through a single switch, the congestion point, towards a single receiver. As the source nodes can potentially transmit at a combined rate faster than the switch can process, its input buffer might fill up and the switch may become congested.

2 Networking Preliminaries

We start with a short introduction covering the most relevant concepts in networking.

In this work, we focus on datacenter networks. In datacenters, traffic contains multiple concurrent data streams transmitting at high rates. The servers, also known as hosts, are interconnected through a topology of switches. A directional connection between two hosts that continuously transmits data is called a flow. We assume, for simplicity, that the path of each flow is fixed.

Each host can hold multiple flows whose transmission rates are determined by a scheduler. The scheduler iterates in a cyclic manner between the flows, also known as round-robin scheduling. Once scheduled, the flow transmits a burst of data. The burst’s size generally depends on the requested transmission rate, the time it was last scheduled, and the maximal burst size limitation.

A flow’s transmission is characterized by two primary values. Bandwidth: the average amount of data transmitted, measured in Gbit per second; and latency: the time it takes for a packet to reach its destination. Round-trip-time (RTT) measures the latency of sourcedestinationsource. While the latency is often the metric of interest, most systems are only capable of measuring RTT. This is presented in Fig. 2.

3 Congestion Control

Figure 2: Schema of the packet flow. The sender transmits bursts of data packets and at the end transmits an RTT packet.

Congestion occurs when multiple flows cross paths, transmitting data through a single congestion point (switch or receiving server) at a rate faster than the congestion point can process. In this work, we assume that all connections have equal transmission rates, as typically occurs in most datacenters. Thus, a single flow can saturate an entire path by transmitting at the maximal rate.

Each congestion point in the network has an inbound buffer, see Fig. 1, enabling it to cope with short periods where the inbound rate is higher than it can process. As this buffer begins to fill, the time (latency) it takes for each packet to reach its destination increases. When the buffer is full, any additional arriving packets are dropped.

3.1 Congestion Indicators

There are various methods to measure or estimate the congestion within the network. The ECN protocol


considers marking packets with an increasing probability as the buffer fills up. Network telemetry is an additional, advanced, congestion signal. As opposed to statistical information (ECN), a telemetry signal is a precise measurement provided directly from the switch, such as the switch’s buffer and port utilization.

However, while the ECN and telemetry signals provide useful information, they require specialized hardware. An ideal solution is one that can be easily deployed within existing networks. Such solutions are based on RTT measurements. They measure congestion by comparing the RTT to that of an empty system.

3.2 Objective

CC can be seen as a multi-agent problem. Assuming there are N flows, this results in N CC algorithms (agents) operating simultaneously. Assuming all agents have an infinite amount of traffic to transmit, their goal is to optimize the following metrics:

  1. Switch bandwidth utilization – the % from maximal transmission rate.

  2. Packet latency – the amount of time it takes for a packet to travel from the source to its destination.

  3. Packet-loss – the amount of data (% of maximum transmission rate) dropped due to congestion.

  4. Fairness – a measure of similarity in the transmission rate between flows sharing a congested path. We consider .

The multi-objective problem of the CC agent is to maximize the bandwidth utilization and fairness, and minimize the latency and packet-loss. Thus, it may have a Pareto-front (liu2014multiobjective) for which optimality w.r.t. to one objective may result in sub-optimality of another. However, while the metrics of interest are clear, the agent does not necessarily have access to signals representing them. For instance, fairness is a metric that involves all flows, yet the agent observes signals relevant only to the flow it controls. Hence, it is impossible for a flow to obtain an estimate of the current fairness in the system. Instead, we reach fairness by setting each flow’s individual target adaptively, based on known relations between its current RTT and rate. More details on this are given in Sec. 6.

This problem exhibits additional complexities. As the agent only observes information relevant to the flow it controls, this task is partially observable. The observation might lack sufficient statistics required to determine the optimal policy. Moreover, the network’s reaction to transmission rate changes is delayed by

4 Reinforcement Learning Preliminaries

We model the task of congestion control as a multi-agent partially-observable multi-objective MDP, where all agents share the same policy. Each agent observes statistics relevant to itself and does not observe the entire global state (e.g., the number of active flows in the network).

We consider an infinite-horizon Partially Observable Markov Decision Process (POMDP). A POMDP is defined as the tuple (puterman1994markov; spaan2012partially). An agent interacting with the environment observes a state and performs an action . After performing an action, the environment transitions to a new state based on the transition kernel and receives a reward . In a POMDP, the observed state does not necessarily contain sufficient statistics for determining the optimal action.

We consider the average reward metric, defined as follows. We denote as the set of stationary deterministic policies on , i.e., if then . Let be the gain of a policy defined in state as , where denotes the expectation w.r.t. the distribution induced by

The goal is to find a policy yielding the optimal gain , i.e., for all , and the optimal gain is . A well known result (puterman1994markov)[Theorem 6.2.10], is that there always exists an optimal policy which is stationary and deterministic.

5 Related Work

Hand-tuned CC: While there has been a vast amount of work on congestion control, we focus on datacenter congestion control methods. Previous work tackled this problem from various angles. alizadeh2010data used a TCP-like protocol, which increases the transmission until congestion is sensed, and then dramatically decreases it. mittal2015timely; kumar2020swift directly used RTT to react quickly to changes. zhu2015congestion utilized the ECN protocol, a statistical signal provided by the switch, and li2019hpcc added telemetry information that requires specialized hardware yet proves to benefit greatly in terms of reaction times.

Optimization-based CC: Although most previous work has focused on hand-tuned algorithmic behavior, two notable mentions have taken an optimization-based approach. dong2018pcc presented the PCC-Vivace algorithm, which combines information from fixed time intervals such as bandwidth, latency inflation, and more. As it tackles the problem via online convex optimization, it is stateless; as such, it does not optimize for long-term behavior but rather focuses on the immediate reward (bandit setting). This work was then extended in the Aurora system (jay2019deep). Aurora provides the monitor interval, defined in PCC-Vivace, as a state for a PPO (schulman2017proximal) algorithm. Although these methods work in a naive single-agent setting, we observed their inability to converge to satisfying behavior in the realistic multi-agent setting.

Multi-objective RL: This task can also be cast as a multi-objective RL problem (mannor2004geometric) and solved by combining the various metrics into a scalar reward function (jay2019deep) (e.g., , where are positive constant coefficients). However, in practice, the exact coefficients are selected through a computationally-intensive process of hyper-parameter tuning. Previous work (leike2017ai; mania2018simple; tessler2018reward) has shown that these coefficients do not generalize: a coefficient that leads to a satisfying behavior on one domain may lead to catastrophic failure on the other. We propose an alternative approach, in which we present a reward function that is domain agnostic. Our reward has a single fixed point solution, which is optimal for any domain.

6 Reinforcement Learning for Congestion Control

The POMDP framework from Section 4 requires the definition of the four elements in . The agent, a congestion control algorithm, runs from within the network-interface-card (NIC) and controls the rate of the flows passing through that NIC. At each decision point, the agent observes statistics correlated to the specific flow it controls. The agent then acts by determining a new transmission rate and observes the outcome of this action.

Observations. As the agent can only observe information relevant to the flow it controls, we consider: the flow’s transmission rate, RTT measurement, and number of CNP and NACK packets received. The CNP and NACK packets represent events occurring in the network. A CNP packet is transmitted to the source host once an ECN-marked packet reaches the destination. A NACK packet signals to the source host that packets have been dropped (e.g., due to congestion) and should be re-transmitted.

Actions. The optimal transmission rate depends on the number of agents simultaneously interacting in the network and on the network itself (bandwidth limitations and topology). As such, the optimal transmission rate will vary greatly across scenarios. Since it should be quickly adapted across different orders of magnitude, we define the action as a multiplication of the previous rate. I.e., .

Transitions. The transition depends on the dynamics of the environment and on the frequency at which the agent is polled to provide an action. Here, the agent acts once an RTT packet is received. This is similar to the definition of a monitor interval by dong2018pcc, but while they considered fixed time intervals, we consider event-triggered (RTT) intervals.

Reward. As the task is a multi-agent partially observable problem, the reward must be designed such that there exists a single fixed-point equilibrium. Thus, we let

where target is a constant value shared by all flows, is defined as the RTT of flow in an empty system, and and are respectively the RTT and transmission rate of flow at time . is also called the rtt inflation of agent at time . The ideal reward is obtained when . Hence, when the target is larger, the ideal operation point is obtained when is larger. The transmission rate has a direct correlation to the RTT, hence the two grow together. Such an operation point is less latency sensitive (RTT grows) but enjoys better utilization (higher rate).

Based on appenzeller2004sizing, a good approximation of the RTT inflation in a bursty system, where all flows transmit at the ideal rate, behaves like where is the number of flows. As the system at the optimal point is on the verge of congestion, the major latency increase is due to the packets waiting in the congestion point. As such, we can assume that all flows sharing a congested path will observe a similar . As Proposition 1 shows, maximizing this reward results in a fair solution.

Proposition 1.

The fixed-point solution for all flows sharing a congested path is a transmission rate of .

The proof is provided in Appendix B.

6.1 Implementation

Thriving for simplicity, we initially attempted to solve this task using standard RL techniques such as DQN (mnih2015human), DDPG (lillicrap2015continuous) and PPO (schulman2017proximal). Due to the challenges this task exhibits, namely partial observability and multi-agent optimization, these methods did not converge. Additionally, we experimented with Aurora (jay2019deep): an RL agent for CC, designed to solve a single-agent setup. As such, it did not converge in our more complex domain. All the above attempts are documented in detail in Appendix C.

Due to the partial observability, on-policy methods are the most suitable. And as the goal is to converge to a stable multi-agent equilibrium, and due to the high-sensitivity action choice, deterministic policies are easier to manage. Thus, we devise a novel on-policy deterministic policy-gradient (silver2014deterministic, DPG) method that directly relies on the structure of the reward function as given below. In DPG, the goal is to estimate , the gradient of the value of the current policy, with respect to the policy’s parameters . By taking a gradient step in this direction, the policy is improving and thus under standard assumptions will converge to the optimal policy.

As opposed to off-policy methods, on-policy learning does not demand a critic. We observed that due to the challenges in this task, learning a critic is not an easy feat. Hence, we focus on estimating from a sampled trajectory.


Using the chain rule we can estimate the gradient of the reward



Notice that both and are monotonically increasing in . The action is a scalar determining by how much to change the transmission rate. A faster transmission rate also leads to higher RTT inflation. Thus, the signs of and are identical and is always non-negative.

However, estimating the exact value is impossible given the complex dynamics of a datacenter network. Instead, as the sign is always non-negative, we approximate this gradient with a positive constant which can be absorbed into the learning rate.


In layman’s terms – if is above the target, the gradient will push the action towards decreasing the transmission rate, and vice versa. As all flows observe approximately the same , the objective drives them towards the fixed-point solution. As shown in Proposition 1, this occurs when all flows transmit at the same rate of and the system is slightly congested as proposed by kumar2020swift.

Finally, the true estimation of the gradient is obtained for . Our approximation for this gradient is by averaging over a finite, sufficiently long, . In practice, is determined empirically.

7 The Challenges of Real-World Deployment

PCC-RL currently undergoes a productization procedure in a large tech company (15K employees) and is to be deployed in its live datacenters. Thus, beyond training the RL agent, our goal is to provide a method that can run in real-time on-device (on a NIC). This requirement presents two limitations. (1) The problem is asynchronous. While the CC algorithm is determining the transmission rate limitation, the flows continue to transmit data. As such, decision making must be efficient and fast such that inference can be performed within . (2) Each NIC can control thousands of flows. As we require real-time reaction, the algorithm must utilize fast, yet small, memory. Hence, the amount of memory stored per each flow must be minimal.

Thus, for the policy we choose a neural network that is composed of 2 fully connected layers

followed by an LSTM and a final fully connected layer . As the information takes

to propagate through the network, the next state is also a function of the previous actions, we observed that the LSTM was crucial. Such an architecture, combined with ReLU activation functions, enables fast inference using common deep learning accelerators (DLA).

While a small number of weights and a low memory footprint enables faster inference, to meet the strict requirements of running in real-time on-device, we also quantize a trained agent and analyze its performance. This is discussed in detail in Section 8.3.

Alg. 128 to 1 1024 to 1 4096 to 1 8192 to 1
PCC-RL 92 95 8 0 90 70 15 0 91 44 26 0 92 29 42 0
DC2QCN 96 84 8 0 88 82 17 0 85 67 110 0.2 100 72 157 1.3
HPCC 83 96 5 0 59 48 27 0 73 13 79 0.2 86 8 125 0.9
SWIFT 98 99 40 0 91 98 66 0 90 56 120 0.1 92 50 123 0.2
Table 1: Many-to-one test results. Numerical comparison of PCC-RL (our method) with DC2QCN (unpublished followup to zhu2015congestion), HPCC (li2019hpcc) and SWIFT (kumar2020swift). The column legend is: SU Switch Utilization [%] (higher is better), FR Fairness defined as (higher is better), QL Queue Latency [ sec] (lower is better), and DR Drop rate [Gbit/s] (lower is better). As the goal of a CC algorithm is to prevent congestion, we color the tests that failed (extensive periods of packet loss) in red. We mark the best performing methods in each test in bold. For 1024, we mark both DC2QCN and SWIFT as it is up to the end user to determine which outcome is preferred.
(a) PCC-RL
(b) DC2QCN
(c) HPCC
(f) DC2QCN
(g) HPCC
(e) PCC-RL
Figure 3: Long Short test results. The goal is to recover fast, but also avoid packet loss. Higher buffer utilization means higher latency and only when the buffer is fully utilized (100% utilization of the 5MB allocated) packets are dropped. In these tests, none of the algorithms encountered packet loss. The top row (Figs. 2(d), 2(c), 2(b) and 2(a)) presents the results of a long-short test with 2 flows, and the bottom row (Figs. 2(h), 2(g), 2(f) and 2(e)) presents a test with 8 flows. We plot the bandwidth utilization of the long flow and the buffer utilization in the switch. Recovery time is measured as the time it takes the long flow to return to maximal utilization. As can be seen, in both scenarios, DC2QCN does not recover within a reasonable time. In addition, in HPCC, the long flow does not reach 100% utilization, even when there are no additional flows.

8 Experiments

To show our method generalizes to unseen scenarios, motivating the use in the real world, we split the scenarios to train and test sets. We train the agents only in the many-to-one domain (see below), on the scenarios: and . Evaluation is performed on many-to-one, all-to-all, and long-short scenarios. We provide an extensive overview of the training process, including the technical challenges of the asynchronous CC task, in Appendix A.

We compare to 3 baseline algorithms: DC2QCN that utilizes the ECN packet marking protocol, HPCC that focuses on network telemetry information, and SWIFT that can be deployed in any existing datacenter, as it relies only on RTT measurements. As these methods lack an official implementation, we use unofficial implementations. Although unofficial, these implementations are currently deployed and used daily in real datacenters222A reference shall appear in the final version and is omitted here to guarantee anonymity..

Many-to-one: Denoted by this scenario emulates senders transmitting data through a single switch to a single receiver as depicted in Fig. 1. We evaluate the agents on , for . The exact configuration (number of flows per host) is presented in Appendix A.

(a) 128 to 1
(b) 1024 to 1
(c) 4096 to 1
(d) 8192 to 1
Figure 4: Radar plots comparing the various algorithms. Each color represents a single algorithm. The axis are normalized to , where higher (closer to the edge) is better. While all methods are competitive at low scale (Fig. 3(a)), PCC-RL outperforms at a large scale (thousands of flows in parallel).

All-to-all: This scenario emulates multiple servers transmitting data to all other servers. In this case, given there are servers, there will also be congestion points. All data sent towards server routes through switch port . This synchronized traffic causes a high system load. While the ‘many-to-one’ is a relatively clean setting, ‘all-to-all’ tests the ability of the various algorithms to cope in a complex and dynamic system.

Long-short: In addition to testing the ability of the algorithms to converge to a stable-point solution, the long-short scenario evaluates the ability of each agent to dynamically react to changes. A single flow (the ‘long’ flow) transmits an infinite amount of data, while several short flows randomly interrupt it with short data transmission. The goal is to test how fast the long flow reacts and reduces its transmission rate to enable the short flows to transmit their data. Once the short flows are finished, the long flow should recover quickly to full line rate. We follow the process from interruption until full recovery. We present an example of ideal behaviors in Fig. 5.

Figure 5: Long Short ideal behavior. The above plots depict two ideal behaviors. On the left, the long flow halts while the short flow is transmitting data. On the right, the long and short flows quickly converge to an equal fair transmission rate. Both solutions are ideal, and determining which solution is preferable depends on the datacenter requirements.

8.1 Baseline comparison

The results for the many-to-one tests are presented graphically in Fig. 4 and numerically in Table 1. The simulation settings (number of hosts and flows per host) are presented in Appendix A. We observe that while most methods are competitive at a small scale (small amount of flows with few interactions), at large-scale deployments, all methods aside from PCC-RL encounter extensive periods of packet loss. PCC-RL is the only method capable of maintaining high switch utilization, while keeping the latency low and fairness at a reasonable level.

Alg. 4 hosts 8 hosts
PCC-RL 94 77 6 94 97 8
DC2QCN 90 91 5 91 89 6
HPCC 71 18 3 69 60 3
SWIFT 76 100 11 76 98 13
Table 2: All-to-all test results. In these tests, none of the algorithms exhibited packet loss. For 4 hosts, we mark both PCC-RL and DC2QCN as best since it is up to the end-user to determine which outcome is preferred.

In addition, we evaluate the various methods on an all-to-all setting. Here, there are hosts and flows running on each (a total of ) . At each host, flow constantly sends data towards host through switch . The results are presented in Table 2. Although SWIFT performed well at low scale many-to-one tests, when transitioning to the all-to-all setting it is incapable of retaining high switch utilization. Similarly to the many-to-one setting, PCC-RL performs competitively and outperforms at the higher scale setting.

Finally, we compared the methods on a long-short setting where the algorithms are tested on their ability to quickly react to changes, presented in Fig. 3 (numerical results in Appendix E). Although PCC-RL did not encounter this scenario during training, it is able to perform competitively. In this scenario, the fastest to react was HPCC, which also maintained minimal buffer utilization. We highlight, though, that HPCC was specifically designed to handle such long-short scenarios (li2019hpcc). Nonetheless, as opposed to HPCC, PCC-RL achieves 100% utilization before and after the interruption and recovers faster than both SWIFT and DC2QCN.

Summary: We observe that PCC-RL is capable of learning a robust policy. Although in certain tasks the baselines seldom marginally outperformed it, PCC-RL always obtained competitive performance. In addition, in large scale scenarios, where thousands of flows interact in parallel, we observed that PCC-RL is the only method capable of avoiding packet loss and thus control network congestion. As PCC-RL was trained only on a low-scale scenario, it highlights the ability of our method to generalize and learn a robust behavior, that we believe will perform well in a real datacenter.

8.2 Selecting the operation point

As we have seen in Tables 2 and 1 and Fig. 3, the various algorithms exhibit different behaviors. For instance, as seen in Table 1 under the 1024 to 1 evaluation, both DC2QCN and SWIFT obtain good results. It is not clear whether one is absolutely better than the other. The decision depends on the use case. Certain datacenters are very sensitive to latency and would thus prefer DC2QCN. Others might prefer behavior such as of SWIFT, that provides higher fairness and bandwidth outputs.

PCC-RL is controlled by a single parameter, target. By changing the target, the CC behavior in the datacenter is adapted. When the required behavior is low latency, the target value should be set close to 0, whereas in a system is less latency sensitive, while improved fairness and utilization can be achieved by setting higher values.

Target 8192 to 1
Strict 51 99 12 0
Standard 92 29 42 0
Loose 92 35 55 0
Table 3: Many-to-one: Numerical comparison of PCC-RL with various selected operation points, where Strict, Standard, and Loose refer to targets of 1, 2, and 20, respectively. The ‘Standard’ results are those presented in the previous section. Here, there isn’t a single ‘best’ solution. The prefered solution depends on the datacenter requirements.
Target 4 hosts 8 hosts
Strict 60 30 4 74 52 13
Standard 94 77 6 94 97 8
Loose 73 18 8 71 21 9
Table 4: All-to-all: Numerical comparison of PCC-RL with various operation points. The standard results are presented in the previous section.
TargetFlows 2 128 1024 2048
Strict 5e-2 5e-2 - -
Standard 5e-3 5e-3 - -
Loose 6e-7 8e-4 3e-2 3e-2
Table 5: Long-short: Numerical comparison of PCC-RL with various operation points. The ‘Loose’ results are those presented in the previous section. When scaling to above 1024 flows, both the strict and standard settings are incapable of recovering in a reasonable amount of time.

The results in Tables 5, 4 and 3 present an interesting image on how PCC-RL behaves when the operation point is changed. We compare three operation points: ‘strict’, ‘standard’, and ‘loose’, corresponding to targets 1, 2, and 20, respectively.

As expected, the strict target results in lower latency while the opposite occurs for the loose. As the agent attempts to maintain a stricter latency profile, it is required to keep the switch’s buffer utilization as low as possible. This results in a dramatic decrease in switch utilization; see Table 5. On the other hand, a looser profile suffers from higher latency but is capable of reacting faster to the appearance of new flows, which can be explained by the new flows joining faster due to the larger “allowance”, and attains better fairness across flows (Table 3).

8.3 Quantization

As stated in Section 7, PCC-RL is to be deployed in live datacenters. Although our architectural choices ensure a low computational and memory footprint, adhering to a inference time requires further reductions.

We follow the process in wu2020integer and quantize the trained PCC-RL agent. All neural network weights and arithmetic operations are reduced from float32 down to int8. This results in a memory reduction of , from 9 KB down to 2.25 KB.

Empirical evidence shows that although the quantized agent has a vastly reduced the required computational capacity, it performs on par with the original agent. Results from wu2020integer suggest a speedup of up to 16x when deployed on specialized hardware. These results increase our confidence in the applicability of the method in live datacenters. The numerical results are presented in Appendix D.

9 Summary

In this work, we demonstrated the efficacy and generalization capabilities of RL, in contrast to the hand-crafted algorithms that currently dominate the field of CC. Our experiments utilized the realistic OMNeT++ network simulator that is commonly used to benchmark CC algorithms for deployment in real datacenters. While the various baselines exhibited outstanding performance in certain scenarios, there are others in which they catastrophically failed. Contrarily, PCC-RL learned a robust policy that performed well in all scenarios and often obtained the best results. In addition, we show that PCC-RL generalizes to unseen domains and is the only algorithm capable of operating at a large-scale without incurring packet loss.

PCC-RL was developed with the aim of easy deployment in real datacenters, and potentially even on-device training. We took into consideration memory and compute limitations. By limiting to relatively small and simple networks, combined with int8 quantization, we ensured that the method can run in real-time on a NIC. The next steps involve full productization of the algorithm and adaptive deployment in datacenters – such that enables customization to customers’ needs via the tunable target parameter and additional possible reward components.


Appendix A Simulator

The simulator attempts to model the network behavior as realistically as possible. The task of CC is a multi-agent problem, there are multiple flows running on each host (server) and each flow is unaware of the others. As such, each flow is a single agent, and 4096 flows imply 4096 concurrent agents.

Each agent is called by the CC algorithm to provide an action. The action, whether continuous or discrete, is mapped to a requested transmission rate. When the flow is rescheduled, it will attempt to transmit at the selected rate. Calling the agent (the triggering event) occurs each time an RTT packet arrives.

Agents are triggered by spontaneous events rather than at fixed time intervals; this makes the simulator asynchronous. Technically speaking, as the simulator exposes a single step function, certain agents might be called upon more times than others.

While the action sent to the simulator is for the current state , in contradiction to the standard GYM environments, state is not necessarily from the same flow as . Due to the asyncronous nature of the problem, the simulator provides the state which corresponds to the next agent that receives an RTT packet.

To overcome this, we propose a ‘KeySeparatedTemporalReplay’, a replay memory that enables storing asynchronous rollouts separated by a key (flow). We utilize this memory for training our method and Aurora (PPO), who both require gradient calculation over entire rollouts.

a.1 Many to one tests

In the many to one tests, each configuration combines a different number of hosts and flows. For completeness we provide the exact mapping below:

Total flows Hosts Flows per per host
2 2 1
4 4 1
16 16 1
32 32 1
64 64 1
128 64 2
256 32 8
512 64 8
1024 32 32
2048 64 32
4096 64 64
8192 64 128
Table 6: Many to one experiment mapping

a.2 Computational Details

The agents were trained on a standard i7 CPU with 6 cores and a single GTX 2080. The training time (for 200k steps) took 2-3 hours. The major bottleneck was the evaluation times. We evaluated the agents in a many-to-one setting with a very high number of flows (up to 8k). The more flows, the longer the test. In the python version, on this system it took approximately 2 days to evaluate the agent throughout 2 simulated seconds. However, on an optimized C implementation, the same evaluation took 20 minutes.

Appendix B Fixed-Point Proof

Proposition 2.

The fixed-point solution for all flows sharing a congested path is a transmission rate of .

The proof relies on the assumption that all flows sharing a congested path observe the same rtt inflation. Although the network topology affects the packet path and thus the latency, this latency is minimal when compared to the queue latency of a congested switch.


The maximal reward is obtained when all agents minimize the distance . There are two stationary solutions (1) or (2) .

As flows can always reduce the transmission rate (up to 0) and . A solution where is not stable.

We analyze both scenarios below and show that a stable solution at that point is also fair.

  1. . The value is below the target. Minimizing the distance to the target means maximizing the transmission rate. A stable solution below the target is obtained when the flows are transmitting at full-line rate (can’t increase the rate over 100%) and yet the rtt-inflation is low (small or no congestion). As all flows are transmiting at 100% this solution is fair.

  2. . For any sharing a congested path, we assume that , this is a reasonable assumption in congested systems as the RTT is mainly affected by the latency in the congestion point. As all flows observe , we conclude that if then .

Appendix C Training Curves

In this section of the appendix, we expand on additional methods that did not prevail as well as our algorithm.

c.1 Aurora

We begin with Aurora (jay2019deep). Aurora is similar to PCC-RL in how it extracts statistics from the network. A monitor-interval (MI) is defined as a period of time over which the network collects statistics. These statistics, such as average drop rate, RTT inflation, and more, are combined into the state provided to the agent.

However, Aurora focused on the task of single-agent congestion control. As they considered internet congestion control (as opposed to datacenter congestion control), their main challenge was handling jitters (random noise in the network resulting in packet loss even at network under-utilization).

An additional difference is that Aurora defines a naive reward signal, inspired by PCC-Vivace (dong2018pcc):

Figure 6: Aurora training: two hosts with a single flow per host. The flows are unable to converge to a stable equilibrium.

We observe in Fig. 6 that Aurora is incapable of converging in the simulated environment. We believe this is due to the many challenges the real world exhibits, specifically partial observability and the non-stationarity of the opposing agents.

c.2 Ppo

PCC-RL introduces a deterministic on-policy policy-gradient scheme that utilizes specific properties of the reward function.

It is not immediately clear why such a scheme is important. As such, we compare to PPO trained on our raw reward function

We present two versions of PPO. (1) A continuous action space represented as a stochastic Gaussian policy (as is common in continuous control tasks such as MuJoCo (todorov2012mujoco; schulman2017proximal)). (2) A discrete action space represented as a stochastic discrete policy (softmax) where .

(a) Continuous
(b) Discrete
Figure 7: PPO training: both the continuous (Fig. 6(a)) and discrete (Fig. 6(b)) versions of the PPO algorithm are unable to learn, even with our target-based reward signal.

c.3 Pcc-Rl

Finally, we present the training curves of PCC-RL. As can be seen, Fig. 8, PCC-RL quickly converges to a region of the fixed-point stable equilibrium.

Figure 8: PCC-RL training: the agents quickly converge to a region of the fair equilibrium.

In addition, below we present the performance of a trained PCC-RL agent during a 4 to 1 evaluation test.

Figure 9: Four PCC-RL agents controlling 4 flows, each running on a different host, and marked with a different line color. Only a single flow is active at the beginning and end of the simulation. The congestion point is limited to . As seen, the flows quickly converge to a region of the fair transmission rate ().

Appendix D Quantization

A major challenge in CC is the requirement for real-time decisions. Although the network itself is relatively small and efficient, when all computations are performed in int8 data type, the run-time can be dramatically optimized.

To this end, we test the performance of a trained PCC-RL agent after quantization and optimization in C. Following the methods in wu2020integer, we quantize the network to int8. Combining int8 operations requires a short transition to int32 (to avoid overflow), followed by a de-quantization and re-quantization step.

We present the results in Table 7. The quantized agent performs similarly to an agent trained on a loose target (improved switch utilization at the expense of a slightly higher latency). These exciting results show that a quantized agent is capable of obtaining similar performance to the original agent. This is a major step forward towards deployment in live datacenters.

Flows Switch Utilization Fairness Queue Latency
Original Quantized Original Quantized Original Quantized
2 96.9 99.91 0.99 1.00 5.2 8.29
4 96.9 99.86 0.98 1.00 5.4 8.59
16 95.1 99.61 0.99 0.99 6.2 9.46
32 95.6 99.22 0.86 0.99 7.0 9.90
64 93.3 99.06 0.96 0.98 7.0 9.72
128 92.5 98.57 0.94 0.96 8.0 10.94
256 91.4 98.02 0.91 0.90 9.3 12.85
512 90.4 97.54 0.86 0.86 11.3 16.26
1024 90.2 97.10 0.74 0.75 14.7 20.92
2048 90.5 96.74 0.56 0.61 20.3 27.67
4096 91.3 96.65 0.46 0.43 27.7 36.87
8192 92.8 96.79 0.28 0.29 40.0 48.40
Table 7: Quantization many to one: Comparison of the original Python PCC-RL agent with an optimized and quantized C version.

Appendix E Long Short Details

Algorithm 2 flows 128 flows 1024 flows 2048 flows
PCC-RL 6e-7 0 97 8e-4 0 94 3e-2 1.1 62 3e-2 2.7 56
DC2QCN 3e-2 0 63 5e-2 0 40 - -
HPCC 3e-5 0 90 1e-2 0.4 75 2e-2 1.1 72 3e-2 1.8 62
SWIFT 1e-3 0 97 1e-2 0.3 85 1e-2 1.2 83 2e-2 2.1 72
Table 8: Long-short:

The results represent the time it takes from the moment the

short flows interrupt and start transmitting until they finish and the long flow recovers to full line rate transmission. DC2QCN does not recover fast enough and has thus failed the high scale recovery tests. We present the recovery time
RT (sec), drop rate DR (Gbit/s) and the bandwidth utilization of the long flow LBW (%).