Most RL algorithms were designed under the assumption that the world can be adequately modeled as a Markov Decision Process (MDP). Unfortunately, this is seldom the case in realistic applications. In this work, we study a real-world application of RL to data-center congestion control. This application is highly challenging because of partial observability and complex multi-objective. Moreover, there are multiple decision-makers that have to make concurrent decisions that can affect each other. We show that even though that the problem at hand does not fit the standard model, we can nevertheless devise an RL framework that outperforms state-of-the-art tailored heuristics. Moreover, we show experimentally that the RL framework generalizes well.
We provide a full description of the Congestion Control (CC) problem in Section 3, but from a bird’s eye view, it is enough to think of CC as a multi-agent, multi-objective, partially observed problem where each decision maker receives a goal (target). The target makes it easy to tune the behavior to fit the requirements (i.e., how latency-sensitive the system is). We devise the target so that leads to beneficial behavior in the multiple considered metrics, without having to tune coefficients of multiple reward components. We model the task of datacenter congestion control as a reinforcement learning problem. We observe that standard algorithms (mnih2015human; lillicrap2015continuous; schulman2017proximal) are incapable of solving this task. Thus, we introduce a novel on-policy deterministic-policy-gradient scheme that takes advantage of the structure of our target-based reward function. This method enjoys both the stability of deterministic algorithms and the ability to tackle partially observable problems.
To validate our claims, we develop an RL GYM (brockman2016openai) environment, based on a realistic simulator and perform extensive experiments. The simulator is based on OMNeT++ (varga2002omnet++) and emulates the behavior of ConnextX-6Dx network interface cards (state of the art hardware deployed in current datacenters)111 To encourage additional research in this field, the code will be open-sourced after the review phase is over.
To encourage additional research in this field, the code will be open-sourced after the review phase is over.. Our experiments show that our method, Programmable CC-RL (PCC-RL), learns a robust policy, in the sense that it is competitive in all the evaluated scenarios; often outperforming the current state-of-the-art methods.
Our contributions are as follows. (i) We formulate the problem of datacenter congestion control as a partially-observable multi-agent multi-objective RL task. (ii) We present the challenges of this realistic formulation and propose a novel on-policy deterministic-policy-gradient method to solve it. (iii) We provide an RL training and evaluation suite for training and testing RL agents within a realistic simulator. Finally, (iv) we ensure our agent satisfies compute and memory constraints such that it can be easily deployed in future datacenter network devices.
2 Networking Preliminaries
We start with a short introduction covering the most relevant concepts in networking.
In this work, we focus on datacenter networks. In datacenters, traffic contains multiple concurrent data streams transmitting at high rates. The servers, also known as hosts, are interconnected through a topology of switches. A directional connection between two hosts that continuously transmits data is called a flow. We assume, for simplicity, that the path of each flow is fixed.
Each host can hold multiple flows whose transmission rates are determined by a scheduler. The scheduler iterates in a cyclic manner between the flows, also known as round-robin scheduling. Once scheduled, the flow transmits a burst of data. The burst’s size generally depends on the requested transmission rate, the time it was last scheduled, and the maximal burst size limitation.
A flow’s transmission is characterized by two primary values. Bandwidth: the average amount of data transmitted, measured in Gbit per second; and latency: the time it takes for a packet to reach its destination. Round-trip-time (RTT) measures the latency of sourcedestinationsource. While the latency is often the metric of interest, most systems are only capable of measuring RTT. This is presented in Fig. 2.
3 Congestion Control
Congestion occurs when multiple flows cross paths, transmitting data through a single congestion point (switch or receiving server) at a rate faster than the congestion point can process. In this work, we assume that all connections have equal transmission rates, as typically occurs in most datacenters. Thus, a single flow can saturate an entire path by transmitting at the maximal rate.
Each congestion point in the network has an inbound buffer, see Fig. 1, enabling it to cope with short periods where the inbound rate is higher than it can process. As this buffer begins to fill, the time (latency) it takes for each packet to reach its destination increases. When the buffer is full, any additional arriving packets are dropped.
3.1 Congestion Indicators
There are various methods to measure or estimate the congestion within the network. The ECN protocol(ramakrishnan2001rfc3168)
considers marking packets with an increasing probability as the buffer fills up. Network telemetry is an additional, advanced, congestion signal. As opposed to statistical information (ECN), a telemetry signal is a precise measurement provided directly from the switch, such as the switch’s buffer and port utilization.
However, while the ECN and telemetry signals provide useful information, they require specialized hardware. An ideal solution is one that can be easily deployed within existing networks. Such solutions are based on RTT measurements. They measure congestion by comparing the RTT to that of an empty system.
CC can be seen as a multi-agent problem. Assuming there are N flows, this results in N CC algorithms (agents) operating simultaneously. Assuming all agents have an infinite amount of traffic to transmit, their goal is to optimize the following metrics:
Switch bandwidth utilization – the % from maximal transmission rate.
Packet latency – the amount of time it takes for a packet to travel from the source to its destination.
Packet-loss – the amount of data (% of maximum transmission rate) dropped due to congestion.
Fairness – a measure of similarity in the transmission rate between flows sharing a congested path. We consider .
The multi-objective problem of the CC agent is to maximize the bandwidth utilization and fairness, and minimize the latency and packet-loss. Thus, it may have a Pareto-front (liu2014multiobjective) for which optimality w.r.t. to one objective may result in sub-optimality of another. However, while the metrics of interest are clear, the agent does not necessarily have access to signals representing them. For instance, fairness is a metric that involves all flows, yet the agent observes signals relevant only to the flow it controls. Hence, it is impossible for a flow to obtain an estimate of the current fairness in the system. Instead, we reach fairness by setting each flow’s individual target adaptively, based on known relations between its current RTT and rate. More details on this are given in Sec. 6.
This problem exhibits additional complexities. As the agent only observes information relevant to the flow it controls, this task is partially observable. The observation might lack sufficient statistics required to determine the optimal policy. Moreover, the network’s reaction to transmission rate changes is delayed by
4 Reinforcement Learning Preliminaries
We model the task of congestion control as a multi-agent partially-observable multi-objective MDP, where all agents share the same policy. Each agent observes statistics relevant to itself and does not observe the entire global state (e.g., the number of active flows in the network).
We consider an infinite-horizon Partially Observable Markov Decision Process (POMDP). A POMDP is defined as the tuple (puterman1994markov; spaan2012partially). An agent interacting with the environment observes a state and performs an action . After performing an action, the environment transitions to a new state based on the transition kernel and receives a reward . In a POMDP, the observed state does not necessarily contain sufficient statistics for determining the optimal action.
We consider the average reward metric, defined as follows. We denote as the set of stationary deterministic policies on , i.e., if then . Let be the gain of a policy defined in state as , where denotes the expectation w.r.t. the distribution induced by
The goal is to find a policy yielding the optimal gain , i.e., for all , and the optimal gain is . A well known result (puterman1994markov)[Theorem 6.2.10], is that there always exists an optimal policy which is stationary and deterministic.
5 Related Work
Hand-tuned CC: While there has been a vast amount of work on congestion control, we focus on datacenter congestion control methods. Previous work tackled this problem from various angles. alizadeh2010data used a TCP-like protocol, which increases the transmission until congestion is sensed, and then dramatically decreases it. mittal2015timely; kumar2020swift directly used RTT to react quickly to changes. zhu2015congestion utilized the ECN protocol, a statistical signal provided by the switch, and li2019hpcc added telemetry information that requires specialized hardware yet proves to benefit greatly in terms of reaction times.
Optimization-based CC: Although most previous work has focused on hand-tuned algorithmic behavior, two notable mentions have taken an optimization-based approach. dong2018pcc presented the PCC-Vivace algorithm, which combines information from fixed time intervals such as bandwidth, latency inflation, and more. As it tackles the problem via online convex optimization, it is stateless; as such, it does not optimize for long-term behavior but rather focuses on the immediate reward (bandit setting). This work was then extended in the Aurora system (jay2019deep). Aurora provides the monitor interval, defined in PCC-Vivace, as a state for a PPO (schulman2017proximal) algorithm. Although these methods work in a naive single-agent setting, we observed their inability to converge to satisfying behavior in the realistic multi-agent setting.
Multi-objective RL: This task can also be cast as a multi-objective RL problem (mannor2004geometric) and solved by combining the various metrics into a scalar reward function (jay2019deep) (e.g., , where are positive constant coefficients). However, in practice, the exact coefficients are selected through a computationally-intensive process of hyper-parameter tuning. Previous work (leike2017ai; mania2018simple; tessler2018reward) has shown that these coefficients do not generalize: a coefficient that leads to a satisfying behavior on one domain may lead to catastrophic failure on the other. We propose an alternative approach, in which we present a reward function that is domain agnostic. Our reward has a single fixed point solution, which is optimal for any domain.
6 Reinforcement Learning for Congestion Control
The POMDP framework from Section 4 requires the definition of the four elements in . The agent, a congestion control algorithm, runs from within the network-interface-card (NIC) and controls the rate of the flows passing through that NIC. At each decision point, the agent observes statistics correlated to the specific flow it controls. The agent then acts by determining a new transmission rate and observes the outcome of this action.
Observations. As the agent can only observe information relevant to the flow it controls, we consider: the flow’s transmission rate, RTT measurement, and number of CNP and NACK packets received. The CNP and NACK packets represent events occurring in the network. A CNP packet is transmitted to the source host once an ECN-marked packet reaches the destination. A NACK packet signals to the source host that packets have been dropped (e.g., due to congestion) and should be re-transmitted.
Actions. The optimal transmission rate depends on the number of agents simultaneously interacting in the network and on the network itself (bandwidth limitations and topology). As such, the optimal transmission rate will vary greatly across scenarios. Since it should be quickly adapted across different orders of magnitude, we define the action as a multiplication of the previous rate. I.e., .
Transitions. The transition depends on the dynamics of the environment and on the frequency at which the agent is polled to provide an action. Here, the agent acts once an RTT packet is received. This is similar to the definition of a monitor interval by dong2018pcc, but while they considered fixed time intervals, we consider event-triggered (RTT) intervals.
Reward. As the task is a multi-agent partially observable problem, the reward must be designed such that there exists a single fixed-point equilibrium. Thus, we let
where target is a constant value shared by all flows, is defined as the RTT of flow in an empty system, and and are respectively the RTT and transmission rate of flow at time . is also called the rtt inflation of agent at time . The ideal reward is obtained when . Hence, when the target is larger, the ideal operation point is obtained when is larger. The transmission rate has a direct correlation to the RTT, hence the two grow together. Such an operation point is less latency sensitive (RTT grows) but enjoys better utilization (higher rate).
Based on appenzeller2004sizing, a good approximation of the RTT inflation in a bursty system, where all flows transmit at the ideal rate, behaves like where is the number of flows. As the system at the optimal point is on the verge of congestion, the major latency increase is due to the packets waiting in the congestion point. As such, we can assume that all flows sharing a congested path will observe a similar . As Proposition 1 shows, maximizing this reward results in a fair solution.
The fixed-point solution for all flows sharing a congested path is a transmission rate of .
The proof is provided in Appendix B.
Thriving for simplicity, we initially attempted to solve this task using standard RL techniques such as DQN (mnih2015human), DDPG (lillicrap2015continuous) and PPO (schulman2017proximal). Due to the challenges this task exhibits, namely partial observability and multi-agent optimization, these methods did not converge. Additionally, we experimented with Aurora (jay2019deep): an RL agent for CC, designed to solve a single-agent setup. As such, it did not converge in our more complex domain. All the above attempts are documented in detail in Appendix C.
Due to the partial observability, on-policy methods are the most suitable. And as the goal is to converge to a stable multi-agent equilibrium, and due to the high-sensitivity action choice, deterministic policies are easier to manage. Thus, we devise a novel on-policy deterministic policy-gradient (silver2014deterministic, DPG) method that directly relies on the structure of the reward function as given below. In DPG, the goal is to estimate , the gradient of the value of the current policy, with respect to the policy’s parameters . By taking a gradient step in this direction, the policy is improving and thus under standard assumptions will converge to the optimal policy.
As opposed to off-policy methods, on-policy learning does not demand a critic. We observed that due to the challenges in this task, learning a critic is not an easy feat. Hence, we focus on estimating from a sampled trajectory.
Using the chain rule we can estimate the gradient of the reward:
Notice that both and are monotonically increasing in . The action is a scalar determining by how much to change the transmission rate. A faster transmission rate also leads to higher RTT inflation. Thus, the signs of and are identical and is always non-negative.
However, estimating the exact value is impossible given the complex dynamics of a datacenter network. Instead, as the sign is always non-negative, we approximate this gradient with a positive constant which can be absorbed into the learning rate.
In layman’s terms – if is above the target, the gradient will push the action towards decreasing the transmission rate, and vice versa. As all flows observe approximately the same , the objective drives them towards the fixed-point solution. As shown in Proposition 1, this occurs when all flows transmit at the same rate of and the system is slightly congested as proposed by kumar2020swift.
Finally, the true estimation of the gradient is obtained for . Our approximation for this gradient is by averaging over a finite, sufficiently long, . In practice, is determined empirically.
7 The Challenges of Real-World Deployment
PCC-RL currently undergoes a productization procedure in a large tech company (15K employees) and is to be deployed in its live datacenters. Thus, beyond training the RL agent, our goal is to provide a method that can run in real-time on-device (on a NIC). This requirement presents two limitations. (1) The problem is asynchronous. While the CC algorithm is determining the transmission rate limitation, the flows continue to transmit data. As such, decision making must be efficient and fast such that inference can be performed within . (2) Each NIC can control thousands of flows. As we require real-time reaction, the algorithm must utilize fast, yet small, memory. Hence, the amount of memory stored per each flow must be minimal.
Thus, for the policy we choose a neural network that is composed of 2 fully connected layersfollowed by an LSTM and a final fully connected layer . As the information takes
to propagate through the network, the next state is also a function of the previous actions, we observed that the LSTM was crucial. Such an architecture, combined with ReLU activation functions, enables fast inference using common deep learning accelerators (DLA).
While a small number of weights and a low memory footprint enables faster inference, to meet the strict requirements of running in real-time on-device, we also quantize a trained agent and analyze its performance. This is discussed in detail in Section 8.3.
|Alg.||128 to 1||1024 to 1||4096 to 1||8192 to 1|
To show our method generalizes to unseen scenarios, motivating the use in the real world, we split the scenarios to train and test sets. We train the agents only in the many-to-one domain (see below), on the scenarios: and . Evaluation is performed on many-to-one, all-to-all, and long-short scenarios. We provide an extensive overview of the training process, including the technical challenges of the asynchronous CC task, in Appendix A.
We compare to 3 baseline algorithms: DC2QCN that utilizes the ECN packet marking protocol, HPCC that focuses on network telemetry information, and SWIFT that can be deployed in any existing datacenter, as it relies only on RTT measurements. As these methods lack an official implementation, we use unofficial implementations. Although unofficial, these implementations are currently deployed and used daily in real datacenters222A reference shall appear in the final version and is omitted here to guarantee anonymity..
Many-to-one: Denoted by this scenario emulates senders transmitting data through a single switch to a single receiver as depicted in Fig. 1. We evaluate the agents on , for . The exact configuration (number of flows per host) is presented in Appendix A.
All-to-all: This scenario emulates multiple servers transmitting data to all other servers. In this case, given there are servers, there will also be congestion points. All data sent towards server routes through switch port . This synchronized traffic causes a high system load. While the ‘many-to-one’ is a relatively clean setting, ‘all-to-all’ tests the ability of the various algorithms to cope in a complex and dynamic system.
Long-short: In addition to testing the ability of the algorithms to converge to a stable-point solution, the long-short scenario evaluates the ability of each agent to dynamically react to changes. A single flow (the ‘long’ flow) transmits an infinite amount of data, while several short flows randomly interrupt it with short data transmission. The goal is to test how fast the long flow reacts and reduces its transmission rate to enable the short flows to transmit their data. Once the short flows are finished, the long flow should recover quickly to full line rate. We follow the process from interruption until full recovery. We present an example of ideal behaviors in Fig. 5.
8.1 Baseline comparison
The results for the many-to-one tests are presented graphically in Fig. 4 and numerically in Table 1. The simulation settings (number of hosts and flows per host) are presented in Appendix A. We observe that while most methods are competitive at a small scale (small amount of flows with few interactions), at large-scale deployments, all methods aside from PCC-RL encounter extensive periods of packet loss. PCC-RL is the only method capable of maintaining high switch utilization, while keeping the latency low and fairness at a reasonable level.
|Alg.||4 hosts||8 hosts|
In addition, we evaluate the various methods on an all-to-all setting. Here, there are hosts and flows running on each (a total of ) . At each host, flow constantly sends data towards host through switch . The results are presented in Table 2. Although SWIFT performed well at low scale many-to-one tests, when transitioning to the all-to-all setting it is incapable of retaining high switch utilization. Similarly to the many-to-one setting, PCC-RL performs competitively and outperforms at the higher scale setting.
Finally, we compared the methods on a long-short setting where the algorithms are tested on their ability to quickly react to changes, presented in Fig. 3 (numerical results in Appendix E). Although PCC-RL did not encounter this scenario during training, it is able to perform competitively. In this scenario, the fastest to react was HPCC, which also maintained minimal buffer utilization. We highlight, though, that HPCC was specifically designed to handle such long-short scenarios (li2019hpcc). Nonetheless, as opposed to HPCC, PCC-RL achieves 100% utilization before and after the interruption and recovers faster than both SWIFT and DC2QCN.
Summary: We observe that PCC-RL is capable of learning a robust policy. Although in certain tasks the baselines seldom marginally outperformed it, PCC-RL always obtained competitive performance. In addition, in large scale scenarios, where thousands of flows interact in parallel, we observed that PCC-RL is the only method capable of avoiding packet loss and thus control network congestion. As PCC-RL was trained only on a low-scale scenario, it highlights the ability of our method to generalize and learn a robust behavior, that we believe will perform well in a real datacenter.
8.2 Selecting the operation point
As we have seen in Tables 2 and 1 and Fig. 3, the various algorithms exhibit different behaviors. For instance, as seen in Table 1 under the 1024 to 1 evaluation, both DC2QCN and SWIFT obtain good results. It is not clear whether one is absolutely better than the other. The decision depends on the use case. Certain datacenters are very sensitive to latency and would thus prefer DC2QCN. Others might prefer behavior such as of SWIFT, that provides higher fairness and bandwidth outputs.
PCC-RL is controlled by a single parameter, target. By changing the target, the CC behavior in the datacenter is adapted. When the required behavior is low latency, the target value should be set close to 0, whereas in a system is less latency sensitive, while improved fairness and utilization can be achieved by setting higher values.
|Target||8192 to 1|
|Target||4 hosts||8 hosts|
The results in Tables 5, 4 and 3 present an interesting image on how PCC-RL behaves when the operation point is changed. We compare three operation points: ‘strict’, ‘standard’, and ‘loose’, corresponding to targets 1, 2, and 20, respectively.
As expected, the strict target results in lower latency while the opposite occurs for the loose. As the agent attempts to maintain a stricter latency profile, it is required to keep the switch’s buffer utilization as low as possible. This results in a dramatic decrease in switch utilization; see Table 5. On the other hand, a looser profile suffers from higher latency but is capable of reacting faster to the appearance of new flows, which can be explained by the new flows joining faster due to the larger “allowance”, and attains better fairness across flows (Table 3).
As stated in Section 7, PCC-RL is to be deployed in live datacenters. Although our architectural choices ensure a low computational and memory footprint, adhering to a inference time requires further reductions.
We follow the process in wu2020integer and quantize the trained PCC-RL agent. All neural network weights and arithmetic operations are reduced from float32 down to int8. This results in a memory reduction of , from 9 KB down to 2.25 KB.
Empirical evidence shows that although the quantized agent has a vastly reduced the required computational capacity, it performs on par with the original agent. Results from wu2020integer suggest a speedup of up to 16x when deployed on specialized hardware. These results increase our confidence in the applicability of the method in live datacenters. The numerical results are presented in Appendix D.
In this work, we demonstrated the efficacy and generalization capabilities of RL, in contrast to the hand-crafted algorithms that currently dominate the field of CC. Our experiments utilized the realistic OMNeT++ network simulator that is commonly used to benchmark CC algorithms for deployment in real datacenters. While the various baselines exhibited outstanding performance in certain scenarios, there are others in which they catastrophically failed. Contrarily, PCC-RL learned a robust policy that performed well in all scenarios and often obtained the best results. In addition, we show that PCC-RL generalizes to unseen domains and is the only algorithm capable of operating at a large-scale without incurring packet loss.
PCC-RL was developed with the aim of easy deployment in real datacenters, and potentially even on-device training. We took into consideration memory and compute limitations. By limiting to relatively small and simple networks, combined with int8 quantization, we ensured that the method can run in real-time on a NIC. The next steps involve full productization of the algorithm and adaptive deployment in datacenters – such that enables customization to customers’ needs via the tunable target parameter and additional possible reward components.
Appendix A Simulator
The simulator attempts to model the network behavior as realistically as possible. The task of CC is a multi-agent problem, there are multiple flows running on each host (server) and each flow is unaware of the others. As such, each flow is a single agent, and 4096 flows imply 4096 concurrent agents.
Each agent is called by the CC algorithm to provide an action. The action, whether continuous or discrete, is mapped to a requested transmission rate. When the flow is rescheduled, it will attempt to transmit at the selected rate. Calling the agent (the triggering event) occurs each time an RTT packet arrives.
Agents are triggered by spontaneous events rather than at fixed time intervals; this makes the simulator asynchronous. Technically speaking, as the simulator exposes a single step function, certain agents might be called upon more times than others.
While the action sent to the simulator is for the current state , in contradiction to the standard GYM environments, state is not necessarily from the same flow as . Due to the asyncronous nature of the problem, the simulator provides the state which corresponds to the next agent that receives an RTT packet.
To overcome this, we propose a ‘KeySeparatedTemporalReplay’, a replay memory that enables storing asynchronous rollouts separated by a key (flow). We utilize this memory for training our method and Aurora (PPO), who both require gradient calculation over entire rollouts.
a.1 Many to one tests
In the many to one tests, each configuration combines a different number of hosts and flows. For completeness we provide the exact mapping below:
|Total flows||Hosts||Flows per per host|
a.2 Computational Details
The agents were trained on a standard i7 CPU with 6 cores and a single GTX 2080. The training time (for 200k steps) took 2-3 hours. The major bottleneck was the evaluation times. We evaluated the agents in a many-to-one setting with a very high number of flows (up to 8k). The more flows, the longer the test. In the python version, on this system it took approximately 2 days to evaluate the agent throughout 2 simulated seconds. However, on an optimized C implementation, the same evaluation took 20 minutes.
Appendix B Fixed-Point Proof
The fixed-point solution for all flows sharing a congested path is a transmission rate of .
The proof relies on the assumption that all flows sharing a congested path observe the same rtt inflation. Although the network topology affects the packet path and thus the latency, this latency is minimal when compared to the queue latency of a congested switch.
The maximal reward is obtained when all agents minimize the distance . There are two stationary solutions (1) or (2) .
As flows can always reduce the transmission rate (up to 0) and . A solution where is not stable.
We analyze both scenarios below and show that a stable solution at that point is also fair.
. The value is below the target. Minimizing the distance to the target means maximizing the transmission rate. A stable solution below the target is obtained when the flows are transmitting at full-line rate (can’t increase the rate over 100%) and yet the rtt-inflation is low (small or no congestion). As all flows are transmiting at 100% this solution is fair.
. For any sharing a congested path, we assume that , this is a reasonable assumption in congested systems as the RTT is mainly affected by the latency in the congestion point. As all flows observe , we conclude that if then .
Appendix C Training Curves
In this section of the appendix, we expand on additional methods that did not prevail as well as our algorithm.
We begin with Aurora (jay2019deep). Aurora is similar to PCC-RL in how it extracts statistics from the network. A monitor-interval (MI) is defined as a period of time over which the network collects statistics. These statistics, such as average drop rate, RTT inflation, and more, are combined into the state provided to the agent.
However, Aurora focused on the task of single-agent congestion control. As they considered internet congestion control (as opposed to datacenter congestion control), their main challenge was handling jitters (random noise in the network resulting in packet loss even at network under-utilization).
An additional difference is that Aurora defines a naive reward signal, inspired by PCC-Vivace (dong2018pcc):
We observe in Fig. 6 that Aurora is incapable of converging in the simulated environment. We believe this is due to the many challenges the real world exhibits, specifically partial observability and the non-stationarity of the opposing agents.
PCC-RL introduces a deterministic on-policy policy-gradient scheme that utilizes specific properties of the reward function.
It is not immediately clear why such a scheme is important. As such, we compare to PPO trained on our raw reward function
We present two versions of PPO. (1) A continuous action space represented as a stochastic Gaussian policy (as is common in continuous control tasks such as MuJoCo (todorov2012mujoco; schulman2017proximal)). (2) A discrete action space represented as a stochastic discrete policy (softmax) where .
Finally, we present the training curves of PCC-RL. As can be seen, Fig. 8, PCC-RL quickly converges to a region of the fixed-point stable equilibrium.
In addition, below we present the performance of a trained PCC-RL agent during a 4 to 1 evaluation test.
Appendix D Quantization
A major challenge in CC is the requirement for real-time decisions. Although the network itself is relatively small and efficient, when all computations are performed in int8 data type, the run-time can be dramatically optimized.
To this end, we test the performance of a trained PCC-RL agent after quantization and optimization in C. Following the methods in wu2020integer, we quantize the network to int8. Combining int8 operations requires a short transition to int32 (to avoid overflow), followed by a de-quantization and re-quantization step.
We present the results in Table 7. The quantized agent performs similarly to an agent trained on a loose target (improved switch utilization at the expense of a slightly higher latency). These exciting results show that a quantized agent is capable of obtaining similar performance to the original agent. This is a major step forward towards deployment in live datacenters.
|Flows||Switch Utilization||Fairness||Queue Latency|
Appendix E Long Short Details
|Algorithm||2 flows||128 flows||1024 flows||2048 flows|
The results represent the time it takes from the moment theshort flows interrupt and start transmitting until they finish and the long flow recovers to full line rate transmission. DC2QCN does not recover fast enough and has thus failed the high scale recovery tests. We present the recovery time RT (sec), drop rate DR (Gbit/s) and the bandwidth utilization of the long flow LBW (%).