Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

07/05/2022
by   Benjamin Fuhrer, et al.
0

Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired μsec decision latency required for real-time inference with RDMA. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and packet drops.

READ FULL TEXT
research
05/26/2022

RACE: A Reinforcement Learning Framework for Improved Adaptive Control of NoC Channel Buffers

Network-on-chip (NoC) architectures rely on buffers to store flits to co...
research
02/18/2021

Reinforcement Learning for Datacenter Congestion Control

We approach the task of network congestion control in datacenters using ...
research
01/29/2023

A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers

Various congestion control protocols have been designed to achieve high ...
research
11/22/2022

A Reinforcement Learning Approach to Optimize Available Network Bandwidth Utilization

Efficient data transfers over high-speed, long-distance shared networks ...
research
11/11/2018

Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio-Temporal Mining

In this paper, we develop a reinforcement learning (RL) based system to ...
research
02/24/2023

Machine Learning-based Low Overhead Congestion Control Algorithm for Industrial NoCs

Network-on-Chip (NoC) congestion builds up during heavy traffic load and...
research
05/16/2022

Many Field Packet Classification with Decomposition and Reinforcement Learning

Scalable packet classification is a key requirement to support scalable ...

Please sign up or login with your details

Forgot password? Click here to reset