SwitchAgg:A Further Step Towards In-Network Computation

03/28/2019
by   Fan Yang, et al.
0

Many distributed applications adopt a partition/aggregation pattern to achieve high performance and scalability. The aggregation process, which usually takes a large portion of the overall execution time, incurs large amount of network traffic and bottlenecks the system performance. To reduce network traffic,some researches take advantage of network devices to commit innetwork aggregation. However, these approaches use either special topology or middle-boxes, which cannot be easily deployed in current datacenters. The emerging programmable RMT switch brings us new opportunities to implement in-network computation task. However, we argue that the architecture of RMT switch is not suitable for in-network aggregation since it is designed primarily for implementing traditional network functions. In this paper, we first give a detailed analysis of in-network aggregation, and point out the key factor that affects the data reduction ratio. We then propose SwitchAgg, which is an innetwork aggregation system that is compatible with current datacenter infrastructures. We also evaluate the performance improvement we have gained from SwitchAgg. Our results show that, SwitchAgg can process data aggregation tasks at line rate and gives a high data reduction rate, which helps us to cut down network traffic and alleviate pressure on server CPU. In the system performance test, the job-completion-time can be reduced as much as 50

READ FULL TEXT
research
01/17/2022

Efficient Data-Plane Memory Scheduling for In-Network Aggregation

As the scale of distributed training grows, communication becomes a bott...
research
06/29/2021

Flare: Flexible In-Network Allreduce

The allreduce operation is one of the most commonly used communication r...
research
04/10/2020

Cheetah: Accelerating Database Queries with Switch Pruning

Modern database systems are growing increasingly distributed and struggl...
research
05/11/2022

Libra: In-network Gradient Aggregation for Speeding up Distributed Sparse Deep Training

Distributed sparse deep learning has been widely used in many internet-s...
research
09/21/2020

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

We present NetReduce, a novel RDMA-compatible in-network reduction archi...
research
05/10/2023

P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs

Generalized linear models (GLMs) are a widely utilized family of machine...
research
08/12/2021

SAFE: Secure Aggregation with Failover and Encryption

We propose and experimentally evaluate a novel secure aggregation algori...

Please sign up or login with your details

Forgot password? Click here to reset