Efficient Data-Plane Memory Scheduling for In-Network Aggregation

01/17/2022
by   Hao Wang, et al.
3

As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an Efficient Switch Memory Scheduler for In-Network Aggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to 1.35×.

READ FULL TEXT

page 2

page 4

page 7

research
02/22/2019

Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasingly ...
research
03/28/2019

SwitchAgg:A Further Step Towards In-Network Computation

Many distributed applications adopt a partition/aggregation pattern to a...
research
07/29/2021

P4COM: In-Network Computation with Programmable Switches

Traditionally, switches only provide forwarding services and have no cre...
research
06/29/2021

Flare: Flexible In-Network Allreduce

The allreduce operation is one of the most commonly used communication r...
research
09/14/2020

UniFuzz: Optimizing Distributed Fuzzing via Dynamic Centralized Task Scheduling

Fuzzing is one of the most efficient technology for vulnerability detect...
research
04/10/2020

Cheetah: Accelerating Database Queries with Switch Pruning

Modern database systems are growing increasingly distributed and struggl...
research
01/31/2023

On Memory Codelets: Prefetching, Recoding, Moving and Streaming Data

For decades, memory capabilities have scaled up much slower than compute...

Please sign up or login with your details

Forgot password? Click here to reset