Scaling Distributed Machine Learning with In-Network Aggregation

02/22/2019
by   Amedeo Sapio, et al.
0

Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300

READ FULL TEXT
research
05/10/2023

P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs

Generalized linear models (GLMs) are a widely utilized family of machine...
research
01/17/2022

Efficient Data-Plane Memory Scheduling for In-Network Aggregation

As the scale of distributed training grows, communication becomes a bott...
research
06/01/2022

P4DB – The Case for In-Network OLTP (Extended Technical Report)

In this paper we present a new approach for distributed DBMSs called P4D...
research
05/17/2022

IIsy: Practical In-Network Classification

The rat race between user-generated data and data-processing systems is ...
research
04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
research
12/09/2022

DUNE: Improving Accuracy for Sketch-INT Network Measurement Systems

In-band Network Telemetry (INT) and sketching algorithms are two promisi...
research
04/10/2020

Cheetah: Accelerating Database Queries with Switch Pruning

Modern database systems are growing increasingly distributed and struggl...

Please sign up or login with your details

Forgot password? Click here to reset