P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAs

05/10/2023
by   Hongjing Huang, et al.
0

Generalized linear models (GLMs) are a widely utilized family of machine learning models in real-world applications. As data size increases, it is essential to perform efficient distributed training for these models. However, existing systems for distributed training have a high cost for communication and often use large batch sizes to balance computation and communication, which negatively affects convergence. Therefore, we argue for an efficient distributed GLM training system that strives to achieve linear scalability, while keeping batch size reasonably low. As a start, we propose P4SGD, a distributed heterogeneous training system that efficiently trains GLMs through model parallelism between distributed FPGAs and through forward-communication-backward pipeline parallelism within an FPGA. Moreover, we propose a light-weight, latency-centric in-switch aggregation protocol to minimize the latency of the AllReduce operation between distributed FPGAs, powered by a programmable switch. As such, to our knowledge, P4SGD is the first solution that achieves almost linear scalability between distributed accelerators through model parallelism. We implement P4SGD on eight Xilinx U280 FPGAs and a Tofino P4 switch. Our experiments show P4SGD converges up to 6.5X faster than the state-of-the-art GPU counterpar.

READ FULL TEXT
research
02/22/2019

Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasingly ...
research
09/08/2018

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, ...
research
12/11/2021

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

The advent of switches with programmable dataplanes has enabled the rapi...
research
05/30/2016

ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

Many powerful machine learning models are based on the composition of mu...
research
06/29/2021

Flare: Flexible In-Network Allreduce

The allreduce operation is one of the most commonly used communication r...
research
02/27/2023

Hulk: Graph Neural Networks for Optimizing Regionally Distributed Computing Systems

Large deep learning models have shown great potential for delivering exc...
research
03/28/2019

SwitchAgg:A Further Step Towards In-Network Computation

Many distributed applications adopt a partition/aggregation pattern to a...

Please sign up or login with your details

Forgot password? Click here to reset