Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

05/07/2023
by   Zixuan Chen, et al.
0

Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. To address this challenge, we design the Loss-tolerant Transmission Protocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. LTP implements loss-tolerant transmission through out-of-order transmission and out-of-order Acknowledges (ACKs). LTP employs Early Close to adjust the loss-tolerant threshold based on network conditions and Bubble Filling for data correction to maintain training accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on a testbed of 8 worker nodes and one PS node demonstrate that LTP can significantly improve DML training task throughput by up to 30x compared to traditional TCP congestion controls, with no sacrifice to final accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/29/2023

OSP: Boosting Distributed Model Training with 2-stage Synchronization

Distributed deep learning (DDL) is a promising research area, which aims...
research
11/16/2019

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) ...
research
10/01/2020

CAFT: Congestion-Aware Fault-Tolerant Load Balancing for Three-Tier Clos Data Centers

Production data centers operate under various workload sizes ranging fro...
research
05/10/2019

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neura...
research
08/20/2020

Long-Lived LoRa: Prolonging the Lifetime of a LoRa Network

Prolonging the network lifetime is a major consideration in many Interne...
research
10/17/2018

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume reliable ne...

Please sign up or login with your details

Forgot password? Click here to reset