Highly Available Data Parallel ML training on Mesh Networks

11/06/2020
by   Sameer Kumar, et al.
0

Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed chips for full connectivity. In this paper, we present techniques to route gradient summation allreduce traffic around failed chips on 2-D meshes. We evaluate performance of our fault tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT benchmarks. Performance results show minimal impact to training throughput on 512 and 1024 TPU-v3 chips.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2017

A Face Fairness Framework for 3D Meshes

In this paper, we present a face fairness framework for 3D meshes that p...
research
11/15/2022

Generation of Curved Meshes for the High-Lift Common Research Model

We answer the questions of the high-order technology focus group (HO-TFG...
research
06/27/2018

An Overview of Machine Learning Approaches in Wireless Mesh Networks

Wireless Mesh Networks (WMNs) have been extensively studied for nearly t...
research
10/06/2018

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of ...
research
05/10/2022

Incident duration prediction using a bi-level machine learning framework with outlier removal and intra-extra joint optimisation

Predicting the duration of traffic incidents is a challenging task due t...
research
02/25/2022

Hex-Mesh Generation and Processing: a Survey

In this article, we provide a detailed survey of techniques for hexahedr...
research
06/08/2021

Demystifying the Performance of Bluetooth Mesh: Experimental Evaluation and Optimization

Mesh connectivity is attractive for Internet-of- Things (IoT) applicatio...

Please sign up or login with your details

Forgot password? Click here to reset