Distributed Optimization using Heterogeneous Compute Systems

10/03/2021
by   Vineeth S, et al.
0

Hardware compute power has been growing at an unprecedented rate in recent years. The utilization of such advancements plays a key role in producing better results in less time – both in academia and industry. However, merging the existing hardware with the latest hardware within the same ecosystem poses a challenging task. One of the key challenges, in this case, is varying compute power. In this paper, we consider the training of deep neural networks on a distributed system of workers with varying compute power. A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing. To mitigate this issue, we propose to dynamically adjust the data assigned for each worker during the training. We assign each worker a partition of total data proportional to its computing power. Our experiments show that dynamically adjusting the data partition helps to improve the utilization of the system and significantly reduces the time taken for training. Code is available at the repository: <https://github.com/vineeths96/Heterogeneous-Systems>.

READ FULL TEXT
research
06/18/2023

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Background: Distributed training is essential for large scale training o...
research
05/16/2019

Auto-tuning of dynamic load balancing applied to 3D reverse time migration on multicore systems

Reverse time migration (RTM) is an algorithm widely used in the oil and ...
research
10/08/2021

RelaySum for Decentralized Deep Learning on Heterogeneous Data

In decentralized machine learning, workers compute model updates on thei...
research
10/31/2018

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...
research
04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
research
08/19/2020

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Using multiple nodes and parallel computing algorithms has become a prin...
research
04/06/2020

Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio

Automatic designing computationally efficient neural networks has receiv...

Please sign up or login with your details

Forgot password? Click here to reset