FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

10/06/2021
by   Yuhao Chen, et al.
0

With the increased penetration and proliferation of Internet of Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) across edge devices rather than centralizing it in the cloud. This development enables better privacy preservation, real-time responses, and user-specific models. To deploy deep and complex models to edge devices with limited resources, model partitioning of deep neural networks (DNN) model is necessary, and has been widely studied. However, most of the existing literature only considers distributing the inference model while still relying centralized cloud infrastructure to generate this model through training. In this paper, we propose FTPipeHD, a novel DNN training framework that trains DNN models across distributed heterogeneous devices with fault tolerance mechanism. To accelerate the training with time-varying computing power of each device, we optimize the partition points dynamically according to real-time computing capacities. We also propose a novel weight redistribution approach that replicates the weights to both the neighboring nodes and the central node periodically, which combats the failure of multiple devices during training while incurring limited communication cost. Our numerical results demonstrate that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one. It is also shown that the proposed method is able to accelerate the training even with the existence of device failures.

READ FULL TEXT

page 1

page 4

page 9

research
01/08/2019

Collaborative Execution of Deep Neural Networks on Internet of Things Devices

With recent advancements in deep neural networks (DNNs), we are able to ...
research
06/16/2021

Improving DNN Fault Tolerance using Weight Pruning and Differential Crossbar Mapping for ReRAM-based Edge AI

Recent research demonstrated the promise of using resistive random acces...
research
12/06/2020

CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices

Recent advances in artificial intelligence have driven increasing intell...
research
06/21/2022

CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework

There is a growing demand for shifting the delivery of AI capability fro...
research
09/06/2017

Distributed Deep Neural Networks over the Cloud, the Edge and End Devices

We propose distributed deep neural networks (DDNNs) over distributed com...
research
04/09/2021

Creating Robust Deep Neural Networks With Coded Distributed Computing for IoT Systems

The increasing interest in serverless computation and ubiquitous wireles...
research
06/16/2022

Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework

Collaborative inference has received significant research interest in ma...

Please sign up or login with your details

Forgot password? Click here to reset