Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

02/18/2020
by   Ashkan Yousefpour, et al.
15

When a neural network is partitioned and distributed across physical nodes, failure of physical nodes causes the failure of the neural units that are placed on those nodes, which results in a significant performance drop. Current approaches focus on resiliency of training in distributed neural networks. However, resiliency of inference in distributed neural networks is less explored. We introduce ResiliNet, a scheme for making inference in distributed neural networks resilient to physical node failures. ResiliNet combines two concepts to provide resiliency: skip connection in residual neural networks, and a novel technique called failout, which is introduced in this paper. Failout simulates physical node failure conditions during training using dropout, and is specifically designed to improve the resiliency of distributed neural networks. The results of the experiments and ablation studies using three datasets confirm the ability of ResiliNet to provide inference resiliency for distributed neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2019

Guardians of the Deep Fog: Failure-Resilient DNN Inference from Edge to Cloud

Partitioning and distributing deep neural networks (DNNs) over physical ...
research
05/01/2018

Internal node bagging: an explicit ensemble learning method in neural network training

We introduce a novel view to understand how dropout works as an inexplic...
research
10/22/2018

RCanopus: Making Canopus Resilient to Failures and Byzantine Faults

Distributed consensus is a key enabler for many distributed systems incl...
research
02/09/2023

FLAC: A Robust Failure-Aware Atomic Commit Protocol for Distributed Transactions

In distributed transaction processing, atomic commit protocol (ACP) is u...
research
12/02/2018

Double and Triple Node-Erasure-Correcting Codes over Graphs

In this paper we study array-based codes over graphs for correcting mult...
research
07/25/2017

On The Robustness of a Neural Network

With the development of neural networks based machine learning and their...
research
09/19/2018

Stop, Think, and Roll: Online Gain Optimization for Resilient Multi-robot Topologies

Efficient networking of many-robot systems is considered one of the gran...

Please sign up or login with your details

Forgot password? Click here to reset