DeepAI AI Chat
Log In Sign Up

Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

by   Ashkan Yousefpour, et al.

When a neural network is partitioned and distributed across physical nodes, failure of physical nodes causes the failure of the neural units that are placed on those nodes, which results in a significant performance drop. Current approaches focus on resiliency of training in distributed neural networks. However, resiliency of inference in distributed neural networks is less explored. We introduce ResiliNet, a scheme for making inference in distributed neural networks resilient to physical node failures. ResiliNet combines two concepts to provide resiliency: skip connection in residual neural networks, and a novel technique called failout, which is introduced in this paper. Failout simulates physical node failure conditions during training using dropout, and is specifically designed to improve the resiliency of distributed neural networks. The results of the experiments and ablation studies using three datasets confirm the ability of ResiliNet to provide inference resiliency for distributed neural networks.


page 1

page 2

page 3

page 4


Guardians of the Deep Fog: Failure-Resilient DNN Inference from Edge to Cloud

Partitioning and distributing deep neural networks (DNNs) over physical ...

Internal node bagging: an explicit ensemble learning method in neural network training

We introduce a novel view to understand how dropout works as an inexplic...

RCanopus: Making Canopus Resilient to Failures and Byzantine Faults

Distributed consensus is a key enabler for many distributed systems incl...

FLAC: A Robust Failure-Aware Atomic Commit Protocol for Distributed Transactions

In distributed transaction processing, atomic commit protocol (ACP) is u...

Double and Triple Node-Erasure-Correcting Codes over Graphs

In this paper we study array-based codes over graphs for correcting mult...

On The Robustness of a Neural Network

With the development of neural networks based machine learning and their...

Stop, Think, and Roll: Online Gain Optimization for Resilient Multi-robot Topologies

Efficient networking of many-robot systems is considered one of the gran...