DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

07/29/2019
by   Shashank Rajput, et al.
0

To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited number of Byzantine failures. In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that DETOX leads to orders of magnitude accuracy and speedup improvements over many state-of-the-art Byzantine-resilient approaches.

READ FULL TEXT
research
10/21/2021

Utilizing Redundancy in Cost Functions for Resilience in Distributed Optimization and Learning

This paper considers the problem of resilient distributed optimization a...
research
10/04/2021

Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients

There has been a growing need to provide Byzantine-resilience in distrib...
research
03/27/2018

DRACO: Robust Distributed Training via Redundant Gradients

Distributed model training is vulnerable to worst-case system failures a...
research
02/12/2023

Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Modern ML applications increasingly rely on complex deep learning models...
research
09/21/2014

Distributed Robust Learning

We propose a framework for distributed robust statistical learning on b...
research
08/05/2021

Aspis: A Robust Detection System for Distributed Learning

State of the art machine learning models are routinely trained on large ...
research
08/17/2022

Efficient Detection and Filtering Systems for Distributed Training

A plethora of modern machine learning tasks requires the utilization of ...

Please sign up or login with your details

Forgot password? Click here to reset