Aspis: A Robust Detection System for Distributed Learning

State of the art machine learning models are routinely trained on large scale distributed clusters. Crucially, such systems can be compromised when some of the computing devices exhibit abnormal (Byzantine) behavior and return arbitrary results to the parameter server (PS). This behavior may be attributed to a plethora of reasons including system failures and orchestrated attacks. Existing work suggests robust aggregation and/or computational redundancy to alleviate the effect of distorted gradients. However, most of these schemes are ineffective when an adversary knows the task assignment and can judiciously choose the attacked workers to induce maximal damage. Our proposed method Aspis assigns gradient computations to worker nodes using a subset-based assignment which allows for multiple consistency checks on the behavior of a worker node. Examination of the calculated gradients and post-processing (clique-finding in an appropriately constructed graph) by the central node allows for efficient detection and subsequent exclusion of adversaries from the training process. We prove the Byzantine resilience and detection guarantees of Aspis under weak and strong attacks and extensively evaluate the system on various large-scale training scenarios. The main metric for our experiments is the test accuracy for which we demonstrate significant improvement of about 30 state-of-the-art approaches on the CIFAR-10 dataset. The corresponding reduction of the fraction of corrupted gradients ranges from 16

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2020

ByzShield: An Efficient and Robust System for Distributed Training

Training of large scale models on distributed clusters is a critical com...
research
08/17/2022

Efficient Detection and Filtering Systems for Distributed Training

A plethora of modern machine learning tasks requires the utilization of ...
research
07/29/2019

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

To improve the resilience of distributed training to worst-case, or Byza...
research
05/31/2022

Communication-efficient distributed eigenspace estimation with arbitrary node failures

We develop an eigenspace estimation algorithm for distributed environmen...
research
09/11/2023

Practical Homomorphic Aggregation for Byzantine ML

Due to the large-scale availability of data, machine learning (ML) algor...
research
03/27/2018

DRACO: Robust Distributed Training via Redundant Gradients

Distributed model training is vulnerable to worst-case system failures a...
research
02/12/2023

Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Modern ML applications increasingly rely on complex deep learning models...

Please sign up or login with your details

Forgot password? Click here to reset