Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

02/12/2023
by   Hamidreza Almasi, et al.
0

Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator

READ FULL TEXT
research
07/29/2019

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

To improve the resilience of distributed training to worst-case, or Byza...
research
03/04/2021

Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference

This paper develops an efficient distributed inference algorithm, which ...
research
11/18/2019

Fast Machine Learning with Byzantine Workers and Servers

Machine Learning (ML) solutions are nowadays distributed and are prone t...
research
02/28/2021

Communication-efficient Byzantine-robust distributed learning with statistical guarantee

Communication efficiency and robustness are two major issues in modern d...
research
03/27/2018

DRACO: Robust Distributed Training via Redundant Gradients

Distributed model training is vulnerable to worst-case system failures a...
research
09/11/2023

Practical Homomorphic Aggregation for Byzantine ML

Due to the large-scale availability of data, machine learning (ML) algor...
research
08/05/2021

Aspis: A Robust Detection System for Distributed Learning

State of the art machine learning models are routinely trained on large ...

Please sign up or login with your details

Forgot password? Click here to reset