
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates
It has been experimentally observed that the efficiency of distributed t...
Consensus Control for Decentralized Deep Learning
Decentralized training of deep learning models enables ondevice learnin...
QuasiGlobal Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data
Decentralized training of deep learning models is a key element for enab...
A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!
Decentralized optimization methods enable ondevice training of machine ...
On Communication Compression for Distributed Optimization on Heterogeneous Data
Lossy gradient compression, with either unbiased or biased compressors, ...
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Federated learning is a challenging optimization problem due to the hete...
Analysis of SGD with Biased Gradient Estimators
We analyze the complexity of biased stochastic gradient methods (SGD), w...
Dynamic Model Pruning with Feedback
Deep neural networks often have millions of parameters. This can hinder ...
Ensemble Distillation for Robust Model Fusion in Federated Learning
Federated Learning (FL) is a machine learning setting where many devices...
Extrapolation for Largebatch Training in Deep Learning
Deep learning networks are typically trained by Stochastic Gradient Desc...
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Decentralized stochastic optimization methods have gained a lot of atten...
Is Local SGD Better than Minibatch SGD?
We study local SGD (also known as parallel SGD and federated averaging),...
Advances and Open Problems in Federated Learning
Federated learning (FL) is a machine learning setting where many clients...
SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated Learning
Federated learning is a key scenario in modern largescale machine learn...
The ErrorFeedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication
We analyze (stochastic) gradient descent (SGD) with delayed updates on s...
Decentralized Deep Learning with Arbitrary Communication Compression
Decentralized training of deep learning models is a key element for enab...
Unified Optimal Analysis of the (Stochastic) Gradient Method
In this note we give a simple proof for the convergence of stochastic gr...
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
We consider decentralized stochastic optimization with the objective fun...
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
Signbased algorithms (e.g. signSGD) have been proposed as a biased grad...
Efficient Greedy Coordinate Descent for Composite Problems
Coordinate descent with random coordinate selection is the current state...
Sparsified SGD with Memory
Huge scale machine learning problems are nowadays tackled by distributed...
Don't Use Large MiniBatches, Use Local SGD
Minibatch stochastic gradient methods are the current state of the art ...
Global linear convergence of Newton's method without strongconvexity or Lipschitz gradients
We show that Newton's method converges globally at a linear rate for obj...
Local SGD Converges Fast and Communicates Little
Minibatch stochastic gradient descent (SGD) is the state of the art in ...
SVRG meets SAGA: kSVRG  A Tale of Limited Memory
In recent years, many variance reduced algorithms for empirical risk min...
Revisiting FirstOrder Convex Optimization Over Linear Spaces
Two popular examples of firstorder optimization methods over linear spa...
Sebastian U. Stich
