Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

by   Mher Safaryan, et al.

Large scale distributed optimization has become the default tool for the training of supervised machine learning models with a large number of parameters and training data. Recent advancements in the field provide several mechanisms for speeding up the training, including compressed communication, variance reduction and acceleration. However, none of these methods is capable of exploiting the inherently rich data-dependent smoothness structure of the local losses beyond standard smoothness constants. In this paper, we argue that when training supervised models, smoothness matrices – information-rich generalizations of the ubiquitous smoothness constants – can and should be exploited for further dramatic gains, both in theory and practice. In order to further alleviate the communication burden inherent in distributed optimization, we propose a novel communication sparsification strategy that can take full advantage of the smoothness matrices associated with local losses. To showcase the power of this tool, we describe how our sparsification technique can be adapted to three distributed optimization algorithms – DCGD, DIANA and ADIANA – yielding significant savings in terms of communication complexity. The new methods always outperform the baselines, often dramatically so.



page 1

page 2

page 3

page 4


Smoothness-Aware Quantization Techniques

Distributed machine learning has become an indispensable tool for traini...

A case for new neural network smoothness constraints

How sensitive should machine learning models be to input changes? We tac...

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Due to the explosion in the size of the training datasets, distributed l...

Distributed Fixed Point Methods with Compressed Iterates

We propose basic and natural assumptions under which iterative optimizat...

Training Faster with Compressed Gradient

Although the distributed machine learning methods show the potential for...

Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning

In recent centralized nonconvex distributed learning and federated learn...

Learning Representations from Temporally Smooth Data

Events in the real world are correlated across nearby points in time, an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.