DeepAI AI Chat
Log In Sign Up


by   Konstantin Mishchenko, et al.
King Abdullah University of Science and Technology

It is well known that many optimization methods, including SGD, SAGA, and Accelerated SGD for over-parameterized models, do not scale linearly in the parallel setting. In this paper, we present a new version of block coordinate descent that solves this issue for a number of methods. The core idea is to make the sampling of coordinate blocks on each parallel unit independent of the others. Surprisingly, we prove that the optimal number of blocks to be updated by each of n units in every iteration is equal to m/n, where m is the total number of blocks. As an illustration, this means that when n=100 parallel units are used, 99% of work is a waste of time. We demonstrate that with m/n blocks used by each unit the iteration complexity often remains the same. Among other applications which we mention, this fact can be exploited in the setting of distributed optimization to break the communication bottleneck. Our claims are justified by numerical experiments which demonstrate almost a perfect match with our theory on a number of datasets.


Avoiding Communication in Logistic Regression

Stochastic gradient descent (SGD) is one of the most widely used optimiz...

Efficient Distributed SGD with Variance Reduction

Stochastic Gradient Descent (SGD) has become one of the most popular opt...

Accelerated, Parallel and Proximal Coordinate Descent

We propose a new stochastic coordinate descent method for minimizing the...

Parallel Coordinate Descent Methods for Big Data Optimization

In this work we show that randomized (block) coordinate descent methods ...

Distributed and Stochastic Optimization Methods with Gradient Compression and Local Steps

In this thesis, we propose new theoretical frameworks for the analysis o...

Better Communication Complexity for Local SGD

We revisit the local Stochastic Gradient Descent (local SGD) method and ...

Feature selection with optimal coordinate ascent (OCA)

In machine learning, Feature Selection (FS) is a major part of efficient...