Decentralized Stochastic Proximal Gradient Descent with Variance Reduction over Time-varying Networks

In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives, and incorporates a non-smooth regularization term for the better generalization ability. Decentralized stochastic proximal gradient (DSPG) method is commonly used to train this type of learning models, while the convergence rate is retarded by the variance of stochastic gradients. In this paper, we propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique. The basic idea is to introduce an estimator in each node, which tracks the local full gradient periodically, to correct the stochastic gradient at each iteration. By transforming our decentralized algorithm into a centralized inexact proximal gradient algorithm with variance reduction, and controlling the bounds of error sequences, we prove that DPSVRG converges at the rate of O(1/T) for general convex objectives plus a non-smooth term with T as the number of iterations, while DSPG converges at the rate O(1/√(T)). Our experiments on different applications, network topologies and learning models demonstrate that DPSVRG converges much faster than DSPG, and the loss function of DPSVRG decreases smoothly along with the training epochs.



There are no comments yet.


page 1

page 9


A Proximal Stochastic Gradient Method with Progressive Variance Reduction

We consider the problem of minimizing the sum of two convex functions: o...

Variance Reduction for Faster Non-Convex Optimization

We consider the fundamental problem in non-convex optimization of effici...

A Proximal Stochastic Quasi-Newton Algorithm

In this paper, we discuss the problem of minimizing the sum of two conve...

A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization

Decentralized optimization is a promising paradigm that finds various ap...

On the Convergence of Consensus Algorithms with Markovian Noise and Gradient Bias

This paper presents a finite time convergence analysis for a decentraliz...

Dual-Free Stochastic Decentralized Optimization with Variance Reduction

We consider the problem of training machine learning models on distribut...

Training Structured Neural Networks Through Manifold Identification and Variance Reduction

This paper proposes an algorithm (RMDA) for training neural networks (NN...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decentralized algorithms to solve finite-sum minimization problems are crucial to train machine learning models where data samples are distributed across a

network of nodes. Each of them communicates with its one-hop neighbors, instead of sending information uniformly to a centralized server. The driving forces toward decentralized machine learning are twofold. One is the ever-increasing privacy concern [gaia, communication, privacy, privacy1], especially when the personal data is collected by smartphones, cameras and wearable devices. In fear of privacy leakage, a node is inclined to keeping and computing the raw data locally, and communicating only with other trustworthy nodes. The other is owing to the expensive communication. Traditional distributed training with a centralized parameter server requires that all nodes push gradients and pull parameters so that the ingress and egress links of the server can easily throttle the traffic. By removing this server and balancing the communications in a network, one can reduce the wall-clock time of model training several folds [outperform].

Consensus-based gradient descent methods [DGD, network, consensus, ADMM] are widely used in decentralized learning problems because of their efficacy and simplicity. Each node computes the local gradient to update its parameter with the local data, exchanges this parameter with its neighbors, and then calculates the weighted average as the start of the next training round. To avoid overfitting the data, or trace norms are introduced as an additional penalty to the original loss function, yet the new loss function is non-smooth at kinks. Decentralized Proximal Gradient (DPG) [first-order-optimization] leverages a proximal operator to cope with the non-differentiable part. Its stochastic version, Decentralized Stochastic Proximal Gradient (DSPG) [DSPG], reduces the computation complexity per-iteration by using the stochastic gradient other than the full gradient.

Stochastic gradient methods tend to have the potentially large variance and their performance relies on the tuning of a decaying learning rate sequence. Several variance reduction algorithms have been proposed to solve this problem in the past decade, e.g. SAGA [SAGA], SVRG [SVRG], SCSG [SCSG] and SARAH [SARAH]

. Variance reduction is deemed as a breakthrough in the first-order optimization that accelerates Stochastic Gradient Descent (SGD) to the linear convergence rate under smooth and strongly convex conditions. The variance reduction technique is more imperative in decentralized machine learning

[heu-vr] for two reasons. Firstly, the local averaging is inefficient to mitigate the noise of the models in decentralized topologies. In light of the limited number of local models available at a node, the averaged model has a much larger variance compared to the centralized setting. Secondly, the intermittent network connectivity in real-world brings the temporal variance of stochastic gradients. Hence, the underlying network topology is time-varying that introduces further randomness in the estimation of local gradients temporally. Owing to the above considerations, variance reduction is generally used in decentralized learning, including DSA [DSA], Network-SVRG/SARAH [network-svrg] and GT-SVRG [gt-svrg].

Our work also considers time-varying networks, where the link connectivity is changing over time. Time-varying networks are ubiquitous in the daily life [time-vary]. For instance, a transmission pair is interrupted if one of the nodes moves out of the mutual transmission range in mobile networks. Besides, only the set of non-interfering wireless links can communicate simultaneously and different sets of links have to be activated in a time-division mode. Time-varying graphs not only allow the dynamic topology, but also are not guaranteed to be connected all the time. The representative works on convex problems include DIGing[diging], PANDA[panda] and the time-varying /push-pull method[push-pull].

In this paper, we propose the Decentralized Proximal Stochastic Variance Reduced Gradient descent (DPSVRG) algorithm to address a fundamental problem, i.e., can the stochastic decentralized learning over temporal changing networks with a general convex but non-smooth loss function be faster and more robust. DPSVRG introduces a new variable for each node to help upper bound the local gradient variance and updates this variable periodically. During the training, this bound will decrease continually and the local gradient variance will vanish as well. In addition, DPSVRG leverages multi-consensus to further speed up the convergence rate that is applicable to both static and time-varying networks. Up to now, the only decentralized stochastic proximal algorithm [pmgt-vr] considers static graphs and strongly convex global objectives instead.

We rigorously prove that the convergence rate of DPSVRG is toward general convex objective functions in contrast to for DSPG without variance reduction, where refers to the number of iterations.