1 Introduction
Decentralized algorithms to solve finite-sum minimization problems are crucial to train machine learning models where data samples are distributed across a
network of nodes. Each of them communicates with its one-hop neighbors, instead of sending information uniformly to a centralized server. The driving forces toward decentralized machine learning are twofold. One is the ever-increasing privacy concern [gaia, communication, privacy, privacy1], especially when the personal data is collected by smartphones, cameras and wearable devices. In fear of privacy leakage, a node is inclined to keeping and computing the raw data locally, and communicating only with other trustworthy nodes. The other is owing to the expensive communication. Traditional distributed training with a centralized parameter server requires that all nodes push gradients and pull parameters so that the ingress and egress links of the server can easily throttle the traffic. By removing this server and balancing the communications in a network, one can reduce the wall-clock time of model training several folds [outperform].Consensus-based gradient descent methods [DGD, network, consensus, ADMM] are widely used in decentralized learning problems because of their efficacy and simplicity. Each node computes the local gradient to update its parameter with the local data, exchanges this parameter with its neighbors, and then calculates the weighted average as the start of the next training round. To avoid overfitting the data, or trace norms are introduced as an additional penalty to the original loss function, yet the new loss function is non-smooth at kinks. Decentralized Proximal Gradient (DPG) [first-order-optimization] leverages a proximal operator to cope with the non-differentiable part. Its stochastic version, Decentralized Stochastic Proximal Gradient (DSPG) [DSPG], reduces the computation complexity per-iteration by using the stochastic gradient other than the full gradient.
Stochastic gradient methods tend to have the potentially large variance and their performance relies on the tuning of a decaying learning rate sequence. Several variance reduction algorithms have been proposed to solve this problem in the past decade, e.g. SAGA [SAGA], SVRG [SVRG], SCSG [SCSG] and SARAH [SARAH]
. Variance reduction is deemed as a breakthrough in the first-order optimization that accelerates Stochastic Gradient Descent (SGD) to the linear convergence rate under smooth and strongly convex conditions. The variance reduction technique is more imperative in decentralized machine learning
[heu-vr] for two reasons. Firstly, the local averaging is inefficient to mitigate the noise of the models in decentralized topologies. In light of the limited number of local models available at a node, the averaged model has a much larger variance compared to the centralized setting. Secondly, the intermittent network connectivity in real-world brings the temporal variance of stochastic gradients. Hence, the underlying network topology is time-varying that introduces further randomness in the estimation of local gradients temporally. Owing to the above considerations, variance reduction is generally used in decentralized learning, including DSA [DSA], Network-SVRG/SARAH [network-svrg] and GT-SVRG [gt-svrg].Our work also considers time-varying networks, where the link connectivity is changing over time. Time-varying networks are ubiquitous in the daily life [time-vary]. For instance, a transmission pair is interrupted if one of the nodes moves out of the mutual transmission range in mobile networks. Besides, only the set of non-interfering wireless links can communicate simultaneously and different sets of links have to be activated in a time-division mode. Time-varying graphs not only allow the dynamic topology, but also are not guaranteed to be connected all the time. The representative works on convex problems include DIGing[diging], PANDA[panda] and the time-varying /push-pull method[push-pull].
In this paper, we propose the Decentralized Proximal Stochastic Variance Reduced Gradient descent (DPSVRG) algorithm to address a fundamental problem, i.e., can the stochastic decentralized learning over temporal changing networks with a general convex but non-smooth loss function be faster and more robust. DPSVRG introduces a new variable for each node to help upper bound the local gradient variance and updates this variable periodically. During the training, this bound will decrease continually and the local gradient variance will vanish as well. In addition, DPSVRG leverages multi-consensus to further speed up the convergence rate that is applicable to both static and time-varying networks. Up to now, the only decentralized stochastic proximal algorithm [pmgt-vr] considers static graphs and strongly convex global objectives instead.
We rigorously prove that the convergence rate of DPSVRG is toward general convex objective functions in contrast to for DSPG without variance reduction, where refers to the number of iterations.
Comments
There are no comments yet.