 # Variance Reduction for Distributed Stochastic Gradient Descent

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many problems in machine learning and statistics involve minimizations of the form

 minxf(x),f(x)=1nn∑i=1fi(x), (1)

where each

Problems of this form include logistic regression, support vector machines, matrix factorization, and others

. Such problems frequently arise when fitting a model to data, where each measures how well a model fits a particular data point. For example, if is a collection of feature vectors and is a collection of model outputs, a simple least-squares regression is obtained by setting

When large datasets are involved, problems of the form (1) are traditionally solved using stochastic gradient descent (SGD) updates of the form

 xk+1=xk−ηkgk,

where

is a “noisy” gradient estimate using a randomly chosen index

, and is a stepsize parameter . This crude approximate gradient is inexpensive and highly effective when the error is large, resulting in fast convergence to low accuracy solutions. Unfortunately, as the iterates approach the true solution, the gradient descent stepsize must vanish () to dampen the effects of noise, resulting in extremely slow convergence in the high accuracy regime.

Variance reduction (VR) schemes [9, 6, 14, 16, 7, 19, 18, 8] are able to maintain a large constant stepsize and achieve fast convergence to high accuracy. While the difference between the stochastic and exact gradient is potentially large, this difference is predictable – the gradient errors are highly correlated between different uses of the same function

VR methods exploit this property by approximating the gradient error at the most recent use of and subtracting this error from to obtain a more accurate solution. This results in approximate gradients of the form

 gk=∇fik(x)approximate % gradient−∇fik(y)+˜gyerror % correction term, (2)

where is an old iterate, and is an approximation of the true gradient The algorithm SVRG , for example, sets to be an iterate from the algorithm history and sets to be the exact gradient at . The more recent SAGA scheme  stores the most recent value of (for all ) and chooses to be the average of these values. This error correction term in (2) allows VR methods to use large (non-vanishing) stepsizes, preserve linear convergence rates (under strong convexity assumptions), and significantly outperform classical SGD.

Current VR methods suffer from a few caveats that limit their use in practical settings. First, some methods (e.g., SAGA) require storing a complete history of previous iterates, thus increasing memory requirements (although this requirement can be mitigated for some simple minimization problems ). Other methods (e.g., SVRG) require computation of the exact gradient over the entire dataset after each epoch. This cost is often significant in real-world scenarios where SGD methods typically converge after relatively few epochs, and also prevent its use in asynchronous distributed environments.

Another caveat of VR algorithms is that they have not been adequately studied in the distributed setting. Current methods are impractical in large scale settings as explained later in the paper, and only a couple of recent papers have tried to address this [4, 21]. For large-scale problems, asynchronous variants of distributed SGD are still widely used [5, 13, 1, 10, 17, 22, 23, 20], and most of the work has been focused on the parameter server model of computation [5, 13, 22, 1, 10], where updates are immediately communicated to the central server. Relatively little work has been done in the truly distributed case where communications costs are high [23, 20, 11], since infrequent communication in SGD-type methods usually leads to slower convergence and instability; a problem that could be mitigated using VR.

### 1.1 Contributions

This work has two main contributions. First, we present a new approach to variance reduction, VR-lite, that requires neither extra memory to store the algorithm history, nor the computation of exact gradients. For this reason, VR-lite has lower memory requirements and faster iterations than previous approaches. This is achieved by storing averages of previous iterates rather than a complete history. We show empirically that this approach converges much faster than competing methods.

Next, we study the application of VR-lite in a large scale distributed setting. We present synchronous and asynchronous variants of VR-lite in a fully distributed setting with high communication periods between nodes. We find that our algorithms are scalable and remain stable despite high communication periods. The proposed distributed algorithms outperform a number of other distributed SGD methods on a variety of classification and regression tasks.

## 2 Proposed Algorithm: VR-lite

Most VR schemes are divided into epochs, during which a pass is made over the entire dataset. Thus, updates take place during the -th epoch (one update per data record/feature vector). The iterates generated in the -th epoch can be written as

At the end of an epoch, the SVRG method sets , and , the exact gradient. These values of and are then used to perform corrected gradient updates of the form (2). This approach avoids the extra memory requirements of “epoch-free algorithms” such as SAGA , but requires an expensive gradient evaluation over the entire dataset on each iteration.

To speed computation, VR-lite accumulates the average gradient vector over an epoch. This average gradient is then used as thus avoiding costly loops over the entire (large) dataset. These accumulated averages can be computed cheaply “on the fly” as the algorithm runs without noticeable overhead.

Let denote a random permutation of the data indices , with denoting the data index chosen in the -th step in . Then, the update rule for VR-lite is given by update x_m+1 = x^k_m+1 - η(∇f_i_k (x^k_m+1) - ∇f_i_k (¯x_m) + ¯g_m ), where , and and denote the average over iterations and gradients, given by

 ¯¯¯xm=1nn∑j=1xjm, and ¯¯¯gm=1nn∑j=1∇fπjm(xjm). (3)

Note that for VR-lite, , unlike most other variance reduction methods. This makes it hard to theoretically prove convergence guarantees for VR-lite.

The values of and are initialized using a single epoch of vanilla SGD with no VR correction. This algorithm fits into the general VR framework (2) with and does not require any extra storage apart from a place to store and update these averages, and only requires 2 gradient computations per epoch. The full method is described in Algorithm 1.

## 3 Distributed Algorithms

We now consider a setting where the data is distributed across local nodes, and our goal is to minimize the global objective function. We consider that the local nodes can only communicate with a central server, i.e., a centralized setting. Let the set of data indices be decomposed into disjoint subsets with each representing the indices of data stored on server The objective (1) now has the form

 f(x)=1np∑s=1∑j∈Ψsfj(x). (4)

VR-lite is easily distributed in an asynchronous manner since it does not involve calculating the full gradient of the objective function as in SVRG. Moreover, since the algorithm needs to update the average of the gradients only at the end of an epoch, communication periods between the central node and the local nodes can be increased while still ensuring a fast and stable algorithm.

In this section, we propose fast and stable synchronous and asynchronous variants of VR-lite in the fully distributed setting with high communication latency between the central and local nodes.

### 3.1 Synchronous VR-lite

In a fully distributed setting, the objective is to decrease the frequency of (slow) communication with the central server so as to not impede (fast) gradient updates on each local node. This is achieved by updating local copies of on each remote client, and communicating with the central server periodically to report the change. Furthermore, the same average gradient term will be shared across all local nodes and used for variance reduction, ensuring that no local node drifts far away from the global solution even when communication gaps are large. This approach leads to a synchronous algorithm, Sync VR-lite, detailed in Algorithm 2.

In Sync VR-lite, each local node communicates with the central server only once in each epoch. Thus, during one full epoch over the dataset, Sync VR-lite only requires a total of communications with the central server. This is much less communication compared to the commonly used “parameter server” model of computation, where a communication phase is required for each update to a centrally stored iterate, leading to communications per epoch.

### 3.2 Asynchronous VR-lite

Sync VR-lite can be made asynchronous very easily, as demonstrated in Algorithm 3. The key idea for the asynchronous algorithm, Async VR-lite, is that only the change in , and , is sent from local node to the central server. This ensures that when updating the centrally stored , , and , the previous contribution to the average from that local worker is subtracted out before adding in the new value. Thus, the central and

always store unbiased estimates of the average of the distributed mean

and This is critical for ensuring that a fast working local node does not bias the solution, making the algorithm more robust to heterogeneous computing environments where local nodes work at drastically different speeds.

## 4 Experiments Figure 1: Sequential Results. Left to right: Logistic regression on toy dataset; Ridge regression on toy data; Logistic regression on IJCNN1 dataset; Ridge regression on MILLIONSONG dataset Figure 2: Distributed Results. Left to right: Logistic regression on SUSY over 250 local workers; Ridge regression on MILLIONSONG over 240 local workers; Time required for convergence as number of local workers is increased on SUSY and MILLIONSONG, respectively.

In this section, we present our empirical results. We test our algorithms on a binary classification problem with -regularized logistic regression and an

-regularized linear regression (ridge regression) problem. More formally, we are interested in the following two optimization problems:

 logistic regression: minx1nn∑i=1log(1+exp(biaTix))+λ∥x∥2 ridge regression: minx1nn∑i=1(aTix−bi)2+λ∥x∥2,

where each is a data sample with label , and . In all our experiments, we set the regularization parameter to .

### 4.1 Sequential Results

We first investigate the performance of VR-lite in the sequential, non-distributed setting. It is well-established by now that VR methods out-perform regular SGD by a wide margin on convex models. Thus, we compare our algorithm’s performance with two popular variance reduction methods, SVRG  and SAGA .

We tested VR-lite on two toy datasets as well as two real world datasets. For binary classification, we generated a random Gaussian data matrix for each class, where the first class had zero mean and the second class had mean 1 entries. The variances of the distributions were varied such that the dataset was not perfectly linearly separable. For ridge regression, we generated a random Gaussian matrix with observation vector , where denotes standard Gaussian noise. We built the toy datasets to have data samples with features, with 2500 points of each class for the binary classification case. We also tested our method on two standard real world datasets: IJCNN1  for binary classification and the MILLIONSONG  dataset for least squares regression.

Figure 1 shows results of our experiments. For all algorithms, we pick the constant learning rate which gives fastest convergence. Since a gradient computation is the most expensive step of VR methods, we compare the rate of convergence with the number of epochs used. It is important to note that comparing VR-lite to SAGA may not be a fair comparison since SAGA uses considerably more storage space. Despite this fact, we find that VR-lite converges faster than both SAGA and SVRG in all cases.

### 4.2 Distributed Results

We now present results of our distributed algorithms. All algorithms were implemented using Python and MPI, and run on an HPC cluster with 24 cores per node. Our asynchronous implementations are “locked”, i.e., only a single local node can update the parameters on the central server at a given time. However, all our implementations can be made faster in a simple extension to a lock-free setting. We compare our algorithms to the following popular asynchronous stochastic optimization methods:

• Hogwild!: A basic asynchronous variant of SGD  using a “parameter server” model.

• Elastic Averaging SGD (EASGD): A recent asynchronous SGD method 

that has been shown to accelerate the training of neural networks efficiently. We found the basic EASGD algorithm performed better than the momentum method (M-EASGD), so we present results only on EASGD. As recommended in

, we set the parameter . The communication period was varied as .

• Asynchronous SVRG: This is the first work to our knowledge to propose an asynchronous variant of VR algorithms . The authors show that the asynchronous SVRG algorithm enjoys the same benefits as its sequential counterpart over standard distributed stochastic optimization methods. As recommended in both  and , we set the epoch size to . Note that, after each epoch of size , a synchronization step is required for this algorithm to calculate the full gradient. Thus, this is not a fully asynchronous method.

Note that there is no accepted standard stochastic optimization algorithm for the distributed setting, and so we choose to compare VR-lite with the above recently developed algorithms.

For all algorithms and all experiments, we present results using the constant step size that achieves fastest convergence. We present results for binary classification on the dataset SUSY  and for least squares regression on the dataset MILLIONSONG . MILLIONSONG contains over 500,000 data samples while SUSY contains 5,000,000 samples.

Figure 2 shows results of our distributed experiments. The first two plots compare our algorithms with the other methods on SUSY with 250 local workers and MILLIONSONG with 240 local workers. We plot the relative norm of the gradient on the -axis, and wall clock time on the -axis. For both the binary classification and least squares regression problems, we see that both Sync VR-lite and Async VR-lite significantly outperform all other algorithms. We further notice that the methods are extremely stable inspite of the high communication latency with the central server.

The last two plots in Figure 2 show the scalability of our algorithms. We plot the wall clock time required for convergence on the -axis and increase the number of local workers on the -axis. We compare convergence rates of Sync VR-lite and Async VR-lite with 60, 120, 240 and 480 local workers for MILLIONSONG and with 50, 250, 500, 750 local workers for SUSY. We notice that for MILLIONSONG, there is a large drop in convergence rate initially with the gains leveling out as we keep increasing the number of local workers. The proposed algorithms take only about 10 seconds over hundreds of local workers to train the dataset and the convergence rates of the asynchronous algorithm can potentially be further improved by using a lock-free implementation. For the dataset SUSY, which is 10 times larger than MILLIONSONG, we see a consistent gain in convergence rates as we keep increasing the number of local workers, with our algorithms training the dataset in less than 5 seconds when distributed over 750 local workers.

## 5 Conclusion

In this paper, we presented VR-lite, a variance reduction SGD algorithm which, unlike other popular variance reduction schemes, does not require any full gradient computations or extra storage. We presented distributed variants of this algorithm, Sync VR-lite and Async VR-lite, for a setting with low frequency of communication between the local nodes and the central server. Sync VR-lite and Async VR-lite are scalable in the distributed setting, and perform favorably to other parallel and distributed SGD algorithms on standard models typically encountered in machine learning.

## References

•  Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
•  Pierre Baldi, Peter Sadowski, and Daniel Whiteson.

Searching for exotic particles in high-energy physics with deep learning.

Nature communications, 5, 2014.
•  Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
•  Soham De and Tom Goldstein. Efficient distributed SGD with variance reduction. In 2016 IEEE International Conference on Data Mining. IEEE, 2016.
•  Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
•  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
•  Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of The 31st International Conference on Machine Learning, pages 1125–1133, 2014.
•  Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečnỳ, and Scott Sallinen. Stop wasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pages 2242–2250, 2015.
•  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
•  Mu Li, David Andersen, Alex Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014.
•  Aryan Mokhtari and Alejandro Ribeiro. Dsa: Decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 17(61):1–35, 2016.
•  Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN, 1, 2001.
•  Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
•  Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex J Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2629–2637, 2015.
•  Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
•  Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
•  Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages 850–857, 2014.
•  Chong Wang, Xi Chen, Alex J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pages 181–189, 2013.
•  Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
•  Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
•  Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, and Wu-Jun Li. Scope: Scalable composite optimization for learning on spark. arXiv preprint arXiv:1602.00133, 2016.
•  Martin Zinkevich, John Langford, and Alex J Smola. Slow learners are fast. In Advances in Neural Information Processing Systems, pages 2331–2339, 2009.
•  Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.