1 Introduction
Assume we have a set of labeled instances , where
is the feature vector for instance
, is the feature size and is the class label of . In machine learning, we often need to solve the following regularized empirical risk minimization problem:(1) 
where is the parameter to learn,
is the loss function defined on instance
, and often with a regularization term to avoid overfitting. For example, can be which is known as the logistic loss, orwhich is known as the hinge loss in support vector machine (SVM). The regularization term can be
, , or some other forms.Due to their efficiency and effectiveness, stochastic gradient descent (SGD) and its variants [11, 2, 9, 6, 4, 10, 7] have recently attracted much attention to solve machine learning problems like that in (1). Many works have proved that SGD and its variants can outperform traditional batch learning algorithms such as gradient descent or Newton methods in real applications.
In many realworld problems, the number of instances is typically very large. In this case, the traditional sequential SGD methods might not be efficient enough to find the optimal solution for (1). On the other hand, clusters and multicore systems have become popular in recent years. Hence, to handle largescale problems, researchers have recently proposed several distributed SGD methods for clusters and parallel SGD methods for multicore systems. Although distributed SGD methods for clusters like those in [12, 1, 13] are meaningful to handle very largescale problems, there also exist a lot of problems which can be solved by a single machine with multiple cores. Furthermore, even in distributed settings with clusters, each machine (node) of the cluster typically have multiple cores. Hence, how to design effective parallel SGD methods for multicore systems has become a key issue to solve largescale learning problems like that in (1).
There have appeared some parallel SGD methods for multicore systems. The roundrobin scheme proposed in [12] tries to order the processors and then each processor update the variables in order. Hogwild! [8] is a lockfree approach for parallel SGD. Experimental results in [8] have shown that Hogwild! can outperform the roundrobin scheme in [12]. However, Hogwild! can only achieve a sublinear convergence rate. Hence, Hogwild! is not efficient (fast) enough to achieve satisfactory performance.
In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG) [4]. The contributions of AsySVRG can be outlined as follows:

Two asynchronous schemes, consistent reading and inconsistent reading, are proposed to coordinate different threads. Theoretical analysis is provided to show that both schemes have linear convergence rate, which is faster than that of Hogwild!

The implementation of AsySVRG is simple.

Empirical results on real datasets show that AsySVRG can outperform Hogwild! in terms of computation cost.
2 Preliminary
We use to denote the objective function in (1), which means . In this paper, we use to denote the norm and to denote the optimal solution of the objective function.
Assumption 1.
The function in (1) is convex and smooth, which means ,
or equivalently
where denotes the gradient of .
Assumption 2.
The objective function is strongly convex, which means ,
or equivalently
3 Algorithm
Assume that we have processors (threads) which can access a shared memory, and is stored in the shared memory. Furthermore, we assume each thread has access to a shared data structure for the vector and has access to choose any instance randomly to compute the gradient . We also assume consistent reading of , which means that all the elements of in the shared memory have the same “age” (time clock).
Our AsySVRG algorithm is presented in Algorithm 1. We can find that in the iteration, each thread completes the following operations:

All threads parallelly compute the full gradient . Assume the gradients computed by thread are denoted by which is a subset of . We have if , and .

Run an innerloop in which each iteration randomly chooses an instance indexed by and computes the gradient , where , and compute the vector
(2) Then update the vector
where is a step size.
Here, is the total number of updates on from all threads and is the iteration at which the update was calculated. Since each thread can compute an update and change the , obviously. At the same time, we should guarantee that the update is not too old. Hence, we need , where is a positive integer, and usually called the bounded delay. If , the algorithm AsySVRG degenerates to the sequential (singlethread) version of SVRG.
4 Convergence Analysis
Our convergence analysis is based on the Option 2 in Algorithm 1. Please note that we have threads and let each thread calculate times of update. Hence, the total times of updates on in the shared memory, which is denoted by , must satisfy that . And obviously, the larger the is, the larger the will be.
4.1 Consistent Reading
Since is a vector with several elements, it is typically impossible to complete updating in an atomic operation. We have to give this step a lock for each thread. More specifically, it need a lock whenever a thread tries to read or update in the shared memory. This is called consistent reading scheme.
First, we give some notations as follows:
(3)  
(4) 
It is easy to find that and the update of can be written as follows:
(5) 
One key to get the convergence rate is the estimation of the variance of
. We use the technique in [5] and get the following result:Lemma 1.
There exists a constant such that .
Proof.
According to Lemma 1, . If we want to be small enough, we need a small step size . This is reasonable because should be changed slowly if the gradient applied to update is relatively old.
4.2 Inconsistent Reading
The consistent reading scheme would cost much waiting time because we need a lock whenever a thread tries to read or update . In this subsection, we will introduce an inconsistent reading scheme, in which a thread does not need a lock when reading current in the memory. For the update step, the thread still need a lock. Please note that our inconsistent reading scheme is different from that in [3] which adopts the atomic update strategy. Since the update vector applied to is usually dense, the atomic update strategy used in [3] is not applicable for our case.
For convenience, we use to denote the vector set generated in the inner loop of our algorithm, and to denote the vector that one thread gets from the shared memory and uses to update . Then, we have
(9) 
We also need the following assumption:
Assumption 3.
For all threads, they enjoy the same speed of reading operation and the same speed of updating operation. And any reading operation is faster than updating operation, which means that for three scalars , is faster than .
Since we do not use locks when a thread reads in the shared memory, some elements in which have not been read by one thread may be changed by other threads. Usually, . If we call the age of each element of to be , the ages of elements of may not be the same. We use to denote the smallest age of the elements of . Of course, we expect that is not too small. Given a positive integer , we assume that . With Assumption 3, according to the definition of and , we have
(10) 
where is a set that belongs to , is a diagonal matrix that only the diagonal position is , , and other elements of are .
The (10) is right because with an update lock and Assumption 3, at most one thread is updating at any time. If a thread begins to read , only two cases would happen. One is that no threads are updating , which leads . Another is that one thread is updating , which leads to the result that the thread would get a new and may also get some old elements. Obviously, they enjoy the same age of if it reads at a good pace.
Then, we can get the following results:

or .

, and , which means that ,
is an identity matrix.
We give a notation for any two integers , i.e., .
According to the proof in Lemma 1, we can get the property that ,
(11) 
Lemma 2.
There exists a constant and a corresponding suitable step size that make:
Proof.
According to (11), we have
(12) 
According to (10), we have
In the first inequality, may be less than , but it won’t impact the result of the second inequality. Summing from to for (12), we can get
Taking expectation to which are the random numbers selected by Algorithm 1, we obtain
When satisfy the following condition:
we have
∎
Lemma 3.
For the relation between and , we have the following result
where .
Proof.
In (11), if we take and , we can obtain
Summing from to , we obtain
(13)  
where the second inequality uses the fact that .
Similar to Theorem (1), we have the following result about inconsistent reading:
Theorem 2.
Remark: In our convergence analysis for both consistent reading and inconsistent reading schemes, there are a lot of parameters, such as . We can set . Since are determined by the objective function and are constants, we only need the step size to be small enough and to be large enough. Then, all the conditions in these lemmas and theorems will be satisfied.
5 Experiments
We choose logistic regression with a
norm regularization term to evaluate our AsySVRG. Hence, the is defined as follows:We choose Hogwild! as baseline because Hogwild! has been proved to be the stateoftheart parallel SGD methods for multicore systems [8]. The experiments are conducted on a server with 12 Intel cores and 64G memory.
5.1 Dataset and Evaluation Metric
We choose three datasets for evaluation. They are rcv1, realsim, and news20, which can be downloaded from the LibSVM website ^{1}^{1}1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. Detailed information is shown in Table 1, where is the hyperparameter in .
dataset  instances  features  

rcv1  20,242  47,236  0.0001 
realsim  72,309  20,958  0.0001 
news20  19,996  1,355,191  0.0001 
We adopt the speedup and convergence rate for evaluation. The definition of speedup is as follows:
We get a suboptimal solution by stopping the algorithms when the gap between training loss and the optimal solution is less than .
We set in Algorithm 1 to be , where is the number of training instances and is number of threads. When , the setting about is the same as that in SVRG [4]
. According to our theorems, the step size should be small. However, we can also get good performance with a relatively large step size in practice. For the Hogwild!, in each epoch, we run each thread
iterations. We use a constant step size , and we set after every epoch. These settings are the same as those in the experiments in Hogwild![8]. For each epoch, our algorithm will visit the whole dataset three times and the Hogwild! will visit the whole dataset only once. To make a fair comparison about the convergence rate, we study the change of objective value versus the number of effective passes. One effective pass of the dataset means the whole dataset is visited once.5.2 Results
In practice, we find that our AsySVRG algorithm without any lock strategy, denoted by AsySVRGunlock, can achieve the best performance. Table 2 shows the running time and speedup results of consistent reading, inconsistent reading, and unclock schemes for AsySVRG on dateset rcv1. Here, 77.15s denotes 77.15 seconds, 1.94x means the speedup is 1.94, i.e., it is 1.94 times faster than the sequential (onethread) algorithm.
threads  consistent reading  inconsistent reading  AsySVRGunlock 

2  77.15s/1.94x  77.20s/1.94x  137.55s/1.09x 
4  62.20s/2.4x  51.06s/2.93x  58.07s/2.58x 
8  63.05s/2.4x  53.93s/2.78x  30.49s/4.92x 
10  64.76s/2.3x  56.29s/2.66x  26s/5.77x 
We find that the consistent reading scheme has the worst performance. Hence, in the following experiments, we only report the results of inconsistent reading scheme, denoted by AsySVRGlock, and AsySVRGunlock.
Table 3 compares the time cost between AsySVRG and Hogwild! to achieve a gap less than with 10 threads. We can find that our AsySVRG is much faster than Hogwild!, either with lock or without lock.
AsySVRGlock  AsySVRGunlock  Hogwild!lock  Hogwild!unlock  

rcv1  55.77  25.33  500  200 
realsim  42.20  21.16  400  200 
news20  909.93  514.50  4000  2000 
Figure 1 shows the speedup and convergence rate on three datasets. Here, AsySVRGlock10 denotes AsySVRG with lock strategy on 10 threads. Similar nations are used for other settings of AsySVRG and Hogwild!. We can find that the speedup of AsySVRG and Hogwild! is comparable. Combined with the results in Table 3, we can find that Hogwild! is slower than AsySVRG with different numbers of threads. From Figure 1, we can also find that the convergence rate of AsySVRG is much faster than that of Hogwild!.
6 Conclusion
In this paper, we have proposed a novel asynchronous parallel SGD method, called AsySVRG, for multicore systems. Both theoretical and empirical results show that AsySVRG can outperform other stateoftheart methods.
References
 [1] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems, pages 550–558, 2010.
 [2] John C. Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
 [3] ChoJui Hsieh, HsiangFu Yu, and Inderjit S Dhillon. Passcode: Parallel asynchronous stochastic dual coordinate descent. arXiv preprint arXiv:1504.01365, 2015.
 [4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
 [5] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. arXiv preprint arXiv:1311.1873, 2013.
 [6] Julien Mairal. Optimization with firstorder surrogate functions. In Proceedings of the 30th International Conference on Machine Learning, pages 783–791, 2013.
 [7] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
 [8] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
 [9] Nicolas Le Roux, Mark W. Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2012.
 [10] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 [11] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, pages 2116–2124, 2009.
 [12] Martin Zinkevich, Alexander J. Smola, and John Langford. Slow learners are fast. In Advances in Neural Information Processing Systems, pages 2331–2339, 2009.
 [13] Martin Zinkevich, Markus Weimer, Alexander J. Smola, and Lihong Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595–2603, 2010.
Appendix A Notations for Proof
For the proof of Theorem 1.
Proof.
According to (5), we obtain that
(14) 
For the old gradient, we have
(15) 
Since is smooth, we have
Substituting the above inequalities into (A), we obtain
(16) 
Since
Taking expectation and using Lemma 1, we obtain
The second inequality uses . In the first equality, we use the fact that . The last inequality uses the inequality
which has been proved in [4]. Then summing up from to , and taking or randomly choosing a to be , we can get
(17)  
Then, we have
Of course, we need . ∎
For the proof of Theorem 2.
Proof.
According to (4.2), we have
(18) 
Similar to the analysis of (15) in Theorem 1, we can get
The second inequality is the same as the analysis in (13). The third inequality uses Lemma 2. The fourth inequality uses Lemma 3.
For convenience, we use and sum the above inequality from to , and take . Then, we obtain
which means that
Comments
There are no comments yet.