Fast Asynchronous Parallel Stochastic Gradient Decent

08/24/2015
by   Shen-Yi Zhao, et al.
0

Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG). Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/23/2015

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

We study optimization algorithms based on variance reduction for stochas...
05/21/2016

Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent

Asynchronous parallel optimization algorithms for solving large-scale ma...
01/17/2021

Guided parallelized stochastic gradient descent for delay compensation

Stochastic gradient descent (SGD) algorithm and its variations have been...
04/05/2020

On the Convergence Analysis of Asynchronous SGD for Solving Consistent Linear Systems

In the realm of big data and machine learning, data-parallel, distribute...
07/20/2017

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Due to their simplicity and excellent performance, parallel asynchronous...
10/25/2019

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters

To gain a better performance, many researchers put more computing resour...
02/17/2021

Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence

Stochastic gradient descent (SGD) is an essential element in Machine Lea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Assume we have a set of labeled instances , where

is the feature vector for instance

, is the feature size and is the class label of . In machine learning, we often need to solve the following regularized empirical risk minimization problem:

(1)

where is the parameter to learn,

is the loss function defined on instance

, and often with a regularization term to avoid overfitting. For example, can be which is known as the logistic loss, or

which is known as the hinge loss in support vector machine (SVM). The regularization term can be

, , or some other forms.

Due to their efficiency and effectiveness, stochastic gradient descent (SGD) and its variants [11, 2, 9, 6, 4, 10, 7] have recently attracted much attention to solve machine learning problems like that in (1). Many works have proved that SGD and its variants can outperform traditional batch learning algorithms such as gradient descent or Newton methods in real applications.

In many real-world problems, the number of instances is typically very large. In this case, the traditional sequential SGD methods might not be efficient enough to find the optimal solution for (1). On the other hand, clusters and multicore systems have become popular in recent years. Hence, to handle large-scale problems, researchers have recently proposed several distributed SGD methods for clusters and parallel SGD methods for multicore systems. Although distributed SGD methods for clusters like those in [12, 1, 13] are meaningful to handle very large-scale problems, there also exist a lot of problems which can be solved by a single machine with multiple cores. Furthermore, even in distributed settings with clusters, each machine (node) of the cluster typically have multiple cores. Hence, how to design effective parallel SGD methods for multicore systems has become a key issue to solve large-scale learning problems like that in (1).

There have appeared some parallel SGD methods for multicore systems. The round-robin scheme proposed in [12] tries to order the processors and then each processor update the variables in order. Hogwild! [8] is a lock-free approach for parallel SGD. Experimental results in [8] have shown that Hogwild! can outperform the round-robin scheme in [12]. However, Hogwild! can only achieve a sub-linear convergence rate. Hence, Hogwild! is not efficient (fast) enough to achieve satisfactory performance.

In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG) [4]. The contributions of AsySVRG can be outlined as follows:

  • Two asynchronous schemes, consistent reading and inconsistent reading, are proposed to coordinate different threads. Theoretical analysis is provided to show that both schemes have linear convergence rate, which is faster than that of Hogwild!

  • The implementation of AsySVRG is simple.

  • Empirical results on real datasets show that AsySVRG can outperform Hogwild! in terms of computation cost.

2 Preliminary

We use to denote the objective function in (1), which means . In this paper, we use to denote the -norm and to denote the optimal solution of the objective function.

Assumption 1.

The function in (1) is convex and -smooth, which means ,

or equivalently

where denotes the gradient of .

Assumption 2.

The objective function is -strongly convex, which means ,

or equivalently

3 Algorithm

Assume that we have processors (threads) which can access a shared memory, and is stored in the shared memory. Furthermore, we assume each thread has access to a shared data structure for the vector and has access to choose any instance randomly to compute the gradient . We also assume consistent reading of , which means that all the elements of in the shared memory have the same “age” (time clock).

Our AsySVRG algorithm is presented in Algorithm 1. We can find that in the iteration, each thread completes the following operations:

  • All threads parallelly compute the full gradient . Assume the gradients computed by thread are denoted by which is a subset of . We have if , and .

  • Run an inner-loop in which each iteration randomly chooses an instance indexed by and computes the gradient , where , and compute the vector

    (2)

    Then update the vector

    where is a step size.

Here, is the total number of updates on from all threads and is the -iteration at which the update was calculated. Since each thread can compute an update and change the , obviously. At the same time, we should guarantee that the update is not too old. Hence, we need , where is a positive integer, and usually called the bounded delay. If , the algorithm AsySVRG degenerates to the sequential (single-thread) version of SVRG.

  Initialization: threads, initialize ;
  for  do
     All threads parallelly compute the full gradient ;
     ;
     For each thread, do:
     for  to  do
        Pick up an randomly from ;
        Compute update vector ;
        ;
     end for
     Option 1: Take to be the current in the shared memory;
     Option 2: Take to be the average sum of generated by the inner loop;
  end for
Algorithm 1 AsySVRG

4 Convergence Analysis

Our convergence analysis is based on the Option 2 in Algorithm 1. Please note that we have threads and let each thread calculate times of update. Hence, the total times of updates on in the shared memory, which is denoted by , must satisfy that . And obviously, the larger the is, the larger the will be.

4.1 Consistent Reading

Since is a vector with several elements, it is typically impossible to complete updating in an atomic operation. We have to give this step a lock for each thread. More specifically, it need a lock whenever a thread tries to read or update in the shared memory. This is called consistent reading scheme.

First, we give some notations as follows:

(3)
(4)

It is easy to find that and the update of can be written as follows:

(5)

One key to get the convergence rate is the estimation of the variance of

. We use the technique in [5] and get the following result:

Lemma 1.

There exists a constant such that .

Proof.
(6)

The fourth inequality uses Assumption 1 and is a constant. Summing (4.1) from to , and taking expectation about , we have

(7)

We use and choose such that , then we can get . Please note that . Then, we obtain that

(8)

where satisfies and . ∎

According to Lemma 1, . If we want to be small enough, we need a small step size . This is reasonable because should be changed slowly if the gradient applied to update is relatively old.

Theorem 1.

With the Assumption 1 and 2, choosing a small step size and large , we have the following result:

where and .

4.2 Inconsistent Reading

The consistent reading scheme would cost much waiting time because we need a lock whenever a thread tries to read or update . In this subsection, we will introduce an inconsistent reading scheme, in which a thread does not need a lock when reading current in the memory. For the update step, the thread still need a lock. Please note that our inconsistent reading scheme is different from that in [3] which adopts the atomic update strategy. Since the update vector applied to is usually dense, the atomic update strategy used in [3] is not applicable for our case.

For convenience, we use to denote the vector set generated in the inner loop of our algorithm, and to denote the vector that one thread gets from the shared memory and uses to update . Then, we have

(9)

We also need the following assumption:

Assumption 3.

For all threads, they enjoy the same speed of reading operation and the same speed of updating operation. And any reading operation is faster than updating operation, which means that for three scalars , is faster than .

Since we do not use locks when a thread reads in the shared memory, some elements in which have not been read by one thread may be changed by other threads. Usually, . If we call the age of each element of to be , the ages of elements of may not be the same. We use to denote the smallest age of the elements of . Of course, we expect that is not too small. Given a positive integer , we assume that . With Assumption 3, according to the definition of and , we have

(10)

where is a set that belongs to , is a diagonal matrix that only the diagonal position is , , and other elements of are .

The (10) is right because with an update lock and Assumption 3, at most one thread is updating at any time. If a thread begins to read , only two cases would happen. One is that no threads are updating , which leads . Another is that one thread is updating , which leads to the result that the thread would get a new and may also get some old elements. Obviously, they enjoy the same age of if it reads at a good pace.

Then, we can get the following results:

  • or .

  • , and , which means that ,

    is an identity matrix.

Similar to (3) and (4), we give the following definitions:

We can find that and .

We give a notation for any two integers , i.e., .

According to the proof in Lemma 1, we can get the property that ,

(11)
Lemma 2.

There exists a constant and a corresponding suitable step size that make:

Proof.

According to (11), we have

(12)

According to (10), we have

In the first inequality, may be less than , but it won’t impact the result of the second inequality. Summing from to for (12), we can get

Taking expectation to which are the random numbers selected by Algorithm 1, we obtain

When satisfy the following condition:

we have

Lemma 3.

For the relation between and , we have the following result

where .

Proof.

In (11), if we take and , we can obtain

Summing from to , we obtain

(13)

where the second inequality uses the fact that .

Taking expectation on both sides, and using Lemma 2, we get

which means that

Similar to Theorem (1), we have the following result about inconsistent reading:

Theorem 2.

With Assumption 123, a suitable step size which satisfies the condition in Lemma 23, and a large , we can get our convergence result for the inconsistent reading scheme:

where .

Remark: In our convergence analysis for both consistent reading and inconsistent reading schemes, there are a lot of parameters, such as . We can set . Since are determined by the objective function and are constants, we only need the step size to be small enough and to be large enough. Then, all the conditions in these lemmas and theorems will be satisfied.

5 Experiments

We choose logistic regression with a

-norm regularization term to evaluate our AsySVRG. Hence, the is defined as follows:

We choose Hogwild! as baseline because Hogwild! has been proved to be the state-of-the-art parallel SGD methods for multicore systems [8]. The experiments are conducted on a server with 12 Intel cores and 64G memory.

5.1 Dataset and Evaluation Metric

We choose three datasets for evaluation. They are rcv1, real-sim, and news20, which can be downloaded from the LibSVM website 111http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. Detailed information is shown in Table 1, where is the hyper-parameter in .

dataset instances features
rcv1 20,242 47,236 0.0001
real-sim 72,309 20,958 0.0001
news20 19,996 1,355,191 0.0001
Table 1: Dataset

We adopt the speedup and convergence rate for evaluation. The definition of speedup is as follows:

We get a suboptimal solution by stopping the algorithms when the gap between training loss and the optimal solution is less than .

We set in Algorithm 1 to be , where is the number of training instances and is number of threads. When , the setting about is the same as that in SVRG [4]

. According to our theorems, the step size should be small. However, we can also get good performance with a relatively large step size in practice. For the Hogwild!, in each epoch, we run each thread

iterations. We use a constant step size , and we set after every epoch. These settings are the same as those in the experiments in Hogwild![8]. For each epoch, our algorithm will visit the whole dataset three times and the Hogwild! will visit the whole dataset only once. To make a fair comparison about the convergence rate, we study the change of objective value versus the number of effective passes. One effective pass of the dataset means the whole dataset is visited once.

5.2 Results

In practice, we find that our AsySVRG algorithm without any lock strategy, denoted by AsySVRG-unlock, can achieve the best performance. Table 2 shows the running time and speedup results of consistent reading, inconsistent reading, and unclock schemes for AsySVRG on dateset rcv1. Here, 77.15s denotes 77.15 seconds, 1.94x means the speedup is 1.94, i.e., it is 1.94 times faster than the sequential (one-thread) algorithm.

threads consistent reading inconsistent reading AsySVRG-unlock
2 77.15s/1.94x 77.20s/1.94x 137.55s/1.09x
4 62.20s/2.4x 51.06s/2.93x 58.07s/2.58x
8 63.05s/2.4x 53.93s/2.78x 30.49s/4.92x
10 64.76s/2.3x 56.29s/2.66x 26s/5.77x
Table 2: Lock versus Unlock (in second)

We find that the consistent reading scheme has the worst performance. Hence, in the following experiments, we only report the results of inconsistent reading scheme, denoted by AsySVRG-lock, and AsySVRG-unlock.

Table 3 compares the time cost between AsySVRG and Hogwild! to achieve a gap less than with 10 threads. We can find that our AsySVRG is much faster than Hogwild!, either with lock or without lock.

AsySVRG-lock AsySVRG-unlock Hogwild!-lock Hogwild!-unlock
rcv1 55.77 25.33 500 200
real-sim 42.20 21.16 400 200
news20 909.93 514.50 4000 2000
Table 3: Time (in second) taken by 10 threads when the gap is less than

Figure 1 shows the speedup and convergence rate on three datasets. Here, AsySVRG-lock-10 denotes AsySVRG with lock strategy on 10 threads. Similar nations are used for other settings of AsySVRG and Hogwild!. We can find that the speedup of AsySVRG and Hogwild! is comparable. Combined with the results in Table 3, we can find that Hogwild! is slower than AsySVRG with different numbers of threads. From Figure 1, we can also find that the convergence rate of AsySVRG is much faster than that of Hogwild!.

(a) rcv1
(b) rcv1
(c) realsim
(d) realsim
(e) news20
(f) news20
Figure 1: Experimental results. Left: speedup; Right: convergence rate. Please note that in (b), (d) and (f), some curves are overlapped.

6 Conclusion

In this paper, we have proposed a novel asynchronous parallel SGD method, called AsySVRG, for multicore systems. Both theoretical and empirical results show that AsySVRG can outperform other state-of-the-art methods.

References

  • [1] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Distributed dual averaging in networks. In Advances in Neural Information Processing Systems, pages 550–558, 2010.
  • [2] John C. Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009.
  • [3] Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit S Dhillon. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. arXiv preprint arXiv:1504.01365, 2015.
  • [4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • [5] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. arXiv preprint arXiv:1311.1873, 2013.
  • [6] Julien Mairal. Optimization with first-order surrogate functions. In Proceedings of the 30th International Conference on Machine Learning, pages 783–791, 2013.
  • [7] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014.
  • [8] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
  • [9] Nicolas Le Roux, Mark W. Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2012.
  • [10] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
  • [11] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, pages 2116–2124, 2009.
  • [12] Martin Zinkevich, Alexander J. Smola, and John Langford. Slow learners are fast. In Advances in Neural Information Processing Systems, pages 2331–2339, 2009.
  • [13] Martin Zinkevich, Markus Weimer, Alexander J. Smola, and Lihong Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595–2603, 2010.

Appendix A Notations for Proof

For the proof of Theorem 1.

Proof.

According to (5), we obtain that

(14)

For the old gradient, we have

(15)

Since is -smooth, we have

Substituting the above inequalities into (A), we obtain

(16)

Since

Taking expectation and using Lemma 1, we obtain

The second inequality uses . In the first equality, we use the fact that . The last inequality uses the inequality

which has been proved in [4]. Then summing up from to , and taking or randomly choosing a to be , we can get

(17)

Then, we have

Of course, we need . ∎

For the proof of Theorem 2.

Proof.

According to (4.2), we have

(18)

Similar to the analysis of (15) in Theorem 1, we can get

The second inequality is the same as the analysis in (13). The third inequality uses Lemma 2. The fourth inequality uses Lemma 3.

For convenience, we use and sum the above inequality from to , and take . Then, we obtain

which means that