Asynchronous Parallel Stochastic Quasi-Newton Methods

11/02/2020 ∙ by Qianqian Tong, et al. ∙ University of Connecticut Baidu, Inc. 0

Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to their effectiveness in dealing with ill-conditioned optimization problems. The L-BFGS method is one of the most widely used quasi-Newton methods. We propose an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method. Unlike prior attempts, which parallelize only the calculation for gradient or the two-loop recursion of L-BFGS, our algorithm is the first one that truly parallelizes L-BFGS with a convergence guarantee. Adopting the variance reduction technique, a prior stochastic L-BFGS, which has not been designed for parallel computing, reaches a linear convergence rate. We prove that our asynchronous parallel scheme maintains the same linear convergence rate but achieves significant speedup. Empirical evaluations in both simulations and benchmark datasets demonstrate the speedup in comparison with the non-parallel stochastic L-BFGS, as well as the better performance than first-order methods in solving ill-conditioned problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the immense growth of data in modern life, developing parallel or distributed optimization algorithms has become a well-established strategy in machine learning, such as the widely used stochastic gradient descent (SGD) algorithm and its variants zinkevich2010parallelized ; recht2011hogwild ; duchi2011adaptive ; defazio2014saga ; kingma2014adam ; reddi2015variance ; schmidt2017minimizing . Because gradient-based algorithms have deficiencies (e.g., zigzagging) with ill-conditioned optimization problems (elongated curvature), second-order algorithms that utilize curvature information have drawn research attention. The most well-known second-order algorithm is a set of quasi-Newton methods, particularly the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method and its limited memory version (L-BFGS) dennis1977quasi ; nocedal1980updating ; dembo1982inexact ; liu1989limited ; bordes2009sgd ; wang2017stochastic ; mokhtari2018iqn ; bottou2018optimization ; karimireddy2018global ; marteau2019globally ; gao2019quasi ; kovalev2020fast ; jin2020non . Second-order methods have several advantages over first-order methods, such as fast rate of local convergence (typically superlinear) dennis1974characterization , and affine invariance (not sensitive to the choice of coordinates).

Stochastic algorithms have been extensively studied and substantially improved the scalability of machine learning models. Particularly, stochastic first-order methods have been a big success for which convergence is guaranteed when assuming the expectation of stochastic gradient is the true gradient. In contrast, second-order algorithms cannot directly use stochastic sampling techniques without losing the curvature information. Several schemes have been proposed to develop stochastic versions of quasi-Newton methods. Stochastic quasi-Newton method (SQN) byrd2016stochastic uses independent large batches for updating Hessian inverse. Stochastic block BFGS gower2016stochastic

calculates subsampled Hessian-matrix product instead of Hessian-vector product to preserve more curvature information. These methods achieve a sub-linear convergence rate in the strongly convex setting. Using the variance reduction (VR) technique proposed in

johnson2013accelerating , the convergence rate can be lifted up to linear in the latest attempts moritz2016linearly ; gower2016stochastic . Later, acceleration strategies zhao2017stochastic are combined with VR, non-uniform mini-batch subsampling, momentum calculation to derive a fast and practical stochastic algorithm. Another line of stochastic quasi-Newton studies tries to focus on solving self-concordant functions, which requires more on the shape or property of objective functions, can reach a linear convergence rate. Stochastic adaptive quasi-Newton methods for self-concordant functions have been proposed in zhou2017stochastic , where the step size can be computed analytically, using only local information, and adapts itself to the local curvature. meng2020fast gives a global convergence rate for a stochastic variant of the BFGS method combined with stochastic gradient descent. Other researchers use randomized BFGS kovalev2020fast as a variant of stochastic block BFGS and prove the linear convergence rate under self-concordant functions. However, so far all these stochastic methods have not been designed for parallel computing.

QN methods Stochastic Parallel Asy VR High dimensional Convergence
byrd2016stochastic sublinear
moritz2016linearly linear
gower2016stochastic linear
zhao2017stochastic linear
chen2014large parallel two-loop recursion
berahas2016multi map reduce for gradient sublinear
bollapragada2018progressive map reduce for gradient linear
eisen2016decentralized ; eisen2017decentralized parallel calculation for gradient
soori2020dave parallel calculation for Hessian superlinear
AsySQN parallel model for L-BFGS linear
Table 1: The comparison of various quasi-Newton methods in terms of their stochastic, parallel, asynchronous frameworks, convergence rates, and if they use a variance reduction technique and limited memory update of BFGS method to make it work well for high dimensional problem. Here, our algorithm AsySQN is designed for paralleling the whole L-BFGS method in shared memory.

Parallel quasi-Newton methods have been explored in several directions: map-reduce (vector-free L-BFGS) chen2014large has been used to parallelize the two-loop recursion (See more discussion in Section 2) in a deterministic way; the distributed L-BFGS najafabadi2017large is focused on the implementation of L-BFGS over high performance computing cluster (HPCC) platform, e.g. how to distribute data such that a full gradient or the two-loop recursion can be calculated fast. As a successful trial to create both stochastic and parallel algorithms, multi-batch L-BFGS berahas2016multi uses map-reduce to compute both gradients and updating rules for L-BFGS. The idea of using overlapping sets to evaluate curvature information helps if a node failure is encountered. However, the size of overlapping is large, reshuffling and redistributing data among nodes may need costly communications. Progressive batching L-BFGS bollapragada2018progressive gradually increases the batch size instead of using the full batch in the multi-batch scheme to improve the computational efficiency. Another line of work explores the decentralized quasi-Newton methods (Decentralized BFGS) eisen2016decentralized ; eisen2017decentralized . Taking physical network structure into consideration, the nodes communicate only with their neighbors and perform local computations. Recently, the distributed averaged quasi-Newton methods and the adaptive distributed variants (DAve-QN) have been implemented in soori2020dave , which can be very efficient for low dimensional problems because that they need to read and write the whole Hessian matrix during updating. Notice that in the analysis of DAve-QN method, they have a strong requirement for condition number of the objective functions in Lemma 2 of soori2020dave , which is hard to satisfied in most real datasets. However, when dealing with machine learning problems under big datasets, or considering the possibility of node failures, deterministic algorithms like the decentralized ones perform no better than stochastic ones.

1.1 Contributions

In this paper, we present not only parallel but an asynchronous regime of SQN method with the VR technique, which we call AsySQN. The comparison of our method against existing methods is illustrated in Table 1. AsySQN aims to design a stochastic and parallel process for every step during calculating the search directions. To the best of our knowledge, this is the first effort in implementing the SQN methods in an asynchronous parallel way instead of directly using some local acceleration techniques. The basic idea of asynchronous has been explored by first-order methods, but it is not straightforward to apply to stochastic quasi-Newton methods, and not even clear if such a procedure converges. We provide a theoretical guarantee of a linear convergence rate for the proposed algorithm. Notice that even though there is another commonly used way of lifting the convergence rate, by exploring the self-concordant functions, we choose to use the VR technique so that we can have more general objective functions when conducting experiments. In addition, we remove the requirement of gradients sparsity in previous methods recht2011hogwild ; reddi2015variance . Although we focus on the L-BFGS method, the proposed framework can be applied to other quasi-Newton methods including BFGS, DFP, and the Broyden family wright1999numerical . In our empirical evaluation, we show that the proposed algorithm is more computational efficient than its non-parallel counterpart. Additional experiments are conducted to show that AsySQN maintains the property of a quasi-Newton method that is to obtain a solution with high precision faster on ill-conditioned problems than first-order methods without zigzagging.

1.2 Notations

The Euclidean norm of vector is denoted by , and without loss of generality, we omit the subscript and use . We use to denote the global optimal solution of Eq.(2). For simplicity, we also use to be the corresponding optimal value . The notation

means taking the expectation in terms of all random variables, and

denotes the identity matrix. The condition number of a matrix

is defined by , where and

denote, respectively, the largest and the smallest singular values of

. For square matrices and of the same size, , iff is positive semidefinite. Given two numbers (integers), (positive intergers), , if there exists an integer satisfying .

2 Algorithm

In this paper, we consider the following finite-sum problem :

(1)

where denote the training examples,

is a loss function that is parametrized by

which is to be determined in the optimization process, and is the loss occurred on . Most machine learning models (with model parameter ) can be obtained by solving the following minimization problem:

(2)

We aim to find an approximate solution, , if satisfies:

(3)

where is the global minimum if exists and is an optimal solution, is the targeted accuracy, and means to take the expectation of . We require a high precision solution, e.g., .

1:Input: Initial , step size , subsample size , parameter , , , and threads number .
2:Initialize
3:Initialize parameter in shared memory
4:for  do
5:      (Algorithm 2)
6:     for , parallel do
7:         for  to  do
8:               Sample
9:              Read from shared memory as
10:              Compute
11:               Compute
12:              Calculate the variance-reduced gradient:
13:              if  then
14:                  
15:              else
16:                  
17:                  where the search direction is computed by the Two-loop-recursion(, , , ) (Algorithm 3)
18:              end if
19:              Write to in shared memory
20:         end for
21:     end for
22:      Sample
23:     
24:     
25:      Option I:
26:      Option II:
27:end for
Algorithm 1 AsySQN
1:Input:.
2:Output:
3:if  mod() == 0 then
4:     
5:     
6:end if
Algorithm 2 Schedule-update
1:Input: , , where .
2:Output:
3:Initialize
4:for  do
5:     
6:     
7:end for
8:
9:for  do
10:     
11:     
12:end for
Algorithm 3 Two-loop-recursion (, , , )

2.1 Review of the stochastic quasi-Newton method

According to the classical L-BFGS methods wright1999numerical , given the current and last iterates and and their corresponding gradients, we define the correction pairs:

(4)

and define ,

. Then the new Hessian inverse matrix can be uniquely updated based on the last estimate

:

(5)

When datasets are high dimensional, i.e. is a large number, and the Hessian and Hessian inverse matrices are and maybe dense, storing and directly updating might be computationally prohibitive. Instead of exploring fully dense matrices, limited-memory methods only save the latest information (say, the recent ) of correction pairs to arrive a Hessian inverse approximation. Given the initial guess , which is usually set to an identity matrix, the new Hessian inverse at the -th iteration can actually be approximated by repeated application of the formula (5),

(6)

Based on this updating rule, a two-loop recursion procedure has been derived: the first for-loop multiplying all the left-hand side vectors and the second for-loop multiplying the right-hand side ones. In practice, if denotes the current variance-reduced gradient, we directly calculate Hessian-vector product to find the search direction rather than actually computing and storing . Details are referred to Algorithm 3.

2.2 The proposed quasi-Newton scheme

Algorithm 1 outlines the main steps of our AsySQN method where

indexes the epochs,

indexes the processors, and indexes the iterations in each epoch within each processor. Following the same sampling strategy in byrd2016stochastic , we use two separate subsets and of data to estimate the gradient (Lines 8-11) and the Hessian inverse (Lines 22-24), respectively. At each iteration in an epoch, is used to update the gradient whereas the Hessian inverse is updated at each epoch using a subsample . In order to better preserve the curvature information, the size of subsample should be relatively large, whereas the subsample should be smaller to reduce the overall computation.

Figure 1: The key asynchronous parallel steps have been identified as follows: 1. Subsample ; 2. Read from the shared memory; 3. Compute stochastic gradients , , and ; 4. Compute the two-loop recursion and update ; 5. Write to the shared memory; 6. Subsample and update correction pairs. Before these steps, there is another step that computes the full gradient for all threads, i.e., 0. Calculate full gradient. Here we show the asynchronous parallel updating of the threads where is a global timer in the epoch, recording the times that has been updated. For example, at , Thread updates to in the shared memory while Thread 1 and Thread 3 are in the middle of calculation using and respectively; hence the current index for Thread 1 is which we denote as for Thread , and similarly for Thread , the maximum delay is . For Thread , the delayed searching direction is used to update to , where ; therefore the delay . Similar analysis can be applied to each state of .

The algorithm requires several pre-defined parameters: step size , limited memory size , integers and , the number of threads and batch size , and . We need the parameter because we only compute the full gradient of the objective function at the iterate every epochs, which will help reduce the computational cost. Algorithm 1 can be split into hierarchically two parts: an inner loop (iteration) and an outer loop (epoch). The inner loop runs asynchronous parallel processes where each thread updates concurrently and the epoch loop includes the inner loops and an update on the curvature information.

In our implementation, at the -th epoch, the vector (Line 24) is the difference between the two iterates obtained before and after (parallel) inner iterations, where (Line 23) can be the average or the latest iterate in every iterations. The vector represents the difference between two gradients computed at and using the data subsample: . By the mid-point theorem, there exists a , such that . The two options (Line 25 and Line 26) of computing are both frequently used in quasi-Newton methods. In our numerical experiments, we run both options for comparisons to show that option II (Line 26) is more stable when the length of , , becomes small.

Stochastic gradient descent methods are generally difficult in achieving a solution of high precision, but they can reduce the objective function quickly in the beginning. Thus, a desirable way is to quickly find the optimal interval using a first-order (gradient-based) method and then switch to a second-order method. In practice, we take a warm start by the SVRG (not shown in Algorithm 1), and after certain iterations of quick descending, we then switch to the SQN method to efficiently get a more accurate optimal value.

2.3 The proposed asynchronous parallel quasi-Newton

We propose a multi-threaded process in which each thread makes stochastic updates to a centrally stored vector (stored in a shared memory) without waiting, which is similar to the processes used in the asynchronous stochastic gradient descent (AsySGD) recht2011hogwild ; lian2015asynchronous , asynchronous stochastic coordinate descent (AsySCD) liu2015asynchronous ; hsieh2015passcode and asynchronous SVRG (AsySVRG) reddi2015variance ; zhao2016fast . In multi-threaded process, there are threads, and each thread reads the latest from the shared memory, calculates the next iterate concurrently, and then write the iterate into the shared memory.

We use an example in Figure 1 to illustrate the process of this parallel algorithm. At the beginning of each epoch, the initial (for notational convenience) is the current from the shared memory. When a thread finishes computing the next iterate, it will immediately write into the shared memory. Using the modern memory techniques, there are locks to guarantee that each time only one thread can write to the shared memory. Let us use a global timer that records any time when the shared memory is updated. In an asynchronous algorithm, when one thread is updating to , another thread may still use an old state to compute the search direction and update . Let be the particular state of for a thread when the shared memory () is updated at time . Specifically, let be the state of that has been used to update to . We assume that the maximum delay among processes is bounded by , that is . When Thread updates to , within this thread, the actual state of that has been used to calculate the search direction is . Hence, we have , where the delay within Thread is , obviously . Considering that the maximum number of iterations in each epoch is , we have .

In summary, we design an asynchronous parallel stochastic quasi-Newton method, named AsySQN, to improve the computational efficiency. In our experiments (Section 5), AsySQN shows obvious speedup compared with the original stochastic quasi-Newton methods SQN-VR due to the asynchronous parallel setting. Meanwhile, it still enjoys the advantages of common second-order methods: stability and accuracy. Furthermore, in some machine learning problems, it is not easy to do data normalization or even need more transformation to change the ill-conditioning behavior of objective function. In this case, AsySQN can perform better than AsySVRG or other first-order algorithms.

3 Preliminaries

3.1 Definitions

Each epoch consists of iterations, at the beginning of each epoch, suppose that the ScheduleUpdate rule is triggered after each steps, and for brevity, we use to denote the iterations chosen at the -th epoch.

At the epoch iteration:

The updating formula is: , where step size determines how long should moves on, here we do not require diminishing way to guarantee convergence rate;

Stochastic Variance-Reduced gradient: , where is randomly chosen from a subset in . Easily, we get the expectation of stochastic Variance-Reduced gradient: ;

The delayed stochastic Variance-Reduced (VR) gradient used in asynchronous setting: . As the full gradient has been computed before, the gradient was estimated at whereas the current iteration should be due to the delay. Taking expectation of the delayed gradient, we get: .

3.2 Assumptions

We make the same assumptions as non-asynchronous version of the SQN method that also uses VR technique moritz2016linearly .

Assumption 1.

The loss function is strongly convex, that is,

Assumption 2.

The loss function has Lipschitz continuous gradients:

Since the average of loss function over preserves the continuity and convexity, objective function also satisfies these two hypotheses, then we can easily derive the following lemmas.

Lemma 3.1.

Suppose that Assumption 1 and 2 hold. The Hessian matrix will be bounded by two positive constants and such that

The second-order information of the objective function is bounded by and : , therefore all Hessian matrices share the same bounds and generally we have the condition number of the objective function: .

Lemma 3.2.

Suppose that Assumption 1 and 2 hold. Let be the Hessian inverse matrix. Then for all , there exist constants such that satisfies

where , , is the Lipschitz constant, is the limited memory size, is the length of .

Proof can be referenced from moritz2016linearly . Similarly, all generated Hessian inverse matrices have the same following condition number: . Obviously, the ill-conditioning degree of Hessian inverse will surely be amplified. For large scale datasets, the shape of objective function can be extreme ill-conditioned. Thus, how to achieve a fair performance in this ill-conditioned situation deserves our attention. As a second-order method, our AsySQN still enjoys fast convergence rate in ill-conditioned situation (see the main theorem 4.6 and Corrollary 4.6.1). We also do extensive experiments showing this property in our simulation numerical analysis.

4 Convergence Analysis

In the asynchronous parallel setting, all working nodes read and update without synchronization. This will cause delayed updates for some other threads. Hence, the stochastic variance-reduced gradients may not be computed using the latest iterate of . Here, we use gradient defined above, then each thread uses each its own to update .

Lemma 4.1.

In one epoch, the delay of parameter reading from the shared memory can be bounded by:

Proof.

From Lemma 3.2 and updating formula ,

Lemma 4.2.

Suppose that Assumption 2 holds, the delay in the stochastic VR gradient of the objective function

Proof.

From Assumption 2, Lemma 3.2 and 4.1,

Lemma 4.3.

Suppose that Assumption 1 holds, denote to be the unique minimizer of objective function . Then for any , we have

Proof.

From Assumption 1,

Letting when , the quadratic function can achieve its minimum. ∎

Lemma 4.4.

Suppose that Assumptions 1 and 2 hold. Let be the optimal value of Eq.(2). Considering the stochastic variance-reduced gradient, we have:

Proof reference johnson2013accelerating .

Theorem 4.5.

Suppose that Assumptions 1 and 2 hold, the gradient delay in an epoch can be measured in the following way:

Proof.

Summing from to , we get

By rearranging, we get the inequality above. ∎

Theorem 4.6 (Main theorem).

Suppose that Assumptions 1 and 2 hold, step size and epoch size are chosen such that the following condition holds:

where Then for all we have

Proof.

Let’s start from the Lipschitz continuous condition:

Taking expectation at both hands,

Given , we can derive: . Thus,

The first inequality comes from Lipschitz continuous condition and Lemma 4.3.

Substituting this in the inequality above, we get the following result:

The last inequality comes from the fact that

Summing from to , we get

The second inequality uses the assumption that we have a delay in one epoch iteration. The third inequality uses the results of Theorem 4.5, hence all delayed stochastic VR gradients can be bounded by the stochastic VR gradients. The fourth inequality holds for Lemma 4.4. In the epoch-based setting,

Thus the above inequality gives

By rearranging the above gives

Hence, we get the following result:

where

From Theorem 4.6 we can see that should satisfy condition to ensure the result holds. Therefore, we reach an upper bound of the step size: . By substituting the result of Lemma 3.2, we can easily derive