With the immense growth of data in modern life, developing parallel or distributed optimization algorithms has become a well-established strategy in machine learning, such as the widely used stochastic gradient descent (SGD) algorithm and its variants zinkevich2010parallelized ; recht2011hogwild ; duchi2011adaptive ; defazio2014saga ; kingma2014adam ; reddi2015variance ; schmidt2017minimizing . Because gradient-based algorithms have deficiencies (e.g., zigzagging) with ill-conditioned optimization problems (elongated curvature), second-order algorithms that utilize curvature information have drawn research attention. The most well-known second-order algorithm is a set of quasi-Newton methods, particularly the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method and its limited memory version (L-BFGS) dennis1977quasi ; nocedal1980updating ; dembo1982inexact ; liu1989limited ; bordes2009sgd ; wang2017stochastic ; mokhtari2018iqn ; bottou2018optimization ; karimireddy2018global ; marteau2019globally ; gao2019quasi ; kovalev2020fast ; jin2020non . Second-order methods have several advantages over first-order methods, such as fast rate of local convergence (typically superlinear) dennis1974characterization , and affine invariance (not sensitive to the choice of coordinates).
Stochastic algorithms have been extensively studied and substantially improved the scalability of machine learning models. Particularly, stochastic first-order methods have been a big success for which convergence is guaranteed when assuming the expectation of stochastic gradient is the true gradient. In contrast, second-order algorithms cannot directly use stochastic sampling techniques without losing the curvature information. Several schemes have been proposed to develop stochastic versions of quasi-Newton methods. Stochastic quasi-Newton method (SQN) byrd2016stochastic uses independent large batches for updating Hessian inverse. Stochastic block BFGS gower2016stochastic
calculates subsampled Hessian-matrix product instead of Hessian-vector product to preserve more curvature information. These methods achieve a sub-linear convergence rate in the strongly convex setting. Using the variance reduction (VR) technique proposed injohnson2013accelerating , the convergence rate can be lifted up to linear in the latest attempts moritz2016linearly ; gower2016stochastic . Later, acceleration strategies zhao2017stochastic are combined with VR, non-uniform mini-batch subsampling, momentum calculation to derive a fast and practical stochastic algorithm. Another line of stochastic quasi-Newton studies tries to focus on solving self-concordant functions, which requires more on the shape or property of objective functions, can reach a linear convergence rate. Stochastic adaptive quasi-Newton methods for self-concordant functions have been proposed in zhou2017stochastic , where the step size can be computed analytically, using only local information, and adapts itself to the local curvature. meng2020fast gives a global convergence rate for a stochastic variant of the BFGS method combined with stochastic gradient descent. Other researchers use randomized BFGS kovalev2020fast as a variant of stochastic block BFGS and prove the linear convergence rate under self-concordant functions. However, so far all these stochastic methods have not been designed for parallel computing.
|QN methods||Stochastic||Parallel||Asy||VR||High dimensional||Convergence|
|chen2014large||parallel two-loop recursion|
|berahas2016multi||map reduce for gradient||sublinear|
|bollapragada2018progressive||map reduce for gradient||linear|
|eisen2016decentralized ; eisen2017decentralized||parallel calculation for gradient|
|soori2020dave||parallel calculation for Hessian||superlinear|
|AsySQN||parallel model for L-BFGS||linear|
Parallel quasi-Newton methods have been explored in several directions: map-reduce (vector-free L-BFGS) chen2014large has been used to parallelize the two-loop recursion (See more discussion in Section 2) in a deterministic way; the distributed L-BFGS najafabadi2017large is focused on the implementation of L-BFGS over high performance computing cluster (HPCC) platform, e.g. how to distribute data such that a full gradient or the two-loop recursion can be calculated fast. As a successful trial to create both stochastic and parallel algorithms, multi-batch L-BFGS berahas2016multi uses map-reduce to compute both gradients and updating rules for L-BFGS. The idea of using overlapping sets to evaluate curvature information helps if a node failure is encountered. However, the size of overlapping is large, reshuffling and redistributing data among nodes may need costly communications. Progressive batching L-BFGS bollapragada2018progressive gradually increases the batch size instead of using the full batch in the multi-batch scheme to improve the computational efficiency. Another line of work explores the decentralized quasi-Newton methods (Decentralized BFGS) eisen2016decentralized ; eisen2017decentralized . Taking physical network structure into consideration, the nodes communicate only with their neighbors and perform local computations. Recently, the distributed averaged quasi-Newton methods and the adaptive distributed variants (DAve-QN) have been implemented in soori2020dave , which can be very efficient for low dimensional problems because that they need to read and write the whole Hessian matrix during updating. Notice that in the analysis of DAve-QN method, they have a strong requirement for condition number of the objective functions in Lemma 2 of soori2020dave , which is hard to satisfied in most real datasets. However, when dealing with machine learning problems under big datasets, or considering the possibility of node failures, deterministic algorithms like the decentralized ones perform no better than stochastic ones.
In this paper, we present not only parallel but an asynchronous regime of SQN method with the VR technique, which we call AsySQN. The comparison of our method against existing methods is illustrated in Table 1. AsySQN aims to design a stochastic and parallel process for every step during calculating the search directions. To the best of our knowledge, this is the first effort in implementing the SQN methods in an asynchronous parallel way instead of directly using some local acceleration techniques. The basic idea of asynchronous has been explored by first-order methods, but it is not straightforward to apply to stochastic quasi-Newton methods, and not even clear if such a procedure converges. We provide a theoretical guarantee of a linear convergence rate for the proposed algorithm. Notice that even though there is another commonly used way of lifting the convergence rate, by exploring the self-concordant functions, we choose to use the VR technique so that we can have more general objective functions when conducting experiments. In addition, we remove the requirement of gradients sparsity in previous methods recht2011hogwild ; reddi2015variance . Although we focus on the L-BFGS method, the proposed framework can be applied to other quasi-Newton methods including BFGS, DFP, and the Broyden family wright1999numerical . In our empirical evaluation, we show that the proposed algorithm is more computational efficient than its non-parallel counterpart. Additional experiments are conducted to show that AsySQN maintains the property of a quasi-Newton method that is to obtain a solution with high precision faster on ill-conditioned problems than first-order methods without zigzagging.
The Euclidean norm of vector is denoted by , and without loss of generality, we omit the subscript and use . We use to denote the global optimal solution of Eq.(2). For simplicity, we also use to be the corresponding optimal value . The notation
means taking the expectation in terms of all random variables, and
denotes the identity matrix. The condition number of a matrixis defined by , where and
denote, respectively, the largest and the smallest singular values of. For square matrices and of the same size, , iff is positive semidefinite. Given two numbers (integers), (positive intergers), , if there exists an integer satisfying .
In this paper, we consider the following finite-sum problem :
where denote the training examples,
is a loss function that is parametrized bywhich is to be determined in the optimization process, and is the loss occurred on . Most machine learning models (with model parameter ) can be obtained by solving the following minimization problem:
We aim to find an approximate solution, , if satisfies:
where is the global minimum if exists and is an optimal solution, is the targeted accuracy, and means to take the expectation of . We require a high precision solution, e.g., .
2.1 Review of the stochastic quasi-Newton method
According to the classical L-BFGS methods wright1999numerical , given the current and last iterates and and their corresponding gradients, we define the correction pairs:
and define ,
. Then the new Hessian inverse matrix can be uniquely updated based on the last estimate:
When datasets are high dimensional, i.e. is a large number, and the Hessian and Hessian inverse matrices are and maybe dense, storing and directly updating might be computationally prohibitive. Instead of exploring fully dense matrices, limited-memory methods only save the latest information (say, the recent ) of correction pairs to arrive a Hessian inverse approximation. Given the initial guess , which is usually set to an identity matrix, the new Hessian inverse at the -th iteration can actually be approximated by repeated application of the formula (5),
Based on this updating rule, a two-loop recursion procedure has been derived: the first for-loop multiplying all the left-hand side vectors and the second for-loop multiplying the right-hand side ones. In practice, if denotes the current variance-reduced gradient, we directly calculate Hessian-vector product to find the search direction rather than actually computing and storing . Details are referred to Algorithm 3.
2.2 The proposed quasi-Newton scheme
Algorithm 1 outlines the main steps of our AsySQN method where
indexes the epochs,indexes the processors, and indexes the iterations in each epoch within each processor. Following the same sampling strategy in byrd2016stochastic , we use two separate subsets and of data to estimate the gradient (Lines 8-11) and the Hessian inverse (Lines 22-24), respectively. At each iteration in an epoch, is used to update the gradient whereas the Hessian inverse is updated at each epoch using a subsample . In order to better preserve the curvature information, the size of subsample should be relatively large, whereas the subsample should be smaller to reduce the overall computation.
The algorithm requires several pre-defined parameters: step size , limited memory size , integers and , the number of threads and batch size , and . We need the parameter because we only compute the full gradient of the objective function at the iterate every epochs, which will help reduce the computational cost. Algorithm 1 can be split into hierarchically two parts: an inner loop (iteration) and an outer loop (epoch). The inner loop runs asynchronous parallel processes where each thread updates concurrently and the epoch loop includes the inner loops and an update on the curvature information.
In our implementation, at the -th epoch, the vector (Line 24) is the difference between the two iterates obtained before and after (parallel) inner iterations, where (Line 23) can be the average or the latest iterate in every iterations. The vector represents the difference between two gradients computed at and using the data subsample: . By the mid-point theorem, there exists a , such that . The two options (Line 25 and Line 26) of computing are both frequently used in quasi-Newton methods. In our numerical experiments, we run both options for comparisons to show that option II (Line 26) is more stable when the length of , , becomes small.
Stochastic gradient descent methods are generally difficult in achieving a solution of high precision, but they can reduce the objective function quickly in the beginning. Thus, a desirable way is to quickly find the optimal interval using a first-order (gradient-based) method and then switch to a second-order method. In practice, we take a warm start by the SVRG (not shown in Algorithm 1), and after certain iterations of quick descending, we then switch to the SQN method to efficiently get a more accurate optimal value.
2.3 The proposed asynchronous parallel quasi-Newton
We propose a multi-threaded process in which each thread makes stochastic updates to a centrally stored vector (stored in a shared memory) without waiting, which is similar to the processes used in the asynchronous stochastic gradient descent (AsySGD) recht2011hogwild ; lian2015asynchronous , asynchronous stochastic coordinate descent (AsySCD) liu2015asynchronous ; hsieh2015passcode and asynchronous SVRG (AsySVRG) reddi2015variance ; zhao2016fast . In multi-threaded process, there are threads, and each thread reads the latest from the shared memory, calculates the next iterate concurrently, and then write the iterate into the shared memory.
We use an example in Figure 1 to illustrate the process of this parallel algorithm. At the beginning of each epoch, the initial (for notational convenience) is the current from the shared memory. When a thread finishes computing the next iterate, it will immediately write into the shared memory. Using the modern memory techniques, there are locks to guarantee that each time only one thread can write to the shared memory. Let us use a global timer that records any time when the shared memory is updated. In an asynchronous algorithm, when one thread is updating to , another thread may still use an old state to compute the search direction and update . Let be the particular state of for a thread when the shared memory () is updated at time . Specifically, let be the state of that has been used to update to . We assume that the maximum delay among processes is bounded by , that is . When Thread updates to , within this thread, the actual state of that has been used to calculate the search direction is . Hence, we have , where the delay within Thread is , obviously . Considering that the maximum number of iterations in each epoch is , we have .
In summary, we design an asynchronous parallel stochastic quasi-Newton method, named AsySQN, to improve the computational efficiency. In our experiments (Section 5), AsySQN shows obvious speedup compared with the original stochastic quasi-Newton methods SQN-VR due to the asynchronous parallel setting. Meanwhile, it still enjoys the advantages of common second-order methods: stability and accuracy. Furthermore, in some machine learning problems, it is not easy to do data normalization or even need more transformation to change the ill-conditioning behavior of objective function. In this case, AsySQN can perform better than AsySVRG or other first-order algorithms.
Each epoch consists of iterations, at the beginning of each epoch, suppose that the ScheduleUpdate rule is triggered after each steps, and for brevity, we use to denote the iterations chosen at the -th epoch.
At the epoch iteration:
The updating formula is: , where step size determines how long should moves on, here we do not require diminishing way to guarantee convergence rate;
Stochastic Variance-Reduced gradient: , where is randomly chosen from a subset in . Easily, we get the expectation of stochastic Variance-Reduced gradient: ;
The delayed stochastic Variance-Reduced (VR) gradient used in asynchronous setting: . As the full gradient has been computed before, the gradient was estimated at whereas the current iteration should be due to the delay. Taking expectation of the delayed gradient, we get: .
We make the same assumptions as non-asynchronous version of the SQN method that also uses VR technique moritz2016linearly .
The loss function is strongly convex, that is,
The loss function has Lipschitz continuous gradients:
Since the average of loss function over preserves the continuity and convexity, objective function also satisfies these two hypotheses, then we can easily derive the following lemmas.
The second-order information of the objective function is bounded by and : , therefore all Hessian matrices share the same bounds and generally we have the condition number of the objective function: .
Proof can be referenced from moritz2016linearly . Similarly, all generated Hessian inverse matrices have the same following condition number: . Obviously, the ill-conditioning degree of Hessian inverse will surely be amplified. For large scale datasets, the shape of objective function can be extreme ill-conditioned. Thus, how to achieve a fair performance in this ill-conditioned situation deserves our attention. As a second-order method, our AsySQN still enjoys fast convergence rate in ill-conditioned situation (see the main theorem 4.6 and Corrollary 4.6.1). We also do extensive experiments showing this property in our simulation numerical analysis.
4 Convergence Analysis
In the asynchronous parallel setting, all working nodes read and update without synchronization. This will cause delayed updates for some other threads. Hence, the stochastic variance-reduced gradients may not be computed using the latest iterate of . Here, we use gradient defined above, then each thread uses each its own to update .
In one epoch, the delay of parameter reading from the shared memory can be bounded by:
From Lemma 3.2 and updating formula ,
Suppose that Assumption 2 holds, the delay in the stochastic VR gradient of the objective function
Suppose that Assumption 1 holds, denote to be the unique minimizer of objective function . Then for any , we have
From Assumption 1,
Letting when , the quadratic function can achieve its minimum. ∎
Proof reference johnson2013accelerating .
Summing from to , we get
By rearranging, we get the inequality above. ∎
Theorem 4.6 (Main theorem).
Let’s start from the Lipschitz continuous condition:
Taking expectation at both hands,
Given , we can derive: . Thus,
The first inequality comes from Lipschitz continuous condition and Lemma 4.3.
Substituting this in the inequality above, we get the following result:
The last inequality comes from the fact that
Summing from to , we get
The second inequality uses the assumption that we have a delay in one epoch iteration. The third inequality uses the results of Theorem 4.5, hence all delayed stochastic VR gradients can be bounded by the stochastic VR gradients. The fourth inequality holds for Lemma 4.4. In the epoch-based setting,
Thus the above inequality gives
By rearranging the above gives
Hence, we get the following result: