L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method

08/20/2021 ∙ by Bugra Can, et al. ∙ UNIVERSITY OF TORONTO Rutgers University 0

This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration the master node and workers communicate vectors of size O(d), where d is the dimension of the decision variable, and the amount of memory required on each node is O(md), where m is an adjustable parameter. To our knowledge, this is the first distributed quasi-Newton method with provable global linear convergence guarantees in the asynchronous setting where delays between nodes are present. Numerical experiments are provided to illustrate the theory and the practical performance of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the rapid increase in the size of datasets in the last decade, distributed algorithms that can parallelize the computations to multiple (computational units) nodes connected over a communication network became indispensable bertsekas1989parallel; recht2011hogwild

. A common communication model in distributed machine learning is the master/worker model in which the master keeps a copy of the global decision variable

and shares it with the workers. Each worker operates locally on its own data and then communicates the results to the master to update the decision variable in a synchronous gurbuzbalaban2017convergence; le2012stochastic; defazio2014saga; defazio2014finito; mairal2015incremental; mokhtari2018surpassing; vanli2018global or asynchronous fashion xiao2019dscovr; asaga; arock; bianchi2015coordinate; zhang2014asynchronous; mansoori2017superlinearly; csimcsekli2018asynchronous. In the synchronous setting, the master waits to receive updates from all workers before updating the decision variable, which can lead to a slow execution if the nodes and/or the network is heterogeneous kanrar2011performance. In the asynchronous setting, coordination amongst workers is not needed (or is more relaxed) and the master can proceed with updates without having to wait for slow worker nodes. As a result, asynchronous setting can be more efficient than synchronous in heterogeneous computing environments wongpanich2020rethinking.

In this paper, we consider distributed algorithms for empirical risk minimization, i.e. for solving the finite-sum problem

(1)

where and

is the loss function of node

. We consider the master/worker communication model with asynchronous computations. With today’s distributed computing environments, the cost of communication between nodes is considerably higher than the cost of computation, which leads to sharing matrices of size across nodes to be prohibitively expensive in many machine learning applications. Thus, inspired by prior work liu1989limited; mokhtari2015global; nash1991numerical; skajaa2010limited; bollapragada2018progressive; berahas2019quasi, we focus on algorithms that communicate between nodes only vectors of size (at most) . There are a number of distributed algorithms for empirical risk minimization that can support asynchronous computations; the most relevant to our work are the recently proposed DAve-RPG anonymous2018daverpg and DAve-QN algorithms SooriDaveQN

. DAve-RPG is a delay tolerant proximal gradient method with linear convergence guarantees that also handles a non-smooth term in the objective. However, it is a first-order method that does not estimate the second-order information of the underlying objective, therefore it can be slow for ill-conditioned problems. DAve-QN is a distributed quasi-Newton method with local superlinear convergence guarantees, however it does not admit global convergence guarantees. Furthermore, it relies on BFGS updates on each node, which requires

memory as well as computations for updating the Hessian estimate at each node. For large , this can be slow where DAve-QN looses its edge over first-order approaches SooriDaveQN; furthermore its memory requirement can be impractical or prohibitively expensive when is large, say when is on the order of ten thousands or hundred thousands.

Contributions. To remedy the shortcomings of the DAve-QN algorithm, we propose L-DQN, a distributed limited-memory quasi-Newton method that requires less memory and computational work per iteration. More specifically, the per iteration and per node storage and computation of L-DQN are and respectively, where is a configurable parameter and is the number of vectors stored in the memory at every iteration that contains information about the past gradients and iterates. Because of the reduced storage and computation costs, our proposed algorithm scales well for large datasets, it is communication-efficient as it exchanges vectors of size at every communication. When the number of nodes is large enough, with an appropriate stepsize, L-DQN has global linear convergence guarantees for strongly convex objectives, even though the computations are done in an asynchronous manner, as opposed to the DAve-QN method which does not provide global convergence guarantees. In practice, we have also observed that L-DQN works well even if the number of nodes is not large, for example when . To our knowledge, L-DQN is the first distributed quasi-Newton method with provable linear convergence guarantees, even in the presence of asynchronous computations.

Related work. The proposed method can be viewed as an asynchronous distributed variant of the traditional quasi-Newton and limited-memory BFGS methods that have been extensively studied in the optimization community (goldfarb1970family; broyden1973local; dennis1974characterization; powell1976some). L-DQN builds on the limited-memory BFGS method liu1989limited. Prior work have also investigated incremental gradient (ig-in; in-mathprog) and incremental aggregated gradient algorithms (le2012stochastic; defazio2014saga; defazio2014finito; mairal2015incremental; gurbuzbalaban2017convergence; mokhtari2018surpassing; vanli2018global; mokhtari2018iqn; blatt2007convergent), which are originally developed for centralized problems. These methods update the global decision variable by processing the gradients of the component functions in a deterministic fashion in a specific (e.g. cyclic) order. They are applicable to our setting in practice, however, these methods do not provide convergence guarantees in asynchronous settings. The Lazily Aggregated Gradient (LAG) chen2018lag method, which has a convergence rate similar to batch gradient descent in strongly convex, convex, and nonconvex cases as well as its quantized version sun2019communication, is an exception, however, LAG is a first-order method that does not use second-order information. For synchronous settings, the distributed quasi-Newton algorithm proposed by lee2018distributed is globally linearly convergent and can handle non-smooth regularization terms; convergence analysis for the algorithm does not exist for asynchronous settings. In this work, we use the star network topology where the nodes follow a master/slave hierarchy. However, there is another setting known as the decentralized setting which does not have a master node and communication between the nodes is limited to a given fixed arbitrary network topology (nedic2009distributed; mansoori2017superlinearly). Amongst algorithms for this setting, eisen2017decentralized proposes a linearly convergent decentralized quasi-Newton method and mansoori2017superlinearly develops an asynchronous Newton-based approach that has local superlinear convergence guarantees to a neighborhood of the problem (1). There are also distributed second-order methods developed for non-convex objectives. Among these, most relevant to our paper are csimcsekli2018asynchronous which proposes a stochastic asynchronous-parallel L-BFGS method and the DINGO method (crane2019dingo) which admits linear convergence guarantees to a local minimum for non-convex objectives that satisfy an invexity property.

Notation. Throughout the paper, we use to denote the matrix 2-norm or the (Euclidean norm) norm depending on the context. The Frobenius norm of a matrix is defined as . The matrix denotes the identity matrix.A memory with capacity , denoted as , is a set of tuples where and ; the size of the memory satisfies . A function is called -smooth and strongly convex if for any vector , the Hessian satisfies .

2 Algorithm

2.1 Preliminaries

BFGS algorithm. In the following, we provide a brief summary of the BFGS algorithm, see nocedal_num_opt for more detail. Given a convex smooth function , the BFGS algorithm consists of iterations:

where is a properly chosen stepsize where the matrix is an estimate of the inverse Hessian matrix at and satisfies the secant equation:

(2)

where and are the differences of the iterates and the gradients respectively. By Taylor’s theorem, , therefore for a small enough stepsize any matrix solving the secant equation can be considered as an inverse approximate of Hessian of the function . In fact, the secant equation (2) has infinitely many solutions and quasi-Newton methods differ in how they choose a particular solution. BFGS chooses the matrix according to

The corresponding update for is

(3)

If function is strongly convex then so that the denominator in (2.1) cannot be zero. Note that and are both rank-one therefore these updates require operations. Even though the BFGS algorithm (2.1) enjoys local superlinear convergence with an appropriate stepsize, its memory requirement to store the matrix and computations required for the updates (2.1) may be impractical or prohibitively expensive for machine learning problems when is large.

Limited-memory BFGS (L-BFGS) algorithm. Limited-memory BFGS (L-BFGS) requires less memory compared to BFGS algorithm. Instead of storing the whole matrix, L-BFGS stores up to pairs in memory and uses these vectors to approximate the Hessian. The parameter is adjustable which results in a memory requirement of . At the start of iteration , we have access to for . Since the storage is full111In the beginning of the iterations, when the total number of gradients computed is less than the storage capacity is not full but the details are omitted for keeping the discussion simpler, see nocedal_num_opt for details., the oldest pair is replaced by the latest pair . The resulting L-BFGS algorithm has the updates:

where the matrices are computed according to the following formula

where is a scaling factor. We note that L-BFGS requires memory which is significantly less compared to for BFGS for a large .

DAve-QN algorithm. The DAve-QN algorithm SooriDaveQN is an asynchronous quasi-Newton method for solving the optimization problem (1) in master/slave communication models. Let be the variable that is kept at the master at time and be the local copy that agent keeps after its last communication with the master. At time , an agent communicates with the master and updates its local estimate for the local Hessian with a BFGS update:

(4)

where , , , and are computed using the local copy and the iterate . Let be delay time between information received and send at agent at time t, then agent sends the information , , , and to the master after making the update (4). Consequently, the master updates the global decision variable with:

In the next section, we introduce the L-DQN method which is a limited-memory version of the DAve-QN algorithm. L-DQN will allow us to improve performance for large dimensional problems. The basic idea is that each agent stores -many tuples requiring memory instead of storing the matrix and carries out L-BFGS-type updates (4) to compute the Hessian estimate .

2.2 A Limited-Memory Distributed Quasi-Newton Method (L-DQN)

Figure 1: Asynchronous communication scheme used by proposed algorithm.

In this section, we introduce the L-DQN algorithm on a master/slave communication setting that consists of workers that are connected to one master with a star topology (see Figure 1). Let be the delay in communication at time with the -th worker and the master and denote the (penultimate) double delay in communication, i.e. the last exchange between the master and the worker was at time , and before that the communication took place at where . For example, if the node communicated with master at times and , we have and , and .

Let us introduce the historical time with the convention and . We introduce the notation , , and explain the L-DQN updates on worker and master in detail:

Worker Updates: Each agent keeps -many tuples at their local memory at time and at the end of the -th iteration, the worker replaces the oldest tuple with the new one . Suppose master communicates with worker

at the moment

and sends the copy ; then upon receiving , the worker computes where is computed according to

(5)

and the scaling factor is chosen as .

A number of choices for are proposed in the literature nocedal_num_opt. given above (which is also considered at mokhtari2015global

) is an estimate for the largest eigenvalue of Hessian

and works well in practice, therefore our algorithm analysis is based on given . However, our analysis on the linear convergence of L-DQN can be extended to different choice of ’s as well.

Function: = LBFGS()
Set
for i=1,..,m do
       Retract from
       Set and
      
end for
Return .
Algorithm 1 Compute given memory

Worker calls Algorithm 1 to perform the update (2.2) locally based on its memory . Then, the worker sends , , , and to the master.

Master Updates: Following its communication with the worker, the master receives the vectors , , , the scalars , and computes

(6)

where and stepsize determined by the master. Soori et al. have shown in SooriDaveQN that the computation of and can be done at master locally by using only vectors send by workers. In particular, if we define and , then the updates at the master follow the below rules: , , and . Hence the master only requires and to proceed to . Let

(7)

then Sherman-Morrison-Woodbury formula implies

(8)

Thus, if the master already has , then is computed using the vectors and . algocf[ht]    

The steps for the master and worker nodes are provided in Algorithm LABEL:Alg:_LDQN. After receiving from the master, worker computes its estimate using the vector received from the master, then updates its memory and sends the vectors together with the scalars back to the master. Based on (7) and (8), the master computes using the vectors received from worker . We define the epochs recursively as follows: We set and define In other words, is the first time such that each machine makes at least 2 updates on the interval

. Epochs as defined above satisfy the properties:

  • For any and any one has

  • If delays are uniformly bounded, i.e. there exists a constant for all and , then for all we have and .

  • If we define average delays as , then . Moreover, assuming that for all t, we get .

Notice that convergence to optimum is not possible without visiting every function , so measuring performance using epochs where every node has communicated with the master at least once is more convenient than the number of communications, , for comparison.

3 Convergence Analysis

In this section, we study theoretical results for linear convergence of L-DQN algorithm with a constant stepsize . Firstly, we assumed that the functions ’s and the matrices ’s satisfy the following conditions:

Assumption 1

The component functions are - smooth and -strongly convex, i.e. there exist positive constants such that, for all and ,

Assumption 2

There exist constants such that the following bounds are satisfied for all and at any :

(9)

Assumption 2 says that approximates the Hessian up to a constant factor.For example, if the objective is a quadratic function of the form and for some constant then we would have and the ratio satisfies . In fact, this ratio can be thought as a measure of the accuracy of the Hessian approximation. In the special case when the Hessian approximations are accurate (when ), we have and . Otherwise, we have .

In particular, if the eigenvalues of stay in the interval , then Assumption (2) holds with and . We note that in the literature, there exist estimates for and nocedal_num_opt; mokhtari2015global. For example, it is known that if we choose then we can take , and for memory/storage capacity (see mokhtari2015global for details). A shortcoming of these existing bounds efficientlycomputeeigen; LargeScaleTRS is that they are not tight, i.e. with an increasing memory capacity , the bounds get worse.In our experiments, we have also observed in real datasets that Assumption 2 holds where we estimated the constants and (see the supplementary file for details). These numerical results show that Assumption 2 is a reasonable assumption to make in practice for analyzing L-DQN methods.

Before we provide a convergence result for L-DQN, we observe that the iterates of L-DQN provided in (6) satisfy the property

(10)

The next theorem uses bounds (9) together with equality (10) to find the condition on fixed step size such that the L-DQN algorithm is linearly convergent on epochs for . The proof can be found in the appendix.

Theorem 1

Suppose Assumptions 1-2 hold and accuracy satisfies

(11)

where is the condition number of . Let stepsize , then for each , the iterates generated by L-DQN algorithm (LABEL:Alg:_LDQN) converge linearly on epochs , i.e. there exists such that for all .

Theorem 1 says that if the Hessian approximations are good enough, then is small enough and L-DQN will admit a linear convergence rate. Even though the condition (11) seems conservative on the accuracy of the Hessian estimates, to our knowledge, there exists no other linear convergence result that supports global convergence of asynchronous distributed second-order methods for distributed empirical risk minimization. Also, on real datasets, we observe that L-DQN algorithm performs well even though limited memory updates fail to satisfy the condition (11) on the accuracy.

(a) covtype
(b) SVHN
(c) mnist8m
Figure 2: Expected suboptimality versus time. The first and second numbers adjacent to the dataset names are variables and respectively.

4 Numerical Experiments

We tested our algorithm on the multi-class logistic regression problem with

regularization where the objective is

(12)

and is the regularization parameter, is a feature vector, and is the corresponding label. We worked with five datasets (SVHN, mnist8m, covtype, cifar10, rcv1) from the LIBSVM repository chang2011libsvm where the covtype dataset is expanded based on the approach in wang2018giant for large-scale experiments.

We compare L-DQN with the following other recent distributed optimization algorithms:

  • DAve-QN SooriDaveQN: An asynchronous distributed quasi-Newton method.

  • Distributed Average Repeated Proximal Gradient (DAve-RPG)mishchenko2018delay: A first-order asynchronous method that performs better compared to incremental aggregated gradient gurbuzbalaban2017convergence and synchronous proximal gradient methods.

  • Globally Improved Approximate Newton (GIANT)giant2018mahoney: A synchronous communication-efficient approximate Newton method, for which the numerical experiments in giant2018mahoney demonstrate it outperforms DANE shamir2014dane and the Accelerated Gradient Descent nesterov2013introductory.

  • Distributed Approximate Newton (DANE)shamir2014dane: A well-known second-order method that requires sychronization step among all workers.

All experiments are conducted on the XSEDE Comet resources towns2014xsede

with 24 workers (on Intel Xeon E5-2680v3 2.5 GHz architectures) and with 120 GB Random Access Memory (RAM.) For L-DQN, DAve-RPG, and DAve-QN, we use 17 processes where one as a master and the 16 processes dedicated as workers. DANE and GIANT do not have a master, thus, we use 16 processes are workers with no master. All datasets are normalized to [0,1] and randomly distributed so the load is roughly balanced among workers. We use Intel MKL 11.1.2 and MVAPICH2/2.1 for BLAS (sparse/dense) operations and MPI programming compiled with mpicc 14.0.2 for optimized communication. Each experiment is repeated five times and the average and standard deviation is reported as error bars in our results.

Parameters: The recommended parameters for each method is used. is tuned to ensure convergence for all methods. We use for mnist8m, and for SVHN, cifar10 and covtype. Other choices of show similar performances. For DANE, SVRG johnson2013accelerating is used as a local solver; parameters are selected based on experiments in shamir2014communication. DANE has two parameters and which are set to 1 and respectively based on the recommendation of the authors in shamir2014communication. For DAve-RPG, the number of passes on local data is set to 5 () and its stepsize is selected using a standard backtracking line algorithm schmidt2015non. For L-DQN, the memory capacity is set as for covtype, mnist8m and cifar10 where stepsize is , and respectively. On SVHN, the parameters of L-DQN are chosen as and .

Figure 2 shows the average suboptimality versus time for the datasets mnist8m, SVHN and covtype. We observe that L-DQN converges with a similar rate compared to DAve-QN while it uses less memory. For larger datasets (such as the rcv1 with and at Figure 3), DAve-QN was not able to run due to its memory requirement whereas the other methods run successfully. DAve-RPG demonstrates good performance at the beginning for SVHN compared to other methods due to its cheaper iteration complexity. However, L-DQN becomes faster eventually and outperforms DAve-RPG.

Figure 3: Expected suboptimality versus time on rcv1 (left) and cifar10 (right)

The right panel of Figure 3 shows the suboptimality versus time for the dataset cifar10 where we choose the parameter , and for cifar10. DAve-RPG is the fastest on this dataset whereas L-DQN is competitive with DAve-QN with less memory requirements. We conclude that when the underlying optimization problem is ill-conditioned (such as the case of mnist8m dataset), L-DQN improves performance with respect to other methods while being scalable to large datasets. In case of less ill-conditioned problems (such as SVHN and cifar10), first-order methods such as DAve-RPG are efficient where second-order methods may not be necessary.

Figure 4: Suboptimality comparison without strong convexity assumption on datasets covtype(left) and cifar10(right).

Figure 4 exhibits the suboptimality results of the algorithms on cifar10 and covtype without regularization parameter which makes the problems more ill-conditioned. Due to its less memory requirement, we can see that the performance of L-DQN algorithm on cifar10 is significantly better than other distributed algorithms including DAve-QN. L-DQN is competitive with DAve-QN and DANE on covtype as well.

Figure 5: Scaling comparisons of the algorithms on covtype(left) and mnist8m(right) datasets

In Figure 5, we also compare the strong scaling of the distributed algorithms on different number of workers for mnist8m and covtype. In particular, we look at the improvement in time to achieve the same suboptimality as we increase the number of workers. We see that L-DQN shows a nearly linear speedup and a slightly better scaling compared to DAve-QN. DAve-RPG scales better but considering the total runtime, it is slower than L-DQN.

In addition to suboptimality and scaling, we also compared the performance of these algorithms for different sparsity of the datasets. For the problem of interest (logistic regression), computing the gradient takes for dense and for sparse datasets where nnz is the number of non-zeros in the dataset. Therefore, L-DQN has while DAve-QN has a iteration complexity of . Similarly, DAve-RPG has a complexity of where is number of passes on local data. We observe that L-DQN has a cheaper iteration complexity compared to DAve-QN while in case of very sparse datasets, DAve-RPG has a cheaper iteration complexity compared to L-DQN.This is illustrated over the dataset rcv1 on the left panel of Figure 3. The dataset rcv1 is quite sparse with non-zeros. We use the parameters , , . For this dataset, DAve-QN fails as it requires more memory than the resources available. GIANT requires each worker to have where is the number of local data points on a worker. Hence, GIANT diverges with 16 workers. We observe that DAve-RPG converges faster than DANE and L-DQN because of its cheap iteration complexity.

Figure 6: The effect of dataset sparsity on the performance of distributed optimization methods.

In order to show the effect of sparsity on performance, we design a synthetic dataset based on a similar approach taken in shamir2014communication. First we generate i.i.d input samples where and the covariance matrix is diagonal with . Then, we randomly choose some entries of all samples and make them zero to add sparsity. We set , and is the vector of all ones. Finally, labels

are generated based on the probabilities

where is the logistic function. The parameters and are chosen for the objective function and for this experiment we have the following and . Time to the accuracy of for all methods is measured and normalized based on L-DQN timing. The results are shown in Figure 6. DAve-RPG and DANE performs poorly for fully dense datasets (sparsity = 0%), however, DAve-RPG and GIANT perform better compared to L-DQN as the dataset sparsity increases. We observe that when above %90 of the data is sparse, DAve-RPG is the most efficient method; whereas for denser datasets GIANT and L-DQN are more efficient on the synthetic data.

5 Conclusion

We proposed the L-DQN method which is an asynchronous limited-memory BFGS method. We showed that under some assumptions, L-DQN admits linear convergence over epochs despite asynchronous computations. Our numerical experiments show that L-DQN can lead to significant performance improvements in practice in terms of both memory requirements and running time.

References

Appendix

6 Proof of Theorem 1

Recall the definition of the average Hessian satisfies the equality . Hence the equation (10) implies that iterates admit the bound

(13)

where where is as in (6) and denotes the -norm of a matrix. Notice that by its definition and from (9), it can be found that has the bounds and hence is positive definite. So the function defined from the set of symmetric positive-definite matrices to itself is a matrix convex function [MatrixInequalities, see E.7.a], that is for any and following inequality holds

(14)

In particular, if for some positive-definite matrices and , by matrix convexity we have

(15)

where the maximum on the right-hand side is in the sense of Loewner ordering, i.e. if and equals to B otherwise. From the bounds (9), we have