1 Introduction
Due to the rapid increase in the size of datasets in the last decade, distributed algorithms that can parallelize the computations to multiple (computational units) nodes connected over a communication network became indispensable bertsekas1989parallel; recht2011hogwild
. A common communication model in distributed machine learning is the master/worker model in which the master keeps a copy of the global decision variable
and shares it with the workers. Each worker operates locally on its own data and then communicates the results to the master to update the decision variable in a synchronous gurbuzbalaban2017convergence; le2012stochastic; defazio2014saga; defazio2014finito; mairal2015incremental; mokhtari2018surpassing; vanli2018global or asynchronous fashion xiao2019dscovr; asaga; arock; bianchi2015coordinate; zhang2014asynchronous; mansoori2017superlinearly; csimcsekli2018asynchronous. In the synchronous setting, the master waits to receive updates from all workers before updating the decision variable, which can lead to a slow execution if the nodes and/or the network is heterogeneous kanrar2011performance. In the asynchronous setting, coordination amongst workers is not needed (or is more relaxed) and the master can proceed with updates without having to wait for slow worker nodes. As a result, asynchronous setting can be more efficient than synchronous in heterogeneous computing environments wongpanich2020rethinking.In this paper, we consider distributed algorithms for empirical risk minimization, i.e. for solving the finitesum problem
(1) 
where and
is the loss function of node
. We consider the master/worker communication model with asynchronous computations. With today’s distributed computing environments, the cost of communication between nodes is considerably higher than the cost of computation, which leads to sharing matrices of size across nodes to be prohibitively expensive in many machine learning applications. Thus, inspired by prior work liu1989limited; mokhtari2015global; nash1991numerical; skajaa2010limited; bollapragada2018progressive; berahas2019quasi, we focus on algorithms that communicate between nodes only vectors of size (at most) . There are a number of distributed algorithms for empirical risk minimization that can support asynchronous computations; the most relevant to our work are the recently proposed DAveRPG anonymous2018daverpg and DAveQN algorithms SooriDaveQN. DAveRPG is a delay tolerant proximal gradient method with linear convergence guarantees that also handles a nonsmooth term in the objective. However, it is a firstorder method that does not estimate the secondorder information of the underlying objective, therefore it can be slow for illconditioned problems. DAveQN is a distributed quasiNewton method with local superlinear convergence guarantees, however it does not admit global convergence guarantees. Furthermore, it relies on BFGS updates on each node, which requires
memory as well as computations for updating the Hessian estimate at each node. For large , this can be slow where DAveQN looses its edge over firstorder approaches SooriDaveQN; furthermore its memory requirement can be impractical or prohibitively expensive when is large, say when is on the order of ten thousands or hundred thousands.Contributions. To remedy the shortcomings of the DAveQN algorithm, we propose LDQN, a distributed limitedmemory quasiNewton method that requires less memory and computational work per iteration. More specifically, the per iteration and per node storage and computation of LDQN are and respectively, where is a configurable parameter and is the number of vectors stored in the memory at every iteration that contains information about the past gradients and iterates. Because of the reduced storage and computation costs, our proposed algorithm scales well for large datasets, it is communicationefficient as it exchanges vectors of size at every communication. When the number of nodes is large enough, with an appropriate stepsize, LDQN has global linear convergence guarantees for strongly convex objectives, even though the computations are done in an asynchronous manner, as opposed to the DAveQN method which does not provide global convergence guarantees. In practice, we have also observed that LDQN works well even if the number of nodes is not large, for example when . To our knowledge, LDQN is the first distributed quasiNewton method with provable linear convergence guarantees, even in the presence of asynchronous computations.
Related work. The proposed method can be viewed as an asynchronous distributed variant of the traditional quasiNewton and limitedmemory BFGS methods that have been extensively studied in the optimization community (goldfarb1970family; broyden1973local; dennis1974characterization; powell1976some). LDQN builds on the limitedmemory BFGS method liu1989limited. Prior work have also investigated incremental gradient (igin; inmathprog) and incremental aggregated gradient algorithms (le2012stochastic; defazio2014saga; defazio2014finito; mairal2015incremental; gurbuzbalaban2017convergence; mokhtari2018surpassing; vanli2018global; mokhtari2018iqn; blatt2007convergent), which are originally developed for centralized problems. These methods update the global decision variable by processing the gradients of the component functions in a deterministic fashion in a specific (e.g. cyclic) order. They are applicable to our setting in practice, however, these methods do not provide convergence guarantees in asynchronous settings. The Lazily Aggregated Gradient (LAG) chen2018lag method, which has a convergence rate similar to batch gradient descent in strongly convex, convex, and nonconvex cases as well as its quantized version sun2019communication, is an exception, however, LAG is a firstorder method that does not use secondorder information. For synchronous settings, the distributed quasiNewton algorithm proposed by lee2018distributed is globally linearly convergent and can handle nonsmooth regularization terms; convergence analysis for the algorithm does not exist for asynchronous settings. In this work, we use the star network topology where the nodes follow a master/slave hierarchy. However, there is another setting known as the decentralized setting which does not have a master node and communication between the nodes is limited to a given fixed arbitrary network topology (nedic2009distributed; mansoori2017superlinearly). Amongst algorithms for this setting, eisen2017decentralized proposes a linearly convergent decentralized quasiNewton method and mansoori2017superlinearly develops an asynchronous Newtonbased approach that has local superlinear convergence guarantees to a neighborhood of the problem (1). There are also distributed secondorder methods developed for nonconvex objectives. Among these, most relevant to our paper are csimcsekli2018asynchronous which proposes a stochastic asynchronousparallel LBFGS method and the DINGO method (crane2019dingo) which admits linear convergence guarantees to a local minimum for nonconvex objectives that satisfy an invexity property.
Notation. Throughout the paper, we use to denote the matrix 2norm or the (Euclidean norm) norm depending on the context. The Frobenius norm of a matrix is defined as . The matrix denotes the identity matrix.A memory with capacity , denoted as , is a set of tuples where and ; the size of the memory satisfies . A function is called smooth and strongly convex if for any vector , the Hessian satisfies .
2 Algorithm
2.1 Preliminaries
BFGS algorithm. In the following, we provide a brief summary of the BFGS algorithm, see nocedal_num_opt for more detail. Given a convex smooth function , the BFGS algorithm consists of iterations:
where is a properly chosen stepsize where the matrix is an estimate of the inverse Hessian matrix at and satisfies the secant equation:
(2) 
where and are the differences of the iterates and the gradients respectively. By Taylor’s theorem, , therefore for a small enough stepsize any matrix solving the secant equation can be considered as an inverse approximate of Hessian of the function . In fact, the secant equation (2) has infinitely many solutions and quasiNewton methods differ in how they choose a particular solution. BFGS chooses the matrix according to
The corresponding update for is
(3) 
If function is strongly convex then so that the denominator in (2.1) cannot be zero. Note that and are both rankone therefore these updates require operations. Even though the BFGS algorithm (2.1) enjoys local superlinear convergence with an appropriate stepsize, its memory requirement to store the matrix and computations required for the updates (2.1) may be impractical or prohibitively expensive for machine learning problems when is large.
Limitedmemory BFGS (LBFGS) algorithm. Limitedmemory BFGS (LBFGS) requires less memory compared to BFGS algorithm. Instead of storing the whole matrix, LBFGS stores up to pairs in memory and uses these vectors to approximate the Hessian. The parameter is adjustable which results in a memory requirement of . At the start of iteration , we have access to for . Since the storage is full^{1}^{1}1In the beginning of the iterations, when the total number of gradients computed is less than the storage capacity is not full but the details are omitted for keeping the discussion simpler, see nocedal_num_opt for details., the oldest pair is replaced by the latest pair . The resulting LBFGS algorithm has the updates:
where the matrices are computed according to the following formula
where is a scaling factor. We note that LBFGS requires memory which is significantly less compared to for BFGS for a large .
DAveQN algorithm. The DAveQN algorithm SooriDaveQN is an asynchronous quasiNewton method for solving the optimization problem (1) in master/slave communication models. Let be the variable that is kept at the master at time and be the local copy that agent keeps after its last communication with the master. At time , an agent communicates with the master and updates its local estimate for the local Hessian with a BFGS update:
(4) 
where , , , and are computed using the local copy and the iterate . Let be delay time between information received and send at agent at time t, then agent sends the information , , , and to the master after making the update (4). Consequently, the master updates the global decision variable with:
In the next section, we introduce the LDQN method which is a limitedmemory version of the DAveQN algorithm. LDQN will allow us to improve performance for large dimensional problems. The basic idea is that each agent stores many tuples requiring memory instead of storing the matrix and carries out LBFGStype updates (4) to compute the Hessian estimate .
2.2 A LimitedMemory Distributed QuasiNewton Method (LDQN)
In this section, we introduce the LDQN algorithm on a master/slave communication setting that consists of workers that are connected to one master with a star topology (see Figure 1). Let be the delay in communication at time with the th worker and the master and denote the (penultimate) double delay in communication, i.e. the last exchange between the master and the worker was at time , and before that the communication took place at where . For example, if the node communicated with master at times and , we have and , and .
Let us introduce the historical time with the convention and . We introduce the notation , , and explain the LDQN updates on worker and master in detail:
Worker Updates: Each agent keeps many tuples at their local memory at time and at the end of the th iteration, the worker replaces the oldest tuple with the new one . Suppose master communicates with worker
at the moment
and sends the copy ; then upon receiving , the worker computes where is computed according to(5) 
and the scaling factor is chosen as .
A number of choices for are proposed in the literature nocedal_num_opt. given above (which is also considered at mokhtari2015global
) is an estimate for the largest eigenvalue of Hessian
and works well in practice, therefore our algorithm analysis is based on given . However, our analysis on the linear convergence of LDQN can be extended to different choice of ’s as well.Worker calls Algorithm 1 to perform the update (2.2) locally based on its memory . Then, the worker sends , , , and to the master.
Master Updates: Following its communication with the worker, the master receives the vectors , , , the scalars , and computes
(6) 
where and stepsize determined by the master. Soori et al. have shown in SooriDaveQN that the computation of and can be done at master locally by using only vectors send by workers. In particular, if we define and , then the updates at the master follow the below rules: , , and . Hence the master only requires and to proceed to . Let
(7) 
then ShermanMorrisonWoodbury formula implies
(8) 
Thus, if the master already has , then is computed using the vectors and . algocf[ht]
The steps for the master and worker nodes are provided in Algorithm LABEL:Alg:_LDQN. After receiving from the master, worker computes its estimate using the vector received from the master, then updates its memory and sends the vectors together with the scalars back to the master. Based on (7) and (8), the master computes using the vectors received from worker . We define the epochs recursively as follows: We set and define In other words, is the first time such that each machine makes at least 2 updates on the interval
. Epochs as defined above satisfy the properties:

For any and any one has

If delays are uniformly bounded, i.e. there exists a constant for all and , then for all we have and .

If we define average delays as , then . Moreover, assuming that for all t, we get .
Notice that convergence to optimum is not possible without visiting every function , so measuring performance using epochs where every node has communicated with the master at least once is more convenient than the number of communications, , for comparison.
3 Convergence Analysis
In this section, we study theoretical results for linear convergence of LDQN algorithm with a constant stepsize . Firstly, we assumed that the functions ’s and the matrices ’s satisfy the following conditions:
Assumption 1
The component functions are  smooth and strongly convex, i.e. there exist positive constants such that, for all and ,
Assumption 2
There exist constants such that the following bounds are satisfied for all and at any :
(9) 
Assumption 2 says that approximates the Hessian up to a constant factor.For example, if the objective is a quadratic function of the form and for some constant then we would have and the ratio satisfies . In fact, this ratio can be thought as a measure of the accuracy of the Hessian approximation. In the special case when the Hessian approximations are accurate (when ), we have and . Otherwise, we have .
In particular, if the eigenvalues of stay in the interval , then Assumption (2) holds with and . We note that in the literature, there exist estimates for and nocedal_num_opt; mokhtari2015global. For example, it is known that if we choose then we can take , and for memory/storage capacity (see mokhtari2015global for details). A shortcoming of these existing bounds efficientlycomputeeigen; LargeScaleTRS is that they are not tight, i.e. with an increasing memory capacity , the bounds get worse.In our experiments, we have also observed in real datasets that Assumption 2 holds where we estimated the constants and (see the supplementary file for details). These numerical results show that Assumption 2 is a reasonable assumption to make in practice for analyzing LDQN methods.
Before we provide a convergence result for LDQN, we observe that the iterates of LDQN provided in (6) satisfy the property
(10) 
The next theorem uses bounds (9) together with equality (10) to find the condition on fixed step size such that the LDQN algorithm is linearly convergent on epochs for . The proof can be found in the appendix.
Theorem 1
Theorem 1 says that if the Hessian approximations are good enough, then is small enough and LDQN will admit a linear convergence rate. Even though the condition (11) seems conservative on the accuracy of the Hessian estimates, to our knowledge, there exists no other linear convergence result that supports global convergence of asynchronous distributed secondorder methods for distributed empirical risk minimization. Also, on real datasets, we observe that LDQN algorithm performs well even though limited memory updates fail to satisfy the condition (11) on the accuracy.
4 Numerical Experiments
We tested our algorithm on the multiclass logistic regression problem with
regularization where the objective is(12) 
and is the regularization parameter, is a feature vector, and is the corresponding label. We worked with five datasets (SVHN
, mnist8m
, covtype
, cifar10
, rcv1
) from the LIBSVM repository chang2011libsvm where the covtype
dataset is expanded based on the approach in wang2018giant for largescale experiments.
We compare LDQN with the following other recent distributed optimization algorithms:

DAveQN SooriDaveQN: An asynchronous distributed quasiNewton method.

Distributed Average Repeated Proximal Gradient (DAveRPG)mishchenko2018delay: A firstorder asynchronous method that performs better compared to incremental aggregated gradient gurbuzbalaban2017convergence and synchronous proximal gradient methods.

Globally Improved Approximate Newton (GIANT)giant2018mahoney: A synchronous communicationefficient approximate Newton method, for which the numerical experiments in giant2018mahoney demonstrate it outperforms DANE shamir2014dane and the Accelerated Gradient Descent nesterov2013introductory.

Distributed Approximate Newton (DANE)shamir2014dane: A wellknown secondorder method that requires sychronization step among all workers.
All experiments are conducted on the XSEDE Comet resources towns2014xsede
with 24 workers (on Intel Xeon E52680v3 2.5 GHz architectures) and with 120 GB Random Access Memory (RAM.) For LDQN, DAveRPG, and DAveQN, we use 17 processes where one as a master and the 16 processes dedicated as workers. DANE and GIANT do not have a master, thus, we use 16 processes are workers with no master. All datasets are normalized to [0,1] and randomly distributed so the load is roughly balanced among workers. We use Intel MKL 11.1.2 and MVAPICH2/2.1 for BLAS (sparse/dense) operations and MPI programming compiled with mpicc 14.0.2 for optimized communication. Each experiment is repeated five times and the average and standard deviation is reported as error bars in our results.
Parameters: The recommended parameters for each method is used. is tuned to ensure convergence for all methods. We use for mnist8m
, and for SVHN
, cifar10
and covtype
. Other choices of show similar performances. For DANE, SVRG johnson2013accelerating is used as a local solver; parameters are selected based on experiments in shamir2014communication. DANE has two parameters and which are set to 1 and respectively based on the recommendation of the authors in shamir2014communication. For DAveRPG, the number of passes on local data is set to 5 () and its stepsize is selected using a standard backtracking line algorithm schmidt2015non. For LDQN, the memory capacity is set as for covtype
, mnist8m
and cifar10
where stepsize is , and respectively. On SVHN
, the parameters of LDQN are chosen as and .
Figure 2 shows the average suboptimality versus time for the datasets mnist8m
, SVHN
and covtype
. We observe that LDQN converges with a similar rate compared to DAveQN while it uses less memory. For larger datasets (such as the rcv1
with and at Figure 3), DAveQN was not able to run due to its memory requirement whereas the other methods run successfully. DAveRPG demonstrates good performance at the beginning for SVHN
compared to other methods due to its cheaper iteration complexity. However, LDQN becomes faster eventually and outperforms DAveRPG.
The right panel of Figure 3 shows the suboptimality versus time for the dataset cifar10
where we choose the parameter
, and for cifar10
. DAveRPG is the fastest on this dataset whereas LDQN is competitive with DAveQN with less memory requirements. We conclude that when the underlying optimization problem is illconditioned (such as the case of mnist8m
dataset), LDQN improves performance with respect to other methods while being scalable to large datasets. In case of less illconditioned problems (such as SVHN
and cifar10
), firstorder methods such as DAveRPG are efficient where secondorder methods may not be necessary.
Figure 4 exhibits the suboptimality results of the algorithms on cifar10
and covtype
without regularization parameter which makes the problems more illconditioned. Due to its less memory requirement, we can see that the performance of LDQN algorithm on cifar10
is significantly better than other distributed algorithms including DAveQN. LDQN is competitive with DAveQN and DANE on covtype
as well.
In Figure 5, we also compare the strong scaling of the distributed algorithms on different number of workers for mnist8m
and covtype
. In particular, we look at the improvement in time to achieve the same suboptimality as we increase the number of workers. We see that LDQN shows a nearly linear speedup and a slightly better scaling compared to DAveQN. DAveRPG scales better but considering the total runtime, it is slower than LDQN.
In addition to suboptimality and scaling, we also compared the performance of these algorithms for different sparsity of the datasets. For the problem of interest (logistic regression), computing the gradient takes for dense and for sparse datasets where nnz
is the number of nonzeros in the dataset. Therefore, LDQN has while DAveQN has a iteration complexity of . Similarly, DAveRPG has a complexity of where is number of passes on local data. We observe that LDQN has a cheaper iteration complexity compared to DAveQN while in case of very sparse datasets, DAveRPG has a cheaper iteration complexity compared to LDQN.This is illustrated over the dataset rcv1
on the left panel of Figure 3. The dataset rcv1
is quite sparse with nonzeros. We use the parameters , , . For this dataset, DAveQN fails as it requires more memory than the resources available. GIANT requires each worker to have where is the number of local data points on a worker. Hence, GIANT diverges with 16 workers. We observe that DAveRPG converges faster than DANE and LDQN because of its cheap iteration complexity.
In order to show the effect of sparsity on performance, we design a synthetic dataset based on a similar approach taken in shamir2014communication. First we generate i.i.d input samples where and the covariance matrix is diagonal with . Then, we randomly choose some entries of all samples and make them zero to add sparsity. We set , and is the vector of all ones. Finally, labels
are generated based on the probabilities
where is the logistic function. The parameters and are chosen for the objective function and for this experiment we have the following and . Time to the accuracy of for all methods is measured and normalized based on LDQN timing. The results are shown in Figure 6. DAveRPG and DANE performs poorly for fully dense datasets (sparsity = 0%), however, DAveRPG and GIANT perform better compared to LDQN as the dataset sparsity increases. We observe that when above %90 of the data is sparse, DAveRPG is the most efficient method; whereas for denser datasets GIANT and LDQN are more efficient on the synthetic data.5 Conclusion
We proposed the LDQN method which is an asynchronous limitedmemory BFGS method. We showed that under some assumptions, LDQN admits linear convergence over epochs despite asynchronous computations. Our numerical experiments show that LDQN can lead to significant performance improvements in practice in terms of both memory requirements and running time.
References
Appendix
6 Proof of Theorem 1
Recall the definition of the average Hessian satisfies the equality . Hence the equation (10) implies that iterates admit the bound
(13) 
where where is as in (6) and denotes the norm of a matrix. Notice that by its definition and from (9), it can be found that has the bounds and hence is positive definite. So the function defined from the set of symmetric positivedefinite matrices to itself is a matrix convex function [MatrixInequalities, see E.7.a], that is for any and following inequality holds
(14) 
In particular, if for some positivedefinite matrices and , by matrix convexity we have
(15) 
where the maximum on the righthand side is in the sense of Loewner ordering, i.e. if and equals to B otherwise. From the bounds (9), we have
Comments
There are no comments yet.