 # Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stopping criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the field of machine learning, solving the expected risk minimization problem has received lots of attentions over the last decades, which is in the form of

 minw∈RdL(w)=minw∈RdEz[f(w,z)], (1)

where is a

dimensional random variable containing both feature variables and a response variable.

is a loss function with respect to

and any fixed value of .

In most practical problems, the distribution of

is either unknown or leading great difficulties evaluating the expected loss. One general idea is to estimate the expectation with a statistical average over a large number of independent and identically distributed data samples of

, denoted by where is the total number of samples. Thus, the problem in (1) can be rewritten as the Empirical Risk Minimization (ERM) problem

 minw∈RdLN(w)=minw∈Rd1NN∑i=1fi(w), (2)

where .

Many studies have been done on developing optimization algorithms to find an optimal solution of above problem under different setting. For example, the studies by Beck09fista; nesterov2013introductory; drusvyatskiy2016optimal; ma2017underestimate are some of the gradient-based methods which require at least one pass over all data samples to evaluate the gradient . As the sample size becomes larger, these methods would be less efficient compared to stochastic gradient methods where the gradient is approximated based on a small number of samples (Johnson13SVRG; roux2012stochastic; Defazio14saga; ShalevShwartz12sdca; Konecny17s2gd; nguyen2017sarah).

Second order methods are well known to share faster convergence rate by utilizing the Hessian information. Recently, several papers by Byrd15quasi; schraudolph2007stochastic; mokhtari2015global have studied how to apply second orders methods to solve ERM problem. However, evaluating the Hessian inverse or its approximation is always computationally costly, leading to a significant difficulty on applying these methods on large-scale problems.

The above difficulty can be addressed by applying the idea of adaptive sample size methods by recent works of Mokhtari_firstOredr; Eisen17; Mokhtari_AdaNewton16, which is based on the following two facts. First, the empirical risk and the statistical loss have different minimizers, and it is not necessary to go further than the difference between the mentioned two objectives, which is called statistical accuracy

. More importantly, if we increase the size of the samples in the ERM problem the solutions should not significantly change as samples are drawn from a fixed but unknown probability distribution. The key idea of adaptive samples size methods is to solve an ERM problem with a small number of samples upto its statistical accuracy and use the obtained solution as a warm start for the next ERM problem which contains more samples. In particular,

Mokhtari_AdaNewton16 reduced the complexity of Newton’s method by incorporating the adaptive sample size idea; however, their approach still requires computing Hessian inversions which is costly when the problem dimension is large. In order to decrease the cost of computing the Hessian inverse, Eisen17 proposed the -Truncated Adaptive Newton (-TAN) approach in which the inverse of Hessian is approximated by truncating the

largest eigenvalues of the Hessian. The cost per iteration of this approach is

which may not be satisfactory either when is large or is close to .

In this paper, we propose an increasing sample size second-order method which solves the Newton step in ERM problems more efficiently. Our proposed algorithm, called Distributed Accumulated Newton Conjugate gradiEnt (DANCE), starts with a small number of samples and minimizes their corresponding ERM problem. This subproblem is solved up to a specific accuracy, and the solution of this stage is used as a warm start for the next stage in which we solve the next empirical risk with a larger number of samples, which contains all the previous samples. Such procedure is run iteratively until either all the samples have been included, or we find that it is unnecessary to further increase the sample size. Our DANCE method combines the idea of increasing sample size and the inexact damped Newton method discussed in the works of Disco15 and ma2016distributed. Instead of solving the Newton system directly, we apply preconditioned conjugate gradient (PCG) method as the solver for each Newton step. Also, it is always a challenging problem to run first order algorithms such as SGD and Adam by kingma2014adam

in a distributed fashion. The DANCE method is designed to be easily parallelized and shares the strong scaling property, i.e., linear speed-up property. Since it is possible to split gradient and Hessian-vector product computations across different machines, it is always expected to get extra acceleration via increasing the number of computational nodes. We formally characterize the required number of communication rounds to reach the statistical accuracy of the full dataset. For a distributed setting, we show that DANCE is communication efficient in both theory and practice. In particular, Table

1 highlights the advantage of DANCE with respect to other adaptive sample size methods which will be discussed in more details in Section 4.

## 2 Problem Formulation

In this paper, we focus on finding the optimal solution of the problem in (1). As described earlier, due to difficulties in the expected risk minimization, as an alternative, we aim to find a solution for the empirical loss function , which is the empirical mean over samples. Now, consider the empirical loss associated with samples. In (estError1_07) and (estError2_07), it has been shown that the difference between the expected loss and the empirical loss with high probability (w.h.p.) is upper bounded by the statistical accuracy , i.e., w.h.p.

 supw∈Rd|L(w)−Ln(w)|≤Vn. (3)

In other words, there exists a constant such that the inequality (3) holds with probability of at least . Generally speaking, statistical accuracy depends on (although it depends on too, but for simplicity in notation we just consider the size of the samples), and is of order where (Vapnik13; 1444; bartlett2006convexity).

For problem (2), if we find an approximate solution which satisfies the inequality , where is the true minimizer of , it is not necessary to go further and find a better solution (a solution with less optimization error). The reason comes from the fact that for a more accurate solution the summation of estimation and optimization errors does not become smaller than . Therefore, when we say that is a -suboptimal solution for the risk , it means that . In other words, solves problem (2) within its statistical accuracy.

It is crucial to note that if we add an additional term in the magnitude of to the empirical loss , the new solution is also in the similar magnitude as to the expected loss . Therefore, we can regularize the non-strongly convex loss function by and consider it as the following problem:

 minw∈RdRn(w):=1nn∑i=1fi(w)+cVn2∥w∥2. (4)

The noticeable feature of the new empirical risk is that is -strongly convex111 depends on number of samples, probability, and VC dimension of the problem. For simplicity in notation, we just consider the number of samples., where is a positive constant depending on the VC dimension of the problem. Thus, we can utilize any practitioner-favorite algorithm. Specifically, we are willing to apply the inexact damped Newton method, which will be discussed in the next section. Due to the fact that a larger strong-convexity parameter leads to a faster convergence, we could expect that the first few steps would converge fast since the values of in these steps are large (larger statistical accuracy), as will be discussed in Theorem 1. From now on, when we say is an -suboptimal solution of the risk , it means that , where is the true optimal solution of the risk . Our final aim is to find which is -suboptimal solution for the risk which is the risk over the whole dataset.
In the rest of this section, first we define the self-concordant functions which have the property that its third derivative can be controlled by its second derivative. By assuming that function has continuous third derivative, we define self-concordant function as follows.

###### Definition 1.

A convex function is -self-concordant if for any and

 |uT(f′′′(w)[u])u|≤ρf(uT∇2f(w)u)32, (5)

where . As it is discussed in (nesterov2013introductory), any self-concordant function with parameter

can be rescaled to become standard self-concordant (with parameter 2). Some of the well-known empirical loss functions which are self-concordant are linear regression, Logistic regression and squared hinge loss. In order to prove our results the following conditions are considered in our analysis.

###### Assumption 1.

The loss functions are convex w.r.t for all values of . In addition, their gradients are Lipschitz continuous

 ∥∇f(w,z)−∇f(w′,z)∥≤M∥w−w′∥,∀z. (6)
###### Assumption 2.

The loss functions are self-concordant w.r.t for all values of .

The immediate conclusion of Assumption 1 is that both and are convex and -smooth. Also, we can note that is -strongly convex and -smooth. Moreover, by Assumption 2, is also self-concordant.

## 3 Distributed Accumulated Newton Conjugate Gradient Method

The goal in inexact damped Newton method, as discussed in (Disco15), is to find the next iterate based on an approximated Newton-type update. It has two important differences comparing to Newton’s method. First, as it is clear from the word “damped”, the learning rate of the inexact damped Newton type update is not , since it depends on the approximation of Newton decrement. The second distinction is that there is no need to compute exact Newton direction (which is very expensive to calculate in one step). Alternatively, an approximated inexact Newton direction is calculated by applying an iterative process to obtain a direction with desirable accuracy under some measurements.

In order to utilize the important features of ERM, we combine the idea of increasing sample size and the inexact damped Newton method. In our proposed method, we start with handling a small number of samples, assume samples. We then solve its corresponding ERM to its statistical accuracy, i.e. , using the inexact damped Newton algorithm. In the next step, we increase the number of samples geometrically with rate of , i.e., samples. The approximated solution of the previous ERM can be used as a warm start point to find the solution of the new ERM. The sample size increases until it equals the number of full samples.

Consider the iterate within the statistical accuracy of the set with samples, i.e. for the risk . In DANCE, we increase the size of the training set to and use the inexact damped Newton to find the iterate which is -suboptimal solution for the sample set , i.e. after iterations. To do so, we initialize and update the iterates according to the following

 ~wk+1=~wk−11+δn(~wk)vk, (7)

where is an -Newton direction. The outcome of applying (7) for iterations is the approximate solution for the risk , i.e., .

To properly define the approximate Newton direction , first consider that the gradient and Hessian of the risk can be evaluated as

 ∇Rn(w)=1nn∑i=1∇fi(w)+cVnw (8)

and

 ∇2Rn(w)=1nn∑i=1∇2fi(w)+cVnI, (9)

respectively. The favorable descent direction would be the Newton direction ; however, the cost of computing this direction is prohibitive. Therefore, we use which is an -Newton direction satisfying the condition

 ∥∇2Rn(~wk)vk−∇Rn(~wk)∥≤ϵk. (10)

As we use the descent direction which is an approximation for the Newton step, we also redefine the Newton decrement based on this modification. To be more specific, we define as the approximation of (exact) Newton decrement , and use it in the update in (7).
In order to find which is an -Newton direction, we use Preconditioned CG (PCG). As it is discussed in (Disco15; nocedal2006sequential), PCG is an efficient iterative process to solve Newton system with the required accuracy. The preconditioned matrix that we considered is in the form of , where , , and is a small regularization parameter. In this case, is an approximate solution of the system . The reason for using preconditioning is that the condition number of may be close to 1 in the case when is close to ; consequently, PCG can be faster than CG. The PCG steps are summarized in Algorithm 2. In every iteration of Algorithm 2, a system needs to be solved in step 10. Due to the structure of matrix , and as it is discussed in (ma2016distributed), this matrix can be considered as rank 1 updates on a diagonal matrix, and now, using Woodbury Formula (Press03) is a very efficient way to solve the mentioned system. The following lemma states the required number of iterations for PCG to find an -Newton direction which is used in every stage of DANCE algorithm.

###### Lemma 1.

(Lemma 4 in Disco15) Suppose Assumption 1 holds and . Then, Algorithm 2, after iterations calculates such that , where

 Cn(ϵk)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎢√(1+2μncVn)log⎛⎜ ⎜ ⎜⎝2√cVn+McVn∥∇Rn(~wk)∥ϵk⎞⎟ ⎟ ⎟⎠⎤⎥ ⎥ ⎥ ⎥ ⎥⎥. (11)

Note that has a crucial effect on the speed of the algorithm. When , then is the exact Newton direction, and the update in (7) is the exact damped Newton step (which recovers the update in Ada Newton algorithm in (Mokhtari_AdaNewton16) when the step-length is 1). Furthermore, the number of total iterations to reach -suboptimal solution for the risk is , i.e. . Hence, if we start with the iterate with corresponding samples, after iterations, we reach with statistical accuracy of for the whole dataset. In Theorem 1, the required rounds of communication to reach the mentioned statistical accuracy will be discussed.

Our proposed method is summarized in Algorithm 1. We start with samples, and an initial point which is an suboptimal solution for the risk . In every iteration of outer loop of Algorithm 1, we increase the sample size geometrically with rate of in step 4. In the inner loop of Algorithm 1, i.e. steps 6-10, in order to calculate the approximate Newton direction and approximate Newton decrement, we use PCG algorithm which is shown in Algorithm 2. This process repeats till we get the point with statistical accuracy of .

### Stopping Criteria

Here we discuss two stopping criteria to fulfill the line of Algorithm 1. At first, considering is unknown in practice, we can use strong convexity inequality as to find a stopping criterion for the inner loop, which satisfies . Another stopping criterion is discussed by Disco15, using the fact that the risk is self-concordant. This criterion222See Section LABEL:pracStopCriterion for the proof. can be written as , where . The later stopping criterion implies that whenever .

### Distributed Implementation

Similar to the algorithm in (Disco15), Algorithms 1 and 2 can also be implemented in a distributed environment. Suppose the entire dataset is stored across machines, i.e., each machine stores data samples such that . Under this setting, each iteration in Algorithm 1 can be executed on different machines in parallel with , where is the batchsize on machine. To implement Algorithm 2 in a distributed manner, a broadcast operation is needed at each iteration to guarantee that each machine will share the same value. Moreover, the gradient and Hessian-vector product can be computed locally and later reduce to the master machine. With the increasing of batch size, computation work on each machine will increase while we still have the same amount of communication need. As a consequence, the computation expense will gradually dominate the communication expense before the algorithm terminates. Therefore the proposed algorithm could take advantage of utilizing more machines to shorten the running time of Algorithm 2.

## 4 Complexity Analysis

In this section, we study the convergence properties of our algorithm. To do so, we analyze the required number of communication rounds and total computational complexity of DANCE to solve every subproblem up to its statistical accuracy.

We analyze the case when we have which is a -suboptimal solution of the risk , and we are interested in deriving a bound for the number of required communication rounds to ensure that is a -suboptimal solution for the risk .

###### Theorem 1.

Suppose that Assumptions 1 and 2 hold. Consider which satisfies and also the risk corresponding to sample set where . Set the parameter (the error in (10)) as following333It is shown in (Disco15) that with this tolerance, the inexact damped Newton method has linear convergence rate. Therefore, every stage of DANCE has linear rate of convergence.

 ϵk=β(cVnM+cVn)1/2∥∇Rn(~wk)∥, (12)

where . Then, in order to find the variable which is an -suboptimal solution for the risk , i.e , the number of communication rounds satisfies in the following:

 Tn≤ Kn(1+Cn(ϵk)),w.h.p. (13)

where . Here shows the smallest nonnegative integer larger than or equal to .

As a result, the update in (7) needs to be done for times in order to attain the solution which is -suboptimal solution for the risk . Also, based on the result in (13), by considering the risk , we can note that when the strong-convexity parameter for the mentioned risk () is large, less number of iterations (communication rounds) are needed (or equally faster convergence is achieved) to reach the iterate with -suboptimal solution; and this happens in the first steps.

###### Corollary 1.

Suppose that Assumptions 1 and 2 hold. Further, assume that is a -suboptimal solution for the risk and consider as the risk corresponding to sample set where . If we set parameter (the error in (10)) as (12), then with high probability communication rounds

 ~Tn≤ (⌈(3+(1−12γ)(2+c2∥w∗∥2))Vm12ω(1/6)⌉+⌈log2(2ω(1/6)Vn)⌉) (1+⌈√1+2μcVn)log2(2(cVn+M)βcVn)⌉), (14)

are needed to reach the point with statistical accuracy of for the risk . Figure 1: Performance of different algorithms on a logistic regression problem with rcv1 dataset. In the left two figures, the plot DANCE* is the training accuracy based on the entire training set, while the plot DANCE represents the training accuracy based on the current sample size.

By Corollary 1, it is shown that444The proof of this part is in Section LABEL:proofOfCo1. after rounds of communication we reach a point with the statistical accuracy of of the full training set, where with high probability is bounded above by

 ~T≤ (2log2Nm0+log2Nm0log2(2ω(1/6)VN) (15)

where is the size of the initial training set. Note that the result in (4) implies that the overall rounds of communication to obtain the statistical accuracy of the full training set is . Hence, when , we have , and for , the result is . The rounds of communication for DiSCO in (Disco15)555In order to have fair comparison, we put , , and in their analysis, and also the constants are ignored for the communication complexity. is where . Comparing these bounds shows that the communication complexity of DANCE is independent of the choice of initial variable and the suboptimality , while the overall communication complexity of DiSCO depends on the initial suboptimality. In addition, implementation of each iteration of DiSCO requires processing all the samples in the dataset, while DANCE only operates on an increasing subset of samples at each phase. Therefore, the computation complexity of DANCE is also lower than DiSCO for achieving the statistical accuracy of the training set. Furthermore, one can notice that by using Woodbury Formula (ma2016distributed; Press03), every PCG iteration has the cost of 666The total computations needed for applying Woodbury Formula are , where , and in our experiments usually works well. The complexity of every iteration of PCG is or equivalently ., which concludes that the total complexity of DANCE is . Table 1 shows that the total complexity of the TAN method (Eisen17) is lower than the one for AdaNewton (Mokhtari_AdaNewton16). Further, as , the total complexity of DANCE is lower than both AdaNewton and TAN methods.

## 5 Numerical Experiments

In this section, we present numerical experiments on several large real-world datasets to show that our restarting DANCE algorithm can outperform other existed methods on solving both convex and non-convex problems. Also, we compare the results obtained from utilizing different number of machines to demonstrate the strong scaling property for DANCE. All the algorithms are implemented in Python with PyTorch (

paszke2017automatic) library and we use MPI for Python (dalcin2011parallel) distributed environment777All codes to reproduce these experimental results are available at anonymous link.. For all plots in this section, vertical pink dashed lines represent restarts in our DANCE method. Figure 2: Comparison between DANCE and SGD with various hyper-parameters setting on Cifar10 dataset and vgg11 network. vgg11 represents (simonyan2014very) a 28layers convolutional neural network (see details at Appendix LABEL:sec:_details_concering_experimental). Figures on the top and bottom show how loss values, training accuracy and test accuracy are changing with respect to epochs and running time. Note that we force both algorithms to restart (double training sample size) after achieving the following number of epochs: 0.2,0.8,1.6.3.2,6.4,12,24,48,96. For SGD, we varies learning rate from 0.01,0.001,0.0001 and batchsize from 128,512.

### Convex problems.

First, we compare DANCE with two algorithms SGD (mini-batch)888The batch size is 10 in our experiments and DiSCO (Disco15), for solving convex problems. The experiments in this section are performed on a cluster with 16 Xeon E5-2620 CPUs (2.40GHz).

We use logistic regression model for two binary classification tasks based on rcv1 and gisette (CC01a) datasets for our convex test problem. We use logistic loss function defined as , where is data sample and is binary label corresponding to . Then we minimize the empirical loss function as (4). Note that there is a fixed -regularization parameter in DiSCO and SGD and we set in (4) to form the -regularization parameter for our DANCE method.

We run our algorithm and compare algorithms with different datasets using nodes. The starting batchsize on each node for our DANCE algorithm is set to while other two algorithms go over the whole dataset at each iteration. For DANCE implementation, number of samples used to form the new ERM loss are doubled from previous iteration after each restarting.

In Figure 1, we observe consistently that DANCE has a better performance over the other two methods from the beginning stages. Both training and test accuracy for DANCE converges to optimality after processing a small number of samples. This observation suggests that DANCE finds a good initial solution and updates it over time. Compared with DiSCO, our restarting approach helps to reduce computational cost for the first iterations, where the second order methods usually performs less efficiently comparing to first order methods. The key difference comes from utilizing the idea of increasing sample size where DANCE goes over small number of samples and finds a suboptimal solution, and use it as a warm-start for the next stage. In this way, less passes over data is needed in the beginning but with satisfactory accuracy. On the other hand, DiSCO uses total number of samples from the beginning which some passes over data is needed in order to reach the neighborhood of global solution. Therefore, DANCE behaves efficiently and reaches the optimal solution with less passes over data.

### Non-convex problems. Figure 3: Comparison between DANCE and Adam on Mnist dataset and NaiveCNet. For DANCE, the initial batchsize is 1024. For Adam, the learning rate is 10−4 and the batchsize is either 64 or 128.

Even though the complexity analysis in Section 4

only covers the convex case, the DANCE algorithm is also able to handle nonconvex problems efficiently. In this section, we compare our method with several stochastic first order algorithms, stochastic gradient descent (SGD), SGD with momentum (SGDMom), and Adam (

kingma2014adam), on training convolution neural networks (CNNs) on two image classification datasets Mnist and Cifar10. The details of the datasets and the CNNs architecture applied on each dataset are presented in Appendix LABEL:sec:_details_concering_experimental. To perform a fair comparison with respect to first order variants, we assume data comes in an online streaming manner, e.g., only a few data samples can be accessed at the beginning, and new data samples arrives at a fixed rate. Such setting is common in industrial production, where business data is collected in a streaming fashion. We feed new data points to all algorithms only if the amount of new samples is equal to the number of existing samples. The experiments in this section are run on an AWS p2.xlarge instance with an NVIDIA K80 GPU.

In Figure 2, we compare DANCE with the build-in SGD optimizer in pyTorch on Cifar dataset to train a layers CNN (Vgg11) architecture. Note that there are several hyper-parameters we need to tune for SGD to reach the best performance, such as batch size and learning rate, which are not required for DANCE. Since we have the online streaming data setting, we don’t need to determine a restarting criterion. The results show that SGD is sensitive to hyper-parameters tuning, i.e., different combination of hyper-parameters affect the performance of SGD a lot and tune them well to achieve the best performance could be painful. However, our DANCE algorithm does not have such weakness and its performance is comparable to SGD with the best parameters setting. We also show that the DANCE algorithm leads to a faster decreasing on the loss value, which is similar to our convex experiments. Again, this is due to fast convergence rate of the second order methods. One could also found the additional experiments regarding the comparison with SGD with momentum and Adam in terms of Mnist with NaiveCNet at Appendix LABEL:sec:_additional_plots.

Regarding Figure 3, the performance of build-in Adam optimizer and our DANCE algorithm are compared regarding Mnist dataset and a layer NaiveCNet (see the details in Appendix LABEL:sec:_details_concering_experimental). In this experiment, we do not assume that the data samples follow an online streaming manner for Adam, i.e., the Adam algorithm does not have a restarting setting and therefore it runs on whole dataset directly. Also, this experiment is performed only on CPUs. We set the learning-rate for Adam as and varies the running batch-size from and . The evolution of loss, training accuracy, testing accuracy with respect to epochs and running time regarding the whole dataset are reported in Figure 3 for different algorithms. One could observe that under the same epochs, Adam eventually achieves the better testing accuracy, while if we look at running time, our DANCE algorithm would be faster due to the distributed implementation.

### Strong scaling

Finally, we demonstrate that our DANCE method shares a strong scaling property. As shown in Figure 4, whenever we increase the number of nodes, we obtain acceleration towards optimality. We use the starting batchsize from upto , and the speed-up compared to serial run ( node) is reported. It indicates that as we increase the batchsize, the speed-up becomes closer to ideal linear speed-up. Since our restarting approach will increase sampling size along the training process, after several restarting, we are able to reach a strong scaling performance asymptoticly. The advantage of the setting is to utilize the large batch over multiple nodes efficiently but not sacrifice the convergence performance. Figure 4: Performance of DANCE algorithm with different number of computing nodes.

## 6 Conclusion

We proposed DANCE an efficient distributed Hessian free algorithm with an increasing sample size strategy to solve the empirical risk minimization problem. Our algorithm obtains a solution within the statistical accuracy of the ERM problem in very few epochs and also can be implemented in a distributed environment. We analyzed the communication-efficiency of DANCE and highlighted its efficiency with respect to DiSCO (Disco15) in term of communication and relative to AdaNewton and TAN methods in terms of total computational complexity. The presented numerical experiments demonstrated the fast convergence of DANCE for both convex and non-convex problems.