Adaptive First- and Second-Order Algorithms for Large-Scale Machine Learning

In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or semi-deterministic to stochastic quadratic regularization methods. We leverage the two-phase nature of stochastic optimization to propose a novel first-order algorithm with adaptive sampling and adaptive step size. In the second-order case, we propose a novel stochastic damped L-BFGS method that improves on previous algorithms in the highly nonconvex context of deep learning. Both algorithms are evaluated on well-known deep learning datasets and exhibit promising performance.


page 1

page 2

page 3

page 4


Second-Order Stochastic Optimization for Machine Learning in Linear Time

First-order stochastic methods are the state-of-the-art in large-scale m...

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

We present a novel adaptive optimization algorithm for large-scale machi...

Exact Stochastic Second Order Deep Learning

Optimization in Deep Learning is mainly dominated by first-order methods...

Optimization Methods for Large-Scale Machine Learning

This paper provides a review and commentary on the past, present, and fu...

Blackwell dominance in large samples

We study repeated independent Blackwell experiments; standard examples i...

Application of a Second-order Stochastic Optimization Algorithm for Fitting Stochastic Epidemiological Models

Epidemiological models have tremendous potential to forecast disease bur...

Predicting Solution Summaries to Integer Linear Programs under Imperfect Information with Machine Learning

The paper provides a methodological contribution at the intersection of ...

1 Introduction and related work

In this paper, we explore promising research directions to improve upon existing first- and second-order optimization algorithms in both deterministic and stochastic settings, with special emphasis on solving problems that arise in the machine learning context.

We consider general unconstrained stochastic optimization problems of the form



denotes a random variable and

is continuously differentiable and may be nonconvex. We introduce additional assumptions about as needed. We refer to (1) as an online optimization problem, which indicates that we typically sample data points during the optimization process so as to use new samples at each iteration, instead of using a fixed dataset that is known up-front. We refer to as the expected risk or the expected loss.

In machine learning, a special case of (1) is empirical risk minimization, which consists in solving



is the loss function that corresponds to the

-th element of our dataset and is the number of samples, i.e., the size of the dataset. We refer to (2) as a finite-sum problem and to as the empirical risk or the empirical loss.

Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951; Bottou, 2010) and its variants (Polyak, 1964; Nesterov, 1983; Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2015)

, including variance-reduced algorithms

(Johnson and Zhang, 2013; Nguyen et al., 2017; Fang et al., 2018; Wang et al., 2019), are among the most popular methods for (1) and (2) in machine learning. They are the typical benchmark with respect to developments in both first- and second-order methods.

In the context of first-order methods, we focus on the case where it is not realistic to evaluate exactly; rather, we have access to an approximation . If we consider (2), a popular choice for is the mini-batch gradient


where is the subset of samples considered at iteration , from a given set of realizations of , is the -th sample of and is the batch size used at iteration , i.e., the number of samples used to evaluate the gradient approximation. It follows that represents the loss function that corresponds to the sample .

As discussed previously, stochastic optimization methods are the default choice for solving large-scale optimization problems. However, the search for the most problem-adapted hyperparameters involves significant computational costs.

Asi and Duchi (2019)

highlighted the alarming computational and engineering energy used to find the best set of hyperparameters in training neural networks by citing 3 recent works

(Real et al., 2019; Zoph and Le, 2016; Collins et al., 2016), where the amount of time needed to tune the optimization algorithm and find the best neural structure can be equivalent to up to central processing unit days (Collins et al., 2016). Therefore, we need to design optimization algorithms with adaptive parameters, that automatically adjust to the nature of the problem without the need for a hyperparameter search. Larson and Billups (2016); Chen et al. (2018); Blanchet et al. (2019), and Curtis et al. (2019) proposed stochastic methods with adaptive step size using the trust-region framework. The use of an adaptive batch size throughout the optimization process is another promising direction towards adaptive algorithms. Exiting works include Friedlander and Schmidt (2012); Byrd et al. (2012); Hashemi et al. (2014), and Bollapragada et al. (2018a). We propose a novel algorithm with both adaptive step size and batch size in Section 2.

Another potential area of improvement is in the context of solving machine learning problems that are highly nonconvex and ill-conditioned (Bottou et al., 2018), which are more effectively treated using (approximate) second-order information. Second-order algorithms are well studied in the deterministic case (Dennis Jr and Moré, 1974; Dembo et al., 1982; Dennis Jr and Schnabel, 1996; Amari, 1998) but there are many areas to explore in the stochastic context that go beyond existing works (Schraudolph et al., 2007; Bordes et al., 2009; Byrd et al., 2016; Moritz et al., 2016; Gower et al., 2016). Among these areas, the use of damping in L-BFGS is an interesting research direction to be leveraged in the stochastic case. Specifically, Wang et al. (2017) proposed a stochastic damped L-BFGS (SdLBFGS) algorithm and proved almost sure convergence to a stationary point. However, damping does not prevent the inverse Hessian approximation from being ill-conditioned (Chen et al., 2019). The convergence of SdLBFGS may be heavily affected if the Hessian approximation becomes nearly singular during the iterations. In order to remedy this issue, Chen et al. (2019) proposed to combine SdLBFGS with regularized BFGS (Mokhtari and Ribeiro, 2014). The approach we propose in Section 3 differs.


We use to denote the Euclidean norm. Other types of norms will be introduced by the notation , where the value of will be specified as needed. Capital Latin letters such as , , and represent matrices, lowercase Latin letters such as , , and

represent vectors in

, and lowercase Greek letters such as , and represent scalars. The operators and represent the expectation and variance over random variable , respectively. For a symmetric matrix , we use and to indicate that is, respectively, positive definite or positive semidefinite, and and

to denote its smallest and largest eigenvalue, respectively. The identity matrix of appropriate size is denoted by

. We use to denote the ceiling function that maps a real number to the smallest integer greater than or equal to .

1.1 Framework and definitions

We consider methods in which iterates are updated as follows:


where is the step size, is the search direction, is positive definite, and we set . Setting to a multiple of results in a first-order method.

A common practice in machine learning is to divide the dataset into a training set, that is used to train the chosen model, and a test set, that is used to evaluate the performance of the trained model on unseen data. The training and test sets are usually partitioned into batches, which are subsets of these sets. The term epoch refers to an optimization cycle over the entire training set, which means that all batches are used during this cycle.

We use the expression semi-deterministic approach to refer to an optimization algorithm that uses exact function values and inexact gradient values.

1.2 Contributions and organization

The paper is naturally divided in two parts.

Section 2

presents our contributions to first-order optimization methods wherein we define a framework to transform a semi-deterministic quadratic regularization method, i.e., one that uses exact function values and inexact gradient estimates, to a fully stochastic optimization algorithm suitable for machine learning problems. We go a step further to propose a novel first-order optimization algorithm with adaptive step size and sampling that exploits the two-phase nature of stochastic optimization. The result is a generic framework to adapt quadratic regularization methods to the stochastic setting. We demonstrate that our implementation is competitive on machine learning problems.

Section 3 addresses second-order methods. We propose a novel stochastic damped L-BFGS method. An eigenvalue control mechanism ensures that our Hessian approximations are uniformly bounded using less restrictive assumptions than related works. We establish almost sure convergence to a stationary point, and demonstrate that our implementation is more robust than a previously proposed stochastic damped L-BFGS algorithm in the highly nonconvex case of deep learning.

Finally, Section 4 draws conclusions and outlines potential research directions.

2 Stochastic adaptive regularization with dynamic sampling

This section is devoted to first-order methods and provides a framework of transition from deterministic or semi-deterministic to fully stochastic quadratic regularization methods. In Section 2.1, we consider the generic unconstrained problem


where has Lipschitz-continuous gradient. We assume that the gradient estimate accuracy can be selected by the user, but do not require errors to converge to zero. We propose our Adaptive Regularization with Inexact Gradients (ARIG) algorithm and establish global convergence and optimal iteration complexity.

In Section 2.2, we specialize our findings to a context appropriate for machine learning, and consider online optimization problems in the form (1) and (2). We prove that we can use inexact function values in ARIG without losing its convergence guarantees.

We present a stochastic variant of ARIG, as well as a new stochastic adaptive regularization algorithm with dynamic sampling that leverages Pflug’s diagnostic in Section 2.3. We evaluate our implementation on machine learning problems in Section 2.4.

2.1 Adaptive regularization with inexact gradients (ARIG)

We consider (5) and assume that it is possible to obtain using a user-specified relative error threshold , i.e.,


Geometrically, (6) requires the exact gradient to lie within a ball centered at the gradient estimate.

We define our assumptions as follows:

Assumption 1

The function is continuously differentiable over .

Assumption 2

The gradient of is Lipschitz continuous, i.e., there exists such that for all , , .

Formulation for ARIG

We define as follows:

At iteration , we consider the approximate Taylor model using the inexact gradient defined in (6):


Finally, we define the approximate regularized Taylor model


whose gradient is

where is the iteration-dependent regularization parameter.

Algorithm 1 summarizes our Adaptive Regularization with Inexact Gradients method. The approach is strongly inspired by the framework laid out by Birgin et al. (2017) and the use of inexact gradients in the context of trust-region methods proposed by Carter (1991) and described by Conn et al. (2000, §8.4). The adaptive regularization property is apparent in step 5 of Algorithm 1, where is updated based on the quality of the model . can be interpreted as the inverse of a trust-region radius or as an approximation of . The main insight is that the trial step is accepted if the ratio

of the actual objective decrease to that predicted by the model is sufficiently large. The regularization parameter is then updated using the following scheme:

  • If , then the model quality is sufficiently high that we decrease so as to promote a larger step at the next iteration;

  • if , then the model quality is sufficient to accept the trial step , but insufficient to decrease (in our implementation, we increase );

  • if , then the model quality is low and is rejected. We increase more aggressively than in the previous case so as to compute a conservative step at the next iteration.

1:   Choose a tolerance , the initial regularization parameter , and the constants
Set .
2:   Choose such that and compute such that (6) holds. If , terminate with the approximate solution .
3:   Compute the step .
4:   Evaluate and define
If , set . Otherwise, set .
5:   Set
Increment by one and go to step 2 if or to step 3 otherwise.
Algorithm 1 Adaptive Regularization with Inexact Gradients—ARIG

Convergence and complexity analysis

The condition (6) ensures that at each iteration ,

Therefore, when the termination occurs, the condition implies and is an approximate first-order critical point to within the desired stopping condition.

The following analysis represents the adaptation of the general properties presented by (Birgin et al., 2017) to a first-order model with inexact gradients.

We first obtain an upper bound on . The following result is similar to Birgin et al. (2017, Lemma 2.2). All proofs may be found in Section 5.

Lemma 1

For all ,


We now bound the total number of iterations in terms of the number of successful iterations and state our main evaluation complexity result.

Theorem 2.1 (Birgin et al., 2017, Theorem )

Let Assumptions 2 and 1 be satisfied. Assume that there exists such that for all . Assume also that for all . Then, given , Algorithm 1 needs at most

successful iterations (each involving one evaluation of and its approximate gradient) and at most

iterations in total to produce an iterate such that is given by Lemma 1 and where

Theorem 2.1 implies that provided is bounded below. The following corollary results immediately.

Corollary 1

Under the assumptions of Theorem 2.1, either or .

2.2 Stochastic adaptive regularization with dynamic sampling

Although ARIG enjoys strong theoretical guarantees under weak assumptions, it is not adapted to the context of machine learning, where we do not have access to exact objective values. Therefore, we propose to build upon ARIG in order to design a stochastic first-order optimization with adaptive regularization with similar convergence guarantees.

Adaptive regularization with inexact gradient and function values

Conn et al. (2000, §10.6) provide guidelines to allow for inexact objective values in trust-region methods while preserving convergence properties. This section adapts their strategy to the context of Algorithm 1. Assume that we do not have access to direct evaluations of , but we can obtain an approximation depending on a user-specified error threshold , such that


We redefine the approximate Taylor series (7) using inexact gradient and function values:


If the approximations of and are not sufficiently accurate, the ratio defined in (10) loses its meaning, since it is supposed to quantify the ratio of the actual decrease of the function value to the predicted decrease by the model. To circumvent the difficulty, we have the following result whose proof is elementary.

Lemma 2

Let the constant be such that and let be a step taken from such that . If and satisfy,


then, a sufficient decrease in the inexact function values,

implies a sufficient decrease in the actual function values,

where .

Lemma 2 states that whenever (15) is satisfied, a sufficient decrease in the stochastic objective, i.e., , implies that the actual decrease in the exact objective is at least a fraction of the decrease predicted by the model. It also means that, as long as we can guarantee that (15) is satisfied at each iteration, the convergence and complexity results from the previous section continue to apply.

In the next section, we use adaptive sampling to satisfy (6) in expectation with respect to the batch used to obtain without evaluation of the full gradient.

Stochastic ARIG with adaptive sampling

To use a stochastic version of ARIG in machine learning, we must compute a gradient approximation for mini-batch ,


that satisfies


where is an iteration-dependent error threshold.

Condition (17) is sometimes called the norm test. We can use the adaptive sampling strategy of Byrd et al. (2012) to maintain this condition satisfied along the iterations. Let us introduce their adaptive sampling strategy. Since should verify , let us fix it to . This choice is natural since a higher error-threshold would require the use of fewer samples to compute . Since is obtained as a sample average (16

), it is an unbiased estimate of

. Therefore,


where is a vector with components , . Freund and Walpole (1971, p.183) establish that,


The direct computation of the population variance is expensive. Hence, Byrd et al. (2012) suggest to estimate it with the sample variance


where the square is applied elementwise. It follows from (18) and (19) that


In the context of large-scale optimization, we let , and (17) implies


Finally, our strategy to obtain that satisfies (17), with , is as follows:

  1. at iteration , sample a new batch of size and compute the sample variance and mini-batch gradient ;

  2. if (22) holds, use mini-batch at iteration . Otherwise, sample a new batch of size


It is important to note that the strategy above relies on the assumption that the change in the batch size is gradual. Therefore, we assume that


Lotfi (2020)

provides a comparison between stochastic ARIG with adaptive sampling and SGD on the MNIST dataset

(LeCun et al., 2010).

Theoretically speaking, convergence requires that either the variance of the gradient estimates or the step size decays to zero. For the strategy defined above, the step size does not converge to zero. Moreover, the percentage batch size would need to be close to for the gradient variance to converge to zero. This discussion motivates our work in Section 2.3, where we propose a new algorithm that leverages both adaptive regularization and adaptive sampling, but not simultaneously. Although we do not provide theoretical guarantees, it is conceivable that the new algorithm globally converges to a stationary points since we choose a step size that converges to zero in the second phase of the optimization (Bottou et al., 2018).

2.3 Stochastic adaptive regularization with dynamic sampling and convergence diagnostics

In this section, we present a novel algorithm that incorporates both adaptive regularization and adaptive sampling by leveraging the two-phase nature of stochastic optimization.

Statistical testing for stochastic optimization

Stochastic optimization algorithms exhibit two phases: a transient phase and a stationary phase (Murata, 1998). The transient phase is characterized by the convergence of the algorithm to a region of interest that contains potentially a good local or global optimum. In the stationary phase, the algorithm oscillates around that optimum in the region of interest.

Several authors aimed to derive a stationary condition to indicate when stochastic procedures leave the transient phase and enter the stationary phase. Some of the early works in this line of research are Pflug’s works (Pflug, 1983, 1990, 1992). In particular, Pflug’s procedure was introduced in Pflug (1990), and it consists of keeping a running average of the inner product of successive gradients in order to detect the end of the transient phase. Pflug’s idea is simple: when the running average becomes negative, that suggests that the consecutive stochastic gradients are pointing in different directions. This should signal that the algorithm entered the stationary phase where it oscillates around a local optimum. More recently, Chee and Toulis (2018) developed a statistical diagnostic test in order to detect when Stochastic Gradient Descent (SGD) enters the stationary phase, by combining the statistical test from Pflug (1990)

with the SGD update. They presented theoretical arguments and practical experiments to prove that the test activation occurs in the phase-transition for SGD.

Algorithm 2 describes Pflug’s diagnostic proposed by Chee and Toulis (2018) for convergence of SGD applied to (2), where represents an approximation to the full gradient . Note that in this case, contains a single sample. Chee and Toulis prove that the convergence diagnostic in Algorithm 2 satisfies as , which implies that the algorithm terminates almost surely.

1:  Choose and a positive integer named “burn-in”.
2:  Set . Compute and .
3:  for  do
4:     Compute and .
5:     Define .
6:     if  and  then
7:        return .
8:     end if
9:  end for
Algorithm 2 Pflug diagnostic for convergence of SGD

Adaptive regularization and sampling using Pflug diagnostic

To harness Pflug’s diagnostic, we discuss the bias-variance trade-off in machine learning. The bias of a model is the extent to which it approximates a global optimum, whereas its variance is the amount of variation in the solution if the training set changes. The generalization error in machine learning is the sum of the bias and the variance, hence the trade-off between the two quantities. Now notice that

  • in the transient phase, we would like to reduce the bias of the model by converging to a promising region of interest. Therefore, we propose to use adaptive regularization in order to adapt the step size automatically, taking into account the ratio . We use a fixed batch size during this phase.

  • In the stationary phase, we would like to reduce the variance of the gradient estimates. Therefore, we use adaptive sampling as a variance reduction technique. Additionally, we choose a step size sequence that converges to zero to ensure global convergence to a stationary point.

We call this algorithm ARAS for Adaptive Regularization and Adaptive Sampling algorithm. Algorithm 3 states the complete procedure. Note that we reduce the number of parameters that were initially introduced in ARIG by eliminating and , which simplifies the algorithm. Inspired by Curtis et al. (2019), we do not require the decrease to be sufficient to accept a step. Instead, we accept the step in all cases. We also use a constant maximum batch size, to ensure that the batch sizes stay reasonable.

1:   Choose the initial regularization parameter , the initial batch size , the maximum batch size , the total number of epochs , and the constants , , and . Set , , , and Transient = True.
2:  for  do
3:     if Transient is True then
4:        Sample and compute .
5:        Compute and define .
6:        Evaluate and , and define
7:        Set
8:        Compute and set .
9:        If ( and ), then set Transient = False.
10:     else
11:        Sample . Compute and .
12:        if (22) holds then
13:           Go to step 18.
14:        else
15:           Set
16:           Sample a new of new size and compute .
17:        end if
18:         Compute the step and define .
19:        Set , , and increment t by one.
20:     end if
21:     Increment by one and go to step 1.
22:  end for
Algorithm 3 Adaptive Regularization and Sampling using Pflug diagnostic—ARAS

2.4 Experimental evaluation

We study the performance of ARAS in both convex and nonconvex settings. First, we compare it to SGD and SGD with momentum in the convex setting, where we consider a logistic regression problem. Then, we compare two versions of ARAS to SGD in the nonconvex setting, where we consider a nonconvex support vector machine problem with a sigmoid loss function.

Logistic regression with MNIST

To evaluate the performance of our algorithm in the convex setting, we compare it to SGD and SGD with momentum for solving a logistic regression problem on the MNIST dataset. All parameters were set using a grid-search and all algorithms were trained for 10 epochs each.

Figure 1: Evolution of the training loss (left) and the validation accuracy (right) for ARAS, SGD and SGD with momentum, solving a logistic regression problem on MNIST.

Figure 1 shows the performance of the three algorithms. ARAS outperforms SGD and SGD with momentum on both the training and test sets.

Nonconvex support vector machine with RCV1

In the nonconvex setting, we compare the performance of ARAS and SGD for solving the finite-sum version of a nonconvex support vector machine problem, with a sigmoid loss function, a problem that was considered in Wang et al. (2017):


where represents the feature vector, represents the label value, and is the regularization parameter.

The finite-sum version of (27) can be written as


where for .

We use a subset111We downloaded the subset from of the dataset Reuters Corpus Volume I (RCV1) (Lewis et al., 2004), which is a collection of manually categorized newswire stories. The subset contains stories with distinct words. There are four categories that we reduce to the binary categories 222The two categories “MCAT” and “ECAT” correspond to label value , and the two categories “C15” and “GCAT” correspond to label value .. We then solve this binary classification problem.

Figure 2 reports our results; ARAS largely outperforms SGD both in terms of the training loss and test accuracy.

Figure 2: Evolution of the training loss (left) and the test accuracy (right) for SGD and ARAS solving a nonconvex SVM problem on RCV1.

3 Stochastic damped L-BFGS with controlled norm of the Hessian approximation

In this section, we consider second-order methods and we assume that, at iteration , we can obtain a stochastic approximation (3) of . Iterates are updated according to (4), where this time .

The stochastic BFGS method (Schraudolph et al., 2007) computes an updated approximation according to


which ensures that the secant equation is satisfied, where


If and the curvature condition holds, then (Fletcher, 1970).

Because storing and performing matrix-vector products is costly for large-scale problems, we use the limited-memory version of BFGS (L-BFGS) (Nocedal, 1980; Liu and Nocedal, 1989), in which only depends on the most recent iterations and an initial . The parameter is the memory. The inverse Hessian update can be written as


When in (4) is not computed using a Wolfe line search (Wolfe, 1969, 1971), there is no guarantee that the curvature condition holds. A common strategy is to simply skip the update. By contrast, Powell (1978) proposed damping, which consists in updating using a modified , denoted by , to benefit from information discovered at iteration while ensuring sufficient positive definiteness. We use


which is inspired by Wang et al. (2017), and differs from the original proposal of Powell (1978), where


with and . The choice (32) ensures that the curvature condition


is always satisfied since . We obtain the damped L-BFGS update, which is (31) with each and replaced with and .

In the next section, we propose a new version of stochastic damped L-BFGS that maintains estimates of the smallest and largest eigenvalues of . We show that the new algorithm requires less restrictive assumptions with respect to current literature, while being more robust to ill-conditioning and maintaining favorable convergence properties and complexity as established in Section 3.2. Finally, Section 3.3 provides a detailed experimental evaluation.

3.1 A new stochastic damped L-BFGS with controlled Hessian norm

Our working assumptions are Assumptions 2 and 1 We begin by deriving bounds on the smallest and largest eigenvalues of as functions of bounds on those of . Proofs can be found in Appendix A.

Lemma 3

Let and such that with , and such that , with . Let , where , , and . Then,

In order to use Lemma 3 to obtain bounds on the eigenvalues of , we make the following assumption:

Assumption 3

There is such that for all , , .

Assumption 3 is required to prove convergence and convergence rates for most recent stochastic quasi-Newton methods (Yousefian et al., 2017). It is less restrictive than requiring to be twice differentiable with respect to , and the Hessian to exist and be bounded for all and , as in Wang et al. (2017).

The next theorem shows that the eigenvalues of are bounded and bounded away from zero.

Theorem 3.1

Let Assumptions 3, 2 and 1 hold. Let and . If is obtained by applying times the damped BFGS update formula with inexact gradient to , there exist easily computable constants and that depend on and such that .

The precise form of and is given in (48) and (50) in Section 5.

A common choice for is where , is the scaling parameter. This choice ensures that the search direction is well scaled, which promotes large steps. To keep from becoming nearly singular or non positive definite, we define


where can be constants or iteration dependent.

The Hessian-gradient product used to compute can be obtained cheaply by exploiting a recursive algorithm (Nocedal, 1980; Lotfi et al., 2020).

Motivated by the success of recent methods combining variance reduction with stochastic L-BFGS (Gower et al., 2016; Moritz et al., 2016; Wang et al., 2017), we apply an SVRG-like type of variance reduction (Johnson and Zhang, 2013) to the update. Not only does this accelerate the convergence, since we can choose a constant step size, but it also improves the quality of the curvature approximation.

We summarize our complete algorithm, VAriance-Reduced stochastic damped L-BFGS with Controlled HEssian Norm (VARCHEN), as Algorithm 4.

1:  Choose , step size sequence , batch size sequence , eigenvalue limits , memory parameter , total number of epochs , and sequences and , such that , for every . Set and .
2:  for  do
3:      Define and compute the full gradient . Set .
4:     while  do
5:         Sample batch of size and compute and .
6:         Define .
7:         Estimate and in Theorem 3.1. If or , delete , and for .
8:        Compute .
9:        Define , and compute , as in (30), and as in (32).
10:        Increment by one and update .
11:     end while
12:  end for
Algorithm 4 Variance-Reduced Stochastic Damped L-BFGS with Controlled Hessian Norm

In step 7 of Algorithm 4, we compute an estimate of the upper and lower bounds on and , respectively. The only unknown quantity in the expressions of and in Theorem 3.1 is , which we estimate as . When the estimates are not within the limits