Exploiting Local Convergence of Quasi-Newton Methods Globally: Adaptive Sample Size Approach

06/10/2021
by   Qiujiang Jin, et al.
The University of Texas at Austin
0

In this paper, we study the application of quasi-Newton methods for solving empirical risk minimization (ERM) problems defined over a large dataset. Traditional deterministic and stochastic quasi-Newton methods can be executed to solve such problems; however, it is known that their global convergence rate may not be better than first-order methods, and their local superlinear convergence only appears towards the end of the learning process. In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasi-Newton methods globally and throughout the entire learning process. The main idea of the proposed adaptive sample size algorithms is to start with a small subset of data points and solve their corresponding ERM problem within its statistical accuracy, and then enlarge the sample size geometrically and use the optimal solution of the problem corresponding to the smaller set as an initial point for solving the subsequent ERM problem with more samples. We show that if the initial sample size is sufficiently large and we use quasi-Newton methods to solve each subproblem, the subproblems can be solved superlinearly fast (after at most three iterations), as we guarantee that the iterates always stay within a neighborhood that quasi-Newton methods converge superlinearly. Numerical experiments on various datasets confirm our theoretical results and demonstrate the computational advantages of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/30/2020

Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods

In this paper, we study the non-asymptotic superlinear convergence rate ...
02/23/2017

Stochastic Newton and Quasi-Newton Methods for Large Linear Least-squares Problems

We describe stochastic Newton and stochastic quasi-Newton approaches to ...
05/22/2017

Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method

We consider large scale empirical risk minimization (ERM) problems, wher...
04/13/2011

Hybrid Deterministic-Stochastic Methods for Data Fitting

Many structured data-fitting applications require the solution of an opt...
02/18/2020

Distributed Adaptive Newton Methods with Globally Superlinear Convergence

This paper considers the distributed optimization problem over a network...
12/07/2020

Adaptive Sequential SAA for Solving Two-stage Stochastic Linear Programs

We present adaptive sequential SAA (sample average approximation) algori...
02/04/2022

A J-Symmetric Quasi-Newton Method for Minimax Problems

Minimax problems have gained tremendous attentions across the optimizati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In various machine learning problems, we aim to find an optimal model that minimizes the expected loss with respect to the probability distribution that generates data points, which is often referred to as expected risk minimization. In realistic settings, the underlying probability distribution of data is unknown and we only have access to a finite number of samples (

samples) that are drawn independently according to the data distribution. As a result, we often settle for minimizing the sample average loss, which is also known as empirical risk minimization (ERM). The gap between expected risk and empirical risk which is known as generalization error is deeply studied in the statistical learning literature (vapnik1998statistical; bousquet2002concentration; bartlett2006convexity; bousquet2008tradeoffs; frostig15)

, and it is well-known that the gap between these two loss functions vanishes as the number of samples

in the ERM problem becomes larger.

Several variance reduced first-order methods have been developed to efficiently solve the ERM problem associated with a large dataset

(roux2012stochastic; defazio2014saga; johnson2013accelerating; shalev2013stochastic; nguyen2017sarah; Allenzhu2017-katyusha; fang2018near; cutkosky2019momentum). However, a common shortcoming of these first-order methods is that their performance heavily depends on the problem condition number , where is the risk function smoothness parameter and is the risk function strong convexity constant. As a result, these first-order methods suffer from slow convergence when the problem has a large condition number, which is often the case in datasets with thousands of features. In fact, the best-known bound achieved by first-order methods belongs to Katyusha (Allenzhu2017-katyusha) which has a cost of to achieve the statistical accuracy of an ERM problem with samples.

To resolve this issue, second-order methods such as Newton’s method and quasi-Newton methods are often used, as they improve curvature approximation by exploiting second-order information. Several efficient implementations of second-order methods for solving large-scale ERM problems have been proposed recently, including incremental Newton-type methods (gurbuzbalaban2015globally), sub-sampled Newton algorithms (erdogdu2015convergence), and stochastic quasi-Newton methods (Schraudolph; MoritzNJ16; mokhtari2017iqn). However, these methods still face two major issues. First, their global convergence requires selection of a small step size which slows down the convergence, or implementation of a line-search scheme which is computationally prohibitive for large-scale ERM problems, since it requires multiple passes over data. The second disadvantage of these (deterministic and stochastic) second-order methods is that their local quadratic or superlinear convergence only appears in a small local neighborhood of the optimal solution. This often means that, when we solve an ERM problem, the fast superlinear or quadratic convergence rate of these methods happens when the iterates have already reached the statistical accuracy of the problem and the test accuracy would not improve, even if the ERM problem error (training error) decreases.

To address these two shortcomings of second-order methods for solving large-scale ERM problems, the adaptive sample size Newton (Ada Newton) method was proposed in (mokhtari2016adaptive). In adaptive sample size methods, instead of solving the ERM corresponding to the full training set directly (as in deterministic methods) or using a subset of samples at each iteration (as in stochastic methods), we minimize a sequence of geometrically increasing ERM problems. Specifically, we solve the ERM problem corresponding to a subset of samples, e.g. samples, within their statistical accuracy, then we use the resulting solution as an initial point for the next ERM problem with samples (), where the larger set with samples contains the smaller set with samples. The main principle of this idea is that all training samples are drawn from the same distribution, therefore the optimal values of the ERM problems with and samples are close to each other. The main result of Ada Newton in (mokhtari2016adaptive) is that if one chooses the expansion factor properly, the solution for the problem with samples is within the Newton quadratic convergence neighborhood of the next ERM problem with samples. Hence, each subproblem can be solved quickly by exploiting the quadratic convergence of Newton’s method at every step of the learning process. This approach is promising as the overall number of gradient and Hessian computations for Ada Newton is , while it requires computing Newton’s directions. Note that notation only hides absolute constants and these complexities are independent of problem condition number . The main drawback of this approach, however, is the requirement of Newton’s directions computation, where each could cost arithmetic operations, where is the dimension. It also requires Hessian evaluations which requires arithmetic operations, and this could be computationally costly.

A natural approach to reduce the computational cost of Newton’s method is the use of quasi-Newton algorithms, in which, instead of exact computation of the Hessian and its inverse required for Newton update, we approximate the objective function Hessian and its inverse only by computing first-order information (gradients). Since this approximation only requires matrix-vector product computations at each iteration, the computational complexity of Quasi-Newton methods is reduced. It is known that quasi-Newton methods such as DFP and BFGS converge superlinearly in a neighborhood close to the optimal solution, i.e., the iterates

converge to the optimal solution at a rate of , where is the iteration number (nocedal2006numerical). To exploit quasi-Newton methods for the described adaptive sample size scheme, however, explicit superlinear convergence rate of these methods is needed to characterize the number of iterations required for solving each ERM problem. The non-asymptotic convergence rate of quasi-Newton methods was unknown until very recently, when a set of concurrent papers (rodomanov2020rates; rodomanov2020ratesnew) and (qiujiang2020quasinewton1) showed that in a local neighborhood of the optimal solution, the iterates of DFP and BFGS converge to the optimal solution at a rate of .

Algorithm gradient comp. Hessian comp. matrix-vec product matrix inversion
Ada Newton
AdaQN 1
Table 1: Comparison of adaptive sample size methods, in terms of number of gradient computations, Hessian computations, matrix-vector product evaluations and matrix inversions.

In this paper, we exploit the finite-time analysis of the quasi-Newton methods studied in (qiujiang2020quasinewton1) to develop an adaptive sample size quasi-Newton (AdaQN) method that reaches the statistical accuracy of the ERM corresponding to the full training set after gradient computations and matrix-vector product evaluations and only matrix inversion. In Table 1, we formally compare the computation cost of our proposed AdaQN algorithm and the Ada Newton method in (mokhtari2016adaptive). As shown in Table 1, the total number of gradient evaluations and matrix-vector product computations for AdaQN are the same as the ones for Ada Newton, while the number of Hessian computations for AdaQN and Ada Newton are and , respectively, where is the size of initial training set in the adaptive sample size scheme, and it is substantially smaller than , i.e., . In our main theorem we show that is lower bounded by , where is a parameter determined by the loss function and the structures of the training data (see details in Remark 2). Hence, in this paper, we focus on the regime that the total number of samples satisfies . However, our numerical experiments in Section 5 show that AdaQN often achieves small risk for much smaller than this lower bound and the effectiveness of AdaQN goes beyond this regime. Also, the number of times that Ada Newton requires to compute the inverse of a square matrix (or solving a linear system) is , while to implement AdaQN we only require one matrix inversion. We should also add that the implementation of Ada Newton requires a backtracking mechanism for the choice of growth factor . This backtracking scheme is needed to ensure that the size of training set does not grow too fast, and the iterates stay within the quadratic convergence neighborhood Newton’s method for next ERM problem. Unlike Ada Newton, our proposed AdaQN method does not require such backtracking scheme, and as long as the size of initial training set is sufficiently large, we can double the size of training set at the end of each phase, i.e., we set .

2 Problem formulation

In this paper, we focus on minimizing large-scale empirical risk minimization (ERM) problems. Consider the loss function that takes inputs as the decision variable

and random variable

with a probability distribution . The expected risk minimization problem aims at minimizing the expected loss over all possible choices of ,

(1)

Note that in this expression we defined the expected risk as and as an optimal solution of the expected risk . In this paper we focus on loss functions that are strongly convex with respect to , hence the optimal solution is unique.

The probability distribution is often unknown, and we only have access to a finite number of realizations of the random variable , which we refer to as our training set . These realizations are assumed to be drawn independently and according to . Hence, instead of minimizing the expected risk in (1), we attempt to minimize the empirical risk corresponding to the training set . To formalize this, let us first define as a subset of the training set which contains its first elements. Without loss of generality we defined an ordering for the elements of the training set . The empirical risk corresponding to the set is defined as and its optimal solution is given by

(2)

Note that the ERM problem associated with the full training set is a special case of (2) when , and its unique optimal solution is denoted by .

The gap between the optimal values of empirical risk and expected risk is well-studied in the statistical learning literature and here we assume the following upper bound holds

(3)

where is a function of the sample size that approaches zero as becomes large. The expectation in (3) is over the set . In this paper, we assume that all the expectations are with respect to the corresponding training set. Classic results have established bounds of the form (vapnik1998statistical; lee1998importance) and other recent papers including (bousquet2002concentration; bartlett2006convexity; bousquet2008tradeoffs; frostig15) show that under stronger regularity conditions such as strong convexity, we have , for . In this paper, we assume the requirements for the bound are satisfied.

Since the gap between the optimal empirical and expected risks is always bounded above by the error , there is no point to reducing the optimization error of minimizing beyond the statistical accuracy . In other words, if we obtain a solution such that , there would be no benefit in further minimizing . Due to this observation, we say solves the ERM in (2) within its statistical accuracy if . The ultimate goal is to efficiently find a solution that reaches the statistical accuracy of the full training set , i.e., . Next, we state the notations and assumptions.

Assumption 1.

For all values of , the loss function is twice differentiable and -strongly convex with respect to and its gradient is Lipschitz continuous with parameter .

Assumption 2.

For all values of , the loss function is self-concordant with respect to .

Assumptions 1 and 2 are customary for the analysis of quasi-Newton methods. These conditions also imply that the empirical risk is also self-concordant, strongly convex with , and its gradient is Lipschitz continuous with . Hence, the condition number of is .

Computational cost notation

We report the overall computational cost in terms of these parameters: (i)  and which denote the cost of computing one gradient of size and one Hessian of size ; (ii)  which indicates the cost of computing the product of a square matrix of size with a vector of size ; and (iii)  which denotes the cost of computing the inverse of a square matrix of size or the cost of solving a linear system with variables and equations.

3 Algorithm

Quasi-Newton methods

Before introducing our method, we first briefly recap the update of quasi-Newton (QN) methods. Given the current iterate , the QN update for the ERM problem in (2) is

(4)

where is a step size and is a symmetric positive definite matrix approximating the Hessian inverse . The main goal of quasi-Newton schemes is to ensure that the matrix always stays close to . There are several different approaches for updating the Hessian approximation matrix , but the two most-widely used updates are the DFP method defined as

(5)

and the BFGS update defined as

(6)

where is the variable variation and is the gradient variation. If we follow the updates in (5) or (6), then finding the new Hessian inverse approximation and consequently the new descent direction only require computing a few matrix-vector multiplications. Considering this point and the fact that each step of BFGS or DFP requires gradients evaluations, the computational cost of each step of BFGS and DFP is .

3.1 Adaptive sample size quasi-Newton algorithm (AdaQN)

BFGS, DFP or other quasi-Newton (QN) methods can be used to solve the ERM problem corresponding to the training set with samples, but (i) the cost per iteration would be of , (ii) the superlinear convergence only appears towards the end of learning process when the iterates approach the optimal solution and statistical accuracy is already achieved, and (iii) they require a step size selection policy for their global convergence. To resolve the first issue and reduce the high computational cost of , one could use stochastic or incremental QN methods that only use a subset of samples at each iteration; however, stochastic or incremental QN methods (similar to deterministic QN algorithms) outperform first-order methods only when the iterates are close to the optimal solution and they also require line-search schemes for selecting the step size. Hence, none of these schemes is able to exploit fast superlinear convergence rate of QN methods throughout the entire training process, while always using a constant step size of .

Figure 1: The interplay between superlinear convergence of BFGS and statistical accuracy in AdaQN.

Next, we introduce the adaptive sample size quasi-Newton (AdaQN) method that addresses these drawbacks. Specifically, (i) AdaQN does not require any line-search scheme and (ii) exploits the superlinear rate of QN methods throughout the entire training process. In a nutshell, AdaQN leverages the interplay between the statistical accuracy of ERM problems and superlinear convergence neighborhood of QN methods to solve ERM problems efficiently. It uses the fact that all samples are drawn from the same distribution, therefore the solution for ERM with samples is not far from the solution of ERM with samples, where samples contain the samples. Hence, the solution of the problem with less samples can be used as a warm-start for the problem with more samples. Note that there are two important points here that we should highlight. First, if and are chosen properly, the solution of the ERM problem corresponding to the set with samples will be in the superlinear convergence neighborhood of the next ERM problem corresponding to the set with samples, where . Hence, each subproblem can be solved with a few iterations of quasi-Newton methods. Second, in each subproblem we only use a subset of samples, therefore the cost of running QN methods for solving subproblems with samples (where ) is significantly less than the update of QN methods for the full training set.

As shown in Figure 1, in AdaQN we need to ensure that the approximate solution of the ERM problem with samples denoted by is within the superlinear convergence neighborhood of BFGS for the ERM problem with samples. Here, and are the optimal solutions of the risks and corresponding to the sets and with and samples, respectively, where . The statistical accuracy region of is denoted by a blue circle, the statistical accuracy region of is denoted by a red circle, and the superlinear convergence neighborhood of BFGS for is denoted by a dotted purple circle. As we observe, any point within the statistical accuracy of is within the superlinear convergence neighborhood of BFGS for . Therefore, after a few steps (at most three steps as we show in our theoretical results) of BFGS, we find a new solution that is within the statistical accuracy of .

The steps of AdaQN are outlined in Algorithm 1. We start with a small subset of the full training set with samples and solve its corresponding ERM problem within its statistical accuracy. The initial ERM problem can be solved using any iterative method and its cost will be negligible as it scales with instead of , where . In the main loop (Step 2-Step 12) we implement AdaQN. Specifically, we first use the solution from the previous round (or when ) as the initial iterate, while we set the initial Hessian inverse approximation as (Step 3). Then, we double the size of the training set by adding more samples to the active training set (Step 4). In Steps 5-10 we run the BFGS update for minimizing the loss , while we keep updating the iterates and Hessian inverse approximation. Once, the required condition for convergence specified in Step 5 is obtained, we output the iterate as the iterate that minimizes within its statistical accuracy. In Step 8 we used the update for BFGS, but one can simply replace this step with the update of the DFP method. As we ensure that iterates always stay within the neighborhood that BFGS (or DFP) converges superlinearly, the step size of these methods can be set as . We repeat this procedure until we reach the whole dataset with samples. In Algorithm 1, for the phase that we minimize (Steps 5-10), our goal is to find a solution that satisfies . We use to indicate the maximum number of QN updates to find such a solution. Our theoretical result (Theorem 1) suggests that due to fast convergence of QN methods, at most iterations required to solve each subproblem within its statistical accuracy.

Input: The initial sample size ; The initial argument within the statistical accuracy of ; The initial Hessian inverse ;

1:  Set ;
2:  while  do
3:     Set the initial argument , the initial matrix and ;
4:     Increase the sample size: ;
5:     while  do
6:        ;
7:        ;
8:        ;
9:        Set , and ;
10:     end while
11:     Set ;
12:  end while
Algorithm 1 AdaQN

In Figure 2, we illustrate the sequential steps of AdaQN moving from one stage to another stage. Note that in the initialization step we solve the first ERM problem with up to its statistical accuracy to find an approximate solution . Then, we compute the Hessian and its inverse . Once the initialization step is done, we perform BFGS updates on the loss of ERM problem with samples starting from the point and using the initial Hessian inverse approximation . Then, after three BFGS updates we find that is within the statistical accuracy of . In the next stage, with samples, we use the original Hessian inverse approximation and the new variable for the BFGS updates. We keep doing this procedure until the size of the training set becomes . As we observe, we only perform Hessian computations to find and one matrix inversion to find its inverse .

Figure 2: The phases of AdaQN.
Remark 1.

Note that a natural choice for the initial Hessian inverse approximation at each phase of AdaQN with samples is the Hessian inverse at the initial iterate of that phase, i.e., setting the initial Hessian inverse approximation as , where is the solution of the previous phase with samples. However, if we follow this initialization, then for each phase of AdaQN we need to compute new Hessians to evaluate and one matrix inversion to find , which would increase the computational cost of AdaQN. To avoid this issue, for all values of , we always use the initial Hessian inverse approximation matrix corresponding to the first phase with samples and set (Step 3). We show that even under this initialization the required condition for superlinear convergence of DFP or BFGS is always satisfied if the initial sample size is large enough. Note that by following this scheme, we only need to compute Hessians and a single matrix inversion to implement AdaQN. This is a significant gain compared to Ada Newton in (mokhtari2016adaptive), which requires computing Hessians and matrix inversions.

4 Convergence analysis

In this section, we characterize the overall complexity of AdaQN. We only state the results for BFGS defined in (6), as the proof techniques and overall complexity bounds for adaptive sample size versions of DFP and BFGS are similar. All the proofs of propositions and theorem are presented in Section A of the appendix.

First we state an upper bound for the sub-optimality of the variable with respect to the empirical risk of , given that has achieved the statistical accuracy of the previous empirical risk .

Proposition 1.

Consider and such that , where there are samples in and samples in and . Consider the corresponding empirical risk functions and defined based on and , respectively. Assume that solves the ERM problem of within its statistical accuracy, i.e., . Then we have

(7)

Proposition 1 characterizes the sub-optimality of the variable for the empirical risk which plays a fundamental role in our analysis. We use this bound later to show is close enough to so that is in the superlinear convergence neighborhood of .

Next, we require a bound on the number of iterations needed by BFGS to solve a subproblem, when the initial iterate is within the superlinear convergence neighborhood. We establish this bound by leveraging the result of (qiujiang2020quasinewton1) which provides a non-asymptotic superlinear convergence for BFGS.

Proposition 2.

Consider AdaQN in the phase that the active training set contains samples. If Assumptions 1-2 hold and the initial iterate and Hessian approximation satisfy

(8)

then after iterations we achieve the output with the following convergence result

Proposition 2 shows that if the initial Hessian approximation error is small and the initial iterate is close to the optimal solution, then the iterates of BFGS with step size converge to the optimal solution at a superlinear rate of after iterations. The inequalities in (8) identify the required conditions to ensure that is within the superlinear convergence neighborhood of BFGS for .

In the next two propositions we show that if the initial sample size is sufficiently large, both and satisfy the conditions in (8) in expectation which are required to achieve the local superlinear convergence.

Proposition 3.

Consider Algorithm 1 for the phase that the active training set contains samples with empirical risk . Suppose Assumptions 1-2 hold, and further suppose we are given which is within the statistical accuracy of , i.e, and . If the sample size is lower bounded by , then .

Proposition 3 shows that if the training set size is sufficiently large, then the solution of the previous phase is within the BFGS superlinear convergence neighborhood of the current problem, and the first condition in (8) holds in expectation. This condition is indeed satisfied throughout the entire learning process, if it holds for the first training set with samples, i.e., .

Next we establish under what condition the Hessian approximation used in adaptive sample size method which is the Hessian evaluated with respect to samples, i.e., , satisfies the second condition in (8) in expectation which is required for the superlinear convergence of BFGS.

Proposition 4.

Consider AdaQN in the phase that the active training set contains samples. If Assumptions 1-2 hold and the initial sample size satisfies , where is defined as . Then for any we have

Based on Proposition 4, if the size of initial training set satisfies , then the second condition in (8) holds in expectation. By combining the results of Propositions 3 and 4, we obtain the required conditions for . Moreover, Proposition 2 quantifies the maximum number of iterations required to solve the each ERM problem to its corresponding statistical accuracy. In the following theorem, we exploit these results to characterize the overall computational complexity of AdaQN to reach the statistical accuracy of the full training set with samples.

Theorem 1.

Consider AdaQN described in Algorithm 1 for the case that we use BFGS updates in (6). Suppose Assumptions 1-2 hold, and the initial sample size is lower bounded by

(9)

where . Then at the stage with samples, AdaQN finds within the statistical accuracy of after at most iterations. Further, the computational cost of AdaQN to reach the statistical accuracy of the full training set is

(10)

Theorem 1 states that if the initial sample size is sufficiently large, the number of required BFGS updates for solving each subproblem is at most iterations. Further, it characterizes the overall computational cost of AdaQN.

Remark 2 (Parameter ).

Since , parameter in Theorem 1 belongs to the interval . Hence, in the worst case and

. However, for many common classes of problems including linear regression and logistic regression and for training datasets with specific structures

could be . In those cases the initial sample size is lower bounded by ; see Section B in the appendix for details. In fact, our numerical experiments also verify this observation and show that the choice of could be much smaller than the worst-case bound of ; see Section 5.

Remark 3 (Cost of initial phase).

In the above complexity bound, we neglect the cost of finding the initial solution , as the cost of finding an approximate solution for the first ERM problem with samples is negligible compared to the overall cost of the AdaQN. For instance, if one solves the first ERM problem using the Katyusha algorithm (Allenzhu2017-katyusha), the cost of the initial stage would be (where is the condition number for ERM with samples), which is indeed dominated by the term , when we are in the regime that .

Remark 4 (Comparison with first-order methods).

If one uses Katyusha, which is an optimal first-order method, to solve an ERM with samples, the overall gradient computation to achieve accuracy would be . The gradient computation cost of AdaQN considering the initialization step is . Indeed, if we are in the regime that the size of initial set is sufficiently smaller than the size of full training set , (i.e., ), then AdaQN gradient complexity scales as which is smaller than cost of Katyusha.

5 Numerical experiments

Dataset
MNIST 11774 784 1024 0.05
GISETTE 6000 5000 1024 0.05
Orange 40000 14000 8192 0.1
Epsilon 80000 2000 4096 0.0001
Table 2: Datasets information: sample size , dimension , initial sample size and regularization parameter .

Next, we evaluate the performance of the AdaQN method for solving large-scale ERM problems. We consider a binary classification problem with regularized logistic regression loss function, where is the regularization parameter. The logistic loss is convex, and, therefore, all functions in our experiments are -strongly convex. We normalize all data points with the unit norm so that the loss function gradient is Lipschitz continuous with , hence condition number is . Moreover, the logistic regression loss function is self-concordant. Thus, Assumptions 1-2 hold.

We compare AdaQN with adaptive sample size Newton (Ada Newton) (mokhtari2016adaptive), (deterministic) BFGS quasi-Newton method, the stochastic quasi-Newton (SQN) method proposed in (byrd2016stochastic)

, and three stochastic first-order methods, including stochastic gradient descent (SGD), the Katyusha algorithm

(Allenzhu2017-katyusha), and SAGA which is a variance reduced method (defazio2014saga). In our experiments, we start with the initial point , and conduct gradient descent method on the initial sample sub-problem until the initial condition is satisfied. Note that implies that this initial condition guarantees is within the statistical accuracy of . We use as the initial point of all considered methods. We compare these methods over (i) MNIST dataset of handwritten digits (lecun2010mnist), (ii) Epsilon dataset from PASCAL challenge 2008 (epsilon)

, (iii) GISETTE handwritten digit classification dataset from the NIPS 2003 feature selection challenge

(gisette) and (iv) Orange dataset of customer relationship management from KDD Cup 2009 (orange).111We use LIBSVM (libsvm) with license: https://www.csie.ntu.edu.tw/~cjlin/libsvm/COPYRIGHT. More details about these datasets are provided in Table 2.

Notice that in our experiments, we observe that even when the initial sample size is smaller than the threshold in (9), AdaQN performs well and converges superlinearly in each subproblem. This is because (9) is a sufficient condition to guarantee our theoretical superlinear rate, and in practice smaller choices of also work. For Ada Newton we use the same scheme descried in Algorithm 1 and replace the QN update with Newton’s method with step size . The step size of the (standard) BFGS method is determined by the Wolfe condition (byrd1987global) to guarantee it converges on the whole dataset. All hyper-parameters (step size, batch size, etc.) of SGD, SAGA, Katyusha and stochastic quasi-Newton method have been tuned to achieve the best possible performance on each dataset.

The convergence results are shown in Figures 3-6 for the considered datasets. We report both training error, i.e., , and test error for all algorithms in terms of number of effective passes over dataset and in terms of runtime. In general AdaQN mostly outperforms the first-order optimization methods (SGD, Katyusha and SAGA). This is caused by the fact that our considered problems are highly ill-conditioned. For instance, for the results of Epsilon dataset which has , there is a substantial gap between the performance of AdaQN and first-order methods.

Figure 3: Training error (first two plots) and test error (last two plots) in terms of number of passes over dataset and runtime for MNIST dataset.
Figure 4: Training error (first two plots) and test error (last two plots) in terms of number of passes over dataset and runtime for GISETTE dataset.
Figure 5: Training error (first two plots) and test error (last two plots) in terms of number of passes over dataset and runtime for Orange dataset.
Figure 6: Training error (first two plots) and test error (last two plots) in terms of number of passes over dataset and runtime for Epsilon dataset.

We also observe that AdaQN outperforms BFGS and SQN in all considered settings. As discussed earlier, this observation is expected since AdaQN is the only quasi-Newton algorithm among these three methods that exploits superlinear convergence of quasi-Newton methods throughout the entire training process. Moreover, AdaQN does not require a line search scheme, while to obtain the best performance of BFGS and SQN we used a line-search scheme which is often computationally costly.

Next we compare Ada Newton and AdaQN. As shown in Figure 3, for the MNIST dataset with low dimension () and moderate number of samples (), AdaQN performs very similar to Ada Newton. However, for settings where either the problem dimension or the number of samples is very large AdaQN outperforms Ada Newton significantly in terms of runtime. For instance, in Figures 4 and 5, which correspond to GISETTE dataset with samples and features and Orange dataset with samples and features, Ada Newton and AdaQN have almost similar convergence paths when compared in terms of number of passes over data, but in terms of runtime AdaQN is substantially faster than Ada Newton. This gap comes from the fact that in these two examples the problem dimension is large and as a result the cost of inverting a matrix at each iteration, which is required for Ada Newton, is prohibitive.

For Epsilon dataset that has a relatively large number of samples and moderate dimension , as shown in Figure 6, Ada Newton outperforms AdaQN in terms of number of passes over data in both training and test errors. However, their relative performance is reversed when we compare them in terms of runtime. This time the slow runtime convergence of Ada Newton is due to the fact that sample size is very large. Note that Ada Newton requires Hessian computations, while for AdaQN we only need Hessian evaluations.

6 Discussion and future work

We proposed AdaQN an algorithm that leverages the interplay between the superlinear convergence of QN methods and statistical accuracy of ERM problems to efficiently minimize the ERM corresponding to a large dataset. We showed that if the initial sample size is sufficiently large, then we can double the size of training set at each iteration and use QN methods to solve ERM subproblems, while we ensure the solution from previous round stays within the superlinear convergence neighborhood of the next problem. As a result, each subproblem can be solved with at most three QN updates with step size . In comparison with Ada Newton, AdaQN has three major advantages. (I) Ada Newton requires a backtracking technique to determine the growth factor, while for AdaQN we can double the size of training set if the size of initial set is sufficiently large. (II) Ada Newton implementation requires computing the Hessian inverse at each iteration, while AdaQN only requires computing a single matrix inversion. (III) Ada Newton requires Hessian computations overall, while AdaQN only needs Hessian evaluations at the initialization step.

Our result shows AdaQN provably outperforms other benchmarks only in the regime that . This is due to the lower bound on the size of initial training set

. We believe this limitation of our analysis could be resolved by tighter non-asymptotic analysis of BFGS method. Moreover, our guarantees only hold for convex settings and extension of our results to nonconvex settings for achieving first-order or second-order stationary points is a natural future direction.

Acknowledgment

This research is supported in part by NSF Grant 2007668, ARO Grant W911NF2110226, and the Machine Learning Laboratory at UT Austin.

Appendix

Appendix A Proof of propositions and the main theorem

We start the proof by providing the following lemmas. These lemmas play fundamental roles in the proof of all our propositions and Theorem 1.

Lemma 1.

Matrix Bernstein Inequality. Consider a finite sequence of independent random symmetric matrix of dimension . Suppose that and for all . Define that and we can bound the expectation,

(11)

where is the matrix variance statistic of .

Proof.

Check Theorem 6.6.1 of (matrixconcentration). ∎

Lemma 2.

Suppose Assumptions 1-2 hold and the sample size . Then the expectation of the difference between Hessian of the empirical risk and expected risk is bounded above by in terms of the Frobenius norm for any , i.e,

(12)

where .

Proof.

We use Lemma 1 and define that and thus . Notice that and for all so we know that . Since each sample is drawn independently we get

By the Matrix Bernstein Inequality of (11) we know,

Notice that the sample size , thus we have that

By the definition of we know

thus we have

This is correct for all so we prove the conclusion (12). ∎

Lemma 3.

Suppose and are two datasets and . Assume that there are samples in and samples in and . Consider the corresponding empirical risk loss function and defined on and , respectively. Then for any the expectation of the difference between their Hessians is bounded above by

(13)

where .

Proof.

Using triangle inequality we have that

By (12) of Lemma 2 and , we have

Leveraging the above three inequalities we prove the conclusion (13). ∎

Lemma 4.

Assume that is a strictly convex and self-concordant function. Suppose that , then we have

(14)

where and for we have

(15)

Moreover, if satisfy that , then we have:

(16)
Proof.

Check Theorem 4.1.7 and Lemma 5.1.5 of (nesterov2004introductory) for (14) and (15). If satisfy that , by Theorem 4.1.6 of (nesterov2004introductory) we obtain that

(17)

Notice that for we have

(18)

Combining (17) and (18) we can obtain that (16) holds. ∎

Lemma 5.

Suppose satisfies the following condition

(19)

where and are two positive constants. Then we can obtain the following inequality

(20)
Proof.

Denote and based on (20) we have the following inequality

This is equivalent to

Take the nature logarithm on both sides we get

(21)

This shows that condition (21) is equivalent to condition (20). Notice that for any we have the following inequality

So if satisfies

(22)

we can derive that

And since the condition (22) is equivalent to

This is just the condition (19). So if satisfies the condition (19)) which is equivalent to condition (22), condition (21) will be satisfied which is equivalent to condition (20). ∎

Now we present the proof of Propositions 1-4.

a.1 Proof of Proposition 1

Note that the difference can be written as