Adaptive Sampling Strategies for Stochastic Optimization

10/30/2017
by   Raghu Bollapragada, et al.
0

In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is governed by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression problems illustrate the performance of the algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/31/2019

A Dynamic Sampling Adaptive-SGD Method for Machine Learning

We propose a stochastic optimization method for minimizing loss function...
11/25/2018

Inexact SARAH Algorithm for Stochastic Optimization

We develop and analyze a variant of variance reducing stochastic gradien...
09/24/2021

Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

We consider unconstrained stochastic optimization problems with no avail...
09/22/2021

On the equivalence of different adaptive batch size selection strategies for stochastic gradient descent methods

In this study, we demonstrate that the norm test and inner product/ortho...
06/08/2018

Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data

Variance reduction has been commonly used in stochastic optimization. It...
03/25/2021

Minimizing Nonsmooth Convex Functions with Variable Accuracy

We consider unconstrained optimization problems with nonsmooth and conve...
02/15/2018

A Progressive Batching L-BFGS Method for Machine Learning

The standard L-BFGS method relies on gradient approximations that are no...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper presents a first-order stochastic optimization method that progressively changes the size of the sample used in the gradient approximation with the aim of achieving overall efficiency. The algorithm starts by choosing a small sample, and increases it as needed so that the gradient approximation is accurate enough to yield a linear rate of convergence for strongly convex functions. Adaptive sampling methods of this type are appealing because they enjoy optimal complexity properties [6, 11] and have the potential of being effective on a wide range of applications. Theoretical guidelines for controlling the sample size have been established in the literature [6, 11, 21], but the design of practical implementations has proven to be difficult. For example, the mechanism studied in [6, 14, 8], although intuitively appealing, is often inefficient in practice for reasons discussed below.

The problem of interest is

where is a smooth function and

is a random variable. A particular instance of this problem arises in machine learning, where it takes the form

(1.1)

In this setting,

is the composition of a prediction function (parametrized by a vector

) and a smooth loss function, and

are random input-output pairs with probability distribution

. We call the expected risk.

Often, problem (1.1) cannot be tackled directly because the joint probability distribution is unknown. In this case, one draws a data set , , from the distribution , and minimizes the empirical risk

We define so that the empirical risk can be written conveniently as

(1.2)

One may view the optimization algorithm as being applied directly to the expected risk or to the empirical risk . We state our algorithm and establish a convergence result in terms of the minimization of . Later on, in Section 4, we discuss a practical implementation designed to minimize .

An approximation to the gradient of can be obtained by sampling. At the iterate , we define

(1.3)

where the set indexes certain data points . A first-order method based on this gradient approximation is then given by

(1.4)

In our approach, the sample changes at every iteration, and its size is determined by a mechanism described in the next section. It is based on an inner product test that ensures that the search direction in (1.4) is a descent direction with high probability. In contrast to the test studied in [6-8, 14], which we call the norm test, and which controls both the direction and length of the gradient approximation and promotes search directions that are close to the true gradient, the inner product test places more emphasis on generating descent directions and allows more freedom in their length.

The numerical results presented in Section 5 suggest that the inner product test is efficient in practice, but in order to establish a Q-linear convergence rate for strongly convex functions, we must reinforce it with an additional mechanism that prevents search directions from becoming nearly orthogonal to the true gradient . More precisely, we introduce an orthogonality test that ensures that the variance of sampled gradients along the direction orthogonal to is properly controlled. The orthogonality test is invoked infrequently in practice and should be regarded as a safeguard against rare difficult cases.

An important component of Algorithm (1.4) is the selection of the steplength . One option is to use a fixed value that is selected for each problem after careful experimentation. An alternative that we explore in more depth is a backtracking line search that imposes sufficient decrease in the sampled function

(1.5)

and that is controlled by an adaptive estimate of the Lipschitz constant

of the gradient. Similar strategies have been considered for deterministic problems (see e.g. [1]), but the stochastic setting provides some challenges and opportunities that we explore in our line search procedure.

This paper is organized into five sections. A literature review and a summary of our notation are presented in the rest of this section. In Section 2, we describe the inner product test, and in Section 3 we introduce the orthogonality test and establish convergence analysis of an adaptive sampling algorithm that employs both tests. In Section 4, we discuss some practical implementation issues, and present a full description of the algorithm. Numerical results are presented in Section 5, and in Section 6 we make some concluding remarks.

1.1 Literature Review

Optimization methods that progressively increase sample sizes have been studied in [15, 24, 11, 6, 21, 23, 22, 9]. Friedlander and Schmidt [11] consider the finite sum problem (1.2) and show linear convergence by increasing at a geometric rate. They also experiment with a quasi-Newton version of their algorithm. Byrd et al. [6] study the minimization of expected risk (1.1) and show linear convergence when the sample size grows geometrically, and provide computational complexity bounds. They propose the norm test as a practical procedure for controlling the sample size. Pasupathy et al. [21] study more generally the effect of sampling rates on the convergence and complexity of various optimization methods. Hashemi et al. [14] consider a test that is similar to the norm test, which is reinforced by a back up mechanism that ensures a geometric increase in the sample size. They motivate this approach from a stochastic simulation perspective and using variance-bias ratios. Cartis and Scheinberg [8] relax the norm test by allowing it to be violated with a probability less than 0.5, and this ensures that the search directions are successful descent directions more than of the time. They use techniques from stochastic processes and analyze algorithms that perform a line search using the true function values . Bollapragada et al. [4] study methods that sample the gradient and Hessian, and establish conditions for global linear convergence. They also provide a superlinear convergence result in the case when the gradient samples are increased at rates faster than geometric and Hessian samples are increased without bound (at any rate).

The adaptive sampling methods studied here can be regarded as variance reducing methods; see the survey [5]. Other noise reducing methods include stochastic aggregated gradient methods, such as SAG [25], SAGA [10], and SVRG [16]. These methods either compute the full gradient at regular intervals, as in SVRG, or require storage of the component gradients, as in SAG or SAGA. These methods have gained much popularity in recent years, as they are able to achieve a linear rate of convergence for the finite sum problem, with a very low iteration cost.

1.2 Notation

We denote the variables of the optimization problem by , and a minimizer of the objective as . Throughout the paper, denotes the vector norm. The notation means that is a symmetric and positive semi-definite matrix.

2 The Inner Product Test

Let us consider how to select the sample size in the first-order stochastic optimization method

(2.1)

Here is the steplength parameter and the sampled gradient is defined in (1.3). We propose to determine the sample size at every iteration through the inner product test described below, which aims to ensure that the algorithm generates descent directions sufficiently often. We recall that the search direction of algorithm (2.1) is a descent direction for if

This condition will not hold at every iteration of our algorithm, but if is chosen uniformly at random from , it will hold in expectation, i.e.,

(2.2)

We must in addition control the variance of the term on the left hand side to guarantee that the iteration (2.1) is convergent. We do so by requiring that the sample size be large enough so that the following condition is satisfied

(2.3)

The left hand side is difficult to compute but can be bounded by the true variance of individual gradients, i.e.,

(2.4)

Therefore, the following condition ensures (2.3)

(2.5)

We refer to (2.5) as the (exact variance) inner product test. In large-scale applications, the computation of can be prohibitively expensive, but we can approximate the variance on the left side of (2.5) with the sample variance and the gradient on the right side with a sampled gradient, to obtain

(2.6)

where

Condition (2.6) will be called the (approximate) inner product test. Whenever it is not satisfied, we increase the sample size to one that we predict will satisfy (2.6). An outline of this approach is given in Algorithm 1.

Input: Initial iterate , initial sample , and a constant .
Set
Repeat until a convergence test is satisfied:

1:Compute
2:Choose a steplength
3:Compute new iterate:
4:Set
5:Choose a new sample such that the condition (2.6) is satisfied
Algorithm 1 Basic Version

In Section 4, we discuss how to implement the inner product test in practice, how to choose the parameter and the stepsize , as well as the strategy for increasing the size of a new sample , when the algorithm calls for it.

It is illuminating to compare the inner product test111We use the term inner product test to refer to (2.5) or (2.6) when the distinction is not important in the discussion. with a related rule studied in the literature [6, 14, 8] that we call the norm test. The comparison can be most simply and clearly seen in the deterministic setting with the gradient based method , where is some approximation to the gradient . In this context, the deterministic analog of (2.3) is

(2.7)

In contrast, the norm test corresponds to

(2.8)

This rule was studied by Carter [7] in the context of trust region methods with inaccurate gradients. It is easy to see that (2.8) ensures that is a descent direction, but it is not a necessary condition; in fact (2.8) is more restrictive than (2.7) because it requires approximate gradients to lie in a ball centered at the true gradient , whereas (2.7) allows gradients that are within an infinite band around the true gradient, as illustrated in Figure 2.1.

Norm Test

Inner Product Test
Figure 2.1: Deterministic setting. Given a gradient , the shaded areas denote the set of vectors satisfying (a): the norm condition (2.8); (b) the deterministic inner product condition (2.7).

In the stochastic setting, the norm condition (2.8) becomes

(2.9)

Following the same reasoning as in (2.4), this condition will be satisfied if we impose instead

(2.10)

This norm test is used in [6] to control the sample size: if (2.10) is not satisfied, then sample size is increased.

Numerical experience indicates that the norm test can be unduly restrictive, often leading to a very fast increase in the sample size, negating the benefits of adaptive sampling. An indication that the inner product test increases the sample size more slowly than the norm test can be see through the following argument. Let , represent the minimum number of samples required to satisfy the inner product test (2.5) and the norm test (2.10), respectively, at any given iterate , using the same value of . A simple computation (see Appendix A) shows that

(2.11)

where

(2.12)

and is the angle between and . The quantity is the ratio of the variance the individual gradients along the true gradient direction and the total variance of the individual gradients. The numerical results presented in Section 5 are consistent with this observation and show that is often much less than 1.

3 Analysis

In order to establish linear convergence for this method, it is necessary to introduce an additional condition that has only a slight effect on the algorithm in practice, but guarantees the quality of the search direction in difficult cases. In this section, we first describe this test, and in the second part we establish some results on convergence rates.

3.1 Orthogonality Test

Establishing a convergence rate usually involves showing that the step direction is bounded away from orthogonality to . However, iteration (2.1) with a sample satisfying the inner product condition (2.5) does not necessarily enjoy this property. The possible near orthogonality corresponds to the case where the ratio defined above is near zero and occurs when the variance in the individual gradients is very large compared to the variance in the individual gradients along the true gradient direction. Although we have not observed very small values of in our numerical tests, this is harmful in principle and to prove convergence we must be able to avoid this possibility. We propose a test that imposes a loose bound on the component of orthogonal to the true gradient. This test, together with (2.5), allows us to prove a linear convergence result for the adaptive sampling algorithm, when is strongly convex.

To motivate the orthogonality test, we note that the component of orthogonal to is 0 in expectation, i.e.,

but that is not sufficient. We must bound the variance of this orthogonal component, and to achieve this we require that the sample size be large enough to satisfy,

(3.1)

for some positive constant whose choice is discussed in Section 4. For a given sample size, this condition can be expressed, using the true variance of individual gradients, as

(3.2)

The (exact variance) orthogonality test states that if this inequality is not satisfied, the sample size should be increased.

Reasoning as in (2.6), we can derive a variant of the orthogonality test based on sample approximations. This (approximate) orthogonality test is given by

(3.3)

It is interesting to note that, since (2.3) implies that the root mean square of the component of the step along the gradient is bounded below by , imposition of (3.2) will tend to keep the tangent of the angle between and below , providing a limit on the near orthogonality of these two vectors.

3.2 Convergence Analysis

The orthogonality test, in conjunction with the inner product test allows the algorithm to make sufficient progress at every iteration, in expectation. More precisely, we now establish three convergence results for the exact versions of these two tests, namely (2.5) and (3.2). Our results apply to iteration (2.1) with a fixed steplength. We start by establishing a technical lemma.

Lemma 3.1.

Suppose that is twice continuously differentiable and that there exists a constant such that

(3.4)

Let be the iterates generated by iteration (2.1) with any , where is chosen such that the (exact variance) inner product test (2.5) and the (exact variance) orthogonality test (3.2) are satisfied at each iteration for any given constants and . Then, for any ,

(3.5)

Moreover, if the steplength satisfies

(3.6)

we have that

(3.7)
Proof.

Since (3.2) is satisfied, we have that (3.1) holds. Thus, recalling (2.2) we have that (3.1) can be written as

Therefore,

(3.8)

To bound the first term on the right side of this inequality, we use the inner product test. Since satisfies (2.5), the inequality (2.3) holds, and this in turn yields

Substituting in (3.8), we get the following bound on the length of the search direction:

which proves (3.5). Using this inequality, (2.1), (3.4), and (3.6) we have,

We now show that iteration (2.1), using a fixed steplength , is linearly convergent when is strongly convex. In the discussion that follows, denotes the minimizer of .

Theorem 3.2.

(Strongly Convex Objective.) Suppose that is twice continuously differentiable and that there exist constants such that

(3.9)

Let be the iterates generated by iteration (2.1) with any , where is chosen such that the (exact variance) inner product test (2.5) and the (exact variance) orthogonality test (3.2) are satisfied at each iteration for any given constants and . Then, if the steplength satisfies (3.6) we have that

(3.10)

where

(3.11)

In particular, if takes its maximum value in (3.6), i.e., , we have

(3.12)
Proof.

It is well known [2] that for strongly convex functions

Substituting this in (3.7) and subtracting from both sides we obtain,

from which the theorem follows. ∎

Note that when we recover the classical result for the exact gradient method. We now consider the case when is convex, but not strongly convex.

Theorem 3.3.

(General Convex Objective.) Suppose that is twice continuously differentiable and convex, and that there exists a constant such that

(3.13)

Let be the iterates generated by iteration (2.1) with any , where is chosen such that the exact variance inner product test (2.5) and orthogonality test (3.2) are satisfied at each iteration for any given constants and . Then, if the steplength satisfies the strict version of (3.6), that is

(3.14)

we have for any positive integer ,

where is the optimal function value, the constant is given by , and

Proof.

From Lemma 3.1 we have that

Using this inequality and (3.13) and considering any we have,

(3.15)

where the last inequality follows from convexity of .

Now, for any function having Lipschitz continuous gradients

(3.16)

This result is shown in [20], but for the sake of completeness we give here a proof. Since has a Lipschitz continuous gradient,

Recalling that is the optimal function value we have,

which proves (3.16).

Substituting (3.16) in (3.15) we obtain

by the definition of . We can write this inequality as

and summing, we obtain

This result establishes a sublinear rate of convergence in function values by referencing the best function value obtained after every iterates. We now consider the case when is nonconvex and bounded below.

Theorem 3.4.

(Nonconvex Objective.) Suppose that is twice continuously differentiable and bounded below, and that there exist a constant such that

(3.17)

Let be the iterates generated by iteration (2.1) with any , where is chosen such that the (exact variance) inner product test (2.5) and the (exact variance) orthogonality test (3.2) are satisfied at each iteration for any given constants and . Then, if the steplength satisfies

(3.18)

then

(3.19)

Moreover, for any positive integer we have that

where is a lower bound on in .

Proof.

From Lemma 3.1 we have

and hence

Summing both sides of this inequality from to , and since is bounded below by , we get

Taking limits, we obtain

which implies (3.19). We can also conclude that

This theorem shows that the sequence of gradients converges to zero, in expectation. It also establishes a global sublinear rate of convergence of the smallest gradients generated after every steps. Results with a similar flavor have been established by Ghadimi and Lan [12] in the context of nonconvex stochastic programming.

4 Practical Implementation

In this section, we describe our line search procedure, and present a technique for making the algorithm robust in the early stages of a run when sample variances and sample gradients are unreliable. We also discuss a heuristic that determines how much to increase the sample size

. We then present the complete algorithm, followed by a discussion of the choice of some important algorithmic parameters.

4.1 Line Search

We select the steplength parameter by a backtracking line search based on the sampled function and an adaptive estimate of the Lipschitz constant of the gradient. The initial value of the steplength at the -th iteration of the algorithm is given by . If sufficient decrease in is not obtained, is increased by a constant factor until such a decrease is achieved.

Now, since overestimating the Lipschitz constant leads to unnecessarily small steps, at every outer iteration of the algorithm the initial value is set to a fraction of the previous estimate . This reset strategy has been used in deterministic convex optimization, but requires some attention in the stochastic setting. Specifically, decreasing the Lipschitz constant by a fixed fraction at every iteration may result in an inadequate steplength. We propose a variance-based rule described below to compute a contraction factor at every iteration. Our line search strategy is summarized in Algorithm 2.

Input: ,

1:Compute as given in (4.3)
2:Set Decrease the Lipschitz constant
3:Compute
4:while  do Sufficient decrease
5:     Set Increase the Lipschitz constant
6:     Compute
7:end while
Algorithm 2 Backtracking Line Search

The expansion factor in Step 5 is set to in our experiments. To determine the contraction factor , we reason as follows. From (2.1) we have

Thus we can guarantee an expected decrease in the true objective function if the right hand side is negative, i.e.,

(4.1)

where

This is the same variance as in the norm test (2.9). As was done in that context we first note that (4.1) holds if

Next, we approximate the true gradient and true variance using a sampled gradient and a sampled variance to obtain

(4.2)

where

We wish to compute an appropriate value and set . Therefore, since we can assume that is large enough so that , inequality (4.2) gives

or

This indicates that when decreasing the Lipschitz estimate by the rule at the start of each iteration, may be chosen as

(4.3)

Therefore, , meaning could be left unchanged , or reduced by a factor of at most 2.

Algorithm 2 is similar in form to the one proposed in [1] for deterministic functions. However, our line search operates with sampled (e.g. inaccurate) function values, and the adaptive setting has allowed us to employ a variance-based rule described above for estimating the initial value of at every iteration. A line search based on sampled function values that is more close related to ours is described in [25], but it differs from ours in that they use a fixed contraction factor throughout the algorithm.

4.2 Sample Control in the Noisy Regime

In the previous sections, we presented two forms of the inner product and orthogonality tests: one based on the population statistics and one based on samples. Since in many applications only sample statistics are available, our practical implementation of the algorithm imposes the inner product and orthogonality tests by verifying (2.6) instead of (2.5), and (3.3) instead of (3.2), respectively. The augmented inner product test consists of the inner product test (2.6) together with the orthogonality test (3.3).

Our numerical experience indicates that these sample approximations are sufficiently accurate, except if we choose to start the algorithm with a very small sample size, say 3, 5, 10. In this highly noisy regime, our conditions may not control the sample correctly because for small samples, is often much larger than the true gradient , and therefore the tests (2.6) and (3.3) are too easily satisfied, preventing increases in the sample size. (These difficulties can also arise when using the norm test.)

To obtain a more accurate estimate of to use in (2.6) and (3.3), we employ the following strategy. Whenever the sample sizes remain constant for a certain number of iterations, say , we compute the running average of the most recent sample gradients:

(4.4)

Ideally, should be chosen such that the iterates in this summation are close enough to provide a good approximation of the gradient at and so that there are enough samples for to be meaningful. (A reasonable default value could be .) If the length of is small compared with the length of