 # RES: Regularized Stochastic BFGS Algorithm

RES, a regularized stochastic version of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method is proposed to solve convex optimization problems with stochastic objectives. The use of stochastic gradient descent algorithms is widespread, but the number of iterations required to approximate optimal arguments can be prohibitive in high dimensional problems. Application of second order methods, on the other hand, is impracticable because computation of objective function Hessian inverses incurs excessive computational cost. BFGS modifies gradient descent by introducing a Hessian approximation matrix computed from finite gradient differences. RES utilizes stochastic gradients in lieu of deterministic gradients for both, the determination of descent directions and the approximation of the objective function's curvature. Since stochastic gradients can be computed at manageable computational cost RES is realizable and retains the convergence rate advantages of its deterministic counterparts. Convergence results show that lower and upper bounds on the Hessian egeinvalues of the sample functions are sufficient to guarantee convergence to optimal arguments. Numerical experiments showcase reductions in convergence time relative to stochastic gradient descent algorithms and non-regularized stochastic versions of BFGS. An application of RES to the implementation of support vector machines is developed.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Stochastic optimization algorithms are used to solve the problem of optimizing an objective function over a set of feasible values in situations where the objective function is defined as an expectation over a set of random functions. In particular, consider an optimization variable

and a random variable

that determines the choice of a function . The stochastic optimization problems considered in this paper entail determination of the argument that minimizes the expected value ,

 w∗ := \operatornamewithlimitsargminwEθ[f(w,θ)] := \operatornamewithlimitsargminwF(w). (1)

We refer to as the random or instantaneous functions and to as the average function. Problems having the form in (1

) are common in machine learning

[3, 4, 5] as well as in optimal resource allocation in wireless systems [6, 7, 8].

Since the objective function of (1) is convex, descent algorithms can be used for its minimization. However, conventional descent methods require exact determination of the gradient of the objective function

, which is intractable in general. Stochastic gradient descent (SGD) methods overcome this issue by using unbiased gradient estimates based on small subsamples of data and are the workhorse methodology used to solve large-scale stochastic optimization problems

[4, 9, 10, 11, 12]. Practical appeal of SGD remains limited, however, because they need large number of iterations to converge. This problem is most acute when the variable dimension is large as the condition number tends to increase with

. Developing stochastic Newton algorithms, on the other hand, is of little use because unbiased estimates of Newton steps are not easy to compute

.

Recourse to quasi-Newton methods then arises as a natural alternative. Indeed, quasi-Newton methods achieve superlinear convergence rates in deterministic settings while relying on gradients to compute curvature estimates [14, 15, 16, 17]. Since unbiased gradient estimates are computable at manageable cost, stochastic generalizations of quasi-Newton methods are not difficult to devise [18, 6, 19]. Numerical tests of these methods on simple quadratic objectives suggest that stochastic quasi-Newton methods retain the convergence rate advantages of their deterministic counterparts . The success of these preliminary experiments notwithstanding, stochastic quasi-Newton methods are prone to yield near singular curvature estimates that may result in erratic behavior (see Section V-A).

In this paper we introduce a stochastic regularized version of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method to solve problems with the generic structure in (1). The proposed regularization avoids the near-singularity problems of more straightforward extensions and yields an algorithm with provable convergence guarantees when the functions are strongly convex.

We begin the paper with a brief discussion of SGD (Section II) and deterministic BFGS (Section II-A

). The fundamental idea of BFGS is to continuously satisfy a secant condition that captures information on the curvature of the function being minimized while staying close to previous curvature estimates. To regularize deterministic BFGS we retain the secant condition but modify the proximity condition so that eigenvalues of the Hessian approximation matrix stay above a given threshold (Section

II-A). This regularized version is leveraged to introduce the regularized stochastic BFGS algorithm (Section II-B). Regularized stochastic BFGS differs from standard BFGS in the use of a regularization to make a bound on the largest eigenvalue of the Hessian inverse approximation matrix and on the use of stochastic gradients in lieu of deterministic gradients for both, the determination of descent directions and the approximation of the objective function’s curvature. We abbreviate regularized stochastic BFGS as RES111The letters “R and “E” appear in “regularized” as well as in the names of Broyden, Fletcher, and Daniel Goldfarb; “S” is for “stochastic” and Shanno..

Convergence properties of RES are then analyzed (Section III). We prove that lower and upper bounds on the Hessians of the sample functions are sufficient to guarantee convergence to the optimal argument

with probability 1 over realizations of the sample functions (Theorem

1). We complement this result with a characterization of the convergence rate which is shown to be at least linear in expectation (Theorem 2). Linear expected convergence rates are typical of stochastic optimization algorithms and, in that sense, no better than SGD. Advantages of RES relative to SGD are nevertheless significant, as we establish in numerical results for the minimization of a family of quadratic objective functions of varying dimensionality and condition number (Section IV). As we vary the condition number we observe that for well conditioned objectives RES and SGD exhibit comparable performance, whereas for ill conditioned functions RES outperforms SGD by an order of magnitude (Section IV-A). As we vary problem dimension we observe that SGD becomes unworkable for large dimensional problems. RES however, exhibits manageable degradation as the number of iterations required for convergence doubles when the problem dimension increases by a factor of ten (Section IV-C).

An important example of a class of problems having the form in (1

) are support vector machines (SVMs) that reduce binary classification to the determination of a hyperplane that separates points in a given training set; see, e.g.,

[20, 4, 21]. We adapt RES for SVM problems (Section V) and show the improvement relative to SGD in convergence time, stability, and classification accuracy through numerical analysis (SectionV-A). We also compare RES to standard (non-regularized) stochastic BFGS. The regularization in RES is fundamental in guaranteeing convergence as standard (non-regularized) stochastic BFGS is observed to routinely fail in the computation of a separating hyperplane.

## Ii Algorithm definition

Recall the definitions of the sample functions and the average function . We assume the sample functions are strongly convex for all . This implies the objective function , being an average of the strongly convex sample functions, is also strongly convex. We can find the optimal argument in (1) with a gradient descent algorithm where gradients of are given by

 s(w):=∇F(w)=Eθ[∇f(w,θ)]. (2)

When the number of functions is large, as is the case in most problems of practical interest, exact evaluation of the gradient is impractical. This motivates the use of stochastic gradients in lieu of actual gradients. More precisely, consider a given set of realizations and define the stochastic gradient of at given samples as

 ^s(w,~θ):=1LL∑l=1∇f(w,θl). (3)

Introducing now a time index , an initial iterate , and a step size sequence , a stochastic gradient descent algorithm is defined by the iteration

 wt+1=wt−ϵt ^s(wt,~θt). (4)

To implement (4) we compute stochastic gradients using (3). In turn, this requires determination of the gradients of the random functions for each component of and their corresponding average. The computational cost is manageable for small values of .

The stochastic gradient in (3) is an unbiased estimate of the (average) gradient in (2) in the sense that . Thus, the iteration in (4) is such that, on average, iterates descend along a negative gradient direction. This intuitive observation can be formalized into a proof of convergence when the step size sequence is selected as nonsummable but square summable, i.e.,

 ∞∑t=0ϵt=∞,and∞∑t=0ϵ2t<∞. (5)

A customary step size choice for which (5) holds is to make , for given parameters and that control the initial step size and its speed of decrease, respectively. Convergence notwithstanding, the number of iterations required to approximate is very large in problems that don’t have small condition numbers. This motivates the alternative methods we discuss in subsequent sections.

### Ii-a Regularized BFGS

To speed up convergence of (4) resort to second order methods is of little use because evaluating Hessians of the objective function is computationally intensive. A better suited methodology is the use of quasi-Newton methods whereby gradient descent directions are premultiplied by a matrix ,

 wt+1=wt−ϵt B−1ts(wt). (6)

The idea is to select positive definite matrices close to the Hessian of the objective function . Various methods are known to select matrices , including those by Broyden e.g., ; Davidon, Feletcher, and Powell (DFP) ; and Broyden, Fletcher, Goldfarb, and Shanno (BFGS) e.g., [17, 16, 15]. We work here with the matrices used in BFGS since they have been observed to work best in practice .

In BFGS – and all other quasi-Newton methods for that matter – the function’s curvature is approximated by a finite difference. Specifically, define the variable and gradient variations at time as

 vt:=wt+1−wt,andrt:=s(wt+1)−s(wt), (7)

respectively, and select the matrix to be used in the next time step so that it satisfies the secant condition . The rationale for this selection is that the Hessian satisfies this condition for tending to . Notice however that the secant condition is not enough to completely specify . To resolve this indeterminacy, matrices in BFGS are also required to be as close as possible to in terms of the Gaussian differential entropy,

 Bt+1 = \operatornamewithlimitsargminZ tr[B−1tZ]−logdet[B−1tZ]−n, \operatornamewithlimitss.t. Zvt=rt,Z⪰0. (8)

The constraint in (II-A) restricts the feasible space to positive semidefinite matrices whereas the constraint requires to satisfy the secant condition. The objective

represents the differential entropy between random variables with zero-mean Gaussian distributions

and having covariance matrices and . The differential entropy is nonnegative and equal to zero if and only if . The solution of the semidefinite program in (II-A) is therefore closest to in the sense of minimizing the Gaussian differential entropy among all positive semidefinite matrices that satisfy the secant condition .

Strongly convex functions are such that the inner product of the gradient and variable variations is positive, i.e., . In that case the matrix in (II-A) is explicitly given by the update – see, e.g.,  and the proof of Lemma 1 –,

 Bt+1=Bt+rtrTtvTtrt−BtvtvTtBtvTtBtvt. (9)

In principle, the solution to (II-A) could be positive semidefinite but not positive definite, i.e., we can have but . However, through direct operation in (9) it is not difficult to conclude that stays positive definite if the matrix is positive definite. Thus, initializing the curvature estimate with a positive definite matrix guarantees for all subsequent times . Still, it is possible for the smallest eigenvalue of to become arbitrarily close to zero which means that the largest eigenvalue of can become arbitrarily large. This has been proven not to be an issue in BFGS implementations but is a more significant challenge in the stochastic version proposed here.

To avoid this problem we introduce a regularization of (II-A) to enforce the eigenvalues of to exceed a positive constant . Specifically, we redefine as the solution of the semidefinite program,

 Bt+1= \operatornamewithlimitsargminZ tr[B−1t(Z−δI)]−logdet[B−1t(Z−δI)]−n, \operatornamewithlimitss.t. Zvt=rt,Z⪰0. (10)

The curvature approximation matrix defined in (II-A) still satisfies the secant condition but has a different proximity requirement since instead of comparing and we compare and . While (II-A) does not ensure that all eigenvalues of exceed we can show that this will be the case under two minimally restrictive assumptions. We do so in the following proposition where we also give an explicit solution for (II-A) analogous to the expression in (9) that solves the non regularized problem in (II-A).

###### Proposition 1

Consider the semidefinite program in (II-A) where the matrix is positive definite and define the corrected gradient variation

 ~rt:=rt−δvt. (11)

If the inner product is positive, the solution of (II-A) is such that all eigenvalues of are larger than ,

 Bt+1⪰δI. (12)

Furthermore, is explicitly given by the expression

 Bt+1=Bt+~rt~rTtvTt~rt−BtvtvTtBtvTtBtvt+δI. (13)

Proof : See Appendix.

Comparing (9) and (13) it follows that the differences between BFGS and regularized BFGS are the replacement of the gradient variation in (7) by the corrected variation and the addition of the regularization term . We use (13) in the construction of the stochastic BFGS algorithm in the following section.

### Ii-B RES: Regularized Stochastic BFGS

As can be seen from (13) the regularized BFGS curvature estimate is obtained as a function of previous estimates , iterates and , and corresponding gradients and . We can then think of a method in which gradients are replaced by stochastic gradients in both, the curvature approximation update in (13) and the descent iteration in (6). Specifically, start at time with current iterate and let stand for the Hessian approximation computed by stochastic BFGS in the previous iteration. Obtain a batch of samples , determine the value of the stochastic gradient as per (3), and update the iterate as

 wt+1=wt−ϵt(^B−1t+ΓI)^s(wt,~θt), (14)

where we added the identity bias term for a given positive constant . Relative to SGD as defined by (4), RES as defined by (14) differs in the use of the matrix to account for the curvature of . Relative to (regularized or non regularized) BFGS as defined in (6) RES differs in the use of stochastic gradients instead of actual gradients and in the use of the curvature approximation in lieu of . Observe that in (14) we add a bias to the curvature approximation . This is necessary to ensure convergence by hedging against random variations in as we discuss in Section III.

To update the Hessian approximation matrix compute the stochastic gradient associated with the same set of samples used to compute the stochastic gradient . Define then the stochastic gradient variation at time as

 ^rt:=^s(wt+1,~θt)−^s(wt,~θt), (15)

and redefine so that it stands for the modified stochastic gradient variation

 ~rt:=^rt−δvt, (16)

by using instead of . The Hessian approximation for the next iteration is defined as the matrix that satisfies the stochastic secant condition and is closest to in the sense of (II-A). As per Proposition 1 we can compute explicitly as

 ^Bt+1=^Bt+~rt~rTtvTt~rt−^BtvtvTt^BtvTt^Btvt+δI. (17)

as long as . Conditions to guarantee that are introduced in Section III.

The resulting RES algorithm is summarized in Algorithm 1. The two core steps in each iteration are the descent in Step 4 and the update of the Hessian approximation in Step 8. Step 2 comprises the observation of samples that are required to compute the stochastic gradients in steps 3 and 5. The stochastic gradient in Step 3 is used in the descent iteration in Step 4. The stochastic gradient of Step 3 along with the stochastic gradient of Step 5 are used to compute the variations in steps 6 and 7 that permit carrying out the update of the Hessian approximation in Step 8. Iterations are initialized at arbitrary variable and positive definite matrix with the smallest eigenvalue larger than .

###### Remark 1.

One may think that the natural substitution of the gradient variation is the stochastic gradient variation instead of the variation in (15). This would have the advantage that is the stochastic gradient used to descend in iteration whereas is not and is just computed for the purposes of updating . Therefore, using the variation requires twice as many stochastic gradient evaluations as using the variation . However, the use of the variation is necessary to ensure that , which in turn is required for (17) to be true. This cannot be guaranteed if we use the variation – see Lemma 1 for details. The same observation holds true for the non-regularized version of stochastic BFGS introduced in .

## Iii Convergence

For the subsequent analysis it is convenient to define the instantaneous objective function associated with samples as

 ^f(w,~θ):=1LL∑l=1f(w,θl). (18)

The definition of the instantaneous objective function in association with the fact that implies

 F(w)=Eθ[^f(w,~θ)]. (19)

Our goal here is to show that as time progresses the sequence of variable iterates approaches the optimal argument . In proving this result we make the following assumptions.

###### Assumption 1.

The instantaneous functions are twice differentiable and the eigenvalues of the instantaneous Hessian are bounded between constants and for all random variables ,

 ~mI ⪯ ^H(w,~θ) ⪯ ~MI. (20)
###### Assumption 2.

The second moment of the norm of the stochastic gradient is bounded for all

. i.e., there exists a constant such that for all variables it holds

 Eθ[∥^s(wt,~θt)∥2]≤S2, (21)
###### Assumption 3.

The regularization constant is smaller than the smallest Hessian eigenvalue , i.e., .

As a consequence of Assumption 1 similar eigenvalue bounds hold for the (average) function . Indeed, it follows from the linearity of the expectation operator and the expression in (19) that the Hessian is . Combining this observation with the bounds in (20) it follows that there are constants and such that

 ~mI ⪯ mI ⪯ H(w) ⪯MI ⪯ ~MI. (22)

The bounds in (22) are customary in convergence proofs of descent methods. For the results here the stronger condition spelled in Assumption 1 is needed. The restriction imposed by Assumption 2 is typical of stochastic descent algorithms, its intent being to limit the random variation of stochastic gradients. Assumption 3 is necessary to guarantee that the inner product [cf. Proposition 1] is positive as we show in the following lemma.

###### Lemma 1

Consider the modified stochastic gradient variation defined in (16) and the variable variation defined in (7). Let Assumption 1 hold and recall the lower bound on the smallest eigenvalue of the instantaneous Hessians. Then, for all constants it holds

 ~rTtvt = (^rt−δvt)Tvt ≥(~m−δ)∥vt∥2 > 0. (23)

Proof : As per (20) in Assumption 1 the eigenvalues of the instantaneous Hessian are bounded by and . Thus, for any given vector it holds

 (24)

For given and define the mean instantaneous Hessian as the average Hessian value along the segment

 ^Gt=∫10^H(wt+τ(wt+1−wt),~θt)dτ. (25)

Consider now the instantaneous gradient evaluated at and observe that its derivative with respect to is . It then follows from the fundamental theorem of calculus that

 ∫10^H(wt+τ(wt+1−wt),~θt)(wt+1−wt) dτ= ^s(wt+1,~θt)−^s(wt,~θt). (26)

Using the definitions of the mean instantaneous Hessian in (25) as well as the definitions of the stochastic gradient variations and variable variations in (15) and (7) we can rewrite (III) as

 ^Gtvt=^rt. (27)

Invoking (24) for the integrand in (25), i.e., for , it follows that for all vectors the mean instantaneous Hessian satisfies

 ~m∥z∥2≤zT^Gtz≤~M∥z∥2. (28)

The claim in (23) follows from (27) and (28). Indeed, consider the ratio of inner products and use (27) and the first inequality in (28) to write

 (29)

Consider now the inner product in (23) and use the bound in (29) to write

 ~rTtvt = ^rTtvt−δvTtvt ≥ ~mvTtvt−δvTtvt (30)

Since we are selecting by hypothesis it follows that (23) is true for all times .

Initializing the curvature approximation matrix , which implies , and setting it follows from Lemma 1 that the hypotheses of Proposition 1 are satisfied for . Hence, the matrix computed from (17) is the solution of the semidefinite program in (II-A) and, more to the point, satisfies , which in turn implies . Proceeding recursively we can conclude that for all times . Equivalently, this implies that all the eigenvalues of are between and and that, as a consequence, the matrix is such that

 ΓI ⪯ ^B−1t+ΓI ⪯ (Γ+1δ)I. (31)

Having matrices that are strictly positive definite with eigenvalues uniformly upper bounded by leads to the conclusion that if is a descent direction, the same holds true of . The stochastic gradient is not a descent direction in general, but we know that this is true for its conditional expectation . Therefore, we conclude that is an average descent direction because . Stochastic optimization algorithms whose displacements are descent directions on average are expected to approach optimal arguments in a sense that we specify formally in the following lemma.

###### Lemma 2

Consider the RES algorithm as defined by (14)-(17). If assumptions 1, 2 and 3 hold true, the sequence of average function satisfies

 E[F(wt+1)∣∣wt]≤F(wt)−ϵtΓ∥∇F(wt)∥2+Kϵ2t (32)

where the constant .

Proof : As it follows from Assumption 1 the eigenvalues of the Hessian are bounded between and as stated in (22). Taking a Taylor’s expansion of the dual function around and using the upper bound in the Hessian eigenvalues we can write

 F(wt+1)≤F(wt)+∇F(wt)T(wt+1−wt)+M2∥wt+1−wt∥2 (33)

From the definition of the RES update in (14) we can write the difference of two consecutive variables as . Making this substitution in (33), taking expectation with given in both sides of the resulting inequality, and observing the fact that when is given the Hessian approximation is deterministic we can write

 E[F(wt+1)∣∣wt]≤ F(wt) (34) −ϵt∇F(wt)T(^B−1t+ΓI)E[^s(wt,~θt)∣∣wt] +ϵ2M2 E[∥∥(^B−1t+ΓI)^s(wt,~θt)∥∥2∣∣wt].

We proceed to bound the third term in the right hand side of (34). Start by observing that the 2-norm of a product is not larger than the product of the 2-norms and that, as noted above, with given the matrix is also given to write

 E[∥∥(^B−1t+ΓI)^s(wt,~θt)∥∥2∣∣wt] ≤∥∥^B−1t+ΓI∥∥2E[∥∥^s(wt,~θt)∥∥2∣∣wt]. (35)

Notice that, as stated in (31), is an upper bound for the eigenvalues of . Further observe that the second moment of the norm of the stochastic gradient is bounded by , as stated in Assumption 2. These two upper bounds substituted in (III) yield

 E[∥∥(^B−1t+ΓI)^s(wt,~θt)∥∥2∣∣wt]≤S2(1/δ+Γ)2. (36)

Substituting the upper bound in (36) for the third term of (34) and further using the fact that in the second term leads to

 E[F(wt+1)∣∣wt]≤ +ϵ2tMS22(1/δ+Γ)2. (37)

We now find a lower bound for the second term in the right hand side of (III). Since the Hessian approximation matrices are positive definite their inverses are positive semidefinite. In turn, this implies that all the eigenvalues of are not smaller than since increases all the eigenvalues of by . This lower bound for the eigenvalues of implies that

 (38)

Substituting the lower bound in (38) for the corresponding summand in (III) and further noting the definition of in the statement of the lemma, the result in (33) follows.

Setting aside the term for the sake of argument (32) defines a supermartingale relationship for the sequence of average functions . This implies that the sequence is almost surely summable which, given that the step sizes are nonsummable as per (5), further implies that the limit infimum of the gradient norm is almost surely null. This latter observation is equivalent to having with probability 1 over realizations of the random samples . The term is a relatively minor nuisance that can be taken care with a technical argument that we present in the proof of the following theorem.

###### Theorem 1

Consider the RES algorithm as defined by (14)-(17). If assumptions 1, 2 and 3 hold true and the sequence of stepsizes satisfies (5), the limit infimum of the squared Euclidean distance to optimality satisfies

 liminft→∞∥wt−w∗∥2=0a.s. (39)

over realizations of the random samples