    Authors

12/01/2020

Convergence of Gradient Algorithms for Nonconvex C^1+α Cost Functions

This paper is concerned with convergence of stochastic gradient algorith...
02/13/2020

Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization

Stochastic gradient methods with momentum are widely used in application...
11/26/2021

Random-reshuffled SARAH does not need a full gradient computations

The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a varia...
08/30/2018

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Stochastic momentum methods have been widely adopted in training deep ne...
05/23/2018

In this paper, we propose a new adaptive stochastic gradient Langevin dy...
01/31/2019

Improving SGD convergence by tracing multiple promising directions and estimating distance to minimum

Deep neural networks are usually trained with stochastic gradient descen...
01/23/2020

Replica Exchange for Non-Convex Optimization

Gradient descent (GD) is known to converge quickly for convex objective ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we develop and analyze stochastic gradient-free descent (SGFD) methods for solving the following class of stochastic optimization problems:

 x∗=argminx∈RdF(x), (1)

where the real-valued function is defined by

 F(x):=Eξ[f(x,ξ)]=∫Ξf(x,ξ)dP(ξ), (2)

and

be a collection of real-valued functions with a given probability distribution

over the index set

. In the context of machine learning applications,

is often treated as the loss function of a prediction function

incurred by the parameter vector

with respect to the randomly selected sample from a sample set , i.e., ; accordingly, is also treated as the empirical risk given a parameter vector with respect to the probability distribution which we usually do not even know. The current popular methodology for such problems is the stochastic gradient (SG) method [19, 18, 3, 9, 14]. Specifically, with an initial point , these methods are characterized by the iteration

 xk+1=xk−αkg(xk,ξk) (3)

where is the stepsize and is the stochastic gradient defined by

 g(xk,ξk)={∇f(xk,ξk),1nk∑nki=1∇f(xk,ξk,i), (4)

which is an unbiased estimator of the socalled full gradient

[15, 4].

The SG method was originally developed by Robbins and Monro  for smooth stochastic approximation problems. It has convergence guarantees [9, 19, 4] and has gained extensive empirical success in large-scale convex and nonconvex stochastic optimization [18, 3, 14, 5, 6]. However, there are still notable difficulties with the SG method 

, and some of them are related to the gradient itself. For example, it might cause the vanishing and exploding gradient in training artificial neural networks

[2, 11]; moreover, the gradient is sometimes very difficult or even impossible to obtain.

From this starting point, this work attempts to establish a method that avoids direct gradient evaluations but retains most of the main advantages of gradient-based methods, such as the sublinear convergence rate for strongly convex objectives with Lipschitz gradients  and the global convergence for twice differentiable objectives [10, 4]. And this work successfully achieved this goal by establishing an asymptotic unbiased estimator of the gradient. Such an approach is referred to as gradient-free method because gradients are not evaluated and applied directly. However, one will see that the convergence analysis of gradient-free methods does not exceed the analysis framework of stochastic gradient methods given in . Moreover, we also provide a theoretical analysis about the inclusion of momentum in stochastic settings. And it is shown that the momentum term introduces extra biases but reduces variances for stochastic directions.

The remainder of the paper is organized as follows. The next section introduces the stochastic gradient-free descent methods. Then the gradient-free methods with momentum are discussed in detail in Section 3. And we state our main theorems guaranteeing convergence for both strongly convex objectives with Lipschitz gradients and possibly nonconvex twice differentiable objectives in Section 4. And we draw some conclusions in Section 5.

The fundamental idea of this work is not to directly evaluate and apply gradients but to indirectly learn information about gradients through stochastic directions and corresponding output feedbacks of the objective function. In the following, we will first describe the gradient-free method and then analyze its convergence properties. And we shall consider differentiable objective functions with Lipschitz continuous gradients.

2.1 Assumptions of objectives

First, let us begin with a basic assumption of smoothness of the objective function. Such an assumption is essential for convergence analyses of our methods, as well as most gradient-based methods . [Lipschitz-continuous gradients] The objective function is continuously differentiable and its gradient function is Lipschitz continuous with Lipschitz constant , i.e.,

 ∥∇F(x′)−∇F(x)∥2⩽L∥x′−x∥2  for all  x′,x∈Rd.

Section 2.1 ensures that the gradient of the objective is bounded and does not change arbitrarily quickly with respect to the parameter vector. As an important consequence of Section 2.1 we note that

 ∣∣F(x′)−F(x)−∇F(x)T(x′−x)∣∣⩽L2∥x′−x∥22  for all  x′,x∈Rd. (5)

This inequality comes from

 F(x′)−F(x)−∇F(x)T(x′−x)=∫10(∇F(x+t(x′−x))−∇F(x))T(x′−x)dt

and . Moreover, it is trivial if is twice continuously differentiable with for every . From this inequality, we can get a fundamental lemma for any iteration based on random steps, which is a slight generalization of Lemma 4.2 in .

Lemma 1

Under Section 2.1, if for every , is any random vector independent of and is a stochastic step that depends on , then the iteration satisfy the following inequality

 Eθk[F(xk+1)]−F(xk)⩽ ∇F(xk)TEθk[s(xk,θk)]+L2∥Eθk[s(xk,θk)]∥22+L2Vθk[s(xk,θk)],

where the variance of is defined as

 Vθk[s(xk,θk)]:=Eθk[∥s(xk,θk)∥22]−∥Eθk[s(xk,θk)]∥22. (6)

Proof

By Section 2.1, the iteration satisfy

 F(xk+1)−F(xk)⩽ ∇F(xk)T(xk+1−xk)+L2∥xk+1−xk∥22 ⩽ ∇F(xk)Ts(xk,θk)+L2∥s(xk,θk)∥22.

Noting that is independent of and taking expectations in these inequalities with respect to the distribution of , we obtain

 Eθk[F(xk+1)]−F(xk)⩽∇F(xk)TEθk[s(xk,θk)]+L2Eθk[∥s(xk,θk)∥22].

Recalling Eq. 6, we finally get the desired bound.

Regardless of the states before , the expected decrease in the objective function yielded by the th stochastic step could be bounded above by a quantity involving (i) a positive definite quadratic form in the expectation of , say,

 ∇F(xk)TEθk[s(xk,θk)]+L2∥Eθk[s(xk,θk)]∥22, (7)

and (ii) the variance of . Hence, this lemma shows that, the bound of the expected decrease can be obtained by analyzing the expectation and variance of the step .

[Strong convexity] The objective function is strongly convex in that there exists a constant such that

 F(x′)⩾F(x)+∇F(x)T(x′−x)+l2∥x′−x∥22  for all  x′,x∈Rd. (8)

Hence, has a unique minimizer, denoted as with .

Notice that for any given , the quadratic model has the unique minimizer with , then together with Eq. 8, one obtain

 F∗⩾F(x)+∇F(x)T(x∗−x)+l2∥x∗−x∥22⩾F(x)−12l∥∇F(x)∥22,

that is, for a given point , the gap between the value of the objective and the minima can be bounded by the squared -norm of the gradient of the objective:

 2l(F(x)−F∗)⩽∥∇F(x)∥22. (9)

This inequality is usually referred to as the Polyak-Łojasiewicz inequality. And it is a sufficient condition for gradient descent to achieve a linear convergence rate and was originally introduced by Polyak ; and it is also a special case of the Łojasiewicz inequality proposed in the same year, which gives an upper bound for the distance of a point to the nearest zero of a given real analytic function .

Under Sections 2.1 and 2.1, it is very easy to see that . Furthermore, if is twice continuously differentiable, then Sections 2.1 and 2.1 imply that for every , it holds that .

2.2 Methods

We now define our SGFD method as Algorithm 1. The random vector here is referred to as stochastic direction. Very similar to stochastic gradient method 

, the algorithm also presumes that three computational tools exist: (i) a mechanism for generating a realization of random variables

and (with or representing a sequence of jointly independent random variables); (ii) given an iteration number , a mechanism for computing a scalar stepsize ; and (iii) given an iterate and the realizations of and , a mechanism for computing a stochastic step .

We consider the following three choices of the stochastic step :

 s(xk,ξk,αk,ζk)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(f(xk,ξk)−f(xk+αkζk,ξk))ζk,1nk∑nki=1(f(xk,ξk,i)−f(xk+αkζk,ξk,i))ζk,1nk∑nki=1(f(xk,ξk,i)−f(xk+αkζk,i,ξk,i))ζk,i, (10)

where the value of the random variables and need only be viewed as a seed for generating a stochastic step, and both and are independent and identically distributed for every . The three possible choices in Eq. 10 are referred to as stochastic gradient-free method, quasi mini-batch gradient-free method, and mini-batch gradient-free method in sequence. And it follows from Eqs. 10 and 2 that

 Eξk,ζk[s(xk,ξk,αk,ζk)]=Eζk[(F(xk)−F(xk+αkζk))ζk], (11)

which is referred to as an average descent direction with respect to the stepsize and the distribution of . And we use

to denote an expected value taken with respect to the joint distribution of all random variables, that is, the total expectation operator can be defined as

 E[⋅]:=Eξ1,ζ1Eξ2,ζ2⋯Eξk,ζk[⋅].

Correspondingly, when the objective function is able and easy to calculate, we would also consider the following two choices of the stochastic step:

 s(xk,αk,ζk)=⎧⎪⎨⎪⎩(F(xk)−F(xk+αkζk))ζk,1nk∑nki=1(F(xk)−F(xk+αkζk,i))ζk,i, (12)

which can be seen as special cases of Eq. 10.

Lemma 2

Under Section 2.1, the expectation of the stochastic steps Eq. 10 satisfy that for every and , there is a , which depends on and for every , such that

Proof

According to Eq. 5, there is a for every such that

 (F(xk)−F(xk+αkζk))ζk=−αk(∇F(xk)Tζk)ζk−α2k2Cxk,αkζk(ζTkζk)ζk,

taking expectations with respect to the distribution of and recalling Eq. 11, one obtains

 Eξk,ζk[s(xk,ξk,αk,ζk)] =Eζk[(F(xk)−F(xk+αkζk))ζk]

as desired.

2.3 Distribution of random directions

We now formalize an assumption of distribution of random directions as follows. This will be an important basis for subsequent analysis. The -dimensional random vectors are independent and identically distributed and simultaneously, for every , satisfy (i) the mean of each component is , i.e.,

 E[ζ(i)k]=0  for all  i∈{1,⋯,d}; (13)

(ii) the covariance matrix is a unit matrix, i.e.,

 E[ζ(i)kζ(j)k]={1,i=j,0,i≠j; (14)

and (iii) every component is bounded or has a finite nonzero fourth moment, i.e., for all

, it holds that

 ∣∣ζ(i)k∣∣⩽rζ   or   E[ζ(i)k]4⩽m(4)ζ, (15)

where is the th element in vector .

One of the typical choices for the distribution of is, of course, the

-dimensional standard normal distribution with zero mean and unit covariance matrix, whose each component is obviously unbounded but has a finite fourth moment, say,

for all ; another typical choice is the

-dimensional uniform distribution on

, whose each component is bounded and has a finite fourth moment.

Under Section 2.3 alone, we could obtain the following lemma. Such a result is essential for convergence analyses of all our methods. And in fact, although not so intuitive, this lemma is a direct source of the basic idea for this work.

Lemma 3

Under Section 2.3, for every vector independent of , it follows that

 (16)

and

 (17)

where are independent and identically distributed,

 Dζ=min⎛⎜ ⎜⎝rζ, ⎷d+m(4)ζ−1d⎞⎟ ⎟⎠,

, and is the fourth moment of .

Proof

Let , then

 v=Eζk[(d∑i=1ω(i)ζ(i)k)ζk]=d∑i=1ω(i)Eζk[ζ(i)kζk],

from Section 2.3 we further observed that for all ,

 v(j)=d∑i=1ω(i)Eζk[ζ(i)kζ(j)k]=ω(j),  that is,  ω=Eζk[(ωTζk)ζk].

Secondly, from Section 2.3, if every component of is bounded, i.e., , then one obtains

 (18)

or, if every component of has a finite nonzero fourth moment , then according to Cauchy-Schwartz’s inequality, it follows that

 (19)

then it follows from Eqs. 19 and 18 that

and further, one also obtains

 ∥Eζk[(ζTkζk)ζk]∥2⩽d12∥Eζk[(ζTkζk)ζk]∥∞⩽d32Dζ,

and the proof is complete.

2.4 Expectation analysis

The important role of Section 2.3 is to ensure that an asymptotic unbiased estimator of . And it also allows us to analyze the bound of the quadratic form Eq. 7 in the expectation of stochastic directions.

Theorem 2.1

Under Sections 2.3 and 2.1, the stochastic steps of SGFD (Algorithm 1) satisfy the asymptotic unbiasedness

 limαk→0Eξk,ζk[s(xk,ξk,αk,ζk)]αk=−∇F(xk), (20)

and for every , it follows that

 ∥Eξk,ζk[s(xk,ξk,αk,ζk)]∥2⩽ αk∥∇F(xk)∥2+α2kLd32Dζ2  and (21) ∇F(xk)TEξk,ζk[s(xk,ξk,αk,ζk)]⩽ −αk∥∇F(xk)∥22+α2kLd32Dζ2∥∇F(xk)∥2, (22)

where the constant comes from Lemma 3.

Proof

According to Lemma 2, there are such that

First, it follows from Lemma 3 that

 ∇F(xk)=Eζk[(∇F(xk)Tζk)ζk];

and then, let , then from Lemmas 3 and 2.3, there is a such that

Thus, one obtains Eqs. 21 and 20 by noting that

 ∥∥ ∥∥Eξk,ζk[s(xk,ξk,αk,ζk)]αk+∇F(xk)∥∥ ∥∥∞⩽αk2∥Rk∥∞  and ∥Eξk,ζk[s(xk,ξk,αk,ζk)]∥2⩽αk∥∇F(xk)∥2+α2k2∥Rk∥2;

and further obtains Eq. 22 by noting that

 ∇F(xk)TEξk,ζk[s(xk,ξk,αk,ζk)]= −αk∥∇F(xk)∥22−α2k2∇F(xk)TRk ⩽ −αk∥∇F(xk)∥22+α2k2∥∇F(xk)∥2∥Rk∥2,

and the proof is complete.

2.5 Variance assumption

The objective function and SGFD (Algorithm 1) satisfy for all , there exist scalars and such that

 Vξk,ζk[s(xk,ξk,αk,ζk)]⩽α2kM+α2kMV∥∇F(xk)∥22.
Lemma 4

Under Sections 2.5, 2.3 and 2.1, now suppose that the SGFD method (Algorithm 1) is run with for every , then

 Eξk,ζk[F(xk+1)]−F(xk)⩽−αk∥∇F(xk)∥22+α2kLMG2∥∇F(xk)∥22+α2kLMd2,

where , , and the constant comes from Lemma 3.

Proof

By Eq. 21 and , it follows that

 ∥∥Eξk,ζk[s(xk,ξk,αk,ζk)]∥∥22⩽ (αk∥∇F(xk)∥2+α2kLd32Dζ2)2 ⩽ α2k∥∇F(xk)∥22+α3kLd32Dζ∥∇F(xk)∥2+α4kL2d3D2ζ4 ⩽ α2k∥∇F(xk)∥22+α2kd32Dζ∥∇F(xk)∥2+α2kd3D2ζ4,

together with the Arithmetic Mean Geometric Mean inequality, i.e.,

 d32Dζ∥∇F(xk)∥2⩽d3D2ζ2+12∥∇F(xk)∥22,

one obtains

 ∥∥Eξk,ζk[s(xk,ξk,αk,ζk)]∥∥22⩽3α2k2∥∇F(xk)∥22+3α2kd3D2ζ4.

Similarly, by Eq. 22 and the Arithmetic Mean Geometric Mean inequality, it holds that

 ∇F(xk)TEξk,ζk[s(xk,ξk,αk,ζk)]⩽ −αk∥∇F(xk)∥22+α2kL2d32Dζ∥∇F(xk)∥2 ⩽ −αk∥∇F(xk)∥22+α2kL4∥∇F(xk)∥22+α2kLd3D2ζ4.

Finally, by Lemma 1 and Section 2.5, the iterates generated by SGFD satisfy

 Eξk,ζk[F(xk+1)]−F(xk)⩽ ∇F(xk)TEξk,ζk[s(xk,ξk,αk,ζk)] +L2∥Eξk,ζk[s(xk,ξk,αk,ζk)]∥22+L2Vξk,ζk[s(xk,ξk,αk,ζk)] ⩽ −αk∥∇F(xk)∥22+α2kLMG2∥∇F(xk)∥22+α2kLMd2,

and the proof is complete.

2.6 Average behavior of iterations

Here we mainly focus on the strongly convex objective functions. According to Lemma 4, it is easy to analyze the average behavior of the SGFD iterations for strongly convex objectives.

Theorem 2.2

Under Sections 2.1, 2.5, 2.3 and 2.1, now suppose that the SGFD method (Algorithm 1) is run with for every , then

 E[F(xk+1)−F∗]⩽(k∏i=1(1−αil))[F(x1)−F∗]+LMd2k∑i=1α2ik∏j=i+1(1−αjl).

Proof

According to Lemma 4 and , we have

 Eξk,ζk[F(xk+1)]⩽ F(xk)−αk∥∇F(xk)∥22+α2kLMG2∥∇F(xk)∥22+α2kLMd2 ⩽ F(xk)−αk2∥∇F(xk)∥22+α2kLMd2.

Subtracting from both sides and applying Eq. 9, this yields

 Eξk,ζk[F(xk+1)]−F∗⩽ F(xk)−F∗−αk2∥∇F(xk)∥22+α2kLMd2 ⩽ F(xk)−F∗−αkl(F(xk)−F∗)+α2kLMd2 = (1−αkl)(F(xk)−F∗)+α2kLMd2,

and it follows from taking total expectations that

 E[F(xk+1)−F∗]⩽(1−αkl)E[F(xk)−F∗]+α2kLMd2.

Thus, the desired result follows by repeatedly applying this inequality above through iteration from to .

Therefore, under strong convex conditions, the convergence of the SGFD methods is closely related to the limits

 ∞∏i=1(1−αil)  and  ∞∑i=1α2i∞∏j=i+1(1−αjl).

Before we return to this issue in Section 4, we shall consider the SGFD methods with momentum in the next section.

3 Stochastic gradient-free descent with momentum

In the community working on training DNNs, stochastic gradient methods with momentum are very popular because of their practical performance [7, 16]. Now we will add a momentum term to our gradient-free methods. Especially, we provide a theoretical analysis about the inclusion of momentum in stochastic settings. And it is shown that the momentum term introduces extra bias but reduces variance for the stochastic directions.

3.1 Methods

Gradient-free methods with momentum are procedures in which each step is chosen as a weighted average of all historical stochastic directions. And specifically, with an initial point , these methods are characterized by the iteration

 xk+1←xk+αkvk∑kj=1γk−j, (23)

where and the momentum term is recursively defined as

 vk=γvk−1+1αks(xk,ξk,αk,ζk);

or equivalently,

 vk=k∑j=1γk−jαjs(xj,ξj,αj,ζj).

Thus, these methods can be rewritten by the iteration

 xk+1←xk+αkmk, (24)

where the momentum direction

 mk=vk∑kj=1γk−j=1∑ki=1γk−ik∑j=1γk−jαjs(xk+δj,ξj,αj,ζj), (25)

where for every .

3.2 Expectation analysis

Theorem 3.1

Under Sections 2.3 and 2.1, suppose that the sequence of iterates is generated by Eq. 24 with a stepsize sequence and a fixed scalar satisfying

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1∑ki=1γk−i∑kj=1γk−j∥δj∥2αk⩽Dγ2<∞,1∑ki=1γk−i∑kj=1γk−jαjαk⩽τγ<∞, (26)

where