## 1 Introduction

In this paper, we develop and analyze stochastic *gradient-free* descent (SGFD) methods for solving the following class of stochastic optimization problems:

(1) |

where the real-valued function is defined by

(2) |

and

be a collection of real-valued functions with a given probability distribution

over the index set. In the context of machine learning applications,

is often treated as the loss function of a prediction function

incurred by the parameter vector

with respect to the randomly selected sample from a sample set , i.e., ; accordingly, is also treated as the empirical risk given a parameter vector with respect to the probability distribution which we usually do not even know. The current popular methodology for such problems is the stochastic gradient (SG) method [19, 18, 3, 9, 14]. Specifically, with an initial point , these methods are characterized by the iteration(3) |

where is the stepsize and is the stochastic gradient defined by

(4) |

which is an unbiased estimator of the socalled full gradient

[15, 4].The SG method was originally developed by Robbins and Monro [13] for smooth stochastic approximation problems. It has convergence guarantees [9, 19, 4] and has gained extensive empirical success in large-scale convex and nonconvex stochastic optimization [18, 3, 14, 5, 6]. However, there are still notable difficulties with the SG method [1]

, and some of them are related to the gradient itself. For example, it might cause the vanishing and exploding gradient in training artificial neural networks

[2, 11]; moreover, the gradient is sometimes very difficult or even impossible to obtain.From this starting point, this work attempts to establish a method that avoids direct gradient evaluations but retains most of the main advantages of gradient-based methods, such as the sublinear convergence rate for strongly convex objectives with Lipschitz gradients [4] and the global convergence for twice differentiable objectives [10, 4]. And this work successfully achieved this goal by establishing an asymptotic unbiased estimator of the gradient. Such an approach is referred to as gradient-free method because gradients are not evaluated and applied directly. However, one will see that the convergence analysis of gradient-free methods does not exceed the analysis framework of stochastic gradient methods given in [4]. Moreover, we also provide a theoretical analysis about the inclusion of momentum in stochastic settings. And it is shown that the momentum term introduces extra biases but reduces variances for stochastic directions.

The remainder of the paper is organized as follows. The next section introduces the stochastic gradient-free descent methods. Then the gradient-free methods with momentum are discussed in detail in Section 3. And we state our main theorems guaranteeing convergence for both strongly convex objectives with Lipschitz gradients and possibly nonconvex twice differentiable objectives in Section 4. And we draw some conclusions in Section 5.

## 2 Stochastic gradient-free descent

The fundamental idea of this work is not to directly evaluate and apply gradients but to indirectly learn information about gradients through *stochastic directions* and corresponding output feedbacks of the objective function. In the following, we will first describe the gradient-free method and then analyze its convergence properties. And we shall consider differentiable objective functions with Lipschitz continuous gradients.

### 2.1 Assumptions of objectives

First, let us begin with a basic assumption of smoothness of the objective function. Such an assumption is essential for convergence analyses of our methods, as well as most gradient-based methods [4]. [Lipschitz-continuous gradients] The objective function is continuously differentiable and its gradient function is Lipschitz continuous with Lipschitz constant , i.e.,

Section 2.1 ensures that the gradient of the objective is bounded and does not change arbitrarily quickly with respect to the parameter vector. As an important consequence of Section 2.1 we note that

(5) |

This inequality comes from

and . Moreover, it is trivial if is twice continuously differentiable with for every . From this inequality, we can get a fundamental lemma for any iteration based on random steps, which is a slight generalization of Lemma 4.2 in [4].

###### Lemma 1

Under Section 2.1, if for every , is any random vector independent of and is a stochastic step that depends on , then the iteration satisfy the following inequality

where the variance of is defined as

(6) |

###### Proof

By Section 2.1, the iteration satisfy

Noting that is independent of and taking expectations in these inequalities with respect to the distribution of , we obtain

Recalling Eq. 6, we finally get the desired bound.

Regardless of the states before , the expected decrease in the objective function yielded by the th stochastic step could be bounded above by a quantity involving (i) a positive definite quadratic form in the expectation of , say,

(7) |

and (ii) the variance of . Hence, this lemma shows that, the bound of the expected decrease can be obtained by analyzing the expectation and variance of the step .

[Strong convexity] The objective function is strongly convex in that there exists a constant such that

(8) |

Hence, has a unique minimizer, denoted as with .

Notice that for any given , the quadratic model has the unique minimizer with , then together with Eq. 8, one obtain

that is, for a given point , the gap between the value of the objective and the minima can be bounded by the squared -norm of the gradient of the objective:

(9) |

This inequality is usually referred to as the Polyak-Łojasiewicz inequality. And it is a sufficient condition for gradient descent to achieve a linear convergence rate and was originally introduced by Polyak [12]; and it is also a special case of the Łojasiewicz inequality proposed in the same year, which gives an upper bound for the distance of a point to the nearest zero of a given real analytic function [8].

Under Sections 2.1 and 2.1, it is very easy to see that . Furthermore, if is twice continuously differentiable, then Sections 2.1 and 2.1 imply that for every , it holds that .

### 2.2 Methods

We now define our SGFD method as Algorithm 1. The random vector here is referred to as stochastic direction. Very similar to stochastic gradient method [4]

, the algorithm also presumes that three computational tools exist: (i) a mechanism for generating a realization of random variables

and (with or representing a sequence of jointly independent random variables); (ii) given an iteration number , a mechanism for computing a scalar stepsize ; and (iii) given an iterate and the realizations of and , a mechanism for computing a stochastic step .We consider the following three choices of the stochastic step :

(10) |

where the value of the random variables and need only be viewed as a seed for generating a stochastic step, and both and are independent and identically distributed for every . The three possible choices in Eq. 10 are referred to as stochastic gradient-free method, quasi mini-batch gradient-free method, and mini-batch gradient-free method in sequence. And it follows from Eqs. 10 and 2 that

(11) |

which is referred to as an *average descent direction* with respect to the stepsize and the distribution of . And we use

to denote an expected value taken with respect to the joint distribution of all random variables, that is, the total expectation operator can be defined as

Correspondingly, when the objective function is able and easy to calculate, we would also consider the following two choices of the stochastic step:

(12) |

which can be seen as special cases of Eq. 10.

###### Lemma 2

Under Section 2.1, the expectation of the stochastic steps Eq. 10 satisfy that for every and , there is a , which depends on and for every , such that

### 2.3 Distribution of random directions

We now formalize an assumption of distribution of random directions as follows. This will be an important basis for subsequent analysis. The -dimensional random vectors are independent and identically distributed and simultaneously, for every , satisfy (i) the mean of each component is , i.e.,

(13) |

(ii) the covariance matrix is a unit matrix, i.e.,

(14) |

and (iii) every component is bounded or has a finite nonzero fourth moment, i.e., for all

, it holds that(15) |

where is the th element in vector .

One of the typical choices for the distribution of is, of course, the

-dimensional standard normal distribution with zero mean and unit covariance matrix, whose each component is obviously unbounded but has a finite fourth moment, say,

for all ; another typical choice is the-dimensional uniform distribution on

, whose each component is bounded and has a finite fourth moment.Under Section 2.3 alone, we could obtain the following lemma. Such a result is essential for convergence analyses of all our methods. And in fact, although not so intuitive, this lemma is a direct source of the basic idea for this work.

###### Lemma 3

Under Section 2.3, for every vector independent of , it follows that

(16) |

and

(17) |

where are independent and identically distributed,

, and is the fourth moment of .

###### Proof

Secondly, from Section 2.3, if every component of is bounded, i.e., , then one obtains

(18) |

or, if every component of has a finite nonzero fourth moment , then according to Cauchy-Schwartz’s inequality, it follows that

(19) |

then it follows from Eqs. 19 and 18 that

and further, one also obtains

and the proof is complete.

### 2.4 Expectation analysis

The important role of Section 2.3 is to ensure that an asymptotic unbiased estimator of . And it also allows us to analyze the bound of the quadratic form Eq. 7 in the expectation of stochastic directions.

###### Theorem 2.1

Under Sections 2.3 and 2.1, the stochastic steps of SGFD (Algorithm 1) satisfy the asymptotic unbiasedness

(20) |

and for every , it follows that

(21) | ||||

(22) |

where the constant comes from Lemma 3.

### 2.5 Variance assumption

The objective function and SGFD (Algorithm 1) satisfy for all , there exist scalars and such that

###### Lemma 4

Under Sections 2.5, 2.3 and 2.1, now suppose that the SGFD method (Algorithm 1) is run with for every , then

where , , and the constant comes from Lemma 3.

###### Proof

By Eq. 21 and , it follows that

together with the Arithmetic Mean Geometric Mean inequality, i.e.,

one obtains

Similarly, by Eq. 22 and the Arithmetic Mean Geometric Mean inequality, it holds that

Finally, by Lemma 1 and Section 2.5, the iterates generated by SGFD satisfy

and the proof is complete.

### 2.6 Average behavior of iterations

Here we mainly focus on the strongly convex objective functions. According to Lemma 4, it is easy to analyze the average behavior of the SGFD iterations for strongly convex objectives.

###### Theorem 2.2

Under Sections 2.1, 2.5, 2.3 and 2.1, now suppose that the SGFD method (Algorithm 1) is run with for every , then

###### Proof

Therefore, under strong convex conditions, the convergence of the SGFD methods is closely related to the limits

Before we return to this issue in Section 4, we shall consider the SGFD methods with momentum in the next section.

## 3 Stochastic gradient-free descent with momentum

In the community working on training DNNs, stochastic gradient methods with momentum are very popular because of their practical performance [7, 16]. Now we will add a momentum term to our gradient-free methods. Especially, we provide a theoretical analysis about the inclusion of momentum in stochastic settings. And it is shown that the momentum term introduces extra bias but reduces variance for the stochastic directions.

### 3.1 Methods

Gradient-free methods with momentum are procedures in which each step is chosen as a weighted average of all historical stochastic directions. And specifically, with an initial point , these methods are characterized by the iteration

(23) |

where and the momentum term is recursively defined as

or equivalently,

Thus, these methods can be rewritten by the iteration

(24) |

where the momentum direction

(25) |

where for every .

### 3.2 Expectation analysis

###### Theorem 3.1

Under Sections 2.3 and 2.1, suppose that the sequence of iterates is generated by Eq. 24 with a stepsize sequence and a fixed scalar satisfying

(26) |

where

Comments

There are no comments yet.