Analysis of gradient descent methods with non-diminishing, bounded errors

04/01/2016 ∙ by Arunselvan Ramaswamy, et al. ∙ indian institute of science 0

The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of GD with bounded, (possibly) non-diminishing gradient errors. In addition to ensuring stability, such an algorithm is shown to converge to a small neighborhood of the minimum set, which depends on the gradient errors. It is worth noting that the main result of this paper can be used to show that GD with asymptotically vanishing errors indeed converges to the minimum set. The results presented herein are not only more general when compared to previous results, but our analysis of GD with errors is new to the literature to the best of our knowledge. Our work extends the contributions of Mangasarian & Solodov, Bertsekas & Tsitsiklis and Tadic & Doucet. Using our framework, a simple yet effective implementation of GD using simultaneous perturbation stochastic approximations (SP SA), with constant sensitivity parameters, is presented. Another important improvement over many previous results is that there are no `additional' restrictions imposed on the step-sizes. In machine learning applications where step-sizes are related to learning rates, our assumptions, unlike those of other papers, do not affect these learning rates. Finally, we present experimental results to validate our theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let us suppose that we are interested in finding a minimum (local/global) of a continuously differentiable function . The following gradient descent method () is often employed to find such a minimum:

(1)

In the above equation, is the given step-size sequence and is a continuous map such that , and .

is a popular tool to implement many machine learning algorithms. For example, the backpropagation algorithm for training neural networks employs

due to its effectiveness and ease of implementation.

When implementing (1

), one often uses gradient estimators such as Kiefer-wolfowitz estimator

[8], simultaneous perturbation stochastic approximation () [10], etc., to obtain estimates of the true gradient at each stage which in turn results in estimation errors ( in (2)). This is particularly true when the form of or is unknown. Previously in the literature, convergence of with errors was studied in [5]. However, their analysis required the errors to go to zero at the rate of the step-size (vanish asymptotically at a prescribed rate). Such assumptions are difficult to enforce and may adversely affect the learning rate when employed to implement machine learning algorithms, see Chapter 4.4 of [6]. In this paper, we present sufficient conditions for both stability (almost sure boundedness) and convergence (to a small neighborhood of the minimum set) of with bounded errors, for which the recursion is given by

(2)

In the above equation is the estimation error at stage such that (a.s. in the case of stochastic errors) for a fixed (positive real). As an example, consider the problem of estimating the average waiting time of a customer in a queue. The objective function , for this problem, has the following form: where

is the “waiting time” random variable with distribution

, with being the underlying parameter (say the arrival or the service rate). In order to define at every , one will need to know the entire family of distributions, , exactly. In such scenarios, one often works with approximate definitions of which in turn lead to approximate gradients, i.e, gradients with errors. More generally, the gradient errors could be inherent to the problem at hand or due to extraneous noise. In such cases, there is no reason to believe that these errors will vanish asymptotically. To the best of our knowledge, this is the first time an analysis is done for with biased/unbiased stochastic/deterministic errors that are not necessarily diminishing, and without imposing ‘additional’ restrictions on step-sizes over the usual standard assumptions, see (A2) in Section 3.1.

Our assumptions, see Section 3.1, not only guarantee stability but also guarantee convergence of the algorithm to a small neighborhood of the minimum set, where the neighborhood is a function of the gradient errors. If as , then it follows from our main result (Theorem 2) that the algorithm converges to an arbitrarily small neighborhood of the minimum set. In other words, the algorithm indeed converges to the minimum set. It may be noted that we do not impose any restrictions on the noise-sequence , except that almost surely for all for some fixed . Our analysis uses techniques developed in the field of viability theory by [1], [2] and [3]. Experimental results supporting the analyses in this paper are presented in Section 5.

1.1 Our contributions

(1) Previous literature such as [5] requires as for it’s analysis to work. Further, both [5] and [9] provide conditions that guarantee one of two things diverges almost surely or converges to the minimum set almost surely. On the other hand, we only require , where is fixed a priori. Also, we present conditions under which with bounded errors is stable (bounded almost surely) and converges to an arbitrarily small neighborhood of the minimum set almost surely. Note that our analysis works regardless of whether or not tends to zero. For more detailed comparisons with [5] and [9] see Section 3.2.
(2) The analyses presented herein will go through even when the gradient errors are “asymptotically bounded” almost surely. In other words, for all almost surely. Here may be sample path dependent.
(3) Previously, convergence analysis of required severe restrictions on the step-size, see [5], [10]. However, in our paper we do not impose any such restrictions on the step-size. See Section 3.2 (specifically points and ) for more details.
(4) Informally, the main result of our paper, Theorem 2, states the following. One wishes to simulate with gradient errors that are not guaranteed to vanish over time. As a consequence of allowing non-diminishing errors, we show the following: There exists such that the iterates are stable and converge to the -neighborhood of the minimum set ( being chosen by the simulator) as long as .
(5) In Section 4.2 we discuss how our framework can be exploited to undertake convenient yet effective implementations of . Specifically, we present an implementation using , although other implementations can be similarly undertaken.

2 Definitions used in this paper

[Minimum set of a function] This set consists of all global and local minima of the given function.
[Upper-semicontinuous map] We say that is upper-semicontinuous, if given sequences (in ) and (in ) with , and , , then .
[Marchaud Map] A set-valued map is called Marchaud if it satisfies the following properties: (i) for each , is convex and compact; (ii) (point-wise boundedness) for each , for some ; (iii) is upper-semicontinuous.
Let be a Marchaud map on . The differential inclusion (DI) given by

(3)

is guaranteed to have at least one solution that is absolutely continuous. The reader is referred to [1] for more details. We say that if x is an absolutely continuous map that satisfies (3). The set-valued semiflow associated with (3) is defined on as:
. Let and define
[Limit set of a solution] The limit set of a solution x with is given by .
[Invariant set] is invariant if for every there exists a trajectory, , entirely in with , , for all .
[Open and closed neighborhoods of a set] Let and , then . We define the -open neighborhood of by . The -closed neighborhood of is defined by .
[ and ] The open ball of radius around the origin is represented by , while the closed ball is represented by . In other words, and .
[Internally chain transitive set] is said to be internally chain transitive if is compact and for every , and we have the following: There exists and that are solutions to the differential inclusion , points and real numbers greater than such that: and for . The sequence is called an chain in from to . If the above property only holds for all , then is called chain recurrent.
[Attracting set & fundamental neighborhood] is attracting if it is compact and there exists a neighborhood such that for any , with . Such a is called the fundamental neighborhood of .
[Attractor set] An attracting set that is also invariant is called an attractor set. The basin of attraction of is given by .
[Lyapunov stable] The above set is Lyapunov stable if for all , such that .
[Upper-limit of a sequence of sets, Limsup] Let be a sequence of sets in . The upper-limit of is given by, .
We may interpret that the lower-limit collects the limit points of while the upper-limit collects its accumulation points.

3 Assumptions and comparison to previous literature

3.1 Assumptions

Recall that with bounded errors is given by the following recursion:

(4)

where and , . In other words, the gradient estimate at stage , , belongs to an -ball around the true gradient at stage . Note that (4) is consistent with (2) of Section 1. Our assumptions, - are listed below.

  • for some fixed . is a continuous function such that for all , for some .

  • is the step-size (learning rate) sequence such that: , and . Without loss of generality we let .

Note that is an upper-semicontinuous map since is continuous and point-wise bounded. For each , we define . Define , see Section 2 for the definition of . Given , the convex closure of , denoted by , is the closure of the convex hull of . It is worth noting that is non-empty for every . Further, we show that is a Marchaud map in Lemma 1. In other words, has at least one solution that is absolutely continuous, see [1]. Here is used to denote the set .

  • has an attractor set such that for some and is a fundamental neighborhood of .

Since is compact, we have that . Let us fix the following sequence of real numbers: .

  • Let be an increasing sequence of integers such that as . Further, let and as , such that , , then .

It is worth noting that the existence of a global Lyapunov function for is sufficient to guarantee that holds. Further, is satisfied when is Lipschitz continuous.

Lemma 1.

is a Marchaud map.

Proof.

From the definition of and we have that is convex, compact and for every . It is left to show that is an upper-semicontinuous map. Let , and , for all . We need to show that . We present a proof by contradiction. Since is convex and compact, implies that there exists a linear functional on , say , such that and , for some and . Since , there exists such that for all , . In other words, for all . We use the notation to denote the set . For the sake of convenience let us denote the set by , where . We claim that for all . We prove this claim later, for now we assume that the claim is true and proceed. Pick for each . It can be shown that is norm bounded and hence contains a convergent subsequence, . Let . Since , such that , where . We choose the sequence such that for each .
We have the following: , , and , for all . It follows from assumption that . Since and for each , we have that . This contradicts the earlier conclusion that .
It remains to prove that for all . If this were not true, then such that for all . It follows that for each . Since , such that for all , . This is a contradiction. ∎

3.2 Relevance of our results

(1) Gradient algorithms with errors have been previously studied by Bertsekas and Tsitsiklis [5]. They impose the following restriction on the estimation errors: , where . If the iterates are stable then . In order to satisfy the aforementioned assumption the choice of step-size may be restricted, thereby affecting the learning rate (when used within the framework of a learning algorithm). In this paper we analyze the more general and practical case of bounded which does not necessarily go to zero. Further none of the assumptions used in our paper impose further restrictions on the step-size, other than standard requirements, see .
(2) The main result of Bertsekas and Tsitsiklis [5] states that the with errors either diverges almost surely or converges to the minimum set almost surely. An older study by Mangasarian and Solodov [9] shows the exact same result as [5] but for without estimation errors (). The main results of our paper, Theorems 1 & 2 show that if the under consideration satisfies - then the iterates are stable (bounded almost surely). Further, the algorithm is guaranteed to converge to a given small neighborhood of the minimum set provided the estimation errors are bounded by a constant that is a function of the neighborhood size. To summarize, under the more restrictive setting of [5] and [9] the is not guaranteed to be stable, see the aforementioned references, while the assumptions used in our paper are less restrictive and guarantee stability under the more general setting of bounded error . It may also be noted that is assumed to be Lipschitz continuous by [5]. This turns out to be sufficient (but not necessary) for & to be satisfied.
(3) The analysis of Spall [10] can be used to analyze a variant of that uses as the gradient estimator. Spall introduces a gradient sensitivity parameter in order to control the estimation error at stage . It is assumed that and , see A1, Section III, [10]. Again, this restricts the choice of step-size and affects the learning rate. In this setting our analysis works for the more practical scenario where for all i.e., a constant, see Section 4.2.
(4) The important advancements of this paper are the following: (i) Our framework is more general and practical since the errors are not required to go to zero; (ii) We provide easily verifiable, non-restrictive set of assumptions that ensure almost sure boundedness and convergence of and (iii) Our assumptions - do not affect the choice of step-size.
(5) Tadić and Doucet [11] showed that GD with bounded non-diminishing errors converges to a small neighborhood of the minimum set. They make the following key assumption: (A) There exists , such that for every compact set and every , , where and .

Note that is the Lebesgue measure of the set . The above assumption holds if is times differentiable, where , see [11] for details. In comparison, we only require that the chain recurrent set of be a subset of it’s minimum set. One sufficient condition for this is given in Proposition 4 of Hurley [7].

Remark 1.

Suppose the minimum set of , contains the chain recurrent set of , then it can be shown that GD without errors ( in (4)) will converge to almost surely, see [4]. On the other hand suppose there are chain recurrent points outside , it may converge to this subset (of the chain recurrent set) outside . In Theorem 2, we will use the upper-semicontinuity of chain recurrent sets (Theorem 3.1 of Benaïm, Hofbauer and Sorin [3]), to show that GD with errors will converge to a small neighborhood of the limiting set of the “corresponding GD without errors”. In other words, GD with errors converges to a small neighborhood of the minimum set provided the corresponding GD without errors converges to the minimum set. This will trivially happen if the chain recurrent set of is a subset of the minimum set of , which we implicitly assume is true. Suppose GD without errors does not converge to the minimum set, then it is reasonable to expect that GD with errors may not converge to a small neighborhood of the minimum set.

Suppose is continuously differentiable and it’s regular values (i.e., for which ) are dense in , then the chain recurrent set of is a subset of it’s minimum set, see Proposition 4 of Hurley [7]. We implicitly assume that an assumption of this kind is satisfied.

4 Proof of stability and convergence

We use (4

) to construct the linearly interpolated trajectory,

for . First, define and for . Then, define and for , is the continuous linear interpolation of and . We also construct the following piece-wise constant trajectory , as follows: for , .

We need to divide time, , into intervals of length , where . Note that is such that for , where denotes solution to at time with initial condition and . Note that is independent of the initial condtion , see Section 2 for more details. Dividing time is done as follows: define and , . Clearly, there exists a subsequence of such that . In what follows we use and interchangeably.

To show stability, we use a projective scheme where the iterates are projected periodically, with period , onto the closed ball of radius around the origin, . Here, the radius is given by . This projective scheme gives rise to the following rescaled trajectories and . First, we construct , : Let for some , then , where ( is defined in ). Also, let , . The ‘rescaled iterates’ are given by .

Let , be the solution (upto time ) to , with the initial condition , recall the definition of from the beginning of Section 4. Clearly, we have

(5)

We begin with a simple lemma which essentially claims that = . The proof is a direct consequence of the definition of and is hence omitted.

Lemma 2.

For all , we have , where .

It directly follows from Lemma 2 that = . In other words, the two families of -length trajectories, and , are really one and the same. When viewed as a subset of , is equi-continuous and point-wise bounded. Further, from the Arzela-Ascoli theorem we conclude that it is relatively compact. In other words, is relatively compact in .

Lemma 3.

Let , then any limit point of is of the form , where is a measurable function and , .

Proof.

For , define . Observe that for any , we have and , since is a Marchaud map. Since is the rescaled trajectory obtained by periodically projecting the original iterates onto a compact set, it follows that is bounded a.s. i.e., It now follows from the observation made earlier that

Thus, we may deduce that there exists a sub-sequence of , say , such that in and weakly in . From Lemma 2 it follows that in . Letting in

we get for . Since we have .

Since weakly in , there exists such that

Further, there exists such that

Let us fix , then

Since is convex and compact (Proposition 1), to show that it is enough to show Suppose this is not true and and such that . Since is norm bounded, it follows that there is a convergent sub-sequence. For convenience, assume , for some . Since and , it follows from assumption that . This leads to a contradiction. ∎

Note that in the statement of Lemma 3 we can replace ‘’ by ‘’, where is a subsequence of . Specifically we can conclude that any limit point of in , conditioned on , is of the form , where for . It should be noted that may be sample path dependent (if is stochastic then is a random variable). Recall that = (see the sentence following in Section 3.1). The following is an immediate corollary of Lemma 3.

Corollary 1.

such that , , where and is a solution (up to time ) of such that . The form of is as given by Lemma 3.

Proof.

Assume to the contrary that such that is at least away from any solution to the . It follows from Lemma 3 that there exists a subsequence of guaranteed to converge, in , to a solution of such that . This is a contradiction. ∎

Remark 2.

It is worth noting that may be sample path dependent. Since we get for all such that .

4.1 Main Results

We are now ready to prove the two main results of this paper. We begin by showing that (4) is stable (bounded a.s.). In other words, we show that a.s. Once we show that the iterates are stable we use the main results of Benaïm, Hofbauer and Sorin to conclude that the iterates converge to a closed, connected, internally chain transitive and invariant set of .

Theorem 1.

Under assumptions , the iterates given by (4) are stable i.e., a.s. Further, they converge to a closed, connected, internally chain transitive and invariant set of .

Proof.

First, we show that the iterates are stable. To do this we start by assuming the negation i.e., . Clearly, there exists such that . Recall that and that .

We have since is a solution, up to time , to the given by and . Since the rescaled trajectory is obtained by projecting onto a compact set, it follows that the trajectory is bounded. In other words, , where could be sample path dependent. Now, we observe that there exists such that all of the following happen:
(i) . [since ]
(ii) . [since and Remark 2]
(iii) . [since ]

We have (see the sentence following in Section 3.1 for more details). Let and for some . If then , else if then . We proceed assuming that