1 Introduction
Let us suppose that we are interested in finding a minimum (local/global) of a continuously differentiable function . The following gradient descent method () is often employed to find such a minimum:
(1) 
In the above equation, is the given stepsize sequence and is a continuous map such that , and .
is a popular tool to implement many machine learning algorithms. For example, the backpropagation algorithm for training neural networks employs
due to its effectiveness and ease of implementation.When implementing (1
), one often uses gradient estimators such as Kieferwolfowitz estimator
[8], simultaneous perturbation stochastic approximation () [10], etc., to obtain estimates of the true gradient at each stage which in turn results in estimation errors ( in (2)). This is particularly true when the form of or is unknown. Previously in the literature, convergence of with errors was studied in [5]. However, their analysis required the errors to go to zero at the rate of the stepsize (vanish asymptotically at a prescribed rate). Such assumptions are difficult to enforce and may adversely affect the learning rate when employed to implement machine learning algorithms, see Chapter 4.4 of [6]. In this paper, we present sufficient conditions for both stability (almost sure boundedness) and convergence (to a small neighborhood of the minimum set) of with bounded errors, for which the recursion is given by(2) 
In the above equation is the estimation error at stage such that (a.s. in the case of stochastic errors) for a fixed (positive real). As an example, consider the problem of estimating the average waiting time of a customer in a queue. The objective function , for this problem, has the following form: where
is the “waiting time” random variable with distribution
, with being the underlying parameter (say the arrival or the service rate). In order to define at every , one will need to know the entire family of distributions, , exactly. In such scenarios, one often works with approximate definitions of which in turn lead to approximate gradients, i.e, gradients with errors. More generally, the gradient errors could be inherent to the problem at hand or due to extraneous noise. In such cases, there is no reason to believe that these errors will vanish asymptotically. To the best of our knowledge, this is the first time an analysis is done for with biased/unbiased stochastic/deterministic errors that are not necessarily diminishing, and without imposing ‘additional’ restrictions on stepsizes over the usual standard assumptions, see (A2) in Section 3.1.Our assumptions, see Section 3.1, not only guarantee stability but also guarantee convergence of the algorithm to a small neighborhood of the minimum set, where the neighborhood is a function of the gradient errors. If as , then it follows from our main result (Theorem 2) that the algorithm converges to an arbitrarily small neighborhood of the minimum set. In other words, the algorithm indeed converges to the minimum set. It may be noted that we do not impose any restrictions on the noisesequence , except that almost surely for all for some fixed . Our analysis uses techniques developed in the field of viability theory by [1], [2] and [3]. Experimental results supporting the analyses in this paper are presented in Section 5.
1.1 Our contributions
(1) Previous literature such as [5] requires
as for it’s analysis to work.
Further, both [5]
and [9] provide conditions that guarantee one of two things
diverges almost surely or converges to the minimum set almost surely.
On the other hand, we
only require , where is fixed a priori.
Also, we present conditions under which with bounded errors is stable
(bounded almost surely) and converges to an arbitrarily small
neighborhood of the minimum set almost surely.
Note that our analysis works regardless of whether or not tends
to zero.
For more detailed comparisons with [5]
and [9] see Section 3.2.
(2) The analyses presented herein will go through even when the gradient errors are “asymptotically
bounded” almost surely. In other words, for all
almost surely.
Here may be sample path dependent.
(3) Previously, convergence analysis of required severe restrictions on the stepsize,
see [5], [10]. However, in our paper we do not impose any
such restrictions on the stepsize. See Section 3.2
(specifically points and ) for more details.
(4) Informally, the main result of our paper, Theorem 2, states the following.
One wishes to simulate with gradient errors that are not
guaranteed to vanish over time. As a consequence of allowing nondiminishing errors,
we show the following: There exists such that the iterates
are stable and converge to the neighborhood of the minimum set ( being chosen by the simulator)
as long as .
(5) In Section 4.2 we discuss how our framework can be exploited to undertake
convenient yet effective implementations of . Specifically, we present an implementation using ,
although other implementations can be similarly undertaken.
2 Definitions used in this paper
[Minimum set of a function] This set consists of all global and local minima of
the given function.
[Uppersemicontinuous map] We say that is uppersemicontinuous,
if given sequences (in ) and
(in ) with
, and , ,
then .
[Marchaud Map] A setvalued map
is called Marchaud if it satisfies
the following properties:
(i) for each , is convex and compact;
(ii) (pointwise boundedness) for each ,
for some ;
(iii) is uppersemicontinuous.
Let be a Marchaud map on .
The differential inclusion (DI) given by
(3) 
is guaranteed to have at least one solution that is absolutely continuous.
The reader is referred to [1] for more details.
We say that if x
is an absolutely continuous map that satisfies (3).
The setvalued semiflow
associated with (3) is defined on as:
. Let
and define
[Limit set of a solution] The limit set of a solution x
with is given by
.
[Invariant set]
is invariant if for every there exists
a trajectory, , entirely in
with , ,
for all .
[Open and closed neighborhoods of a set]
Let and , then
. We define the open neighborhood
of by . The
closed neighborhood of
is defined by .
[ and ]
The open ball of radius around the origin is represented by ,
while the closed ball is represented by . In other words,
and .
[Internally chain transitive set]
is said to be
internally chain transitive if is compact and for every ,
and we have the following: There exists and that
are solutions to the differential inclusion ,
points
and real numbers
greater than such that: and
for . The sequence
is called an chain in from to . If the above property only holds for all ,
then is called chain recurrent.
[Attracting set & fundamental neighborhood]
is attracting if it is compact
and there exists a neighborhood such that for any ,
with . Such a is called the fundamental neighborhood of .
[Attractor set]
An attracting set that is also invariant
is called an attractor set.
The basin
of attraction of is given by .
[Lyapunov stable] The above set is Lyapunov stable
if for all , such that
.
[Upperlimit of a sequence of sets, Limsup]
Let be a sequence of sets in .
The upperlimit of is
given by,
.
We may interpret that the lowerlimit collects the limit points of
while the upperlimit
collects its accumulation points.
3 Assumptions and comparison to previous literature
3.1 Assumptions
Recall that with bounded errors is given by the following recursion:
(4) 
where and , . In other words, the gradient estimate at stage , , belongs to an ball around the true gradient at stage . Note that (4) is consistent with (2) of Section 1. Our assumptions,  are listed below.

for some fixed . is a continuous function such that for all , for some .

is the stepsize (learning rate) sequence such that: , and . Without loss of generality we let .
Note that is an uppersemicontinuous map since is continuous and pointwise bounded. For each , we define . Define , see Section 2 for the definition of . Given , the convex closure of , denoted by , is the closure of the convex hull of . It is worth noting that is nonempty for every . Further, we show that is a Marchaud map in Lemma 1. In other words, has at least one solution that is absolutely continuous, see [1]. Here is used to denote the set .

has an attractor set such that for some and is a fundamental neighborhood of .
Since is compact, we have that . Let us fix the following sequence of real numbers: .

Let be an increasing sequence of integers such that as . Further, let and as , such that , , then .
It is worth noting that the existence of a global Lyapunov function for is sufficient to guarantee that holds. Further, is satisfied when is Lipschitz continuous.
Lemma 1.
is a Marchaud map.
Proof.
From the definition of and we have that is convex, compact
and for every .
It is left to show that is an uppersemicontinuous map.
Let , and ,
for all .
We need to show that . We present a proof by contradiction.
Since is convex and compact,
implies that there exists a linear functional on , say , such that
and , for some
and . Since , there exists
such that for all , . In other
words, for
all . We use the notation to denote the set
. For the sake of convenience let us denote the set
by , where .
We claim that
for all . We prove this claim later,
for now we assume that the claim
is true and proceed. Pick
for each . It can be shown that is norm bounded
and hence contains a convergent subsequence,
.
Let .
Since ,
such that , where .
We choose the sequence
such that for each .
We have the following: , ,
and , for all .
It follows
from assumption that . Since
and for each , we have that
. This contradicts the earlier conclusion that
.
It remains to prove that
for all . If this were not true, then
such that
for all . It follows that
for each .
Since , such that for all ,
. This is a contradiction.
∎
3.2 Relevance of our results
(1) Gradient algorithms with errors have been previously studied by
Bertsekas and Tsitsiklis [5]. They impose the following restriction on the
estimation errors:
, where . If the iterates are stable then . In order to satisfy the aforementioned assumption the choice of stepsize
may be restricted, thereby affecting the learning rate (when used within the framework of a learning algorithm).
In this paper we analyze the more
general and practical case of bounded which does not necessarily
go to zero. Further none of the assumptions used in our paper impose further restrictions
on the stepsize, other than standard requirements, see .
(2) The main result of Bertsekas and Tsitsiklis [5] states that
the with errors either diverges almost surely or converges to the minimum set almost surely.
An older study by Mangasarian and Solodov [9]
shows the exact same result as [5] but for
without estimation errors (). The main results of our
paper, Theorems 1 & 2 show that if the under
consideration satisfies  then the iterates are stable (bounded almost surely).
Further, the algorithm is guaranteed to converge to a given
small neighborhood of the minimum set provided the estimation errors are bounded by
a constant that is a function of the neighborhood size.
To summarize, under the more restrictive setting of [5]
and [9] the is not guaranteed to be stable,
see the aforementioned references, while the assumptions
used in our paper are less restrictive and guarantee stability
under the more general setting of bounded error . It may also be noted that
is assumed to be Lipschitz continuous by [5]. This turns out to be sufficient
(but not necessary)
for & to be satisfied.
(3) The analysis of Spall [10] can be used to analyze a variant of that uses
as the gradient estimator. Spall introduces a gradient sensitivity parameter in order
to control the estimation error at stage . It is assumed that
and ,
see A1, Section III, [10]. Again, this restricts the choice of stepsize
and affects the learning rate.
In this setting our analysis works for the more practical scenario where for all
i.e., a constant, see Section 4.2.
(4) The important advancements of this paper are the following: (i) Our framework is more general and practical since
the errors are not required to go to zero;
(ii) We provide easily verifiable,
nonrestrictive set of assumptions that ensure almost sure boundedness and convergence
of and (iii) Our assumptions  do not affect the choice of stepsize.
(5)
Tadić and Doucet [11] showed that GD with bounded
nondiminishing errors converges to a small neighborhood of the minimum set.
They make the following key assumption:
(A) There exists , such that for every compact set
and every , , where
and .
Note that is the Lebesgue measure of the set . The above assumption holds if is times differentiable, where , see [11] for details. In comparison, we only require that the chain recurrent set of be a subset of it’s minimum set. One sufficient condition for this is given in Proposition 4 of Hurley [7].
Remark 1.
Suppose the minimum set of , contains the chain recurrent set of , then it can be shown that GD without errors ( in (4)) will converge to almost surely, see [4]. On the other hand suppose there are chain recurrent points outside , it may converge to this subset (of the chain recurrent set) outside . In Theorem 2, we will use the uppersemicontinuity of chain recurrent sets (Theorem 3.1 of Benaïm, Hofbauer and Sorin [3]), to show that GD with errors will converge to a small neighborhood of the limiting set of the “corresponding GD without errors”. In other words, GD with errors converges to a small neighborhood of the minimum set provided the corresponding GD without errors converges to the minimum set. This will trivially happen if the chain recurrent set of is a subset of the minimum set of , which we implicitly assume is true. Suppose GD without errors does not converge to the minimum set, then it is reasonable to expect that GD with errors may not converge to a small neighborhood of the minimum set.
Suppose is continuously differentiable and it’s regular values (i.e., for which ) are dense in , then the chain recurrent set of is a subset of it’s minimum set, see Proposition 4 of Hurley [7]. We implicitly assume that an assumption of this kind is satisfied.
4 Proof of stability and convergence
We use (4
) to construct the linearly interpolated trajectory,
for . First, define and for . Then, define and for , is the continuous linear interpolation of and . We also construct the following piecewise constant trajectory , as follows: for , .We need to divide time, , into intervals of length , where . Note that is such that for , where denotes solution to at time with initial condition and . Note that is independent of the initial condtion , see Section 2 for more details. Dividing time is done as follows: define and , . Clearly, there exists a subsequence of such that . In what follows we use and interchangeably.
To show stability, we use a projective scheme where the iterates are projected periodically, with period , onto the closed ball of radius around the origin, . Here, the radius is given by . This projective scheme gives rise to the following rescaled trajectories and . First, we construct , : Let for some , then , where ( is defined in ). Also, let , . The ‘rescaled iterates’ are given by .
Let , be the solution (upto time ) to , with the initial condition , recall the definition of from the beginning of Section 4. Clearly, we have
(5) 
We begin with a simple lemma which essentially claims that = . The proof is a direct consequence of the definition of and is hence omitted.
Lemma 2.
For all , we have , where .
It directly follows from Lemma 2 that = . In other words, the two families of length trajectories, and , are really one and the same. When viewed as a subset of , is equicontinuous and pointwise bounded. Further, from the ArzelaAscoli theorem we conclude that it is relatively compact. In other words, is relatively compact in .
Lemma 3.
Let , then any limit point of is of the form , where is a measurable function and , .
Proof.
For , define . Observe that for any , we have and , since is a Marchaud map. Since is the rescaled trajectory obtained by periodically projecting the original iterates onto a compact set, it follows that is bounded a.s. i.e., It now follows from the observation made earlier that
Thus, we may deduce that there exists a subsequence of , say , such that in and weakly in . From Lemma 2 it follows that in . Letting in
we get for . Since we have .
Since weakly in , there exists such that
Further, there exists such that
Let us fix , then
Since is convex and compact (Proposition 1), to show that it is enough to show Suppose this is not true and and such that . Since is norm bounded, it follows that there is a convergent subsequence. For convenience, assume , for some . Since and , it follows from assumption that . This leads to a contradiction. ∎
Note that in the statement of Lemma 3 we can replace ‘’ by ‘’, where is a subsequence of . Specifically we can conclude that any limit point of in , conditioned on , is of the form , where for . It should be noted that may be sample path dependent (if is stochastic then is a random variable). Recall that = (see the sentence following in Section 3.1). The following is an immediate corollary of Lemma 3.
Corollary 1.
such that , , where and is a solution (up to time ) of such that . The form of is as given by Lemma 3.
Proof.
Assume to the contrary that such that is at least away from any solution to the . It follows from Lemma 3 that there exists a subsequence of guaranteed to converge, in , to a solution of such that . This is a contradiction. ∎
Remark 2.
It is worth noting that may be sample path dependent. Since we get for all such that .
4.1 Main Results
We are now ready to prove the two main results of this paper. We begin by showing that (4) is stable (bounded a.s.). In other words, we show that a.s. Once we show that the iterates are stable we use the main results of Benaïm, Hofbauer and Sorin to conclude that the iterates converge to a closed, connected, internally chain transitive and invariant set of .
Theorem 1.
Under assumptions , the iterates given by (4) are stable i.e., a.s. Further, they converge to a closed, connected, internally chain transitive and invariant set of .
Proof.
First, we show that the iterates are stable. To do this we start by assuming the negation i.e., . Clearly, there exists such that . Recall that and that .
We have since is a solution, up to time ,
to the given by and .
Since the rescaled trajectory is obtained by projecting onto a compact set, it follows that
the trajectory is bounded. In other words, ,
where could be sample path dependent. Now,
we observe that there exists such that all of the following happen:
(i) . [since ]
(ii)
.
[since and Remark 2]
(iii)
. [since ]
We have (see the sentence following in Section 3.1 for more details). Let and for some . If then , else if then . We proceed assuming that
Comments
There are no comments yet.