Let us suppose that we are interested in finding a minimum (local/global) of a continuously differentiable function . The following gradient descent method () is often employed to find such a minimum:
In the above equation, is the given step-size sequence and is a continuous map such that , and .due to its effectiveness and ease of implementation.
When implementing (1
), one often uses gradient estimators such as Kiefer-wolfowitz estimator, simultaneous perturbation stochastic approximation () , etc., to obtain estimates of the true gradient at each stage which in turn results in estimation errors ( in (2)). This is particularly true when the form of or is unknown. Previously in the literature, convergence of with errors was studied in . However, their analysis required the errors to go to zero at the rate of the step-size (vanish asymptotically at a prescribed rate). Such assumptions are difficult to enforce and may adversely affect the learning rate when employed to implement machine learning algorithms, see Chapter 4.4 of . In this paper, we present sufficient conditions for both stability (almost sure boundedness) and convergence (to a small neighborhood of the minimum set) of with bounded errors, for which the recursion is given by
In the above equation is the estimation error at stage such that (a.s. in the case of stochastic errors) for a fixed (positive real). As an example, consider the problem of estimating the average waiting time of a customer in a queue. The objective function , for this problem, has the following form: where
is the “waiting time” random variable with distribution, with being the underlying parameter (say the arrival or the service rate). In order to define at every , one will need to know the entire family of distributions, , exactly. In such scenarios, one often works with approximate definitions of which in turn lead to approximate gradients, i.e, gradients with errors. More generally, the gradient errors could be inherent to the problem at hand or due to extraneous noise. In such cases, there is no reason to believe that these errors will vanish asymptotically. To the best of our knowledge, this is the first time an analysis is done for with biased/unbiased stochastic/deterministic errors that are not necessarily diminishing, and without imposing ‘additional’ restrictions on step-sizes over the usual standard assumptions, see (A2) in Section 3.1.
Our assumptions, see Section 3.1, not only guarantee stability but also guarantee convergence of the algorithm to a small neighborhood of the minimum set, where the neighborhood is a function of the gradient errors. If as , then it follows from our main result (Theorem 2) that the algorithm converges to an arbitrarily small neighborhood of the minimum set. In other words, the algorithm indeed converges to the minimum set. It may be noted that we do not impose any restrictions on the noise-sequence , except that almost surely for all for some fixed . Our analysis uses techniques developed in the field of viability theory by ,  and . Experimental results supporting the analyses in this paper are presented in Section 5.
1.1 Our contributions
(1) Previous literature such as  requires
as for it’s analysis to work.
Further, both 
and  provide conditions that guarantee one of two things
diverges almost surely or converges to the minimum set almost surely.
On the other hand, we
only require , where is fixed a priori.
Also, we present conditions under which with bounded errors is stable
(bounded almost surely) and converges to an arbitrarily small
neighborhood of the minimum set almost surely.
Note that our analysis works regardless of whether or not tends
For more detailed comparisons with 
and  see Section 3.2.
(2) The analyses presented herein will go through even when the gradient errors are “asymptotically bounded” almost surely. In other words, for all almost surely. Here may be sample path dependent.
(3) Previously, convergence analysis of required severe restrictions on the step-size, see , . However, in our paper we do not impose any such restrictions on the step-size. See Section 3.2 (specifically points and ) for more details.
(4) Informally, the main result of our paper, Theorem 2, states the following. One wishes to simulate with gradient errors that are not guaranteed to vanish over time. As a consequence of allowing non-diminishing errors, we show the following: There exists such that the iterates are stable and converge to the -neighborhood of the minimum set ( being chosen by the simulator) as long as .
(5) In Section 4.2 we discuss how our framework can be exploited to undertake convenient yet effective implementations of . Specifically, we present an implementation using , although other implementations can be similarly undertaken.
2 Definitions used in this paper
[Minimum set of a function] This set consists of all global and local minima of
the given function.
[Upper-semicontinuous map] We say that is upper-semicontinuous, if given sequences (in ) and (in ) with , and , , then .
[Marchaud Map] A set-valued map is called Marchaud if it satisfies the following properties: (i) for each , is convex and compact; (ii) (point-wise boundedness) for each , for some ; (iii) is upper-semicontinuous.
Let be a Marchaud map on . The differential inclusion (DI) given by
is guaranteed to have at least one solution that is absolutely continuous.
The reader is referred to  for more details.
We say that if x
is an absolutely continuous map that satisfies (3).
The set-valued semiflow
associated with (3) is defined on as:
. Let and define
[Limit set of a solution] The limit set of a solution x with is given by .
[Invariant set] is invariant if for every there exists a trajectory, , entirely in with , , for all .
[Open and closed neighborhoods of a set] Let and , then . We define the -open neighborhood of by . The -closed neighborhood of is defined by .
[ and ] The open ball of radius around the origin is represented by , while the closed ball is represented by . In other words, and .
[Internally chain transitive set] is said to be internally chain transitive if is compact and for every , and we have the following: There exists and that are solutions to the differential inclusion , points and real numbers greater than such that: and for . The sequence is called an chain in from to . If the above property only holds for all , then is called chain recurrent.
[Attracting set & fundamental neighborhood] is attracting if it is compact and there exists a neighborhood such that for any , with . Such a is called the fundamental neighborhood of .
[Attractor set] An attracting set that is also invariant is called an attractor set. The basin of attraction of is given by .
[Lyapunov stable] The above set is Lyapunov stable if for all , such that .
[Upper-limit of a sequence of sets, Limsup] Let be a sequence of sets in . The upper-limit of is given by, .
We may interpret that the lower-limit collects the limit points of while the upper-limit collects its accumulation points.
3 Assumptions and comparison to previous literature
Recall that with bounded errors is given by the following recursion:
where and , . In other words, the gradient estimate at stage , , belongs to an -ball around the true gradient at stage . Note that (4) is consistent with (2) of Section 1. Our assumptions, - are listed below.
for some fixed . is a continuous function such that for all , for some .
is the step-size (learning rate) sequence such that: , and . Without loss of generality we let .
Note that is an upper-semicontinuous map since is continuous and point-wise bounded. For each , we define . Define , see Section 2 for the definition of . Given , the convex closure of , denoted by , is the closure of the convex hull of . It is worth noting that is non-empty for every . Further, we show that is a Marchaud map in Lemma 1. In other words, has at least one solution that is absolutely continuous, see . Here is used to denote the set .
has an attractor set such that for some and is a fundamental neighborhood of .
Since is compact, we have that . Let us fix the following sequence of real numbers: .
Let be an increasing sequence of integers such that as . Further, let and as , such that , , then .
It is worth noting that the existence of a global Lyapunov function for is sufficient to guarantee that holds. Further, is satisfied when is Lipschitz continuous.
is a Marchaud map.
From the definition of and we have that is convex, compact
and for every .
It is left to show that is an upper-semicontinuous map.
Let , and ,
for all .
We need to show that . We present a proof by contradiction.
Since is convex and compact,
implies that there exists a linear functional on , say , such that
and , for some
and . Since , there exists
such that for all , . In other
all . We use the notation to denote the set
. For the sake of convenience let us denote the set
by , where .
We claim that
for all . We prove this claim later,
for now we assume that the claim
is true and proceed. Pick
for each . It can be shown that is norm bounded
and hence contains a convergent subsequence,
such that , where .
We choose the sequence
such that for each .
We have the following: , , and , for all . It follows from assumption that . Since and for each , we have that . This contradicts the earlier conclusion that .
It remains to prove that for all . If this were not true, then such that for all . It follows that for each . Since , such that for all , . This is a contradiction. ∎
3.2 Relevance of our results
(1) Gradient algorithms with errors have been previously studied by
Bertsekas and Tsitsiklis . They impose the following restriction on the
, where . If the iterates are stable then . In order to satisfy the aforementioned assumption the choice of step-size
may be restricted, thereby affecting the learning rate (when used within the framework of a learning algorithm).
In this paper we analyze the more
general and practical case of bounded which does not necessarily
go to zero. Further none of the assumptions used in our paper impose further restrictions
on the step-size, other than standard requirements, see .
(2) The main result of Bertsekas and Tsitsiklis  states that the with errors either diverges almost surely or converges to the minimum set almost surely. An older study by Mangasarian and Solodov  shows the exact same result as  but for without estimation errors (). The main results of our paper, Theorems 1 & 2 show that if the under consideration satisfies - then the iterates are stable (bounded almost surely). Further, the algorithm is guaranteed to converge to a given small neighborhood of the minimum set provided the estimation errors are bounded by a constant that is a function of the neighborhood size. To summarize, under the more restrictive setting of  and  the is not guaranteed to be stable, see the aforementioned references, while the assumptions used in our paper are less restrictive and guarantee stability under the more general setting of bounded error . It may also be noted that is assumed to be Lipschitz continuous by . This turns out to be sufficient (but not necessary) for & to be satisfied.
(3) The analysis of Spall  can be used to analyze a variant of that uses as the gradient estimator. Spall introduces a gradient sensitivity parameter in order to control the estimation error at stage . It is assumed that and , see A1, Section III, . Again, this restricts the choice of step-size and affects the learning rate. In this setting our analysis works for the more practical scenario where for all i.e., a constant, see Section 4.2.
(4) The important advancements of this paper are the following: (i) Our framework is more general and practical since the errors are not required to go to zero; (ii) We provide easily verifiable, non-restrictive set of assumptions that ensure almost sure boundedness and convergence of and (iii) Our assumptions - do not affect the choice of step-size.
(5) Tadić and Doucet  showed that GD with bounded non-diminishing errors converges to a small neighborhood of the minimum set. They make the following key assumption: (A) There exists , such that for every compact set and every , , where and .
Note that is the Lebesgue measure of the set . The above assumption holds if is times differentiable, where , see  for details. In comparison, we only require that the chain recurrent set of be a subset of it’s minimum set. One sufficient condition for this is given in Proposition 4 of Hurley .
Suppose the minimum set of , contains the chain recurrent set of , then it can be shown that GD without errors ( in (4)) will converge to almost surely, see . On the other hand suppose there are chain recurrent points outside , it may converge to this subset (of the chain recurrent set) outside . In Theorem 2, we will use the upper-semicontinuity of chain recurrent sets (Theorem 3.1 of Benaïm, Hofbauer and Sorin ), to show that GD with errors will converge to a small neighborhood of the limiting set of the “corresponding GD without errors”. In other words, GD with errors converges to a small neighborhood of the minimum set provided the corresponding GD without errors converges to the minimum set. This will trivially happen if the chain recurrent set of is a subset of the minimum set of , which we implicitly assume is true. Suppose GD without errors does not converge to the minimum set, then it is reasonable to expect that GD with errors may not converge to a small neighborhood of the minimum set.
Suppose is continuously differentiable and it’s regular values (i.e., for which ) are dense in , then the chain recurrent set of is a subset of it’s minimum set, see Proposition 4 of Hurley . We implicitly assume that an assumption of this kind is satisfied.
4 Proof of stability and convergence
We use (4
) to construct the linearly interpolated trajectory,for . First, define and for . Then, define and for , is the continuous linear interpolation of and . We also construct the following piece-wise constant trajectory , as follows: for , .
We need to divide time, , into intervals of length , where . Note that is such that for , where denotes solution to at time with initial condition and . Note that is independent of the initial condtion , see Section 2 for more details. Dividing time is done as follows: define and , . Clearly, there exists a subsequence of such that . In what follows we use and interchangeably.
To show stability, we use a projective scheme where the iterates are projected periodically, with period , onto the closed ball of radius around the origin, . Here, the radius is given by . This projective scheme gives rise to the following rescaled trajectories and . First, we construct , : Let for some , then , where ( is defined in ). Also, let , . The ‘rescaled iterates’ are given by .
Let , be the solution (upto time ) to , with the initial condition , recall the definition of from the beginning of Section 4. Clearly, we have
We begin with a simple lemma which essentially claims that = . The proof is a direct consequence of the definition of and is hence omitted.
For all , we have , where .
It directly follows from Lemma 2 that = . In other words, the two families of -length trajectories, and , are really one and the same. When viewed as a subset of , is equi-continuous and point-wise bounded. Further, from the Arzela-Ascoli theorem we conclude that it is relatively compact. In other words, is relatively compact in .
Let , then any limit point of is of the form , where is a measurable function and , .
For , define . Observe that for any , we have and , since is a Marchaud map. Since is the rescaled trajectory obtained by periodically projecting the original iterates onto a compact set, it follows that is bounded a.s. i.e., It now follows from the observation made earlier that
Thus, we may deduce that there exists a sub-sequence of , say , such that in and weakly in . From Lemma 2 it follows that in . Letting in
we get for . Since we have .
Since weakly in , there exists such that
Further, there exists such that
Let us fix , then
Since is convex and compact (Proposition 1), to show that it is enough to show Suppose this is not true and and such that . Since is norm bounded, it follows that there is a convergent sub-sequence. For convenience, assume , for some . Since and , it follows from assumption that . This leads to a contradiction. ∎
Note that in the statement of Lemma 3 we can replace ‘’ by ‘’, where is a subsequence of . Specifically we can conclude that any limit point of in , conditioned on , is of the form , where for . It should be noted that may be sample path dependent (if is stochastic then is a random variable). Recall that = (see the sentence following in Section 3.1). The following is an immediate corollary of Lemma 3.
such that , , where and is a solution (up to time ) of such that . The form of is as given by Lemma 3.
Assume to the contrary that such that is at least away from any solution to the . It follows from Lemma 3 that there exists a subsequence of guaranteed to converge, in , to a solution of such that . This is a contradiction. ∎
It is worth noting that may be sample path dependent. Since we get for all such that .
4.1 Main Results
We are now ready to prove the two main results of this paper. We begin by showing that (4) is stable (bounded a.s.). In other words, we show that a.s. Once we show that the iterates are stable we use the main results of Benaïm, Hofbauer and Sorin to conclude that the iterates converge to a closed, connected, internally chain transitive and invariant set of .
Under assumptions , the iterates given by (4) are stable i.e., a.s. Further, they converge to a closed, connected, internally chain transitive and invariant set of .
First, we show that the iterates are stable. To do this we start by assuming the negation i.e., . Clearly, there exists such that . Recall that and that .
We have since is a solution, up to time ,
to the given by and .
Since the rescaled trajectory is obtained by projecting onto a compact set, it follows that
the trajectory is bounded. In other words, ,
where could be sample path dependent. Now,
we observe that there exists such that all of the following happen:
(i) . [since ]
(ii) . [since and Remark 2]
(iii) . [since ]
We have (see the sentence following in Section 3.1 for more details). Let and for some . If then , else if then . We proceed assuming that