Asymptotic Bias of Stochastic Gradient Search

08/30/2017 ∙ by Vladislav B. Tadic, et al. ∙ University of Oxford University of Bristol 0

The asymptotic behavior of the stochastic gradient algorithm with a biased gradient estimator is analyzed. Relying on arguments based on the dynamic system theory (chain-recurrence) and the differential geometry (Yomdin theorem and Lojasiewicz inequality), tight bounds on the asymptotic bias of the iterates generated by such an algorithm are derived. The obtained results hold under mild conditions and cover a broad class of high-dimensional nonlinear algorithms. Using these results, the asymptotic properties of the policy-gradient (reinforcement) learning and adaptive population Monte Carlo sampling are studied. Relying on the same results, the asymptotic behavior of the recursive maximum split-likelihood estimation in hidden Markov models is analyzed, too.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in automatic control, system identification, signal processing, machine learning, operations research and statistics can be posed as a stochastic optimization problem, i.e., as a minimization (or maximization) of an unknown objective function whose values are available only through noisy observations. Such a problem can efficiently be solved by stochastic gradient search (also known as the stochastic gradient algorithm). Stochastic gradient search is a procedure of the stochastic approximation type which iteratively approximates the minima of the objective function using a statistical or Monte Carlo estimator of the gradient (of the objective function). Often, the estimator is biased, since the consistent gradient estimation is usually computationally expensive or not available at all. As a result of the biased gradient estimation, the stochastic gradient search is biased, too, i.e., the corresponding algorithm does not converge to the minima, but to their vicinity. In order to interpret the results produced by such an algorithm and to tune the algorithm’s parameters (e.g., to achieve a better bias/variance balance and a better convergence rate), the knowledge about the asymptotic behavior and the asymptotic bias of the algorithm iterates is crucially needed.

Despite its practical and theoretical importance, the asymptotic behavior of the stochastic gradient search with biased gradient estimation (also referred to as the biased stochastic gradient search) has not attracted much attention in the literature on stochastic optimization and stochastic approximation. To the best of the present authors’ knowledge, the asymptotic properties of the biased stochastic gradient search (and the biased stochastic approximation) have only been analyzed in [14], [19], [20] and [21]. Although the results of [14], [19], [20], [21] provide a good insight into the asymptotic behavior of the biased gradient search, they hold under restrictive conditions which are very hard to verify for complex stochastic gradient algorithms. Moreover, unless the objective function is of a simple form (e.g., convex or polynomial), none of [14], [19], [20], [21] offers explicit bounds on the asymptotic bias of the algorithm iterates.

In this paper, we study the asymptotic behavior of the biased gradient search. Using arguments based on the dynamic system theory (chain-recurrence) and the differential geometry (Yomdin theorem and Lojasiewicz inequalities), we prove that the algorithm iterates converge to a vicinity of the set of minima. Relying on the same arguments, we also derive relatively tight bounds on the radius of the vicinity, i.e., on the asymptotic bias of the algorithm iterates. The obtained results hold under mild and easily verifiable conditions and cover a broad class of complex stochastic gradient algorithms. In this paper, we show how the obtained results can be applied to the asymptotic analysis of policy-gradient (reinforcement) learning and adaptive population Monte Carlo sampling. We also demonstrate how the obtained results can be used to assess the asymptotic bias of the recursive maximum split-likelihood estimation in hidden Markov models.

The paper is organized as follows. The main results are presented in Section 2, where the stochastic gradient search with additive noise is analyzed. In Section 3, the asymptotic bias of the stochastic gradient search with Markovian dynamics is studied. Sections 46 provide examples of the results of Sections 2 and 3. In Section 4, the policy-gradient (reinforcement) learning is considered, while the adaptive population Monte Carlo sampling is analyzed in Section 5. Section 6 is devoted to the recursive maximum split-likelihood estimation in hidden Markov models. The results of Sections 26 are proved in Sections 712.

2 Main Results

In this section, the asymptotic behavior of the following algorithm is analyzed:

(1)

Here, is a differentiable function, while is a sequence of positive real numbers. is an

-valued random variable defined on a probability space

, while is an -valued stochastic process defined on the same probability space. To allow more generality, we assume that for each , is a random function of . In the area of stochastic optimization, recursion (1) is known as a stochastic gradient search (or stochastic gradient algorithm). The recursion minimizes function , which is usually referred to as the objective function. Term is interpreted as a gradient estimator (i.e., an estimator of ), while represents the estimator’s noise (or error). For further details, see [39], [47] and references given therein.

Throughout the paper, the following notation is used. and stand for the Euclidean norm and the distance induced by the Euclidean norm (respectively). For and , is the integer defined as

and are the sets of stationary points and critical values of , i.e.,

(2)

For , is the solution to the ODE satisfying . denotes the set of chain-recurrent points of this ODE, i.e., if and only if for any , there exist an integer , real numbers

and vectors

(each of which can depend on , , ) such that

(3)

for .

Elements of can be considered as limits to slightly perturbed solutions to the ODE

. As the piecewise linear interpolation of sequence

falls into the category of such solutions, the concept of chain-recurrence is tightly connected to the asymptotic behavior of stochastic gradient search. In [5], [6], it has been shown that for unbiased gradient estimates, all limit points of belong to and that each element of can potentially be a limit point of with a non-zero probability.

If is Lipschitz continuously differentiable, it can be established that . If additionally is of a zero Lebesgue measure (which holds when is discrete or when is -times continuously differentiable), then . However, if is only Lipschitz continuously differentiable, then it is possible to have (see [28, Section 4]). Hence, in general, a limit point of is in but not necessarily in . For more details on chain-recurrence, see [5], [6], [14] and references therein. Given these results, it will prove useful to involve both and in the asymptotic analysis of biased stochastic gradient search.

The algorithm (1) is analyzed under the following assumptions:

Assumption 2.1.

and .

Assumption 2.2.

admits the decomposition for each , where and are -valued stochastic processes (defined on ) satisfying

(4)

almost surely on for any .

Assumption 2.3.a.

is locally Lipschitz continuous on .

Assumption 2.3.b.

is -times differentiable on , where .

Assumption 2.3.c.

is real-analytic on .

Remark 2.1.

Due to Assumption 2.1, is well-defined, finite and satisfies

(5)

for all , . Consequently, Assumption 2.1 yields

(6)

for each .

Assumption 2.1 corresponds to the step-size sequence and is commonly used in the asymptotic analysis of stochastic gradient and stochastic approximation algorithms. In this or similar form, it is an ingredient of practically any asymptotic analysis of stochastic gradient search and stochastic approximation. Assumption 2.1 is satisfied if for , where .

Assumption 2.2 is a noise condition. It can be interpreted as a decomposition of the gradient estimator’s noise into a zero-mean sequence (which is averaged out by step-sizes ) and the estimator’s bias . Assumption 2.2 is satisfied if is a martingale-difference or mixingale sequence, and if are continuous functions of . It also holds for gradient search with Markovian dynamics (see Section 3). If the gradient estimator is unbiased (i.e., almost surely), Assumption 2.2 reduces to the well-known Kushner-Clark condition, the weakest noise assumption under which the almost sure convergence of (1) can be demonstrated.

Assumptions 2.3.a, 2.3.b and 2.3.c are related to the objective function and its analytical properties. Assumption 2.3.a is involved in practically any asymptotic result for stochastic gradient search (as well as in many other asymptotic and non-asymptotic results for stochastic and deterministic optimization). Although much more restrictive than Assumption 2.3.a, Assumptions 2.3.b and 2.3.c hold for a number of algorithms routinely used in engineering, statistics, machine learning and operations research. In Sections 46, Assumptions 2.3.b and 2.3.c are shown for policy-gradient (reinforcement) learning, adaptive population Monte Carlo sampling and recursive maximum split-likelihood estimation in hidden Markov models. In [50], Assumption 2.3.c (which is a special case of Assumption 2.3.b) has been demonstrated for recursive maximum (full) likelihood estimation in hidden Markov models. In [51]

, the same assumption has also been demonstrated for supervised and temporal-difference learning, online principal component analysis, Monte Carlo optimization of controlled Markov chains and recursive parameter estimation in linear stochastic systems. In

[52], we show Assumptions 2.3.b and 2.3.c

for sequential Monte Carlo methods for the parameter estimation in non-linear non-Gaussian state-space models. It is also worth mentioning that the objective functions associated with online principal and independent component analysis (as well as with many other adaptive signal processing algorithms) are often polynomial or rational, and hence, smooth and analytic, too (see e.g.,

[23] and references cited therein).

As opposed to Assumption 2.3.a, Assumptions 2.3.b and 2.3.c allow some sophisticated results from the differential geometry to be applied to the asymptotic analysis of stochastic gradient search. More specifically, Yomdin theorem (qualitative version of Morse-Sard theorem; see [53] and Proposition 8.1 in Section 8) can be applied to functions satisfying Assumption 2.3.b, while Lojasiewicz inequalities (see [35], [36]; see also [12], [32] and Proposition 8.2 in Section 8) hold for functions fulfilling Assumption 2.3.c. Using Yomdin theorem and Lojasiewicz inequalities, a more precise characterization of the asymptotic bias of the stochastic gradient search can be obtained (see Parts (ii) and (iii) of Theorem 2.1).

In order to state the main results of this section, we need some further notation. is the asymptotic magnitude of the gradient estimator’s bias , i.e.,

(7)

For a compact set , denotes the event

(8)

With this notation, our main result on the asymptotic bias of the recursion (1) can be stated as follows.

Theorem 2.1.

Suppose that Assumptions 2.1 and 2.2 hold. Let be any compact set. Then, the following is true:

  1. If satisfies Assumption 2.3.a, there exists a (deterministic) non-decreasing function (independent of and depending only on ) such that and

    (9)

    almost surely on .

  2. If satisfies Assumption 2.3.b, there exists a real number (independent of and depending only on ) such that

    (10)

    almost surely on , where .

  3. If satisfies Assumption 2.3.c, there exist real numbers , (independent of and depending only on ) such that

    (11)

    almost surely on .

Theorem 2.1 is proved in Sections 7 and 8, while its global version is provided in Appendix 12.

Remark.

If Assumption 2.3.b (or Assumption 2.3.c) is satisfied, then . Hence, under Assumption 2.3.b, (9) still holds if is replaced with .

Remark 2.2.

Function depends on in the following two ways. First, depends on through the chain-recurrent set and its geometric properties. In addition to this, depends on through upper bounds of and Lipschitz constants of . An explicit construction of is provided in the proof of Part (i) of Theorem 2.1 (Section 7).

Remark 2.3.

As , constants and depend on through upper bounds of and Lipschitz constants of . and also depend on through the Yomdin and Lojasiewicz constants (quantities , , specified in Propositions 8.1, 8.2). Explicit formulas for and are included in the proof of Parts (ii) and (iii) of Theorem 2.1 (Section 8).

According to the literature on stochastic optimization and stochastic approximation, stochastic gradient search with unbiased gradient estimates (the case when ) exhibits the following asymptotic behavior. Under mild conditions, sequences and converge to and (respectively), i.e,

(12)

almost surely on (see [6, Proposition 4.1, Theorem 5.7] which hold under Assumptions 2.1, 2.2, 2.3.a). Under more restrictive conditions, sequences and converge to and a point in (respectively), i.e.,

(13)

almost surely on (see [6, Corollary 6.7] which holds under Assumptions 2.1, 2.2, 2.3.b). The same asymptotic behavior occurs when Assumptions 2.1, 2.3.a hold and is a martingale-difference sequence (see [11, Proposition 1]). When the gradient estimator is biased (the case where ), this is not true any more. Now, the quantities

(14)

are strictly positive and depend on (it is reasonable to expect these quantities to decrease in and to tend to zero as ). Hence, the quantities (14) and their dependence on can be considered as a sensible characterization of the asymptotic bias of the gradient search with biased gradient estimation (i.e., these quantities describe how biased stochastic gradient search deviates from the nominal behavior). In the case of algorithm (1), such a characterization is provided by Theorem 2.1. The theorem includes tight, explicit bounds on the quantities (14) in the terms of the gradient estimator’s bias and analytical properties of .

The results of Theorem 2.1 are of a local nature. They hold only on the event where algorithm (1) is stable (i.e., where sequence belongs to a compact set ). Stating results on the asymptotic bias of stochastic gradient search in such a local form is quite sensible due to the following reasons. The stability of stochastic gradient search is based on well-understood arguments which are rather different from the arguments used here to analyze the asymptotic bias. Moreover and more importantly, as demonstrated in Appendix 12, it is relatively easy to get a global version of Theorem 2.1 by combining the theorem with the methods for verifying or ensuring the stability (e.g., with the results of [13] and [21]). It is also worth mentioning that local asymptotic results are quite common in the areas of stochastic optimization and stochastic approximation (e.g., most of the results of [9, Part II], similarly as Theorem 2.1, hold only on set ).

Gradient algorithms with biased gradient estimation are extensively used in system identification [2], [25], [26], [29], [34], discrete-event system optimization [22], [27], [43], [44], machine learning [4], [10], [15], [31], [41], and statistics [1], [16], [25], [40] [46]. To interpret results obtained by such an algorithm and to tune the algorithm parameters (e.g., to achieve better bias/variance balance and convergence rate), it is crucially important to understand the asymptotic properties of the biased stochastic gradient search. Despite its importance, the asymptotic behavior of the stochastic gradient search with biased gradient estimation has not received much attention in the literature on stochastic optimization and stochastic approximation. To the best of the present authors’ knowledge, the asymptotic properties of the biased stochastic gradient search and biased stochastic approximation have been studied only in [14, Section 5.3], [19], [20], [21, Section 2.7]. Although these results provide a good insight into the asymptotic behavior of the biased gradient search, they are based on restrictive conditions. More specifically, the results of [14, Section 5.3], [19], [20], [21, Section 2.7] hold only if is unimodal or if belongs to the domain of an asymptotically stable attractor of . In addition to this, the results of [14, Section 5.3], [19], [20], [21, Section 2.7] do not provide any explicit bound on the asymptotic bias of the stochastic gradient search unless is of a simple form (e.g., convex or polynomial). Unfortunately, in the case of complex stochastic gradient algorithms (such as those studied in Sections 46), is usually multimodal with lot of unisolated local extrema and saddle points. For such algorithms, not only it is hard to verify the assumptions adopted in [14, Section 5.3], [19], [20], [21, Section 2.7], but these assumptions are likely not to hold at all.

Relying on the chain-recurrence, Yomdin theorem and Lojasiewicz inequalities, Theorem 2.1 overcomes the described difficulties. The theorem allows the objective function to be multimodal (with manifolds of unisolated extrema and saddle points) and does not require to have an asymptotically stable attractor which is infinitely often visited by . In addition to this, Theorem 2.1 provides relatively tight explicit bounds on the asymptotic bias of algorithm (1). Furthermore, as demonstrated in Sections 46 and [52], the theorem covers a broad class of stochastic gradient algorithms used in machine learning, Monte Carlo sampling and system identification.

3 Stochastic Gradient Search with Markovian Dynamics

In order to illustrate the results of Section 2 and to set up a framework for the analysis carried out in Sections 46, we apply Theorem 2.1 to stochastic gradient algorithms with Markovian dynamics. These algorithms are defined by the following difference equation:

(15)

In this recursion, is a Borel-measurable function, while is a sequence of positive real numbers. is an -valued random variable defined on a probability space . is an -valued stochastic process defined on , while is an -valued stochastic process defined on the same probability space. is a Markov process controlled by , i.e., there exists a family of transition probability kernels defined on such that

(16)

almost surely for any Borel-measurable set and . are random function of , i.e., is a random function of for each . In the context of stochastic gradient search, represents a gradient estimator (i.e., an estimator of ).

The algorithm (15) is analyzed under the following assumptions.

Assumption 3.1.

, and .

Assumption 3.2.

There exist a differentiable function and a Borel-measurable function such that is locally Lipschitz continuous and

(17)

for each , , where .

Assumption 3.3.

For any compact set , there exists a Borel-measurable function such that

for all , . Moreover,

for all , , where is the stopping time defined by .

Assumption 3.4.

almost surely on .

Let and have the same meaning as in (2) ( is now specified in Assumption 3.2), while is the set of chain-recurrent points of the ODE (for details on chain-recurrence, see Section 2). Moreover, let have the same meaning as in (7). Then, our results on the asymptotic behavior of the recursion (15) read as follows.

Theorem 3.1.

Suppose that Assumptions 3.13.4 hold. Let be any compact set. Then, the following is true:

  1. If (specified in Assumption 3.2) satisfies Assumption 2.3.a, Part (i) of Theorem 2.1 holds.

  2. If (specified in Assumption 3.2) satisfies Assumption 2.3.b, Part (ii) of Theorem 2.1 holds.

  3. If (specified in Assumption 3.2) satisfies Assumption 2.3.c, Part (iii) of Theorem 2.1 holds.

Theorem 3.1 is proved in Section 9, while its global version is provided in Appendix Appendix 1.

Assumption 3.1 is related to the sequence . It is satisfied if for , where is a constant. Assumptions 3.2 and 3.3 correspond to the stochastic process and are standard for the asymptotic analysis of stochastic approximation algorithms with Markovian dynamics. Basically, Assumptions 3.2 and 3.3 require the Poisson equation associated with algorithm (15) to have a solution which is Lipschitz continuous in . They hold if the following is satisfied: (i) is geometrically ergodic for each , (ii) the convergence rate of is locally uniform in , and (iii) is locally Lipschitz continuous in on (for further details see, [9, Chapter II.2], [38, Chapter 17] and references cited therein). Assumptions 3.2 and 3.3 have been introduced by Métivier and Priouret in [37] (see also [9, Part II]), and later generalized by Kushner and his co-workers (see [33] and references cited therein). However, none of these results cover the scenario where biased gradient estimates are used. Theorem 3.1 fills this gap in the literature on stochastic optimization and stochastic approximation.

Regarding Theorem 3.1, the following note is in order. As already mentioned in the beginning of the section, the purpose of the theorem is illustrating the results of Section 2 and providing a framework for studying the examples presented in the next few sections. Since these examples perfectly fit into the framework developed by Metivier and Priouret, more general assumptions and settings of [33] are not considered here in order to keep the exposition as concise as possible.

4 Example 1: Reinforcement Learning

In this section, Theorems 2.1 and 3.1 are applied to the asymptotic analysis of policy-gradient search for average-cost Markov decision problems. Policy-gradient search is one of the most important classes of reinforcement learning (for further details, see e.g., [10], [41]).

In order to define controlled Markov chains with parametrized randomized control and to formulate the corresponding average-cost decision problems, we use the following notation. , , are integers, while , are the sets

is a non-negative (real-valued) function of . and are non-negative (real-valued) functions of with the following properties: is differentiable in for each , , , and

for the same , , . For , is an -valued Markov chain which is defined on a (canonical) probability space and which admits

for each , . is a function defined by

(18)

for . With this notation, an average-cost Markov decision problem with parameterized randomized control can be defined as the minimization of . In the literature on reinforcement learning and operations research, are referred to as a controlled Markov chain, while are called control actions. is referred to as the (chain) transition probability, while is called the (control) action probability.

is a parameter indexing the action probability. For further details on Markov decision processes, see

[10], [41], and references cited therein.

Since and its gradient rarely admit a close-form expression, is minimized using methods based on stochastic gradient search and Monte Carlo gradient estimation. Such a method can be derived as follows. Let

for , , . If is geometrically ergodic, we have

(see the proof of Lemma 10.2 and in particular (56)). Hence, quantity

is an asymptotically consistent estimator of . To reduce its variance (which is usually very large for ), term is ‘discounted’ by , where is a constant referred to as the discounting factor. This leads to the following gradient estimator:

(19)

Gradient estimator (19) is biased and its bias is of the order when (see Lemma 10.2). Combining gradient search with estimator (19), we get the policy-gradient algorithm proposed in [4]. This algorithm is defined by the following difference equations:

(20)

In the recursion (4), is a sequence of positive reals, while are any (deterministic) vectors. and are and valued stochastic processes (respectively) generated through the following Monte Carlo simulations:

(21)

where , are deterministic quantities.111In (4), is simulated from independently of , while is simulated from independently of . Hence, satisfies

for all , , .

Algorithm (4) is analyzed under the following assumptions.

Assumption 4.1.

For all , is irreducible and aperiodic.

Assumption 4.2.

For all , , , is well-defined (and finite). Moreover, for each , , is locally Lipschitz continuous in on .

Assumption 4.3.a.

For each , , is -times differentiable in on , where .

Assumption 4.3.b.

For each , , is real-analytic in on .

Assumption 4.1 is related to the stability of the controlled Markov chain . In this or similar form, it is often involved in the asymptotic analysis of reinforcement learning algorithms (see e.g., [10], [41]). Assumptions 4.2, 4.3.a and 4.3.b correspond to the parameterization of the action probabilities . They are satisfied for many commonly used parameterizations (such as natural, exponential and trigonometric).

Let and have the same meaning as in (2) ( is now defined in (18)), while is the set of chain-recurrent points of the ODE (for details on chain-recurrence, see Section 2). Moreover, for a compact set , let have the same meaning as in (8). Then, our results on the asymptotic behavior of the recursion (4) read as follows.

Theorem 4.1.

Suppose that Assumptions 3.1, 4.1 and 4.2 hold. Let be any compact set. Then, the following is true:

  1. There exists a (deterministic) non-decreasing function (independent of and depending only on , , ) such that and

    almost surely on .

  2. If (in addition to Assumptions 3.1, 4.1 and 4.2) Assumption 4.3.a is satisfied, there exists a real number (independent of and depending only on , , ) such that

    almost surely on , where .

  3. If (in addition to Assumptions 3.1, 4.1 and 4.2) Assumption 4.3.b is satisfied, there exist real numbers , (independent of and depending only on , , ) such that

    almost surely on .

Theorem 4.1 is proved in Section 10.

Remark.

Function depends on , , through function (defined in (18)) and its properties (see Remark 2.2 for details). Function also depends on , through the ergodicity properties of (see Lemma 10.1). In addition to this, depends on , through upper bounds of , . Further details can be found in the proofs of Lemmas 10.1, 10.2 and Theorem 4.1 (Section 10).

Remark.

As , constants and depend on , , through function (defined in (18)) and its properties (see Remark 2.3 for details). and also depend on , , through the ergodicity properties of . In addition to this, and depend on , , through upper bounds of , . For further details, see the proofs of Lemmas 10.1, 10.2 and Theorem 4.1 (Section 10).

Although gradient search with ‘discounted’ gradient estimation (19) is widely used in reinforcement learning (besides policy-gradient search, temporal-difference and actor-critic learning also rely on the same approach), the available literature does not give a quite satisfactory answer to the problem of its asymptotic behavior. To the best of the present authors’ knowledge, the existing results do not offer even the guarantee that the asymptotic bias of recursion (4) goes to zero as (i.e., that converges to a vicinity of whose radius tends to zero as ).222 Paper [31] can be considered as the strongest result on the asymptotic behavior of reinforcement learning with ‘discounted’ gradient estimation. However, [31] only claims that a subsequence of converges to a vicinity to whose radius goes to zero as . The main difficulty stems from the fact that reinforcement learning algorithms are so complex that the existing asymptotic results for biased stochastic gradient search and biased stochastic approximation [14, Section 5.3], [19], [20], [21, Section 2.7] cannot be applied. Relying on the results presented in Sections 2 and 3, Theorem 4.1 overcomes these difficulties. Under mild and easily verifiable conditions, Theorem 4.1 guarantees that the asymptotic bias of algorithm (4) converges to zero as (Part (i)). Theorem 4.1 also provides relatively tight polynomial bounds on the rate at which the bias goes to zero (Parts (ii), (iii)). In addition to this, Theorem 4.1 can be extended to other reinforcement learning algorithms such as temporal-difference and actor-critic learning.

5 Example 2: Adaptive Monte Carlo Sampling

In this section, Theorems 2.1 and 3.1 are used to analyze the asymptotic behavior of adaptive population Monte Carlo methods.

In order to describe the population Monte Carlo methods and explain how their performance can adaptively be improved, we use the following notation. , , are integers. is an open set, while is a Borel-set. is a probability density on , while is a non-negative function proportional to (i.e., , , for all ). is a non-negative (real-valued) function of which satisfies for all , (notice that is a transition density on ). is the function defined by

for , . is the transition density on defined as

for , , .