Asynchronous Stochastic Approximation with Differential Inclusions

12/10/2011 ∙ by Steven Perkins, et al. ∙ University of Bristol 0

The asymptotic pseudo-trajectory approach to stochastic approximation of Benaim, Hofbauer and Sorin is extended for asynchronous stochastic approximations with a set-valued mean field. The asynchronicity of the process is incorporated into the mean field to produce convergence results which remain similar to those of an equivalent synchronous process. In addition, this allows many of the restrictive assumptions previously associated with asynchronous stochastic approximation to be removed. The framework is extended for a coupled asynchronous stochastic approximation process with set-valued mean fields. Two-timescales arguments are used here in a similar manner to the original work in this area by Borkar. The applicability of this approach is demonstrated through learning in a Markov decision process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many learning algorithms include a stochastic updating schedule, often based on a Markov chain. Studying the performance of these processes can be carried out using the asynchronous stochastic approximation framework. However, the previous work in this area has focused on continuous, single-valued updates as discussed in the literature (see for example

[8], [13], [16], [17], [23]). Furthermore some of the assumptions which are typically used are challenging to verify. In this work we expand the asymptotic pseudo-trajectory approach of Benaïm, Hofbauer and Sorin [4]

to asynchronous stochastic approximations with set-valued mean fields. We incorporate the asynchronicity into the mean field to give a differential inclusion which will characterise the limiting behaviour of the associated learning process.

Consider an iterative process where and denote the component of as where and is finite. A typical stochastic approximation (SA) is of the form

(1.1)

where is a positive, decreasing sequence, is a zero-mean martingale noise sequence, is a bounded sequence which converges to zero and is a Lipschitz continuous mean field. Standard arguments (e.g. [3]) are then used to show that the limiting behaviour of the iterative process in (1.1

) can be studied through the ordinary differential equation (ODE)

(1.2)

Commonly known as the ODE method of stochastic approximation, originally proposed by Ljung [19], this technique has been extended by numerous authors, for example Benaïm [2], Benaïm, Hofbauer and Sorin [4], Borkar [10], Kushner and Clark [15] and Kushner and Yin [16], [17]. In particular Benaïm, Hofbauer and Sorin [4] have developed the approach so that under some weak criteria can be updated via a set-valued mean field, . This allows for the limiting behaviour to be studied using the associated differential inclusion.

Standard stochastic approximations are not always applicable; an example which we examine in this paper is when learning action values in a Markov decision process (MDP) and this is also discussed by Konda and Tsitsiklis [14], Tsitsiklis [23] and Singh et al. [21]. In a MDP updates are made to a single random component at each iteration. Therefore we have a stochastic, asynchronous updating pattern, where a subset of an iterative process similar to (1.1) can be updated many times before the remaining components are selected for a single update. Based on this idea extensions to the standard theory have been examined such as those by Kushner and Yin [16], [17]. Here however we follow the extension to asynchronous stochastic approximation provided by Borkar [8] and Konda and Borkar [13]. They show that when the iterative updates have a Lipschitz continuous mean field then, similarly to a standard stochastic approximation, the limiting behaviour can be studied via the associated differential equation,

(1.3)

where is a diagonal matrix and the diagonal elements of lie in the set for all . This early work on asynchronous stochastic approximations has certain restrictions which limit its usability. In particular, many of the assumptions made in the work of Borkar [8] and Konda and Borkar [13] are given in implicit form and are difficult to verify in specific situations.

As with the initial results for a standard stochastic approximation the results of Borkar [8] are limited to the case when the mean field, , is a Lipschitz continuous function. The subsequent work by Benaïm, Hofbauer and Sorin [4] on set-valued mean fields leaves the natural question of whether similar results are possible for asynchronous stochastic approximations when a set-valued mean field is used. In addition, the ODE in (1.3) is non-autonomous and the scaling matrix is not explicitly defined. This makes analysis of the limiting behaviour more difficult to study, although some methods for verifying global convergence are outlined by Borkar [10].

Borkar [7] originally extended the stochastic approximation framework to two-timescales. Since then Leslie and Collins [18] extended this idea to multiple timescales and Konda and Borkar [13] provide a first venture into the two-timescale asynchronous stochastic approximation. However, all of these only consider stochastic approximations when the mean field is Lipschitz continuous.

The aim of this work is to combine and generalise the results by Borkar [7], [8], Konda and Borkar [13] and Benaïm, Hofbauer and Sorin [4] to create a framework for single and two-timescale asynchronous stochastic approximations which is straightforward to use in practical applications. In this paper we show that, under a set of verifiable assumptions, the diagonal elements of lie in the closed set , for some . The set can be combined with the mean field to form a set-valued mean field, , whose limiting behaviour can be studied via the associated differential inclusion using the results of Benaïm, Hofbauer and Sorin [4]. A natural benefit of using the differential inclusion framework is that can be set-valued as this does not alter the analysis.

This paper is organised in the following manner: Section 2 reviews some previous results on stochastic approximation with differential inclusions and asynchronous stochastic approximations. In Section 3 we focus on the single-timescale asynchronous stochastic approximation. We state the main theorem before presenting the weak convergence results required for the proof. Section 4 examines the extension to a two-timescale asynchronous stochastic approximation process. Large parts of this section follow directly from the results in Section 3. In Section 5 we present an example of a learning algorithm for discounted reward Markov decision processes and obtain convergence results by applying the method shown in Section 4. This illustrates the ease in which this framework can be used. Finally, the paper concludes with a summary of the work. Throughout this paper many of the proofs are omitted from the main flow of text and are instead presented in an appendix.

2 Background

Throughout this paper we use two main ideas from the stochastic approximation literature. The first relates to the work by Benaïm, Hofbauer and Sorin [4] on stochastic approximation with differential inclusions, and the second concerns the asynchronous stochastic approximation framework introduced by Borkar [8]. We take the opportunity to review the pertinent features of their work in this section.

In what is to follow we use the standard concept of set multiplication: if is a set of matrices and are two closed, convex sets then let the multiplication of these sets be defined as,

Note that is also closed and convex. This definition is still used if either or both of the sets and are single valued. We also use the same concept when multiplying a constant by a set. That is, if is a constant then define

However, in this latter case we often drop the ‘’ notation for convenience.

2.1 Stochastic Approximation with Differential Inclusions

We begin by outlining the current convergence results for stochastic approximations with set-valued maps proved by Benaïm, Hofbauer and Sorin [4]. These results are heavily used in Section 3, most notably to prove our main result. Initially we provide a definition which outlines the class of set-valued mean fields we are able to use for stochastic approximation. These criteria are taken directly from the original work on stochastic approximations with differential inclusions by Benaïm et al. [4].

Definition 2.1.

Call a stochastic approximation map if it satisfies the following

  • is a closed set-valued map. That is,

    is a closed set. Equivalently, is an upper semi-continuous set-valued map.

  • For all , is a non-empty, compact, convex subset of .

  • There exists a such that for all ,

Take as a stochastic approximation map; a typical differential inclusion is in the form,

(2.1)

and a solution to (2.1) is an absolutely continuous mapping such that and for almost every ,

The flow induced by (2.1) is defined by,

Definition 2.2 (Benaïm et al. [4]).

A continuous function is an asymptotic pseudo-trajectory to if

for any and where is a distance measure on .

Many important properties of a dynamical system and the asymptotic pseudo-trajectories of the systems are discussed by Benaïm, Hofbauer and Sorin [4]. Most important for the work here is that an asymptotic pseudo-trajectory to (2.1) will behave in a similar manner to the solutions of the differential inclusion and hence the limiting behaviour will be closely related.

We conclude this section by considering a standard iterative process in the form of (1.1) where the mean field

is a stochastic approximation map. The following theorem states that under four assumptions a linear interpolation of

(a function defined precisely in Section 3.1) is an asymptotic pseudo-trajectory to the differential inclusion (2.1). Hence the limiting behaviour of can be studied via this differential inclusion.

Theorem 2.3.

Assume that

  • For all

    (2.2)

    where , and ,

  • ,

  • is a stochastic approximation map,

  • as and .

Then a linear interpolation of the iterative process given by (1.1) is an asymptotic pseudo-trajectory of the differential inclusion (2.1).

This is a slight modification to a result stated by Benaïm, Hofbauer and Sorin [4, Proposition 1.3] to include the terms. It is trivial to verify that this will not alter any of the asymptotic results of the original work.

2.2 Asynchronous Stochastic Approximations

Now we fully introduce the asynchronous stochastic approximation notation used. A typical asynchronous stochastic approximation such as those studied by Borkar [8] fits the following framework. If is the power set of all possible updating combinations in then let be the components of the iterative process updated at iteration . Using a counter for state ,

we consider processes in which no component, , in the asynchronous process needs to know the global counter, , merely its own counter, . Let , , and be the component of , , and respectively, for . We directly extend the notation used in (1.1) for an asynchronous stochastic approximation; let be a stochastic approximation map, then for let

(2.3)

Define the asynchronous step sizes, , and the relative step sizes, , to be

The asynchronous step sizes, , are random step sizes (in contrast with the deterministic terms) whilst the relative step sizes, are zero whenever the component of the iterative process is not updated. Clearly . By letting be the diagonal matrix of the terms we can express the previous asynchronous stochastic approximation (2.3) in the more concise form

(2.4)

This is a more familiar form for a stochastic approximation with a set-valued mean field. If is a stochastic approximation map in (1.1) then (2.4) differs from (1.1) only in that the step sizes in (2.4) are random and the addition of the coefficient. Instead of thinking of as a coefficient of the step sizes we combine it with the mean field. Convergence of the error term, , will be unaffected and, under a set of assumptions in Section 3.2, the noise term will still satisfy the Kushner and Clark noise condition (2.2). Combining the term and the mean field term into a single set provides an intuitive method of rephrasing the stochastic approximation and leads to a set-valued mean field.

Proceeding with this intuition is not immediately straightforward since is time varying and can be zero infinitely often, in which case the mean field could be zero even when the original update term, , is not. This would mean the limiting behaviour of the differential inclusion in the asynchronous stochastic approximation could be different to the synchronous case, where ultimately we wish to say that the two behave in the same manner in the limit. To avoid this scenario we follow Borkar [8] and consider the weak limit of the interpolations of , which will always be bounded away from zero under some verifiable assumptions, given in Section 3.2.

3 Asynchronous SA with Differential Inclusions

We begin by presenting the main result of this paper which concerns the limiting behaviour of the asynchronous stochastic approximation in (2.4), before outlining the results required to prove this in the remainder of the section.

3.1 Main Result

Assume that is a stochastic approximation map and for all define by its component parts, , , such that

(3.1)

Notice that if then we can select any . Then we can write the iterative process in (2.4) as

(3.2)

For some fixed , let be a series of diagonal matrices with entries in the set , for all , to be defined in Section 3.3. We can again rewrite the iterative process in (3.2) as

Now by letting and we get,

(3.3)

For general let

(3.4)

and define

(3.5)

If is Lipschitz continuous direct comparisons can be made between the mean field, , and the analogous mean field from (1.3) which is used by Borkar [8] and Konda and Borkar [13]. This provides the key insight into the new approach we take. Under the assumptions used in Section 3.2 the equivalent values almost surely lie in . By combining this with we produce a differential inclusion which is more straightforward to study than a non-autonomous differential equation and naturally fits the stochastic approximation framework of Benaïm, Hofbauer and Sorin [4]. In addition, this idea naturally lends itself to examining a similar process for a set-valued mean field as we proceed to do in this paper.

Equation (3.3) can be expressed in the form of a stochastic approximation with a set-valued mean field as in [4]:

(3.6)

Let , be the timescale for the asynchronous updates. To allow this process to be analysed in continuous time consider an interpolated version of the stochastic approximation (3.6) so that this process can be considered in continuous time,

(3.7)

Under the assumptions (A1)-(A5), presented in Section 3.2, we show in Section 3.3 that a sequence, , can be defined such that and satisfy the Kushner and Clark noise condition in (2.2). By invoking Theorem 2.3 we obtain our main result, which is proved in Section 3.4.

Theorem 3.1.

Under the assumptions (A1)-(A5), with probability 1,

is an asymptotic pseudo-trajectory to the differential inclusion,

(3.8)

Directly from Theorem 3.1 and [4, Proposition 3.27] we get the key result concerning the convergence of an asynchronous stochastic approximation process.

Corollary 3.2.

If there is a globally attracting set, , for the differential inclusion (3.8), and assumptions (A1)-(A5) are satisfied, then the iterative process (2.4) will converge to .

3.2 Assumptions

Throughout this section we study the convergence properties of the iterative process (2.4). We make reference to the following assumptions, (A1)-(A5), all of which are either standard requirements for a stochastic approximation or can be verified prior to running the asynchronous stochastic approximation process. This is in contrast with the previous work on asynchronous stochastic approximations by Borkar [8] and Konda and Borkar [13].

    • For a compact set, , for all .

    • is a bounded sequence such that as .

  • Let satisfy the following criteria,

    • and as ,

    • For , , where , means the “integer part of”. In addition, for all , .

  • is a stochastic approximation map.

(A1)(a) is a slight strengthening of the standard stochastic approximation boundedness assumption; however this is still a relatively mild condition. Methods to ensure that it is satisfied are discussed elsewhere, for example [12], [13] or [23]. A basic restriction is placed on in (A1)(b); in this form the sequence does not affect the asymptotic behaviour of the process. (A2)(a) is a standard assumption required for stochastic approximation, and (A2)(b) is a mild technical condition required to deal with the asynchronicity, which is also used by Borkar [8]. We have dropped the additional restriction on the step-sizes used by Borkar which severely restricts the possible choices of . (A3) ensures that we can use the convergence results presented in Theorem 2.3 and is a standard assumption for stochastic approximations with a set-valued mean field.

Define as the set of all the possible combinations which have positive probability of occurring. As an example, if every element of gets updated and it is known that is a singleton for each , then .

Let be a sigma algebra containing all the information up to and including the iteration. That is .

    • For all , ,

    Let

    • For all the transition probabilities form an aperiodic, irreducible, positive recurrent Markov chain over and for all there exists an such that .

    • The map is Lipschitz continuous.

(A4)(a) assumes that the transitions between the updated elements in are part of a controlled Markov chain. (A4)(b) is a straightforward assumption on this controlled Markov chain which can be verified prior to implementation which allows us to negate the need for some of the original technical assumptions made by Konda and Borkar [13]. In this previous work Konda and Borkar assume that every state is updated at a comparable rate in the limit which cannot directly be verified prior to running the process. (A4)(c) is a condition which is required later to use a result from Ma et al. [20] on the convergence of stochastic approximation with Markovian Noise.

    • For some

    • Take independent of for and independent of given for all . Let . Then there exists a positive such that for all ,

      and

      for each .

    We say that (A5) holds if either (A5)(a) or (A5)(b) is true.

An assumption similar to (A5) is used by Benaïm, Hofbauer and Sorin [4] to verify a condition for noise convergence and is similar to that used by Kushner and Clark [15]. We use this assumption only to show the noise term still satisfies the Kushner-Clark condition with the convergence given in Lemma 3.3; the proof is presented via two lemmas in Appendix A.2.

Lemma 3.3.

Assume that (A2)(b), (A4) and (A5) hold. Then with probability 1, for all ,

where , and .

Note that if Lemma 3.3 can be verified directly without (A5) then this assumption is redundant, and hence we only require (A1)-(A4) and Lemma 3.3 to hold. This approach is used in Section 4.

3.3 Weak Convergence of Asynchronous Updates

As discussed in the introduction, the key issue with asynchronous stochastic approximations is how to handle the interaction of the relative step sizes and the mean field. It is important to be able to bound the limit of the relative step sizes, , away from zero for all in order to produce an asynchronous stochastic approximation mean field, which will behave similarly to the synchronous mean field, . However the relative step size of a state is zero whenever that state is not updated, hence it is not immediately clear that this is even possible. Despite this it is sufficient that for any an ‘average’ of over length in the continuous time interpolation converges to a value which is bounded away from zero. In this section we prove that under (A2)(b) and (A4) this is indeed the case.

For the space is the set of measurable functions such that,

Following the method used in [10] and [13], define to be the space of maps with the coarsest topology which for all leaves continuous the map,

for all . Hence is a space of trajectories. This means that for any map defined on convergence to a limit point will be in the weak sense, along a subsequence. That is a sequence of maps such that for all is said to possess a limit point if for fixed there exists a subsequence , such that for any ,

(3.9)

Many authors provide a more detailed discussion on weak convergence; for example [6], [10, Appendix A] or [11].

Now we extend the relative step sizes, , to continuous time; for all let for and let . For all and define .

Lemma 3.4.

Under (A2)(b) and (A4), for all and for any , converges along a subsequence to a limit point such that for some and any ,

(3.10)
Proof.

See Appendix A.3. ∎

Corollary 3.5.

For any let be a limit point of in , then under (A2)(b) and (A4) there exists an such that for all and any such that ,

Proof.

See Appendix A.3. ∎

We now expand upon the discussions in Sections 2.2 and 3.1 on producing a sequence of matrices . In order to use the differential inclusions framework described in Section 2.1 we need to define a sequence of diagonal matrices, with diagonal entries which are always in the set , for some , and such that the terms converge to the same limit as the terms of . Recall that is a diagonal matrix containing the terms.

Fix taken from Lemma 3.4 and define a new function such that

For all let . Corollary 3.5 shows that, with respect to the topology of , in the limit for almost every and similarly for all . From this it is clear and have the same limit point in . That is, if is a limit point of then it is also a limit point of . Hence for any there exists a subsequence such that for ,

However, the key interest here is in the convergence of and . Following the reasoning of Borkar [8] and Konda and Borkar [13], it does not matter whether and converge directly or via a subsequence as this does not affect the convergence of the continuous processes and . Hence we can say that converge weakly to a limit point , or equivalently, if is any bounded, continuous function then for all ,

(3.11)

Define to be the diagonal matrix of the terms and let .

Lemma 3.6.

Almost surely under assumptions (A2)(b) and (A4),

Proof.

See Appendix A.4. The proof relies on (3.11). ∎

3.4 Proof of Theorem 3.1

We must verify that the four conditions of Theorem 2.3 hold for the stochastic approximation process in (3.6) to ascertain that is an asymptotic pseudo-trajectory of (3.8).

Fix . Then,

(3.12)
(3.13)

Using Lemma 3.3 and Lemma 3.6 immediately gives that (3.12) and (3.13) converge to zero a.s., and hence this verifies that property holds. Assumption (A1)(a) directly gives that holds. Lastly, it is straightforward to verify that, under (A1)-(A5), is a stochastic approximation map which verifies condition , and (A1)(b) is equivalent to .

4 Two-timescale Asynchronous Stochastic Approximation

A useful extension of standard stochastic approximations is to two-timescales. This concept was originally introduced by Borkar [7] and has later been used by Leslie and Collins [18] for multiple timescales and Konda and Borkar [13] for two-timescales asynchronous stochastic approximation. If we have a coupled pair of stochastic approximations where one system can be seen to update more aggressively than the other then the aggressive process is always fully adjusted to the value of the other process. This is all controlled through the user’s choice of step sizes in the stochastic approximation. The main result in this section is Corollary 4.8, which comes from combining Theorem 3.1 with the previous work of Konda and Borkar [13].

4.1 Notation

In what is to follow we consider the extension of Theorem 3.1 to the two-timescales setting, with updates and on different timescales. Let be the set of individual elements of the process as in Section 3, and define similarly for the process. Let and so that for all , and . As in Section 3 let be the set containing all combinations of elements in which have a positive probability of being part of the asynchronous update, and define in the same manner for the process. At iteration let and be the updated components of each timescale respectively. Let each component of the two processes have a counter for the number of times it has been selected to be updated defined by,

Here is as in Section 3 and has an analogous definition for the process. Let , be martingale noise processes defined on and respectively, and , as similarly defined on and respectively. Let be component of and , and similarly let be component of and . As in the previous sections , and now , are positive, deceasing sequences of step sizes. Similar restrictions to those in (A2) will be placed on , with an additional requirement for the two-timescale arguments to be valid; this will be made precise in Section 4.2. Finally, and are set-valued maps, where is the value of and similarly for . For all and consider the following coupled process,

(4.1)

Notice that the only change to the first process from Sections 2 and 3 is that the mean field now depends on as well as . It follows that the asynchronous and relative step sizes retain the same form. Recall these definitions and extend them for the process:

As in Section 2.2 let be the diagonal matrix of the terms and similarly let be the diagonal matrix of the terms. The coupled stochastic process (4.1) can be written more concisely as,

(4.2)

Finally, define the two timescales; let , , , . The division of time on the ‘slow’ timescale is given by the increments and similarly for on the ‘fast’ timescale. In a similar manner to the previous sections let , and .

4.2 Assumptions

We state the assumptions (B1)-(B6) used for the convergence results of the two-timescale algorithm (4.2). These are exactly analogous to (A1)-(A5) and are simply extended to accommodate the two-timescales framework. The exceptions to this are (B2)(c) and (B6) and the slight adaptations to (B3), which are in line with those used by Borkar [7] and Konda and Borkar [13]. In (B4) we have produced a single combined Markov chain instead of one for each of the and processes to present a clearer assumption.

    • For compact sets, , , , for all .

    • and are bounded sequences such that as .

  • The following must be true for and

    • and as ,

    • For , . In addition, for all , . .

    • For all , is upper semi-continuous and for all , is a stochastic approximation map.

    • For all , is a stochastic approximation map

The first and second assumptions are direct extensions of (A1) and (A2) to two-timescales with the addition of (B2)(c) which is a standard two-timescale assumption used by Borkar [7]. Condition (B3)(a) is similar to (A3) for the ‘slow’ timescale, however this must hold for all values of the ‘fast’ timescale. (B3)(b) is a similar condition for the ‘fast’ timescale.

Define such that if and then if and only if and have a positive probability of occurring simultaneously (at the same iteration). This means that is the combination of elements across which have positive probability of being updated at any particular iteration. At iteration is taken to be the updated components across and . In addition, let and be a sigma algebra containing all the information up to and including the iteration. That is