Adaptive Bases for Reinforcement Learning

05/02/2010 ∙ by Dotan Di Castro, et al. ∙ 0

We consider the problem of reinforcement learning using function approximation, where the approximating basis can change dynamically while interacting with the environment. A motivation for such an approach is maximizing the value function fitness to the problem faced. Three errors are considered: approximation square error, Bellman residual, and projected Bellman residual. Algorithms under the actor-critic framework are presented, and shown to converge. The advantage of such an adaptive basis is demonstrated in simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) [4]

is an approach for solving Markov Decision Processes (MDPs), when interacting with an unknown environment. One of the main obstacles in applying RL methods is how to cope with a large state space. In general, the underlying methods are based on dynamic programming, and include adaptive schemes that mimic either value iteration, such as Q-learning, or policy iteration, such as Actor-Critic (AC) methods. While the former attempt to directly learn the optimal value function, the latter are based on quickly learning the value of the currently used policy, followed by a slower policy improvement step. In this paper we focus on AC methods.

There are two major problems when solving MDPs with a large state space. The first is the storage problem, i.e., it is impractical to store the value function and the optimal action explicitly for each state. The second is generalization: some notion of similarity between states is needed since most states are not visited or visited only a few times. Thus, these issues are addressed by the Function Approximation (FA) approach [4]

, that involves approximating the value function by functional approximators with a smaller number of parameters in comparison to the original number of states. The success of this approach rests mainly on selecting appropriate features, and on a proper choice of the approximation architecture. In a linear approximation architecture, the value of a state is determined by linear combination of the low dimensional feature vector. In the RL context, linear architectures enjoy convergence results and performance guarantees (e.g.,

[4]).

The approximation quality depends on the choice of the basis functions. In this paper we consider the possibility of tuning the basis functions on-line, under the AC framework. As mentioned before, an agent interacting with the environment is composed of two sub-systems. The first is a critic, that estimates the value function for the states encountered. This sub-system acts on a fast time scale. The second is an actor, that based on the critic output, and mainly the

temporal-difference (TD) signal, improves the agent’s policy using gradient methods. The actor operates on a second time scale, slower than the time-scale of the critic. Bhatnagar et al. [5] proved that such an algorithm with an appropriate relation between the time scales, converges.

We suggest to add a third time scale that is slower than both the critic and the actor, minimizing some error criteria while adapting the critic’s basis functions to better fit the problem. Convergence of the value function, policy and the basis is guaranteed in such an architecture, and simulations show that a dramatic improvement can be achieved using basis adaptation.

Using multiple time scales may pose a convergence drawback at first sight. Two approaches may be applied in order to overcome this problem. First, a recent work of Mokkadem and Pelletier [12], based on previous research by Polyak [13] and others, have demonstrated that combining the algorithm iterates with the averaging method of [13] leads to convergence rate in distribution that is the same as the optimal rate. Second, in multiple time scales the rate between the time steps of the slower and faster time scales should converge to . Thus, time scales which are close, operate on the fast time scale, and satisfy the condition above, are easy to find for any practical needs.

There are several works done in the area of adaptive bases. These works do not address the problem of policy improvement with adaptive bases. We mention here two noticeable works which are similar in spirit to our work. The first work is of Menache et al. [11]. Two algorithms were suggested for adaptive bases by the authors: one algorithm is based on gradient methods for least-squares TD (LSTD) of Bardtke and Barto [2], and the other algorithm is based on the cross entropy method. Both algorithms were demonstrated in simulations to achieve better performance than their fixed basis counterparts but no convergence guarantees were supplied. Yu and Bertsekas [19] suggested several algorithms for two main problem classes: policy evaluation and optimal stopping. The former is closer to our work than the latter so we focus on this class. Three target functions were considered in that work: mean TD error, Bellman error, and projected Bellman error. The main difference between [19] and our work (besides the policy improvement) is the following. The algorithmic variants suggested in [19] are in the flavor of LSTD and LSPE algorithms [3], while in our work the algorithms are TD based, thus, in our work no matrix inversion is involved. Also, we demonstrate the effectiveness of the algorithms in the current work.

The paper is organized as follows. In Section 2 we define some preliminaries and outline the framework. In Section 3 we introduce the algorithms suggested for adaptive bases. In Section 4 we show the convergence of the algorithms suggested, while in Section 5 we demonstrate the algorithms in simulations. In Section 6 we discuss the results.

2 Preliminaries

In this section, we introduce the framework, review actor-critic algorithms, overview multiple time scales stochastic approximation (MTS-SA), and state a related theorem which will be used later in proving the main results.

2.1 The Framework

We consider an agent interacting with an unknown environment that is modeled by a Markov Decision Process (MDP) [14] in discrete time with a finite state set and an action set where . Each selected action of the agent determines a stochastic transition matrix , where is the state followed the state .

For each state the agent receives a corresponding reward that depend only on the current state111Generalizing the results presented here to state-action rewards is straight forward.. The agent maintains a parameterized policy function which is a probabilistic function, denoted by , mapping an observation

into a probability distribution over the controls

. The parameter is a tunable parameter where is a differentiable function w.r.t. . We note that for different ’s, different probability distributions over may be associated for each . We denote by a state-action-reward trajectory where the subindex specifies time.

Under each policy induced by , the environment and the agent induce together a Markovian transition function, denoted by , satisfying . The Markovian transition function induces a stationary distribution over the state space , denoted by . This distribution induces a natural norm, denoted by , which is a weighted norm and is defined by . Note that when the parameter changes, the norm changes as well. We denote by the expectation operator w.r.t. the measures and . There are several performance criteria investigated in the RL literature that differ mainly on their time horizon and the treatment of future rewards [4]. In this work we focus on average reward criteria defined by

(1)

The agent’s goal is to find the parameter that maximizes . Similarly, define the (differential) value function as

(2)

where and is some recurrent state for all policies, we assume to exist. Define the Bellman operator as . Thus, based on (2) it is easy to show the following connection between the average reward to the value function under a given policy [3], i.e.,

(3)

For later use, we denote by and the column representations of and respectively.

We define the Temporal Difference (TD) [4, 16] of the state followed by the state as , where for a specific time we abbreviate as . Based on (3) we can see that

(4)

Based on this property, a wide family of algorithms known as TD algorithm exist [4], where common to all these algorithms is solving (4) iteratively.

Notational comment: from now on, we omit the dependency on whenever it is clear from the context.

2.2 Actor-Critic Algorithms

A well known class of RL approaches is the so called actor-critic (AC) algorithms, where the agent is divided into two components, an actor and a critic. The critic functions as a state value estimator using the so called TD-learning algorithm, whereas the actor attempts to select actions based on the TD signal estimated by the critic. These two components solve their own optimization problems separately interacting with each other.

The critic typically uses a function approximator which approximates the value function in a subspace of a reduced dimension . Define the basis matrix

(5)

where its columns span the subspace . Thus, the approximation to the value function is , where is the solution of the following quadratic program . This solution yields the linear projection operator,

(6)

that satisfies

(7)

where is the vector representation of . Abusing notation, we define the (state dependent) projection operator on as .

As mentioned above, the actor receives the TD signal from the critic, where based on this signal, the actor tries to select the optimal action. As described in Section 2.1, the actor maintains a policy function . In the following, we state a theorem that serves as the foundation for the policy gradient algorithm described later. The theorem relates the gradient w.r.t. of the average reward, , to the TD signal, . Define the likelihood ratio derivative as . We omit the dependency of on , , and through that paper. The following assumption states that is bounded.

Assumption 1

For all , , and , there exists a positive constant, , such that .

Based on this, we present the following lemma that relates the gradient of to the TD signal [5].

Lemma 2

The gradient of the average reward (w.r.t. to ) can be expressed by E.

2.3 Multiple Time Scales Stochastic Approximation

Stochastic approximation (SA), and in particular the ODE approach [9], is a widely used method for investigating the asymptotic behavior of stochastic iterates. For example, consider the following stochastic iterate

where is some random process and are step sizes that form a positive series satisfying conditions to be defined later. The key idea of the technique is the following. Suppose that the iterate can be decomposed into a mean function, denoted by , and a noise term (martingale difference noise), denoted by ,

(8)

and suppose that the effect of the noise weakens due to repeated averaging. Consider the following ODE which is a continuous version of and

(9)

where the dot above a variable stands for a time derivative. Then, a typical result of the ODE method in the SA theory suggests that the asymptotic limit of (8) and (9) are identical.

The classical theory of SA considers an iterate, which may be in some finite dimensional Euclidean space. Sometimes, we need to deal with several multidimensional iterates, dependent one on the other, and where each iterate operates on different timescale. Surprisingly, this type of SA, called multiple time scale SA (MTS-SA), is sometimes easier to analyze, with respect to the same iterates operate on single timescale. The first analysis of two time-scales SA algorithms was given by Borkar in [6] and later expanded to MTS by Leslie and Collins in [10]. In the following we describe the problem of MTS-SA, state the related ODEs, and finally state the conditions under which MTS-SA iterates converge. We follow the definitions of [10].

Consider dependent SA iterates as the following

(10)

where , and . The following assumption contains a standard requirement for MTS-SA step size.

Assumption 3

(MTS-SA step size assumptions)

  1. For , we have

  2. For , we have

We interpret the second requirement in the following way: the higher the index of an iterate, it operates on higher time scale. This is because that there exists some such that for all the step size of the -th iterate is larger uniformly then the step size of the iterates . Thus, the -th iterate advances more than any of the iterates , or in other words, it operates on faster time scale. The following assumption aggregates the main requirement for the MTS-SA iterates.

Assumption 4

(MTS-SA iterate assumptions)

  1. are gloablly Lipschitz continuous,

  2. For , we have .

  3. For , converges a.s.

  4. (The ODEs requirements) Remark: this requirement is defined recursively where requirement (a) below is the initial requirement related to the -th ODE, and requirement (b) below describes the -th ODE system that is recursively based on the -th ODE system, going from to . Denote .

    1. Define the -th ODE system to be

      (11)

      and suppose the initial condition . Then, there exists a Lipschitz continuous function such that the ODE system (11) converges to the point .

    2. Define the -th ODE system, , to be

      (12)

      where is determined by the -th ODE system, and suppose the initial condition . Then, there exists a Lipschitz continuous function such that the ODE system (12) converges to the point .

The first two requirements are common conditions for SA iterates to converge. The third requirement ensures the noise term asymptotically vanishes. The fourth requirement ensures (using a recursive definition) that for each time scale , where the slower time scales are static and where for the faster time scales there exists a function (which is the solution of the ODE system), there exists a Lipschitz convergent function. Based on these requirements, we cite the following theorem due to Leslie and Collins [10].

Theorem 5

Consider the iterate (10) and suppose Assumption 3 and 4 hold. Then, the asymptotic behavior of the iterates (10) converge to the invariant set of the dynamic system

(13)

where is determined by requirement 4 of Assumption 4.

3 Main Results

In this section we present the main theoretical results of the work. We start by introducing adaptive bases and show the algorithms that are derived from choosing different approximating schemes.

3.1 Adaptive Bases

The motivation for adaptive bases is the following. Consider an agent that chooses a basis for the critic in order to approximate the value function. The basis which one chooses with no prior knowledge might not be suitable for the problem at hand. A poor subspace where the actual value function is poorly supported may be chosen. Thus, one might prefer to choose a parameterized basis that has additional flexibility by changing a small set of parameters.

We propose to consider a basis that is linear in some of the parameters but has several other parameters that allow greater flexibility. In other words, we consider bases that are linear with respect to some of the terms (related to the fast time scale), and nonlinear with respect to the rest (related to the slow time scale). The idea is that most probably one does not lose from such an approach in general if it fails, but in many cases it is possible to obtain better fitness and thus a better performance, due to this additional flexibility. Mathematically,

(14)

where is a linear parameter related to the fast time scale, and is the non-linear parameter related to the slow time scale. In the view of (5), we note that from now on the matrix depends on , i.e., , and in matrix form we have , but for ease of exposition we drop the dependency on . The following assumption is needed for proving later results.

Assumption 6

The columns of the the matrix are linearly independent, , and , where is a vector of ’s. Moreover, the functions and for are Liphschitz in with a coefficient , and bounded with coefficient .

Notation comment: for ease of exposition, we drop the dependency on , e.g., , . Denote , (where as in Section 2.1, is the state followed the state ), , , and . Thus, and .

3.2 Minimum Square Error and TD

Assume a basis parameterized as in (14). The minimum square error (MSE) is defined as

The gradient with respect to is

(15)

where in the approximation we use the bootstrapping method (see [16] for a disussion) in order to get the well known TD algorithm (i.e., substituting ). On top of the above TD algorithm, we take a derivative with respect to , , yielding

(16)

where again we use the bootstrapping method. Note that this equation gives the non-linear TD procedure for the basis parameters. We use SA in order to solve the stochastic equations (15) and (16), which together with Theorem 2 is the basis for the following algorithm. For technical reasons, we add an requirement that the iterates for and are bounded, which practically is not constraining (see [9] for discussion on constrained SA).

Algorithm 7

Adaptive basis TD (ABTD).

(17)
(18)
(19)
(20)

where and are projection operators into a non-empty open constraints set whenever and , respectively, and the step size series for satisfy Assumption 3.

We note that this algorithm is an AC algorithm with three time scales: the usual two time scales, i.e., choosing yields Algorithm 1 of [5], and the third iterates is added for the basis adaptation, which is the slowest.

3.3 Minimum Square Bellman Error

The Minimum Square Bellman Error (MSBE) is defined as

The gradient with respect to is

where the derivative with respect to , , is

Based on this we have the following SA algorithm, that is similar to Algorithm 7 except for the iterates for and .

Algorithm 8

- Adaptive Basis for Bellman Error (ABBE). Consider the iterates for and in Algorithm 7. The iterates for and are

3.4 Minimum Square Projected Bellman Error

The Minimum Square Projected Bellman Error (MSPBE) is defined as

where the projection operator is defined in (6) and where the second equality was proved by Sutton et al. [17], Section 4. We note that the projection operator is independent of but depend on the basis parameter . Define . Thus, is the solution to the equation , which yields . Define similar to [4] section 6.3.3 , where and . Define to be the -th column of . For later use, we give here the gradient of with respect to and in implicit form

Denote by , ,,, , and the estimators at time of , , , , , and , respectively. Define to be the -th column of . Thus, the SA iterations for these estimators are

where satisfies Assumption 3. Next, we compute the gradient of the objective function MSPBE with respect to and and suggest a gradient descent algorithm to find the optimal value. Thus,

The following algorithm gives the SA iterates for and , where the iterates for and are the same as in Algorithms 7 and 8 and therefore omitted. This algorithm has four time scales. The fastest time scale, related to the step sizes , is the estimators time scale, i.e., the estimators for , , , , , and . The linear parameters of the critic, i.e., and , related to the step sizes , estimated on the second fastest time scale. The actor parameter , related to the step sizes , is estimated on the second slowest time scale. Finally, the critic non-linear parameter , related to the step sizes , is estimated on the slowest time scale. We note that a version where the two fastest times scales operate on a joint single fastest time scale is possible, but results additional technical difficulties in the convergence proof.

Algorithm 9

- Adaptive Basis for PBE (ABPBE). Consider the iterates for and in Algorithm 7. The iterates for and are

4 Analysis

In this section we prove the convergence of the previous section Algorithm 7 and 8. We omit the convergence proof of Algorithm 9 that is similar to the convergence proof of Algorithm 8.

4.1 Convergence of ABTD

We begin by stating a theorem regarding the ABTD convergence. Due to space limitations, we give only a proof sketch based on the convergence proof of Theorem 2 of Bhatnagar et al. [5]. The self-contained proof under more general conditions is left to the long version of this work.

Theorem 10

Consider Algorithm 7 and suppose Assumption 1, 3, and 6, hold. Then, the iterates (17)-(20) of Algorithm 7 converge w.p. 1 to a point that locally maximizes and solves the equation .

Proof

(Sketch) There are three time-scales in (17)-(20), therefore, we wish to use Theorem 5, i.e., we need to prove that the requirements of Assumption 4 are valid w.r.t. to all iterations, i.e., , , , and .

Requirement 1-4 w.r.t. iterates , , . Bhatnagar et al. proved in [5] that (17)-(19) converge for a specific . Assumption 6 implies that the requirements 1-4 of Assumption 4 are valid regarding the iterates of , and uniformly for all . Therefore, it sufficient to prove that on top of (17)-(19) also iterate (20) converges, i.e., that requirements 1-4 of Assumption 4 are valid w.r.t. .

Requirement 1 w.r.t. iterate . Define the -algebra , and define , , , , and . Thus, (20) can be expressed as

(21)

Trivially, using Assumption 6, , , and are Liphschitz, with respect to , with coefficients , , and , respectively. Also, is Liphschitz with respect to , , and with coefficients , , and , respectively. Thus, requirement 1 of Assumption 4 is valid.

Requirements 2 and 3 w.r.t. iterate . By construction, the iterate is bounded. Requirement 3 of Assumption 4 is valid using the boundedness of the martingale difference noise that implies, using the martingale convergence theorem [4], that the martingale converges.

Requirement 4 w.r.t. iterate . Using the result of Bhatnagar et al. [5], the fast time scales converge w.r.t. the slow time scale. Thus, Requirement 4 is valid based on the fact that the iterates (17)-(19) converge.∎

4.2 Convergence of Adaptive Basis for Bellman Error

We begin by stating the theorem and then we prove it.

Theorem 11

Consider Algorithm 8 and suppose that Assumption 1, 3, and 6, hold. Then, Algorithm 8 converge w.p. 1 to a point that locally maximizes and locally minimizes .

Proof

(Sketch) To use Theorem 5 we need to check that Assumption 4 is valid. Define the -algebra , and define , , , , , , , and .

On the fast time scale (which is related to ), as in Theorem 10, converges to . On the same time scale we need to show that the iterate for converges. Using the above definitions, we can write the iteration as

(22)

We use Theorem 2.2 of Borkar and Meyn [7] to achieve this. Briefly, this theorem states that given an iteration as (22), this iteration is bounded w.p.1 if

(A1)

The process is Lipschitz, the function is Lipschitz, and is asymptotically stable in the origin.

(A2)

The sequence is a martingale difference noise and for some

Trivially, the function is Lipschitz continuous, and we have

Thus, it is easy to show, using Assumption 6, that the ODE has a unique global asymptotically stable point at the origin and (A1) is valid. For (A2) we have

where the first inequality results from the inequality , and the second inequality results from the uniform boundedness of the involved variables. We note that the related ODE for this iteration is given by , and the related Lyapunov function is given by . Next, we need show that under the convergence of the fast time scales for and , the slower iterate for converges. The proof of this is identical to that of Theorem 2 of [5] and is therefore omitted. We are left with proving that if the fast timescales converge, i.e., the iterates , , and , then the iterate converge as well. The proof follows similar lines as of the proof for in the proof of Theorem 10, whereas here the iterate converge to the stable point of the ODE . ∎

5 Simulations

In this section we report empirical results applying the algorithms on two types of problems: Garnet problems [1] and the mountain car problem.

5.1 Garnet problems

The garnet222brevity for Generic Average Reward Non-stationary Environment Test-bench problems [1, 5] are a class of randomly constructed finite MDPs serving as a test-bench for RL algorithms. A garnet problem is characterized by four parameters and is denoted by garnet. The parameter is the number of states, is the number of actions, is the branching factor, and

is the variance of each transition reward. When constructing such a problem, we generate for each state a reward, distributed according to

. For each state-action the reward is distributed according to . The transition matrix for each action is composed of non-zero terms. We consider the same garnet problems as those simulated by [5]. For the critic’s feature vector, we use the basis functions , where , , , and are i.i.d. uniform random phases. Note that only one parameter in this simulation controls the basis functions. The actor’s feature vectors are of size , and are constructed as

The policy function is . Bhatnagar et al. [5] reported simulation results for two garnet problems: garnet and garnet. We based our simulations on these results where the time steps are identical to those of [5]. The garnet problem (Fig. 1 left pane) was simulated for (two lower graphs) and (two upper graphs), where each graph is an average of repeats. The garnet problem (Fig. 1 right pane) was simulated for (two lower graphs) and (two upper graphs), where each graph is an average of repeats. We can see that in such problems there is an evident advantage to an adaptive base, which can achieve additional fitness to the problem, and thus even for low dimensional problems the adaptation may be crucial.

Figure 1: Results for garnet (left pane) and garnet (right pane) where circled graphs are for adaptive bases. In each graph the lower two graphs are for and the upper graphs are for . See text for detail.

5.2 The Mountain Car

The mountain car task (see [15] or [16] for details) is a physical problem where a car is positioned randomly between two mountains (see Fig. 2 left pane) and needs to climb the right mountain, but the engine of the car does not support such a straight climb. Thus, the car needs to accumulate sufficient gradational energy, by applying back and forth actions, in order to succeed.

We applied the adaptive basis TD algorithm on this problem. We chose the critic basis functions to be radial basis functions (RBF) (see

[8]), where the value function is represented by . The centers of the RBFs are parameterized by while the variance is represented by . In the right pane of Fig. 2 we present simulation results for 4 cases: SARSA (blue dash) which is based on the implementation of [15]

, AC (red dash-dot) with 64 basis functions uniformly distributed on the parameter space, ABTD with 64 basis functions (magenta dotted) where both the location and the variance of the basis functions can adapt, ABAC with 16 basis functions (black solid) with the same adaptation. We see that the adaptive basis gives a significant advantage in performance. Moreover, we see that even with small number of parameters, the performance is not affected. In the middle pane, the dynamics of a realization of the basis functions is presented where the dots and circles are the initial positions and final positions of the basis functions, respectively. The circle sizes are proportional to the basis functions standard deviations, i.e.,

.

Figure 2: (left pane) illustration of the mountain car task. (middle pane) Realization of ABTD with 16 basis functions where the red dots are the basis functions initial position and the circles are their final position. The radii are proportional to the variance. The rectangle represents the bounded parameter set of the car. (right pane) Simulation result for the mountain car problem with solutions of SARSA (blue dash) AC (red dash-dot) AB-AC with 64 basis functions (magenta dotted) AB-AC with 16 basis functions (black solid).

5.3 The Performance of Multiple Time Scales vs. Single Time Scale

In this section we discuss the differences in performance between the MTS algorithm to the STS algorithms. Unlike mistakenly thought, neither MTS algorithms nor STS algorithms have advantage in terms of convergence. This difference comes from the fact that both methods perform the gradient algorithm differently, thus, they may result different trajectories. In Fig. 3 we can see a case on a garnet(30,5,5,0.1) where the MTS ABTD algorithm (upper red diamond graph) has an advantage over STS ABTD algorithms or MTS static basis AC algorithm as in [5] (rest of the graphs). We note that this is not always the case and it depends on the problem parameters or the initial conditions.

Figure 3: Results for garnet for . The upper diamond red graph is MTS ABTD algorithm, the circled green graph is STS ABTD acting on slow time scale, the blue crossed line is MTS static basis AC algorithm as in [5], and the black stared line is STS ABTD acting on fast time scale. Each graph is average of 100 simulation runnings.

6 Discussion

We introduced three new AC based algorithms where the critic’s basis is adaptive. Convergence proofs, in the average reward case, were provided. We note that the algorithms can be easily transformed to discounted reward. When considering other target functions, more AC algorithms with adaptive basis can be devised, e.g., considering the objective function yields ATD and GTD(0) algorithms [18]. Also, mixing the different algorithm introduced in here, can yield new algorithms with some desired properties. For example. we can devise an algorithm where the linear part is updated similar to (18) and the non-linear part is updated similar to (21). Convergence of such algorithms will follow the same lines of proof as introduced here.

The advantage of adaptive bases is evident: they relieve the domain expert from the task of carefully designing the basis. Instead, he may choose a flexible basis, where one use algorithms as introduced here to adapt the basis to the problem at hand. From a methodological point of view, the method we introduced in this paper demonstrates how to easily transform an existing RL algorithm to an adaptive basis algorithm. The analysis of the original problem is used to show convergence of the faster time scale and the slow time scale is used for modifying the basis, analogously to “code reuse” concept in software engineering.

References

  • [1] Archibald, T., McKinnon, K., and Thomas, L.: (1995) On the Generation of Markov Decision Processes . Journal of the Operational Research Society, 46 (1995) 354-361
  • [2]

    Bradtke, S. J., Barto, A. G.: Linear least-squares algorithms for temporal difference learning. Machine Learning,

    22 (1996) 33- 57
  • [3] Bertsekas, D.: Dynamic programming and optimal control, 3rd ed. Athena Scientific (2007)
  • [4] Bertsekas, D., Tsitsiklis, J.: Neuro-dynamic programming. Athena Scinetific (1996)
  • [5] Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor–critic algorithms. Technical report Univ. of Alberta (2007)
  • [6] Borkar, V.: Stochastic approximation with two time scales. Systems & Control Letters 29 291–294 (1997)
  • [7] Borkar, V., Meyn, S.: The ode method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Cont. and Optim. 38 (2000) 447–469
  • [8]

    Haykin, S.: Neural networks: a comprehensive foundation. Prentice Hall (2008)

  • [9] Kushner, H., Yin, G.: Stochastic approximation and recursive algorithms and applications. Springer Verlag (2003)
  • [10] Leslie, D., Collins, E.: Convergent multiple-timescales reinforcement learning algorithms in normal form games. The Annals of App. Prob. 13 (2003) 1231–1251.
  • [11] Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134 (2006) 215–238
  • [12] Mokkadem, A., Pelletier, M.: Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Prob. 16 1671
  • [13] Polyak, B.:New method of stochastic approximation type. Automat. Remote Control 51 (1990) 937–946
  • [14] Puterman, M.:Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons Inc (1994)
  • [15] Singh, S., Sutton, R.: Reinforcement learning with replacing eligibility traces. Machine learning, 22 (1996) 123–158.
  • [16] Sutton, R. S., Barto, A. G.: Reinforcement Learning - an Introduction. MIT Press, Cambridge, MA, 1998
  • [17] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proceedings of the 26th Annual International Conference on Machine Learning (2009)
  • [18] Sutton, R. S., Szepesvari, C., Maei, H. R.: A convergent temporal-difference algorithm for off-policy learning with linear function approximation. Advances in Neural Information Processing Systems 21 (2009b) 1609–1616
  • [19] Yu, H., & Bertsekas, D.: Basis function adaptation methods for cost approximation in MDP. Proc. of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, TN (2009)