# On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

Comments

There are no comments yet.

## Authors

• 2 publications
• 24 publications
• 92 publications
• ### A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Actor-critic (AC) methods have exhibited great empirical success compare...
05/04/2020 ∙ by Yue Wu, et al. ∙ 5

read it

• ### Policy Gradient With Value Function Approximation For Collective Multiagent Planning

Decentralized (PO)MDPs provide an expressive framework for sequential de...
04/09/2018 ∙ by Duc Thien Nguyen, et al. ∙ 0

read it

• ### A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

We consider the estimation of the policy gradient in partially observabl...
07/04/2012 ∙ by Huizhen Yu, et al. ∙ 0

read it

• ### Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards w...
04/05/2019 ∙ by Ishan Durugkar, et al. ∙ 0

read it

• ### Phasic Policy Gradient

We introduce Phasic Policy Gradient (PPG), a reinforcement learning fram...
09/09/2020 ∙ by Karl Cobbe, et al. ∙ 0

read it

• ### Dynamic Portfolio Management with Reinforcement Learning

Dynamic Portfolio Management is a domain that concerns the continuous re...
11/26/2019 ∙ by Junhao Wang, et al. ∙ 0

read it

• ### The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making

Off-policy reinforcement learning enables near-optimal policy from subop...
05/29/2018 ∙ by Luchen Li, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning (RL) is a form of adaptive control where the system model is unknown, and one seeks to estimate parameters of a controller through repeated interaction with the environment [5, 40]. This framework gained attention recently for its ability to express problems that exhibit complicated dependences between action selection and environmental response, i.e., when the cost function or system dynamics are difficult to express analytically. This is the case in supply chain management [21], power systems [22], robotic manipulation [23], and games of various kinds [46, 38, 14]. Although the expressive capability of RL continues to motivate new and diverse applications, its computational challenges remain doggedly persistent.

More specifically, RL is defined by a Markov Decision Process

[37]: each time an agent, starting from one state, selects an action, and then transitions to a new state according to a distribution Markov in the current state and action. Then, the environment reveals a reward informing the quality of that decision. The goal of the agent is to select an action sequence which yields the largest expected accumulation of rewards, defined as the value. Two dominant approaches to RL have emerged since its original conception from Bellman [4]. The first, dynamic programming [50], writes the value as the expected one-step reward plus all subsequent rewards (Bellman equations), and then proceeds by stochastic fixed point iterations [48]. Combining dynamic programming approaches with nonlinear function parameterizations, as in [47], may cause instability.

On the other hand, the alternative approach, policy search [43]

, hypothesizes actions are chosen according to a parameterized distribution. It then repeatedly revises those parameters according to stochastic search directions. Policy search has gained popularity due to its ability scale to large (or continuous) spaces and exhibit global convergence, although its variance and hence sample complexity, may be impractically large. Also worth mentioning is Monte Carlo search (“guess and check”)

[2, 20], which is essential to reducing large spaces to only viable hypotheses.

In this work, we focus on methods that operate in the intersection of dynamic programming and policy search called actor-critic [25, 24]. Actor-critic is an online form of policy iteration [5] that inherits the ability of policy search to scale to large or continuous spaces, while reducing its number of queries to the environment. In particular, policy gradient method repeatedly revises policy parameter estimates through policy gradient steps. Owing to the Policy Gradient Theorem [43], the policy gradient is the product of two factors: the score function and the function. One may employ Monte Carlo rollouts to acquire the -estimates, which under careful choice of the rollout horizon, may be shown to be unbiased [34, 52]. Doing so, however, requires an inordinate amount of querying to the environment in order to generate trajectory data.

Actor-critic replaces Monte-Carlo rollouts for the -value by stochastic approximates of solutions to Bellman equations, i.e., temporal difference (TD) [45] or gradient temporal difference (GTD) [44] steps. Intuitively, this weaving together of the merits of dynamic programming and policy search yields comparable scalability properties to policy search while reducing its sample complexity. However, the iteration (and sample) complexity of actor-critic is noticeably absent from the literature, which is striking due to its foundational role in modern reinforcement learning systems [30, 38], and the fact that efforts to improve upon it also only establish asymptotics [8]. This absence is due to the fact that actor-critic algorithms exhibit two technical challenges: (i) their sample path is dependent (non i.i.d.), and (ii) their search directions are biased.

In this work, we mitigate these challenges through (i) the use of a Monte Carlo rollout scheme to estimate the policy gradient, given a value function estimate; and (ii) employing recently established sample complexity results of policy evaluation under linear basis expansion. Doing so permits us to characterize for the first time the complexity of actor-critic algorithms under a few canonical settings and schemes for the critic (policy evaluation) update. Our results hinge upon viewing policy search as a form of stochastic gradient method for maximizing a non-convex function, where the ascent directions are biased. Moreover, the magnitude of this bias is determined the number of critic steps. This perspective treats actor-critic a form of two time-scale algorithm [9], whose asymptotic stability is well-known via dynamical systems tools [26, 10]

. To wield these approaches to establish finite-time performance, however, concentration probabilities and geometric ergodicity assumptions of the Markov dynamics are required – see

[10]. To obviate these complications and exploit recent unbiased sampling procedures [34, 52], we focus on the case where independent trajectory samples are acquirable through querying the environment.

Our main result establishes that actor-critic, independent of any critic method, exhibits convergence to stationary points of the value function that are comparable to stochastic gradient ascent in the non-convex regime. We note that a key distinguishing feature from standard non-convex stochastic programming is that the rates are inherently tied to the bias of the search direction which is determined by the choice of critic scheme. In fact, our methodology is such that a rate for actor-critic can be derived for any critic only method for which a convergence rate in expectation on the parameters can be expressed. In particular, we characterize the rates for actor-critic with temporal difference (TD) and gradient TD (GTD) critic steps. Furthermore, we propose an Accelerated GTD (A-GTD) method derived from accelerations of stochastic compositional (quasi-) gradient descent [49], which converges faster than TD and GTD.

In summary, for the continuous spaces, we establish that GTD and A-GTD converge faster than TD. In particular, this introduces a trade off between the smoothness assumptions and the rates derived (see Table 1). TD has no additional smoothness assumptions, and it achieves a rate of . This rate is analogous to the non-convex analysis of stochastic compositional gradient descent. Adding a smoothness assumption, GTD achieves the faster rate of . By requiring an additional strong convexity assumption, we find that A-GTD achieves the fastest convergence rate of . For the case of finite state action space, actor critic achieves a convergence rate of . Overall, the contribution in terms of sample complexities of different actor-critic algorithms may be found in Table 1.

We evaluate actor-critic with TD, GTD, and A-GTD critic updates on a navigation problem. We find that indeed A-GTD converges faster than both GTD and TD. Interestingly, the stationary point it reaches is worse than GTD or TD. This suggests that the choice of critic scheme illuminates an interplay between optimization and generalization that is less-well understood in reinforcement learning [13, 12]. Surprisingly, we find that TD converges faster than GTD which we postulate is an artifact of the selection of the feature space parametrization. A detailed discussion on the results and implications can be found in section 7. The remainder of the paper is organized as follows. Section 2 describes the problem of Reinforcement Learning and characterizes common assumptions which we use in our analysis. In section 3 we derive a generic actor-critic algorithm from an optimization perspective and describe how the algorithm would be amended given different policy evaluation methods. The derivation of the convergence rate for generic actor-critic is presented in section 4, and the specific analysis for Gradient, Accelerated Gradient, and vanilla Temporal difference are characterized in sections 5 and 6.

## 2 Reinforcement Learning

In reinforcement learning (RL), an agent moves through a state space and takes actions that belong to some action set , where the state/action spaces are assumed to be continuous compact subsets of Euclidean space: and . Every time an action is taken, the agent transitions to its next state that depends on its current state and action. Moreover, a reward is revealed by the environment. In this situation, the agent would like to accumulate as much reward as possible in the long term, which is referred to as value. Mathematically this problem definition may be encapsulated as a Markov decision process (MDP), which is a tuple with Markov transition density that determines the probability of moving to state . Here, is the discount factor that parameterizes the value of a given sequence of actions, which we will define shortly.

At each time , the agent executes an action given the current state , following a possibly stochastic policy , i.e., . Then, given the state-action pair , the agent observes a (deterministic) reward and transitions to a new state according to a transition density that is Markov. For any policy mapping states to actions, define the value function as

 Vπ(s)=Eat∼π(⋅∣st),st+1∼P(⋅∣st,at)(∞∑t=0γtrt∣s0=s), (1)

which is a measure of the long term average reward accumulation discounted by . We can further define the value conditioned on a given initial action as the action-value, or Q-function as . Given any initial state , the goal of the agent is to find the optimal policy that maximizes the long-term return , i.e., to solve the following optimization problem

 maxπ∈Π  J(π):=Vπ(s0). (2)

In this work, we investigate actor-critic methods to solve (2), which is a hybrid RL method that fuses key properties of policy search and approximate dynamic programming. To ground the discussion, we first derive the canonical policy search technique called policy gradient method, and explain how actor-critic augments policy gradient. Begin by noting that to address (2), one must search over an arbitrarily complicated function class which may include those which are unbounded and discontinuous. To mitigate this issue, we parameterize the policy

by a vector

, i.e., , yielding RL tools called policy gradient methods [24, 8, 15]. Under this specification, the search over arbitrarily complicated function class to (2) may be reduced to Euclidean space , i.e., a vector-valued optimization . Subsequently, we denote by for notational convenience. We first make the following standard assumption on the regularity of the MDP problem and the parameterized policy , which are the same conditions as [51].

###### Assumption 1.

Suppose the reward function and the parameterized policy satisfy the following conditions:

1. [label=()]

2. The absolute value of the reward is bounded uniformly by , i.e., for any .

3. The policy is differentiable with respect to , and the score function is -Lipschitz and has bounded norm, i.e., for any ,

 ∥∇logπθ1(a∣s)−∇logπθ2(a∣s)∥≤LΘ⋅∥θ1−θ2∥,~{}~{}for any~{}% ~{}θ1,θ2, (3) ∥∇logπθ(a∣s)∥≤BΘ,~{}~{}% for any~{}~{}θ. (4)

Note that the boundedness of the reward function in Assumption 11 is standard in policy search algorithms [7, 8, 15, 53]. Observe that with , we have the Q-function is absolutely upper bounded by , since by definition

 |Qπθ(s,a)|≤∞∑t=0γt⋅UR=UR1−γ,  for any  (s,a)∈S×A. (5)

The same bound also applies for for any and and thus for the objective which is defined as , i.e.,

 |Vπθ(s)|≤UR1−γ,  for % any s∈S,  |J(θ)|≤UR1−γ. (6)

We note that the conditions (3) and (4) have appeared in recent analyses of policy search [15, 35, 32], and are satisfied by canonical policy parameterizations such as Boltzmann policy [25] and Gaussian policy [18]. For example, for Gaussian policy333We observe that in practice, the action space is bounded, which requires a truncated Gaussian policy to be used over , as in [32]. in continuous spaces, , where

denotes the Gaussian distribution with mean

and variance . Then the score function has the form , which satisfies (3) and (4) if the feature vectors have bounded norm, the parameter lies some bounded set, and the action is bounded.

We now state the following assumptions which are required for the base results. Additional assumptions on smoothness are specific to the critic only method used, and are detailed in Table 1.

###### Assumption 2.

The random tuples are drawn from the stationary distribution of the Markov reward process independently across time.

As pointed out by [17], the i.i.d. assumption does not hold in practice, but it is standard dealing with convergence bounds in reinforcement learning. Although there has been an effort to characterize the rate of convergence in expectation for critic only methods with Markov noise [6], theses results do not result in a bound of the desired form in Assumption 5. For this reason, we study specifically the case where Assumption 2 holds, and leave the study of Markov noise for future work. The general conditions for stability of trajectories with Markov dependence, i.e., negative Lyapunov exponents for mixing rates, may be found in [29].

###### Assumption 3.

For any state action pair , the norm of the feature representation is bounded by a constant .

Assumption 3 is easily implementable in practice by normalizing the feature representation. Because the score function is bounded by (c.f. Assumption 1) and the reward function is bounded, we have that for some constant

 ∥J(θk)∥≤C3 (7)

for all .

Additionally, we assume that the estimate of the gradient conditioned on the filtration is bounded by some finite variance

###### Assumption 4.

Let be a possibly biased estimate . There exists a finite such that,

 E(∥^∇J(θ)∥|Fk)≤σ2. (8)

Generally, the value function is nonconvex with respect to the parameter , meaning that obtaining a globally optimal solution to (2) is out of reach unless the problem has additional structured properties, as in phase retrieval [39], matrix factorization [27]

, and tensor decomposition

[19], among others. Thus, our goal is to design actor-critic algorithms to attain stationary points of the value function . Moreover, we characterize the sample complexity of actor-critic, a noticeable gap in the literature for an algorithmic tool decades old [25]

at the heart of the recent innovations of artificial intelligence architectures

[38].

## 3 From Policy Gradient to Actor-Critic

In this section, we derive actor-critic method [25] from an optimization perspective: we view actor-critic as a way of doing stochastic gradient ascent with biased ascent directions, and the magnitude of this bias is determined by the number of critic evaluations done in the inner loop of the algorithm. The building block of actor-critic is called policy gradient method, a type of direct policy search, based on stochastic gradient ascent. Begin by noting that the gradient of the objective with respect to policy parameters , owing to the Policy Gradient Theorem [43], has the following form:

 ∇J(θ) =∫s∈S,a∈A∞∑t=0γt⋅p(st=s∣s0,πθ)⋅∇πθ(a∣s)⋅Qπθ(s,a)dsda (9) =11−γ∫s∈S,a∈A(1−γ)∞∑t=0γt⋅p(st=s∣s0,πθ)⋅∇πθ(a∣s)⋅Qπθ(s,a)dsda =11−γ∫s∈S,a∈Aρπθ(s)⋅πθ(a∣s)⋅∇log[πθ(a∣s)]⋅Qπθ(s,a)dsda =11−γ⋅E(s,a)∼ρθ(⋅,⋅)[∇logπθ(a∣s)⋅Qπθ(s,a)]. (10)

In the preceding expression, denotes the probability of state equals given initial state and policy

, which is occasionally referred to as the occupancy measure, or the Markov chain transition density induced by policy

. Moreover, is the ergodic distribution associated with the MDP for fixed policy, which is shown to be a valid distribution [43]. For future reference, we define . The derivative of the logarithm of the policy is usually referred to as the score function

corresponding to the probability distribution

for any .

Next, we discuss how (10) can be used to develop stochastic methods to address (2). Unbiased samples of the gradient are required to perform the stochastic gradient ascent, which hopefully converges to a stationary solution of the nonconvex maximization. One way to obtain an estimate of the gradient is to evaluate the score function and

function at the end of a rollout whose length is drawn from a geometric distribution with parameter

[51][Theorem 4.3]. If the function evaluation is unbiased, then the stochastic estimate of the gradient is unbiased as well. We therefore define the stochastic estimate by

 ^∇J(θ)=11−γ^Qπθ(sT,aT)∇logπθ(aT|sT). (11)

We consider the case where the function admits a linear parametrization of the form , which in the literature on policy search is referred to as the critic [25], as it “criticizes” the performance of actions chosen according to policy . Here and

is a (possibly nonlinear) feature map such as a network of radial basis functions or an auto-encoder. Moreover, we estimate the parameter

that defines the function from a policy evaluation (critic only) method after some iterations, where denotes the number of policy gradient updates. Thus, we may write the stochastic gradient estimate as

 ^∇J(θ)=11−γξ⊤kφ(sT,aT)∇logπθ(aT|sT). (12)

If the estimate of the function is unbiased, i.e., , then (c.f. [51]

[Theorem 4.3]). Typically, critic only methods do not give unbiased estimates of the

function; however, in expectation the rate at which their bias decays is proportional to the number of estimation steps. In particular, denote as the parameter for which the estimate is unbiased:

 E[ξ⊤∗φ(s,a)]=E[^Qπθ(s,a)]=Q(s,a). (13)

Hence, by adding and subtracting the true estimate of the parametrized function to (12), we arrive at the fact the policy search direction admits the following decomposition:

 ^∇J(θ)=11−γ(ξk−ξ∗)⊤φ(sT,aT)∇logπθ(aT|sT)+11−γ(ξ∗)⊤φ(sT,aT)∇logπθ(aT|sT). (14)

The second term in is the unbiased estimate of the gradient , whereas the first defines the difference of the critic parameter at iteration with the true estimate . For linear parameterizations of the function, policy evaluation methods establish convergence in mean of the bias

 E[∥ξk−ξ∗∥]≤g(k), (15)

where is some decreasing function. We address cases where the critic bias decays at rate for , due to the fact that several state of the art works on policy evaluation may be mapped to the form (15) for this specification [49, 17]. We formalize this with the following assumption.

###### Assumption 5.

The expected error of the critic parameter is bounded by for some , i.e., there exists constant such that

 E[∥ξk−ξ∗∥]≤L1k−b. (16)

Recently, alternate rates have been established as ; however, they concede that rates may be possible [6, 54]. Thus, we subsume recent sample complexity characterizations of policy evaluation as (15) in Assumption 5.

As such, (14) is nearly a valid ascent direction: it is approximately an unbiased estimate of the gradient since the first term becomes negligible as the number of critic estimation steps increases. Based upon this observation, we propose the following variant of actor-critic method [25]: run a critic estimator (policy evaluator) for steps, whose output is critic parameters . We denote the critic estimator by which returns the parameter after iterations. Then, simulate a trajectory of length , where is geometrically distributed with parameter , and update the actor (policy) parameters as:

 θk+1=θk+ηk^∇J(θk)=θk+ηk11−γξ⊤TC(k)φ(sTk,aTk)∇logπθk(sTk,aTk|θk) (17)

We summarize the aforementioned procedure, which is agnostic to particular choice of critic estimator, as Algorithm 1.

Examples of Critic Updates We note that admits two canonical forms: temporal difference (TD) [45] and gradient temporal difference (GTD)-based estimators [44]. The TD update for the critic is given as

 δt=rt+γξ⊤t(φ(s′t,a′t)−φ(st,at))φ(st,at),ξt+1=ξt+αtδtφ(st,at) (18)

whereas for the GTD-based estimator for the critic, we consider the update

 δt =rt+ξ⊤t(γφ(s′t,a′t)−φ(st,at)),zt+1=(1−βt)zt+βtδt, ξt+1 =(1−λαt)ξt−2αtzt+1[γφ(s′t,a′t)−φ(st,at)] (19)

We further analyze a modification of GTD updates proposed by [49] that incorporates an extrapolation technique to reduce bias in the estimates and improve error dependency, which is distinct from accelerated stochastic approximation with Nesterov Smoothing[31]. With and defined for , the accelerated GTD (A-GTD) update becomes

 ξt+1 =ξt−2αt(γφ(s′,a′)−φ(s,a))yt (20) zt+1 =−(1βt−1)ξt+1βtξt+1 yt+1 =(1−βt)yt+βt(r(s,a)+z⊤t+1(γφ(s′,a′)−φ(s,a))

Subsequently, we shift focus to characterizing the mean convergence of actor-critic method given any policy evaluation method satisfying (15) in Section 4. Then, we specialize the sample complexity of actor-critic to the cases associated with critic updates (18) - (20), which we respectively call Classic (Algorithm 4), Gradient (Algorithm 2), and Accelerated Actor-Critic (Algorithm 3).

## 4 Convergence Rate of Generic Actor-Critic

In this section, we derive the rate of convergence in expectation for the variant of actor-critic defined in Algorithm 1, which is agnostic to the particular choice of policy evaluation method used to estimate the function used in the actor update. Unsurprisingly, we establish that the rate of convergence in expectation for actor-critic depends on the critic update used. Therefore, we present the main result in this paper for any generic critic method. Thereafter, we specialize this result to two well-known choices of policy evaluation previously described (18) - (3), as well as a new variant that employs acceleration (20).

We begin by noting that under Assumption 1, one may establish Lipschitz continuity of the policy gradient [51][Lemma 4.2].

###### Lemma 1 (Lipschitz-Continuity of Policy Gradient).

The policy gradient is Lipschitz continuous with some constant , i.e., for any

 ∥∇J(θ1)−∇J(θ2)∥≤L⋅∥θ1−θ2∥. (21)

This lemma allows us to establish an approximate ascent lemma for a random variable

defined by

 Wk=J(θk)−Lσ2∞∑j=kη2j, (22)

where is defined in (2), is defined in Assumption 4, and is the Lipshitz constant of the gradient from Lemma 1. Unless otherwise stated, to alleviate notation, we denote as short-hand for .

###### Lemma 2.

Consider the actor parameter sequence defined by Algorithm 1. The sequence defined in (22) satisfies the inequality

 E[Wk+1|Fk]≥Wk+ηk∥∇J(θk)∥2−ηkCE[∥ξk−ξ∗∥|Fk] (23)

where , with the bound on the score function as in Assumption 1, the bound on the feature map in Assumption 3, and as the bound on the value function in (7).

###### Proof.

By definition of , we write the expression for

 Wk+1=J(θk+1)−Lσ2∞∑j=k+1η2j. (24)

By the Mean Value Theorem, there exists for some such that

 J(θk+1)=J(θk)+(θk+1−θk)⊤∇J(~θk). (25)

Substitute this expression for in (24)

 Wk+1=J(θk)+(θk+1−θk)⊤∇J(~θk)−Lσ2∞∑j=k+1η2j. (26)

Add and subtract to the right hand side of (26) to obtain

 Wk+1=J(θk)+(θk+1−θk)⊤(∇J(~θk)−∇J(θk))+(θk+1−θk)⊤∇J(θk)−Lσ2∞∑j=k+1η2j. (27)

By Cauchy Schwartz, we know . Further, by the Lipschitz continuity of the gradient, we know . Therefore, we have

 (θk+1−θk)⊤(∇J(~θk)−J(θk))≥−L∥~θk−θk∥⋅∥θk+1−θk∥≥−L∥θk+1−θk∥2, (28)

where the second inequality comes from substituting . We substitute this expression into the definition of in (27) to obtain

 Wk+1≥J(θk)+(θk+1−θk)⊤∇J(θk)−L∥θk+1−θk∥2−Lσ2∞∑j=k+1η2j. (29)

Take the expectation with respect to the filtration , substitute the definition for the actor update (17)

 E[Wk+1|Fk]≥J(θk)+E[θk+1−θk|Fk]⊤∇J(θk)+−LE[∥ηk^∇J(θk)∥2|Fk]−Lσ2∞∑j=k+1η2j. (30)

Together, with the fact that this update is the stepsize times the estimate of the gradient which has bounded variance (8) (Assumption 4), we obtain

 E[Wk+1|Fk]≥J(θk)+E[θk+1−θk|Fk]⊤∇J(θk)−Lσ2η2k−Lσ2∞∑j=k+1η2j. (31)

The terms on the right hand side outside the expectation may be identified as [cf. (22)] by definition, which allows us to write

 E[Wk+1|Fk]≥Wk+E[θk+1−θk|Fk]⊤∇J(θk). (32)

Therefore, we are left to show that the last term on the right-hand side of the preceding expression is “nearly” an ascent direction, i.e.,

 E[θk+1−θk|Fk]⊤∇J(θk)≥ηk∥∇J(θk)∥2−ηkCE[∥ξk−ξ∗∥2|Fk]. (33)

and how far from an ascent direction it is depends on the critic estimate bias . From Algorithm 1, the actor parameter update may be written as

 θk+1−θk=11−γηkξ⊤kφ(sTk,aTk)∇logπ(sTk,aTk|θk). (34)

Add and subtract to (34) where is such that the estimate is unbiased. Hence, represents the distance between the critic parameters corresponding to the biased estimate after critic only steps and the true estimate of the function.

 θk+1−θk=11−γηk(ξk−ξ∗)⊤φ(sTk,aTk)∇logπ(sTk,aTk|θk)+11−γηkξ⊤∗φ(sTk,aTk)∇logπ(sTk,aTk|θk). (35)

Here we recall (12) and (13) from the derivation of the algorithm, that is that the expected value of the stochastic estimate given is unbiased. Therefore, by taking the expectation of (35) with respect to the filtration , we obtain

 E[θk+1−θk|Fk]=11−γηkE[(ξk−ξ∗)⊤φ(sTk,aTk)∇logπ(sTk,aTk|θk)|Fk]+ηk∇J(θk). (36)

Take the inner product with on both sides

 E[θk+1−θk|Fk]⊤∇J(θk)=11−γηkE[(ξk−ξ∗)⊤φ(sTk,aTk)∇logπ(sTk,aTk|θk)|Fk]⊤∇J(θk)+ηk∥∇J(θk)∥2 (37)

The first term on the right-hand side is lower-bounded by the negative of its absolute value, i.e.,

 E[θk+1−θk|Fk]⊤∇J(θk)≥−11−γηk|E[(ξk−ξ∗)⊤φ(sTk,aTk)∇logπ(sTk,aTk|θk)|Fk]⊤∇J(θk)|+ηk∥∇J(θk)∥2. (38)

Next, we apply Cauchy Schwartz to the first term on the right-hand side of the previous expression, followed by Jensen’s Inequality, and then Cauchy Schwartz again, to obtain

 E[θk+1− θk|Fk]⊤∇J(θk) (39) ≥−11−γηk∥E[(ξk−ξ∗)⊤φ(sTk,aTk)∇logπ(sTk,aTk|θk)| |Fk]∥⋅∥∇J(θk)∥+ηk∥∇J(θk)∥2 ≥−11−γηkE[|(ξk−ξ∗)⊤φ(sTk,aTk)|∥∇logπ(sTk,aTk|θk)∥| |Fk]⋅∥∇J(θk)∥+ηk∥∇J(θk)∥2 ≥−11−γηkE[∥(ξk−ξ∗)∥⋅∥φ(sTk,aTk)∥⋅∥∇logπ(sTk,aTk|θk)∥| |Fk]⋅∥∇J(θk)∥+ηk∥∇J(θk)∥2.

Because the score function (), the feature map (), and the gradient () are bounded, we define be the product of these constants with :

 ∥φ(sTk,aTk)∥⋅∥∇logπ(sTk,aTk|θk)∥⋅∥∇J(θk)∥≤BΘC2C31−γ=:C. (40)

which my be substituted into (39) to write

 E[θk+1−θk|Fk]⊤∇J(θk)≥−CηkE[∥ξk−ξ∗∥|Fk]+ηk∥∇J(θk)∥2 (41)

Now, we can express this relationship in terms of by substituting back into (32):

 E[Wk+1|Fk]≥Wk−CηkE[∥ξk−ξ∗∥|Fk]+ηk∥∇J(θk)∥2 (42)

which is as stated in (23). ∎

From(23) (Lemma 2), consider taking the total expectation

 E[Wk+1]≥E[Wk]+ηkE[∥∇J(θk)∥2]−ηkCE[∥ξk−ξ∗∥]. (43)

This almost describes an ascent of the variable . Because the norm of the gradient is non-negative, if the term was removed, an argument could be constructed to show that in expectation, the gradient converges to zero. Unfortunately, the term of the error in critic estimate complicates the picture. However, by Assumption 5 (which is not really an assumption but rather a fundamental property of most common policy evaluation schemes), we know that the error goes to zero in expectation as the number of critic steps increases. Thus, we leverage this property to derive the sample complexity of actor-critic (Algorithm 1).

We now present our main result, which is the convergence rate of actor-critic method when the algorithm remains agnostic to the particular chocie of critic scheme. We characterize the rate of convergence by the smallest number of actor updates required to attain a value function gradient smaller , i,.e.,

 Kϵ=min{k:inf0≤m≤k∥∇J(θm)∥2<ϵ}. (44)
###### Theorem 1.

Suppose the step-size satisfies for and the critic update satisfies Assumption 5. When the critic bias converges to null as , i.e., in (15), then critic updates occur per actor update. Alternatively, if the critic bias converges to null more slowly then critic updates per actor update are chosen. Then the actor sequence defined by Algorithm 1 satisfies

 Kϵ≤O(ϵ−1/ℓ), where ℓ=min{a,1−a,b} (45)

Minimizing over yields actor step-size . Moreover, depending on the rate of attenuation of the critic bias [cf. (15)], the resulting sample complexity is:

 Kϵ≤{O(ϵ−1/b)if b∈(0,1/2)O(ϵ−2).if b∈(1/2,1] (46)
###### Proof.

Begin by substituting the definition for [cf. (22)] into Lemma 2, i.e., (23) to write

 E[J(θk+1)|Fk]−Lσ2∞∑j=k+1η2j≥J(θk)−Lσ2∞∑j=kη2j+ηk∥∇J(θk)∥2−ηkCE[∥ξk−ξ∗∥|Fk]. (47)

The term