Reinforcement learning (RL) is a form of adaptive control where the system model is unknown, and one seeks to estimate parameters of a controller through repeated interaction with the environment [5, 40]. This framework gained attention recently for its ability to express problems that exhibit complicated dependences between action selection and environmental response, i.e., when the cost function or system dynamics are difficult to express analytically. This is the case in supply chain management , power systems , robotic manipulation , and games of various kinds [46, 38, 14]. Although the expressive capability of RL continues to motivate new and diverse applications, its computational challenges remain doggedly persistent.
More specifically, RL is defined by a Markov Decision Process: each time an agent, starting from one state, selects an action, and then transitions to a new state according to a distribution Markov in the current state and action. Then, the environment reveals a reward informing the quality of that decision. The goal of the agent is to select an action sequence which yields the largest expected accumulation of rewards, defined as the value. Two dominant approaches to RL have emerged since its original conception from Bellman . The first, dynamic programming , writes the value as the expected one-step reward plus all subsequent rewards (Bellman equations), and then proceeds by stochastic fixed point iterations . Combining dynamic programming approaches with nonlinear function parameterizations, as in , may cause instability.
On the other hand, the alternative approach, policy search 
, hypothesizes actions are chosen according to a parameterized distribution. It then repeatedly revises those parameters according to stochastic search directions. Policy search has gained popularity due to its ability scale to large (or continuous) spaces and exhibit global convergence, although its variance and hence sample complexity, may be impractically large. Also worth mentioning is Monte Carlo search (“guess and check”)[2, 20], which is essential to reducing large spaces to only viable hypotheses.
In this work, we focus on methods that operate in the intersection of dynamic programming and policy search called actor-critic [25, 24]. Actor-critic is an online form of policy iteration  that inherits the ability of policy search to scale to large or continuous spaces, while reducing its number of queries to the environment. In particular, policy gradient method repeatedly revises policy parameter estimates through policy gradient steps. Owing to the Policy Gradient Theorem , the policy gradient is the product of two factors: the score function and the function. One may employ Monte Carlo rollouts to acquire the -estimates, which under careful choice of the rollout horizon, may be shown to be unbiased [34, 52]. Doing so, however, requires an inordinate amount of querying to the environment in order to generate trajectory data.
Actor-critic replaces Monte-Carlo rollouts for the -value by stochastic approximates of solutions to Bellman equations, i.e., temporal difference (TD)  or gradient temporal difference (GTD)  steps. Intuitively, this weaving together of the merits of dynamic programming and policy search yields comparable scalability properties to policy search while reducing its sample complexity. However, the iteration (and sample) complexity of actor-critic is noticeably absent from the literature, which is striking due to its foundational role in modern reinforcement learning systems [30, 38], and the fact that efforts to improve upon it also only establish asymptotics . This absence is due to the fact that actor-critic algorithms exhibit two technical challenges: (i) their sample path is dependent (non i.i.d.), and (ii) their search directions are biased.
In this work, we mitigate these challenges through (i) the use of a Monte Carlo rollout scheme to estimate the policy gradient, given a value function estimate; and (ii) employing recently established sample complexity results of policy evaluation under linear basis expansion. Doing so permits us to characterize for the first time the complexity of actor-critic algorithms under a few canonical settings and schemes for the critic (policy evaluation) update. Our results hinge upon viewing policy search as a form of stochastic gradient method for maximizing a non-convex function, where the ascent directions are biased. Moreover, the magnitude of this bias is determined the number of critic steps. This perspective treats actor-critic a form of two time-scale algorithm , whose asymptotic stability is well-known via dynamical systems tools [26, 10]
. To wield these approaches to establish finite-time performance, however, concentration probabilities and geometric ergodicity assumptions of the Markov dynamics are required – see. To obviate these complications and exploit recent unbiased sampling procedures [34, 52], we focus on the case where independent trajectory samples are acquirable through querying the environment.
|Critic Method||Convergence Rate||State-Action Space||Smoothness Assumptions||Algorithm|
|GTD (SCGD)||Continuous||Assumption 6||Alg 2|
|GTD (A-SCGD)||Continuous||Assumptions 6 and 7||Alg 3|
Our main result establishes that actor-critic, independent of any critic method, exhibits convergence to stationary points of the value function that are comparable to stochastic gradient ascent in the non-convex regime. We note that a key distinguishing feature from standard non-convex stochastic programming is that the rates are inherently tied to the bias of the search direction which is determined by the choice of critic scheme. In fact, our methodology is such that a rate for actor-critic can be derived for any critic only method for which a convergence rate in expectation on the parameters can be expressed. In particular, we characterize the rates for actor-critic with temporal difference (TD) and gradient TD (GTD) critic steps. Furthermore, we propose an Accelerated GTD (A-GTD) method derived from accelerations of stochastic compositional (quasi-) gradient descent , which converges faster than TD and GTD.
In summary, for the continuous spaces, we establish that GTD and A-GTD converge faster than TD. In particular, this introduces a trade off between the smoothness assumptions and the rates derived (see Table 1). TD has no additional smoothness assumptions, and it achieves a rate of . This rate is analogous to the non-convex analysis of stochastic compositional gradient descent. Adding a smoothness assumption, GTD achieves the faster rate of . By requiring an additional strong convexity assumption, we find that A-GTD achieves the fastest convergence rate of . For the case of finite state action space, actor critic achieves a convergence rate of . Overall, the contribution in terms of sample complexities of different actor-critic algorithms may be found in Table 1.
We evaluate actor-critic with TD, GTD, and A-GTD critic updates on a navigation problem. We find that indeed A-GTD converges faster than both GTD and TD. Interestingly, the stationary point it reaches is worse than GTD or TD. This suggests that the choice of critic scheme illuminates an interplay between optimization and generalization that is less-well understood in reinforcement learning [13, 12]. Surprisingly, we find that TD converges faster than GTD which we postulate is an artifact of the selection of the feature space parametrization. A detailed discussion on the results and implications can be found in section 7. The remainder of the paper is organized as follows. Section 2 describes the problem of Reinforcement Learning and characterizes common assumptions which we use in our analysis. In section 3 we derive a generic actor-critic algorithm from an optimization perspective and describe how the algorithm would be amended given different policy evaluation methods. The derivation of the convergence rate for generic actor-critic is presented in section 4, and the specific analysis for Gradient, Accelerated Gradient, and vanilla Temporal difference are characterized in sections 5 and 6.
2 Reinforcement Learning
In reinforcement learning (RL), an agent moves through a state space and takes actions that belong to some action set , where the state/action spaces are assumed to be continuous compact subsets of Euclidean space: and . Every time an action is taken, the agent transitions to its next state that depends on its current state and action. Moreover, a reward is revealed by the environment. In this situation, the agent would like to accumulate as much reward as possible in the long term, which is referred to as value. Mathematically this problem definition may be encapsulated as a Markov decision process (MDP), which is a tuple with Markov transition density that determines the probability of moving to state . Here, is the discount factor that parameterizes the value of a given sequence of actions, which we will define shortly.
At each time , the agent executes an action given the current state , following a possibly stochastic policy , i.e., . Then, given the state-action pair , the agent observes a (deterministic) reward and transitions to a new state according to a transition density that is Markov. For any policy mapping states to actions, define the value function as
which is a measure of the long term average reward accumulation discounted by . We can further define the value conditioned on a given initial action as the action-value, or Q-function as . Given any initial state , the goal of the agent is to find the optimal policy that maximizes the long-term return , i.e., to solve the following optimization problem
In this work, we investigate actor-critic methods to solve (2), which is a hybrid RL method that fuses key properties of policy search and approximate dynamic programming. To ground the discussion, we first derive the canonical policy search technique called policy gradient method, and explain how actor-critic augments policy gradient. Begin by noting that to address (2), one must search over an arbitrarily complicated function class which may include those which are unbounded and discontinuous. To mitigate this issue, we parameterize the policy
by a vector, i.e., , yielding RL tools called policy gradient methods [24, 8, 15]. Under this specification, the search over arbitrarily complicated function class to (2) may be reduced to Euclidean space , i.e., a vector-valued optimization . Subsequently, we denote by for notational convenience. We first make the following standard assumption on the regularity of the MDP problem and the parameterized policy , which are the same conditions as .
Suppose the reward function and the parameterized policy satisfy the following conditions:
The absolute value of the reward is bounded uniformly by , i.e., for any .
The policy is differentiable with respect to , and the score function is -Lipschitz and has bounded norm, i.e., for any ,
Note that the boundedness of the reward function in Assumption 11 is standard in policy search algorithms [7, 8, 15, 53]. Observe that with , we have the Q-function is absolutely upper bounded by , since by definition
The same bound also applies for for any and and thus for the objective which is defined as , i.e.,
We note that the conditions (3) and (4) have appeared in recent analyses of policy search [15, 35, 32], and are satisfied by canonical policy parameterizations such as Boltzmann policy  and Gaussian policy . For example, for Gaussian policy333We observe that in practice, the action space is bounded, which requires a truncated Gaussian policy to be used over , as in . in continuous spaces, , where
denotes the Gaussian distribution with meanand variance . Then the score function has the form , which satisfies (3) and (4) if the feature vectors have bounded norm, the parameter lies some bounded set, and the action is bounded.
We now state the following assumptions which are required for the base results. Additional assumptions on smoothness are specific to the critic only method used, and are detailed in Table 1.
The random tuples are drawn from the stationary distribution of the Markov reward process independently across time.
As pointed out by , the i.i.d. assumption does not hold in practice, but it is standard dealing with convergence bounds in reinforcement learning. Although there has been an effort to characterize the rate of convergence in expectation for critic only methods with Markov noise , theses results do not result in a bound of the desired form in Assumption 5. For this reason, we study specifically the case where Assumption 2 holds, and leave the study of Markov noise for future work. The general conditions for stability of trajectories with Markov dependence, i.e., negative Lyapunov exponents for mixing rates, may be found in .
For any state action pair , the norm of the feature representation is bounded by a constant .
Assumption 3 is easily implementable in practice by normalizing the feature representation. Because the score function is bounded by (c.f. Assumption 1) and the reward function is bounded, we have that for some constant
for all .
Additionally, we assume that the estimate of the gradient conditioned on the filtration is bounded by some finite variance
Let be a possibly biased estimate . There exists a finite such that,
Generally, the value function is nonconvex with respect to the parameter , meaning that obtaining a globally optimal solution to (2) is out of reach unless the problem has additional structured properties, as in phase retrieval , matrix factorization 
, and tensor decomposition, among others. Thus, our goal is to design actor-critic algorithms to attain stationary points of the value function . Moreover, we characterize the sample complexity of actor-critic, a noticeable gap in the literature for an algorithmic tool decades old 
at the heart of the recent innovations of artificial intelligence architectures.
3 From Policy Gradient to Actor-Critic
In this section, we derive actor-critic method  from an optimization perspective: we view actor-critic as a way of doing stochastic gradient ascent with biased ascent directions, and the magnitude of this bias is determined by the number of critic evaluations done in the inner loop of the algorithm. The building block of actor-critic is called policy gradient method, a type of direct policy search, based on stochastic gradient ascent. Begin by noting that the gradient of the objective with respect to policy parameters , owing to the Policy Gradient Theorem , has the following form:
In the preceding expression, denotes the probability of state equals given initial state and policy
, which is occasionally referred to as the occupancy measure, or the Markov chain transition density induced by policy. Moreover, is the ergodic distribution associated with the MDP for fixed policy, which is shown to be a valid distribution . For future reference, we define . The derivative of the logarithm of the policy is usually referred to as the score function
corresponding to the probability distributionfor any .
Next, we discuss how (10) can be used to develop stochastic methods to address (2). Unbiased samples of the gradient are required to perform the stochastic gradient ascent, which hopefully converges to a stationary solution of the nonconvex maximization. One way to obtain an estimate of the gradient is to evaluate the score function and
function at the end of a rollout whose length is drawn from a geometric distribution with parameter[Theorem 4.3]. If the function evaluation is unbiased, then the stochastic estimate of the gradient is unbiased as well. We therefore define the stochastic estimate by
We consider the case where the function admits a linear parametrization of the form , which in the literature on policy search is referred to as the critic , as it “criticizes” the performance of actions chosen according to policy . Here and
is a (possibly nonlinear) feature map such as a network of radial basis functions or an auto-encoder. Moreover, we estimate the parameterthat defines the function from a policy evaluation (critic only) method after some iterations, where denotes the number of policy gradient updates. Thus, we may write the stochastic gradient estimate as
If the estimate of the function is unbiased, i.e., , then (c.f. 
[Theorem 4.3]). Typically, critic only methods do not give unbiased estimates of thefunction; however, in expectation the rate at which their bias decays is proportional to the number of estimation steps. In particular, denote as the parameter for which the estimate is unbiased:
Hence, by adding and subtracting the true estimate of the parametrized function to (12), we arrive at the fact the policy search direction admits the following decomposition:
The second term in is the unbiased estimate of the gradient , whereas the first defines the difference of the critic parameter at iteration with the true estimate . For linear parameterizations of the function, policy evaluation methods establish convergence in mean of the bias
where is some decreasing function. We address cases where the critic bias decays at rate for , due to the fact that several state of the art works on policy evaluation may be mapped to the form (15) for this specification [49, 17]. We formalize this with the following assumption.
The expected error of the critic parameter is bounded by for some , i.e., there exists constant such that
Recently, alternate rates have been established as ; however, they concede that rates may be possible [6, 54]. Thus, we subsume recent sample complexity characterizations of policy evaluation as (15) in Assumption 5.
As such, (14) is nearly a valid ascent direction: it is approximately an unbiased estimate of the gradient since the first term becomes negligible as the number of critic estimation steps increases. Based upon this observation, we propose the following variant of actor-critic method : run a critic estimator (policy evaluator) for steps, whose output is critic parameters . We denote the critic estimator by which returns the parameter after iterations. Then, simulate a trajectory of length , where is geometrically distributed with parameter , and update the actor (policy) parameters as:
We summarize the aforementioned procedure, which is agnostic to particular choice of critic estimator, as Algorithm 1.
Examples of Critic Updates We note that admits two canonical forms: temporal difference (TD)  and gradient temporal difference (GTD)-based estimators . The TD update for the critic is given as
whereas for the GTD-based estimator for the critic, we consider the update
We further analyze a modification of GTD updates proposed by  that incorporates an extrapolation technique to reduce bias in the estimates and improve error dependency, which is distinct from accelerated stochastic approximation with Nesterov Smoothing. With and defined for , the accelerated GTD (A-GTD) update becomes
Subsequently, we shift focus to characterizing the mean convergence of actor-critic method given any policy evaluation method satisfying (15) in Section 4. Then, we specialize the sample complexity of actor-critic to the cases associated with critic updates (18) - (20), which we respectively call Classic (Algorithm 4), Gradient (Algorithm 2), and Accelerated Actor-Critic (Algorithm 3).
4 Convergence Rate of Generic Actor-Critic
In this section, we derive the rate of convergence in expectation for the variant of actor-critic defined in Algorithm 1, which is agnostic to the particular choice of policy evaluation method used to estimate the function used in the actor update. Unsurprisingly, we establish that the rate of convergence in expectation for actor-critic depends on the critic update used. Therefore, we present the main result in this paper for any generic critic method. Thereafter, we specialize this result to two well-known choices of policy evaluation previously described (18) - (3), as well as a new variant that employs acceleration (20).
Lemma 1 (Lipschitz-Continuity of Policy Gradient).
The policy gradient is Lipschitz continuous with some constant , i.e., for any
This lemma allows us to establish an approximate ascent lemma for a random variabledefined by
By definition of , we write the expression for
By the Mean Value Theorem, there exists for some such that
Substitute this expression for in (24)
Add and subtract to the right hand side of (26) to obtain
By Cauchy Schwartz, we know . Further, by the Lipschitz continuity of the gradient, we know . Therefore, we have
where the second inequality comes from substituting . We substitute this expression into the definition of in (27) to obtain
Take the expectation with respect to the filtration , substitute the definition for the actor update (17)
The terms on the right hand side outside the expectation may be identified as [cf. (22)] by definition, which allows us to write
Therefore, we are left to show that the last term on the right-hand side of the preceding expression is “nearly” an ascent direction, i.e.,
and how far from an ascent direction it is depends on the critic estimate bias . From Algorithm 1, the actor parameter update may be written as
Add and subtract to (34) where is such that the estimate is unbiased. Hence, represents the distance between the critic parameters corresponding to the biased estimate after critic only steps and the true estimate of the function.
Here we recall (12) and (13) from the derivation of the algorithm, that is that the expected value of the stochastic estimate given is unbiased. Therefore, by taking the expectation of (35) with respect to the filtration , we obtain
Take the inner product with on both sides
The first term on the right-hand side is lower-bounded by the negative of its absolute value, i.e.,
Next, we apply Cauchy Schwartz to the first term on the right-hand side of the previous expression, followed by Jensen’s Inequality, and then Cauchy Schwartz again, to obtain
Because the score function (), the feature map (), and the gradient () are bounded, we define be the product of these constants with :
which my be substituted into (39) to write
Now, we can express this relationship in terms of by substituting back into (32):
which is as stated in (23). ∎
This almost describes an ascent of the variable . Because the norm of the gradient is non-negative, if the term was removed, an argument could be constructed to show that in expectation, the gradient converges to zero. Unfortunately, the term of the error in critic estimate complicates the picture. However, by Assumption 5 (which is not really an assumption but rather a fundamental property of most common policy evaluation schemes), we know that the error goes to zero in expectation as the number of critic steps increases. Thus, we leverage this property to derive the sample complexity of actor-critic (Algorithm 1).
We now present our main result, which is the convergence rate of actor-critic method when the algorithm remains agnostic to the particular chocie of critic scheme. We characterize the rate of convergence by the smallest number of actor updates required to attain a value function gradient smaller , i,.e.,
Suppose the step-size satisfies for and the critic update satisfies Assumption 5. When the critic bias converges to null as , i.e., in (15), then critic updates occur per actor update. Alternatively, if the critic bias converges to null more slowly then critic updates per actor update are chosen. Then the actor sequence defined by Algorithm 1 satisfies
Minimizing over yields actor step-size . Moreover, depending on the rate of attenuation of the critic bias [cf. (15)], the resulting sample complexity is: