1 Overview
Reinforcement learning methods navigate an environment and seek to maximize their reward (Sutton and Barto, 2018). A key tension is the tradeoff between exploration and exploitation: does a learner (also called an agent or policy) explore for new highreward states, or does it exploit the best states it has already found? This is a sensitive part of RL algorithm design, as it is easy for methods to become blind to parts of the state space; to combat this, many methods have an explicit exploration component, for instance the greedy method, which forces exploration in all states with probability (Sutton and Barto, 2018; Tokic, 2010)
. Similarly, many methods must use projections and regularization to smooth their estimates
(Williams and Peng, 1991; Mnih et al., 2016; Cen et al., 2020).This work considers actorcritic methods, where a policy (or actor) is updated via the suggestions of a critic. In this setting, prior work invokes a combination of explicit regularization and exploration to avoid getting stuck, and makes various fast mixing assumptions to help accurate exploration. For example, recent work with a single trajectory in the tabular case used both an explicit greedy component and uniform mixing assumptions (Khodadadian et al., 2021), neural actorcritic methods use a combination of projections and regularization together with various assumptions on mixing and on the path followed through policy space (Cai et al., 2019; Wang et al., 2019), and even direct analyses of the TD subroutine in our linear MDP setting make use of both projection steps and an assumption of starting from the stationary distribution (Bhandari et al., 2018).
Contribution. This work shows that a simple linear actorcritic (cf. Algorithm 1) in a linear MDP (cf. Section 1.1) with a finite but nontabular state space (cf. Section 1.1) finds an optimal policy in samples, without any explicit exploration or projections in the algorithm and without any uniform mixing assumptions on the policy space (cf. Section 1.1). The algorithm and analysis avoid both via an implicit bias towards high entropy policies: the actorcritic policy path never leaves a KullbackLeibler (KL) divergence ball of the maximum entropy optimal policy, and this firstly ensures implicit exploration, and secondly ensures fast mixing. In more detail:

Actor analysis via mirror descent. We write the actor update as an explicit mirror descent. While on the surface this does not change the method (e.g., in the tabular case, the method is identical to natural policy gradient (Agarwal et al., 2021)), it gives a clean optimization guarantee which carries a KLbased implicit bias consequence for free, and decouples concerns between the actor and critic.

Critic analysis via projectionfree sampling tools within KL balls. The preceding mirror descent component guarantees that we stay within a small KL ball, if the statistical error of the critic is controlled. Concordantly, our sampling tools guarantee this statistical error is small, if we stay within a small KL ball. Concretely, we provide useful lemmas that every policy in a KL ball around the high entropy policy has uniformly upper bounded mixing times, and separately give a projectionfree (implicitly regularized!) analysis of the standard temporaldifference (TD) update from any starting state (Sutton, 1988), whereas the closest TD analysis in the literature uses projections and requires the sampling process to be started from the stationary distribution (Bhandari et al., 2018). The mixing assumptions here contrast in general with prior work, which either makes explicit use of stationary distributions (Cai et al., 2019; Wang et al., 2019; Bhandari et al., 2018), or makes uniform mixing assumptions on all policies (Xu et al., 2020b; Khodadadian et al., 2021).
Our final proof has the above actor and critic components feeding off of each other: we use an induction to simultaneously establish that since the actor stayed in a small KL ball in previous iterations, then the new critic update is accurate, and similarly the accuracy of the previous critic updates ensures the actor continues to stay in a small KL ball. We feel these tools will be useful in other work.
1.1 Setting and main results
We will now give the setting, main result, and algorithm in full. Further details on MDPs can be found in Section 1.3, but the actorcritic method appears in Algorithm 1. To start, the environment and policies are as follows.
The Markov Decision Process (MDP) has states and finitely many actions , and finite rewards . States are observed in some feature encoding , but the state space is assumed finite: .
Policies are linear softmax policies: a policy is given by a weight matrix , and given a state , uses a perstate softmax to sample a new action :
(1) 
Let denote the set of optimal actions. It is assumed that is nonempty for every ; equivalently, for every state, there exists an optimal policy whose stationary distribution exists and places positive mass on that state.
The choice of linear policies simplifies presentation and analysis, but the tools here should be applicable to other settings. This choice also allows direct comparison to the widelystudied implicit bias of gradient descent in linear classification settings (Soudry et al., 2017; Ji and Telgarsky, 2018), as will be discussed further in Section 1.2. The choice of finite state space is to remove measuretheoretic concerns and to allow a simple characterization of the maximum entropy optimal policy.
[simplification of Appendix A] Under Section 1.1, there exists a unique maximum entropy policy , which satisfies for every state .
To round out this introductory presentation of the actor, the last component is the update: and . This is explicitly a mirror descent or dual averaging representation of the policy, where we use a mirror mapping to obtain the policy from presoftmax values . As mentioned before, this update appears in prior work in the tabular setting with natural policy gradient and actorcritic (Agarwal et al., 2021; Khodadadian et al., 2021). We will motivate this choice in our more general nontabular setting in Section 2.
The final assumption and description of the critic are as follows. As will be discussed in Section 2, the policy becomes optimal if is an accurate estimate of the true function. We employ a standard TD update with no projections or constraints. To guarantee that this linear model of is accurate, we make a standard linear MDP assumption (Bradtke and Barto, 1996; Melo and Ribeiro, 2007; Jin et al., 2020).
In words, the linear MDP assumption
is that the MDP rewards and transitions are modeled by linear functions. In more detail, for convenience first fix a canonical vector form for state/action pairs
: let denote the vector obtained via unrolling the matrix rowwise (whereby vector inner products with match matrix inner products with ). The linear MDP assumption is then that there exists a fixed vector and a fixed matrix so that for any state/action pair and any subsequent state ,Lastly, suppose for all .
Though a strong assumption, it is not only common, but note also that since TD must continually interact with the MDP, then it would have little hope of accuracy if it can not model shortterm MDP dynamics. Indeed, as is shown in Section C.2 (but appears in various forms throughout the literature), Section 1.1 implies that the fixed point of the TD update is the true function.
We now state our main result, which bounds not just the value function (cf. Section 1.3) but also the KL divergence , where is the visitation distribution of the maximum entropy optimal policy when run from state (cf. Section 1.3).
Suppose Sections 1.1 and 1.1 (which imply the (unique) maximum entropy optimal policy is welldefined, and also irreducible and aperiodic). Fix an iteration count and confidence , and choose parameters
where the constants hidden in and depend only on and the MDP, and not on . With these parameters in place, invoke Algorithm 1, and let be the resulting sequence of policies. Then with probability at least , simultaneously for every state and every ,
[Discussion of Section 1.1]

Implicit bias. Since is optimal, the second term can be deleted, and the bound implies
since this holds for all , it controls the optimization path. This term is a direct consequence of our mirror descent setup, and is used to control the TD errors at every iteration.

Mixing time constants. The critic iterations and step size hide mixing time constants; these mixing time constants depend only on the KL bound , and in particular there is no hidden growth in these terms with . That is to say, mixing times are uniformly controlled over a fixed KL ball that does not depend on ; prior work by contrast makes strong mixing assumptions (Wang et al., 2019; Xu et al., 2020b; Khodadadian et al., 2021).

Single trajectory. A single trajectory through the MDP is used to remove the option of the algorithm escaping from poor choices with resets; only the implicit bias can save it.

Rate. Since the actor’s step size is , to obtain error , then actor iterations are needed, and the total number of samples is , which is slower than the given in the only other singletrajectory analysis in the literature Khodadadian et al. (2021), but by contrast that work makes uniform mixing assumptions (cf. Khodadadian et al. (2021, Lemma C.1)), requires the tabular setting, and uses greedy for explicit exploration in each iteration.
The organization of the remainder of this work is as follows. The rest of this introduction gives further related work, and some notation and MDP background. Section 2 presents and discusses the mirror descent framework which provides optimization and implicit bias guarantees on the actor for free. Section 3 presents the sampling lemmas we use to control the TD error. Section 4 concludes with some discussion and open problems, and the appendices contain the full proofs.
1.2 Further related work
For the standard background in reinforcement learning, see Sutton and Barto (2018).
Natural and regular policy gradient (PG & NPG).
As mentioned before, the actor update here agrees with the natural policy gradient update in the tabular setting (Kakade, 2001); see also (Agarwal et al., 2021) for a wellknown analysis of natural and regular policy gradient methods. These methods are widespread in theory and practice (Williams, 1992; Sutton et al., 2000; Bagnell and Schneider, 2003; Liu et al., 2020; Fazel et al., 2018).
Natural and regular actorcritic (AC & NAC).
The study of regular and natural actorcritic methods started with Konda and Tsitsiklis (2000) and Peters and Schaal (2008) respectively. These methods are very common both in theory and practice, and there are many variants and improvements to both the actor component and the critic component (Xu et al., 2020a, b; Wu et al., 2020; Bhatnagar et al., 2009).
Regularization and constraints.
It is standard with neural policies to explicitly maintain a constraint on the network weights (Wang et al., 2019; Cai et al., 2019). Relatedly, many works both in theory and practice use explicit entropy regularization to prevent small probabilities (Williams and Peng, 1991; Mnih et al., 2016; Abdolmaleki et al., 2018), and which can seem to yield convergence rate improvements (Cen et al., 2020).
NPG and mirror descent.
The original and recent analyses of NPG had a mirror descent flavor, though mirror descent and its analysis were not explicitly invoked as a black box (Kakade, 2001; Agarwal et al., 2021). Further connections to mirror descent have appeared many times (Geist et al., 2019; Shani et al., 2020), though with a focus on the design of new algorithms, and not for any implicit regularization effect or proof. Mirror descent is used heavily throughout the online learning literature (ShalevShwartz, 2011), and in work handling adversarial MDP settings (Zimin and Neu, 2013).
Temporaldifference update (TD).
As discussed before, the TD update, originally presented by (Sutton, 1988), is standard in the actorcritic literature (Cai et al., 2019; Wang et al., 2019), and also appears in many other works cited in this section. As was mentioned, prior work requires various projections and mixing assumptions (Bhandari et al., 2018).
Implicit regularization in supervised learning.
A pervasive topic in supervised learning is the
implicit regularization effect of common descent methods; concretely, standard descent methods prefer low or even minimum norm solutions, which can be converted into generalization bounds. The present work makes use of a weakimplicit bias, which only prefers smaller norms and does not necessarily lead to minimal norms; arguably this idea was used in the classical perceptron method
(Novikoff, 1962), but was then shown in linear and shallow network cases of SGD applied to logistic regression
(Ji and Telgarsky, 2018, 2019), which was then generalized to other losses (Shamir, 2020), and also applied to other settings (Chen et al., 2019). The more wellknown strong implicit bias, namely the convergence to minimum norm solutions, has been observed with exponentiallytailed losses together with coordinate descent with linear predictors (Zhang and Yu, 2005; Telgarsky, 2013), gradient descent with linear predictors (Soudry et al., 2017; Ji and Telgarsky, 2018), and deep learning in various settings
(Lyu and Li, 2019; Chizat and Bach, 2020), just to name a few.1.3 Notation
This brief notation section summarizes various concepts and notation used throughout; modulo a few inventions, the presentation mostly matches standard ones in RL (Sutton and Barto, 2018) and policy gradient (Agarwal et al., 2021). A policy maps stateaction pairs to reals, and
will always be a probability distribution. Given a state, the agent samples an action from
, the environment returns some random reward (which has a fixed distribution conditioned on the observed pair), and then uses a transition kernel to choose a new state given .Taking to denote a random trajectory followed by a policy interacting with the MDP from an arbitrary initial state distribution , the value and functions are respectively
where the simplified notation for Dirac distribution on state will often be used, as well as the shorthand and . Additionally, let denote the advantage function; note that the natural policy gradient update could interchangeably use or since they only differ by an actionindependent constant, namely , which the softmax normalizes out. As in Section 1.1, the state space is finite but a subset of , specifically , and the action space is just the standard basis vectors . The other MDP assumption, namely of a linear MDP (cf. Section 1.1), will be used whenever TD guarantees are needed. Lastly, the discount factor has not been highlighted, but is standard in the RL literature, and will be treated as given and fixed throughout the present work.
A common tool in RL is the performance difference lemma (Kakade and Langford, 2002): letting denote the visitation distribution corresponding to policy starting from , meaning
the performance difference lemma can be written as
(2) 
where the final notation will often be employed for convenience.
As mentioned above, will denote the stationary distribution of a policy whenever it exists. The only relevant assumption we make here is that the maximum entropy optimal policy is aperiodic and irreducible, which implies it has a stationary distribution with positive mass on every state (Levin et al., 2006, Chapter 1). Via Section 3, it follows that all policies in a KL ball around also have stationary distributions with positive mass on every state.
The max entropy optimal policy is complemented by a (unique) optimal function and optimal advantage function . The optimal function dominates all other functions, meaning for any policy ; cf. Appendix A.
In a few places, we need the Markov chain
on states, , which is induced by a policy : that is, the chain where given a state , we sample , and then transition to , where the latter sampling is via the MDP’s transition kernel.We use to denote the total variation distance. This distance is pervasive in mixing time analyses (Levin et al., 2006).
2 Mirror descent tools
To see how nicely mirror descent and its guarantees fit with the NPG/NAC setup, first recall our updates: , and (e.g., matching NPG in the tabular case (Kakade, 2001; Agarwal et al., 2021)). In the online learning literature (ShalevShwartz, 2011; Lattimore and Szepesvári, 2020), the basic mirror ascent (or dual averaging) guarantee is of the form
where notably does not need to mean anything, it can just be an element of a vector space. The most common results are stated when is the gradient of some convex function, but here instead we can use the performance difference lemma: recalling the inner product and visitation distribution notation from Section 1.3,
The term is exactly what we will control with the TD analysis, and thus the mirror descent approach has neatly decoupled concerns into an actor term, and a critic term.
In order to apply the mirror descent framework, we need to choose a mirror mapping. Rather than using , we use a choice which bakes the measure from the above inner product above into the dual object
! This may seem strange, but it does not change the induced policy, and thus is a degree of freedom, and allows us to state guarantees for
all possible starting distributions for free.Our full mirror descent setup is detailed in Appendix B, but culminates in the following guarantee.
Consider step size , any reference policy , and two treatments of the error .

(Simplified bound.) For any starting measure ,

(Refined bound.) Define . For any starting measure ,
and additionally and are approximately monotone: for any state and action ,
[Regarding the mirror descent setup, Section 2]

Two rates. For the refined bound, it is most natural to set , which requires iterations to reach accuracy ; by contrast, the simplified guarantee requires iterations for the same . We used the simplified form to prove Section 1.1, since its TD error term is less stringent; indeed, the TD analysis we provide in Section 3 will not be able to give the uniform control needed for the refined bound. Still, we feel the refined bound is promising, and include it for sake of completeness, future work, and comparison to prior work.

Comparison to standard rates. Comparing the refined bound (with all terms set to zero) to the standard NPG rate in the literature (Agarwal et al., 2021), the rate is exactly recovered; as such, this mirror descent setup at the very least has not paid a price in rates.

Implicit regularization term. A conspicuous difference between these bounds and both the standard NPG bounds (cf. (Agarwal et al., 2021, Theorem 5.3)), but also many mirror descent treatments, is the term ; one could argue that this term is nonnegative and moreover we care more about the value function, so why not drop it, as is usual? It is precisely this term that gives our implicit regularization effect: instead, we can drop the value function term and uniformly upper bound the right hand side to get , which is how we control the entropy of the policy path, and prove Section 1.1.
3 Sampling tools
Via Section 2 above, our mirror descent black box analysis gives us a KL bound and a value function bound: what remains, and is the job of this section, is to control the function estimation error, namely terms of the form .
Our analysis here has two parts. The first part, as follows immediately, is that any bounded KL ball in policy space has uniformly controlled mixing times; the second part, which follows immediately thereafter, is our TD guarantees.
Let policy be given, and suppose the induced transition kernel on states is irreducible and aperiodic (Levin et al., 2006, Section 1.3). Then has a stationary distribution , and moreover for any and any measure which is positive on all states and a corresponding set of policies
there exist constants so that mixing is uniform over , meaning for any , and any with induced transition probabilities ,
and for any state and any , and any action with ,
[Implicit vs explicit exploration] On the surface, Section 3 might seem quite nice. Worrying about it a little more, and especially after inspecting the proof, it is clear that the constants , , and can be quite bad. On the one hand, one may argue that this is inherent to implicit exploration, and something like greedy is preferable, as it arguably gives an explicit control on all these quantities.
Some aspects of this situation are unavoidable, however. Consider a combination lock MDP, where a precise, hard to find sequence of actions must be followed to arrive at some good reward. Suppose this sequence has length and we have a reference policy which takes each of these good actions with probability , whereby the probability of the sequence is ; a policy with can drop the probability of this good sequence of actions all the way down to !
Next we present our TD analysis. As discussed in Section 1, by contrast with prior work, this analysis handles starting from an arbitrary state, and does not make use of any projections. The following guarantee is specialized to Algorithm 1; it is a corollary of a more general TD guarantee, given in Appendix C, which is stated without reference to Algorithm 1, and can be applied in other settings.
[See also Section C.2] Suppose the MDP and linear MDP assumptions (cf. Sections 1.1 and 1.1). Consider a policy in some iteration of Algorithm 1, and suppose there exist mixing constants and so that the induced transition kernel on satisfies
Suppose the TD iterations and step size satisfy
Then the average TD iterate satisfies
where is the minimum norm fixed point of the expected TD iteration at stationarity (cf. Section C.2), and thus for any .
The proof is intricate owing mainly to issues of statistical dependency. It is not merely an issue that the chain is not started from the stationary distribution; notice that are all statistically dependent. Indeed, even if is sampled from the stationary distribution (which also means is distributed according to the stationary distribution as well), the conditional distribution of given is not in general stationary! To deal with such issues, the proof chooses a very small step size which ensures the TD estimate evolves much more slowly than the mixing time of the chain, and within the proof gaps are introduced in the chain so that rather than considering inner products of the form , the proof only considers . That said, many details need to be checked for this to go through.
A second component of the proof, which removes projection steps from prior work (Bhandari et al., 2018), is an implicit bias of TD, detailed as follows. Mirroring the MD statement in Section 2, the left hand side here has not only a term as promised, but also a norm control ; in fact, this norm control holds for all intermediate TD iterations, and is used throughout the proof to control many error terms. Just like in the MD analysis, this term is an implicit regularization, and is how this work avoids the projection step needed in prior work (Bhandari et al., 2018).
All the pieces are now in place to sketch the proof of Section 1.1, which is presented in full in Appendix D. To start, instantiate Section 3 with KL divergence upper bound , which gives the various mixing constants used throughout the proof (which we need to instantiate now, before seeing the sequence of policies, to avoid any dependence). With that out of the way, consider some iteration , and suppose that for all iterations , we have a handle both on the TD error, and also a guarantee that we are in a small KL ball around (specifically, of radius as in Section 1.1). The right hand side of the simplified mirror descent bound in Section 2 only needs a control on all previous TD errors, therefore it implies both a bound on and on . But this KL control on means that the mixing and other constants we assumed at the start will hold for , and thus we can invoke Section 3 to bound the error on , which we will use in the next loop of the induction. In this way, the actor and critic analysis complement each other and work together in each step of the induction.
There was one issue overlooked in the preceding paragraph. Notice that Section C.2 only grants an error control on average over pairs sampled from the stationary distribution of (which we mix towards thanks to Section 3). To control the error in Section 2, superficially we need something closer to a uniform error over pairs; within the proof, however, the only actions we need to consider end up being sampled from or from , and in the latter case we know explicitly that either the probability of some is large (since is the maximum entropy optimal policy), or it is and the error term vanishes. This reasoning is only sufficient for the simplified mirror descent bound in Section 2, and more sophisticated error controls would be needed to apply the refined bound.
4 Discussion and open problems
This work, in contrast to prior work in natural actorcritic and natural policy gradient methods, dropped many assumptions from the analysis, and components of the algorithms. The analysis was meant to be fairly general purpose and unoptimized. As such, there are many open problems.
Faster rates.
How much can this analysis be squeezed? Moreover, does it suggest any algorithmic improvements?
Implicit vs explicit regularization/exploration.
What are some situations where one is better than the other, and vice versa? The analysis here only says you can get away with doing everything implicitly, but not necessarily that this is the best option.
More general settings.
The paper here is for linear MDPs, linear softmax policies, finite state and action spaces. How much does the implicit bias phenomenon (and this analysis) help in more general settings?
Tightening the TD and MD coupling.
The proof of Section 1.1 here relied on a very tight coupling of the actor (mirror descent) and the critic (temporal difference). But perhaps the coupling can be made even tighter, both in the algorithm and the analysis?
Acknowledgments
MT thanks Nan Jiang and Tor Lattimore for helpful discussions in earlier phases of this work, and is grateful to the NSF for support under grant IIS1750051.
References
 Abdolmaleki et al. (2018) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.

Agarwal et al. (2021)
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan.
On the theory of policy gradient methods: Optimality, approximation,
and distribution shift.
Journal of Machine Learning Research
, 22(98):1–76, 2021.  Bagnell and Schneider (2003) J Andrew Bagnell and Jeff Schneider. Covariant policy search. 2003.
 Bhandari et al. (2018) Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
 Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
 Bradtke and Barto (1996) Steven J Bradtke and Andrew G Barto. Linear leastsquares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
 Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.
 Cai et al. (2019) Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neural temporaldifference and qlearning provably converge to global optima. arXiv preprint arXiv:1905.10027, 2019.
 Cen et al. (2020) Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. arXiv preprint arXiv:2007.06558, 2020.
 Chen et al. (2019) Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much overparameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
 Chizat and Bach (2020) Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide twolayer neural networks trained with the logistic loss. arXiv preprint arXiv:2002.04486, 2020.
 Fazel et al. (2018) Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
 Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR, 2019.
 Jerrum and Sinclair (1988) Mark Jerrum and Alistair Sinclair. Conductance and the rapid mixing property for markov chains: The approximation of permanent resolved. In STOC, pages 235–244, 1988.
 Ji and Telgarsky (2018) Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300v2, 2018.
 Ji and Telgarsky (2019) Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. 2019. arXiv:1909.12292 [cs.LG].
 Jin et al. (2020) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
 Kakade (2001) Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
 Kakade and Langford (2002) Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002.
 Khodadadian et al. (2021) Sajad Khodadadian, Thinh T Doan, Siva Theja Maguluri, and Justin Romberg. Finite sample analysis of twotimescale natural actorcritic algorithm. arXiv preprint arXiv:2101.10506, 2021.
 Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/9781108571401.
 Levin et al. (2006) David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006.

Liu et al. (2020)
Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin.
An improved analysis of (variancereduced) policy gradient and natural policy gradient methods.
In NeurIPS, 2020.  Lovasz and Simonovits (1990) L. Lovasz and M. Simonovits. The mixing rate of markov chains, an isoperimetric inequality, and computing the volume. In FOCS, pages 346–354, 1990.
 Lyu and Li (2019) Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.

Melo and Ribeiro (2007)
Francisco S Melo and M Isabel Ribeiro.
Qlearning with linear function approximation.
In
International Conference on Computational Learning Theory
, pages 308–322. Springer, 2007.  Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
 Novikoff (1962) Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, 12:615–622, 1962.
 Peters and Schaal (2008) Jan Peters and Stefan Schaal. Natural actorcritic. Neurocomputing, 71(79):1180–1190, 2008.
 ShalevShwartz (2011) Shai ShalevShwartz. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194, 2011.
 Shamir (2020) Ohad Shamir. Gradient methods never overfit on separable data. arXiv:2007.00028 [cs.LG], 2020.

Shani et al. (2020)
Lior Shani, Yonathan Efroni, and Shie Mannor.
Adaptive trust region policy optimization: Global convergence and
faster rates for regularized mdps.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 34, pages 5668–5675, 2020.  Soudry et al. (2017) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
 Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Telgarsky (2013) Matus Telgarsky. Margins, shrinkage, and boosting. In ICML, 2013.
 Tokic (2010) Michel Tokic. Adaptive greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pages 203–210. Springer, 2010.
 Vempala (2005) Santosh Vempala. Geometric random walks: A survey. Combinatorial and Computational Geometry, 2005. URL https://www.cc.gatech.edu/~vempala/papers/survey.pdf.
 Wang et al. (2019) Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
 Williams and Peng (1991) Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
 Wu et al. (2020) Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite time analysis of two timescale actor critic methods. arXiv preprint arXiv:2005.01350, 2020.
 Xu et al. (2020a) Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for actorcritic algorithms. arXiv preprint arXiv:2004.12956, 2020a.
 Xu et al. (2020b) Tengyu Xu, Zhe Wang, and Yingbin Liang. Nonasymptotic convergence analysis of two timescale (natural) actorcritic algorithms. arXiv preprint arXiv:2005.03557, 2020b.
 Zhang and Yu (2005) Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33:1538–1579, 2005.
 Zimin and Neu (2013) Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In NIPS, 2013.
Appendix A Background proof: existence of
The only thing in this section is the expanded version of Algorithm 1, namely giving the unique maximum entropy optimal policy, and some key properties.
There exists a unique maximum margin optimal policy and corresponding and which satisfy the following properties.

For any state , let denote the set of actions taken by optimal policies; Define , which is unique; then is also an optimal policy, and let and denote its advantage and functions.

For every state and every action , then , where the maximum is taken over all policies. Moreover, .

.
Proof of Appendices A and 1.

We provide an iterative construction of . Start with equal to any optimal deterministic policy (which must exist as usual for MDPs), and consider any enumeration of the set of states. The construction produces from , and will assume by induction that is an optimal policy which for every state with satisfies . The base case was handled directly, thus consider constructing . Since Markov chains have no memory, the behavior in state is independent of the behavior in all prior and subsequent states; therefore we can safely define for and , and is still an optimal policy. To complete the construction, set , and let and correspond to it.

For any with corresponding function and value function , and any , then
It follows that , and since , then the supremum is a maximum and the inequality is an equality.

By the previous point, for any state and any , then whereas for any , then . It follows that
∎
Appendix B Full mirror descent setup and proofs
This section first gives a basic mirror descent / dual averaging setup. This characterization is mostly standard (Bubeck, 2015), though the main guarantees are given with some flexibility to allow for various natural policy setups.
First, here is the basic notation (which, unlike the paper body, will allow for step size to differ between iterations):
One nonstandard choice here is that the Bregman divergence bakes in a conjugate element, rather than using and ; this gives an easy way to handle certain settings (like the boundary of the simplex) which run into nonuniqueness issues. Secondly, is just some bilinear form and need not be interpreted as a standard inner product.
The standard Bregman identities used in mirror descent proofs are as follows:
(3)  
(4)  
(5)  
(6) 
With these in hand, the core mirror descent guarantee is as follows. The bound is written with equalities to allow for careful handling of error terms. Note that this version of mirror descent does not interpret the “gradient” in any way, and treats it as a vector and no more.
Suppose . For any and where ,
Moreover, for any , .
Proof.
For any fixed iterate , by eqs. 5, 4 and 3,
and additionally note
The first equalities now follow by applying to both sides and telescoping.
For the second part, for any , by convexity of ,
which rearranges to give since . ∎
All that remains is to instantiate the various mirror descent objects to match Algorithm 1, and control the various resulting terms. This culminates in Section 2; its proof is as follows.
Proof of Section 2.
The core of both parts of the proof is to apply the mirror descent guarantees from Appendix B, using the following choices:
A key consequence of these constructions is that , treated for any fixed as an unnormalized policy, agrees with after normalization; that is to say, it gives the same policy, and the choice of baked into the definition is not needed by the algorithm, is only used in the analysis. The “gradient” makes no use of it.
Plugging this notation in to Appendix B but making use of two of its equalities, and the performance difference lemma eq. 2, then for any
(7) 
The proof now splits into the two different settings.

(Simplified bound.) By the above definitions and the first equality in Appendix B,
where the last term may be bounded in a way common in the online learning literature (ShalevShwartz, 2011): since when , setting for convenience (whereby as needed by the preceding inequality),
Comments
There are no comments yet.