# Actor-critic is implicitly biased towards high entropy optimal policies

We show that the simplest actor-critic method – a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration – does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ϵ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

## Authors

• 3 publications
• 20 publications
• 29 publications
07/18/2016

### A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

We develop an off-policy actor-critic algorithm for learning an optimal ...
06/02/2020

### Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...
10/06/2021

### Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Off-policy Actor-Critic algorithms have demonstrated phenomenal experime...
01/26/2021

### Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm

Actor-critic style two-time-scale algorithms are very popular in reinfor...
05/29/2021

### MARL with General Utilities via Decentralized Shadow Reward Actor-Critic

We posit a new mechanism for cooperation in multi-agent reinforcement le...
02/18/2021

### Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

In this paper, we provide finite-sample convergence guarantees for an of...
09/20/2019

### On the Convergence of Approximate and Regularized Policy Iteration Schemes

Algorithms based on the entropy regularized framework, such as Soft Q-le...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Overview

Reinforcement learning methods navigate an environment and seek to maximize their reward (Sutton and Barto, 2018). A key tension is the tradeoff between exploration and exploitation: does a learner (also called an agent or policy) explore for new high-reward states, or does it exploit the best states it has already found? This is a sensitive part of RL algorithm design, as it is easy for methods to become blind to parts of the state space; to combat this, many methods have an explicit exploration component, for instance the -greedy method, which forces exploration in all states with probability (Sutton and Barto, 2018; Tokic, 2010)

. Similarly, many methods must use projections and regularization to smooth their estimates

(Williams and Peng, 1991; Mnih et al., 2016; Cen et al., 2020).

This work considers actor-critic methods, where a policy (or actor) is updated via the suggestions of a critic. In this setting, prior work invokes a combination of explicit regularization and exploration to avoid getting stuck, and makes various fast mixing assumptions to help accurate exploration. For example, recent work with a single trajectory in the tabular case used both an explicit -greedy component and uniform mixing assumptions (Khodadadian et al., 2021), neural actor-critic methods use a combination of projections and regularization together with various assumptions on mixing and on the path followed through policy space (Cai et al., 2019; Wang et al., 2019), and even direct analyses of the TD subroutine in our linear MDP setting make use of both projection steps and an assumption of starting from the stationary distribution (Bhandari et al., 2018).

Contribution. This work shows that a simple linear actor-critic (cf. Algorithm 1) in a linear MDP (cf. Section 1.1) with a finite but non-tabular state space (cf. Section 1.1) finds an -optimal policy in samples, without any explicit exploration or projections in the algorithm and without any uniform mixing assumptions on the policy space (cf. Section 1.1). The algorithm and analysis avoid both via an implicit bias towards high entropy policies: the actor-critic policy path never leaves a Kullback-Leibler (KL) divergence ball of the maximum entropy optimal policy, and this firstly ensures implicit exploration, and secondly ensures fast mixing. In more detail:

1. Actor analysis via mirror descent. We write the actor update as an explicit mirror descent. While on the surface this does not change the method (e.g., in the tabular case, the method is identical to natural policy gradient (Agarwal et al., 2021)), it gives a clean optimization guarantee which carries a KL-based implicit bias consequence for free, and decouples concerns between the actor and critic.

2. Critic analysis via projection-free sampling tools within KL balls. The preceding mirror descent component guarantees that we stay within a small KL ball, if the statistical error of the critic is controlled. Concordantly, our sampling tools guarantee this statistical error is small, if we stay within a small KL ball. Concretely, we provide useful lemmas that every policy in a KL ball around the high entropy policy has uniformly upper bounded mixing times, and separately give a projection-free (implicitly regularized!) analysis of the standard temporal-difference (TD) update from any starting state (Sutton, 1988), whereas the closest TD analysis in the literature uses projections and requires the sampling process to be started from the stationary distribution (Bhandari et al., 2018). The mixing assumptions here contrast in general with prior work, which either makes explicit use of stationary distributions (Cai et al., 2019; Wang et al., 2019; Bhandari et al., 2018), or makes uniform mixing assumptions on all policies (Xu et al., 2020b; Khodadadian et al., 2021).

Our final proof has the above actor and critic components feeding off of each other: we use an induction to simultaneously establish that since the actor stayed in a small KL ball in previous iterations, then the new critic update is accurate, and similarly the accuracy of the previous critic updates ensures the actor continues to stay in a small KL ball. We feel these tools will be useful in other work.

### 1.1 Setting and main results

We will now give the setting, main result, and algorithm in full. Further details on MDPs can be found in Section 1.3, but the actor-critic method appears in Algorithm 1. To start, the environment and policies are as follows.

The Markov Decision Process (MDP) has states and finitely many actions , and finite rewards . States are observed in some feature encoding , but the state space is assumed finite: .

Policies are linear softmax policies: a policy is given by a weight matrix , and given a state , uses a per-state softmax to sample a new action :

 a∼ϕ(s\TW⋅),where ϕ(s\TWa)=exp(s\TWa)∑b∈\cAexp(s\TWb). (1)

Let denote the set of optimal actions. It is assumed that is nonempty for every ; equivalently, for every state, there exists an optimal policy whose stationary distribution exists and places positive mass on that state.

The choice of linear policies simplifies presentation and analysis, but the tools here should be applicable to other settings. This choice also allows direct comparison to the widely-studied implicit bias of gradient descent in linear classification settings (Soudry et al., 2017; Ji and Telgarsky, 2018), as will be discussed further in Section 1.2. The choice of finite state space is to remove measure-theoretic concerns and to allow a simple characterization of the maximum entropy optimal policy.

[simplification of Appendix A] Under Section 1.1, there exists a unique maximum entropy policy , which satisfies for every state .

To round out this introductory presentation of the actor, the last component is the update: and . This is explicitly a mirror descent or dual averaging representation of the policy, where we use a mirror mapping to obtain the policy from pre-softmax values . As mentioned before, this update appears in prior work in the tabular setting with natural policy gradient and actor-critic (Agarwal et al., 2021; Khodadadian et al., 2021). We will motivate this choice in our more general non-tabular setting in Section 2.

The final assumption and description of the critic are as follows. As will be discussed in Section 2, the policy becomes optimal if is an accurate estimate of the true function. We employ a standard TD update with no projections or constraints. To guarantee that this linear model of is accurate, we make a standard linear MDP assumption (Bradtke and Barto, 1996; Melo and Ribeiro, 2007; Jin et al., 2020).

In words, the linear MDP assumption

is that the MDP rewards and transitions are modeled by linear functions. In more detail, for convenience first fix a canonical vector form for state/action pairs

: let denote the vector obtained via unrolling the matrix row-wise (whereby vector inner products with match matrix inner products with ). The linear MDP assumption is then that there exists a fixed vector and a fixed matrix so that for any state/action pair and any subsequent state ,

 \bbE[r|(s,a)]=x\Tsay,and\bbE[s′|(s,a)]=Mxsa.

Lastly, suppose for all .

Though a strong assumption, it is not only common, but note also that since TD must continually interact with the MDP, then it would have little hope of accuracy if it can not model short-term MDP dynamics. Indeed, as is shown in Section C.2 (but appears in various forms throughout the literature), Section 1.1 implies that the fixed point of the TD update is the true function.

We now state our main result, which bounds not just the value function (cf. Section 1.3) but also the KL divergence , where is the visitation distribution of the maximum entropy optimal policy when run from state (cf. Section 1.3).

Suppose Sections 1.1 and 1.1 (which imply the (unique) maximum entropy optimal policy is well-defined, and also irreducible and aperiodic). Fix an iteration count and confidence , and choose parameters

 θ=1√t,N=\cO\delt4δ2,1η=\cO√NlnN,

where the constants hidden in and depend only on and the MDP, and not on . With these parameters in place, invoke Algorithm 1, and let be the resulting sequence of policies. Then with probability at least , simultaneously for every state and every ,

 Kvs¯π(¯¯¯π,πi)+θ(1−γ)∑j

[Discussion of Section 1.1]

1. Implicit bias. Since is optimal, the second term can be deleted, and the bound implies

 maxi≤ts∈\cSKvs¯π(¯¯¯π,πi)≤lnk+2(1−γ)2;

since this holds for all , it controls the optimization path. This term is a direct consequence of our mirror descent setup, and is used to control the TD errors at every iteration.

2. Mixing time constants. The critic iterations and step size hide mixing time constants; these mixing time constants depend only on the KL bound , and in particular there is no hidden growth in these terms with . That is to say, mixing times are uniformly controlled over a fixed KL ball that does not depend on ; prior work by contrast makes strong mixing assumptions (Wang et al., 2019; Xu et al., 2020b; Khodadadian et al., 2021).

3. Single trajectory. A single trajectory through the MDP is used to remove the option of the algorithm escaping from poor choices with resets; only the implicit bias can save it.

4. Rate. Since the actor’s step size is , to obtain error , then actor iterations are needed, and the total number of samples is , which is slower than the given in the only other single-trajectory analysis in the literature Khodadadian et al. (2021), but by contrast that work makes uniform mixing assumptions (cf. Khodadadian et al. (2021, Lemma C.1)), requires the tabular setting, and uses -greedy for explicit exploration in each iteration.

The organization of the remainder of this work is as follows. The rest of this introduction gives further related work, and some notation and MDP background. Section 2 presents and discusses the mirror descent framework which provides optimization and implicit bias guarantees on the actor for free. Section 3 presents the sampling lemmas we use to control the TD error. Section 4 concludes with some discussion and open problems, and the appendices contain the full proofs.

### 1.2 Further related work

For the standard background in reinforcement learning, see Sutton and Barto (2018).

##### Natural and regular policy gradient (PG & NPG).

As mentioned before, the actor update here agrees with the natural policy gradient update in the tabular setting (Kakade, 2001); see also (Agarwal et al., 2021) for a well-known analysis of natural and regular policy gradient methods. These methods are widespread in theory and practice (Williams, 1992; Sutton et al., 2000; Bagnell and Schneider, 2003; Liu et al., 2020; Fazel et al., 2018).

##### Natural and regular actor-critic (AC & NAC).

The study of regular and natural actor-critic methods started with Konda and Tsitsiklis (2000) and Peters and Schaal (2008) respectively. These methods are very common both in theory and practice, and there are many variants and improvements to both the actor component and the critic component (Xu et al., 2020a, b; Wu et al., 2020; Bhatnagar et al., 2009).

##### Regularization and constraints.

It is standard with neural policies to explicitly maintain a constraint on the network weights (Wang et al., 2019; Cai et al., 2019). Relatedly, many works both in theory and practice use explicit entropy regularization to prevent small probabilities (Williams and Peng, 1991; Mnih et al., 2016; Abdolmaleki et al., 2018), and which can seem to yield convergence rate improvements (Cen et al., 2020).

##### NPG and mirror descent.

The original and recent analyses of NPG had a mirror descent flavor, though mirror descent and its analysis were not explicitly invoked as a black box (Kakade, 2001; Agarwal et al., 2021). Further connections to mirror descent have appeared many times (Geist et al., 2019; Shani et al., 2020), though with a focus on the design of new algorithms, and not for any implicit regularization effect or proof. Mirror descent is used heavily throughout the online learning literature (Shalev-Shwartz, 2011), and in work handling adversarial MDP settings (Zimin and Neu, 2013).

##### Temporal-difference update (TD).

As discussed before, the TD update, originally presented by (Sutton, 1988), is standard in the actor-critic literature (Cai et al., 2019; Wang et al., 2019), and also appears in many other works cited in this section. As was mentioned, prior work requires various projections and mixing assumptions (Bhandari et al., 2018).

##### Implicit regularization in supervised learning.

A pervasive topic in supervised learning is the

implicit regularization effect of common descent methods; concretely, standard descent methods prefer low or even minimum norm solutions, which can be converted into generalization bounds. The present work makes use of a weak

implicit bias, which only prefers smaller norms and does not necessarily lead to minimal norms; arguably this idea was used in the classical perceptron method

(Novikoff, 1962)

, but was then shown in linear and shallow network cases of SGD applied to logistic regression

(Ji and Telgarsky, 2018, 2019), which was then generalized to other losses (Shamir, 2020), and also applied to other settings (Chen et al., 2019). The more well-known strong implicit bias, namely the convergence to minimum norm solutions, has been observed with exponentially-tailed losses together with coordinate descent with linear predictors (Zhang and Yu, 2005; Telgarsky, 2013), gradient descent with linear predictors (Soudry et al., 2017; Ji and Telgarsky, 2018)

, and deep learning in various settings

(Lyu and Li, 2019; Chizat and Bach, 2020), just to name a few.

### 1.3 Notation

This brief notation section summarizes various concepts and notation used throughout; modulo a few inventions, the presentation mostly matches standard ones in RL (Sutton and Barto, 2018) and policy gradient (Agarwal et al., 2021). A policy maps state-action pairs to reals, and

will always be a probability distribution. Given a state, the agent samples an action from

, the environment returns some random reward (which has a fixed distribution conditioned on the observed pair), and then uses a transition kernel to choose a new state given .

Taking to denote a random trajectory followed by a policy interacting with the MDP from an arbitrary initial state distribution , the value and functions are respectively

 \cVπ(μ) :=\bbEs0∼μτ=s0,a0,r0,s1,⋯∑t≥0γtrt, \cQπ(s,a) :=\bbEs0=s,a0=aτ=s0,a0,r0,s1,⋯∑t≥0γtrt=\bbEr0∼(s,a)\delr0+γ\bbEs1∼(s,a)\cVπ(s1),

where the simplified notation for Dirac distribution on state will often be used, as well as the shorthand and . Additionally, let denote the advantage function; note that the natural policy gradient update could interchangeably use or since they only differ by an action-independent constant, namely , which the softmax normalizes out. As in Section 1.1, the state space is finite but a subset of , specifically , and the action space is just the standard basis vectors . The other MDP assumption, namely of a linear MDP (cf. Section 1.1), will be used whenever TD guarantees are needed. Lastly, the discount factor has not been highlighted, but is standard in the RL literature, and will be treated as given and fixed throughout the present work.

A common tool in RL is the performance difference lemma (Kakade and Langford, 2002): letting denote the visitation distribution corresponding to policy starting from , meaning

 vμπ:=11−γ\bbEs′∼μ∑t≥0γtPr[st=s|s0=s′],

the performance difference lemma can be written as

 \cVπ(μ)−\cVπ′(μ)=11−γ\bbEs∼vμπ′∑a\cQπ(s,a)(π(s,a)−π′(s,a))=11−γ\ip\cQππ−π′vμπ′, (2)

where the final notation will often be employed for convenience.

As mentioned above, will denote the stationary distribution of a policy whenever it exists. The only relevant assumption we make here is that the maximum entropy optimal policy is aperiodic and irreducible, which implies it has a stationary distribution with positive mass on every state (Levin et al., 2006, Chapter 1). Via Section 3, it follows that all policies in a KL ball around also have stationary distributions with positive mass on every state.

The max entropy optimal policy is complemented by a (unique) optimal function and optimal advantage function . The optimal function dominates all other functions, meaning for any policy ; cf. Appendix A.

In a few places, we need the Markov chain

on states, , which is induced by a policy : that is, the chain where given a state , we sample , and then transition to , where the latter sampling is via the MDP’s transition kernel.

We use to denote the total variation distance. This distance is pervasive in mixing time analyses (Levin et al., 2006).

## 2 Mirror descent tools

To see how nicely mirror descent and its guarantees fit with the NPG/NAC setup, first recall our updates: , and (e.g., matching NPG in the tabular case (Kakade, 2001; Agarwal et al., 2021)). In the online learning literature (Shalev-Shwartz, 2011; Lattimore and Szepesvári, 2020), the basic mirror ascent (or dual averaging) guarantee is of the form

 ∑i

where notably does not need to mean anything, it can just be an element of a vector space. The most common results are stated when is the gradient of some convex function, but here instead we can use the performance difference lemma: recalling the inner product and visitation distribution notation from Section 1.3,

 \ipˆ\cQiπi−πvμπ =\ip\cQiπi−πvμπ+\ipˆ\cQi−\cQiπi−πvμπ =(1−γ)\del\cVi(μ)−\cVπ(μ)+\ipˆ\cQi−\cQiπi−πvμπ.

The term is exactly what we will control with the TD analysis, and thus the mirror descent approach has neatly decoupled concerns into an actor term, and a critic term.

In order to apply the mirror descent framework, we need to choose a mirror mapping. Rather than using , we use a choice which bakes the measure from the above inner product above into the dual object

! This may seem strange, but it does not change the induced policy, and thus is a degree of freedom, and allows us to state guarantees for

Our full mirror descent setup is detailed in Appendix B, but culminates in the following guarantee.

Consider step size , any reference policy , and two treatments of the error .

1. (Simplified bound.) For any starting measure ,

 Kvμπ(π,πt)+θ(1−γ)∑i
2. (Refined bound.) Define . For any starting measure ,

 Kvμπ(π,πt)+θ(1−γ)∑i

and additionally and are approximately monotone: for any state and action ,

 \cVi+1(s)≥\cVi(s)−2^\epsi1−γand\cQi+1(s,a)≥\cQi(s,a)−2γ^\epsi1−γ−^\epsi−^\epsi+1.

[Regarding the mirror descent setup, Section 2]

1. Two rates. For the refined bound, it is most natural to set , which requires iterations to reach accuracy ; by contrast, the simplified guarantee requires iterations for the same . We used the simplified form to prove Section 1.1, since its TD error term is less stringent; indeed, the TD analysis we provide in Section 3 will not be able to give the uniform control needed for the refined bound. Still, we feel the refined bound is promising, and include it for sake of completeness, future work, and comparison to prior work.

2. Comparison to standard rates. Comparing the refined bound (with all terms set to zero) to the standard NPG rate in the literature (Agarwal et al., 2021), the rate is exactly recovered; as such, this mirror descent setup at the very least has not paid a price in rates.

3. Implicit regularization term. A conspicuous difference between these bounds and both the standard NPG bounds (cf. (Agarwal et al., 2021, Theorem 5.3)), but also many mirror descent treatments, is the term ; one could argue that this term is nonnegative and moreover we care more about the value function, so why not drop it, as is usual? It is precisely this term that gives our implicit regularization effect: instead, we can drop the value function term and uniformly upper bound the right hand side to get , which is how we control the entropy of the policy path, and prove Section 1.1.

## 3 Sampling tools

Via Section 2 above, our mirror descent black box analysis gives us a KL bound and a value function bound: what remains, and is the job of this section, is to control the function estimation error, namely terms of the form .

Our analysis here has two parts. The first part, as follows immediately, is that any bounded KL ball in policy space has uniformly controlled mixing times; the second part, which follows immediately thereafter, is our TD guarantees.

Let policy be given, and suppose the induced transition kernel on states is irreducible and aperiodic (Levin et al., 2006, Section 1.3). Then has a stationary distribution , and moreover for any and any measure which is positive on all states and a corresponding set of policies

 \cPc:=\cbrπ:Kν(~π,π)≤c,

there exist constants so that mixing is uniform over , meaning for any , and any with induced transition probabilities ,

 sups∥Ptπ(s,⋅)−sπ∥\textsctv≤m1e−m2t,

and for any state and any , and any action with ,

 1C≤~π(s,a)π(s,a)≤Cand1C≤s~π(s)sπ(s)≤C.

[Implicit vs explicit exploration] On the surface, Section 3 might seem quite nice. Worrying about it a little more, and especially after inspecting the proof, it is clear that the constants , , and can be quite bad. On the one hand, one may argue that this is inherent to implicit exploration, and something like -greedy is preferable, as it arguably gives an explicit control on all these quantities.

Some aspects of this situation are unavoidable, however. Consider a combination lock MDP, where a precise, hard to find sequence of actions must be followed to arrive at some good reward. Suppose this sequence has length and we have a reference policy which takes each of these good actions with probability , whereby the probability of the sequence is ; a policy with can drop the probability of this good sequence of actions all the way down to !

Next we present our TD analysis. As discussed in Section 1, by contrast with prior work, this analysis handles starting from an arbitrary state, and does not make use of any projections. The following guarantee is specialized to Algorithm 1; it is a corollary of a more general TD guarantee, given in Appendix C, which is stated without reference to Algorithm 1, and can be applied in other settings.

[See also Section C.2] Suppose the MDP and linear MDP assumptions (cf. Sections 1.1 and 1.1). Consider a policy in some iteration of Algorithm 1, and suppose there exist mixing constants and so that the induced transition kernel on satisfies

 sups∥Ptπi(s,⋅)−sπi∥\textsctv≤me−ct.

Suppose the TD iterations and step size satisfy

 N≥k,η≤1400√kN,where k=⌈lnN+lnmc⌉.

Then the average TD iterate satisfies

 \bbE→x,→r\enVertˆUi−¯Ui2+ηN\bbE→x,→r\bbE(s,a)∼(sπ,π)\ipsa\TˆUi−¯Ui2≤54(1−γ)2,

where is the minimum norm fixed point of the expected TD iteration at stationarity (cf. Section C.2), and thus for any .

The proof is intricate owing mainly to issues of statistical dependency. It is not merely an issue that the chain is not started from the stationary distribution; notice that are all statistically dependent. Indeed, even if is sampled from the stationary distribution (which also means is distributed according to the stationary distribution as well), the conditional distribution of given is not in general stationary! To deal with such issues, the proof chooses a very small step size which ensures the TD estimate evolves much more slowly than the mixing time of the chain, and within the proof gaps are introduced in the chain so that rather than considering inner products of the form , the proof only considers . That said, many details need to be checked for this to go through.

A second component of the proof, which removes projection steps from prior work (Bhandari et al., 2018), is an implicit bias of TD, detailed as follows. Mirroring the MD statement in Section 2, the left hand side here has not only a term as promised, but also a norm control ; in fact, this norm control holds for all intermediate TD iterations, and is used throughout the proof to control many error terms. Just like in the MD analysis, this term is an implicit regularization, and is how this work avoids the projection step needed in prior work (Bhandari et al., 2018).

All the pieces are now in place to sketch the proof of Section 1.1, which is presented in full in Appendix D. To start, instantiate Section 3 with KL divergence upper bound , which gives the various mixing constants used throughout the proof (which we need to instantiate now, before seeing the sequence of policies, to avoid any dependence). With that out of the way, consider some iteration , and suppose that for all iterations , we have a handle both on the TD error, and also a guarantee that we are in a small KL ball around (specifically, of radius as in Section 1.1). The right hand side of the simplified mirror descent bound in Section 2 only needs a control on all previous TD errors, therefore it implies both a bound on and on . But this KL control on means that the mixing and other constants we assumed at the start will hold for , and thus we can invoke Section 3 to bound the error on , which we will use in the next loop of the induction. In this way, the actor and critic analysis complement each other and work together in each step of the induction.

There was one issue overlooked in the preceding paragraph. Notice that Section C.2 only grants an error control on average over pairs sampled from the stationary distribution of (which we mix towards thanks to Section 3). To control the error in Section 2, superficially we need something closer to a uniform error over pairs; within the proof, however, the only actions we need to consider end up being sampled from or from , and in the latter case we know explicitly that either the probability of some is large (since is the maximum entropy optimal policy), or it is and the error term vanishes. This reasoning is only sufficient for the simplified mirror descent bound in Section 2, and more sophisticated error controls would be needed to apply the refined bound.

## 4 Discussion and open problems

This work, in contrast to prior work in natural actor-critic and natural policy gradient methods, dropped many assumptions from the analysis, and components of the algorithms. The analysis was meant to be fairly general purpose and unoptimized. As such, there are many open problems.

##### Faster rates.

How much can this analysis be squeezed? Moreover, does it suggest any algorithmic improvements?

##### Implicit vs explicit regularization/exploration.

What are some situations where one is better than the other, and vice versa? The analysis here only says you can get away with doing everything implicitly, but not necessarily that this is the best option.

##### More general settings.

The paper here is for linear MDPs, linear softmax policies, finite state and action spaces. How much does the implicit bias phenomenon (and this analysis) help in more general settings?

##### Tightening the TD and MD coupling.

The proof of Section 1.1 here relied on a very tight coupling of the actor (mirror descent) and the critic (temporal difference). But perhaps the coupling can be made even tighter, both in the algorithm and the analysis?

#### Acknowledgments

MT thanks Nan Jiang and Tor Lattimore for helpful discussions in earlier phases of this work, and is grateful to the NSF for support under grant IIS-1750051.

## Appendix A Background proof: existence of ¯¯¯π

The only thing in this section is the expanded version of Algorithm 1, namely giving the unique maximum entropy optimal policy, and some key properties.

There exists a unique maximum margin optimal policy and corresponding and which satisfy the following properties.

1. For any state , let denote the set of actions taken by optimal policies; Define , which is unique; then is also an optimal policy, and let and denote its advantage and functions.

2. For every state and every action , then , where the maximum is taken over all policies. Moreover, .

3. .

###### Proof of Appendices A and 1.
1. We provide an iterative construction of . Start with equal to any optimal deterministic policy (which must exist as usual for MDPs), and consider any enumeration of the set of states. The construction produces from , and will assume by induction that is an optimal policy which for every state with satisfies . The base case was handled directly, thus consider constructing . Since Markov chains have no memory, the behavior in state is independent of the behavior in all prior and subsequent states; therefore we can safely define for and , and is still an optimal policy. To complete the construction, set , and let and correspond to it.

2. For any with corresponding function and value function , and any , then

 ¯¯¯¯¯¯¯¯\cQ(s,a)−\cQπ(s,a) =\bbEr,s′∼(s,a)\sbrr(s,a)+γ¯¯¯¯¯¯¯¯\cV(s′)−r(s,a)−γ\cVπ(s′) =γ\bbEs′∼(s,a)\sbr¯¯¯¯¯¯¯¯\cV(s′)−\cVπ(s′) ≥0.

It follows that , and since , then the supremum is a maximum and the inequality is an equality.

3. By the previous point, for any state and any , then whereas for any , then . It follows that

 limr→∞ϕ(r\cA(s,⋅))=Uniform(\cAs)=¯¯¯π(s,⋅).

## Appendix B Full mirror descent setup and proofs

This section first gives a basic mirror descent / dual averaging setup. This characterization is mostly standard (Bubeck, 2015), though the main guarantees are given with some flexibility to allow for various natural policy setups.

First, here is the basic notation (which, unlike the paper body, will allow for step size to differ between iterations):

 pi+1 :=pi−θigi, θi>0, qi :=∇ψ(pi), closed proper convex ψ, bilinear pairing, D(p,pi) primal Bregman, D∗(q,qi) dual Bregman.

One nonstandard choice here is that the Bregman divergence bakes in a conjugate element, rather than using and ; this gives an easy way to handle certain settings (like the boundary of the simplex) which run into non-uniqueness issues. Secondly, is just some bilinear form and need not be interpreted as a standard inner product.

The standard Bregman identities used in mirror descent proofs are as follows:

 D∗(q,qi)−D∗(q,qi+1)−D∗(qi+1,qi) (3) D∗(qi+1,qi) =D(pi,pi+1), (4) D∗(qi+1,qi)+D∗(qi,qi+1) (5) D∗(q,qi) (6)

With these in hand, the core mirror descent guarantee is as follows. The bound is written with equalities to allow for careful handling of error terms. Note that this version of mirror descent does not interpret the “gradient” in any way, and treats it as a vector and no more.

Suppose . For any and where ,

 =D∗(q,q0)−D∗(q,qt)+∑i

Moreover, for any , .

###### Proof.

For any fixed iterate , by eqs. 5, 4 and 3,

 ∵\lx@cref{creftype~refnum}{eq:breg:3point} =D∗(q,qi)−D∗(q,qi+1)+D∗(qi,qi+1) ∵\lx@cref{creftype~refnum}{eq:breg:swap} =−D∗(q,qi)+D∗(q,qi+1)+D(pi+1,pi), ∵\lx@cref{creftype~refnum}{eq:breg:mirror}

The first equalities now follow by applying to both sides and telescoping.

For the second part, for any , by convexity of ,

which rearranges to give since . ∎

All that remains is to instantiate the various mirror descent objects to match Algorithm 1, and control the various resulting terms. This culminates in Section 2; its proof is as follows.

###### Proof of Section 2.

The core of both parts of the proof is to apply the mirror descent guarantees from Appendix B, using the following choices:

 gi :=−ˆ\cQi, pi+1 :=pi−θgi=pi+θˆ\cQi, :=∑s∈\cS∑a∈\cAp(s,a)q(s,a), ψ(p) :=\bbEs∼vμπln∑a∈\cAexp(p(s,a))=∑s∈\cSvμπ(s)ln∑a∈\cAexp(p(s,a)), (s,a) =vμπ(s)exp(p(s,a))∑b∈\cAexp(p(s,b)), qi :=∇ψ(pi), qπ(s,a) :=vμπ(s)π(s,a), D(p,pi) ψ∗(q) D∗(q,qi) −∑s,a\delln(qi(s,a))+lnvμπ(s)+ln∑bexp(pi(s,b)\delq(s,a)−qi(s,a)

A key consequence of these constructions is that , treated for any fixed as an unnormalized policy, agrees with after normalization; that is to say, it gives the same policy, and the choice of baked into the definition is not needed by the algorithm, is only used in the analysis. The “gradient” makes no use of it.

Plugging this notation in to Appendix B but making use of two of its equalities, and the performance difference lemma eq. 2, then for any

 θ(1−γ)∑i

The proof now splits into the two different settings.

1. (Simplified bound.) By the above definitions and the first equality in Appendix B,

 =D∗(q,qt)−D∗(q,q0)−∑i

where the last term may be bounded in a way common in the online learning literature (Shalev-Shwartz, 2011): since when , setting for convenience (whereby as needed by the preceding inequality),

 D(pi+1,pi) =\bbEs∼vμπ\delln∑aexp(pi+1(s,a))−ln∑aexp(pi(s,a))−∑aπi(s,a)(pi+1(s,a)−pi(s,a)) =\bbEs∼vμπ\delln\del∑aπi(s,a)exp\delθˆ\cQi(s,