# On Sample Complexity of Projection-Free Primal-Dual Methods for Learning Mixture Policies in Markov Decision Processes

We study the problem of learning policy of an infinite-horizon, discounted cost, Markov decision process (MDP) with a large number of states. We compute the actions of a policy that is nearly as good as a policy chosen by a suitable oracle from a given mixture policy class characterized by the convex hull of a set of known base policies. To learn the coefficients of the mixture model, we recast the problem as an approximate linear programming (ALP) formulation for MDPs, where the feature vectors correspond to the occupation measures of the base policies defined on the state-action space. We then propose a projection-free stochastic primal-dual method with the Bregman divergence to solve the characterized ALP. Furthermore, we analyze the probably approximately correct (PAC) sample complexity of the proposed stochastic algorithm, namely the number of queries required to achieve near optimal objective value. We also propose a modification of our proposed algorithm with the polytope constraint sampling for the smoothed ALP, where the restriction to lower bounding approximations are relaxed. In addition, we apply the proposed algorithms to a queuing problem, and compare their performance with a penalty function algorithm. The numerical results illustrates that the primal-dual achieves better efficiency and low variance across different trials compared to the penalty function method.

• 5 publications
• 5 publications
• 4 publications
• 31 publications
10/17/2017

### Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems

Consider the problem of approximating the optimal policy of a Markov dec...
02/27/2020

### Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

We study the estimation of risk-sensitive policies in reinforcement lear...
02/26/2018

### Optimizing over a Restricted Policy Class in Markov Decision Processes

We address the problem of finding an optimal policy in a Markov decision...
04/27/2018

### Scalable Bilinear π Learning Using State and Action Features

Approximate linear programming (ALP) represents one of the major algorit...
02/27/2014

### Linear Programming for Large-Scale Markov Decision Problems

We consider the problem of controlling a Markov decision process (MDP) w...
05/07/2020

### A Gradient-Aware Search Algorithm for Constrained Markov Decision Processes

The canonical solution methodology for finite constrained Markov decisio...
02/21/2022

### Accelerating Primal-dual Methods for Regularized Markov Decision Processes

Entropy regularized Markov decision processes have been widely used in r...

## 1. Introduction

This manuscript is the (extended) conference paper.

M

arkov decision processes are mathematical models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, transition probabilities, and costs. Choosing an action in a state generates a cost and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to take in a given state to minimize the cost. Given a MDP, the main objective is to compute (near-) optimal policies that (approximately) attain the minimum long-term average cost.

In most practical applications, the underlying MDP is compactly represented and the number of states scale exponentially with the size of the representation of the MDP. In addition, for such applications, various hardness results often indicate that computing actions of optimal policies is intractable in the sense that polynomial-time algorithms to compute such control policies are unlikely (unless come complexity class collapse) or simply don’t have guarantees for searching policies within a constant multiplicative or additive factor of the optimal; see, e.g., [1]. In view of those negative results, it is natural to pursue a more modest objective which is to compute the actions of a policy that is nearly as good as a policy chosen by an oracle from a given restricted policy class. Following the work of [1], in this paper we consider the policy class to be the convex-hull of a set of known base policies. Such base policies are often known in practice for certain applications. For instance, a myopic and a look ahead policy in queuing networks can be combined to achieve a mixture policy with better performance.

### 1.1. Main Contributions

The main contributions of this paper are summarized as follows:

• We formulate the optimization over the restricted class of mixture policies as an approximate linear programming (ALP) for MDPs, where the feature vectors are given by the occupation measures of the base policies.

• We propose a novel projection-free

stochastic primal-dual (SPD) algorithm for the reinforcement learning of efficient mixture policies in Markov decision processes.

• We analyze the constraint violation of the solutions of SPD, and prove that such constraint violation diminishes in the asymptotic of many rounds.

• We analyze the sample complexity of the proposed algorithm, i.e., the number of queries required from a sampling oracle to achieve near optimal cost function.

• We numerically compare the performance of the proposed primal-dual algorithm with that of the penalty function method for a queuing problem, and we show that the solutions obtained by the proposed method in this paper has a smaller variance across different trials compared to the penalty function method.

### 1.2. Connections to Prior Works

The ALP as a framework to find a “low-dimensional” representation of “high-dimensional” functions on a state (action) space has a long history in decision theory; see, e.g., [2, 3, 4, 5, 6] and references therein. The seminal work of De Farias and Von Roy [3]

studies an ALP for stochastic control problems as a means of alleviating the curse of dimensionality. The main challenge in using the proposed ALP is that while it has a relatively small number of variables, it has an intractable number of constraints. To address this issue, the same authors proposed a constraint sampling scheme in a separate work

[7]. In the same line of work, a dual approximate linear programming for the MDP is considered in [5], where the optimization variable is a stationary distribution over state-action pairs. A neighborhood of a low-dimensional subset of the set of stationary distributions defined in terms of state-action features is considered as the comparison class. In a similar line of work, a -learning algorithm is proposed in [4] which leverages state and action features to reduce the dimensionality of the state and action spaces. In addition, the sample complexity of the proposed -learning algorithm is analyzed.

Our work is also closely related to the recent study of Banijamali, et al. [1]. Therein, the authors propose a stochastic gradient decent in conjunction with a penalty function method to optimize over a set of mixtures of policies. The main challenge in using a penalty function method is that its performance is often sensitive to the choice of the penalty factor in addition to the learning rate. Moreover, it yields a large constraint violation as is observed in [8, Thm. 1]. Furthermore, to optimize the regret performance , the authors in [1] propose a penalty factor that depends on the amount of violation of constraints, which is unknown in practice. In this paper, we propose a stochastic primal-dual method whose only hyper-parameter is the learning rate.

We also mention the recent work of Chen and Wang [4], where a primal-dual algorithm with the Bregman divergence is considered in conjunction with the approximate linear programming formulation of MDPs. The work of [4] deals with the sample complexity of the stochastic primal-dual algorithm, namely the number of queries to the oracle needed to attain near optimal value function, whereas in this paper we are concerned with the efficiency in the dual space as well as the constraint violation of the primal-dual solutions. In addition, the algorithm we propose differs from [4] in several key points. First, unlike the algorithm of Chen and Wang, our approach does not require the realizability assumption (cf. [4, Def. 1]). The realizability condition requires the spanning vectors of features in ALP to be non-negative, which in turn allows one to simplifies the projection onto the simplex section of a hyper-plane. In particular, projection onto the simplex can be implemented efficiently using the Kullback-Leibler (KL) divergence as the proximal function. In our algorithm, we provide a direct method of projection onto the hyper-plane which obviates the realizability condition and provides a more expressive representation. In addition, we present a randomized policy based on the primal-dual solutions that differs from the policy proposed in [4]. Second, our proposed algorithm solves an optimization problem in the dual space, where feature vectors are the occupation measures of a set of base policies. This allows us to compute a randomized policy directly from the solution of the underlying dual optimization problem. Lastly, the role of the Bregman divergence in our algorithm is to implicitly enforce the constraints due to the size of the policy class. In contrast, the Bregman divergence in [4] is used as a mean to attain better sample complexities via adaptation to the geometry of the feasible set.

### 1.3. Paper Outline

The rest of this paper is organized as follows. In Section 2, we present preliminaries related to MDPs. In addition, we review the approximate linear programming formulation of MDPs based on the linear representation of large state-spaces. In Section 3, we present a stochastic regularized primal-dual algorithm to compute the coefficients of the mixture policy. We also present the main theorem concerning the performance of the proposed algorithm. In Section 5 we present the proof of the main theorem, while deferring technical lemmas to the appendices. Lastly, in Section 7, we conclude this paper.

## 2. Preliminaries

In this section, we first present the notations and technical definitions that we need to establish our theoretical results. We then review relevant definitions regarding infinite horizon discounted Markov decision processes. We also review Bellman’s equations as well as linear programming dual formulation of Bellman’s equations.

Notations and Definitions. We denote vectors by the lower case letters (e.g.

), and random variables and matrices with the upper case letters (

e.g. ). The dual norm conjugate to the norm is defined by . We denote the Euclidean projection by . We use the standard asymptotic notation with the following definition: If are two functions from to , then if there exists a constant such that for every sufficiently large and that if . For positive integer , we use the shorthand to denote the set . For a matrix , let denote the subordinate norm for

. We denote the largest singular value by

. Further, we use the shorthand notations , , and . A function is -Lipschitz with respect to the norm over iff

 |f(x)−f(y)|≤L∥x−y∥,for all x,y∈X.

A function is -smooth with respect to the norm over iff

 ∥∇f(x)−∇f(y)∥∗≤β∥x−y∥,for all x,y∈X.

A function is -strongly convex with respect to the norm over iff

 f(x)+⟨g,y−x⟩+μ2∥x−y∥2≥f(y),

for all . The effective domain of a function is the following set . The sub-differential set of a function at the point is defined as follows

 (2.1) ∂f(x0)def={g:f(x)−f(x0)≥⟨g,x−x0⟩,∀x∈dom(f)}.

The relative interior of a convex set , abbreviated , is defined as , where denotes the affine hull of the set , and is a ball of radius centered on .

The Fenchel conjugate of the function is defined as follows

 (2.2) f∗(y)=supx∈X{⟨x,y⟩−f(x)}.
###### Definition 2.1.

(Orlicz Norm) The Young-Orlicz modulus is a convex non-decreasing function such that and when . Accordingly, the Orlicz norm of an integrable random variable with respect to the modulus is defined as

 (2.3) ∥X∥ψdef=inf{β>0:IE[ψ(|X|−IE[|X|]/β)]≤1}.

In the sequel, we consider the Orlicz modulus . Accordingly, the cases of and norms are called the sub-Gaussian and the sub-exponential norms and have the following alternative definitions:

###### Definition 2.2.

(Sub-Gaussian Norm) The sub-Gaussian norm of a random variable , denoted by , is defined as

 (2.4) ∥Z∥ψ2=supq≥1q−1/2(IE|Z|q)1/q.

For a random vector , its sub-Gaussian norm is defined as

 (2.5) ∥Z∥ψ2=supx∈Sn−1∥⟨x,Z⟩∥ψ2.
###### Definition 2.3.

(Sub-exponential Norm) The sub-exponential norm of a random variable , denoted by , is defined as follows

 (2.6) ∥Z∥ψ1=supq≥1q−1(IE[|Z|q])1/q.

For a random vector , its sub-exponential norm is defined as

 (2.7) ∥Z∥ψ1=supx∈Sn−1∥⟨Z,x⟩∥ψ1.
###### Definition 2.4.

(Legendre Function) A function , is called a Legendre function (a.k.a. essentially smooth functions) if it satisfies the following conditions:

• and and is convex.

• is strictly convex.

• partial derivatives of exists and are continuous for all .

• Any sequence converging to a boundary point of satisfies

 limn→∞∥∇ϕ(xn)∥=∞.
###### Definition 2.5.

(Bregman Divergence) Suppose , is a Legendre function. The Bregman divergence is defined as follows

 Dϕ(x;y)def=ϕ(x)−ϕ(y)−⟨x−y,∇ϕ(y)⟩,

where is the gradient vector evaluated at .

### 2.1. Markov Decision Processes (MDPs)

In this paper, we consider MDPs with high dimensional state and action spaces. The first definition formalizes MDPs:

###### Definition 2.6.

(Markov Decision Process) A Markov decision process (MDP) is a 6-tuple consists of:

• Decision epochs: The set represents the set of times at which decisions are to be made. If is finite, then the MDP is said to be a finite horizon MDP with -epochs. If , the MDP is said to be an infinite horizon MDP.

• States: We assume that the state-space is finite.

• Actions: We assume that the action set is also finite.

• Transition model: The transition model for each and

, is the probability distribution

on . The element represents the probability of transitioning to the state after performing the action in the state . We define the matrices and .

• Cost function: For each , is the cost function. Let , and .

• Discount factor: The discount factor reflects the intertemporal preferences.

We consider discounted infinite horizon MDP characterized by the tuple . For such MDPs, we compute randomized policies which we formally define below:

###### Definition 2.7.

(Randomized Policy) A randomized policy is the sequence of distributions, where is a probability distribution on the action space , conditioned on the state . The value represents the probability of taking the action in the state . Let denotes the set of all such randomized policies.

The objective of discounted infinite horizon MDP is to find a policy such that the infinite-horizon sum of discounted costs is minimized regardless of the initial state , i.e.,

 minπ∈Πvπ(s)=liminfT→∞IEπ[T∑t=0γtcat(st)∣∣∣s0=s],

where and are the realizations of state transitions and actions, respectively, generated by the Markov decision process under a given policy , and the expectation is taken over the entire trajectory.

Define the Bellman operator for all . From the theory of dynamic programming, a vector is the optimal value function to the MDP if and only if it satisfies the following Bellman fixed point equation [9]

 (2.8) Sv∗=v∗,

where is the difference-of-value vector. A stationary policy is an optimal policy of the MDP if it attains the element-wise maximization in the Bellman equation (2.8), i.e., .

Alternatively, the Bellman equation (2.8) can be recast as the following linear programming problem (cf. [10]),

 (2.9a) maxv∈IRn+ αTv (2.9b) s.t.:(In−γPa)v−ca⪰0,∀a∈A,

where , and is the initial distribution over the states, and is the simplex of the probability measures on the set of states .

To characterize the dual problem associated with the primal problem (2.9), Let denotes the stationary distribution under the policy and initial distribution of states . In particular, let denotes the transition probability matrix induced by a fixed policy whose matrix elements are defined as . Alternatively, let be a matrix that encodes the policy , i.e., let , where other entries of the matrix are zero. Then, .

Furthermore, let

denotes the stationary distribution of the Markov chain induced by the policy

, i.e., , where

 (2.10) μπα =(1−γ)α∞∑t=0γt(Pπ)t (2.11) =(1−γ)α(In−γPπ)−1.

The measure captures the expected frequency of visits to each state when the underlying policy is , conditioned on the initial state being distributed according to . Future visits are discounted by the factor .

We define the occupation measure as the vector defined by , where is the Hadamard (element-wise) vector multiplication.. Then,

 π∗ =argminπ∈Π∑s∈Sμπα(s)∑a∈Aπa(s)ca(s) =argminπ∈Π∑s∈S∑a∈Aξπa(s)ca(s) (2.12) =argminπ∈ΠcTξπ.

Thus, the dual problem associated with the primal problem in Eq. (2.9a) has the following form

 (2.13a) minξ∈IRmncTξ (2.13b) s.t.:ξT(P−Q)=0, (2.13c) 0⪯ξ,ξT1=1.

where is a binary matrix such that the -th column has ones in rows to , and . In the optimization problem in Eq. (2.13), the constraint (2.13c) ensures that is a distribution, and the constraint (2.13b) guarantees that is stationary.

Let denotes the optimal solution of the linear programming problem in Eq. (2.13). An optimal randomized optimal policy can be characterized as follows

 (2.14) π∗a(s)=ξ∗a(s)μπ∗α(s)=ξ∗a(s)∑a∈Aξ∗a(s),∀a∈A,∀s∈S.

Furthermore, the optimal objective value of the linear programming problem in Eq. (2.13) corresponds to the optimal discounted-cost value under the optimal policy and the initial distribution of the states.

### 2.2. Approximate Linear Programming (ALP) for the Linear Representation of Large State Spaces

It is well-known that the sample complexity as well as computational complexity of solving the linear programming dual problem in Eq. (2.13) scales (at least) linearly with the size of the state space , rendering exact representation intractable in the face of problems of practical scale; see, e.g., [11]. Consequently, to deal with MDPs with a large state space, it is practical to map the state space to a low dimensional space, using the linear span of a small number of features.

Specifically, let denotes a matrix whose column vectors represent feature vectors. In the feature space, the distribution is spanned by the linear combination of features , namely, . Here, the radius and the general norm determine the size and geometry of the policy class , respectively. The dual optimization problem in Eq. (2.13) can thus be reformulated in terms of features

 (2.15a) minθ∈ΘRcTΨθ (2.15b) s.t.:θTΨT(P−Q)=0, (2.15c) 0⪯Ψθ,θTΨT1=1.

Designing the feature matrix for Markov decision processes is a challenging problem. In particular, we wish to design a feature matrix such that the linear expansion is expressive, but does not lead to over-fitting. In this paper, we focus on the set of features associated with a set of known base policies. Formally, we consider the set of -base policies and define the subset as the convex hull of base policies,

 (2.16) Πddef={πω:πω=d∑i=1ωiπi,d∑i=1ωi=1,ωi≥0,i=1,2,⋯,d}.

Corresponding to each base policy , a stationary distribution is defined according to Eq. (2.11). The dual space of in Eq. (2.16) is then defined as the linear combinations of occupation measures

 (2.17) Ξddef={ξθ:πω=d∑i=1ωiπi,d∑i=1ωi=1,i=1,2,⋯,d}.

For all the state-action distribution is defined as for all . With this choice of the feature vectors, we have and as the columns of the matrix are stationary probability distributions. Therefore, the dual optimization (2.15) takes the following form

 (2.18a) minθ∈ΘRcTΨθ (2.18b) s.t.:0⪯Ψθ,θT1=1.

Let and . The feasible set of the dual optimization in Eq. (2.18) is the intersection of the following sets

 (2.19) Θd=Θd1∩Θd2∩ΘdR.

Let denotes an approximate solution of the optimization problem in Eq. (2.15) generated by an (arbitrary) optimization algorithm after rounds. When is a feasible solution of Eq. (2.15), then defines a valid occupation measure. However, in this paper we permit the feature vectors to violate the non-negativity constraint defined by in the following sense: let denotes a function that quantifies the amount by which the vector violates the non-negativity constraints. In particular, iff . After rounds of the algorithm we propose in the next section (cf. Algorithm 1

), it outputs an estimate

that may potentially violate the constraints defined by , and thus . Nevertheless, we impose the constraint that the estimate generated by the algorithm satisfies the constraints in the asymptotic of many rounds .

By allowing such constraint violation, we devise an efficient algorithm whose only projection is onto the hyper-plane section of the feasible set , and the constraint due to the size of the policy class is enforced implicitly using a Bregman divergence whose domain is subsumed by the set . As we discuss in the next section, the Bregman projection onto the hyper-plane has an efficient implementation.

Notice, however, that due to the (potential) constraint violation of solutions , the vector may no longer be a valid occupancy measure. Nonetheless, it defines an admissible randomized policy via the following relation

 (2.20) πˆθTa(s)=[(ΨˆθT)a(s)]+∑a∈A[(ΨˆθT)a(s)]+,∀a∈A,∀s∈S.

In the case that for all pairs of action-state , we define

to be the uniform distribution. Let

denotes the occupancy measure induced by the policy defined in Eq. (2.20), i.e., .

### 2.3. Expanded Efficiency

Equipped with the dual optimization problem in Eq. (2.15), we now describe the notion of efficiency of an algorithm to solve (2.15). The following definition is adapted from [5]:

###### Definition 2.8.

(Efficient Large Scale Dual ALP [5]) For an MDP specified by the cost matrix , probability transition matrix , a feature matrix , the efficient large-scale dual ALP problem is to produce an estimate such that

 (2.21) cTξˆθ≤minθ∈ΘdcTξθ+O(ε),

in time polynomial in and under the model of computation in (A.3).

As described by Definition 2.8, the computational complexity depends on the number of features only, and not the size of state-space .

The preceding definition is a special case of the following generalized definition that allows for the constraint violation:

###### Definition 2.9.

(Expanded Efficient Large Scale Dual ALP [5]) Let be some violation function that measures how far is from a valid stationary distribution. In particular, iff is a feasible point for the dual ALP in Eq. (2.13). Then,

 (2.22) cTξˆθ≤minθ∈Θd2∩ΘdR[cTξθ+1εV(θ)]+O(ε),

in time polynomial in and , under the model of computation in (A.3).

Clearly, a guarantee for (2.22) implies a guarantee for Eq. (2.21). In addition, the expanded problem has a larger feasible set. Therefore, even if many feature vectors may not admit any feasible points in the feasible set and the dual problem is trivial, solving the expanded problem is still meaningful.

### 2.4. Assumptions

To establish our theoretical results, we need the following fast mixing assumption on underlying MDPs. This assumption implies that any policy quickly converges to its stationary distribution:

• (Fast Mixing Markov Chain) For , the Markov decision process specified by the tuple is -mixing in the sense that

 (2.23) τmix(ε)def=maxπ∈Πmin{t≥1:∥(Pπ)t(s,⋅)−μπα∥TV≤ε,∀s∈S},

for all , where for two given Borel probability measures , is the total variation norm.

The fast mixing condition (2.23) is a standard assumption for Markov decision processes; see, e.g., [5], [12]. The fast mixing condition (2.23) implies that for any policy , there exists a constant such that for all the distributions over the state-action space,

 (2.24) ∥νPπ−ˆνPπ∥TV≤e−1τmix(ε)∥ν−ˆν∥TV.

The following assumption is also standard in the literature; see, e.g., [12]:

• (Uniformly Bounded Ergodicity) The Markov decision process is ergodic under any stationary policy , and there exists such that

 (2.25) 1n√κ1≤μπα≤√κn1.

where we recall from Section 3.1 that is the stationary distribution over the state space of the MDP under the policy , and with the initial distribution on the states.

Under the condition (2.25) of (A.2), it is well-known that the underlying MDP is unichain [13], i.e., a Markov chain that contains a single recurrent class and possibly, some transient states. Moreover, the factor determines the amount of variation of stationary distributions associated with different policies, and thus can be sought of as a form of measuring the complexity of a MDP. Notice that in the case that some policies induce transient states (so the stationary distribution is not bounded away from zero), their mixture with an ergodic policy guarantee ergodicity.

Lastly, we consider the setup of the reinforcement learning in which the cost function is unknown. But, the agent can interact with its environment and receives feedbacks from a sampling oracle:

• (Model-Free Reinforcement Learning) We consider the following model of computation:

1. [leftmargin=*]

2. The state space , the action spaces , the reward upper bound and lower bounds and , and the discount factor are known.

3. The cost function is unknown.

4. There is a sampling oracle that takes input and generates a new state with probabilities and returns the cost .

## 3. Stochastic Primal-Dual Proximal Algorithm

In this section, we first describe a stochastic primal-dual proximal method to compute the coefficients of the mixture model in the dual optimization problem in Eq. (2.18). We then analyze the efficiency as well as the sample complexity of the proposed algorithm.

### 3.1. Stochastic Primal-Dual Algorithm

To apply the stochastic primal-dual method, we recast the optimization problem (2.18) as the following MinMax optimization problem

 (3.1) minθ∈Θd2∩ΘdRmaxλ∈IRnm+LΨ(θ,λ)def=cTΨθ−λTΨθ.

The following definition characterizes the notion of the saddle point of a MinMax problem:

###### Definition 3.1.

(Saddle Point of MinMax Optimization Problem) Let denotes a saddle point of the MinMax problem in Eq. (3.1), i.e., a point that satisfies the inequalities

 (3.2) LΨ(θ∗,λ)≤LΨ(θ∗,λ∗)≤LΨ(θ,λ∗),

for all and .

The primal optimal point of the MinMax problem (3.1) is an optimal solution for the dual problem (2.18).

Applying the stochastic primal-dual method to the MinMax problem (3.1) is challenging as the the Lagrange multipliers may take arbitrarily large values during the iterations of the primal-dual algorithm, resulting in a large sub-gradients for the Lagrangian function and instability in the performance. The following lemma, due to Chen and Wang [14, Lemma 1], provides an upper bound on the norm of the optimal Lagrange multipliers:

###### Lemma 3.2.

(Upper Bound on the Lagrange Multipliers, [14, Lemma 1]) Suppose is a Lagrange multiplier vector satisfying the MinMax condition in Eq. (3.2). Then, , and for all .

Now, define the norm ball , where . Here, is a free parameter of the algorithm that will be determined later.

Lemma 3.2 suggests that the vector of optimal Lagrange multipliers belongs to the compact set . Therefore, instead of the MinMax problem (3.1), we can consider the following equivalent problem

 (3.3) minθ∈Θ2∩ΘRmaxλ∈ΛLΨ(θ,λ),

where the Lagrange multipliers are maximized over the compact set instead of entire non-negative orthant . As we alluded earlier, the set of saddle points of the MinMax problems in Eqs. (3.3) and (3.1) coincide.

Algorithm 1 describes a procedure to solve the MinMax problem in Eq. (3.3). At each iteration of the stochastic primal-dual method, we compute the following deterministic gradient

 (3.4) ∇λLΨ(θt,λt)=Ψθt.

Furthermore, we draw a random index , and subsequently sample the state-action . Then, we compute the stochastic gradient as follows

 (3.5a) ∇θˆLΨ(θt,λt) =cat(st)Ψat(st)−λt,at(st)Ψat(st)1m∑di=1ψiat(st),

where is a row vector, corresponding to the row of the feature matrix . We notice that

is an unbiased estimator of the gradients of the Lagrangian function

. Formally,

 IE[∇θˆLΨ(θt,λt)] =∑(s,a)∈S×Acat(st)Ψat(st)−λt,at(st)Ψat(st)1m∑di=1ψiat(st)IP[st=s,at=a] =∑(s,a)∈S×Aca(s)Ψa(s)−λt,a(s)Ψa(s) (3.6) =cTΨ−λTtΨ=∇θLΨ(θt,λt).

Algorithm 1 describes the primal-dual steps to update . To implicitly control the size of the policy class as characterized by the constraint , in Algorithm 1 we consider a Bregman divergence whose modulus has the domain . For example, when the policy class is defined by a Euclidean ball , the following convex function can be employed (see [15])

 (3.7) ϕ(θ)=−√R2−∥θ∥22.

The associated convex conjugate of the modulus (3.7) is

 (3.8) ϕ∗(θ)=R√1+∥θ∥22.

Moreover, the modulus (3.7) yields the Hellinger-like divergence

 (3.9) Dϕ(θ1,θ2)=R2−⟨θ1,θ2⟩√R2−∥θ2∥22−√R2−∥θ1∥22.

Alternatively, when the policy class is defined by a hyper-cube , a function similar to the symmetric logistic function yields the divergence with the desired domain. In particular, let

 (3.10) ϕ(θ)=d∑i=1[(R+θi)log(R+θi)+(R−θi)log(R−θi)].

The convex conjuagte of is the following function

 (3.11) ϕ∗(θ)=−d∑i=1Rθi−d∑i=12Rlog(2Reθi+1).

The Bregman divergence associated with the modulus in Eq. (3.10) is

 (3.12) Dϕ(θ1,θ2)=d∑i=1[(θi1+R)log(θi1+R)(θi2+R)+(R−θi1)log(R−θi1)(R−θi2)].

The steps (3.25a)-(3.25c) of Algorithm 1 are reminiscent of the so-called Mirror Descent Algorithm for online convex optimization [16],

 (3.13)

However, the role of the Bregman divergence in Algorithm 1 is different from that of the Mirror Descent algorithm. In the Mirror Descent algorithm, the Bregman divergence is typically used to adapt to the geometry of the feasible set and achieve better dependency on the dimension of the embedding space of the feasible set. In contrast, the Bregman divergence in Algorithm 1 is employed to implicitly enforce the constraints due to the size of the policy class and eliminate the projection step.

To see this equivalence, first notice that the update rule in Eq. (3.13) can be rewritten as below

 (3.14a) ˜θt =∇ϕ∗(∇ϕ(θt)−ηt∇θˆLΨ(θt,λt)), (3.14b) θt+1 =argminθ∈Θ2Dϕ(θ;˜θt),

where we also recall . Now, consider the Bregman projection (3.14b) onto the hyper-plane . As shown in [15], the Bregman projection in Equation (3.14b) can be alternatively computed using Eqs. (3.25b)-(3.25c). To demonstrate this, first invoke the necessary and sufficient KKT conditions for the optimality of which requires the existence of an optimal Lagrange multiplier that satisfies the following equation

 (3.15) ∇θDϕ(θ;˜θt)|θ=θt+1=zt∇θ(θT1−1).

From Eq. (3.15), we establish that

 (3.16) ∇ϕ(θt+1)=zt1+∇ϕ(˜θt).

Alternatively, since the gradient of a Legendre function is a bijection from to and its inverse is the gradient of the conjugate (cf. [17, Thm. 26.5]), we have

 (3.17) θt+1=∇ϕ