1. Introduction
^{†}^{†}This manuscript is the (extended) conference paper.M
arkov decision processes are mathematical models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, transition probabilities, and costs. Choosing an action in a state generates a cost and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to take in a given state to minimize the cost. Given a MDP, the main objective is to compute (near) optimal policies that (approximately) attain the minimum longterm average cost.
In most practical applications, the underlying MDP is compactly represented and the number of states scale exponentially with the size of the representation of the MDP. In addition, for such applications, various hardness results often indicate that computing actions of optimal policies is intractable in the sense that polynomialtime algorithms to compute such control policies are unlikely (unless come complexity class collapse) or simply don’t have guarantees for searching policies within a constant multiplicative or additive factor of the optimal; see, e.g., [1]. In view of those negative results, it is natural to pursue a more modest objective which is to compute the actions of a policy that is nearly as good as a policy chosen by an oracle from a given restricted policy class. Following the work of [1], in this paper we consider the policy class to be the convexhull of a set of known base policies. Such base policies are often known in practice for certain applications. For instance, a myopic and a look ahead policy in queuing networks can be combined to achieve a mixture policy with better performance.
1.1. Main Contributions
The main contributions of this paper are summarized as follows:

We formulate the optimization over the restricted class of mixture policies as an approximate linear programming (ALP) for MDPs, where the feature vectors are given by the occupation measures of the base policies.

We propose a novel projectionfree
stochastic primaldual (SPD) algorithm for the reinforcement learning of efficient mixture policies in Markov decision processes.

We analyze the constraint violation of the solutions of SPD, and prove that such constraint violation diminishes in the asymptotic of many rounds.

We analyze the sample complexity of the proposed algorithm, i.e., the number of queries required from a sampling oracle to achieve near optimal cost function.

We numerically compare the performance of the proposed primaldual algorithm with that of the penalty function method for a queuing problem, and we show that the solutions obtained by the proposed method in this paper has a smaller variance across different trials compared to the penalty function method.
1.2. Connections to Prior Works
The ALP as a framework to find a “lowdimensional” representation of “highdimensional” functions on a state (action) space has a long history in decision theory; see, e.g., [2, 3, 4, 5, 6] and references therein. The seminal work of De Farias and Von Roy [3]
studies an ALP for stochastic control problems as a means of alleviating the curse of dimensionality. The main challenge in using the proposed ALP is that while it has a relatively small number of variables, it has an intractable number of constraints. To address this issue, the same authors proposed a constraint sampling scheme in a separate work
[7]. In the same line of work, a dual approximate linear programming for the MDP is considered in [5], where the optimization variable is a stationary distribution over stateaction pairs. A neighborhood of a lowdimensional subset of the set of stationary distributions defined in terms of stateaction features is considered as the comparison class. In a similar line of work, a learning algorithm is proposed in [4] which leverages state and action features to reduce the dimensionality of the state and action spaces. In addition, the sample complexity of the proposed learning algorithm is analyzed.Our work is also closely related to the recent study of Banijamali, et al. [1]. Therein, the authors propose a stochastic gradient decent in conjunction with a penalty function method to optimize over a set of mixtures of policies. The main challenge in using a penalty function method is that its performance is often sensitive to the choice of the penalty factor in addition to the learning rate. Moreover, it yields a large constraint violation as is observed in [8, Thm. 1]. Furthermore, to optimize the regret performance , the authors in [1] propose a penalty factor that depends on the amount of violation of constraints, which is unknown in practice. In this paper, we propose a stochastic primaldual method whose only hyperparameter is the learning rate.
We also mention the recent work of Chen and Wang [4], where a primaldual algorithm with the Bregman divergence is considered in conjunction with the approximate linear programming formulation of MDPs. The work of [4] deals with the sample complexity of the stochastic primaldual algorithm, namely the number of queries to the oracle needed to attain near optimal value function, whereas in this paper we are concerned with the efficiency in the dual space as well as the constraint violation of the primaldual solutions. In addition, the algorithm we propose differs from [4] in several key points. First, unlike the algorithm of Chen and Wang, our approach does not require the realizability assumption (cf. [4, Def. 1]). The realizability condition requires the spanning vectors of features in ALP to be nonnegative, which in turn allows one to simplifies the projection onto the simplex section of a hyperplane. In particular, projection onto the simplex can be implemented efficiently using the KullbackLeibler (KL) divergence as the proximal function. In our algorithm, we provide a direct method of projection onto the hyperplane which obviates the realizability condition and provides a more expressive representation. In addition, we present a randomized policy based on the primaldual solutions that differs from the policy proposed in [4]. Second, our proposed algorithm solves an optimization problem in the dual space, where feature vectors are the occupation measures of a set of base policies. This allows us to compute a randomized policy directly from the solution of the underlying dual optimization problem. Lastly, the role of the Bregman divergence in our algorithm is to implicitly enforce the constraints due to the size of the policy class. In contrast, the Bregman divergence in [4] is used as a mean to attain better sample complexities via adaptation to the geometry of the feasible set.
1.3. Paper Outline
The rest of this paper is organized as follows. In Section 2, we present preliminaries related to MDPs. In addition, we review the approximate linear programming formulation of MDPs based on the linear representation of large statespaces. In Section 3, we present a stochastic regularized primaldual algorithm to compute the coefficients of the mixture policy. We also present the main theorem concerning the performance of the proposed algorithm. In Section 5 we present the proof of the main theorem, while deferring technical lemmas to the appendices. Lastly, in Section 7, we conclude this paper.
2. Preliminaries
In this section, we first present the notations and technical definitions that we need to establish our theoretical results. We then review relevant definitions regarding infinite horizon discounted Markov decision processes. We also review Bellman’s equations as well as linear programming dual formulation of Bellman’s equations.
Notations and Definitions. We denote vectors by the lower case letters (e.g.
), and random variables and matrices with the upper case letters (
e.g. ). The dual norm conjugate to the norm is defined by . We denote the Euclidean projection by . We use the standard asymptotic notation with the following definition: If are two functions from to , then if there exists a constant such that for every sufficiently large and that if . For positive integer , we use the shorthand to denote the set . For a matrix , let denote the subordinate norm for. We denote the largest singular value by
. Further, we use the shorthand notations , , and . A function is Lipschitz with respect to the norm over iffA function is smooth with respect to the norm over iff
A function is strongly convex with respect to the norm over iff
for all . The effective domain of a function is the following set . The subdifferential set of a function at the point is defined as follows
(2.1) 
The relative interior of a convex set , abbreviated , is defined as , where denotes the affine hull of the set , and is a ball of radius centered on .
The Fenchel conjugate of the function is defined as follows
(2.2) 
Definition 2.1.
(Orlicz Norm) The YoungOrlicz modulus is a convex nondecreasing function such that and when . Accordingly, the Orlicz norm of an integrable random variable with respect to the modulus is defined as
(2.3) 
In the sequel, we consider the Orlicz modulus . Accordingly, the cases of and norms are called the subGaussian and the subexponential norms and have the following alternative definitions:
Definition 2.2.
(SubGaussian Norm) The subGaussian norm of a random variable , denoted by , is defined as
(2.4) 
For a random vector , its subGaussian norm is defined as
(2.5) 
Definition 2.3.
(Subexponential Norm) The subexponential norm of a random variable , denoted by , is defined as follows
(2.6) 
For a random vector , its subexponential norm is defined as
(2.7) 
Definition 2.4.
(Legendre Function) A function , is called a Legendre function (a.k.a. essentially smooth functions) if it satisfies the following conditions:

and and is convex.

is strictly convex.

partial derivatives of exists and are continuous for all .

Any sequence converging to a boundary point of satisfies
Definition 2.5.
(Bregman Divergence) Suppose , is a Legendre function. The Bregman divergence is defined as follows
where is the gradient vector evaluated at .
2.1. Markov Decision Processes (MDPs)
In this paper, we consider MDPs with high dimensional state and action spaces. The first definition formalizes MDPs:
Definition 2.6.
(Markov Decision Process) A Markov decision process (MDP) is a 6tuple consists of:

Decision epochs: The set represents the set of times at which decisions are to be made. If is finite, then the MDP is said to be a finite horizon MDP with epochs. If , the MDP is said to be an infinite horizon MDP.

States: We assume that the statespace is finite.

Actions: We assume that the action set is also finite.

Transition model: The transition model for each and
, is the probability distribution
on . The element represents the probability of transitioning to the state after performing the action in the state . We define the matrices and . 
Cost function: For each , is the cost function. Let , and .

Discount factor: The discount factor reflects the intertemporal preferences.
We consider discounted infinite horizon MDP characterized by the tuple . For such MDPs, we compute randomized policies which we formally define below:
Definition 2.7.
(Randomized Policy) A randomized policy is the sequence of distributions, where is a probability distribution on the action space , conditioned on the state . The value represents the probability of taking the action in the state . Let denotes the set of all such randomized policies.
The objective of discounted infinite horizon MDP is to find a policy such that the infinitehorizon sum of discounted costs is minimized regardless of the initial state , i.e.,
where and are the realizations of state transitions and actions, respectively, generated by the Markov decision process under a given policy , and the expectation is taken over the entire trajectory.
Define the Bellman operator for all . From the theory of dynamic programming, a vector is the optimal value function to the MDP if and only if it satisfies the following Bellman fixed point equation [9]
(2.8) 
where is the differenceofvalue vector. A stationary policy is an optimal policy of the MDP if it attains the elementwise maximization in the Bellman equation (2.8), i.e., .
Alternatively, the Bellman equation (2.8) can be recast as the following linear programming problem (cf. [10]),
(2.9a)  
(2.9b) 
where , and is the initial distribution over the states, and is the simplex of the probability measures on the set of states .
To characterize the dual problem associated with the primal problem (2.9), Let denotes the stationary distribution under the policy and initial distribution of states . In particular, let denotes the transition probability matrix induced by a fixed policy whose matrix elements are defined as . Alternatively, let be a matrix that encodes the policy , i.e., let , where other entries of the matrix are zero. Then, .
Furthermore, let
denotes the stationary distribution of the Markov chain induced by the policy
, i.e., , where(2.10)  
(2.11) 
The measure captures the expected frequency of visits to each state when the underlying policy is , conditioned on the initial state being distributed according to . Future visits are discounted by the factor .
We define the occupation measure as the vector defined by , where is the Hadamard (elementwise) vector multiplication.. Then,
(2.12) 
Thus, the dual problem associated with the primal problem in Eq. (2.9a) has the following form
(2.13a)  
(2.13b)  
(2.13c) 
where is a binary matrix such that the th column has ones in rows to , and . In the optimization problem in Eq. (2.13), the constraint (2.13c) ensures that is a distribution, and the constraint (2.13b) guarantees that is stationary.
Let denotes the optimal solution of the linear programming problem in Eq. (2.13). An optimal randomized optimal policy can be characterized as follows
(2.14) 
Furthermore, the optimal objective value of the linear programming problem in Eq. (2.13) corresponds to the optimal discountedcost value under the optimal policy and the initial distribution of the states.
2.2. Approximate Linear Programming (ALP) for the Linear Representation of Large State Spaces
It is wellknown that the sample complexity as well as computational complexity of solving the linear programming dual problem in Eq. (2.13) scales (at least) linearly with the size of the state space , rendering exact representation intractable in the face of problems of practical scale; see, e.g., [11]. Consequently, to deal with MDPs with a large state space, it is practical to map the state space to a low dimensional space, using the linear span of a small number of features.
Specifically, let denotes a matrix whose column vectors represent feature vectors. In the feature space, the distribution is spanned by the linear combination of features , namely, . Here, the radius and the general norm determine the size and geometry of the policy class , respectively. The dual optimization problem in Eq. (2.13) can thus be reformulated in terms of features
(2.15a)  
(2.15b)  
(2.15c) 
Designing the feature matrix for Markov decision processes is a challenging problem. In particular, we wish to design a feature matrix such that the linear expansion is expressive, but does not lead to overfitting. In this paper, we focus on the set of features associated with a set of known base policies. Formally, we consider the set of base policies and define the subset as the convex hull of base policies,
(2.16) 
Corresponding to each base policy , a stationary distribution is defined according to Eq. (2.11). The dual space of in Eq. (2.16) is then defined as the linear combinations of occupation measures
(2.17) 
For all the stateaction distribution is defined as for all . With this choice of the feature vectors, we have and as the columns of the matrix are stationary probability distributions. Therefore, the dual optimization (2.15) takes the following form
(2.18a)  
(2.18b) 
Let and . The feasible set of the dual optimization in Eq. (2.18) is the intersection of the following sets
(2.19) 
Let denotes an approximate solution of the optimization problem in Eq. (2.15) generated by an (arbitrary) optimization algorithm after rounds. When is a feasible solution of Eq. (2.15), then defines a valid occupation measure. However, in this paper we permit the feature vectors to violate the nonnegativity constraint defined by in the following sense: let denotes a function that quantifies the amount by which the vector violates the nonnegativity constraints. In particular, iff . After rounds of the algorithm we propose in the next section (cf. Algorithm 1
), it outputs an estimate
that may potentially violate the constraints defined by , and thus . Nevertheless, we impose the constraint that the estimate generated by the algorithm satisfies the constraints in the asymptotic of many rounds .By allowing such constraint violation, we devise an efficient algorithm whose only projection is onto the hyperplane section of the feasible set , and the constraint due to the size of the policy class is enforced implicitly using a Bregman divergence whose domain is subsumed by the set . As we discuss in the next section, the Bregman projection onto the hyperplane has an efficient implementation.
Notice, however, that due to the (potential) constraint violation of solutions , the vector may no longer be a valid occupancy measure. Nonetheless, it defines an admissible randomized policy via the following relation
(2.20) 
In the case that for all pairs of actionstate , we define
to be the uniform distribution. Let
denotes the occupancy measure induced by the policy defined in Eq. (2.20), i.e., .2.3. Expanded Efficiency
Equipped with the dual optimization problem in Eq. (2.15), we now describe the notion of efficiency of an algorithm to solve (2.15). The following definition is adapted from [5]:
Definition 2.8.
(Efficient Large Scale Dual ALP [5]) For an MDP specified by the cost matrix , probability transition matrix , a feature matrix , the efficient largescale dual ALP problem is to produce an estimate such that
(2.21) 
in time polynomial in and under the model of computation in (A.3).
As described by Definition 2.8, the computational complexity depends on the number of features only, and not the size of statespace .
The preceding definition is a special case of the following generalized definition that allows for the constraint violation:
Definition 2.9.
Clearly, a guarantee for (2.22) implies a guarantee for Eq. (2.21). In addition, the expanded problem has a larger feasible set. Therefore, even if many feature vectors may not admit any feasible points in the feasible set and the dual problem is trivial, solving the expanded problem is still meaningful.
2.4. Assumptions
To establish our theoretical results, we need the following fast mixing assumption on underlying MDPs. This assumption implies that any policy quickly converges to its stationary distribution:

(Fast Mixing Markov Chain) For , the Markov decision process specified by the tuple is mixing in the sense that
(2.23) for all , where for two given Borel probability measures , is the total variation norm.
The fast mixing condition (2.23) is a standard assumption for Markov decision processes; see, e.g., [5], [12]. The fast mixing condition (2.23) implies that for any policy , there exists a constant such that for all the distributions over the stateaction space,
(2.24) 
The following assumption is also standard in the literature; see, e.g., [12]:

(Uniformly Bounded Ergodicity) The Markov decision process is ergodic under any stationary policy , and there exists such that
(2.25) where we recall from Section 3.1 that is the stationary distribution over the state space of the MDP under the policy , and with the initial distribution on the states.
Under the condition (2.25) of (A.2), it is wellknown that the underlying MDP is unichain [13], i.e., a Markov chain that contains a single recurrent class and possibly, some transient states. Moreover, the factor determines the amount of variation of stationary distributions associated with different policies, and thus can be sought of as a form of measuring the complexity of a MDP. Notice that in the case that some policies induce transient states (so the stationary distribution is not bounded away from zero), their mixture with an ergodic policy guarantee ergodicity.
Lastly, we consider the setup of the reinforcement learning in which the cost function is unknown. But, the agent can interact with its environment and receives feedbacks from a sampling oracle:

(ModelFree Reinforcement Learning) We consider the following model of computation:

[leftmargin=*]

The state space , the action spaces , the reward upper bound and lower bounds and , and the discount factor are known.

The cost function is unknown.

There is a sampling oracle that takes input and generates a new state with probabilities and returns the cost .

3. Stochastic PrimalDual Proximal Algorithm
In this section, we first describe a stochastic primaldual proximal method to compute the coefficients of the mixture model in the dual optimization problem in Eq. (2.18). We then analyze the efficiency as well as the sample complexity of the proposed algorithm.
3.1. Stochastic PrimalDual Algorithm
To apply the stochastic primaldual method, we recast the optimization problem (2.18) as the following MinMax optimization problem
(3.1) 
The following definition characterizes the notion of the saddle point of a MinMax problem:
Definition 3.1.
(Saddle Point of MinMax Optimization Problem) Let denotes a saddle point of the MinMax problem in Eq. (3.1), i.e., a point that satisfies the inequalities
(3.2) 
for all and .
The primal optimal point of the MinMax problem (3.1) is an optimal solution for the dual problem (2.18).
Applying the stochastic primaldual method to the MinMax problem (3.1) is challenging as the the Lagrange multipliers may take arbitrarily large values during the iterations of the primaldual algorithm, resulting in a large subgradients for the Lagrangian function and instability in the performance. The following lemma, due to Chen and Wang [14, Lemma 1], provides an upper bound on the norm of the optimal Lagrange multipliers:
Lemma 3.2.
Now, define the norm ball , where . Here, is a free parameter of the algorithm that will be determined later.
Lemma 3.2 suggests that the vector of optimal Lagrange multipliers belongs to the compact set . Therefore, instead of the MinMax problem (3.1), we can consider the following equivalent problem
(3.3) 
where the Lagrange multipliers are maximized over the compact set instead of entire nonnegative orthant . As we alluded earlier, the set of saddle points of the MinMax problems in Eqs. (3.3) and (3.1) coincide.
Algorithm 1 describes a procedure to solve the MinMax problem in Eq. (3.3). At each iteration of the stochastic primaldual method, we compute the following deterministic gradient
(3.4) 
Furthermore, we draw a random index , and subsequently sample the stateaction . Then, we compute the stochastic gradient as follows
(3.5a) 
where is a row vector, corresponding to the row of the feature matrix . We notice that
is an unbiased estimator of the gradients of the Lagrangian function
. Formally,(3.6) 
Algorithm 1 describes the primaldual steps to update . To implicitly control the size of the policy class as characterized by the constraint , in Algorithm 1 we consider a Bregman divergence whose modulus has the domain . For example, when the policy class is defined by a Euclidean ball , the following convex function can be employed (see [15])
(3.7) 
The associated convex conjugate of the modulus (3.7) is
(3.8) 
Moreover, the modulus (3.7) yields the Hellingerlike divergence
(3.9) 
Alternatively, when the policy class is defined by a hypercube , a function similar to the symmetric logistic function yields the divergence with the desired domain. In particular, let
(3.10) 
The convex conjuagte of is the following function
(3.11) 
The Bregman divergence associated with the modulus in Eq. (3.10) is
(3.12) 
The steps (3.25a)(3.25c) of Algorithm 1 are reminiscent of the socalled Mirror Descent Algorithm for online convex optimization [16],
(3.13) 
However, the role of the Bregman divergence in Algorithm 1 is different from that of the Mirror Descent algorithm. In the Mirror Descent algorithm, the Bregman divergence is typically used to adapt to the geometry of the feasible set and achieve better dependency on the dimension of the embedding space of the feasible set. In contrast, the Bregman divergence in Algorithm 1 is employed to implicitly enforce the constraints due to the size of the policy class and eliminate the projection step.
To see this equivalence, first notice that the update rule in Eq. (3.13) can be rewritten as below
(3.14a)  
(3.14b) 
where we also recall . Now, consider the Bregman projection (3.14b) onto the hyperplane . As shown in [15], the Bregman projection in Equation (3.14b) can be alternatively computed using Eqs. (3.25b)(3.25c). To demonstrate this, first invoke the necessary and sufficient KKT conditions for the optimality of which requires the existence of an optimal Lagrange multiplier that satisfies the following equation
(3.15) 
From Eq. (3.15), we establish that
(3.16) 
Alternatively, since the gradient of a Legendre function is a bijection from to and its inverse is the gradient of the conjugate (cf. [17, Thm. 26.5]), we have
(3.17) 