DeepAI

# Accelerating Primal-dual Methods for Regularized Markov Decision Processes

Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.

• 6 publications
• 20 publications
• 64 publications
• 18 publications
10/17/2021

### A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

We study entropy-regularized constrained Markov decision processes (CMDP...
10/16/2012

### Sparse Q-learning with Mirror Descent

This paper explores a new framework for reinforcement learning based on ...
05/22/2017

### A unified view of entropy-regularized Markov decision processes

We propose a general framework for entropy-regularized average-reward re...
12/22/2021

### Entropy-Regularized Partially Observed Markov Decision Processes

We investigate partially observed Markov decision processes (POMDPs) wit...
10/20/2021

### Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

The problem of constrained Markov decision process (CMDP) is investigate...
03/27/2020

### Distributed and time-varying primal-dual dynamics via contraction analysis

In this paper, we provide a holistic analysis of the primal-dual dynamic...
12/28/2021

### Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations

We consider large-scale Markov decision processes with an unknown cost f...

## 1 Introduction

### 1.1 Setup

Consider an infinite-horizon Markov decision process (MDP) [bellman2015applied, sutton2018reinforcement, puterman2014markov] , where

is a set of states of the Markov chain and

is a set of actions.

is a transition probability tensor with

being the probability of transitioning from state to state when taking action , is a reward matrix with being the reward obtained when taking action at state , and is the discount factor. In this paper, we assume that the state space and the action space are finite.

A policy is a randomized strategy over the actions at each state, i.e., for each state , is the probability of choosing action at . For a given policy, the value function

is a vector defined as

 (vπ)s:=E∞∑k=0(γkrskak∣s0=s), (1)

where the expectation is taken over all possible trajectories starting from following the policy . The value function satisfies the well-known Bellman equation [bellman2015applied]

 (I−γPπ)vπ=rπ, (2)

where , , and is the identity operator. In a Markov decision problem, the goal is to find the optimal policy such that

 vπ∗(s)≥vπ(s),∀s∈S,

for any other policy . The corresponding optimal value function will also be referred as in this paper. The existence of and is guaranteed by the theory of MDP [puterman2014markov].

In recent studies, entropy regularization has been widely used in MDP problems to encourage exploration and enhance robustness [neu2017unified, dai2018sbeed, geist2019theory, ahmed2019understanding, agarwal2020optimality, mei2020global, cen2020fast, zhan2021policy]. With the entropy regularization, the value function is defined by

 (vπ)s:=E∞∑k=0(γk(rskak−τlogπskak)∣s0=s), (3)

where is the regularization coefficient. satisfies the regularized Bellman equation

 (I−γPπ)vπ=rπ−τhπ, (4)

where is a vector in with each entry given by the negative Shannon entropy of

 (hπ)s=∑a∈Aπsalogπsa.

Here we overload the notation for the regularized value function and for the rest of the paper shall always denote the regularized value function (3) unless otherwise specified. For the entropy regularized MDP (see [geist2019theory]), there exists a unique optimal policy , such that

 v∗(s):=vπ∗(s)≥vπ(s),∀s∈S, (5)

for any other policy .

Without loss of generality, the reward is assumed to be nonnegative throughout this paper. This can be guaranteed by adding to the rewards a sufficiently large constant . Note that such a uniform shift keeps the optimal policy unchanged and shifts by a constant .

### 1.2 Primal-dual formulation

Entropy regularized MDPs enjoy linear programming formulations, in the primal, dual, and primal-dual forms. In this paper, we are concerned with the primal-dual formulation (see for example

[neu2017unified, ying2020note]):

 minv∈R|S|maxu∈R|S|×|A|∑s∈Sesvs+∑s∈S,a∈Ausa(rsa−((I−γPa)v)s)−τ∑s∈S,a∈Ausalog(usa/~us),

where . The policy is related to via the relationship

 πsa=usa/~us.

The main advantage of working with the primal-dual formulation is that the transition matrix

appears linearly in the objective function of the primal-dual problem. This linearity brings an important benefit when a stochastic gradient method is used to solve the primal-dual formulation: an unbiased estimator of the transition matrix

guarantees an unbiased estimator for the gradient-based update rule. This avoids the famous double-sampling problem

[sutton2018reinforcement] that affects any formulation that performs a nonlinear operation to the transition matrix . Examples of these affected formulations include the primal formulation, where a nonlinear max or exponentiation operator is applied to and the dual formulation, where the inverse of is needed. From this perspective, the primal-dual formulation is convenient in the model-free setting, where the transition probability tensor can only be estimated from samples and is thus inherently noisy.

In what follows, we shall simplify the notation by denoting and . With this simplification, the primal-dual problem can be rewritten more compactly as

 minvmaxu∑sesvs+∑sausa(rsa−(Kav)s)−τ∑sausalog(usa/~us). (6)

Though theoretically appealing, the primal-dual formulation (6) often poses computational challenges because it is a minimax optimization. Newton-type methods are often impractical to apply because either is only accessible via samples or its size is too large for practical inversion. A close look at the objective function of (6) suggests that it is linear with respect to both the value function and the dual variable in the radial direction. This lack of strict convexity/concavity makes the first-order methods hard to converge.

### 1.3 Contributions

To overcome this difficulty, this paper proposes a quadratically convexified reformulation of (6) that shares the same solution with (6) and an interpolating natural gradient ascent descent method that significantly speeds up the convergence. More specifically, the main contributions of this paper are listed as follows:

• We propose a new quadratically convexified primal-dual formulation in which the linear weighted sum of (6) is replaced with a quadratic term . The surprising feature is that the solution

is unchanged and is independent of the hyperparameter

. We prove that the vanilla natural gradient ascent descent (NGAD) of this quadratically convexified problem enjoys a Lyapunov function [lyapunov1992general] and converges linearly. To the best of our knowledge, this is the first quadratically convexified primal-dual formulation of Markov decision problems.

• We propose an interpolating natural gradient ascent descent (INGAD) by introducing a new interpolating metric for the variable. The corresponding Lyapunov function is constructed and the convergence of the new dynamics is proved. The acceleration is verified by numerical tests under multiple settings.

### 1.4 Related work

Regarding the primal-dual formulation, the first primal-dual learning algorithm is given in [wang2016online]. A follow-up work [wang2020randomized] leverages the binary-tree data structure and adaptive importance sampling techniques to reduce the complexity. The convergence result for these two papers is however only for the average of all the policies rather than the policy obtained in the last iteration. In these papers, no regularization is used in the formulation and no preconditioner is used in the iterative update scheme. As a comparison, the current paper proves a last-iteration convergence result with the help of Lyapunov method and entropy regularization, and derives an interpolating metric that accelerates the algorithm. Various studies have been carried out following the primal-dual formulation in [wang2016online]. For example, a modified form with the -function is proposed in [lee2018stochastic] and the corresponding primal-dual type algorithm is derived. An extension to the infinite-horizon average-reward setting is provided in [wang2017primal], but only average-case convergence result is given. A later work [cho2017deep] further extended this method to the function approximation setting. A comprehensive review of the primal-dual methods in the average reward setting is given in a recent thesis [gong2021primal] and a generalization to the general utility maximization formulation is provided. The primal-dual method has also been used to find risk-sensitive policies, for example, in [zhang2021cautious], where a risk function is integrated into the primal-dual objective through the dual variable. In the optimization literature, the primal-dual formulation is often called the saddle point problem: for example [serrano2020faster] considers a linear relaxation version of the saddle-point problem in [wang2017primal] to address large-scale problems. However, it is worth noting that no (entropy) regularization is used in the papers mentioned above, which is able to make the landscape of the optimization problem smoother and is thus a crucial element of recent fast convergence results [cen2020fast, li2021quasinewton]. Without regularization, gradient-type methods can take exponential time to converge [li2021softmax].

Besides the primal-dual formulations, the discussion below briefly touches on the primal and the dual formulations. For the entropy regularized Markov decision process, the primal formulation [ying2020note] takes the form

 vs=τlog(∑aexp(rsa+γ∑s′Pass′vs′τ,)), (7)

which leads to a value iteration algorithm. Let be the fixed-point map such that , then by calculating the derivative matrix we have

Hence is a contraction map and converges to a fixed-point, which is the solution to (7) at a linear rate , where is the number of iterations. After obtaining the optimal value function , the corresponding policy is given by [ying2020note]:

 πsa=exp(τ−1(rsa+γ∑s′∈SPass′vs′))∑aexp(τ−1(rsa+γ∑s′∈SPass′vs′))=exp(τ−1(rsa−∑s′(I−γPa)ss′vs′)). (8)

As a result of the aforementioned double-sampling problem, the value-iteration algorithm based on (7) is mainly used in the model-based setting, but due to the nice properties of , it appears as an important ingredient in various other algorithms. For example, in [asadi2017alternative] and [rawlik2013stochastic], the authors use the function as an alternative softmax operator and form a -learning type algorithm, and in [nachum2017trust], the function

appears as a result of inner optimization of an entropy regularized trust region type formulation and is used to form the loss function. In

[dai2018sbeed], the mean squared regularized Bellman error is employed to establish the optimization problem.

An alternative way to solve a regularized Markov decision problem in the model-based setting is the dual formulation [ying2020note], in which one seeks a policy that solves the following optimization problem:

 maxπeTvπ:=eT(I−γPπ)−1(rπ−τhπ), (9)

where is a weight vector. By the existence and uniqueness of the optimal value function and optimal policy and the optimality (5), it is clear that any choice of leads to the optimal policy and the optimal value function. A variety of policy gradient algorithms can be used to solve the dual problem. Examples include [williams1992simple, sutton1999policy, kakade2001natural, schulman2015high, schulman2015trust, schulman2017proximal], to mention only a few. Recently, [li2021quasinewton] proposes a quasi-Newton policy gradient algorithm, where an approximate Hessian of the objective function in (9) is used as a preconditioner for the gradient, resulting in a quadratic convergence rate by better fitting the problem geometry.

The word primal-dual also appears in other types of formulations where the dual variables do not represent the policy. For example, in [ding2020natural], the authors apply the natural policy gradient method to constrained MDPs (CMDPs), where the dual variables are the multipliers of the constraints. Similarly, in [chow2018lyapunov], the dual variables come from the constraints in CMDPs. In this paper, the Lyapunov method is used to give theoretical analysis for the natural gradient flow of the method we propose. The idea of Lyapunov methods have also been applied to discrete time control problems [kalman1960control, kalman1959control] and to discrete Markovian systems [meyn2012markov]. Recently it has also been used to address the safety problem, where safety usually appears as additional constraints in the model [perkins2002lyapunov, chow2018lyapunov, berkenkamp2017safe], and the Lyapunov function is usually defined on the state space and is used explicitly in the policy iteration or in finding the controller.

### 1.5 Notations

For a vector , denotes a diagonal matrix with size and the -th diagonal element being , . For , we denote the -th element as . While denotes the vector in with the -th element being , denotes the vector in with the -th element being . The states of the MDP are typically referred with , and while the actions are referred by and . The vector with length and all elements equal to is denoted by , and the subscript is often omitted when there is no ambiguity. The -by-identity matrix is denoted by , again with the subscript often omitted when there is no ambiguity. For a matrix , denotes its Hermitian transpose. If a scalar function is applied to a vector, then the result is defined element-wise unless otherwise specified, e.g., for , with for .

### 1.6 Contents

The rest of the paper is organized as follows. Section 2 derives the quadratically convexified primal-dual formulation, proves its equivalence with (6), and shows that the vanilla NGAD of the new formulation converges linearly using a Lyapunov function method. Section 3 introduces an interpolating metric by leveraging the flexibility of underlying metric described by the block diagonal part of the Hessian. The convergence rate of the INGAD based on this new interpolating metric is significantly improved. We also provide a Lyapunov style proof for global convergence and an analysis of the exponential convergence rate in the last-iterate sense. Finally, Section 4 demonstrates the numerical performance of this proposed natural gradient methods.

## 2 Quadratically convexified primal-dual formulation

### 2.1 Formulation

In what follows, we use to denote the objective of the standard entropy regularized primal-dual formulation

 minvmaxuE0(v,u):=∑sesvs+∑sausa(rsa−(Kav)s)−τ∑sausalogusa~us. (10)

Since it is linear in and linear along the radial direction of , first order optimization methods typically experience slow convergence. To address the issue in the variable, we propose a quadratically convexified primal-dual formulation:

 minvmaxuE(v,u):=α2∑sv2s+∑sausa(rsa−(Kav)s)−τ∑sausalogusa~us. (11)

Though these two formulations look quite different, surprisingly when they are equivalent in the following sense.

• They share the same optimal value function .

• The optimal dual variable differs only by an -dependent scaling factor. This implies that the optimal policy are the same.

One geometric way to see this equivalence is to go through the associated primal formulations

 minveTv, s.t. ∀s,vs≥τlog(∑a∈Aexp(rsa+γ∑s′Pass′vs′τ)), (12)

and

 minvα2∥v∥2, s.t. ∀s,vs≥τlog(∑a∈Aexp(rsa+γ∑s′Pass′vs′τ)). (13)

Fig. 1 illustrates the primal formulations of a randomly generated MDP with , where the yellow region represents the feasible set and the red dot represents the optimal value . Due to the key assumption , the feasible set lies in the first quadrant. From the contour plots of the objective function and shown by the dotted curves, it is clear that both of them are minimized at when constrained to the feasible set.

The following theorem states this equivalence precisely and its proof is given in Section 6.

###### Theorem 1

For an infinite-horizon discounted MDP with finite state space , finite action space and nonnegative reward , we have the following properties:

1. There is a unique solution to the primal-dual problem:

 minvmaxuE0(v,u)=∑sesvs+∑sausa(rsa−∑s′Kass′vs′)−τ∑sausalogusa~us,

where is the optimal value function defined by (5) and gives the optimal policy .

2. There is a unique solution to the quadratically convexified problem:

 minvmaxuE(v,u)=α2∑sv2s+∑sausa(rsa−∑s′Kass′vs′)−τ∑sausalogusa~us,

where is the optimal value function, and coincides with the optimal policy .

### 2.2 Natural gradient ascent descent

As mentioned earlier, the gradient-based methods for the primal-dual formulation (10) suffer from slow convergence, partly due to the linearity of in . Since the quadratically convexified scheme (11) gives the same value function and policy as the original primal-dual problem (10), we work instead with (11) and propose an NGAD algorithm.

The first-order derivatives of the new objective function are

 ∂E∂vs′=αvs′−∑saKass′usa,s′∈S,∂E∂usa=(rsa−∑s′Kass′vs′)−τlogusa~us,(s,a)∈S×A. (14)

The diagonal blocks of the second-order derivatives and are

 ∂2E∂vs∂vs′=αδss′,(s,s′)∈S×S∂2E∂usa∂us′a′=−τδss′(δaa′usa−1~us),(s,s′,a,a′)∈S2×A2. (15)

Of the two diagonal blocks above, is easy to invert since it is diagonal with positive diagonal entries, whereas is the sum of a diagonal part and a low-rank part. In the natural gradient dynamics below, we only keep the first part of , namely (or more compactly in the matrix form). The resulting NGAD flow is:

 dvs′dt =−1α(αvs′−∑saKass′usa),s′∈S, dusadt =−1τusa(τlogusa~us−(rsa−∑s′Kass′vs′)),(s,a)∈S×A,

or equivalently,

To analyze its convergence, we start by identifying a Lyapunov function of this dynamics. By  Theorem 1 there is a unique solution to problem (11). Based on the solution , define

 L(v,u)=α2∑s∈S|vs−v∗s|2+τ∑s∈S,a∈A(u∗salogu∗sausa+usa−u∗sa). (17)

The following lemma summarizes some key properties of .

###### Lemma 1

is strictly convex, and the unique minimum is , which satisfies . In addition, any sublevel set of is bounded.

The next lemma states that is a Lyapunov function of (16).

###### Lemma 2

is a Lyapunov function for the dynamics (16), i.e., when and are defined in (16), and the only trajectory of the dynamics (16) satisfying is .

The proofs of these two lemmas are given in Section 6.

###### Theorem 2

The dynamics of (16) converges globally to .

###### Proof

By Lemma 1, Lemma 2 and the Barbashin-Krasovskii-LaSalle theorem [haddad2011nonlinear], the dynamics of (16) is globally asymptotically stable, which means the NGAD dynamics converges globally to .

To show the exponential convergence of (16), we follow Lyapunov’s indirect method, i.e., analyzing the linearization of (16) at

and demonstrating that the real part of the eigenvalues of the corresponding matrix is negative. This result is the content of

Theorem 3, with the proof given in Section 6.

###### Theorem 3

The dynamics of (16) converges at rate to for some .

Below we discuss the implementation of (16). By introducing , (16) can be rewritten as

With a learning rate , this leads to the update rule

 vs′←(1−η)vs′+ηα∑saKass′exp(θsa),s′∈S,θsa←(1−η)θsa+ηlog(∑aexp(θsa))+ητ(rsa−∑s′Kass′vs′),(s,a)∈S×A. (19)

The details of the algorithm are summarized in Algorithm 1.

## 3 Interpolating natural gradient method

In Section 2.2, NGAD is introduced using the diagonal part of . A natural question is whether the whole matrix can be used. Under the matrix notation, in (15) takes the form

 (20)

Since the Hessian matrix describes the local geometry of the problem, the standard NGAD in Section Section 2.2 can be viewed as approximating the Hessian diagonally

 ∂2E∂u2≈⎡⎢ ⎢ ⎢⎣diag((u1⋅)−1)⋱diag((u|S|⋅)−1)⎤⎥ ⎥ ⎥⎦

and using its inverse

 ⎡⎢ ⎢ ⎢⎣diag(u1⋅)⋱diag(u|S|⋅)⎤⎥ ⎥ ⎥⎦≡⎡⎢ ⎢ ⎢⎣~u1(diag(π1⋅))⋱~u|S|(diag(π|S|⋅))⎤⎥ ⎥ ⎥⎦

to precondition the gradient. However, is in fact singular with and its pseudoinverse reads

 ⎡⎢ ⎢ ⎢⎣~u1(diag(π1⋅)−π1⋅πT1⋅)⋱~u|S|(diag(π|S|⋅)−π|S|⋅πT|S|⋅).⎤⎥ ⎥ ⎥⎦

If we had constructed the natural gradient method with this pseudoinverse, the direction in the gradient would not have been updated in the dynamics.

The key idea is that one can interpolate between these two extreme cases, i.e., we propose to use

 ⎡⎢ ⎢ ⎢⎣~u1(diag(π1⋅)−cπ1⋅πT1⋅)⋱~u|S|(diag(π|S|⋅)−cπ|S|⋅πT|S|⋅)⎤⎥ ⎥ ⎥⎦ (21)

for to precondition the gradient.

Under this interpolating metric (21), the new interpolating NGAD (INGAD) is given by

 dvs′dt=−(vs′−1α∑saKass′usa),s′∈S,dus⋅dt=−~us(diag(πs⋅)−cπs⋅πTs⋅)(logus⋅~us−1τ(rs⋅−∑s′K⋅ss′vs′)),s∈S, (22)

where . When , this dynamics reduces to (16).

A Lyapunov function of this dynamics can also be identified. Using the unique solution to (11), we define

 Lc(v,u)=α2∑s|vs−v∗s|2+τ(∑sa(u∗salogu∗sausa+usa−u∗sa)+c1−c∑s(~u∗slog~u∗s~us+~us−~u∗s)), (23)

where the subscript denotes the hyperparameter in the function. Some key properties of are summarized in the following lemma.

###### Lemma 3

is convex and the unique minimum is . The sublevel sets of are bounded.

The next lemma states that is a Lyapunov function for (22).

###### Lemma 4

is a Lyapunov function for the dynamics (22), i.e., when and are defined by (22), and the only trajectory of the dynamics (22) satisfying is .

The proofs of these two lemmas can be found again in Section 6.

###### Theorem 4

The dynamics of (22) converges globally to .

###### Proof

Similar with Theorem 2, by Lemma 3, Lemma 4 and the Barbashin-Krasovskii-LaSalle theorem [haddad2011nonlinear], the dynamics of (22) is globally asymptotically stable, which means the INGAD dynamics converges globally to .

The local exponential convergence of (22) can also be shown with Lyapunov’s indirect method. This result is stated in Theorem 5.

###### Theorem 5

The dynamics of (22) converges at rate to for some .

Finally we discuss the implementation of (16). By letting , (22) can be written as

 dvs′dt=−(vs′−1α∑saKass′exp(θsa)),s′∈S,dθs⋅dt=−(I−c1exp(θs⋅)T1Texp(θs⋅))(θs⋅−log∑aexp(θsa)1−1τ(rs⋅−∑s′K⋅ss′vs′)), s∈S. (24)

With a learning rate , this becomes

 vs′←(1−η)vs′+ηα∑saKass′exp(θsa),s′∈S,θs⋅←θs⋅−η(I−c1exp(θs⋅)T1Texp(θs⋅))(θs⋅−log∑aexp(θsa)1−1τ(rs⋅−∑s′K⋅ss′vs′)), s∈S. (25)

The details of the algorithm can be found in Algorithm 2 below.

## 4 Numerical results

In this section, we examine the performance of Algorithm 1 and Algorithm 2 on several MDP problems. Section 4.1 compares Algorithm 1 and Algorithm 2 in a complete-information case where the transition probabilities and the rewards are known exactly. Section 4.2 is concerned with the problem where the rewards are given with different levels of Gaussian noises. Finally, a sample-based example is studied in Section 4.3.

### 4.1 Experiment I – Complete information

Here we test the numerical performance of the standard natural gradient in Algorithm 1 and the interpolating natural gradient in Algorithm 2 in a complete information situation. The MDP used is from [zhan2021policy], where , , and the transition probabilities and rewards are randomly generated. More specifically, the transition probabilities are set as for any , where is a uniformly randomly chosen subset of such that , and the reward for , where and are independently uniformly sampled from .

A comparison of Algorithm 1 and Algorithm 2 is carried out using the same discount rate and hyperparameters . Since both algorithms are explicit discretizations of the corresponding flow, a sufficiently small learning rate is needed to ensure convergence. In the tests, the learning rates are set as for Algorithm 1 and for Algorithm 2, which are both manually tuned to be close to the largest learning rates such that convergence is achieved. For Algorithm 2, we set .

As a result, Algorithm 1 takes iterations to converge while Algorithm 2 takes iterations, demonstrating that the interpolating metric introduced in Section 3 gives rise to an acceleration of more than magnitude. Plotted in Figure 2(a) and Figure 2(c) are the errors of the value and policy with respect to the ground truth in the training process, which verifies that Algorithm 2 achieves the same precision more than a magnitude faster than Algorithm 1. Moreover, it can be observed from Figure 2(b) and Figure 2(d) that in both cases the Lyapunov function decreases monotonically, confirming the theoretical analyses in Section 2 and Section 3.

### 4.2 Experiment II – Gaussian noise

Next, we consider the case where the transitions probabilities are known, but noises are included in the rewards. More specifically, in iteration , the reward is replaced with , where the noises are independent for , and is the maximum of number of iterations. Since the expectation of is

, the standard deviation is set as

or for the experiment, such that the noises have an appropriate scale. Apart from the additional noises, the MDP is the same as in Section 4.1. Only Algorithm 2 is implemented for this problem since it has much better performance than Algorithm 1. We take .