1 Introduction
1.1 Setup
Consider an infinitehorizon Markov decision process (MDP) [bellman2015applied, sutton2018reinforcement, puterman2014markov] , where
is a set of states of the Markov chain and
is a set of actions.is a transition probability tensor with
being the probability of transitioning from state to state when taking action , is a reward matrix with being the reward obtained when taking action at state , and is the discount factor. In this paper, we assume that the state space and the action space are finite.A policy is a randomized strategy over the actions at each state, i.e., for each state , is the probability of choosing action at . For a given policy, the value function
is a vector defined as
(1) 
where the expectation is taken over all possible trajectories starting from following the policy . The value function satisfies the wellknown Bellman equation [bellman2015applied]
(2) 
where , , and is the identity operator. In a Markov decision problem, the goal is to find the optimal policy such that
for any other policy . The corresponding optimal value function will also be referred as in this paper. The existence of and is guaranteed by the theory of MDP [puterman2014markov].
In recent studies, entropy regularization has been widely used in MDP problems to encourage exploration and enhance robustness [neu2017unified, dai2018sbeed, geist2019theory, ahmed2019understanding, agarwal2020optimality, mei2020global, cen2020fast, zhan2021policy]. With the entropy regularization, the value function is defined by
(3) 
where is the regularization coefficient. satisfies the regularized Bellman equation
(4) 
where is a vector in with each entry given by the negative Shannon entropy of
Here we overload the notation for the regularized value function and for the rest of the paper shall always denote the regularized value function (3) unless otherwise specified. For the entropy regularized MDP (see [geist2019theory]), there exists a unique optimal policy , such that
(5) 
for any other policy .
Without loss of generality, the reward is assumed to be nonnegative throughout this paper. This can be guaranteed by adding to the rewards a sufficiently large constant . Note that such a uniform shift keeps the optimal policy unchanged and shifts by a constant .
1.2 Primaldual formulation
Entropy regularized MDPs enjoy linear programming formulations, in the primal, dual, and primaldual forms. In this paper, we are concerned with the primaldual formulation (see for example
[neu2017unified, ying2020note]):where . The policy is related to via the relationship
The main advantage of working with the primaldual formulation is that the transition matrix
appears linearly in the objective function of the primaldual problem. This linearity brings an important benefit when a stochastic gradient method is used to solve the primaldual formulation: an unbiased estimator of the transition matrix
guarantees an unbiased estimator for the gradientbased update rule. This avoids the famous doublesampling problem
[sutton2018reinforcement] that affects any formulation that performs a nonlinear operation to the transition matrix . Examples of these affected formulations include the primal formulation, where a nonlinear max or exponentiation operator is applied to and the dual formulation, where the inverse of is needed. From this perspective, the primaldual formulation is convenient in the modelfree setting, where the transition probability tensor can only be estimated from samples and is thus inherently noisy.In what follows, we shall simplify the notation by denoting and . With this simplification, the primaldual problem can be rewritten more compactly as
(6) 
Though theoretically appealing, the primaldual formulation (6) often poses computational challenges because it is a minimax optimization. Newtontype methods are often impractical to apply because either is only accessible via samples or its size is too large for practical inversion. A close look at the objective function of (6) suggests that it is linear with respect to both the value function and the dual variable in the radial direction. This lack of strict convexity/concavity makes the firstorder methods hard to converge.
1.3 Contributions
To overcome this difficulty, this paper proposes a quadratically convexified reformulation of (6) that shares the same solution with (6) and an interpolating natural gradient ascent descent method that significantly speeds up the convergence. More specifically, the main contributions of this paper are listed as follows:

We propose a new quadratically convexified primaldual formulation in which the linear weighted sum of (6) is replaced with a quadratic term . The surprising feature is that the solution
is unchanged and is independent of the hyperparameter
. We prove that the vanilla natural gradient ascent descent (NGAD) of this quadratically convexified problem enjoys a Lyapunov function [lyapunov1992general] and converges linearly. To the best of our knowledge, this is the first quadratically convexified primaldual formulation of Markov decision problems. 
We propose an interpolating natural gradient ascent descent (INGAD) by introducing a new interpolating metric for the variable. The corresponding Lyapunov function is constructed and the convergence of the new dynamics is proved. The acceleration is verified by numerical tests under multiple settings.
1.4 Related work
Regarding the primaldual formulation, the first primaldual learning algorithm is given in [wang2016online]. A followup work [wang2020randomized] leverages the binarytree data structure and adaptive importance sampling techniques to reduce the complexity. The convergence result for these two papers is however only for the average of all the policies rather than the policy obtained in the last iteration. In these papers, no regularization is used in the formulation and no preconditioner is used in the iterative update scheme. As a comparison, the current paper proves a lastiteration convergence result with the help of Lyapunov method and entropy regularization, and derives an interpolating metric that accelerates the algorithm. Various studies have been carried out following the primaldual formulation in [wang2016online]. For example, a modified form with the function is proposed in [lee2018stochastic] and the corresponding primaldual type algorithm is derived. An extension to the infinitehorizon averagereward setting is provided in [wang2017primal], but only averagecase convergence result is given. A later work [cho2017deep] further extended this method to the function approximation setting. A comprehensive review of the primaldual methods in the average reward setting is given in a recent thesis [gong2021primal] and a generalization to the general utility maximization formulation is provided. The primaldual method has also been used to find risksensitive policies, for example, in [zhang2021cautious], where a risk function is integrated into the primaldual objective through the dual variable. In the optimization literature, the primaldual formulation is often called the saddle point problem: for example [serrano2020faster] considers a linear relaxation version of the saddlepoint problem in [wang2017primal] to address largescale problems. However, it is worth noting that no (entropy) regularization is used in the papers mentioned above, which is able to make the landscape of the optimization problem smoother and is thus a crucial element of recent fast convergence results [cen2020fast, li2021quasinewton]. Without regularization, gradienttype methods can take exponential time to converge [li2021softmax].
Besides the primaldual formulations, the discussion below briefly touches on the primal and the dual formulations. For the entropy regularized Markov decision process, the primal formulation [ying2020note] takes the form
(7) 
which leads to a value iteration algorithm. Let be the fixedpoint map such that , then by calculating the derivative matrix we have
Hence is a contraction map and converges to a fixedpoint, which is the solution to (7) at a linear rate , where is the number of iterations. After obtaining the optimal value function , the corresponding policy is given by [ying2020note]:
(8) 
As a result of the aforementioned doublesampling problem, the valueiteration algorithm based on (7) is mainly used in the modelbased setting, but due to the nice properties of , it appears as an important ingredient in various other algorithms. For example, in [asadi2017alternative] and [rawlik2013stochastic], the authors use the function as an alternative softmax operator and form a learning type algorithm, and in [nachum2017trust], the function
appears as a result of inner optimization of an entropy regularized trust region type formulation and is used to form the loss function. In
[dai2018sbeed], the mean squared regularized Bellman error is employed to establish the optimization problem.An alternative way to solve a regularized Markov decision problem in the modelbased setting is the dual formulation [ying2020note], in which one seeks a policy that solves the following optimization problem:
(9) 
where is a weight vector. By the existence and uniqueness of the optimal value function and optimal policy and the optimality (5), it is clear that any choice of leads to the optimal policy and the optimal value function. A variety of policy gradient algorithms can be used to solve the dual problem. Examples include [williams1992simple, sutton1999policy, kakade2001natural, schulman2015high, schulman2015trust, schulman2017proximal], to mention only a few. Recently, [li2021quasinewton] proposes a quasiNewton policy gradient algorithm, where an approximate Hessian of the objective function in (9) is used as a preconditioner for the gradient, resulting in a quadratic convergence rate by better fitting the problem geometry.
The word primaldual also appears in other types of formulations where the dual variables do not represent the policy. For example, in [ding2020natural], the authors apply the natural policy gradient method to constrained MDPs (CMDPs), where the dual variables are the multipliers of the constraints. Similarly, in [chow2018lyapunov], the dual variables come from the constraints in CMDPs. In this paper, the Lyapunov method is used to give theoretical analysis for the natural gradient flow of the method we propose. The idea of Lyapunov methods have also been applied to discrete time control problems [kalman1960control, kalman1959control] and to discrete Markovian systems [meyn2012markov]. Recently it has also been used to address the safety problem, where safety usually appears as additional constraints in the model [perkins2002lyapunov, chow2018lyapunov, berkenkamp2017safe], and the Lyapunov function is usually defined on the state space and is used explicitly in the policy iteration or in finding the controller.
1.5 Notations
For a vector , denotes a diagonal matrix with size and the th diagonal element being , . For , we denote the th element as . While denotes the vector in with the th element being , denotes the vector in with the th element being . The states of the MDP are typically referred with , and while the actions are referred by and . The vector with length and all elements equal to is denoted by , and the subscript is often omitted when there is no ambiguity. The byidentity matrix is denoted by , again with the subscript often omitted when there is no ambiguity. For a matrix , denotes its Hermitian transpose. If a scalar function is applied to a vector, then the result is defined elementwise unless otherwise specified, e.g., for , with for .
1.6 Contents
The rest of the paper is organized as follows. Section 2 derives the quadratically convexified primaldual formulation, proves its equivalence with (6), and shows that the vanilla NGAD of the new formulation converges linearly using a Lyapunov function method. Section 3 introduces an interpolating metric by leveraging the flexibility of underlying metric described by the block diagonal part of the Hessian. The convergence rate of the INGAD based on this new interpolating metric is significantly improved. We also provide a Lyapunov style proof for global convergence and an analysis of the exponential convergence rate in the lastiterate sense. Finally, Section 4 demonstrates the numerical performance of this proposed natural gradient methods.
2 Quadratically convexified primaldual formulation
2.1 Formulation
In what follows, we use to denote the objective of the standard entropy regularized primaldual formulation
(10) 
Since it is linear in and linear along the radial direction of , first order optimization methods typically experience slow convergence. To address the issue in the variable, we propose a quadratically convexified primaldual formulation:
(11) 
Though these two formulations look quite different, surprisingly when they are equivalent in the following sense.

They share the same optimal value function .

The optimal dual variable differs only by an dependent scaling factor. This implies that the optimal policy are the same.
One geometric way to see this equivalence is to go through the associated primal formulations
(12) 
and
(13) 
Fig. 1 illustrates the primal formulations of a randomly generated MDP with , where the yellow region represents the feasible set and the red dot represents the optimal value . Due to the key assumption , the feasible set lies in the first quadrant. From the contour plots of the objective function and shown by the dotted curves, it is clear that both of them are minimized at when constrained to the feasible set.
The following theorem states this equivalence precisely and its proof is given in Section 6.
Theorem 1
For an infinitehorizon discounted MDP with finite state space , finite action space and nonnegative reward , we have the following properties:

There is a unique solution to the primaldual problem:
where is the optimal value function defined by (5) and gives the optimal policy .

There is a unique solution to the quadratically convexified problem:
where is the optimal value function, and coincides with the optimal policy .
2.2 Natural gradient ascent descent
As mentioned earlier, the gradientbased methods for the primaldual formulation (10) suffer from slow convergence, partly due to the linearity of in . Since the quadratically convexified scheme (11) gives the same value function and policy as the original primaldual problem (10), we work instead with (11) and propose an NGAD algorithm.
The firstorder derivatives of the new objective function are
(14) 
The diagonal blocks of the secondorder derivatives and are
(15) 
Of the two diagonal blocks above, is easy to invert since it is diagonal with positive diagonal entries, whereas is the sum of a diagonal part and a lowrank part. In the natural gradient dynamics below, we only keep the first part of , namely (or more compactly in the matrix form). The resulting NGAD flow is:
or equivalently,
(16) 
To analyze its convergence, we start by identifying a Lyapunov function of this dynamics. By Theorem 1 there is a unique solution to problem (11). Based on the solution , define
(17) 
The following lemma summarizes some key properties of .
Lemma 1
is strictly convex, and the unique minimum is , which satisfies . In addition, any sublevel set of is bounded.
The next lemma states that is a Lyapunov function of (16).
Lemma 2
The proofs of these two lemmas are given in Section 6.
Theorem 2
The dynamics of (16) converges globally to .
Proof
To show the exponential convergence of (16), we follow Lyapunov’s indirect method, i.e., analyzing the linearization of (16) at
and demonstrating that the real part of the eigenvalues of the corresponding matrix is negative. This result is the content of
Theorem 3, with the proof given in Section 6.Theorem 3
The dynamics of (16) converges at rate to for some .
Below we discuss the implementation of (16). By introducing , (16) can be rewritten as
(18) 
With a learning rate , this leads to the update rule
(19) 
The details of the algorithm are summarized in Algorithm 1.
3 Interpolating natural gradient method
In Section 2.2, NGAD is introduced using the diagonal part of . A natural question is whether the whole matrix can be used. Under the matrix notation, in (15) takes the form
(20) 
Since the Hessian matrix describes the local geometry of the problem, the standard NGAD in Section Section 2.2 can be viewed as approximating the Hessian diagonally
and using its inverse
to precondition the gradient. However, is in fact singular with and its pseudoinverse reads
If we had constructed the natural gradient method with this pseudoinverse, the direction in the gradient would not have been updated in the dynamics.
The key idea is that one can interpolate between these two extreme cases, i.e., we propose to use
(21) 
for to precondition the gradient.
Under this interpolating metric (21), the new interpolating NGAD (INGAD) is given by
(22) 
where . When , this dynamics reduces to (16).
A Lyapunov function of this dynamics can also be identified. Using the unique solution to (11), we define
(23) 
where the subscript denotes the hyperparameter in the function. Some key properties of are summarized in the following lemma.
Lemma 3
is convex and the unique minimum is . The sublevel sets of are bounded.
The next lemma states that is a Lyapunov function for (22).
Lemma 4
The proofs of these two lemmas can be found again in Section 6.
Theorem 4
The dynamics of (22) converges globally to .
Proof
The local exponential convergence of (22) can also be shown with Lyapunov’s indirect method. This result is stated in Theorem 5.
Theorem 5
The dynamics of (22) converges at rate to for some .
Finally we discuss the implementation of (16). By letting , (22) can be written as
(24) 
With a learning rate , this becomes
(25) 
The details of the algorithm can be found in Algorithm 2 below.
4 Numerical results
In this section, we examine the performance of Algorithm 1 and Algorithm 2 on several MDP problems. Section 4.1 compares Algorithm 1 and Algorithm 2 in a completeinformation case where the transition probabilities and the rewards are known exactly. Section 4.2 is concerned with the problem where the rewards are given with different levels of Gaussian noises. Finally, a samplebased example is studied in Section 4.3.
4.1 Experiment I – Complete information
Here we test the numerical performance of the standard natural gradient in Algorithm 1 and the interpolating natural gradient in Algorithm 2 in a complete information situation. The MDP used is from [zhan2021policy], where , , and the transition probabilities and rewards are randomly generated. More specifically, the transition probabilities are set as for any , where is a uniformly randomly chosen subset of such that , and the reward for , where and are independently uniformly sampled from .
A comparison of Algorithm 1 and Algorithm 2 is carried out using the same discount rate and hyperparameters . Since both algorithms are explicit discretizations of the corresponding flow, a sufficiently small learning rate is needed to ensure convergence. In the tests, the learning rates are set as for Algorithm 1 and for Algorithm 2, which are both manually tuned to be close to the largest learning rates such that convergence is achieved. For Algorithm 2, we set .
As a result, Algorithm 1 takes iterations to converge while Algorithm 2 takes iterations, demonstrating that the interpolating metric introduced in Section 3 gives rise to an acceleration of more than magnitude. Plotted in Figure 2(a) and Figure 2(c) are the errors of the value and policy with respect to the ground truth in the training process, which verifies that Algorithm 2 achieves the same precision more than a magnitude faster than Algorithm 1. Moreover, it can be observed from Figure 2(b) and Figure 2(d) that in both cases the Lyapunov function decreases monotonically, confirming the theoretical analyses in Section 2 and Section 3.
4.2 Experiment II – Gaussian noise
Next, we consider the case where the transitions probabilities are known, but noises are included in the rewards. More specifically, in iteration , the reward is replaced with , where the noises are independent for , and is the maximum of number of iterations. Since the expectation of is
, the standard deviation is set as
or for the experiment, such that the noises have an appropriate scale. Apart from the additional noises, the MDP is the same as in Section 4.1. Only Algorithm 2 is implemented for this problem since it has much better performance than Algorithm 1. We take .