## 1 Introduction

Reinforcement learning studies how agents explore/exploit environment, and take actions to maximize long-term reward. It has broad applications in robot control and game playing(Mnih et al., 2015; Silver et al., 2016; Argall et al., 2009; Silver et al., 2017). Value iteration and policy gradient methods are mainstream methods for reinforcement learning (Sutton and Barto, 2018; Li, 2017).

Policy gradient methods learn optimal policy directly from past experience or on the fly. It maximizes expected discounted reward through a parametrized policy whose parameters are updated using gradient ascent. Traditional policy gradient methods suffer from three well-known obstacles: high-variance, sample inefficiency and difficulty in tuning learning rate. To make the learning algorithm more robust and scalable to large datasets, Schulman et al. proposed trust region policy optimization algorithm (TRPO)

(Schulman et al., 2015). TRPO searches for the optimal policy by maximizing a surrogate function with constraints placed upon the KL divergence between old and new policy distributions, which guarantees monotonically improvements. To further improve the data efficiency and reliable performance, proximal policy optimization algorithm (PPO) was proposed which utilizes first-order optimization and clipped probability ratio between the new and old policies

(Schulman et al., 2017). TRPO was also extended to constrained reinforcement learning. Achiam et al. proposed constrained policy optimization (CPO) which guarantees near-constraint satisfaction at each iteration (Achiam et al., 2017).Although TRPO, PPO and CPO have shown promising performance on complex decision-making problems, such as continuous-control tasks and playing Atari, as other neural network based models, they face two typical challenges: the lack of interpretability, and difficult to converge due to the nature of non-convex optimization in high dimensional parameter space. For many real applications, data lying in a high dimensional ambient space usually have a much lower intrinsic dimension. It may be easier to optimize the policy function in low dimensional manifolds.

In recent years, Many optimization methods are generalized from Euclidean space to Riemannian space due to manifold structures existed in many machine learning problems

(Absil et al., 2007, 2009; Vandereycken, 2013; Huang et al., 2015; Zhang et al., 2016). In this paper, we leverage merits of TRPO, PPO, and CPO and propose a new algorithm called Riemannian proximal policy optimization (RPPO) by taking manifold learning into account for policy optimization. In order to estimate the policy, we need a density-estimation function. Options we have include kernel density estimation, neural networks, Gaussian mixture model (GMM), etc. In this study we choose GMM due to its good analytical characteristics, universal representation power and low computational cost compared with neural networks. It is well-known that the covariance matrices of GMM lie in a Riemannian manifold of positive semidefinite matrices.

To be more specific, we model policy functions using GMM first. Secondly, to optimize GMM and learn the optimal policy functions efficiently, we formulate it as a non-convex optimization problem in the Riemannian space. By this way, our method gains advantages in improving both interpretability and speed of convergence. Please note that Our RPPO algorithm can be easily extended to any other non-GMM density estimators, as long as their parameter space is Riemannian. In addition, previously GMM has been applied to reinforcement learning by embedding GMM in the Q-learning framework(Agostini and Celaya, 2010). So it also suffers from the headache of Q-learning that it can hardly handle problems with large continuous state-action space.

## 2 Preliminaries

### 2.1 Reinforcement learning

In this study, we consider the following Markov decision process (MDP) which is defined as a tuple , where is the set of states, is the set of actions, is the transition probability function, is the reward function, and is the discount factor which balances future rewards and immediate ones.

To make optimal decisions for MDP problems, reinforcement learning was proposed to learn optimal value function or policy. A value function is an expected, discounted accumulative reward function of a state or state-action pair by following a policy . Here we define state value function as where denotes a trajectory by playing policy , , and . Similarly we define state-action value function as: . We also define advantage function as .

In reinforcement learning, we try to find or learn an optimal policy which maximizes a given performance metric . Infinite horizon discounted accumulative return is widely used to evaluate a given policy which is defined as: , where is the reward from to by taking action . Please note that the expectation operation is performed over the distribution of trajectories.

In the work of (Kakade and Langford, 2002; Schulman et al., 2017), for two given policies and , their expected accumulative returns over infinite horizon can be linked by the advantage functions: . By introducing discounted visit frequencies (Schulman et al., 2015; Achiam et al., 2017), where , is distribution of initial state , we have . To reduce the complexity of searching for a new policy which increases , we replace discounted visit frequencies to be optimized with old discounted visit frequencies :.

Assume that the policy functions

are parametrized by a vector

, . Searching for new policy is equivalent to searching new parameters in the parameter space. So we have .### 2.2 Riemannian space

Here we give a brief introduction to Riemannian space, for more details see (Eisenhart, 2016).Let be a connected and finite dimensional manifold with dimensionality of . We denote by the tangent space of at . Let be endowed with a Riemannian metric , with corresponding norm denoted by , so that is now a Riemannian manifold (Eisenhart, 2016). We use to denote length of a piecewise smooth curve joining to , i.e., such that and . Minimizing this length functional over the set of all piecewise smooth curves passing and , we get a Riemannian distance which induces original topology on . Take the exponential map is defined by which maps a tangent vector at to along the curve . For any we define the exponential inverse map which is and maps a point on to a tangent vector at with . We assume is a complete metric space, bounded and all closed subsets of are compact. For a given convex function at , a vector is called subgradient of at if , for all . The set of all subgradients of at is called subdifferential of at which is denoted by . If is a Hadamard manifold which is complete, simply connected and has everywhere non-positive sectional curvature, the subdifferential of at any point on is non-empty (Ferreira and Oliveira, 2002).

## 3 Modeling policy function using Gaussian mixture model

To model policy functions, we employ the Gaussian mixture model which is a widely used and statistically mature method for clustering and density estimation. The policy function can be modeled as , where

is a (multivariate) Gaussian distribution with mean

and covariance matrix , is number of components in the mixture model, are mixture component weights which sum to 1. In the following, we drop in GMM to make it simple and parameters of GMM still depend on state variable implicitly.Let’s define which parametrizes the policy function. For a given policy function , we would like to find a new policy which has higher performance evaluated by within close proximity of the old policy to avoid dramatic policy updates.

Define , and which searches new near the proximity of the old parameter . is a similarity metric which can be Euclidean distance, KL divergence, J-S divergence or the 2nd Wasserstein distance(Arjovsky et al., 2017).

We would like to optimize the following problem with corresponding constraints from GMMs:

(1) |

s.t. , , .

We employ a reparametrization method to make the Gaussian distributions zero-centered. We augment action variables by 1 and define a new variable vector as with new covariance matrix (Hosseini and Sra, 2015).

In the Optimization Problem (1), there is a simplex constraint . To convert it to a unconstrained problem, we first define , for , and let be a constant (Hosseini and Sra, 2015). Then we have the following unconstrained optimization problem:

(2) |

## 4 Riemannian proximal method for policy optimization

### 4.1 Riemannian proximal method for a class of non-convex problems

In this section, following the work of (Khamaru and Wainwright, 2018), we tackle a more general class of functions of the form: . We assume the following assumptions hold:

Assumption 1:

(a) The function is continuously differentiable and its gradient vector field is Lipschitz continuous with constant . (b) The function is continuous and convex. (c) The function is proper, convex and lower semi-continuous. (d) The function is bounded below over a complete Riemannian manifold of dimension . (e) Solution set of is non-empty and its optimum value is denoted as .

Lemma 1. Under Assumption 1, assume and are subgradients of the convex functions and , respectively. We have , and , where is the step size.

Proof of Lemma 1 can be found in the Appendix.

Theorem 1. Under Assumption 1, the following statements hold for any sequence generated by Algorithm 1:

(a) Any limit point of the sequence is a critical point, and the sequence of function values is strictly decreasing and convergent.

(b) For , we have . In addition, if the function is -smooth: where is a constant, then

(3) |

Proof of Theorem 1 can be found in the Appendix.

1. Given an initial point and choose step size , .

2. For , do

Choose subgradient , where denotes subdifferential (or subderivatives) of the convex function at the point . Update

(4) |

### 4.2 Lower bound of policy improvement

Assume we have two policy functions and parameterized by GMMs with parameters and , we would like to bound the performance improvement of over under limitation of the proximal operator.

In this study, we choose the Wasserstein distance to measure discrepancy between policy functions and due to its robustness. For two distributions and of dimension , the Wasserstein distance is defined as

which seeks a joint probability distribution

in whose marginals along coordinates and coincide with and , respectively. For two Gaussian distributions and , its Wasserstein distance is .First we have the following lemma:

Lemma 2. Given two policies parametrized by GMMs and , let , where , defines Wasserstein distance between two GMMs. Then exist , and .

Lemma 2 can be simply proved by applying Theorem 1.

To reduce computational complexity, we employ discrete Wasserstein distance by embedding GMMs to a manifold of probability densities with Gaussian mixture structure as proposed by (Chen et al., 2019). To be more specific, the discrete Wasserstein distance between two GMMs and is:

(5) |

where .

Lemma 3. Given two policies parametrized by GMMs and , their total variation distance can be bounded as follows:

(6) |

where for Gaussian distributions and (Devroye et al., 2018), and . Please note the bound actually is the Wasserstein distance between the discrete distributions and with pairwise cost defined by the bound of total variation distance between two Gaussian distributions , .

With Lemma 2 and Lemma 3, we have the following theorem:

Theorem 2. Given two policy functions and parametrized by GMMs and , assume policy is parametrized by as shown in Lemma 2, then we have the following bound for any within proximity of :

,

where , , and ( and follow the bound definition in Lemma 3), .

Proofs of Lemma 3 and Theorem 2 are shown in the appendix.

### 4.3 Implementation of the Riemannian proximal policy optimization method

Recall that in the optimization problem (2), we are trying to optimize the following objective function: .

1) *Riemannian Gradient*

,

, .

Let , ,

.

For Euclidean distance , we have

, . For discrete Wasserstein distance , we have

, where

, .

2) *Retraction*

With and shown above at iteration , we would like to calculate using retraction. From (Cheng, 2013), for any tangent vector , where is a point in Riemannian space , its retraction . For our case , where and are the

-th eigenvalues and eigenvector of matrix

., are updated using standard gradient decent method in the Euclidean space. The calculation and retraction shown above are repeated until converges.

## 5 Experimental results

### 5.1 Simulation environments and baseline methods

We choose TRPO and PPO, which are well-known excelling at continuous-control tasks, as baseline algorithms. Each algorithm runs on the following 3 environments in OpenAI Gym MuJoCo simulator (Todorov et al., 2012): InvertedPendulum-v2, Hopper-v2, and Walker2d-v2 with increasing task complexity regarding size of state and action spaces. For each run, we compute the average reward for every 50 episodes, and report the mean reward curve and parameters statistics for comparison.

### 5.2 Preliminary results

Environments | Number of parameters | Dim. of states | Dim. of actions | ||
---|---|---|---|---|---|

RPPO | TRPO | PPO | |||

InvertedPendulum-v2 | 104 | 124,026 | 124,026 | 4 | 1 |

Hopper-v2 | 599 | 5,281,434 | 5,281,434 | 11 | 3 |

Walker2d-v2 | 599 | 40,404,522 | 40,404,522 | 17 | 6 |

In Fig. 1 we show mean reward (column1) for PPO, RPPO and TRPO algorithms on three MuJoCo environments, screenshots (column2) and probability density of GMM (column3) for RPPO on each environment. From the learning curves, we can see that as the state-action dimension of environment increases (shown in Table 1), both the convergence speed and the reward improvement slow down. This is because the higher dimension the environment sits, the more difficult the optimization task is for the algorithm. Correspondingly, in the GMM plot, S, A represent the state and the action dimensions respectively, and the probability density is shown in z axis. In the density plot, we can see that as the environment complexity increases, the density pattern becomes more diverse, and non-diagonal matrix terms also show its importance. The probability density of GMM shows that RPPO learns meaningful structure of policy functions.

TRPO and PPO are pure neural-network-based models with numerous parameters. This makes the model highly vulnerable to overfitting, poor network architecture design and the hyper-parameters tuning. RPPO achieves better robustness by having much fewer parameters. In Table 1 we compare the number of parameters of each algorithm on each environment. It can be seen that GMM has order fewer parameters as compared with TRPO and PPO.

## 6 Conclusion

We proposed a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employed the Gaussian mixture model (GMM) and formulated it as a non-convex optimization problem in the Riemannian space of positive semidefinite matrices. Preliminary experiments on benchmark tasks in OpenAI Gym MuJoCo (Todorov et al., 2012) show the efficacy of the proposed RPPO algorithm.

In Sec. 4.1, the algorithm 1 we proposed is capable of optimizing a general class of non-convex functions of the form . Due to page limit, in this study we focus on as shown in the Optimization problem (2). In the future, it would be interesting to incorporate constraints in MDP problems like constrained policy optimization (Achiam et al., 2017) and encode them as a concave function in our RPPO algorithm.

## Appendix

### 1. Proof of Lemma 1:

First let’s define a convex majorant of the function as follows:

, where . Note that minimizer of is the same as . The optimality condition of guarantees that there exists subgradient satisfying the following equation: . Let , we have .

From convexity of the function , for any and we have , . To prove the second inequality in Lemma 1, we have

Since , we have

.

Recall that is a majorant for the function , so

.

### 2. Proof of Theorem 1:

First we would like to prove the convergence of function value. Since the sequence is bounded below, if for some , the convergence of the sequence is trivial. Let’s assume that for all . Under the above assumption, Lemma 1 ensures that . Consequently, there must exist some scalar which is the limit of , .

Due to page limit, the proof of stationarity of limit points is omitted.

Now let’s establish the bound. Since is finite, by utilizing Lemma 1, we have

.

Note that , , so .

Recall that the function is -smooth,

So

### 3. Proof of Lemma 3:

For two Gaussian distributions and with the same mean, their total variation distance is bounded by (Devroye et al., 2018):