1 Introduction
Policy Gradient (PG) methods are a class of popular policy optimization methods for Reinforcement Learning (RL), and have achieved significant successes in many challenging applications (Li, 2017) such as robot manipulation (Deisenroth et al., 2013), the Go game (Silver et al., 2017) and autonomous driving (ShalevShwartz et al., 2016)
. In general, PG methods directly search for the optimal policy by maximizing the expected total reward of Markov Decision Processes (MDPs) involved in RL, where an agent takes action dictated by a policy in an unknown dynamic environment over a sequence of time steps. Since the PGs are generally estimated by MonteCarlo sampling, such vanilla PG methods usually suffer from very high variances resulted in slow convergence rate and destabilization. Thus, recently many fast PG methods have been proposed to reduce variances of vanilla stochastic PG. For example,
(Sutton et al., 2000) introduced a baseline to reduce variances of the stochastic PG. (Konda and Tsitsiklis, 2000) proposed an efficient actorcritic algorithm by estimating the value function to reduce effects of large variances. (Schulman et al., 2015b) proposed the generalized advantage estimation (GAE) to control both the bias and variance in policy gradient. More recently, some faster variancereduced PG methods (Papini et al., 2018; Xu et al., 2019a; Shen et al., 2019; Liu et al., 2020) have been developed based on the variancereduction techniques in stochastic optimization.Alternatively, some successful PG algorithms (Schulman et al., 2015a, 2017) improve convergence rate and robustness of vanilla PG methods by using some penalties such as KullbackLeibler (KL) divergence penalty. For example, trustregion policy optimization (TRPO) (Schulman et al., 2015a) ensures that the new selected policy is near to the old one by using KLdivergence constraint, while proximal policy optimization (PPO) (Schulman et al., 2017) clips the weighted likelihood ratio to implicitly reach this goal. Subsequently, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular RL based on the convex mirror descent algorithm. Liu et al. (2019)
have also studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time,
Yang et al. (2019) tried to propose the PG methods based on the mirror descent algorithm. More recently, mirror descent policy optimization (MDPO) (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm. In addition, (Agarwal et al., 2019; Cen et al., 2020) have studied the natural PG methods for regularized RL. However, (Agarwal et al., 2019) mainly focuses on tabular policy and loglinear, neural policy classes. (Cen et al., 2020) mainly focuses on softmax policy class.Algorithm  Reference  Complexity  Batch Size 

TRPO  Shani et al. (2020)  
Regularized TRPO  Shani et al. (2020)  
TRPO/PPO  Liu et al. (2019)  
VRMPO  Yang et al. (2019)  
MDPO  Tomar et al. (2020)  Unknown  Unknown 
BGPO  Ours  
VRBGPO  Ours 
Although these specific PG methods based on mirror descent iteration have been recently studied, which are scattered in empirical and theoretical aspects respectively, it lacks a universal framework for these PG methods without relying on some specific RL tasks. In particular, there still does not exist the convergence analysis of PG methods based on the mirror descent algorithm under the nonconvex setting. Since mirror descent iteration adjusts gradient updates to fit problem geometry, and is useful in regularized RL (Geist et al., 2019), there exists an important problem to be addressed:
Could we design a universal policy optimization framework based on the mirror descent algorithm, and provide its convergence guarantee under the nonconvex setting ?
In the paper, we firmly answer the above challenging question with positive solutions and propose an efficient Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. In particular, we provide a convergence analysis framework of the PG methods based on mirror descent iteration under the nonconvex setting. Our main contributions are provided as follows:

We propose an effective Bregman gradient policy optimization (BGPO) algorithm based the basic momentum technique, which achieves the sample complexity of for finding stationary point only requiring one trajectory at each iteration.

We further propose an accelerated Bregman gradient policy optimization (VRBGPO) algorithm based on the momentum variancereduced technique. Moreover, we prove that the VRBGPO reaches the best known sample complexity of under the nonconvex setting.

We design a unified policy optimization framework based on the mirror descend iteration and momentum techniques, and provide its convergence analysis under the nonconvex setting.
In Table 1 shows thet sample complexities of the representative PG algorithms based on mirror descent algorithm. Shani et al. (2020); Liu et al. (2019) have established global convergence of a mirror descent variant of PG under some prespecified setting such as overparameterized networks (Liu et al., 2019) by exploiting these specific problems’ hidden convex nature. Without these special structures, global convergence of these methods cannot be achieved. However, our framework does not rely on any specific policy classes, and our convergence analysis only builds on the general nonconvex setting. Thus, we only prove that our methods convergence to stationary points.
Geist et al. (2019); Jin and Sidford (2020); Lan (2021); Zhan et al. (2021) studied a general theory of regularized MDPs based on policy function space that generally is discontinuous. Since both the state and action spaces and generally are very large in practice, the policy function space is large. While our methods build on policy parameter space that is generally continuous Euclidean space and relatively small. Clearly, our methods and theoretical results are more practical than the results in (Geist et al., 2019; Jin and Sidford, 2020; Lan, 2021; Zhan et al., 2021). (Tomar et al., 2020) also proposes mirror descent PG framework based on policy parameter space, but it does not provide any theoretical results and only focuses on Bregman divergence taking form of KL divergence. Our framework can collaborate with any Bregman divergence forms. At the same time, our framework can also flexibly use the momentum and variance reduced techniques.
2 Related Works
In this section, we review some related work about mirror descent algorithm in RL and (variancereduced) PG methods, respectively.
2.1 Mirror Descent Algorithm in RL
Due to easily deal with the regularization terms, mirror descent algorithm has shown significant successes in regularized RL. For example, Neu et al. (2017) have shown both the dynamic policy programming (Azar et al., 2012) and TRPO (Schulman et al., 2015a) algorithms are approximate variants of mirror descent algorithm. Subsequently, Geist et al. (2019) have introduced a general theory of regularized MDPs based on the convex mirror descend algorithm. More recently, Liu et al. (2019) have studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular policy based on the convex mirror descent algorithm. Wang et al. (2019) have proposed divergence augmented policy optimization for offpolicy learning based on mirror descent algorithm. MDPO (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm.
2.2 (VarianceReduced) PG Methods
PG methods have been widely studied due to their stability and incremental nature in policy optimization. For example, the global convergence properties of vanilla policy gradient method in infinitehorizon MDPs have been recently studied in (Zhang et al., 2019). Subsequently, Zhang et al. (2020) have studied asymptotically global convergence properties of the REINFORCE (Williams, 1992), whose policy gradient is approximated by using a single trajectory or a fixed size minibatch of trajectories under softmax parametrization and logbarrier regularization. To accelerate these vanilla PG methods, some faster variancereduced PG methods have been proposed based on the variancereduction techniques of SVRG (Johnson and Zhang, 2013), SPIDER (Fang et al., 2018) and STORM (Cutkosky and Orabona, 2019) in stochastic convex and nonconvex optimization. For example, fast SVRPG (Papini et al., 2018; Xu et al., 2019a) algorithm have been proposed based on SVRG. Fast HAPG (Shen et al., 2019) and SRVRPG (Xu et al., 2019a) algorithms have been presented by using SPIDER technique. More recently, the momentumbased PG methods (i.e., ISMBPG and HAMBPG) (Huang et al., 2020) have been developed based on variancereduced technique of STORM.
3 Preliminaries
In the section, we will review some preliminaries of Markov decision process and policy gradients.
3.1 Notations
Let for all
. For a vector
, let denote the norm of , and denotes the norm of . For two sequences and , we denote if for some constant , and hides logarithmic factors. anddenote the expectation and variance of random variable
, respectively.3.2 Markov Decision Process
Reinforcement learning generally involves a discrete time discounted Markov Decision Process (MDP) defined by a tuple . and denote the state and action spaces of the agent, respectively.
is the Markov kernel that determines the transition probability from the state
to under taking an action . is the reward function of and , and denotes the initial state distribution. is the discount factor. Let be a stationary policy, whereis the set of probability distributions on
.Given the current state , the agent executes an action
following a conditional probability distribution
, and then the agent obtains a reward . At each time , we can define the stateaction value function and state value function as follows:(1) 
We also define the advantage function . The goal of the agent is to find the optimal policy by maximizing the expected discounted reward
(2) 
Consider a time horizon , the agent collects a trajectory under any stationary policy. Then the agent obtains a cumulative discounted reward . Since the state and action spaces and are generally very large, directly solving the problem (2) is difficult. Thus, we let the policy be parametrized as for the parameter . Given the initial distribution , the probability distribution over trajectory can be obtain
(3) 
Thus, the problem (2) will be equivalent to maximize the expected discounted trajectory reward:
(4) 
3.3 Policy Gradients
The policy gradient methods (Williams, 1992; Sutton et al., 2000) are a class of effective policybased methods to solve the above RL problem (4). Specifically, the gradient of with respect to is given as follows:
(5) 
Given a minibatch trajectories sampled from the distribution , the standard stochastic policy gradient ascent update at th step, defined as
(6) 
where is learning rate, and is stochastic policy gradient. Given as in (Zhang et al., 2019; Shani et al., 2020), is the unbiased stochastic policy gradient of , i.e., , where
(7) 
Based on the gradient estimator in (7), we can obtain the existing wellknown policy gradient estimators such as REINFORCE (Williams, 1992), PGT (Sutton et al., 2000). Specifically, the REINFORCE obtains a policy gradient estimator by adding a baseline , defined as
The PGT is a version of the REINFORCE, defined as
Alternatively, based on the policy gradient Theorem (Sutton et al., 2000; Zhang et al., 2019), we also have the following policy gradient form
(8) 
where and denotes a valid probability measure over the state . Here is the probability that state given initial state and policy parameter . Since any function is independent of action , the policy gradient (8) is equivalent to
where is a baseline function. When choose the statevalue function as a baseline function , we can obtain the advantagebased policy gradient
(9) 
4 Bregman Gradient Policy Optimization
In the section, we propose a novel Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. We first let , the goal of policybased RL is to solve the following problem:
(10) 
So we have .
(Zhang and He, 2018) Given a function defined on a closed convex set , and the Bregman distance : , then we define a proximal operator of function ,
(11) 
where , is a continuouslydifferentiable and strongly convex function, i.e., . Based on the proximal operator of as in Zhang and He (2018); Ghadimi et al. (2016), we define the Bregman gradient as follows:
(12) 
If and , is a stationary point of if and only if . Thus, this Bregman gradient can be regarded as a generalized gradient.
4.1 BGPO Algorithm
In the subsection, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique. The pseudo code of BGPO Algorithm is provided in Algorithm 1.
In Algorithm 1, the step 5 uses the stochastic Bregman gradient descent (a.k.a., stochastic mirror descent) to update the parameter . Let be the firstorder approximation of function at , where is a stochastic gradient of function at . By the step 5 of Algorithm 1 and the above equality (12), we have
(13) 
where . Then by the step 6 of Algorithm 1, we have
(14) 
where . Due to the convexity of set and , we choose the parameter to ensure the updated sequence in the set .
In fact, our algorithm unifies many popular policy optimization algorithms. When the mirror mappings for , then the update (14) is reduced to classic policy gradient algorithms (Sutton et al., 2000; Zhang et al., 2019, 2020). In the other words, given and , we have and
(15) 
When the mirror mappings with for , the update (14) is reduced to natural policy gradient algorithms (Kakade, 2001; Liu et al., 2020). Similarly, given and , we have and
(16) 
where denotes the MoorePenrose pseudoinverse of the Fisher information matrix .
4.2 VRBGPO Algorithm
In the subsection, we propose a faster variancereduced Bregman gradient policy optimization (VRBGPO) algorithm based on a variancereduced technique. The pseudo code of VRBGPO algorithm is provided in Algorithm 2.
Consider the problem (4) is nonoblivious that the distribution depends on the variable varying through the whole optimization procedure, we apply the importance sampling weight (Papini et al., 2018; Xu et al., 2019a) in estimating our policy gradient , defined as
Except for different stochastic policy gradients and tuning parameters using in Algorithms 1 and 2, the steps 5 and 6 in these algorithms for updating parameter are the same. Interestingly, when choosing mirror mapping , our VRBGPO algorithm will reduce to a nonadaptive version of ISMBPG algorithm (Huang et al., 2020).
5 Convergence Analysis
In this section, we will analyze the convergence properties of the proposed algorithms. All related proofs are provided in the Appendix.
5.1 Convergence Metric
In this subsection, we given a reasonable metric to measure the convergence of our Algorithms, defined as
(17) 
where depends on the strongly convex function . This convergence metric form is also used in the SUPERADAM algorithm (Huang et al., 2021). When , we have and . Thus, we have
(18) 
In fact, our metric is a generalized metric to used in (Zhang and He, 2018; Yang et al., 2019). In particular, when and , we have and
(19) 
When , we can obtain .
5.2 Some Mild Assumptions
In the subsection, we first give some standard assumptions.
Assumption 1
For function , its gradient and Hessian matrix are bounded, i.e., there exist constants such that
(20) 
Assumption 2
Variance of stochastic gradient is bounded, i.e., there exists a constant , for all such that .
Assumption 3
For importance sampling weight , its variance is bounded, i.e., there exists a constant , it follows for any and .
Assumption 4
The function has an upper bound in , i.e., .
Assumptions 1 and 2 are standard in the PG algorithms (Papini et al., 2018; Xu et al., 2019a, b). Assumption 3 is widely used in the study of variance reduced PG algorithms (Papini et al., 2018; Xu et al., 2019a). In fact, the bounded importance sampling weight might be violated in some cases such as using neural networks as the policy. Thus, we can clip this importance sampling weights to guarantee the effectiveness of our algorithms as in (Papini et al., 2018). Assumption 4 guarantees the feasibility of the problem (4). Next, we provide some useful lemmas based on the above assumptions.
5.3 Convergence Analysis of BGPO Algorithm
In the subsection, we provide convergence properties of the BGPO algorithm. The detailed proof is provided in Appendix A1. Assume the sequence be generated from Algorithm 1. Let and for all , , , , and , we have
where .
Without loss of generality, let , and , we have . Theorem 5.3 shows that the BGPO algorithm has a convergence rate of . Let , we have . Since the BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it has the sample complexity of for finding an stationary point.
5.4 Convergence Analysis of VRBGPO Algorithm
In the subsection, we give convergence properties of the VRBGPO algorithm. The detailed proof is provided in Appendix A2. Suppose the sequence be generated from Algorithm 2. Let and for all , , , and , we have
(22) 
where and .
Without loss of generality, let , and , we have . Theorem 5.4 shows that the VRBGPO algorithm has a convergence rate of . Let , we have . Since the VRBGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it reaches a lower sample complexity of for finding an stationary point.
6 Experiments
In this section, we conduct some RL tasks to verify the effectiveness of our methods. We first study the effect of different choices of Bregman divergences with our algorithms (BGPO and VRBGPO), and then we compare our VRBGPO algorithm with other stateoftheart methods such as TRPO (Schulman et al., 2015a), PPO (Schulman et al., 2017), VRMPO (Yang et al., 2019), and MDPO (Tomar et al., 2020).
6.1 Effects of Bregman Divergences
In the subsection, we examine how different Bregman divergences affect the performance of our algorithms. In the first setting, we let mirror mapping with different to test the performance our algorithms. Let be the conjugate mapping of , where . According to (Beck and Teboulle, 2003), when , the update of in our algorithms can be calculated by , where and are norm link functions, and , , and is the coordinate index of and . In the second setting, we apply diagonal term on the mirror mapping , where is a diagonal matrix with positive values. In the experiments, we generate , , and , as in the SUPERADAM algorithm (Huang et al., 2021). Then we have . Under this setting, the update of can also be analytically solved .
To test the effectiveness of two different Bregman divergences, we evaluate them on three classic control environments from gym Brockman et al. (2016)
: CartPolev1, Acrobatv1, and MountainCarContinuousv0. In the experiment, categorical policy is used for CartPole and Acrobot environments, and Gaussian policy is used for MountainCar. Gaussian value functions are used in all settings. All policies and value functions are parameterized by multilayer perceptrons (MLPs). For a fair comparison, all settings use the same initialization for policies. We run each setting five times and plot the mean and variance of average returns.
For norm mapping, we test three different values of . For diagonal mapping, we set and
. We set hyperparameters
to be the same. still needs to be tuned for different to achieve relatively good performance. For simplicity, we use BGPODiag to represent BGPO with diagonal mapping, and we use BGPO to represent BGPO with norm mapping. Details about the setup of environments and hyperparameters are provided in the Appendix B.From Fig. 1, we can find that BGPODiag largely outperforms BGPO with different choices of . The parameter tuning of BGPO is much more difficult than BGPODiag because each requires an individual to achieve the desired performance. To reach the best possible performance may also need to be tuned. On the contrary, the parameter tuning of BGPODiag is much simpler, and the performance is also more stable.
6.2 Comparison between BGPO and VRBGPO
To understand the effectiveness of variance reduced technique used in our VRBGPO algorithm, we compare BGPO and VRBGPO using the same settings introduced in section. 6.1. Both algorithms use the diagonal mapping for , since it performs much better than norm.
From Fig. 2, we can see that VRBGPO can outperform BGPO in all three environments. In CartPole, both algorithms converge very fast and have similar performance, and VRBGPO is more stable than BGPO. The advantage of VRBGPO becomes large in Acrobot and MountainCar environments, probably because the task is more difficult compared to CartPole.
6.3 Compare to other Methods
In this subsection, we apply our VRBGPO algorithm to compare with the other methods since it has better performances shown in section 6.2. Specifically, we compare VRBGPO with two popular policy optimization algorithms: PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a). We also compare our algorithm with recent proposed mirrordescent based policy optimization algorithms: VRMPO (Yang et al., 2019) and MDPO (Tomar et al., 2020). For VRBGPO, we continue to use diagonal mapping for . For VRMPO, we follow their implementation and use norm for . For MDPO, is the negative Shannon entropy, and the Bregman divergence becomes KLdivergence.
To evaluate the performance of these algorithms, we test them on six gym (Brockman et al., 2016) environments with continuous control tasks: InvertedPendulumv2, InvertedDoublePendulumv2, Walker2dv2, Reacherv2, Swimmerv2, and HalfCheetahv2. We use Gaussian policies and Gaussian value functions for all environments, and both of them are parameterized by MLPs. To ensure a fair comparison, all policies use the same initialization. For TRPO and PPO, we use the implementations provided by garage (garage contributors, 2019). We carefully implement MDPO and VRMPO following the description provided by the original papers. All methods include our method, are implemented with garage (garage contributors, 2019)
and pytorch
(Paszke et al., 2019). We run all algorithms ten times on each environment and report the mean and variance of average returns. Details about the setup of environments and hyperparameters are also provided in the Appendix B.From Fig. 3, we can find that VRBGPO consistently outperforms all the other methods on six environments, demonstrating the effectiveness of our algorithm. Specifically, our method achieves the best mean average return in InvertedDoublePendulum, Swimmer, and HalfCheetah. Our method converges faster than other methods in all environments, especially in InvertedPendulum, InvertedDoublePendulum, and Reacher. MDPO can achieve good results in some environments, but it can not outperform PPO or TRPO in Swimmer and InvertedDoublePendulum. VRMPO only outperforms PPO and TRPO in Reacher and InvertedDoublePendulum. The undesirable performance of VRMPO is probably because it uses norm for , which requires careful tuning of the learning rate.
7 Conclusion
In the paper, we proposed a novel Bregman gradient policy optimization framework for reinforcement learning based on mirror descend iteration and momentum techniques. Moreover, we studied convergence properties of the proposed policy optimization methods under the nonconvex setting.
We thank the IT Help Desk at University of Pittsburgh. This work was partially supported by NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.
References
 Agarwal et al. (2019) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261, 2019.

Azar et al. (2012)
Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen.
Dynamic policy programming.
The Journal of Machine Learning Research
, 13(1):3207–3245, 2012.  Beck and Teboulle (2003) Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 Cen et al. (2020) Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. arXiv preprint arXiv:2007.06558, 2020.
 Cortes et al. (2010) Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
 Cutkosky and Orabona (2019) Ashok Cutkosky and Francesco Orabona. Momentumbased variance reduction in nonconvex sgd. In Advances in Neural Information Processing Systems, pages 15210–15219, 2019.
 Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
 Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
 garage contributors (2019) The garage contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.
 Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In Thirtysixth International Conference on Machine Learning, 2019.
 Ghadimi et al. (2016) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(12):267–305, 2016.
 Huang et al. (2020) Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Momentumbased policy gradient methods. In International Conference on Machine Learning, pages 4422–4433. PMLR, 2020.
 Huang et al. (2021) Feihu Huang, Junyi Li, and Heng Huang. Superadam: Faster and universal framework of adaptive gradients. arXiv preprint arXiv:2106.08208, 2021.
 Jin and Sidford (2020) Yujia Jin and Aaron Sidford. Efficiently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pages 4890–4900. PMLR, 2020.

Johnson and Zhang (2013)
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, pages 315–323, 2013.  Kakade (2001) Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14:1531–1538, 2001.
 Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 Lan (2021) Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. arXiv preprint arXiv:2102.00135, 2021.
 Li (2017) Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
 Liu et al. (2019) Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
 Liu et al. (2020) Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variancereduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33, 2020.
 Neu et al. (2017) Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropyregularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
 Papini et al. (2018) Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variancereduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pages 4026–4035, 2018.

Paszke et al. (2019)
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
Pytorch: An imperative style, highperformance deep learning library.
In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.  Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 ShalevShwartz et al. (2016) Shai ShalevShwartz, Shaked Shammah, and Amnon Shashua. Safe, multiagent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.

Shani et al. (2020)
Lior Shani, Yonathan Efroni, and Shie Mannor.
Adaptive trust region policy optimization: Global convergence and
faster rates for regularized mdps.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 34, pages 5668–5675, 2020.  Shannon (1948) Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
 Shen et al. (2019) Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, and Chao Mi. Hessian aided policy gradient. In International Conference on Machine Learning, pages 5729–5738, 2019.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Tomar et al. (2020) Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
 Wang et al. (2019) Qing Wang, Yingru Li, Jiechao Xiong, and Tong Zhang. Divergenceaugmented policy optimization. In Advances in Neural Information Processing Systems, pages 6099–6110, 2019.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Xu et al. (2019a) Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variancereduced policy gradient. In Proceedings of the ThirtyFifth Conference on Uncertainty in Artificial Intelligence, page 191, 2019a.
 Xu et al. (2019b) Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019b.
 Yang et al. (2019) Long Yang, Gang Zheng, Haotian Zhang, Yu Zhang, Qian Zheng, Jun Wen, and Gang Pan. Policy optimization with stochastic mirror descent. arXiv preprint arXiv:1906.10462, 2019.
 Zhan et al. (2021) Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv preprint arXiv:2105.11066, 2021.
 Zhang et al. (2020) Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with reinforce. arXiv preprint arXiv:2010.11364, 2020.
 Zhang et al. (2019) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383, 2019.
 Zhang and He (2018) Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv preprint arXiv:1806.04781, 2018.