Policy Gradient (PG) methods are a class of popular policy optimization methods for Reinforcement Learning (RL), and have achieved significant successes in many challenging applications (Li, 2017) such as robot manipulation (Deisenroth et al., 2013), the Go game (Silver et al., 2017) and autonomous driving (Shalev-Shwartz et al., 2016)
. In general, PG methods directly search for the optimal policy by maximizing the expected total reward of Markov Decision Processes (MDPs) involved in RL, where an agent takes action dictated by a policy in an unknown dynamic environment over a sequence of time steps. Since the PGs are generally estimated by Monte-Carlo sampling, such vanilla PG methods usually suffer from very high variances resulted in slow convergence rate and destabilization. Thus, recently many fast PG methods have been proposed to reduce variances of vanilla stochastic PG. For example,(Sutton et al., 2000) introduced a baseline to reduce variances of the stochastic PG. (Konda and Tsitsiklis, 2000) proposed an efficient actor-critic algorithm by estimating the value function to reduce effects of large variances. (Schulman et al., 2015b) proposed the generalized advantage estimation (GAE) to control both the bias and variance in policy gradient. More recently, some faster variance-reduced PG methods (Papini et al., 2018; Xu et al., 2019a; Shen et al., 2019; Liu et al., 2020) have been developed based on the variance-reduction techniques in stochastic optimization.
Alternatively, some successful PG algorithms (Schulman et al., 2015a, 2017) improve convergence rate and robustness of vanilla PG methods by using some penalties such as Kullback-Leibler (KL) divergence penalty. For example, trust-region policy optimization (TRPO) (Schulman et al., 2015a) ensures that the new selected policy is near to the old one by using KL-divergence constraint, while proximal policy optimization (PPO) (Schulman et al., 2017) clips the weighted likelihood ratio to implicitly reach this goal. Subsequently, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular RL based on the convex mirror descent algorithm. Liu et al. (2019)
have also studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time,Yang et al. (2019) tried to propose the PG methods based on the mirror descent algorithm. More recently, mirror descent policy optimization (MDPO) (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm. In addition, (Agarwal et al., 2019; Cen et al., 2020) have studied the natural PG methods for regularized RL. However, (Agarwal et al., 2019) mainly focuses on tabular policy and log-linear, neural policy classes. (Cen et al., 2020) mainly focuses on softmax policy class.
|TRPO||Shani et al. (2020)|
|Regularized TRPO||Shani et al. (2020)|
|TRPO/PPO||Liu et al. (2019)|
|VRMPO||Yang et al. (2019)|
|MDPO||Tomar et al. (2020)||Unknown||Unknown|
Although these specific PG methods based on mirror descent iteration have been recently studied, which are scattered in empirical and theoretical aspects respectively, it lacks a universal framework for these PG methods without relying on some specific RL tasks. In particular, there still does not exist the convergence analysis of PG methods based on the mirror descent algorithm under the nonconvex setting. Since mirror descent iteration adjusts gradient updates to fit problem geometry, and is useful in regularized RL (Geist et al., 2019), there exists an important problem to be addressed:
Could we design a universal policy optimization framework based on the mirror descent algorithm, and provide its convergence guarantee under the non-convex setting ?
In the paper, we firmly answer the above challenging question with positive solutions and propose an efficient Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. In particular, we provide a convergence analysis framework of the PG methods based on mirror descent iteration under the nonconvex setting. Our main contributions are provided as follows:
We propose an effective Bregman gradient policy optimization (BGPO) algorithm based the basic momentum technique, which achieves the sample complexity of for finding -stationary point only requiring one trajectory at each iteration.
We further propose an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on the momentum variance-reduced technique. Moreover, we prove that the VR-BGPO reaches the best known sample complexity of under the nonconvex setting.
We design a unified policy optimization framework based on the mirror descend iteration and momentum techniques, and provide its convergence analysis under the nonconvex setting.
In Table 1 shows thet sample complexities of the representative PG algorithms based on mirror descent algorithm. Shani et al. (2020); Liu et al. (2019) have established global convergence of a mirror descent variant of PG under some pre-specified setting such as over-parameterized networks (Liu et al., 2019) by exploiting these specific problems’ hidden convex nature. Without these special structures, global convergence of these methods cannot be achieved. However, our framework does not rely on any specific policy classes, and our convergence analysis only builds on the general nonconvex setting. Thus, we only prove that our methods convergence to stationary points.
Geist et al. (2019); Jin and Sidford (2020); Lan (2021); Zhan et al. (2021) studied a general theory of regularized MDPs based on policy function space that generally is discontinuous. Since both the state and action spaces and generally are very large in practice, the policy function space is large. While our methods build on policy parameter space that is generally continuous Euclidean space and relatively small. Clearly, our methods and theoretical results are more practical than the results in (Geist et al., 2019; Jin and Sidford, 2020; Lan, 2021; Zhan et al., 2021). (Tomar et al., 2020) also proposes mirror descent PG framework based on policy parameter space, but it does not provide any theoretical results and only focuses on Bregman divergence taking form of KL divergence. Our framework can collaborate with any Bregman divergence forms. At the same time, our framework can also flexibly use the momentum and variance reduced techniques.
2 Related Works
In this section, we review some related work about mirror descent algorithm in RL and (variance-reduced) PG methods, respectively.
2.1 Mirror Descent Algorithm in RL
Due to easily deal with the regularization terms, mirror descent algorithm has shown significant successes in regularized RL. For example, Neu et al. (2017) have shown both the dynamic policy programming (Azar et al., 2012) and TRPO (Schulman et al., 2015a) algorithms are approximate variants of mirror descent algorithm. Subsequently, Geist et al. (2019) have introduced a general theory of regularized MDPs based on the convex mirror descend algorithm. More recently, Liu et al. (2019) have studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular policy based on the convex mirror descent algorithm. Wang et al. (2019) have proposed divergence augmented policy optimization for off-policy learning based on mirror descent algorithm. MDPO (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm.
2.2 (Variance-Reduced) PG Methods
PG methods have been widely studied due to their stability and incremental nature in policy optimization. For example, the global convergence properties of vanilla policy gradient method in infinite-horizon MDPs have been recently studied in (Zhang et al., 2019). Subsequently, Zhang et al. (2020) have studied asymptotically global convergence properties of the REINFORCE (Williams, 1992), whose policy gradient is approximated by using a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization. To accelerate these vanilla PG methods, some faster variance-reduced PG methods have been proposed based on the variance-reduction techniques of SVRG (Johnson and Zhang, 2013), SPIDER (Fang et al., 2018) and STORM (Cutkosky and Orabona, 2019) in stochastic convex and non-convex optimization. For example, fast SVRPG (Papini et al., 2018; Xu et al., 2019a) algorithm have been proposed based on SVRG. Fast HAPG (Shen et al., 2019) and SRVR-PG (Xu et al., 2019a) algorithms have been presented by using SPIDER technique. More recently, the momentum-based PG methods (i.e., IS-MBPG and HA-MBPG) (Huang et al., 2020) have been developed based on variance-reduced technique of STORM.
In the section, we will review some preliminaries of Markov decision process and policy gradients.
3.2 Markov Decision Process
Reinforcement learning generally involves a discrete time discounted Markov Decision Process (MDP) defined by a tuple . and denote the state and action spaces of the agent, respectively.
is the Markov kernel that determines the transition probability from the stateto under taking an action . is the reward function of and , and denotes the initial state distribution. is the discount factor. Let be a stationary policy, where
is the set of probability distributions on.
Given the current state , the agent executes an action
following a conditional probability distribution, and then the agent obtains a reward . At each time , we can define the state-action value function and state value function as follows:
We also define the advantage function . The goal of the agent is to find the optimal policy by maximizing the expected discounted reward
Consider a time horizon , the agent collects a trajectory under any stationary policy. Then the agent obtains a cumulative discounted reward . Since the state and action spaces and are generally very large, directly solving the problem (2) is difficult. Thus, we let the policy be parametrized as for the parameter . Given the initial distribution , the probability distribution over trajectory can be obtain
Thus, the problem (2) will be equivalent to maximize the expected discounted trajectory reward:
3.3 Policy Gradients
The policy gradient methods (Williams, 1992; Sutton et al., 2000) are a class of effective policy-based methods to solve the above RL problem (4). Specifically, the gradient of with respect to is given as follows:
Given a mini-batch trajectories sampled from the distribution , the standard stochastic policy gradient ascent update at -th step, defined as
Based on the gradient estimator in (7), we can obtain the existing well-known policy gradient estimators such as REINFORCE (Williams, 1992), PGT (Sutton et al., 2000). Specifically, the REINFORCE obtains a policy gradient estimator by adding a baseline , defined as
The PGT is a version of the REINFORCE, defined as
where and denotes a valid probability measure over the state . Here is the probability that state given initial state and policy parameter . Since any function is independent of action , the policy gradient (8) is equivalent to
where is a baseline function. When choose the state-value function as a baseline function , we can obtain the advantage-based policy gradient
4 Bregman Gradient Policy Optimization
In the section, we propose a novel Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. We first let , the goal of policy-based RL is to solve the following problem:
So we have .
(Zhang and He, 2018) Given a function defined on a closed convex set , and the Bregman distance : , then we define a proximal operator of function ,
where , is a continuously-differentiable and -strongly convex function, i.e., . Based on the proximal operator of as in Zhang and He (2018); Ghadimi et al. (2016), we define the Bregman gradient as follows:
If and , is a stationary point of if and only if . Thus, this Bregman gradient can be regarded as a generalized gradient.
4.1 BGPO Algorithm
In the subsection, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique. The pseudo code of BGPO Algorithm is provided in Algorithm 1.
In Algorithm 1, the step 5 uses the stochastic Bregman gradient descent (a.k.a., stochastic mirror descent) to update the parameter . Let be the first-order approximation of function at , where is a stochastic gradient of function at . By the step 5 of Algorithm 1 and the above equality (12), we have
where . Then by the step 6 of Algorithm 1, we have
where . Due to the convexity of set and , we choose the parameter to ensure the updated sequence in the set .
In fact, our algorithm unifies many popular policy optimization algorithms. When the mirror mappings for , then the update (14) is reduced to classic policy gradient algorithms (Sutton et al., 2000; Zhang et al., 2019, 2020). In the other words, given and , we have and
where denotes the Moore-Penrose pseudoinverse of the Fisher information matrix .
4.2 VR-BGPO Algorithm
In the subsection, we propose a faster variance-reduced Bregman gradient policy optimization (VR-BGPO) algorithm based on a variance-reduced technique. The pseudo code of VR-BGPO algorithm is provided in Algorithm 2.
Consider the problem (4) is non-oblivious that the distribution depends on the variable varying through the whole optimization procedure, we apply the importance sampling weight (Papini et al., 2018; Xu et al., 2019a) in estimating our policy gradient , defined as
Except for different stochastic policy gradients and tuning parameters using in Algorithms 1 and 2, the steps 5 and 6 in these algorithms for updating parameter are the same. Interestingly, when choosing mirror mapping , our VR-BGPO algorithm will reduce to a non-adaptive version of IS-MBPG algorithm (Huang et al., 2020).
5 Convergence Analysis
In this section, we will analyze the convergence properties of the proposed algorithms. All related proofs are provided in the Appendix.
5.1 Convergence Metric
In this subsection, we given a reasonable metric to measure the convergence of our Algorithms, defined as
where depends on the -strongly convex function . This convergence metric form is also used in the SUPER-ADAM algorithm (Huang et al., 2021). When , we have and . Thus, we have
When , we can obtain .
5.2 Some Mild Assumptions
In the subsection, we first give some standard assumptions.
For function , its gradient and Hessian matrix are bounded, i.e., there exist constants such that
Variance of stochastic gradient is bounded, i.e., there exists a constant , for all such that .
For importance sampling weight , its variance is bounded, i.e., there exists a constant , it follows for any and .
The function has an upper bound in , i.e., .
Assumptions 1 and 2 are standard in the PG algorithms (Papini et al., 2018; Xu et al., 2019a, b). Assumption 3 is widely used in the study of variance reduced PG algorithms (Papini et al., 2018; Xu et al., 2019a). In fact, the bounded importance sampling weight might be violated in some cases such as using neural networks as the policy. Thus, we can clip this importance sampling weights to guarantee the effectiveness of our algorithms as in (Papini et al., 2018). Assumption 4 guarantees the feasibility of the problem (4). Next, we provide some useful lemmas based on the above assumptions.
5.3 Convergence Analysis of BGPO Algorithm
In the subsection, we provide convergence properties of the BGPO algorithm. The detailed proof is provided in Appendix A1. Assume the sequence be generated from Algorithm 1. Let and for all , , , , and , we have
Without loss of generality, let , and , we have . Theorem 5.3 shows that the BGPO algorithm has a convergence rate of . Let , we have . Since the BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it has the sample complexity of for finding an -stationary point.
5.4 Convergence Analysis of VR-BGPO Algorithm
In the subsection, we give convergence properties of the VR-BGPO algorithm. The detailed proof is provided in Appendix A2. Suppose the sequence be generated from Algorithm 2. Let and for all , , , and , we have
where and .
Without loss of generality, let , and , we have . Theorem 5.4 shows that the VR-BGPO algorithm has a convergence rate of . Let , we have . Since the VR-BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it reaches a lower sample complexity of for finding an -stationary point.
In this section, we conduct some RL tasks to verify the effectiveness of our methods. We first study the effect of different choices of Bregman divergences with our algorithms (BGPO and VR-BGPO), and then we compare our VR-BGPO algorithm with other state-of-the-art methods such as TRPO (Schulman et al., 2015a), PPO (Schulman et al., 2017), VRMPO (Yang et al., 2019), and MDPO (Tomar et al., 2020).
6.1 Effects of Bregman Divergences
In the subsection, we examine how different Bregman divergences affect the performance of our algorithms. In the first setting, we let mirror mapping with different to test the performance our algorithms. Let be the conjugate mapping of , where . According to (Beck and Teboulle, 2003), when , the update of in our algorithms can be calculated by , where and are -norm link functions, and , , and is the coordinate index of and . In the second setting, we apply diagonal term on the mirror mapping , where is a diagonal matrix with positive values. In the experiments, we generate , , and , as in the SUPER-ADAM algorithm (Huang et al., 2021). Then we have . Under this setting, the update of can also be analytically solved .
To test the effectiveness of two different Bregman divergences, we evaluate them on three classic control environments from gym Brockman et al. (2016)
: CartPole-v1, Acrobat-v1, and MountainCarContinuous-v0. In the experiment, categorical policy is used for CartPole and Acrobot environments, and Gaussian policy is used for MountainCar. Gaussian value functions are used in all settings. All policies and value functions are parameterized by multilayer perceptrons (MLPs). For a fair comparison, all settings use the same initialization for policies. We run each setting five times and plot the mean and variance of average returns.
For -norm mapping, we test three different values of . For diagonal mapping, we set and
. We set hyperparametersto be the same. still needs to be tuned for different to achieve relatively good performance. For simplicity, we use BGPO-Diag to represent BGPO with diagonal mapping, and we use BGPO- to represent BGPO with -norm mapping. Details about the setup of environments and hyperparameters are provided in the Appendix B.
From Fig. 1, we can find that BGPO-Diag largely outperforms BGPO- with different choices of . The parameter tuning of BGPO- is much more difficult than BGPO-Diag because each requires an individual to achieve the desired performance. To reach the best possible performance may also need to be tuned. On the contrary, the parameter tuning of BGPO-Diag is much simpler, and the performance is also more stable.
6.2 Comparison between BGPO and VR-BGPO
To understand the effectiveness of variance reduced technique used in our VR-BGPO algorithm, we compare BGPO and VR-BGPO using the same settings introduced in section. 6.1. Both algorithms use the diagonal mapping for , since it performs much better than -norm.
From Fig. 2, we can see that VR-BGPO can outperform BGPO in all three environments. In CartPole, both algorithms converge very fast and have similar performance, and VR-BGPO is more stable than BGPO. The advantage of VR-BGPO becomes large in Acrobot and MountainCar environments, probably because the task is more difficult compared to CartPole.
6.3 Compare to other Methods
In this subsection, we apply our VR-BGPO algorithm to compare with the other methods since it has better performances shown in section 6.2. Specifically, we compare VR-BGPO with two popular policy optimization algorithms: PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a). We also compare our algorithm with recent proposed mirror-descent based policy optimization algorithms: VRMPO (Yang et al., 2019) and MDPO (Tomar et al., 2020). For VR-BGPO, we continue to use diagonal mapping for . For VRMPO, we follow their implementation and use -norm for . For MDPO, is the negative Shannon entropy, and the Bregman divergence becomes KL-divergence.
To evaluate the performance of these algorithms, we test them on six gym (Brockman et al., 2016) environments with continuous control tasks: Inverted-Pendulum-v2, Inverted-DoublePendulum-v2, Walker2d-v2, Reacher-v2, Swimmer-v2, and HalfCheetah-v2. We use Gaussian policies and Gaussian value functions for all environments, and both of them are parameterized by MLPs. To ensure a fair comparison, all policies use the same initialization. For TRPO and PPO, we use the implementations provided by garage (garage contributors, 2019). We carefully implement MDPO and VRMPO following the description provided by the original papers. All methods include our method, are implemented with garage (garage contributors, 2019)
and pytorch(Paszke et al., 2019). We run all algorithms ten times on each environment and report the mean and variance of average returns. Details about the setup of environments and hyperparameters are also provided in the Appendix B.
From Fig. 3, we can find that VR-BGPO consistently outperforms all the other methods on six environments, demonstrating the effectiveness of our algorithm. Specifically, our method achieves the best mean average return in InvertedDoublePendulum, Swimmer, and HalfCheetah. Our method converges faster than other methods in all environments, especially in InvertedPendulum, InvertedDoublePendulum, and Reacher. MDPO can achieve good results in some environments, but it can not outperform PPO or TRPO in Swimmer and InvertedDoublePendulum. VRMPO only outperforms PPO and TRPO in Reacher and InvertedDoublePendulum. The undesirable performance of VRMPO is probably because it uses norm for , which requires careful tuning of the learning rate.
In the paper, we proposed a novel Bregman gradient policy optimization framework for reinforcement learning based on mirror descend iteration and momentum techniques. Moreover, we studied convergence properties of the proposed policy optimization methods under the nonconvex setting.
We thank the IT Help Desk at University of Pittsburgh. This work was partially supported by NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.
- Agarwal et al. (2019) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261, 2019.
Azar et al. (2012)
Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen.
Dynamic policy programming.
The Journal of Machine Learning Research, 13(1):3207–3245, 2012.
- Beck and Teboulle (2003) Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- Cen et al. (2020) Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. arXiv preprint arXiv:2007.06558, 2020.
- Cortes et al. (2010) Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
- Cutkosky and Orabona (2019) Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pages 15210–15219, 2019.
- Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
- Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
- garage contributors (2019) The garage contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.
- Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In Thirty-sixth International Conference on Machine Learning, 2019.
- Ghadimi et al. (2016) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
- Huang et al. (2020) Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Momentum-based policy gradient methods. In International Conference on Machine Learning, pages 4422–4433. PMLR, 2020.
- Huang et al. (2021) Feihu Huang, Junyi Li, and Heng Huang. Super-adam: Faster and universal framework of adaptive gradients. arXiv preprint arXiv:2106.08208, 2021.
- Jin and Sidford (2020) Yujia Jin and Aaron Sidford. Efficiently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pages 4890–4900. PMLR, 2020.
Johnson and Zhang (2013)
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In NIPS, pages 315–323, 2013.
- Kakade (2001) Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14:1531–1538, 2001.
- Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
- Lan (2021) Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. arXiv preprint arXiv:2102.00135, 2021.
- Li (2017) Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
- Liu et al. (2019) Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
- Liu et al. (2020) Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33, 2020.
- Neu et al. (2017) Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
- Papini et al. (2018) Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pages 4026–4035, 2018.
Paszke et al. (2019)
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.
Pytorch: An imperative style, high-performance deep learning library.In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
- Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015a.
- Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Shalev-Shwartz et al. (2016) Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
Shani et al. (2020)
Lior Shani, Yonathan Efroni, and Shie Mannor.
Adaptive trust region policy optimization: Global convergence and
faster rates for regularized mdps.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
- Shannon (1948) Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
- Shen et al. (2019) Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, and Chao Mi. Hessian aided policy gradient. In International Conference on Machine Learning, pages 5729–5738, 2019.
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- Tomar et al. (2020) Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
- Wang et al. (2019) Qing Wang, Yingru Li, Jiechao Xiong, and Tong Zhang. Divergence-augmented policy optimization. In Advances in Neural Information Processing Systems, pages 6099–6110, 2019.
- Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Xu et al. (2019a) Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variance-reduced policy gradient. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, page 191, 2019a.
- Xu et al. (2019b) Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610, 2019b.
- Yang et al. (2019) Long Yang, Gang Zheng, Haotian Zhang, Yu Zhang, Qian Zheng, Jun Wen, and Gang Pan. Policy optimization with stochastic mirror descent. arXiv preprint arXiv:1906.10462, 2019.
- Zhan et al. (2021) Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. arXiv preprint arXiv:2105.11066, 2021.
- Zhang et al. (2020) Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with reinforce. arXiv preprint arXiv:2010.11364, 2020.
- Zhang et al. (2019) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383, 2019.
- Zhang and He (2018) Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv preprint arXiv:1806.04781, 2018.