Bregman Gradient Policy Optimization

06/23/2021 ∙ by Feihu Huang, et al. ∙ University of Pittsburgh 0

In this paper, we design a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Specifically, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique and mirror descent iteration. At the same time, we present an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on a momentum variance-reduced technique. Moreover, we introduce a convergence analysis framework for our Bregman gradient policy optimization under the nonconvex setting. Specifically, we prove that BGPO achieves the sample complexity of Õ(ϵ^-4) for finding ϵ-stationary point only requiring one trajectory at each iteration, and VR-BGPO reaches the best known sample complexity of Õ(ϵ^-3) for finding an ϵ-stationary point, which also only requires one trajectory at each iteration. In particular, by using different Bregman divergences, our methods unify many existing policy optimization algorithms and their new variants such as the existing (variance-reduced) policy gradient algorithms and (variance-reduced) natural policy gradient algorithms. Extensive experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Policy Gradient (PG) methods are a class of popular policy optimization methods for Reinforcement Learning (RL), and have achieved significant successes in many challenging applications (Li, 2017) such as robot manipulation (Deisenroth et al., 2013), the Go game (Silver et al., 2017) and autonomous driving (Shalev-Shwartz et al., 2016)

. In general, PG methods directly search for the optimal policy by maximizing the expected total reward of Markov Decision Processes (MDPs) involved in RL, where an agent takes action dictated by a policy in an unknown dynamic environment over a sequence of time steps. Since the PGs are generally estimated by Monte-Carlo sampling, such vanilla PG methods usually suffer from very high variances resulted in slow convergence rate and destabilization. Thus, recently many fast PG methods have been proposed to reduce variances of vanilla stochastic PG. For example,

(Sutton et al., 2000) introduced a baseline to reduce variances of the stochastic PG. (Konda and Tsitsiklis, 2000) proposed an efficient actor-critic algorithm by estimating the value function to reduce effects of large variances. (Schulman et al., 2015b) proposed the generalized advantage estimation (GAE) to control both the bias and variance in policy gradient. More recently, some faster variance-reduced PG methods (Papini et al., 2018; Xu et al., 2019a; Shen et al., 2019; Liu et al., 2020) have been developed based on the variance-reduction techniques in stochastic optimization.

Alternatively, some successful PG algorithms (Schulman et al., 2015a, 2017) improve convergence rate and robustness of vanilla PG methods by using some penalties such as Kullback-Leibler (KL) divergence penalty. For example, trust-region policy optimization (TRPO) (Schulman et al., 2015a) ensures that the new selected policy is near to the old one by using KL-divergence constraint, while proximal policy optimization (PPO) (Schulman et al., 2017) clips the weighted likelihood ratio to implicitly reach this goal. Subsequently, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular RL based on the convex mirror descent algorithm. Liu et al. (2019)

have also studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time,

Yang et al. (2019) tried to propose the PG methods based on the mirror descent algorithm. More recently, mirror descent policy optimization (MDPO) (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm. In addition, (Agarwal et al., 2019; Cen et al., 2020) have studied the natural PG methods for regularized RL. However, (Agarwal et al., 2019) mainly focuses on tabular policy and log-linear, neural policy classes. (Cen et al., 2020) mainly focuses on softmax policy class.

Algorithm Reference Complexity Batch Size
TRPO Shani et al. (2020)
Regularized TRPO Shani et al. (2020)
TRPO/PPO Liu et al. (2019)
VRMPO Yang et al. (2019)
MDPO Tomar et al. (2020) Unknown Unknown
BGPO Ours
VR-BGPO Ours
Table 1: Sample complexities of the representative PG algorithms based on mirror descent algorithm for finding an -stationary point of the nonconcave performance function. Although Liu et al. (2019); Shani et al. (2020) have provided the global convergence of TRPO and PPO under some specific policies based on the convex mirror descent, they still obtain a stationary point of the nonconcave performance function. Note that our convergence analysis does not rely any specific policies.

Although these specific PG methods based on mirror descent iteration have been recently studied, which are scattered in empirical and theoretical aspects respectively, it lacks a universal framework for these PG methods without relying on some specific RL tasks. In particular, there still does not exist the convergence analysis of PG methods based on the mirror descent algorithm under the nonconvex setting. Since mirror descent iteration adjusts gradient updates to fit problem geometry, and is useful in regularized RL (Geist et al., 2019), there exists an important problem to be addressed:

Could we design a universal policy optimization framework based on the mirror descent algorithm, and provide its convergence guarantee under the non-convex setting ?

In the paper, we firmly answer the above challenging question with positive solutions and propose an efficient Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. In particular, we provide a convergence analysis framework of the PG methods based on mirror descent iteration under the nonconvex setting. Our main contributions are provided as follows:

  • We propose an effective Bregman gradient policy optimization (BGPO) algorithm based the basic momentum technique, which achieves the sample complexity of for finding -stationary point only requiring one trajectory at each iteration.

  • We further propose an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on the momentum variance-reduced technique. Moreover, we prove that the VR-BGPO reaches the best known sample complexity of under the nonconvex setting.

  • We design a unified policy optimization framework based on the mirror descend iteration and momentum techniques, and provide its convergence analysis under the nonconvex setting.

In Table 1 shows thet sample complexities of the representative PG algorithms based on mirror descent algorithm. Shani et al. (2020); Liu et al. (2019) have established global convergence of a mirror descent variant of PG under some pre-specified setting such as over-parameterized networks (Liu et al., 2019) by exploiting these specific problems’ hidden convex nature. Without these special structures, global convergence of these methods cannot be achieved. However, our framework does not rely on any specific policy classes, and our convergence analysis only builds on the general nonconvex setting. Thus, we only prove that our methods convergence to stationary points.

Geist et al. (2019); Jin and Sidford (2020); Lan (2021); Zhan et al. (2021) studied a general theory of regularized MDPs based on policy function space that generally is discontinuous. Since both the state and action spaces and generally are very large in practice, the policy function space is large. While our methods build on policy parameter space that is generally continuous Euclidean space and relatively small. Clearly, our methods and theoretical results are more practical than the results in (Geist et al., 2019; Jin and Sidford, 2020; Lan, 2021; Zhan et al., 2021). (Tomar et al., 2020) also proposes mirror descent PG framework based on policy parameter space, but it does not provide any theoretical results and only focuses on Bregman divergence taking form of KL divergence. Our framework can collaborate with any Bregman divergence forms. At the same time, our framework can also flexibly use the momentum and variance reduced techniques.

2 Related Works

In this section, we review some related work about mirror descent algorithm in RL and (variance-reduced) PG methods, respectively.

2.1 Mirror Descent Algorithm in RL

Due to easily deal with the regularization terms, mirror descent algorithm has shown significant successes in regularized RL. For example, Neu et al. (2017) have shown both the dynamic policy programming (Azar et al., 2012) and TRPO (Schulman et al., 2015a) algorithms are approximate variants of mirror descent algorithm. Subsequently, Geist et al. (2019) have introduced a general theory of regularized MDPs based on the convex mirror descend algorithm. More recently, Liu et al. (2019) have studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular policy based on the convex mirror descent algorithm. Wang et al. (2019) have proposed divergence augmented policy optimization for off-policy learning based on mirror descent algorithm. MDPO (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm.

2.2 (Variance-Reduced) PG Methods

PG methods have been widely studied due to their stability and incremental nature in policy optimization. For example, the global convergence properties of vanilla policy gradient method in infinite-horizon MDPs have been recently studied in (Zhang et al., 2019). Subsequently, Zhang et al. (2020) have studied asymptotically global convergence properties of the REINFORCE (Williams, 1992), whose policy gradient is approximated by using a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization. To accelerate these vanilla PG methods, some faster variance-reduced PG methods have been proposed based on the variance-reduction techniques of SVRG (Johnson and Zhang, 2013), SPIDER (Fang et al., 2018) and STORM (Cutkosky and Orabona, 2019) in stochastic convex and non-convex optimization. For example, fast SVRPG (Papini et al., 2018; Xu et al., 2019a) algorithm have been proposed based on SVRG. Fast HAPG (Shen et al., 2019) and SRVR-PG (Xu et al., 2019a) algorithms have been presented by using SPIDER technique. More recently, the momentum-based PG methods (i.e., IS-MBPG and HA-MBPG) (Huang et al., 2020) have been developed based on variance-reduced technique of STORM.

3 Preliminaries

In the section, we will review some preliminaries of Markov decision process and policy gradients.

3.1 Notations

Let for all

. For a vector

, let denote the norm of , and denotes the -norm of . For two sequences and , we denote if for some constant , and hides logarithmic factors. and

denote the expectation and variance of random variable

, respectively.

3.2 Markov Decision Process

Reinforcement learning generally involves a discrete time discounted Markov Decision Process (MDP) defined by a tuple . and denote the state and action spaces of the agent, respectively.

is the Markov kernel that determines the transition probability from the state

to under taking an action . is the reward function of and , and denotes the initial state distribution. is the discount factor. Let be a stationary policy, where

is the set of probability distributions on

.

Given the current state , the agent executes an action

following a conditional probability distribution

, and then the agent obtains a reward . At each time , we can define the state-action value function and state value function as follows:

(1)

We also define the advantage function . The goal of the agent is to find the optimal policy by maximizing the expected discounted reward

(2)

Consider a time horizon , the agent collects a trajectory under any stationary policy. Then the agent obtains a cumulative discounted reward . Since the state and action spaces and are generally very large, directly solving the problem (2) is difficult. Thus, we let the policy be parametrized as for the parameter . Given the initial distribution , the probability distribution over trajectory can be obtain

(3)

Thus, the problem (2) will be equivalent to maximize the expected discounted trajectory reward:

(4)

3.3 Policy Gradients

The policy gradient methods (Williams, 1992; Sutton et al., 2000) are a class of effective policy-based methods to solve the above RL problem (4). Specifically, the gradient of with respect to is given as follows:

(5)

Given a mini-batch trajectories sampled from the distribution , the standard stochastic policy gradient ascent update at -th step, defined as

(6)

where is learning rate, and is stochastic policy gradient. Given as in (Zhang et al., 2019; Shani et al., 2020), is the unbiased stochastic policy gradient of , i.e., , where

(7)

Based on the gradient estimator in (7), we can obtain the existing well-known policy gradient estimators such as REINFORCE (Williams, 1992), PGT (Sutton et al., 2000). Specifically, the REINFORCE obtains a policy gradient estimator by adding a baseline , defined as

The PGT is a version of the REINFORCE, defined as

Alternatively, based on the policy gradient Theorem (Sutton et al., 2000; Zhang et al., 2019), we also have the following policy gradient form

(8)

where and denotes a valid probability measure over the state . Here is the probability that state given initial state and policy parameter . Since any function is independent of action , the policy gradient (8) is equivalent to

where is a baseline function. When choose the state-value function as a baseline function , we can obtain the advantage-based policy gradient

(9)

4 Bregman Gradient Policy Optimization

In the section, we propose a novel Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. We first let , the goal of policy-based RL is to solve the following problem:

(10)

So we have .

(Zhang and He, 2018) Given a function defined on a closed convex set , and the Bregman distance : , then we define a proximal operator of function ,

(11)

where , is a continuously-differentiable and -strongly convex function, i.e., . Based on the proximal operator of as in Zhang and He (2018); Ghadimi et al. (2016), we define the Bregman gradient as follows:

(12)

If and , is a stationary point of if and only if . Thus, this Bregman gradient can be regarded as a generalized gradient.

4.1 BGPO Algorithm

In the subsection, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique. The pseudo code of BGPO Algorithm is provided in Algorithm 1.

1:  Input: Total iteration , tuning parameters and mirror mappings are -strongly convex functions;
2:  Initialize: , and sample a trajectory from , and compute ;
3:  for  do
4:     Compute ;
5:     Update ;
6:     Update ;
7:     Update ;
8:     Sample a trajectory from , and compute ;
9:  end for
10:  Output: chosen uniformly random from .
Algorithm 1 BGPO Algorithm

In Algorithm 1, the step 5 uses the stochastic Bregman gradient descent (a.k.a., stochastic mirror descent) to update the parameter . Let be the first-order approximation of function at , where is a stochastic gradient of function at . By the step 5 of Algorithm 1 and the above equality (12), we have

(13)

where . Then by the step 6 of Algorithm 1, we have

(14)

where . Due to the convexity of set and , we choose the parameter to ensure the updated sequence in the set .

In fact, our algorithm unifies many popular policy optimization algorithms. When the mirror mappings for , then the update (14) is reduced to classic policy gradient algorithms (Sutton et al., 2000; Zhang et al., 2019, 2020). In the other words, given and , we have and

(15)

When the mirror mappings with for , the update (14) is reduced to natural policy gradient algorithms (Kakade, 2001; Liu et al., 2020). Similarly, given and , we have and

(16)

where denotes the Moore-Penrose pseudoinverse of the Fisher information matrix .

When mirror mapping is Boltzmann-Shannon entropy function Shannon (1948), i.e., is the KL divergence, the update (14) is reduced to relative mirror descent policy optimization (MDPO) (Tomar et al., 2020).

4.2 VR-BGPO Algorithm

In the subsection, we propose a faster variance-reduced Bregman gradient policy optimization (VR-BGPO) algorithm based on a variance-reduced technique. The pseudo code of VR-BGPO algorithm is provided in Algorithm 2.

Consider the problem (4) is non-oblivious that the distribution depends on the variable varying through the whole optimization procedure, we apply the importance sampling weight (Papini et al., 2018; Xu et al., 2019a) in estimating our policy gradient , defined as

Except for different stochastic policy gradients and tuning parameters using in Algorithms 1 and 2, the steps 5 and 6 in these algorithms for updating parameter are the same. Interestingly, when choosing mirror mapping , our VR-BGPO algorithm will reduce to a non-adaptive version of IS-MBPG algorithm (Huang et al., 2020).

1:  Input: Total iteration , tuning parameters and mirror mappings are -strongly convex functions;
2:  Initialize: , and sample a trajectory from , and compute ;
3:  for  do
4:     Compute ;
5:     Update ;
6:     Update ;
7:     Update ;
8:     Sample a trajectory from , and compute ;
9:  end for
10:  Output: chosen uniformly random from .
Algorithm 2 VR-BGPO Algorithm

5 Convergence Analysis

In this section, we will analyze the convergence properties of the proposed algorithms. All related proofs are provided in the Appendix.

5.1 Convergence Metric

In this subsection, we given a reasonable metric to measure the convergence of our Algorithms, defined as

(17)

where depends on the -strongly convex function . This convergence metric form is also used in the SUPER-ADAM algorithm (Huang et al., 2021). When , we have and . Thus, we have

(18)

In fact, our metric is a generalized metric to used in (Zhang and He, 2018; Yang et al., 2019). In particular, when and , we have and

(19)

When , we can obtain .

5.2 Some Mild Assumptions

In the subsection, we first give some standard assumptions.

Assumption 1

For function , its gradient and Hessian matrix are bounded, i.e., there exist constants such that

(20)
Assumption 2

Variance of stochastic gradient is bounded, i.e., there exists a constant , for all such that .

Assumption 3

For importance sampling weight , its variance is bounded, i.e., there exists a constant , it follows for any and .

Assumption 4

The function has an upper bound in , i.e., .

Assumptions 1 and 2 are standard in the PG algorithms (Papini et al., 2018; Xu et al., 2019a, b). Assumption 3 is widely used in the study of variance reduced PG algorithms (Papini et al., 2018; Xu et al., 2019a). In fact, the bounded importance sampling weight might be violated in some cases such as using neural networks as the policy. Thus, we can clip this importance sampling weights to guarantee the effectiveness of our algorithms as in (Papini et al., 2018). Assumption 4 guarantees the feasibility of the problem (4). Next, we provide some useful lemmas based on the above assumptions.

(Proposition 4.2 in Xu et al. (2019b)) Suppose is the PGT estimator. Under Assumption 1, we have

  • is -Lipschitz differential, i.e., for all , where ;

  • is -smooth, i.e., ;

  • is bounded, i.e., for all with .

(Lemma 6.1 in Xu et al. (2019a)) Under Assumptions 1 and 3, let , we have

(21)

where .

5.3 Convergence Analysis of BGPO Algorithm

In the subsection, we provide convergence properties of the BGPO algorithm. The detailed proof is provided in Appendix A1. Assume the sequence be generated from Algorithm 1. Let and for all , , , , and , we have

where .

Without loss of generality, let , and , we have . Theorem 5.3 shows that the BGPO algorithm has a convergence rate of . Let , we have . Since the BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it has the sample complexity of for finding an -stationary point.

5.4 Convergence Analysis of VR-BGPO Algorithm

In the subsection, we give convergence properties of the VR-BGPO algorithm. The detailed proof is provided in Appendix A2. Suppose the sequence be generated from Algorithm 2. Let and for all , , , and , we have

(22)

where and .

Without loss of generality, let , and , we have . Theorem 5.4 shows that the VR-BGPO algorithm has a convergence rate of . Let , we have . Since the VR-BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs iterations, it reaches a lower sample complexity of for finding an -stationary point.

6 Experiments

In this section, we conduct some RL tasks to verify the effectiveness of our methods. We first study the effect of different choices of Bregman divergences with our algorithms (BGPO and VR-BGPO), and then we compare our VR-BGPO algorithm with other state-of-the-art methods such as TRPO  (Schulman et al., 2015a), PPO (Schulman et al., 2017), VRMPO (Yang et al., 2019), and MDPO (Tomar et al., 2020).

6.1 Effects of Bregman Divergences

(a) CartPole-v1
(b) Acrobat-v1
(c) MountainCarContinuous-v0
Figure 1: Effects of two Bregman Divergences: -norm and diagonal term (Diag).
(a) CartPole-v1
(b) Acrobat-v1
(c) MountainCarContinuous-v0
Figure 2: Comparison between BGPO and VR-BGPO on different environments.
(a) InvertedPendulum-v2
(b) InvertedDoublePendulum-v2
(c) Walker2d-v2
(d) Swimmer-v2
(e) Reacher-v2
(f) HalfCheetah-v2
Figure 3: Experimental results of our VR-BGPO and other baseline algorithms on six environments.

In the subsection, we examine how different Bregman divergences affect the performance of our algorithms. In the first setting, we let mirror mapping with different to test the performance our algorithms. Let be the conjugate mapping of , where . According to  (Beck and Teboulle, 2003), when , the update of in our algorithms can be calculated by , where and are -norm link functions, and , , and is the coordinate index of and . In the second setting, we apply diagonal term on the mirror mapping , where is a diagonal matrix with positive values. In the experiments, we generate , , and , as in the SUPER-ADAM algorithm (Huang et al., 2021). Then we have . Under this setting, the update of can also be analytically solved .

To test the effectiveness of two different Bregman divergences, we evaluate them on three classic control environments from gym Brockman et al. (2016)

: CartPole-v1, Acrobat-v1, and MountainCarContinuous-v0. In the experiment, categorical policy is used for CartPole and Acrobot environments, and Gaussian policy is used for MountainCar. Gaussian value functions are used in all settings. All policies and value functions are parameterized by multilayer perceptrons (MLPs). For a fair comparison, all settings use the same initialization for policies. We run each setting five times and plot the mean and variance of average returns.

For -norm mapping, we test three different values of . For diagonal mapping, we set and

. We set hyperparameters

to be the same. still needs to be tuned for different to achieve relatively good performance. For simplicity, we use BGPO-Diag to represent BGPO with diagonal mapping, and we use BGPO- to represent BGPO with -norm mapping. Details about the setup of environments and hyperparameters are provided in the Appendix B.

From Fig. 1, we can find that BGPO-Diag largely outperforms BGPO- with different choices of . The parameter tuning of BGPO- is much more difficult than BGPO-Diag because each requires an individual to achieve the desired performance. To reach the best possible performance may also need to be tuned. On the contrary, the parameter tuning of BGPO-Diag is much simpler, and the performance is also more stable.

6.2 Comparison between BGPO and VR-BGPO

To understand the effectiveness of variance reduced technique used in our VR-BGPO algorithm, we compare BGPO and VR-BGPO using the same settings introduced in section. 6.1. Both algorithms use the diagonal mapping for , since it performs much better than -norm.

From Fig. 2, we can see that VR-BGPO can outperform BGPO in all three environments. In CartPole, both algorithms converge very fast and have similar performance, and VR-BGPO is more stable than BGPO. The advantage of VR-BGPO becomes large in Acrobot and MountainCar environments, probably because the task is more difficult compared to CartPole.

6.3 Compare to other Methods

In this subsection, we apply our VR-BGPO algorithm to compare with the other methods since it has better performances shown in section 6.2. Specifically, we compare VR-BGPO with two popular policy optimization algorithms: PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015a). We also compare our algorithm with recent proposed mirror-descent based policy optimization algorithms: VRMPO (Yang et al., 2019) and MDPO (Tomar et al., 2020). For VR-BGPO, we continue to use diagonal mapping for . For VRMPO, we follow their implementation and use -norm for . For MDPO, is the negative Shannon entropy, and the Bregman divergence becomes KL-divergence.

To evaluate the performance of these algorithms, we test them on six gym (Brockman et al., 2016) environments with continuous control tasks: Inverted-Pendulum-v2, Inverted-DoublePendulum-v2, Walker2d-v2, Reacher-v2, Swimmer-v2, and HalfCheetah-v2. We use Gaussian policies and Gaussian value functions for all environments, and both of them are parameterized by MLPs. To ensure a fair comparison, all policies use the same initialization. For TRPO and PPO, we use the implementations provided by garage (garage contributors, 2019). We carefully implement MDPO and VRMPO following the description provided by the original papers. All methods include our method, are implemented with garage (garage contributors, 2019)

and pytorch 

(Paszke et al., 2019). We run all algorithms ten times on each environment and report the mean and variance of average returns. Details about the setup of environments and hyperparameters are also provided in the Appendix B.

From Fig. 3, we can find that VR-BGPO consistently outperforms all the other methods on six environments, demonstrating the effectiveness of our algorithm. Specifically, our method achieves the best mean average return in InvertedDoublePendulum, Swimmer, and HalfCheetah. Our method converges faster than other methods in all environments, especially in InvertedPendulum, InvertedDoublePendulum, and Reacher. MDPO can achieve good results in some environments, but it can not outperform PPO or TRPO in Swimmer and InvertedDoublePendulum. VRMPO only outperforms PPO and TRPO in Reacher and InvertedDoublePendulum. The undesirable performance of VRMPO is probably because it uses norm for , which requires careful tuning of the learning rate.

7 Conclusion

In the paper, we proposed a novel Bregman gradient policy optimization framework for reinforcement learning based on mirror descend iteration and momentum techniques. Moreover, we studied convergence properties of the proposed policy optimization methods under the nonconvex setting.

We thank the IT Help Desk at University of Pittsburgh. This work was partially supported by NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.


References