A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning

03/01/2020 ∙ by Nhan H. Pham, et al. ∙ 0

We propose a novel hybrid stochastic policy gradient estimator by combining an unbiased policy gradient estimator, the REINFORCE estimator, with another biased one, an adapted SARAH estimator for policy optimization. The hybrid policy gradient estimator is shown to be biased, but has variance reduced property. Using this estimator, we develop a new Proximal Hybrid Stochastic Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization problem that allows us to handle constraints or regularizers on the policy parameters. We first propose a single-looped algorithm then introduce a more practical restarting variant. We prove that both algorithms can achieve the best-known trajectory complexity O(ε^-3) to attain a first-order stationary point for the composite problem which is better than existing REINFORCE/GPOMDP O(ε^-4) and SVRPG O(ε^-10/3) in the non-composite setting. We evaluate the performance of our algorithm on several well-known examples in reinforcement learning. Numerical results show that our algorithm outperforms two existing methods on these examples. Moreover, the composite settings indeed have some advantages compared to the non-composite ones on certain problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, research on reinforcement learning (RL) (Sutton and Barto, 2018)

, an area of machine learning to learn how to make a series of decisions while interacting with the underlying environment, has been immensely active. Unlike supervised learning, reinforcement learning agents often have limited or no knowledge about the environment and the rewards of taking certain actions might not be immediately observed, making these problems more challenging to solve. Over the past decade, there has been a large number of research works developing and using reinforcement learning to solve emerging problems. Notable reinforcement learning agents include, but not limited to, AlphaGo and AlphaZero

(Silver et al., 2016, 2018), OpenAIFive (OpenAI, 2018), and AlphaStar (DeepMind, 2019).

In modern RL tasks, the environment is often not known beforehand so the agent has to simultaneously learn the environment while making appropriate decisions. One approach is to estimate the value function or the state-value function, e.g., Q-learning (Watkins and Dayan, 1992) and its variants such as Deep Q-learning (DQN) (Mnih et al., 2013, 2015), Dueling DQN (Wang et al., 2016), and double Q-learning (Hasselt et al., 2016).

It has been observed that learning the state-value function is not efficient when the action space is large or even infinite. In that case, policy gradient methods learn the policy directly with a parameterized function. Silver et al. (2014) presents a framework for deterministic policy gradient algorithms which can be estimated more efficiently than their stochastic counterparts whereas DDPG (Lillicrap et al., 2016) adapts the idea of deep Q-learning into continuous action tasks in RL. TRPO (Schulman et al., 2015) uses a constraint on the KL divergence between the new and old policies to improve the robustness of each update. PPO (Schulman et al., 2017) is an extension of TRPO which uses a clipped surrogate objective resulting a simpler implementation. Other policy gradient methods utilize the actor-critic paradigm including ACER (Wang et al., 2017), A3C (Mnih et al., 2016) and its synchronous variant A2C, ACKTR (Wu et al., 2017), and SAC (Haarnoja et al., 2018).

REINFORCE (Williams, 1992) is perhaps one classical method closely related to our work here. It uses an estimator of the policy gradient and applies a gradient ascent step to update the policy. Nevertheless, the REINFORCE estimator is known to have high variance leading to several weaknesses. Other improvements to reduce the variance such as adding baselines (Sutton and Barto, 2018; Zhao et al., 2011), discarding some rewards in the so-called GPOMDP estimator (Baxter and Bartlett, 2001) were proposed. While REINFORCE estimator is an unbiased policy gradient estimator, GPOMDP is shown to be biased (Baxter and Bartlett, 2001) making theoretical analysis harder.

The nature of REINFORCE algorithm appears to be closely related to stochastic gradient descent (SGD)

(Robbins and Monro, 1951) in stochastic nonconvex optimization. In particular, the standard SGD estimator is also known to often have fixed variance, which is often high. On the one hand, there are algorithms trying to reduce the oscillation (Tieleman and Hinton, 2012) or introduce momentums or adaptive updates (Allen-Zhu, 2017, 2018; Kingma and Ba, 2014) for SGD methods to accelerate performance. On the other hand, other researchers are searching for new gradient estimators. One approach is the SAGA estimator proposed by Defazio et al. (2014). Another well-known estimator is the SVRG estimator (Johnson and Zhang, 2013) which has been intensively studied in recent works, e.g., in Allen-Zhu and Yuan (2016); Li and Li (2018); Reddi et al. (2016); Zhou et al. (2018)

. This estimator not only overcomes the storage issue of SAGA but also possesses variance reduced property, i.e., the variance of the estimator decreases over epochs. Methods based on SVRG estimators have recently been developed for reinforcement learning, e.g., SVRPG

(Papini et al., 2018). Xu et al. (2019a) refines the analysis of SVRPG to achieve an improved trajectory complexity of . Shen et al. (2019) also adopts the SVRG estimator into policy gradient and achieve the trajectory oracle complexity of with the use of a second-order estimator.

While SGD, SAGA, and SVRG estimators are unbiased, there have been algorithms developed based on a biased gradient estimator named SARAH (Nguyen et al., 2017b). Such algorithms include SARAH (Nguyen et al., 2017a, 2019), SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018), and ProxSARAH (Pham et al., 2019). Similar to SVRG, all these methods can potentially be extended to reinforcement learning. A recent attempt is SARAPO (Yuan et al., 2019) which combines SARAH (Nguyen et al., 2019) with TRPO (Schulman et al., 2015) algorithm but no theoretical guarantee is provided. Yang and Zhang (2019) propose Mirror Policy Optimization (MPO) algorithm which covers the classical policy gradient and the natural policy gradient as special cases. They also introduce a variance reduction variant, called VRMPO, which achieves trajectory complexity. Another notable work is SRVR-PG (Xu et al., 2019b) where the policy gradient estimator is the adapted version of SARAH estimator for reinforcement learning. Note that Yang and Zhang (2019) and Xu et al. (2019b) achieve the same trajectory complexity of as ours. However, our algorithm is essentially different. Xu et al. (2019b) and Yang and Zhang (2019) use two different adaptation of the SARAH estimator for policy gradient. Xu et al. (2019b) uses the importance weight in their estimator to handle distribution shift while Yang and Zhang (2019) remove it as seen in Shen et al. (2019). Meanwhile, we introduce a new policy gradient estimator which can also be calculated recursively. The new estimator is fundamentally different from the other two since it combines the adapted SARAH estimator as in Xu et al. (2019b) with the classical REINFORCE estimator. In addition, our analysis shows that the best-known convergence rate and complexity can be achieved by our single-loop algorithm (Algorithm 1) while SRVR-PG and VRMPO require double loops to achieve the same oracle complexity. Moreover, Xu et al. (2019b); Yang and Zhang (2019) do not consider the composite setting that includes the constraints or regularizers on the policy parameters as we do.

          Algorithms Complexity Composite Single-loop
REINFORCE (Williams, 1992)  ✗  ✓
GPOMDP (Baxter and Bartlett, 2001)  ✗  ✓
SVRPG (Papini et al., 2018)  ✗  ✗
SVRPG (Xu et al., 2019a)  ✗  ✗
HAPG (Shen et al., 2019)  ✗  ✗
VRMPO (Yang and Zhang, 2019)  ✗  ✗
SRVR-PG (Xu et al., 2019b)  ✗  ✗
This work  ✓  ✓
Table 1: A comparison between different methods for the non-composite setting (1) of (2).

Our approach:

Our approach lies in the stochastic variance reduction avenue, but using a completely new hybrid approach, leading to a novel estimator compared to existing methods in reinforcement learning. We build our estimator by taking a convex combination of the adapted SARAH (Nguyen et al., 2017b) and REINFORCE (Williams, 1992), a classical unbiased policy gradient estimator. This hybrid estimator not only allows us to trade-off the bias and variance between these two estimators but also possesses useful properties for developing new algorithms. Note that the idea of combining stochastic estimators was first proposed for stochastic optimization in our recent works (Tran-Dinh et al., 2019b, a)

. Unlike existing policy gradient methods, our algorithm first samples a large batch of trajectories to establish a good search direction. After that, it iteratively updates the policy parameters using our hybrid estimator leading to a single-loop method without any snapshot loop as in SVRG or SARAH variants. In addition, as regularization techniques have shown their effectiveness in deep learning

(Neyshabur et al., 2017; Zhang et al., 2017), they possibly have great potential in reinforcement learning algorithms too. A recent study (Liu et al., 2019) shows that regularizations on the policy parameters can greatly improve the performance of policy gradient algorithms. Motivated by these facts, we directly consider a new composite setting (2) as presented in Section 3. For this new composite model, it is not clear if existing algorithms remain convergent by simply adding a projection step on the constraint set, while our method does guarantee convergence.

Our contribution:

To this end, our contribution in this paper can be summarized as follows:

  • We introduce a novel hybrid stochastic policy gradient estimator by combining existing REINFORCE estimator with the adapted SARAH estimator for policy gradient. We investigate some key properties of our estimator that can be used for algorithmic development.

  • We propose a new algorithm to solve a composite maximization problem for policy optimization in reinforcement learning. Our model not only covers existing settings but also handles constraints and convex regularizers on policy parameters.

  • We provide convergence analysis as the first theoretical result for composite optimization in reinforcement learning and estimate the trajectory complexity of our algorithm and show that our algorithm can achieve the best-known complexity over existing first-order methods (see Table 1).

Our algorithm only has one loop as REINFORCE or GPOMDP, which is fundamentally different from SVRPG, SVRG-adapted, and other SARAH-based algorithms for RL. It can work with single sample or mini-batch and has two steps: proximal gradient step and averaging step with different step-sizes. This makes the algorithm more flexible to use different step-sizes without sacrificing the overall complexity.

Paper outline:

The rest of this paper is organized as follows. Section 2 describes problem of interest and gives an overview about policy gradient methods. Section 3 introduces our new hybrid estimator for policy gradient and develops the main algorithm. The complexity analysis is presented in Section 4, while Section 5 provides several numerical examples. All technical proofs and experimental details are given in Supplementary Document (Supp. Doc.).

2 Model and Problem Statement


We consider a Markov Decision Process (MDP)

(Sutton and Barto, 2018) equipped with components where , are the state and action spaces,

denotes the set of transition probabilities when taking certain actions,

is the reward function which characterizes the immediate reward earned by taking certain action, is a discount factor, and is the initial state distribution.

Let be a density function over when current state is and is a policy parameterized by parameter . A trajectory with effective length is a collection of states and actions sampled from a stationary policy. Denote as the density induced by policy over all possible trajectories and is the probability of observing a trajectory . Also, let be the total discounted reward for a trajectory . Solving an MDP is equivalent to finding the solution that maximizes the expected cumulative discounted rewards.

Classical policy gradient methods:

Policy gradient methods seek a differentiable parameterized policy that maximizes the expected cumulative discounted rewards as


where is the parameter dimension. The policy gradient theorem (Sutton et al., 1999) shows that

where the policy gradient does not depend on the gradient of the state distribution despite the fact that the state distribution depends on the policy parameters (Silver et al., 2014).

This policy gradient can be used in gradient ascent algorithms to update the parameter . However, we cannot calculate the full gradient at each update as we only get a finite number of samples at each iteration. Consequently, the policy gradient is often estimated by its sample average. At each iteration, a batch of trajectories will be sampled from the environment to estimate the policy gradient as

where is a sample estimator of . We call a stochastic policy gradient (SPG) estimator. This estimator has been exploited in the two well-known REINFORCE (Williams, 1992) and GPOMDP (Baxter and Bartlett, 2001) methods. The main step of policy gradient ascent methods is to update the parameters as

where is some appropriate learning rate, which can be fixed or varied over . Since the policy changes after each update, the density also changes and creates non-stationarity in the problem which will be handled by importance weight in Section 3.

3 A New Hybrid Stochastic Policy Gradient Algorithm

In this section, we first introduce a composite model for policy optimization. Next, we extend the hybrid gradient idea from Tran-Dinh et al. (2019a) to policy gradient estimators. Finally, we develop a new proximal policy gradient algorithm and its restart variant to solve the composite policy optimization problem and analyze their trajectory complexity.

3.1 Composite Policy Optimization Model

While the objective function in (1) is standard in most policy gradient methods, it is natural to have some constraints or regularizers on the policy parameters. In addition, adding constraints can prevent the explosion of parameters in highly nonlinear models as often seen in deep learning (Srivastava et al., 2014). Adopting the idea of composite nonconvex optimization (Pham et al., 2019), we are interested in the more general optimization problem in reinforcement learning as follow:


where is a proper, closed, and convex function acting as a regularizer which can be the indicator function of a convex set representing the constraints on the parameters or some standard regularizers such as -norm or -norm. If there is no regularizer , the problem (2) reduces to the standard one in (1).

3.2 Assumptions

Let be the total objective function. We impose the following assumptions for our convergence analysis, which are often used in practice.

Assumption 3.1.

The regularizer is a proper, closed, and convex function. We also assume that the domain of is nonempty and there exists a finite upper bound

Assumption 3.2.

The immediate reward function is bounded, i.e., there exists such that for all , ,

Assumption 3.3.

Let be the policy for a given state-action pair . Then, there exist two positive constants and such that

for any , where is the -norm.

This assumption leads to useful results about the smoothness of and and the upper bound on the variance of the policy gradient estimator.

Lemma 3.1 ((Papini et al., 2018; Shen et al., 2019; Xu et al., 2019a)).

Under Assumption 3.2 and 3.3, for all , we have

  • ;

  • ;

  • ; and

  • ,

where is the REINFORCE estimator and , , , and are constants depending only on , , , , , and the baseline .

For more details about the constants and the proofs of Lemma 3.1 we refer e.g., to Papini et al. (2018); Shen et al. (2019); Xu et al. (2019a).

Assumption 3.4.

There exists a constant such that, for each pair of policies encountered in Algorithm 1 the following holds

where is the importance weight between and .

Since the importance weight introduces another source of variance, we require this assumption for our convergence analysis as used in previous works, e.g., in Papini et al. (2018); Xu et al. (2019a).

Remark 3.1.

Cortes et al. (2010) shows that if ,

are variances of two Gaussian distributions P and Q, and

then the variance of the importance weights is bounded, i.e. Assumption 3.4 holds for Gaussian policies which are commonly used to represent the policy in continuous control tasks.

3.3 Optimality Condition

Associated with problem (2), we define


for some as the gradient mapping of (Nesterov, 2014), where denotes the proximal operator of (see, e.g., Parikh and Boyd (2014) for more details).

A point is called a stationary point of (2) if

Our goal is to design an iterative method to produce an -approximate stationary point of (2) after at most iterations defined as

where is a desired tolerance, and the expectation is taken overall the randomness up to iterations.

3.4 Novel Hybrid SPG Estimator

Unbiased estimator:

Recall that given a trajectory , the REINFORCE (SPG) estimator is defined as

where .

Note that the REINFORCE estimator is unbiased, i.e. . In order to reduce the variance of these estimators, a baseline is normally added while maintaining the unbiasedness of the estimators (Sutton and Barto, 2018; Zhao et al., 2011). From now on, we will refer to as the baseline-added version defined as

where with being a baseline and possibly depending only on .

Hybrid SPG estimator:

In order to reduce the number of trajectories sampled, we extend our idea in Tran-Dinh et al. (2019a)

for stochastic optimization to develop a new hybrid stochastic policy gradient (HSPG) estimator that helps balance the bias-variance trade-off. The estimator is formed by taking a convex combination of two other estimators: one is an unbiased estimator which can be REINFORCE estimator, and another is the adapted SARAH estimator

(Nguyen et al., 2017b) for policy gradient which is biased.

More precisely, if and are two random batches of trajectories with sizes and , respectively, sampled from , the hybrid stochastic policy gradient estimator at -th iteration can be expressed as




with is a batch of trajectories collected at the beginning. Note that is an importance weight added to account for the distribution shift since the trajectories are sampled from but not from . Note also that in (4) is also different from the momentum SARAH estimator recently proposed in Cutkosky and Orabona (2019).

3.5 The Complete Algorithm

The novel Proximal Hybrid Stochastic Policy Gradient Algorithm (abbreviated by ProxHSPGA) to solve (2) is presented in detail in Algorithm 1.

1:Initialization: An initial point , and positive parameters , , , , , , and (specified later).
2: Sample a batch of trajectories of size from .
3: Calculate .
4: Update
5:For do
6:    Generate independent batches of trajectories and with size and from .
7:    Evaluate the hybrid estimator as in (4).
8:    Update
10:Choose from uniformly randomly.
Algorithm 1 (ProxHSPGA)

Unlike SVRPG (Papini et al., 2018; Xu et al., 2019a) and HAPG (Shen et al., 2019), Algorithm 1 only has one loop as REINFORCE or GPOMDP. Moreover, Algorithm 1 does not use the estimator for the policy Hessian as in HAPG. At the initial stage, a batch of trajectories is sampled using to estimate an initial policy gradient estimator which provides a good initial search direction. At the -th iteration, two independent batches of trajectories are sampled from to evaluate the hybrid stochastic policy gradient estimator. After that, a proximal step followed by an averaging step are performed which are inspired by Pham et al. (2019). Note that the batches of trajectories at each iteration are sampled from the current distribution which will change after each update. Therefore, the importance weight is introduced to account for the non-stationarity of the sampling distribution. As a result, we still have .

3.6 Restarting variant

While Algorithm 1 has the best-known theoretical complexity as shown in Section 4, its practical performance may be affected by the constant step-size depending on . As will be shown later, the step-size is inversely proportional to the number of iterations and it is natural to have close to to take advantage of the newly computed information. To increase the practical performance of our algorithm without sacrificing its complexity, we propose to inject a simple restarting strategy by repeatedly running Algorithm 1 for multiple stages as in Algorithm 2.

1:Initialization: Input an initial point .
2:For do
3:      Run Algorithm 1 with .
4:      Output .
6:Choose uniformly randomly from .
Algorithm 2 (Restarting ProxHSPGA)

We emphasize that without this restarting strategy, Algorithm 1 still converges and the restarting loop in Algorithm 2 does not sacrifice the best-known complexity as stated in the next section.

4 Convergence Analysis

This section presents key properties of the hybrid stochastic policy gradient estimators as well as the theoretical convergence analysis and complexity estimate.

4.1 Properties of the hybrid SPG estimator

Let be the -field generated by all trajectories sampled up to the -th iteration. For the sake of simplicity, we assume that but our analysis can be easily extended for the case . Then the hybrid SPG estimator has the following properties

Lemma 4.1 (Key properties).

Let be defined as in (4) and . Then


If , then is an biased estimator. In addition, we have


where is a given constant.

The proof of Lemma 4.1 and the explicit constants are given in Supp. Doc. A.1 due to space limit.

4.2 Complexity Estimates

The following lemma presents a key estimate for our convergence results.

Lemma 4.2 (One-iteration analysis).

Under Assumption 3.2, 3.3, and 3.4, let be the sequence generated by Algorithm 1 and be the gradient mapping defined in (3). Then


where and provided that and .

The following theorem summarizes the convergence analysis of Algorithm 1.

Theorem 4.1.

Under Assumption 3.1, 3.2, 3.3, and 3.4, let be the sequence generated by Algorithm 1 with


where and are given constants. If is chosen uniformly at random from , then the following estimate holds


Consequently, the trajectory complexity is presented in the following corollary.

Corollary 4.1.

For both Algorithm 1 and Algorithm 2, let us fix and set for some in Theorem 4.1. If we also choose in Algorithm 1 such that and choose in Algorithm 2 such that for some constant , then the number of trajectories to achieve such that for any is at most

where is chosen uniformly at random from if using Algorithm 2.

The proof of Theorem 4.1 and Corollary 4.1 are given in Supp. Doc. A.3, and A.4, respectively.

Comparing our complexity bound with other existing methods in Table 1, we can see that we improve a factor of over SVRPG in Xu et al. (2019a) while matching the best-known complexity without the need of using the policy Hessian estimator as HAPG from Shen et al. (2019).

5 Numerical Experiments

In this section, we present three examples to provide comparison between the performance of HSPGA and other related policy gradient methods. We also provide an example to illustrate the effect of the regularizer to our model (2). More examples can be found in the Supp. Doc. C. All experiments are run on a Macbook Pro with 2.3 GHz Quad-Core, 8GB RAM.

Figure 1: The performance of three algorithms on the Carpole-v0 environment.
Figure 2: The performance of three algorithms on the Acrobot-v1 environment.

We implement our restarting algorithm, Algorithm 2, on top of the rllab111Available at https://github.com/rll/rllab library (Duan et al., 2016). The source code is available at https://github.com/unc-optimization/ProxHSPGA. We compare our algorithm with two other methods: SVPRG (Papini et al., 2018; Xu et al., 2019a) and GPOMDP (Baxter and Bartlett, 2001). Although REINFORCE and GPOMDP have the same trajectory complexity, as observed in (Papini et al., 2018), GPOMDP often performs better than REINFORCE, so we only choose to implement GPOMDP in our experiments. Since SVRPG and GPOMDP solves the non-composite problems (1), we set in the first three examples and adjust our algorithm, denoted as HSPGA, accordingly. We compare our algorithm with the fixed epoch length variant of SVRPG as reported in Papini et al. (2018); Xu et al. (2019a). For the implementation of SVRPG and GPOMDP, we reuse the implementation of Papini et al.222Available at https://github.com/Dam930/rllab. We test these algorithms on three well-studied reinforcement learning tasks: Cart Pole, Acrobot, and Moutain Car which are available in OpenAI gym (Brockman et al., 2016), a well-known toolkit for developing and comparing reinforcement learning algorithms. We also test these algorithms on continuous control tasks using other simulators such as Roboschool (Klimov and Schulman, 2017) and Mujoco (Todorov et al., 2012).

For each environment, we initialize the policy randomly and use it as initial policies for all runs of all algorithms. The performance measure, i.e., mean rewards, is computed by averaging the final rewards of

trajectories sampled by the current policy. We then compute the mean and 90% confidence interval across

runs of these performance measures at different time point. In all plots, the solid lines represent the mean and the shaded areas are the confidence band of the mean rewards. In addition, detailed configurations of the policy network and parameters can be found in Supp. Doc. B

. We note that the architecture of the neural network is denoted as


Cart Pole-v0 environment:

For the Cart pole environment, we use a deep soft-max policy network (Bridle, 1990; Levine, 2017; Sutton and Barto, 2018) with one hidden layer of neurons. Figure 1 depicts the results where we run each algorithm for times and compute the mean and confidence intervals.

From Figure 1, we can see that HSPGA outperforms the other algorithms while SVRPG works better than GPOMDP as expected. HSPGA is able to reach the maximum reward of in less than episodes.

Figure 3: The performance of three algorithms on the Mountain Car-v0 environment.

Acrobot environment:

Next, we evaluate three algorithms on the Acrobot-v1 environment. Here, we use a deep soft-max policy with one hidden layer of 16 neurons. The performance of these 3 algorithms are illustrated in Figure 2.

We observe similar results as in the previous example where HSPGA has the best performance over three candidates. SVRPG is still better than GPOMDP in this example.

Mountain Car environment:

For the MountainCar-v0 environment, we use a deep Gaussian policy (Sutton and Barto, 2018) where the mean is the output of a neural network containing one hidden layer of

neurons and the standard deviation is fixed at

. The results of three algorithms are presented in Figure 3.

Figure 3 shows that HSPGA highly outperforms the other two algorithms. Again, SVRPG remains better than GPOMDP as expected.

The effect of regularizers:

We test the effect of the regularizer by adding a Tikhonov one as

This model was intensively studied in Liu et al. (2019).

We also compare all non-composite algorithms with ProxHSPGA in the Roboschool Inverted Pendulum-v1 environment. In this experiment, we set the penalty parameter for ProxHSPGA. The results are depicted in Figure 4 and more information about the configuration of each algorithm is in Supp. Document B.

Figure 4: The performance of composite vs. non-composite algorithms on the Roboschool Inverted Pendulum-v1 environment.

From Figure 4, in terms of non-composite algorithms, HSPGA is the best followed by SVRPG and then by GPOMDP. Furthermore, ProxHSPGA shows its advantage by reaching the maximum reward of 1000 faster than HSPGA.

6 Conclusion

We have presented a novel policy gradient algorithm to solve regularized reinforcement learning models. Our algorithm uses a novel policy gradient estimator which is a combination of an unbiased estimator, i.e. REINFORCE estimator, and a biased estimator adapted from SARAH estimator for policy gradient. Theoretical results show that our algorithm achieves the best-known trajectory complexity to attain an -approximate first-order solution for the problem under standard assumptions. In addition, our numerical experiments not only help confirm the benefit of our algorithm compared to other closely related policy gradient methods but also verify the effectiveness of regularization in policy gradient methods.

Q. Tran-Dinh has partly been supported by the National Science Foundation (NSF), grant no. DMS-1619884 and the Office of Naval Research (ONR), grant no. N00014-20-1-2088 (2020-2023). Q. Tran-Dinh and N. H. Pham are partly supported by The Statistical and Applied Mathematical Sciences Institute (SAMSI).


  • Z. Allen-Zhu and Y. Yuan (2016) Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1080–1089. Cited by: §1.
  • Z. Allen-Zhu (2017) Katyusha: the first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    New York, NY, USA, pp. 1200–1205. Cited by: §1.
  • Z. Allen-Zhu (2018) Natasha 2: faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems 31, pp. 2675–2686. Cited by: §1.
  • J. Baxter and P. L. Bartlett (2001) Infinite-horizon policy-gradient estimation. J. Artif. Int. Res. 15 (1), pp. 319–350. Cited by: Table 1, §1, §2, §5.
  • J. S. Bridle (1990) Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems 2, pp. 211–217. Cited by: §5.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.
  • C. Cortes, Y. Mansour, and M. Mohri (2010) Learning bounds for importance weighting. In Advances in Neural Information Processing Systems 23, pp. 442–450. Cited by: §A.1, Remark 3.1.
  • A. Cutkosky and F. Orabona (2019) Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219. Cited by: §3.4.
  • DeepMind (2019) AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Note: https://deepmind.com/blog Cited by: §1.
  • A. Defazio, F. Bach, and S. Lacoste-Julien (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, pp. 1646–1654. Cited by: §1.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1329–1338. Cited by: §5.
  • C. Fang, C. J. Li, Z. Lin, and T. Zhang (2018) SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In NeurIPS, Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. Cited by: §1.
  • H. v. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    pp. 2094–2100. Cited by: §1.
  • R. Johnson and T. Zhang (2013) Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pp. 315–323. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §1.
  • O. Klimov and J. Schulman (2017) Roboschool. Note: https://openai.com/blog/roboschool/ Cited by: §5.
  • S. Levine (2017) CS 294-112: deep reinforcement learning lecture notes. Cited by: §5.
  • Z. Li and J. Li (2018) A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, USA, pp. 5569–5579. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Cited by: §1.
  • Z. Liu, X. Li, B. Kang, and T. Darrell (2019) Regularization matters in policy optimization. arXiv preprint arXiv:1910.09191. Cited by: Appendix C, §1, §5.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 1928–1937. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. ArXiv abs/1312.5602. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518, pp. 529–533. Cited by: §1.
  • Y. Nesterov (2014) Introductory lectures on convex optimization: a basic course. 1 edition, Springer Publishing Company, Incorporated. Cited by: §3.3.
  • B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro (2017) Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, pp. 5947–5956. Cited by: §1.
  • L. M. Nguyen, J. Liu, K. Scheinberg, and M.Takác (2017a) Stochastic recursive gradient algorithm for nonconvex optimization. CoRR abs/1705.07261. Cited by: §1.
  • L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč (2017b) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621. Cited by: §1, §1, §3.4.
  • L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam (2019) Finite-sum smooth optimization with sarah. arXiv preprint arXiv:1901.07648. Cited by: §1.
  • OpenAI (2018) OpenAI Five. Note: https://blog.openai.com/openai-five/ Cited by: §1.
  • M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli (2018) Stochastic variance-reduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 4026–4035. Cited by: Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §5.
  • N. Parikh and S. Boyd (2014) Proximal algorithms. Found. Trends Optim. 1 (3), pp. 127–239. Cited by: §3.3.
  • N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh (2019) ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. ArXiv abs/1902.05679. Cited by: §1, §3.1, §3.5.
  • S. J. Reddi, A. Hefny, S. Sra, B. Póczós, and A. Smola (2016) Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 314–323. Cited by: §1.
  • H. Robbins and S. Monro (1951) A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. Cited by: §1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France, pp. 1889–1897. Cited by: §1, §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §1.
  • Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi (2019) Hessian aided policy gradient. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5729–5738. Cited by: Table 1, §1, §1, §3.2, §3.5, Lemma 3.1, §4.2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. v. d. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
  • D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §1.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, pp. I–387–I–395. Cited by: §1, §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.1.
  • R. S. Sutton and A. G. Barto (2018) Introduction to reinforcement learning, 2nd edition. MIT Press. Cited by: §1, §1, §2, §3.4, §5, §5.
  • R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063. Cited by: §2.
  • T. Tieleman and G. Hinton (2012)

    Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude

    Note: COURSERA: Neural Networks for Machine Learning Cited by: §1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.
  • Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen (2019a) A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793. Cited by: §A.1, §A.2, Appendix A, §1, §3.4, §3.
  • Q. Tran-Dinh, N. H. Pham, D. T. Phan, and L. M. Nguyen (2019b) Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920. Cited by: §1.
  • Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. d. Freitas (2017) Sample efficient actor-critic with experience replay. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1.
  • Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh (2018) SpiderBoost: a class of faster variance-reduced algorithms for nonconvex optimization. ArXiv abs/1810.10690. Cited by: §1.
  • Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas (2016) Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1995–2003. Cited by: §1.
  • C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3), pp. 279–292. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. Cited by: §1, Table 1, §1, §2.
  • Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 5285–5294. Cited by: §1.
  • P. Xu, F. Gao, and Q. Gu (2019a) An improved convergence analysis of stochastic variance-reduced policy gradient. Conference on Uncertainty in Artificial Intelligence. Cited by: §A.1, Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §4.2, §5.
  • P. Xu, F. Gao, and Q. Gu (2019b) Sample efficient policy gradient methods with recursive variance reduction. ArXiv abs/1909.08610. Cited by: Table 1, §1.
  • L. Yang and Y. Zhang (2019) Policy optimization with stochastic mirror descent. CoRR abs/1906.10462. External Links: 1906.10462 Cited by: Table 1, §1.
  • H. Yuan, C. J. Li, Y. Tang, and Y. Zhou (2019) Policy optimization via stochastic recursive gradient algorithm. External Links: Link Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. External Links: Link Cited by: §1.
  • T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama (2011) Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems 24, pp. 262–270. Cited by: §1, §3.4.
  • D. Zhou, P. Xu, and Q. Gu (2018) Stochastic nested variance reduction for nonconvex optimization. ArXiv abs/1806.07811. Cited by: §1.

Appendix A Convergence Analysis

We note that the original idea of using hybrid estimators has been proposed in our working paper (Tran-Dinh et al., 2019a). In this work, we have extended this idea as well as the proof techniques for stochastic optimization in Tran-Dinh et al. (2019a) into reinforcement learning settings. We now provide the full analysis of Algorithm 1 and 2. We first prove a key property of our new hybrid estimator for the policy gradient . Then, we provide the proof of Theorem 4.1 and Corollary 4.1.

a.1 Proof of Lemma 4.1: Bound on the Variance of the Hybrid SPG Estimator

Part of this proof comes from the proof of Lemma 1 in Tran-Dinh et al. (2019a). Let be the total expectation. Using the independence of and , taking the total expectation on (4), we obtain

which is the same as (5).

To prove (6), we first define and . We have

Taking the total expectation and note that and , we get


where the first inequality comes from the triangle inequality then we ignore the non-negative terms to arrive at the second inequality.

Additionally, Lemma 6.1 in Xu et al. (2019a) shows that


where .

Using (11) we have

where , , and is a baseline reward. Here, comes from Lemma 3.1 and is from Lemma 1 in Cortes et al. (2010).

Plugging the last estimate into (10) yields


which is (6), where .

a.2 Proof of Lemma 4.2: Key Estimate of Algorithm 1

Similar to the proof of Lemma 5 in Tran-Dinh et al. (2019a) , from the update in Algorithm 1, we have , which leads to . Combining this expression and the -smoothness of in Lemma 3.1, we have


From the convexity of , we have


where is a subgradient of at .

By the optimality condition of , we can show that for some where is the subdifferential of Q at . Plugging this into (14), we get


Subtracting (15) from (13), we obtain


Using the fact that

and ignoring the non-negative term , we can rewrite (16) as

Taking the total expectation over the entire history , we obtain


From the definition of the gradient mapping (3), we have

Applying the triangle inequality, we can derive

Taking the full expectation over the entire history