1 Introduction
Recently, research on reinforcement learning (RL) (Sutton and Barto, 2018)
, an area of machine learning to learn how to make a series of decisions while interacting with the underlying environment, has been immensely active. Unlike supervised learning, reinforcement learning agents often have limited or no knowledge about the environment and the rewards of taking certain actions might not be immediately observed, making these problems more challenging to solve. Over the past decade, there has been a large number of research works developing and using reinforcement learning to solve emerging problems. Notable reinforcement learning agents include, but not limited to, AlphaGo and AlphaZero
(Silver et al., 2016, 2018), OpenAIFive (OpenAI, 2018), and AlphaStar (DeepMind, 2019).In modern RL tasks, the environment is often not known beforehand so the agent has to simultaneously learn the environment while making appropriate decisions. One approach is to estimate the value function or the statevalue function, e.g., Qlearning (Watkins and Dayan, 1992) and its variants such as Deep Qlearning (DQN) (Mnih et al., 2013, 2015), Dueling DQN (Wang et al., 2016), and double Qlearning (Hasselt et al., 2016).
It has been observed that learning the statevalue function is not efficient when the action space is large or even infinite. In that case, policy gradient methods learn the policy directly with a parameterized function. Silver et al. (2014) presents a framework for deterministic policy gradient algorithms which can be estimated more efficiently than their stochastic counterparts whereas DDPG (Lillicrap et al., 2016) adapts the idea of deep Qlearning into continuous action tasks in RL. TRPO (Schulman et al., 2015) uses a constraint on the KL divergence between the new and old policies to improve the robustness of each update. PPO (Schulman et al., 2017) is an extension of TRPO which uses a clipped surrogate objective resulting a simpler implementation. Other policy gradient methods utilize the actorcritic paradigm including ACER (Wang et al., 2017), A3C (Mnih et al., 2016) and its synchronous variant A2C, ACKTR (Wu et al., 2017), and SAC (Haarnoja et al., 2018).
REINFORCE (Williams, 1992) is perhaps one classical method closely related to our work here. It uses an estimator of the policy gradient and applies a gradient ascent step to update the policy. Nevertheless, the REINFORCE estimator is known to have high variance leading to several weaknesses. Other improvements to reduce the variance such as adding baselines (Sutton and Barto, 2018; Zhao et al., 2011), discarding some rewards in the socalled GPOMDP estimator (Baxter and Bartlett, 2001) were proposed. While REINFORCE estimator is an unbiased policy gradient estimator, GPOMDP is shown to be biased (Baxter and Bartlett, 2001) making theoretical analysis harder.
The nature of REINFORCE algorithm appears to be closely related to stochastic gradient descent (SGD)
(Robbins and Monro, 1951) in stochastic nonconvex optimization. In particular, the standard SGD estimator is also known to often have fixed variance, which is often high. On the one hand, there are algorithms trying to reduce the oscillation (Tieleman and Hinton, 2012) or introduce momentums or adaptive updates (AllenZhu, 2017, 2018; Kingma and Ba, 2014) for SGD methods to accelerate performance. On the other hand, other researchers are searching for new gradient estimators. One approach is the SAGA estimator proposed by Defazio et al. (2014). Another wellknown estimator is the SVRG estimator (Johnson and Zhang, 2013) which has been intensively studied in recent works, e.g., in AllenZhu and Yuan (2016); Li and Li (2018); Reddi et al. (2016); Zhou et al. (2018). This estimator not only overcomes the storage issue of SAGA but also possesses variance reduced property, i.e., the variance of the estimator decreases over epochs. Methods based on SVRG estimators have recently been developed for reinforcement learning, e.g., SVRPG
(Papini et al., 2018). Xu et al. (2019a) refines the analysis of SVRPG to achieve an improved trajectory complexity of . Shen et al. (2019) also adopts the SVRG estimator into policy gradient and achieve the trajectory oracle complexity of with the use of a secondorder estimator.While SGD, SAGA, and SVRG estimators are unbiased, there have been algorithms developed based on a biased gradient estimator named SARAH (Nguyen et al., 2017b). Such algorithms include SARAH (Nguyen et al., 2017a, 2019), SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018), and ProxSARAH (Pham et al., 2019). Similar to SVRG, all these methods can potentially be extended to reinforcement learning. A recent attempt is SARAPO (Yuan et al., 2019) which combines SARAH (Nguyen et al., 2019) with TRPO (Schulman et al., 2015) algorithm but no theoretical guarantee is provided. Yang and Zhang (2019) propose Mirror Policy Optimization (MPO) algorithm which covers the classical policy gradient and the natural policy gradient as special cases. They also introduce a variance reduction variant, called VRMPO, which achieves trajectory complexity. Another notable work is SRVRPG (Xu et al., 2019b) where the policy gradient estimator is the adapted version of SARAH estimator for reinforcement learning. Note that Yang and Zhang (2019) and Xu et al. (2019b) achieve the same trajectory complexity of as ours. However, our algorithm is essentially different. Xu et al. (2019b) and Yang and Zhang (2019) use two different adaptation of the SARAH estimator for policy gradient. Xu et al. (2019b) uses the importance weight in their estimator to handle distribution shift while Yang and Zhang (2019) remove it as seen in Shen et al. (2019). Meanwhile, we introduce a new policy gradient estimator which can also be calculated recursively. The new estimator is fundamentally different from the other two since it combines the adapted SARAH estimator as in Xu et al. (2019b) with the classical REINFORCE estimator. In addition, our analysis shows that the bestknown convergence rate and complexity can be achieved by our singleloop algorithm (Algorithm 1) while SRVRPG and VRMPO require double loops to achieve the same oracle complexity. Moreover, Xu et al. (2019b); Yang and Zhang (2019) do not consider the composite setting that includes the constraints or regularizers on the policy parameters as we do.
Algorithms  Complexity  Composite  Singleloop 

REINFORCE (Williams, 1992)  ✗  ✓  
GPOMDP (Baxter and Bartlett, 2001)  ✗  ✓  
SVRPG (Papini et al., 2018)  ✗  ✗  
SVRPG (Xu et al., 2019a)  ✗  ✗  
HAPG (Shen et al., 2019)  ✗  ✗  
VRMPO (Yang and Zhang, 2019)  ✗  ✗  
SRVRPG (Xu et al., 2019b)  ✗  ✗  
This work  ✓  ✓ 
Our approach:
Our approach lies in the stochastic variance reduction avenue, but using a completely new hybrid approach, leading to a novel estimator compared to existing methods in reinforcement learning. We build our estimator by taking a convex combination of the adapted SARAH (Nguyen et al., 2017b) and REINFORCE (Williams, 1992), a classical unbiased policy gradient estimator. This hybrid estimator not only allows us to tradeoff the bias and variance between these two estimators but also possesses useful properties for developing new algorithms. Note that the idea of combining stochastic estimators was first proposed for stochastic optimization in our recent works (TranDinh et al., 2019b, a)
. Unlike existing policy gradient methods, our algorithm first samples a large batch of trajectories to establish a good search direction. After that, it iteratively updates the policy parameters using our hybrid estimator leading to a singleloop method without any snapshot loop as in SVRG or SARAH variants. In addition, as regularization techniques have shown their effectiveness in deep learning
(Neyshabur et al., 2017; Zhang et al., 2017), they possibly have great potential in reinforcement learning algorithms too. A recent study (Liu et al., 2019) shows that regularizations on the policy parameters can greatly improve the performance of policy gradient algorithms. Motivated by these facts, we directly consider a new composite setting (2) as presented in Section 3. For this new composite model, it is not clear if existing algorithms remain convergent by simply adding a projection step on the constraint set, while our method does guarantee convergence.Our contribution:
To this end, our contribution in this paper can be summarized as follows:

We introduce a novel hybrid stochastic policy gradient estimator by combining existing REINFORCE estimator with the adapted SARAH estimator for policy gradient. We investigate some key properties of our estimator that can be used for algorithmic development.

We propose a new algorithm to solve a composite maximization problem for policy optimization in reinforcement learning. Our model not only covers existing settings but also handles constraints and convex regularizers on policy parameters.

We provide convergence analysis as the first theoretical result for composite optimization in reinforcement learning and estimate the trajectory complexity of our algorithm and show that our algorithm can achieve the bestknown complexity over existing firstorder methods (see Table 1).
Our algorithm only has one loop as REINFORCE or GPOMDP, which is fundamentally different from SVRPG, SVRGadapted, and other SARAHbased algorithms for RL. It can work with single sample or minibatch and has two steps: proximal gradient step and averaging step with different stepsizes. This makes the algorithm more flexible to use different stepsizes without sacrificing the overall complexity.
Paper outline:
The rest of this paper is organized as follows. Section 2 describes problem of interest and gives an overview about policy gradient methods. Section 3 introduces our new hybrid estimator for policy gradient and develops the main algorithm. The complexity analysis is presented in Section 4, while Section 5 provides several numerical examples. All technical proofs and experimental details are given in Supplementary Document (Supp. Doc.).
2 Model and Problem Statement
Model:
We consider a Markov Decision Process (MDP)
(Sutton and Barto, 2018) equipped with components where , are the state and action spaces,denotes the set of transition probabilities when taking certain actions,
is the reward function which characterizes the immediate reward earned by taking certain action, is a discount factor, and is the initial state distribution.Let be a density function over when current state is and is a policy parameterized by parameter . A trajectory with effective length is a collection of states and actions sampled from a stationary policy. Denote as the density induced by policy over all possible trajectories and is the probability of observing a trajectory . Also, let be the total discounted reward for a trajectory . Solving an MDP is equivalent to finding the solution that maximizes the expected cumulative discounted rewards.
Classical policy gradient methods:
Policy gradient methods seek a differentiable parameterized policy that maximizes the expected cumulative discounted rewards as
(1) 
where is the parameter dimension. The policy gradient theorem (Sutton et al., 1999) shows that
where the policy gradient does not depend on the gradient of the state distribution despite the fact that the state distribution depends on the policy parameters (Silver et al., 2014).
This policy gradient can be used in gradient ascent algorithms to update the parameter . However, we cannot calculate the full gradient at each update as we only get a finite number of samples at each iteration. Consequently, the policy gradient is often estimated by its sample average. At each iteration, a batch of trajectories will be sampled from the environment to estimate the policy gradient as
where is a sample estimator of . We call a stochastic policy gradient (SPG) estimator. This estimator has been exploited in the two wellknown REINFORCE (Williams, 1992) and GPOMDP (Baxter and Bartlett, 2001) methods. The main step of policy gradient ascent methods is to update the parameters as
where is some appropriate learning rate, which can be fixed or varied over . Since the policy changes after each update, the density also changes and creates nonstationarity in the problem which will be handled by importance weight in Section 3.
3 A New Hybrid Stochastic Policy Gradient Algorithm
In this section, we first introduce a composite model for policy optimization. Next, we extend the hybrid gradient idea from TranDinh et al. (2019a) to policy gradient estimators. Finally, we develop a new proximal policy gradient algorithm and its restart variant to solve the composite policy optimization problem and analyze their trajectory complexity.
3.1 Composite Policy Optimization Model
While the objective function in (1) is standard in most policy gradient methods, it is natural to have some constraints or regularizers on the policy parameters. In addition, adding constraints can prevent the explosion of parameters in highly nonlinear models as often seen in deep learning (Srivastava et al., 2014). Adopting the idea of composite nonconvex optimization (Pham et al., 2019), we are interested in the more general optimization problem in reinforcement learning as follow:
(2) 
where is a proper, closed, and convex function acting as a regularizer which can be the indicator function of a convex set representing the constraints on the parameters or some standard regularizers such as norm or norm. If there is no regularizer , the problem (2) reduces to the standard one in (1).
3.2 Assumptions
Let be the total objective function. We impose the following assumptions for our convergence analysis, which are often used in practice.
Assumption 3.1.
The regularizer is a proper, closed, and convex function. We also assume that the domain of is nonempty and there exists a finite upper bound
Assumption 3.2.
The immediate reward function is bounded, i.e., there exists such that for all , ,
Assumption 3.3.
Let be the policy for a given stateaction pair . Then, there exist two positive constants and such that
for any , where is the norm.
This assumption leads to useful results about the smoothness of and and the upper bound on the variance of the policy gradient estimator.
For more details about the constants and the proofs of Lemma 3.1 we refer e.g., to Papini et al. (2018); Shen et al. (2019); Xu et al. (2019a).
Assumption 3.4.
There exists a constant such that, for each pair of policies encountered in Algorithm 1 the following holds
where is the importance weight between and .
Since the importance weight introduces another source of variance, we require this assumption for our convergence analysis as used in previous works, e.g., in Papini et al. (2018); Xu et al. (2019a).
Remark 3.1.
Cortes et al. (2010) shows that if ,
are variances of two Gaussian distributions P and Q, and
then the variance of the importance weights is bounded, i.e. Assumption 3.4 holds for Gaussian policies which are commonly used to represent the policy in continuous control tasks.3.3 Optimality Condition
Associated with problem (2), we define
(3) 
for some as the gradient mapping of (Nesterov, 2014), where denotes the proximal operator of (see, e.g., Parikh and Boyd (2014) for more details).
A point is called a stationary point of (2) if
Our goal is to design an iterative method to produce an approximate stationary point of (2) after at most iterations defined as
where is a desired tolerance, and the expectation is taken overall the randomness up to iterations.
3.4 Novel Hybrid SPG Estimator
Unbiased estimator:
Recall that given a trajectory , the REINFORCE (SPG) estimator is defined as
where .
Note that the REINFORCE estimator is unbiased, i.e. . In order to reduce the variance of these estimators, a baseline is normally added while maintaining the unbiasedness of the estimators (Sutton and Barto, 2018; Zhao et al., 2011). From now on, we will refer to as the baselineadded version defined as
where with being a baseline and possibly depending only on .
Hybrid SPG estimator:
In order to reduce the number of trajectories sampled, we extend our idea in TranDinh et al. (2019a)
for stochastic optimization to develop a new hybrid stochastic policy gradient (HSPG) estimator that helps balance the biasvariance tradeoff. The estimator is formed by taking a convex combination of two other estimators: one is an unbiased estimator which can be REINFORCE estimator, and another is the adapted SARAH estimator
(Nguyen et al., 2017b) for policy gradient which is biased.More precisely, if and are two random batches of trajectories with sizes and , respectively, sampled from , the hybrid stochastic policy gradient estimator at th iteration can be expressed as
(4) 
where
and
with is a batch of trajectories collected at the beginning. Note that is an importance weight added to account for the distribution shift since the trajectories are sampled from but not from . Note also that in (4) is also different from the momentum SARAH estimator recently proposed in Cutkosky and Orabona (2019).
3.5 The Complete Algorithm
The novel Proximal Hybrid Stochastic Policy Gradient Algorithm (abbreviated by ProxHSPGA) to solve (2) is presented in detail in Algorithm 1.
Unlike SVRPG (Papini et al., 2018; Xu et al., 2019a) and HAPG (Shen et al., 2019), Algorithm 1 only has one loop as REINFORCE or GPOMDP. Moreover, Algorithm 1 does not use the estimator for the policy Hessian as in HAPG. At the initial stage, a batch of trajectories is sampled using to estimate an initial policy gradient estimator which provides a good initial search direction. At the th iteration, two independent batches of trajectories are sampled from to evaluate the hybrid stochastic policy gradient estimator. After that, a proximal step followed by an averaging step are performed which are inspired by Pham et al. (2019). Note that the batches of trajectories at each iteration are sampled from the current distribution which will change after each update. Therefore, the importance weight is introduced to account for the nonstationarity of the sampling distribution. As a result, we still have .
3.6 Restarting variant
While Algorithm 1 has the bestknown theoretical complexity as shown in Section 4, its practical performance may be affected by the constant stepsize depending on . As will be shown later, the stepsize is inversely proportional to the number of iterations and it is natural to have close to to take advantage of the newly computed information. To increase the practical performance of our algorithm without sacrificing its complexity, we propose to inject a simple restarting strategy by repeatedly running Algorithm 1 for multiple stages as in Algorithm 2.
4 Convergence Analysis
This section presents key properties of the hybrid stochastic policy gradient estimators as well as the theoretical convergence analysis and complexity estimate.
4.1 Properties of the hybrid SPG estimator
Let be the field generated by all trajectories sampled up to the th iteration. For the sake of simplicity, we assume that but our analysis can be easily extended for the case . Then the hybrid SPG estimator has the following properties
Lemma 4.1 (Key properties).
Let be defined as in (4) and . Then
(5) 
If , then is an biased estimator. In addition, we have
(6) 
where is a given constant.
4.2 Complexity Estimates
The following lemma presents a key estimate for our convergence results.
Lemma 4.2 (Oneiteration analysis).
The following theorem summarizes the convergence analysis of Algorithm 1.
Theorem 4.1.
Consequently, the trajectory complexity is presented in the following corollary.
Corollary 4.1.
For both Algorithm 1 and Algorithm 2, let us fix and set for some in Theorem 4.1. If we also choose in Algorithm 1 such that and choose in Algorithm 2 such that for some constant , then the number of trajectories to achieve such that for any is at most
where is chosen uniformly at random from if using Algorithm 2.
5 Numerical Experiments
In this section, we present three examples to provide comparison between the performance of HSPGA and other related policy gradient methods. We also provide an example to illustrate the effect of the regularizer to our model (2). More examples can be found in the Supp. Doc. C. All experiments are run on a Macbook Pro with 2.3 GHz QuadCore, 8GB RAM.
We implement our restarting algorithm, Algorithm 2, on top of the rllab^{1}^{1}1Available at https://github.com/rll/rllab library (Duan et al., 2016). The source code is available at https://github.com/uncoptimization/ProxHSPGA. We compare our algorithm with two other methods: SVPRG (Papini et al., 2018; Xu et al., 2019a) and GPOMDP (Baxter and Bartlett, 2001). Although REINFORCE and GPOMDP have the same trajectory complexity, as observed in (Papini et al., 2018), GPOMDP often performs better than REINFORCE, so we only choose to implement GPOMDP in our experiments. Since SVRPG and GPOMDP solves the noncomposite problems (1), we set in the first three examples and adjust our algorithm, denoted as HSPGA, accordingly. We compare our algorithm with the fixed epoch length variant of SVRPG as reported in Papini et al. (2018); Xu et al. (2019a). For the implementation of SVRPG and GPOMDP, we reuse the implementation of Papini et al.^{2}^{2}2Available at https://github.com/Dam930/rllab. We test these algorithms on three wellstudied reinforcement learning tasks: Cart Pole, Acrobot, and Moutain Car which are available in OpenAI gym (Brockman et al., 2016), a wellknown toolkit for developing and comparing reinforcement learning algorithms. We also test these algorithms on continuous control tasks using other simulators such as Roboschool (Klimov and Schulman, 2017) and Mujoco (Todorov et al., 2012).
For each environment, we initialize the policy randomly and use it as initial policies for all runs of all algorithms. The performance measure, i.e., mean rewards, is computed by averaging the final rewards of
trajectories sampled by the current policy. We then compute the mean and 90% confidence interval across
runs of these performance measures at different time point. In all plots, the solid lines represent the mean and the shaded areas are the confidence band of the mean rewards. In addition, detailed configurations of the policy network and parameters can be found in Supp. Doc. B. We note that the architecture of the neural network is denoted as
.Cart Polev0 environment:
For the Cart pole environment, we use a deep softmax policy network (Bridle, 1990; Levine, 2017; Sutton and Barto, 2018) with one hidden layer of neurons. Figure 1 depicts the results where we run each algorithm for times and compute the mean and confidence intervals.
From Figure 1, we can see that HSPGA outperforms the other algorithms while SVRPG works better than GPOMDP as expected. HSPGA is able to reach the maximum reward of in less than episodes.
Acrobot environment:
Next, we evaluate three algorithms on the Acrobotv1 environment. Here, we use a deep softmax policy with one hidden layer of 16 neurons. The performance of these 3 algorithms are illustrated in Figure 2.
We observe similar results as in the previous example where HSPGA has the best performance over three candidates. SVRPG is still better than GPOMDP in this example.
Mountain Car environment:
For the MountainCarv0 environment, we use a deep Gaussian policy (Sutton and Barto, 2018) where the mean is the output of a neural network containing one hidden layer of
neurons and the standard deviation is fixed at
. The results of three algorithms are presented in Figure 3.Figure 3 shows that HSPGA highly outperforms the other two algorithms. Again, SVRPG remains better than GPOMDP as expected.
The effect of regularizers:
We test the effect of the regularizer by adding a Tikhonov one as
This model was intensively studied in Liu et al. (2019).
We also compare all noncomposite algorithms with ProxHSPGA in the Roboschool Inverted Pendulumv1 environment. In this experiment, we set the penalty parameter for ProxHSPGA. The results are depicted in Figure 4 and more information about the configuration of each algorithm is in Supp. Document B.
From Figure 4, in terms of noncomposite algorithms, HSPGA is the best followed by SVRPG and then by GPOMDP. Furthermore, ProxHSPGA shows its advantage by reaching the maximum reward of 1000 faster than HSPGA.
6 Conclusion
We have presented a novel policy gradient algorithm to solve regularized reinforcement learning models. Our algorithm uses a novel policy gradient estimator which is a combination of an unbiased estimator, i.e. REINFORCE estimator, and a biased estimator adapted from SARAH estimator for policy gradient. Theoretical results show that our algorithm achieves the bestknown trajectory complexity to attain an approximate firstorder solution for the problem under standard assumptions. In addition, our numerical experiments not only help confirm the benefit of our algorithm compared to other closely related policy gradient methods but also verify the effectiveness of regularization in policy gradient methods.
Q. TranDinh has partly been supported by the National Science Foundation (NSF), grant no. DMS1619884 and the Office of Naval Research (ONR), grant no. N000142012088 (20202023). Q. TranDinh and N. H. Pham are partly supported by The Statistical and Applied Mathematical Sciences Institute (SAMSI).
References
 Improved svrg for nonstronglyconvex or sumofnonconvex objectives. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, pp. 1080–1089. Cited by: §1.

Katyusha: the first direct acceleration of stochastic gradient methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, New York, NY, USA, pp. 1200–1205. Cited by: §1.  Natasha 2: faster nonconvex optimization than sgd. In Advances in Neural Information Processing Systems 31, pp. 2675–2686. Cited by: §1.
 Infinitehorizon policygradient estimation. J. Artif. Int. Res. 15 (1), pp. 319–350. Cited by: Table 1, §1, §2, §5.
 Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems 2, pp. 211–217. Cited by: §5.
 OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.
 Learning bounds for importance weighting. In Advances in Neural Information Processing Systems 23, pp. 442–450. Cited by: §A.1, Remark 3.1.
 Momentumbased variance reduction in nonconvex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219. Cited by: §3.4.
 AlphaStar: Mastering the RealTime Strategy Game StarCraft II. Note: https://deepmind.com/blog Cited by: §1.
 SAGA: a fast incremental gradient method with support for nonstrongly convex composite objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 1, Cambridge, MA, USA, pp. 1646–1654. Cited by: §1.
 Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, pp. 1329–1338. Cited by: §5.
 SPIDER: nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In NeurIPS, Cited by: §1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. Cited by: §1.

Deep reinforcement learning with double qlearning.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pp. 2094–2100. Cited by: §1.  Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pp. 315–323. Cited by: §1.
 Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §1.
 Roboschool. Note: https://openai.com/blog/roboschool/ Cited by: §5.
 CS 294112: deep reinforcement learning lecture notes. Cited by: §5.
 A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, USA, pp. 5569–5579. Cited by: §1.
 Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Cited by: §1.
 Regularization matters in policy optimization. arXiv preprint arXiv:1910.09191. Cited by: Appendix C, §1, §5.
 Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 1928–1937. Cited by: §1.
 Playing atari with deep reinforcement learning. ArXiv abs/1312.5602. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518, pp. 529–533. Cited by: §1.
 Introductory lectures on convex optimization: a basic course. 1 edition, Springer Publishing Company, Incorporated. Cited by: §3.3.
 Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, pp. 5947–5956. Cited by: §1.
 Stochastic recursive gradient algorithm for nonconvex optimization. CoRR abs/1705.07261. Cited by: §1.
 SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621. Cited by: §1, §1, §3.4.
 Finitesum smooth optimization with sarah. arXiv preprint arXiv:1901.07648. Cited by: §1.
 OpenAI Five. Note: https://blog.openai.com/openaifive/ Cited by: §1.
 Stochastic variancereduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 4026–4035. Cited by: Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §5.
 Proximal algorithms. Found. Trends Optim. 1 (3), pp. 127–239. Cited by: §3.3.
 ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. ArXiv abs/1902.05679. Cited by: §1, §3.1, §3.5.
 Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, pp. 314–323. Cited by: §1.
 A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. Cited by: §1.
 Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France, pp. 1889–1897. Cited by: §1, §1.
 Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §1.
 Hessian aided policy gradient. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5729–5738. Cited by: Table 1, §1, §1, §3.2, §3.5, Lemma 3.1, §4.2.
 Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
 A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science 362 (6419), pp. 1140–1144. Cited by: §1.
 Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, pp. I–387–I–395. Cited by: §1, §2.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.1.
 Introduction to reinforcement learning, 2nd edition. MIT Press. Cited by: §1, §1, §2, §3.4, §5, §5.
 Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063. Cited by: §2.

Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude
. Note: COURSERA: Neural Networks for Machine Learning Cited by: §1.  MuJoCo: a physics engine for modelbased control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.
 A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793. Cited by: §A.1, §A.2, Appendix A, §1, §3.4, §3.
 Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920. Cited by: §1.
 Sample efficient actorcritic with experience replay. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1.
 SpiderBoost: a class of faster variancereduced algorithms for nonconvex optimization. ArXiv abs/1810.10690. Cited by: §1.
 Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, pp. 1995–2003. Cited by: §1.
 Qlearning. Machine Learning 8 (3), pp. 279–292. Cited by: §1.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. Cited by: §1, Table 1, §1, §2.
 Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 5285–5294. Cited by: §1.
 An improved convergence analysis of stochastic variancereduced policy gradient. Conference on Uncertainty in Artificial Intelligence. Cited by: §A.1, Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §4.2, §5.
 Sample efficient policy gradient methods with recursive variance reduction. ArXiv abs/1909.08610. Cited by: Table 1, §1.
 Policy optimization with stochastic mirror descent. CoRR abs/1906.10462. External Links: 1906.10462 Cited by: Table 1, §1.
 Policy optimization via stochastic recursive gradient algorithm. External Links: Link Cited by: §1.
 Understanding deep learning requires rethinking generalization. External Links: Link Cited by: §1.
 Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems 24, pp. 262–270. Cited by: §1, §3.4.
 Stochastic nested variance reduction for nonconvex optimization. ArXiv abs/1806.07811. Cited by: §1.
Appendix A Convergence Analysis
We note that the original idea of using hybrid estimators has been proposed in our working paper (TranDinh et al., 2019a). In this work, we have extended this idea as well as the proof techniques for stochastic optimization in TranDinh et al. (2019a) into reinforcement learning settings. We now provide the full analysis of Algorithm 1 and 2. We first prove a key property of our new hybrid estimator for the policy gradient . Then, we provide the proof of Theorem 4.1 and Corollary 4.1.
a.1 Proof of Lemma 4.1: Bound on the Variance of the Hybrid SPG Estimator
Part of this proof comes from the proof of Lemma 1 in TranDinh et al. (2019a). Let be the total expectation. Using the independence of and , taking the total expectation on (4), we obtain
which is the same as (5).
To prove (6), we first define and . We have
Taking the total expectation and note that and , we get
(10) 
where the first inequality comes from the triangle inequality then we ignore the nonnegative terms to arrive at the second inequality.
a.2 Proof of Lemma 4.2: Key Estimate of Algorithm 1
Similar to the proof of Lemma 5 in TranDinh et al. (2019a) , from the update in Algorithm 1, we have , which leads to . Combining this expression and the smoothness of in Lemma 3.1, we have
(13) 
From the convexity of , we have
(14) 
where is a subgradient of at .
By the optimality condition of , we can show that for some where is the subdifferential of Q at . Plugging this into (14), we get
(15) 
Subtracting (15) from (13), we obtain
(16) 
Using the fact that
and ignoring the nonnegative term , we can rewrite (16) as
Taking the total expectation over the entire history , we obtain
(17) 
From the definition of the gradient mapping (3), we have
Applying the triangle inequality, we can derive
Taking the full expectation over the entire history
Comments
There are no comments yet.