Recently, research on reinforcement learning (RL) (Sutton and Barto, 2018)
, an area of machine learning to learn how to make a series of decisions while interacting with the underlying environment, has been immensely active. Unlike supervised learning, reinforcement learning agents often have limited or no knowledge about the environment and the rewards of taking certain actions might not be immediately observed, making these problems more challenging to solve. Over the past decade, there has been a large number of research works developing and using reinforcement learning to solve emerging problems. Notable reinforcement learning agents include, but not limited to, AlphaGo and AlphaZero(Silver et al., 2016, 2018), OpenAIFive (OpenAI, 2018), and AlphaStar (DeepMind, 2019).
In modern RL tasks, the environment is often not known beforehand so the agent has to simultaneously learn the environment while making appropriate decisions. One approach is to estimate the value function or the state-value function, e.g., Q-learning (Watkins and Dayan, 1992) and its variants such as Deep Q-learning (DQN) (Mnih et al., 2013, 2015), Dueling DQN (Wang et al., 2016), and double Q-learning (Hasselt et al., 2016).
It has been observed that learning the state-value function is not efficient when the action space is large or even infinite. In that case, policy gradient methods learn the policy directly with a parameterized function. Silver et al. (2014) presents a framework for deterministic policy gradient algorithms which can be estimated more efficiently than their stochastic counterparts whereas DDPG (Lillicrap et al., 2016) adapts the idea of deep Q-learning into continuous action tasks in RL. TRPO (Schulman et al., 2015) uses a constraint on the KL divergence between the new and old policies to improve the robustness of each update. PPO (Schulman et al., 2017) is an extension of TRPO which uses a clipped surrogate objective resulting a simpler implementation. Other policy gradient methods utilize the actor-critic paradigm including ACER (Wang et al., 2017), A3C (Mnih et al., 2016) and its synchronous variant A2C, ACKTR (Wu et al., 2017), and SAC (Haarnoja et al., 2018).
REINFORCE (Williams, 1992) is perhaps one classical method closely related to our work here. It uses an estimator of the policy gradient and applies a gradient ascent step to update the policy. Nevertheless, the REINFORCE estimator is known to have high variance leading to several weaknesses. Other improvements to reduce the variance such as adding baselines (Sutton and Barto, 2018; Zhao et al., 2011), discarding some rewards in the so-called GPOMDP estimator (Baxter and Bartlett, 2001) were proposed. While REINFORCE estimator is an unbiased policy gradient estimator, GPOMDP is shown to be biased (Baxter and Bartlett, 2001) making theoretical analysis harder.
The nature of REINFORCE algorithm appears to be closely related to stochastic gradient descent (SGD)(Robbins and Monro, 1951) in stochastic nonconvex optimization. In particular, the standard SGD estimator is also known to often have fixed variance, which is often high. On the one hand, there are algorithms trying to reduce the oscillation (Tieleman and Hinton, 2012) or introduce momentums or adaptive updates (Allen-Zhu, 2017, 2018; Kingma and Ba, 2014) for SGD methods to accelerate performance. On the other hand, other researchers are searching for new gradient estimators. One approach is the SAGA estimator proposed by Defazio et al. (2014). Another well-known estimator is the SVRG estimator (Johnson and Zhang, 2013) which has been intensively studied in recent works, e.g., in Allen-Zhu and Yuan (2016); Li and Li (2018); Reddi et al. (2016); Zhou et al. (2018)
. This estimator not only overcomes the storage issue of SAGA but also possesses variance reduced property, i.e., the variance of the estimator decreases over epochs. Methods based on SVRG estimators have recently been developed for reinforcement learning, e.g., SVRPG(Papini et al., 2018). Xu et al. (2019a) refines the analysis of SVRPG to achieve an improved trajectory complexity of . Shen et al. (2019) also adopts the SVRG estimator into policy gradient and achieve the trajectory oracle complexity of with the use of a second-order estimator.
While SGD, SAGA, and SVRG estimators are unbiased, there have been algorithms developed based on a biased gradient estimator named SARAH (Nguyen et al., 2017b). Such algorithms include SARAH (Nguyen et al., 2017a, 2019), SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018), and ProxSARAH (Pham et al., 2019). Similar to SVRG, all these methods can potentially be extended to reinforcement learning. A recent attempt is SARAPO (Yuan et al., 2019) which combines SARAH (Nguyen et al., 2019) with TRPO (Schulman et al., 2015) algorithm but no theoretical guarantee is provided. Yang and Zhang (2019) propose Mirror Policy Optimization (MPO) algorithm which covers the classical policy gradient and the natural policy gradient as special cases. They also introduce a variance reduction variant, called VRMPO, which achieves trajectory complexity. Another notable work is SRVR-PG (Xu et al., 2019b) where the policy gradient estimator is the adapted version of SARAH estimator for reinforcement learning. Note that Yang and Zhang (2019) and Xu et al. (2019b) achieve the same trajectory complexity of as ours. However, our algorithm is essentially different. Xu et al. (2019b) and Yang and Zhang (2019) use two different adaptation of the SARAH estimator for policy gradient. Xu et al. (2019b) uses the importance weight in their estimator to handle distribution shift while Yang and Zhang (2019) remove it as seen in Shen et al. (2019). Meanwhile, we introduce a new policy gradient estimator which can also be calculated recursively. The new estimator is fundamentally different from the other two since it combines the adapted SARAH estimator as in Xu et al. (2019b) with the classical REINFORCE estimator. In addition, our analysis shows that the best-known convergence rate and complexity can be achieved by our single-loop algorithm (Algorithm 1) while SRVR-PG and VRMPO require double loops to achieve the same oracle complexity. Moreover, Xu et al. (2019b); Yang and Zhang (2019) do not consider the composite setting that includes the constraints or regularizers on the policy parameters as we do.
|REINFORCE (Williams, 1992)||✗||✓|
|GPOMDP (Baxter and Bartlett, 2001)||✗||✓|
|SVRPG (Papini et al., 2018)||✗||✗|
|SVRPG (Xu et al., 2019a)||✗||✗|
|HAPG (Shen et al., 2019)||✗||✗|
|VRMPO (Yang and Zhang, 2019)||✗||✗|
|SRVR-PG (Xu et al., 2019b)||✗||✗|
Our approach lies in the stochastic variance reduction avenue, but using a completely new hybrid approach, leading to a novel estimator compared to existing methods in reinforcement learning. We build our estimator by taking a convex combination of the adapted SARAH (Nguyen et al., 2017b) and REINFORCE (Williams, 1992), a classical unbiased policy gradient estimator. This hybrid estimator not only allows us to trade-off the bias and variance between these two estimators but also possesses useful properties for developing new algorithms. Note that the idea of combining stochastic estimators was first proposed for stochastic optimization in our recent works (Tran-Dinh et al., 2019b, a)
. Unlike existing policy gradient methods, our algorithm first samples a large batch of trajectories to establish a good search direction. After that, it iteratively updates the policy parameters using our hybrid estimator leading to a single-loop method without any snapshot loop as in SVRG or SARAH variants. In addition, as regularization techniques have shown their effectiveness in deep learning(Neyshabur et al., 2017; Zhang et al., 2017), they possibly have great potential in reinforcement learning algorithms too. A recent study (Liu et al., 2019) shows that regularizations on the policy parameters can greatly improve the performance of policy gradient algorithms. Motivated by these facts, we directly consider a new composite setting (2) as presented in Section 3. For this new composite model, it is not clear if existing algorithms remain convergent by simply adding a projection step on the constraint set, while our method does guarantee convergence.
To this end, our contribution in this paper can be summarized as follows:
We introduce a novel hybrid stochastic policy gradient estimator by combining existing REINFORCE estimator with the adapted SARAH estimator for policy gradient. We investigate some key properties of our estimator that can be used for algorithmic development.
We propose a new algorithm to solve a composite maximization problem for policy optimization in reinforcement learning. Our model not only covers existing settings but also handles constraints and convex regularizers on policy parameters.
We provide convergence analysis as the first theoretical result for composite optimization in reinforcement learning and estimate the trajectory complexity of our algorithm and show that our algorithm can achieve the best-known complexity over existing first-order methods (see Table 1).
Our algorithm only has one loop as REINFORCE or GPOMDP, which is fundamentally different from SVRPG, SVRG-adapted, and other SARAH-based algorithms for RL. It can work with single sample or mini-batch and has two steps: proximal gradient step and averaging step with different step-sizes. This makes the algorithm more flexible to use different step-sizes without sacrificing the overall complexity.
The rest of this paper is organized as follows. Section 2 describes problem of interest and gives an overview about policy gradient methods. Section 3 introduces our new hybrid estimator for policy gradient and develops the main algorithm. The complexity analysis is presented in Section 4, while Section 5 provides several numerical examples. All technical proofs and experimental details are given in Supplementary Document (Supp. Doc.).
2 Model and Problem Statement
We consider a Markov Decision Process (MDP)(Sutton and Barto, 2018) equipped with components where , are the state and action spaces,
denotes the set of transition probabilities when taking certain actions,is the reward function which characterizes the immediate reward earned by taking certain action, is a discount factor, and is the initial state distribution.
Let be a density function over when current state is and is a policy parameterized by parameter . A trajectory with effective length is a collection of states and actions sampled from a stationary policy. Denote as the density induced by policy over all possible trajectories and is the probability of observing a trajectory . Also, let be the total discounted reward for a trajectory . Solving an MDP is equivalent to finding the solution that maximizes the expected cumulative discounted rewards.
Classical policy gradient methods:
Policy gradient methods seek a differentiable parameterized policy that maximizes the expected cumulative discounted rewards as
where is the parameter dimension. The policy gradient theorem (Sutton et al., 1999) shows that
where the policy gradient does not depend on the gradient of the state distribution despite the fact that the state distribution depends on the policy parameters (Silver et al., 2014).
This policy gradient can be used in gradient ascent algorithms to update the parameter . However, we cannot calculate the full gradient at each update as we only get a finite number of samples at each iteration. Consequently, the policy gradient is often estimated by its sample average. At each iteration, a batch of trajectories will be sampled from the environment to estimate the policy gradient as
where is a sample estimator of . We call a stochastic policy gradient (SPG) estimator. This estimator has been exploited in the two well-known REINFORCE (Williams, 1992) and GPOMDP (Baxter and Bartlett, 2001) methods. The main step of policy gradient ascent methods is to update the parameters as
where is some appropriate learning rate, which can be fixed or varied over . Since the policy changes after each update, the density also changes and creates non-stationarity in the problem which will be handled by importance weight in Section 3.
3 A New Hybrid Stochastic Policy Gradient Algorithm
In this section, we first introduce a composite model for policy optimization. Next, we extend the hybrid gradient idea from Tran-Dinh et al. (2019a) to policy gradient estimators. Finally, we develop a new proximal policy gradient algorithm and its restart variant to solve the composite policy optimization problem and analyze their trajectory complexity.
3.1 Composite Policy Optimization Model
While the objective function in (1) is standard in most policy gradient methods, it is natural to have some constraints or regularizers on the policy parameters. In addition, adding constraints can prevent the explosion of parameters in highly nonlinear models as often seen in deep learning (Srivastava et al., 2014). Adopting the idea of composite nonconvex optimization (Pham et al., 2019), we are interested in the more general optimization problem in reinforcement learning as follow:
where is a proper, closed, and convex function acting as a regularizer which can be the indicator function of a convex set representing the constraints on the parameters or some standard regularizers such as -norm or -norm. If there is no regularizer , the problem (2) reduces to the standard one in (1).
Let be the total objective function. We impose the following assumptions for our convergence analysis, which are often used in practice.
The regularizer is a proper, closed, and convex function. We also assume that the domain of is nonempty and there exists a finite upper bound
The immediate reward function is bounded, i.e., there exists such that for all , ,
Let be the policy for a given state-action pair . Then, there exist two positive constants and such that
for any , where is the -norm.
This assumption leads to useful results about the smoothness of and and the upper bound on the variance of the policy gradient estimator.
There exists a constant such that, for each pair of policies encountered in Algorithm 1 the following holds
where is the importance weight between and .
3.3 Optimality Condition
Associated with problem (2), we define
A point is called a stationary point of (2) if
Our goal is to design an iterative method to produce an -approximate stationary point of (2) after at most iterations defined as
where is a desired tolerance, and the expectation is taken overall the randomness up to iterations.
3.4 Novel Hybrid SPG Estimator
Recall that given a trajectory , the REINFORCE (SPG) estimator is defined as
Note that the REINFORCE estimator is unbiased, i.e. . In order to reduce the variance of these estimators, a baseline is normally added while maintaining the unbiasedness of the estimators (Sutton and Barto, 2018; Zhao et al., 2011). From now on, we will refer to as the baseline-added version defined as
where with being a baseline and possibly depending only on .
Hybrid SPG estimator:
In order to reduce the number of trajectories sampled, we extend our idea in Tran-Dinh et al. (2019a)
for stochastic optimization to develop a new hybrid stochastic policy gradient (HSPG) estimator that helps balance the bias-variance trade-off. The estimator is formed by taking a convex combination of two other estimators: one is an unbiased estimator which can be REINFORCE estimator, and another is the adapted SARAH estimator(Nguyen et al., 2017b) for policy gradient which is biased.
More precisely, if and are two random batches of trajectories with sizes and , respectively, sampled from , the hybrid stochastic policy gradient estimator at -th iteration can be expressed as
with is a batch of trajectories collected at the beginning. Note that is an importance weight added to account for the distribution shift since the trajectories are sampled from but not from . Note also that in (4) is also different from the momentum SARAH estimator recently proposed in Cutkosky and Orabona (2019).
3.5 The Complete Algorithm
Unlike SVRPG (Papini et al., 2018; Xu et al., 2019a) and HAPG (Shen et al., 2019), Algorithm 1 only has one loop as REINFORCE or GPOMDP. Moreover, Algorithm 1 does not use the estimator for the policy Hessian as in HAPG. At the initial stage, a batch of trajectories is sampled using to estimate an initial policy gradient estimator which provides a good initial search direction. At the -th iteration, two independent batches of trajectories are sampled from to evaluate the hybrid stochastic policy gradient estimator. After that, a proximal step followed by an averaging step are performed which are inspired by Pham et al. (2019). Note that the batches of trajectories at each iteration are sampled from the current distribution which will change after each update. Therefore, the importance weight is introduced to account for the non-stationarity of the sampling distribution. As a result, we still have .
3.6 Restarting variant
While Algorithm 1 has the best-known theoretical complexity as shown in Section 4, its practical performance may be affected by the constant step-size depending on . As will be shown later, the step-size is inversely proportional to the number of iterations and it is natural to have close to to take advantage of the newly computed information. To increase the practical performance of our algorithm without sacrificing its complexity, we propose to inject a simple restarting strategy by repeatedly running Algorithm 1 for multiple stages as in Algorithm 2.
4 Convergence Analysis
This section presents key properties of the hybrid stochastic policy gradient estimators as well as the theoretical convergence analysis and complexity estimate.
4.1 Properties of the hybrid SPG estimator
Let be the -field generated by all trajectories sampled up to the -th iteration. For the sake of simplicity, we assume that but our analysis can be easily extended for the case . Then the hybrid SPG estimator has the following properties
Lemma 4.1 (Key properties).
Let be defined as in (4) and . Then
If , then is an biased estimator. In addition, we have
where is a given constant.
4.2 Complexity Estimates
The following lemma presents a key estimate for our convergence results.
Lemma 4.2 (One-iteration analysis).
The following theorem summarizes the convergence analysis of Algorithm 1.
Consequently, the trajectory complexity is presented in the following corollary.
For both Algorithm 1 and Algorithm 2, let us fix and set for some in Theorem 4.1. If we also choose in Algorithm 1 such that and choose in Algorithm 2 such that for some constant , then the number of trajectories to achieve such that for any is at most
where is chosen uniformly at random from if using Algorithm 2.
5 Numerical Experiments
In this section, we present three examples to provide comparison between the performance of HSPGA and other related policy gradient methods. We also provide an example to illustrate the effect of the regularizer to our model (2). More examples can be found in the Supp. Doc. C. All experiments are run on a Macbook Pro with 2.3 GHz Quad-Core, 8GB RAM.
We implement our restarting algorithm, Algorithm 2, on top of the rllab111Available at https://github.com/rll/rllab library (Duan et al., 2016). The source code is available at https://github.com/unc-optimization/ProxHSPGA. We compare our algorithm with two other methods: SVPRG (Papini et al., 2018; Xu et al., 2019a) and GPOMDP (Baxter and Bartlett, 2001). Although REINFORCE and GPOMDP have the same trajectory complexity, as observed in (Papini et al., 2018), GPOMDP often performs better than REINFORCE, so we only choose to implement GPOMDP in our experiments. Since SVRPG and GPOMDP solves the non-composite problems (1), we set in the first three examples and adjust our algorithm, denoted as HSPGA, accordingly. We compare our algorithm with the fixed epoch length variant of SVRPG as reported in Papini et al. (2018); Xu et al. (2019a). For the implementation of SVRPG and GPOMDP, we reuse the implementation of Papini et al.222Available at https://github.com/Dam930/rllab. We test these algorithms on three well-studied reinforcement learning tasks: Cart Pole, Acrobot, and Moutain Car which are available in OpenAI gym (Brockman et al., 2016), a well-known toolkit for developing and comparing reinforcement learning algorithms. We also test these algorithms on continuous control tasks using other simulators such as Roboschool (Klimov and Schulman, 2017) and Mujoco (Todorov et al., 2012).
For each environment, we initialize the policy randomly and use it as initial policies for all runs of all algorithms. The performance measure, i.e., mean rewards, is computed by averaging the final rewards of
trajectories sampled by the current policy. We then compute the mean and 90% confidence interval acrossruns of these performance measures at different time point. In all plots, the solid lines represent the mean and the shaded areas are the confidence band of the mean rewards. In addition, detailed configurations of the policy network and parameters can be found in Supp. Doc. B
. We note that the architecture of the neural network is denoted as.
Cart Pole-v0 environment:
For the Cart pole environment, we use a deep soft-max policy network (Bridle, 1990; Levine, 2017; Sutton and Barto, 2018) with one hidden layer of neurons. Figure 1 depicts the results where we run each algorithm for times and compute the mean and confidence intervals.
From Figure 1, we can see that HSPGA outperforms the other algorithms while SVRPG works better than GPOMDP as expected. HSPGA is able to reach the maximum reward of in less than episodes.
Next, we evaluate three algorithms on the Acrobot-v1 environment. Here, we use a deep soft-max policy with one hidden layer of 16 neurons. The performance of these 3 algorithms are illustrated in Figure 2.
We observe similar results as in the previous example where HSPGA has the best performance over three candidates. SVRPG is still better than GPOMDP in this example.
Mountain Car environment:
For the MountainCar-v0 environment, we use a deep Gaussian policy (Sutton and Barto, 2018) where the mean is the output of a neural network containing one hidden layer of
neurons and the standard deviation is fixed at. The results of three algorithms are presented in Figure 3.
Figure 3 shows that HSPGA highly outperforms the other two algorithms. Again, SVRPG remains better than GPOMDP as expected.
The effect of regularizers:
We test the effect of the regularizer by adding a Tikhonov one as
This model was intensively studied in Liu et al. (2019).
We also compare all non-composite algorithms with ProxHSPGA in the Roboschool Inverted Pendulum-v1 environment. In this experiment, we set the penalty parameter for ProxHSPGA. The results are depicted in Figure 4 and more information about the configuration of each algorithm is in Supp. Document B.
From Figure 4, in terms of non-composite algorithms, HSPGA is the best followed by SVRPG and then by GPOMDP. Furthermore, ProxHSPGA shows its advantage by reaching the maximum reward of 1000 faster than HSPGA.
We have presented a novel policy gradient algorithm to solve regularized reinforcement learning models. Our algorithm uses a novel policy gradient estimator which is a combination of an unbiased estimator, i.e. REINFORCE estimator, and a biased estimator adapted from SARAH estimator for policy gradient. Theoretical results show that our algorithm achieves the best-known trajectory complexity to attain an -approximate first-order solution for the problem under standard assumptions. In addition, our numerical experiments not only help confirm the benefit of our algorithm compared to other closely related policy gradient methods but also verify the effectiveness of regularization in policy gradient methods.
Q. Tran-Dinh has partly been supported by the National Science Foundation (NSF), grant no. DMS-1619884 and the Office of Naval Research (ONR), grant no. N00014-20-1-2088 (2020-2023). Q. Tran-Dinh and N. H. Pham are partly supported by The Statistical and Applied Mathematical Sciences Institute (SAMSI).
- Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1080–1089. Cited by: §1.
Katyusha: the first direct acceleration of stochastic gradient methods.
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, New York, NY, USA, pp. 1200–1205. Cited by: §1.
- Natasha 2: faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems 31, pp. 2675–2686. Cited by: §1.
- Infinite-horizon policy-gradient estimation. J. Artif. Int. Res. 15 (1), pp. 319–350. Cited by: Table 1, §1, §2, §5.
- Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Advances in Neural Information Processing Systems 2, pp. 211–217. Cited by: §5.
- OpenAI gym. External Links: Cited by: §5.
- Learning bounds for importance weighting. In Advances in Neural Information Processing Systems 23, pp. 442–450. Cited by: §A.1, Remark 3.1.
- Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219. Cited by: §3.4.
- AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. Note: https://deepmind.com/blog Cited by: §1.
- SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA, pp. 1646–1654. Cited by: §1.
- Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1329–1338. Cited by: §5.
- SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In NeurIPS, Cited by: §1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. Cited by: §1.
Deep reinforcement learning with double q-learning.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2094–2100. Cited by: §1.
- Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pp. 315–323. Cited by: §1.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §1.
- Roboschool. Note: https://openai.com/blog/roboschool/ Cited by: §5.
- CS 294-112: deep reinforcement learning lecture notes. Cited by: §5.
- A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, USA, pp. 5569–5579. Cited by: §1.
- Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Cited by: §1.
- Regularization matters in policy optimization. arXiv preprint arXiv:1910.09191. Cited by: Appendix C, §1, §5.
- Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 1928–1937. Cited by: §1.
- Playing atari with deep reinforcement learning. ArXiv abs/1312.5602. Cited by: §1.
- Human-level control through deep reinforcement learning. Nature 518, pp. 529–533. Cited by: §1.
- Introductory lectures on convex optimization: a basic course. 1 edition, Springer Publishing Company, Incorporated. Cited by: §3.3.
- Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, pp. 5947–5956. Cited by: §1.
- Stochastic recursive gradient algorithm for nonconvex optimization. CoRR abs/1705.07261. Cited by: §1.
- SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pp. 2613–2621. Cited by: §1, §1, §3.4.
- Finite-sum smooth optimization with sarah. arXiv preprint arXiv:1901.07648. Cited by: §1.
- OpenAI Five. Note: https://blog.openai.com/openai-five/ Cited by: §1.
- Stochastic variance-reduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 4026–4035. Cited by: Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §5.
- Proximal algorithms. Found. Trends Optim. 1 (3), pp. 127–239. Cited by: §3.3.
- ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. ArXiv abs/1902.05679. Cited by: §1, §3.1, §3.5.
- Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 314–323. Cited by: §1.
- A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. Cited by: §1.
- Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, Lille, France, pp. 1889–1897. Cited by: §1, §1.
- Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §1.
- Hessian aided policy gradient. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5729–5738. Cited by: Table 1, §1, §1, §3.2, §3.5, Lemma 3.1, §4.2.
- Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §1.
- Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, pp. I–387–I–395. Cited by: §1, §2.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.1.
- Introduction to reinforcement learning, 2nd edition. MIT Press. Cited by: §1, §1, §2, §3.4, §5, §5.
- Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063. Cited by: §2.
Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §1.
- MuJoCo: a physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.
- A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1907.03793. Cited by: §A.1, §A.2, Appendix A, §1, §3.4, §3.
- Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920. Cited by: §1.
- Sample efficient actor-critic with experience replay. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1.
- SpiderBoost: a class of faster variance-reduced algorithms for nonconvex optimization. ArXiv abs/1810.10690. Cited by: §1.
- Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1995–2003. Cited by: §1.
- Q-learning. Machine Learning 8 (3), pp. 279–292. Cited by: §1.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. Cited by: §1, Table 1, §1, §2.
- Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 5285–5294. Cited by: §1.
- An improved convergence analysis of stochastic variance-reduced policy gradient. Conference on Uncertainty in Artificial Intelligence. Cited by: §A.1, Table 1, §1, §3.2, §3.2, §3.5, Lemma 3.1, §4.2, §5.
- Sample efficient policy gradient methods with recursive variance reduction. ArXiv abs/1909.08610. Cited by: Table 1, §1.
- Policy optimization with stochastic mirror descent. CoRR abs/1906.10462. External Links: Cited by: Table 1, §1.
- Policy optimization via stochastic recursive gradient algorithm. External Links: Cited by: §1.
- Understanding deep learning requires rethinking generalization. External Links: Cited by: §1.
- Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems 24, pp. 262–270. Cited by: §1, §3.4.
- Stochastic nested variance reduction for nonconvex optimization. ArXiv abs/1806.07811. Cited by: §1.
Appendix A Convergence Analysis
We note that the original idea of using hybrid estimators has been proposed in our working paper (Tran-Dinh et al., 2019a). In this work, we have extended this idea as well as the proof techniques for stochastic optimization in Tran-Dinh et al. (2019a) into reinforcement learning settings. We now provide the full analysis of Algorithm 1 and 2. We first prove a key property of our new hybrid estimator for the policy gradient . Then, we provide the proof of Theorem 4.1 and Corollary 4.1.
a.1 Proof of Lemma 4.1: Bound on the Variance of the Hybrid SPG Estimator
which is the same as (5).
To prove (6), we first define and . We have
Taking the total expectation and note that and , we get
where the first inequality comes from the triangle inequality then we ignore the non-negative terms to arrive at the second inequality.
Additionally, Lemma 6.1 in Xu et al. (2019a) shows that
Using (11) we have
From the convexity of , we have
where is a subgradient of at .
By the optimality condition of , we can show that for some where is the subdifferential of Q at . Plugging this into (14), we get
Using the fact that
and ignoring the non-negative term , we can rewrite (16) as
Taking the total expectation over the entire history , we obtain
From the definition of the gradient mapping (3), we have
Applying the triangle inequality, we can derive
Taking the full expectation over the entire history