1 Introduction
Reinforcement Learning (RL) (Sutton et al., 1998) is a dynamic learning approach that interacts with the environment and execute actions according to the current state, so that a particular measure of cumulative rewards is maximized. Modelfree deep reinforcement learning algorithms (LeCun et al., 2015) have achieved remarkable performance in a range of challenging tasks, including stochastic control (Munos and Bourgine, 1998), autonomous driving (ShalevShwartz et al., 2016), games (Mnih et al., 2013; Silver et al., 2016), continuous robot control tasks (Schulman et al., 2015), etc.
Generally, there are two aspects of methods of solving a modelfree RL problem: valuebased methods such as QLearning (Tesauro, 1995), SARSA (Rummery and Niranjan, 1994), etc., as well as policybased methods such as Policy Gradient (PG) algorithm (Sutton et al., 1999)
. PG algorithm models the statetoaction transition probabilities as a parameterized family, and the cumulative rewards can be regarded as a function of the parameters. Thus, policy gradient based problem shares a formulation that is analogous to the traditional stochastic optimization problem.
One critical challenge of reinforcement learning algorithms compared to traditional gradient based algorithms lies on the issue of distribution shift, that is, the data sample distribution encounters distributional changes throughout the learning dynamics (Papini et al., 2018). To correct this, (an offpolicy version of) Policy Gradient (PG) (Sutton et al., 1999) method and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) method have been proposed as general offpolicy algorithms to optimize policy parameters using gradient based methods.^{1}^{1}1In reinforcement learning literature, onpolicy algorithms make use of samples rolled out by the current policy for only once, and hence suffer from high sample complexities. On the contrary, offpolicy algorithms in earlier work enjoy reduced sample complexities since they reuse the past trajectory samples (Mnih et al., 2015; Lillicrap et al., 2015)
. Nevertheless, they are often brittle and sensitive to hyperparameters and hence suffer from reproducibility issues
(Henderson et al., 2018). PG method directly optimizes the policy parameters via gradient based algorithms, and it dates back to the introduction of REINFORCE (Williams, 1992) and GPOMDP (Baxter and Bartlett, 2001)estimators that our algorithm is built upon.The problem of high sample complexity arises frequently in policy gradient based methods due to a combined effect of high variance incurred during the training phase (Henderson et al., 2018; Duan et al., 2016) and distribution shift, limiting the ability of modelfree deep reinforcement learning algorithms. Such a combined effect signals the potential need of adopting variancereduced gradient estimators (Johnson and Zhang, 2013; Nguyen et al., 2017; Zhou et al., 2018; Fang et al., 2018) to accelerate offpolicy algorithms. Recently proposed variancereduced policy gradient methods include SVRPG (Papini et al., 2018; Xu et al., 2019a) and SRVRPG (Xu et al., 2019b) theoretically improve the sample efficiency over PG. This is corroborated by empirical findings: we observe that the variancereduced gradient alternatives SVRPG and SRVRPG accelerate and stabilize the training processes, mainly due to their accommodations with larger stepsizes and reduced variances (Papini et al., 2018; Xu et al., 2017).
Nevertheless compared to the vanilla PG method, one major drawback of the aforementioned variancereduced policy gradient methods is their alternations between large and small batches of trajectory samples, spelled as the restarting mechanism, so the variance can be effectively controlled. In this paper, we circumvent such a restarting mechanism by introducing a new algorithm named STOchastic Recursive Momentum Policy Gradient (STORMPG), which utilizes the idea of a recently proposed variancereduced gradient method STORM (Cutkosky and Orabona, 2019) and blends with policy gradient methods. STORM is an online variancereduced gradient method that adopts an exponential moving averaging mechanism that persistently discount the accumulated variance. In the nonconvex smooth stochastic optimization setting, STORM achieves an queries complexity that ties with online SARAH/SPIDER and matches the lower bound for finding an firstorder stationary point (Arjevani et al., 2019). As a closely related variant, SARAH/SPIDER based stochastic variancereduced compositional gradient methods also achieve an complexity under a different set of assumptions (Hu et al., 2019; Zhang and Xiao, 2019). Our proposed STORMPG algorithm blends such a stateoftheart variancereduced gradient estimator with the PG algorithm. Instead of introducing a restarting mechanism in concurrent variancereduced policy gradient methods, our STORMPG algorithm guarantees the variance stability by adopting the exponential moving averaging mechanism featured by STORM. In our experiments, we see that the variance stability of our variancereduced gradient estimator allows our STORMPG algorithm to achieve a (perhaps surprisingly) overall mean rewards improvement in reinforcement learning tasks.
Our Contributions
We have designed a novel policy gradient method that enjoys several benign properties, such as using an exponential moving averaging mechanism instead of restarting mechanism to reduce our gradient estimator variance. Theoretically, we prove a stateofart convergence rate for our proposed STORMPG algorithm in our setting. Experimentally, our STORMPG algorithm depicts strikingly desirable performance in many reinforcement learning tasks.
Notational Conventions
Throughout the paper, we treat the parameters , , , , , and as global constants. Let denote the index of steps that the agent takes to interact with the environment and is the maximum length of an episode. Let
denote the Euclidean norm of a vector or the operator norm of a matrix induced by Euclidean norm. For fixed
, let denotes the batch of samples choosen at the ’th iteration and . is the algebra generated by and is the conditional expectation based on samples generated up to the ’th iteration. Other notations are explained at their first appearances.Organization
The rest of our paper is organized as follows. Section 2 introduces the backgrounds and preliminaries of the policy gradient algorithm. Section 3 formally introduces our STORMPG algorithm design. Section 4 introduces the necessary definitions and assumptions. Section 5 presents the convergence rate analysis, whose corresponding proof is provided in Section 6. Section 7 conducts experimental comparison on continuous control tasks, and Section 8 concludes our results.
Algorithms  Complexity  Restarting 

PGT (Sutton et al., 2000)  N  
REINFORCE (Williams, 1992)  N  
GPOMDP (Baxter and Bartlett, 2001)  N  
SVRPO (Xu et al., 2017)  N/A  Y 
SVRPG (Xu et al., 2019a)  Y  
SRVRPG (Xu et al., 2019b)  Y  
STORMPG (This paper)  N 
2 Policy Gradient Prelimilaries
In this section we introduce the background of policy gradient and the objective function that our algorithm is based on. The basic operation of the PG algorithm is similar to the gradient acsent algorithm with some RL specific gradient estimators. In Section 2.1 we introduce the REINFORCE estimator which is the basis of many follow up PG works. In Section 2.2 we introduce the GPOMDP estimator which further reduces the variance and is the fundation of our algorithm. Finally in Section 2.3
we formulate the probability induced by the policy as a Gaussian distribution, which is a special case adopted in our experiments.
2.1 REINFORCE Estimator
We consider the standard reinforcement learning setting of solving a discrete time finite horizon Markov Decision Process (MDP)
which models the behavior of an agent interacting with a given environment. Let be the space of states in the environment, be the space of actions that the agent can take, be the transition probability from to given , be the reward function of taking action at state , be the discount factor that adds smaller weights to rewards at more distant future, and be the initial state distribution.We mainly focuses on in this paper the policy gradient setting where there is a policy as the probability of taking action given state such that ; The policy models the agent’s behavior after experiencing the environment’s state . Given finite state and action spaces, the policy can be coded in a tabular. However when the state/action space is large or countably infinite, we adopt a probability mass function class , parameterized by , as an approximated class of functions to such a tabular. Given a policy , the probability of a trajectory can be expressed in terms of the transition probability and the policy :
(1) 
where the trajectory is the sequence that alters between states and actions, and is the maximum length (episode) of all trajectories.
Policy gradient algorithms target to maximize the expected sum of discounted rewards over trajectories :
(2) 
where the expectation is taken over a parameterized probability distribution
with parameter , as is defined in (1). Standard algorithm for maximizing (2) is the gradient descent algorithm (GD) which updates on the direction of the objective gradient with a fixed learning rate :where the gradient can be calculated as follows by combining (1) and (2):
(3)  
To avoid the costly (or infeasible in the case of infinite spaces) full gradient computations which requires sampling all possible trajectories, we adopt its Monte Carlo estimator as:
(4) 
where the trajectories are generated according to the trajectory distribution . The above estimator in policy gradient is known as the REINFORCE estimator (Williams, 1992).
2.2 GPOMDP Estimator
One of the disadvantage of REINFORCE estimator lies on its excessive variance of trajectories introduced throughout the end of the episode. Using a simple fact that for any constant , and the observation that rewards obtained before step is irrelevant with after step , the REINFORCE estimator (4) can be substituted by the following GPOMDP (Baxter and Bartlett, 2001)unbiased estimator which uses a baseline to reduce the variance:
where for each , is a constant. Throughout this paper, we use to refer to the unbiased GPOMDP estimator of :
(5)  
where are actionstate pairs along the trajectory . We adopt a variancereduced version of GPOMDP estimator throughout the end of this paper.
2.3 Gaussian Policy
Finally, we introduce the Gaussian policy setting. In control tasks where the state and action spaces can be continuous, one choice of the policy function class is the Gaussian family:
where is the fixed variance parameter and is a bounded feature mapping from the state space to . As the readers will see, the Gaussian policy satisfies all assumptions in Section 4; more detailed discussions can be found in Xu et al. (2019a), Xu et al. (2019b) and Papini et al. (2018).
3 STORMPG Algorithm
Recall our goal is to solve the general policy optimization problem:
(6) 
and defined in (5) is an unbiased estimator of the true gradient . The simplest algorithm, stochastic gradient ascent, updates the iterates as
where is chosen randomly from a data set sampled with the current distribution . To further unfold this expression, we note that where
To remedy the distribution shift issue in reinforcement learning tasks, we introduce an importance sampling weight between trajectories generated by and the ones generated by as
where is a trajectory generated by truncated at time . To further reduce the variance introduced by the randomness in , SVRG introduced a variancereduced estimator estimator of
(7) 
where is a fixed point calculated once every steps and is a fixed estimation of the gradient at point . Instead of the aforementioned SVRGtype estimator which was adopted by Xu et al. (2019b), Papini et al. (2018) adopts instead a recursive estimator
(8) 
to track the gradient at each time. In above, is scheduled to be updated once every iterations as a largebatch estimated gradient.
3.1 STORMPG Estimator
In this paper, we propose to use the STORM estimator as introduced in (Cutkosky and Orabona, 2019), which is essentially an exponential moving average SARAH estimator
(9) 
When , the STORMPG estimator reduces to the vanilla stochastic gradient estimator and when , the STORMPG esimator reduces to the SARAH estimator. As our is chosen between , the estimator is a combination of an variance reduced biased estimator and an unbiased estimator. In addition, (9) can be rewritten as
which can be interpreted as an exponentially decaying mechanism via a factor of . We can see later in the proof of the convergence rate that the estimation error can be controlled by a proper choice of while in SARAH case to control the convergence speed, the batch size or the learning rate have to be tuned accordingly. This allows us to operate a singleloop algorithm instead of a doubleloop algorithm. We only need a large batch to estimate once, and do minibatch or single batch updates till the end of the algorithm. This estimator hinders the accumulation of estimation error in each round.
We describe our STORMPG as in Algorithm 1.
(10) 
(11)  
4 Definitions and Assumptions
In this section, we make several definitions and assumptions necessary for analyzing the convergence of the STORMPG Algorithm. First of all, we define the accurate solution of a policy gradient algorithm:
Definition 1 (accurate solution).
We call an accurate solution if and only if
We say that an stochastic policy gradient based algorithm reaches an accurate solution if and only if
where is the output after the algorithm’s iteration number , and the expectation is taken over the randomness in at each iteration.
To bound the norm of the gradient estimation , we need assumptions on the norm of rewards and the norm of gradient as follows:
Assumption 2 (Boundedness).
We assume that the reward and the gradient of are bounded for any and , and there exists a constant and a constant such that:
(12) 
for any .
Assumption 3 (Smoothness).
There exists a constant such that for any and :
(13) 
Assumption 4 (Finitevariance).
There exists a such that:
(14) 
Assumption 5 (Finite IS variance).
For , use to denote the importance sampling weight . Then there exists a constant such that:
(15) 
where the variance is taken over .
5 Convergence Analysis
In this section, we introduce the lemmas neccessary for proving the convergence results of our STORMPG Algorithm and finally state our main theorem of convergence. We recall that our goal is to achieve an accurate solution of function , whose gradient can be estimated unbiasedly by . First of all, given Assumptions 2 and 3, we can derive the boundedness, Liptchizness of and the smoothness of , which are necessary conditions for proving convergence of nonconvex stochastic optimization problems. From the definition in equation (5), can be written as a linear combination of :
(16) 
Similarily, can be written as a linear combination of . Using the fact that and the bound derived in Assumptions 2 and 3, it is direct to see that
(17) 
Equation (17) implies that if we define , and is smooth. With the boundedness and smoothness results, we further estimate the accumulated estimation error . In Lemma 6 below we establish the variance bound of the importance sampling weight:
Lemma 6 (Lemma A.1 in (Xu et al., 2019b)).
The proof of Lemma 6 can be found in (Xu et al., 2019b). Combining Lemma 6 and Equation (16), we get the following bound of difference between two consecutive estimations:
Lemma 7.
where is a constant depending on .
Lemma 7 shows that the expected squared error between and is bounded by the squared distance between and by a constant dependent of while independent of . The specific choice of and the proof of Lemma 7 can be found in Appendix A.3.
To estimate the estimation error , we recursively calculate the relation between and by bringing in the recursive definition of in Equation (9). The result is shown in Lemma 8 below:
Lemma 8.
The above lemma shows that the estimation error between and can be bounded by times the estimation error of the previous iteration plus a factor of the norm of plus a variance controlling term.
Lemma 9.
Remark 10.
Hence, to control the growth of function value, should be chosen with an order of . With infinitely increasing , have to be chosen to be infinitely small. SARAH/SPIDER algorithm uses an restart machenism to remedy for this problem. However in our STORMPG Algorithm, by introducing a exponential moving average, we bring in a shrinkage term on the accumulation speed of , allowing the order of to decrease from to .
For , we only need to control so that is no longer related with . This allows us to do continuous training without restarting the iterations.
Next we come to our main theorem in this paper, which conclude that after iterations, the expected gradient norm satisfies a bound described below:
Theorem 11.
Choose and In the theorem, the term can be controlled by letting to be proportional with . Thus the third term is of order and the second term is of order . If we choose and is of order , We have that after iterates, the algorithm reaches a point with expected gradient norm of order . Compared with in (Papini et al., 2018) and in Xu et al. (2019b). However, The sample complexity in Xu et al. (2019b) is while in our algorithm is , which makes the algorithm converges faster.
The detailed analysis of the convergence rate is shown in the next section. Corollary 12 is a direct result after Theorem 11. By controlling the estimated gradient to be in the neighborhood of 0, and minimizing , we get the IFO complexity bound of STORMPG algorithm:
Corollary 12.
6 Proof of Main Results
In this section, we prove the main results in this paper. More auxiliary proofs are located in the supplementary section.
6.1 Proof of Theorem 11
6.2 Proof of Corollary 12
Proof of Corollary 12.
For choosing parameters of correct dependency over , by Equation (23), one requires:
(28) 
and we recall that previously we have a lower bound on : . So finally we choose
(29) 
Bring Equation (29) into Equation (28) we have two lower bounds over to reach an accurate solution:
and also
Our goal is to minimize the IFO complexity
which is approximately equivalent to
Our best choice of is obviously . So the IFO complexity of reaching an accurate solution is
IFO  
∎
7 Experiments
In this section, we design a set of experiments to validate the superiority of our STORMPG Algorithm. Our implementation is based on the rllab library^{2}^{2}2https://github.com/Dam930/rllab and the initial implementation of Papini et al. (2018)^{3}^{3}3https://github.com/rll/rllab. We test the performance of our algorithms as well as the baseline algorithms on the CartPole^{4}^{4}4https://github.com/openai/gym/wiki/CartPolev0 environment and the MountainCar environment.
For baseline algorithms, We choose GPOMDP (Baxter and Bartlett, 2001) and two variancereduced policy gradient algorithms SVRPG (Papini et al., 2018) and SRVRPG (Xu et al., 2019b). The results and detailed experimental design are described as follows:
7.1 Comparison of different Algorithms
In SRVRPG (Xu et al., 2019b) and SVRPG (Papini et al., 2018), adjustable parameters include the large batch size , the mini batch size , the inner iteration number and the learning rate . In STORMPG Algorithm, we have to tune the large batch size , the momentum factor and the learning rate . Notice that we do not tune the mini batch size in STORMPG, and fix it to be the same with the best tuned on SVRPG, as shown in the theory.
We adaptively choose the learning rate by adam optimizer and learning rate decay. The initial learning rate and decay discount are chosen between and respectively. The environment related parameters: the discount factor and the horizon varies according to tasks. We list the specific choice of , , together with the initial batch size and the inner batch size in the supplementary materials.
We use a Gaussian policy with a neural network with one hidden layer of size 64. For each algorithm in one environment, we choose ten best independent runs to collect the rewards and plot the confidence interval together with the average rewards at each iteration of the training process.
CartPole environment
The CartPole environment describes the interaction of a pendulum pole attached to a cart. By pushing the cart leftward or rightward, a reward of +1 is obtained by keeping the pole upright and the episode ends when the cart or the pole is too far away from a given center.
Under this environment setting, figure 1 shows the growth of the average return according to the training trajectories.
From Figure 1, we see that our STORMPG algorithm ourperforms other variancereduced policy gradient methods in convergence speed. It reaches the maximum value at approximately 500 trajectories while SRVRPG and SVRPG reaches the maximum value at approximately 1500 trajectories. GPOMDP converges at about 3000 trajectories.
MountainCar Environment
We use the Mountain Car environment provided in rllab environments. The task is to push a car to a certain position on a hill. The agent takes continuous actions to move leftward or rightward and gets a reward according to it’s current position and height, every step it takes gets a penalty 1 and the episode ends when a target position is reached.
Figure 2 shows the growth of the average return according to the training trajectories. The GPOMDP algorithm in the MountainCar environment does not converge well. For illustrative purpose we only present the plot of the STORMPG algorithm and two variancereduced baselines.
From Figure 2, we see that STORMPG algorithm outperforms other baselines within the first 200 trajectories and reaches a stable zone within 600 trajectories, while for the algorithms it takes at least 1000 trajectories to reach a reasonable result. The two figures 1 and 2 verifies our theory that our STORMPG algorithm brings significant improvement to the policy gradient training.
Specifically, as we have mentioned at the beginning of Section 7, previous variancereduced policy gradient methods requires carefully tuning of the inner loop iteration number. SVRPG (Papini et al., 2018) uses adaptive number of iterations while after tuning SRVRPG (Xu et al., 2019b) fixes a very small number of inner loops.
On the contrary, we do not tune the mini batch size . In practice, we fix both the initial batch and the mini batch . The high stability with respect to hyperparameters saves lots of efforts during the training process, The tolerance to the choice of parameters allows us to design a highly user friendly while efficient policy gradient algorithm.
8 Final Remarks
In this paper, we propose a new STORMPG algorithm that adopts a recently proposed variancereduced gradient method called STORM. STORMPG enjoys advantage both theoretically and experimentally. From the final experimental results, our STORMPG algorithm is significantly better than all other baseline methods, both in aspects of training stability and parameter tuning (the user time of tuning STORMPG is much shorter). The superiority of STORMPG in experimental results over SVRPG breaks the curse that stochastic recursive gradient method, namely SARAH, often fails to outperform SVRG in practice even though it has better theoretical convergence rate. Future works include proving the lower bounds of our algorithm and further improvement of the experimental performance on other statistical learning tasks. We hope this work can inspire both reinforcement learning and optimization communities for future explorations.
References
 Lower bounds for nonconvex stochastic optimization. arXiv preprint arXiv:1912.02365. Cited by: §1.

Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
15, pp. 319–350. Cited by: Table 2, Table 1, §1, §2.2, §7.  Momentumbased variance reduction in nonconvex sgd. In Advances in Neural Information Processing Systems, pp. 15210–15219. Cited by: §1, §3.1.

Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning
, pp. 1329–1338. Cited by: §1.  SPIDER: nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 686–696. Cited by: §1.
 Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1, footnote 1.
 Efficient smooth nonconvex stochastic compositional optimization via stochastic recursive gradient descent. In Advances in Neural Information Processing Systems, pp. 6926–6935. Cited by: §1.

Accelerating stochastic gradient descent using predictive variance reduction
. In Advances in Neural Information Processing Systems, pp. 315–323. Cited by: §1.  Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: footnote 1.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: footnote 1.
 Reinforcement learning for continuous stochastic control problems. In Advances in neural information processing systems, pp. 1029–1035. Cited by: §1.
 SARAH: a novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp. 2613–2621. Cited by: §1, Remark 10.
 Stochastic variancereduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholm Sweden, pp. 4026–4035. External Links: Link Cited by: §1, §1, §2.3, §3, §5, §7.1, §7.1, §7, §7.
 Online qlearning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, UK. Cited by: §1.
 Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §1, §1.
 Safe, multiagent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §1.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: Table 1.
 Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. External Links: Link Cited by: §1, §1.
 Temporal difference learning and tdgammon. Communications of the ACM 38 (3), pp. 58–68. Cited by: §1.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: Table 1, §1, §2.1.
 An improved convergence analysis of stochastic variancereduced policy gradient. arXiv preprint arXiv:1905.12615. Cited by: Table 2, Table 3, Table 1, §1, §2.3.
 Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610. Cited by: Table 2, Table 3, Table 1, §1, §2.3, §3, §5, §5, §7.1, §7.1, §7, Remark 13, Lemma 6.
 Stochastic variance reduction for policy gradient estimation. CoRR abs/1710.06034. Cited by: Table 1, §1.
 Multilevel composite stochastic optimization via nested variance reduction. arXiv preprint arXiv:1908.11468. Cited by: §1.
 Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936. Cited by: §1.