1 Introduction
On a very general level, artificial intelligence addresses the problem of an agent that must select the right actions to solve a task. The approach of Reinforcement Learning (RL)
(Sutton & Barto, 1998)is to learn the best actions by direct interaction with the environment and evaluation of the performance in the form of a reward signal. This makes RL fundamentally different from Supervised Learning (SL), where correct actions are explicitly prescribed by a human teacher (e.g., for classification, in the form of class labels). However, the two approaches share many challenges and tools. The problem of estimating a model from samples, which is at the core of SL, is equally fundamental in RL, whether we choose to model the environment, a value function, or directly a policy defining the agent’s behaviour. Furthermore, when the tasks are characterized by large or continuous stateaction spaces, RL needs the powerful function approximators (e.g., neural networks) that are the main subject of study of SL. In a typical SL setting, a performance function
has to be optimized w.r.t. to model parameters . The set of data that are available for training is often a subset of all the cases of interest, which may even be infinite, leading to optimization of finite sums that approximate the expected performance over an unknown data distribution. When generalization to the complete dataset is not taken into consideration, we talk about Empirical Risk Minimization (ERM). Even in this case, stochastic optimization is often used for reasons of efficiency. The idea of stochastic gradient (SG) ascent (Nesterov, 2013) is to iteratively focus on a random subset of the available data to obtain an approximate improvement direction. At the level of the single iteration, this can be much less expensive than taking into account all the data. However, the subsampling of data is a source of variance that can potentially compromise convergence, so that periteration efficiency and convergence rate must be traded off with proper handling of metaparameters. Variancereduced gradient algorithms such as SAG (Roux et al., 2012), SVRG (Johnson & Zhang, 2013) and SAGA (Defazio et al., 2014a) offer better ways of solving this tradeoff, with significant results both in theory and practice. Although designed explicitly for ERM, these algorithms address a problem that affects more general machine learning problems.In RL, stochastic optimization is rarely a matter of choice, since data must be actively sampled by interacting with an initially unknown environment. In this scenario, limiting the variance of the estimates is a necessity that cannot be avoided, which makes variancereduced algorithms very interesting. Among RL approaches, policy gradient (Sutton et al., 2000) is the one that bears the closest similarity to SL solutions. The fundamental principle of these methods is to optimize a parametric policy through stochastic gradient ascent. Compared to other applications of SG, the cost of collecting samples can be very high since it requires to interact with the environment. This makes SVRGlike methods potentially much more efficient than, e.g., batch learning. Unfortunately, RL has a series of difficulties that are not present in ERM. First, in SL the objective can often be designed to be strongly concave (we aim to maximize). This is not the case for RL, so we have to deal with nonconcave objective functions. Then, as mentioned before, the dataset is not initially available and may even be infinite, which makes approximations unavoidable. This rules out SAG and SAGA because of their storage requirements, which leaves SVRG as the most promising choice. Finally, the distribution used to sample data is not under direct control of the algorithm designer, but it is a function of policy parameters that change over time as the policy is optimized, which is a form of nonstationarity. SVRG has been used in RL as an efficient technique for optimizing the periteration problem in TrustRegion Policy Optimization (Xu et al., 2017) or for policy evaluation (Du et al., 2017). In both the cases, the optimization problems faced resemble the SL scenario and are not affected by all the previously mentioned issues.
After providing background on policy gradient and SVRG in Section 2, we propose SVRPG, a variant of SVRG for the policy gradient framework, addressing all the difficulties mentioned above (see Section 3). In Section 4 we provide convergence guarantees for our algorithm, and we show a convergence rate that has an dependence on the number of iterations. In Section 5.2 we suggest how to set the metaparameters of SVRPG, while in Section 5.3 we discuss some practical variants of the algorithm. Finally, in Section 7 we empirically evaluate the performance of our method on popular continuous RL tasks.
2 Preliminaries
In this section, we provide the essential background on policy gradient methods and stochastic variancereduced gradient methods for finitesum optimization.
2.1 Policy Gradient
A Reinforcement Learning task (Sutton & Barto, 1998) can be modelled with a discretetime continuous Markov Decision Process (MDP) , where is a continuous state space; is a continuous action space; is a Markovian transition model, where defines the transition density from state to under action ; is the reward function, where is the expected reward for stateaction pair ; is the discount factor; and is the initial state distribution. The agent’s behaviour is modelled as a policy , where is the density distribution over in state . We consider episodic MDPs with effective horizon .^{1}^{1}1
The episode duration is a random variable, but the optimal policy can reach the target state (i.e., absorbing state) in less than
steps. This has not to be confused with a finite horizon problem where the optimal policy is nonstationary. In this setting, we can limit our attention to trajectories of length . A trajectory is a sequence of states and actions observed by following a stationary policy, where . We denote with the density distribution induced by policy on the set of all possible trajectories (see Appendix A for the definition), and with the total discounted reward provided by trajectory : Policies can be ranked based on their expected total reward: . Solving an MDP means finding .Policy gradient methods restrict the search for the best performing policy over a class of parametrized policies , with the only constraint that is differentiable w.r.t. . For sake of brevity, we will denote the performance of a parametric policy with
and the probability of a trajectory
with (in some occasions, will be replaced by for the sake of readability). The search for a locally optimal policy is performed through gradient ascent, where the policy gradient is (Sutton et al., 2000; Peters & Schaal, 2008a):(1) 
Notice that the distribution defining the gradient is induced by the current policy. This aspect introduces a nonstationarity in the sampling process. Since the underlying distribution changes over time, it is necessary to resample at each update or use weighting techniques such as importance sampling. Here, we consider the online learning scenario, where trajectories are sampled by interacting with the environment at each policy change. In this setting, stochastic gradient ascent is typically employed. At each iteration , a batch of trajectories is collected using policy . The policy is then updated as , where is a step size and is an estimate of Eq. (1) using . The most common policy gradient estimators (e.g., REINFORCE (Williams, 1992) and G(PO)MDP (Baxter & Bartlett, 2001)) can be expressed as follows
(2) 
where is an estimate of . Although the REINFORCE definition is simpler than the G(PO)MDP one, the latter is usually preferred due to its lower variance. We refer the reader to Appendix A for details and a formal definition of .
The main limitation of plain policy gradient is the high variance of these estimators. The naïve approach of increasing the batch size is not an option in RL due to the high cost of collecting samples, i.e., by interacting with the environment. For this reason, literature has focused on the introduction of baselines (i.e., functions ) aiming to reduce the variance (e.g., Williams, 1992; Peters & Schaal, 2008a; Thomas & Brunskill, 2017; Wu et al., 2018), see Appendix A for a formal definition of . These baselines are usually designed to minimize the variance of the gradient estimate, but even them need to be estimated from data, partially reducing their effectiveness. On the other hand, there has been a surge of recent interest in variance reduction techniques for gradient optimization in supervised learning (SL). Although these techniques have been mainly derived for finitesum problems, we will show in Section 3 how they can be used in RL. In particular, we will show that the proposed SVRPG algorithm can take the best of both worlds (i.e., SL and RL) since it can be plugged into a policy gradient estimate using baselines. The next section has the aim to describe variance reduction techniques for finitesum problems. In particular, we will present the SVRG algorithm that is at the core of this work.
2.2 Stochastic VarianceReduced Gradient
Finitesum optimization is the problem of maximizing an objective function which can be decomposed into the sum or average of a finite number of functions :
This kind of optimization is very common in machine learning, where each may correspond to a data sample from a dataset of size (i.e., ). A common requirement is that must be smooth and concave in .^{2}^{2}2Note that we are considering a maximization problem instead of the classical minimization one. Under this hypothesis, full gradient (FG) ascent (Cauchy, 1847) with a constant step size achieves a linear convergence rate in the number of iterations (i.e., parameter updates) (Nesterov, 2013). However, each iteration requires gradient computations, which can be too expensive for large values of . Stochastic Gradient (SG) ascent (e.g., Robbins & Monro, 1951; Bottou & LeCun, 2004) overcomes this problem by sampling a single sample per iteration, but a vanishing step size is required to control the variance introduced by sampling. As a consequence, the lower periteration cost is paid with a worse, sublinear convergence rate (Nemirovskii et al., 1983). Starting from SAG, a series of variations to SG have been proposed to achieve a better tradeoff between convergence speed and cost per iteration: e.g., SAG (Roux et al., 2012), SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014a), Finito (Defazio et al., 2014b), and MISO (Mairal, 2015). The common idea is to reuse past gradient computations to reduce the variance of the current estimate. In particular, Stochastic VarianceReduced Gradient (SVRG) is often preferred to other similar methods for its limited storage requirements, which is a significant advantage when deep and/or wide neural networks are employed.
The idea of SVRG (Algorithm 1) is to alternate full and stochastic gradient updates. Each iterations, a snapshot of the current parameter is saved together with its full gradient . Between snapshots, the parameter is updated with , a gradient estimate corrected using stochastic gradient. For any :
(3) 
where is sampled uniformly at random from (i.e., ). Note that corresponds to a FG step (i.e., ) since . The corrected gradient
is an unbiased estimate of
, and it is able to control the variance introduced by sampling even with a fixed step size, achieving a linear convergence rate without resorting to a plain full gradient.More recently, some extensions of variance reduction algorithms to the nonconcave objectives have been proposed (e.g., AllenZhu & Hazan, 2016; Reddi et al., 2016a, b). In this scenario, is typically required to be smooth, i.e., for each and for some Lipschitz constant . Under this hypothesis, the convergence rate of SG is (Ghadimi & Lan, 2013), i.e., iterations are required to get . Again, SVRG achieves the same rate as FG (Reddi et al., 2016a), which is in this case (Nesterov, 2013). The only additional requirement is to select uniformly at random among all the instead of simply setting it to the final value ( being the iterations).
3 SVRG in Reinforcement Learning
In online RL problems, the usual approach is to tune the batch size of SG to find the optimal tradeoff between variance and speed. Recall that, compared to SL, the samples are not fixed in advance but we need to collect them at each policy change. Since this operation may be costly, we would like to minimize the number of interactions with the environment. For these reasons, we would like to apply SVRG to RL problems in order to limit the variance introduced by sampling trajectories, which would ultimately lead to faster convergence. However, a direct application of SVRG to RL is not possible due to the following issues:
 Nonconcavity:

the objective function is typically nonconcave.
 Infinite dataset:

the RL optimization cannot be expressed as a finitesum problem. The objective function is an expected value over the trajectory density of the total discounted reward, for which we would need an infinite dataset.
 Nonstationarity:

the distribution of the samples changes over time. In particular, the value of the policy parameter influences the sampling process.
To deal with nonconcavity, we require to be smooth, which is a reasonable assumption for common policy classes such as Gaussian^{3}^{3}3See Appendix C for more details on the Gaussian policy case. and softmax (e.g., Furmston & Barber, 2012; Pirotta et al., 2015). Because of the infinite dataset, we can only rely on an estimate of the full gradient. Harikandeh et al. (2015) analysed this scenario under the assumptions of being concave, showing that SVRG is robust to an inexact computation of the full gradient. In particular, it is still possible to recover the original convergence rate if the error decreases at an appropriate rate. Bietti & Mairal (2017) performed a similar analysis on MISO. In Section 4 we will show how the estimation accuracy impacts on the convergence results with a nonconcave objective. Finally, the nonstationarity of the optimization problem introduces a bias into the SVRG estimator in Eq. (3). To overcome this limitation we employ importance weighting (e.g., Rubinstein, 1981; Precup, 2000) to correct the distribution shift.
We can now introduce Stochastic VarianceReduced Policy Gradient (SVRPG) for a generic policy gradient estimator . Pseudocode is provided in Algorithm 2. The overall structure is the same as Algorithm 1, but the snapshot gradient is not exact and the gradient estimate used between snapshots is corrected using importance weighting:^{4}^{4}4Note that can be any unbiased estimator, with or without baseline. The unbiasedness is required for theoretical results (e.g., Appendix A).
for any , where is as in Eq. (2) where is sampled using the snapshot policy , is sampled from the current policy , and is an importance weight from to the snapshot policy . Similarly to SVRG, we have that , and the update is a FG step. Our update is still fundamentally onpolicy since the weighting concerns only the correction term. However, this partial “offpolicyness” represents an additional source of variance. This is a wellknown issue of importance sampling (e.g., Thomas et al., 2015). To mitigate it, we use minibatches of trajectories of size to average the correction, i.e.,
(4)  
It is worth noting that the full gradient and the correction term have the same expected value: .^{5}^{5}5The reader can refer to Appendix A for offpolicy gradients and variants of REINFORCE and G(PO)MDP. This property will be used to prove Lemma 3.1. The use of minibatches is also common practice in SVRG since it can yield a performance improvement even in the supervised case (Harikandeh et al., 2015; Konečnỳ et al., 2016). It is easy to show that the SVRPG estimator has the following, desirable properties:
Lemma 3.1.
Previous results hold for both REINFORCE and G(PO)MDP. In particular, the latter result suggests that an SVRGlike algorithm using can achieve faster convergence, by performing much more parameter updates with the same data without introducing additional variance (at least asymptotically). Note that the randomized return value of Algorithm 2 does not affect online learning at all, but will be used as a theoretical tool in the next section.
4 Convergence Guarantees of SVRPG
In this section, we state the convergence guarantees for SVRPG with REINFORCE or G(PO)MDP gradient estimator. We mainly leverage on the recent analysis of nonconcave SVRG (Reddi et al., 2016a; AllenZhu & Hazan, 2016). Each of the three challenges presented at the beginning of Section 3 can potentially prevent convergence, so we need additional assumptions. In Appendix C we show how Gaussian policies satisfy these assumptions.
1) Nonconcavity. A common assumption, in this case, is to assume the objective function to be smooth. However, in RL we can consider the following assumption which is sufficient for the smoothness of the objective (see Lemma B.2).
Assumption 4.1 (On policy derivatives).
For each stateaction pair , any value of , and all parameter components there exist constants such that:
2) FG Approximation. Since we cannot compute an exact full gradient, we require the variance of the estimator to be bounded. This assumption is similar in spirit to the one in (Harikandeh et al., 2015).
Assumption 4.2 (On the variance of the gradient estimator).
There is a constant such that, for any policy :
3) Nonstationarity. Similarly to what is done in SL (Cortes et al., 2010), we require the variance of the importance weight to be bounded.
Assumption 4.3 (On the variance of importance weights).
There is a constant such that, for each pair of policies encountered in Algorithm 2 and for each trajectory,
Differently from Assumptions 4.1 and 4.2, Assumption 4.3 must be enforced by a proper handling of the epoch size .
We can now state the convergence guarantees for SVRPG.
Theorem 4.4 (Convergence of the SVRPG algorithm).
Assume the REINFORCE or the G(PO)MDP gradient estimator is used in SVRPG (see Equation (4)). Under Assumptions 4.1, 4.2 and 4.3, the parameter vector returned by Algorithm 2 after iterations has, for some positive constants and for proper choice of the step size and the epoch size , the following property:
where is a global optimum and depend only on and .
Refer to Appendix B for a detailed proof involving the definition of the constants and the metaparameter constraints. By analysing the upperbound in Theorem 4.4 we observe that: I) the term is coherent with results on nonconcave SVRG (e.g., Reddi et al., 2016a); II) the term is due to the FG approximation and is analogous to the one in (Harikandeh et al., 2015); III) the term is due to importance weighting. To achieve asymptotic convergence, the batch size and the minibatch size should increase over time. In practice, it is enough to choose and large enough to make the second and the third term negligible, i.e., to mitigate the variance introduced by FG approximation and importance sampling, respectively. Once the last two terms can be neglected, the number of trajectories needed to achieve is . In this sense, an advantage over batch gradient ascent can be achieved with properly selected metaparameters. In Section 5.2 we propose a joint selection of step size and epoch size . Finally, from the return statement of Algorithm 2, it is worth noting that can be seen as the average performance of all the policies tried by the algorithm. This is particularly meaningful in the context of online learning that we are considering in this paper.
5 Remarks on SVRPG
The convergence guarantees presented in the previous section come with requirements on the metaparameters (i.e., and ) that may be too conservative for practical applications. Here we provide a practical and automatic way to choose the step size and the number of subiterations performed between snapshots. Additionally, we provide a variant of SVRPG exploiting a variancereduction technique for importance weights. Despite lacking theoretical guarantees, we will show in Section 7 that this method can outperform the baseline SVRPG (Algorithm 2).
5.1 Full Gradient Update
As noted in Section 3, the update performed at the beginning of each epoch is equivalent to a fullgradient update. In our setting, where collecting samples is particularly expensive, the trajectories collected using the snapshot trajectory feels like a waste of data (the term since ). In practice, we just perform an approximate full gradient update using the trajectories sampled to compute , i.e.,
In the following, we will always use this practical variant.
5.2 MetaParameter Selection
The step size is crucial to balance variance reduction and efficiency, while the epoch length influences the variance introduced by the importance weights. Low values of are associated with small variance but increase the frequency of snapshot points (which means many FG computations). High values of may move policy far away from the snapshot policy , causing large variance in the importance weights. We will jointly set the two metaparameters.
Adaptive step size.
A standard way to deal with noisy gradients is to use adaptive strategies to compute the step size. ADAptive Moment estimation (ADAM)
(Kingma & Ba, 2014) stabilizes the parameter update by computing learning rates for each parameter based on an incremental estimate of the gradient variance. Due to this feature, we would like to incorporate ADAM in the structure of the SVRPG update. Recall that SVRPG performs two different updates of the parameters : I) FG update in the snapshot; II) corrected gradient update in the subiterations. Given this structure, we suggest using two separate ADAM estimators:where is associated with the snapshot and with the subiterations (see Appendix D for details). By doing so, we decouple the contribution of the variance due to the approximate FG from the one introduced by the subiterations. Note that these two terms have different orders of magnitude since are estimated with a different number of trajectories () and the estimator in the snapshot does not require importance weights. The use of two ADAM estimators allows to capture and exploit this property.
Adaptive epoch length. It is easy to imagine that a predefined schedule (e.g., fixed in advance or changed with a policyindependent process) may poorly perform due to the high variability of the updates. In particular, given a fixed number of subiterations , the variance of the updates in the subiterations depends on the snapshot policy and the sampled trajectories. Since the ADAM estimate partly captures such variability, we propose to take a new snapshot (i.e., interrupt the subiterations) whenever the step size proposed by ADAM for the subiterations is smaller than the one for the FG (i.e., ). If the latter condition is verified, it amounts to say that the noise in the corrected gradient has overcome the information of the FG. Formally, the stopping condition is as follows
where we have introduced and to take into account the trajectory efficiency (i.e., weighted advantage). The less the number of trajectories used to update the policy, the better. Including the batch sizes in the stopping condition allows us to optimize the tradeoff between the quality of the updates and the cost of performing them.
5.3 Normalized Importance Sampling
As mentioned in Section 5.2, importance weights are an additional source of variance. A standard way to cope with this issue is selfnormalization (e.g., Precup, 2000; Owen, 2013). This technique can reduce the variance of the importance weights at the cost of introducing some bias (Owen, 2013, Chapter 9). Whether the tradeoff is advantageous depends on the specific task. Introducing selfnormalization in the context of our algorithm, we switch from Eq. (4) to:
where . In Section 7 we show that selfnormalization can provide a performance improvement.
6 Related Work
Despite the considerable interest received in SL, variancereduced gradient approaches have not attracted the RL community. As far as we know, there are just two applications of SVRG in RL. The first approach (Du et al., 2017) aims to exploit SVRG for policy evaluation. The policy evaluation problem is more straightforward than the one faced in this paper (control problem). In particular, since the goal is to evaluate just the performance of a predefined policy, the optimization problem is stationary. The setting considered in the paper is the one of policy evaluation by minimizing the empirical mean squared projected Bellman error (MSPBE) with a linear approximation of the value function. Du et al. (2017) shown that this problem can be equivalently reformulated as a convexconcave saddlepoint problem that is characterized by a finitesum structure. This problem can be solved using a variant of SVRG (Palaniappan & Bach, 2016) for which convergence guarantees have been provided. The second approach (Xu et al., 2017) uses SVRG as a practical method to solve the optimization problem faced by Trust Region Policy Optimization (TRPO) at each iteration. This is just a direct application of SVRG to a problem having finitesum structure since no specific structure of the RL problem is exploited. It is worth to mention that, for practical reasons, the authors proposed to use a Newton conjugate gradient method with SVRG.
In the recent past, there has been a surge of studies investigating variance reduction techniques for policy gradient methods. The specific structure of the policy gradient allows incorporating a baseline (i.e., a function ) without affecting the unbiasedness of the gradient (e.g., Williams, 1992; Weaver & Tao, 2001; Peters & Schaal, 2008b; Thomas & Brunskill, 2017; Wu et al., 2018). Although the baseline can be arbitrarily selected, literature often refers to the optimal baseline as the one minimizing the variance of the estimate. Nevertheless, even the baseline needs to be estimated from data. This fact may partially reduce its effectiveness by introducing variance. Even if these approaches share the same goal as SVRG, they are substantially different. In particular, the proposed SVRPG does not make explicit use of the structure of the policy gradient framework, and it is independent of the underlying gradient estimate (i.e., with or without baseline). This suggests that would be possible to integrate an adhoc SVRPG baseline to further reduce the variance of the estimate. Since this paper is about the applicability of SVRG technique to RL, we consider this topic as future work. Additionally, the experiments show that SVRPG has an advantage over G(PO)MPD even when the baseline is used (see the halfcheetah domain in Section 7).
Concerning importance weighting techniques, RL has made extensive use of them for offpolicy problems (e.g., Precup, 2000; Thomas et al., 2015). However, as mentioned before, SVRPG cannot be compared to such methods since it is in all respects an onpolicy algorithm. Here, importance weighting is just a statistical tool used to preserve the unbiasedness of the corrected gradient.
7 Experiments
In this section, we evaluate the performance of SVRPG and compare it with policy gradient (PG) on well known continuous RL tasks: Cartpole balancing and Swimmer (e.g., Duan et al., 2016). We consider G(PO)MDP since it has a smaller variance than REINFORCE. For our algorithm, we use a batch size , a minibatch size , and the jointly adaptive step size and epoch length proposed in Section 5.2. Since the aim of this comparison is to show the improvement that SVRGflavored variance reduction brings to SG in the policy gradient framework, we set the batch size of the baseline policy gradient algorithm to
. In this sense, we measure the improvement yielded by computing snapshot gradients and using them to adjust parameter updates. Since we evaluate online performance over the number of sampled trajectories, the cost of computing such snapshot gradients is automatically taken into consideration. To make the comparison fair, we also use Adam in the baseline PG algorithm, which we will denote simply as G(PO)MDP in the following. In all the experiments, we use deep Gaussian policies with adaptive standard deviation (details on network architecture in Appendix
E). Each experiment is runtimes with a random policy initialization and seed, but this initialization is shared among the algorithms under comparison. The length of the experiment, i.e., the total number of trajectories, is fixed for each task. Performance is evaluated by using testtrajectories on a subset of the policies considered during the learning process. We provide average performance with 90% bootstrap confidence intervals. Task implementations are from the
rllab library (Duan et al., 2016), on which our agents are also based.^{7}^{7}7Code available at github.com/Dam930/rllab. More details on metaparameters and exhaustive task descriptions are provided in Appendix E.Figure 0(a) compares SVRPG with G(PO)MDP on a continuous variant of the classical Cartpole task, which is a 2D balancing task. Despite using more trajectories on average for each parameter update, our algorithm shows faster convergence, which can be ascribed to the better quality of updates due to variance reduction.
The Swimmer task is a 3D continuouscontrol locomotion task. This task is more difficult than cartpole. In particular, the longer horizon and the more complex dynamics can have a dangerous impact on the variance of importance weights. In this case, the selfnormalization technique proposed in Section 5.3 brings an improvement (even if not statistically significant), as shown in Figure 0(b). Figure 0(c) shows selfnormalized SVRPG against G(PO)MDP. Our algorithm outperforms G(PO)MDP for almost the entire learning process. Also here, we note an increase of speed in early iterations, and, toward the end of the learning process, the improvement becomes statistically significant.
Preliminary results on actorcritic. Another variancereduction technique in policy gradient consists of using baselines or critics. This tool is orthogonal to the methods described in this paper, and the theoretical results of Section 4 are general in this sense. In the experiments described so far, we compared against the socalled actoronly G(PO)MDP, i.e., without the baseline. To move towards a more general understanding of the variance issue in policy gradient, we also test SVRPG in an actorcritic scenario. To do so, we consider the more challenging MuJoCo (Todorov et al., 2012) Halfcheetah task, a 3D locomotion task that has a larger stateaction space than Swimmer. Figure 0(d) compares selfnormalized SVRPG and G(PO)MDP on Halfcheetah, using the critic suggested in (Duan et al., 2016) for both algorithms. Results are promising, showing that a combination of the baseline usage and SVRGlike variance reduction can yield an improvement that the two techniques alone are not able to achieve. Moreover, SVRPG presents a noticeably lower variance. The performance of actorcritic G(PO)MDP^{8}^{8}8Duan et al. (2016) report results on REINFORCE. However, inspection on rllab code and documentation reveals that it is actually PGT (Sutton et al., 2000), which is equivalent to G(PO)MDP (shown by Peters & Schaal, 2008b). Using the name REINFORCE in a general way is inaccurate, but widespread. on HalfCheetah is coherent with the one reported in (Duan et al., 2016). Other results are not comparable since we did not use the critic.
8 Conclusion
In this paper, we introduced SVRPG, a variant of SVRG designed explicitly for RL problems. The control problem considered in the paper has a series of difficulties that are not common in SL. Among them, nonconcavity and approximate estimates of the FG have been analysed independently in SL (e.g., AllenZhu & Hazan, 2016; Reddi et al., 2016a; Harikandeh et al., 2015) but never combined. Nevertheless, the main issue in RL is the nonstationarity of the sampling process since the distribution underlying the objective function is policydependent. We have shown that by exploiting importance weighting techniques, it is possible to overcome this issue and preserve the unbiasedness of the corrected gradient. We have additionally shown that, under mild assumptions that are often verified in RL applications, it is possible to derive convergence guarantees for SVRPG. Finally, we have empirically shown that practical variants of the theoretical SVRPG version can outperform classical actoronly approaches on benchmark tasks. Preliminary results support the effectiveness of SVRPG also with a commonly used baseline for the policy gradient. Despite that, we believe that it will be possible to derive a baseline designed explicitly for SVRPG to exploit the RL structure and the SVRG idea jointly. Another possible improvement would be to employ the natural gradient (Kakade, 2002) to better control the effects of parameter updates on the variance of importance weights. Future work should also focus on making batch sizes and adaptive, as suggested in (Papini et al., 2017).
Acknowledgments
This research was supported in part by French Ministry of Higher Education and Research, NordPasdeCalais Regional Council and French National Research Agency (ANR) under project ExTraLearn (n.ANR14CE24001001).
References
 AllenZhu & Hazan (2016) AllenZhu, Zeyuan and Hazan, Elad. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pp. 699–707, 2016.
 Baxter & Bartlett (2001) Baxter, Jonathan and Bartlett, Peter L. Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
 Bietti & Mairal (2017) Bietti, Alberto and Mairal, Julien. Stochastic optimization with variance reduction for infinite datasets with finite sum structure. In Advances in Neural Information Processing Systems, pp. 1622–1632, 2017.
 Bottou & LeCun (2004) Bottou, Léon and LeCun, Yann. Large scale online learning. In Advances in neural information processing systems, pp. 217–224, 2004.
 Cauchy (1847) Cauchy, Augustin. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
 Cortes et al. (2010) Cortes, Corinna, Mansour, Yishay, and Mohri, Mehryar. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442–450, 2010.
 Defazio et al. (2014a) Defazio, Aaron, Bach, Francis, and LacosteJulien, Simon. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014a.
 Defazio et al. (2014b) Defazio, Aaron, Domke, Justin, et al. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning, pp. 1125–1133, 2014b.
 Du et al. (2017) Du, Simon S., Chen, Jianshu, Li, Lihong, Xiao, Lin, and Zhou, Dengyong. Stochastic variance reduction methods for policy evaluation. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 1049–1058. PMLR, 2017.
 Duan et al. (2016) Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Furmston & Barber (2012) Furmston, Thomas and Barber, David. A unifying perspective of parametric policy search methods for markov decision processes. In NIPS, pp. 2726–2734, 2012.
 Ghadimi & Lan (2013) Ghadimi, Saeed and Lan, Guanghui. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Harikandeh et al. (2015) Harikandeh, Reza, Ahmed, Mohamed Osama, Virani, Alim, Schmidt, Mark, Konečnỳ, Jakub, and Sallinen, Scott. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pp. 2251–2259, 2015.

Johnson & Zhang (2013)
Johnson, Rie and Zhang, Tong.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in neural information processing systems, pp. 315–323, 2013.  Jurčíček (2012) Jurčíček, Filip. Reinforcement learning for spoken dialogue systems using offpolicy natural gradient method. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pp. 7–12. IEEE, 2012.
 Kakade (2002) Kakade, Sham M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538, 2002.
 Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Konečnỳ et al. (2016) Konečnỳ, Jakub, Liu, Jie, Richtárik, Peter, and Takáč, Martin. Minibatch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
 Mairal (2015) Mairal, Julien. Incremental majorizationminimization optimization with application to largescale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
 Nemirovskii et al. (1983) Nemirovskii, Arkadii, Yudin, David Borisovich, and Dawson, Edgar Ronald. Problem complexity and method efficiency in optimization. 1983.
 Nesterov (2013) Nesterov, Yurii. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Owen (2013) Owen, Art B. Monte Carlo theory, methods and examples. 2013.
 Palaniappan & Bach (2016) Palaniappan, Balamurugan and Bach, Francis. Stochastic variance reduction methods for saddlepoint problems. In NIPS, pp. 1408–1416, 2016.
 Papini et al. (2017) Papini, Matteo, Pirotta, Matteo, and Restelli, Marcello. Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems, pp. 3594–3603, 2017.
 Peters & Schaal (2008a) Peters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008a.
 Peters & Schaal (2008b) Peters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008b.
 Pirotta et al. (2013) Pirotta, Matteo, Restelli, Marcello, and Bascetta, Luca. Adaptive stepsize for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394–1402, 2013.
 Pirotta et al. (2015) Pirotta, Matteo, Restelli, Marcello, and Bascetta, Luca. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2):255–283, 2015. ISSN 15730565. doi: 10.1007/s1099401554841.
 Precup (2000) Precup, Doina. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000.
 Reddi et al. (2016a) Reddi, Sashank J, Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alex. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314–323, 2016a.
 Reddi et al. (2016b) Reddi, Sashank J, Sra, Suvrit, Póczos, Barnabás, and Smola, Alex. Fast incremental method for nonconvex optimization. arXiv preprint arXiv:1603.06159, 2016b.
 Robbins & Monro (1951) Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
 Roux et al. (2012) Roux, Nicolas L, Schmidt, Mark, and Bach, Francis R. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663–2671, 2012.
 Rubinstein (1981) Rubinstein, Reuven Y Reuven Y. Simulation and the monte carlo method. Technical report, 1981.
 Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
 Sutton et al. (2000) Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
 Thomas & Brunskill (2017) Thomas, Philip S. and Brunskill, Emma. Policy gradient methods for reinforcement learning with function approximation and actiondependent baselines. CoRR, abs/1706.06643, 2017.
 Thomas et al. (2015) Thomas, Philip S, Theocharous, Georgios, and Ghavamzadeh, Mohammad. Highconfidence offpolicy evaluation. In AAAI, pp. 3000–3006, 2015.
 Todorov et al. (2012) Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Weaver & Tao (2001) Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradientbased reinforcement learning. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 538–545. Morgan Kaufmann Publishers Inc., 2001.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Wu et al. (2018) Wu, Cathy, Rajeswaran, Aravind, Duan, Yan, Kumar, Vikash, Bayen, Alexandre M, Kakade, Sham, Mordatch, Igor, and Abbeel, Pieter. Variance reduction for policy gradient with actiondependent factorized baselines. International Conference on Learning Representations, 2018. accepted as oral presentation.
 Xu et al. (2017) Xu, Tianbing, Liu, Qiang, and Peng, Jian. Stochastic variance reduction for policy gradient estimation. CoRR, abs/1710.06034, 2017.
 Zhao et al. (2011) Zhao, Tingting, Hachiya, Hirotaka, Niu, Gang, and Sugiyama, Masashi. Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems, pp. 262–270, 2011.
Appendix A Policy Gradient Estimators
We present a brief overview of the two most widespread gradient estimators (REINFORCE (Williams, 1992) and G(PO)MDP (Baxter & Bartlett, 2001)) both in onpolicy and offpolicy settings. Let is a steps trajectory. Given that depends on the MDP and the actual policy , the trajectory is said drawn from density distribution defined as:
We can now recall the definition of policy performance
where . The policy gradient is
(7) 
Onpolicy setting. Consider a policy and let be a dataset collected using policy . The REINFORCE gradient estimator (Williams, 1992) provides a simple, unbiased way of estimating the gradient:
where subscripts denote the time step, superscripts denote the trajectory, is the reward actually collected at time from trajectory and (e.g., Thomas & Brunskill, 2017). The G(PO)MDP gradient estimator (Baxter & Bartlett, 2001) is a refinement of REINFORCE which is subject to less variance (Zhao et al., 2011) while preserving the unbiasedness:
G(PO)MDP can be seen as a more efficient implementation of the REINFORCE algorithm. In fact, the latter does not perform an optimal credit assignment since it ignores that the reward at time does not depend on the action performed after time . G(PO)MDP overcomes this issue taking into account the causality of rewards in the REINFORCE definition of policy gradient.
Offpolicy setting. In offpolicy setting two policies, called behavioural and target , are involved. The first is used to select actions for the interaction with the system, while the second is used to evaluate the agent performance and it is improved in each update. Suppose now that we aim to estimate the performance of the target policy but we have samples collected using policy . We can use importance weight correction to correct the shift in the distribution and obtain an unbiased estimate of :
where and
The definition of the offpolicy version of (7) is (e.g., Jurčíček, 2012)
(8) 
For being well defined the behavioural policy should have nonzero probability of selecting any action in every state i.e., for any . Equation 8 is important for proving Theorem 4.4 since it provides a common representation of REINFORCE and G(PO)MDP.
The offpolicy version of REINFORCE is easily obtained by taking the empirical average of (8):
The G(PO)MDP offpolicy estimator is defined as follows
Appendix B Proofs
In this section, we prove all the claims made in the paper, with the primary objective of proving Theorem 4.4. Our proof is adapted from the one of Theorem 2 from (Reddi et al., 2016a) and has a very similar structure, but with all the additional challenges and assumptions described in Section 4.
Note that in the following we will make wide use of the following properties.
Assumption B.1.
We consider an estimate as in Eq. 2 such that

Onpolicy Unbiased Estimator.

Offpolicy Unbiased Estimator.
Note that these assumptions are verified by REINFORCE and G(PO)MDP.
Definitions
We give some additional definitions which will be useful in the proofs.
Definition B.1.
We introduce the notation to denote the th trajectory collected using policy where will be clear from the context.
Definition B.2.
For a random variable :
where the sequence is defined in Algorithm 2, , and . To avoid inconsistencies, we also define .
Intuitively, the operator computes the expected value with respect to the sampling of trajectories from the snapshot up to the th iteration included. Note that the order in which expected values are taken is important since each is function of previously sampled trajectories and is used to sample new ones.
Definition B.3.
For random vectors X, Y:
where denotes the trace of a matrix. From the linearity of expected value we have the following:
(9) 
and are defined in the same way from .
Definition B.4.
The full gradient estimation error is:
Definition B.5.
The ideal SVRPG gradient estimate is:
Basic Lemmas
We prove two basic properties of the SVRPG update.
See 3.1
Proof.
Note that the importance weight is necessary to guarantee unbiasedness, since the are sampled from . As , also . Hence, by continuity of :
Note that it is important that the trajectories used in the second and the third term are the same for the variance to vanish. ∎
Ancillary Lemmas
Before addressing the main convergence theorem, we prove some useful lemmas.
Lemma B.2.
Under Assumption 4.1, is Lsmooth for some positive Lipschitz constant .
Proof.
Lemma B.3.
Under Assumption 4.1, whether we use the REINFORCE or the G(PO)MDP gradient estimator, is Lipschitz continuous with Lipschitz constant , i.e., for any trajectory :
Proof.
For both REINFORCE and G(PO)MDP, is a linear combination of terms of the kind (Peters & Schaal, 2008b). These terms have bounded gradient from the second inequality of Assumption 4.1 and the fact that . If a baseline is used in REINFORCE or G(PO)MDP, we only need the additional assumption that said baseline is bounded. Bounded gradient implies Lipschitz continuity. Finally, the linear combination of Lipschitz continuous functions is Lipschitz continuous. ∎
Lemma B.4.
Under Assumption 4.1, whether we use the REINFORCE or the G(PO)MDP gradient estimator, for every and , there is a positive constant such that:
Proof.
For REINFORCE we have, from Assumption 4.1:
For G(PO)MDP, we do not have a compact expression for , but since it is derived from REINFORCE by neglecting some terms of the kind (Baxter & Bartlett, 2001; Peters & Schaal, 2008b), the above bound still holds. If a baseline is used in REINFORCE or G(PO)MDP, we only need the additional assumption that said baseline is bounded. ∎
Lemma B.5.
For any random vector X, the variance (as defined in Definition B.3), can be bounded as follows:
Proof.
By using basic properties of expected value and scalar variance:
∎
Lemma B.6.
Under Assumption 4.1 , the expected squared norm of the SVRPG gradient can be bounded as follows:
Comments
There are no comments yet.