1 Introduction
^{†}^{†}footnotetext: Equal contribution, Mila, Université de Montréal, Mila, McGill University, Facebook AI ResearchIn reinforcement learning (RL), an agent continuously interacts with an environment by choosing actions as prescribed by a way of behaving called a policy. The agent observes its current state and performs an action based on its current policy (which is a probability distribution conditioned on state), then it reaches a new state and obtains a reward. The goal of the agent is to improve its policy, but a key requirement in this process is the ability to evaluate the expected longterm return of the current policy, called the value function. After evaluating the policy, the policy can be updated so that more valuable states are visited more often. Performing policy evaluation efficiently is thus imperative to the success of training a RL agent.
Temporal difference (TD) learning (Sutton, 1988) is a classic method for policy evaluation, which uses the Bellman equation to bootstrap the estimation process and continually update the value function. Least Squares Temporal Difference (LSTD) method (Bradtke and Barto, 1996; Boyan, 2002) is a more dataefficient approach which uses the data to construct a linear system approximating the original problem, then solves this system. It also has the advantage of not requiring a learning rate parameter. However, LSTD is not computationally feasible when the number of features is large, because it requires inverting a matrix of size . When is large, stochastic gradient based approaches, such as GTD (Sutton et al., 2008), GTD2 and TDC (Sutton et al., 2009) are preferred because the amount of computation and storage during each update is linear in . Compared to classical TD, these algorithms truly compute a gradient (instead of performing a fixedpoint approximation which is in fact not a gradient update); as a result, they enjoy better theoretical guarantees, especially in the case of offpolicy learning, in which the policy of interest for the evaluation is different from the policy generating the agent’s experience.
Convex problems with large (number of data samples) and appear often in machine learning and there are many efficient stochastic gradient methods for finding solutions (e.g. SAG (Roux et al., 2012), SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014)). In the problem of interests here, policy evaluation with linear function approximation, the objective function is a saddlepoint formulation of the empirical Mean Squared Projected Bellman Error (MSPBE). It is convexconcave and not strongly convex in the primal variable, so existing powerful convex optimization methods do not directly apply.
Despite this problem, Du et al. (2017) showed that SVRG and SAGA can be applied to solve the saddle point version of MSPBE with linear convergence rates, leading to fast, convergent methods for policy evaluation. An important and computationally heavy step of SVRG is to compute a full gradient at the beginning of every epoch. Subsequent stochastic gradient updates use this full gradient so that the variance of updating directions is reduced. In this paper, we address the computational bottleneck of SVRG by extending two methods, Batching SVRG (Harikandeh et al., 2015) and SCSG (Lei and Jordan, 2017), for policy evaluation. These methods were originally proposed to make SVRG computationally efficient when solving strongly convex problems, so they do not directly apply to our problem, a convexconcave function without strong convexity in the primal variable.
In this work, we make the following key contributions:

We show that both Batching SVRG and SCSG achieve linear convergence rate for policy evaluation while saving considerably amount of gradient computations. To the best of our knowledge, this is the first result for Batching SVRG and SCSG in saddlepoint setting.

While our analysis builds on the ideas of Lei and Jordan (2017), our proofs end up quite different and also a lot simpler because we exploit the structure of our problem.

Our experimental results demonstrate that given the same amount of data, batching SVRG and SCSG achieve better empirical performances than vanilla SVRG on some standard benchmarks.
2 Background
In RL, a Markov Decision Process (MDP) is typically used to model the interaction between an agent and its environment. A MDP is defined by a tuple
, where is the set of possible states, is the set of actions,the transition probability function maps stateaction pairs to distributions over next states. denotes the reward function: , which returns the immediate reward that an agent will receive after performing an action at state and is the discount factor used to discount rewards received farther in the future. For simplicity, we will assume and are finite.A policy is a mapping from states to distributions over actions. The value function for policy , denoted , represents the expected sum of discounted rewards along the trajectories induced by the policy in the MDP: . can be obtained as the fixed point of the Bellman operator over the actionvalue function where is the expected immediate reward and is defined as .
In this paper, we are concerned with the policy evaluation problem (Sutton and Barto, 1998) i.e estimation of for a given policy . In order to obtain generalization between different states, should be represented in a functional form. In this paper, we focus on linear function approximation of the form: where
is a weight vector and
is a feature map from a states to a given dimensional feature space.3 Objective Functions
We assume that the Markov chain induced by the policy
is ergodic and admits a unique stationary distribution, denoted by , over states. We write for the diagonal matrix whose diagonal entries are .If denotes the matrix obtained by stacking the state feature vectors row by row, then it is known (Bertsekas, 2011) that is the fixed point of the projected Bellman operator :
(1) 
where is the orthogonal projection onto the space with respect to the weighted Euclidean norm . Rather than computing a sequence of iterates given by the projected Bellman operator, another approach for finding is to directly minimize (Sutton et al., 2009; Liu et al., 2015) the Mean Squared Projected Bellman Error (MSPBE):
(2) 
By substituting the definition of into (2), we can write MSPBE as a standard weighted leastsquares problem (See Sutton et al. (2009) for a complete derivation):
(3) 
where , and are defined as follows: , and where the expectations are taken with respect to the stationary distribution.
Empirical MSPBE:
We focus here on the batch setting where we collect a dataset of transitions generated by the policy . We replace the quantities , and in (3) by their empirical estimates:
(4) 
where for all , for a given transition
(5) 
Therefore we consider the empirical MSPBE defined as follows:
(6) 
Finite sum structure:
We aim at using stochastic variancereduction techniques to our problem. These methods are designed for problem with finite sum structure as follows:
(7) 
Unfortunately, even by replacing quantities , and by their finitesample estimates, the obtained empirical objective in (6) could not be written in such form (7). However, Du et al. (2017) convert the empirical MSPBE minimization in (6) into a convexconcave saddle point problem which presents a finite sum structure. To this end, Du et al. (2017) use the convexconjugate trick. Recall that the convex conjugate of a realvalued function is defined as:
(8) 
and is convex, we have . Also, if , then . Thanks to the latter relation, the empirical MSPBE minimization is equivalent to:
(9) 
The obtained objective, we denote by , in (9) could be written as where
4 Existing Optimization Algorithms
Before presenting our new methods, we first review briefly existing algorithms that solve the saddlepoint problem (9). Let’s define the vector obtained by stacking the primal and negative dual gradients:
(10) 
We have where
Gradient temporal difference:
GTD2 algorithm Sutton et al. (2009), when applied to the batch setting, consists in the following update: for a uniformly sampled :
(11) 
where and are step sizes on and . GTD2 has a low computation cost per iteration but only a sublinear convergence rate (Touati et al., 2018).
SVRG for policy evaluation:
Du et al. (2017) applied SVRG to solve the saddlepoint problem (9). The idea is to alternate between full and stochastic gradient updates in two layers of loops. In the outer loop, a snapshot of the current variables is saved together with its full gradients vector . Between snapshots, the variables are updated with a gradient estimate corrected using the stochastic gradient:
(12) 
where is uniformly sampled. Du et al. (2017) showed that the algorithm has a linear convergence rate although the objective (9) is not strongly convex in the primal variable . However, the algorithm remains inefficient in term of computations as it requires to compute a full gradient using the entire dataset in the outer loop. In the rest of the paper, ”An epoch” means an iteration of the outer loop. In the sequel, we introduce two variants of SVRG for policy evaluation that alleviate the latter computational bottleneck while preserving the linear convergence rate.
5 Proposed Methods
5.1 Batching SVRG for Policy Evaluation
Algorithm 1 presents batching SVRG for policy evaluation. It applies batching SVRG Harikandeh et al. (2015) on solving the convex concave formulation of the empirical MSPBE.
Harikandeh et al. (2015) show that SVRG is robust to an inexact computation of the full gradient. In order to speed up the algorithm, we propose algorithm 1, similar to Harikandeh et al. (2015), estimating the full gradient in each epoch using only a subset (a minibatch) of size of training examples:
In each iteration of the inner loop in algorithm 1, it uses to update and . is the usual SVRG update, except that the full gradients is replaced with the minibatch gradients :
where is sampled uniformly in .
Input: initial point , , and
Output:
1: for m = 0 to M1 do
2: Set and to .
3: Choose a minibatch size
4: Sample a set with elements uniformly from
5: Compute
6: for j = 0 to K1 do
7: Sample uniformly randomly from
8:
9:
5.2 Stochastically Controlled Stochastic Gradient (SCSG) for Policy Evaluation
Algorithm 2 presents Stochastically Controlled Stochastic Gradient (SCSG) for Policy Evaluation. SCSG is initially introduced for convex minimization in Lei and Jordan (2017). Here, we apply it to our convexconcave saddlepoint problem. Similar to Bachting SVRG for policy evaluation in algorithm 1, algorithm 2 implements the gradient computation on a subset of training examples at each epoch, but the minibatch size is fixed in advance and not varying. Moreover, instead of being fixed, the number of iteration for the inner loop in algorithm 2
is sampled from a geometrically distributed random variable:
for each epoch .6 Convergence Analysis
6.1 Notations and Preliminary
In order to characterize the convergence rates of the proposed algorithms 1 and 2, we need to introduce some new notations and state new assumptions.
We denote by the spectral norm of the matrix A and by
its condition number. If the eigenvalues of a matrix
are real, we use and to denote respectively the largest and the smallest eigenvalue.If we set for a positive constant , it is possible to write the inner loop update (line 9 in both algorithms) as an update for the vector as follows :
where:
and their corresponding averages over the minibatch :
Let’s now define the matrix (the vector ) as the average of matrices (vectors ) over the entire dataset:
To simplify notations, we overload the notation . Another important quantity that characterizes smoothness of our problem is defined below as:
(13) 
The matrix will play a key role in the convergence analysis of both algorithms 1 and 2. Du et al. (2017) have already studied the spectral properties of as it was critical for the convergence of SVRG for policy evaluation. The following lemma, restated from (Du et al., 2017), show the condition should satisfy so that is diagonalizable with all its eigenvalues real and positive.
Assumption 1.
nonsingular and is definite positive. This implies that the saddlepoint problem admits a unique solution and we define .
Lemma 1.
If assumptions of lemma 1 hold, we can write as where is a diagonal matrix whose diagonal entries are the eigenvalues of , and
consists of it eigenvectors as column. We define the residual vector
. To study the behaviour of our algorithms, we use the potential function . As , the convergence of implies the convergence of .6.2 Convergence of batching SVRG for Policy Evaluation
In order to study the behavior of algorithm 1, we defined the error occurred at epoch . This error comes from computing the gradients over a minibatch instead of the entire dataset.
(14) 
The stochastic update of the inner loop could be written as follows:
(15) 
Note that if , the error is zero and we recover the convergence rate of SVRG in theorem 1. Moreover, we could still maintain the linear convergence rate if the error term vanishes at an appropriate rate. In particular, the corollary below provides a possible batching strategy to control the error term.
Corollary 1.
Suppose that assumptions of theorem 1 hold. If we the sample variance of the norms of the vectors is bounded by a constant : and we set for some constants and then we obtain:
(17) 
We conclude that an exponentiallyincreasing schedule of minibatch sizes achieves linear convergence rate for Batching SVRG. Moreover, this batching strategy saves many gradients computations in early stages of the algorithm comparing to vanilla SVRG.
6.3 Convergence of SCSG for Policy Evaluation
Algorithm 2 considers a fixed minibatch size instead of varying size as in algorithm 1. Moreover, the number of iteration of the inner loop is sampled from a geometric distribution, i.e. , which implies that the number of iteration is equal in expectation to .
Before stating the convergence result, we introduce the complexity measure defined as follows:
(18) 
This quantity is equivalent to the complexity measure that is introduced by Lei and Jordan (2017) to motivate and analyze SCSG for convex finite sum minimization problem, and that is defined as:
(19) 
Theorem 2.
Suppose assumption 1 holds. We set and . We choose , so that . Assume that the dataset size is large enough: , then we obtain
(20) 
Corollary 2.
Technical detail
In the proof of theorem 2, we set the value of stepsize earlier in the proof to simplify our derivation. In fact, we could continue the derivation but with complicated expressions and then set to be dependant on the batchsize . We conjecture by doing this, that the computational cost would be: . In particular, we drop the assumption that is large enough.
In table 1, GTD2 is the cheapest computationally but it has sublinear convergence rate. Both SVRG and SCSG achieve linear convergence rate. When the sample size is small, SVRG and SCSG have an equivalent computational cost. However, when is large and the required accuracy is low, SCSG saves unnecessary computations and is able to achieve the target accuracy with potentially less than a single pass through the dataset.
7 Related Works
Stochastic gradient methods (Robbins and Monro, 1951) is the most popular method for optimizing convex problems with a finite sum structure, but has slow convergence rate due to the inherent variance. Later, various works show that a faster convergence rate is possible provided that the objective function is strongly convex and smooth. Some representative ones are SAG, SVRG, SAGA (Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014). Among these methods, SVRG has no memory requirements but requires a lot of computations. There have been attempts to make SVRG computationally efficient for minimizing convex problems (Harikandeh et al., 2015; Lei and Jordan, 2017), but they do not directly apply to the problem of our interests, a convexconcave saddlepoint problem without strong convexity in the primal variable. A general convexconcave saddlepoint problem can be solved with linear convergence rate (Balamurugan and Bach, 2016), but their method requires strong convexity in the primal variable and the proximal mappings of variables in our problem are difficult to compute (Du et al., 2017).
Many existing works study policy evaluation with linear function approximation. Gradient based approaches (Baird, 1995; Sutton et al., 2008, 2009; Liu et al., 2015) choose different objective functions and parameters of the value function are optimized toward solutions of their objective functions. Least square approaches (Bradtke and Barto, 1996; Boyan, 2002) directly compute the closed form solutions and have high computation costs because they need to compute matrix inverses. The idea of SVRG has been applied to policy evaluation Korda and L.A. (2015); Du et al. (2017). In this work, we extend SVRG for policy evaluation, proposed in Du et al. (2017), and show that the amount of computations can be reduced with linear convergence guarantees.
In control case, Papini et al. (2018) adapt SVRG to policy gradient and they use minibatch to approximate the fullgradient similarly to our work. However, their problem is a nonconvex minimization and they obtain a sublinear convergence rate.
8 Experiments
We compare empirical performances of our proposed algorithms with SVRG on two benchmarks: Random MDP and Mountain Car. Details of the two environments, along with other experimental details, are given in the supplementary material. Figure 1 demonstrates that Batching SVRG is able to achieve the same performances of SVRG while using a significantly less amount of data. Two algorithms have identical empirical performances in Random MDP environment. In Mountain Car environment, Batching SVRG’s performance is worse than SVRG in early epochs, but it later reaches the same level of objective values and has the same convergence speed with SVRG. This is expected because our theoretical result suggests that having an approximation error will not affect the overall convergence rate if the error decreases properly. Figure 2 shows performances of SCSG and SVRG. We plot our results against two metrics, number of epochs and number of times a method has used data samples. Since SVRG uses the entire data set to evaluate the full gradient in every epoch, its performances are not as good as SCSG in terms of the amount of data used. We also demonstrate that SVRG is better than SCSG in terms of the number of epochs. This is not surprising as SCSG samples its number of inner loop iterations from a geometric distribution, so an epoch of SCSG is significantly shorter than SVRG.
9 Conclusion and future work
In this paper, we show that Batching SVRG and SCSG converge linearly when solving the saddlepoint formulation of MSPBE. This problem is convexconcave and is not strongly convex in the primal variable, so it is very different with the original objective function that Batching SVRG and SCSG attempt to solve. Our algorithms are very practical because they require fewer gradient evaluations than the vanilla SVRG for policy evaluation. It would be useful in the future to get more empirical evaluations for the proposed algorithms. In general, we think that there is a lot of room for applying more efficient optimization algorithms to problems in reinforcement learning, in order to obtain better theoretical guarantees and to improve sample and computational efficiency.
References
 Baird (1995) Baird, L. (1995). Residual algorithms : Reinforcement learning with function approximation. In International Conference on Machine Learning.
 Balamurugan and Bach (2016) Balamurugan, P. and Bach, F. (2016). Stochastic variance reduction methods for saddlepoint problems. In Advances in Neural Information Processing Systems.
 Bertsekas (2011) Bertsekas, D. P. (2011). Temporal difference methods for general projected equations. IEEE Transactions on Automatic Control.
 Boyan (2002) Boyan, J. (2002). Technical update: Leastsquares temporal difference learning. Machine Learning, 49(2):233–246.
 Bradtke and Barto (1996) Bradtke, S. J. and Barto, A. G. (1996). Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22(13):33–57.
 Defazio et al. (2014) Defazio, A., Bach, F., and LacosteJulien, S. (2014). Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems.
 Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017). Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning.
 Harikandeh et al. (2015) Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečný, J., and Sallinen, S. (2015). Stop wasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems.

Johnson and Zhang (2013)
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems.  Korda and L.A. (2015) Korda, N. and L.A., P. (2015). On td(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. In International Conference on Machine Learning.

Lei and Jordan (2017)
Lei, L. and Jordan, M. I. (2017).
Less than a single pass: Stochastically controlled stochastic
gradient method.
In
International Conference on Artificial Intelligence and Statistics
.  Liu et al. (2015) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2015). Finitesample analysis of proximal gradient td algorithms. In Conference on Uncertainty in Artificial Intelligence.
 Papini et al. (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. (2018). Stochastic variancereduced policy gradient. International Conference on Machine Learning.
 Robbins and Monro (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. In Annals of Mathematical Statistics, pages 400–407.
 Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. (2012). Minimizing finite sums with the stochastic average gradient. In Advances in Neural Information Processing Systems.
 Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.
 Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009). Fast gradientdescent methods for temporaldifference learning with linear function approximation. In International Conference on Machine Learning, pages 993–1000.
 Sutton et al. (2008) Sutton, R. S., Szepesvári, C., and Maei, H. R. (2008). A convergent o(n) temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems.
 Touati et al. (2018) Touati, A., Bacon, P.L., Precup, D., and Vincent, P. (2018). Convergent tree backup and retrace with function approximation. In International Conference on Machine Learning, pages 4962–4971.
Appendix A Proof of theorem 1
Proof.
Define the residual vector and as:
(22) 
and are and at the beginning of epoch . and are and at epoch and iteration of the inner loop . and are optimal solutions of (9). From the first order optimality condition, we know that
The above equality is obtained by setting (10) to a zero vector.
By writing out Algorithm 1’s update, we have:
(23) 
we defined the error coming from using a minibatch to compute the gradients at epoch .
(24) 
we obtain then:
(25) 
Subtract both sides by and use the first order optimality condition. We obtain:
(26) 
We set so that is diagonalizable by Lemma 1. Let where contains eigenvectors and contains eigenvalues of . Multiply both sides of (A) by , then take squared 2norm and expectation. Set . We get:
(27) 
The cross term in the second equality is simplified by using and is independent with , and
. We use in the last inequality that same independence and that the variance of a random variable is less than its second moment.
We borrow the following useful inequalities from appendix C of Du et al. (2017).
(28) 
(29) 
Now we bound the cross term in (A):
(30) 
the first inequality is obtained by CauchySchwartz inequality The last inequality follows from the fact that for any and we select in order for the inequality to hold.
(31) 
If we choose , then and are smaller than which implies that:
(32) 
Note that , because of the following inequalities cited from Appendix C in Du et al. (2017):
(33) 
Now enrolling the above inequality (A) from to , we obtain:
Comments
There are no comments yet.