Reinforcement learning (RL) (Sutton and Barto, 1998) is intrinsically linked to fixed-point computation: the optimal value function is the fixed point of the (nonlinear) Bellman optimality operator, and the value function of a given policy is the fixed point of the related (linear) Bellman evaluation operator. Most of the time, these fixed points are computed recursively, by applying repeatedly the operator of interest. Notable exceptions are the evaluation step of policy iteration and the least-squares temporal differences (LSTD) algorithm111In the realm of deep RL, as far as we know, all fixed points are computed iteratively, there is no LSTD. (Bradtke and Barto, 1996).
Anderson (1965) acceleration (also known as Anderson mixing, Pulay mixing, direct inversion on the iterative subspace or DIIS, among others222Anderson acceleration and variations have been rediscovered a number of times in various communities in the last 50 years. Walker and Ni (2011) provide a brief overview of these methods, and a close approach has been recently proposed in the machine learning community
provide a brief overview of these methods, and a close approach has been recently proposed in the machine learning community(Scieur et al., 2016).
) is a method that allows speeding up the computation of such fixed points. The classic fixed-point iteration applies repeatdly the operator to the last estimate. Anderson acceleration considers theprevious estimates. Then, it searches for the point that has minimal residual within the subspace spanned by these estimates, and applies the operator to it. This approach has been successfully applied to fields such as electronic structure computation or computational chemistry, but it has never been applied to dynamic programming or reinforcement learning, as far as we know. For more about Anderson acceleration, refer to Walker and Ni (2011), for example.
2 Anderson Acceleration for Value Iteration
In this section, we briefly review Markov decision processes, value iteration, and show how Anderson acceleration can be used to speed up convergence.
2.1 Value Iteration
be the set of probability distributions over a finite setand the set of applications from to the set
. By convention, all vectors are column vectors. A Markov Decision Process (MDP) is a tuple, where is the finite state space, is the finite action space, is the Markovian transition kernel ( denotes the probability of transiting to when action is applied in state ), is the bounded reward function ( represents the local benefit of doing action in state ) and is the discount factor.
A stochastic policy associates a distribution over actions to each state (deterministic policies being a special case of this). The policy-induced reward and transition kernels, and , are defined as
The quality of a policy is quantified by the associated value function :
The value is the unique fixed point of the Bellman operator , defined as for any .
Let define the second Bellman operator as, for any , . This operator is a -contraction (in supremum norm), so the iteration
converges to its unique fixed-point , for any . This is the value iteration algorithm.
2.2 Accelerated Value Iteration
Anderson acceleration is a method that aims at accelerating the computation of the fixed point of any operator. Here, we describe it considering the Bellman operator , which provides an accelerated value iteration algorithm.
Assume that estimates have been computed up to iteration , and that in addition to the previous estimates are known. The coefficient vector is defined as follows:
Notice that we don’t impose a positivity condition on the coefficients. We will consider practically the -norm for this problem, but it could be a different norm (for example or
, in which case the optimization problem is a linear program). Then, the new estimate is given by:
The resulting Anderson accelerated value iteration is summarized in Alg. 1. Notice that the solution to the optimization problem can be obtained analytically for the -norm, using the Karush-Kuhn-Tucker conditions. With the notations of Alg. 1 and writting the vector with all components equal to one, it is
This can be regularized to avoid ill-conditioning.
The rationale of this acceleration scheme is better understood with an affine operator. We consider here the Bellman evaluation operator . Given the current and the previous estimates, define
Thanks to this constraint, for an affine operator (here ), we have that
Then, one searches for a vector (satisfying the constraint) that minimizes the residual
Eventually, the new estimate is obtained by applying the operator to the vector of minimal residual.
2.3 Preliminary Experimental Results
We consider Garnet problems (Archibald et al., 1995; Bhatnagar et al., 2009). They are a class of randomly built MDPs meant to be totally abstract while remaining representative of the problems that might be encountered in practice. Here, a Garnet is specified by the number of states, the number of actions and the branching factor. For each couple, different next states are chosen randomly and the associated probabilities are set by randomly partitioning the unit interval. The reward is null, except for of states where it is set to a random value, uniform in .
We generate 100 random MDPs and set to . For each MDP, we apply value iteration (denoted as in the graphics) and Anderson accelerated value iteration for ranging from 1 to 9. The inital value function is always the null vector. We run all algorithms for iterations, and measure the normalised error for algorithm alg at iteration , , where stands for the optimal value function of the considered MDP.
stands for classic value iteration). Shaded areas correspond to standard deviations and lines to means (due to randomness of the MDPs, the algorithms being deterministic given the fixed initial value function). Fig.1.b and 1.c show respectively the mean and the standard deviation of these errors, in a logarithmic scale. One can observe that Anderson acceleration consistently offers a significant speed-up compared to value iteration, and that rather small values of () seem to be enough.
2.4 Nuancing the Acceleration
We must highlight that the optimal policy is the object of interest, the value function being only a proxy to it. Regarding the value function, its level is not that important, but its relative differences are. This is addressed by the relative value iteration algorithm (Puterman, 1994, Ch. 6.6). For a given state , it iterates as , . It usually converges much faster than value iteration (towards ), but the greedy policies resp. to each iterate’s estimated values are the same for both algorithms. This scheme can also be easily accelerated with Anderson’s approach.
We provide additional results on Fig. 2 (for the same MDPs as previously). Fig. 2.a shows the error of the greedy policy, that is , with being greedy respectively to , for the first 10 iterations (same data as for Fig. 1). This is what we’re really interested in. One can observe that value iteration provides more quickly better solutions than Anderson acceleration. This is due to the fact that if the level of the value function converges slowly, its relative differences converge more quickly.
So, we compare relative value iteration and its accelerated counterpart in Fig. 2.b (normalized error of the estimate, not of the greedy policy), to be compared to Fig. 1.b. There is still an acceleration with Anderson, at least at the beginning, but the speed-up is much less than in Fig. 1. We compare the error on greedy policies for the same setting in Fig. 2.c, and all approaches perform equally well.
3 Anderson Acceleration for Reinforcement Learning
So, the advantage of Anderson acceleration applied to exact value iteration on simple Garnet problems is not that clear. Yet, it could still be interesting for policy evaluation or in the approximate setting. We discuss briefly its possible applications to (deep) RL.
3.1 Approximate Dynamic Programming
Anderson acceleration could be applied to approximate dynamic programming and related methods. For example, the well-known DQN algorithm (Mnih et al., 2015) is nothing else than a (very smart) approximate value iteration approach. A state-action value function
is estimated (rather than a value function), and this function is represented as a neural network. A target networkis maintained, and the Q-function is estimated by solving the least-squares problem (for the memory buffer )
Anderson acceleration can be applied directly as follows. Assume that the previous target networks are maintained. Define for
and . Solve as in Eq. (6) and define for all
So, Anderson acceleration would modify the targets in the regression problem, the necessary coefficients being obtained with a cheap least-squares (given is small enough, as suggested by our preliminary experiments). Notice that the estimate is biased, as being the solution to a residual problem with sampled transitions. However, if a problem, this could probably be handled with instrumental variables, giving an LSTD-like algorithm (Bradtke and Barto, 1996). Variations of this general scheme could also be envisionned, for example by computing the vector on a subset of the memory replay or even on the current mini-batch, or by considering variations of Anderson acceleration such as the one of Henderson and Varadhan (2018).
This acceleration scheme could be more generally applied to approximate modified policy iteration, or AMPI (Scherrer et al., 2015), that generalizes both approximate policy and value iterations. Modified policy iteration is similar to policy iteration, except that instead of computing the fixed point in the evaluation step, the Bellman evaluation operator is applied times ( gives value iteration, policy iteration), the improvement step (computing the greedy policy) being the same (up to possible approximation). In the approximate setting, the evaluation step is usually performed by performing the regression of -step returns, but it could be done by applying repeatedly the evaluation operator, this being combined with Anderson acceleration (much like DQN, but with instead of ).
3.2 Policy Optimization
Another popular approach in reinforcement learning is policy optimization, or direct policy search (Deisenroth et al., 2013), that maximizes (or a proxy), for a user-defined state distribution , over a class of parameterized policies. This is classically done by performing a gradient ascent:
This gradient is given by . Thus, it depends on the state-action value function of the current policy. This gradient can be estimated with rollouts, but it is quite common to estimate the Q-function itself. Related approaches are known as actor-critic methods (the actor being the policy, and the critic the Q-function). It is quite common to estimate the critic using a SARSA-like approach, especially in deep RL. In other words, the critic is estimated by applying repeatedly the Bellman evaluation operator. Therefore, Anderson acceleration could be applied, in the same spirit as what we described for DQN.
Yet, Anderson acceleration could also be used to speed up the convergence of the policy. Consider the gradient ascent in Eq. (13). This can be seen as a fixed-point iteration to solve . Anderson acceleration could thus be used to speed it up. Seeing gradient descent as a fixed point is not new (Jung, 2017), nor is applying Anderson acceleration to speed it up (Scieur et al., 2016; Xie et al., 2018). Yet, it has never been applied to policy optimization, as far as we know.
- Anderson (1965) Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
- Archibald et al. (1995) TW Archibald, KIM McKinnon, and LC Thomas. On the generation of Markov decision processes. Journal of the Operational Research Society, pages 354–361, 1995.
- Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Bradtke and Barto (1996) Steven J. Bradtke and Andrew G. Barto. Linear Least-Squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
- Deisenroth et al. (2013) Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
- Fang and Saad (2009) Haw-ren Fang and Yousef Saad. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009.
- Henderson and Varadhan (2018) Nicholas C Henderson and Ravi Varadhan. Damped anderson acceleration with restarts and monotonicity control for accelerating em and em-like algorithms. arXiv preprint arXiv:1803.06673, 2018.
- Jung (2017) Alexander Jung. A fixed-point of view on gradient methods for big data. Frontiers in Applied Mathematics and Statistics, 3:18, 2017.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, 1994.
- Scherrer et al. (2015) Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:1629–1676, 2015.
- Scieur et al. (2016) Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Regularized nonlinear acceleration. In Advances In Neural Information Processing Systems, pages 712–720, 2016.
- Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
- Toth and Kelley (2015) Alex Toth and CT Kelley. Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis, 53(2):805–819, 2015.
- Walker and Ni (2011) Homer F Walker and Peng Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011.
- Xie et al. (2018) Guangzeng Xie, Yitan Wang, Shuchang Zhou, and Zhihua Zhang. Interpolatron: Interpolation or extrapolation schemes to accelerate optimization for deep neural networks. arXiv preprint arXiv:1805.06753, 2018.