For the general optimization problem
, there exist several techniques to accelerate the standard gradient descent, e.g., Nesterov momentum, Katyusha momentum 
. There are also various vector sequence acceleration methods developed in the numerical analysis literature, e.g.,[5, 26, 27, 6, 7]. Roughly speaking, if a vector sequence converges very slowly to its limit, then one may apply such methods to accelerate the convergence of this sequence. Taking gradient descent as an example, the vector sequence are generated by , where the limit is the fixed-point (i.e. . One notable advantage of such acceleration methods is that they usually do not require to know how the vector sequence is actually generated. Thus the applicability of those methods is very wide.
Recently, Scieur et al.  used the minimal polynomial extrapolation (MPE) method  for convergence acceleration. This is a nice example of using sequence acceleration methods to optimization problems. In this paper, we are interested in another classical sequence acceleration method called Anderson acceleration (or Anderson mixing), which was proposed by Anderson in 1965 . The method is known to be quite efficient in a variety of applications [9, 20, 14, 16]. The idea of Anderson mixing is to maintain recent iterations for determining the next iteration point, where is a parameter (typically a very small constant). Thus, it can be viewed as an extension of the existing momentum methods which usually use the last and current points to determine the next iteration point. Anderson mixing with slight modifications is formally described in Algorithm 1.
Note that the step in Line 7 of Algorithm 1 can be transformed to an equivalent unconstrained least-squares problem:
. Using QR decomposition, (1) can be solved in time , where is the dimension. Moreover, the QR decomposition of (1) at iteration can be efficiently obtained from that of at iteration in (see, e.g. ). The constant is usually very small. We use and for the numerical experiments in Section 5. Hence, each iteration of Anderson mixing can be implemented quite efficiently.
Many studies showed the relations between Anderson mixing and other optimization methods. In particular, for the quadratic case (linear problems), Walker and Ni  showed that it is related to the well-known Krylov subspace method GMRES (generalized minimal residual algorithm) . Furthermore, Potra and Engler  showed that GMRES is equivalent to Anderson mixing with any mixing parameters under (see Line 5 of Algorithm 1) for linear problems. Concretely, Toth and Kelley  proved the first linear convergence rate for linear problems with fixed parameter , where is the condition number. Besides, Eyert , and Fang and Saad  showed that Anderson mixing is related to the multisecant quasi-Newton methods (more concretely, the generalized Broyden’s second method). Despite the above results, the convergence results for this efficient method are still limited (especially for general nonlinear function and the case where is small).
1.1 Our Contributions
We prove the optimal convergence rate of the proposed Anderson-Chebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) for minimizing quadratic functions (see Theorem 2.1). Our result improves the previous result (i.e., ) using fixed parameters given by Toth and Kelley  and matches the lower bound (i.e., ) provided by Nesterov .
Then, we prove the linear-quadratic convergence of Anderson mixing for minimizing general nonlinear problems under some reasonable assumptions (see Theorem 3.1). Compared with Newton-like methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessian-vector products.
Besides, we propose a guessing algorithm (Algorithm 2) for the case when the hyperparameters (e.g., ) are not available. We prove that it achieves a similar convergence rate (see Theorem 4.1). This guessing algorithm can also be combined with other algorithms, e.g., Gradient Descent (GD), Nesterov’s Accelerated GD (NAGD). The experimental results (see Section 5.1) show that these algorithms combined with the proposed guessing algorithm achieve much better performance.
Finally, the experimental results on the real-world UCI datasets and synthetic datasets demonstrate that Anderson mixing methods converge significantly faster than other algorithms (see Section 5). This validates that Anderson mixing methods (especially Anderson-Chebyshev mixing method) are efficient both in theory and practice.
1.2 Related Work
As aforementioned, Anderson mixing can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. In particular, GD is the special case of Anderson mixing with , and to some extent NAGD can be viewed as . We also review the equivalence of GMRES and Anderson mixing without truncation (i.e., ) in Appendix 0.A. Besides, Eyert , and Fang and Saad  showed that Anderson mixing is related to the multisecant quasi-Newton methods. Note that Anderson mixing has the advantage over the Newton-like methods since it does not require the computation of Hessians or approximation of Hessians or Hessian-vector products.
There are many sequence acceleration methods in the numerical analysis literatures. In particular, the well-known Aitken’s process  accelerated the convergence of a sequence that is converging linearly. Shanks generalized the Aitken extrapolation which was known as Shanks transformation . Recently, Brezinski et al.  proposed a general framework for Shanks sequence transformations which includes many vector sequence acceleration methods. One fundamental difference between Anderson mixing and other sequence acceleration methods (such as MPE, RRE (reduced rank extrapolation) [26, 27], etc.) is that Anderson mixing is a fully dynamic method . Here dynamic means all iterations are in the same sequence, and it does not require to restart the procedure. It can be seen from Algorithm 1 that all iterations are applied to the same sequence . In fact, in Capehart’s PhD thesis , several experiments were conducted to demonstrate the superior performance of Anderson mixing over other semi-dynamic methods such as MPE, RRE (semi-dynamic means that the algorithm maintains more than one sequences or needs to restart several times).
2 The Quadratic Case
In this section, we consider the problem of minimizing a quadratic function (also called least square, or ridge regression[4, 15]). The formulation of the problem is
where . Note that and are usually called the strongly convex parameter and Lipschitz continuous gradient parameter, respectively (e.g. ). There are many algorithms for optimizing this type of functions. See e.g.  for more details. We analyze the problem of minimizing a general function in the next Section 3.
We prove that Anderson mixing with Chebyshev polynomial parameters achieves the optimal convergence rate. The convergence result is stated in the following Theorem 2.1.
The Anderson-Chebyshev mixing method achieves the optimal convergence rate for obtaining an -approximation solution of problem (2) for any , where is the condition number, is defined in Definition 1 and this method combines Anderson Mixing (Algorithm 1) with the Chebyshev polynomial parameters , for .
Remark: In this quadratic case, we mention that Toth and Kelley  proved the first convergence rate for fixed parameter . Here we use the Chebyshev polynomials to improve the result to the optimal one, i.e., . Also note that in practice the constant is usually very small. Particularly, has already achieved a remarkable performance from our experimental results (see Figures 2–5 in Section 5).
The Chebyshev polynomials are polynomials , where , , which is defined by the recursive relation:
The key property is that has minimal deviation from on among all polynomials with and , i.e.,
In particular, for , Chebyshev polynomials can be written in an equivalent way:
To demonstrate it more clearly, we provide an example for (W-shape curve) in Figure 1. Since in this polynomial , the first root . The remaining three roots for can be easily computed too.
Proof of Theorem 2.1. For iteration , the residual can be deduced as follows:
where (9) uses .
To bound (i.e.,
), we first obtain the following lemma by using Singular Value Decomposition (SVD) to solve the least squares problem (1) and then using several transformations. We defer the proof of Lemma 1 to Appendix 0.B.2.
Let and , then
where is a degree polynomial.
According to Lemma 1, to bound , it is sufficient to bound the right-hand-side (RHS) of (11) (i.e., ). In order to bound this, we first transform into . Let , where . We have the following equalities:
where (13) uses (12), and (14) uses (see (5)). According to (8), it is not hard to see that is defined by the mixing parameters , where ). Note that the roots of standard Chebyshev polynomials (i.e., (8)) can be obtained from many textbooks, e.g., Section 1.2 of . Now, we only need to bound . First, we need to transform the form (5) of Chebyshev polynomials as follows:
Let , we get . So we have
Now, the RHS of (11) can be bounded as
Thus Anderson-Chebyshev mixing method achieves the optimal convergence rate for obtaining an -approximation solution.
3 The General Case
In this section, we analyze the Anderson mixing (Algorithm 1) in the general nonlinear case:
We prove that Anderson mixing method achieves the linear-quadratic convergence rate under the following standard Assumptions 1 and 2, where denotes the Euclidean norm. Let denote the small matrix of the least-square problem in Line 7 of Algorithm 1, i.e., (see problem (1)). Then, we define its condition number , where denotes the least non-zero singular value of . We further define .
The Hessian satisfies , where .
The Hessian is -Lipschitz continuous, i.e.,
The constant is usually very small. Particularly, we use and for the numerical experiments in Section 5. Hence is very small and also decreases as the iteration increasing.
Besides, one can also use instead of in the RHS of (20) according to the property of , i.e., and
Note that the first two terms in RHS of (20) converge quadratically and the last term converges linearly. Due to the fully dynamic property of Anderson mixing as we discussed at the end of Section 1.2, it turns out the exact convergence rate of Anderson mixing in the general case is not easy to obtain. But we note that the convergence rate is roughly since the first two quadratic terms converge much faster than the last linear term. In particular, if is a quadratic function, then and thus in (20). Only the last linear term remained, thus it converges linearly (see the following corollary).
If is a quadratic function, let step-size and . Then the convergence rate of Anderson Mixing is linear, i.e., , where is the condition number.
Note that this corollary recovers the previous result (i.e., ) obtained by , and we use Chebyshev polyniomial to improve this result to obtain the optimal convergence rate in our previous Section 2 (see Theorem 2.1).
Proof Sketch of Theorem 3.1: For the iteration , we have according to . First, we demonstrate several useful forms of as follows:
Then, to bound (i.e., ), we deduce as follows:
After some non-trivial calculations (details can be found in Appendix 0.B.1), we obtain
where denotes the Euclidean norm of . Then, according to the problem (1) and the definition of , we have . Finally, we bound using QR decomposition of problem (1) and recall to finish the proof of Theorem 3.1.
4 Guessing Algorithm
In this section, we provide a Guessing Algorithm (described in Algorithm 2) which guesses the parameters (e.g., ) dynamically. Intuitively, we guess the parameter and the condition number in a doubling way. Note that in general these parameters are not available, since the time for computing these parameters is almost the same as (or even longer than) solving the original problem. Also note that the condition in Line 14 of Algorithm 2 depends on the algorithm used in Line 12.
Remark: We provide a simple example to show why this guessing algorithm is useful. Note that algorithms usually need the parameters and to set the step size no matter they have combined with Algorithm 2 or not. Thus, we need to approximate these parameters once at the beginning. Let and denote the approximated values, where . Without guessing them dynamically, one fixs and all the time and its convergence rate cannot be better than . However, according to our Theorem 4.1, the rate is due to .
In this section, we conduct the numerical experiments on the real-world UCI datasets and synthetic datasets111The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. We compare the performance among these five algorithms: Anderson-Chebyshev mixing method (AM-Cheby), Anderson mixing method (AM), vanilla Gradient Descent (GD), Nesterov’s Accelerated Gradient Descent (NAGD)  and Regularized Minimal Polynomial Extrapolation (RMPE) with (same as ).
Concretely, Figure 2 demonstrates the convergence performance in general nonlinear case, Figures 3–5 demonstrate the convergence performance of these algorithms in quadratic case and Figure 6 demonstrates the convergence performance of these algorithm combined with our guessing algorithm (Algorithm 2). The values of in the caption of figures denote the mix parameter of Anderson mixing algorithms (see Line 5 of Algorithm 1).
demonstrates the convergence performance of these algorithms in general nonlinear case. Concretely, we use the negative log-likelihood as the loss function(logistic regression), i.e., , where . We run these five algorithms on real-world diabetes and cancer datasets which are standard UCI datasets. The x-axis and y-axis represent the number of iterations and the norm of the gradient of loss function respectively.
Figures 3–5 demonstrate the convergence performance of these algorithms in the quadratic case, where . Concretely, we compared the convergence performance among these algorithms when the condition number and the mix parameter are varied, e.g., the left figure in Figure 3 is the case and , where is the parameter for Anderson mixing algorithms (see Line 5 of Algorithm 1). We run these five algorithms on the synthetic datasets in which we randomly generate the and for the quadratic function . Note that for randomly generated satisfying the property of , we randomly generate instead and let .
In conclusion, Anderson mixing-type methods converge the fastest no matter it is a linear or nonlinear problem in all of our experiments. The efficient Anderson mixing methods can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. Because GD is the special case of Anderson Mixing with , and to some extent NAGD can be viewed as . Note that the performance gap between Anderson Mixing methods and (GD, NAGD) in Figure 5 (i.e. ) is somewhat larger than that in Figure 3 (i.e. ). Regarding the Krylov extension, Anderson Mixing without truncation is equivalent to the well-known Krylov subspace method GMRES (see Appendix 0.A), and we prove the optimal convergence rate in this quadratic case (Theorem 2.1), which matches the Nesterov’s lower bound. For the general case, we prove the linear-quadratic convergence (Theorem 3.1). Compared with Newton-like methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessian-vector products.
5.1 Experiments for Guessing Algorithm
In this subsection, we conduct the experiments for guessing the hyperparameters (i.e., ) dynamically using Algorithm 2.
In Figure 6, we separately consider these algorithms. For each of them, we compare its convergence performance between its original version and the one combined with our guessing algorithm (Algorithm 2). The experimental results show that all these four algorithms combined with our guessing algorithm achieve much better performance than their original versions.
In this paper, we show that Anderson-Chebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) can achieve the optimal convergence rate, which improves the previous result provided by . Furthermore, we prove the linear-quadratic convergence of Anderson mixing for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the efficient Anderson-Chebyshev mixing method converges significantly faster than other algorithms. This validates that Anderson-Chebyshev mixing method is efficient both in theory and practice.
The authors would like to thank Claude Brezinski, Rong Ge, Damien Scieur and Le Zhang for useful discussions and suggestions.
-  Aitken, A.: On bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46, 289–305 (1926)
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. pp. 1200–1205. ACM (2017)
-  Anderson, D.G.: Iterative procedures for nonlinear integral equations. Journal of the ACM 12(4), 547–560 (1965)
-  Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
-  Brezinski, C.: Convergence acceleration during the 20th century. Journal of Computational and Applied Mathematics 122, 1–21 (2000)
-  Brezinski, C., Redivo Zaglia, M.: Extrapolation methods: theory and practice (1991)
-  Brezinski, C., Redivo-Zaglia, M., Saad, Y.: Shanks sequence transformations and anderson acceleration. SIAM Review 60(3), 646–669 (2018)
Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning8(3-4), 231–357 (2015)
-  Capehart, S.R.: Techniques for accelerating iterative methods for the solution of mathematical problems. Ph.D. thesis, Oklahoma State University (1989)
-  Eyert, V.: A comparative study on methods for convergence acceleration of iterative vector sequences. Journal of Computational Physics 124(2), 271–285 (1996)
-  Fang, H.r., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications 16(3), 197–221 (2009)
-  Golub, G., Van Loan, C.: Matrix computations. 3rd ed., The John Hopkins University Press, Baltimore, MD (1996)
-  Hageman, L.A., Young, D.M.: Applied iterative methods. Courier Corporation (2012)
-  Higham, N.J., Strabić, N.: Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms 72(4), 1021–1042 (2016)
Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics12(1), 55–67 (1970)
-  Loffeld, J., Woodward, C.S.: Considerations on the implementation and use of anderson acceleration on distributed memory and gpu-based parallel computers. Advances in the Mathematical Sciences p. 417 (2016)
-  Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer (2004)
-  Olshanskii, M.A., Tyrtyshnikov, E.E.: Iterative methods for linear systems: theory and applications. SIAM (2014)
-  Potra, F.A., Engler, H.: A characterization of the behavior of the anderson acceleration on linear problems. Linear Algebra and its Applications 438(3), 1002–1011 (2013)
-  Pratapa, P.P., Suryanarayana, P., Pask, J.E.: Anderson acceleration of the jacobi iterative method: An efficient alternative to krylov methods for large, sparse linear systems. Journal of Computational Physics 306, 43–54 (2016)
-  Rivlin, T.J.: The Chebyshev polynomials. Wiley (1974)
-  Saad, Y., Schultz, M.H.: Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing 7(3), 856–869 (1986)
-  Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. In: Advances in Neural Information Processing Systems. pp. 712–720 (2016)
-  Scieur, D., Oyallon, E., d’Aspremont, A., Bach, F.: Nonlinear acceleration of deep neural networks. arXiv preprint arXiv:1805.09639 (2018)
Shanks, D.: Non-linear transformations of divergent and slowly convergent sequences. Studies in Applied Mathematics34(1-4), 1–42 (1955)
-  Sidi, A., Ford, W.F., Smith, D.A.: Acceleration of convergence of vector sequences. SIAM Journal on Numerical Analysis 23(1), 178–196 (1986)
-  Smith, D.A., Ford, W.F., Sidi, A.: Extrapolation methods for vector sequences. SIAM review 29(2), 199–233 (1987)
-  Toth, A., Kelley, C.: Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis 53(2), 805–819 (2015)
-  Walker, H.F., Ni, P.: Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis 49(4), 1715–1735 (2011)
Appendix 0.A GMRES vs. Anderson Mixing ()
In this appendix, in order to better understand this efficient Anderson mixing method, we review the equivalence between the well-known Krylov subspace method GMRES  and Anderson mixing without truncation (i.e., or large enough in Line 5 of Algorithm 1) in linear case. We emphasize that in this paper we focus on the more general hard cases where is small (since usually is finite and not very large in practice) and also nonlinear case.
Consider the problem of solving the linear system , with a nonsingular matrix . This is equivalent to solving the fixed point , where . Let denote the residual in the point , i.e., . The GMRES method is an effective iterative method for linear system which has the property of minimizing the norm of the residual vector over a Krylov subspace at every step.
Note that the Krylov space is the linear span of the first gradients and can span the whole space . Hence the method arrives the exact solution after iteration. It is also theoretically equivalent to the Generalized Conjugate Residual method (GCR).
Now we show that to indicate the equivalence, under the assumption for . and denote the -th GMRES iterative point and -th Anderson mixing iterative point, respectively. Let mixing parameters for all . Then, we deduce the as follows:
The equals to . Note that and . Replacing these equations into (29), we have