1 Introduction
For the general optimization problem
, there exist several techniques to accelerate the standard gradient descent, e.g., Nesterov momentum
[17], Katyusha momentum [2]. There are also various vector sequence acceleration methods developed in the numerical analysis literature, e.g.,
[5, 26, 27, 6, 7]. Roughly speaking, if a vector sequence converges very slowly to its limit, then one may apply such methods to accelerate the convergence of this sequence. Taking gradient descent as an example, the vector sequence are generated by , where the limit is the fixedpoint (i.e. . One notable advantage of such acceleration methods is that they usually do not require to know how the vector sequence is actually generated. Thus the applicability of those methods is very wide.Recently, Scieur et al. [23] used the minimal polynomial extrapolation (MPE) method [27] for convergence acceleration. This is a nice example of using sequence acceleration methods to optimization problems. In this paper, we are interested in another classical sequence acceleration method called Anderson acceleration (or Anderson mixing), which was proposed by Anderson in 1965 [3]. The method is known to be quite efficient in a variety of applications [9, 20, 14, 16]. The idea of Anderson mixing is to maintain recent iterations for determining the next iteration point, where is a parameter (typically a very small constant). Thus, it can be viewed as an extension of the existing momentum methods which usually use the last and current points to determine the next iteration point. Anderson mixing with slight modifications is formally described in Algorithm 1.
Note that the step in Line 7 of Algorithm 1 can be transformed to an equivalent unconstrained leastsquares problem:
(1) 
then let
. Using QR decomposition, (
1) can be solved in time , where is the dimension. Moreover, the QR decomposition of (1) at iteration can be efficiently obtained from that of at iteration in (see, e.g. [12]). The constant is usually very small. We use and for the numerical experiments in Section 5. Hence, each iteration of Anderson mixing can be implemented quite efficiently.Many studies showed the relations between Anderson mixing and other optimization methods. In particular, for the quadratic case (linear problems), Walker and Ni [29] showed that it is related to the wellknown Krylov subspace method GMRES (generalized minimal residual algorithm) [22]. Furthermore, Potra and Engler [19] showed that GMRES is equivalent to Anderson mixing with any mixing parameters under (see Line 5 of Algorithm 1) for linear problems. Concretely, Toth and Kelley [28] proved the first linear convergence rate for linear problems with fixed parameter , where is the condition number. Besides, Eyert [10], and Fang and Saad [11] showed that Anderson mixing is related to the multisecant quasiNewton methods (more concretely, the generalized Broyden’s second method). Despite the above results, the convergence results for this efficient method are still limited (especially for general nonlinear function and the case where is small).
1.1 Our Contributions
There has been a growing number of applications of Anderson mixing method [20, 14, 16, 24]. Towards a better understanding of the efficient method, we make the following technical contributions:

We prove the optimal convergence rate of the proposed AndersonChebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) for minimizing quadratic functions (see Theorem 2.1). Our result improves the previous result (i.e., ) using fixed parameters given by Toth and Kelley [28] and matches the lower bound (i.e., ) provided by Nesterov [17].

Then, we prove the linearquadratic convergence of Anderson mixing for minimizing general nonlinear problems under some reasonable assumptions (see Theorem 3.1). Compared with Newtonlike methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessianvector products.

Besides, we propose a guessing algorithm (Algorithm 2) for the case when the hyperparameters (e.g., ) are not available. We prove that it achieves a similar convergence rate (see Theorem 4.1). This guessing algorithm can also be combined with other algorithms, e.g., Gradient Descent (GD), Nesterov’s Accelerated GD (NAGD). The experimental results (see Section 5.1) show that these algorithms combined with the proposed guessing algorithm achieve much better performance.

Finally, the experimental results on the realworld UCI datasets and synthetic datasets demonstrate that Anderson mixing methods converge significantly faster than other algorithms (see Section 5). This validates that Anderson mixing methods (especially AndersonChebyshev mixing method) are efficient both in theory and practice.
1.2 Related Work
As aforementioned, Anderson mixing can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. In particular, GD is the special case of Anderson mixing with , and to some extent NAGD can be viewed as . We also review the equivalence of GMRES and Anderson mixing without truncation (i.e., ) in Appendix 0.A. Besides, Eyert [10], and Fang and Saad [11] showed that Anderson mixing is related to the multisecant quasiNewton methods. Note that Anderson mixing has the advantage over the Newtonlike methods since it does not require the computation of Hessians or approximation of Hessians or Hessianvector products.
There are many sequence acceleration methods in the numerical analysis literatures. In particular, the wellknown Aitken’s process [1] accelerated the convergence of a sequence that is converging linearly. Shanks generalized the Aitken extrapolation which was known as Shanks transformation [25]. Recently, Brezinski et al. [7] proposed a general framework for Shanks sequence transformations which includes many vector sequence acceleration methods. One fundamental difference between Anderson mixing and other sequence acceleration methods (such as MPE, RRE (reduced rank extrapolation) [26, 27], etc.) is that Anderson mixing is a fully dynamic method [9]. Here dynamic means all iterations are in the same sequence, and it does not require to restart the procedure. It can be seen from Algorithm 1 that all iterations are applied to the same sequence . In fact, in Capehart’s PhD thesis [9], several experiments were conducted to demonstrate the superior performance of Anderson mixing over other semidynamic methods such as MPE, RRE (semidynamic means that the algorithm maintains more than one sequences or needs to restart several times).
2 The Quadratic Case
In this section, we consider the problem of minimizing a quadratic function (also called least square, or ridge regression
[4, 15]). The formulation of the problem is(2) 
where . Note that and are usually called the strongly convex parameter and Lipschitz continuous gradient parameter, respectively (e.g. [17]). There are many algorithms for optimizing this type of functions. See e.g. [8] for more details. We analyze the problem of minimizing a general function in the next Section 3.
We prove that Anderson mixing with Chebyshev polynomial parameters achieves the optimal convergence rate. The convergence result is stated in the following Theorem 2.1.
Theorem 2.1
The AndersonChebyshev mixing method achieves the optimal convergence rate for obtaining an approximation solution of problem (2) for any , where is the condition number, is defined in Definition 1 and this method combines Anderson Mixing (Algorithm 1) with the Chebyshev polynomial parameters , for .
Remark: In this quadratic case, we mention that Toth and Kelley [28] proved the first convergence rate for fixed parameter . Here we use the Chebyshev polynomials to improve the result to the optimal one, i.e., . Also note that in practice the constant is usually very small. Particularly, has already achieved a remarkable performance from our experimental results (see Figures 2–5 in Section 5).
Before proving Theorem 2.1, we first define and then briefly review some properties of the Chebyshev polynomials. We refer to [21, 18, 13] for more details of Chebyshev polynomials.
Definition 1
Let
’s be the unit eigenvectors of
, where is defined in (2). Consider a unit vector and let , where denotes the projection to the orthogonal complement of the column space of . Let denote the maximum integer such that for any . Obviously, since due to and .The Chebyshev polynomials are polynomials , where , , which is defined by the recursive relation:
(3) 
The key property is that has minimal deviation from on among all polynomials with and , i.e.,
(4) 
In particular, for , Chebyshev polynomials can be written in an equivalent way:
(5) 
In our proof, we use this equivalent form (5) instead of (3). The equivalence can be verified as follows:
(6)  
(7) 
where (6) and (7) use the transformation due to . According to (5), and the roots of are as follows:
(8) 
To demonstrate it more clearly, we provide an example for (Wshape curve) in Figure 1. Since in this polynomial , the first root . The remaining three roots for can be easily computed too.
Proof of Theorem 2.1. For iteration , the residual can be deduced as follows:
(9)  
(10) 
where (9) uses .
To bound (i.e.,
), we first obtain the following lemma by using Singular Value Decomposition (SVD) to solve the least squares problem (
1) and then using several transformations. We defer the proof of Lemma 1 to Appendix 0.B.2.Lemma 1
Let and , then
(11) 
where is a degree polynomial.
According to Lemma 1, to bound , it is sufficient to bound the righthandside (RHS) of (11) (i.e., ). In order to bound this, we first transform into . Let , where . We have the following equalities:
(12) 
According to (4) (the optimal property of standard Chebyshev polynomials), when (note that here ), the RHS of (11) can be bounded as follows:
(13)  
(14) 
where (13) uses (12), and (14) uses (see (5)). According to (8), it is not hard to see that is defined by the mixing parameters , where ). Note that the roots of standard Chebyshev polynomials (i.e., (8)) can be obtained from many textbooks, e.g., Section 1.2 of [21]. Now, we only need to bound . First, we need to transform the form (5) of Chebyshev polynomials as follows:
Let , we get . So we have
(15) 
Now, the RHS of (11) can be bounded as
(16)  
(17) 
where (16) follows from (14), and (17) follows from (15). Then, according to (11), the gradient norm is bounded as , where . Note that if the number of iterations , then
Thus AndersonChebyshev mixing method achieves the optimal convergence rate for obtaining an approximation solution.
3 The General Case
In this section, we analyze the Anderson mixing (Algorithm 1) in the general nonlinear case:
(18) 
We prove that Anderson mixing method achieves the linearquadratic convergence rate under the following standard Assumptions 1 and 2, where denotes the Euclidean norm. Let denote the small matrix of the leastsquare problem in Line 7 of Algorithm 1, i.e., (see problem (1)). Then, we define its condition number , where denotes the least nonzero singular value of . We further define .
Assumption 1
The Hessian satisfies , where .
Assumption 2
The Hessian is Lipschitz continuous, i.e.,
(19) 
Theorem 3.1
Remark:

The constant is usually very small. Particularly, we use and for the numerical experiments in Section 5. Hence is very small and also decreases as the iteration increasing.

Besides, one can also use instead of in the RHS of (20) according to the property of , i.e., and

Note that the first two terms in RHS of (20) converge quadratically and the last term converges linearly. Due to the fully dynamic property of Anderson mixing as we discussed at the end of Section 1.2, it turns out the exact convergence rate of Anderson mixing in the general case is not easy to obtain. But we note that the convergence rate is roughly since the first two quadratic terms converge much faster than the last linear term. In particular, if is a quadratic function, then and thus in (20). Only the last linear term remained, thus it converges linearly (see the following corollary).
Corollary 1
If is a quadratic function, let stepsize and . Then the convergence rate of Anderson Mixing is linear, i.e., , where is the condition number.
Note that this corollary recovers the previous result (i.e., ) obtained by [28], and we use Chebyshev polyniomial to improve this result to obtain the optimal convergence rate in our previous Section 2 (see Theorem 2.1).
Proof Sketch of Theorem 3.1: For the iteration , we have according to . First, we demonstrate several useful forms of as follows:
(21)  
(22) 
where (21) holds due to the definition , and (22) holds since .
Then, to bound (i.e., ), we deduce as follows:
(23) 
where (23) uses the definition . Now, we bound the first two terms of (23) as follows:
(24) 
where (24) is obtained by using (22) to replace . To bound (24), we use Assumptions 1, 2, and the equation
After some nontrivial calculations (details can be found in Appendix 0.B.1), we obtain
where denotes the Euclidean norm of . Then, according to the problem (1) and the definition of , we have . Finally, we bound using QR decomposition of problem (1) and recall to finish the proof of Theorem 3.1.
4 Guessing Algorithm
In this section, we provide a Guessing Algorithm (described in Algorithm 2) which guesses the parameters (e.g., ) dynamically. Intuitively, we guess the parameter and the condition number in a doubling way. Note that in general these parameters are not available, since the time for computing these parameters is almost the same as (or even longer than) solving the original problem. Also note that the condition in Line 14 of Algorithm 2 depends on the algorithm used in Line 12.
The convergence result of Algorithm 2 is stated in the following Theorem 4.1. The detailed proof is deferred to Appendix 0.B.3.
Theorem 4.1
Without knowing the parameters and , Algorithm 2 achieves convergence rate for obtaining an approximation solution of problem (2), where , and
can be any number as long as the eigenvalue spectrum belongs to
.Remark: We provide a simple example to show why this guessing algorithm is useful. Note that algorithms usually need the parameters and to set the step size no matter they have combined with Algorithm 2 or not. Thus, we need to approximate these parameters once at the beginning. Let and denote the approximated values, where . Without guessing them dynamically, one fixs and all the time and its convergence rate cannot be better than . However, according to our Theorem 4.1, the rate is due to .
5 Experiments
In this section, we conduct the numerical experiments on the realworld UCI datasets and synthetic datasets^{1}^{1}1The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. We compare the performance among these five algorithms: AndersonChebyshev mixing method (AMCheby), Anderson mixing method (AM), vanilla Gradient Descent (GD), Nesterov’s Accelerated Gradient Descent (NAGD) [17] and Regularized Minimal Polynomial Extrapolation (RMPE) with (same as [23]).
Concretely, Figure 2 demonstrates the convergence performance in general nonlinear case, Figures 3–5 demonstrate the convergence performance of these algorithms in quadratic case and Figure 6 demonstrates the convergence performance of these algorithm combined with our guessing algorithm (Algorithm 2). The values of in the caption of figures denote the mix parameter of Anderson mixing algorithms (see Line 5 of Algorithm 1).
Figure 2
demonstrates the convergence performance of these algorithms in general nonlinear case. Concretely, we use the negative loglikelihood as the loss function
(logistic regression), i.e., , where . We run these five algorithms on realworld diabetes and cancer datasets which are standard UCI datasets. The xaxis and yaxis represent the number of iterations and the norm of the gradient of loss function respectively.Figures 3–5 demonstrate the convergence performance of these algorithms in the quadratic case, where . Concretely, we compared the convergence performance among these algorithms when the condition number and the mix parameter are varied, e.g., the left figure in Figure 3 is the case and , where is the parameter for Anderson mixing algorithms (see Line 5 of Algorithm 1). We run these five algorithms on the synthetic datasets in which we randomly generate the and for the quadratic function . Note that for randomly generated satisfying the property of , we randomly generate instead and let .
In conclusion, Anderson mixingtype methods converge the fastest no matter it is a linear or nonlinear problem in all of our experiments. The efficient Anderson mixing methods can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. Because GD is the special case of Anderson Mixing with , and to some extent NAGD can be viewed as . Note that the performance gap between Anderson Mixing methods and (GD, NAGD) in Figure 5 (i.e. ) is somewhat larger than that in Figure 3 (i.e. ). Regarding the Krylov extension, Anderson Mixing without truncation is equivalent to the wellknown Krylov subspace method GMRES (see Appendix 0.A), and we prove the optimal convergence rate in this quadratic case (Theorem 2.1), which matches the Nesterov’s lower bound. For the general case, we prove the linearquadratic convergence (Theorem 3.1). Compared with Newtonlike methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessianvector products.
5.1 Experiments for Guessing Algorithm
In this subsection, we conduct the experiments for guessing the hyperparameters (i.e., ) dynamically using Algorithm 2.
In Figure 6, we separately consider these algorithms. For each of them, we compare its convergence performance between its original version and the one combined with our guessing algorithm (Algorithm 2). The experimental results show that all these four algorithms combined with our guessing algorithm achieve much better performance than their original versions.
6 Conclusion
In this paper, we show that AndersonChebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) can achieve the optimal convergence rate, which improves the previous result provided by [28]. Furthermore, we prove the linearquadratic convergence of Anderson mixing for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the efficient AndersonChebyshev mixing method converges significantly faster than other algorithms. This validates that AndersonChebyshev mixing method is efficient both in theory and practice.
Acknowledgments
The authors would like to thank Claude Brezinski, Rong Ge, Damien Scieur and Le Zhang for useful discussions and suggestions.
References
 [1] Aitken, A.: On bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46, 289–305 (1926)

[2]
AllenZhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. pp. 1200–1205. ACM (2017)
 [3] Anderson, D.G.: Iterative procedures for nonlinear integral equations. Journal of the ACM 12(4), 547–560 (1965)
 [4] Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
 [5] Brezinski, C.: Convergence acceleration during the 20th century. Journal of Computational and Applied Mathematics 122, 1–21 (2000)
 [6] Brezinski, C., Redivo Zaglia, M.: Extrapolation methods: theory and practice (1991)
 [7] Brezinski, C., RedivoZaglia, M., Saad, Y.: Shanks sequence transformations and anderson acceleration. SIAM Review 60(3), 646–669 (2018)

[8]
Bubeck, S.: Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning
8(34), 231–357 (2015)  [9] Capehart, S.R.: Techniques for accelerating iterative methods for the solution of mathematical problems. Ph.D. thesis, Oklahoma State University (1989)
 [10] Eyert, V.: A comparative study on methods for convergence acceleration of iterative vector sequences. Journal of Computational Physics 124(2), 271–285 (1996)
 [11] Fang, H.r., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications 16(3), 197–221 (2009)
 [12] Golub, G., Van Loan, C.: Matrix computations. 3rd ed., The John Hopkins University Press, Baltimore, MD (1996)
 [13] Hageman, L.A., Young, D.M.: Applied iterative methods. Courier Corporation (2012)
 [14] Higham, N.J., Strabić, N.: Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms 72(4), 1021–1042 (2016)

[15]
Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics
12(1), 55–67 (1970)  [16] Loffeld, J., Woodward, C.S.: Considerations on the implementation and use of anderson acceleration on distributed memory and gpubased parallel computers. Advances in the Mathematical Sciences p. 417 (2016)
 [17] Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer (2004)
 [18] Olshanskii, M.A., Tyrtyshnikov, E.E.: Iterative methods for linear systems: theory and applications. SIAM (2014)
 [19] Potra, F.A., Engler, H.: A characterization of the behavior of the anderson acceleration on linear problems. Linear Algebra and its Applications 438(3), 1002–1011 (2013)
 [20] Pratapa, P.P., Suryanarayana, P., Pask, J.E.: Anderson acceleration of the jacobi iterative method: An efficient alternative to krylov methods for large, sparse linear systems. Journal of Computational Physics 306, 43–54 (2016)
 [21] Rivlin, T.J.: The Chebyshev polynomials. Wiley (1974)
 [22] Saad, Y., Schultz, M.H.: Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing 7(3), 856–869 (1986)
 [23] Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. In: Advances in Neural Information Processing Systems. pp. 712–720 (2016)
 [24] Scieur, D., Oyallon, E., d’Aspremont, A., Bach, F.: Nonlinear acceleration of deep neural networks. arXiv preprint arXiv:1805.09639 (2018)

[25]
Shanks, D.: Nonlinear transformations of divergent and slowly convergent sequences. Studies in Applied Mathematics
34(14), 1–42 (1955)  [26] Sidi, A., Ford, W.F., Smith, D.A.: Acceleration of convergence of vector sequences. SIAM Journal on Numerical Analysis 23(1), 178–196 (1986)
 [27] Smith, D.A., Ford, W.F., Sidi, A.: Extrapolation methods for vector sequences. SIAM review 29(2), 199–233 (1987)
 [28] Toth, A., Kelley, C.: Convergence analysis for anderson acceleration. SIAM Journal on Numerical Analysis 53(2), 805–819 (2015)
 [29] Walker, H.F., Ni, P.: Anderson acceleration for fixedpoint iterations. SIAM Journal on Numerical Analysis 49(4), 1715–1735 (2011)
Appendix 0.A GMRES vs. Anderson Mixing ()
In this appendix, in order to better understand this efficient Anderson mixing method, we review the equivalence between the wellknown Krylov subspace method GMRES [22] and Anderson mixing without truncation (i.e., or large enough in Line 5 of Algorithm 1) in linear case. We emphasize that in this paper we focus on the more general hard cases where is small (since usually is finite and not very large in practice) and also nonlinear case.
Consider the problem of solving the linear system , with a nonsingular matrix . This is equivalent to solving the fixed point , where . Let denote the residual in the point , i.e., . The GMRES method is an effective iterative method for linear system which has the property of minimizing the norm of the residual vector over a Krylov subspace at every step.
(25) 
Note that the Krylov space is the linear span of the first gradients and can span the whole space . Hence the method arrives the exact solution after iteration. It is also theoretically equivalent to the Generalized Conjugate Residual method (GCR).
Now we show that to indicate the equivalence, under the assumption for . and denote the th GMRES iterative point and th Anderson mixing iterative point, respectively. Let mixing parameters for all . Then, we deduce the as follows:
(26)  
(27)  
(28) 
Note that the second term in (28) is the same as we minimized in Line 7 of Algorithm 1. This step also can be transformed to an unconstrained version as follows:
(29) 
The equals to . Note that and . Replacing these equations into (29), we have
(30)  