1 Introduction
For the general optimization problem
, there exist several techniques to accelerate the standard gradient descent, e.g., Nesterov momentum
(Nesterov, 2004), Katyusha momentum (AllenZhu, 2017). There are also various vector sequence acceleration methods developed in the numerical analysis literature, e.g.,
(Brezinski, 2000; Sidi et al., 1986; Smith et al., 1987; Brezinski and Redivo Zaglia, 1991; Brezinski et al., 2018). Roughly speaking, if a vector sequence converges very slowly to its limit, then one may apply such methods to accelerate the convergence of this sequence. Taking gradient descent as an example, the vector sequence are generated by , where the limit is the fixedpoint (i.e. . One notable advantage of such acceleration methods is that they usually do not require to know how the vector sequence is actually generated. Thus the applicability of those methods is very wide.Recently, Scieur et al. (2016) used the minimal polynomial extrapolation (MPE) method (Smith et al., 1987) for convergence acceleration. This is a nice example of using sequence acceleration methods to optimization problems. In this paper, we are interested in another classical sequence acceleration method called Anderson acceleration (or Anderson mixing), which was proposed by Anderson in 1965 (Anderson, 1965). The method is known to be quite efficient in a variety of applications (Capehart, 1989; Pratapa et al., 2016; Higham and Strabić, 2016; Loffeld and Woodward, 2016). The idea of Anderson mixing is to maintain recent iterations for determining the next iteration point, where is a parameter (typically a very small constant). Thus, it can be viewed as an extension of the existing momentum methods which usually use the last and current points to determine the next iteration point. Anderson mixing with slight modifications is formally described in Algorithm 1.
Note that the step in Line 7 of Algorithm 1 can be transformed to an equivalent unconstrained leastsquares problem:
(1) 
then let
. Using QR decomposition, (
1) can be solved in time , where is the dimension. Moreover, the QR decomposition of (1) at iteration can be efficiently obtained from that of at iteration in (see, e.g. (Golub and Van Loan, 1996)). The constant is usually very small. For the numerical experiments in Section 5, we use and . Hence, each iteration of Anderson mixing can be implemented quite efficiently.Many studies showed the relations between Anderson mixing and other optimization methods. In particular, for the quadratic case (linear problems), Walker and Ni (2011) showed that it is related to the wellknown Krylov subspace method GMRES (generalized minimal residual algorithm) (Saad and Schultz, 1986). Furthermore, Potra and Engler (2013) showed that GMRES is equivalent to Anderson mixing with any mixing parameters under (see Line 5 of Algorithm 1) for linear problems. Concretely, Toth and Kelley (2015) proved the first linear convergence rate for linear problems with fixed parameter , where is the condition number. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson mixing is related to the multisecant quasiNewton methods (more concretely, the generalized Broyden’s second method). Despite the above results, the convergence results for this efficient method are still limited (especially for general nonlinear function and the case where is small).
1.1 Our Contributions
There has been a growing number of applications of Anderson mixing method Pratapa et al. (2016); Higham and Strabić (2016); Loffeld and Woodward (2016); Scieur et al. (2018). Towards a better understanding of the efficient method, we make the following technical contributions:

We prove the optimal convergence rate of the proposed AndersonChebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) for minimizing quadratic functions (see Theorem 1). Our result improves the previous result (i.e., ) using fixed parameters given by Toth and Kelley Toth and Kelley (2015) and matches the lower bound (i.e., ) provided by Nesterov Nesterov (2004).

Then, we prove the linearquadratic convergence of Anderson mixing for minimizing general nonlinear problems under some reasonable assumptions (see Theorem 2). Compared with Newtonlike methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessianvector products.

Besides, we propose a Guessing Algorithm for the case when the hyperparameters (e.g., ) are not available. We prove that it achieves a similar convergence rate (see Theorem 3). This guessing algorithm can also be combined with other algorithms, e.g., Gradient Descent (GD), Nesterov’s Accelerated GD (NAGD). The experimental results (see Section 5) show that these algorithms combined with the guessing algorithm achieve much better performance.

Finally, the experimental results on the realworld UCI datasets and synthetic datasets demonstrate that Anderson mixing methods converge significantly faster than other algorithms (see Section 5). This validates that Anderson mixing methods (especially AndersonChebyshev mixing method) are efficient both in theory and practice.
1.2 Related Work
As aforementioned, Anderson mixing can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. In particular, GD is the special case of Anderson mixing with , and to some extent NAGD can be viewed as . We also review the equivalence of GMRES and Anderson mixing without truncation (i.e., ) in Appendix A. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson mixing is related to the multisecant quasiNewton methods. Note that Anderson mixing has the advantage over the Newtonlike methods since it does not require the computation of Hessians or approximation of Hessians or Hessianvector products.
There are many sequence acceleration methods in the numerical analysis literatures. In particular, the wellknown Aitken’s process (Aitken, 1926) accelerated the convergence of a sequence that is converging linearly. Shanks generalized the Aitken extrapolation which was known as Shanks transformation Shanks (1955). Recently, Brezinski et al. (2018) proposed a general framework for Shanks sequence transformations which includes many vector sequence acceleration methods. One fundamental difference between Anderson mixing and other sequence acceleration methods (such as MPE, RRE (reduced rank extrapolation) (Sidi et al., 1986; Smith et al., 1987), etc.) is that Anderson mixing is a fully dynamic method (Capehart, 1989). Here dynamic means all iterations are in the same sequence, and it does not require to restart the procedure. It can be seen from Algorithm 1 that all iterations are applied to the same sequence . In fact, in Capehart’s PhD thesis (Capehart, 1989), several experiments were conducted to demonstrate the superior performance of Anderson mixing over other semidynamic methods such as MPE, RRE (semidynamic means that the algorithm maintains more than one sequences or needs to restart several times).
2 The Quadratic Case
In this section, we consider the problem of minimizing a quadratic function (also called least square, or ridge regression
(Boyd and Vandenberghe, 2004; Hoerl and Kennard, 1970)). The formulation of the problem is(2) 
where . Note that and are usually called the strongly convex parameter and Lipschitz continuous gradient parameter, respectively (e.g. (Nesterov, 2004)). There are many algorithms for optimizing this type of functions. See e.g. (Bubeck, 2015) for more details. We analyze the problem of minimizing a general function in the next Section 3.
We prove that Anderson mixing with Chebyshev polynomial parameters achieves the optimal convergence rate. The convergence result is stated in the following Theorem 1. Let denote the maximum integer such that for any . The and ’s are defined as follows: , , where is a unit vector,
’s are the unit eigenvectors of
, and denotes the projection to the orthogonal complement of the column space of . Obviously, since (then ), and since is a projection operator.Theorem 1
Remark: In this quadratic case, we mention that Toth and Kelley (2015) proved the first convergence rate for fixed parameter . Here we use the Chebyshev polynomials to improve the result to the optimal one, i.e., . Besides, the constant is usually very small. Particularly, has already achieved a remarkable performance from our experimental results (see Figures 2–4)).
Before proving Theorem 1, some properties of the Chebyshev polynomials are briefly reviewed. We refer to Rivlin (1974); Olshanskii and Tyrtyshnikov (2014); Hageman and Young (2012) for more details of Chebyshev polynomials.
The Chebyshev polynomials are polynomials , where , , which is defined by the recursive relation:
(3) 
The key property is that has minimal deviation from on among all polynomials with and , i.e.,
(4) 
In particular, for , Chebyshev polynomials can be written in an equivalent way:
(5) 
In our proof, we use this equivalent form (5) instead of (3). The equivalence can be verified as follows:
(6)  
(7) 
where (6) and (7) use the transformation due to . According to (5), and the roots of are as follows:
(8) 
To demonstrate it more clearly, we provide an example for (Wshape curve) in Figure 1. Since in this polynomial , the first root . The remaining three roots for can be easily computed too.
Proof of Theorem 1. For iteration , the residual can be deduced as follows:
(9)  
(10) 
where (9) uses .
To bound (i.e.,
), we first obtain the following lemma by using Singular Value Decomposition (SVD) to solve the least squares problem (
1) and then using several transformations. We defer the proof of Lemma 1 to Appendix B.2.Lemma 1
(11) 
where is a degree polynomial.
According to Lemma 1, to bound , it is sufficient to bound the righthandside (RHS) of (11) (i.e., ). In order to bound this, we first transform into . Let , where . We have the following equalities:
(12) 
According to (4) (the optimal property of standard Chebyshev polynomials), when (note that here ), the RHS of (11) can be bounded as follows:
(13)  
(14) 
where (13) uses (12), and (14) uses (see (5)). According to (8), it is not hard to see that is defined by the mixing parameters , where ). Note that the roots of standard Chebyshev polynomials (i.e., (8)) can be obtained from many textbooks, e.g., Section 1.2 of Rivlin (1974). Now, we only need to bound . First, we need to transform the form (5) of Chebyshev polynomials as follows:
Let , we get . So we have
(15) 
Now, the RHS of (11) can be bounded as
(16)  
(17) 
where (16) follows from (14), and (17) follows from (15). Then, according to (11), the gradient norm is bounded as , where . Note that if the number of iterations , then
Thus AndersonChebyshev mixing method achieves the optimal convergence rate for obtaining an approximation solution.
3 The General Case
In this section, we analyze the Anderson mixing (Algorithm 1) in the general nonlinear case:
(18) 
We prove that Anderson mixing method achieves the linearquadratic convergence rate under the following standard Assumptions 1 and 2, where denotes the Euclidean norm. Let denote the small matrix of the leastsquare problem in Line 7 of Algorithm 1, i.e., (see problem (1)). Then, we define its condition number , where denotes the least nonzero singular value of . We further define .
Assumption 1
The Hessian satisfies , where .
Assumption 2
The Hessian is Lipschitz continuous, i.e.,
(19) 
Theorem 2
Remark:

The constant is usually very small. Particularly, we use and for the numerical experiments in Section 5. Hence is very small and also decreases as the iteration increased.

Besides, one can also use instead of in the RHS of (20) according to the property of , i.e., and

Note that the first two terms in RHS of (20) converge quadratically and the last term converges linearly. Due to the fully dynamic property of Anderson mixing as we discussed at the end of Section 1.2, it turns out the exact convergence rate of Anderson mixing in the general case is not easy to obtain. But we note that the convergence rate is roughly since the first two quadratic terms converge much faster than the last linear term. In particular, if is a quadratic function, then and thus in (20). Only the last linear term remained, thus it converges linearly (see the following corollary).
Corollary 1
If is a quadratic function, let stepsize and . Then the convergence rate of Anderson Mixing is linear, i.e., , where is the condition number.
Note that this corollary recovers the previous result (i.e., ) obtained by (Toth and Kelley, 2015), and we use Chebyshev polyniomial to improve this result to obtain the optimal convergence rate in our previous Section 2 (see Theorem 1).
Proof Sketch of Theorem 2: For the iteration , we have according to . First, we demonstrate several useful forms of as follows:
(21)  
(22) 
where (21) holds due to the definition , and (22) holds since .
Then, to bound (i.e., ), we deduce as follows:
(23) 
where (23) uses the definition . Now, we bound the first two terms of (23) as follows:
(24) 
where (24) is obtained by using (22) to replace . To bound (24), we use Assumptions 1, 2, and the equation
After some nontrivial calculations (details can be found in Appendix B.1), we obtain
where denotes the Euclidean norm of . Then, according to the problem (1) and the definition of , we have . Finally, we bound using QR decomposition of problem (1) and recall to finish the proof of Theorem 2.
4 Guessing Algorithm
In this section, we provide a Guessing Algorithm (described in Algorithm 2) which guesses the parameters (e.g., ) dynamically. Intuitively, we guess the parameter and the condition number in a doubling way. Note that in general these parameters are not available, since the time for computing these parameters is almost the same as (or even longer than) solving the original problem. Note that the condition in Line 14 of Algorithm 2 depends on the algorithm used in Line 12.
The convergence result is stated in the following Theorem 3.
Theorem 3
Without knowing the true parameters, the guessing algorithm achieves convergence rate for quadratic functions, where , and
can be any number as long as the eigenvalue spectrum belongs to
, assuming that .Example: We provide a simple example to show why this guessing algorithm is useful. Note that many algorithms need these parameters to set the step size no matter they have combined with Algorithm 2 or not. Thus, we need to approximate these parameters once at the beginning. Let and denote the approximated values, where . Without guessing them dynamically, one fixs and all the time and its convergence rate cannot be better than . However, according to our Theorem 3, the rate is if it is combined with Algorithm 2.
Before to prove the Theorem 3, we need the following lemmas and their proofs are provided in Appendix B.3.
Lemma 2
If , then .
Lemma 3
Let , where , then is satisfied.
Lemma 4
Proof of Theorem 3. According to Lemma 4, (in Line 3 of Algorithm 2) is less than and is less than . The inner loop (in Line 5) is obviously less than . Let and denote the times of the execution of dowhile loop (Line 7–14) in each loop iteration (Line 5–16). Thus, the total number of iterations (corresponding to ) is in each loop iteration. These iterations satisfy the dowhile condition, i.e., . We combine the condition together to obtain . Finally, this guessing algorithm satisfied the following Inequality (25).
Note that the Line 15 and 16 of Algorithm 2 ignore the failed iterations. Also this ignored step can be executed at most once in each loop iteration (Line 5–16). Let denote the total number of iterations of Algorithm 2. Then .
(25) 
As is less than and . In order to prove the convergence rate, we need the RHS of (25) , it is sufficient to satisfy the following inequality
i.e.,
(26) 
By applying Lemma 2 and ignoring the constant, we can transform (26) to (27). Recall that and .
(27) 
This is exactly the same as Lemma 3. Now is bounded by .
5 Experiments
In this section, we conduct the numerical experiments on the realworld UCI datasets and synthetic datasets ^{1}^{1}1The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. We compare the performance among these five algorithms: AndersonChebyshev mixing method (AMCheby), vanilla Anderson mixing method (AM), vanilla Gradient Descent (GD), Nesterov’s Accelerated Gradient Descent (NAGD) (Nesterov, 2004) and Regularized Minimal Polynomial Extrapolation (RMPE) with (same as (Scieur et al., 2016)). Concretely, Figures 4–4 demonstrate the convergence performance of these algorithms in the quadratic case, Figure 2 demonstrates the convergence performance in general case, and Figure 5 demonstrates the convergence performance of these algorithm combined with our guessing algorithm (Algorithm 2). The values of in the caption of figures denote the mix parameter of Anderson mixing algorithms (see Line 5 of Algorithm 1).
Figure 2
demonstrates the convergence performance of these algorithms in general nonlinear case. Concretely, we use the negative loglikelihood as the loss function
(logistic regression), i.e., , where . We run these five algorithms on realworld diabetes and cancer datasets which are standard UCI datasets.Figures 4–4 demonstrate the convergence performance of these algorithms in the quadratic case, where . Concretely, we compared the convergence performance among these algorithms when the condition number and the mix parameter are varied, e.g., the left figure in Figure 4 is the case and , where is the parameter for Anderson mixing algorithms (see Line 5 of Algorithm 1). We run these five algorithms on the synthetic datasets in which we randomly generate the and for the quadratic function . Note that for randomly generated satisfying the property of , we randomly generate instead and let .
In conclusion, Anderson mixingtype methods converge the fastest no matter it is a linear or nonlinear problem in all of our experiments. The efficient Anderson mixing methods can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. Because GD is the special case of Anderson Mixing with , and to some extent NAGD can be viewed as . Note that the performance gap between Anderson Mixing methods and (GD, NAGD) in Figure 4 (i.e. ) is somewhat larger than that in Figure 4 (i.e. ). Regarding the Krylov extension, Anderson Mixing without truncation is equivalent to the wellknown Krylov subspace method GMRES (see Appendix A), and we prove the optimal convergence rate in this quadratic case (Theorem 1), which matches the Nesterov’s lower bound. For the general case, we prove the linearquadratic convergence (Theorem 2). Compared with Newtonlike methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessianvector products.
5.1 Experiments for Guessing Algorithm
In this subsection, we conduct the experiments for guessing the hyperparameters (i.e., ) using the proposed guessing algorithm (Algorithm 2). Note that we compute the hyperparameters in advance for our previous experiments (e.g., Figures 4–4) to better compare the convergence performance among these algorithms.
Now, we separately consider these algorithms in Figure 5. For each of them, we compare its convergence performance between its original version and the one combined with our guessing algorithm. The experimental results show that all these algorithms combined with our guessing algorithm achieve much better performance than their original versions.
6 Conclusion
In this paper, we show that AndersonChebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) can achieve the optimal convergence rate, which improves the previous result provided by (Toth and Kelley, 2015). Furthermore, we prove the linearquadratic convergence of Anderson mixing for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that this efficient AndersonChebyshev mixing method converges significantly faster than other algorithms. This validates that AndersonChebyshev mixing method is efficient both in theory and practice.
Acknowledgments
The authors would like to thank Claude Brezinski, Rong Ge, Damien Scieur and Le Zhang for useful discussions and suggestions.
References
 Aitken [1926] A Aitken. On bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289–305, 1926.

AllenZhu [2017]
Zeyuan AllenZhu.
Katyusha: the first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  Anderson [1965] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM, 12(4):547–560, 1965.
 Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 Brezinski [2000] Claude Brezinski. Convergence acceleration during the 20th century. Journal of Computational and Applied Mathematics, 122:1–21, 2000.
 Brezinski and Redivo Zaglia [1991] Claude Brezinski and M Redivo Zaglia. Extrapolation methods: theory and practice. 1991.
 Brezinski et al. [2018] Claude Brezinski, Michela RedivoZaglia, and Yousef Saad. Shanks sequence transformations and anderson acceleration. SIAM Review, 60(3):646–669, 2018.

Bubeck [2015]
Sébastien Bubeck.
Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning
, 8(34):231–357, 2015.  Capehart [1989] Steven Russell Capehart. Techniques for accelerating iterative methods for the solution of mathematical problems. PhD thesis, Oklahoma State University, 1989.
 Eyert [1996] V Eyert. A comparative study on methods for convergence acceleration of iterative vector sequences. Journal of Computational Physics, 124(2):271–285, 1996.
 Fang and Saad [2009] Hawren Fang and Yousef Saad. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009.
 Golub and Van Loan [1996] GH Golub and CF Van Loan. Matrix computations. 3rd ed., The John Hopkins University Press, Baltimore, MD, 1996.
 Hageman and Young [2012] Louis A Hageman and David M Young. Applied iterative methods. Courier Corporation, 2012.
 Higham and Strabić [2016] Nicholas J Higham and Nataša Strabić. Anderson acceleration of the alternating projections method for computing the nearest correlation matrix. Numerical Algorithms, 72(4):1021–1042, 2016.

Hoerl and Kennard [1970]
Arthur E Hoerl and Robert W Kennard.
Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12(1):55–67, 1970.  Loffeld and Woodward [2016] John Loffeld and Carol S Woodward. Considerations on the implementation and use of anderson acceleration on distributed memory and gpubased parallel computers. Advances in the Mathematical Sciences, page 417, 2016.
 Nesterov [2004] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004.
 Olshanskii and Tyrtyshnikov [2014] Maxim A Olshanskii and Eugene E Tyrtyshnikov. Iterative methods for linear systems: theory and applications. SIAM, 2014.
 Potra and Engler [2013] Florian A Potra and Hans Engler. A characterization of the behavior of the anderson acceleration on linear problems. Linear Algebra and its Applications, 438(3):1002–1011, 2013.
 Pratapa et al. [2016] Phanisri P Pratapa, Phanish Suryanarayana, and John E Pask. Anderson acceleration of the jacobi iterative method: An efficient alternative to krylov methods for large, sparse linear systems. Journal of Computational Physics, 306:43–54, 2016.
 Rivlin [1974] Theodore J Rivlin. The Chebyshev polynomials. Wiley, 1974.
 Saad and Schultz [1986] Youcef Saad and Martin H Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing, 7(3):856–869, 1986.
 Scieur et al. [2016] Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Regularized nonlinear acceleration. In Advances in Neural Information Processing Systems, pages 712–720, 2016.
 Scieur et al. [2018]