An Anderson-Chebyshev Mixing Method for Nonlinear Optimization

09/07/2018 ∙ by Zhize Li, et al. ∙ Tsinghua University 0

Anderson mixing (or Anderson acceleration) is an efficient acceleration method for fixed point iterations (i.e., x_t+1=G(x_t)), e.g., gradient descent can be viewed as iteratively applying the operation G(x) = x-α∇ f(x). It is known that Anderson mixing is quite efficient in practice and can be viewed as an extension of Krylov subspace methods for nonlinear problems. First, we show that Anderson mixing with Chebyshev polynomial parameters can achieve the optimal convergence rate O(√(κ)1/ϵ), which improves the previous result O(κ1/ϵ) provided by [Toth and Kelley, 2015] for quadratic functions. Then, we provide a convergence analysis for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter L) are not available, we propose a Guessing Algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the proposed Anderson-Chebyshev mixing method converges significantly faster than other algorithms, e.g., vanilla gradient descent (GD), Nesterov's Accelerated GD. Also, these algorithms combined with the proposed guessing algorithm (guessing the hyperparameters dynamically) achieve much better performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For the general optimization problem

, there exist several techniques to accelerate the standard gradient descent, e.g., Nesterov momentum 

(Nesterov, 2004), Katyusha momentum (Allen-Zhu, 2017)

. There are also various vector sequence acceleration methods developed in the numerical analysis literature, e.g.,

(Brezinski, 2000; Sidi et al., 1986; Smith et al., 1987; Brezinski and Redivo Zaglia, 1991; Brezinski et al., 2018). Roughly speaking, if a vector sequence converges very slowly to its limit, then one may apply such methods to accelerate the convergence of this sequence. Taking gradient descent as an example, the vector sequence are generated by , where the limit is the fixed-point (i.e. . One notable advantage of such acceleration methods is that they usually do not require to know how the vector sequence is actually generated. Thus the applicability of those methods is very wide.

Recently, Scieur et al. (2016) used the minimal polynomial extrapolation (MPE) method (Smith et al., 1987) for convergence acceleration. This is a nice example of using sequence acceleration methods to optimization problems. In this paper, we are interested in another classical sequence acceleration method called Anderson acceleration (or Anderson mixing), which was proposed by Anderson in 1965 (Anderson, 1965). The method is known to be quite efficient in a variety of applications (Capehart, 1989; Pratapa et al., 2016; Higham and Strabić, 2016; Loffeld and Woodward, 2016). The idea of Anderson mixing is to maintain recent iterations for determining the next iteration point, where is a parameter (typically a very small constant). Thus, it can be viewed as an extension of the existing momentum methods which usually use the last and current points to determine the next iteration point. Anderson mixing with slight modifications is formally described in Algorithm 1.

1 input:
2 Define ;
3 , ;
4 for  do
5       ;
6       ;
7       Solve subject to ;
8       ;
9      
return
Algorithm 1 Anderson Mixing ()

Note that the step in Line 7 of Algorithm 1 can be transformed to an equivalent unconstrained least-squares problem:

(1)

then let

. Using QR decomposition, (

1) can be solved in time , where is the dimension. Moreover, the QR decomposition of (1) at iteration can be efficiently obtained from that of at iteration in (see, e.g. (Golub and Van Loan, 1996)). The constant is usually very small. For the numerical experiments in Section 5, we use and . Hence, each iteration of Anderson mixing can be implemented quite efficiently.

Many studies showed the relations between Anderson mixing and other optimization methods. In particular, for the quadratic case (linear problems), Walker and Ni (2011) showed that it is related to the well-known Krylov subspace method GMRES (generalized minimal residual algorithm) (Saad and Schultz, 1986). Furthermore, Potra and Engler (2013) showed that GMRES is equivalent to Anderson mixing with any mixing parameters under (see Line 5 of Algorithm 1) for linear problems. Concretely, Toth and Kelley (2015) proved the first linear convergence rate for linear problems with fixed parameter , where is the condition number. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson mixing is related to the multisecant quasi-Newton methods (more concretely, the generalized Broyden’s second method). Despite the above results, the convergence results for this efficient method are still limited (especially for general nonlinear function and the case where is small).

1.1 Our Contributions

There has been a growing number of applications of Anderson mixing method Pratapa et al. (2016); Higham and Strabić (2016); Loffeld and Woodward (2016); Scieur et al. (2018). Towards a better understanding of the efficient method, we make the following technical contributions:

  1. We prove the optimal convergence rate of the proposed Anderson-Chebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) for minimizing quadratic functions (see Theorem 1). Our result improves the previous result (i.e., ) using fixed parameters given by Toth and Kelley Toth and Kelley (2015) and matches the lower bound (i.e., ) provided by Nesterov Nesterov (2004).

  2. Then, we prove the linear-quadratic convergence of Anderson mixing for minimizing general nonlinear problems under some reasonable assumptions (see Theorem 2). Compared with Newton-like methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessian-vector products.

  3. Besides, we propose a Guessing Algorithm for the case when the hyperparameters (e.g., ) are not available. We prove that it achieves a similar convergence rate (see Theorem 3). This guessing algorithm can also be combined with other algorithms, e.g., Gradient Descent (GD), Nesterov’s Accelerated GD (NAGD). The experimental results (see Section 5) show that these algorithms combined with the guessing algorithm achieve much better performance.

  4. Finally, the experimental results on the real-world UCI datasets and synthetic datasets demonstrate that Anderson mixing methods converge significantly faster than other algorithms (see Section 5). This validates that Anderson mixing methods (especially Anderson-Chebyshev mixing method) are efficient both in theory and practice.

1.2 Related Work

As aforementioned, Anderson mixing can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. In particular, GD is the special case of Anderson mixing with , and to some extent NAGD can be viewed as . We also review the equivalence of GMRES and Anderson mixing without truncation (i.e., ) in Appendix A. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson mixing is related to the multisecant quasi-Newton methods. Note that Anderson mixing has the advantage over the Newton-like methods since it does not require the computation of Hessians or approximation of Hessians or Hessian-vector products.

There are many sequence acceleration methods in the numerical analysis literatures. In particular, the well-known Aitken’s process (Aitken, 1926) accelerated the convergence of a sequence that is converging linearly. Shanks generalized the Aitken extrapolation which was known as Shanks transformation Shanks (1955). Recently, Brezinski et al. (2018) proposed a general framework for Shanks sequence transformations which includes many vector sequence acceleration methods. One fundamental difference between Anderson mixing and other sequence acceleration methods (such as MPE, RRE (reduced rank extrapolation) (Sidi et al., 1986; Smith et al., 1987), etc.) is that Anderson mixing is a fully dynamic method (Capehart, 1989). Here dynamic means all iterations are in the same sequence, and it does not require to restart the procedure. It can be seen from Algorithm 1 that all iterations are applied to the same sequence . In fact, in Capehart’s PhD thesis (Capehart, 1989), several experiments were conducted to demonstrate the superior performance of Anderson mixing over other semi-dynamic methods such as MPE, RRE (semi-dynamic means that the algorithm maintains more than one sequences or needs to restart several times).

2 The Quadratic Case

In this section, we consider the problem of minimizing a quadratic function (also called least square, or ridge regression

(Boyd and Vandenberghe, 2004; Hoerl and Kennard, 1970)). The formulation of the problem is

(2)

where . Note that and are usually called the strongly convex parameter and Lipschitz continuous gradient parameter, respectively (e.g. (Nesterov, 2004)). There are many algorithms for optimizing this type of functions. See e.g. (Bubeck, 2015) for more details. We analyze the problem of minimizing a general function in the next Section 3.

We prove that Anderson mixing with Chebyshev polynomial parameters achieves the optimal convergence rate. The convergence result is stated in the following Theorem 1. Let denote the maximum integer such that for any . The and ’s are defined as follows: , , where is a unit vector,

’s are the unit eigenvectors of

, and denotes the projection to the orthogonal complement of the column space of . Obviously, since (then ), and since is a projection operator.

Theorem 1

The Anderson-Chebyshev mixing method achieves the optimal convergence rate for problem (2) for any , where is the condition number. This method combines Anderson Mixing (Algorithm 1) with the Chebyshev polynomial parameters , for .

Remark: In this quadratic case, we mention that Toth and Kelley (2015) proved the first convergence rate for fixed parameter . Here we use the Chebyshev polynomials to improve the result to the optimal one, i.e., . Besides, the constant is usually very small. Particularly, has already achieved a remarkable performance from our experimental results (see Figures 24)).

Before proving Theorem 1, some properties of the Chebyshev polynomials are briefly reviewed. We refer to Rivlin (1974); Olshanskii and Tyrtyshnikov (2014); Hageman and Young (2012) for more details of Chebyshev polynomials.

The Chebyshev polynomials are polynomials , where , , which is defined by the recursive relation:

(3)

The key property is that has minimal deviation from on among all polynomials with and , i.e.,

(4)

In particular, for , Chebyshev polynomials can be written in an equivalent way:

(5)

In our proof, we use this equivalent form (5) instead of (3). The equivalence can be verified as follows:

(6)
(7)

where (6) and (7) use the transformation due to . According to (5), and the roots of are as follows:

(8)

To demonstrate it more clearly, we provide an example for (W-shape curve) in Figure 1. Since in this polynomial , the first root . The remaining three roots for can be easily computed too.

Figure 1: The Chebyshev polynomial

Proof of Theorem 1. For iteration , the residual can be deduced as follows:

(9)
(10)

where (9) uses .

To bound (i.e.,

), we first obtain the following lemma by using Singular Value Decomposition (SVD) to solve the least squares problem (

1) and then using several transformations. We defer the proof of Lemma 1 to Appendix B.2.

Lemma 1
(11)

where is a degree polynomial.

According to Lemma 1, to bound , it is sufficient to bound the right-hand-side (RHS) of (11) (i.e., ). In order to bound this, we first transform into . Let , where . We have the following equalities:

(12)

According to (4) (the optimal property of standard Chebyshev polynomials), when (note that here ), the RHS of (11) can be bounded as follows:

(13)
(14)

where (13) uses (12), and (14) uses (see (5)). According to (8), it is not hard to see that is defined by the mixing parameters , where ). Note that the roots of standard Chebyshev polynomials (i.e., (8)) can be obtained from many textbooks, e.g., Section 1.2 of Rivlin (1974). Now, we only need to bound . First, we need to transform the form (5) of Chebyshev polynomials as follows:

Let , we get . So we have

(15)

Now, the RHS of (11) can be bounded as

(16)
(17)

where (16) follows from (14), and (17) follows from (15). Then, according to (11), the gradient norm is bounded as , where . Note that if the number of iterations , then

Thus Anderson-Chebyshev mixing method achieves the optimal convergence rate for obtaining an -approximation solution.

3 The General Case

In this section, we analyze the Anderson mixing (Algorithm 1) in the general nonlinear case:

(18)

We prove that Anderson mixing method achieves the linear-quadratic convergence rate under the following standard Assumptions 1 and 2, where denotes the Euclidean norm. Let denote the small matrix of the least-square problem in Line 7 of Algorithm 1, i.e., (see problem (1)). Then, we define its condition number , where denotes the least non-zero singular value of . We further define .

Assumption 1

The Hessian satisfies , where .

Assumption 2

The Hessian is -Lipschitz continuous, i.e.,

(19)
Theorem 2

Suppose Assumption 1 and 2 hold. Let step-size . The convergence rate of Anderson Mixing() (Algorithm 1) is linear-quadratic for (18), i.e.,

(20)

where ,  ,   and .

Remark:

  1. The constant is usually very small. Particularly, we use and for the numerical experiments in Section 5. Hence is very small and also decreases as the iteration increased.

  2. Besides, one can also use instead of in the RHS of (20) according to the property of , i.e., and

  3. Note that the first two terms in RHS of (20) converge quadratically and the last term converges linearly. Due to the fully dynamic property of Anderson mixing as we discussed at the end of Section 1.2, it turns out the exact convergence rate of Anderson mixing in the general case is not easy to obtain. But we note that the convergence rate is roughly since the first two quadratic terms converge much faster than the last linear term. In particular, if is a quadratic function, then and thus in (20). Only the last linear term remained, thus it converges linearly (see the following corollary).

Corollary 1

If is a quadratic function, let step-size and . Then the convergence rate of Anderson Mixing is linear, i.e., , where is the condition number.

Note that this corollary recovers the previous result (i.e., ) obtained by (Toth and Kelley, 2015), and we use Chebyshev polyniomial to improve this result to obtain the optimal convergence rate in our previous Section 2 (see Theorem 1).

Now, we provide a proof sketch for Theorem 2. The detailed proof can be found in Appendix B.1.

Proof Sketch of Theorem 2: For the iteration , we have according to . First, we demonstrate several useful forms of as follows:

(21)
(22)

where (21) holds due to the definition , and (22) holds since .

Then, to bound (i.e., ), we deduce as follows:

(23)

where (23) uses the definition . Now, we bound the first two terms of (23) as follows:

(24)

where (24) is obtained by using (22) to replace . To bound (24), we use Assumptions 1, 2, and the equation

After some non-trivial calculations (details can be found in Appendix B.1), we obtain

where denotes the Euclidean norm of . Then, according to the problem (1) and the definition of , we have . Finally, we bound using QR decomposition of problem (1) and recall to finish the proof of Theorem 2.

4 Guessing Algorithm

In this section, we provide a Guessing Algorithm (described in Algorithm 2) which guesses the parameters (e.g., ) dynamically. Intuitively, we guess the parameter and the condition number in a doubling way. Note that in general these parameters are not available, since the time for computing these parameters is almost the same as (or even longer than) solving the original problem. Note that the condition in Line 14 of Algorithm 2 depends on the algorithm used in Line 12.

The convergence result is stated in the following Theorem 3.

Theorem 3

Without knowing the true parameters, the guessing algorithm achieves convergence rate for quadratic functions, where , and

can be any number as long as the eigenvalue spectrum belongs to

, assuming that .

Example: We provide a simple example to show why this guessing algorithm is useful. Note that many algorithms need these parameters to set the step size no matter they have combined with Algorithm 2 or not. Thus, we need to approximate these parameters once at the beginning. Let and denote the approximated values, where . Without guessing them dynamically, one fixs and all the time and its convergence rate cannot be better than . However, according to our Theorem 3, the rate is if it is combined with Algorithm 2.

1 input:
2 Let ;
3 for  do
4       ;
5       for  do
6             ;
7             do
8                   ;
9                   if  then
10                         break;
11                        
12                  ;
13                   Anderson Mixing()  // it can be replaced by other algorithms;
14                   ;
15                  
16            while  ;
17            if  then
18                   ;
19                  
20            
21      
return
Algorithm 2 Guessing Algorithm

Before to prove the Theorem 3, we need the following lemmas and their proofs are provided in Appendix B.3.

Lemma 2

If , then .

Lemma 3

Let , where , then is satisfied.

Lemma 4

The condition number (in Line 4 of Algorithm 2) is always less than , where is the true condition number. Equivalently, (in Line 3 of Algorithm 2) is always less than .

Proof of Theorem 3. According to Lemma 4, (in Line 3 of Algorithm 2) is less than and is less than . The inner loop (in Line 5) is obviously less than . Let and denote the times of the execution of do-while loop (Line 7–14) in each loop iteration (Line 5–16). Thus, the total number of iterations (corresponding to ) is in each loop iteration. These iterations satisfy the do-while condition, i.e., . We combine the condition together to obtain . Finally, this guessing algorithm satisfied the following Inequality (25).

Note that the Line 15 and 16 of Algorithm 2 ignore the failed iterations. Also this ignored step can be executed at most once in each loop iteration (Line 5–16). Let denote the total number of iterations of Algorithm 2. Then .

(25)

As is less than and . In order to prove the convergence rate, we need the RHS of (25) , it is sufficient to satisfy the following inequality

i.e.,

(26)

By applying Lemma 2 and ignoring the constant, we can transform (26) to (27). Recall that and .

(27)

This is exactly the same as Lemma 3. Now is bounded by .

5 Experiments

In this section, we conduct the numerical experiments on the real-world UCI datasets and synthetic datasets 111The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. We compare the performance among these five algorithms: Anderson-Chebyshev mixing method (AM-Cheby), vanilla Anderson mixing method (AM), vanilla Gradient Descent (GD), Nesterov’s Accelerated Gradient Descent (NAGD) (Nesterov, 2004) and Regularized Minimal Polynomial Extrapolation (RMPE) with (same as (Scieur et al., 2016)). Concretely, Figures 44 demonstrate the convergence performance of these algorithms in the quadratic case, Figure 2 demonstrates the convergence performance in general case, and Figure 5 demonstrates the convergence performance of these algorithm combined with our guessing algorithm (Algorithm 2). The values of in the caption of figures denote the mix parameter of Anderson mixing algorithms (see Line 5 of Algorithm 1).


Figure 2: Logistic regression,

Figure 2

demonstrates the convergence performance of these algorithms in general nonlinear case. Concretely, we use the negative log-likelihood as the loss function

(logistic regression), i.e., , where . We run these five algorithms on real-world diabetes and cancer datasets which are standard UCI datasets.

Figure 3:
Figure 4:

Figures 44 demonstrate the convergence performance of these algorithms in the quadratic case, where . Concretely, we compared the convergence performance among these algorithms when the condition number and the mix parameter are varied, e.g., the left figure in Figure 4 is the case and , where is the parameter for Anderson mixing algorithms (see Line 5 of Algorithm 1). We run these five algorithms on the synthetic datasets in which we randomly generate the and for the quadratic function . Note that for randomly generated satisfying the property of , we randomly generate instead and let .

In conclusion, Anderson mixing-type methods converge the fastest no matter it is a linear or nonlinear problem in all of our experiments. The efficient Anderson mixing methods can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. Because GD is the special case of Anderson Mixing with , and to some extent NAGD can be viewed as . Note that the performance gap between Anderson Mixing methods and (GD, NAGD) in Figure 4 (i.e. ) is somewhat larger than that in Figure 4 (i.e. ). Regarding the Krylov extension, Anderson Mixing without truncation is equivalent to the well-known Krylov subspace method GMRES (see Appendix A), and we prove the optimal convergence rate in this quadratic case (Theorem 1), which matches the Nesterov’s lower bound. For the general case, we prove the linear-quadratic convergence (Theorem 2). Compared with Newton-like methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessian-vector products.

5.1 Experiments for Guessing Algorithm

In this subsection, we conduct the experiments for guessing the hyperparameters (i.e., ) using the proposed guessing algorithm (Algorithm 2). Note that we compute the hyperparameters in advance for our previous experiments (e.g., Figures 44) to better compare the convergence performance among these algorithms.

Now, we separately consider these algorithms in Figure 5. For each of them, we compare its convergence performance between its original version and the one combined with our guessing algorithm. The experimental results show that all these algorithms combined with our guessing algorithm achieve much better performance than their original versions.

(a) Gradient Descent
(b) Nesterov’s AGD

(c) Anderson Mixing
(d) Anderson-Chebyshev
Figure 5: Algorithms with/without guessing algorithm

6 Conclusion

In this paper, we show that Anderson-Chebyshev mixing method (Anderson mixing with Chebyshev polynomial parameters) can achieve the optimal convergence rate, which improves the previous result provided by (Toth and Kelley, 2015). Furthermore, we prove the linear-quadratic convergence of Anderson mixing for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that this efficient Anderson-Chebyshev mixing method converges significantly faster than other algorithms. This validates that Anderson-Chebyshev mixing method is efficient both in theory and practice.

Acknowledgments

The authors would like to thank Claude Brezinski, Rong Ge, Damien Scieur and Le Zhang for useful discussions and suggestions.

References

  • Aitken [1926] A Aitken. On bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289–305, 1926.
  • Allen-Zhu [2017] Zeyuan Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 1200–1205. ACM, 2017.
  • Anderson [1965] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM, 12(4):547–560, 1965.
  • Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Brezinski [2000] Claude Brezinski. Convergence acceleration during the 20th century. Journal of Computational and Applied Mathematics, 122:1–21, 2000.
  • Brezinski and Redivo Zaglia [1991] Claude Brezinski and M Redivo Zaglia. Extrapolation methods: theory and practice. 1991.
  • Brezinski et al. [2018] Claude Brezinski, Michela Redivo-Zaglia, and Yousef Saad. Shanks sequence transformations and anderson acceleration. SIAM Review, 60(3):646–669, 2018.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity.

    Foundations and Trends® in Machine Learning

    , 8(3-4):231–357, 2015.
  • Capehart [1989] Steven Russell Capehart. Techniques for accelerating iterative methods for the solution of mathematical problems. PhD thesis, Oklahoma State University, 1989.
  • Eyert [1996] V Eyert. A comparative study on methods for convergence acceleration of iterative vector sequences. Journal of Computational Physics, 124(2):271–285, 1996.
  • Fang and Saad [2009] Haw-ren Fang and Yousef Saad. Two classes of multisecant methods for nonlinear acceleration. Numerical Linear Algebra with Applications, 16(3):197–221, 2009.
  • Golub and Van Loan [1996] GH Golub and