Quasi-Newton Methods for Saddle Point Problems

11/04/2021
by   Chengchang Liu, et al.
FUDAN University
0

This paper studies quasi-Newton methods for solving strongly-convex-strongly-concave saddle point problems (SPP). We propose a variant of general greedy Broyden family update for SPP, which has explicit local superlinear convergence rate of 𝒪((1-1/nκ^2)^k(k-1)/2), where n is dimensions of the problem, κ is the condition number and k is the number of iterations. The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization. We also present two specific Broyden family algorithms with BFGS-type and SR1-type updates, which enjoy the faster local convergence rate of 𝒪((1-1/n)^k(k-1)/2).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/30/2020

Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods

In this paper, we study the non-asymptotic superlinear convergence rate ...
07/11/2016

Proximal Quasi-Newton Methods for Regularized Convex Optimization with Linear and Accelerated Sublinear Convergence Rates

In [19], a general, inexact, efficient proximal quasi-Newton algorithm f...
07/08/2021

Identification and Adaptation with Binary-Valued Observations under Non-Persistent Excitation Condition

Dynamical systems with binary-valued observations are widely used in inf...
09/30/2018

Newton-MR: Newton's Method Without Smoothness or Convexity

Establishing global convergence of the classical Newton's method has lon...
08/26/2020

Convergence Rate Improvement of Richardson and Newton-Schulz Iterations

Fast convergent, accurate, computationally efficient, parallelizable, an...
04/26/2017

Stochastic Orthant-Wise Limited-Memory Quasi-Newton Methods

The ℓ_1-regularized sparse model has been popular in machine learning so...
12/15/2014

Fixed Point Algorithm Based on Quasi-Newton Method for Convex Minimization Problem with Application to Image Deblurring

Solving an optimization problem whose objective function is the sum of t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we focus on the following smoothed saddle point problem

(1)

where is strongly-convex in and strongly-concave in . We target to find the saddle point which holds that

for all and

. This formulation contains a lot of scenarios including game theory 

[2, 36], AUC maximization [14, 40], robust optimization [3, 13, 33], empirical risk minimization [42]

 and reinforcement learning 

[11].

There are a great number of first-order optimization algorithms for solving problem (1), including extragradient method [18, 35], optimistic gradient descent ascent [8], proximal point method [28] and dual extrapolation [23]. These algorithms iterate with first-order oracle and achieve linear convergence. Lin et al. [21], Wang and Li [37] used Catalyst acceleration to reduce the complexity for unbalanced saddle point problem, nearly matching the lower bound of first-order algorithms [25, 41] in specific assumptions. Compared with first-order methods, second-order methods usually enjoy superior convergence in numerical optimization. Huang et al. [16] extended cubic regularized Newton (CRN) method [23, 22] to solve saddle point problem (1), which has quadratic local convergence. However, each iteration of CRN requires accessing the exact Hessian matrix and solving the corresponding linear systems. These steps arise time complexity, which is too expensive for high dimensional problems.

Quasi-Newton methods [6, 5, 34, 4, 9] are popular ways to avoid accessing exact second-order information applied in standard Newton methods. They approximate the Hessian matrix based on the Broyden family updating formulas [4], which significantly reduces the computational cost. These algorithms are well studied for convex optimization. The famous quasi-Newton methods including Davidon-Fletcher-Powell (DFP) method [9, 12], Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [6, 5, 34] and symmetric rank 1 (SR1) method [4, 9] enjoy local superlinear convergence [27, 7, 10] when the objective function is strongly-convex. Recently, Rodomanov and Nesterov [29, 30, 31] proposed greedy variant of quasi-Newton methods, which first achieves non-asymptotic superlinear convergence. Later, Lin et al. [20] established a better convergence rate which is condition-number-free. Jin and Mokhtari [17], Ye et al. [39] showed the non-asymptotic superlinear convergence rate also holds for classical DFP, BFGS and SR1 methods.

In this paper, we study quasi-Newton methods for saddle point problem (1). Noticing the Hessian matrix of our objective function is indefinite, the existing Broyden family update formulas and their convergence analysis cannot be applied directly. To overcome this issue, we propose a variant framework of greedy quasi-Newton methods for saddle point problems, which approximates the square of the Hessian matrix during the iteration. Our theoretical analysis characterizes the convergence rate by the Euclidean distance to the saddle point, rather than the weighted norm of gradient used in convex optimization [29, 30, 31, 20, 39, 17]. We summarize the theoretical results for proposed algorithms in Table 1. The local convergence behaviors for all of the algorithms have two periods. The first period has iterations with a linear convergence rate . The second one enjoys superlinear convergence:

  • For general Broyden family methods, we have an explicit rate .

  • For BFGS method and SR1 method, we have the faster explicit rate , which is condition-number-free.

Additionally, our ideas also can be used for solving general non-linear equations.

Paper Organization

In Section 2, we start with notation and preliminaries throughout this paper. In Section 3, we first propose greedy quasi-Newton methods for quadratic saddle point problem which enjoys local superlinear convergence. Then we extend it to solve general strongly-convex-strongly-concave saddle point problems. In Section 4, we show our theory also can be applied to solve more general non-linear equations and give the corresponding convergence analysis. We conclude our work in Section 5. All proofs are deferred to appendix.

Algorithms Upper Bound of
Broyden family (Algorithm 5)
BFGS/SR1 (Algorithm 6/7)
Table 1: We summarize the convergence behaviors of three proposed algorithms for solving saddle point problem in the view of Euclidean distance after iterations, where is dimensions of the problem and is the condition number. The results come from Corollary 3.17.1.

2 Notation and Preliminaries

We use

to present spectral norm and Euclidean norm of matrix and vector respectively. We denote the standard basis for

by and let

be the identity matrix. The trace of a square matrix is denoted by

. Given two positive definite matrices and , we define their inner product as

We introduce the following notation to measure how well does matrix approximate matrix :

(2)

If we further suppose , it holds that

(3)

by Rodomanov and Nesterov [29].

Using the notation of problem (1), we let where and denote the gradient and Hessian matrix of at as

We suppose the saddle point problem (1) satisfies the following assumptions.

Assumption 2.1.

The objective function is twice differentiable and has -Lipschitz continuous gradient and -Lipschitz continuous Hessian , i.e., there exists constants and such that

(4)

and

(5)

for any .

Assumption 2.2.

The objective function is twice differentiable, -strongly-convex in and -strongly-concave in , i.e., there exists such that and for any .

Note that inequality (4) means the spectral norm of Hessian matrix can be upper bounded, that is

(6)

for all . Additionally, the condition number of the objective function is defined as

3 Quasi-Newton Methods for Saddle Point Problems

The update rule of standard Newton’s method for solving problem (1) can be written as

(7)

This iteration scheme has quadratic local convergence, but solving linear system (7) takes time complexity. For convex minimization, quasi-Newton methods including BFGS/SR1 [6, 5, 34, 4, 9] and their variants [20, 39, 29, 30] focus on approximating the Hessian and reduce the computational cost to for each round. However, all of these algorithms and related convergence analysis are based on the assumption that the Hessian matrix is positive definite, which is not suitable for our saddle point problems since is indefinite.

We introduce the auxiliary matrix be the square of Hessian

The following lemma means is always positive definite.

Lemma 3.1.

Under Assumption 2.1 and 2.2, we have

(8)

for all , where .

Hence, we can reformulate the update of Newton’s method (7) by

(9)

Then it is natural to characterize the second-order information by estimating the auxiliary matrix , rather than the indefinite Hessian . If we have obtained a symmetric positive definite matrix as an estimator for , the update rule of (9) can be approximated by

(10)

The remainder of this section introduce several strategies to construct , resulting the quasi-Newton methods for saddle point problem with local superlinear convergence. We should point out the implementation of iteration (10) is unnecessary to construct Hessian matrix explicitly, since we are only interested in the Hessian-vector product , which can be computed efficiently [26, 32].

3.1 The Broyden Family Updates

We first introduce the Broyden family [24, Section 6.3] of quasi-Newton updates for approximating an positive definite matrix by using the information of current estimator .

Definition 3.2.

Suppose two positive definite matrices satisfy . For any , if , we define . Otherwise, we define

(11)

The different choices of parameter for formula (11) contain several popular quasi-Newton updates:

  • For , it corresponds to BFGS update

    (12)
  • For , it corresponds to SR1 update

    (13)

The general Broyden family update as Definition 3.2 has the following properties.

Lemma 3.3 ([29, Lemma 2.1 and Lemma 2.2]).

Suppose two positive definite matrices satisfy for some , then for any and , we have

Additionally, for any , we have

(14)

We first introduce the greedy update method [29] by choosing as follows

(15)

The following lemma shows applying update rule (11) with formula (15) leads to a new estimator with tighter error bound in the measure of .

Lemma 3.4 ([29, Theorem 2.5]).

Suppose two positive definite matrices satisfy . Let , then for any , we have

(16)

For specific Broyden family updates BFGS and SR1 shown in (12) and (13), we can replace (15) by scaling greedy direction [20], which leads to a better convergence result. Concretely, for BFGS method, we first find such that , where is an upper triangular matrix. This step can be implemented with complexity [20, Proposition 15]. We present the subroutine for factorizing in Algorithm 1 and give its detailed implementation in Appendix B.

Then we use direction for BFGS update, where

(17)

For SR1 method, we choose the direction by

(18)

Applying the BFGS update rule (12) with formula (17), we obtain a condition-number-free result as follows.

Lemma 3.5 ([20, Theorem 2.6]).

Suppose two positive definite matrices satisfy . Let , where is an upper triangular matrix such that . Then we have

(19)

The effect of SR1 update can be characterized by the following measure

Applying the SR1 update rule (13) with formula (18), we also hold a condition-number-free result.

Lemma 3.6 ([20, Theorem 2.3]).

Suppose two positive definite matrices satisfy . Let , then we have

(20)
1:  Input: positive definite matrix , upper triangular matrix , greedy direction
2:  
3:  
4:  
5:  Output:
Algorithm 1 Fast-Cholesky
1:  Input: and
2:  for
3:   
4:   
5:   
6:  end for
Algorithm 2 Greedy Broyden Class Method for Quadratic Problems
1:  Input: and
2:  for
3:   
4:   
5:   
6:    Fast-Cholesky
7:  end for
Algorithm 3 Greedy BFGS Method for Quadratic Problem
1:  Input: and
2:  for
3:   
4:   
5:   
6:  end for
Algorithm 4 Greedy SR1 method for Quadratic Problem

3.2 Algorithms for Quadratic Saddle Point Problems

Then we consider solving quadratic saddle point problem of the form

(21)

where , is symmetric and . We suppose could be partitioned as

where the sub-matrices , , and satisfy , and . Recall the condition number is defined as . Using notations introduced in Section 2, we have

We present the detailed procedure of greedy quasi-Newton methods for quadratic saddle point problem by using Broyden family update, BFGS update and SR1 update in Algorithm 2, 3 and 4 respectively. We define as the Euclidean distance from to the saddle point for our convergence analysis, that is

The definition of in this paper is different from the measure used in convex optimization [29, 20]111In later section, we will see the measure is suitable to convergence analysis of quasi-Newton methods for saddle point problems., but it also holds the similar property as follows.

Lemma 3.7.

Assume we have and such that for Algorithm 2, 3 and 4, then we have

The next theorem states the assumptions of Lemma 3.7 always holds with , which means converges to 0 linearly.

Theorem 3.8.

For all , Algorithm 2, 3 and 4 holds that

(22)

and

(23)

Lemma 3.7 also implies superlinear convergence can be obtained if there exists which converges to 1. Applying Lemma 3.4, 3.5 and 3.6, we can show it holds for proposed algorithms.

Theorem 3.9.

Solving quadratic saddle point problem (21) by proposed greedy quasi-Newton algorithms, we have the following results:

  1. For greedy Broyden family method (Algorithm 2), we have

  2. For greedy BFGS method (Algorithm 3), we have

  3. For greedy SR1 method (Algorithm 4), we have

Combing the results of Theorem 3.8 and 3.9, we achieve the two-stages convergence behavior, that is, the algorithm has global linear convergence and local superlinear convergence. The formal description is summarized as follows.

Corollary 3.9.1.

Solving quadratic saddle point problem (21) by proposed greedy quasi-Newton algorithms, we have the following results:

  1. Using greedy Broyden family method (Algorithm 2), we have

    (24)
  2. Using greedy BFGS method (Algorithm 3), we have

    (25)
  3. Using greedy SR1 method (Algorithm 4), we have

    (26)

    where .

3.3 Algorithms for General Saddle Point Problems

In this section, we consider the general saddle point problem

where satisfies Assumption 2.1 and 2.2. We target to propose quasi-Newton methods for solving above problem with local superlinear convergence and time complexity for each iteration.

3.3.1 Algorithms

The key idea of designing quasi-Newton methods for saddle point problems is characterizing the second-order information by approximating the auxiliary matrix . Note that we have supposed the Hessian of is Lipschitz continuous and bounded in Assumption 2.1 and 2.2, which means the auxiliary matrix operator is also Lipschitz continuous.

Lemma 3.10.

Under Assumption 2.1 and 2.2, we have is -Lipschitz continuous, that is

(27)

Combining Lemma 3.1 and 3.10, we achieve the following properties of , which analogizes the strongly self-concordance in convex optimization [29].

Lemma 3.11.

Under Assumption 2.1 and 2.2, the auxiliary matrix operator satisfies

(28)

for all , where .

Corollary 3.11.1.

Let and . Suppose the objective function satisfies Assumption 2.1 and 2.2, then the auxiliary matrix operator holds that

(29)

for all , where .

Different with the quadratic case, auxiliary matrix is not fixed for general saddle point problem. Based on the smoothness of , we apply Lemma 3.11.1 to generalize Corollary 3.3 as follows.

Lemma 3.12.

Let and be a positive definite matrix such that

(30)

for some . We additionally define and , then we have

(31)

and

(32)

for all , and .

The relationship (32) implies it is reasonable to establish the algorithms by update rule

(33)

with and . Similarly, we can also achieve by such for specific BFGS and SR1 update. Combining iteration (10) with (33), we propose the quasi-Newton methods for general strongly-convex-strongly-concave saddle point problems. The details is shown in Algorithm 5, 6 and 7 for greedy Broyden family, BFGS and SR1 updates respectively.

1:  Input: , and .
2:  for
3:   
4:   
5:   
6:   
7:   
8:  end for
Algorithm 5 Greedy Broyden Method for General Case
1:  Input: and .
2:  
3:  for
4:   
5:   
6:   
7:   
8:   
9:   
10:    Fast-Cholesky
11:  end for
Algorithm 6 Greedy BFGS Method for General Case
1:  Input: and
2:  for
3:   
4:   
5:   
6:   
7:   
8:  end for
Algorithm 7 Greedy SR1 Method for General Case

3.3.2 Convergence Analysis

We turn to consider building the convergence guarantee for algorithms proposed in Section 3.3.1. We start from the following lemma which is useful to further analysis.

Lemma 3.13.

Suppose objective function satisfies Assumption 2.1 and 2.2 and is its saddle point, then we have

(34)

for all .

We introduce the following notations to simplify the presentation. We let be the sequence generated from Algorithm 5, 6 or 7 and denote

We still use Euclidean distance for analysis and establish the relationship between and , which is shown in Lemma 3.14.

Lemma 3.14.

Using Algorithm 5, 6 and 7, suppose we have

(35)

for some . Then we have

(36)
(37)

where .

Rodomanov and Nesterov [29, Lemma 4.3] derive a result similar to Lemma 3.14 for minimizing the strongly-convex function , but depends on the different measure .222The original notations of Rodomanov and Nesterov [29] is minimizing the strongly-convex function and establishing the convergence result by . To avoid ambiguity, we use notations and to describe their work in this paper. Note that our algorithms are based on the iteration rule

Compared with quasi-Newton methods for convex optimization, there exists an additional term between and , which leads to the convergence analysis based on is so difficult. Fortunately, we find directly using Euclidean distance makes sure everything goes smoothly.

For further analysis, we also denote

(38)

and

(39)

Then we establish linear convergence for the first period of iterations, which can be viewed as the generalization of Theorem 3.8. Note that the following result holds for all Algorithm 5, 6 and 7.

Theorem 3.15.

Using Algorithm 5, 6 and 7 with and , we suppose the initial point is sufficiently close to the saddle point such that

(40)

where . Let , then for all , we have

(41)
(42)

and

(43)

Then we analyze how does change after one iteration to show the local superlinear convergence for greedy Broyden family method (Algorithm 5) and greedy BFGS method (Algorithm 6). Recall that is defined in (39) to measure how well does the matrix approximate the auxiliary matrix .

Lemma 3.16.

Solving general strongly-convex strongly-concave saddle point problem (1) under Assumption 2.1 and 2.2 by proposed greedy quasi-Newton algorithms, we have the following results for :

  1. For greedy Broyden family method (Algorithm 5) with and we have

    (44)
  2. For greedy BFGS method (Algorithm 6) with , and , we have

    (45)

The analysis for greedy SR1 method is based on constructing such that