# Quasi-Newton Methods for Saddle Point Problems

This paper studies quasi-Newton methods for solving strongly-convex-strongly-concave saddle point problems (SPP). We propose a variant of general greedy Broyden family update for SPP, which has explicit local superlinear convergence rate of 𝒪((1-1/nκ^2)^k(k-1)/2), where n is dimensions of the problem, κ is the condition number and k is the number of iterations. The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization. We also present two specific Broyden family algorithms with BFGS-type and SR1-type updates, which enjoy the faster local convergence rate of 𝒪((1-1/n)^k(k-1)/2).

There are no comments yet.

## Authors

• 1 publication
• 9 publications
03/30/2020

### Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods

In this paper, we study the non-asymptotic superlinear convergence rate ...
07/11/2016

### Proximal Quasi-Newton Methods for Regularized Convex Optimization with Linear and Accelerated Sublinear Convergence Rates

In [19], a general, inexact, efficient proximal quasi-Newton algorithm f...
07/08/2021

### Identification and Adaptation with Binary-Valued Observations under Non-Persistent Excitation Condition

Dynamical systems with binary-valued observations are widely used in inf...
09/30/2018

### Newton-MR: Newton's Method Without Smoothness or Convexity

Establishing global convergence of the classical Newton's method has lon...
08/26/2020

### Convergence Rate Improvement of Richardson and Newton-Schulz Iterations

Fast convergent, accurate, computationally efficient, parallelizable, an...
04/26/2017

### Stochastic Orthant-Wise Limited-Memory Quasi-Newton Methods

The ℓ_1-regularized sparse model has been popular in machine learning so...
12/15/2014

### Fixed Point Algorithm Based on Quasi-Newton Method for Convex Minimization Problem with Application to Image Deblurring

Solving an optimization problem whose objective function is the sum of t...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we focus on the following smoothed saddle point problem

 minx∈Rnxmaxy∈Rnyf(x,y), (1)

where is strongly-convex in and strongly-concave in . We target to find the saddle point which holds that

 f(x∗,y)≤f(x∗,y∗)≤f(x,y∗)

for all and

. This formulation contains a lot of scenarios including game theory

[2, 36], AUC maximization [14, 40], robust optimization [3, 13, 33], empirical risk minimization [42][11].

There are a great number of first-order optimization algorithms for solving problem (1), including extragradient method [18, 35], optimistic gradient descent ascent [8], proximal point method [28] and dual extrapolation [23]. These algorithms iterate with first-order oracle and achieve linear convergence. Lin et al. [21], Wang and Li [37] used Catalyst acceleration to reduce the complexity for unbalanced saddle point problem, nearly matching the lower bound of first-order algorithms [25, 41] in specific assumptions. Compared with first-order methods, second-order methods usually enjoy superior convergence in numerical optimization. Huang et al. [16] extended cubic regularized Newton (CRN) method [23, 22] to solve saddle point problem (1), which has quadratic local convergence. However, each iteration of CRN requires accessing the exact Hessian matrix and solving the corresponding linear systems. These steps arise time complexity, which is too expensive for high dimensional problems.

Quasi-Newton methods [6, 5, 34, 4, 9] are popular ways to avoid accessing exact second-order information applied in standard Newton methods. They approximate the Hessian matrix based on the Broyden family updating formulas [4], which significantly reduces the computational cost. These algorithms are well studied for convex optimization. The famous quasi-Newton methods including Davidon-Fletcher-Powell (DFP) method [9, 12], Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [6, 5, 34] and symmetric rank 1 (SR1) method [4, 9] enjoy local superlinear convergence [27, 7, 10] when the objective function is strongly-convex. Recently, Rodomanov and Nesterov [29, 30, 31] proposed greedy variant of quasi-Newton methods, which first achieves non-asymptotic superlinear convergence. Later, Lin et al. [20] established a better convergence rate which is condition-number-free. Jin and Mokhtari [17], Ye et al. [39] showed the non-asymptotic superlinear convergence rate also holds for classical DFP, BFGS and SR1 methods.

In this paper, we study quasi-Newton methods for saddle point problem (1). Noticing the Hessian matrix of our objective function is indefinite, the existing Broyden family update formulas and their convergence analysis cannot be applied directly. To overcome this issue, we propose a variant framework of greedy quasi-Newton methods for saddle point problems, which approximates the square of the Hessian matrix during the iteration. Our theoretical analysis characterizes the convergence rate by the Euclidean distance to the saddle point, rather than the weighted norm of gradient used in convex optimization [29, 30, 31, 20, 39, 17]. We summarize the theoretical results for proposed algorithms in Table 1. The local convergence behaviors for all of the algorithms have two periods. The first period has iterations with a linear convergence rate . The second one enjoys superlinear convergence:

• For general Broyden family methods, we have an explicit rate .

• For BFGS method and SR1 method, we have the faster explicit rate , which is condition-number-free.

Additionally, our ideas also can be used for solving general non-linear equations.

##### Paper Organization

In Section 2, we start with notation and preliminaries throughout this paper. In Section 3, we first propose greedy quasi-Newton methods for quadratic saddle point problem which enjoys local superlinear convergence. Then we extend it to solve general strongly-convex-strongly-concave saddle point problems. In Section 4, we show our theory also can be applied to solve more general non-linear equations and give the corresponding convergence analysis. We conclude our work in Section 5. All proofs are deferred to appendix.

## 2 Notation and Preliminaries

We use

to present spectral norm and Euclidean norm of matrix and vector respectively. We denote the standard basis for

by and let

be the identity matrix. The trace of a square matrix is denoted by

. Given two positive definite matrices and , we define their inner product as

 ⟨G,H⟩def=tr(GH).

We introduce the following notation to measure how well does matrix approximate matrix :

 σH(G)def=⟨H−1,G−H⟩=⟨H−1,G⟩−n. (2)

If we further suppose , it holds that

 G−H⪯⟨H−1,G−H⟩H=σH(G)H (3)

by Rodomanov and Nesterov [29].

Using the notation of problem (1), we let where and denote the gradient and Hessian matrix of at as

 g(z)def=[∇xf(x,y)∇yf(x,y)]∈Rnand^H(z)def=∇2f(x,y)=[∇2xxf(x,y)∇2xyf(x,y)∇2yxf(x,y)∇2yyf(x,y)]∈Rn×n.

We suppose the saddle point problem (1) satisfies the following assumptions.

###### Assumption 2.1.

The objective function is twice differentiable and has -Lipschitz continuous gradient and -Lipschitz continuous Hessian , i.e., there exists constants and such that

 ∥∥g(z)−g(z′)∥∥≤L∥∥z−z′∥∥ (4)

and

 ∥∥^H(z)−^H(z′)∥∥≤L2∥∥z−z′∥∥. (5)

for any .

###### Assumption 2.2.

The objective function is twice differentiable, -strongly-convex in and -strongly-concave in , i.e., there exists such that and for any .

Note that inequality (4) means the spectral norm of Hessian matrix can be upper bounded, that is

 ∥^H(z)∥≤L. (6)

for all . Additionally, the condition number of the objective function is defined as

 κdef=Lμ.

## 3 Quasi-Newton Methods for Saddle Point Problems

The update rule of standard Newton’s method for solving problem (1) can be written as

 z+=z−(^H(z))−1g(z). (7)

This iteration scheme has quadratic local convergence, but solving linear system (7) takes time complexity. For convex minimization, quasi-Newton methods including BFGS/SR1 [6, 5, 34, 4, 9] and their variants [20, 39, 29, 30] focus on approximating the Hessian and reduce the computational cost to for each round. However, all of these algorithms and related convergence analysis are based on the assumption that the Hessian matrix is positive definite, which is not suitable for our saddle point problems since is indefinite.

We introduce the auxiliary matrix be the square of Hessian

 H(z)def=(^H(z))2.

The following lemma means is always positive definite.

###### Lemma 3.1.

Under Assumption 2.1 and 2.2, we have

 μ2I⪯H(z)⪯L2I (8)

for all , where .

Hence, we can reformulate the update of Newton’s method (7) by

 (9)

Then it is natural to characterize the second-order information by estimating the auxiliary matrix , rather than the indefinite Hessian . If we have obtained a symmetric positive definite matrix as an estimator for , the update rule of (9) can be approximated by

 z+=z−G−1^H(z)g(z). (10)

The remainder of this section introduce several strategies to construct , resulting the quasi-Newton methods for saddle point problem with local superlinear convergence. We should point out the implementation of iteration (10) is unnecessary to construct Hessian matrix explicitly, since we are only interested in the Hessian-vector product , which can be computed efficiently [26, 32].

### 3.1 The Broyden Family Updates

We first introduce the Broyden family [24, Section 6.3] of quasi-Newton updates for approximating an positive definite matrix by using the information of current estimator .

###### Definition 3.2.

Suppose two positive definite matrices satisfy . For any , if , we define . Otherwise, we define

 Broydτ(G,H,u)def= τ[G−Huu⊤G+Guu⊤Hu⊤Hu+(u⊤Guu⊤Hu+1)Huu⊤Hu⊤Hu] (11) +(1−τ)[G−(G−H)uu⊤(G−H)u⊤(G−H)u].

The different choices of parameter for formula (11) contain several popular quasi-Newton updates:

• For , it corresponds to BFGS update

 BFGS(G,H,u)def=G−Guu⊤Gu⊤Gu+Huu⊤Hu⊤Hu. (12)
• For , it corresponds to SR1 update

 SR1(G,H,u)def=G−(G−H)uu⊤(G−H)u⊤(G−H)u. (13)

The general Broyden family update as Definition 3.2 has the following properties.

###### Lemma 3.3 ([29, Lemma 2.1 and Lemma 2.2]).

Suppose two positive definite matrices satisfy for some , then for any and , we have

 Broydτ1(G,H,u)⪯Broydτ2(G,H,u).

Additionally, for any , we have

 H⪯Broydτ(G,H,u)⪯ηH. (14)

We first introduce the greedy update method [29] by choosing as follows

 uH(G)def=argmaxu∈{e1,⋯,en}u⊤(G−H)uu⊤Hu=argmaxu∈{e1,⋯,en}u⊤Guu⊤Hu. (15)

The following lemma shows applying update rule (11) with formula (15) leads to a new estimator with tighter error bound in the measure of .

###### Lemma 3.4 ([29, Theorem 2.5]).

Suppose two positive definite matrices satisfy . Let , then for any , we have

 σH(G+)≤(1−1nκ2)σH(G) (16)

For specific Broyden family updates BFGS and SR1 shown in (12) and (13), we can replace (15) by scaling greedy direction [20], which leads to a better convergence result. Concretely, for BFGS method, we first find such that , where is an upper triangular matrix. This step can be implemented with complexity [20, Proposition 15]. We present the subroutine for factorizing in Algorithm 1 and give its detailed implementation in Appendix B.

Then we use direction for BFGS update, where

 ~uH(L)def=argmaxu∈{e1,⋯,en}u⊤L−⊤H−1L−1u. (17)

For SR1 method, we choose the direction by

 ¯uH(G)def=argmaxu∈{e1,⋯,en}u⊤(G−H)2uu⊤(G−H)u. (18)

Applying the BFGS update rule (12) with formula (17), we obtain a condition-number-free result as follows.

###### Lemma 3.5 ([20, Theorem 2.6]).

Suppose two positive definite matrices satisfy . Let , where is an upper triangular matrix such that . Then we have

 σH(G+)≤(1−1n)σH(G). (19)

The effect of SR1 update can be characterized by the following measure

 τH(G)def=tr(G−H).

Applying the SR1 update rule (13) with formula (18), we also hold a condition-number-free result.

###### Lemma 3.6 ([20, Theorem 2.3]).

Suppose two positive definite matrices satisfy . Let , then we have

 τH(G+)≤(1−1n)τH(G). (20)

### 3.2 Algorithms for Quadratic Saddle Point Problems

Then we consider solving quadratic saddle point problem of the form

 minx∈Rnxmaxy∈Rnyf(x,y)def=12[x⊤ y⊤]A[xy]−b⊤[xy], (21)

where , is symmetric and . We suppose could be partitioned as

 A=[AxxAxyAyxAyy]∈Rn×n

where the sub-matrices , , and satisfy , and . Recall the condition number is defined as . Using notations introduced in Section 2, we have

 z=[x;y],g(z)=Az−b,^Hdef=^H(z)=AandHdef=H(z)=A2.

We present the detailed procedure of greedy quasi-Newton methods for quadratic saddle point problem by using Broyden family update, BFGS update and SR1 update in Algorithm 2, 3 and 4 respectively. We define as the Euclidean distance from to the saddle point for our convergence analysis, that is

 λkdef=∥zk−z∗∥.

The definition of in this paper is different from the measure used in convex optimization [29, 20]111In later section, we will see the measure is suitable to convergence analysis of quasi-Newton methods for saddle point problems., but it also holds the similar property as follows.

###### Lemma 3.7.

Assume we have and such that for Algorithm 2, 3 and 4, then we have

 λk+1≤(1−1ηk)λk.

The next theorem states the assumptions of Lemma 3.7 always holds with , which means converges to 0 linearly.

###### Theorem 3.8.

For all , Algorithm 2, 3 and 4 holds that

 H⪯Gk⪯κ2H, (22)

and

 λk≤(1−1κ2)kλ0. (23)

Lemma 3.7 also implies superlinear convergence can be obtained if there exists which converges to 1. Applying Lemma 3.4, 3.5 and 3.6, we can show it holds for proposed algorithms.

###### Theorem 3.9.

Solving quadratic saddle point problem (21) by proposed greedy quasi-Newton algorithms, we have the following results:

1. For greedy Broyden family method (Algorithm 2), we have

 H⪯Gk⪯(1+(1−1nκ2)knκ2)Handλk+1≤(1−1nκ2)knκ2λkfor % all k≥0.
2. For greedy BFGS method (Algorithm 3), we have

3. For greedy SR1 method (Algorithm 4), we have

Combing the results of Theorem 3.8 and 3.9, we achieve the two-stages convergence behavior, that is, the algorithm has global linear convergence and local superlinear convergence. The formal description is summarized as follows.

###### Corollary 3.9.1.

Solving quadratic saddle point problem (21) by proposed greedy quasi-Newton algorithms, we have the following results:

1. Using greedy Broyden family method (Algorithm 2), we have

 λk0+k≤(1−1nκ2)k(k−1)2(12)k(1−1κ2)k0λ0for all k>0 and k0=O(nκ2ln(nκ2)). (24)
2. Using greedy BFGS method (Algorithm 3), we have

 λk0+k≤(1−1n)k(k−1)2(12)k(1−1κ2)k0λ0for all k>0 and k0=O(nln(nκ2)). (25)
3. Using greedy SR1 method (Algorithm 4), we have

 λk≤⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(1−1κ2)kλ00≤k≤k0,(n−1)!(n−k+k0−1)!(12n)k−k0(1−1κ2)k0k0

where .

### 3.3 Algorithms for General Saddle Point Problems

In this section, we consider the general saddle point problem

 minx∈Rnxmaxy∈Rnyf(x,y).

where satisfies Assumption 2.1 and 2.2. We target to propose quasi-Newton methods for solving above problem with local superlinear convergence and time complexity for each iteration.

#### 3.3.1 Algorithms

The key idea of designing quasi-Newton methods for saddle point problems is characterizing the second-order information by approximating the auxiliary matrix . Note that we have supposed the Hessian of is Lipschitz continuous and bounded in Assumption 2.1 and 2.2, which means the auxiliary matrix operator is also Lipschitz continuous.

###### Lemma 3.10.

Under Assumption 2.1 and 2.2, we have is -Lipschitz continuous, that is

 ∥H(z)−H(z′)∥≤2L2L∥z−z′∥. (27)

Combining Lemma 3.1 and 3.10, we achieve the following properties of , which analogizes the strongly self-concordance in convex optimization [29].

###### Lemma 3.11.

Under Assumption 2.1 and 2.2, the auxiliary matrix operator satisfies

 H(z)−H(z′)⪯M∥z−z′∥H(w), (28)

for all , where .

###### Corollary 3.11.1.

Let and . Suppose the objective function satisfies Assumption 2.1 and 2.2, then the auxiliary matrix operator holds that

 H(z)1+Mr⪯H(z′)⪯(1+Mr)H(z), (29)

for all , where .

Different with the quadratic case, auxiliary matrix is not fixed for general saddle point problem. Based on the smoothness of , we apply Lemma 3.11.1 to generalize Corollary 3.3 as follows.

###### Lemma 3.12.

Let and be a positive definite matrix such that

 H(z)⪯G⪯ηH(z), (30)

for some . We additionally define and , then we have

 ~Gdef=(1+Mr)G⪰H(z) (31)

and

 H(z+)⪯Broydτ(~G,H(z+),u)⪯(1+Mr)2ηH(z+) (32)

for all , and .

The relationship (32) implies it is reasonable to establish the algorithms by update rule

 Gk+1=Broydτk(~Gk,Hk+1,uk) (33)

with and . Similarly, we can also achieve by such for specific BFGS and SR1 update. Combining iteration (10) with (33), we propose the quasi-Newton methods for general strongly-convex-strongly-concave saddle point problems. The details is shown in Algorithm 5, 6 and 7 for greedy Broyden family, BFGS and SR1 updates respectively.

#### 3.3.2 Convergence Analysis

We turn to consider building the convergence guarantee for algorithms proposed in Section 3.3.1. We start from the following lemma which is useful to further analysis.

###### Lemma 3.13.

Suppose objective function satisfies Assumption 2.1 and 2.2 and is its saddle point, then we have

 ∥∥g(z)−^H(z)(z−z∗)∥∥≤L22∥z−z∗∥2, (34)

for all .

We introduce the following notations to simplify the presentation. We let be the sequence generated from Algorithm 5, 6 or 7 and denote

 ^Hkdef=^Hk(zk),Hkdef=(^Hk(zk))2,gkdef=g(zk)andrkdef=∥zk+1−zk∥.

We still use Euclidean distance for analysis and establish the relationship between and , which is shown in Lemma 3.14.

###### Lemma 3.14.

Using Algorithm 5, 6 and 7, suppose we have

 Hk⪯Gk⪯ηkHk, (35)

for some . Then we have

 λk+1 ≤ (1−1ηk)λk+βλ2k, (36) rk ≤ λk+λk+1, (37)

where .

Rodomanov and Nesterov [29, Lemma 4.3] derive a result similar to Lemma 3.14 for minimizing the strongly-convex function , but depends on the different measure .222The original notations of Rodomanov and Nesterov [29] is minimizing the strongly-convex function and establishing the convergence result by . To avoid ambiguity, we use notations and to describe their work in this paper. Note that our algorithms are based on the iteration rule

 zk+1=zk−G−1k^Hkgk.

Compared with quasi-Newton methods for convex optimization, there exists an additional term between and , which leads to the convergence analysis based on is so difficult. Fortunately, we find directly using Euclidean distance makes sure everything goes smoothly.

For further analysis, we also denote

 ρkdef=2Mλk, (38)

and

 σkdef=σHk(Gk)=⟨H−1k,Gk⟩−n. (39)

Then we establish linear convergence for the first period of iterations, which can be viewed as the generalization of Theorem 3.8. Note that the following result holds for all Algorithm 5, 6 and 7.

###### Theorem 3.15.

Using Algorithm 5, 6 and 7 with and , we suppose the initial point is sufficiently close to the saddle point such that

 Mλ0≤lnb8bκ2 (40)

where . Let , then for all , we have

 λk+1≤λk, (41)
 Hk⪯Gk⪯exp(2k−1∑i=0ρi)κ2Hk⪯bκ2Hk (42)

and

 λk≤(1−12bκ2)kλ0. (43)

Then we analyze how does change after one iteration to show the local superlinear convergence for greedy Broyden family method (Algorithm 5) and greedy BFGS method (Algorithm 6). Recall that is defined in (39) to measure how well does the matrix approximate the auxiliary matrix .

###### Lemma 3.16.

Solving general strongly-convex strongly-concave saddle point problem (1) under Assumption 2.1 and 2.2 by proposed greedy quasi-Newton algorithms, we have the following results for :

1. For greedy Broyden family method (Algorithm 5) with and we have

 σk+1≤(1−1nκ2)(1+Mrk)2(σk+2nMrk1+Mrk)for all k≥0 (44)
2. For greedy BFGS method (Algorithm 6) with , and , we have

 σk+1≤(1−1n)(1+Mrk)2(σk+2nMrk1+Mrk)for all k≥0. (45)

The analysis for greedy SR1 method is based on constructing such that

 tr(Gk−Hk)=ηktr(H