1 Introduction
In this paper, we focus on the following smoothed saddle point problem
(1) |
where is strongly-convex in and strongly-concave in . We target to find the saddle point which holds that
for all and
. This formulation contains a lot of scenarios including game theory
[2, 36], AUC maximization [14, 40], robust optimization [3, 13, 33], empirical risk minimization [42][11].There are a great number of first-order optimization algorithms for solving problem (1), including extragradient method [18, 35], optimistic gradient descent ascent [8], proximal point method [28] and dual extrapolation [23]. These algorithms iterate with first-order oracle and achieve linear convergence. Lin et al. [21], Wang and Li [37] used Catalyst acceleration to reduce the complexity for unbalanced saddle point problem, nearly matching the lower bound of first-order algorithms [25, 41] in specific assumptions. Compared with first-order methods, second-order methods usually enjoy superior convergence in numerical optimization. Huang et al. [16] extended cubic regularized Newton (CRN) method [23, 22] to solve saddle point problem (1), which has quadratic local convergence. However, each iteration of CRN requires accessing the exact Hessian matrix and solving the corresponding linear systems. These steps arise time complexity, which is too expensive for high dimensional problems.
Quasi-Newton methods [6, 5, 34, 4, 9] are popular ways to avoid accessing exact second-order information applied in standard Newton methods. They approximate the Hessian matrix based on the Broyden family updating formulas [4], which significantly reduces the computational cost. These algorithms are well studied for convex optimization. The famous quasi-Newton methods including Davidon-Fletcher-Powell (DFP) method [9, 12], Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [6, 5, 34] and symmetric rank 1 (SR1) method [4, 9] enjoy local superlinear convergence [27, 7, 10] when the objective function is strongly-convex. Recently, Rodomanov and Nesterov [29, 30, 31] proposed greedy variant of quasi-Newton methods, which first achieves non-asymptotic superlinear convergence. Later, Lin et al. [20] established a better convergence rate which is condition-number-free. Jin and Mokhtari [17], Ye et al. [39] showed the non-asymptotic superlinear convergence rate also holds for classical DFP, BFGS and SR1 methods.
In this paper, we study quasi-Newton methods for saddle point problem (1). Noticing the Hessian matrix of our objective function is indefinite, the existing Broyden family update formulas and their convergence analysis cannot be applied directly. To overcome this issue, we propose a variant framework of greedy quasi-Newton methods for saddle point problems, which approximates the square of the Hessian matrix during the iteration. Our theoretical analysis characterizes the convergence rate by the Euclidean distance to the saddle point, rather than the weighted norm of gradient used in convex optimization [29, 30, 31, 20, 39, 17]. We summarize the theoretical results for proposed algorithms in Table 1. The local convergence behaviors for all of the algorithms have two periods. The first period has iterations with a linear convergence rate . The second one enjoys superlinear convergence:
-
For general Broyden family methods, we have an explicit rate .
-
For BFGS method and SR1 method, we have the faster explicit rate , which is condition-number-free.
Additionally, our ideas also can be used for solving general non-linear equations.
Paper Organization
In Section 2, we start with notation and preliminaries throughout this paper. In Section 3, we first propose greedy quasi-Newton methods for quadratic saddle point problem which enjoys local superlinear convergence. Then we extend it to solve general strongly-convex-strongly-concave saddle point problems. In Section 4, we show our theory also can be applied to solve more general non-linear equations and give the corresponding convergence analysis. We conclude our work in Section 5. All proofs are deferred to appendix.
Algorithms | Upper Bound of | |
---|---|---|
Broyden family (Algorithm 5) | ||
BFGS/SR1 (Algorithm 6/7) |
2 Notation and Preliminaries
We use
to present spectral norm and Euclidean norm of matrix and vector respectively. We denote the standard basis for
by and letbe the identity matrix. The trace of a square matrix is denoted by
. Given two positive definite matrices and , we define their inner product asWe introduce the following notation to measure how well does matrix approximate matrix :
(2) |
If we further suppose , it holds that
(3) |
by Rodomanov and Nesterov [29].
Using the notation of problem (1), we let where and denote the gradient and Hessian matrix of at as
We suppose the saddle point problem (1) satisfies the following assumptions.
Assumption 2.1.
The objective function is twice differentiable and has -Lipschitz continuous gradient and -Lipschitz continuous Hessian , i.e., there exists constants and such that
(4) |
and
(5) |
for any .
Assumption 2.2.
The objective function is twice differentiable, -strongly-convex in and -strongly-concave in , i.e., there exists such that and for any .
Note that inequality (4) means the spectral norm of Hessian matrix can be upper bounded, that is
(6) |
for all . Additionally, the condition number of the objective function is defined as
3 Quasi-Newton Methods for Saddle Point Problems
The update rule of standard Newton’s method for solving problem (1) can be written as
(7) |
This iteration scheme has quadratic local convergence, but solving linear system (7) takes time complexity. For convex minimization, quasi-Newton methods including BFGS/SR1 [6, 5, 34, 4, 9] and their variants [20, 39, 29, 30] focus on approximating the Hessian and reduce the computational cost to for each round. However, all of these algorithms and related convergence analysis are based on the assumption that the Hessian matrix is positive definite, which is not suitable for our saddle point problems since is indefinite.
We introduce the auxiliary matrix be the square of Hessian
The following lemma means is always positive definite.
Hence, we can reformulate the update of Newton’s method (7) by
(9) |
Then it is natural to characterize the second-order information by estimating the auxiliary matrix , rather than the indefinite Hessian . If we have obtained a symmetric positive definite matrix as an estimator for , the update rule of (9) can be approximated by
(10) |
The remainder of this section introduce several strategies to construct , resulting the quasi-Newton methods for saddle point problem with local superlinear convergence. We should point out the implementation of iteration (10) is unnecessary to construct Hessian matrix explicitly, since we are only interested in the Hessian-vector product , which can be computed efficiently [26, 32].
3.1 The Broyden Family Updates
We first introduce the Broyden family [24, Section 6.3] of quasi-Newton updates for approximating an positive definite matrix by using the information of current estimator .
Definition 3.2.
Suppose two positive definite matrices satisfy . For any , if , we define . Otherwise, we define
(11) | ||||
The different choices of parameter for formula (11) contain several popular quasi-Newton updates:
-
For , it corresponds to BFGS update
(12) -
For , it corresponds to SR1 update
(13)
The general Broyden family update as Definition 3.2 has the following properties.
Lemma 3.3 ([29, Lemma 2.1 and Lemma 2.2]).
Suppose two positive definite matrices satisfy for some , then for any and , we have
Additionally, for any , we have
(14) |
We first introduce the greedy update method [29] by choosing as follows
(15) |
The following lemma shows applying update rule (11) with formula (15) leads to a new estimator with tighter error bound in the measure of .
Lemma 3.4 ([29, Theorem 2.5]).
Suppose two positive definite matrices satisfy . Let , then for any , we have
(16) |
For specific Broyden family updates BFGS and SR1 shown in (12) and (13), we can replace (15) by scaling greedy direction [20], which leads to a better convergence result. Concretely, for BFGS method, we first find such that , where is an upper triangular matrix. This step can be implemented with complexity [20, Proposition 15]. We present the subroutine for factorizing in Algorithm 1 and give its detailed implementation in Appendix B.
Then we use direction for BFGS update, where
(17) |
For SR1 method, we choose the direction by
(18) |
Applying the BFGS update rule (12) with formula (17), we obtain a condition-number-free result as follows.
Lemma 3.5 ([20, Theorem 2.6]).
Suppose two positive definite matrices satisfy . Let , where is an upper triangular matrix such that . Then we have
(19) |
The effect of SR1 update can be characterized by the following measure
Applying the SR1 update rule (13) with formula (18), we also hold a condition-number-free result.
Lemma 3.6 ([20, Theorem 2.3]).
Suppose two positive definite matrices satisfy . Let , then we have
(20) |
3.2 Algorithms for Quadratic Saddle Point Problems
Then we consider solving quadratic saddle point problem of the form
(21) |
where , is symmetric and . We suppose could be partitioned as
where the sub-matrices , , and satisfy , and . Recall the condition number is defined as . Using notations introduced in Section 2, we have
We present the detailed procedure of greedy quasi-Newton methods for quadratic saddle point problem by using Broyden family update, BFGS update and SR1 update in Algorithm 2, 3 and 4 respectively. We define as the Euclidean distance from to the saddle point for our convergence analysis, that is
The definition of in this paper is different from the measure used in convex optimization [29, 20]111In later section, we will see the measure is suitable to convergence analysis of quasi-Newton methods for saddle point problems., but it also holds the similar property as follows.
The next theorem states the assumptions of Lemma 3.7 always holds with , which means converges to 0 linearly.
Lemma 3.7 also implies superlinear convergence can be obtained if there exists which converges to 1. Applying Lemma 3.4, 3.5 and 3.6, we can show it holds for proposed algorithms.
Theorem 3.9.
Combing the results of Theorem 3.8 and 3.9, we achieve the two-stages convergence behavior, that is, the algorithm has global linear convergence and local superlinear convergence. The formal description is summarized as follows.
Corollary 3.9.1.
3.3 Algorithms for General Saddle Point Problems
In this section, we consider the general saddle point problem
where satisfies Assumption 2.1 and 2.2. We target to propose quasi-Newton methods for solving above problem with local superlinear convergence and time complexity for each iteration.
3.3.1 Algorithms
The key idea of designing quasi-Newton methods for saddle point problems is characterizing the second-order information by approximating the auxiliary matrix . Note that we have supposed the Hessian of is Lipschitz continuous and bounded in Assumption 2.1 and 2.2, which means the auxiliary matrix operator is also Lipschitz continuous.
Combining Lemma 3.1 and 3.10, we achieve the following properties of , which analogizes the strongly self-concordance in convex optimization [29].
Lemma 3.11.
Corollary 3.11.1.
Different with the quadratic case, auxiliary matrix is not fixed for general saddle point problem. Based on the smoothness of , we apply Lemma 3.11.1 to generalize Corollary 3.3 as follows.
Lemma 3.12.
Let and be a positive definite matrix such that
(30) |
for some . We additionally define and , then we have
(31) |
and
(32) |
for all , and .
The relationship (32) implies it is reasonable to establish the algorithms by update rule
(33) |
with and . Similarly, we can also achieve by such for specific BFGS and SR1 update. Combining iteration (10) with (33), we propose the quasi-Newton methods for general strongly-convex-strongly-concave saddle point problems. The details is shown in Algorithm 5, 6 and 7 for greedy Broyden family, BFGS and SR1 updates respectively.
3.3.2 Convergence Analysis
We turn to consider building the convergence guarantee for algorithms proposed in Section 3.3.1. We start from the following lemma which is useful to further analysis.
Lemma 3.13.
We introduce the following notations to simplify the presentation. We let be the sequence generated from Algorithm 5, 6 or 7 and denote
We still use Euclidean distance for analysis and establish the relationship between and , which is shown in Lemma 3.14.
Lemma 3.14.
Rodomanov and Nesterov [29, Lemma 4.3] derive a result similar to Lemma 3.14 for minimizing the strongly-convex function , but depends on the different measure .222The original notations of Rodomanov and Nesterov [29] is minimizing the strongly-convex function and establishing the convergence result by . To avoid ambiguity, we use notations and to describe their work in this paper. Note that our algorithms are based on the iteration rule
Compared with quasi-Newton methods for convex optimization, there exists an additional term between and , which leads to the convergence analysis based on is so difficult. Fortunately, we find directly using Euclidean distance makes sure everything goes smoothly.
For further analysis, we also denote
(38) |
and
(39) |
Then we establish linear convergence for the first period of iterations, which can be viewed as the generalization of Theorem 3.8. Note that the following result holds for all Algorithm 5, 6 and 7.
Theorem 3.15.
Then we analyze how does change after one iteration to show the local superlinear convergence for greedy Broyden family method (Algorithm 5) and greedy BFGS method (Algorithm 6). Recall that is defined in (39) to measure how well does the matrix approximate the auxiliary matrix .
Lemma 3.16.
The analysis for greedy SR1 method is based on constructing such that
Comments
There are no comments yet.