We revisit the convex-concave saddle point problems of the form
where both and are convex functions and
is a coupling matrix. This formulation has a wide range of applications, including supervised learning(Zhang and Lin, 2015)2005; Bach et al., 2008)2017), robust optimization (Ben-Tal et al., 2009), PID control (Hast et al., 2013), etc. See Section 1.2 for some concrete examples.
When the problem dimension is large, the most widely used and sometimes the only scalable methods to solve Problem (1) are first-order methods. Arguably the simplest first-order algorithm is the primal-dual gradient method (Algorithm 1), a natural generalization of the gradient descent algorithm, which simultaneously performs gradient descent on the primal variable and gradient ascent on the dual variable .
There has been extensive research on analyzing the convergence rate of Algorithm 1 and its variants. It is known that if both and are strongly convex and admit efficient proximal mappings, then the proximal primal-dual gradient method converges to the optimal solution at a linear rate (Bauschke and Combettes, 2011; Palaniappan and Bach, 2016; Chen and Rockafellar, 1997), i.e., it only requires iterations to obtain a solution that is -close to the optimum.
In many applications, however, we only have strong convexity in but no strong convexity in . This motivates the following question:
Does the primal-dual gradient method converge linearly to the optimal solution if is not strongly convex?
Intuitively, a linear convergence rate is plausible. Consider the corresponding primal problem of (1):
where is the conjugate function of . Because is smooth and strongly convex, as long as has full column rank, Problem (2) has a smooth and strongly convex objective and thus vanilla gradient descent achieves linear convergence. Therefore, one might expect a linearly convergent first-order algorithm for Problem (1) as well. However, whether the vanilla primal-dual gradient method (Algorithm 1) has linear convergence turns out to be a nontrivial question.
Two recent results verified this conceptual experiment with additional assumptions: Du et al. (2017) required both and to be quadratic functions, and Wang and Xiao (2017) required both and to have efficient proximal mappings and uses a proximal primal-dual gradient method. In this paper, we give an affirmative answer to this question with minimal assumptions. Our main contributions are summarized below.
1.1 Our Contributions
Linear Convergence of the Primal-Dual Gradient Method.
We show that as long as and are smooth, is convex, is strongly convex and the coupling matrix has full column rank, Algorithm 1 converges to the optimal solution at a linear rate. See Section 3 for a precise statement of our result. This result significantly generalizes previous ones which rely on stronger assumptions. Note that all the assumptions are necessary for linear convergence: without any of them, the primal problem (2) requires at least iterations to obtain an -close solution (Nesterov, 2013), so there is no hope of linear convergence for Problem (1).
New Analysis Technique.
To analyze the convergence of an optimization algorithm, a common way is to construct a potential function (also called Lyapunov function in the literature) which decreases after each iteration. For example, for the primal problem (2), a natural potential function is , the distance between the current iterate and the optimal solution. However, for the primal-dual gradient method, it is difficult to show similar potential functions like decrease because the two sequences, and , are related to each other.
In this paper, we develop a novel method for analyzing the convergence rate of the primal-dual gradient method. The key idea is to consider a “ghost” sequence. For example, in our setting, the “ghost” sequence comes from a gradient descent step for Problem (2). Then we relate the sequence generated by Algorithm 1 to this “ghost” sequence and show they are close in a certain way. See Section 3 for details. We believe this technique is applicable to other problems where we need to analyze multiple sequences.
Extension to Primal-Dual Stochastic Variance Reduced Gradient Method.
Many optimization problems in machine learning have a finite-sum structure, and randomized algorithms have been proposed to exploit this structure and to speed up the convergence. There has been extensive research in recent years on developing more efficient stochastic algorithms in such setting(Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; Shalev-Shwartz and Zhang, 2013; Richtárik and Takáč, 2014; Lin et al., 2015; Zhang and Lin, 2015; Allen-Zhu, 2017). Among them, the stochastic variance reduced gradient (SVRG) algorithm (Johnson and Zhang, 2013) is a popular one with computational complexity for smooth and strongly convex objectives, where is the number of component functions, is the dimension of the variable, and is a condition number that only depends on problem-dependent parameters like smoothness and strong convexity but not . Variants of SVRG for saddle point problems have been recently studied by Palaniappan and Bach (2016); Wang and Xiao (2017); Du et al. (2017) and can achieve similar running time.111 may be different in the primal and the primal-dual settings. However, these results all require additional assumptions. In this paper, we use our analysis technique developed for Algorithm 1 to show that the primal-dual SVRG method also admits type computational complexity.
1.2 Motivating Examples
In this subsection we list some machine learning applications that naturally lead to convex-concave saddle point problems.
For policy evaluation task in reinforcement learning, we have data generated by a policy where is the state at the -th time step, is the reward and is the state at the -th step. We also have a discount factor and a feature function
which maps a state to a feature vector. Our goal is to learn a linear value functionwhich represents the long term expected reward starting from state using the policy
. A common way to estimateis to minimize the empirical mean squared projected Bellman error (MSPBE):
The gradient of can be computed more efficiently than the original formulation (3), and has a finite-sum structure.
Empirical Risk Minimization.
Consider the classical supervised learning problem of learning a linear predictor given data points . Denote by the data matrix whose -th row is . Then the empirical risk minimization (ERM) problem amounts to solving
is induced by some loss function andis a regularizer; both and are convex functions. Equivalently, we can solve the dual problem or the saddle point problem . The saddle point formulation is favorable in many scenarios, e.g., when such formulation admits a finite-sum structure (Zhang and Lin, 2015; Wang and Xiao, 2017), reduces communication complexity in the distributed setting (Xiao et al., 2017) or exploits sparsity structure (Lei et al., 2017).
The robust optimization framework (Ben-Tal et al., 2009) aims at minimizing an objective function with uncertain data, which naturally leads to a saddle point problem, often with the following form:
1.3 Comparison with Previous Results
|Paper||smooth||s.c.||smooth||s.c.||full column rank||Other Assumptions|
|(Chen and Rockafellar, 1997)||\||Yes||\||Yes||No||Prox maps for and|
|(Du et al., 2017)||Yes||No||Yes||Yes||Yes||and are quadratic|
|(Wang and Xiao, 2017)||\||No||\||Yes||Yes||Prox maps for and|
There have been many attempts to analyze the primal-dual gradient method or its variants. In particular, Chen and Rockafellar (1997); Chambolle and Pock (2011); Palaniappan and Bach (2016) showed that if both and are strongly convex and have efficient proximal mappings, then the proximal primal-dual gradient method achieves a linear convergence rate.222Chen and Rockafellar (1997); Palaniappan and Bach (2016) considered a more general formulation than Problem (1). Here we specialize in the bi-linear saddle point problem. In fact, even without proximal mappings, as long as both and are smooth and strongly convex, Algorithm 1 achieves a linear convergence rate. In Appendix B we give a simple proof of this fact.
Two recent papers show that it is possible to achieve linear convergence even without strong convexity in . The key is the additional assumption that has full column rank, which helps “transfer” ’s strong convexity to . Du et al. (2017) considered the case when both and are quadratic functions, i.e., when Problem (1) has the following special form:
Note that does not have to be positive definite (but has to be), and thus strong convexity is not necessary in the primal variable. Their analysis is based on writing the gradient updates as a linear dynamic system (c.f. Equation (41) in (Du et al., 2017)):
where is some fixed matrix that depends on and step sizes. Next, it suffices to bound the spectral norm of (which can be made strictly less than ) to show that converges to at a linear rate. However, it is difficult to generalize this approach to general saddle point problem (1) since only when and are quadratic do we have the linear form (5).
Wang and Xiao (2017) considered the proximal primal-dual gradient method. They construct a potential function (c.f. Page 15 in (Wang and Xiao, 2017)) and show it decreases at a linear rate. However, this potential function heavily relies on the proximal mappings so it is difficult to use this technique to analyze Algorithm 1.
In Table 1, we summarize different assumptions sufficient for linear convergence used in different papers.
1.4 Paper Organization
The rest of the paper is organized as follows. We give necessary definitions in Section 2. In Section 3, we present our main result for the primal-dual gradient method and its proof. In Section 4, we extend our analysis to the primal-dual stochastic variance reduced gradient method. In Section 5, we use some preliminary experiments to verify our theory. We conclude in Section 6 and put omitted proofs in the appendix.
Let denote the Euclidean () norm of a vector, and let denote the standard Euclidean inner product between two vectors. For a matrix , let be its
-th largest singular value, and letand be the largest and the smallest singular values of , respectively. For a function , we use to denote its gradient. Denote . Let
be the identity matrix in.
The smoothness and the strong convexity of a function are defined as follows:
For a differentiable function , we say
is -smooth if for all ;
is -strongly convex if for all .
We also need the definition of conjugate function:
The conjugate of a function is defined as
It is well-known that if is closed and convex, then . If is smooth and strongly convex, its conjugate has the following properties:
3 Linear Convergence of the Primal-Dual Gradient Method
is convex and -smooth ().
is -smooth and -strongly convex ().
The matrix satisfies .
While the first two assumptions on and are standard in convex optimization literature, the third one is important for ensuring linear convergence of Problem (1). Note, for example, that if
is the all-zero matrix, then there is no interaction betweenand , and to solve the convex optimization problem on we need at least iterations (Nesterov, 2013) instead of .
Denote by the optimal solution to Problem (1). For simplicity, we let and .
Recall the first-order optimality condition:
In the setting of Algorithm 1, define and . Let and . If we choose and , then we have
for some absolute constant .
In this theorem, we use as the potential function and show that this function shrinks at a geometric rate. Note that from (6) and Fact 2.1 (ii) we have . Then we have upper bounds and , which imply that if is small then will be close to the optimal solution . Therefore a direct corollary of Theorem 3.1 is:
For any , after iterations, we have and , where hides polynomial factors in and .
We remark that our theorem suggests that step sizes depend on problem parameters which may be unknown. In practice, we may try to use a small amount of data to estimate them first or use the adaptive tuning heuristic introduced in(Wang and Xiao, 2017).
3.1 Proof of Theorem 3.1
Now we present the proof of Theorem 3.1.
First recall the standard linear convergence guarantee of gradient descent on a smooth and strongly convex objective. See Theorem 3.12 in (Bubeck, 2015) for a proof.
Suppose is -smooth and -strongly convex, and let . For any , , letting , we have
Step 1: Bounding the Decrease of via a One-Step “Ghost” Algorithm.333 may not decrease as increases. Here what we mean is to upper bound using and an error term.
Our technique is to consider the following one-step “ghost” algorithm for the primal variable, which corresponds to a gradient descent step for the primal problem (2). We define an auxiliary variable : given , let
where . Note that is defined only for the purpose of the proof. Our main idea is to use this “ghost” algorithm as a reference and bound the distance between the primal-dual gradient iterate and this “ghost” variable . We first prove with this “ghost” algorithm, the distance between the primal variable and the optimum decreases at a geometric rate.
If , then
Since (7) is a gradient descent step for the primal problem (2) whose objective is where , it suffices to show that is smooth and strongly convex in order to apply Lemma 3.1. Note that is -smooth and -strongly convex according to Fact 2.1.
We have . Then for any we have
where we have used the -smoothness of , the -smoothness of , and the bound on . Therefore is -smooth.
On the other hand, for any we have
where we have used the convexity of , the -strong convexity of , and that has full column rank. Therefore is -strongly convex.
With the smoothness and the strong convexity of , the proof is completed by applying Lemma 3.1. ∎
Proposition 3.1 suggests that if we use the “ghost” algorithm (7), we have the desired linear convergence property. The following proposition gives an upper bound on by bounding the distance between and .
If , then
We have , which implies
Then the proposition follows by applying the triangle inequality and Proposition 3.1. ∎
Step 2: Bounding the Decrease of .
One may want to show the decrease of similarly using a “ghost” update for the dual variable. However, the objective function in the dual problem might be non-smooth, which means we cannot obtain a result similar to Proposition 3.1. Instead, we show that decreases geometrically up to an error term.
Using the gradient update formula of the primal variable, we have
If , then
For fixed , the update rule is a gradient descent step for the objective function which is also -smooth and -strongly convex. By the optimality condition, the minimizer satisfies , i.e., . Then from Lemma 3.1 we know that
Step 3: Putting Things Together.
To prove the convergence of sequences and to , we consider a linear combination with a free parameter to be determined. Combining (11) and (12), with some routine calculations, we can show that our choices of , and given in Theorem 3.1 can ensure for some , as desired. We give the remaining details in Appendix A.1.
4 Extension to Primal-Dual SVRG
In this section we consider the case where the saddle point problem (1) admits a finite-sum structure:444For ease of presentation we assume , and can be split into terms. It is not hard to generalize our analysis to the case where , and can be split into different numbers of terms.
where . Optimization problems with finite-sum structure are ubiquitous in machine learning, because loss functions can often be written as a sum of individual loss terms corresponding to individual observations.
In this section, we make the following assumptions:
Each is -smooth (), and is convex.
Each is -smooth, and is -strongly convex ().
Each satisfies , and has rank .
Given the finite-sum structure (13), we denote the individual gradient of each as
and the full gradient of as
A naive computation of or takes time. However, in many applications like policy evaluation (Du et al., 2017) and empirical risk minimization, each is given as the outer product of two vectors (i.e., a rank- matrix), which makes and computable in only time, where . In this case, computing an individual gradient takes time while computing the full gradient takes time.
We adapt the stochastic variance reduced gradient (SVRG) method (Johnson and Zhang, 2013) to solve Problem (13). The algorithm uses two layers of loops. In an outer loop, the algorithm first computes a full gradient using a “snapshot” point , and then the algorithm executes inner loops, where is a parameter to be chosen. In each inner loop, the algorithm randomly samples an index from and updates the current iterate using a variance-reduced stochastic gradient:
Here, is the stochastic gradient at computed using the random index , and is a term used to reduce the variance in while keeping
an unbiased estimate of. The full details of the algorithm are provided in Algorithm 2. For clarity, we denote by the snapshot point in the
-th epoch (outer loop), and denote byall the intermediate iterates within this epoch.
The following theorem establishes the linear convergence guarantee of Algorithm 2.
Since computing a full gradient takes time and each inner loop takes time, each epoch takes time in total. Therefore, the total running time of Algorithm 2 is in order to reach an -close solution, which is the desired running time of SVRG (note that does not depend on ).
5 Preliminary Empirical Evaluation
We perform preliminary empirical evaluation for the following purposes: (i) to verify that both the primal-dual gradient method (Algorithm 1) and the primal-dual SVRG method (Algorithm 2) can indeed achieve linear convergence, (ii) to investigate the convergence rates of Algorithms 1 and 2, in comparison with their primal-only counterparts (i.e., the usual gradient descent and SVRG algorithms for the primal problem), and (iii) to compare the convergence rates of Algorithms 1 and 2.
We consider the linear regression problem with smoothed-regularization, formulated as
where , , and is the smoothed - regularization (Schmidt et al., 2007).555When is large we have for all . Note that is smooth but not strongly convex, and does not have a closed-form proximal mapping. As discussed in Section 1.2, Problem (15) admits a saddle point formulation: