1 Introduction
We revisit the convexconcave saddle point problems of the form
(1) 
where both and are convex functions and
is a coupling matrix. This formulation has a wide range of applications, including supervised learning
(Zhang and Lin, 2015)(Xu et al., 2005; Bach et al., 2008)(Du et al., 2017), robust optimization (BenTal et al., 2009), PID control (Hast et al., 2013), etc. See Section 1.2 for some concrete examples.When the problem dimension is large, the most widely used and sometimes the only scalable methods to solve Problem (1) are firstorder methods. Arguably the simplest firstorder algorithm is the primaldual gradient method (Algorithm 1), a natural generalization of the gradient descent algorithm, which simultaneously performs gradient descent on the primal variable and gradient ascent on the dual variable .
There has been extensive research on analyzing the convergence rate of Algorithm 1 and its variants. It is known that if both and are strongly convex and admit efficient proximal mappings, then the proximal primaldual gradient method converges to the optimal solution at a linear rate (Bauschke and Combettes, 2011; Palaniappan and Bach, 2016; Chen and Rockafellar, 1997), i.e., it only requires iterations to obtain a solution that is close to the optimum.
In many applications, however, we only have strong convexity in but no strong convexity in . This motivates the following question:
Does the primaldual gradient method converge linearly to the optimal solution if is not strongly convex?
Intuitively, a linear convergence rate is plausible. Consider the corresponding primal problem of (1):
(2) 
where is the conjugate function of . Because is smooth and strongly convex, as long as has full column rank, Problem (2) has a smooth and strongly convex objective and thus vanilla gradient descent achieves linear convergence. Therefore, one might expect a linearly convergent firstorder algorithm for Problem (1) as well. However, whether the vanilla primaldual gradient method (Algorithm 1) has linear convergence turns out to be a nontrivial question.
Two recent results verified this conceptual experiment with additional assumptions: Du et al. (2017) required both and to be quadratic functions, and Wang and Xiao (2017) required both and to have efficient proximal mappings and uses a proximal primaldual gradient method. In this paper, we give an affirmative answer to this question with minimal assumptions. Our main contributions are summarized below.
1.1 Our Contributions
Linear Convergence of the PrimalDual Gradient Method.
We show that as long as and are smooth, is convex, is strongly convex and the coupling matrix has full column rank, Algorithm 1 converges to the optimal solution at a linear rate. See Section 3 for a precise statement of our result. This result significantly generalizes previous ones which rely on stronger assumptions. Note that all the assumptions are necessary for linear convergence: without any of them, the primal problem (2) requires at least iterations to obtain an close solution (Nesterov, 2013), so there is no hope of linear convergence for Problem (1).
New Analysis Technique.
To analyze the convergence of an optimization algorithm, a common way is to construct a potential function (also called Lyapunov function in the literature) which decreases after each iteration. For example, for the primal problem (2), a natural potential function is , the distance between the current iterate and the optimal solution. However, for the primaldual gradient method, it is difficult to show similar potential functions like decrease because the two sequences, and , are related to each other.
In this paper, we develop a novel method for analyzing the convergence rate of the primaldual gradient method. The key idea is to consider a “ghost” sequence. For example, in our setting, the “ghost” sequence comes from a gradient descent step for Problem (2). Then we relate the sequence generated by Algorithm 1 to this “ghost” sequence and show they are close in a certain way. See Section 3 for details. We believe this technique is applicable to other problems where we need to analyze multiple sequences.
Extension to PrimalDual Stochastic Variance Reduced Gradient Method.
Many optimization problems in machine learning have a finitesum structure, and randomized algorithms have been proposed to exploit this structure and to speed up the convergence. There has been extensive research in recent years on developing more efficient stochastic algorithms in such setting
(Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; ShalevShwartz and Zhang, 2013; Richtárik and Takáč, 2014; Lin et al., 2015; Zhang and Lin, 2015; AllenZhu, 2017). Among them, the stochastic variance reduced gradient (SVRG) algorithm (Johnson and Zhang, 2013) is a popular one with computational complexity for smooth and strongly convex objectives, where is the number of component functions, is the dimension of the variable, and is a condition number that only depends on problemdependent parameters like smoothness and strong convexity but not . Variants of SVRG for saddle point problems have been recently studied by Palaniappan and Bach (2016); Wang and Xiao (2017); Du et al. (2017) and can achieve similar running time.^{1}^{1}1 may be different in the primal and the primaldual settings. However, these results all require additional assumptions. In this paper, we use our analysis technique developed for Algorithm 1 to show that the primaldual SVRG method also admits type computational complexity.1.2 Motivating Examples
In this subsection we list some machine learning applications that naturally lead to convexconcave saddle point problems.
Reinforcement Learning.
For policy evaluation task in reinforcement learning, we have data generated by a policy where is the state at the th time step, is the reward and is the state at the th step. We also have a discount factor and a feature function
which maps a state to a feature vector. Our goal is to learn a linear value function
which represents the long term expected reward starting from state using the policy. A common way to estimate
is to minimize the empirical mean squared projected Bellman error (MSPBE):(3) 
where , and . Note that directly using gradient descent to solve problem (3) is expensive because we need to invert a matrix . Du et al. (2017) considered the equivalent saddle point formulation:
The gradient of can be computed more efficiently than the original formulation (3), and has a finitesum structure.
Empirical Risk Minimization.
Consider the classical supervised learning problem of learning a linear predictor given data points . Denote by the data matrix whose th row is . Then the empirical risk minimization (ERM) problem amounts to solving
where
is induced by some loss function and
is a regularizer; both and are convex functions. Equivalently, we can solve the dual problem or the saddle point problem . The saddle point formulation is favorable in many scenarios, e.g., when such formulation admits a finitesum structure (Zhang and Lin, 2015; Wang and Xiao, 2017), reduces communication complexity in the distributed setting (Xiao et al., 2017) or exploits sparsity structure (Lei et al., 2017).Robust Optimization.
The robust optimization framework (BenTal et al., 2009) aims at minimizing an objective function with uncertain data, which naturally leads to a saddle point problem, often with the following form:
(4) 
where is some loss function we want to minimize and the distribution of the data is parametrized by . For certain special cases (Liu et al., 2017), Problem (4) has the bilinear form as in (1).
1.3 Comparison with Previous Results
Paper  smooth  s.c.  smooth  s.c.  full column rank  Other Assumptions 

(Chen and Rockafellar, 1997)  \  Yes  \  Yes  No  Prox maps for and 
(Du et al., 2017)  Yes  No  Yes  Yes  Yes  and are quadratic 
(Wang and Xiao, 2017)  \  No  \  Yes  Yes  Prox maps for and 
Folklore  Yes  Yes  Yes  Yes  No  No 
This Paper  Yes  No  Yes  Yes  Yes  No 
There have been many attempts to analyze the primaldual gradient method or its variants. In particular, Chen and Rockafellar (1997); Chambolle and Pock (2011); Palaniappan and Bach (2016) showed that if both and are strongly convex and have efficient proximal mappings, then the proximal primaldual gradient method achieves a linear convergence rate.^{2}^{2}2Chen and Rockafellar (1997); Palaniappan and Bach (2016) considered a more general formulation than Problem (1). Here we specialize in the bilinear saddle point problem. In fact, even without proximal mappings, as long as both and are smooth and strongly convex, Algorithm 1 achieves a linear convergence rate. In Appendix B we give a simple proof of this fact.
Two recent papers show that it is possible to achieve linear convergence even without strong convexity in . The key is the additional assumption that has full column rank, which helps “transfer” ’s strong convexity to . Du et al. (2017) considered the case when both and are quadratic functions, i.e., when Problem (1) has the following special form:
Note that does not have to be positive definite (but has to be), and thus strong convexity is not necessary in the primal variable. Their analysis is based on writing the gradient updates as a linear dynamic system (c.f. Equation (41) in (Du et al., 2017)):
(5) 
where is some fixed matrix that depends on and step sizes. Next, it suffices to bound the spectral norm of (which can be made strictly less than ) to show that converges to at a linear rate. However, it is difficult to generalize this approach to general saddle point problem (1) since only when and are quadratic do we have the linear form (5).
Wang and Xiao (2017) considered the proximal primaldual gradient method. They construct a potential function (c.f. Page 15 in (Wang and Xiao, 2017)) and show it decreases at a linear rate. However, this potential function heavily relies on the proximal mappings so it is difficult to use this technique to analyze Algorithm 1.
In Table 1, we summarize different assumptions sufficient for linear convergence used in different papers.
1.4 Paper Organization
The rest of the paper is organized as follows. We give necessary definitions in Section 2. In Section 3, we present our main result for the primaldual gradient method and its proof. In Section 4, we extend our analysis to the primaldual stochastic variance reduced gradient method. In Section 5, we use some preliminary experiments to verify our theory. We conclude in Section 6 and put omitted proofs in the appendix.
2 Preliminaries
Let denote the Euclidean () norm of a vector, and let denote the standard Euclidean inner product between two vectors. For a matrix , let be its
th largest singular value, and let
and be the largest and the smallest singular values of , respectively. For a function , we use to denote its gradient. Denote . Letbe the identity matrix in
.The smoothness and the strong convexity of a function are defined as follows:
Definition 2.1.
For a differentiable function , we say

is smooth if for all ;

is strongly convex if for all .
We also need the definition of conjugate function:
Definition 2.2.
The conjugate of a function is defined as
It is wellknown that if is closed and convex, then . If is smooth and strongly convex, its conjugate has the following properties:
3 Linear Convergence of the PrimalDual Gradient Method
In this section we show the linear convergence of Algorithm 1 on Problem (1) under the following assumptions:
Assumption 3.1.
is convex and smooth ().
Assumption 3.2.
is smooth and strongly convex ().
Assumption 3.3.
The matrix satisfies .
While the first two assumptions on and are standard in convex optimization literature, the third one is important for ensuring linear convergence of Problem (1). Note, for example, that if
is the allzero matrix, then there is no interaction between
and , and to solve the convex optimization problem on we need at least iterations (Nesterov, 2013) instead of .Denote by the optimal solution to Problem (1). For simplicity, we let and .
Recall the firstorder optimality condition:
(6) 
Theorem 3.1.
In the setting of Algorithm 1, define and . Let and . If we choose and , then we have
for some absolute constant .
In this theorem, we use as the potential function and show that this function shrinks at a geometric rate. Note that from (6) and Fact 2.1 (ii) we have . Then we have upper bounds and , which imply that if is small then will be close to the optimal solution . Therefore a direct corollary of Theorem 3.1 is:
Corollary 3.1.
For any , after iterations, we have and , where hides polynomial factors in and .
We remark that our theorem suggests that step sizes depend on problem parameters which may be unknown. In practice, we may try to use a small amount of data to estimate them first or use the adaptive tuning heuristic introduced in
(Wang and Xiao, 2017).3.1 Proof of Theorem 3.1
Now we present the proof of Theorem 3.1.
First recall the standard linear convergence guarantee of gradient descent on a smooth and strongly convex objective. See Theorem 3.12 in (Bubeck, 2015) for a proof.
Lemma 3.1.
Suppose is smooth and strongly convex, and let . For any , , letting , we have
Step 1: Bounding the Decrease of via a OneStep “Ghost” Algorithm.^{3}^{3}3 may not decrease as increases. Here what we mean is to upper bound using and an error term.
Our technique is to consider the following onestep “ghost” algorithm for the primal variable, which corresponds to a gradient descent step for the primal problem (2). We define an auxiliary variable : given , let
(7) 
where . Note that is defined only for the purpose of the proof. Our main idea is to use this “ghost” algorithm as a reference and bound the distance between the primaldual gradient iterate and this “ghost” variable . We first prove with this “ghost” algorithm, the distance between the primal variable and the optimum decreases at a geometric rate.
Proposition 3.1.
If , then
Proof.
Since (7) is a gradient descent step for the primal problem (2) whose objective is where , it suffices to show that is smooth and strongly convex in order to apply Lemma 3.1. Note that is smooth and strongly convex according to Fact 2.1.
We have . Then for any we have
where we have used the smoothness of , the smoothness of , and the bound on . Therefore is smooth.
On the other hand, for any we have
where we have used the convexity of , the strong convexity of , and that has full column rank. Therefore is strongly convex.
With the smoothness and the strong convexity of , the proof is completed by applying Lemma 3.1. ∎
Proposition 3.1 suggests that if we use the “ghost” algorithm (7), we have the desired linear convergence property. The following proposition gives an upper bound on by bounding the distance between and .
Proposition 3.2.
If , then
(8)  
Proof.
We have , which implies
Then the proposition follows by applying the triangle inequality and Proposition 3.1. ∎
Step 2: Bounding the Decrease of .
One may want to show the decrease of similarly using a “ghost” update for the dual variable. However, the objective function in the dual problem might be nonsmooth, which means we cannot obtain a result similar to Proposition 3.1. Instead, we show that decreases geometrically up to an error term.
Proposition 3.3.
We have
Proof.
Using the gradient update formula of the primal variable, we have
(9)  
Proposition 3.4.
If , then
Proof.
For fixed , the update rule is a gradient descent step for the objective function which is also smooth and strongly convex. By the optimality condition, the minimizer satisfies , i.e., . Then from Lemma 3.1 we know that
(10) 
Step 3: Putting Things Together.
Now we are ready to finish the proof of Theorem 3.1. From Propositions 3.2 and 3.4 we have
(11) 
(12)  
To prove the convergence of sequences and to , we consider a linear combination with a free parameter to be determined. Combining (11) and (12), with some routine calculations, we can show that our choices of , and given in Theorem 3.1 can ensure for some , as desired. We give the remaining details in Appendix A.1.
4 Extension to PrimalDual SVRG
In this section we consider the case where the saddle point problem (1) admits a finitesum structure:^{4}^{4}4For ease of presentation we assume , and can be split into terms. It is not hard to generalize our analysis to the case where , and can be split into different numbers of terms.
(13) 
where . Optimization problems with finitesum structure are ubiquitous in machine learning, because loss functions can often be written as a sum of individual loss terms corresponding to individual observations.
In this section, we make the following assumptions:
Assumption 4.1.
Each is smooth (), and is convex.
Assumption 4.2.
Each is smooth, and is strongly convex ().
Assumption 4.3.
Each satisfies , and has rank .
Note that we only require component functions and to be smooth; they are not necessarily convex. However, the overall objective function still has to satisfy Assumptions 3.13.3.
Given the finitesum structure (13), we denote the individual gradient of each as
and the full gradient of as
A naive computation of or takes time. However, in many applications like policy evaluation (Du et al., 2017) and empirical risk minimization, each is given as the outer product of two vectors (i.e., a rank matrix), which makes and computable in only time, where . In this case, computing an individual gradient takes time while computing the full gradient takes time.
We adapt the stochastic variance reduced gradient (SVRG) method (Johnson and Zhang, 2013) to solve Problem (13). The algorithm uses two layers of loops. In an outer loop, the algorithm first computes a full gradient using a “snapshot” point , and then the algorithm executes inner loops, where is a parameter to be chosen. In each inner loop, the algorithm randomly samples an index from and updates the current iterate using a variancereduced stochastic gradient:
(14) 
Here, is the stochastic gradient at computed using the random index , and is a term used to reduce the variance in while keeping
an unbiased estimate of
. The full details of the algorithm are provided in Algorithm 2. For clarity, we denote by the snapshot point in theth epoch (outer loop), and denote by
all the intermediate iterates within this epoch.The following theorem establishes the linear convergence guarantee of Algorithm 2.
Theorem 4.1.
Since computing a full gradient takes time and each inner loop takes time, each epoch takes time in total. Therefore, the total running time of Algorithm 2 is in order to reach an close solution, which is the desired running time of SVRG (note that does not depend on ).
5 Preliminary Empirical Evaluation
We perform preliminary empirical evaluation for the following purposes: (i) to verify that both the primaldual gradient method (Algorithm 1) and the primaldual SVRG method (Algorithm 2) can indeed achieve linear convergence, (ii) to investigate the convergence rates of Algorithms 1 and 2, in comparison with their primalonly counterparts (i.e., the usual gradient descent and SVRG algorithms for the primal problem), and (iii) to compare the convergence rates of Algorithms 1 and 2.
We consider the linear regression problem with smoothed
regularization, formulated as(15) 
where , , and is the smoothed  regularization (Schmidt et al., 2007).^{5}^{5}5When is large we have for all . Note that is smooth but not strongly convex, and does not have a closedform proximal mapping. As discussed in Section 1.2, Problem (15) admits a saddle point formulation:
Comments
There are no comments yet.