 # Linear Convergence of the Primal-Dual Gradient Method for Convex-Concave Saddle Point Problems without Strong Convexity

We consider the convex-concave saddle point problem _x_y f(x)+y^ A x-g(y) where f is smooth and convex and g is smooth and strongly convex. We prove that if the coupling matrix A has full column rank, the vanilla primal-dual gradient method can achieve linear convergence even if f is not strongly convex. Our result generalizes previous work which either requires f and g to be quadratic functions or requires proximal mappings for both f and g. We adopt a novel analysis technique that in each iteration uses a "ghost" update as a reference, and show that the iterates in the primal-dual gradient method converge to this "ghost" sequence. Using the same technique we further give an analysis for the primal-dual stochastic variance reduced gradient (SVRG) method for convex-concave saddle point problems with a finite-sum structure.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We revisit the convex-concave saddle point problems of the form

 minx∈Rd1maxy∈Rd2L(x,y)=f(x)+y⊤Ax−g(y), (1)

where both and are convex functions and

is a coupling matrix. This formulation has a wide range of applications, including supervised learning

(Zhang and Lin, 2015)(Xu et al., 2005; Bach et al., 2008)(Du et al., 2017), robust optimization (Ben-Tal et al., 2009), PID control (Hast et al., 2013), etc. See Section 1.2 for some concrete examples.

When the problem dimension is large, the most widely used and sometimes the only scalable methods to solve Problem (1) are first-order methods. Arguably the simplest first-order algorithm is the primal-dual gradient method (Algorithm 1), a natural generalization of the gradient descent algorithm, which simultaneously performs gradient descent on the primal variable and gradient ascent on the dual variable .

There has been extensive research on analyzing the convergence rate of Algorithm 1 and its variants. It is known that if both and are strongly convex and admit efficient proximal mappings, then the proximal primal-dual gradient method converges to the optimal solution at a linear rate (Bauschke and Combettes, 2011; Palaniappan and Bach, 2016; Chen and Rockafellar, 1997), i.e., it only requires iterations to obtain a solution that is -close to the optimum.

In many applications, however, we only have strong convexity in but no strong convexity in . This motivates the following question:

Does the primal-dual gradient method converge linearly to the optimal solution if is not strongly convex?

Intuitively, a linear convergence rate is plausible. Consider the corresponding primal problem of (1):

 minx∈Rd1P(x)=g∗(Ax)+f(x), (2)

where is the conjugate function of . Because is smooth and strongly convex, as long as has full column rank, Problem (2) has a smooth and strongly convex objective and thus vanilla gradient descent achieves linear convergence. Therefore, one might expect a linearly convergent first-order algorithm for Problem (1) as well. However, whether the vanilla primal-dual gradient method (Algorithm 1) has linear convergence turns out to be a nontrivial question.

Two recent results verified this conceptual experiment with additional assumptions: Du et al. (2017) required both and to be quadratic functions, and Wang and Xiao (2017) required both and to have efficient proximal mappings and uses a proximal primal-dual gradient method. In this paper, we give an affirmative answer to this question with minimal assumptions. Our main contributions are summarized below.

### 1.1 Our Contributions

##### Linear Convergence of the Primal-Dual Gradient Method.

We show that as long as and are smooth, is convex, is strongly convex and the coupling matrix has full column rank, Algorithm 1 converges to the optimal solution at a linear rate. See Section 3 for a precise statement of our result. This result significantly generalizes previous ones which rely on stronger assumptions. Note that all the assumptions are necessary for linear convergence: without any of them, the primal problem (2) requires at least iterations to obtain an -close solution (Nesterov, 2013), so there is no hope of linear convergence for Problem (1).

##### New Analysis Technique.

To analyze the convergence of an optimization algorithm, a common way is to construct a potential function (also called Lyapunov function in the literature) which decreases after each iteration. For example, for the primal problem (2), a natural potential function is , the distance between the current iterate and the optimal solution. However, for the primal-dual gradient method, it is difficult to show similar potential functions like decrease because the two sequences, and , are related to each other.

In this paper, we develop a novel method for analyzing the convergence rate of the primal-dual gradient method. The key idea is to consider a “ghost” sequence. For example, in our setting, the “ghost” sequence comes from a gradient descent step for Problem (2). Then we relate the sequence generated by Algorithm 1 to this “ghost” sequence and show they are close in a certain way. See Section 3 for details. We believe this technique is applicable to other problems where we need to analyze multiple sequences.

##### Extension to Primal-Dual Stochastic Variance Reduced Gradient Method.

Many optimization problems in machine learning have a finite-sum structure, and randomized algorithms have been proposed to exploit this structure and to speed up the convergence. There has been extensive research in recent years on developing more efficient stochastic algorithms in such setting

(Le Roux et al., 2012; Johnson and Zhang, 2013; Defazio et al., 2014; Xiao and Zhang, 2014; Shalev-Shwartz and Zhang, 2013; Richtárik and Takáč, 2014; Lin et al., 2015; Zhang and Lin, 2015; Allen-Zhu, 2017). Among them, the stochastic variance reduced gradient (SVRG) algorithm (Johnson and Zhang, 2013) is a popular one with computational complexity for smooth and strongly convex objectives, where is the number of component functions, is the dimension of the variable, and is a condition number that only depends on problem-dependent parameters like smoothness and strong convexity but not . Variants of SVRG for saddle point problems have been recently studied by Palaniappan and Bach (2016); Wang and Xiao (2017); Du et al. (2017) and can achieve similar running time.111 may be different in the primal and the primal-dual settings. However, these results all require additional assumptions. In this paper, we use our analysis technique developed for Algorithm 1 to show that the primal-dual SVRG method also admits type computational complexity.

### 1.2 Motivating Examples

In this subsection we list some machine learning applications that naturally lead to convex-concave saddle point problems.

##### Reinforcement Learning.

For policy evaluation task in reinforcement learning, we have data generated by a policy where is the state at the -th time step, is the reward and is the state at the -th step. We also have a discount factor and a feature function

which maps a state to a feature vector. Our goal is to learn a linear value function

which represents the long term expected reward starting from state using the policy

. A common way to estimate

is to minimize the empirical mean squared projected Bellman error (MSPBE):

 minx(Ax−b)⊤C−1(Ax−b), (3)

where , and . Note that directly using gradient descent to solve problem (3) is expensive because we need to invert a matrix . Du et al. (2017) considered the equivalent saddle point formulation:

 minxmaxyL(x,y)=−y⊤Ax−12y⊤Cy+b⊤y.

The gradient of can be computed more efficiently than the original formulation (3), and has a finite-sum structure.

##### Empirical Risk Minimization.

Consider the classical supervised learning problem of learning a linear predictor given data points . Denote by the data matrix whose -th row is . Then the empirical risk minimization (ERM) problem amounts to solving

 minx∈Rdℓ(Ax)+f(x),

where

is induced by some loss function and

is a regularizer; both and are convex functions. Equivalently, we can solve the dual problem or the saddle point problem . The saddle point formulation is favorable in many scenarios, e.g., when such formulation admits a finite-sum structure (Zhang and Lin, 2015; Wang and Xiao, 2017), reduces communication complexity in the distributed setting (Xiao et al., 2017) or exploits sparsity structure (Lei et al., 2017).

##### Robust Optimization.

The robust optimization framework (Ben-Tal et al., 2009) aims at minimizing an objective function with uncertain data, which naturally leads to a saddle point problem, often with the following form:

 minxmaxyEξ∼P(y)[f(x,ξ)], (4)

where is some loss function we want to minimize and the distribution of the data is parametrized by . For certain special cases (Liu et al., 2017), Problem (4) has the bilinear form as in (1).

### 1.3 Comparison with Previous Results

There have been many attempts to analyze the primal-dual gradient method or its variants. In particular, Chen and Rockafellar (1997); Chambolle and Pock (2011); Palaniappan and Bach (2016) showed that if both and are strongly convex and have efficient proximal mappings, then the proximal primal-dual gradient method achieves a linear convergence rate.222Chen and Rockafellar (1997); Palaniappan and Bach (2016) considered a more general formulation than Problem (1). Here we specialize in the bi-linear saddle point problem. In fact, even without proximal mappings, as long as both and are smooth and strongly convex, Algorithm 1 achieves a linear convergence rate. In Appendix B we give a simple proof of this fact.

Two recent papers show that it is possible to achieve linear convergence even without strong convexity in . The key is the additional assumption that has full column rank, which helps “transfer” ’s strong convexity to . Du et al. (2017) considered the case when both and are quadratic functions, i.e., when Problem (1) has the following special form:

 L(x,y)=x⊤Bx+b⊤x+y⊤Ax−y⊤Cy+c⊤y.

Note that does not have to be positive definite (but has to be), and thus strong convexity is not necessary in the primal variable. Their analysis is based on writing the gradient updates as a linear dynamic system (c.f. Equation (41) in (Du et al., 2017)):

 ⎡⎣xt+1−x∗√η1η2(yt+1−y∗)⎤⎦=(I−G)[xt−x∗√η1η2(yt−y∗)], (5)

where is some fixed matrix that depends on and step sizes. Next, it suffices to bound the spectral norm of (which can be made strictly less than ) to show that converges to at a linear rate. However, it is difficult to generalize this approach to general saddle point problem (1) since only when and are quadratic do we have the linear form (5).

Wang and Xiao (2017) considered the proximal primal-dual gradient method. They construct a potential function (c.f. Page 15 in (Wang and Xiao, 2017)) and show it decreases at a linear rate. However, this potential function heavily relies on the proximal mappings so it is difficult to use this technique to analyze Algorithm 1.

In Table 1, we summarize different assumptions sufficient for linear convergence used in different papers.

### 1.4 Paper Organization

The rest of the paper is organized as follows. We give necessary definitions in Section 2. In Section 3, we present our main result for the primal-dual gradient method and its proof. In Section 4, we extend our analysis to the primal-dual stochastic variance reduced gradient method. In Section 5, we use some preliminary experiments to verify our theory. We conclude in Section 6 and put omitted proofs in the appendix.

## 2 Preliminaries

Let denote the Euclidean () norm of a vector, and let denote the standard Euclidean inner product between two vectors. For a matrix , let be its

-th largest singular value, and let

and be the largest and the smallest singular values of , respectively. For a function , we use to denote its gradient. Denote . Let

be the identity matrix in

.

The smoothness and the strong convexity of a function are defined as follows:

###### Definition 2.1.

For a differentiable function , we say

• is -smooth if for all ;

• is -strongly convex if for all .

We also need the definition of conjugate function:

###### Definition 2.2.

The conjugate of a function is defined as

It is well-known that if is closed and convex, then . If is smooth and strongly convex, its conjugate has the following properties:

###### Fact 2.1.

If is -smooth and -strongly convex (), then

1. ((Kakade et al., 2009)) is -smooth and -strongly convex.

2. ((Rockafellar, 1970)) The gradient mappings and are inverse of each other.

## 3 Linear Convergence of the Primal-Dual Gradient Method

In this section we show the linear convergence of Algorithm 1 on Problem (1) under the following assumptions:

###### Assumption 3.1.

is convex and -smooth ().

###### Assumption 3.2.

is -smooth and -strongly convex ().

###### Assumption 3.3.

The matrix satisfies .

While the first two assumptions on and are standard in convex optimization literature, the third one is important for ensuring linear convergence of Problem (1). Note, for example, that if

is the all-zero matrix, then there is no interaction between

and , and to solve the convex optimization problem on we need at least iterations (Nesterov, 2013) instead of .

Denote by the optimal solution to Problem (1). For simplicity, we let and .

Recall the first-order optimality condition:

 {∇xL(x∗,y∗)=∇f(x∗)+A⊤y∗=0,∇yL(x∗,y∗)=−∇g(y∗)+Ax∗=0. (6)
###### Theorem 3.1.

In the setting of Algorithm 1, define and . Let and . If we choose and , then we have

 Pt+1≤⎛⎜ ⎜ ⎜ ⎜⎝1−C⋅α2σ4minβ3σ2max⋅(ρ+σ2maxα)⎞⎟ ⎟ ⎟ ⎟⎠Pt

for some absolute constant .

In this theorem, we use as the potential function and show that this function shrinks at a geometric rate. Note that from (6) and Fact 2.1 (ii) we have . Then we have upper bounds and , which imply that if is small then will be close to the optimal solution . Therefore a direct corollary of Theorem 3.1 is:

###### Corollary 3.1.

For any , after iterations, we have and , where hides polynomial factors in and .

We remark that our theorem suggests that step sizes depend on problem parameters which may be unknown. In practice, we may try to use a small amount of data to estimate them first or use the adaptive tuning heuristic introduced in

(Wang and Xiao, 2017).

### 3.1 Proof of Theorem 3.1

Now we present the proof of Theorem 3.1.

First recall the standard linear convergence guarantee of gradient descent on a smooth and strongly convex objective. See Theorem 3.12 in (Bubeck, 2015) for a proof.

###### Lemma 3.1.

Suppose is -smooth and -strongly convex, and let . For any , , letting , we have

 ∥~x−¯x∥≤(1−δη)∥x−¯x∥.
##### Step 1: Bounding the Decrease of ∥xt−x∗∥ via a One-Step “Ghost” Algorithm.333∥xt−x∗∥ may not decrease as t increases. Here what we mean is to upper bound ∥xt+1−x∗∥ using ∥xt−x∗∥ and an error term.

Our technique is to consider the following one-step “ghost” algorithm for the primal variable, which corresponds to a gradient descent step for the primal problem (2). We define an auxiliary variable : given , let

 ~xt+1:=xt−η1(∇f(xt)+∇h(xt)). (7)

where . Note that is defined only for the purpose of the proof. Our main idea is to use this “ghost” algorithm as a reference and bound the distance between the primal-dual gradient iterate and this “ghost” variable . We first prove with this “ghost” algorithm, the distance between the primal variable and the optimum decreases at a geometric rate.

###### Proposition 3.1.

If , then

 ∥~xt+1−x∗∥≤(1−σ2minβη1)∥xt−x∗∥.
###### Proof.

Since (7) is a gradient descent step for the primal problem (2) whose objective is where , it suffices to show that is smooth and strongly convex in order to apply Lemma 3.1. Note that is -smooth and -strongly convex according to Fact 2.1.

We have . Then for any we have

 ∥∥∇P(x)−∇P(x′)∥∥ ≤ ∥∥∇f(x)−∇f(x′)∥∥+∥∥A⊤∇g∗(Ax)−A⊤∇g∗(Ax′)∥∥ ≤ ρ∥∥x−x′∥∥+σmax∥∥∇g∗(Ax)−∇g∗(Ax′)∥∥ ≤ ≤ ρ∥∥x−x′∥∥+σ2maxα∥∥x−x′∥∥ = (ρ+σ2max/α)∥∥x−x′∥∥,

where we have used the -smoothness of , the -smoothness of , and the bound on . Therefore is -smooth.

On the other hand, for any we have

 P(x′)−P(x) = f(x′)−f(x)+g∗(Ax′)−g∗(Ax) ≥ ⟨∇f(x),x′−x⟩+⟨∇g∗(Ax),Ax′−Ax⟩ +1/β2∥∥Ax′−Ax∥∥2 = ⟨∇f(x)+A⊤∇g∗(Ax),x′−x⟩+12β∥∥Ax′−Ax∥∥2 ≥ ⟨∇P(x),x′−x⟩+12βσ2min∥∥x′−x∥∥2,

where we have used the convexity of , the -strong convexity of , and that has full column rank. Therefore is -strongly convex.

With the smoothness and the strong convexity of , the proof is completed by applying Lemma 3.1. ∎

Proposition 3.1 suggests that if we use the “ghost” algorithm (7), we have the desired linear convergence property. The following proposition gives an upper bound on by bounding the distance between and .

###### Proposition 3.2.

If , then

 ∥xt+1−x∗∥≤ (1−σ2minβη1)∥xt−x∗∥ (8) +σmaxη1∥yt−∇g∗(Axt)∥.
###### Proof.

We have , which implies

 ∥~xt+1−xt+1∥≤η1σmax∥yt−∇g∗(Axt)∥.

Then the proposition follows by applying the triangle inequality and Proposition 3.1. ∎

##### Step 2: Bounding the Decrease of ∥yt−∇g∗(Axt)∥.

One may want to show the decrease of similarly using a “ghost” update for the dual variable. However, the objective function in the dual problem might be non-smooth, which means we cannot obtain a result similar to Proposition 3.1. Instead, we show that decreases geometrically up to an error term.

###### Proposition 3.3.

We have

 ∥xt+1−xt∥≤ (ρ+σ2maxα)η1∥xt−x∗∥ +σmaxη1∥yt−∇g∗(Axt)∥.
###### Proof.

Using the gradient update formula of the primal variable, we have

 1η1∥xt+1−xt∥=∥∥∇f(xt)+A⊤yt∥∥ (9) ≤ ≤ ∥∥∇f(xt)+A⊤∇g∗(Axt)∥∥+σmax∥yt−∇g∗(Axt)∥.

Recall that the primal objective function is -smooth (see the proof of Proposition 3.1). So we have

 = ∥∇P(xt)−∇P(x∗)∥≤(ρ+σ2max/α)∥xt−x∗∥.

Plugging this back to (9) we obtain the desired result. ∎

###### Proposition 3.4.

If , then

 ∥yt+1−∇g∗(Axt+1)∥ ≤ (1−αη2+σ2maxαη1)∥yt−∇g∗(Axt)∥ +σmaxα(ρ+σ2maxα)η1∥xt−x∗∥.
###### Proof.

For fixed , the update rule is a gradient descent step for the objective function which is also -smooth and -strongly convex. By the optimality condition, the minimizer satisfies , i.e., . Then from Lemma 3.1 we know that

 ∥yt+1−∇g∗(Axt)∥≤(1−αη2)∥yt−∇g∗(Axt)∥. (10)

Since we want to upper bound , we need to take into account the difference between and . We prove an upper bound on in Proposition 3.3. Using Proposition 3.3 and (10), we have

 ∥yt+1−∇g∗(Axt+1)∥ ≤ ∥yt+1−∇g∗(Axt)∥+∥∇g∗(Axt+1)−∇g∗(Axt)∥ ≤ ∥yt+1−∇g∗(Axt)∥+σmaxα∥xt+1−xt∥ ≤ (1−αη2)∥yt−∇g∗(Axt)∥ +σmaxα(ρ+σ2maxα)η1∥xt−x∗∥ +σ2maxαη1∥yt−∇g∗(Axt)∥.\qed

Note that the upper bound on given in Proposition 3.3 is proportional to , not to . This allows us to choose a relatively small to ensure that the factor in Proposition 3.4 is indeed less than , i.e., is approximately decreasing.

##### Step 3: Putting Things Together.

Now we are ready to finish the proof of Theorem 3.1. From Propositions 3.2 and 3.4 we have

 at+1≤(1−σ2minβη1)at+σmaxη1bt, (11)
 bt+1≤ σmaxα(ρ+σ2maxα)η1at (12) +(1−αη2+σ2maxαη1)bt.

To prove the convergence of sequences and to , we consider a linear combination with a free parameter to be determined. Combining (11) and (12), with some routine calculations, we can show that our choices of , and given in Theorem 3.1 can ensure for some , as desired. We give the remaining details in Appendix A.1.

## 4 Extension to Primal-Dual SVRG

In this section we consider the case where the saddle point problem (1) admits a finite-sum structure:444For ease of presentation we assume , and can be split into terms. It is not hard to generalize our analysis to the case where , and can be split into different numbers of terms.

 minx∈Rd1maxy∈Rd2L(x,y)=1nn∑i=1Li(x,y), (13)

where . Optimization problems with finite-sum structure are ubiquitous in machine learning, because loss functions can often be written as a sum of individual loss terms corresponding to individual observations.

In this section, we make the following assumptions:

###### Assumption 4.1.

Each is -smooth (), and is convex.

###### Assumption 4.2.

Each is -smooth, and is -strongly convex ().

###### Assumption 4.3.

Each satisfies , and has rank .

Note that we only require component functions and to be smooth; they are not necessarily convex. However, the overall objective function still has to satisfy Assumptions 3.1-3.3.

Given the finite-sum structure (13), we denote the individual gradient of each as

 Bi(x,y):=[∇xLi(x,y)∇yLi(x,y)]=[∇fi(x)+A⊤iyAix−∇gi(y)],

and the full gradient of as

A naive computation of or takes time. However, in many applications like policy evaluation (Du et al., 2017) and empirical risk minimization, each is given as the outer product of two vectors (i.e., a rank- matrix), which makes and computable in only time, where . In this case, computing an individual gradient takes time while computing the full gradient takes time.

We adapt the stochastic variance reduced gradient (SVRG) method (Johnson and Zhang, 2013) to solve Problem (13). The algorithm uses two layers of loops. In an outer loop, the algorithm first computes a full gradient using a “snapshot” point , and then the algorithm executes inner loops, where is a parameter to be chosen. In each inner loop, the algorithm randomly samples an index from and updates the current iterate using a variance-reduced stochastic gradient:

 Bi(x,y,~x,~y)=Bi(x,y)+B(~x,~y)−Bi(~x,~y). (14)

Here, is the stochastic gradient at computed using the random index , and is a term used to reduce the variance in while keeping

. The full details of the algorithm are provided in Algorithm 2. For clarity, we denote by the snapshot point in the

-th epoch (outer loop), and denote by

all the intermediate iterates within this epoch.

The following theorem establishes the linear convergence guarantee of Algorithm 2.

###### Theorem 4.1.

There exists a choice of parameters and in Algorithm 2, as well as another number , such that if we define , then Algorithm 2 guarantees for all .

Since computing a full gradient takes time and each inner loop takes time, each epoch takes time in total. Therefore, the total running time of Algorithm 2 is in order to reach an -close solution, which is the desired running time of SVRG (note that does not depend on ).

The proof of Theorem 4.1 is given in Appendix A.2. It relies on the same proof idea in Section 3 as well as the standard analysis technique for SVRG by Johnson and Zhang (2013).