 # Stochastic Primal-Dual Algorithms with Faster Convergence than O(1/√(T)) for Problems without Bilinear Structure

Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than O(1/√(T)) with T being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the convergence rate of the proposed algorithm is O(1/T). We also investigate the effectiveness of the proposed algorithms for learning robust models and empirical AUC maximization.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is motivated by solving the following convex-concave problem:

 minx∈Xmaxy∈dom(ϕ∗)y⊤ℓ(x)−ϕ∗(y)+g(x) (1)

where is a closed convex set, is a lower-semicontinuous mapping whose component function is lower-semicontinuous and convex, is a convex function whose convex conjugate is denoted by , and is a lower-semicontinuous convex function. To ensure the convexity of the problem, it is assumed that if is not an affine function. By using the convex conjugate , the problem (1) is equivalent to the following convex minimization problem:

 minx∈XP(x):=ϕ(ℓ(x))+g(x). (2)

A particular family of min-max problem (1) and its minimization form (2) that has been considered extensively in the literature Zhang and Lin (2015); Yu et al. (2015); Tan et al. (2018); Shalev-Shwartz and Zhang (2013); Lin et al. (2014) is that is an affine function and for is decomposable. In this case, the problem (2

) is known as (regularized) empirical risk minimization problem in machine learning:

 minx∈X1nn∑i=1ϕi(a⊤ix+bi)+g(x), (3)

where is the -th row of and is the i-th element of .

However, stochastic optimization algorithms with fast convergence rates are still under-explored for a more challenging family of problems of (1) and (2) where is not necessarily an affine or smooth function and is not necessarily decomposable. It is our goal to design new stochastic primal-dual algorithms for solving these problems with a fast convergence rate. A key motivating example of the considered problem is to solve a distributionally robust optimization problem:

 minx∈Xmaxy∈Δnn∑i=1yiℓi(x)−V(y,y0)+g(x), (4)

where is a simplex, and denotes a divergence measure (e.g.,

-divergence) between two sets of probabilities

and . In machine learning with denoting the loss of a model on the -th example, the above problem corresponds to robust risk minimization paradigm

, which can achieve variance-based regularization for learning a predictive model from

examples Namkoong and Duchi (2017). Other examples of the considered challenging problems can be found in robust learning from multiple perturbed distributions Chen et al. (2017a) in which corresponds to the loss from the

-th perturbed distribution, and minimizing non-decomposable loss functions

Fan et al. (2017); Dekel and Singer (2006).

With stochastic (sub)-gradients computed for and , one can employ the conventional primal-dual stochastic gradient method or its variant Nemirovski et al. (2009); Juditsky et al. (2011) for solving the problem (1). Under appropriate basic assumptions, one can derive the standard convergence rate with being the number of stochastic updates. However, the convergence rate is known as a slow convergence rate. It is always desirable to design optimization algorithms with a faster convergence. Nonetheless, to the best of our knowledge stochastic primal-dual algorithms with a fast convergence rate of in terms of minimizing remain unknown in general, even under the strong convexity of and . In contrast, if is decomposable and is strongly convex, the standard stochastic gradient method for solving (2) with an appropriate scheme of step size has a convergence rate of  Hazan et al. (2007); Hazan and Kale (2011a). A direct extension of algorithms and analysis for stochastic strongly convex minimization to the stochastic concave-concave optimization does not give a satisfactory convergence rate 111One may obtain a dimensionality dependent convergence rate of by following conventional analysis, but it is not the standard dimensionality independent rate that we aim to achieve. . It is still an open problem that whether there exists a stochastic primal-dual algorithm by solving the convex-concave problem (1) that enjoys a fast rate of in terms of minimizing .

The major contribution of this paper is to fill this gap by developing stochastic primal-dual algorithms for solving (1) such that they enjoy a faster convergence than in terms of the primal objective gap. In particular, under the assumptions that is Lipschitz continuous, are Lipschitz continuous and the minimization problem (2) satisfies the strong convexity condition, the proposed algorithms enjoy an iteration complexity of for finding a solution such that , which corresponds to a faster convergence rate of . The key difference of the proposed algorithms from the traditional stochastic primal-dual algorithm is that it is required to compute a logarithmic number of deterministic updates for in the following form:

 A(x)=argmaxy∈dom(ϕ∗)y⊤ℓ(x)−ϕ∗(y), (5)

which can be usually solved in time complexity. It would be worth noting that (See Appendix A). When is a moderate number, the proposed algorithms could converge faster than the traditional primal-dual stochastic gradient method. It is also important to note that we do not assume the proximal mapping of and can be easily computed. Instead, our algorithms only require (stochastic) sub-gradients of and , which make them applicable and efficient for solving more challenging problems where is an empirical sum of individual functions.

In addition, the proposed algorithms and theories can be easily extended to the case that is Hölder continuous and the minimization problem (2) satisfies a more general local error bound condition as defined later, with intermediate faster rates established.

## 2 Related Work

Stochastic primal-dual gradient method and its variant were first analyzed by Nemirovski et al. (2009) for solving a more general problem . Under the standard bounded stochastic (sub)-gradient assumption, a convergence rate of was established for a primal-dual gap, which implies a convergence rate of for minimizing the primal objective . Later, there are couple of studies that aim to strengthen this convergence rate by leveraging the smoothness of or the involved function when there is a special structure of the objective function Juditsky et al. (2011); Chen et al. (2014, 2017b). However, the worst-case convergence rate of these later algorithms is still dominated by . Without smoothness assumption on or a bilinear structure, these later algorithms are not directly applicable to solving (1). In addition, Frank Wolfe algorithms are analyzed for saddle point problems in Gidel et al. (2016), which could also achieve a convergence rate of in terms of primal-dual gap under the smoothness condition.

Recently, there emerge several algorithms with faster convergence for solving (1) by leveraging the bilinear structure and strong convexity of and . For example, Zhang and Lin (2015) proposed a stochastic primal-dual coordinate (SPDC) method for solving (3) under the condition that is strongly convex. When is also a strongly convex function, SPDC enjoys a linear convergence for the primal-dual gap. Other variants of SPDC have been considered in (Yu et al., 2015; Tan et al., 2018) for solving (1) with bilinear structure. Palaniappan and Bach (2016) proposed stochastic variance reduction methods for solving a family of saddle-point problems. When applied to (1), they require is either an affine function or a smooth mapping. If additionally and are strongly convex, their algorithms also enjoy a linear convergence for finding a solution that is -close to the optimal solution in squared Euclidean distance. Du and Hu (2018) established a similar linear convergence of a primal-dual SVRG algorithm for solving (1) when is an affine function with a full column rank for , is smooth, and is smooth and strongly convex, which are stronger assumptions than ours. All of these algorithms except (Du and Hu, 2018) also need to compute the proximal mapping of and at each iteration. In contrast, the present work is complementary to these studies aiming to solve a more challenging family of problems. In particular, the proposed algorithms do not require the bilinear structure or the smoothness of , and the smoothness and strong convexity of and are also not necessary. In addition, we do not assume that and have an efficient proximal mapping.

Several recent studies have been devoted to stochastic AUC optimization based on a min-max formulation that has a bilinear structure (Liu et al., 2018a; Natole et al., 2018), aiming to derive a faster convergence rate of . The differences from the present work is that (i) (Liu et al., 2018a)’s analysis is restricted to the online setting for AUC optimization; (ii) (Natole et al., 2018) only proves a convergence rate of in term of squared distance of found primal solution to the optimal solution under the strong convexity of the regularizer on the primal variable, which is weaker than our results on the convergence of the primal objective gap. To the best of our knowledge, the present work is the first one that establishes a convergence rate of in terms of minimizing for the proposed stochastic primal-dual methods by solving a general convex-concave problem (1) without bilinear structure or smoothness assumption on under (weakly local) strong convexity.

Restart schemes are recently considered to get improved convergence rate under some conditions. In Roulet and d’Aspremont (2017), restart scheme is analyzed for smooth convex problems under the sharpness and Hölder continuity condition. In Dvurechensky et al. (2018), a universal algorithm is proposed for variational inequalities under Hölder conituity condition where the Hölder parameters are unknown. Stochastic algorithms are proposed for strongly convex stochastic composite problems in Ghadimi and Lan (2012, 2013).

Finally, we would like to mention that our algorithms and techniques share many similarities to that proposed in Xu et al. (2017) for solving stochastic convex minimization problems under the local error bound condition. However, their algorithms are not directly applicable to the convex-concave problem (1) or the problem (2) with non-decomposable function . The novelty of this work is the design and analysis of new algorithms that can leverage the weak local strong convexity or more general local error bound condition of the primal minimzation problem (2) through solving the convex-concave problem (1) for enjoying a faster convergence.

## 3 Preliminaries

Recall that the problem of interest:

 minx∈X{P(x)= ϕ(ℓ(x))+g(x) = maxy∈Yy⊤ℓ(x)−ϕ∗(y)+g(x)f(x,y)}, (6)

where . Let denote the optimal set of the primal variable for the above problem, denote the optimal primal objective value and is the optimal solution closest to , where denotes the Euclidean norm.

Let denote the projection onto the set . Denote by and denote the -level set and -sublevel set of the primal problem, respectively. A function is -smooth if it is differentiable and its gradient is -Lipchitz continuous, i.e., . A differentiable function is said to have an -Hölder continuous gradient with iff . When , Hölder continuous gradient reduces to Lipchitz continuous gradient. A function is called -strongly convex if for any there exists such that

 f(x1)≥f(x2)+∂f(x2)⊤(x1−x2)+λ2∥x1−x2∥2,

where denotes any subgradient of at . A more general definition is the uniform convexity. is uniformly convex with degree if for any there exists such that

 f(x1)≥f(x2)+∂f(x2)⊤(x1−x2)+λ2∥x1−x2∥p.

For analysis of the proposed algorithms, we need a few basic notions about convex conjugate. For an extended real-valued convex function , the convex conjugate of is defined as

 h∗(y)=maxxy⊤x−h(x).

The convex conjugate of is . Due to the convex duality, if is -strongly convex then is differentiable and is -smooth. More generally, if is -uniformly convex then is differentiable and its gradient is -Hölder continuous where ,  (Nesterov, 2015).

One of the conditions that allows us to derive a fast rate of for a stochastic algorithm is that both and are strongly convex, which implies that is strongly convex in terms of and strongly concave in terms of . One might regard this as a trivial task given the result for stochastic strongly convex minimization where a stochastic gradient is available for the objective function to be minimized Hazan et al. (2007); Hazan and Kale (2011a). However, the analysis for stochastic strongly convex minimization is not directly applicable to stochastic primal-dual algorithms, as briefly explained later as we present our results.

Moreover, the strong convexity of can be relaxed to a weak strong convexity of to derive a similar order of convergence rate, i.e., for any , we have

 dist(x,X∗)≤c(P(x)−P∗)1/2,

where is the distance between and the optimal set . More generally, we can consider a setting in which satisfies a local error bound (or local growth) condition as defined below.

###### Definition 1.

A function is said to be satisfied local error bound (LEB) condition if for any ,

 dist(x,X∗)≤c(P(x)−P∗)θ, (7)

where is a constant, and is a parameter.

This condition was recently studied in Yang and Lin (2018) for developing a faster subgradient method than the standard subgradient method, and was laster considered in Xu et al. (2017) for stochastic convex optimization. A global version of the above condition (known as the global error bound condition) has a long history in mathematical programming (Pang, 1997). However, exploiting this condition for developing stochastic primal-dual algorithms seems to be new. When , the above condition is also referred to as weakly local strong convexity. When , it can capture general convex functions as long as is upper bounded for , which is true if is compact or is compact.

In parallel with the relaxed condition on , we can also relax the smoothness condition on or strong convexity condition on to Hölder continuous gradient condition on or a uniformly convexity condition on . Under the local error bound condition of and the Hölder continuous gradient condition of , we are able to develop stochastic primal-dual algorithms with intermediate complexity depending on and , which varies from to .

Formally, we will develop stochastic primal-dual algorithms for solving (3) under the following assumptions.

###### Assumption 1.

For Problem (3), we assume

1. There exist and such that ;

2. Let and denote the stochastic subgradient of w.r.t. and , respectively. There exists constants and such that and .

3. is -uniformly convex with such that has -Hölder continuous gradient where and .

4. is -Lipchitz continuous for .

5. One of the following conditions hold: (i) is -strongly convex; (ii) satisfies the LEB condition for and .

Remark. Assumption 1 (1) assumes that there is a lower bound of , which is usually satisfied in machine learning problems. Assumption 1 (2) is a common assumption usually made in existing stochastic-based methods. Note that we do not assume and have efficient proximal mapping. Instead, we only require a stochastic subgradient of and . Assumption 1 (3) is a general condition which unifies both smooth and non-smooth assumptions on . When , satisfies the classical smooth condition with parameter . When , it is the classical non-smooth assumption on the boundness of the subgradients. We will state our convergence results in terms of and instead of and . Assumption 1 (4) on the Lipschitz continuity of is more general than assuming a bilinear form . Finally, we note that assuming the strong convexity of allows us to develop a stochastic primal-dual algorithm with simpler updates.

## 4 Main Results

In this section, we will present our main results for solving (3). Our development is divided into three parts. First, we present a stochastic primal-dual algorithm and its convergence result when the primal objective function is strongly convex and is also strongly convex. Then we extend the result into a more general case, i.e., satisfying LEB condition and is uniformly convex. Lastly, we propose an adaptive variant with the same order of convergence result when the value of parameter in LEB condition is unknown, which is also useful for tackling problems without knowing the value of .

### 4.1 Restarted Stochastic Primal-Dual Algorithm for Strongly Convex P

The detailed updates of the proposed stochastic algorithm for strongly convex are presented in Algorithm 1, to which we refer as restarted stochastic primal-dual algorithm or RSPD for short. The algorithm is based on a restarting idea that have been used widely in existing studies Hazan and Kale (2011b); Ghadimi and Lan (2013); Xu et al. (2017); Yang and Lin (2018)

. It runs in epoch-wise and it has two loops. The steps 3-7 are the standard updates of stochastic primal-dual subgradient method

Nemirovski et al. (2009). However, the key difference from these previous studies is that the restarted solution for the dual variable for the next epoch is computed based on the averaged primal variable for the -th epoch. It is this step that explores the strong convexity of , which together with the restarting scheme allows us exploring the strong convexity of to derive a fast convergence rate of with being the total number of iterations.

Below, we will briefly discuss the path for proving the fast convergence rate of RSPD. We first show that why the standard analysis for strongly convex minimization can not be generalized to the stochastic convex-concave problem to derive the fast convergence rate of . Let and similarly for . A standard convergence analysis for the inner loop (steps 3-6) of Algorithm 1 usually starts from the following inequalities.

###### Lemma 1.

For the updates in Step 4 and 5 omitting the subscript , the following holds for any

 ∇⊤x,t(xt−x)≤∥xt−x∥2−∥xt+1−x∥22ηx+ηxM22 (8) ∇⊤y,t(y−yt)≤∥yt−y∥2−∥yt+1−y∥22ηy+ηyB22. (9)

For stochastic strongly convex minimization problems in which is absent in the above inequalities, one can take expectation over (8) and then apply the -strong convexity of to get the following inequality

 E[f(xt)−f(x)]≤ ∥xt−x∥2−∥xt+1−x∥22ηx +ηxM22−λ∥xt−x∥22.

Based on the above inequalities for all , one can design a particular scheme of step size that allows us to derive convergence rate. However, such analysis cannot be extended to the primal-dual case.

A naive approach would be taking expectation for both (8) and (9) for a fixed and applying the -strong convexity (resp. -strong concavity) of in terms of (resp. ), which yields the following inequalities

 E[f(xt,yt)−f(x,yt)]≤ ∥xt−x∥2−∥xt+1−x∥22ηx +ηxM22−λx∥xt−x∥22.
 E[f(xt,y)−f(xt,yt)]≤ ∥yt−y∥2−∥yt+1−y∥22ηy +ηyB22−λy∥yt−y∥22.

It is notable that in deriving the above inequalities, and have to be independent of .

By adding the above inequalities together and applying the same analysis for the R.H.S with and , we can obtain the following inequalities for any fixed and independent of :

 E[(f(^xT,y)−f(x,^yT))]≤ O(logTT), (10)

where and . However, the above inequality does not imply the convergence for the standard definition of primal-dual gap of or even the primal objective gap . The main obstacle is that we cannot set which will make depend on and hence make the expectional analysis fail. It would be worth noting that following Gidel et al. (2016), one could derive the upper bound of primal-dual gap of by (see Equation (5), (13) and (14) therein), where can be upper bounded by a constant and . Even if one sets and in (10), the convergence rate of primal-dual gap is only of , which is not what we pursue.

Another approach that gets around of the issue introduced by taking the expectation is by using high probability analysis. To this end, one can use concentration inequalities to bound the martingale difference sequence and for a fixed and  (Kakade and Tewari, 2008). However, in order to prove the primal objective gap one has to bound the later martingale difference sequence for any possible so that one can get from . A standard approach for achieving this high probability bound is by using a covering number argument for the set . However, this will inevitably introduce dependence on the dimensionality of . For example, an -cover of a bounded ball of radius in has cardinality of , and of a simplex in has cardinality of .

To tackle the aforementioned challenges for both exceptional analysis and high probability analysis, we develop a different analysis for the proposed RSPD algorithm in order to achieve a faster convergence rate of without explicit dependence on the dimensionality of . In this subsection, we will focus on expectional convergence result, which will be extended to high probability convergence in next subsection. Our expectional analysis is build on the following lemma that is used to derive convergence rate in the literature (Nemirovski et al., 2009).

###### Lemma 2.

Let the Lines 4 and 5 of Algorithm 1 run for iterations with a fixed step size and . Then

 E[maxy∈Yf(¯xT,y)−f(x∗,¯yT)]≤E[||x∗−x0||2]ηxT +E[||^yT−y0||2]ηyT+5ηxM22+5ηyB22, (11)

where , , and .

Remark: A nice property of the above result is that the max over in the L.H.S is taken before expectation.

Nevertheless, a simple approach for setting the step size as still yields a convergence rate of by assuming the size of is bounded (Nemirovski et al., 2009). The proposed RSPD algorithm has the special design of computing the restarted solutions and setting the step sizes, which together allows us to achieve convergence rate as stated in the following theorem. The key idea is that by using as a restarted point for the dual variable, we are able to connect to by using the strong convexity of and of . The convergence result of RSPD is presented below.

###### Theorem 2.

Suppose that Assumption 1 holds with and being -strongly convex. By setting and , then Algorithm 1 guarantees that . The total number of iterations is .

Remark. The equivalent convergence rate of the above result is given a total number of iterations . This matches the state-of-the-art convergence result for stochastic strongly convex minimization (Hazan and Kale, 2011b). Our algorithm can be applied to solving (2) for non-decomposable . In contrast to the standard stochastic primal-dual subgradient method, the additional computational overhead in RSPD is introduced by computing the restarted points . However, such computation only happens for a logarithmic number of times in the order of . We defer the discussion on the total time complexity of RSPD to the next section for some particular applications.

###### Proof.

To prove Theorem 2, we first need Lemma 2. Its proof will be given in Appendix B.

Let , by the setting of Algorithm 1, we know , , and for . We will show by induction for . It is easy to verify for a sufficiently large according to Assumption 1. Next, we need to show that conditional on , then we have

 E[P(x(s+1)0)]−P∗≤ϵs.

Consider the update of -th stage. By Lemma 2 for the update of -the stage, we have

 E[f(¯xs,^y(¯xs))−f(x∗,¯ys)]≤E[||x∗−x(s)0||2]ηx,sTs+E[||^y(¯xs)−y(s)0||2]ηy,sTs+5ηx,sM22+5ηy,sB22.

Since and , we have

 E[P(¯xs)−P∗]≤ E[||x∗−x(s)0||2]ηx,sT+E[||^y(¯xs)−y(s)0||2]ηy,sT+5ηx,sM22+5ηy,sB22. (12)

For the first term on the RHS of (12), by the strong convexity of and the condition we have

 E[||x∗−x(s)0||2]≤E[2μ(P(x(s)0)−P∗)]≤2ϵs−1μ.

For the second term on the RHS of (12),

 ||^y(¯xs)−y(s)0||2= ||∇ϕ(ℓ(¯xs))−∇ϕ(ℓ(x(s)0))||2 ≤ L2||ℓ(¯xs)−ℓ(x(s)0)||2v = L2||ℓ(¯xs)−ℓ(x(s)0)||2 = L2G2||¯xs−x(s)0||2,

where the first equality is due to the set up of the algorithm and Lemma 5, the second equality is due to is smooth (). Since is strongly convex with parameter , its optimal solution is unique, then we have

 E[||^y(¯xs)−y(s)0||2]≤ 2L2G2(E[||¯xs−x∗||2]+E[||x∗−x(s)0||2]) ≤ 4L2G2μ(E[P(¯xs)−P∗]+E[P(x(s)0)−P∗]) ≤ 4L2G2μ(E[P(¯xs)−P∗]+ϵs−1]).

Then the inequality (12) becomes

 E[P(¯xs)−P∗]≤ 2ϵs−1μηx,sTs+4L2G2μ(E[P(¯xs)−P∗]+ϵs−1])ηy,sTs+5ηx,sM22+5ηy,sB22.

By the setting of , and , we know , then

 E[P(¯xs)−P∗]≤ 9ϵs−14ηx,sμTs+ϵs−18+45ηx,sM216+45ηy,sB216≤ϵs−12=ϵs.

Therefore, by induction, after running stages, we have

 E[P(¯xS)−P∗]≤ϵS=ϵ.

The total iteration complexity is . ∎

### 4.2 RSPD Algorithm under the LEB condition

In the previous subsection, we introduce the RSPD algorithm for solving problem (1) when the objective function is strongly convex and is -smooth. However, these conditions are sometimes too strong for many machine learning problems. In this subsection, we will relax these strong conditions by assuming that satisfies the LEB condition (7) and has -Hölder continuous gradient with . We will develop a different variant of RSPD that also has high probability convergence guarantee.

Denote by a ball centered at with a radius intersected with , and similarly by a ball centered at with a radius intersected with . The second variant of the RSPD algorithm for solving problem (1) is summarized in Algorithm 2, which is similar to the RSPD algorithm except that the iterates are projected to bounded balls centered at the initial solutions of each epoch. This complication on the updates is introduced for the purpose of high-probability analysis, which also allows us to tackle problems that satisfies the LEB condition with . After each epoch, the proposed RSPD algorithm reduces the radius of the Euclidean ball. It is notable that this ball shrinkage technique is not new and has already used in Epoch-SGD method Hazan and Kale (2011b) for high probability bound analysis. We set the same value of initial radius for primal variable and dual variable in RSPD algorithm for the convenience of analysis. However, one can use different values but the same order of convergence result will be obtained by changing the analysis slightly. Another feature of RSPD that is different from RSPD is that RSPD uses a constant number of iterations in the inner loop in order to accommodate the local error bound condition.