Given a convex optimization problem and its dual, there are many possible first-order algorithms. In this paper, we show the equivalence between mirror descent algorithms and algorithms generalizing the conditional gradient method. This is done through convex duality, and implies notably that for certain problems, such as for supervised machine learning problems with non-smooth losses or problems regularized by non-smooth regularizers, the primal subgradient method and the dual conditional gradient method are formally equivalent. The dual interpretation leads to a form of line search for mirror descent, as well as guarantees of convergence for primal-dual certificates.

Authors

• 131 publications
• Gradient Primal-Dual Algorithm Converges to Second-Order Stationary Solutions for Nonconvex Distributed Optimization

In this work, we study two first-order primal-dual based algorithms, the...
02/25/2018 ∙ by Mingyi Hong, et al. ∙ 0

• Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

Iterative Hard Thresholding (IHT) is a class of projected gradient desce...
03/01/2017 ∙ by Bo Liu, et al. ∙ 0

• Totally Corrective Boosting for Regularized Risk Minimization

Consideration of the primal and dual problems together leads to importan...
08/30/2010 ∙ by Chunhua Shen, et al. ∙ 0

• Duality Regularization for Unsupervised Bilingual Lexicon Induction

Unsupervised bilingual lexicon induction naturally exhibits duality, whi...
09/03/2019 ∙ by Xuefeng Bai, et al. ∙ 0

• Improved Linear Embeddings via Lagrange Duality

Near isometric orthogonal embeddings to lower dimensions are a fundament...
11/30/2017 ∙ by Kshiteej Sheth, et al. ∙ 0

• Duality and Stability in Complex Multiagent State-Dependent Network Dynamics

Many of the current challenges in science and engineering are related to...
10/30/2019 ∙ by S. Rasoul Etesami, et al. ∙ 0

• Solving the L1 regularized least square problem via a box-constrained smooth minimization

In this paper, an equivalent smooth minimization for the L1 regularized ...
04/11/2017 ∙ by Majid Mohammadi, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in machine learning, statistics and signal processing may be cast as convex optimization problems. In large-scale situations, simple gradient-based algorithms with potentially many cheap iterations are often preferred over methods, such as Newton’s method or interior-point methods, that rely on fewer but more expensive iterations. The choice of a first-order method depends on the structure of the problem, in particular (a) the smoothness and/or strong convexity of the objective function, and (b) the computational efficiency of certain operations related to the non-smooth parts of the objective function, when it is decomposable in a smooth and a non-smooth part.

In this paper, we consider two classical algorithms, namely (a) subgradient descent and its mirror descent extension [29, 24, 4], and (b) conditional gradient algorithms, sometimes referred to as Frank-Wolfe algorithms [16, 13, 15, 14, 19].

Subgradient algorithms are adapted to non-smooth unstructured situations, and after steps have a convergence rate of in terms of objective values. This convergence rate improves to when the objective function is strongly convex [22]. Conditional-gradient algorithms are tailored to the optimization of smooth functions on a compact convex set, for which minimizing linear functions is easy (but where orthogonal projections would be hard, so that proximal methods [26, 5] cannot be used efficiently). They also have a convergence rate of  [15]. The main results of this paper are (a) to show that for common situations in practice, these two sets of methods are in fact equivalent by convex duality, (b) to recover a previously proposed extension of the conditional gradient method which is more generally applicable [10], and (c) provide explicit convergence rates for primal and dual iterates. We also review in Appendix A the non-strongly convex case and show that both primal and dual suboptimalities then converge at rate .

More precisely, we consider a convex function defined on , a convex function defined on , both potentially taking the value , and a matrix . We consider the following minimization problem, which we refer to as the primal problem:

 minx∈Rp h(x)+f(Ax). (1)

Throughout this paper, we make the following assumptions regarding the problem:

• is Lipschitz-continuous and finite on , i.e., there exists a constant such that for all , , where denotes the Euclidean norm. Note that this implies that the domain of the Fenchel conjugate is bounded. We denote by the bounded domain of . Thus, for all , . In many situations, is also closed but this is not always the case (in particular, when tends to infinity when tends to the boundary of ).

Note that the boundedness of the domain of is crucial and allows for simpler proof techniques with explicit constants (see a generalization in [10]).

• is lower-semicontinuous and -strongly convex on . This implies that is defined on , differentiable with -Lipschitz continuous gradient [8, 28]. Note that the domain of may be strictly included in .

Moreover, we assume that the following quantities may be computed efficiently:

• Subgradient of : for any , a subgradient of is any maximizer of .

• Gradient of : for any , may be computed and is equal to the unique maximizer of .

The values of the functions , , and will be useful to compute duality gaps but are not needed to run the algorithms. As shown in Section 2, there are many examples of pairs of functions with the computational constraints described above. If other operations are possible, in particular , then proximal methods [5, 26] applied to the dual problem converge at rate . If and are smooth, then gradient methods (accelerated [25, Section 2.2] or not) have linear convergence rates.

We denote by the primal objective in Eq. (1). It is the sum of a Lipschitz-continuous convex function and a strongly convex function, potentially on a restricted domain . It is thus well adapted to the subgradient method [29].

We have the following primal/dual relationships (obtained from Fenchel duality [8]):

 minx∈Rph(x)+f(Ax) = minx∈Rpmaxy∈Ch(x)+y⊤(Ax)−f∗(y) = maxy∈C{minx∈Rph(x)+x⊤A⊤y}−f∗(y) = maxy∈C−h∗(−A⊤y)−f∗(y).

This leads to the dual maximization problem:

 maxy∈C−h∗(−A⊤y)−f∗(y). (2)

We denote by the dual objective. It has a smooth part defined on and a potentially non-smooth part , and the problem is restricted onto a bounded set . When is linear (and more generally smooth) on its support, then we are exactly in the situation where conditional gradient algorithms may be used [16, 13].

Given a pair of primal-dual candidates , we denote by the duality gap:

 gap(x,y)=gprimal(x)−gdual(y)=[h(x)+h∗(−A⊤y)+y⊤Ax]+[f(Ax)+f∗(y)−y⊤Ax].

It is equal to zero if and only if (a) is a Fenchel-dual pair for and (b) is a Fenchel-dual pair for . This quantity serves as a certificate of optimality, as

 gap(x,y)=[gprimal(x)−minx′∈Kgprimal(x′)]+[maxy′∈Cgdual(y′)−gdual(y)].

The goal of this paper is to show that for certain problems ( linear and quadratic), the subgradient method applied to the primal problem in Eq. (1) is equivalent to the conditional gradient applied to the dual problem in Eq. (2); when relaxing the assumptions above, this equivalence is then between mirror descent methods and generalized conditional gradient algorithms.

2 Examples

The non-smooth strongly convex optimization problem defined in Eq. (1) occurs in many applications in machine learning and signal processing, either because they are formulated directly in this format, or their dual in Eq. (2) is (i.e., the original problem is the minimization of a smooth function over a compact set).

2.1 Direct formulations

Typical cases for (often the regularizer in machine learning and signal processing) are the following:

• Squared Euclidean norm: , which is -strongly convex.

• Squared Euclidean norm with convex constraints: , with the indicator function for a closed convex set, which is -strongly convex.

• Negative entropy: , where , which is -strongly convex. More generally, many barrier functions of convex sets may be used (see examples in [4, 9], in particular for problems on matrices).

Typical cases for (often the data fitting terms in machine learning and signal processing) are functions of the form :

• Least-absolute-deviation: , with . Note that the square loss is not Lipschitz-continuous on (although it is Lipschitz-continuous when restricted to a bounded set).

• Logistic regression: , with . Here is not linear in its support, and is not smooth, since it is a sum of negative entropies (and the second-order derivative is not bounded). This extends to any “log-sum-exp” functions which occur as a negative log-likelihood from the exponential family (see, e.g., [32] and references therein). Note that is then smooth and proximal methods with an exponential convergence rate may be used (which correspond to a constant step size in the algorithms presented below, instead of a decaying step size) [26, 5].

• Support vector machine: , with . Here is linear on its domain (this is a situation where subgradient and conditional gradient methods are exactly equivalent). This extends to more general “max-margin” formulations [31, 30]

: in these situations, a combinatorial object (such as a full chain, a graph, a matching or vertices of the hypercube) is estimated (rather than an element of

) and this leads to functions

whose Fenchel-conjugates are linear and have domains which are related to the polytopes associated to the linear programming relaxations of the corresponding combinatorial optimization problems. For these polytopes, often, only linear functions can be maximized, i.e., we can compute a subgradient of

but typically nothing more.

Other examples may be found in signal processing; for example, total-variation denoising, where the loss is strongly convex but the regularizer is non-smooth [11], or submodular function minimization cast through separable optimization problems [2]. Moreover, many proximal operators for non-smooth regularizers are of this form, with and is a norm (or more generally a gauge function).

2.2 Dual formulations

Another interesting set of examples for machine learning are more naturally described from the dual formulation in Eq. (2): given a smooth loss term (this could be least-squares or logistic regression), a typically non-smooth penalization or constraint is added, often through a norm . Thus, this corresponds to functions of the form , where is a convex non-decreasing function ( is then convex).

Our main assumption is that a subgradient of may be easily computed. This is equivalent to being able to maximize functions of the form for . If one can compute the dual norm of , , and in particular a maximizer in the unit-ball of , then one can compute simply the subgradient of . Only being able to compute the dual norm efficiently is a common situation in machine learning and signal processing, for example, for structured regularizers based on submodularity [2], all atomic norms [12], and norms based on matrix decompositions [1]. See additional examples in [19].

Our assumption regarding the compact domain of translates to the assumption that has compact domain. This includes indicator functions which corresponds to the constraint . We may also consider , which corresponds to jointly penalizing and constraining the norm; in practice, may be chosen so that the constraint is not active at the optimum and we get the solution of the penalized problem . See [17, 34, 1] for alternative approaches.

3 Mirror descent for strongly convex problems

We first assume that the function is essentially smooth (i.e., differentiable at any point in the interior of , and so that the norm of gradients converges to when approaching the boundary of ); then is a bijection from to , where is the domain of (see, e.g., [28, 18]). We consider the Bregman divergence

 D(x1,x2)=h(x1)−h(x2)−(x1−x2)⊤h′(x2).

It is always defined on , and is nonnegative. If , then if and only if . Moreover, since is assumed -strongly convex, we have . See more details in [4]. For example, when , we have .

Subgradient descent for square Bregman divergence

We first consider the common situation where ; the primal problem then becomes:

 minx∈Kf(Ax)+μ2∥x∥2.

The projected subgradient method starts from any , and iterates the following recursion:

 xt=xt−1−ρtμ[A⊤f′(Axt−1)+μxt−1],

where is any subgradient of at . The step size is .

The recursion may be rewritten as

 μxt=μxt−1−ρt[A⊤f′(Axt−1)+μxt−1],

which is equivalent to being the unique minimizer of

 (x−xt−1)⊤[A⊤¯yt−1+μxt−1]+μ2ρt∥x−xt−1∥2, (3)

which is the traditional proximal step, with step size .

Mirror descent

We may interpret the last formulation in Eq. (3) for the square regularizer as the minimization of

 (x−xt−1)⊤g′primal(xt−1)+1ρtD(x,xt−1),

with solution defined through (note that is a bijection from to ):

 h′(xt) = h′(xt−1)−ρt[A⊤f′(Axt−1)+h′(xt−1)] = (1−ρt)h′(xt−1)−ρtA⊤f′(Axt−1).

This leads to the following definition of the mirror descent recursion:

 ⎧⎪⎨⎪⎩¯yt−1∈argmaxy∈C y⊤Axt−1−f∗(y),xt=argminx∈Rp h(x)−(1−ρt)x⊤h′(xt−1)+ρtx⊤A⊤¯yt−1. (4)

The following proposition proves the convergence of mirror descent in the strongly convex case with rate —previous results were considering the convex case, with convergence rate  [24, 4].

Proposition 1 (Convergence of mirror descent in the strongly convex case)

Assume that (a) is Lipschitz-continuous and finite on , with the domain of , (b) is essentially smooth and -strongly convex. Consider and . Denoting by the unique minimizer of , after  iterations of the mirror descent recursion of Eq. (4), we have:

 g(2t(t+1)t∑u=1uxu−1)−gprimal(x∗) ⩽ R2μ(t+1), minu∈{0,…,t−1}{gprimal(xu)−gprimal(x∗)} ⩽ R2μ(t+1), D(x∗,xt) ⩽ R2μ(t+1).

Proof  We follow the proof of [4] and adapt it to the strongly convex case. We have, by reordering terms and using the optimality condition :

 D(x∗,xt)−D(x∗,xt−1) = h(xt−1)−h(xt)−(x∗−xt)⊤h′(xt)+(x∗−xt−1)⊤h′(xt−1) = h(xt−1)−h(xt)−(x∗−xt)⊤[(1−ρt)h′(xt−1)−ρtA⊤f′(Axt−1)] +(x∗−xt−1)⊤h′(xt−1) = h(xt−1)−h(xt)−(xt−1−xt)⊤h′(xt−1)+ρt(x∗−xt)⊤g′primal(xt−1) = [−D(xt,xt−1)+ρt(xt−1−xt)⊤g′primal(xt−1)] +[ρt(x∗−xt−1)⊤g′primal(xt−1)].

In order to upper-bound the two terms in Eq. (3), we first consider the following bound (obtained by convexity of and the definition of ):

 f(Ax∗)+h(x∗)⩾f(Axt−1)+h(xt−1)+(x∗−xt−1)⊤[A⊤¯yt−1+h′(xt−1)]+D(x∗,xt−1),

which may be rewritten as:

 gprimal(xt−1)−gprimal(x∗)⩽−D(x∗,xt−1)+(xt−1−x∗)⊤g′primal(xt−1),

which implies

 ρt(x∗−xt−1)⊤g′primal(xt−1)⩽−ρtD(x∗,xt−1)−ρt[gprimal(xt−1)−gprimal(x∗)]. (6)

Moreover, by definition of ,

 −D(xt,xt−1)+ρt(xt−1−xt)⊤g′primal(xt−1)=maxx∈Rp−D(x,xt−1)+ρt(xt−1−x)⊤z=φ(z),

with . The function is -strongly convex, and its Fenchel conjugate is thus -smooth. This implies that is -smooth. Since and , . Moreover, . Since (because is a convex combination of such elements), then .

Overall, combining Eq. (6) and into Eq. (3), this implies that

 D(x∗,xt)−D(x∗,xt−1)⩽ρ2t2μR2−ρtD(x∗,xt−1)−ρt[gprimal(xt−1)−gprimal(x∗)],

that is,

 gprimal(xt−1)−gprimal(x∗)⩽ρtR22μ+(ρ−1t−1)D(x∗,xt−1)−ρ−1tD(x∗,xt).

With , we obtain

 t[gprimal(xt−1)−gprimal(x∗)] ⩽ R2tμ(t+1)+(t−1)t2D(x∗,xt−1)−t(t+1)2D(x∗,xt).

Thus, by summing from to , we obtain

 t∑u=1u[gprimal(xu−1)−gprimal(x∗)]⩽R2μt−t(t+1)2D(x∗,xt),

that is,

 D(x∗,xt)+2t(t+1)t∑u=1u[gprimal(xu−1)−gprimal(x∗)]⩽R2μ(t+1).

This implies that , i.e., the iterates converges. Moreover, using the convexity of ,

 g(2t(t+1)t∑u=1uxu−1)−gprimal(x∗)⩽2t(t+1)t∑u=1u[gprimal(xu−1)−gprimal(x∗)]⩽R2μ(t+1),

i.e., the objective functions at an averaged iterate converges, and

 minu∈{0,…,t−1}gprimal(xu)−gprimal(x∗)⩽R2μ(t+1),

i.e., one of the iterates has an objective that converges.

Averaging

Note that with the step size , we have

 h′(xt)=t−1t+1h′(xt−1)−2t+1A⊤f′(Axt−1),

which implies

 t(t+1)h′(xt)=(t−1)th′(xt−1)−2tA⊤f′(Axt−1).

By summing these equalities, we obtain , i.e.,

 h′(xt)=2t(t+1)t∑u=1u[−A⊤f′(Axu−1)],

that is, is a weighted average of subgradients (with more weights on later iterates).

For , then, we the same techniques, we would obtain a convergence rate proportional to for the average iterate , thus with an additional factor (see a similar situation in the stochastic case in [20]). We would then have , and this is exactly a form dual averaging method [27], which also comes with primal-dual guarantees.

Generalization to h non-smooth

The previous result does not require to be essentially smooth, i.e., it may be applied to where is a closed convex set strictly included in . In the mirror descent recursion,

 ⎧⎪⎨⎪⎩¯yt−1∈argmaxy∈C y⊤Axt−1−f∗(y),xt=argminx∈Rp h(x)−(1−ρt)x⊤h′(xt−1)+ρtx⊤A⊤¯yt−1,

there may then be multiple choices for . If we choose for at iteration , the subgradient of obtained at the previous iteration, i.e., such that , then the proof of Prop. 1 above holds.

Note that when , the algorithm above is not equivalent to classical projected gradient descent. Indeed, the classical algorithm has the iteration

 xt=ΠK(xt−1−1μρt[μxt−1+A⊤f′(Axt−1)])=ΠK((1−ρt)xt−1+ρt[−1μA⊤f′(Axt−1)]),

and corresponds to the choice in the mirror descent recursion, which, when is on the boundary of , is not the choice that we need for the equivalence in Section 4.

However, when is assumed to be differentiable on its closed domain , then the bound of Prop. 1 still holds because the optimality condition may now be replaced by for all , which also allows to get to Eq. (3) in the proof of Prop. 1.

4 Conditional gradient method and extensions

In this section, we first review the classical conditional gradient algorithm, which corresponds to the extra assumption that is linear in its domain.

Given a maximization problem of the following form (i.e., where is linear on its domain, or equal to zero by a simple change of variable):

 maxy∈C−h∗(−A⊤y),

the conditional gradient algorithm consists in the following iteration (note that below is the gradient of the objective function and that we are maximizing the first-order Taylor expansion to obtain a candidate towards which we make a small step):

 xt−1 = argminx∈Rph(x)+x⊤A⊤yt−1 ¯yt−1 ∈ argmaxy∈Cy⊤Axt−1 yt = (1−ρt)yt−1+ρt¯yt−1.

It corresponds to a linearization of and its maximization over the bounded convex set . As we show later, the choice of may be done in different ways, through a fixed step size of by (approximate) line search.

Generalization

Following [10], the conditional gradient method can be generalized to problems of the form

 maxy∈C−h∗(−A⊤y)−f∗(y),

with the following iteration:

 ⎧⎪ ⎪⎨⎪ ⎪⎩xt−1=argminx∈Rph(x)+x⊤A⊤yt−1=(h∗)′(−A⊤yt−1)¯yt−1∈argmaxy∈Cy⊤Axt−1−f∗(y)yt=(1−ρt)yt−1+ρt¯yt−1. (7)

The previous algorithm may be interpreted as follows: (a) perform a first-order Taylor expansion of the smooth part , while leaving the other part intact, (b) minimize the approximation, and (c) perform a small step towards the maximizer. Note the similarity (and dissimilarity) with proximal methods which would add a proximal term proportional to , leading to faster convergences, but with the extra requirement of solving the proximal step [26, 5].

Note that here may be expressed as a convex combination of all , :

 yt =

and that when we chose , it simplifies to:

 yt = 2t(t+1)t∑u=1u¯yu−1.

When is essentially smooth (and thus is essentially strictly convex), it can be reformulated with as follows:

 h′(xt) = (1−ρt)h′(xt−1)−ρtA⊤argmaxy∈C{y⊤Axt−1−f∗(y)}, = (1−ρt)h′(xt−1)−ρtA⊤f′(Axt−1),

which is exactly the mirror descent algorithm described in Eq. (4). This leads to the following proposition:

Proposition 2 (Equivalence between mirror descent and generalized conditional gradient)

Assume that (a) is Lipschitz-continuous and finite on , with the domain of , (b) is -strongly convex and essentially smooth. The mirror descent recursion in Eq. (4), started from , is equivalent to the generalized conditional gradient recursion in Eq. (7), started from .

When is not essentially smooth, then with a particular choice of subgradient (see end of Section 3), the two algorithms are also equivalent. We now provide convergence proofs for the two versions (with adaptive and non-adaptive step sizes); similar rates may be obtained without the boundedness assumptions [10], but our results provide explicit constants and primal-dual guarantees. We first have the following convergence proof for generalized conditional gradient with no line search (the proof of dual convergence uses standard arguments from [13, 15], while the convergence of gaps is due to [19] for the regular conditional gradient):

Proposition 3 (Convergence of extended conditional gradient - no line search)

Assume that (a) is Lipschitz-continuous and finite on , with the domain of , (b) is -strongly convex. Consider and . Denoting by any maximizer of on , after iterations of the generalized conditional gradient recursion of Eq. (7), we have:

 gdual(y∗)−gdual(yt) ⩽ 2R2μ(t+1), minu∈{0,…,t−1}gap(xt,yt) ⩽ 8R2μ(t+1).

Proof  We have (using convexity of and -smoothness of ):

 gdual(yt) = −h∗(−A⊤yt)−f∗(yt) ⩾ [−h∗(−A⊤yt−1)+(yt−yt−1)⊤Axt−1−R2ρ2t2μ]−[(1−ρt)f∗(yt−1)+ρtf∗(¯yt−1)] = −h∗(−A⊤yt−1)+ρt(¯yt−1−yt−1)⊤Axt−1−R2ρ2t2μ−(1−ρt)f∗(yt−1)−ρtf∗(¯yt−1) = gdual(yt−1)+ρt(¯yt−1−yt−1