Many problems in machine learning, statistics and signal processing may be cast as convex optimization problems. In large-scale situations, simple gradient-based algorithms with potentially many cheap iterations are often preferred over methods, such as Newton’s method or interior-point methods, that rely on fewer but more expensive iterations. The choice of a first-order method depends on the structure of the problem, in particular (a) the smoothness and/or strong convexity of the objective function, and (b) the computational efficiency of certain operations related to the non-smooth parts of the objective function, when it is decomposable in a smooth and a non-smooth part.
In this paper, we consider two classical algorithms, namely (a) subgradient descent and its mirror descent extension [29, 24, 4], and (b) conditional gradient algorithms, sometimes referred to as Frank-Wolfe algorithms [16, 13, 15, 14, 19].
Subgradient algorithms are adapted to non-smooth unstructured situations, and after steps have a convergence rate of in terms of objective values. This convergence rate improves to when the objective function is strongly convex . Conditional-gradient algorithms are tailored to the optimization of smooth functions on a compact convex set, for which minimizing linear functions is easy (but where orthogonal projections would be hard, so that proximal methods [26, 5] cannot be used efficiently). They also have a convergence rate of . The main results of this paper are (a) to show that for common situations in practice, these two sets of methods are in fact equivalent by convex duality, (b) to recover a previously proposed extension of the conditional gradient method which is more generally applicable , and (c) provide explicit convergence rates for primal and dual iterates. We also review in Appendix A the non-strongly convex case and show that both primal and dual suboptimalities then converge at rate .
More precisely, we consider a convex function defined on , a convex function defined on , both potentially taking the value , and a matrix . We consider the following minimization problem, which we refer to as the primal problem:
Throughout this paper, we make the following assumptions regarding the problem:
is Lipschitz-continuous and finite on , i.e., there exists a constant such that for all , , where denotes the Euclidean norm. Note that this implies that the domain of the Fenchel conjugate is bounded. We denote by the bounded domain of . Thus, for all , . In many situations, is also closed but this is not always the case (in particular, when tends to infinity when tends to the boundary of ).
Note that the boundedness of the domain of is crucial and allows for simpler proof techniques with explicit constants (see a generalization in ).
Moreover, we assume that the following quantities may be computed efficiently:
Subgradient of : for any , a subgradient of is any maximizer of .
Gradient of : for any , may be computed and is equal to the unique maximizer of .
The values of the functions , , and will be useful to compute duality gaps but are not needed to run the algorithms. As shown in Section 2, there are many examples of pairs of functions with the computational constraints described above. If other operations are possible, in particular , then proximal methods [5, 26] applied to the dual problem converge at rate . If and are smooth, then gradient methods (accelerated [25, Section 2.2] or not) have linear convergence rates.
We denote by the primal objective in Eq. (1). It is the sum of a Lipschitz-continuous convex function and a strongly convex function, potentially on a restricted domain . It is thus well adapted to the subgradient method .
We have the following primal/dual relationships (obtained from Fenchel duality ):
This leads to the dual maximization problem:
We denote by the dual objective. It has a smooth part defined on and a potentially non-smooth part , and the problem is restricted onto a bounded set . When is linear (and more generally smooth) on its support, then we are exactly in the situation where conditional gradient algorithms may be used [16, 13].
Given a pair of primal-dual candidates , we denote by the duality gap:
It is equal to zero if and only if (a) is a Fenchel-dual pair for and (b) is a Fenchel-dual pair for . This quantity serves as a certificate of optimality, as
The goal of this paper is to show that for certain problems ( linear and quadratic), the subgradient method applied to the primal problem in Eq. (1) is equivalent to the conditional gradient applied to the dual problem in Eq. (2); when relaxing the assumptions above, this equivalence is then between mirror descent methods and generalized conditional gradient algorithms.
The non-smooth strongly convex optimization problem defined in Eq. (1) occurs in many applications in machine learning and signal processing, either because they are formulated directly in this format, or their dual in Eq. (2) is (i.e., the original problem is the minimization of a smooth function over a compact set).
2.1 Direct formulations
Typical cases for (often the regularizer in machine learning and signal processing) are the following:
Squared Euclidean norm: , which is -strongly convex.
Squared Euclidean norm with convex constraints: , with the indicator function for a closed convex set, which is -strongly convex.
Typical cases for (often the data fitting terms in machine learning and signal processing) are functions of the form :
Least-absolute-deviation: , with . Note that the square loss is not Lipschitz-continuous on (although it is Lipschitz-continuous when restricted to a bounded set).
Logistic regression: , with . Here is not linear in its support, and is not smooth, since it is a sum of negative entropies (and the second-order derivative is not bounded). This extends to any “log-sum-exp” functions which occur as a negative log-likelihood from the exponential family (see, e.g.,  and references therein). Note that is then smooth and proximal methods with an exponential convergence rate may be used (which correspond to a constant step size in the algorithms presented below, instead of a decaying step size) [26, 5].
Support vector machine: , with . Here is linear on its domain (this is a situation where subgradient and conditional gradient methods are exactly equivalent). This extends to more general “max-margin” formulations [31, 30]
: in these situations, a combinatorial object (such as a full chain, a graph, a matching or vertices of the hypercube) is estimated (rather than an element of) and this leads to functions
whose Fenchel-conjugates are linear and have domains which are related to the polytopes associated to the linear programming relaxations of the corresponding combinatorial optimization problems. For these polytopes, often, only linear functions can be maximized, i.e., we can compute a subgradient ofbut typically nothing more.
Other examples may be found in signal processing; for example, total-variation denoising, where the loss is strongly convex but the regularizer is non-smooth , or submodular function minimization cast through separable optimization problems . Moreover, many proximal operators for non-smooth regularizers are of this form, with and is a norm (or more generally a gauge function).
2.2 Dual formulations
Another interesting set of examples for machine learning are more naturally described from the dual formulation in Eq. (2): given a smooth loss term (this could be least-squares or logistic regression), a typically non-smooth penalization or constraint is added, often through a norm . Thus, this corresponds to functions of the form , where is a convex non-decreasing function ( is then convex).
Our main assumption is that a subgradient of may be easily computed. This is equivalent to being able to maximize functions of the form for . If one can compute the dual norm of , , and in particular a maximizer in the unit-ball of , then one can compute simply the subgradient of . Only being able to compute the dual norm efficiently is a common situation in machine learning and signal processing, for example, for structured regularizers based on submodularity , all atomic norms , and norms based on matrix decompositions . See additional examples in .
Our assumption regarding the compact domain of translates to the assumption that has compact domain. This includes indicator functions which corresponds to the constraint . We may also consider , which corresponds to jointly penalizing and constraining the norm; in practice, may be chosen so that the constraint is not active at the optimum and we get the solution of the penalized problem . See [17, 34, 1] for alternative approaches.
3 Mirror descent for strongly convex problems
We first assume that the function is essentially smooth (i.e., differentiable at any point in the interior of , and so that the norm of gradients converges to when approaching the boundary of ); then is a bijection from to , where is the domain of (see, e.g., [28, 18]). We consider the Bregman divergence
It is always defined on , and is nonnegative. If , then if and only if . Moreover, since is assumed -strongly convex, we have . See more details in . For example, when , we have .
Subgradient descent for square Bregman divergence
We first consider the common situation where ; the primal problem then becomes:
The projected subgradient method starts from any , and iterates the following recursion:
where is any subgradient of at . The step size is .
The recursion may be rewritten as
which is equivalent to being the unique minimizer of
which is the traditional proximal step, with step size .
We may interpret the last formulation in Eq. (3) for the square regularizer as the minimization of
with solution defined through (note that is a bijection from to ):
This leads to the following definition of the mirror descent recursion:
Proposition 1 (Convergence of mirror descent in the strongly convex case)
Assume that (a) is Lipschitz-continuous and finite on , with the domain of , (b) is essentially smooth and -strongly convex. Consider and . Denoting by the unique minimizer of , after iterations of the mirror descent recursion of Eq. (4), we have:
Proof We follow the proof of  and adapt it to the strongly convex case. We have, by reordering terms and using the optimality condition :
In order to upper-bound the two terms in Eq. (3), we first consider the following bound (obtained by convexity of and the definition of ):
which may be rewritten as:
Moreover, by definition of ,
with . The function is -strongly convex, and its Fenchel conjugate is thus -smooth. This implies that is -smooth. Since and , . Moreover, . Since (because is a convex combination of such elements), then .
With , we obtain
Thus, by summing from to , we obtain
This implies that , i.e., the iterates converges. Moreover, using the convexity of ,
i.e., the objective functions at an averaged iterate converges, and
i.e., one of the iterates has an objective that converges.
Note that with the step size , we have
By summing these equalities, we obtain , i.e.,
that is, is a weighted average of subgradients (with more weights on later iterates).
For , then, we the same techniques, we would obtain a convergence rate proportional to for the average iterate , thus with an additional factor (see a similar situation in the stochastic case in ). We would then have , and this is exactly a form dual averaging method , which also comes with primal-dual guarantees.
Generalization to non-smooth
The previous result does not require to be essentially smooth, i.e., it may be applied to where is a closed convex set strictly included in . In the mirror descent recursion,
there may then be multiple choices for . If we choose for at iteration , the subgradient of obtained at the previous iteration, i.e., such that , then the proof of Prop. 1 above holds.
Note that when , the algorithm above is not equivalent to classical projected gradient descent. Indeed, the classical algorithm has the iteration
and corresponds to the choice in the mirror descent recursion, which, when is on the boundary of , is not the choice that we need for the equivalence in Section 4.
4 Conditional gradient method and extensions
In this section, we first review the classical conditional gradient algorithm, which corresponds to the extra assumption that is linear in its domain.
Conditional gradient method
Given a maximization problem of the following form (i.e., where is linear on its domain, or equal to zero by a simple change of variable):
the conditional gradient algorithm consists in the following iteration (note that below is the gradient of the objective function and that we are maximizing the first-order Taylor expansion to obtain a candidate towards which we make a small step):
It corresponds to a linearization of and its maximization over the bounded convex set . As we show later, the choice of may be done in different ways, through a fixed step size of by (approximate) line search.
Following , the conditional gradient method can be generalized to problems of the form
with the following iteration:
The previous algorithm may be interpreted as follows: (a) perform a first-order Taylor expansion of the smooth part , while leaving the other part intact, (b) minimize the approximation, and (c) perform a small step towards the maximizer. Note the similarity (and dissimilarity) with proximal methods which would add a proximal term proportional to , leading to faster convergences, but with the extra requirement of solving the proximal step [26, 5].
Note that here may be expressed as a convex combination of all , :
and that when we chose , it simplifies to:
When is essentially smooth (and thus is essentially strictly convex), it can be reformulated with as follows:
which is exactly the mirror descent algorithm described in Eq. (4). This leads to the following proposition:
Proposition 2 (Equivalence between mirror descent and generalized conditional gradient)
When is not essentially smooth, then with a particular choice of subgradient (see end of Section 3), the two algorithms are also equivalent. We now provide convergence proofs for the two versions (with adaptive and non-adaptive step sizes); similar rates may be obtained without the boundedness assumptions , but our results provide explicit constants and primal-dual guarantees. We first have the following convergence proof for generalized conditional gradient with no line search (the proof of dual convergence uses standard arguments from [13, 15], while the convergence of gaps is due to  for the regular conditional gradient):
Proposition 3 (Convergence of extended conditional gradient - no line search)
Assume that (a) is Lipschitz-continuous and finite on , with the domain of , (b) is -strongly convex. Consider and . Denoting by any maximizer of on , after iterations of the generalized conditional gradient recursion of Eq. (7), we have:
Proof We have (using convexity of and -smoothness of ):