Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization

Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone K, a norm · and a smooth convex function f, we want either 1) to minimize the norm over the intersection of the cone and a level set of f, or 2) to minimize over the cone the sum of f and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) · is "too complicated" to allow for computationally cheap Bregman projections required in the first-order proximal gradient algorithms. On the other hand, we assume that it is relatively easy to minimize linear forms over the intersection of K and the unit ·-ball. Motivating examples are given by the nuclear norm with K being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications.

READ FULL TEXT VIEW PDF

Authors

page 23

04/14/2014

Hybrid Conditional Gradient - Smoothing Algorithms with Applications to Sparse and Low Rank Regularization

We study a hybrid conditional gradient - smoothing algorithm (HCGS) for ...
02/10/2014

Signal Reconstruction Framework Based On Projections Onto Epigraph Set Of A Convex Cost Function (PESC)

A new signal processing framework based on making orthogonal Projections...
05/01/2010

Perturbation Resilience and Superiorization of Iterative Algorithms

Iterative algorithms aimed at solving some problems are discussed. For c...
04/05/2017

Geometry of Factored Nuclear Norm Regularization

This work investigates the geometry of a nonconvex reformulation of mini...
02/12/2021

From perspective maps to epigraphical projections

The projection onto the epigraph or a level set of a closed proper conve...
12/11/2015

A Unified Approach to Error Bounds for Structured Convex Optimization Problems

Error bounds, which refer to inequalities that bound the distance of vec...
12/14/2020

On the Treatment of Optimization Problems with L1 Penalty Terms via Multiobjective Continuation

We present a novel algorithm that allows us to gain detailed insight int...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider two norm-regularized convex optimization problems as follows:

[norm minimization] (1)
[penalized minimization] (2)

where is a convex function with Lipschitz continuous gradient, is a closed convex cone in a Euclidean space , is some norm, and are positive parameters. Problems such as such as (1) and (2) are of definite interest for signal processing and machine learning. In these applications,

quantifies the discrepancy between the observed noisy output of some parametric model and the output of the model with candidate vector

of parameters. Most notably, is the quadratic penalty: , where

is the “true” output of the linear regression model

, and , where is the vector of true parameters, is the observation error, and is an a priori upper bound on . The cone sums up a priori information on the parameter vectors (e.g., – no a priori information at all, or , or , the space of symmetric matrices, and , the cone of positive semidefinite matrices, as is the case of covariance matrices recovery). Finally, is a regularizing norm “promoting” a desired property of the recovery, e.g., the sparsity-promoting norm on , or the low rank promoting nuclear norm on , or the Total Variation (TV) norm, as in image reconstruction.

In the large-scale case, first-order algorithms of proximal-gradient type are popular to tackle such problems, see [30] for a recent overview. Among them, the celebrated Nesterov optimal gradient methods for smooth and composite minimization [22, 23, 24], and their stochastic approximation counterparts [18], are now state-of-the-art in compressive sensing and machine learning. These algorithms enjoy the best known so far theoretical estimates (and in some cases, these estimates are the best possible for the first-order algorithms). For instance, Nesterov’s algorithm for penalized minimization [23, 24] solves  (2) to accuracy in iterations, where is the properly defined Lipschitz constant of the gradient of , and is the initial distance to the optimal set, measured in the norm . However, applicability and efficiency of proximal-gradient algorithms in the large-scale case require from the problem to possess “favorable geometry” (for details, see [24, Section A.6]). To be more specific, consider proximal-gradient algorithm for convex minimization problems of the form

(3)

The comments to follow, with slight modifications, are applicable to problems such as (1) and (2) as well. In this case, a proximal-gradient algorithm operates with a “distance generating function” (d.g.f.) defined on the domain of the problem and -strongly convex w.r.t. the norm . Each step of the algorithm requires minimizing the sum of the d.g.f. and a linear form. The efficiency estimate of the algorithm depends on the variation of the d.g.f. on the domain and on regularity of w.r.t. 111i.e., the Lipschitz constant of w.r.t. in the nonsmooth case, or the Lipschitz constant of the gradient mapping w.r.t. the norm on the argument and the conjugate of this norm on the image spaces in the smooth case.. As a result, in order for a proximal-gradient algorithm to be practical in the large scale case, two “favorable geometry” conditions should be met: (a) the outlined sub-problems should be easy to solve, and (b) the variation of the d.g.f. on the domain of the problem should grow slowly (if at all) with problem’s dimension. Both these conditions indeed are met in many applications; see, e.g., [2, 17] for examples. This explains the recent popularity of this family of algorithms.

However, sometimes conditions (a) and/or (b) are violated, and application of proximal algorithms becomes questionable. For example, for the case of , (b) is violated for the usual -norm on or, more generally, for norm on the space of matrices given by

where denotes the -th row of . Here the variation of (any) d.g.f. on problem’s domain is at least . As a result, in the case in question the theoretical iteration complexity of a proximal algorithm grows rapidly with the dimension . Furthermore, for some high-dimensional problems which do satisfy (b), solving the sub-problem can be computationally challenging. Examples of such problems include nuclear-norm-based matrix completion, Total Variation-based image reconstruction, and multi-task learning with a large number of tasks and features. This corresponds to in  (1) or  (2) being the nuclear norm [10] or the TV-norm.

These limitations recently motivated alternative approaches, which do not rely upon favorable geometry of the problem domain and/or do not require to solve hard sub-problems at each iteration, and triggered a renewed interest in the Conditional Gradient (CndG) algorithm. This algorithm, also known as the Frank-Wolfe algorithm, which is historically the first method for smooth constrained convex optimization, originates from [8], and was extensively studied in the 70-s (see, e.g., [5, 7, 25] and references therein). CndG algorithms work by minimizing a linear form on the problem domain at each iteration; this auxiliary problem clearly is easier, and in many cases – significantly easier than the auxiliary problem arising in proximal-gradient algorithms. Conditional gradient algorithms for collaborative filtering were studied recently [15, 14], some variants and extensions were studied in [6, 29, 10]. Those works consider constrained formulations of machine learning or signal processing problems, i.e., minimizing the discrepancy under a constraint on the norm of the solution, as in  (3). On the other hand, CndG algorithms for other learning formulations, such as norm minimization  (1) or penalized minimization  (2) remain open issues. An exception is the work of [6, 10], where a Conditional Gradient algorithm for penalized minimization was studied, although the efficiency estimates obtained in that paper were suboptimal. In this paper, we present CndG-type algorithms aimed at solving norm minimization and penalized norm minimization problems and provide theoretical efficiency guarantees for these algorithms.

The main body of the paper is organized as follows. In Section 2, we present detailed setting of problems (1), (2) along with basic assumptions on the “computational environment” required by the CndG-based algorithms we are developing. These algorithms and their efficiency bounds are presented in Sections 3 (problem (1)) and 5 (problem (2). In Section 6 we outline some applications, and in Section 7 present preliminary numerical results. All proofs are relegated to the appendix.

2 Problem statement

Throughout the paper, we shall assume that is a closed convex cone in Euclidean space ; we loose nothing by assuming that linearly spans . We assume, further, that is a norm on , and is a convex function with Lipschitz continuous gradient, so that

where denotes the norm dual to , whence

(4)

We consider two kinds of problems, detailed below.

Norm-minimization.

Such problems correspond to

(5)

To tackle  (5), we consider the following parametric family of problems

(6)

Note that whenever (5) is feasible, which we assume from now on, we have

(7)

and both problems (5), (7) can be solved.

Given a tolerance , we want to find an -solution to the problem, that is, a pair , such that

(8)

where . Getting back to the problem of interest (5), is then “super-optimal” and -feasible:

Penalized norm minimization.

These problems write as

(9)

A equivalent formulation is

(10)

We shall refer to  (10) as the problem of composite optimization (CO). Given a tolerance , our goal is to find an -solution to (10), defined as a feasible solution to the problem satisfying . Note that in this case is an -solution, in the similar sense, to  (9).

Special case.

In many applications where Problem (5) arise, (9) the function enjoys a special structure:

where is an affine mapping from to , and is a convex function with Lipschitz continuous gradient; we shall refer to this situation as to special case. In such case, the quantity can be bounded as follows. Let be some norm on , be the conjugate norm, and be the norm of the linear mapping induced by the norms , on the argument and the image spaces:

Let also be the Lipschitz constant of the gradient of induced by the norm , so that

Then, one can take as the quantity

(11)

Example 1: quadratic fit. In many applications, we are interested in -discrepancy between and ; the related choice of is . Specifying as , we get .

Example 2: smoothed fit. When interested in discrepancy between and , we can use as the function , where . Taking as , we get

Note that

so that for and large enough (specifically, such that ), is within absolute constant factor of . The latter situation can be interpreted as behaving as ). At the same time, with , grows with logarithmically.

Another widely used choice of for this type of discrepancy is “logistic” function

For we easily compute and .

Note that in some applications we are interested in “one-sided” discrepancies quantifying the magnitude of the vector rather than the the magnitude of the vector itself. Here, instead of using in the context of examples 1 and 2, one can use the functions . In this case the bounds on are exactly the same as the above bounds on . The obvious substitute for the two-sided logistic function is its “one-sided version:” which obeys the same bound for as its two-sided analogue.

First-order and Linear Optimization oracles.

We assume that is represented by a first-order oracle – a routine which, given on input a point , returns the value and the gradient of at . As about and , we assume that they are given by a Linear Optimization (LO) oracle which, given on input a linear form on , returns a minimizer of this linear form on the set . We assume w.l.o.g. that for every , is either zero, or is a vector of the -norm equal to 1. To ensure this property, it suffices to compute for given by the oracle; if this inner product is 0, we can reset , otherwise is automatically equal to 1.

Note that an LO oracle for and allows to find a minimizer of a linear form of on a set of the form due to the following observation:

Lemma 1.

Let and . Consider the linear form of , and let

Then is a minimizer of over , When , one has .

Indeed, let be a minimizer of over . Since due to , we have due to , and due to the definition of . We conclude that any minimizer of over the segment is also a minimizer of over . It remains to note that the vector indicated in Lemma clearly is a minimizer of on the above segment.  

3 Conditional Gradient algorithm

In this section, we present an overview of the properties of the standard Conditional Gradient algorithm, and highlight some memory-based extensions. These properties are not new. However, since they are key for the design of our proposed algorithms in the next sections, we present them for further reference.

3.1 Conditional gradient algorithm

Let be a Euclidean space and be a closed and bounded convex set in which linearly spans . Assume that is given by a LO oracle – a routine which, given on input , returns an optimal solution to the optimization problem

(cf. Section 2). Let be a convex differentiable function on with Lipschitz continuous gradient , so that

(12)

where is the norm on with the unit ball . We intend to solve the problem

(13)

A generic CndG algorithm is a recurrence which builds iterates , , in such a way that

(14)

where

(15)

Basic implementations of a generic CndG algorithm are given by

(16)

in the sequel, we refer to them as CndGa and CndGb, respectively. As a byproduct of running generic CndG, after steps we have at our disposal the quantities

(17)

which, by convexity of , are lower bounds on . Consequently, at the end of step we have at our disposal a lower bound

(18)

on .

Finally, we define the approximate solution found in course of steps as the best – with the smallest value of – of the points . Note that .

The following statement summarizes the well known properties of CndG (to make the presentation self-contained, we provide in Appendix the proof).

Theorem 1.

For a generic CndG algorithm, in particular, for both CndGa, CndGb, we have

(19)

and

(20)

Some remarks regarding the conditional algorithm are in order.

Certifying quality of approximate solutions. An attractive property of CndG is the presence of online lower bound on which certifies the theoretical rate of convergence of the algorithm, see (20). This accuracy certificate, first established in [14], also provides a valuable stopping criterion when running the algorithm in practice.

CndG algorithm with memory. When computing the next search point the simplest CndG algorithm CndGa only uses the latest answer of the LO oracle. Meanwhile, algorithm CndGb can be modified to make use of information supplied by previous oracle calls; we refer to this modification as CndG with memory (CndGM).222Note that in the context of “classical” Frank-Wolfe algorithm – minimization of a smooth function over a polyhedral set – such modification is referred to as Restricted Simplicial Decomposition [13, 12, 32]. Assume that we have already carried out steps of the algorithm and have at our disposal current iterate (with selected as an arbitrary point of ) along with previous iterates , and the vectors , . At the step, we compute and . Thus, at this point in time we have at our disposal points , , which belong to . Let be subset of these points, with the only restriction that the points , are selected, and let us define the next iterate as

(21)

that is,

(22)

Clearly, it is again a generic CndG algorithm, so that conclusions in Theorem 1 are fully applicable to CndGM. Note that CndGb per se is nothing but CndGM with and for all .

CndGM: implementation issues. Assume that the cardinalities of the sets in CndGM are bounded by some . In this case, implementation of the method requires solving at every step an auxiliary problem (22) of minimizing over the standard simplex of dimension a smooth convex function given by a first-order oracle induced by the first-oracle for . When is a once for ever fixed small integer, the arithmetic cost of solving this problem within machine accuracy by, say, the Ellipsoid algorithm is dominated by the arithmetic cost of just calls to the first-order oracle for . Thus, CndGM with small can be considered as implementable333Assuming possibility to solve (22

) exactly, while being idealization, is basically as “tolerable” as the standard in continuous optimization assumption that one can use exact real arithmetic or compute exactly eigenvalues/eigenvectors of symmetric matrices. The outlined “real life” considerations can be replaced with rigorous error analysis which shows that in order to maintain the efficiency estimates from Theorem

1, it suffices to solve -th auxiliary problem within properly selected positive inaccuracy, and this can be achieved in computations of and ..

Note that in the special case (Section 2), where , assuming and easy to compute, as is the case in most of the applications, the first-order oracle for the auxiliary problems arising in CndGM becomes cheap (cf. [34]). Indeed, in this case (22) reads

It follows that all we need to get a computationally cheap access to the first-order information on for all values of is to have at our disposal the matrix-vector products , . With our construction of , the only two “new” elements in (those which were not available at preceding iterations) are and , so that the only two new matrix-vector products we need to compute at iteration are (which usually is a byproduct of computing ) and . Thus, we can say that the “computational overhead,” as compared to computing and , needed to get easy access to the first-order information on reduces to computing the single matrix-vector product .

4 Conditional gradient algorithm for parametric optimization

In this section, we describe a multi-stage algorithm to solve the parametric optimization problem (6), (7), using the conditional algorithm to solve inner sub-problems. (6), (7). The idea, originating from [19] (see also [22, 16, 24]), is to use a Newton-type method for approximating from below the positive root of , with (inexact) first-order information on yielded by approximate solving the optimization problems defining ; the difference with the outlined references is that now we solve these problems with the CndG algorithm.

Our algorithm works stagewise. At the beginning of stage , we have at hand a lower bound on , with defined as follows:

We compute , and . If or , we are done — the pair (, ) is an -solution to (7) in the first case, and is an optimal solution to the problem in the second case (since in the latter case is a minimizer of on , and (7) is feasible). Assume from now on that the above options do not take place (“nontrivial case”), and let

Due to the origin of , is positive, and for all , which implies that

At stage we apply a generic CndG algorithm (e.g., CndGa,CndGb, or CndGM; in the sequel, we refer to the algorithm we use as to CndG) to the auxiliary problem

(23)

Note that the LO oracle for , induces an LO oracle for ; specifically, for every , the point is a minimizer of the linear form over , see Lemma 1. is exactly the LO oracle utilized by CndG as applied to (23).

As explained above, after steps of CndG as applied to (23), the iterates being , 444The iterates , same as other indexed by quantities participating in the description of the algorithm, in fact depend on both and the stage number . To avoid cumbersome notation when speaking about a particular stage, we suppress in the notation., we have at our disposal current approximate solution such that along with a lower bound on . Our policy is as follows.

  1. When , we terminate the solution process and output and ;

  2. When the above option is not met and , we specify according to the description of CndG and pass to step of stage ;

  3. Finally, when neither one of the above options takes place, we terminate stage and pass to stage , specifying as follows:
    We are in the situation and . Now, for the quantities , and define affine function of

    By Lemma 1 we have for every

    where the inequality is due to the convexity of . Thus, is an affine in lower bound on , and we lose nothing by assuming that all these univariate affine functions are memorized when running CndG on (23). Note that by construction of the lower bound (see (17), (18) and take into account that we are in the case of , ) we have

    Note that is a lower bound on , so that for , while is positive. It follows that

    is well defined and satisfies . We compute (which is easy) and pass to stage , setting and selecting, as the first iterate of the new stage, any point known to belong to (e.g., the origin, or ). The first iterate of the first stage is .

The description of the algorithm is complete.

The complexity properties of the algorithm are given by the following proposition.

Theorem 2.

When solving a PO problem  (6),  (7) by the outlined algorithm,

(i) the algorithm terminates with an -solution, as defined in Section 2 (cf.  (8));

(ii) The number of steps at every stage of the method admits the bound

(iii) The number of stages before termination does not exceed the quantity

5 Conditional Gradient algorithm for Composite Optimization

In this section, we present a modification of the CndG algorithm capable to solve composite minimization problem (10). We assume in the sequel that are represented by an LO oracle for the set , and is given by a first order oracle. In order to apply CndG to the composite optimization problem  (10), we make the assumption as follows:

Assumption A: There exists such that together with , , imply that .

We define as the minimal value of satisfying Assumption A, and assume that we have at our disposal a finite upper bound on . An important property of the algorithm we are about to develop is that its efficiency estimate depends on the induced by problem’s data quantity , and is independent of our a priori upper bound on this quantity, see Theorem 3 below.

The algorithm.

We are about to present an algorithm for solving  (10). Let , and . From now on, for a point we set and . Given , let us consider the segment

and the linear form

Observe that by Lemma 1, for every , the minimum of this form on is attained at a point of (either at or at the origin). A generic Conditional Gradient algorithm for composite optimization (COCndG) is a recurrence which builds the points , , in such a way that

(24)

Let be an optimal solution to  (10) (which under Assumption A clearly exists), and let (i.e., is nothing but , see (9)).

Theorem 3.

A generic COCndG algorithm  (24) maintains the inclusions and is a descent algorithm: for all . Besides this, we have

(25)

COCndG with memory.

The simplest implementation of a generic COCndG algorithm is given by the recurrence

(26)

Denoting , the recurrence can be written

(27)

As for the CndG algorithm in section 3, the recurrence  (26) admits a version with memory COCndGM still obeying (24) and thus sartisfying the conclusion of Theorem 3. Specifically, assume that we already have built iterates , , with , along with the gradients and the points . Then we have at our disposal a number of points from , namely, the iterates , , and the points . Let us select a subset of the set , with the only restriction that contains the points , and set

(28)

Since , we have , whence the procedure we have outlined is an implementation of generic COCndG algorithm. Note that the basic COCndG algorithm is the particular case of the COCndGM corresponding to the case where for all . The discussion of implementability of CndGM in section 3 fully applies to COCndGM.

Let us outline several options which can be implemented in COCndGM; while preserving the theoretical efficiency estimates stated in Theorem 3 they can improve the practical performance of the algorithm. For the sake of definiteness, let us focus on the case of quadratic : , with ; extensions to a more general case are straightforward.

  1. We lose nothing (and potentially gain) when extending in (28) to the conic hull

    of . When , we can go further and replace (28) with

    (29)

    Note that the preceding “conic case” is obtained from (29) by adding to the constraints of the right hand side problem the inequalities . Finally, when is easy to compute, we can improve (29) to

    (30)

    (the definition of assumes that , otherwise the constraints of the problem specifying should be augmented by the inequalities ).

  2. In the case of quadratic and moderate cardinality of , optimization problems arising in (29) (with or without added constraints ) are explicitly given low-dimensional “nearly quadratic” convex problems which can be solved to high accuracy “in no time” by interior point solvers. With this in mind, we could solve these problems for the given value of the penalty parameter and also for several other values of the parameter. Thus, at every iteration we get feasible approximate solution to several instances of (9) for different values of the penalty parameter. Assume that we keep in memory, for every value of the penalty parameter in question, the best, in terms of the respective objective, of the related approximate solutions found so far. Then upon termination we will have at our disposal, along with the feasible approximate solution associated with the given value of the penalty parameter, provably obeying the efficiency estimates of Theorem 3, a set of feasible approximate solutions to the instances of (9) corresponding to other values of the penalty.

  3. In the above description, was assumed to be a subset of the set containing and . Under the latter restriction, we lose nothing when allowing for to contain points from as well. For instance, when and is easy to compute, we can add to the point . Assume, e.g., that we fix in advance the cardinality of and define as follows: to get from , we eliminate from the latter set several (the less, the better) points to get a set of cardinality , and then add to the resulting set the points , and . Eliminating the points according to the rule “first in – first out,” the projection of the feasible set of the optimization problem in (30) onto the space of -variables will be a linear subspace of containing, starting with step , at least (here stands for the largest integer not larger than ) of gradients of taken at the latest iterates, so that the method, modulo the influence of the penalty term, becomes a “truncated” version of the Conjugate Gradient algorithm for quadratic minimization. Due to nice convergence properties of Conjugate Gradient in the quadratic case, one can hope that a modification of this type will improve significantly the practical performance of COCndGM.

6 Application examples

In this section, we detail how the proposed conditional gradient algorithms apply to several examples. In particular, we detail the corresponding LO oracles, and how one could implement these oracles efficiently.

6.1 Regularization by nuclear/trace norm

The first example where the proposed algorithms seem to be more attractive than the proximal methods are large-scale problems (5), (9) on the space of matrices associated with the nuclear norm of a matrix , where

is the vector of singular values of a

matrix . Problems of this type with arise in various versions of matrix completion, where the goal is to recover a matrix from its noisy linear image , so that , with some smooth and convex discrepancy measure , most notably, . In this case, minimization/penalization is aimed at getting a recovery of low rank ([31, 3, 4, 9, 15, 26, 27, 33, 20, 29] and references therein). Another series of applications relates to the case when is the space of symmetric matrices, and is the cone of positive semidefinite matrices, with and as above; this setup corresponds to the situation when one wants to recover a covariance (and thus positive semidefinite symmetric) matrix from experimental data. Restricted from onto , the nuclear norm becomes the trace norm , where is the vector of eigenvalues of a symmetric matrix , and regularization by this norm is, as above, aimed at building a low rank recovery.

With the nuclear (or trace) norm in the role of

, all known proximal algorithms require, at least in theory, computing at every iteration the complete singular value decomposition of

matrix (resp., complete eigenvalue decomposition of a symmetric matrix ), which for large may become prohibitively time consuming. In contrast to this, with and , LO oracle for only requires computing the leading right singular vector of a matrix (i.e., the leading eigenvector of ): , where