The Frank-Wolfe method (aka conditional gradient, see Algorithm 1 below), originally due to  is a classical first-order method for minimizing a smooth and convex function over a convex and compact set [8, 22, 19]
. It regained significant interest in the machine learning, optimization and statistics communities in recent years mainly due to two reasons: i) in term of the feasible set, the method only requires access to an oracle for minimizing a linear function over the set. Such an oracle could be implemented very efficiently for many feasible sets that arise in applications, as opposed to most standard first-order methods which usually require to solve non-linear problems over the feasible set (e.g., Euclidean projection onto the set) which can be much less efficient (e.g., see detailed examples in[19, 18]
), and ii) when the number of iterations is not too large, the method naturally produces sparse solutions, in the sense that they are given explicitly as a convex combination of a small number of extreme points of the feasible set, which in many cases (e.g., optimization with sparse vectors/low-rank matrices) is much desired ([19, 6]).
The convergence rate of the method is of the order where is the iteration counter. This rate is known to be tight and does not improve even when the objective function is strongly convex, a property that, when combined with smoothness, is well known to yield a linear convergence rate, i.e., for standard first-order methods such as the proximal/projected gradient methods. For optimization over convex and compact polytopes, in his classical book , Wolfe himself suggested a simple variant of the method that does not only add new vertices to the solution using the linear optimization oracle, but also moves away more aggressively from previously found vertices of the polytope, a step typically referred to as an away step. Wolfe conjectured that with the addition of these away steps and assuming strong convexity of the objective and an additional strict complementary condition w.r.t. the optimal face of the polytope (see Assumption 2 in the sequel), a linear convergence rate can be proved. Later, Guélat and Marcotte  proved this result rigorously but without giving an explicit rate or complexity analysis. Also, their convergence rate depends on the distance of the optimal solution from the boundary of the optimal face of the polytope, which can be arbitrarily bad. Their technique for proving the linear rate is also related to techniques used in [3, 11].
In recent years Garber and Hazan [10, 12] and then Simon Lacoste Julien and Jaggi  presented variants of the Frank-Wolfe method that utilize away steps alongside new analyses, which resulted in provable and explicit linear rates without requiring strict complementary conditions and without dependence on the location of the optimal solution. These results have encouraged much followup theoretical and empirical work e.g., [2, 24, 23, 14, 25, 13, 26, 16, 5, 15, 1, 4, 21, 7], to name a few. However, the linear convergence rates in [10, 12, 20] and follow-up works depend explicitly on the dimension of the problem (at least linear dependence, i.e., the convergence rate is of the form , where is the dimension)111While in [10, 12] the dependence on the dimension is explicit in the convergence rate presented, in  it comes from the so-called pyramidal-width parameter, which already for the simplest polytopes such as the unit simplex or the hypercube causes the worst-case rate to depend linearly on the dimension..
Unfortunately, the explicit dependence on the dimension in all such works fails to explain and support the good empirical performance of these away-steps-based variants for large-scale problems. In particular, the examples constructed to show that explicit dependence on the dimension is mandatory in general (see for instance ) have focused on the case that the optimal solution lies on a high-dimensional face of the polytope222This is not surprising, since when initialized with a vertex of the polytope, these methods increase the dimension of the active face, i.e., the face in which the current iterate lies, by at most one on each iteration.. However, this leaves open the natural question:
Can explicit dependence on the dimension be avoided when the set of optimal solutions lies on a low-dimensional face of the polytope?
Indeed, models in which the optimal solution is sparse/low-rank are extremely common and important in statistics and machine learning. With this respect, the solution being on a low-dimensional face is analogues to sparsity in case the feasible set is a polytope, since it implies the solution could be expressed as a small number of extreme points of the polytope.
In this work we begin by answering the above question on the negative side, at least in worst-case. We give a construction of a very simple problem for which the optimal solution is a vertex of the polytope (i.e., lies on a face of dimension ), but for which all Frank-Wolfe-type methods (including those which use away steps) which apply for arbitrary polytopes, require number of steps that depends explicitly on the dimension. We then revisit the strict complementary condition assumed in the works of Wolfe  and Guélat and Marcotte  (but not in the more modern works such as Garber and Hazan [10, 12] and Lacoste Julien and Jaggi ). We first motivate this condition by showing how it implies a robustness-to-noise property of optimal solutions. That is, under this condition if the optimal solutions lie on a low-dimensional face of the polytope, then also the optimal solutions to a slightly-perturbed version of the problem must also lie on this face. We then use this condition to give a new analysis for the Frank-Wolfe method with away steps and line-search that converges with linear rate that depends explicitly only on the dimension of the optimal face, and not on the dimension of the problem. In terms of techniques, we use the original algorithm used in the works of Guélat and Marcotte  and Lacoste Julien and Jaggi  (Algorithm 2 below), but with a new complexity analysis that is mostly inspired by that of Garber and Hazan .
Finally, it is important to note that while Garber and Meshi  gave a Frank-Wolfe variant for polytopes with linear rate that depends only on the dimension of the optimal face, their result can be efficiently implemented only for a very restrictive family of polytopes, and hence is far from generic. See also a follow-up work by Bashiri and Zhang . Here we do not impose any additional structural assumption on the feasible polytope.
Throughout this work we let denote the standard Euclidean norm for vectors in
and the spectral norm (i.e., largest singular value) for matrices in. We use lower-case boldface letters to denote vectors and upper-case bold-face letters to denote matrices. for a matrix we let denote the th row of .
Throughout this work we consider the following convex optimization problem:
where is a convex and compact polytope in the form , , , is convex and -smooth (Lipschitz gradient). We let denote the set of vertices of . We let denote the optimal value of Problem (1) and we let denote the set of optimal solutions.
For a face of we define:
We let denote the lowest-dimensional face of containing the set of optimal solution, i.e., . In the following we write . Observe that the rows of are exactly the rows of plus the rows of which correspond to inequality constraints that are tight for all point in and the vector is defined accordingly. The rows of the matrix are exactly the rows in which correspond to inequality constraints that are satisfied by some of the points in but not by others, and the vector is defined accordingly. In particular, if follows that .
We let denote the set of all matrices whose rows are linearly independent rows chosen from the rows of . Similarly to , we define the following quantities: and (note here they are only defined w.r.t. the optimal face ). We denote by and the Euclidean diameter of and , respectively.
Given a set we let denote the convex-hull of the points in , we let denote the number of nonzero entries in a given vector, and for any positive integer , we let denote the unit simplex in . Given a point and a set we denote .
Throughout this paper, unless stated otherwise, we assume the objective function satisfies the quadratic growth property, which is a weaker assumption than assuming strong-convexity, and is to almost all linearly-converging Frank-Wolfe variants previously studied.
Assumption 1 (quadratic growth).
The objective function in (1) satisfies the quadratic growth property with parameter w.r.t. the polytope , i.e., for all : .
In particular, the highly important case of , where is not necessarily full row-rank, satisfies the quadratic growth property w.r.t. any convex and compact polytope.
2.1 Lower bound for Frank-Wolfe-type methods
We now prove our claim that already for very simple problems and even when the (unique) optimal solution is a vertex of the polytope (i.e., lies on a face of dimension 0), any Frank-Wolfe-type method (which we define next), even with away-steps, must exhibit at least linear dependence on the dimension, in worst case.
Definition 1 (Frank-Wolfe-type method).
An iterative algorithm for Problem (1) is a Frank-Wolfe-type method if on each iteration , it performs a single call to the linear optimization oracle of w.r.t. the point , i.e., computes some , where is the current iterate, and produces the next iterate by taking some convex combination of the points in , where is the initialization point and is the entire history of outputs of the linear optimization oracle.
In the following we let denote the down-closed unit simplex in , i.e., .
Consider the optimization problem . Then, any Frank-Wolfe-type method (see Definition 1) when initialized with some standard basis vector , , must perform in worst case calls to the linear optimization oracle to obtain approximation error lower than .
Clearly, the unique optimal solution is and . Consider now the iterates of some Frank-Wolfe-type method and recall that for some . Observe now that for any iteration for which it holds that it follows that a valid output for the linear optimization oracle is a standard basis vector such that . Thus, before making calls to linear optimization oracle, all iterates must lie in and hence for all we have . ∎
2.2 Strict complementary condition
Assumption 2 (strict complementary).
There exist such that
To motive Assumption 2 in the context of optimization with sparse/low-dimensional models under noisy data, we bring the following theorem which states that if the strict complementary condition holds then, even if instead of directly optimizing over the polytope , we only optimize a noisy version of it , then as long as the noise level is controlled by the strict complementary parameter , the optimal face is preserved. That is, the optimal solutions to the perturbed problem all lie within the optimal face w.r.t. the original objective .
Let be two -smooth, convex functions with the quadratic growth property with parameter over the polytope . Suppose also that for all , . Let and be the optimal faces w.r.t. the objective functions and , respectively, and suppose that the strict complementary condition (Assumption 2) holds w.r.t. the face with parameter . If then .
Let and denote the sets of optimal solutions w.r.t. and , respectively. Let and let be the point in closest in Euclidean distance to . From the convexity of we have that
where the last inequality follows from the optimality of w.r.t. and the Cauchy-Schwarz inequality. Using the above inequality and the quadratic growth of we have that
Thus, we have that . It thus follows that for any vertex ,
Thus, we have that whenever it must hold that . Otherwise, due to the differentiability of , moving arbitrarily small positive mass from a vertex in the convex decomposition of , to the point will reduce the objective value w.r.t. , hence contradicting the optimality of . Thus, . ∎
We first begin with a very simple result proving that if the optimal solution is simply a vertex and the strict complementary condition holds, then the standard Frank-Wolfe method with line-search (Algorithm 1) finds the optimal solution within a finite number of iterations, without even requiring the objective to satisfy the quadratic growth property. Such a result was essentially already proved by Guélat and Marcotte , though they did assume strong convexity of the objective, and did not give explicit complexity analysis (i.e., only proved finiteness).
Suppose Algorithm 1 runs for iterations and that the final iterate satisfies . In particular it follows that . Thus, from the convexity of and Assumption 2 it follows that . However, from the standard convergence result for the Frank-Wolfe method (see for instance ), it follows that after iterations, . Thus, we have arrived at a contradiction. ∎
We now turn to present and prove our main result. For this result we use the Frank-Wolfe variant with away steps already suggested in  and revisited in  without further change. Only the analysis is new and based mostly on the ideas of  rather than those of [17, 20].
[Main Theorem] Let be the sequence of iterates produced by Algorithm 2 and for all denote . Then,
such that the iterates all lie inside the optimal face , and
Note that the linear rates in (3), (4) depend explicitly only on the dimension of the optimal face - (through the parameter ), but not on the dimension of the polytope - . That is, treating all other quantities as constants, the linear rate is of the form and not as in the previous works [10, 12, 20]. Moreover, the rate in (4) depends only on the diameter of the optimal face and not on that of the entire polytope. Such improved dependence can be significant since for many polytopes, the diameter of a face scales with the square root of its dimension (e.g., the hypercube ).
The proof of Theorem 5 mainly combines ideas from the works  and  (mainly for the proof of (4)). We reiterate that while  proved linear convergence, they did not give proper complexity analysis (i.e., how the rate depends on the different parameters of the problem).
Before proving the theorem we will need a simple observation and two lemmas. Following the terminology of  we refer to each step of Algorithm 2 on which the away direction was chosen and also as a drop step, since in such a case one of the vertices in the decomposition of the current iterate is removed from the decomposition. We denote by the number of iterations up to (and including) iteration that are drop steps. The following simple observation is highly important for the analysis of Algorithm 2 and was made in .
Let be given by an explicit convex combination of vertices and suppose that starting with the point , iterations of Algorithm 2 have been executed. Then, on these iterations it holds that .
Algorithm 2 satisfies that for all , .
Fix some iteration of Algorithm 2 on which the away direction was chosen but it is not a drop step (i.e., ). Due to the use of line-search and the convexity of it in particular follows that . Thus, we have that on such iteration,
where (a) follows from the smoothness of
and (b) follows since the away direction was chosen and not the FW direction. The above bound is the standard error-reduction analysis for the standard Frank-Wolfe method with line-search (Algorithm1). Thus, we have that any iteration of Algorithm 2, which is not a drop step, reduces the error by at least the amount the Frank-Wolfe method with line-search reduces in worst-case. Since drop steps do not increase the function value, the lemma follows directly from the convergence rate of the standard Frank-Wolfe method (i.e., , see for instance ) and Observation 1. ∎
Let and write as a convex combination of points in , i.e., such that for all . Let be the optimal solution closet in Euclidean distance to . Then, can be written as a convex combination , for some , , and .
Let us write as , where and . Since , clearly it must hold that for all , , and .
We begin by upper-bounding . From the convexity of it holds that
where (a) follows from the optimality of and (b) follows from the strict complementary assumption (Assumption 2). Since for all we have we obtain the bound .
We now turn to upper-bound . For this we use a refinement of the argument introduced in . Applying Lemma 5.3 from  we have that there is alway a choice for and such that for all , if then there must exist an index such that and .
Let . Let further be such that is a basis for . Note in particular that since and , we have that and thus, .
Let be the matrix after deleting each row . It holds that
where (a) follows since and thus, for any there must exist some such that the constraint is not tight for (see also Lemma 5.4 in ). Thus, using the quadratic growth of we have that . ∎
Proof of Theorem 5.
Result (2) follows immediately from Lemma 1. From this result it also follows that for some it holds that for all , , for . Throughout the rest of the proof, for every iteration we let denote the point in closest in Euclidean distance to the iterate .
Consider now some iteration and write the convex decomposition of as . Suppose without loss of generality that are ordered such that . Let be the bound in Lemma 2 when applied w.r.t. the point . Let be the smallest integer such that and consider the point . Since is obtained by replacing vertices in the decomposition of with highest inner product with , with the point that minimizes the inner-product among all points in , overall shifting the distribution mass which corresponds to the bound in Lemma 2 , we have that (see also Lemma 5.6 in ) . On the other hand, taking for all , , and for all , we have that
Thus, we have that
where the last inequality follows from the convexity of . In particular, it follows that for any , whenever and either the FW direction was chosen or the away direction was chosen together with (i.e., not a drop step) that,
where (a) follows from the use of line-search and the convexity of , (b) follows from the smoothness of , and (c) follows from plugging Eq. (5), the bound on from Lemma 2 (note for all , and thus ), and the Euclidean diameter of . Thus, for by subtracting from both sides of (3), we get that for any step which is not a drop step, .
From Observation 1 we have that since the convex decomposition of is supported on at most vertices and since on any iteration the approximation error never increases, from the above analysis we have that for all ,
This proves the rate in (3).