We consider the optimization problem:
where is a continuously differentiable function over the domain that is convex and compact, but is potentially non-convex. The Frank-Wolfe (FW) optimization algorithm proposed by Frank and Wolfe (1956) (also known as conditional gradient method (Demyanov and Rubinov, 1970)), is a popular first-order method to solve (1) while only requiring access to a linear minimization oracle over , i.e., the ability to compute efficiently .111Ties can be broken arbitrarily in this paper. It has recently enjoyed a surge in popularity thanks to its ability to cheaply exploit the structured constraint sets
appearing in machine learning applications, seeJaggi (2013); Lacoste-Julien and Jaggi (2015) and references therein. See also Lan (2013) for a related survey.
We give the Frank-Wolfe algorithm with adaptive step sizes in Algorithm 1 (either with line-search or with a step size that minimizes an affine invariant quadratic upper bound). As is convex, the iterates stay in the feasible set during the algorithm. For a convex function with Lipschitz continuous gradient, the FW algorithm obtains a global suboptimality smaller than after iterations (Jaggi, 2013, Theorem 1), where is the constant used in Algorithm 1 for the adaptive step size, and is called the curvature constant of (defined in (4) below). On the other hand, we are not aware of any rates proven for Algorithm 1 in the case where is non-convex. Examples of recent applications where the FW algorithm is run on a non-convex objective include multiple sequence alignment (Alayrac et al., 2016, Appendix B) and multi-object tracking (Chari et al., 2015, Section 5.1). To talk about rates in the non-convex setting, we need to define a measure of non-stationarity for our iterates.
Consider the “Frank-Wolfe gap” of at :
This quantity is a standard one appearing in the analysis of FW algorithms, and is computed for free during the FW algorithm (see Line 5 in Algorithm 1). A point is a stationary point for the constrained optimization problem (1) if and only if . Moreover, we always have . The FW gap is thus a meaningful measure of non-stationarity, generalizing the more standard that is used for unconstrained optimization. An appealing property of the FW gap is that it is affine invariant (Jaggi, 2013), that is, it is invariant to an affine transformation of the domain in problem (1) and is not tied to any specific choice of norm, unlike the criterion . As the FW algorithm is also affine invariant (Jaggi, 2013), it is important that we state our convergence results in term of affine invariant quantities. In this paper, we show in Theorem 1 below that the minimal FW gap encountered during the FW algorithm is after iterations, that is:
where is the initial global suboptimality.
Another nice property of the FW gap is the following local suboptimality property. If lies in a convex subset on which is convex, then upper bounds the suboptimality with respect to the constrained minimum on , that is, (by convexity).
Before stating our convergence result, we review the usual affine invariant constant appearing in the convergence rates for FW methods. The curvature constant of a continuously differentiable function , with respect to a compact domain , is defined as:
The assumption of bounded curvature closely corresponds to a Lipschitz assumption on the gradient of . More precisely, if is -Lipschitz continuous on with respect to some arbitrary chosen norm in dual pairing, i.e. , then
where denotes the -diameter, see (Jaggi, 2013, Lemma 7). These quantities were normally defined in the context of convex optimization, but these bounds did not use convexity anywhere.
Theorem 1 (Convergence of FW on non-convex objectives).
Consider the problem (1) where is a continuously differentiable function that is potentially non-convex, but has a finite curvature constant as defined by (4) over the compact convex domain . Consider running the Frank-Wolfe algorithm 1 with line-search (option I; then take below) or with the step size that minimizes a quadratic upper bound (option II), for any . Then the minimal FW gap encountered by the iterates during the algorithm after iterations satisfies:
where is the initial global suboptimality. It thus takes at most iterations to find an approximate stationary point with gap smaller than .
The main idea of the proof is fairly simple and follows the spirit of the ones used for the gradient descent method. Basically, during FW, the objective is decreased by a quantity related to the gap at each iteration. As the maximum progress is bounded by the global minimum on of , the gap cannot always stay big, and the initial suboptimality will control how big the gap can stay.
Let be the point obtained by moving with step size in direction , where is the FW direction as defined by Algorithm 1. By using , and in the definition of the curvature constant (4), and solving for , we get an affine invariant version of the standard descent lemma (see e.g. (1.2.5) in Nesterov, 2004):
Replacing the value of the FW gap in the above equation and substituting , we get:
We consider the best feasible step size that minimizes the quadratic upper bound on the RHS of (8): . This is the same step size as used in option II of the algorithm (). In option I, the step size is obtained by line-search, and so and thus . In both cases, we thus have:
where is an indicator function used to consider both possibilities of in the same equation. By recursively applying (9), we get:
Now let be the minimal gap seen so far. Inequality (10) then becomes:
We consider the two possibilities for the result of the . In both cases, we use the fact that by definition and solve for in (11). In case that , the first argument of the is smaller and we get the claimed rate on :
In case that (in the first few iterations), we get that the initial condition is forgotten at a faster rate for :
We note that this case is only relevant when ; neither of the bounds (12) and (13) are then dominating each other as the inequality (13) has a faster rate but with the worse constant . We can also show that (for any ) when . Indeed, as we assumed that to get (13), we have that (13) then implies:
If , (14) then yields a contradiction as , implying that for all in this case.
From this analysis, we can summarize the bounds as:
We obtain the theorem statement by simplifying the first option in (15) by using that it only happens when and , and thus:
By using , we get the theorem statement. ∎
3 Related work
The only convergence rate for a FW-type algorithm on non-convex objectives that we are aware of is given in Theorem 7 of Yu et al. (2014),222In this paper, they generalize the Frank-Wolfe algorithm to handle the (unconstrained) minimization of , where is a non-smooth convex function for which can be efficiently computed. The standard FW setup is recovered when is the characteristic function of a convex set
is the characteristic function of a convex set, but they can also handle other types of , such as the norm for example. but they only cover non-adaptive step size versions (which does not apply to Algorithm 1) and they can only obtain slower rates than . Bertsekas (1999, Section 2.2) shows that any limit point of the sequence of iterates for the standard FW algorithm converges to a stationary point (though no rates are given). His proof only requires the function to be continuously differentiable.333We note that the gradient function being continuous on a compact set implies that it is uniformly continuous, with a modulus of continuity function that characterizes its level of continuity. Different rates are obtained by assuming different levels of uniform continuity of the gradient function. The more standard one is assuming Lipschitz-continuity of the gradient, but other (slower) rates could be derived by using various growth levels of the modulus of continuity function. He basically shows that the sequence of directions obtained by the FW algorithm is gradient related, and then get the stationarity point convergence guarantees by using (Bertsekas, 1999, Proposition 2.2.1). Dunn (1979, Note 5.5) generalizes the standard rates for the FW method (in terms of global suboptimality) when running it on a class of quasi-convex functions of the form: where is a strictly increasing real function with a continuous derivative, and is a convex function with Lipschitz continuous gradient. These functions are quite special though: they are invex, that is, all their stationarity points are also global optima.
Unconstrained gradient methods.
Our rate is analogous to the ones derived for projected gradient methods. In the unconstrained setting, Nesterov (2004, Inequality (1.2.15)) showed that the gradient descent method with line-search or a fixed step size of , where is the Lipschitz constant of the gradient function, had the following convergence rate to a stationary point:
We see that this rate is very similar to the one we give in (12) in the proof of Theorem 1. Cartis et al. (2010) also showed that the rate was tight for the gradient descent method for an unconstrained objective. It is unclear though whether their example could be adapted to also show a lower bound for the FW method in the constrained setting, as their unidimensional example has a stationarity point only at , which thus does not apply to a compact domain.
Constrained gradient methods.
In the constrained setting, several measures of non-stationarity have been considered for projected gradient methods. Cartis et al. (2012) consider the first-order criticality measure of at which is similar to the FW gap (2), but replacing the maximization over in its definition to the more local , where is the unit ball around . This measure appears standard in the trust region method literature (Conn et al., 1993). Cartis et al. (2012) present an algorithm that gives a rate on this measure. Ghadimi et al. (2016) considers instead the norm of the gradient mapping as a measure of non-stationarity.444The gradient mapping is defined for the more general proximal optimization setting, but we consider it here for the simple projected gradient setup. For a step size , the gradient mapping is defined as where . If we let the , then the gradient mapping becomes simply the negative of the projection of on the solid tangent cone to at . When is the full space (unconstrained setting), then the gradient mapping becomes simply . We use the “gradient mapping” terminology from (Nesterov, 2004, Definition 2.2.3) but with the notation from (Ghadimi et al., 2016). They show in Ghadimi et al. (2016, Corollary 1) that the simple projected gradient method with step size gives the same rate as given by (16) in the unconstrained setting, but using the norm of the gradient mapping on the LHS instead. They also later showed in Ghadimi and Lan (2016) that the accelerated projected gradient method of Nesterov gave also the same rate, but with a slightly better dependence on the Lipschitz constant.
- Alayrac et al. (2016) J.-B. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016.
- Bertsekas (1999) D. P. Bertsekas. Nonlinear programming. Athena Scientific, second edition, 1999.
- Cartis et al. (2010) C. Cartis, N. I. M. Gould, and P. L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 20(6):2833–2852, 2010.
- Cartis et al. (2012) C. Cartis, N. I. M. Gould, and P. L. Toint. An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4):1662–1695, 2012.
- Chari et al. (2015) V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic. On pairwise costs for network flow multi-object tracking. In CVPR, 2015.
- Conn et al. (1993) A. R. Conn, N. Gould, A. Sartenaer, and P. L. Toint. Global convergence of a class of trust region algorithms for optimization using inexact projections on convex constraints. SIAM Journal on Optimization, 3(1):164–221, 1993.
- Demyanov and Rubinov (1970) V. F. Demyanov and A. M. Rubinov. Approximate methods in optimization problems. Elsevier, 1970.
- Dunn (1979) J. C. Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization, 17(2):187–211, 1979.
- Frank and Wolfe (1956) M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956.
- Ghadimi and Lan (2016) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
- Ghadimi et al. (2016) S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016.
- Jaggi (2013) M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.
- Lacoste-Julien and Jaggi (2015) S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In NIPS, 2015.
- Lan (2013) G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv:1309.5550v2, 2013.
- Nesterov (2004) Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.
- Yu et al. (2014) Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for sparse estimation. arXiv:1410.4828v1, 2014.