# A Novel Frank-Wolfe Algorithm. Analysis and Applications to Large-Scale SVM Training

Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation procedure for concave optimization known as the Frank-Wolfe (FW) method. In particular, this procedure has been successfully applied to train large-scale instances of non-linear Support Vector Machines (SVMs). Specializing FW to SVM training has allowed to obtain efficient algorithms but also important theoretical results, including convergence analysis of training algorithms and new characterizations of model sparsity. In this paper, we present and analyze a novel variant of the FW method based on a new way to perform away steps, a classic strategy used to accelerate the convergence of the basic FW procedure. Our formulation and analysis is focused on a general concave maximization problem on the simplex. However, the specialization of our algorithm to quadratic forms is strongly related to some classic methods in computational geometry, namely the Gilbert and MDM algorithms. On the theoretical side, we demonstrate that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, the method enjoys a linear rate of convergence, a result that has been recently proved for MDM on quadratic forms. On the practical side, we provide experiments on several classification datasets, and evaluate the results using statistical tests. Experiments show that our method is faster than the FW method with classic away steps, and works well even in the cases in which classic away steps slow down the algorithm. Furthermore, these improvements are obtained without sacrificing the predictive accuracy of the obtained SVM model.

## Authors

• 2 publications
• 5 publications
• 8 publications
• 5 publications
02/05/2015

### A PARTAN-Accelerated Frank-Wolfe Algorithm for Large-Scale SVM Classification

Frank-Wolfe algorithms have recently regained the attention of the Machi...
11/18/2015

### On the Global Linear Convergence of Frank-Wolfe Optimization Variants

The Frank-Wolfe (FW) optimization algorithm has lately re-gained popular...
12/04/2012

### Training Support Vector Machines Using Frank-Wolfe Optimization Methods

Training a Support Vector Machine (SVM) requires the solution of a quadr...
08/06/2019

### On Convergence of Distributed Approximate Newton Methods: Globalization, Sharper Bounds and Beyond

The DANE algorithm is an approximate Newton method popularly used for co...
08/24/2010

### NESVM: a Fast Gradient Method for Support Vector Machines

Support vector machines (SVMs) are invaluable tools for many practical a...
12/26/2019

### Sparse Optimization on General Atomic Sets: Greedy and Forward-Backward Algorithms

We consider the problem of sparse atomic optimization, where the notion ...
07/09/2021

### Learning structured approximations of operations research problems

The design of algorithms that leverage machine learning alongside combin...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we present a novel variant of the Frank-Wolfe (hereafter FW) method [23, 33], designed to deal with large-scale instances of the following problem:

 maximizeα g(α)subject to α∈S:={α∈Rm:∑iαi=1,αi≥0}. (1)

This problem encompasses several models used in machine learning [13, 28], including hard-margin Support Vector Machines (SVMs) [36] and -loss SVMs (

-SVMs) for binary classification, regression and novelty detection

[58, 59].

#### FW Methods and Focus of this Paper

It has been noted by researchers in different fields that approximate solutions to problem (1) can be obtained using quite simple iterative procedures. In [64], for instance, Yildirim presents two iterative algorithms for the task of approximating the Minimum Enclosing Ball (MEB) of a set of points. In [1], Ahipasaoglu et al. propose similar methods to solve Minimum Volume Enclosing Ellipsoid problems. In [66]

, Zhang studies similar techniques for convex approximation and estimation of mixture models. All these methods are nowadays identified as variants of a general approximation procedure for maximizing a differentiable concave function on the simplex, which traces back to Frank and Wolfe

[23, 62, 27] and has been recently analyzed by Clarkson [13] and Jaggi [33] under a modern perspective.

In a nutshell, each iteration of the FW method moves the solution towards the direction along which the linearized objective function increases most rapidly but is still feasible. The procedure is related to the idea of coreset, coined in the context of computational geometry and denoting a subset of data which suffices to obtain an approximation to the solution on the whole dataset up to a given precision . Clarkson’s framework unifies diverse results regarding the existence of small coresets for different instances of problem (1). These ideas were used in [28] to characterize the sparsity of SVMs and the convergence properties of training algorithms for geometric formulations of the problem.

The algorithm studied in this paper is obtained by incorporating a new type of away step into the basic FW method. Loosely speaking, instead of moving the solution towards a direction along which the linearized objective function increases, an away step moves the solution away from a direction along which the linearized objective function decreases. This strategy was suggested by Wolfe in [62] to improve the convergence rate of the FW method, leading to a variant of the original algorithm called Modified Frank-Wolfe method (hereafter MFW). It has been demonstrated that MFW is linearly convergent under some general assumptions on the properties of problem (1). However, we have found in [20] that classic away steps do not improve significantly the running times of the FW method on machine learning problems. A similar conclusion was obtained by Ouyang and Gray in [44]. In contrast, our approach experimentally improves on other FW methods and shows theoretical guarantees (e.g. convergence rate) at least as good as those of MFW.

#### Applications to SVM Learning

Training non-linear SVMs on large datasets is challenging [16]. Effective Interior Point Methods can be devised under some special circumstances, such as kernels which admit low-rank factorizations [18, 63]. However, these methods are not suitable for large-scale problems in a general scenario, mainly due to memory constraints: a general interior point method needs () memory and () time for matrix inversions, and both are prohibitive even for medium-scale problems. Among the traditional methods devised to cope with this problem, Active Set methods [49, 34, 50] and Sequential Minimal Optimization (SMO) [45, 16] are well-known alternatives among practitioners. Indeed, these are the algorithms of choice in the widely known libraries SVMLight [35] and LIBSVM [12]

, respectively. For the linear kernel case, Stochastic Gradient Descent (SGD)

[7, 8], specialized sub-gradient methods like Pegasos [54] and Stochastic Dual Coordinate Ascent (SDCA) [31, 55] have lately gained popularity in the community as approximate but efficient alternatives to the classic solutions on large-scale problems [65].

In the non-linear case, effective methods to deal with large datasets have been recently devised by focusing on formulations which fit problem (1

) and then applying FW methods. The first work to specialize a variant of the FW method to SVM training is probably due to Tsang

et al. [58]. Given a labelled set of examples , where denotes the input space and an index set, they adopt the so-called -SVM formulation, where the model is built by solving the following optimization problem

 maximizeα g(α)=−αTKα subject to ∑iαi=1,αi≥0, (2)

where , is the kernel function used in the SVM model and is the regularization parameter [58, 19, 20]. Problem (2) clearly fits problem (1). This formulation is adopted mainly because of efficiency: by using the functional of Eqn. (2), it is possible to exploit the framework introduced in [58], and further developed in [13], to solve the learning problem more easily111Strictly speaking, [58] is a special case of a FW method which does not address the general form of problem (2), as a normalization constraint on the quadratic form is required (see Sections 2.2 and 2.4).. Note also that in problem (2) is positive definite 222This is easily seen by writing as the sum of two positive semi-definite matrices and a multiple of the identity, , where is the column vector whose components are the labels , is the Gram matrix and is the Hadamard or componentwise product. for and thus is strictly concave.

Borrowing a coreset-based algorithm from computational geometry [10], the authors obtain that the total number of iterations needed to identify a coreset, i.e. an approximation to the -SVM model up to an arbitrary precision , is bounded by , independently of the size of the dataset. From the iterative structure the algorithm, it follows easily that the size of the coreset is also bounded by . A similar result regarding linear SVMs trained with SDCA has been recently demonstrated in [55].

The latter properties imply in particular that the size of the set of examples required to represent the (approximate) solution, i.e. the number of support vectors in the model, is also independent from the size of the dataset, an improvement on previous lower bounds for the support set size, such as those in [56], where the bound grows linearly in the size of the dataset. The obtained training algorithm also exhibits linear running times in the number of examples. These are remarkable results in the context of non-linear SVM models, where the support set needs to be explicitly stored in memory to implement predictions and determines the cost of a classification decision in terms of testing time. In addition, a combination of this procedure with certain sampling techniques allows to obtain sub-linear time approximation algorithms [58, 19]333To be rigorous, the probability of identifying a good point in a given iteration depends on the size of the sampling. A sub-linear procedure guaranteeing a constant success probability is studied in [14], though it seems that results on the non-linear case are provided only for some kernels.. In practice, the method was found to be competitive with most well-known SVM software using non-linear kernels [58, 59].

Several other papers have recently stressed the efficiency of FW and coreset-based methods in machine learning. In [19] and [21] the authors investigate the direct application of the FW method to large-scale non-linear SVM training, demonstrating that running times of [58] can be significantly improved as long a minor loss in accuracy is acceptable. Variations of the algorithm based on geometrical reformulations of the learning problem [38, 28], stochastic variants of the method [44], and applications to SVM training on data streams [61, 47] and structural SVMs [39] have also been proposed.

#### Contributions

We present a FW method endowed with a new type of optimization step devised to overcome the difficulties observed with the classic MFW approach, while preserving the intuition and benefits behind the introduction of away steps. On the theoretical side, we formulate and analyze the algorithm for the general case of problem (1), demonstrating that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, we show that the method converges linearly to the optimal value of the objective function, and achieves a predetermined accuracy (primal-dual gap) in at most iterations. Focusing on quadratic objectives, it turns out that the method is strongly related to the Gilbert and Mitchell-Demnyanov-Malozemov (MDM) algorithms, two classic methods in computational geometry [26, 42]. Such methods are well-known in machine learning and their properties, in particular their rate of convergence, have been the focus of recent research [40, 41].

On the practical side, we specialize the algorithm to SVM training and perform detailed experiments on several classification problems. We conclude that our algorithm improves the running times of existing FW approaches without any statistically significant difference in terms of prediction accuracy. In particular, we show that the method is faster than the FW and MFW methods, while MFW is not statistically faster than FW. In addition, we show that the method is faster than or equal to the FW method when MFW is significantly slower, i.e. when classic away steps fail. In addition, the method is competitive with MFW when FW is significantly slower, i.e., if classic away steps work, our algorithm works as well. Thus, the method represents a robust alternative to implement away steps, enjoying strong theoretical guarantees and providing significant improvements in practice.

#### Organization

The paper is organized as follows. In Section 2 we give an overview of FW methods and introduce the basic concepts required for their analysis. In Section 3 we present the new method, including a minor variant, and provide some details about its specialization to SVMs. The analysis of convergence is provided in Section 4. In Section 5, we discuss the relation of the proposed method to some classic geometric approaches for a quadratic objective. Experiments on SVM problems are presented in Section 6. Finally, Section 7 closes the paper with some concluding remarks. In addition, some technical results required for the proofs of Section 4 are reported in the Appendix.

#### Notation

An optimal solution for problem (1) is denoted . A sequence of approximations to a solution of problem (1) is abbreviated . The set of indices is denoted . The face of the unit simplex corresponding to a set of indices is the subset of points such that . The term active face indicates the face corresponding to the non-zero indices, , of the current solution . The term optimal face, denoted by , indicates the face corresponding to an optimal solution . The vector denotes the -th vector of the canonical basis.

## 2 Frank-Wolfe Methods

The FW method computes a sequence of approximations to a solution of problem (1) by iterating until convergence the following steps. First, a linear approximation of at the current iterate is performed in order to find an ascent direction , with

 uk∈argmaxu∈Sψk(u):=g(αk)+(u−αk)T∇g(αk). (3)

Since lies in , it is easy to see that the linear approximation step reduces to where is the largest coordinate of the gradient, i.e. . The iterate is then moved towards , seeking for the best feasible improvement of the objective function. The procedure is summarized in Algorithm 1. In the rest of this paper we refer to as the ascent vertex used by the method.

As discussed below, the procedure can be stopped when is “close enough” to the optimum.

### 2.1 Optimality Measures and Stopping Condition

It can be shown that the FW method is globally convergent under rather weak assumptions on the properties of the objective function [27, 23], which are guaranteed to hold for the SVM problem (2) [64, 20]. In addition, it can be shown that the iterates of this procedure satisfy

 Δp(αk):=g(α∗)−g(αk)≤4Cgk+3 , (4)

where is a constant related to the second derivative of [13]. This convergence rate is slow compared to other methods. However, the simplicity of the procedure implies that the amount of computation per iteration is usually very small. This kind of tradeoff can be favorable for large-scale applications, as testified for example by the widespread adoption of the SMO method in the context of SVMs [45, 16].

When is continuously differentiable, the Wolfe dual of problem (1) is

 minimizeαw(α),withw(α)=g(α)+maxi∇g(α)i−αT∇g(α) . (5)

As shown in [13], the strong duality condition

 g(α)≤g(α∗)=w(α∗)≤w(α) (6)

holds for any feasible . Thus, another reasonable measure of optimality for the Frank-Wolfe iterates is the so-called primal-dual gap

 Δd(α):=w(α)−g(α)=maxi∇g(α)i−αT∇g(α) . (7)

Up to a multiplicative constant (), the primal-dual gap in Eqn. (7) and the primal measure of approximation in Eqn. (4) are the metrics employed in [13] to analyze the convergence of Algorithm 1. The advantage of with respect to is that the former does not depend on the optimal value of the objective function. Therefore, can be explicitly monitored during the execution of the algorithm and can be adopted to implement a stopping condition for Algorithm 1. In this paper, we adopt this measure to stop the FW method and any of its variants. That is, the algorithms are terminated when

 Δd(αk)=maxi∇g(αk)i−αTk∇g(αk)≤ε , (8)

where is a given tolerance parameter. Note that the strong duality condition implies . Therefore, if the algorithm stops at iteration we also have .

Note also that Eqn. (4) implies that the FW method finds a solution fullfiling in at most iterations. Clarkson has recently shown that we also have after at most iterates [13]. Thus, the solution found by the FW method using the stopping condition (8) is guaranteed to be “close” to the optimum both primally and dually after iterations.

In the analysis presented in this paper, we make use of the following notion of approximation quality introduced in [1].

###### Definition 1.

A feasible solution to problem (1) is said a -approximate solution if

 Δd(α)≤Δ (9) and Δsi(α):=∇g(α)i−αT∇g(α)≥−Δ,∀i:αi>0 . (10)

The first condition guarantees that a -approximate solution is “close” to the optimum both primally and dually. In addition, the second condition ensures that for the active face, that is, the primal-dual gap computed on each active coordinate is not far from the largest gap computed among all the coordinates of the gradient. This implies also that the solution is “almost” optimal in the face of the simplex defined by the non-zero indices.

### 2.2 Sparsity of the FW solutions and Coresets

On of the main points of interest for the FW method is the sparsity of the solutions it finds. It should be observed that, in contrast to other methods such as projected or reduced gradient methods, Algorithm 1 modifies only one coordinate of the previous iterate at each step. If the starting solution has non-zero coordinates, iterate has at most non-zero entries. Therefore, our previous remarks about the convergence of the FW method show that there exist solutions with space-complexity that are good approximations for problem (1), even if (the dimensionality of the feasible space and the number of data points in SVM problems) is much larger.

The above properties are essential for in the context of training non-linear SVMs. In this case, each non-zero coordinate in represents a support training example (a document, image or protein profile) that needs to be explicitly stored in memory during the execution of the algorithm. In addition, the test complexity of non-linear SVMs is proportional to the number of non-zero coordinates in , which determines the cost of each iteration in training time, and the cost of a classification decision in testing time.

Existence of sparse approximate solutions for problem (1) can be linked to the idea of -coreset, first described for the MEB and other geometrical problems [64]. For , an -coreset has the property that if the smallest ball containing is expanded by a factor of , then the resulting ball contains . That is, if the problem is solved on , the solution is “close” to the solution on . The existence of -coresets of size for the MEB problem was first demonstrated by Bădoiu and Clarkson in [9, 10]. Note that in large-scale applications can be much smaller than the cardinality of .

In [13], Clarkson provides a definition of coreset that applies in the general setting of problem (1). Basically, a -coreset for problem (1) is a subset of indices spanning a face of on which we can compute a good approximate solution. The existence of small -coresets implies the existence of sparse solutions which are optimal in their respective active faces. The practical consequence of this result would be the possibility of solving large instances of (1) working with a small set of variables of the original problem.

###### Definition 2.

An -coreset for problem (1) is a set of indices such that the solution to the reduced problem

 maximizeα g(α)subject to α∈SI:={α∈S:αi=0,∀i∉I}. (11)

satisfies .

As discussed in [13], the FW method is not guaranteed to find a -coreset after iterations for problem (1). It has been demonstrated that FW is able to find such a coreset in some special cases, e.g., in polytope distance problems [28]. However, in general, iterations may be required. Instead, the computationally intensive modification presented in Algorithm 2, generally known as the fully corrective variant of FW, does the job.

Note that Algorithm 2 needs to solve an optimization problem of increasing size at each iteration. This can be considered a generalized version of the well-known Bădoiu-Clarkson (BC) method to compute MEBs in computational geometry and, up to our knowledge, corresponds to the first variant of the FW method applied to SVM problems [58].

### 2.3 Boosting the Convergence using Away-steps

It is well-known that the FW method often exhibits a tendency to stagnate near the solution , resulting in a slow convergence rate [27]. As discussed in [64, 20], this problem can be explained geometrically. Near the solution, the gradient at has a tendency to become nearly orthogonal to the face of the simplex spanned by (the non-zero coordinates of ). Therefore, very little improvement can be achieved by moving towards the ascent vertex . However, since the solution is not optimal, it is reasonable to think that the solution can be improved working on the face spanned by . Actually, Algorithm 2 works on until approximate optimality before exploring the next ascent direction.

It can be shown that the convergence of the FW method can be boosted by introducing a new type of optimization step. In short the idea is that, instead of moving towards the point maximizing the local linear approximation of , we can move away from the point of the current face minimizing . At each iteration, a choice between these two options is made by determining which of the directions (moving towards or moving away from ) is more promising.

Since the point must lie in the current active face, it is easy to see that the linear approximation step reduces to , where is the smallest active coordinate of the gradient, i.e., . The whole procedure, known as the Modified Frank-Wolfe (MFW) method, is summarized in Algorithm 3. In the rest of this paper, we refer to and as the descent vertex and the away direction used by the method respectively.

In contrast to the FW method, for which only a sub-linear rate of convergence can be expected in general [27, 64], it has been shown that MFW asymptotically exhibits linear convergence to the solution of problem (1) under some assumptions on the form of the objective function and the feasible set [27, 64, 1]. In addition, the MFW algorithm has the potential to compute sparser solutions in practice, since in contrast to the FW method it allows reducing the coordinates of at each step.

In the context of SVM learning, the work of Tsang et al. in [58] was arguably the first to point out the properties of the algorithms than can be obtained by applying FW methods to formulations fitting problem (1). Their work relies on the equivalence between the SVM problem (2) and a MEB problem, which holds under a normalization assumption on the kernel function employed in the model [58, 59]. Exploiting this equivalence, and adapting the Bădoiu-Clarkson algorithm for computing a MEB to the problem of training non-linear SVMs, an algorithm called Core Vector Machine (CVM) is obtained, which enjoys remarkable theoretical properties and competitive performance in practice [58].

First, the number of support vectors of the model obtained by the CVM is where is a constant and is the tolerance parameter of the method. Therefore, the space complexity of the model is independent of the size and dimensionality of the training set. Second, the number of iterations of the algorithm before termination is also , independent of the size and dimensionality of the training set. To determine the overall time complexity of this method, we note that Algorithm 2 requires a search for the point representing the best ascent direction in the current approximation of the objective function, an operation that is also performed by the FW and MFW methods. Searching among all of the training points requires a number of kernel evaluations of order , where is the cardinality of . Since the cardinality of is bounded as (the worst-case number of iterations), we obtain that the CVM has an overall time complexity (measured as the total number of kernel evaluations) of , linear in the number of examples, improving on the super-linear time complexity reported empirically for popular methods like SMO to train SVMs [45, 16].

If is very large, however, the complexity per iteration can still become prohibitive in practice. A sampling technique, called probabilistic speedup, was proposed in [52] to overcome this obstacle. This technique was also used to implement the CVM in [58, 57] leading to SVM training algorithms with an overall time complexity which is independent of the number of training examples. In practice, the index is computed just on a random subset of coordinates, with . The overall complexity per iteration is thereby reduced to order , a major improvement on the previous estimate, since we generally have . Refer to [51] or [58] for details about this speed-up technique.

More recently, several authors have explored the adaptation of the original FW methods to the task of training SVMs. The advantage of Algorithms 1 and 3 over Algorithm 2 is that they rely only on analytical steps. As a result, each training iteration becomes significantly cheaper than a CVM iteration and does not depend on any external numerical solver. In practice, the training algorithm might probably require more iterations in order to obtain a solution within the predefined tolerance criterion , but the work per iteration is significantly smaller. Such a trade-off has been shown to be worthwhile when dealing with large-scale applications [45, 16, 19].

In [19, 20] the authors show that adopting Algorithms 1 and 3 the running times of [58] can be significantly improved as long a minor loss in accuracy is acceptable. From the analysis presented in [13], it is possible to conclude that this approach enjoys similar theoretical guarantees, namely, linear time in the number of examples and a number of iterations which is independent of the number of examples. The sampling technique to speed-up the computation of

introduced above can be used with these methods as well, in order to obtain overall time complexities which are independent of the number of training patterns.

In a closely related work [38], Kumar and Yildirim present a specialization of the MFW method to SVM problems, adopting the geometrical formulation studied in [6]. This approach reformulates the SVM problem as a minimum polytope distance problem. The obtained method and its properties are also strongly related to the work of Gartner and Jaggi [28], in which the authors in which the authors show (theoretically) that the FW method as well as the coreset framework introduced in [13] can be applied to all the currently most used hard and soft margin SVM variants, with arbitrary kernels, to obtain approximate training algorithms needing a number of iterations independent of the number of attributes and training examples. In [44], Ouyang and Gray propose a stochastic variant of FW methods for online learning of -SVMs, obtaining comparable and sometimes better accuracies than state-of-the-art batch and online algorithms for training SVMs. A similar technique has recently been proposed in [29] to allow smooth and general online convex optimization with sub-linear regret bounds [53]. Variants of the method proposed in [58] have been introduced in [61] and [47] for training SVMs on data streams. In [39] the authors adapted the FW method to train SVMs with structured outputs like graphs and other combinatorial objects [60, 3], obtaining an algorithm which outperforms competing structural SVM solvers444To be precise, the block-coordinate FW in [39], when applied on the binary SVM as a special case of the structured SVM, becomes a variant of dual coordinate ascent [31]..

## 3 The SWAP Method

We have described in the previous sections how the basic FW method can be modified in order to avoid stagnation near a solution, in this way obtaining an algorithm with a guaranteed rate of convergence. Our previous remarks about the MFW method suggest that this algorithm should terminate faster and find sparser solutions. In practice however, the MFW method is not always as fast as one could expect from the theory. For instance, the experimental results reported in [64] and [1] for the MEB and Minimum Volume Enclosing Ellipsoid problems respectively, show that very tight improvements, if any, are obtained using the enhanced method (MFW) with respect to the basic approach. As concerns the problem of training SVMs, results in [20] confirm using statistical tests that MFW is not systematically better than FW. Indeed it may sometimes be slower. Similarly, the authors of [44] argue that the use of away steps does not provide a clear advantage with respect to the standard FW method.

A possible interpretation of these results can be given by looking at the way in which MFW implements the away steps to keep feasibility, i.e., to ensure the constraint is satisfied. The basic idea in the MFW approach is to include the alternative of getting away from a descent vertex of the current face , decreasing the -th weight in , instead of going toward an ascent vertex , which would increase the -th weight in . The choice is mutually exclusive. If the algorithm decides to work around , it may lose the opportunity to explore a promising direction of the feasible space, and vice-versa.

On the other hand, if an away step is performed, the weights of the active vertices are uniformly scaled by to keep feasibility. This scheme not only does considerably perturb the current approximation, since all the weights are modified, but, more importantly, can increase the weights of vertices which do not belong to the optimal face . Away steps in the MFW method are thus prone to increase the need of further away steps to eliminate such “spurious points” (, but ).

Here, we introduce a new type of away step devised to circumvent these problems while preserving the advantages of MFW. We discuss two variants of the method, obtained by using first and second order approximations of the objective function at each iteration, respectively.

### 3.1 Main Construction

Our method is obtained as follows. As in the previous FW methods, we find, at each iteration, an ascent vertex , as

 i∗∈argmaxi∇g(αk)i , (12)

and a descent vertex on the face spanned by the current solution , as

 j∗∈argmaxj∈Ik−∇g(αk)j=argminj∈Ik∇g(αk)j . (13)

However, instead of considering the update for the away step, we propose a step of the form

 αk+1=αk+λ(ei∗−ej∗) , (14)

where is determined by a line-search. That is, instead of exploring the away direction , our algorithm explores the direction . A sketch is included in Figure (1). This scheme for implementing away steps provides the following conceptual advantages.

1. This away step perturbs the current solution only locally, in the sense that the weight of any vertex other than and is preserved.

2. This away step does not increase the weight of vertices of the active face corresponding to descent vertices. These points may correspond to spurious points that need to be removed from the active face to reach the optimal face of the problem.

3. This away step moves the current solution in the away direction and simultaneously in the direction of a toward step. That is, it moves away from the descent vertex , but also gets closer to the ascent vertex in the same iteration. The step (14) can actually be written as the superposition of two separate steps,

 αk+1 =12(αk+λ(ei∗−αk))  toward step (15) +12(αk+λ(αk−ej∗))  away step ,

where the first term of the right-hand side represents the standard toward step in the FW method and the second term, , the away step considered in the MFW approach. Note that the term disappears in the sum, so that only the components corresponding to and are updated, leaving the rest of the current solution unchanged.

The new type of away step is called a SWAP step and substitutes the MFW away steps in Algorithm 3. The procedure is summarized in Algorithm 4. Note that we deliberately include some steps which do not represent computational tasks but definitions which simplify the convergence analysis of the next section.

So as to choose the type of step to perform, the MFW criterion cannot be employed in our method. The MFW method employs a first order approximation of at the current iterate to predict the value of the objective function at the next iterate. That is, if denotes the search direction,

 ψk(αk+λd)=g(αk)+λdT∇g(αk). (16)

is computed. The step which gives the largest value of is selected. However, a SWAP step always gives a larger value of than the value obtained using a toward step. Indeed, the value of using a SWAP step is

 ψk(αk+λdswap) =g(αk)+λ(ei∗−ej∗)T∇g(αk) =g(αk)+λ∇g(αk)i∗−λ∇g(αk)j∗. (17)

The value of using a toward step is

 ψk(αk+λdfw) =g(αk)+λ(ei∗−αk)T∇g(αk) =g(αk)+λ∇g(αk)i∗−λαTk∇g(αk). (18)

Since is always larger than , a SWAP step would always be preferred using first-order information to predict the objective function value.

To address this problem, we observe that the MFW method computes an exact line-search for the search direction selected using . We thus formulate our method computing the line-search before deciding the type of step to perform. This design requires to perform two line-searches instead of one. However, the estimation of the objective function value at the next iterate is more accurate.

As we will discuss in the section regarding the adaptation of the procedure to the SVM problem, this computation is particularly simple for the objective function in problem (2). All the computations are analytical. Furthermore, the exact computation of and involve terms already computed in the line-searches and therefore does not represent an additional overhead for the algorithm.

### 3.2 A Variant using Second Order Information

All the FW methods introduced previously make use of first-order approximations of the objective function in order to determine the direction toward which the current iterate should be moved. Here, we consider the possibility of using second-order information. If we assume that the objective function is twice differentiable, the second-order Taylor approximation of in a neighborhood of is

 g(αk+λd) ≈g(αk)+λ∇g(αk)Td+12λ2dT∇2g(αk)d , (20)

where the Hessian matrix is negative semi-definite. Finding the best ascent direction would thus require the computation of the quadratic form . Since the matrix may be highly dense, which is usually the case in SVM applications, employing a first order relaxation as in Frank-Wolfe methods makes sense in order to obtain lighter iterations. However, we note that the search direction for a SWAP step yields a particularly simple expression

 g(αk+λd\tiny swap) (21) ≈ g(αk)+λ∇g(αk)T(ei∗−ej∗)+12λ2(ei∗−ej∗)T∇2g(αk)(ei∗−ej∗) =

where .

In order to determine the best pair we thus need to evaluate three entries of the Hessian matrix. However this is still a computationally hard task for each iteration, since we would need to consider pairs of points in order to take a step. We thus adopt the strategy used in the second-order version of SMO proposed in [16]. We fix the ascent index as in the first-order SWAP, and search for the index in the active set which maximizes the improvement of the second order approximation (21). We call the obtained procedure second-order SWAP and we denote it as SWAP-2o in the next Sections.

It is worth to note that this approximation is exact for quadratic objective functions, which is the case for the SVM problem (2). Note also that in this case the line-search along the ascent direction defined by and has a closed-form solution. Indeed,

 λ∗ =(∇g(αk)i∗−∇g(αk)j∗)−(∇2i∗,i∗−2∇2i∗,j∗+∇2j∗,j∗) . (22)

From the negative semi-definiteness of it follows that is non-negative. Substituting this step-size in (21), the improvement in the objective function becomes

 g(αk+λ∗d\tiny swap)−g(αk) =(∇g(αk)i∗−∇g(αk)j∗)2−2(∇2i∗,i∗−2∇2i∗,j∗+∇2j∗,j∗) , (23)

which again, from the negative semi-definiteness of , is non-negative. Naturally, we need to restrict the value of to the interval in order to obtain a feasible solution for the next step. We thus modify Algorithm 4 as specified in Algorithm 5.

### 3.3 Notes on the Adaptation to SVM Training

Here we provide analytical expressions for all the computations required by Algorithm 4 and Algorithm 5 applied to the the SVM problem (2). Similar expressions follow for any quadratic objective function.

For problem (2), the gradient and Hessian at given iterate take particularly simple expressions:

 ∇g(αk)=−2Kαk,∇2g(αk)=−2K. (24)

Notice that . Therefore, the line-searches in Algorithm 4 or Algorithm 5 can be performed analytically as follows. For FW steps,

 λ\tiny fw=∇g(αk)i∗−2g(αk)2(Ki∗,i∗+∇g(αk)i∗−g(αk)). (25)

Note that the quantity has been already computed to find the ascent vertex . For SWAP steps,

 λ\tiny swap=∇g(αk)i∗−∇g(αk)j∗2(Ki∗,i∗−2Ki∗,j∗+Kj∗,j∗). (26)

Again, the quantity has been already computed to choose the descent vertex .

The improvement in the objective function can also be calculated analytically. For FW steps,

 δ\tiny fw=(∇g(αk)i∗−2g(αk))24(Ki∗,i∗+∇g(αk)i∗−g(αk)). (27)

All the terms involved here have already been computed to perform the line-search. Similarly, for SWAP steps,

 δ\tiny swap=(∇g(αk)i∗−∇g(αk)j∗)24(Ki∗,i∗−2Ki∗,j∗+Kj∗,j∗). (28)

With the exception of the term , all the computations have already been performed to compute and to choose the descent vertex . We conclude that, compared with MFW procedure, the SWAP method adapted for problem (2) involves the computation of just one additional term, which is an entry of the kernel matrix defining the SVM problem.

The objective function value can be computed recursively from the relationship555A similar recursive equation can be derived to handle the case of SWAP-drop steps. . Finally, we observe that the stopping criterion of Eqn. (8) takes the form

 Δd(αk)=∇g(αk)i∗−2g(αk)≤ε , (29)

which involves the same already computed terms.

## 4 Convergence Analysis of the SWAP Method

In this section we study the convergence of the SWAP method on problem (1), of which the -SVM problem (2) is a particular instance.

We start by demonstrating the global convergence of the SWAP method. Then, we analyze its rate of convergence towards the optimum. For this purpose we will adapt the analysis presented by Ahipasaoglu et al. in [1]. Using this framework and using a set of observations concerning the improvement in the objective function after an iteration of the SWAP method, we will be able to prove that the algorithm converges linearly to the optimal value of the objective function. From a theoretical point of view, these results show that the SWAP enjoys the same mathematical properties of the MFW method. Finally, we provide bounds on the number of iterations required to fulfill the stopping condition of Eqn. (8). We demonstrate that the algorithm stops in at most iterations independently of the number of variables , which coincides with the number of training examples in the SVM problem (2).

Here we only provide proofs for the first-order SWAP method, described in Algorithm 4. However all the convergence results follow easily for the second-order variant as well. The statements and proofs of some of the technical results used in this section can be found in the Appendix.

We develop our analysis under the following assumptions:

• is twice continuously differentiable;

• There is an optimal solution of the optimization problem satisfying the strong sufficient condition of Robinson in [48].

This is the same set of hypotheses imposed by Yildirim in [64] and Ahipasaoglu et al. in [1] to study the convergence of Frank-Wolfe methods for the MEB problem and the Minimum Volume Enclosing Ellipsoid problem, respectively.

###### Remark 1.

Robinson’s condition is a general version of the classical second order sufficient condition for a solution to be an isolated local extremum, i.e. locally unique [17]. Referring to the case of a constrained maximization problem for a concave objective , this result requires two conditions to be fulfilled:

• is a KKT point [17].

• The Hessian of the Lagrangian function at behaves as a negative definite matrix (positive definite for minimization problems) along the directions belonging to the critical cone of the KKT point [43]. Specialized to a quadratic problem on the simplex, i.e. a problem with the form of (2), this condition assumes the form:

 yTKy>0

for all such that , , (where is the vector of Lagrange multipliers at corresponding to inequality constraints, which is unique since the constraints are linear and linearly independent).

The additional analysis in [48], which plays a key role in our convergence analysis, essentially describes the conditions under which the stationary points of a small perturbation of the problem lie in a neighborhood of the solution of the original problem. This is also the key step in the proofs of linear convergence provided in [64] and [1] for the MFW method.

In [27], Guélat and Marcotte analyzed the convergence properties of FW and MFW methods under the following alternative hypotheses:

• is Lipschitz-continuous on the feasible set;

• is strongly concave;

• Let be optimal for (1) and be the smallest face of the feasible set containing . Then

 (α−α∗)T∇g(α∗)=0⇔α∈T∗(strict complementarity).

However, this set of assumptions can be difficult to satisfy in practice. In particular, A is a quite strong assumption and cannot be guaranteed in general.

Note that assumption B1 implies A1, as from the mean value theorem it follows that

 ∥∇g(x)−∇g(y)