Support Vector Machines (SVMs) are currently one of the most effective methods to approach classification and other machine learning problems, improving on more traditional techniques like decision trees and neural networks in a number of applications
Support Vector Machines (SVMs) are currently one of the most effective methods to approach classification and other machine learning problems, improving on more traditional techniques like decision trees and neural networks in a number of applications[16, 33]. SVMs are defined by optimizing a regularized risk functional on the training data, which in most cases leads to classifiers with an outstanding generalization performance [39, 33]. This optimization problem is usually formulated as a large convex quadratic programming problem (QP), for which a naive implementation requires space and time in the number of examples , complexities that are prohibitively expensive for large scale problems [33, 37]. Major research efforts have been hence directed towards scaling up SVM algorithms to large datasets.
Due to the typically dense structure of the hessian matrices involved in the QP, traditional optimization methods cannot be directly applied to train an SVM on large datasets. The problem is usually addressed using an active set method where at each iteration only a small number of variables are allowed to change [32, 18, 30]. In non-linear SVMs problems, this is essentially equivalent to selecting a subset of training examples called a working set . The most prominent example in this category of methods is Sequential Minimal Optimization (SMO), where only two variables are selected for optimization each time [8, 30]. The main disadvantage of these methods is that they generally exhibit a slow local rate of convergence, that is, the closer one gets to a solution, the more slowly one approaches that solution. Moreover, performance results are in practice very sensitive to the size of the active set, the way to select the active variables and other implementation details like the caching strategy used to avoid repetitive computations of the kernel function on which the model is based  Other attempts to scale up SVM methods consist in adapting interior point methods to some classes of the SVM QP.. For large-scale problems however the resulting rank of the kernel matrix can still be too high to be handled efficiently . The reformulation of the SVM objective function as in , the use of sampling methods to reduce the number of variables in the problem as in  and , and the combination of small SVMs using ensemble methods as in  have also been explored.
Looking for more efficient methods, in  a new approach was proposed: the task of learning the classifier from data can be transformed to the problem of computing a minimal enclosing ball (MEB), that is, the ball of smallest radius containing a set of points. This equivalence is obtained by adopting a slightly different penalty term in the objective function and imposing some mild conditions on the kernel used by the SVM. Recent advances in computational geometry have demonstrated that there are algorithms capable of approximating a MEB with any degree of accuracy in iterations independently of the number of points and the dimensionality of the space in which the ball is built . Adopting one of these algorithms, Tsang and colleagues devised in  the Core Vector Machine (CVM), demonstrating that the new method compares favorably with most traditional SVM software, including for example software based on SMO [8, 30].
CVMs start by solving the optimization problem on a small subset of data and then proceed iteratively. At each iteration the algorithm looks for a point outside the approximation of the MEB obtained so far. If this point exists, it is added to the previous subset of data to define a larger optimization problem, which is solved to obtain a new approximation to the MEB. The process is repeated until no points outside the current approximating ball are found within a prescribed tolerance. CVMs hence need the resolution of a sequence of optimization problems of increasing complexity using an external numerical solver. In order to be efficient, the solver should be able to solve each problem from a warm-start and to avoid the full storage of the corresponding Gram matrix. Experiments in Ref. 37 employ to this end a variant of the second-order SMO proposed in .
In this paper, we study two novel algorithms that exploit the formalism of CVMs but do not need the resolution of a sequence of QPs. These algorithms are based on the Frank-Wolfe (FW) optimization framework, introduced in  and recently studied in  and  as a method to approximate the solution of the MEB problem and other convex optimization problems defined on the unit simplex. Both algorithms can be used to obtain a solution arbitrarily close to the optimum, but at the same time are considerably simpler than CVMs. The key idea is to replace the nested optimization problem to be solved at each iteration of the CVM approach by a linearization of the objective function at the current feasible solution and an exact line search in the direction obtained from the linearization. Consequently, each iteration becomes fairly cheaper than a CVM iteration and does not require any external numerical solver.
Similar to CVMs, both algorithms incrementally discover the examples which become support vectors in the SVM model, looking for the optimal set of weights in the process. However, the second of the proposed algorithms is also endowed with the ability to explicitly remove examples from the working set used at each iteration of the procedure and has thus the potential to compute smaller models. On the theoretical side, both algorithms are guaranteed to succeed in iterations for an arbitrary . In addition, the second algorithm exhibits an asymptotically linear rate of convergence .
This research was originally motivated by the use of the MEB framework and computational geometry optimization for the problem of training an SVM. However, a major advantage of the proposed methods over the CVM approach is the possibility to employ kernels which do not satisfy the conditions required to obtain the equivalence between the SVM and MEB optimization problems. For example, the popular polynomial kernel does not allow the use of CVMs as a training method. Since the optimal kernel for a given application cannot be specified a priori, the capability of a training method to work with any valid kernel function is an important feature. Adaptations of the CVM to handle more general kernels have been recently proposed in  but, in contrast, our algorithms can be used with any Mercer kernel without changes to the theory or the implementation.
The effectiveness of the proposed methods is evaluated on several data classification sets, most of them already used to show the improvements of CVMs over second-order SMO . Our experimental results suggest that, as long as a minor loss in accuracy is acceptable, our algorithms significantly improve the actual running times of this algorithm. Statistical tests are conducted to assess the significance of these conclusions. In addition, our experiments confirm that effective classifiers are also obtained with kernels that do not fulfill the conditions required by CVMs.
The article is organized as follows. Section presents a brief overview on SVMs and the way by which the problem of computing an SVM can be treated as a MEB problem. Section describes the CVM approach. In Section we introduce the proposed methods. Section presents the experimental setting and our numerical results. Section closes the article with a discussion of the main conclusions of this research.
2 Support Vector Machines and the MEB Equivalence
In this section we present an overview on Support Vector Machines (SVMs), and discuss the conditions under which the problem of building these models can be treated as a Minimal Enclosing Ball (MEB) problem in a feature space.
2.1 The Pattern Classification Problem
Consider a set of training data with , . The set , often coinciding with , is called the input space, and each instance is associated with a given category in the set . A pattern classification problem consists of inferring from a prediction mechanism , termed hypothesis, to associate new instances with the correct category. When the problem described above is called binary classification. This problem can be addressed by defining a set of candidate models , a risk functional assessing the ability of to correctly predict the category of the instances in , and a procedure by which a dataset is mapped to a given hypothesis achieving a low risk. In the context of machine learning, is called the learning algorithm, the hypothesis space and the induction principle .
In the rest of this paper we focus on the problem of computing a model designed for binary classification problems. The extension of these models to handle multiple categories can be accomplished in several ways. A possible approach corresponds to use several binary classifiers, separately trained and joined into a multi-category decision function. Well known approaches of this type are one-versus-the-rest (OVR, see ), where one classifier is trained to separate each class from the rest; one-versus-one (OVO, see ), where different binary SVMs are used to separate each possible pair of classes; and DDAG, where one-versus-one classifiers are organized in a directed acyclic graph decision structure . Previous experiments with SVMs show that OVO frequently obtains a better performance both in terms of accuracy and training time . Another type of extension consists in reformulating the optimization problem underlying the method to directly address multiple categories. See , ,  and  for details about these methods.
2.2 Linear Classifiers and Kernels
Support Vector Machines implement the decision mechanism by using simple linear functions. Since in realistic problems the configuration of the data can be highly non-linear, SVMs build a linear model not in the original space , but in a high-dimensional dot product feature space where the original data is embedded through the mapping for each . In this space, it is expected that an accurate decision function can be linearly represented. The feature space is related with by means of a so called kernel function , which allows to compute dot products in directly from the input space. More precisely, for each , , we have . The explicit computation of the mapping , which would be computationally infeasible, is thus avoided . For binary classification problems, the most common approach is to associate a positive label to the examples of the first class, and a negative label to the examples belonging to the other class. This approach allows the use of real valued hypotheses , whose output is passed through a sign threshold to yield the classification label . Since is a linear function in , the final prediction mechanism takes the form
with and . This gives a
classification rule whose decision boundary is a hyperplane with normal vector
is a hyperplane with normal vectorand position parameter .
2.3 Large Margin Classifiers
It should be noted that a decision function well predicting the training data does not necessarily classify well unseen examples. Hence, minimizing the training error (or empirical risk)
does not necessarily imply a small test error. The implementation of
an induction principle guaranteeing a good classification
performance on new instances of the problem is addressed in SVMs by
building on the concept of margin . For a given training pair
, the margin is defined as and is expected to estimate how reliable the prediction of
the model on this pattern is. Note that the example
and is expected to estimate how reliable the prediction of the model on this pattern is. Note that the exampleis misclassified if and only if . Note also that a large margin of the pattern suggests a more robust decision with respect to changes in the parameters of the decision function , which are to be estimated from the training sample . The margin attained by a given prediction mechanism on the full training set is defined as the minimum margin over the whole sample, that is . This implements a measure of the worst classification performance on the training set, since . Under some regularity conditions, a large margin leads to theoretical guarantees of good performance on new decision instances . The decision function maximizing the margin on the training data is thus obtained by solving
However, without some constraint on the size of , the solution
to this maximin problem does not
exist [35, 14]. On the other hand, even
if we fix the norm of , a separating hyperplane guaranteeing a
positive margin on each training pattern need
not exist. This is the case, for example, if a high noise level
causes a large overlap of the classes. In this case, the hyperplane
on each training pattern need not exist. This is the case, for example, if a high noise level causes a large overlap of the classes. In this case, the hyperplane maximizing (3) performs poorly because the prediction mechanism is determined entirely by misclassified examples and the theoretical results guaranteeing a good classification accuracy on unseen patterns no longer hold . A standard approach to deal with noisy training patterns is to allow for the possibility of examples violating the constraint and by computing the margin on a subset of training examples. The exact way by which SVMs address these problems gives rise to specific formulations, called soft-margin SVMs.
2.4 Soft-Margin SVM Formulations
In -SVMs (see e.g. [5, 33, 14]), degeneracy of problem (3) is addressed by scaling the constraints as and by adding the constraint , so that the problem now takes the form of the quadratic programming problem
Noisy training examples are handled by incorporating slack variables to the constraints in (5) and by penalizing them in the objective function:
This leads to the so called soft-margin -SVM. In this formulation, the parameter controls the trade-off between margin maximization and margin constraints violation.
Several other reformulations of problem (3) can be found in literature. In particular, in some formulations the two–norm of is penalized instead of the one–norm. In this article, we are particularly interested in the soft margin -SVM proposed by Lee and Mangasarian in . In this formulation, the margin constraints in (3) are preserved, the margin variable is explicitly incorporated in the objective function and degeneracy is addressed by penalizing the squared norms of both and ,
2.5 The Target QP
In this paper we focus on the -SVM model as described above. The use of this formulation is mainly motivated by efficiency: by adopting the slightly modified functional of Eqn. 7, we can exploit the framework introduced in  and solve the learning problem more easily, as we will explain in the next Subsection. As a drawback, the constraints of problem (7) explicitly depend on the images of the training examples under the mapping . In practice, to avoid the explicit computation of the mapping, it is convenient to derive the Wolfe dual of the problem by incorporating multipliers and considering its Lagrangian
Plugging into the Lagrangian, we have
where is equal to 1 if , and 0 otherwise. In contrast to (7), the problem above depends on the training examples images only through the dot products . By using the kernel function we can hence obtain a problem defined entirely on the original data
Note that the decision function above depends only on the subset of training examples for which . These examples are usually called the support vectors of the model . The set of support vectors is often considerably smaller than the original training set.
2.6 Computing SVMs as Minimal Enclosing Balls (MEBs)
Now we explain why the -SVM formulation introduced in the previous paragraphs can lead to efficient algorithms to extract SVM classifiers from data. As pointed out first in  and then generalized in , the -SVM can be equivalently formulated as a MEB problem in a certain feature space, that is, as the computation of the ball of smallest radius containing the image of the dataset under a mapping into a dot product space .
Consider the image of the training set under a mapping , that is, . Suppose now that there exists a kernel function such that . Denote the closed ball of center and radius as . The MEB of can be defined as the solution of the following optimization problem
By using the kernel function to implement dot products in , the following Wolfe dual of the MEB problem is obtained (see ):
If we denote by the solution of (15), formulas for the center and the squared radius of MEB follow from strong duality:
Note that the MEB depends only on the subset of points for which . It can be shown that computing the MEB on is equivalent to computing the MEB on the entire dataset . This set is frequently called a coreset of , a concept we are going to explore further in the next sections.
We immediately notice a deep similarity between problems (12) and (15), the only difference being the presence of a linear term in the objective function of the latter. This linear term can be neglected under mild conditions on the kernel function . Suppose fulfills the following normalization condition:
where is the kernel function used within the SVM classifier. Therefore, computing an SVM for a set of labelled data is equivalent to computing the MEB of the set of feature points , where the mapping satisfies the condition . A possible implementation of such a mapping is , where is in turn the mapping associated with the original Mercer kernel used by the SVM.
Note that the previous equivalence between the MEB and the SVM problems holds if and only if the kernel fulfills assumption (17). If, for example, the SVM classifier implements the well-known -th order polynomial kernel , we have that is no longer a constant, and thus the MEB equivalence no longer holds. Complex constructions are required to extend the MEB optimization framework to SVMs using different kernel functions .
3 Bădoiu-Clarkson Algorithm and Core Vector Machines
Problem (15) is in general a large and dense QP. Obtaining a numerical solution when is large is very expensive, no matter which kind of numerical method one decides to employ. Taking into account that in practice we can only approximate the solution within a given tolerance, it is convenient to modify a priori our objective: instead of MEB, we can try to compute an approximate MEB in the sense specified by the following definition.
Let MEB= and be a given tolerance. Then, a –MEB of is a ball such that
A set is an –coreset of if MEB is a –MEB of .
In  and , algorithms to compute –MEBs that scale independently of the dimension of and the cardinality of have been provided. In particular, the Bădoiu-Clarkson (BC) algorithm described in  is able to provide an –coreset of in no more than iterations. We denote with the coreset approximation obtained at the -th iteration and its MEB as . Starting from a given , at each iteration is defined as the union of and the point of furthest from . The algorithm then computes and stops if contains .
Exploiting these ideas, Tsang and colleagues introduced in  the CVM (Core Vector Machine) for training SVMs supporting a reduction to a MEB problem. The CVM is described in Algorithm 1, where each is identified by the index set . The elements included in are called the core vectors. Their role is exactly analogous to that of support vectors in a classical SVM model.
The expression for the radius follows easily from (16). Moreover, it is easy to show (see ) that step 14 exactly looks for the point whose image is the furthest from . In fact, by using the expressions and , we obtain:
Note how this computation can be performed, by means of kernel evaluations, in spite of the lack of an explicit representation of and . Once has been found, it is included in the index set. Finally, the reduced QP corresponding to the MEB of the new approximate coreset is solved.
Algorithm 1 has two main sources of computational overhead: the computation of the furthest point from , which is linear in , and the solution of the optimization subproblem in step 11. The complexity of the former step can be made constant and independent of by suitable sampling techniques (see ), an issue to which we will return later. As regards the optimization step, CVMs adopt a SMO method, where only two variables are selected for optimization at each iteration [8, 30]. It is known that the cost of each SMO iteration is not too high, but the method can require a large number of iterations in order to satisfy reasonable stopping criteria .
As regards the initialization, that is, the computation of and , a simple choice is suggested in , which consists in choosing , where is an arbitrary point in and is the farthest point from . Obviously, in this case the center and radius of are and , respectively. That is, we initialize , and for . A more efficient strategy, implemented for example in the code LIBCVM , is the following. The procedure consists in determining the MEB of a subset of training points, where the set of indices is randomly chosen and is small. This MEB is approximated by running a SMO solver. In practice, is suggested to be enough, but one can also try larger initial guesses, as long as SMO can rapidly compute the initial MEB. is then defined as the set of points gaining a strictly positive dual weight in the process, and as the set of the corresponding indices.
4 Frank-Wolfe Methods for the MEB-SVM Problem
4.1 Overview of the Frank-Wolfe Algorithm
The Frank-Wolfe algorithm (FW), originally presented in , is designed to solve optimization problems of the form
where is a concave function, and a bounded convex polyhedron.
In the case of the MEB dual problem, the objective function is quadratic and coincides with the unit simplex. Given the current iterate , a standard Frank-Wolfe iteration consists in the following steps:
Find a point maximizing the local linear approximation , and define .
Perform a line-search .
Update the iterate by
The algorithm is usually stopped when the objective function is sufficiently close to its optimal value, according to a suitable proximity measure .
Since is a linear function and is a bounded polyhedron, the search directions are always directed towards an extreme point of . That is, is a vertex of the feasible set. The constraint ensures feasibility at each iteration. It is easy to show that in the case of the MEB problem , where denotes the -th vector of the canonical basis, and is the index corresponding to the largest component of . The updating step therefore assumes the form
It can be proved that the above procedure converges globally . As a drawback, however, it often exhibits a tendency to stagnate near a solution. Intuitively, suppose that solutions of (22) lie on the boundary of (this is often true in practice, and holds in particular for the MEB problem). In this case, as gets close to a solution , the directions become more and more orthogonal to . As a consequence, possibly never reaches the face of containing , resulting in a sublinear convergence rate .
4.2 The Modified Frank-Wolfe Algorithm
We now describe an improvement over the general Frank-Wolfe procedure, which was first proposed in  and later detailed in . This improvement can be quantified in terms of the rate of convergence of the algorithm and thus of the number of iterations in which it can be expected to fulfill the stopping conditions.
In practice, the tendency of FW to stagnate near a solution can lead to later iterations wasting computational resources while making minimal progress towards the optimal function value. It would thus be desirable to obtain a stronger result on the convergence rate, which guarantees that the speed of the algorithm does not deteriorate when approaching a solution. This paragraph describes a technique geared precisely towards this aim.
Essentially, the previous algorithm is enhanced by introducing alternative search directions known as away steps. The basic idea is that, instead of moving towards the vertex of maximizing a linear approximation of in , we can move away from the vertex minimizing . At each iteration, a choice between these two options is made by choosing the best ascent direction. The whole procedure, known as Modified Frank-Wolfe algorithm (MFW), can be sketched as follows:
Find and define as in the standard FW algorithm.
Find by minimizing , s.t. if . Define .
If , then , else .
Perform a line-search , where if and .
Update the iterate by
It is easy to show that both and are feasible ascent directions, unless is already a stationary point.
In the case of the MEB problem, step 2 corresponds to finding the basis vector corresponding to the smaller component of . Note that a face of of lower dimensionality is reached whenever an away step with maximal stepsize is performed. Imposing the constraint in step 2 is tantamount to ruling out away steps with zero stepsize. That is, an away step from cannot be taken if is already zero.
In  linear convergence of to was proved, assuming Lipschitz continuity of , strong concavity of , and strict complementarity at the solution. In , a proof of the same result was provided for the MEB problem, under weaker assumptions. It is important to note that such assumptions are in particular satisfied for the MEB formulation of the -SVM, and that as such the aforementioned linear convergence property holds for all problems considered in this paper. In particular, uniqueness of the solution, which is implied if we ask for strong (or just strict) concavity, is not required. The gist is essentially that, in a small neighborhood of a solution , MFW is forced to perform away steps until the face of containing is reached, which happens after a finite number of iterations. Starting from that point, the algorithm behaves as an unconstrained optimization method, and it can be proved that converges to linearly .
4.3 The FW and MFW Algorithms for MEB-SVMs
If FW method is applied to the MEB dual problem, the structure of the objective function can be exploited in order to obtain explicit formulas for steps 1 and 2 of the generic procedure. Indeed, the components of are given by
and therefore, since does not depend on ,
In practice, step 1 selects the index of the input point maximizing the distance from , exactly as done in the CVM procedure. Computation of distances can be carried out as in CVMs, using (20). As regards step 2, it can be shown (see [4, 41]) that
The whole procedure is sketched in Algorithm 2, where at each iteration we associate to the index set .
As regards the initialization, and can be defined exactly as in the CVM procedure. At subsequent iterations, the formula to update immediately follows from the updating (24) for ; indeed, the indices of the strictly positive components of are the same of , plus if (which means that was not already included in the current coreset). The introduction of the sequence in Algorithm 2 makes it evident that structure and output of Algorithm 1 are preserved.
In , it has been proved that is a monotonically increasing sequence with as an upper bound. Therefore, since the same stopping criterion of the BC algorithm is used, identifies an –coreset of , and the last is a –MEB of . However, the MEB-approximating procedure differs from that of BC in that the value of is not equal to the squared radius of MEB, but tends to the correct value as gets near the optimal solution (see Fig. 1).