Supervised classification is of great importance in many domains and research areas and has enjoyed widespread application. In a vast majority of cases, it is usually assumed that each pattern or instance (during training and testing) only belongs to one class and therefore is assigned a single label. While this kind of single label classification has been widely studied, there are many situations which do not fit into the single label classification framework with the classification problems being inherently multilabel. In the multilabel setting, an instance can simultaneously belong to multiple classes and therefore more than one label needs to be assigned to each instance. Consider the image tagging problem for instance. Here, each image contains multiple objects, each possessing a different label. The task at hand is the assignment of all relevant labels to each image comprising the set of unique objects in the scene. Alternatively, consider the problem of identifying each instrument at every time instant of a song. Since multiple instruments—guitar, bass and drums for example—can be simultaneously heard at any given moment, instrument recognition is inherently a multilabel classification problem.
Multilabel classification is a generalization of the well known multiclass problem in supervised learning and is consequently more challenging. Some of the challenges in the multilabel setting include, but are not limited to, exponential growth of cardinality of possible label sets, correlation among labels, structured label spaces and severely unbalanced datasets. Multilabel classification has already seen many real-world applications such as scene labeling[boutell2004SceneSVM], functional genomics analysis [zhang2006multilabel, lippert2010gene] and text categorization [joachims1998text, mccallum1999multi, kazawa2005maximal, rousu2006kerneltext].
The rising number of applications has led to the development of multilabel learning as a new supervised learning paradigm [gibaja2015tutorial]. Many approaches have been proposed in the literature to tackle multilabel classification problems within the support vector machine (SVM) framework. Despite the plethora of approaches and algorithms, we assert that no straightforward (natural) extension of the SVM has hitherto been proposed for multilabel classification. The SVM was first introduced as a two-class, single label (binary) classifier based on maximum margin geometry (and regularized error minimization) [boser1992training, cortes1995support, vapnik1998statistical]. Since then, the binary SVM has been studied extensively and applied successfully in many different domains. However, its extension to multiclass and multilabel cases is still an ongoing research problem since there is neither consensus nor a majoritarian approach.
In this work, we first attempt to justify the need to revisit the two-class (binary) SVM framework. Subsequently, we investigate the multilabel problem from a unified large margin perspective. We propose a novel formulation based on a geometric reinterpretation of the binary SVM which has a natural extension resulting in a unified multiclass and multilabel classification framework. Our proposed method is also capable of dealing with the aforementioned class imbalance problem in classification tasks.
Multilabel classification approaches mainly fall into the two categories of problem transformation and algorithm adaptation. An intuitive way to handle multilabel classification tasks is to transform them into a series of binary classification problems and then use—as a method of solution—one of the existing binary classifiers (including binary SVM) to solve them. Algorithm adaptation approaches, on the other hand, solve the problem directly by extending binary classification methods in such a way that all classes are considered at once. Our method falls into the category of algorithm adaptation for multilabel (and multiclass) classification—within the SVM framework. In this paper, we therefore focus mainly on SVM-based approaches to multiclass and multilabel problems and mention other approaches (available in the literature) only as required.
1.2 Outline of the basic idea
Below, we describe the basic idea behind our multilabel SVM formulation. A more formal description is elaborated in Section 3. The main reason for presenting a précis is to quickly get to the (simple) idea behind the formulation.
If the dataset is assumed to be linearly separable, then there are infinitely many hyperplanes which can separate the patterns into two classes. The main idea of the binary SVM is to choose a hyperplane which is optimal in the sense of maximizing the margin between the two sets of patterns while being somewhat robust to outliers. Given the finite set of patterns
subset is linearly separable from subset by hyperplane if there exist both a unit vector and constant c such that for all training instances, . Vapnik [vapnik1998statistical] shows that the optimal solution to the following optimization problem is a unit normal vector of the hyperplane with maximum margin:
where the assumption is that the patterns of these two classes are linearly separable, i.e., for some vector and constant . The optimal unit vector along with constant
determine the maximum margin hyperplane. It is also proved that the maximum margin (optimal) hyperplane is unique [vapnik1998statistical]. Furthermore, the optimization problem in (1) can be equivalently formulated as finding a pair—vector and bias (threshold) –such that has the smallest norm and constraints
are satisfied for all training patterns[vapnik1998statistical]:
Patterns for which the inequality constraints in (2) become active (equality constraints) are of particular importance. These points locate the optimal hyperplane and are known as support vectors and hence the name support vector machine. Cortes and Vapnik [cortes1995support] extended this problem to the case where the two classes are not linearly separable by relaxing the hard margin constraints in (2) and letting some patterns violate the margin:
The optimization problem in (3) is known as the soft margin SVM. Solving for in terms of the other variables, we get the following hinge loss
in objective function which is an upper bound convex approximation to the zero-one misclassification loss function:
The essential idea behind our multilabel (and multiclass) formulation was inspired by the optimization problem in (1) which can be reformulated as
and furthermore, equivalently written as
We have introduced an origin and wish to point out that reconfiguring the training set with respect to this origin does not change the optimal solution to the problem in (1). In fact, Vapnik’s original formulation in (1) is relative. Therefore, adding to both terms in the optimization problem in (1) does not change the optimal solution. In Vapnik’s equivalent formulation (2), the origin is implicitly introduced via a bias term by transitioning from the relative formulation in (1). Furthermore, we are effectively considering a separate hyperplane for each class which results in the two hyperplanes becoming parallel in this formulation with the addition of a sum-to-zero constraint on the normal vectors . As we will see in Section 3, there is no necessity to have a sum-to-zero constraint on the normal vectors (of hyperplanes). When this is relaxed, it yields nonparallel
separating hyperplanes in two-class problems. While it may seem odd to have twin, nonparallel hyperplanes in a two-class problem, as we later show, it becomes a useful construct in the multilabel setting.
Having presented the basic idea—nonparallel hyperplanes—we turn to a quick description of related work in order to forestall impressionistic notions of similarity to previous work. The twin SVM (TWSVM) also departs from parallel hyperplanes for the two-class problem [khemchandani2007twin]. However, it is neither a maximum margin formulation nor an all-in-one machine since it solves two separate optimization problems for binary classification. Mangasarian and Wild [mangasarian2005multisurface]
proposed a generalized eigenvalue proximal SVM (GEPSVM) which deviates from a previous proximal SVM (PSVM)[Mangasarian2005multicategory]
with parallel hyperplanes. This approach also solves two separate optimization problems for obtaining the separating hyperplane of each class in binary classification. The normal vector of each of the two nonparallel proximal hyperplanes is the eigenvector corresponding to the smallest eigenvalue of a generalized eigenvalue problem.
Our approach is also different from the one-class SVM which is an unsupervised algorithm for outlier (novelty) detection[scholkopf2000support, tax2004support]. In the one-class SVM, training data is available for only one class and the goal is to decide whether a new pattern belongs to this class. The one-class SVM introduced in [scholkopf2000support]
is a natural extension of SVM to the case of unsupervised learning and all the patterns are separated from the origin of the feature space by a hyperplane with maximum margin. Depending on which side of the hyperplane a point falls, it is classified either as a member (of that class) or as an outlier. Whether or not this problem could be formulated for a two-class problem with outliers without the oppositional framework has not been investigated. This approach does not consider the multilabel problem (or even multiclass for that matter). Our model simultaneously addresses multiclass and multilabel problems with the origin implicitly determined by the support vectors of all classes in feature space. In particular, our problem formulation is based on a unique and novel geometry for simultaneously maximizing the margins of all classes in the multiclass and multilabel settings. In order to better demonstrate the advantages of our approach, we mainly focus on large margin formulations in the literature.
The outline of paper is as follows. In Section 2 we review previous work focused on multiclass and multilabel problems mostly within the SVM framework. In Section 3, the new formulation is presented and a quadratic majorizer for the hinge loss is used to solve the optimization problem in the primal. In Section 4, the nonlinear transformation of the input space to infinite dimensional spaces via reproducing kernel Hilbert spaces (RKHS) is discussed and the majorization algorithm extended to solve the primal kernel formulation. In Section 5, we present illustrative experiments and the results of our formulation on multiclass and multilabel datasets. We also discuss a scenario which widely used approaches are not able to handle but is easily handled in our formulation. Section 6 presents the conclusions and speculates on future possibilities and extensions.
2 Related Work
In our discussion of previous work, we mainly confine ourselves to SVM-based approaches, specifically all-in-one (single) machines for multiclass and multilabel problems. All-in-one approaches consider information of all classes simultaneously and set up a single optimization problem for learning all classifiers (discriminators) at once. The margin concept is key in binary SVM and its extensions. SVM approaches differ mainly in how they define the margin and how margin violations are formulated in the loss function. There are two notions of margin, namely, relative margin and absolute margin which basically correspond respectively to cases where the margins of the classifiers are optimized relative to each other or otherwise [dogan2016unified].
Vapnik [vapnik1998statistical] proposed the first all-in-one framework for the multiclass SVM based on relative margins. [weston1999support] and [bredensteiner1999multicategory] also used this framework albeit with slightly different formulations. It can be shown that all three formulations are equivalent. In this framework, the relative margin of each pattern is maximized against all other classes to which it does not belong with a slack variable utilized for each of these pairs. Crammer and Singer [crammer2001algorithmic] proposed a similar formulation for multiclass SVM with a single slack variable per pattern, essentially a penalty for the worst class. Their formulation did not take the biases into account. Hsu and Lin [hsu2002comparison] subsequently added biases into the Crammer and Singer formulation.
Lee et al. [lee2004multicategory] proposed another formulation for multiclass SVM with absolute margins which (while extending the Bayes decision rule for multiclass in the same fashion as for the binary case) ensures that the solution directly targets the Bayes decision rule also known as Fisher consistency. However, setting up the framework for compatibility with the Bayes decision rule eliminates the notion of support vectors for each class since the hinge loss term corresponding to the correct class gets eliminated. In other words, this approach penalizes a pattern if it gets within margin distance to other classifiers (where is the number of classes) without encouraging a pattern to get close to its own classifier. The work proposed by [lee2004multicategory], and [weston1999support] are theoretically sound but an efficient training algorithm has not yet been proposed [dogan2016unified]. Van Den Burg and Groenen [van2016gensvm] introduced a single machine—the generalized multiclass SVM (GenSVM)—which uses a simplex encoding to cast the multiclass problem in dimensions. GenSVM generalizes aforementioned methods of [lee2004multicategory] and [crammer2001algorithmic]. Liu and Yuan [liu2011reinforced] proposed a new type of multiclass hinge loss function called reinforced hinge loss and proved that under some conditions, it is Fisher consistent. Szedmak et al. [szedmak2006learning] introduced the multiclass maximum margin regression (MMR) framework with computational complexity no more than a binary SVM. They extend the idea of support vector regression [vapnik1998statistical] to vector label learning in an arbitrary Hilbert space. Dogan et al. [dogan2016unified] showed that most of the proposed all-in-one multiclass SVMs are mainly different in terms of choice of margin and margin-based loss functions and presented a unified formulation which accommodates existing single machine approaches for multiclass classification. Based on this unified view, they proposed two new multiclass formulations.
Zhang and Jordan [zhangJordan2012bayesian] proposed a Bayesian multiclass SVM by extending the method proposed in [sollich2002bayesian]. [Mangasarian2005multicategory] proposed multicategory proximal support vector machines (MPSVM) where instead of working in the feature space and maximizing the margin, they concatenate the bias to the weight vector and try to maximize this extended margin in [Fung2001ProximalSV]. MPSVM is a problem transformation scheme which solves independent nonsingular systems of linear equations to obtain the weight vectors . It is not clear why merging the bias with the weight vector and maximizing this extended margin should improve classification performance.
The one versus the rest (OvR) a.k.a. one versus all (OvA) approach is based on the idea of separately training binary classifiers for each class. The patterns belonging to one class form a group and all other patterns form the second group. Therefore, in the OvR approach for -class problems, binary classifiers are trained [vapnik1998statistical]. For a test pattern, each classifier outputs a real value which reflects the level of certainty of the pattern belonging to its corresponding class. Naturally, the test pattern is assigned to the class with maximum discriminant function value or confidence score (winner-take-all decision criterion). OvR does not take label correlations into account. OvR is considered to be an intuitive and simple extension to binary SVM and yet powerful in producing results as good as other classifiers [rifkin2004defenseOvA]
and can be extended to the multilabel problem by using a probabilistic scheme for obtaining confidence levels and considering a threshold for scores for which each value or probability gives rise to a label. However, as we will show in Section5
, there are cases in which this approach fails. Since a smaller number of patterns are used to train the classifier, OvR is susceptible to variance increase since partitioning data this way makes the training set unbalanced[lee2004multicategory]. In the OvR approach, it is not clear how support vectors should be determined and their interpretation is ambiguous especially in the multilabel setting since support vectors are relative.
In the one versus one (OvO) approach a.k.a. all versus all, a binary classifier is trained for each pair of classes [hastie1998OvO, kressel1999pairwise]. Therefore for a -class problem, classifiers are trained. OvO suffers from the transitivity problem which may result in label contradiction in some areas of the input space. A test pattern is assigned to the class with maximum number of votes. When the number of classes is large, these binary-classifier-based methods may suffer from either computational cost or highly imbalanced sample sizes in training. DAGSVM was proposed to combine the result of classifiers obtained via the OvO strategy [platt2000DAGSVM]. A new test pattern is classified using classifiers which are chosen based on which path is traversed in the graph.
Dietterich and Bakiri [dietterich1994error-correcting] proposed a general approach for multiclass classification based on error-correcting output codes (ECOC) for binary classifiers. A binary string of some fixed length known as the codeword is assigned to each class and then binary functions are learned. At the decoding step, a code is obtained for a test pattern in the test set by evaluating all trained binary classifiers for that pattern. The test pattern is assigned to the class with minimum Hamming distance (which counts the number of bits that differ) between the binary output code of the test pattern and base code of that class. ECOC is applied to margin-based methods including SVM where each codeword belongs to with indicating that the way in which the corresponding classifier assigns this label is irrelevant [allwein2000errorcoding]. Label prediction is based on a loss-based decoding scheme. Loss-based decoding takes the output magnitude of binary classifiers which are considered as certainty level into account and choose a label which is most consistent with the classifiers’ outputs based on the loss function. Most of the aforementioned approaches do not accommodate and cannot be extended to multilabel classification in a straightforward manner.
Approaches to multilabel learning mainly fall into the categories of algorithm adaptation and problem transformation. In general, there is no established approach for solving a multilabel classification problem [scikit-multilearn]. OvR a.k.a binary relevance (BR) in multilabel learning is a baseline approach and usually serves as a benchmark for other multilabel approaches [luaces2012binary]. Boutell et al. [boutell2004SceneSVM]
used the OvR approach with SVM as the binary classifier on the scene dataset. They use cross training which basically includes each pattern as a positive example (label 1) for each class when training the corresponding classifier. The multilabel patterns are not used as negative examples while training the classifiers. In our framework, on the other hand, no patterns are used as negative examples for training any classifier since each classifier only uses information related to its own patterns to determine the hyperplane. They have also offered an empirical (heuristic) decision criterion for testing and an evaluation metric for performance evaluation and this is mentioned in Section5. Two approaches are proposed in [godbole2004discriminative]
to improve the margin in multilabel classification within the OvR framework. The first approach eliminates patterns from the opposite class (the rest). The second approach removes all the training data in confusing classes using a confusion matrix obtained from applying a fast and relatively accurate classifier. Chen et al.[chen2016mltsvm] extended TWSVM to the multilabel setting and proposed MLTSVM in which classifiers are trained separately in OvR fashion.
Elisseeff and Weston [elisseeff2002rankSVM] introduced a large margin ranking system known as Rank-SVM for multilabel classification. Rank-SVM is an algorithm adaptation method which can be considered as an extension of the Crammer and Singer multiclass SVM framework. The idea of Rank-SVM is to find classifiers (hyperplanes), one for each label, such that the relevant labels for each training pattern are ranked higher than irrelevant labels. To achieve this, a proposed ranking loss is minimized and relative margins are maximized. Since it is a ranking-based approach, the design of a set size predictor is needed. Formulating multilabel classification as a ranking problem imposes a quadratic number of constraints on the problem and makes optimization more challenging [hariharan2010large]. Rank-SVM, as its name indicates, ranks the labels for each pattern and then chooses a number of highly ranked labels for each pattern based on a probabilistic or heuristic approach for determining the cardinality of the set of labels. Therefore, Rank-SVM is not able to directly produce output labels.
Armano et al. [armano2012error] extended ECOC to multilabel text categorization. They generate a codeword for each class with only and
bits. Then the posterior probability of each class is obtained andtop ranking categories are selected where is user defined or tuned by the validation set.
Incorporating label space structure and capturing label correlations within the framework of large margin methods has been studied in multilabel learning [tsochantaridis2004structuredSVM, liu2018svm, hariharan2010large, sun2011canonical, lampert2011maximum, guo2011adaptive, taskar2004max]. There are cases in which elements of the output space are structured objects such as trees, sequences, strings or sets. The challenge here is how to effectively use label correlation information to improve accuracy in multilabel prediction. The structured SVM (SSVM) was proposed to learn a mapping from the input space to structured output spaces like a parse tree in Natural Language Parsing [tsochantaridis2004structuredSVM, joachims2009predicting].
Hariharan et al. [hariharan2010large] developed a maximum margin multilabel formulation (M3L) which assumes labels are densely correlated but did not include pairwise label correlation terms in the objective function. Their work is the middle ground between approaches with independent labels assumption (linear complexity) and explicitly modeling label correlations (exponential complexity). They make the assumption that prior knowledge about label correlations is available and that labels have at most linear dense correlations. This prior knowledge is incorporated into the formulation in the form of a dense correlation matrix and densely correlated sub-problems are solved. They show that OvR is obtained as a special case of their model.
Canonical correlation analysis (CCA) is used for dimensionality reduction while exploiting label correlations in the multilabel setting [sun2011canonical, Jiepeng2016book]. In supervised dimensionality reduction via CCA, the two sets of variables are derived from the data and class labels followed by projection into a lower-dimensional space in which the patterns are maximally correlated. They incorporated a dimensionality reduction scheme into the SVM formulation and proposed a joint dimensionality reduction and multilabel classification framework. Their framework also uses opposition from other classes to learn classifiers. Liu et al. [liu2018svm], modified SVM to take missing labels into consideration by extending the label set to include zero which indicates a missing class label. They also take label correlations into consideration by proposing label and class smoothness and integrating these into the objective function.
These approaches still use patterns from other classes in an oppositional fashion to obtain each classifier. Our approach is a natural extension to an all-in-one multiclass SVM [weston1999support, crammer2001algorithmic] and does not need to incorporate every possible label set prediction into model. Labels get assigned only after training—similar to multiclass classification in SVM. Furthermore, interpretation of labels is very important in multilabel classification problems since in most of the work, not having a label for a class is automatically interpreted as negative membership. Our framework takes this into account and uses available information on memberships for classification and deviates from all oppositional classification frameworks. However, it is easily extensible to negative labels by assigning a label of to patterns for which information about negative membership is available. Our approach resides between approaches with the independent labels assumption and the ones which consider direct pairwise label correlations. From this perspective, we do not have the exponential label space complexity issue since separation in our framework is against the origin. As a matter of fact, our formulation does not explicitly explore label correlations, however, as we show, this information is implicitly taken into account. Finally, a dimensionality reduction scheme can also be easily incorporated into our formulation.
3 A Unified SVM Classification Framework
The underlying motivation for the present work originates from the desire to naturally extend the SVM formulation to a multilabel classification framework. The lack of such a unified multiclass and multilabel framework in the literature led us to revisit the original SVM formulation and suggest a new, geometrically driven approach. The basic idea is as follows. We (implicitly) learn a new origin for the feature vectors as well as the magnitude and direction of the weight vectors such that the inner product of the reconfigured patterns with respect to this new origin with each weight vector represents the confidence (certainty) of the memberships of the patterns belonging to each class. Consequently, each class chooses a magnitude and orientation of its weight vector in such a way that the patterns belonging to are best represented (separated) without the need of explicit opposition to other classes. Inspired by this characteristic, our approach is called one versus none (OvN). In this section, we flesh out this idea, formulate a new objective function and present an algorithm for finding the global optimum.
3.1 Problem Formulation
In Section 1, we motivated the unified framework by jettisoning the parallel hyperplane constraint which is standard in almost all binary (two-class) SVMs. Furthermore, we introduced an origin variable and set up the margin constraints using this new variable. The picture that emerges from this description is of individual class-specific weight vectors which seek to satisfy margin constraints for the patterns belonging to that class. This is repeated for all classes. However, this alone does not ensure competition between classes which is enshrined using parallel hyperplanes in the standard SVM. In our formulation, we introduce competition by minimizing the inner product between all pairs of weight vectors. Note that the minimum inner product between two unit weight vectors and is attained when the two vectors are anti-parallel (as in the standard SVM). Later, in Section 3.5, we show that suitable constraints on a pair of weight vectors and in a two-class problem, make our formulation completely equivalent to the standard binary SVM.
In common with virtually all previous soft margin approaches, the objective function includes regularization and outlier rejection terms [cortes1995support]. As mentioned above, in addition to the margin term for each weight vector , we have the sum of all pairwise inner products between weight vectors and which endeavor to make the weight vectors anti-parallel. Thus far, we have not discussed the bias term in this set up. When we introduced the origin as a variable in (4), we ended up with two terms that canceled each other out: and which assumes anti-parallel hyperplanes. In a two-class multilabel SVM, these become and . We then have the option of enforcing a hard constraint or a soft constraint term to be mimimized as part of objective function. For problem formulation with more than two classes, the former option is selected in this work since it was practically demonstrating better performance in both linear and kernel cases and is fully demonstrated in appendices. The other case can be similarly worked out. From a conceptual perspective, a hard constraint on the biases is an extension of the anti-parallel hyperplanes constraint, whereas the soft constraint on the biases is a relaxation of the same. A candidate multilabel (and multiclass) optimization problem is
where the first constraint is the hard constraint on biases and the second set of constraints are margin violation constraints obtained by introducing outlier (slack) variables [cortes1995support]. The first term in the objective function relates to margin maximization, the second term implicitly models pairwise label correlations and the last term penalizes deviation of patterns from their corresponding class margin. The regularization parameters, ,, have standard objective function semantics. We reformulate the optimization problem in (5) by eliminating all the outlier variables via hinge losses and introducing the bias variables :
where we define as the set comprising all the normal vectors to the hyperplanes and as the set of biases. The hard constraint on biases is incorporated into the objective function of (6) by the method of Lagrange multipliers and results in the following Lagrangian (where is a Lagrange multiplier):
This is by no means the only available integrated formulation. In Section 3.4, we examine all four possibilities: the cross product space of soft and hard constraints on the hyperplanes and soft and hard constraints on the biases. We also have considered the margins as in the conventional SVM. However, it is possible to incorporate margin-related variables in the formulation by setting up the problem similar to the -SVC [scholkopf2000new].
3.2 Testing Scenario
After solving the optimization problem and obtaining the classifiers, a test instance is projected to each weight vector. We employ a simple decision criterion wherein each test instance gets assigned all class labels with projection greater than , i.e., if denotes the set of labels for test instance , then
From a geometric point of view, we have an ambiguous region which is inhabited by patterns whose projections are all less than 1. Therefore the decisions on label assignment could include this information. In the present work however, we eschew careful examination of pattern and label test set geometry and relegate this to future work. Instead, when this situation occurs, we revert to a multiclass strategy and pick the label corresponding to the maximum projection as explained below.
In the multiclass (single label) setting, a winner-take-all strategy (as mentioned above) is applied and the test pattern is assigned to the class with maximum projection value
3.3 A Quadratic Majorizer for Optimization
Most of the existing literature on SVM optimization is focused on formulating the primal problem from which the dual formulation is extracted followed by a solution in dual space. However, primal problems can also be efficiently solved. In fact, solving the primal may have advantages for large scale optimization since—at least in the linear SVM—the objective function is linear in the training set patterns (and not quadratic). Furthermore, often in large scale optimization, an approximate solution is sought which results in a large number of support vectors. In this case, the dual may result in a solution which is not meaningful [chapelle2007primalSVM].
One of the most simple and effective algorithms for solving the primal is majorization-minimization (MM). The main advantage of iterative majorization for the SVM is that it results in simple linear system updates (achieved without the need for line-search parameter tuning). A secondary advantage of MM is that the objective function value is guaranteed to be non-increasing and is generally decreasing in each iteration. If the function to be optimized is strictly convex which is the case with the SVM loss function, MM converges to the global minimum. In contrast, other methods which solve the dual must often completely converge in order to obtain a meaningful and reasonable solution. Variants of MM have also been proposed which deal with large-scale optimization problems.
Consider the standard, linear, binary SVM in the primal. It comprises a margin term and the hinge loss on the patterns. An alternative to majorization is to smooth the hinge loss (resulting in a differentiable approximation like the Huber loss). However, in this case, we are minimizing a different function which is an approximation to the original function. Furthermore, the smoothed hinge loss does not result in simple update equations (achieved without having to estimate a step-size parameter). Even if we approximate the hinge loss with the generalized logistic loss(which approaches as
), we would still need to use nonlinear optimization on this logistic regression variant in order to obtain good solutions. Finally, the non-differentiability of the hinge loss forces us to adopt subgradient optimization which is also complex (from a line search perspective). For these reasons, we introduce a simple MM approach to minimize the SVM objective in our unified framework. Clearly, alternatives are available (including minimization in the dual) but we have relegated these to future work.
Majorization leads to a particular kind of iterative optimization algorithm. The basic idea behind majorization is to construct an auxiliary objective function which lies above the original objective function and is easier to minimize. The auxiliary objective function usually has the property of touching the original objective at the current time (iteration) instant. By descending on the auxiliary objective, we guarantee that we also descend on the original objective by virtue of the fact that the auxiliary objective is strictly greater than or equal to the original [lange2013optimization]. For the task at hand, note that MM can specifically deal with nondifferentiable convex objective functions (like the hinge loss) by constructing a sequence of smooth convex functions which majorize the original nonsmooth objective function. A brief formal description of MM follows. A function majorizes at if
which means that lies above everywhere and touches at least at where is a point in the domain of at the current iteration. Now we minimize the surrogate function instead of in the minimization version of the MM algorithm. If is the minimum of , then the descent property follows from (8). From the descent property, we see that it is possible to merely descend on instead of finding its global minimum:
There exist previous approaches which solve the SVM in the primal using MM [groenen2007majorize, SVM-Maj-2008, nguyen2017iterativelyS, van2016gensvm]. For example, [groenen2007majorize] have a majorization approach which considers different loss functions (hinge, quadratic, Huber). We follow [lange2009sharp] in our approach. [groenen2007majorize] and [lange2009sharp] show that the hinge loss is majorized by the following quadratic (in ):
where . It is shown that is the best quadratic majorizer for [lange2009sharp]. Theoretically, it is possible to get arbitrarily as close as possible to the hinge loss through the above quadratic majorizer. In practice however, we choose a very small positive threshold value for , i.e., for the sake of numerical stability. Here is an auxiliary variable. Assuming (for the sake of convenience) that is unconstrained (instead of requiring a positivity constraint), the derivative of w.r.t. is
Setting this to zero yields from which we obtain which is non-negative and is equal to zero when . It can easily be shown that setting when is the global minimum for (when is constrained to be greater than or equal to ). We can therefore use the above majorization trick to obtain a quadratic objective w.r.t. . Note that will be replaced by a suitable function of and in the formulation. The majorized multiclass/multilabel SVM optimization problem can be written as
where denotes the set of all strictly positive auxiliary variables . In order to simplify the notation in the optimization problem in (9), we concatenate and into one vector and define the following matrices:
where is the number of features. With these changes and defining the set of augmented weight vectors , we may rewrite the optimization problem as
and again by including the hard constraint on biases into the objective function via the method of Lagrange multipliers, the Lagrangian can be written as
Rather than solving for , we concatenate all the augmented weight vectors into one vector and alternate between least-squares solutions for and closed-form solutions for . The solution for follows the approach outlined above:
The solution for the concatenated weight vector merely requires some turgid algebra and is relegated to Appendix A. The pseudo code of the algorithm is presented in Algorithm 1. We briefly mention here that the regularization coefficients for soft constraints on weight vectors and hinge loss terms should be chosen in such a way that the Hessian obtained after concatenating the weight vectors is positive definite (as shown in Appendix A).
3.4 Incorporating soft and hard constraints into the formulation
The formulation in the previous section imposed a soft constraint on the hyperplane normal vectors, wherein the inner product of every pair of vectors was minimized. Furthermore, a hard constraint on the biases was used to make their sum equal to zero. However, depending on the problem and dataset, we could also consider hard constraints on both the normal vectors and the biases . Table 1
provides the four different possibilities for incorporating the constraints into the formulation. For the sake of simplicity, we omit the margin terms and hinge loss terms which are common to all cases. In all these models, hyperparameters should be chosen in such a way that the Hessian of the objective function remains positive definite, ensuring that the optimization problem remains bounded from below. The details of the formulation specific to the Hessian for the case of soft-hard are relegated to Appendix A.
|Constraints||Appearance in objective (Lagrangian) function|
3.5 Equivalence to the binary SVM
The original two-class soft margin SVM problem is equivalent to the minimization of the following objective function:
Note that we have used instead of since the equivalence will be shown using two hyperplanes. From (14), we see that the existence of two hyperplanes with equal norms and anti-parallel normal vectors is implicit within the standard soft margin binary SVM formulation. The original binary SVM is equivalent to our formulation by imposing hard constraints on and :
Incorporating these constraints into the objective function, we get
which is identical to (14). We have shown that when hard constraints are imposed on two weight vectors (making them anti-parallel but with equal norm) and on the two biases (making them sum to zero), we recover the original soft margin SVM. Furthermore, our soft -hard model (shown in Table 1)—while providing a degree of flexibility to the weight vectors’ orientations and norms—is still capable of producing parallel hyperplanes (anti-parallel weight vectors) when the optimal solution dictates this to be the case. Consider the following example which is designed in such a way that the optimal solution must be two parallel hyperplanes. As shown in Figure 1, for any , we get parallel hyperplanes and the solution coincides with the binary SVM.
4 Generalization to RKHS
In Section 3, we presented the formulation and algorithm assuming linear separability between classes. However, linear separability is not a realistic assumption in many problems. Therefore, a more complex approach is required to extend these models to problems with nonlinear separability. In the following subsections, we give an introduction to kernels and demonstrate how the origin is formulated in this setting. We then develop the kernel formulation for unified multiclass and multilabel SVMs.
Kernels were introduced as a tool to map a nonlinearly separable dataset into an implicit higher (or infinite) dimensional reproducing kernel Hilbert space (RKHS) where linear separability can be achieved without the need to explicitly compute the features in the transformed space [vapnik2013nature]. In order to adapt our formulation to an RKHS, two problems have to be dealt with before we can proceed. As we will show, the origin of the RKHS needs to be properly treated and correctly interpreted in an infinite dimensional Hilbert space. Furthermore, the inclusion of a hard constraint on the weight vectors is not as straightforward as in the linearly separable case. We first briefly summarize how RKHS kernels are used in the SVM context.
Deploying a kernel function , the inner product of feature mappings in a high dimensional RKHS is performed by function evaluation in the original space. If is a mapping from the original space to a feature space, i.e., , then
By the representer theorem [kimeldorf1971WahbaRepresenter], each weight vector can be written as a linear combination of all projected patterns in RKHS,
If we define as the subspace spanned by , i.e.,
we can write (the origin in the infinite dimensional feature space) as a summation of elements in and an element in the orthogonal complement of , i.e.,
However, in calculating , the inner product of all elements in taken with vanishes. Therefore, we consider only the component of . Replacing in with (15), we get
In our application of RKHS principles, we need to compute the following inner products with respect to some kernel in RKHS:
where is the Gram matrix formed by the set of pairwise RKHS inner products of patterns. Also, we have
where is the column of the Gram matrix .
4.2 A Kernel Formulation for Multiclass and Multilabel SVMs
We now extend our formulation to an RKHS. The RKHS objective function directly follows (7) with the main difference being the use of RKHS inner products:
Again, after majorizing the hinge loss followed by a change of variables , we have
Rewriting the Lagrangian using a matrix formulation of RKHS inner products, we get
where . Again by concatenating and into a vector and defining the following matrices
the Lagrangian is rewritten as
4.3 Hard constraints on weight vectors and biases
In a manner similar to the linear setting in Section 3.4, we can also incorporate any combination of soft and hard constraints on the normal vectors and biases into the formulation in the kernel setting. As mentioned above, using the representer theorem we can write each in an RKHS as a linear combination of mapped patterns. Note that we cannot directly incorporate a hard constraint on the weight vectors into the formulation as we did in the linear case since here is the origin of an infinite dimensional Hilbert space and is therefore not computable. One way to set the sum of the equal to in RKHS, i.e., , is by adding the following constraints to the problem:
These constraints are derived from the fact that each is a linear combination of mapped patterns
and therefore setting the sum to via
is equivalent to (16). Here, we only present the formulation for the hard constraints on both weight vectors and biases. The detailed formulation for the case of soft -hard is also presented in Appendix B. The other two cases can be obtained in similar fashion. The optimization problem is
Similarly, by concatenating to , we get the simplified optimization problem
We then concatenate all into one vector, , and solve for it using KKT optimality conditions. The four cases for the kernel formulation with different settings of constraints are summarized in Table 2. Similar to the linear case in Section 3, regularization terms and hinge loss terms are left out since they are common to all cases.
|Constraints||Appearance in objective (Lagrangian) function|
In this section, we present our results for some well known datasets using our models and compare them to the OvR approach with the soft margin two-class SVM as the base classifier—using the scikit-learn [pedregosa2011scikit] implementation. In Section 5.1, we discuss one of the drawbacks of the OvR approach for a toy dataset which does not have samples exclusively drawn from each of the classes. In Section 5.2, we first present the results for a few two-class datasets and compare the performance of our model with the soft margin binary SVM. Subsequently, we present the results of our model on some famous multiclass datasets and compare them with OvR and the Crammer-Singer (CS) model for the linear case. In Section 5.3, our model is tested on benchmark multilabel datasets and the results are compared to binary relevance (BR).
We use 3-fold cross validation to choose the hyperparameters. We assume a finite and discrete grid of parameter values. First, the data is shuffled and divided into three equal folds. We choose the first tuple of grid points and train the model on two folds (training set). Then the trained model with this tuple is validated on the fold (validation set) and the model accuracy on the validation set is recorded. We repeat this procedure three times, choosing a different fold as validation set on each occasion and the remaining two folds as the training set and record all accuracy measures. In this way, we get three accuracy measures for the first choice of hyperparameters. This procedure is repeated for each and every tuple of grid points. Once again, all accuracy measures are recorded and the average accuracy on three folds for each tuple is computed. The maximum average accuracy is reported here. We tested our models on known datasets and compare the results against frequently used classification models in the literature.
5.1 Generating unseen label set configuration
One of the drawbacks of the BR approach for multilabel problems is that we might not have data only belonging to one class in order to train its classifier. For illustration, consider the following scenario shown in Figure 2. In this two-class example, the training set patterns belong either to class 1 (label ) or to both classes (label ). Therefore there is no pattern which only belongs to class 2. Obviously, the OvR approach can not train a classifier for class 1 versus class 2 as there is no pattern belonging solely to the latter. Hence we need an approach for determining the hyperplanes of each class without requiring data belonging to opposing classes. In other words, we need each class to determine its classifier based on available information of its members without the need for direct opposition from other classes. This is the core idea of our approach and hence the monicker one versus none (OvN).
Scikit-learn handles this situation by just assigning the label to any test set pattern and therefore only produces label set or for each training or test pattern. The cross training approach of [boutell2004SceneSVM] is also not able to cope with this situation. They implemented the OvR approach to train separate classifiers. If a pattern belongs to both class 1 and class 2, it is considered a class 1 (class 2) pattern while training the class 1 (class 2) classifier. However, even this strategy is not able to mitigate the situation. With the OvN approach, we are capable of producing new label sets which have not occurred during training. Figure 2 demonstrates this scenario. As we can see from Figure 2, it does not make sense to assign label set to patterns